欣淇
发布于 2026-05-11 / 2 阅读
0
0

Scrapling48k Stars + MCP:

🕷️ Scrapling:48k Stars 的智能网页抓取框架,自带反封锁 + MCP,网站改结构也不怕

项目地址:D4Vinci/Scrapling | ⭐ 48,725 Stars | 🐍 Python | 🏷 AI / Web Scraping / MCP | 作者 Karim Shoair


老实说,做 Web Scraping 最让人崩溃的不是写爬虫逻辑,而是:

  • 网站改了个 CSS class,你的选择器全废了
  • 2. Cloudflare Turnstile 把你当机器人拦在外面

    3. 爬着爬着 IP 被封了

    以前解决方案是:BeautifulSoup 写一遍 → 被反爬 → 换 Selenium → 太慢 → 换 Scrapy → 太重。折腾一圈下来,数据没拿到几行。

    Scrapling 就是来解决这些破事的。一个库搞定解析、反封锁、并发爬取,而且它有个骚操作:能自动适应网站结构变化,下次页面改版了你的选择器照样能用。


    一、安装:一行命令

    pip install scrapling
    

    要搞定反爬和浏览器自动化,加个 fetcher 扩展:

    pip install "scrapling[all]"
    scrapling install      # 下载浏览器内核
    

    完了,就这么简单。


    二、最骚的操作:自适应选择器

    网站改版是爬虫的噩梦。Scrapling 提供 auto_saveadaptive 参数,第一次爬的时候记住元素特征,后面页面变了也能自动重新定位:

    from scrapling.fetchers import StealthyFetcher
    
    StealthyFetcher.adaptive = True
    page = StealthyFetcher.fetch('https://quotes.toscrape.com/', headless=True)
    
    # 第一次:记录元素特征
    quotes = page.css('.quote', auto_save=True)
    
    # 一周后网站改版了,class 从 .quote 变成了 .item
    # 传 adaptive=True 它自己就能找到
    quotes = page.css('.quote', adaptive=True)
    

    别问我怎么知道的——以前用 BS4 写过爬虫的看到这功能应该想哭。


    三、反封锁:Cloudflare Turnstile 秒过

    from scrapling.fetchers import StealthySession
    
    with StealthySession(headless=True, solve_cloudflare=True) as session:
        page = session.fetch('https://nopecha.com/demo/cloudflare')
        data = page.css('#padded_content a').getall()
    

    StealthyFetcher 内置了浏览器指纹伪装和 Cloudflare 绕过,不用自己配 undetected-chromedriver 那一套了。


    四、完整爬虫框架:像 Scrapy 但更轻

    Scrapling 内置了 Spider 框架,支持并发、多 session、暂停恢复:

    from scrapling.spiders import Spider, Response
    
    class QuotesSpider(Spider):
        name = "quotes"
        start_urls = ["https://quotes.toscrape.com/"]
        concurrent_requests = 10
    
        async def parse(self, response: Response):
            for quote in response.css('.quote'):
                yield {
                    "text": quote.css('.text::text').get(),
                    "author": quote.css('.author::text').get(),
                }
    
            next_page = response.css('.next a')
            if next_page:
                yield response.follow(next_page[0].attrib['href'])
    
    result = QuotesSpider().start()
    result.items.to_json("quotes.json")
    

    跑了 10 个并发,一个命令搞定。暂停按 Ctrl+C,重新跑自动恢复进度。


    五、MCP Server:AI Agent 的网页数据提取器

    Scrapling 自带 MCP Server,AI Agent(Claude、Cursor 等)可以直接用它来提取网页内容。跟 Firecrawl 不一样的是,Scrapling 是本地运行的,数据不过第三方:

    pip install "scrapling[ai]"
    

    然后配置 MCP 客户端指向本地的 Scrapling 服务就行了。Agent 问"帮我抓这个页面",Scrapling 先去提取,再把精简后的内容喂给 LLM,省 Token 又省时间。


    总结

  • 自适应选择器解决了网站改版的最大痛点,auto_save + adaptive 的组合拳很实用
  • 反封锁能力开箱即用,Cloudflare Turnstile 不用额外配置
  • Spider 框架轻量但功能完整,并发、暂停恢复、多 session 都支持
  • MCP 集成让 AI Agent 能直接调用,适合做自动化数据管道
  • 性能吊打 BS4(快 700 倍以上),内存占用也小很多
  • 如果你还在用 requests + BeautifulSoup 手写爬虫,试试 Scrapling——别整那些花里胡哨的,一个库就够了。


    🏷 tags: #WebScraping #Python #MCP #Crawler #AI-Agent #Cloudflare


    🕷️ Scrapling: 48k Stars Adaptive Web Scraping Framework — Built-in Anti-Block + MCP, Survives Website Redesigns

    Repo: D4Vinci/Scrapling | ⭐ 48,725 Stars | 🐍 Python

    The most painful part of web scraping isn't the parsing logic — it's websites changing their CSS classes, Cloudflare Turnstile blocking your requests, and IP bans mid-crawl.

    Scrapling solves all of that in one library. Its killer feature: adaptive selectors that automatically relocate your elements when the page structure changes.

    Quick Start

    pip install scrapling
    pip install "scrapling[all]"
    scrapling install
    

    Adaptive Selection

    from scrapling.fetchers import StealthyFetcher
    StealthyFetcher.adaptive = True
    page = StealthyFetcher.fetch('https://example.com', headless=True)
    products = page.css('.product', auto_save=True)   # remembers element patterns
    products = page.css('.product', adaptive=True)     # finds them even after redesign
    

    Anti-Bot Bypass

    Cloudflare Turnstile? No problem:

    from scrapling.fetchers import StealthySession
    with StealthySession(headless=True, solve_cloudflare=True) as session:
        page = session.fetch('https://nopecha.com/demo/cloudflare')
    

    Spider Framework

    Concurrent crawling with pause/resume, multi-session support:

    from scrapling.spiders import Spider, Response
    class QuotesSpider(Spider):
        name = "quotes"
        start_urls = ["https://quotes.toscrape.com/"]
        concurrent_requests = 10
        async def parse(self, response: Response):
            for quote in response.css('.quote'):
                yield {"text": quote.css('.text::text').get(), "author": quote.css('.author::text').get()}
    result = QuotesSpider().start()
    result.items.to_json("quotes.json")
    

    MCP Server for AI Agents

    Scrapling includes a built-in MCP server — your AI agent (Claude, Cursor, etc.) can use it to extract web content locally without third-party APIs. Unlike Firecrawl, everything runs on your machine.

    Bottom Line

    If you're still using requests + BeautifulSoup for web scraping, give Scrapling a try. Adaptive selectors, built-in anti-bot, and a proper spider framework in a single pip install. No more juggling five libraries for one job.


    评论