🕷️ Scrapling:48k Stars 的智能网页抓取框架,自带反封锁 + MCP,网站改结构也不怕
项目地址:D4Vinci/Scrapling | ⭐ 48,725 Stars | 🐍 Python | 🏷 AI / Web Scraping / MCP | 作者 Karim Shoair
老实说,做 Web Scraping 最让人崩溃的不是写爬虫逻辑,而是:
2. Cloudflare Turnstile 把你当机器人拦在外面
3. 爬着爬着 IP 被封了
以前解决方案是:BeautifulSoup 写一遍 → 被反爬 → 换 Selenium → 太慢 → 换 Scrapy → 太重。折腾一圈下来,数据没拿到几行。
Scrapling 就是来解决这些破事的。一个库搞定解析、反封锁、并发爬取,而且它有个骚操作:能自动适应网站结构变化,下次页面改版了你的选择器照样能用。
一、安装:一行命令
pip install scrapling
要搞定反爬和浏览器自动化,加个 fetcher 扩展:
pip install "scrapling[all]"
scrapling install # 下载浏览器内核
完了,就这么简单。
二、最骚的操作:自适应选择器
网站改版是爬虫的噩梦。Scrapling 提供 auto_save 和 adaptive 参数,第一次爬的时候记住元素特征,后面页面变了也能自动重新定位:
from scrapling.fetchers import StealthyFetcher
StealthyFetcher.adaptive = True
page = StealthyFetcher.fetch('https://quotes.toscrape.com/', headless=True)
# 第一次:记录元素特征
quotes = page.css('.quote', auto_save=True)
# 一周后网站改版了,class 从 .quote 变成了 .item
# 传 adaptive=True 它自己就能找到
quotes = page.css('.quote', adaptive=True)
别问我怎么知道的——以前用 BS4 写过爬虫的看到这功能应该想哭。
三、反封锁:Cloudflare Turnstile 秒过
from scrapling.fetchers import StealthySession
with StealthySession(headless=True, solve_cloudflare=True) as session:
page = session.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()
StealthyFetcher 内置了浏览器指纹伪装和 Cloudflare 绕过,不用自己配 undetected-chromedriver 那一套了。
四、完整爬虫框架:像 Scrapy 但更轻
Scrapling 内置了 Spider 框架,支持并发、多 session、暂停恢复:
from scrapling.spiders import Spider, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
concurrent_requests = 10
async def parse(self, response: Response):
for quote in response.css('.quote'):
yield {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get(),
}
next_page = response.css('.next a')
if next_page:
yield response.follow(next_page[0].attrib['href'])
result = QuotesSpider().start()
result.items.to_json("quotes.json")
跑了 10 个并发,一个命令搞定。暂停按 Ctrl+C,重新跑自动恢复进度。
五、MCP Server:AI Agent 的网页数据提取器
Scrapling 自带 MCP Server,AI Agent(Claude、Cursor 等)可以直接用它来提取网页内容。跟 Firecrawl 不一样的是,Scrapling 是本地运行的,数据不过第三方:
pip install "scrapling[ai]"
然后配置 MCP 客户端指向本地的 Scrapling 服务就行了。Agent 问"帮我抓这个页面",Scrapling 先去提取,再把精简后的内容喂给 LLM,省 Token 又省时间。
总结
auto_save + adaptive 的组合拳很实用如果你还在用 requests + BeautifulSoup 手写爬虫,试试 Scrapling——别整那些花里胡哨的,一个库就够了。
🏷 tags: #WebScraping #Python #MCP #Crawler #AI-Agent #Cloudflare
🕷️ Scrapling: 48k Stars Adaptive Web Scraping Framework — Built-in Anti-Block + MCP, Survives Website Redesigns
Repo: D4Vinci/Scrapling | ⭐ 48,725 Stars | 🐍 Python
The most painful part of web scraping isn't the parsing logic — it's websites changing their CSS classes, Cloudflare Turnstile blocking your requests, and IP bans mid-crawl.
Scrapling solves all of that in one library. Its killer feature: adaptive selectors that automatically relocate your elements when the page structure changes.
Quick Start
pip install scrapling
pip install "scrapling[all]"
scrapling install
Adaptive Selection
from scrapling.fetchers import StealthyFetcher
StealthyFetcher.adaptive = True
page = StealthyFetcher.fetch('https://example.com', headless=True)
products = page.css('.product', auto_save=True) # remembers element patterns
products = page.css('.product', adaptive=True) # finds them even after redesign
Anti-Bot Bypass
Cloudflare Turnstile? No problem:
from scrapling.fetchers import StealthySession
with StealthySession(headless=True, solve_cloudflare=True) as session:
page = session.fetch('https://nopecha.com/demo/cloudflare')
Spider Framework
Concurrent crawling with pause/resume, multi-session support:
from scrapling.spiders import Spider, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
concurrent_requests = 10
async def parse(self, response: Response):
for quote in response.css('.quote'):
yield {"text": quote.css('.text::text').get(), "author": quote.css('.author::text').get()}
result = QuotesSpider().start()
result.items.to_json("quotes.json")
MCP Server for AI Agents
Scrapling includes a built-in MCP server — your AI agent (Claude, Cursor, etc.) can use it to extract web content locally without third-party APIs. Unlike Firecrawl, everything runs on your machine.
Bottom Line
If you're still using requests + BeautifulSoup for web scraping, give Scrapling a try. Adaptive selectors, built-in anti-bot, and a proper spider framework in a single pip install. No more juggling five libraries for one job.