🔥 Firecrawl:12万星网页抓取API,让AIagent自动获取 clean Markdown
Firecrawl 是一个专门为AIagent设计的网页抓取工具,能把任何网站转换成干净的 Markdown 或结构化数据。目前已经有12万+星标,是AI数据获取的瑞士军刀。
安装使用
pip install firecrawl-py
from firecrawl import Firecrawl
app = Firecrawl(api_key="fc-YOUR_API_KEY")
# 抓取单个网页
doc = app.scrape("https://example.com", formats=["markdown"])
print(doc.markdown)
# 爬取整个网站
docs = app.crawl("https://docs.example.com", limit=50)
for doc in docs.data:
print(doc.metadata.source_url, doc.markdown[:100])
核心功能
Search: 搜索网络并获取完整内容
search_result = app.search("AI agent tools", limit=5)
Agent: AI自动数据收集,无需提供URL
result = app.agent(prompt="Find all AI agent frameworks")
print(result.data)
Batch Scrape: 异步抓取数千URL
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
docs = app.batch_scrape(urls)
CLI 工具
# 安装CLI
npm install -g @mendable/firecrawl-cli
# 搜索
firecrawl search "AI tools" --limit 5
# 抓取网页
firecrawl scrape https://example.com --only-main-content
# AI交互
firecrawl scrape https://amazon.com
firecrawl interact exec --prompt "Search for mechanical keyboard"
实际应用场景
- AI训练数据收集: 自动抓取技术文档、博客文章
- 竞品分析: 批量收集竞品网站信息
- 内容聚合: 从多个来源收集相关主题内容
- 监控网页变化: 定期抓取重要页面更新
Firecrawl 的最大优势是输出格式统一,不管是复杂的电商页面还是简单的博客,都能转换成干净的 Markdown,让AIagent直接使用,无需额外清理工作。
🔥 Firecrawl: 120k+ Star Web Scraping API for AI Agents
Firecrawl is a specialized web scraping tool designed for AI agents that converts any website into clean Markdown or structured data. With 120k+ stars, it's the Swiss Army knife for AI data acquisition.
Installation & Usage
pip install firecrawl-py
from firecrawl import Firecrawl
app = Firecrawl(api_key="fc-YOUR_API_KEY")
# Scrape single webpage
doc = app.scrape("https://example.com", formats=["markdown"])
print(doc.markdown)
# Crawl entire website
docs = app.crawl("https://docs.example.com", limit=50)
for doc in docs.data:
print(doc.metadata.source_url, doc.markdown[:100])
Core Features
Search: Search web and get full content from results
search_result = app.search("AI agent tools", limit=5)
Agent: Autonomous data gathering, no URLs required
result = app.agent(prompt="Find all AI agent frameworks")
print(result.data)
Batch Scrape: Scrape thousands of URLs asynchronously
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
docs = app.batch_scrape(urls)
CLI Tools
# Install CLI
npm install -g @mendable/firecrawl-cli
# Search
firecrawl search "AI tools" --limit 5
# Scrape webpage
firecrawl scrape https://example.com --only-main-content
# AI interaction
firecrawl scrape https://amazon.com
firecrawl interact exec --prompt "Search for mechanical keyboard"
Real-world Use Cases
- AI Training Data Collection: Automatically scrape technical docs, blog posts
- Competitive Analysis: Batch collect competitor website information
- Content Aggregation: Collect related content from multiple sources
- Webpage Change Monitoring: Regularly scrape important pages for updates
Firecrawl's biggest advantage is consistent output format. Whether it's complex e-commerce pages or simple blogs, everything converts to clean Markdown that AI agents can use directly without additional cleanup work.