项目地址:docling-project/docling | ⭐ 59.7K Stars | 🛠 Python | 🏢 IBM Research
老实说,处理 PDF 这件事,我踩过的坑都是泪。各种格式不兼容、表格乱飞、图片识别全靠运气——直到发现 Docling。
这玩意儿是 IBM Research 开源的全格式文档解析工具,目前已经 59.7k Stars。它不光能把 PDF 变成 Markdown,还支持 DOCX、PPTX、XLSX、HTML、LaTeX 甚至音频文件。最骚的操作是它对表格、代码块、公式和图片分类都有专门的模型处理,解析质量吊打市面上大多数工具。
一行命令尝尝鲜
pip install docling
装完直接跑 CLI:
docling https://arxiv.org/pdf/2206.01062
几秒钟后,当前目录下就会生成一个结构完整的 .md 文件。就这么简单。
如果用 VLM(视觉语言模型)处理复杂排版:
docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062
Python 集成,真正生产力的地方
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
三行代码搞定。它还支持导出为 HTML、JSON(无损结构)、DocTags 等多种格式,方便扔进 RAG 管道。
为什么值得用?
Docling 的核心卖点就几个:
总结
| 要点 | 一句话 |
|------|--------|
| 谁做的 | IBM Research,MIT 协议 |
| 怎么装 | pip install docling |
| 怎么用 | CLI 一行,Python 三行 |
| 最佳场景 | RAG 数据预处理、文档批量转换、PDF 结构化提取 |
| 坑 | 有些复杂表格偶尔会丢线,但比 PyMuPDF 强太多 |
别再去折腾各种 PDF 解析库了,Docling 一把梭完事。
English Version
Repo: docling-project/docling | ⭐ 59.7K | 🛠 Python | 🏢 IBM Research
Let's be real—parsing PDFs has always been a pain. Broken layouts, mangled tables, garbage OCR. Then along came Docling.
It's an IBM Research open-source document parser with 59.7K stars. It handles PDF, DOCX, PPTX, XLSX, HTML, LaTeX, even audio. The killer feature? Dedicated ML models for table structure, code blocks, formulas, and image classification—delivering parsing quality that most tools just can't match.
One-liner to try
pip install docling
Then from the CLI:
docling https://arxiv.org/pdf/2206.01062
Boom—a clean .md file lands in your current directory seconds later.
For complex layouts with VLM:
docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062
Python integration
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
Three lines. It also exports to HTML, lossless JSON, DocTags, and more—ready to feed into any RAG pipeline.
Why Docling?
Bottom line
Stop wrestling with half-baked PDF libraries. Docling does the heavy lifting so you don't have to.