Docling is an open-source Python library by IBM Research designed for local parsing of various document formats such as PDF, DOCX, PPTX, Images, HTML, and AsciiDoc into Markdown or JSON. It ensures data security by performing conversions locally. The library excels in preserving document structure, including tables and reading order, and offers built-in OCR capabilities for scanned documents. It also integrates smoothly with AI workflows like LlamaIndex and LangChain, catering to complex document processing needs while constantly being improved by the open-source community.
Table of contents
Docling — An OpenSource Python library for PDF Parsing | OCR Support | RAG | IBM ResearchIntroductionChallengesExperimentationsFinal ThoughtsReferences3 Comments
Sort: