Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

Vision Language Models (VLMs) enable processing of long documents by preserving visual information that traditional OCR misses, such as text positioning, drawings, and layout. VLMs can perform advanced OCR by extracting Markdown, explaining visual elements, and identifying missing information. When applying VLMs to documents over 100 pages, consider processing power requirements (GPU access, image resolution, page count), implement hierarchical processing strategies (starting with fewer pages or lower resolution), and optimize for cost through token caching. Open-source options like Qwen 3 VL and closed-source models like Gemini 2.5 Pro offer different tradeoffs for document understanding tasks.