GLM-OCR is an open-source multimodal OCR model achieving state-of-the-art performance on document understanding benchmarks with only 0.9B parameters. Built on GLM-V architecture with Multi-Token Prediction loss, it excels at complex layouts including tables, formulas, and code. The project provides a comprehensive SDK supporting multiple deployment options: cloud API via Zhipu MaaS, self-hosted with vLLM/SGLang, or local deployment with Ollama/MLX. Features include layout detection via PP-DocLayout-V3, parallel region recognition, and outputs in both JSON and Markdown formats.

7m read timeFrom github.com
Post cover image
Table of contents
GLM-OCRGLM-OCR SDKAcknowledgementLicense

Sort: