NVIDIA released Nemotron OCR v2, a multilingual OCR model trained on 12.2 million synthetic images across six languages (English, Japanese, Korean, Russian, Simplified and Traditional Chinese). The synthetic data pipeline, built on a modified SynthDoG renderer, generates pixel-perfect annotations at word, line, and paragraph levels with reading order graphs, using mOSCAR for source text and open-source font pools. The model uses a FOTS-inspired shared backbone architecture that reuses feature maps across detection, recognition, and relational components, achieving 34.7 pages/second on a single A100 GPU. NED scores for non-English languages dropped from 0.56–0.92 (v1) to 0.035–0.069 (v2), outperforming specialized per-language models like PaddleOCR. The dataset and model are publicly available on Hugging Face.

10m read timeFrom huggingface.co
Post cover image
Table of contents
The Problem: Data, Not ArchitectureA Generic Synthetic Data PipelineWhat the Data Looks LikeDataset at a GlanceExtensibilityThe Model: Nemotron OCR v2The architecture is based on the FOTS (Fast Oriented Text Spotting) design, which unifies detection and recognition into a single network with a shared convolutional backbone. The detection backbone (RegNetX-8GF) processes the input image once and produces feature maps that are reused by all three components. The text recognizer receives rectified feature crops from detected regions and decodes them with a small Transformer. The relational model reasons over per-region embeddings derived from the same feature maps using a compact Transformer encoder. Because the expensive convolutional pass happens only once, the downstream components add minimal overhead. This feature reuse is what drives the model's efficiency, enabling 34.7 pages/second on a single A100 GPU.Results: What Synthetic Data Buys YouLinksAcknowledgments

Sort: