Grab built a custom 1B-parameter Vision LLM to extract information from Southeast Asian documents for eKYC verification. Starting with Qwen2-VL 2B, they progressed from LoRA fine-tuning to full parameter training, then built a lightweight model from scratch combining Qwen2-VL's vision encoder with Qwen2.5's compact language decoder. The four-stage training process included projector alignment, vision enhancement, language-specific visual training on synthetic OCR data, and task-specific fine-tuning. The final model achieved comparable accuracy to the 2B version while delivering 48-56% faster latency, addressing challenges with non-Latin scripts and diverse document formats across the region.

12m read timeFrom blog.bytebytego.com
Post cover image
Table of contents
Kubernetes Quick-Start Guide (Sponsored)Understanding Vision LLMsBuild product instead of babysitting prod (Sponsored)Selecting the Base ModelTraining Dataset GenerationThe Experimentation JourneyFour-Stage Training ProcessResults and PerformanceKey Technical InsightsConclusion

Sort: