How Grab Built a Vision LLM to Scan Images

Grab built a custom 1B-parameter Vision LLM to extract information from Southeast Asian documents for eKYC verification. Starting with Qwen2-VL 2B, they progressed from LoRA fine-tuning to full parameter training, then built a lightweight model from scratch combining Qwen2-VL's vision encoder with Qwen2.5's compact language decoder. The four-stage training process included projector alignment, vision enhancement, language-specific visual training on synthetic OCR data, and task-specific fine-tuning. The final model achieved comparable accuracy to the 2B version while delivering 48-56% faster latency, addressing challenges with non-Latin scripts and diverse document formats across the region.

#machine-learning

#llm

#computer-vision

Feb 03•12m read time•From blog.bytebytego.com

Table of contents

Kubernetes Quick-Start Guide (Sponsored)Understanding Vision LLMs Build product instead of babysitting prod (Sponsored)Selecting the Base Model Training Dataset Generation The Experimentation Journey Four-Stage Training Process Results and Performance Key Technical Insights Conclusion

Comment

Bookmark

Copy

Sort: