Best of Computer Vision — February 2026

1
Article
ByteByteGo·11w
How Grab Built a Vision LLM to Scan Images
Grab built a custom 1B-parameter Vision LLM to extract information from Southeast Asian documents for eKYC verification. Starting with Qwen2-VL 2B, they progressed from LoRA fine-tuning to full parameter training, then built a lightweight model from scratch combining Qwen2-VL's vision encoder with Qwen2.5's compact language decoder. The four-stage training process included projector alignment, vision enhancement, language-specific visual training on synthetic OCR data, and task-specific fine-tuning. The final model achieved comparable accuracy to the 2B version while delivering 48-56% faster latency, addressing challenges with non-Latin scripts and diverse document formats across the region.
54
2
Video
The Coding Gopher·10w
Meta now has the most insane AI agent
Meta's AI agent represents a shift from large language models to large action models (LAMs) that can interact with computers through visual understanding and mouse/keyboard control. The system uses vision transformers to parse screen pixels, DOM annotation for web interaction, and operates in ephemeral sandboxed microVMs for security. By working at the UI layer rather than requiring APIs, it enables probabilistic automation of complex workflows across legacy systems, marking a transition from text-to-text models to multimodal input-to-executable-action systems.
47
1
3
Article
Product Hunt·11w
Polyvia: Pinecone for visual data - Visual Knowledge Index for Agents
Polyvia is a Visual Knowledge Index that enables AI agents to reason across visual data like charts, diagrams, and infographics. Unlike traditional tools that only extract or index text, Polyvia uses VLM-OCR extraction to convert visual content into structured logic, creates a graph-based facts ontology to disambiguate information, and provides agentic visual reasoning with audit-ready citations. It's available via API, MCP Server (for Claude, Cursor), and Polyvia Studio interface, targeting developers building multimodal agents and knowledge-work teams in research, finance, and healthcare.
15
1

See all Computer Vision archives