Best of Computer VisionFebruary 2026

  1. 1
    Article
    Avatar of bytebytegoByteByteGo·11w

    How Grab Built a Vision LLM to Scan Images

    Grab built a custom 1B-parameter Vision LLM to extract information from Southeast Asian documents for eKYC verification. Starting with Qwen2-VL 2B, they progressed from LoRA fine-tuning to full parameter training, then built a lightweight model from scratch combining Qwen2-VL's vision encoder with Qwen2.5's compact language decoder. The four-stage training process included projector alignment, vision enhancement, language-specific visual training on synthetic OCR data, and task-specific fine-tuning. The final model achieved comparable accuracy to the 2B version while delivering 48-56% faster latency, addressing challenges with non-Latin scripts and diverse document formats across the region.

  2. 2
    Video
    Avatar of codinggopherThe Coding Gopher·10w

    Meta now has the most insane AI agent

    Meta's AI agent represents a shift from large language models to large action models (LAMs) that can interact with computers through visual understanding and mouse/keyboard control. The system uses vision transformers to parse screen pixels, DOM annotation for web interaction, and operates in ephemeral sandboxed microVMs for security. By working at the UI layer rather than requiring APIs, it enables probabilistic automation of complex workflows across legacy systems, marking a transition from text-to-text models to multimodal input-to-executable-action systems.

  3. 3
    Article
    Avatar of phProduct Hunt·11w

    Polyvia: Pinecone for visual data - Visual Knowledge Index for Agents

    Polyvia is a Visual Knowledge Index that enables AI agents to reason across visual data like charts, diagrams, and infographics. Unlike traditional tools that only extract or index text, Polyvia uses VLM-OCR extraction to convert visual content into structured logic, creates a graph-based facts ontology to disambiguate information, and provides agentic visual reasoning with audit-ready citations. It's available via API, MCP Server (for Claude, Cursor), and Polyvia Studio interface, targeting developers building multimodal agents and knowledge-work teams in research, finance, and healthcare.