Falcon Perception

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Falcon Perception is a 0.6B-parameter early-fusion Transformer model for open-vocabulary visual grounding and segmentation from natural language prompts. Unlike modular pipeline approaches, it processes image patches and text in a single shared backbone using a hybrid attention mask (bidirectional for image tokens, causal for text/task tokens). A structured 'Chain-of-Perception' interface decomposes each instance prediction into coordinate, size, and segmentation tokens. On the SA-Co benchmark, it achieves 68.0 Macro-F1 vs. 62.3 for SAM 3, with large gains on attribute-heavy, OCR-guided, spatial, and relational prompts. The team also introduces PBench, a diagnostic benchmark that isolates performance by capability level (L0–L4 plus dense scenes). Additionally, Falcon OCR — a 0.3B variant trained from scratch — achieves 80.3 on olmOCR and 88.6 on OmniDocBench, outperforming larger models including DeepSeek OCR v2 and Mistral OCR 3, with the highest throughput among open-source OCR models. Both models are open-sourced with an inference stack built on PyTorch FlexAttention.

15m read timeFrom huggingface.co
Post cover image
Table of contents
The problem: why do perception systems end up as pipelines?The architecture: early fusion, hybrid attention, and an efficient dense interfacePBench: a benchmark designed to isolate what is missingTraining: distillation, large-scale data, and a three-stage recipeResultsFalcon OCR: extending early fusion to document understandingInference: Fast, Practical, and OpenThe Bigger Picture: A "Bitter Lesson" for PerceptionCitation

Sort: