Falcon Perception

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Falcon Perception is a 0.6B-parameter early-fusion Transformer model for open-vocabulary visual grounding and segmentation from natural language prompts. Unlike modular pipeline approaches, it processes image patches and text in a single shared backbone using a hybrid attention mask (bidirectional for image tokens, causal for text/task tokens). A structured 'Chain-of-Perception' interface decomposes each instance prediction into coordinate, size, and segmentation tokens. On the SA-Co benchmark, it achieves 68.0 Macro-F1 vs. 62.3 for SAM 3, with large gains on attribute-heavy, OCR-guided, spatial, and relational prompts. The team also introduces PBench, a diagnostic benchmark that isolates performance by capability level (L0–L4 plus dense scenes). Additionally, Falcon OCR — a 0.3B variant trained from scratch — achieves 80.3 on olmOCR and 88.6 on OmniDocBench, outperforming larger models including DeepSeek OCR v2 and Mistral OCR 3, with the highest throughput among open-source OCR models. Both models are open-sourced with an inference stack built on PyTorch FlexAttention.

#machine-learning

#computer-vision

Apr 20•15m read time•From huggingface.co

Table of contents

The problem: why do perception systems end up as pipelines?The architecture: early fusion, hybrid attention, and an efficient dense interface PBench: a benchmark designed to isolate what is missing Training: distillation, large-scale data, and a three-stage recipe Results Falcon OCR: extending early fusion to document understanding Inference: Fast, Practical, and Open The Bigger Picture: A "Bitter Lesson" for Perception Citation

Comment

Bookmark

Copy

Sort: