ファルコン・パーセプション
Falcon Perceptionは0.6Bパラメータの早期融合Transformerであり、画像パッチとテキストを単一シーケンスで処理するアーキテクチャにより、オープンボキャブラリ接地・セグメンテーションにおいてSAM 3を上回る性能を実現した。
キーポイント
単一アーキテクチャによる早期融合設計
従来のモジュール型パイプライン(固定ビジョンバックボーン+別個の融合/デコーダー)に依存せず、画像パッチ、テキスト、タスクトークンを1つのオート回帰型Transformerで処理する設計を採用し、複雑さの蓄積とスケーリングの課題を解消した。
ハイブリッド注意機構と効率的出力インターフェース
画像パッチとテキストを統合シーケンスとして扱い、ハイブリッド注意マスクと構造化トークン出力、軽量出力ヘッドを用いることで、変数個数のインスタンス生成を効率的かつ高精度に処理可能にした。
PBench診断ベンチマークの公開
属性、OCR誘導の曖昧さ解消、空間制約、関係性、密集シーンなどの能力別およびコンテキスト長さに基づく性能を診断する新評価基準PBenchを導入し、モデルの強みと弱点(特に存在確率補正)を明確化した。
Falcon OCRの併せ公開とスループット達成
0.3BパラメータのFalcon OCRモデルを公開し、olmOCRとOmniDocBenchで高いスコアを記録するとともに、オープンソースOCRモデル中最も高速なスループットを実現した。
単一バックボーンとハイブリッド注意機構
画像トークンは双方向でグローバル視覚コンテキストを構築し、テキスト・タスクトークンは因果的に予測を行うハイブリッド注意マスクにより、単一Transformerで多様な構造を処理。
Chain-of-Perceptionによる粗密な予測順序
<coord>→<size>→<seg>の固定順序で幾何情報を先に確定させ、マスク予測の曖昧さを軽減し、条件付けを最適化することで高密度な出力を効率的に生成。
専用軽量ヘッドとフーリエ特徴符号化
座標・サイズ予測にランダムガウス射影を用いたフーリエ特徴符号化を適用し、スペクトルバイアスを克服して高精度な位置特定を実現。セグメンテーションはドット積で軽量に処理。
影響分析・編集コメントを表示
影響分析
本記事は、従来のモジュール型パイプラインに依存しがちだったオープンボキャブラリ視覚理解の設計パラダイムを、単一アーキテクチャへ統合する方向性を示している。これにより開発コストの削減とモデルのスケーリングが容易になり、実運用におけるデプロイの簡素化に寄与する。特にPBenchのような能力別診断ベンチマークの公開は、オープンソースコミュニティにおけるモデル改善の指針を明確化する可能性が高い。
編集コメント
パイプライン設計の複雑さを解消する単一アーキテクチャへの転換は、実運用コスト削減に直結する重要な設計思想です。PBenchのような能力別診断ベンチマークの公開は、オープンソースコミュニティにおけるモデル改善の指針を明確化するでしょう。
記事一覧に戻る Falcon Perception
Upvote 2

要約 — Falcon Perceptionは、自然言語プロンプトからのオープン語彙グラウンディングとセグメンテーションのための、6億パラメータのアーリーフュージョンTransformerモデルです。このモデルは、ハイブリッドアテンションマスクを用いて画像パッチとテキストを1つのシーケンスで処理し、小さな構造化トークンインターフェースと軽量な出力ヘッドによって可変数のインスタンスを生成します。SA-Coベンチマークにおいて、Falcon Perceptionは68.0のMacro-F1スコア(SAM 3の62.3に対して)を達成し、主な残存ギャップは存在キャリブレーションです(MCC 0.64 対 0.82)。また、性能を能力別(属性、OCRガイドによる曖昧性解消、空間的制約、関係性)および、密集した長文脈シーン別に分解する診断ベンチマーク「PBench」を紹介します。
さらに、3億パラメータのモデルであるFalcon OCRも公開します。このモデルはolmOCRベンチマークで80.3、OmniDocBenchで88.6のスコアを達成し、オープンソースOCRモデルの中で最高のスループットを有しています。
この記事では、我々が何を構築したか、なぜこの方法で構築したか、そしてその過程で学んだことを、簡潔かつ実践的に記述します。
問題:なぜ知覚システムはパイプライン化されてしまうのか?
多くのオープン語彙知覚システムは、モジュール式パイプラインとして構築されています。具体的には、(多くの場合固定化された)視覚バックボーンが特徴を抽出し、別個のフュージョン/デコーダ段階がそれを言語と結合し、追加のコンポーネントがマッチングと後処理を担当します。この設計体系は多くの設定でうまく機能しますが、トレードオフがあります。きれいにスケールさせることが難しく、改善点を適切なコンポーネントに帰属させることが困難であり、各失敗モードに対して新しい修正を追加するにつれて複雑さが蓄積されやすいのです。
我々は、より単純な疑問を投げかけました:適切なアテンションパターン、出力インターフェース、および訓練信号を選択すれば、単一のアーリーフュージョンTransformerバックボーンが知覚と言語モデリングの両方を処理できるだろうか?
我々の実験では、その答えはほぼ「イエス」です。この記事の残りの部分では、主要な設計選択とその背後にある根拠について説明します。
アーキテクチャ:アーリーフュージョン、ハイブリッドアテンション、効率的な高密度インターフェース

単一の自己回帰型Transformerが、画像パッチ、テキスト、タスクトークンの統合シーケンスを処理します。モデルはオブジェクトの属性を固定された順序(<coord> → <size> → <seg>)で予測します。
一つのバックボーン、二つの振る舞い
Falcon Perceptionの核心は、画像パッチとテキストトークンを最初の層から共有パラメータ空間で処理する高密度Transformerです。別個の視覚バックボーンと後続のレイトフュージョンデコーダを用いる代わりに、単一のバックボーンを維持し、マスキングと軽量な出力インターフェースに依存して、高密度予測問題を扱いやすくしています。
画像とテキストは構造が異なります。ピクセルは2次元であり双方向の文脈から恩恵を受けますが、予測インターフェースは本質的に逐次的です。我々はこれをハイブリッドアテンションマスクで解決します:
- 画像トークンは、他のすべての画像トークンに双方向でアテンションし、(視覚エンコーダのように)グローバルな視覚的文脈を構築します。
- テキストおよびタスクトークンは、それ以前のすべて(完全な視覚プレフィックスと先行するテキスト)に対して因果的にアテンションします。
これにより、同じバックボーンが、画像トークン上では双方向視覚エンコーダのように振る舞いながら、タスクトークンに対する自己回帰予測もサポートできるようになります。
Chain-of-Perception:高密度出力のための粗密な教師信号
高密度知覚は固定サイズの予測問題ではありません。画像にはインスタンスがゼロの場合もあれば、数百含まれる場合もあります。自己回帰生成はきれいな可変長インターフェースを提供しますが、完全に自己回帰的な高密度生成(例:ポリゴンや高解像度マスクのトークン単位生成)はすぐに計算コストが高くなります。
我々は、各インスタンスを3つのステップに分解する、小さな構造化インターフェース「Chain-of-Perception」を使用します:
- <coord> → 2. <size> → 3. <seg>
- 座標トークン:モデルはまずインスタンスの中心を予測します。これにより、どのオブジェクトについて話しているのかが解決されます。
- サイズトークン:次に空間的範囲を予測します。これにより、その大きさが解決されます。
- セグメンテーショントークン:最後に、単一の埋め込みを生成し、それをアップサンプリングされた画像特徴量とドット積することで、フル解像度のバイナリマスクを生成します。
この順序は意図的なものです。最初に幾何学的形状を決定することで曖昧さ(「どのインスタンス?」)が減少し、マスク予測ステップが、解決されたオブジェクトを条件としたピクセル調整に近づきます。
専門化されたヘッド、最小限のオーバーヘッド
バックボーンは共有されますが、デコーディングには出力タイプに特化した軽量なヘッドを使用します:
- 座標 & サイズヘッドはフーリエ特徴量エンコーディングを使用します:連続座標をランダムガウス射影を通して高次元の正弦波空間にマッピングします。これにより、ニューラルネットワークのスペクトルバイアスを克服し、離散ビニングだけの場合よりも正確な位置特定を実現します。デコードされた座標は、後続のトークンの条件付けとしてシーケンスに再注入されます。
- セグメンテーションヘッドは、<seg>トークンの埋め込みとアップサンプリングされた画像特徴量との間のドット積を計算します。
PBench:何が不足しているかを切り分けるために設計されたベンチマーク
RefCOCOのような既存の参照表現ベンチマークは飽和状態にあります。モデルは日常的に90%以上を達成し、何が失敗したのかを混同させてしまいます。モデルは、テキストを読めなかったために失敗したのか?空間関係を理解できなかったのか?混雑したシーンを扱えなかったのか?
我々は、サンプルを支配的な能力別に分離する診断ベンチマーク「PBench」を紹介します。
原文を表示
Back to Articles Falcon Perception
Upvote 2

TL;DR — Falcon Perception is a 0.6B-parameter early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes image patches + text in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On SA-Co, Falcon Perception reaches 68.0 Macro-F1 (vs. 62.3 for SAM 3) with the main remaining gap being presence calibration (MCC 0.64 vs. 0.82). We also introduce PBench, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes.
We also relase Falcon OCR, a 0.3B-parameter model which reaches a score of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the highest throughput of any open source OCR model.
This post is a brief, practical write-up of what we built, why we built it this way, and what we learned along the way.
The problem: why do perception systems end up as pipelines?
Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and additional components handle matching and post-processing. This family of designs works well in many settings, but it comes with trade-offs: it can be hard to scale cleanly, hard to attribute improvements to the right component, and easy to accumulate complexity as we add a new fix for each failure mode.
We asked a simpler question: can a single early-fusion Transformer backbone handle both perception and language modeling, if we choose the right attention pattern, output interface, and training signal?
In our experiments, the answer is largely yes. The rest of this post describes the main design choices and the evidence behind them.
The architecture: early fusion, hybrid attention, and an efficient dense interface

A single autoregressive Transformer processes a unified sequence of image patches, text, and task tokens. The model predicts object properties in a fixed order: <coord>
One Backbone, Two Behaviors
At its core, Falcon Perception is a dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer. Instead of a separate vision backbone followed by a late-fusion decoder, we keep a single backbone and rely on masking and a lightweight output interface to make the dense prediction problem tractable.
Images and text have different structure: pixels are 2D and benefit from bidirectional context, while the prediction interface is naturally sequential. We address this with a hybrid attention mask:
Image tokens attend to all other image tokens bidirectionally, building a global visual context (like a vision encoder would).
Text and task tokens attend causally to everything before them — the full visual prefix plus preceding text.
This allows the same backbone to behave like a bidirectional visual encoder on image tokens, while still supporting autoregressive prediction over task tokens.
Chain-of-Perception: coarse-to-fine supervision for dense outputs
Dense perception is not a fixed-size prediction problem: an image may contain zero instances or hundreds. Autoregressive generation gives a clean variable-length interface, but fully autoregressive dense generation (e.g., polygons or high-resolution masks token-by-token) quickly becomes expensive.
We use a small structured interface, Chain-of-Perception, which decomposes each instance into three steps:
<coord> → <size> → <seg>
Coordinate token: The model first predicts the center of the instance — resolving which object it's talking about.
Size token: Then the spatial extent — resolving how big it is.
Segmentation token: Finally, a single embedding that, when dot-producted with upsampled image features, produces a full-resolution binary mask.
This ordering is deliberate. Committing to geometry first reduces ambiguity (“which instance?”), and makes the mask prediction step closer to pixel refinement conditioned on the resolved object.
Specialized Heads, Minimal Overhead
The backbone is shared, while decoding uses lightweight heads tailored to the output type:
Coordinate & Size Heads use Fourier feature encoding : mapping continuous coordinates through a random Gaussian projection into a high-dimensional sinusoidal space. This overcomes the spectral bias of neural networks, yielding more precise localization than discrete binning alone. Decoded coordinates are re-injected into the sequence as conditioning for subsequent tokens.
Segmentation Head computes a dot product between the <seg>
PBench: a benchmark designed to isolate what is missing
Existing referring-expression benchmarks like RefCOCO are saturated — models routinely hit 90%+ — and they conflate what went wrong. Did the model fail because it can't read text? Can't understand spatial relationships? Can't handle a crowd?
We introduce PBench, a diagnostic benchmark that separates samples by the dominant capability required:
Attributes & subtypes
"red car", "broken fence"
OCR-guided identification
"Diet Coke bottle", "Nike shoes"
Spatial understanding
"car on the left", "third window from left"
Relations & interactions
"person holding umbrella", "tallest building"
Crowdedness stress test
Hundreds of instances per image
Each sample targets one dominant capability: OCR prompts avoid spatial qualifiers, and spatial prompts avoid in-image text disambiguators. This yields a capability profile rather than a single opaque score, and makes it easier to decide where to invest next (data, training curriculum, or post-training).
Training: distillation, large-scale data, and a three-stage recipe
Multi-Teacher Distillation
Rather than training from random weights (which in our ablations was unstable for segmentation), Falcon Perception initializes via multi-teacher distillation. Two strong vision teachers contribute complementary signals:
DINOv3 (ViT-H): strong local features critical for segmentation
SigLIP2: language-aligned features for open-vocabulary understanding
The distilled initialization achieves 74.25% zero-shot accuracy on ImageNet-1k and 85.11% linear-probe mIoU on Pascal VOC, providing a strong visual foundation before perception-specific training.
Data: 54M Images, 195M Positive Expressions, 488M Hard Negatives
We build the training set through a multi-stage pipeline:
Hierarchical clustering of web-scraped images via DINOv3 embeddings to ensure uniform concept coverage.
VLM-driven listing generates dense object descriptions per image, categorized by PBench complexity level (60% basic, 40% advanced).
Negative mining produces semantic, visual, and fine-grained hard negatives to combat hallucination.
Ensemble consensus — SAM 3, Qwen3-VL-30B, and Moondream3 must agree (IoU > 0.8) for automatic acceptance.
Human verification — disagreements go to annotators, recovering hard samples that confuse automated systems.
We maintain a strict 1:1 ratio of positive to negative samples. This makes presence calibration a first-class target: the model should reliably say “absent,” not only draw masks when confident.
The Three Stages (700 GT Total)
Stage 1 — In-Context Listing (450 GT): The model learns to autoregressively list scene inventories — predicting text expressions and their locations. Full causal attention between queries enables learning of object co-occurrence ("fork, then knife, then plate"). This builds broad scene understanding.
Stage 2 — Task Alignment (225 GT): The attention mask is modified so queries can no longer see each other, simulating independent queries at inference time. Loss on text tokens is masked, focusing gradient signal entirely on presence classification and localization. This stage transitions from "scene understanding" to "answer this specific question."
Stage 3 — Long-Context Finetuning (10 GT): A short phase with the mask limit raised to 600 per expression and a minimal constant learning rate. This adapts the model for extreme crowd density without forgetting earlier capabilities.
Key design choices validated through ablations:
Muon optimizer for the specialized heads (vs. AdamW) — yields +4.8 points on SA-Co detection
Raster ordering of instances (vs. random/size) — +10 points over random ordering on SA-Co
Gram feature regularization — prevents drift from the distillation features, improving segmentation by +1.5 points
Global loss normalization across ranks — corrects bias from variable-length packed sequences in FSDP
SA-Co: Best-in-Class Mask Quality
On the SA-Co open-vocabulary segmentation benchmark, Falcon Perception (0.6B parameters) achieves 68.0 Macro-F1, compared to 62.3 for SAM 3, with large gains on attribute-heavy (+8.2), food & drink (+12.2), and sports equipment (+4.0) splits. At the same time, Falcon Perception lags SAM 3 on presence calibration (MCC: 0.64 vs 0.82), which is the clearest remaining improvement axis.
Here's an example output — the prompt "Falcon" produces precise instance masks:

Falcon Perception also performs well for reffering expressions, able to correctly segment the burger with a black bun in each frame of the video:

PBench: Scaling with Prompt Complexity
This is where the early-fusion design shows the largest differences:
Falcon Perception
L0: Simple objects
On simple objects, the gap is modest. As prompts become more compositional—requiring OCR-guided disambiguation, spatial constraints, or relational binding—the gap widens.
In our PBench Dense split, Falcon Perception (0.6B) substantially outperforms generalist VLM baselines (e.g., 72.6 vs 8.9 for Qwen3-VL-30B in our evaluation setup), and matches or exceeds the 8B model on spatial and relational tiers.
Qualitative Results: OCR, Spatial, Relational, and Dense
As prompts grow more compositional — requiring OCR-guided disambiguation, spatial constraints, relational binding, or scaling to hundreds of instances — the early-fusion advantage becomes visually clear:
OCR-Guided Grounding (Level 2): When the distinguishing signal is text written on an object, Falcon Perception reads it correctly while SAM 3 cannot differentiate.
Spatial Understanding (Level 3): When prompts specify spatial relationships, Falcon Perception forms a coherent 2D scene map.
Relational Reasoning (Level 4): When the target is defined through interactions rather than appearance, Falcon Perception understands the scene graph.
Dense Scenes: Scaling to Hundreds of Instances: The autoregressive interface is particularly useful when scenes are extremely crowded, where fixed-query decoders can run into practical limits.

"168 wine bottles": Falcon Perception identifies the bottles labeled "168", while SAM 3 highlights every bottle. "Honolulu direction sign": Falcon reads the text to find the right sign.

"Lower meat skewer on left grill," "black car to the right of red car at bottom," "Belgian flag on the left" — Falcon Perception resolves the correct instance from spatial constraints. SAM 3 predicts false positives for multiple candidates.

"Pastry next to brown round bread," "person using phone," "person holding helmet in hand" — Falcon Perception identifies the interacting instance. SAM 3 highlights all instances of the object class, ignoring the relational constraint.

"Snow goose," "pigeon," "colorful canned drinks" — Falcon Perception autoregressively segments hundreds of instances. SAM 3's fixed-size decoder runs out of query tokens beyond ~200 instances.
Falcon OCR: extending early fusion to document understanding
Modern OCR has moved well beyond extracting text from clean scans. Today's systems must handle multi-column layouts, mathematical formulas, tables, charts, and multilingual content — all in one pass. Most competitive OCR VLMs tackle this with a familiar recipe: a vision encoder feeding a separate text decoder, plus task-specific glue. These systems work, but they tend to be large (1B–3B+ parameters).
We took a different path: reuse the same early-fusion dense Transformer from Falcon Perception, but train a smaller 0.3B-parameter variant from scratch specifically for OCR. The result is Falcon OCR — a single backbone that processes image patches and text tokens in a shared parameter space with the same hybrid attention mask (bidirectional for image tokens, causal for text tokens), and switches tasks through prompts rather than additional modules.
We trained from scratch (no multi-teacher distillation) because the visual features OCR needs — fine-grained glyph recognition, stroke-level discrimination — differ substantially from the object-level features useful for segmentation. Starting fresh lets the backbone develop text-optimized representations from the ground up.
We train on a curated English-language mixture spanning three core tasks: general document text parsing (digital PDFs, old scans, typewritten documents), mathematical and scientific formula recognition, and table structure recognition. The mixture also includes handwriting, real-world scene text, and synthetic samples generated from rendered LaTeX and HTML sources. The training objective is pure next-token prediction on structured text outputs.
Training proceeds in two phases: a long pre-training phase at constant learning rate where the model learns core OCR capabilities across all element types, followed by a short cosine-decay finetuning phase where the learning rate is annealed to near zero.
Benchmark results
We evaluate on olmOCR (binary correctness checks across diverse inputs) and OmniDocBench (continuous metrics over full-page parses). All comparison models are significantly larger and/or use proprietary infrastructure. At 80.3% on olmOCR with only 0.3B parameters, Falcon OCR is within 1.7 points of the top system and leads all models on Multi-Column (87.1%) and Tables (90.3%). On OmniDocBench it scores 88.64 overall, ahead of DeepSeek OCR v2, GPT 5.2, and Mistral OCR 3.
Serving throughput
At 0.3B parameters, Falcon OCR is roughly 3x smaller than 0.9B-class OCR VLMs, which translates directly into higher serving throughput. Measured on a single A100-80GB with vLLM at high concurrency:
Full pipeline: layout detection → crop → per-region OCR
The compact footprint and vLLM integration (continuous batching, PagedAttention, optimized CUDA kernels) make it practical for large-scale document digitization where millions of pages need processing.
What we see in the results
More broadly, these results suggest that the early-fusion single-stack Transformer is a viable alternative to the "vision encoder plus text decoder" recipe for OCR. One backbone, shared parameter space, one decoding interface, and better data and training signals rather than increasingly complex pipelines. We hope this encourages more work in this direction.
Qualitative examples
Falcon OCR processes images captured under challenging real-world conditions with varying lighting, diverse text semantics (mathematical formulae, structured tables, handwritten notes), and complex document layouts, to produce structured text output.
Click each category below to expand.

Falcon OCR extracts text from handwritten documents and real-world photographs with variable lighting, orientation, and content complexity.

Falcon OCR accurately reproduces cell entries and structural layout from tables of varying formats and complexity.

Falcon OCR correctly transcribes mathematical expressions ranging from simple equations to multi-line derivations with nested operators.

Falcon OCR preserves reading order and structural fidelity when extracting text from documents with multi-column layouts, figures, and footnotes.
Inference: Fast, Practical, and Open
The release includes an inference stack built on PyTorch’s FlexAttention, which makes it practical to express the custom attention patterns and efficiently serve packed variable-length sequences.
Paged Inference Engine
Paged KV cache with virtual page tables (no wasted memory from padding)
Continuous batching: new sequences enter mid-generation, finished ones release pages immediately
CUDA graph capture for the decode loop
Background tokenization overlapped with GPU compute
HR feature cache: LRU cache with pinned-memory buffers for async GPU-CPU transfer of upsampled image features — subsequent queries on the same image skip the expensive upsampling step
In our setup on an H100, typical latencies are on the order of ~100ms prefill, ~200ms upsampling (0ms if cached), and ~50ms decode for a handful of instances. (These numbers depend on resolution, sequence length, and the number of predicted instances.)
Docker and MLX Integration for Falcon-OCR
For the Falcon-OCR model, we also provide a vLLM docker server for fast deployment and MLX integration for Apple-Silicon
Please check out the github repo for details.
The Bigger Picture: A "Bitter Lesson" for Perception
Falcon Perception is intentionally minimal: one backbone, one objective family, and small heads only where outputs are continuous and dense. The working assumption is that most gains should come from data, compute, and training signals, rather than continually expanding the pipeline with specialized modules.
The architecture doesn't block any obvious scaling path: add more images and harder prompts for better grounding, mix in text-only data for better language, increase context length for denser scenes. It's still just one sequence model.
Falcon Perception is developed by the Falcon Vision Team at the Technology Innovation Institute (TII), Abu Dhabi, UAE.
If you use Falcon-Perception, please cite
@article{bevli2026falcon, title = {Falcon Perception}, author = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit}, journal = {arXiv preprint arXiv:2603.27365}, year = {2026}, url = {https://arxiv.org/abs/2603.27365} }

関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み