TLDR AI·2026年6月24日 09:00·約4分で読める

Unlimited OCR Works（GitHub リポジトリ）

#OCR #Computer Vision #Open Source #Baidu #Document Intelligence

TL;DR

百度が公開した「Unlimited OCR」は、ワンショット学習で長文書や複雑なレイアウトを高精度に解析する新世代OCR技術であり、ドキュメント理解の新たな基準を示している。

AI深層分析2026年6月24日 16:08

重要/ 5段階

深度40%

キーポイント

ワンショット・ロングホライズン解析の実現

従来のOCRが苦手とする長文書や複雑なレイアウトに対し、単一のサンプル（ワンショット）で学習・推論を行う「One-shot Long-horizon Parsing」技術を採用している。

オープンソースとモデルの同時公開

GitHub 上でコードが公開され、Hugging Face では事前学習済みモデルが提供されており、開発者がすぐに実装・検証できる環境が整っている。

研究論文による裏付け

arXiv にて「Unlimited OCR Works」というタイトルで技術詳細が公開され、その手法の妥当性と性能が学術的に裏付けられている。

One-shot Long-horizon Parsing の実現

本プロジェクトは「ワンショット・ロングホリズン・パーシング」の時代を宣言し、DeepSeek-OCR をさらに進化させた OCR ツールとして提供されています。

Hugging Face でのデモ公開

AK氏によるデモが Hugging Face Spaces で利用可能になっており、モデルは ModelScope でも入手可能です。

二つの推論構成モード

単一画像の処理には高解像度切り出し機能を持つ「gundam」構成と、全解像度を使用する「base」構成の2種類が用意されており、用途に応じて選択可能である。

多ページおよびPDF対応

複数ページの画像やPDFファイル（PyMuPDF経由でページを画像に変換）を処理するための専用メソッド「infer_multi」が実装されており、最大32768トークンの長文出力に対応している。

影響分析・編集コメントを表示

影響分析

この技術は、従来のOCRが抱えていた「長文書への対応難」や「複雑なレイアウトの理解不足」という課題を解決する可能性を秘めており、法務文書の自動処理、学術論文の構造化、大規模アーカイブのデジタル化など、ドキュメント処理を要する産業分野に大きな影響を与える。特にワンショット学習による汎用性の高さは、データ収集が困難なニッチな領域での実装を容易にする。

編集コメント

「Unlimited OCR」は、単なる文字認識の精度向上ではなく、文脈を長期間にわたって維持する解析能力への転換点を示す重要なリリースです。開発者にとっては即座に試せるオープンソース環境が整っている点も評価できます。

One-shot 長距離パースの時代の到来を歓迎します。

リリース

[2026/06/24] 🤝 AK 氏にデモを作成していただきありがとうございます。現在は Hugging Face Spaces で利用可能です。

[2026/06/23] 📄 論文が arXiv に掲載されました。

[2026/06/23] 🤝 ModelScope コミュニティの皆様のご支援に感謝いたします。当モデルは現在、ModelScope で利用可能です。

[2026/06/22] 🚀 Deepseek-OCR を一歩進めることを目指し、Unlimited-OCR を発表します。

推論

Transformers

NVIDIA GPU 上で Huggingface transformers を使用した推論。要件は python 3.12.3 + CUDA12.9 でテスト済み：

torch==2.10.0

torchvision==0.25.0

transformers==4.57.1

Pillow==12.1.1

matplotlib==3.10.8

einops==0.8.2

addict==2.4.0

easydict==1.13

pymupdf==1.27.2.2

psutil==7.2.2

import os

import torch

from transformers import AutoModel, AutoTokenizer

model_name = 'baidu/Unlimited-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModel.from_pretrained(

model_name,

trust_remote_code=True,

use_safetensors=True,

torch_dtype=torch.bfloat16,

)

model = model.eval().cuda()

── 単一画像は 2 つの構成をサポート：gundam または base ──

gundam: base_size=1024, image_size=640, crop_mode=True

base: base_size=1024, image_size=1024, crop_mode=False

model.infer(

tokenizer,

prompt='<image>document parsing.',

image_file='your_image.jpg',

output_path='your/output/dir',

base_size=1024, image_size=640, crop_mode=True,

max_length=32768,

no_repeat_ngram_size=35, ngram_window=128,

save_results=True,

)

── 多ページ / PDF は base のみを使用 (image_size=1024) ──

model.infer_multi(

tokenizer,

prompt='<image>Multi page parsing.',

image_files=['page1.png', 'page2.png', 'page3.png'],

output_path='your/output/dir',

image_size=1024,

max_length=32768,

no_repeat_ngram_size=35, ngram_window=1024,

save_results=True,

)

── PDF (ページを画像に変換後、多ページ解析) ──

import tempfile, fitz # PyMuPDF

def pdf_to_images(pdf_path, dpi=300):

doc = fitz.open(pdf_path)

tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')

mat = fitz.Matrix(dpi / 72, dpi / 72)

paths = []

for i, page in enumerate(doc):

out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')

page.get_pixmap(matrix=mat).save(out)

paths.append(out)

doc.close()

return paths

model.infer_multi(

tokenizer,

prompt='<image>Multi page parsing.',

image_files=pdf_to_images('your_doc.pdf', dpi=300),

output_path='your/output/dir',

image_size=1024,

max_length=32768,

no_repeat_ngram_size=35, ngram_window=1024,

save_results=True,

)

SGLang

環境を設定します（uv で管理された仮想環境）。まずローカルの SGLang ウィールをインストールし、その後 kernels==0.9.0 を固定して、PDF から画像への変換用に PyMuPDF をインストールしてください:

uv venv --python 3.12

source .venv/bin/activate

uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl

uv pip install kernels==0.11.7

uv pip install pymupdf==1.27.2.2

SGLang サーバーを起動します:

python -m sglang.launch_server \

--model baidu/Unlimited-OCR \

--served-model-name Unlimited-OCR \

--attention-backend fa3 \

--page-size 1 \

--mem-fraction-static 0.8 \

--context-length 32768 \

--enable-custom-logit-processor \

--disable-overlap-schedule \

--skip-server-warmup \

--host 0.0.0.0 \

--port 10000

OpenAI 互換 API に対してストリーミングリクエストを送信します:

import base64

import json

import os

import tempfile

import fitz

import requests

from sglang.srt.sampling.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor

server_url = "http://127.0.0.1:10000"

session = requests.Session()

session.trust_env = False

def pdf_to_images(pdf_path, dpi=300):

doc = fitz.open(pdf_path)

tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_")

mat = fitz.Matrix(dpi / 72, dpi / 72)

image_paths = []

for i, page in enumerate(doc):

image_path = os.path.join(tmp_dir, f"page_{i + 1:04d}.png")

page.get_pixmap(matrix=mat).save(image_path)

image_paths.append(image_path)

doc.close()

return image_paths

def encode_image(image_path):

ext = os.path.splitext(image_path)[1].lower()

mime = "image/jpeg" if ext in (".jpg", ".jpeg") else f"image/{ext.lstrip('.')}\

def generate(prompt, image_paths, image_mode, ngram_window):

payload = {

"model": "Unlimited-OCR",

"messages": [{"role": "user", "content": build_content(prompt, image_paths)}],

"temperature": 0,

"skip_special_tokens": False,

"images_config": {"image_mode": image_mode},

"custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),

"custom_params": {

"ngram_size": 35,

"window_size": ngram_window,

"stream": True,

}

response = session.post(

f"{server_url}/v1/chat/completions",

headers={"Content-Type": "application/json"},

data=json.dumps(payload),

timeout=1200,

stream=True,

)

response.raise_for_status()

chunks = []

for line in response.iter_lines(chunk_size=1, decode_unicode=True):

if not line or not line.startswith("data: "):

continue

data = line[len("data: "):]

if data == "[DONE]":

break

event = json.loads(data)

delta = event["choices"][0].get("delta", {}).get("content", "")

if delta:

print(delta, end="", flush=True)

chunks.append(delta)

print()

return "".join(chunks)

Single image supports two configs: gundam or base. Example below uses gundam.

generate("document parsing.", ["your_image.jpg"], image_mode="gundam", ngram_window=128)

マルチ画像（ベースのみ）

generate("マルチページ解析。", ["page1.png", "page2.png"], image_mode="base", ngram_window=1024)

PDF（ベースのみ）

generate("マルチページ解析。", pdf_to_images("your_doc.pdf", dpi=300), image_mode="base", ngram_window=1024)

バッチ推論では、infer.py が自動的に SGLang サーバーを起動し、画像ディレクトリまたは PDF に対して並列リクエストを送信します:

画像ディレクトリ

python infer.py \

--image_dir ./examples/images \

--output_dir ./outputs \

--concurrency 8 \

--image_mode gundam

PDF ページ

python infer.py \

--pdf ./examples/document.pdf \

--output_dir ./outputs \

--concurrency 8 \

--image_mode gundam

便利なオプション:

--model_dir baidu/Unlimited-OCR # ローカルパスまたは Hugging Face モデル ID

--gpu 0 # CUDA_VISIBLE_DEVICES の値

--server_log ./log/sglang_server.log

可視化

謝辞

Deepseek-OCR、Deepseek-OCR-2、PaddleOCR の貴重なモデルとアイデアに感謝いたします。

引用

@misc{yin2026unlimitedocrworks,

title={Unlimited OCR Works},

author={Youyang Yin and Huanhuan Liu and YY and Qunyi Xie and Chaorun Liu and Shiqi Yang and Shaohua Wang and Zhanlong Liu and Hao Zou and Jinyue Chen and Shu Wei and Jingjing Wu and Mingxin Huang and Zhen Wu and Guibin Wang and Tengyu Du and Lei Jia},

year={2026},

eprint={2606.23050},

archivePrefix={arXiv},

primaryClass={cs.CV},

url={https://arxiv.org/abs/2606.23050},

}

原文を表示

Welcome the Era of One-shot Long-horizon Parsing.

Release

[2026/06/24] 🤝 Thanks to AK for creating a demo for us. It is now available at Hugging Face Spaces.

[2026/06/23] 📄 Our paper is now available on arXiv.

[2026/06/23] 🤝 Thanks to the ModelScope community for their support. Our model is now available at ModelScope.

[2026/06/22] 🚀 We present Unlimited-OCR, aiming to push Deepseek-OCR one step further.

Inference

Transformers

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.3 + CUDA12.9：

code

torch==2.10.0
torchvision==0.25.0
transformers==4.57.1
Pillow==12.1.1
matplotlib==3.10.8
einops==0.8.2
addict==2.4.0
easydict==1.13
pymupdf==1.27.2.2
psutil==7.2.2

code

import os
import torch
from transformers import AutoModel, AutoTokenizer

model_name = 'baidu/Unlimited-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
)
model = model.eval().cuda()

# ── Single image supports two configs: gundam or base ──
# gundam: base_size=1024, image_size=640, crop_mode=True
# base: base_size=1024, image_size=1024, crop_mode=False
model.infer(
    tokenizer,
    prompt='<image>document parsing.',
    image_file='your_image.jpg',
    output_path='your/output/dir',
    base_size=1024, image_size=640, crop_mode=True,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=128,
    save_results=True,
)

# ── Multi page / PDF only uses base (image_size=1024) ──
model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=['page1.png', 'page2.png', 'page3.png'],
    output_path='your/output/dir',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

# ── PDF (convert pages to images, then multi-page parsing) ──
import tempfile, fitz  # PyMuPDF

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    paths = []
    for i, page in enumerate(doc):
        out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')
        page.get_pixmap(matrix=mat).save(out)
        paths.append(out)
    doc.close()
    return paths

model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=pdf_to_images('your_doc.pdf', dpi=300),
    output_path='your/output/dir',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

SGLang

Set up the environment (uv-managed virtualenv). Install the local SGLang wheel first,

then pin kernels==0.9.0 and install PyMuPDF for PDF-to-image conversion:

code

uv venv --python 3.12
source .venv/bin/activate

uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7
uv pip install pymupdf==1.27.2.2

Start the SGLang server:

code

python -m sglang.launch_server \
    --model baidu/Unlimited-OCR \
    --served-model-name Unlimited-OCR \
    --attention-backend fa3 \
    --page-size 1 \
    --mem-fraction-static 0.8 \
    --context-length 32768 \
    --enable-custom-logit-processor \
    --disable-overlap-schedule \
    --skip-server-warmup \
    --host 0.0.0.0 \
    --port 10000

Send streaming requests to the OpenAI-compatible API:

code

import base64
import json
import os
import tempfile

import fitz
import requests
from sglang.srt.sampling.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor

server_url = "http://127.0.0.1:10000"

session = requests.Session()
session.trust_env = False

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_")
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    image_paths = []
    for i, page in enumerate(doc):
        image_path = os.path.join(tmp_dir, f"page_{i + 1:04d}.png")
        page.get_pixmap(matrix=mat).save(image_path)
        image_paths.append(image_path)
    doc.close()
    return image_paths

def encode_image(image_path):
    ext = os.path.splitext(image_path)[1].lower()
    mime = "image/jpeg" if ext in (".jpg", ".jpeg") else f"image/{ext.lstrip('.')}"
    with open(image_path, "rb") as f:
        data = base64.b64encode(f.read()).decode("utf-8")
    return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{data}"}}

def build_content(prompt, image_paths):
    return [{"type": "text", "text": prompt}] + [encode_image(path) for path in image_paths]

def generate(prompt, image_paths, image_mode, ngram_window):
    payload = {
        "model": "Unlimited-OCR",
        "messages": [{"role": "user", "content": build_content(prompt, image_paths)}],
        "temperature": 0,
        "skip_special_tokens": False,
        "images_config": {"image_mode": image_mode},
        "custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),
        "custom_params": {
            "ngram_size": 35,
            "window_size": ngram_window,
        },
        "stream": True,
    }
    response = session.post(
        f"{server_url}/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        data=json.dumps(payload),
        timeout=1200,
        stream=True,
    )
    response.raise_for_status()

    chunks = []
    for line in response.iter_lines(chunk_size=1, decode_unicode=True):
        if not line or not line.startswith("data: "):
            continue
        data = line[len("data: "):]
        if data == "[DONE]":
            break
        event = json.loads(data)
        delta = event["choices"][0].get("delta", {}).get("content", "")
        if delta:
            print(delta, end="", flush=True)
            chunks.append(delta)
    print()
    return "".join(chunks)

# Single image supports two configs: gundam or base. Example below uses gundam.
generate("document parsing.", ["your_image.jpg"], image_mode="gundam", ngram_window=128)

# Multi image (base only)
generate("Multi page parsing.", ["page1.png", "page2.png"], image_mode="base", ngram_window=1024)

# PDF (base only)
generate("Multi page parsing.", pdf_to_images("your_doc.pdf", dpi=300), image_mode="base", ngram_window=1024)

For batch inference, infer.py starts the SGLang server automatically and sends concurrent requests for an image directory or PDF:

code

# Image directory
python infer.py \
    --image_dir ./examples/images \
    --output_dir ./outputs \
    --concurrency 8 \
    --image_mode gundam

# PDF pages
python infer.py \
    --pdf ./examples/document.pdf \
    --output_dir ./outputs \
    --concurrency 8 \
    --image_mode gundam

Useful options:

code

--model_dir baidu/Unlimited-OCR   # Local path or Hugging Face model ID
--gpu 0                           # CUDA_VISIBLE_DEVICES value
--server_log ./log/sglang_server.log

Visualization

Acknowledgement

We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas.

Citation

code

@misc{yin2026unlimitedocrworks,
      title={Unlimited OCR Works}, 
      author={Youyang Yin and Huanhuan Liu and YY and Qunyi Xie and Chaorun Liu and Shiqi Yang and Shaohua Wang and Zhanlong Liu and Hao Zou and Jinyue Chen and Shu Wei and Jingjing Wu and Mingxin Huang and Zhen Wu and Guibin Wang and Tengyu Du and Lei Jia},
      year={2026},
      eprint={2606.23050},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.23050}, 
}

この記事をシェア

MarkTechPost★42026年6月25日 14:39

百度、長文解析向け KV キャッシュを一定に保つ 3B モデル「Unlimited OCR」を発表

百度は、出力が増加してもメモリ使用量が一定となる「Reference Sliding Window Attention」を採用した 3B パラメータモデル「Unlimited OCR」を発表し、長文の OCR 処理を高速化した。

TLDR AI★42026年6月24日 09:00

Mistral OCR 4：文書知能のための最先端 OCR ツール（9 分読了）

Mistral は、170 か国語に対応し、エンタープライズ検索や構造化データパイプラインに統合可能な文書知能ツール「OCR 4」をリリースした。同ツールは単一コンテナで展開可能であり、低リソース言語を含む高精度な抽出と他システムより 4 倍の高速処理を実現している。

Latent Space★42026年6月25日 11:14

[AINews] メタハーネスの夏が到来

メタハーネスの歴史を振り返り、ダatabricks の CTO マタイ・ザハリヤ氏が、あらゆるコーディングや知識作業を取り込むためのオープンソースでプラグ可能アーキテクチャ「オムニジェント」に賭けていると報じる。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年6月24日 09:00·約4分で読める

Unlimited OCR Works（GitHub リポジトリ）

#OCR #Computer Vision #Open Source #Baidu #Document Intelligence

TL;DR

AI深層分析2026年6月24日 16:08

重要/ 5段階

深度40%

キーポイント

ワンショット・ロングホライズン解析の実現

オープンソースとモデルの同時公開

GitHub 上でコードが公開され、Hugging Face では事前学習済みモデルが提供されており、開発者がすぐに実装・検証できる環境が整っている。

研究論文による裏付け

arXiv にて「Unlimited OCR Works」というタイトルで技術詳細が公開され、その手法の妥当性と性能が学術的に裏付けられている。

One-shot Long-horizon Parsing の実現

Hugging Face でのデモ公開

AK氏によるデモが Hugging Face Spaces で利用可能になっており、モデルは ModelScope でも入手可能です。

二つの推論構成モード

多ページおよびPDF対応

影響分析・編集コメントを表示

影響分析

編集コメント

One-shot 長距離パースの時代の到来を歓迎します。

リリース

[2026/06/24] 🤝 AK 氏にデモを作成していただきありがとうございます。現在は Hugging Face Spaces で利用可能です。

[2026/06/23] 📄 論文が arXiv に掲載されました。

[2026/06/23] 🤝 ModelScope コミュニティの皆様のご支援に感謝いたします。当モデルは現在、ModelScope で利用可能です。

[2026/06/22] 🚀 Deepseek-OCR を一歩進めることを目指し、Unlimited-OCR を発表します。

推論

Transformers

NVIDIA GPU 上で Huggingface transformers を使用した推論。要件は python 3.12.3 + CUDA12.9 でテスト済み：

torch==2.10.0

torchvision==0.25.0

transformers==4.57.1

Pillow==12.1.1

matplotlib==3.10.8

einops==0.8.2

addict==2.4.0

easydict==1.13

pymupdf==1.27.2.2

psutil==7.2.2

import os

import torch

from transformers import AutoModel, AutoTokenizer

model_name = 'baidu/Unlimited-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModel.from_pretrained(

model_name,

trust_remote_code=True,

use_safetensors=True,

torch_dtype=torch.bfloat16,

)

model = model.eval().cuda()

── 単一画像は 2 つの構成をサポート：gundam または base ──

gundam: base_size=1024, image_size=640, crop_mode=True

base: base_size=1024, image_size=1024, crop_mode=False

model.infer(

tokenizer,

prompt='<image>document parsing.',

image_file='your_image.jpg',

output_path='your/output/dir',

base_size=1024, image_size=640, crop_mode=True,

max_length=32768,

no_repeat_ngram_size=35, ngram_window=128,

save_results=True,

)

── 多ページ / PDF は base のみを使用 (image_size=1024) ──

model.infer_multi(

tokenizer,

prompt='<image>Multi page parsing.',

image_files=['page1.png', 'page2.png', 'page3.png'],

output_path='your/output/dir',

image_size=1024,

max_length=32768,

no_repeat_ngram_size=35, ngram_window=1024,

save_results=True,

)

── PDF (ページを画像に変換後、多ページ解析) ──

import tempfile, fitz # PyMuPDF

def pdf_to_images(pdf_path, dpi=300):

doc = fitz.open(pdf_path)

tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')

mat = fitz.Matrix(dpi / 72, dpi / 72)

paths = []

for i, page in enumerate(doc):

out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')

page.get_pixmap(matrix=mat).save(out)

paths.append(out)

doc.close()

return paths

model.infer_multi(

tokenizer,

prompt='<image>Multi page parsing.',

image_files=pdf_to_images('your_doc.pdf', dpi=300),

output_path='your/output/dir',

image_size=1024,

max_length=32768,

no_repeat_ngram_size=35, ngram_window=1024,

save_results=True,

)

SGLang

uv venv --python 3.12

source .venv/bin/activate

uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl

uv pip install kernels==0.11.7

uv pip install pymupdf==1.27.2.2

SGLang サーバーを起動します:

python -m sglang.launch_server \

--model baidu/Unlimited-OCR \

--served-model-name Unlimited-OCR \

--attention-backend fa3 \

--page-size 1 \

--mem-fraction-static 0.8 \

--context-length 32768 \

--enable-custom-logit-processor \

--disable-overlap-schedule \

--skip-server-warmup \

--host 0.0.0.0 \

--port 10000

OpenAI 互換 API に対してストリーミングリクエストを送信します:

import base64

import json

import os

import tempfile

import fitz

import requests

from sglang.srt.sampling.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor

server_url = "http://127.0.0.1:10000"

session = requests.Session()

session.trust_env = False

def pdf_to_images(pdf_path, dpi=300):

doc = fitz.open(pdf_path)

tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_")

mat = fitz.Matrix(dpi / 72, dpi / 72)

image_paths = []

for i, page in enumerate(doc):

image_path = os.path.join(tmp_dir, f"page_{i + 1:04d}.png")

page.get_pixmap(matrix=mat).save(image_path)

image_paths.append(image_path)

doc.close()

return image_paths

def encode_image(image_path):

ext = os.path.splitext(image_path)[1].lower()

mime = "image/jpeg" if ext in (".jpg", ".jpeg") else f"image/{ext.lstrip('.')}\

def generate(prompt, image_paths, image_mode, ngram_window):

payload = {

"model": "Unlimited-OCR",

"messages": [{"role": "user", "content": build_content(prompt, image_paths)}],

"temperature": 0,

"skip_special_tokens": False,

"images_config": {"image_mode": image_mode},

"custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),

"custom_params": {

"ngram_size": 35,

"window_size": ngram_window,

"stream": True,

}

response = session.post(

f"{server_url}/v1/chat/completions",

headers={"Content-Type": "application/json"},

data=json.dumps(payload),

timeout=1200,

stream=True,

)

response.raise_for_status()

chunks = []

for line in response.iter_lines(chunk_size=1, decode_unicode=True):

if not line or not line.startswith("data: "):

continue

data = line[len("data: "):]

if data == "[DONE]":

break

event = json.loads(data)

delta = event["choices"][0].get("delta", {}).get("content", "")

if delta:

print(delta, end="", flush=True)

chunks.append(delta)

print()

return "".join(chunks)

Single image supports two configs: gundam or base. Example below uses gundam.

generate("document parsing.", ["your_image.jpg"], image_mode="gundam", ngram_window=128)

マルチ画像（ベースのみ）

generate("マルチページ解析。", ["page1.png", "page2.png"], image_mode="base", ngram_window=1024)

PDF（ベースのみ）

generate("マルチページ解析。", pdf_to_images("your_doc.pdf", dpi=300), image_mode="base", ngram_window=1024)

バッチ推論では、infer.py が自動的に SGLang サーバーを起動し、画像ディレクトリまたは PDF に対して並列リクエストを送信します:

画像ディレクトリ

python infer.py \

--image_dir ./examples/images \

--output_dir ./outputs \

--concurrency 8 \

--image_mode gundam

PDF ページ

python infer.py \

--pdf ./examples/document.pdf \

--output_dir ./outputs \

--concurrency 8 \

--image_mode gundam

便利なオプション:

--model_dir baidu/Unlimited-OCR # ローカルパスまたは Hugging Face モデル ID

--gpu 0 # CUDA_VISIBLE_DEVICES の値

--server_log ./log/sglang_server.log

可視化

謝辞

Deepseek-OCR、Deepseek-OCR-2、PaddleOCR の貴重なモデルとアイデアに感謝いたします。

引用

@misc{yin2026unlimitedocrworks,

title={Unlimited OCR Works},

year={2026},

eprint={2606.23050},

archivePrefix={arXiv},

primaryClass={cs.CV},

url={https://arxiv.org/abs/2606.23050},

}

原文を表示

Welcome the Era of One-shot Long-horizon Parsing.

Release

[2026/06/24] 🤝 Thanks to AK for creating a demo for us. It is now available at Hugging Face Spaces.

[2026/06/23] 📄 Our paper is now available on arXiv.

[2026/06/23] 🤝 Thanks to the ModelScope community for their support. Our model is now available at ModelScope.

[2026/06/22] 🚀 We present Unlimited-OCR, aiming to push Deepseek-OCR one step further.

Inference

Transformers

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.3 + CUDA12.9：

code

torch==2.10.0
torchvision==0.25.0
transformers==4.57.1
Pillow==12.1.1
matplotlib==3.10.8
einops==0.8.2
addict==2.4.0
easydict==1.13
pymupdf==1.27.2.2
psutil==7.2.2

code

import os
import torch
from transformers import AutoModel, AutoTokenizer

model_name = 'baidu/Unlimited-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
)
model = model.eval().cuda()

# ── Single image supports two configs: gundam or base ──
# gundam: base_size=1024, image_size=640, crop_mode=True
# base: base_size=1024, image_size=1024, crop_mode=False
model.infer(
    tokenizer,
    prompt='<image>document parsing.',
    image_file='your_image.jpg',
    output_path='your/output/dir',
    base_size=1024, image_size=640, crop_mode=True,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=128,
    save_results=True,
)

# ── Multi page / PDF only uses base (image_size=1024) ──
model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=['page1.png', 'page2.png', 'page3.png'],
    output_path='your/output/dir',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

# ── PDF (convert pages to images, then multi-page parsing) ──
import tempfile, fitz  # PyMuPDF

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    paths = []
    for i, page in enumerate(doc):
        out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')
        page.get_pixmap(matrix=mat).save(out)
        paths.append(out)
    doc.close()
    return paths

model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=pdf_to_images('your_doc.pdf', dpi=300),
    output_path='your/output/dir',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

SGLang

Set up the environment (uv-managed virtualenv). Install the local SGLang wheel first,

then pin kernels==0.9.0 and install PyMuPDF for PDF-to-image conversion:

code

uv venv --python 3.12
source .venv/bin/activate

uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7
uv pip install pymupdf==1.27.2.2

Start the SGLang server:

code

python -m sglang.launch_server \
    --model baidu/Unlimited-OCR \
    --served-model-name Unlimited-OCR \
    --attention-backend fa3 \
    --page-size 1 \
    --mem-fraction-static 0.8 \
    --context-length 32768 \
    --enable-custom-logit-processor \
    --disable-overlap-schedule \
    --skip-server-warmup \
    --host 0.0.0.0 \
    --port 10000

Send streaming requests to the OpenAI-compatible API:

code

import base64
import json
import os
import tempfile

import fitz
import requests
from sglang.srt.sampling.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor

server_url = "http://127.0.0.1:10000"

session = requests.Session()
session.trust_env = False

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_")
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    image_paths = []
    for i, page in enumerate(doc):
        image_path = os.path.join(tmp_dir, f"page_{i + 1:04d}.png")
        page.get_pixmap(matrix=mat).save(image_path)
        image_paths.append(image_path)
    doc.close()
    return image_paths

def encode_image(image_path):
    ext = os.path.splitext(image_path)[1].lower()
    mime = "image/jpeg" if ext in (".jpg", ".jpeg") else f"image/{ext.lstrip('.')}"
    with open(image_path, "rb") as f:
        data = base64.b64encode(f.read()).decode("utf-8")
    return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{data}"}}

def build_content(prompt, image_paths):
    return [{"type": "text", "text": prompt}] + [encode_image(path) for path in image_paths]

def generate(prompt, image_paths, image_mode, ngram_window):
    payload = {
        "model": "Unlimited-OCR",
        "messages": [{"role": "user", "content": build_content(prompt, image_paths)}],
        "temperature": 0,
        "skip_special_tokens": False,
        "images_config": {"image_mode": image_mode},
        "custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),
        "custom_params": {
            "ngram_size": 35,
            "window_size": ngram_window,
        },
        "stream": True,
    }
    response = session.post(
        f"{server_url}/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        data=json.dumps(payload),
        timeout=1200,
        stream=True,
    )
    response.raise_for_status()

    chunks = []
    for line in response.iter_lines(chunk_size=1, decode_unicode=True):
        if not line or not line.startswith("data: "):
            continue
        data = line[len("data: "):]
        if data == "[DONE]":
            break
        event = json.loads(data)
        delta = event["choices"][0].get("delta", {}).get("content", "")
        if delta:
            print(delta, end="", flush=True)
            chunks.append(delta)
    print()
    return "".join(chunks)

# Single image supports two configs: gundam or base. Example below uses gundam.
generate("document parsing.", ["your_image.jpg"], image_mode="gundam", ngram_window=128)

# Multi image (base only)
generate("Multi page parsing.", ["page1.png", "page2.png"], image_mode="base", ngram_window=1024)

# PDF (base only)
generate("Multi page parsing.", pdf_to_images("your_doc.pdf", dpi=300), image_mode="base", ngram_window=1024)

For batch inference, infer.py starts the SGLang server automatically and sends concurrent requests for an image directory or PDF:

code

# Image directory
python infer.py \
    --image_dir ./examples/images \
    --output_dir ./outputs \
    --concurrency 8 \
    --image_mode gundam

# PDF pages
python infer.py \
    --pdf ./examples/document.pdf \
    --output_dir ./outputs \
    --concurrency 8 \
    --image_mode gundam

Useful options:

code

--model_dir baidu/Unlimited-OCR   # Local path or Hugging Face model ID
--gpu 0                           # CUDA_VISIBLE_DEVICES value
--server_log ./log/sglang_server.log

Visualization

Acknowledgement

We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas.

Citation

code

@misc{yin2026unlimitedocrworks,
      title={Unlimited OCR Works}, 
      author={Youyang Yin and Huanhuan Liu and YY and Qunyi Xie and Chaorun Liu and Shiqi Yang and Shaohua Wang and Zhanlong Liu and Hao Zou and Jinyue Chen and Shu Wei and Jingjing Wu and Mingxin Huang and Zhen Wu and Guibin Wang and Tengyu Du and Lei Jia},
      year={2026},
      eprint={2606.23050},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.23050}, 
}

この記事をシェア

MarkTechPost★42026年6月25日 14:39

百度、長文解析向け KV キャッシュを一定に保つ 3B モデル「Unlimited OCR」を発表

TLDR AI★42026年6月24日 09:00

Mistral OCR 4：文書知能のための最先端 OCR ツール（9 分読了）

Latent Space★42026年6月25日 11:14

[AINews] メタハーネスの夏が到来

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

影響分析

編集コメント

One-shot 長距離パースの時代の到来を歓迎します。

リリース

推論

Transformers

── 単一画像は 2 つの構成をサポート：gundam または base ──

gundam: base_size=1024, image_size=640, crop_mode=True

base: base_size=1024, image_size=1024, crop_mode=False

── 多ページ / PDF は base のみを使用 (image_size=1024) ──

── PDF (ページを画像に変換後、多ページ解析) ──

SGLang

Single image supports two configs: gundam or base. Example below uses gundam.

マルチ画像（ベースのみ）

PDF（ベースのみ）

画像ディレクトリ

PDF ページ

可視化

謝辞

引用

Welcome the Era of One-shot Long-horizon Parsing.

Release

Inference

Transformers

SGLang

Visualization

Acknowledgement

Citation

関連記事

キーポイント

影響分析

編集コメント

One-shot 長距離パースの時代の到来を歓迎します。

リリース

推論

Transformers

── 単一画像は 2 つの構成をサポート：gundam または base ──

gundam: base_size=1024, image_size=640, crop_mode=True

base: base_size=1024, image_size=1024, crop_mode=False

── 多ページ / PDF は base のみを使用 (image_size=1024) ──

── PDF (ページを画像に変換後、多ページ解析) ──

SGLang

Single image supports two configs: gundam or base. Example below uses gundam.

マルチ画像（ベースのみ）

PDF（ベースのみ）

画像ディレクトリ

PDF ページ

可視化

謝辞

引用

Welcome the Era of One-shot Long-horizon Parsing.

Release

Inference

Transformers

SGLang

Visualization

Acknowledgement

Citation

関連記事