合成データを用いた高速多言語OCRモデルの構築
NVIDIAは合成データを用いて構築した高速多言語OCRモデル「Nemotron OCR v2」を公開し、従来のデータ収集の課題を解決しつつ、12万枚の合成画像で高い精度と34.7ページ/秒の高速処理を実現した。
キーポイント
合成データによるデータ収集課題の解決
従来のOCRモデル構築では、高品質なアノテーションデータの収集が課題だったが、プログラムによるテキストレンダリングで、大規模かつ正確なラベル付きデータを生成する手法を確立した。
Nemotron OCR v2の性能向上
6言語で12万枚の合成画像を用いて訓練し、非英語言語でのNEDスコアを0.56-0.92から0.035-0.069に大幅改善し、単一A100 GPUで34.7ページ/秒の高速処理を実現した。
アーキテクチャの効率化
検出バックボーンの特徴を認識モデルと関係モデルで共有することで冗長計算を排除し、高速化と精度向上の両立を図った。
公開リソースと実用性
データセット(nvidia/OCR-Synthetic-Multilingual-v1)とモデル(nvidia/nemotron-ocr-v2)を公開し、ブラウザデモも提供することで、実用的な利用を促進している。
影響分析・編集コメントを表示
影響分析
この記事は、合成データを用いたOCRモデルの開発が、従来のデータ収集の課題を解決し、多言語対応と高速処理を実現した点で、文書デジタル化や多言語ドキュメント処理の分野に大きな影響を与える。公開リソースにより、研究開発と実用化の加速が期待される。
編集コメント
合成データの活用がOCRの性能向上と多言語化を実現した好例。公開リソースにより、実用化のハードルが下がり、産業応用が加速する可能性が高い。
ここまで説明したパイプラインは意図的に汎用的なものとなっています。今回のリリースでは6言語を選択しましたが、新しい言語を追加するには、その書記体系をカバーするソーステキストとフォントさえあれば十分です。モデルアーキテクチャの変更や手動でのアノテーションは必要ありません。レンダリングパイプラインは単一マシンで1日あたり数百万ページのアノテーション付きデータを生成できるため、新しい言語向けの大規模なトレーニングセットを迅速に作成することが現実的です。mOSCARが163の言語サブセットをカバーし、Notoフォントファミリーが実際に使用されているほぼすべてのUnicode書記体系をサポートしていることを考えると、このアプローチを広くスケールさせる明確な道筋があります。
モデル: Nemotron OCR v2
Nemotron OCR v2は、この合成データと約68万枚の実世界画像でトレーニングされた、本番環境で使用可能な商用OCRモデルです。3つのコンポーネントからなるエンドツーエンドアーキテクチャを採用しています:
テキスト検出器(RegNetX-8GFバックボーン): 画像内のテキスト領域を特定
テキスト認識器(プレノーマライズドTransformer): 検出された領域を文字起こし
関係モデル: 論理的なグループ化、読み順、レイアウト関係を予測
以下の2つのバリアントが利用可能です:
v2_multilingual
v2_english
Nemotron OCR v2多言語モデルは、英語(EN)、中国語(ZH)、日本語(JA)、韓国語(KO)、ロシア語(RU)の5言語を同時に処理する単一の統合モデルです。事前に文書の言語を知っている必要も、言語固有のバリアントを選択する必要もありません。対照的に、PP-OCR v5(PaddleOCR)やOpenOCRのようなパイプラインツールは、対象言語では優れた性能を発揮する言語固有のモデルを提供しますが、事前に言語を検出するか、全体的に性能が劣る基本バリアントにフォールバックする必要があります。
モデルが高速である理由
このアーキテクチャはFOTS(Fast Oriented Text Spotting)設計に基づいており、検出と認識を共有の畳み込みバックボーンを持つ単一ネットワークに統合しています。検出バックボーン(RegNetX-8GF)は入力画像を一度処理し、3つのコンポーネントすべてで再利用される特徴マップを生成します。テキスト認識器は、検出された領域から補正された特徴クロップを受け取り、小型のTransformerでデコードします。関係モデルは、同じ特徴マップから導出された領域ごとの埋め込みに対して、コンパクトなTransformerエンコーダを用いて推論を行います。高コストな畳み込み処理は一度しか行われないため、下流コンポーネントによるオーバーヘッドは最小限です。この特徴の再利用こそがモデルの効率性を駆動し、単一のA100 GPUで34.7ページ/秒を実現しています。
結果:合成データがもたらすもの
多言語ベンチマーク(SynthDoG)
SynthDoGで生成されたページに対する正規化編集距離(NED)(低いほど良い)。当社の合成データでトレーニングされたv2多言語モデルは、すべての対象言語において、使用不能レベルだったNEDスコアをほぼゼロにまで引き下げました:
「PaddleOCR(専用)」列は言語固有のモデル(例:韓国語には韓国語モデル)を使用しており、これはすでに言語が分かっている場合の最良のシナリオです。Nemotron OCR v2多言語モデルは、単一モデルを用いて、この合成データ上ではこれらの専用バリアントよりもすべての言語で優れた結果を達成しています。
実世界ベンチマーク(OmniDocBench)
英語、中国語、混合言語コンテンツを含む実世界文書OCRベンチマークであるOmniDocBenchにおいて、Nemotron OCR v2多言語モデルは、34.7ページ/秒という速度で競争力のある精度を達成し、PaddleOCR v5よりも28倍以上高速です:
NEDスコア(低いほど良い)。速度はv2バッチ処理パイプラインを用いて単一A100 GPUで測定。比較対象モデルはすべて、オプション機能を無効にしたデフォルトの検出器+認識器パイプラインを使用してベンチマークを実施。
バリアント間の速度差に関する注記:v1およびv2英語モデルがv2多言語モデルよりも高速なのは、多言語認識器の規模が大きいためです(14,244トークンの語彙を持つ6層Transformer対855トークンの語彙を持つ3層)。認識器は検出されたすべてのテキスト領域を処理するため、より重い認識器は、テキストが密集したページのスループットに直接影響します。v2英語モデルがv1よりもわずかに高速なのは、バックボーンがRegNetYからRegNetXに置き換えられたためです。
モデル: nvidia/nemotron-ocr-v2
デモ: nvidia/nemotron-ocr-v2 Space
データセット: nvidia/OCR-Synthetic-Multilingual-v1
ライセンス: NVIDIA Open Model License(モデル), CC-BY-4.0(データセット)
謝辞
Bo Liu、Théo Viel、Mike Ranzingerの各氏には、この研究に対してコード、戦略、追加の検証を提供していただき、感謝申し上げます。

原文を表示
Back to Articles Building a Fast Multilingual OCR Model with Synthetic Data
Upvote 1
Training a high-quality OCR model requires a large quantity of annotated image-text pairs: images with precise bounding boxes, transcriptions, and ideally reading order information at the word, line, and paragraph level. Every approach to curating this data comes with tradeoffs. Existing benchmark datasets like ICDAR and Total-Text have clean labels but limited scale, typically tens of thousands of images skewed toward English and Chinese. Manual annotation produces the highest quality labels but is expensive and slow, making it impractical at the millions-of-images scale needed for robust multilingual models. Web-scraped PDFs offer enormous quantity, but the embedded text is often noisy: characters recorded as individual strokes instead of words, text baked into images with no extractable layer, or scanned pages where a weak OCR model was applied and the resulting text layer is unreliable. You can extract usable signal from web PDFs, but it takes significant filtering effort and the result is never perfectly clean.
Synthetic data generation offers a way out of these tradeoffs. By rendering text onto images programmatically, we get both the scale of web scraping and the label purity of hand annotation. Every bounding box, transcription, and reading order relationship is known exactly because we placed it there, and we have full control over which layouts, font styles, and edge cases appear in the training set. The challenge is realism. Simulating diverse layouts and realistic document scenarios is difficult, but with the right rendering engine and strong randomization across fonts, colors, backgrounds, augmentations, and layout structures, it is possible to build enough invariance that models trained on synthetic data generalize well to real-world documents.
Using this approach, we built Nemotron OCR v2, a multilingual OCR model that is both accurate and fast. Accuracy is driven by data: 12 million synthetic training images across six languages brought NED scores from 0.56–0.92 down to 0.035–0.069 on non-English languages. Speed is driven by architecture: a shared detection backbone whose features are reused by both the recognizer and relational model, eliminating redundant computation and enabling 34.7 pages/second on a single A100 GPU. The synthetic data pipeline is generic enough to extend to any language for which fonts and source text exist.
The dataset is publicly available at nvidia/OCR-Synthetic-Multilingual-v1 and the model at nvidia/nemotron-ocr-v2. You can try the model directly in the browser at the Nemotron OCR v2 demo.

The Problem: Data, Not Architecture
Nemotron OCR v1 was a strong English OCR model, but it was not trained for multilingual purposes so when exposed to other languages it failed to read the documents accurately. On our SynthDoG benchmark, v1 produced Normalized Edit Distance (NED) scores between 0.56 and 0.92 for Japanese, Korean, Russian, and Chinese. At these error rates, the model output bears little resemblance to the ground truth.
Nemotron OCR v1 NED
Chinese (Simplified)
Chinese (Traditional)
Part of the issue was the character set. The v1 model supported only 855 characters, which simply did not cover CJK (Chinese, Japanese, Korean) or Cyrillic scripts. We ran an experiment where we expanded the character set to 14,244 characters to cover all the target languages. This helped slightly, but without sufficient training data actually containing those characters, the improvement was marginal. The model could theoretically output the right characters, but it had never learned what they looked like. The bottleneck was data, not architecture.
Collecting and annotating millions of real-world images across six languages with word-, line-, and paragraph-level bounding boxes plus reading order graphs would be prohibitively expensive. We needed a different approach.
A Generic Synthetic Data Pipeline
Our key insight is that the recipe for multilingual OCR training data is fundamentally language-agnostic. You need two ingredients:
Source text in the target language, drawn from a realistic distribution
Fonts that can render that language's script
Given those, a synthetic renderer can produce unlimited annotated training images with pixel-perfect ground truth at every level of granularity for free.
For source text, we use mOSCAR, a large-scale multilingual web corpus covering 163 language subsets across dozens of scripts including Latin, CJK, Cyrillic, Arabic, Devanagari, and Thai. Sampling from mOSCAR gives us text that follows a realistic distribution of vocabulary, sentence length, and character frequency for each language. This is far more representative than dictionary word lists or machine-generated text.
Rendering: Modified SynthDoG
We built our pipeline on a heavily modified version of SynthDoG (Synthetic Document Generator) from the Donut project. The original SynthDoG generates document-like images with page-level text labels. We extended it in several important ways.
Multi-level bounding boxes. Vanilla SynthDoG provides only page-level text. Our pipeline generates pixel-precise annotations at three levels simultaneously: word, line, and paragraph. Each level includes both axis-aligned bounding boxes and 4-point quads, with indices linking words to their parent lines and paragraphs.
Relation graph for reading order. Most publicly available OCR datasets do not include reading order annotations. This makes it hard to train models that understand document structure beyond just detecting text. We took inspiration from the HierText dataset, which pioneered hierarchical (word, line, paragraph) annotations with structural relationships. Our synthetic pipeline generates a relation graph for every sample, encoding which words compose each line, which lines compose each paragraph, and in what order they should be read. This is what powers the relational model component of Nemotron OCR v2, which handles multi-column layouts, tables, and other structures where a simple top-to-bottom, left-to-right merge would produce garbled output.
Diverse layout modes. We created a set of layout templates that cover a range of real-world document scenarios: flowing multi-column text, scattered scene-text-like words, vertical text columns (important for Japanese and Chinese), tables with headers and borders, table-of-contents pages with dot leaders, PowerPoint-style slides, and Word-document-style pages with headings and body text. Each generation run randomly selects a layout mode, so the model sees a wide variety of structures during training.
Line-level recognition for CJK. An important design decision for the multilingual variant was moving from word-level to line-level text recognition. Languages like Chinese and Japanese do not use spaces between words, so there is no natural word boundary to segment on. Korean uses spaces inconsistently. By operating at the line level, the recognizer handles these languages naturally without needing a separate word segmentation step. The English variant continues to use word-level recognition where it makes sense.
Open-source font pool. We assembled 165 to 1,258 unique fonts per language from open-source collections including Google Fonts and the Noto family, covering serif, sans-serif, handwritten, decorative, and variable-weight styles.
Augmentations. Each rendered page goes through a stack of randomized augmentations to improve generalization. At the text level, these include border/outline effects, drop shadows, extrusion, and sprinkle noise on glyph edges. Custom effects modulate stroke opacity and stroke width variation across text using random fields. At the image level, we apply morphological operations (dilation, erosion), median blur, and elastic distortion, gated by a minimum text height to avoid destroying small text. At the full-page level, the pipeline applies contrast and brightness jitter, Gaussian and motion blur, color shifting, shadow overlays, and additive Gaussian noise. Backgrounds are either image textures or solid colors, with optional semi-transparent tinted rectangles behind individual words or lines.
What the Data Looks Like
Here is a sample of raw synthetic images across all six languages. Each image is generated with a random layout mode, font selection, background, and augmentation stack:

Below is an annotated example showing the hierarchical structure. Dashed outlines indicate paragraph boundaries, shaded regions show line-level groupings (colored by paragraph), and arrows trace the reading order between lines within each paragraph.

Here are more annotated examples across languages showing the range of layouts, scripts, and augmentation styles the pipeline produces. Each subtitle describes what the example highlights:

Dataset at a Glance
The full dataset contains 12.2 million samples across six languages:
Chinese (Simplified)
Chinese (Traditional)
Download: nvidia/OCR-Synthetic-Multilingual-v1
The pipeline we have described is deliberately generic. We chose six languages for this release, but adding a new language only requires source text and fonts that cover the script. No changes to the model architecture or manual annotation are needed. The rendering pipeline can generate millions of annotated pages per day on a single machine, making it practical to produce large-scale training sets for new languages quickly. With mOSCAR covering 163 language subsets and the Noto font family supporting virtually every Unicode script in active use, there is a clear path to scaling this approach broadly.
The Model: Nemotron OCR v2
Nemotron OCR v2 is a production-ready, commercially usable OCR model trained on this synthetic data along with approximately 680K real-world images. It uses a three-component end-to-end architecture:
Text Detector (RegNetX-8GF backbone): localizes text regions in the image
Text Recognizer (pre-norm Transformer): transcribes detected regions
Relational Model: predicts logical groupings, reading order, and layout relationships
Two variants are available:
v2_multilingual
EN, ZH, JA, KO, RU
Recognizer layers
An important distinction from other OCR pipelines: Nemotron OCR v2 multilingual is a single unified model that handles all five languages simultaneously. You do not need to know the document language ahead of time or select a language-specific variant. By contrast, pipeline tools like PP-OCR v5 (PaddleOCR) and OpenOCR offer specialized per-language models that perform well on their target language but require you to either detect the language first or fall back to a base variant that is less capable across the board.
Why the Model Is Fast
The architecture is based on the FOTS (Fast Oriented Text Spotting) design, which unifies detection and recognition into a single network with a shared convolutional backbone. The detection backbone (RegNetX-8GF) processes the input image once and produces feature maps that are reused by all three components. The text recognizer receives rectified feature crops from detected regions and decodes them with a small Transformer. The relational model reasons over per-region embeddings derived from the same feature maps using a compact Transformer encoder. Because the expensive convolutional pass happens only once, the downstream components add minimal overhead. This feature reuse is what drives the model's efficiency, enabling 34.7 pages/second on a single A100 GPU.
Results: What Synthetic Data Buys You
Multilingual Benchmark (SynthDoG)
Normalized Edit Distance on SynthDoG-generated pages (lower is better). The v2 multilingual model, trained on our synthetic data, brings NED scores from unusable levels down to near-zero across all target languages:
PaddleOCR (base)
PaddleOCR (specialized)
OpenOCR (server)
Nemotron OCR v1
Nemotron OCR v2 (multi)
Chinese (Simplified)
Chinese (Traditional)
Note that the "PaddleOCR (specialized)" column uses language-specific models (e.g., the Korean model for Korean), which is the best-case scenario when you already know the language. Nemotron OCR v2 multilingual achieves better results on this synthetic data than even these specialized variants across all languages, using a single model.
Real-World Benchmark (OmniDocBench)
On OmniDocBench, a real-world document OCR benchmark with English, Chinese, and mixed-language content, Nemotron OCR v2 multilingual achieves competitive accuracy at 34.7 pages/second, over 28x faster than PaddleOCR v5:
PaddleOCR v5 (server)
OpenOCR (server)
Nemotron OCR v2 (multi)
Nemotron OCR v2 (EN)
Nemotron OCR v1
NED scores (lower is better). Speed measured on a single A100 GPU with the v2 batched pipeline. All comparison models were benchmarked using their default detector + recognizer pipeline with optional extras disabled.
A note on the speed differences between variants: v1 and v2 English are faster than v2 multilingual because the multilingual recognizer is larger (6 Transformer layers with a 14,244-token vocabulary vs 3 layers with 855 tokens). The recognizer processes every detected text region, so a heavier recognizer directly impacts throughput on text-dense pages. The v2 english model is slightly faster than v1 because the backbone has been swapped from RegNetY to RegNetX.
Model: nvidia/nemotron-ocr-v2
Demo: nvidia/nemotron-ocr-v2 Space
Dataset: nvidia/OCR-Synthetic-Multilingual-v1
License: NVIDIA Open Model License (model), CC-BY-4.0 (dataset)
Acknowledgments
Thank you to Bo Liu, Théo Viel and Mike Ranzinger for contributing code, strategy and additional validation to this work.

関連記事
DeepStreamコーディングエージェントを使用したビジョンAIパイプライン構築方法
NVIDIAが、DeepStreamコーディングエージェントを使用してリアルタイムビジョンAIアプリケーションの開発を効率化する方法を紹介した。複雑なデータパイプラインや大量のコードを必要とする課題を解決する技術を提案している。
NVIDIA、チップソフトウェアメーカーと提携しシミュレーションと現実のギャップを縮める
NVIDIAはCadenceとの提携を拡大し、ロボットトレーニングデータの精度向上とエンジニア向けAIサービスの構築を目指す。
NVIDIA、ロボットシミュレーション訓練を拡張するLyra 2.0を発表
NVIDIAの研究者が、単一写真から大規模で一貫性のある3D環境を生成するシステム「Lyra 2.0」を発表した。生成されたシーンはリアルタイムで探索可能で、ロボットシミュレーションに直接使用できる。