Building a Fast Multilingual OCR Model with Synthetic Data
Training a high-quality OCR model requires a large quantity of annotated image-text pairs: images with precise bounding boxes, transcriptions, and ideally reading order information at the word, line, and paragraph level. Every approach to curating this data comes with tradeoffs. Existing benchmark datasets like ICDAR and Total-Text have clean labels but limited scale, typically tens of thousands of images skewed toward English and Chinese. Manual annotation produces the highest quality labels but is expensive and slow, making it impractical at the millions-of-images scale needed for robust multilingual models. Web-scraped PDFs offer enormous quantity, but the embedded text is often noisy: characters recorded as individual strokes instead of words, text baked into images with no extractable layer, or scanned pages where a weak OCR model was applied and the resulting text layer is unreliable. You can extract usable signal from web PDFs, but it takes significant filtering effort and the result is never perfectly clean.
Synthetic data generation offers a way out of these tradeoffs. By rendering text onto images programmatically, we get both the scale of web scraping and the label purity of hand annotation. Every bounding box, transcription, and reading order relationship is known exactly because we placed it there, and we have full control over which layouts, font styles, and edge cases appear in the training set. The challenge is realism. Simulating diverse layouts and realistic document scenarios is difficult, but with the right rendering engine and strong randomization across fonts, colors, backgrounds, augmentations, and layout structures, it is possible to build enough invariance that models trained on synthetic data generalize well to real-world documents. Using this approach, we built Nemotron OCR v2, a multilingual OCR model that is both accurate and fast. Accuracy is driven by data: 12 million synthetic training images across six languages brought NED scores from 0.56–0.92 down to 0.035–0.069 on non-English languages. Speed is driven by architecture: a shared detection backbone whose features are reused by both the recognizer and relational model, eliminating redundant computation and enabling 34.7 pages/second on a single A100 GPU.
The synthetic data pipeline is generic enough to extend to any language for which fonts and source text exist. The dataset is publicly available at nvidia/OCR-Synthetic-Multilingual-v1 and the model at nvidia/nemotron-ocr-v2. You can try the model directly in the browser at the Nemotron OCR v2 demo. The Problem: Data, Not Architecture Nemotron OCR v1 was a strong English OCR model, but it was not trained for multilingual purposes so when exposed to other languages it failed to read the documents accurately. On our SynthDoG benchmark, v1 produced Normalized Edit Distance (NED) scores between 0.56 and 0.92 for Japanese, Korean, Russian, and Chinese. At these error rates, the model output bears little resemblance to the ground truth. Language Nemotron OCR v1 NED Japanese 0.723 Korean 0.923 Russian 0.564 Chinese (Simplified) 0.784 Chinese (Traditional) 0.700 Part of the issue was the character set. The v1 model supported only 855 characters, which simply did not cover CJK (Chinese, Japanese, Korean) or Cyrillic scripts.
We ran an experiment where we expanded the character set to 14,244 characters to cover all the target languages. This helped slightly, but without sufficient training data actually containing those characters, the improvement was marginal. The model could theoretically output the right characters, but it had never learned what they looked like. The bottleneck was data, not architecture. Collecting and annotating millions of real-world images across six languages with word-, line-, and paragraph-level bounding boxes plus reading order graphs would be prohibitively expensive. We needed a different approach. A Generic Synthetic Data Pipeline Our key insight is that the challenge of building a multilingual OCR model is not the architecture, but rather the data. By using synthetic data generation, we can create a large-scale dataset that covers multiple languages and scripts, and train a model that can generalize well to real-world documents.
No replies yet. Be first.