Chandra 2: OCR for Documents That Break Everything Else

If you’ve ever tried to extract structured data from a scanned form, a table with merged cells, or a PDF with handwritten text mixed with printed text, you already know what traditional OCR in production looks like: fragile pipelines, post-processing hacks, and documents that simply break everything.

Chandra 2 is an open source OCR model from Datalab — the same team behind Marker and Surya — that takes a different approach. Instead of dividing pages into blocks and running inference on each piece separately, it decodes the entire page at once. The result is a model that handles layout, tables, forms, handwritten text, mathematics, and multilingual content as unified output in Markdown, HTML, or JSON.

It just achieved state-of-the-art on the olmOCR benchmark with a score of 85.9% — surpassing GPT-4o, Gemini Flash 2, Mistral OCR, and DeepSeek OCR. And with 4B parameters (down from Chandra 1’s 9B), it’s also faster.

What it generates as output

You pass a PDF or image. You get:

  • Structured Markdown with headings, tables, and lists correctly formatted
  • Layout block types: table, form, figure, equation, code block, footnote, captioned image
  • Flowcharts converted to Mermaid format
  • Charts extracted as structured data (values, axis labels, categories)
  • Mathematics in LaTeX

For those building RAG pipelines, document ingestion flows, or automations, this is the format you actually need — not plain text with broken whitespace.

The three ways to use it

Here’s what’s practical. You have three options depending on your setup and scale:

Option 1: Hosted API (the fastest path)

The simplest option. Datalab runs the inference, you call the API. Includes $5 in free credits to get started, then pay-as-you-go. No GPU required. Ideal for: prototyping, low to medium volumes, or when you don’t want to manage infrastructure.

pip install datalab-sdk

Option 2: Local vLLM (recommended for production)

Run the model on your own GPU with vLLM as the inference backend. Lighter installation than HuggingFace, better throughput, and easier to containerize. It’s the recommended path for teams that need data privacy or predictable costs at scale.

pip install chandra-ocr
chandra_vllm  # starts the vLLM server
chandra input.pdf ./output

Configuration via environment variables:

MODEL_CHECKPOINT=datalab-to/chandra-ocr-2
VLLM_API_BASE=http://localhost:8000/v1
VLLM_GPUS=0

On an H100, you can reach 4 pages per second — roughly 345,000 pages per day.

Option 3: HuggingFace backend (maximum control)

Use transformers directly. More dependencies (torch, flash attention recommended), but gives you the most control over inference, batching, and integration with existing ML pipelines.

pip install chandra-ocr[hf]
chandra input.pdf ./output --method hf

And by code:

from chandra.model import InferenceManager
from chandra.model.schema import BatchInputItem
from PIL import Image

manager = InferenceManager(method="vllm")
batch = [BatchInputItem(image=Image.open("doc.png"), prompt_type="ocr_layout")]
result = manager.generate(batch)[0]
print(result.markdown)

License: what you need to know

The code is Apache 2.0 — completely open. The model weights use a modified OpenRAIL-M license: free for research, personal use, and startups with less than $2M in funding or revenue. Above that threshold, you need a commercial license. If you’re building a product that directly competes with Datalab’s hosted API, the open weights are not an option regardless of company size.

When to use it

Chandra shines on documents where layout matters: annual reports, government forms, invoices, academic papers with equations, historical documents, anything with tables spanning multiple columns. For simple PDFs with text only, Marker is faster and lighter. For structured enterprise documents, Chandra is the tool.

The multilingual support is also worth noting for the region — Chandra 2 covers 90 languages with significant improvements in South Asian scripts. Performance in Spanish is solid across the entire benchmark.


Are you building document ingestion or RAG pipelines? What type of documents are hardest for you to process today? Let us know in the comments.

Resources: