Baidu Introduces Unlimited OCR: A 3B MoE Model Optimizing Long-Document Parsing via R-SWA

Baidu has open-sourced "Unlimited OCR," a 3-billion parameter Mixture-of-Experts (MoE) model designed to eliminate the linear memory growth of the KV cache, enabling efficient processing of long-form documents without the typical performance degradation associated with extended context windows.

Overcoming the KV Cache Bottleneck in End-to-End OCR

Traditional end-to-end Optical Character Recognition (OCR) models frequently encounter scalability issues when processing multi-page documents. As the model generates tokens, the Key-Value (KV) cache expands linearly, leading to increased memory consumption and computational latency. This growth often renders the parsing of dozens of pages computationally impractical for standard hardware.

Architectural Innovation: Reference Sliding Window Attention (R-SWA)

Unlike conventional engineering workarounds that attempt to manage memory through caching strategies, Baidu's Unlimited OCR addresses the problem at the architectural level. The model replaces every decoder attention layer with Reference Sliding Window Attention (R-SWA).

By implementing R-SWA, the model ensures that the KV cache remains "flat," preventing the memory climb typically seen during long-sequence generation. This allows the model to maintain consistent throughput and memory efficiency regardless of the document length.

Model Specifications and Foundation

The Unlimited OCR model is built upon the DeepSeek OCR framework. Its architecture utilizes a Mixture-of-Experts (MoE) approach, featuring a total of 3 billion parameters, with only 500 million active parameters per token. This MoE configuration allows the model to maintain high capacity and specialized knowledge while keeping inference costs low.

Key Technical Highlights:

Parameter Count: 3B Total / 500M Active (MoE).
Core Innovation: Integration of Reference Sliding Window Attention (R-SWA) in the decoder.
Primary Goal: Constant-time memory overhead for long-document parsing.
Base Architecture: Derived from DeepSeek OCR.

Note: The provided source material was truncated; specific details regarding the exact mechanism of "Reference" sliding windows and comprehensive benchmark results were not included in the input.

Original Source

OCR Mixture-of-Experts (MoE) KV Cache Optimization Sliding Window Attention DeepSeek Long-Context Parsing

Techyon

Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing

Baidu Introduces Unlimited OCR: A 3B MoE Model Optimizing Long-Document Parsing via R-SWA

Overcoming the KV Cache Bottleneck in End-to-End OCR

Architectural Innovation: Reference Sliding Window Attention (R-SWA)

Model Specifications and Foundation

Key Technical Highlights:

Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing

Baidu Introduces Unlimited OCR: A 3B MoE Model Optimizing Long-Document Parsing via R-SWA

Overcoming the KV Cache Bottleneck in End-to-End OCR

Architectural Innovation: Reference Sliding Window Attention (R-SWA)

Model Specifications and Foundation

Key Technical Highlights:

Related Articles

A new paper finds the matrix of 84 models × 133 AI benchmarks is basically rank-2 — two numbers predict ~90% of every model's scores

Apple’s Siri AI at WWDC: How a Voice-First Agent Strategy Could Move the Stock and Reshape the AI Race

NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone.

Bible as RAG Database

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning