Baidu Introduces Unlimited OCR: A 3B MoE Model Optimizing Long-Document Parsing via R-SWA
Baidu has open-sourced "Unlimited OCR," a 3-billion parameter Mixture-of-Experts (MoE) model designed to eliminate the linear memory growth of the KV cache, enabling efficient processing of long-form documents without the typical performance degradation associated with extended context windows.
Overcoming the KV Cache Bottleneck in End-to-End OCR
Traditional end-to-end Optical Character Recognition (OCR) models frequently encounter scalability issues when processing multi-page documents. As the model generates tokens, the Key-Value (KV) cache expands linearly, leading to increased memory consumption and computational latency. This growth often renders the parsing of dozens of pages computationally impractical for standard hardware.
Architectural Innovation: Reference Sliding Window Attention (R-SWA)
Unlike conventional engineering workarounds that attempt to manage memory through caching strategies, Baidu's Unlimited OCR addresses the problem at the architectural level. The model replaces every decoder attention layer with Reference Sliding Window Attention (R-SWA).
By implementing R-SWA, the model ensures that the KV cache remains "flat," preventing the memory climb typically seen during long-sequence generation. This allows the model to maintain consistent throughput and memory efficiency regardless of the document length.
Model Specifications and Foundation
The Unlimited OCR model is built upon the DeepSeek OCR framework. Its architecture utilizes a Mixture-of-Experts (MoE) approach, featuring a total of 3 billion parameters, with only 500 million active parameters per token. This MoE configuration allows the model to maintain high capacity and specialized knowledge while keeping inference costs low.
Key Technical Highlights:
- Parameter Count: 3B Total / 500M Active (MoE).
- Core Innovation: Integration of Reference Sliding Window Attention (R-SWA) in the decoder.
- Primary Goal: Constant-time memory overhead for long-document parsing.
- Base Architecture: Derived from DeepSeek OCR.
Note: The provided source material was truncated; specific details regarding the exact mechanism of "Reference" sliding windows and comprehensive benchmark results were not included in the input.
Original Source