OSCAR RotationZoo: Implementing Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

A new optimization technique called OSCAR (RotationZoo) introduces offline spectral covariance-aware rotation to enable high-efficiency 2-bit quantization of the KV cache, significantly reducing memory overhead for Large Language Models (LLMs) while maintaining performance.

Overview of OSCAR RotationZoo

The OSCAR framework addresses one of the primary bottlenecks in LLM inference: the memory consumption of the Key-Value (KV) cache. By employing a technique known as Offline Spectral Covariance-Aware Rotation, the method allows for the aggressive quantization of the KV cache down to 2 bits (INT2). This approach aims to mitigate the precision loss typically associated with ultra-low bit-width quantization by rotating the feature space to better align with the spectral properties of the data.

Technical Implementation and Integration

The implementation has been integrated into popular inference engines to facilitate immediate deployment and testing. The current releases focus on optimizing the KV cache for several state-of-the-art models, ensuring that the spectral rotation is applied offline to minimize runtime overhead during inference.

Supported Model Quantizations

Pre-quantized GGUF weights are available for the following architectures:

Gemma-4-12B-it: Optimized INT2 KV cache.
Qwen3-32B: Optimized INT2 KV cache.
Qwen3-4B-Thinking-2507: Optimized INT2 KV cache.

Code Availability

The implementation is available across two primary frameworks to support different deployment needs:

llama.cpp: Integration provided via a specialized branch for GGUF support.
sglang: Integration available within the FutureMLS repository for high-throughput serving.

Limitations and Note

Note: The provided source material focuses on the distribution of quantized weights and repository links. Detailed mathematical proofs of the spectral covariance-aware rotation and specific perplexity benchmarks comparing OSCAR against standard INT2 or INT4 quantization were not provided in the source text.

Original Source

KV Cache Quantization INT2 Spectral Rotation LLM Optimization llama.cpp sglang

Techyon

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

OSCAR RotationZoo: Implementing Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Overview of OSCAR RotationZoo

Technical Implementation and Integration

Supported Model Quantizations

Code Availability

Limitations and Note

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

OSCAR RotationZoo: Implementing Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Overview of OSCAR RotationZoo

Technical Implementation and Integration

Supported Model Quantizations

Code Availability

Limitations and Note

Related Articles

Without open llm competition, closed source LLM companies will become insatiable.

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

openvinotoolkit /openvino

lemonade-sdk /lemonade

If Claude Fable stops helping you, you'll never know