OSCAR RotationZoo: Implementing Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
A new optimization technique called OSCAR (RotationZoo) introduces offline spectral covariance-aware rotation to enable high-efficiency 2-bit quantization of the KV cache, significantly reducing memory overhead for Large Language Models (LLMs) while maintaining performance.
Overview of OSCAR RotationZoo
The OSCAR framework addresses one of the primary bottlenecks in LLM inference: the memory consumption of the Key-Value (KV) cache. By employing a technique known as Offline Spectral Covariance-Aware Rotation, the method allows for the aggressive quantization of the KV cache down to 2 bits (INT2). This approach aims to mitigate the precision loss typically associated with ultra-low bit-width quantization by rotating the feature space to better align with the spectral properties of the data.
Technical Implementation and Integration
The implementation has been integrated into popular inference engines to facilitate immediate deployment and testing. The current releases focus on optimizing the KV cache for several state-of-the-art models, ensuring that the spectral rotation is applied offline to minimize runtime overhead during inference.
Supported Model Quantizations
Pre-quantized GGUF weights are available for the following architectures:
- Gemma-4-12B-it: Optimized INT2 KV cache.
- Qwen3-32B: Optimized INT2 KV cache.
- Qwen3-4B-Thinking-2507: Optimized INT2 KV cache.
Code Availability
The implementation is available across two primary frameworks to support different deployment needs:
- llama.cpp: Integration provided via a specialized branch for GGUF support.
- sglang: Integration available within the FutureMLS repository for high-throughput serving.
Limitations and Note
Note: The provided source material focuses on the distribution of quantized weights and repository links. Detailed mathematical proofs of the spectral covariance-aware rotation and specific perplexity benchmarks comparing OSCAR against standard INT2 or INT4 quantization were not provided in the source text.
Original Source