Open Sourcing InfiniteKV: Extending LLM Context via Searchable KV Cache Offloading

InfiniteKV introduces a novel approach to KV cache management by storing aged tokens as compact, 104-byte searchable records in RAM or on disk, enabling models to retrieve information far beyond their native training window without the typical VRAM exhaustion.

Overcoming the VRAM Bottleneck in Long-Context Inference

One of the primary constraints in deploying Large Language Models (LLMs) is the exponential growth of the Key-Value (KV) cache. In standard inference, the GPU must maintain two float vectors for every token in the conversation. This memory overhead scales linearly with context length, creating a significant barrier for consumer-grade hardware.

For example, a Llama-3.1-8B model requires approximately 0.12 MB per token. Scaling this to 100k tokens consumes 12 GB of VRAM, while a million tokens would require 122 GB. Because consumer GPUs cannot accommodate such volumes, most serving stacks implement a "sliding window" or truncation strategy, quietly deleting the oldest tokens once memory limits are reached. Consequently, the model loses access to early conversation history, leading to failures in retrieval and coherence.

How InfiniteKV Works

InfiniteKV addresses this limitation by splitting the memory architecture into two distinct tiers. Rather than deleting old tokens, InfiniteKV archives them as searchable records. These records are compressed into 104-byte entries that can be stored in system RAM or on disk, drastically reducing the VRAM footprint while maintaining the ability to retrieve historical data.

Key Performance Metrics

The efficiency of this approach allows models to operate well beyond their theoretical context limits. In initial tests, a Mistral-7B model was able to successfully answer a query based on token 76,747—effectively extending its operational range to 2.3x its original trained window.

Implementation and Accessibility

The project is now open-sourced to allow developers and researchers to implement this offloading mechanism in their own pipelines. A Colab demo has been provided to showcase the practical application of the searchable record system and its ability to maintain long-term memory without crashing the GPU.

Note: The provided source material contains a high-level overview of the memory splitting mechanism; detailed architectural specifications regarding the search algorithm and the exact compression method used to reach the 104-byte record size were not specified.

Original Source
LLM KV Cache VRAM Optimization Mistral-7B Open Source Context Window Extension