Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Tangram introduces a novel approach to non-uniform KV cache compression, addressing the memory bottlenecks inherent in multi-turn Large Language Model (LLM) serving by allocating heterogeneous memory budgets across attention heads to optimize throughput without sacrificing model accuracy.

The Memory Bottleneck in Multi-turn LLM Serving

In the deployment of Large Language Models (LLMs), particularly in multi-turn dialogue systems, the accumulation of dialogue history poses a significant scaling challenge. The Key-Value (KV) cache grows linearly with every turn and every concurrent user. This growth often leads to a scenario where the memory required for the KV cache exceeds the memory required for the model weights themselves. Consequently, memory—rather than compute power—becomes the primary binding constraint on system throughput.

Uniform vs. Non-Uniform KV Compression

To mitigate memory pressure, existing serving stacks often employ KV cache compression. Traditional uniform compression schemes apply the same reduction budget across all attention heads. However, research indicates that non-uniform KV compression—which allocates heterogeneous budgets across different attention heads based on their importance—preserves model accuracy far more effectively than uniform schemes.

The Implementation Challenge

Despite the theoretical advantages of non-uniform compression, practical implementation has remained elusive. Modern LLM serving stacks are architected under the assumption that KV lengths are identical across all heads. This architectural rigidity creates a technical "trap" where the memory freed by compressing less important heads cannot be efficiently utilized, as the system still allocates space based on the longest remaining head.

Tangram's Objective

Tangram aims to unlock the potential of non-uniform KV compression by overcoming these architectural limitations, allowing for a more flexible and efficient memory management system that aligns with the varying importance of different attention heads during the inference process.

Note: Due to the limited nature of the provided source snippet, specific implementation details regarding the Tangram architecture's mechanism for handling heterogeneous KV lengths are not detailed.

Original Source
LLM Serving KV Cache Compression Memory Optimization Attention Mechanisms Inference Efficiency