Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
Researchers propose a novel "recoverable routing" mechanism for Vision-Language Models (VLMs) to optimize inference efficiency by dynamically managing visual tokens across decoder layers, moving away from traditional irreversible token pruning.
The Challenge of Visual Token Overhead in VLMs
Modern Vision-Language Models (VLMs) typically project input images into a vast array of visual tokens, often numbering in the hundreds or thousands. While this high resolution is necessary for detailed image understanding, it imposes a significant computational burden during the decoder inference phase. Specifically, the quadratic complexity of attention mechanisms and the substantial memory requirements of the KV-cache make processing these tokens expensive in terms of both latency and hardware resources.
The Limitations of the "Rank-and-Remove" Paradigm
To mitigate these costs, current token reduction strategies generally employ a "rank-and-remove" approach. In this paradigm, the model assigns importance scores to visual tokens, retains a small, compact subset of the highest-scoring tokens, and permanently discards the remainder. However, the authors argue that this irreversible action is fundamentally fragile.
The core issue lies in the dynamic nature of token importance: a visual token that appears irrelevant at an early stage of the decoder may become critical for the final output as the model processes deeper layers. By permanently removing these tokens early on, models risk losing essential spatial or semantic information required for accurate reasoning.
Introducing Recoverable Visual Token Routing
To address this fragility, the paper introduces a routing mechanism that favors rerouting over removal. Rather than deleting tokens, the proposed method allows the model to dynamically manage which tokens are active at different depths of the decoder. This ensures that tokens can be "recovered" or reintroduced if their relevance increases as the processing progresses, optimizing the trade-off between computational efficiency and model performance.
Note: Due to the provided text being a partial description, specific architectural implementation details and quantitative benchmark results are not available.
Original Source