VIA-SD: Enhancing Speculative Decoding via Intra-Model Routing for Verification

Researchers introduce VIA-SD, a novel speculative decoding framework that optimizes the verification process by utilizing intra-model routing to delegate rejected tokens to slim sub-models, reducing the computational overhead of full verifier recomputation.

Overcoming the Bottlenecks of Speculative Decoding

Speculative Decoding (SD) has emerged as a critical technique to mitigate the high inference latency and computational costs associated with Large Language Models (LLMs). The standard architecture employs a lightweight "drafter" model to generate a sequence of candidate tokens, which are then validated in parallel by a larger, more capable "verifier" model.

Traditionally, the verification process operates on a binary logic: a candidate token is either accepted or rejected. When a token is rejected, the system typically triggers a full recomputation using the entire verifier model to generate the correct token, which consumes significant computational resources and offsets some of the efficiency gains provided by the drafting phase.

The VIA-SD Approach: Intra-Model Routing

The authors of the VIA-SD paper identify a critical inefficiency in this binary decision-making process. Their research suggests that not all rejected tokens require the full capacity of the verifier for correction. Many tokens that are rejected by the initial draft can be correctly verified and corrected by a "slim submodel" derived from the full verifier.

VIA-SD introduces Intra-Model Routing, a mechanism that routes tokens requiring only moderate verification resources to these slim sub-models rather than the full-scale verifier. By leveraging internal representations of the verifier to create these specialized paths, the framework optimizes the balance between accuracy and inference speed.

Key Technical Innovation: The Slim-Verifier

The core of the proposal is the implementation of a slim-verifier. Instead of a costly full-model forward pass for every rejection, the routing mechanism determines if a sub-component of the verifier can handle the correction. This approach effectively creates a tiered verification hierarchy, ensuring that the most computationally expensive resources are reserved only for the most complex tokens.

Note: The provided source material is a summary; specific benchmarks, architectural details of the routing mechanism, and quantitative performance gains are not detailed in the available description.
Original Source
Speculative Decoding LLM Inference Optimization Intra-Model Routing Model Compression Computational Efficiency