Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding
A new framework called Domino introduces a method to decouple causal modeling from the autoregressive drafting process, achieving significant throughput improvements of up to 5.8x on Qwen3 models.
Optimizing Inference via Speculative Decoding
Speculative decoding is a widely used technique to accelerate the inference of Large Language Models (LLMs) by using a smaller, faster "draft" model to predict multiple tokens, which are then verified in parallel by a larger target model. However, the efficiency of this process is often limited by the inherent overhead of the autoregressive drafting phase.
The Domino Approach
The Domino framework proposes a paradigm shift by decoupling the causal modeling from the autoregressive drafting process. By separating these components, Domino optimizes how tokens are proposed and verified, reducing the computational bottlenecks typically associated with traditional speculative decoding pipelines.
Performance Benchmarks
According to the provided data, the implementation of Domino has demonstrated substantial performance gains. Specifically, when applied to the Qwen3 architecture, the framework achieved a throughput speedup of up to 5.8x, marking a significant improvement in token generation efficiency.
Resources and Implementation
The research and implementation details are available through the following technical repositories:
- Research Paper: Detailed theoretical foundations can be found on arXiv:2605.29707.
- Source Code: The official implementation is hosted on GitHub.
- Pre-trained Models: Model weights are available via Hugging Face.
Note: Due to the brevity of the source announcement, specific architectural details regarding the decoupling mechanism are not provided in this summary; please refer to the linked paper for the full technical specification.