Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

A new framework called Domino introduces a method to decouple causal modeling from the autoregressive drafting process, achieving significant throughput improvements of up to 5.8x on Qwen3 models.

Optimizing Inference via Speculative Decoding

Speculative decoding is a widely used technique to accelerate the inference of Large Language Models (LLMs) by using a smaller, faster "draft" model to predict multiple tokens, which are then verified in parallel by a larger target model. However, the efficiency of this process is often limited by the inherent overhead of the autoregressive drafting phase.

The Domino Approach

The Domino framework proposes a paradigm shift by decoupling the causal modeling from the autoregressive drafting process. By separating these components, Domino optimizes how tokens are proposed and verified, reducing the computational bottlenecks typically associated with traditional speculative decoding pipelines.

Performance Benchmarks

According to the provided data, the implementation of Domino has demonstrated substantial performance gains. Specifically, when applied to the Qwen3 architecture, the framework achieved a throughput speedup of up to 5.8x, marking a significant improvement in token generation efficiency.

Resources and Implementation

The research and implementation details are available through the following technical repositories:

Research Paper: Detailed theoretical foundations can be found on arXiv:2605.29707.
Source Code: The official implementation is hosted on GitHub.
Pre-trained Models: Model weights are available via Hugging Face.

Note: Due to the brevity of the source announcement, specific architectural details regarding the decoupling mechanism are not provided in this summary; please refer to the linked paper for the full technical specification.

Original Source

Speculative Decoding LLM Inference Throughput Optimization Qwen3 Causal Modeling

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Optimizing Inference via Speculative Decoding

The Domino Approach

Performance Benchmarks

Resources and Implementation

Related Articles

Without open llm competition, closed source LLM companies will become insatiable.

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

openvinotoolkit /openvino

lemonade-sdk /lemonade

If Claude Fable stops helping you, you'll never know