DFlash: Accelerating LLM Throughput via Parallel Block Diffusion Speculative Decoding

Researchers from UC San Diego's z-lab have introduced DFlash, a novel speculative decoding framework that utilizes a lightweight block diffusion model to draft entire token sequences in parallel, achieving up to 15x throughput improvements on NVIDIA Blackwell hardware.

Overcoming the Serial Bottleneck in Speculative Decoding

Traditional speculative decoding methodologies typically rely on a small draft model to predict tokens sequentially. While this approach reduces the computational burden on the target model, it essentially hides a serial loop behind a smaller architecture rather than achieving true parallel generation. This limitation prevents the system from fully leveraging the massive compute capabilities of modern GPU architectures.

The DFlash Architecture: Block Diffusion

DFlash departs from token-by-token drafting by implementing a lightweight block diffusion model. Instead of predicting the next single token, DFlash drafts an entire block of tokens in a single forward pass. This paradigm shift allows the system to move from sequential drafting to parallel block generation.

Once the block is drafted, the target model verifies the entire sequence in parallel. This process significantly reduces the number of iterations required to generate a given sequence of text, optimizing the interaction between the draft and target models.

Performance Benchmarks on NVIDIA Blackwell

The efficiency gains of DFlash are particularly pronounced on next-generation hardware. When tested using the gpt-oss-120b model on NVIDIA Blackwell GPUs, DFlash demonstrated a throughput increase of up to 15× compared to standard decoding methods. This performance leap is attributed to the elimination of the token-by-token drafting bottleneck and the superior parallel processing capabilities of the Blackwell architecture.

Note: Specific architectural hyperparameters of the diffusion model and detailed latency breakdowns were not provided in the source material.

Original Source

Speculative Decoding Diffusion Models NVIDIA Blackwell LLM Optimization Parallel Generation Throughput

Techyon

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

DFlash: Accelerating LLM Throughput via Parallel Block Diffusion Speculative Decoding

Overcoming the Serial Bottleneck in Speculative Decoding

The DFlash Architecture: Block Diffusion

Performance Benchmarks on NVIDIA Blackwell

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

DFlash: Accelerating LLM Throughput via Parallel Block Diffusion Speculative Decoding

Overcoming the Serial Bottleneck in Speculative Decoding

The DFlash Architecture: Block Diffusion

Performance Benchmarks on NVIDIA Blackwell

Related Articles

Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas

bradautomates /claude-video

Accéder aux modèles d'IA chinois (DeepSeek, GLM, Qwen) depuis la France : guide 2026

Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments

DiffusionBench: Towards Holistic Evaluation of Generative Diffusion Transformers