Performance Degradation in Qwen 3.6 27B MTP: Impact of Speculative Decoding Parameters on Throughput

A technical report indicates that enabling specific speculative decoding configurations—namely --spec-type draft-mtp and --spec-draft-n-max—on the Qwen 3.6 27B MTP model leads to a significant drop in tokens per second (t/s) and suboptimal GPU power utilization.

Technical Observation

A user report detailing the deployment of the Qwen 3.6 27B MTP (quantized as unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf) using llama-server has revealed a performance bottleneck when utilizing Multi-Token Prediction (MTP) via speculative decoding. Despite utilizing high-end hardware (NVIDIA RTX 5090), the system is failing to reach its potential compute capacity.

Hardware and Configuration

The environment used for this observation consists of an NVIDIA RTX 5090 with a power limit set to 475W. The execution command utilized the following key parameters:

Context Window: 131,072 tokens (-c 131072)
Flash Attention: Enabled (-fa on)
Speculative Decoding: --spec-type draft-mtp with --spec-draft-n-max 2

Performance Bottlenecks

The primary issue identified is the discrepancy between the hardware's power ceiling and actual utilization. While the GPU is capped at 475W, the current configuration only draws approximately 300W during inference. This under-utilization correlates with a throughput of roughly 30 tokens per second (t/s), suggesting that the speculative decoding overhead or the specific MTP implementation is introducing a bottleneck that prevents the GPU from reaching full saturation.

Analysis of the Speculative Setup

The use of --spec-type draft-mtp is intended to increase throughput by predicting multiple tokens per forward pass. However, in this specific instance, the addition of --spec-draft-n-max 2 appears to be counterproductive, resulting in lower t/s than expected for a model of this scale on the 5090 architecture.

Note: The provided source is a brief user report; further benchmarking is required to determine if this is a driver-level issue, a limitation of the GGUF quantization, or an inefficiency in the llama-server MTP implementation.

Original Source

Qwen 3.6 Multi-Token Prediction (MTP) Speculative Decoding RTX 5090 llama-server LLM Inference

Techyon

Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization

Performance Degradation in Qwen 3.6 27B MTP: Impact of Speculative Decoding Parameters on Throughput

Technical Observation

Hardware and Configuration

Performance Bottlenecks

Analysis of the Speculative Setup

Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization

Performance Degradation in Qwen 3.6 27B MTP: Impact of Speculative Decoding Parameters on Throughput

Technical Observation

Hardware and Configuration

Performance Bottlenecks

Analysis of the Speculative Setup

Related Articles

Without open llm competition, closed source LLM companies will become insatiable.

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

openvinotoolkit /openvino

lemonade-sdk /lemonade

If Claude Fable stops helping you, you'll never know