Performance Degradation in Qwen 3.6 27B MTP: Impact of Speculative Decoding Parameters on Throughput
A technical report indicates that enabling specific speculative decoding configurations—namely --spec-type draft-mtp and --spec-draft-n-max—on the Qwen 3.6 27B MTP model leads to a significant drop in tokens per second (t/s) and suboptimal GPU power utilization.
Technical Observation
A user report detailing the deployment of the Qwen 3.6 27B MTP (quantized as unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf) using llama-server has revealed a performance bottleneck when utilizing Multi-Token Prediction (MTP) via speculative decoding. Despite utilizing high-end hardware (NVIDIA RTX 5090), the system is failing to reach its potential compute capacity.
Hardware and Configuration
The environment used for this observation consists of an NVIDIA RTX 5090 with a power limit set to 475W. The execution command utilized the following key parameters:
- Context Window: 131,072 tokens (
-c 131072) - Flash Attention: Enabled (
-fa on) - Speculative Decoding:
--spec-type draft-mtpwith--spec-draft-n-max 2
Performance Bottlenecks
The primary issue identified is the discrepancy between the hardware's power ceiling and actual utilization. While the GPU is capped at 475W, the current configuration only draws approximately 300W during inference. This under-utilization correlates with a throughput of roughly 30 tokens per second (t/s), suggesting that the speculative decoding overhead or the specific MTP implementation is introducing a bottleneck that prevents the GPU from reaching full saturation.
Analysis of the Speculative Setup
The use of --spec-type draft-mtp is intended to increase throughput by predicting multiple tokens per forward pass. However, in this specific instance, the addition of --spec-draft-n-max 2 appears to be counterproductive, resulting in lower t/s than expected for a model of this scale on the 5090 architecture.
Note: The provided source is a brief user report; further benchmarking is required to determine if this is a driver-level issue, a limitation of the GGUF quantization, or an inefficiency in the llama-server MTP implementation.