Performance Benchmarks: Qwen3.6 35B MoE on RTX 5080 at 128k Context

Inference Performance Deep Dive: Qwen3.6 35B MoE on RTX 5080 — Analyzing MTP Convergence at 128K Context

This technical analysis benchmarks the Qwen3.6 35B MoE model, utilizing GGUF and llama.cpp (b9204), on an RTX 5080 16GB setup. The study focuses on generation and prompt processing speeds across extreme context lengths (up to 128k tokens), specifically addressing the efficacy of Multi-Token Prediction (MTP) in long-context, partially offloaded environments.

Executive Summary of Findings

The primary finding reveals a critical performance trade-off related to Multi-Token Prediction (MTP). While MTP provides significant speed boosts for models that fit entirely on the GPU (e.g., the 27B IQ3 configuration), it introduces bottlenecks for larger, partially offloaded models like the 35B Q4_K_XL. For the 35B MoE on 16GB, MTP is demonstrably slower (23% slower) than the non-MTP configuration. At the target 128k context length, MTP and non-MTP converge to the same token generation speed (~56 tok/s). The key to achieving stable long-context performance lies not in MTP, but in correctly configuring the `--fit-target` parameter to manage the KV cache growth and prevent Out-of-Memory (OOM) errors.

Architectural and Configuration Analysis

The Benchmark Setup

All tests were conducted using llama.cpp version b9204 on an RTX 5080 16GB paired with a Ryzen 9 9950X and 128GB RAM. The objective was to test three configurations of Qwen3.6, focusing on different quantization and MTP usage.

The core configurations tested included:

27B IQ3 + MTP: Fully on GPU (12.45 GB model size).
35B Q4_K_XL + MTP: Partial offload (~22 GB model size).
35B Q8_0 + MTP: Heavy offload (~36 GB model size).

MTP Efficacy: A Context-Dependent Metric

MTP's performance is highly dependent on the model's memory placement.

MTP Benefit (27B IQ3): For the 27B model, which fits entirely on the 16GB GPU, MTP significantly improves speed (from ~56 tok/s to 73 tok/s). This is because the model avoids the VRAM penalty associated with the MTP compute buffer.
MTP Detriment (35B MoE): For the 35B MoE,

→ View original source

Techyon - AI News Aggregator

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

Inference Performance Deep Dive: Qwen3.6 35B MoE on RTX 5080 — Analyzing MTP Convergence at 128K Context

Executive Summary of Findings

Architectural and Configuration Analysis

The Benchmark Setup

MTP Efficacy: A Context-Dependent Metric