Llama.cpp MTP Testing on Qwen3.6 with RTX 5090

Validating Multi-Turn Prompting (MTP) Support in llama.cpp using Qwen3.6 on RTX 5090

This technical experiment details the setup and methodology for testing the Multi-Turn Prompting (MTP) feature within llama.cpp, specifically utilizing Qwen3.6 GGUF models on an RTX 5090 GPU. The testing rigorously isolates MTP functionality by controlling quantization and context parameters across diverse prompt lengths.

Experimental Setup and Configuration

The experiment utilized a high-performance computational environment designed to stress-test the model inference capabilities. The core hardware and software stack were precisely configured to ensure reliable testing of the MTP implementation.

Hardware and Software Stack

  • GPU Hardware: NVIDIA RTX 5090 (32 GB VRAM).
  • Operating System: Linux.
  • Inference Framework: llama.cpp, compiled from source (specifically commit 4f13cb7). Note: The base CUDA image required custom docker building (CUDA_DOCKER_ARCH=120) as the official ghcr.io/ggml-org/llama.cpp:server-cuda image had not yet incorporated the necessary merge.
  • Model Architectures: Unsloth's Qwen3.6-27B-MTP-GGUF (Q5_K_M) and Qwen3.6-35B-A3B-MTP-GGUF (UD-Q4_K_M).

Methodology: Isolating MTP Functionality

A critical aspect of this test was the isolation of the MTP feature from the variability introduced by different quantization levels. This was achieved by running the same GGUF file for both the "MTP on" and "MTP off" configurations, toggling only the specific command-line flags.

Inference Parameters

The standard inference parameters were set as follows: a 128k context window, utilization of flash-attn, a q8_0 KV cache, a temperature of 0.8, and a parallelization setting of `--parallel 1` (which was noted as required for MTP).

Testing Protocols

MTP functionality was validated using two distinct prompt types, ensuring the test covered both short and extremely long context interactions:

  • Short Prompt Test: A request for a "short story about a cat" (approximately 400 tokens).
  • Long Prompt Test: A complex generation task: "Flappy Bird clone as a single HTML file" (approximately 3000 tokens).
To ensure statistical robustness, the tests were executed using three distinct seeds per configuration, and the results were subsequently averaged. The MTP feature was activated using the flags `--spec-type draft-mtp --spec-draft-n-max 3`.

Tags: llama.cpp, Qwen3.6, MTP, RTX 5090, GGUF, LLMs, Inference, CUDA

Original Source: reddit/r/LocalLLaMA