Performance Anomalies: Analyzing PP and TG Metrics for Qwen 3.6-27B on AMD RX 7900 XTX

A technical report from the local LLM community highlights unexpected performance metrics regarding prompt processing (PP) and token generation (TG) when running the Qwen 3.6-27B model (both non-MTP and MTP variants) on AMD hardware utilizing ROCm and Vulkan backends.

Hardware and Software Environment

The reported performance issues were observed on a system utilizing an AMD Radeon RX 7900 XTX (gfx1100) GPU. The software stack consists of a highly specific environment geared toward high-performance compute and graphics:

  • Operating System: Ubuntu 24.04.4 LTS
  • Linux Kernel: 6.8.0-124-generic
  • ROCm Version: 7.2.4
  • AMD Driver: 6.16.13
  • Vulkan API: 1.4.330 (Mesa 26.0.0-devel)
  • Inference Engine: llama.cpp (build b9630 / 8ed274ef4)

Benchmarking Observations

The user is reporting "unsatisfactory results" when deploying the Qwen 3.6-27B model. The investigation focuses on the discrepancy between the expected and actual throughput for both Prompt Processing (PP) and Token Generation (TG).

Backend Comparison: ROCm vs. Vulkan

The analysis aims to compare raw backend benchmarks to determine how the ROCm stack performs against the Vulkan API implementation. The testing encompasses two primary model configurations:

  • Non-MTP: Standard model inference without Multi-Token Prediction.
  • MTP: Inference utilizing Multi-Token Prediction to accelerate generation.

Note: The provided source material is a fragment of a larger discussion. Specific numerical data for prompt tokens per second (tok/s) and decode tokens per second (tok/s) were not fully provided in the snippet, limiting the ability to perform a quantitative analysis of the "strange numbers" mentioned.

Technical Limitations

Due to the incomplete nature of the source text, the exact delta between the expected performance and the observed results is not specified. Further data on the specific quantization levels used and the exact "strange numbers" encountered is required to diagnose whether the bottleneck lies in the ROCm kernel implementation or the Vulkan shader compilation.

Original Source
AMD Radeon RX 7900 XTX ROCm Vulkan llama.cpp Qwen 3.6-27B Multi-Token Prediction (MTP)