HipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)
INSTRUCTIONS:
1. Analyze the provided news carefully (title + text/description).
2. Write a complete HTML article that includes:
- A catchy but technical title in
- A brief summary in
- Well-structured body with paragraphs
and subheadings
/
- Link to the original source in the format Original Source
- Technical tags/labels in
RULES:
- Use precise technical language, suitable for AI developers and researchers.
- If information is insufficient, highlight the article's limitations with a brief note.
- Maintain a professional but accessible tone.
- Do NOT invent information — use only what is provided.
- Write the ENTIRE article in English, even if the source material contains Italian or other languages.
- Output ONLY valid HTML — no text before or after the HTML.
- Stop writing as soon as the article is complete. Do NOT repeat yourself or add extra content after
NEWS TO TRANSFORM:
---
Title: hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)
Source: reddit/r/localllama
URL: https://reddit.com/r/LocalLLaMA/comments/1tmq4s6/hipengine_fast_native_qwen_36_inference_for_rdna3/
Author: u/randomfoo2
Date: 2026-05-24T22:21:21
Description/Content: A few weeks ago, after finishing [FastDMS](https://www.reddit.com/r/LocalLLaMA/comments/1t3vlrx/fastdms_64x_kvcache_compression_running_faster/), I started to thinking about some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into [hipEngine](https://github.com/shisa-ai/hipEngine), a new open source (AGPLv3) ROCm-native local LLM inference engine.
It's Python based, but with no heavy PyTorch dependency. All the hot-path is HIP/C++, making liberal use of AMD native libs like hipBLASLt, hipGraph, AOTriton, etc.
### gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900)
The initial implementation has Qwen 3.6 (MoE and dense) running competitively with llama.cpp, with the [ParoQuant](https://github.com/shisa-ai/paroquant) (which I've also ported to be ROCm compatible) 4.68bpw having better c=1 prefill ("prompt processing") at every tested context length, from 512-128K on gfx1100 (W7900/7900 XTX):
### Prefill tok/s
| Workload | hipEngine PARO | llama.cpp HIP | llama.cpp Vulkan |
| --- | ---: | ---: | ---: |
| 512/128 | **2718.497** | 2258.847 | 2436.049 |
| 4K/128 | **2838.773** | 2576.673 | 2176.905 |
| 32K/128 | **2074.699** | 1893.967 | 1496.409 |
| 128K/128 | **1055.454** | 109.152 | 85.487 |
| 32K/128 | **0.875** | 0.000 | 0.000 |
### Decode tok/s
| Workload | hipEngine PARO | llama.cpp HIP | llama.cpp Vulkan |
| --- | ---: | ---: | ---: |
| 512/128 | **62.060** | 50.537 | 57.615 |
| 4K/128 | **63.605** | 49.379 | 55.027 |
| 32K/128 | **50.629** | 43.435 | 44.576 |
| 128K/128 | 30.245 | 43.435 | 26.935 |
## gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900)
The initial implementation has Qwen 3.6 (MoE and dense) running competitively with llama.cpp, with the [Paro