HipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

INSTRUCTIONS: 1. Analyze the provided news carefully (title + text/description). 2. Write a complete HTML article that includes: - A catchy but technical title in

- A brief summary in

- Well-structured body with paragraphs

and subheadings

/

- Link to the original source in the format Original Source - Technical tags/labels in
RULES: - Use precise technical language, suitable for AI developers and researchers. - If information is insufficient, highlight the article's limitations with a brief note. - Maintain a professional but accessible tone. - Do NOT invent information — use only what is provided. - Write the ENTIRE article in English, even if the source material contains Italian or other languages. - Output ONLY valid HTML — no text before or after the HTML. - Stop writing as soon as the article is complete. Do NOT repeat yourself or add extra content after
NEWS TO TRANSFORM: --- Title: hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) Source: reddit/r/localllama URL: https://reddit.com/r/LocalLLaMA/comments/1tmq4s6/hipengine_fast_native_qwen_36_inference_for_rdna3/ Author: u/randomfoo2 Date: 2026-05-24T22:21:21 Description/Content: A few weeks ago, after finishing [FastDMS](https://www.reddit.com/r/LocalLLaMA/comments/1t3vlrx/fastdms_64x_kvcache_compression_running_faster/), I started to thinking about some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into [hipEngine](https://github.com/shisa-ai/hipEngine), a new open source (AGPLv3) ROCm-native local LLM inference engine. It's Python based, but with no heavy PyTorch dependency. All the hot-path is HIP/C++, making liberal use of AMD native libs like hipBLASLt, hipGraph, AOTriton, etc. ### gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900) The initial implementation has Qwen 3.6 (MoE and dense) running competitively with llama.cpp, with the [ParoQuant](https://github.com/shisa-ai/paroquant) (which I've also ported to be ROCm compatible) 4.68bpw having better c=1 prefill ("prompt processing") at every tested context length, from 512-128K on gfx1100 (W7900/7900 XTX): ### Prefill tok/s | Workload | hipEngine PARO | llama.cpp HIP | llama.cpp Vulkan | | --- | ---: | ---: | ---: | | 512/128 | **2718.497** | 2258.847 | 2436.049 | | 4K/128 | **2838.773** | 2576.673 | 2176.905 | | 32K/128 | **2074.699** | 1893.967 | 1496.409 | | 128K/128 | **1055.454** | 109.152 | 85.487 | | 32K/128 | **0.875** | 0.000 | 0.000 | ### Decode tok/s | Workload | hipEngine PARO | llama.cpp HIP | llama.cpp Vulkan | | --- | ---: | ---: | ---: | | 512/128 | **62.060** | 50.537 | 57.615 | | 4K/128 | **63.605** | 49.379 | 55.027 | | 32K/128 | **50.629** | 43.435 | 44.576 | | 128K/128 | 30.245 | 43.435 | 26.935 | ## gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900) The initial implementation has Qwen 3.6 (MoE and dense) running competitively with llama.cpp, with the [Paro