Optimizing WebGPU Prefill Performance: Matrix Multiplication Refactor for K-Quants in llama.cpp

A recent pull request to the ggml-org/llama.cpp repository introduces significant performance gains for k-quantized models on WebGPU, specifically targeting matrix multiplication (matmul) efficiency during the prefill stage.

Technical Overview

Pull Request #24225, submitted by contributor yomaytk, focuses on the optimization of the ggml-webgpu backend. The primary objective of this update is to improve prefill speeds for k-quants by refactoring the matrix multiplication kernels for various quantization levels, including Q4, Q5, Q8, and k-quants.

The refactor optimizes how the GPU handles the computation of quantized weights, reducing overhead and increasing throughput during the initial processing of input tokens (prefill), which is critical for reducing time-to-first-token (TTFT) in browser-based LLM deployments.

Performance Benchmarks

Benchmarks conducted on an Apple M2 Pro chip using the pp512 test demonstrate substantial speedups across multiple model architectures and quantization schemes. The improvements are particularly pronounced in lower-bit k-quants:

Observed Speedups

  • Q2_K (qwen3 0.6B): Increased from 817.86 t/s to 1991.81 t/s, representing a 2.44x speedup.
  • Q3_K (qwen35 4B): Increased from 92.54 t/s to 302.24 t/s, representing a 3.27x speedup.
  • Q3_K (gemma4 E4B): Increased from 79.06 t/s to 298.73 t/s, representing a 3.78x speedup.
  • Q4_K (qwen35 4B): Increased from 243.82 t/s to 327.24 t/s.

Impact on Local LLM Deployment

These optimizations allow for more efficient execution of quantized models within WebGPU environments, bridging the gap between native execution and browser-based inference. By optimizing the matmul operations for k-quants, the implementation ensures that memory-efficient quantization does not come at a prohibitive cost to compute speed on compatible hardware.

Note: The provided data is limited to specific benchmarks on the M2 Pro; performance gains may vary across different GPU architectures and WebGPU implementations.

Original Source
WebGPU llama.cpp Quantization K-Quants Matrix Multiplication Inference Optimization