SYCL Integration of Multi-Column MMVQ: Accelerating Speculative Decoding on Intel Arc GPUs

A significant performance optimization has been merged into the llama.cpp repository, porting the multi-column Mixed-Precision Vector Quantization (MMVQ) from the CUDA backend to SYCL, yielding up to a 45% speedup in speculative decoding for Intel Arc GPU users.

Technical Overview

A recent contribution to the ggml-org/llama.cpp project (Pull Request #21845) introduces the porting of multi-column MMVQ (Mixed-Precision Vector Quantization) to the SYCL backend. This implementation, originally developed for NVIDIA's CUDA architecture, is now optimized for Intel's cross-architecture programming model, specifically targeting Intel Arc graphics hardware.

Impact on Speculative Decoding

The primary benefit of this port is a substantial increase in efficiency during speculative decoding. Early benchmarks and reports indicate a speculative decoding speedup of approximately 45% on Intel Arc GPUs. By leveraging multi-column MMVQ, the system can more effectively handle the quantized weight matrices during the verification phase of speculative sampling, reducing latency and increasing the overall tokens-per-second throughput.

Implementation and Availability

This optimization is integrated into the llama.cpp codebase starting from build b9519. Users utilizing Intel Arc GPUs are encouraged to update their installations to the latest version to leverage these performance gains.

Technical Requirements

Software: llama.cpp build b9519 or newer.
Hardware: Intel Arc GPU.
Backend: SYCL.

Note: This article is based on a community report; specific architectural details regarding the MMVQ implementation specifics within the SYCL kernel were not provided in the source.

Original Source

SYCL Intel Arc llama.cpp Speculative Decoding MMVQ GPU Optimization

Techyon

sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

SYCL Integration of Multi-Column MMVQ: Accelerating Speculative Decoding on Intel Arc GPUs

Technical Overview

Impact on Speculative Decoding

Implementation and Availability

Technical Requirements

sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

SYCL Integration of Multi-Column MMVQ: Accelerating Speculative Decoding on Intel Arc GPUs

Technical Overview

Impact on Speculative Decoding

Implementation and Availability

Technical Requirements

Related Articles

Without open llm competition, closed source LLM companies will become insatiable.

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

openvinotoolkit /openvino

lemonade-sdk /lemonade

If Claude Fable stops helping you, you'll never know