SYCL Integration of Multi-Column MMVQ: Accelerating Speculative Decoding on Intel Arc GPUs

A significant performance optimization has been merged into the llama.cpp repository, porting the multi-column Mixed-Precision Vector Quantization (MMVQ) from the CUDA backend to SYCL, yielding up to a 45% speedup in speculative decoding for Intel Arc GPU users.

Technical Overview

A recent contribution to the ggml-org/llama.cpp project (Pull Request #21845) introduces the porting of multi-column MMVQ (Mixed-Precision Vector Quantization) to the SYCL backend. This implementation, originally developed for NVIDIA's CUDA architecture, is now optimized for Intel's cross-architecture programming model, specifically targeting Intel Arc graphics hardware.

Impact on Speculative Decoding

The primary benefit of this port is a substantial increase in efficiency during speculative decoding. Early benchmarks and reports indicate a speculative decoding speedup of approximately 45% on Intel Arc GPUs. By leveraging multi-column MMVQ, the system can more effectively handle the quantized weight matrices during the verification phase of speculative sampling, reducing latency and increasing the overall tokens-per-second throughput.

Implementation and Availability

This optimization is integrated into the llama.cpp codebase starting from build b9519. Users utilizing Intel Arc GPUs are encouraged to update their installations to the latest version to leverage these performance gains.

Technical Requirements

  • Software: llama.cpp build b9519 or newer.
  • Hardware: Intel Arc GPU.
  • Backend: SYCL.

Note: This article is based on a community report; specific architectural details regarding the MMVQ implementation specifics within the SYCL kernel were not provided in the source.

Original Source
SYCL Intel Arc llama.cpp Speculative Decoding MMVQ GPU Optimization