Significant Performance Gains for Intel Arc via Speculative Decoding in llama.cpp

A recent contribution to the llama.cpp SYCL backend has resolved previous performance regressions in speculative decoding, delivering substantial token generation speedups of 40% to 90% for Intel Arc GPU users.

Optimization of the SYCL Backend

For users running Large Language Models (LLMs) on Intel Arc hardware via the SYCL backend, speculative decoding has historically been inefficient. Previous benchmarks indicated that utilizing Multi-Token Prediction (MTP) actually degraded performance, with some users reporting a 12% decrease in speed compared to standard single-token generation on Q4 quantized models.

The Technical Breakthrough: Multi-Column MMVQ Port

The performance bottleneck was addressed by porting the multi-column Mixed-Precision Vector Quantization (MMVQ) path from the CUDA backend to the SYCL implementation. This optimization streamlines how the hardware handles the speculative drafting process, allowing the Intel Arc GPUs to leverage their compute capabilities more effectively during the verification phase of speculative decoding.

Benchmark Improvements

Following the implementation of the MMVQ path, the performance gains are significant across different quantization levels:

Q4 Quantization: Observed speed increase of approximately 40%.
Q8 Quantization: Observed speed increase of 90% or more.

Availability and Implementation

These optimizations have been merged into the master branch as of build b9519. Users can access these improvements by pulling the latest version of the llama.cpp repository.

Note: This update specifically targets the SYCL backend; users are encouraged to verify their build version to ensure the merge is included in their current installation.

Original Source

Intel Arc llama.cpp SYCL Speculative Decoding MMVQ LLM Optimization

Techyon

PSA for Intel Arc llama.cpp users: speculative decoding is finally worth turning on (merged ~40–90% speedup)

Significant Performance Gains for Intel Arc via Speculative Decoding in llama.cpp

Optimization of the SYCL Backend

The Technical Breakthrough: Multi-Column MMVQ Port

Benchmark Improvements

Availability and Implementation

PSA for Intel Arc llama.cpp users: speculative decoding is finally worth turning on (merged ~40–90% speedup)

Significant Performance Gains for Intel Arc via Speculative Decoding in llama.cpp

Optimization of the SYCL Backend

The Technical Breakthrough: Multi-Column MMVQ Port

Benchmark Improvements

Availability and Implementation

Related Articles

Local-First Coding Agent

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

openvinotoolkit /openvino

lemonade-sdk /lemonade

Without open llm competition, closed source LLM companies will become insatiable.