Significant Performance Gains for Intel Arc via Speculative Decoding in llama.cpp
A recent contribution to the llama.cpp SYCL backend has resolved previous performance regressions in speculative decoding, delivering substantial token generation speedups of 40% to 90% for Intel Arc GPU users.
Optimization of the SYCL Backend
For users running Large Language Models (LLMs) on Intel Arc hardware via the SYCL backend, speculative decoding has historically been inefficient. Previous benchmarks indicated that utilizing Multi-Token Prediction (MTP) actually degraded performance, with some users reporting a 12% decrease in speed compared to standard single-token generation on Q4 quantized models.
The Technical Breakthrough: Multi-Column MMVQ Port
The performance bottleneck was addressed by porting the multi-column Mixed-Precision Vector Quantization (MMVQ) path from the CUDA backend to the SYCL implementation. This optimization streamlines how the hardware handles the speculative drafting process, allowing the Intel Arc GPUs to leverage their compute capabilities more effectively during the verification phase of speculative decoding.
Benchmark Improvements
Following the implementation of the MMVQ path, the performance gains are significant across different quantization levels:
- Q4 Quantization: Observed speed increase of approximately 40%.
- Q8 Quantization: Observed speed increase of 90% or more.
Availability and Implementation
These optimizations have been merged into the master branch as of build b9519. Users can access these improvements by pulling the latest version of the llama.cpp repository.
Note: This update specifically targets the SYCL backend; users are encouraged to verify their build version to ensure the merge is included in their current installation.
Original Source