Optimizing Blackwell: Prebuilt llama.cpp Binaries for RTX 50-Series with MTP and TurboQuant
A new set of prebuilt llama.cpp binaries for Windows addresses a critical performance gap for NVIDIA RTX 50-series users by integrating Multi-Token Prediction (MTP), TurboQuant, and native sm_120 support, enabling significant throughput gains on Blackwell architecture.
Addressing the Blackwell Performance Gap
Users of the NVIDIA RTX 50-series (Blackwell architecture) on Windows have previously faced a fragmented ecosystem of llama.cpp builds. While various forks and pull requests introduced critical optimizations, no single prebuilt binary had combined all necessary features for maximum efficiency. This fragmentation forced users to choose between specific feature sets or compile from source.
Key Technical Enhancements
The latest release resolves these discrepancies by merging three critical components into a single distribution:
1. Multi-Token Prediction (MTP)
Integrating the functionality from PR #22673 (merged in May), the build enables Multi-Token Prediction, which allows the model to predict multiple subsequent tokens in a single forward pass, significantly increasing generation speed.
2. TurboQuant Integration
Unlike the upstream llama.cpp, this build includes TurboQuant, providing advanced quantization optimizations that reduce memory overhead and increase inference throughput without substantial loss in perplexity.
3. Native sm_120 Support
The binaries are specifically targeted for the sm_120 compute capability of the Blackwell architecture. This is a critical improvement over previous builds (such as the Tom's tqp-v0.1.1 version), which suffered from a ~50% performance degradation due to FORCE_CUBLAS=ON being locked in the CMakeCache, effectively disabling the high-performance MMQ kernels.
Performance Benchmarks
The synergy of MTP, TurboQuant, and native Blackwell optimization yields impressive results. In initial tests, the Qwen 27B model achieved a throughput of 47 tokens per second (t/s) while maintaining a massive 256K context window.
Comparative Analysis of Previous Builds
To illustrate the necessity of this release, the following limitations were identified in existing alternatives:
- Upstream llama.cpp: Includes MTP but lacks TurboQuant.
- TheTom's tqp-v0.1.1: Includes TurboQuant but lacks MTP and suffers from CUDA 12.4 configuration issues that disable MMQ kernels.
- AmesianX/TurboQuant: Provides
sm_120binaries but lacks MTP. - NJannasch: Successfully combined MTP and TurboQuant in the source code, but did not provide prebuilt binaries for end-users.
Note: This article is based on community reports from r/LocalLLM; official documentation for these specific combined binaries is limited to the provided source.
Original Source