BeeLlama v0.3.1: Optimizing llama.cpp with DFlash, MTP, and TurboQuant for High-Throughput Inference

The release of BeeLlama v0.3.1 introduces significant architectural updates to the llama.cpp fork, integrating Multi-Token Prediction (MTP) and DFlash to achieve massive throughput gains, reaching up to 177.8 tokens per second on a single RTX 3090.

Architectural Alignment and New Features

BeeLlama v0.3.1 represents a major update designed to maintain tight alignment with the upstream llama.cpp repository while introducing specialized performance enhancements. This version integrates critical new capabilities, including support for Multi-Token Prediction (MTP) and native support for Gemma 4 12B models.

A key highlight of this release is the update to DFlash, which has been enhanced to handle complex deployment configurations. These improvements specifically target multi-slot environments and multi-GPU setups, allowing for more flexible and scalable inference workloads.

Performance Benchmarks: Pushing the Limits of the RTX 3090

The implementation of TurboQuant and q6_0 cache optimizations has resulted in substantial performance leaps compared to baseline implementations. Testing conducted on a single NVIDIA RTX 3090 demonstrates the efficiency of these optimizations:

  • Models Tested: Qwen 3.6 27B and Gemma 4 31B.
  • Peak Performance: Up to 177.8 tokens per second (tps).
  • Relative Gain: This represents a 4.93x increase in throughput over the baseline.

Community Integration and Validation

The project has gained recognition within the enthusiast community, receiving a recommendation from the "club-3090" group. The development of v0.3.0 and v0.3.1 involved rigorous testing on multi-GPU configurations to ensure stability and efficiency across diverse hardware environments.

Note: Specific implementation details regarding the internal mechanics of TurboQuant and the exact configuration of the q6_0 cache are not detailed in the provided source.

Original Source
LLM Inference llama.cpp RTX 3090 Quantization Throughput Optimization Gemma 4 Qwen 3.6