Unsloth Releases MTP GGUF Weights for Gemma 4 Model Suite
Unsloth has expanded its optimization library by providing GGUF-formatted weights for the Multi-Token Prediction (MTP) variants of Google's Gemma 4, covering multiple parameter scales for efficient local deployment.
Enhanced Local Deployment via GGUF
Unsloth has officially released GGUF weights for the Gemma 4 architecture, specifically targeting the Multi-Token Prediction (MTP) capabilities. By providing these weights in GGUF format, the team enables seamless integration with llama.cpp and other compatible inference engines, allowing developers and researchers to run high-performance models on consumer-grade hardware with reduced memory overhead.
Supported Model Variants and Quantizations
The release covers a broad spectrum of model sizes to accommodate different hardware constraints and performance requirements. The following Gemma 4 variants are now available:
- Gemma 4 31B: High-capacity model for complex reasoning tasks.
- Gemma 4 26B-A4B: A specialized architecture variant.
- Gemma 4 12B: An optimized balance between performance and efficiency.
To ensure flexibility in precision and accuracy, Unsloth has provided multiple weight formats for these models, including Q8 (8-bit quantization), F16 (Half-precision floating point), and BF16 (Bfloat16), allowing users to choose the optimal trade-off between perplexity and VRAM usage.
Accessing the Weights
The weights have been hosted on Hugging Face under the Unsloth organization, organized by model size within the MTP directories of their respective repositories.