Unsloth Releases MTP GGUF Weights for Gemma 4 Model Suite

Unsloth has expanded its optimization library by providing GGUF-formatted weights for the Multi-Token Prediction (MTP) variants of Google's Gemma 4, covering multiple parameter scales for efficient local deployment.

Enhanced Local Deployment via GGUF

Unsloth has officially released GGUF weights for the Gemma 4 architecture, specifically targeting the Multi-Token Prediction (MTP) capabilities. By providing these weights in GGUF format, the team enables seamless integration with llama.cpp and other compatible inference engines, allowing developers and researchers to run high-performance models on consumer-grade hardware with reduced memory overhead.

Supported Model Variants and Quantizations

The release covers a broad spectrum of model sizes to accommodate different hardware constraints and performance requirements. The following Gemma 4 variants are now available:

Gemma 4 31B: High-capacity model for complex reasoning tasks.
Gemma 4 26B-A4B: A specialized architecture variant.
Gemma 4 12B: An optimized balance between performance and efficiency.

To ensure flexibility in precision and accuracy, Unsloth has provided multiple weight formats for these models, including Q8 (8-bit quantization), F16 (Half-precision floating point), and BF16 (Bfloat16), allowing users to choose the optimal trade-off between perplexity and VRAM usage.

Accessing the Weights

The weights have been hosted on Hugging Face under the Unsloth organization, organized by model size within the MTP directories of their respective repositories.

Original Source

LLM Gemma 4 Unsloth GGUF Quantization Multi-Token Prediction Local Inference

Techyon

Unsloth just dropped MTP GGUF weights for Gemma 4!

Unsloth Releases MTP GGUF Weights for Gemma 4 Model Suite

Enhanced Local Deployment via GGUF

Supported Model Variants and Quantizations

Accessing the Weights

Unsloth just dropped MTP GGUF weights for Gemma 4!

Unsloth Releases MTP GGUF Weights for Gemma 4 Model Suite

Enhanced Local Deployment via GGUF

Supported Model Variants and Quantizations

Accessing the Weights

Related Articles

Without open llm competition, closed source LLM companies will become insatiable.

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

openvinotoolkit /openvino

lemonade-sdk /lemonade

If Claude Fable stops helping you, you'll never know