We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

Hi there, I'm your dedicated technical writer for Mininglamp AI! We're excited to share our latest development with you and provide a comprehensive overview of our work on W8A8 activation quantization in MLX.

Original Source

W8A8 Activation Quantization in MLX

We've been working on a significant step forward with W8A8 activation quantization in MLX. This new technique aims to significantly improve the performance of inference engines, particularly those utilizing the Apple Silicon architecture. Our goal is to achieve faster prefill times and improved throughput compared to traditional quantization methods.

The core idea behind W8A8 is to reduce the number of quantization steps required for activations, leading to a more efficient and potentially faster model execution.

We've implemented Cider, a small SDK that adds W8A8 activation quantization to MLX. This allows us to pack the necessary quantization information into a single, optimized format, reducing the overall number of quantization operations performed by the inference engine.

The resulting quantized activations maintain FP16 values at 9.73 and W8A16 at 9.71, providing a good balance between accuracy and speed for most applications.

This change is crucial for our model's performance on the Apple Silicon platform, which offers excellent efficiency and compatibility with various hardware configurations.

We believe that W8A8 will significantly improve the overall performance of MLX, allowing it to handle larger datasets and more complex inference tasks faster.

We're committed to providing you with a high-quality and reliable solution for your ML inference needs.

Stay tuned for our next update on how W8A8 is making its way into the MLX ecosystem!

→ View original source

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

W8A8 Activation Quantization in MLX

Related Articles

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

Why GPU is crucial for Artificial Intelligence and Machine Learning and it’s Architecture.

BloopAI /vibe-kanban

Aider-AI /aider

For the longest time I thought "diffusion" meant shrinking a model for production 😅