mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

u//u/EricBuehler 2026-06-01 · 14:10 UTC

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

Article automatically generated from technical news.

Hey all! I’ve been working on CUDA performance in mistral.rs, and v0.8.2 is focused on CUDA throughput. The result: on Gemma 4 (dense & MoE), mistral.rs is faster than llama.cpp at every point in my release sweep on GB10/H100/B200. See some results below on GB10 and B200: https://preview.redd.it/jmdsjkrbfo4h1.png?width=3312&format=png&auto=webp&s=8a69286b73a8fad4edc671cb9ca8ad3f3cd74d1c The full report includes all steps to reproduce these results. The

Fonte originale

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

Related Articles

Best way to index full Italian Wikipedia for 100% offline RAG in LM Studio?

Bedrock Codex, Robust MILP, Multi‑Model Deliberation, Tree‑Based Molecule Ops, and MoE Quantization

0xPlaygrounds /rig

0x4m4 /hexstrike-ai

Google ordered to put clearer links in AI search and let UK publishers opt out