Benchmarked Ollama vs LM Studio vs raw llama.cpp across AMD APU, Apple Silicon, and NVIDIA. Out-of-the-box and matched-flags compared.

Article automatically generated from technical news.

Ran a comparison across three hardware families and four model sizes (0.6B, 8B, 30B-class, 30B+ MoE). Measured TTFT (cold and warm) and decode tokens/sec. Did it twice: once with matched llama.cpp flags, once with each tool's defaults. What I found Out-of-the-box, Ollama is 41-72% slower decode on AMD APU than raw llama.cpp; cold-RAG prefill on a 31B model on Strix Halo took roughly 4 minutes LM Studio's Vulkan path wins decode on small/mid models, but pays a 1-1.5 second TTFT tax A

Fonte originale