GLM-5.2 753B (IQ1_S) fully local across 2×M5 Max over one TB5 cable — ~16 tok/s, llama.cpp RPC [video]

Article automatically generated from technical news.

Model: GLM-5.2, 753B params. Unsloth's dynamic IQ1_S build. It's labeled ~1.6 bits, but the dynamic mix keeps some layers at higher precision, so it lands around 2.1 bits effective — 202GB on disk. Lossy, obviously, but it stays coherent and follows instructions. Setup: Two M5 Max (128GB each) pooled to 256GB unified memory, wired with a single Thunderbolt 5 cable running llama.cpp's RPC backend. One box serves the endpoint and holds half the weights, the other holds the rest.

Fonte originale

GLM-5.2 753B (IQ1_S) fully local across 2×M5 Max over one TB5 cable — ~16 tok/s, llama.cpp RPC [video]

GLM-5.2 753B (IQ1_S) fully local across 2×M5 Max over one TB5 cable — ~16 tok/s, llama.cpp RPC [video]

Related Articles

Qwen3.5-9B on RTX 5060 8GB VRAM: The llama.cpp settings + quants that finally made reliable local agents work

metalbear-co /mirrord

The Illusion of "Vibe-Coding": Why Pure AI App Generation Fails (and How to Fix It)

I used Claude Code to get a second opinion on my MRI

Qwen-Image-2.0-RL Technical Report