Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist
The author outlines VRAM optimization challenges for large language models, detailing a quantized model configuration on an RTX 5090 and expressing a need for more flexible mmproj and MTP handling.
Optimizing VRAM usage in LLM inference involves balancing multiple components: the quantized model weights, the quantized key‑value (KV) cache, and the mmproj (matrix‑multiplication projection) layer. The author reports that on an RTX 5090, a Qwen3.5‑27B model quantized to Q6_K with mmproj enabled, MTP disabled, and a q8_0 KV cache supporting a 150k‑