VRAM Utilization in LLM Inference

Optimizing GPU VRAM for Small Model Inference with llama.cpp

This technical discussion explores the challenge of achieving complete GPU VRAM residency for smaller language models when utilizing inference engines like llama.cpp. Despite successful deployment of large Mixture-of-Experts (MoE) and large-context models, the user seeks to eliminate host memory overhead for quantized, small-scale models.

The Challenge of Host Memory Overhead

In the domain of local Large Language Model (LLM) inference, maximizing GPU Video RAM (VRAM) utilization is critical for achieving high throughput and low latency. The goal is to offload all model weights and operational data from the CPU's host memory onto the dedicated GPU memory.

Current Performance Benchmarks

The reported setup—an RTX 4070 (12GB VRAM), 32GB system RAM, and iGPU for GUI—demonstrates strong performance capabilities. The system successfully managed inference for substantial models, including Gemma4 26B and Qwen 3.6 35B MoEs, while maintaining a high inference rate of approximately 40 tokens per second (t/s) even at high quantization levels and large context windows.

The VRAM Residency Problem

While the setup handles large models effectively, a specific technical hurdle arises when attempting to run significantly smaller, highly quantized models, such as a Qwen3.5-9B model. The objective is to ensure the entire operation occurs solely within the 12GB VRAM, bypassing the system's host RAM entirely.

The core issue identified is the persistent requirement for host memory even when running small models. For instance, testing with Gemma4-e2b (Q4_IXS quantization) and a constrained context length of 8192 demonstrated that the process still allocated approximately 3.5 GB of host RAM alongside the GPU usage. This indicates that a portion of the operational overhead, likely related to buffering or internal processing structures, is not fully resident on the GPU.

Investigation into llama.cpp Configuration

The user has systematically explored various command-line options provided by `llama-server` in an attempt to enforce full VRAM residency. However, preliminary testing suggests that current configuration options may not fully resolve the issue of host memory leakage or required auxiliary memory for smaller, quantized models.

This scenario highlights a limitation in current LLM inference software where achieving 100% VRAM utilization, especially for smaller models, remains non-trivial and requires deeper investigation into memory management within the inference framework.

llama.cpp VRAM LLM Inference Quantization GPU Optimization LocalLLaMA