CUDA Context Allocation Issues in llama-server Router Mode

A technical report on a memory management issue within llama-server where models pinned to specific GPUs still trigger CUDA context allocations across all available devices, leading to Out-of-Memory (OOM) errors.

Problem Overview: Global CUDA Context Allocation

A user reported a significant inefficiency when utilizing llama-server in router mode (via the --models-preset flag). Despite pinning specific models to designated GPUs, the server appears to initialize a CUDA context on every available graphics card in the system regardless of where the model is actually hosted.

This behavior creates a critical resource conflict: when secondary GPUs are already heavily utilized by other processes, the attempt by the llama-server router to grab a context on those cards results in an Out-of-Memory (OOM) failure, preventing the model from loading even if the target GPU has sufficient VRAM.

Hardware Configuration

The reported issue was observed on a multi-GPU setup consisting of a heterogeneous mix of NVIDIA hardware:

2x NVIDIA RTX 3090 (used for larger models, e.g., 27B Q8)
2x NVIDIA RTX 4060 Ti (one currently inactive)
1x NVIDIA RTX 5060 Ti (used for smaller models, e.g., Gemma 4B)

Technical Analysis

The core of the issue lies in how the router spawns child processes for model instances. In a standard CUDA environment, initializing a context can consume a baseline amount of VRAM. If the llama-server router triggers a global initialization across all visible devices instead of isolating the process to a specific CUDA_VISIBLE_DEVICES environment variable, it can lead to crashes on systems where VRAM is tightly allocated across multiple cards.

Current Workflow

The user employs the --models-preset configuration to dynamically spawn child processes per model on demand. While this architecture allows for efficient model switching and routing, the current implementation of context allocation appears to ignore the specific GPU pinning, attempting to access all available hardware.

Note: The provided source is a user query. As such, a definitive solution or official confirmation from the llama.cpp developers is not included in this report. It remains unclear if this is a bug in the router's process spawning logic or a missing configuration flag.

Original Source

llama-server CUDA VRAM Management Multi-GPU OOM Error Local LLM

Techyon

llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?

CUDA Context Allocation Issues in llama-server Router Mode

Problem Overview: Global CUDA Context Allocation

Hardware Configuration

Technical Analysis

Current Workflow

llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?

CUDA Context Allocation Issues in llama-server Router Mode

Problem Overview: Global CUDA Context Allocation

Hardware Configuration

Technical Analysis

Current Workflow

Related Articles

Without open llm competition, closed source LLM companies will become insatiable.

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

openvinotoolkit /openvino

lemonade-sdk /lemonade

If Claude Fable stops helping you, you'll never know