CUDA Context Allocation Issues in llama-server Router Mode
A technical report on a memory management issue within llama-server where models pinned to specific GPUs still trigger CUDA context allocations across all available devices, leading to Out-of-Memory (OOM) errors.
Problem Overview: Global CUDA Context Allocation
A user reported a significant inefficiency when utilizing llama-server in router mode (via the --models-preset flag). Despite pinning specific models to designated GPUs, the server appears to initialize a CUDA context on every available graphics card in the system regardless of where the model is actually hosted.
This behavior creates a critical resource conflict: when secondary GPUs are already heavily utilized by other processes, the attempt by the llama-server router to grab a context on those cards results in an Out-of-Memory (OOM) failure, preventing the model from loading even if the target GPU has sufficient VRAM.
Hardware Configuration
The reported issue was observed on a multi-GPU setup consisting of a heterogeneous mix of NVIDIA hardware:
- 2x NVIDIA RTX 3090 (used for larger models, e.g., 27B Q8)
- 2x NVIDIA RTX 4060 Ti (one currently inactive)
- 1x NVIDIA RTX 5060 Ti (used for smaller models, e.g., Gemma 4B)
Technical Analysis
The core of the issue lies in how the router spawns child processes for model instances. In a standard CUDA environment, initializing a context can consume a baseline amount of VRAM. If the llama-server router triggers a global initialization across all visible devices instead of isolating the process to a specific CUDA_VISIBLE_DEVICES environment variable, it can lead to crashes on systems where VRAM is tightly allocated across multiple cards.
Current Workflow
The user employs the --models-preset configuration to dynamically spawn child processes per model on demand. While this architecture allows for efficient model switching and routing, the current implementation of context allocation appears to ignore the specific GPU pinning, attempting to access all available hardware.
Note: The provided source is a user query. As such, a definitive solution or official confirmation from the llama.cpp developers is not included in this report. It remains unclear if this is a bug in the router's process spawning logic or a missing configuration flag.