Comparative Analysis of Dual-GPU Inference: llama.cpp Row/Tensor Split vs. ik_llama Graph Split

A technical evaluation of multi-GPU inference strategies, comparing the standard row/tensor splitting method implemented in llama.cpp against the graph-based splitting approach utilized by ik_llama.

Technical Environment and Hardware Configuration

The performance benchmarks were conducted using a dual-GPU setup. Based on the provided system logs, the environment utilized NVIDIA drivers version 610.43.02 with CUDA UMD version 13.3. This configuration provides the baseline for analyzing how different memory distribution strategies impact inference throughput and latency in local Large Language Model (LLM) deployments.

Inference Splitting Methodologies

The analysis focuses on two distinct methods of distributing model weights across multiple GPUs to optimize inference speed:

llama.cpp Row/Tensor Split

The standard approach in llama.cpp typically involves splitting tensors across available GPUs. This method distributes the workload by dividing the rows of the weight matrices, allowing each GPU to process a portion of the computation before aggregating the results.

ik_llama Graph Split

The ik_llama implementation introduces a graph-based splitting mechanism. Unlike simple tensor splitting, graph splitting aims to optimize the execution flow by partitioning the computational graph, potentially reducing synchronization overhead and improving the utilization of the interconnect between GPUs.

Analysis Limitations

Note: The provided source material contains the initial hardware environment and the objective of the comparison but lacks the specific numerical results, token-per-second (t/s) metrics, and final conclusions of the benchmark. Consequently, this article describes the methodologies being compared rather than the final performance outcome.

Original Source
LLM Inference Multi-GPU llama.cpp ik_llama CUDA Tensor Splitting