Low-Rank Approximation of LLM Performance: Two Factors Explain 90% of Benchmark Variance

A recent analysis of 84 frontier AI models across 133 benchmarks reveals that the resulting performance matrix is approximately rank-2, suggesting that a vast majority of model evaluation variance can be predicted by just two underlying factors.

The Dimensionality of AI Benchmarking

As the landscape of Large Language Models (LLMs) expands, models are increasingly released with extensive evaluation suites, often featuring 40 or more benchmark scores. This proliferation of metrics creates a complex multidimensional space that makes it difficult to discern true performance gains from noise or overfitting.

A new research paper addresses this by compiling a comprehensive public matrix consisting of 84 frontier models evaluated across 133 distinct benchmarks. The study applies linear algebra techniques to analyze the variance across this dataset, discovering a surprising structural simplicity in how models perform across different tasks.

Rank-2 Matrix Analysis

The researchers found that the performance matrix is approximately rank-2. In technical terms, this implies that over 90% of the variation in scores across all 133 benchmarks can be explained by only two latent variables. This suggests that most benchmarks are highly correlated and essentially measure the same underlying capabilities.

Furthermore, the study demonstrates that these two factors are robust enough to reconstruct scores for benchmarks that were intentionally omitted from the matrix, indicating a high degree of predictability in model performance across diverse evaluation sets.

Optimizing Evaluation: The Minimal Benchmark Set

The practical implication of this finding is the potential to drastically reduce the computational and temporal overhead of model evaluation. Rather than running hundreds of tests, the researchers identified a lean set of five benchmarks that can effectively recover the performance profile of a model across the rest of the spectrum.

The recommended "core" benchmarks for high-fidelity performance estimation include:

GPQA-Diamond
HLE
Codeforces
MMLU-Pro
ARC-AGI-1

Conclusion

By identifying the low-rank nature of the benchmark matrix, this research suggests that the current method of reporting dozens of individual scores may be redundant. A focused evaluation on a small, strategically chosen set of benchmarks can provide a statistically accurate representation of a model's overall capabilities.

Note: This article is based on a summary of the paper; the full methodology and the specific identity of the two underlying latent factors were not detailed in the provided source.

Original Source

LLM Evaluation Linear Algebra Matrix Rank Benchmarking AI Performance Analysis

Techyon

A new paper finds the matrix of 84 models × 133 AI benchmarks is basically rank-2 — two numbers predict ~90% of every model's scores

Low-Rank Approximation of LLM Performance: Two Factors Explain 90% of Benchmark Variance

The Dimensionality of AI Benchmarking

Rank-2 Matrix Analysis

Optimizing Evaluation: The Minimal Benchmark Set

Conclusion

A new paper finds the matrix of 84 models × 133 AI benchmarks is basically rank-2 — two numbers predict ~90% of every model's scores

Low-Rank Approximation of LLM Performance: Two Factors Explain 90% of Benchmark Variance

The Dimensionality of AI Benchmarking

Rank-2 Matrix Analysis

Optimizing Evaluation: The Minimal Benchmark Set

Conclusion

Related Articles

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

ariadng /metatrader-mcp-server

OpenAI Jalapeño AI Technology: The Inference Chip Breakdown

openai /codex

OpenAI unveils its first custom chip, built by Broadcom