Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference
Introducing PRECISE, a novel framework that leverages Prediction-Powered Inference (PPI) to generate bias-corrected estimates for ranking evaluation metrics, combining a limited set of human labels with extensive LLM-generated judgments to ensure statistical reliability.
Addressing Bias in LLM-as-a-Judge
The use of Large Language Models (LLMs) as judges for evaluating ranking systems has become common, yet these models often introduce systematic biases that can lead to unreliable performance estimates. To solve this, the research introduces PRECISE, a method that extends Prediction-Powered Inference (PPI) to the domain of ranking evaluation. The core objective is to produce estimates of ranking metrics that are provably unbiased, regardless of the specific error profile of the LLM judge used.
The PRECISE Framework: Mechanism and Innovation
PRECISE achieves statistical reliability by combining a small, gold-standard human-labeled dataset with a significantly larger dataset judged by an LLM. By applying PPI, the framework corrects the LLM's predictions using the human labels to eliminate systematic bias.
Overcoming Computational Complexity in Hierarchical Metrics
A primary challenge in applying PPI to ranking evaluation is the nature of hierarchical metrics, such as Precision@K. In these scenarios, annotations are typically performed at the document level, but the final metric is calculated at the query level. This discrepancy traditionally leads to an exponential output-space computation of $O(2^{|C|})$.
The authors have optimized this process, reducing the computational complexity from $O(2^{|C|})$ to $O(2^K)$, making the computation feasible for practical ranking evaluation tasks without sacrificing statistical rigor.
Benchmark Results
The effectiveness of the PRECISE framework was tested on the ESCI benchmark. Preliminary results indicate that augmenting a small set of 30 human annotations with LLM judgments allows for significantly more reliable evaluation of ranking performance compared to relying solely on LLM judgments or limited human labels.
Note: The provided source text is truncated. Specific quantitative results from the ESCI benchmark and the full extent of the human annotation impact are not fully detailed in the available snippet.