Fewer Battles, More Gain: An Information-Efficient Framework for Arena-based LLM Evaluation

ICLR 2026 Conference SubmissionAnonymous Authors
LLM Evaluation; Chatbot Arena; Efficient Evaluation
Abstract:

Arena-based evaluation has become a key method for assessing large language models (LLMs) through head-to-head model comparisons, closely reflecting human preferences. However, current arena rating systems (e.g., ELO rating system) often suffer from inefficiencies due to exhaustive or random model pair annotations, leading to redundant evaluations, longer evaluation times, and lower overall efficiency. To address these challenges, we propose a novel adaptive model-pair selection algorithm. By leveraging the asymptotic normality of LLM ability estimation under sparse conditions, our approach strategically selects high-value model pairs, focusing on confrontations with the lowest variance. Specifically, we introduce Fisher information as a metric to guide model pair selection, optimizing the evaluation process through A-optimality and D-optimality. A-optimality minimizes estimation variance, ensuring balanced reliability across models, while D-optimality reduces uncertainty by maximizing the determinant of the Fisher Information Matrix. Extensive experiments on both simulated and real-world datasets demonstrate that our method outperforms existing approaches in terms of information efficiency and result reliability. Notably, our method offers a flexible, general toolkit that can be easily integrated into existing arena-based platforms, greatly improving scalability and efficiency for large-scale LLM evaluations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
14
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: efficient model pair selection for arena-based LLM evaluation. The field has organized itself around several complementary branches that together address how to build, operate, and improve large-scale comparative evaluation systems. Arena-Based Evaluation Platforms and Frameworks establish the infrastructure for collecting human or automated pairwise judgments, exemplified by systems like Chatbot Arena[3] and specialized variants such as RAG-QA Arena[26]. Ranking Methodologies and Algorithms develop the statistical machinery—often rooted in Elo or Bradley-Terry models—to convert sparse comparisons into global leaderboards, with works like Elo Uncovered[5] examining the theoretical foundations. Efficient Evaluation and Sample Selection tackles the cost bottleneck by designing adaptive strategies that reduce the number of comparisons needed, while Evaluation Reliability and Bias Mitigation investigates position bias, judge consistency, and adversarial manipulation. LLM-as-Judge and Comparative Assessment explores using language models themselves as evaluators, and Benchmark Design and Evaluation Methodology addresses prompt construction, metric choice, and reproducibility across diverse tasks. Within this landscape, a particularly active tension exists between scaling up arena coverage and controlling evaluation costs. Many studies focus on adaptive sampling or active learning to prioritize informative matchups, while others scrutinize whether automated judges can replace expensive human annotations without introducing new biases. Fewer Battles[0] sits squarely in the Adaptive Model Pair Selection cluster, proposing algorithms that intelligently choose which model pairs to compare in order to accelerate convergence of ranking estimates. This emphasis on sample efficiency contrasts with broader platform studies like Chatbot Arena[3], which prioritize ecosystem-wide data collection, and complements reliability-focused work such as Not Fair Evaluators[2], which examines how judge biases propagate through pairwise comparisons. By targeting the algorithmic core of pair selection, Fewer Battles[0] addresses a bottleneck that affects both human-driven arenas and automated evaluation pipelines, positioning itself as a methodological contribution to making large-scale comparative assessment more practical.

Claimed Contributions

Adaptive model-pair selection algorithm for arena-based LLM evaluation

The authors introduce an adaptive algorithm that selects model pairs for evaluation by exploiting the asymptotic normality of ability estimates under sparse conditions. This approach targets high-value confrontations with minimal variance, thereby improving evaluation efficiency.

1 retrieved paper
Fisher information-based optimization using A-optimality and D-optimality

The authors propose using Fisher information to guide model pair selection, implementing two optimization criteria: A-optimality, which minimizes estimation variance for balanced reliability, and D-optimality, which reduces uncertainty by maximizing the Fisher Information Matrix determinant.

10 retrieved papers
Can Refute
Introduction of efficiency concept in arena-based LLM evaluation

The authors introduce the concept of efficiency into arena-based LLM evaluation by using statistical uncertainty measures to minimize redundant evaluations, thereby significantly improving evaluation speed and resource utilization.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Adaptive model-pair selection algorithm for arena-based LLM evaluation

The authors introduce an adaptive algorithm that selects model pairs for evaluation by exploiting the asymptotic normality of ability estimates under sparse conditions. This approach targets high-value confrontations with minimal variance, thereby improving evaluation efficiency.

Contribution

Fisher information-based optimization using A-optimality and D-optimality

The authors propose using Fisher information to guide model pair selection, implementing two optimization criteria: A-optimality, which minimizes estimation variance for balanced reliability, and D-optimality, which reduces uncertainty by maximizing the Fisher Information Matrix determinant.

Contribution

Introduction of efficiency concept in arena-based LLM evaluation

The authors introduce the concept of efficiency into arena-based LLM evaluation by using statistical uncertainty measures to minimize redundant evaluations, thereby significantly improving evaluation speed and resource utilization.