Fewer Battles, More Gain: An Information-Efficient Framework for Arena-based LLM Evaluation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

LLM Evaluation; Chatbot Arena; Efficient Evaluation

Arena-based evaluation has become a key method for assessing large language models (LLMs) through head-to-head model comparisons, closely reflecting human preferences. However, current arena rating systems (e.g., ELO rating system) often suffer from inefficiencies due to exhaustive or random model pair annotations, leading to redundant evaluations, longer evaluation times, and lower overall efficiency. To address these challenges, we propose a novel adaptive model-pair selection algorithm. By leveraging the asymptotic normality of LLM ability estimation under sparse conditions, our approach strategically selects high-value model pairs, focusing on confrontations with the lowest variance. Specifically, we introduce Fisher information as a metric to guide model pair selection, optimizing the evaluation process through A-optimality and D-optimality. A-optimality minimizes estimation variance, ensuring balanced reliability across models, while D-optimality reduces uncertainty by maximizing the determinant of the Fisher Information Matrix. Extensive experiments on both simulated and real-world datasets demonstrate that our method outperforms existing approaches in terms of information efficiency and result reliability. Notably, our method offers a flexible, general toolkit that can be easily integrated into existing arena-based platforms, greatly improving scalability and efficiency for large-scale LLM evaluations.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient model pair selection for arena-based LLM evaluation. The field has organized itself around several complementary branches that together address how to build, operate, and improve large-scale comparative evaluation systems. Arena-Based Evaluation Platforms and Frameworks establish the infrastructure for collecting human or automated pairwise judgments, exemplified by systems like Chatbot Arena[3] and specialized variants such as RAG-QA Arena[26]. Ranking Methodologies and Algorithms develop the statistical machinery—often rooted in Elo or Bradley-Terry models—to convert sparse comparisons into global leaderboards, with works like Elo Uncovered[5] examining the theoretical foundations. Efficient Evaluation and Sample Selection tackles the cost bottleneck by designing adaptive strategies that reduce the number of comparisons needed, while Evaluation Reliability and Bias Mitigation investigates position bias, judge consistency, and adversarial manipulation. LLM-as-Judge and Comparative Assessment explores using language models themselves as evaluators, and Benchmark Design and Evaluation Methodology addresses prompt construction, metric choice, and reproducibility across diverse tasks. Within this landscape, a particularly active tension exists between scaling up arena coverage and controlling evaluation costs. Many studies focus on adaptive sampling or active learning to prioritize informative matchups, while others scrutinize whether automated judges can replace expensive human annotations without introducing new biases. Fewer Battles[0] sits squarely in the Adaptive Model Pair Selection cluster, proposing algorithms that intelligently choose which model pairs to compare in order to accelerate convergence of ranking estimates. This emphasis on sample efficiency contrasts with broader platform studies like Chatbot Arena[3], which prioritize ecosystem-wide data collection, and complements reliability-focused work such as Not Fair Evaluators[2], which examines how judge biases propagate through pairwise comparisons. By targeting the algorithmic core of pair selection, Fewer Battles[0] addresses a bottleneck that affects both human-driven arenas and automated evaluation pipelines, positioning itself as a methodological contribution to making large-scale comparative assessment more practical.

Claimed Contributions

Adaptive model-pair selection algorithm for arena-based LLM evaluation

1 retrieved paper

The authors introduce an adaptive algorithm that selects model pairs for evaluation by exploiting the asymptotic normality of ability estimates under sparse conditions. This approach targets high-value confrontations with minimal variance, thereby improving evaluation efficiency.

1 retrieved paper

Fisher information-based optimization using A-optimality and D-optimality

Can Refute

10 retrieved papers

The authors propose using Fisher information to guide model pair selection, implementing two optimization criteria: A-optimality, which minimizes estimation variance for balanced reliability, and D-optimality, which reduces uncertainty by maximizing the Fisher Information Matrix determinant.

10 retrieved papers

Can Refute

Introduction of efficiency concept in arena-based LLM evaluation

3 retrieved papers

The authors introduce the concept of efficiency into arena-based LLM evaluation by using statistical uncertainty measures to minimize redundant evaluations, thereby significantly improving evaluation speed and resource utilization.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Adaptive model-pair selection algorithm for arena-based LLM evaluation

[64] Review on ranking and selection: A new perspective PDF

Cannot Refute

Contribution

Fisher information-based optimization using A-optimality and D-optimality

[54] Information-based optimal subdata selection for non-linear models PDF

Can Refute

[56] D-optimal data fusion: Exact and approximation algorithms PDF

Can Refute

[62] Bayesian optimal experimental designs for binary responses in an adaptive framework PDF

Can Refute

[55] A Multi-AUV Collaborative Mapping System With Bathymetric Cooperative Active SLAM Algorithm PDF

Cannot Refute

[57] A-optimal experimental design for locally adaptive regression models PDF

Cannot Refute

[58] A-optimal versus D-optimal design of screening experiments PDF

Cannot Refute

[59] Fishermask: Enhancing neural network labeling efficiency in image classification using fisher information PDF

Cannot Refute

[60] Optimal experimental design for parameter estimation of the Peleg model PDF

Cannot Refute

[61] Sensor Selection by Greedy Method for Linear Dynamical Systems: Comparative Study on Fisher-Information-Matrix, Observability-Gramian and Kalman-Filter-Based Indices PDF

Cannot Refute

[63] Sequential model-based a-optimal design of experiments when the fisher information matrix is noninvertible PDF

Cannot Refute

Contribution

Introduction of efficiency concept in arena-based LLM evaluation

[51] Analysis of systems' performance in natural language processing competitions PDF

Cannot Refute

[52] Precise computer comparisons via statistical resampling methods PDF

Cannot Refute

[53] Uncertainty-driven Sampling for Efficient Pairwise Comparison Subjective Assessment PDF

Cannot Refute

Fewer Battles, More Gain: An Information-Efficient Framework for Arena-based LLM Evaluation

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Adaptive model-pair selection algorithm for arena-based LLM evaluation

[64] Review on ranking and selection: A new perspective PDF

Fisher information-based optimization using A-optimality and D-optimality

[54] Information-based optimal subdata selection for non-linear models PDF

[56] D-optimal data fusion: Exact and approximation algorithms PDF

[62] Bayesian optimal experimental designs for binary responses in an adaptive framework PDF

[55] A Multi-AUV Collaborative Mapping System With Bathymetric Cooperative Active SLAM Algorithm PDF

[57] A-optimal experimental design for locally adaptive regression models PDF

[58] A-optimal versus D-optimal design of screening experiments PDF

[59] Fishermask: Enhancing neural network labeling efficiency in image classification using fisher information PDF

[60] Optimal experimental design for parameter estimation of the Peleg model PDF

[61] Sensor Selection by Greedy Method for Linear Dynamical Systems: Comparative Study on Fisher-Information-Matrix, Observability-Gramian and Kalman-Filter-Based Indices PDF

[63] Sequential model-based a-optimal design of experiments when the fisher information matrix is noninvertible PDF

Introduction of efficiency concept in arena-based LLM evaluation

[51] Analysis of systems' performance in natural language processing competitions PDF

[52] Precise computer comparisons via statistical resampling methods PDF

[53] Uncertainty-driven Sampling for Efficient Pairwise Comparison Subjective Assessment PDF

Table of Contents