How Reliable is Language Model Micro-Benchmarking?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

efficient evaluationmeta-evaluationlanguage models

Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a meta-evaluation framework for assessing whether micro-benchmarks can reliably rank language models, focusing on the Minimum Detectable Ability Difference (MDAD) measure and pairwise ranking agreement probabilities. It resides in the 'Micro-Benchmark Reliability Analysis' leaf, which contains only this paper as a sibling, indicating a relatively sparse research direction within the broader 'Micro-Benchmark Design and Validation' branch. The taxonomy shows eleven total papers across the field, with this leaf representing a focused but underexplored niche examining statistical properties of small-scale evaluation methods.

The taxonomy reveals neighboring work in sibling leaves: 'Synthetic Lightweight Test Suite Generation' addresses rapid dataset creation, while 'Multi-Agent Long-Horizon Stress Testing' examines reliability in extended interaction scenarios. These adjacent directions emphasize benchmark construction and dynamic robustness rather than the statistical validation of ranking consistency that defines this paper's contribution. Parallel branches like 'Task-Specific Robustness Benchmarks' and 'Behavioral Consistency Evaluation' probe model stability under perturbations or across reasoning patterns, but do not directly address the meta-question of whether micro-benchmark rankings preserve full-benchmark orderings.

Among twenty candidates examined, no contributions were clearly refuted by prior work. The MDAD measure was assessed against ten candidates with zero refutable overlaps; the pairwise ranking framework examined six candidates with similar results; actionable size-selection guidance reviewed four candidates without finding substantial prior coverage. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific framing of ranking reliability as a function of performance difference appears novel, though the analysis does not claim exhaustive coverage of all related statistical evaluation literature.

The limited search scope and sparse taxonomy leaf suggest the paper addresses a gap in how the field validates micro-benchmark design choices. However, the twenty-candidate examination cannot rule out relevant work in adjacent statistical or psychometric evaluation traditions outside the core language model benchmarking literature. The novelty appears strongest in operationalizing ranking agreement as a meta-evaluation criterion, though broader connections to measurement theory remain underexplored in this analysis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating reliability of language model micro-benchmarks. The field has organized itself around several complementary perspectives on how to assess whether small-scale evaluation tasks truly measure what they claim. At the highest level, one branch focuses on Micro-Benchmark Design and Validation, examining the internal consistency and statistical properties of individual test suites. Parallel branches address Task-Specific Robustness Benchmarks (probing whether performance holds under input perturbations), Behavioral Consistency Evaluation (checking whether models exhibit stable reasoning patterns), and Human-Aligned Evaluation Methodologies (ensuring that automated metrics correlate with human judgments). Additional branches cover Domain-Specific Trustworthiness Frameworks—such as WebTrust[1] for web-based tasks or specialized resources for automotive IoT[5]—and broader Trustworthiness Challenges and Perspectives[3][6] that situate reliability questions within ethical and societal contexts. Cross-Domain Evaluation Techniques round out the taxonomy by exploring how insights transfer across different problem settings. Several active lines of work highlight contrasting priorities. Some studies emphasize controlled perturbation experiments to stress-test robustness (PPTC-R[4]), while others develop compact benchmarks like Tiny QA Benchmark[7] to balance coverage with efficiency. Meanwhile, domain-specific efforts (Nuclear Engineering Retrieval[9], Sentence Simplification Evaluation[8]) demonstrate that reliability concerns vary widely depending on the application. Micro-Benchmarking Reliability[0] sits squarely within the Micro-Benchmark Design and Validation branch, focusing on the foundational question of whether small test sets yield stable and interpretable signals. Its emphasis on statistical validation and reproducibility aligns closely with works like BECEL[2], which also scrutinizes benchmark construction, yet it differs from performance-oriented tools such as Chatperftest[10] or behavioral probes like Delay-of-Gratification[11] that prioritize dynamic or longitudinal consistency over static design properties.

Claimed Contributions

Minimum Detectable Ability Difference (MDAD) meta-evaluation measure

10 retrieved papers

The authors propose MDAD, a new meta-evaluation measure that determines the minimum performance difference between two models on a full benchmark required for a micro-benchmark to consistently rank them correctly at least 80% of the time. This measure provides finer-grained analysis of micro-benchmark reliability than existing aggregate metrics.

10 retrieved papers

Pairwise ranking agreement probability framework

6 retrieved papers

The authors introduce a framework for evaluating micro-benchmarks based on the probability that pairwise model rankings on a micro-benchmark agree with those on the full benchmark, as a function of the performance difference between model pairs. This approach differs from prior work that focused on individual model accuracy or aggregate rankings.

6 retrieved papers

Actionable guidance for micro-benchmark size selection

4 retrieved papers

The authors provide empirical findings and practical recommendations for selecting appropriate micro-benchmark sizes based on the desired ability to distinguish models with varying performance differences. They show when random sampling becomes competitive with specialized micro-benchmarking methods and identify limitations of extremely small micro-benchmarks.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Minimum Detectable Ability Difference (MDAD) meta-evaluation measure

[16] MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding PDF

Cannot Refute

[17] The mighty torr: A benchmark for table reasoning and robustness PDF

Cannot Refute

[18] MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models PDF

Cannot Refute

[19] Enabling Weak LLMs to Judge Response Reliability via Meta Ranking PDF

Cannot Refute

[20] On Robustness and Reliability of Benchmark-Based Evaluation of LLMs PDF

Cannot Refute

[21] DARE: Diverse Visual Question Answering with Robustness Evaluation PDF

Cannot Refute

[22] SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models PDF

Cannot Refute

[23] DRAC 2022: A public benchmark for diabetic retinopathy analysis on ultra-wide optical coherence tomography angiography images PDF

Cannot Refute

[24] Do these llm benchmarks agree? fixing benchmark evaluation with benchbench PDF

Cannot Refute

[25] Inherent trade-offs between diversity and stability in multi-task benchmarks PDF

Cannot Refute

Contribution

Pairwise ranking agreement probability framework

[26] Fairness in recommendation ranking through pairwise comparisons PDF

Cannot Refute

[27] Simple, robust and optimal ranking from pairwise comparisons PDF

Cannot Refute

[28] Label ranking by learning pairwise preferences PDF

Cannot Refute

[29] Feature importance measures for hydrological applications: insights from a virtual experiment PDF

Cannot Refute

[30] SubLIME: Subset Selection via Rank Correlation Prediction for Data-Efficient LLM Evaluation PDF

Cannot Refute

[31] A new and flexible approach to the analysis of paired comparison data PDF

Cannot Refute

Contribution

Actionable guidance for micro-benchmark size selection

[12] Using microbenchmark suites to detect application performance changes PDF

Cannot Refute

[13] SuperBench: A Proactive Validation System for Improving Reliability of Cloud AI Infrastructure PDF

Cannot Refute

[14] Investigations of micro-benchmarks for performance profiling in multi-tenant clouds PDF

Cannot Refute

[15] Performance evaluation of serverless applications and infrastructures PDF

Cannot Refute

How Reliable is Language Model Micro-Benchmarking?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Minimum Detectable Ability Difference (MDAD) meta-evaluation measure

[16] MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding PDF

[17] The mighty torr: A benchmark for table reasoning and robustness PDF

[18] MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models PDF

[19] Enabling Weak LLMs to Judge Response Reliability via Meta Ranking PDF

[20] On Robustness and Reliability of Benchmark-Based Evaluation of LLMs PDF

[21] DARE: Diverse Visual Question Answering with Robustness Evaluation PDF

[22] SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models PDF

[23] DRAC 2022: A public benchmark for diabetic retinopathy analysis on ultra-wide optical coherence tomography angiography images PDF

[24] Do these llm benchmarks agree? fixing benchmark evaluation with benchbench PDF

[25] Inherent trade-offs between diversity and stability in multi-task benchmarks PDF

Pairwise ranking agreement probability framework

[26] Fairness in recommendation ranking through pairwise comparisons PDF

[27] Simple, robust and optimal ranking from pairwise comparisons PDF

[28] Label ranking by learning pairwise preferences PDF

[29] Feature importance measures for hydrological applications: insights from a virtual experiment PDF

[30] SubLIME: Subset Selection via Rank Correlation Prediction for Data-Efficient LLM Evaluation PDF

[31] A new and flexible approach to the analysis of paired comparison data PDF

Actionable guidance for micro-benchmark size selection

[12] Using microbenchmark suites to detect application performance changes PDF

[13] SuperBench: A Proactive Validation System for Improving Reliability of Cloud AI Infrastructure PDF

[14] Investigations of micro-benchmarks for performance profiling in multi-tenant clouds PDF

[15] Performance evaluation of serverless applications and infrastructures PDF

Table of Contents