Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings

ICLR 2026 Conference SubmissionAnonymous Authors
Preference-based EvaluationsRobustness to Data DroppingBradley--Terry ModelInfluence Functions
Abstract:

We propose a method for evaluating the robustness of widely used LLM ranking systems---variants of a Bradley--Terry model---to dropping a worst-case very small fraction of preference data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from popular LLM ranking platforms, including Chatbot Arena and derivatives, we find that the rankings of top-performing models can be remarkably sensitive to the removal of a small fraction of preferences; for instance, dropping just 0.003% of human preferences can change the top-ranked model on Chatbot Arena. Our robustness check identifies the specific preferences most responsible for such ranking flips, allowing for inspection of these influential preferences. We observe that the rankings derived from MT-bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench's use of expert annotators and carefully constructed prompts. Finally, we find that neither rankings based on crowdsourced human evaluations nor those based on LLM-as-a-judge preferences are systematically more sensitive than the other.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a computationally efficient method for assessing how Bradley-Terry-based LLM ranking systems respond to worst-case removal of small preference data fractions. It resides in the 'Preference-Based Ranking Robustness' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy of ten papers across seven leaf nodes. This positioning suggests the work addresses a focused problem—stability of preference-driven rankings under adversarial data dropping—that has received limited prior attention compared to adjacent areas like general benchmark evaluation or machine unlearning.

The taxonomy reveals neighboring research directions that contextualize this contribution. The sibling leaf 'Benchmark Stability Under Missing Data' examines ranking stability when scores are absent but does not focus on preference-based models or adversarial dropping scenarios. Adjacent branches include 'Data Removal and Modification Techniques,' which emphasizes unlearning and safety filtering rather than ranking sensitivity analysis, and 'LLM Evaluation Frameworks,' which addresses broader assessment paradigms. The paper's scope_note explicitly excludes general evaluation frameworks and data modification methods, positioning it at the intersection of ranking theory and adversarial robustness rather than comprehensive evaluation design or training-time data curation.

Among fourteen candidates examined through limited semantic search, no contributions were clearly refuted by prior work. The core methodological contribution—evaluating worst-case robustness to data dropping—examined ten candidates with zero refutable overlaps. The identification of specific influential preferences examined three candidates, again with no refutations, while empirical findings on platform sensitivity examined one candidate without overlap. These statistics suggest that within the examined scope, the approach and findings appear distinct from existing literature, though the limited search scale (fourteen total candidates) means comprehensive coverage of all related work cannot be claimed.

Based on the available signals, the work appears to occupy a relatively underexplored niche within LLM evaluation research. The sparse taxonomy leaf and absence of refutable prior work among examined candidates suggest novelty, though the limited search scope (top-K semantic matches plus citations) means adjacent or parallel efforts outside this sample remain possible. The focus on adversarial data dropping for preference-based rankings distinguishes it from broader robustness studies that address missing data or general perturbations without targeting worst-case scenarios.

Taxonomy

Core-task Taxonomy Papers
10
3
Claimed Contributions
14
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating robustness of LLM ranking systems to data dropping. The field examines how ranking systems that compare or order large language models respond when portions of evaluation data are removed or modified. The taxonomy organizes this area into three main branches. The first, Ranking System Robustness and Sensitivity Analysis, investigates how stable rankings remain under perturbations, including preference-based methods that assess consistency when human or model preferences are altered. The second branch, Data Removal and Modification Techniques, encompasses approaches for systematically dropping, unlearning, or filtering data—ranging from exact unlearning methods like Exact Efficient Unlearning[2] to representation-level interventions such as Layer-aware Representation Filtering[1]. The third branch, LLM Evaluation Frameworks and Methodologies, addresses broader assessment paradigms, including multi-dimensional evaluation schemes like Multi-Dimensional LLM Evaluation[7] and benchmarks such as ONEBench[4] that test model capabilities under varied conditions. Several active lines of work explore trade-offs between evaluation completeness and robustness. One thread examines how missing or incomplete scores affect ranking validity, as seen in Handling Missing Scores[5], while another investigates whether LLMs themselves can serve as reliable rankers, exemplified by LLM Better Ranker[6]. A related concern is maintaining stable task prioritization when data availability fluctuates, addressed by Stable Task Prioritization[8]. Within this landscape, Dropping Preferences Rankings[0] sits squarely in the preference-based robustness cluster, focusing on how rankings derived from pairwise or preference judgments degrade when subsets of preferences are systematically dropped. This emphasis contrasts with nearby work like CURATRON[10], which targets data curation for training rather than post-hoc ranking stability, and differs from unlearning-focused studies like LLM Unlearning Survey[3] that prioritize removing specific knowledge rather than testing ranking resilience.

Claimed Contributions

Method for evaluating robustness of LLM ranking systems to worst-case data dropping

The authors develop a computationally efficient algorithm that assesses whether LLM leaderboard rankings remain stable when a small fraction of preference data is removed. The method extends approximate maximum influence perturbation techniques to identify influential preferences and verify ranking changes without exhaustive combinatorial search.

10 retrieved papers
Identification of specific preferences responsible for ranking flips

The method not only detects non-robustness but also pinpoints the exact preference data points (prompt-response pairs) that drive changes in model rankings, enabling qualitative inspection of these influential evaluations.

3 retrieved papers
Empirical findings on sensitivity of popular LLM evaluation platforms

The authors apply their robustness check to multiple LLM arenas and discover that top model rankings are surprisingly fragile, with extremely small fractions of data (as low as 0.003%) sufficient to alter rankings. They also find that MT-bench is notably more robust due to expert annotators and carefully designed prompts.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Method for evaluating robustness of LLM ranking systems to worst-case data dropping

The authors develop a computationally efficient algorithm that assesses whether LLM leaderboard rankings remain stable when a small fraction of preference data is removed. The method extends approximate maximum influence perturbation techniques to identify influential preferences and verify ranking changes without exhaustive combinatorial search.

Contribution

Identification of specific preferences responsible for ranking flips

The method not only detects non-robustness but also pinpoints the exact preference data points (prompt-response pairs) that drive changes in model rankings, enabling qualitative inspection of these influential evaluations.

Contribution

Empirical findings on sensitivity of popular LLM evaluation platforms

The authors apply their robustness check to multiple LLM arenas and discover that top model rankings are surprisingly fragile, with extremely small fractions of data (as low as 0.003%) sufficient to alter rankings. They also find that MT-bench is notably more robust due to expert annotators and carefully designed prompts.