DISCO: Diversifying Sample Condensation for Accelerating Model Evaluation

ICLR 2026 Conference SubmissionAnonymous Authors
Efficient evaluationLarge Language Modelsanchor pointfingerprint
Abstract:

Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. To address the growing cost of standard evaluation, new methods focused on efficient evaluation have started to appear. The typical approach follows two steps. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that maximise diversity in model responses. Our method, Diversifying Sample Condensation (DISCO), selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. DISCO shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DISCO, a greedy sample selection method that identifies test samples maximizing inter-model disagreement to predict full benchmark performance with reduced evaluation cost. Within the taxonomy, DISCO resides in the 'Greedy Sample Selection for Performance Prediction' leaf under 'Benchmark Evaluation Acceleration'. Notably, this leaf contains no sibling papers in the current taxonomy, suggesting a relatively sparse research direction. The broader 'Benchmark Evaluation Acceleration' category includes only three leaves, indicating that greedy disagreement-based approaches represent a less crowded niche compared to training-focused sample selection methods.

The taxonomy reveals that DISCO's closest neighbors lie in adjacent leaves: 'Capability Coverage Maximization' (containing EffiEval) and 'Lifelong Benchmark Evaluation with Model Reuse'. While EffiEval emphasizes clustering-based coverage of task dimensions, DISCO adopts a simpler greedy strategy targeting model response diversity. The broader 'Efficient Evaluation Through Sample Reduction' branch contrasts with 'Strategic Sample Selection for Training Data Efficiency', which dominates the taxonomy with active learning and data curation methods. DISCO's focus on test-time efficiency without model modification distinguishes it from adaptive evaluation techniques like test-time adaptation, which belong to a separate subtopic.

Among the three contributions analyzed, the literature search examined 26 candidates total. The core DISCO method (9 candidates examined, 0 refutable) and the model signature framework (7 candidates, 0 refutable) appear relatively novel within the limited search scope. However, the information-theoretic justification for disagreement-based selection (10 candidates examined, 1 refutable) shows overlap with prior work. The analysis indicates that while the algorithmic approach may be distinctive, the theoretical grounding has some precedent among the examined candidates. The absence of sibling papers in DISCO's taxonomy leaf suggests limited direct competition, though the small search scale (26 papers) leaves open the possibility of unexamined related work.

Based on the top-26 semantic matches and taxonomy structure, DISCO appears to occupy a relatively underexplored niche within benchmark evaluation acceleration. The greedy disagreement-based approach contrasts with clustering-heavy methods in neighboring leaves, and the lack of sibling papers suggests limited direct prior work in this specific formulation. However, the analysis covers a narrow slice of the literature, and the refutable theoretical contribution indicates that some conceptual elements have precedent. A more exhaustive search might reveal additional related efforts in efficient evaluation or active testing domains.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: efficient model performance evaluation through sample selection. The field addresses how to reduce computational and annotation costs when assessing model quality by strategically choosing which samples to evaluate or train on. The taxonomy reveals six main branches. Strategic Sample Selection for Training Data Efficiency focuses on choosing informative subsets during training—ranging from active learning methods like PALM Active Learning[26] to curriculum-based approaches. Efficient Evaluation Through Sample Reduction targets accelerating benchmark evaluation itself, employing greedy selection or coverage-based strategies such as EffiEval Capability Coverage[2] to predict performance with fewer test samples. Model Selection and Output Ranking deals with choosing among candidate models or outputs, exemplified by Best-of-N Selection[20]. Hyperparameter and Configuration Optimization explores sample-efficient tuning, including bandit-inspired methods like Hyperband Prompt Selection[22]. Specialized Selection Techniques encompasses domain-specific sampling (e.g., Landslide Sample Sampling[14], Battery Testing Optimization[3]), while Auxiliary Methods and Applications covers supporting techniques like influence functions and variance estimation. Several contrasting themes emerge across these branches. Training-focused selection often emphasizes diversity and informativeness to improve learning efficiency, whereas evaluation-focused selection prioritizes representativeness and correlation with full-benchmark performance. A handful of works, including Efficient Test-Time Adaptation[1] and Continual Test-Time Adaptation[5], bridge training and evaluation by adapting models during inference with minimal samples. DISCO Sample Condensation[0] sits within the Efficient Evaluation Through Sample Reduction branch, specifically under greedy sample selection for performance prediction. Its emphasis on condensing evaluation sets to predict model rankings aligns closely with EffiEval Capability Coverage[2], which also seeks capability-aware subsets, though DISCO's greedy approach contrasts with coverage-based heuristics. Compared to Efficient Lifelong Evaluation[4], which addresses continual benchmarking over time, DISCO focuses on static benchmark acceleration. This positioning highlights an ongoing tension: balancing sample reduction with faithful performance estimation across diverse model families and tasks.

Claimed Contributions

DISCO method for efficient model evaluation via sample selection

The authors propose DISCO, a method that selects evaluation samples based on inter-model disagreement rather than clustering-based representativeness. This greedy, sample-wise approach simplifies subset selection by focusing on samples that maximize diversity in model responses.

9 retrieved papers
Information-theoretic justification for disagreement-based selection

The authors establish that inter-model disagreement, measured via Jensen-Shannon Divergence or Predictive Diversity Score, is information-theoretically optimal for selecting samples that best differentiate and rank models when estimating benchmark performance.

10 retrieved papers
Can Refute
Model signature-based performance prediction framework

The authors introduce a direct prediction approach using model signatures (concatenated outputs on selected samples) as input to simple metamodels, bypassing the complexity of estimating hidden model parameters required by prior methods like IRT-based approaches.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DISCO method for efficient model evaluation via sample selection

The authors propose DISCO, a method that selects evaluation samples based on inter-model disagreement rather than clustering-based representativeness. This greedy, sample-wise approach simplifies subset selection by focusing on samples that maximize diversity in model responses.

Contribution

Information-theoretic justification for disagreement-based selection

The authors establish that inter-model disagreement, measured via Jensen-Shannon Divergence or Predictive Diversity Score, is information-theoretically optimal for selecting samples that best differentiate and rank models when estimating benchmark performance.

Contribution

Model signature-based performance prediction framework

The authors introduce a direct prediction approach using model signatures (concatenated outputs on selected samples) as input to simple metamodels, bypassing the complexity of estimating hidden model parameters required by prior methods like IRT-based approaches.

DISCO: Diversifying Sample Condensation for Accelerating Model Evaluation | Novelty Validation