DISCO: Diversifying Sample Condensation for Accelerating Model Evaluation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Efficient evaluationLarge Language Modelsanchor pointfingerprint

Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. To address the growing cost of standard evaluation, new methods focused on efficient evaluation have started to appear. The typical approach follows two steps. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that maximise diversity in model responses. Our method, Diversifying Sample Condensation (DISCO), selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. DISCO shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DISCO, a greedy sample selection method that identifies test samples maximizing inter-model disagreement to predict full benchmark performance with reduced evaluation cost. Within the taxonomy, DISCO resides in the 'Greedy Sample Selection for Performance Prediction' leaf under 'Benchmark Evaluation Acceleration'. Notably, this leaf contains no sibling papers in the current taxonomy, suggesting a relatively sparse research direction. The broader 'Benchmark Evaluation Acceleration' category includes only three leaves, indicating that greedy disagreement-based approaches represent a less crowded niche compared to training-focused sample selection methods.

The taxonomy reveals that DISCO's closest neighbors lie in adjacent leaves: 'Capability Coverage Maximization' (containing EffiEval) and 'Lifelong Benchmark Evaluation with Model Reuse'. While EffiEval emphasizes clustering-based coverage of task dimensions, DISCO adopts a simpler greedy strategy targeting model response diversity. The broader 'Efficient Evaluation Through Sample Reduction' branch contrasts with 'Strategic Sample Selection for Training Data Efficiency', which dominates the taxonomy with active learning and data curation methods. DISCO's focus on test-time efficiency without model modification distinguishes it from adaptive evaluation techniques like test-time adaptation, which belong to a separate subtopic.

Among the three contributions analyzed, the literature search examined 26 candidates total. The core DISCO method (9 candidates examined, 0 refutable) and the model signature framework (7 candidates, 0 refutable) appear relatively novel within the limited search scope. However, the information-theoretic justification for disagreement-based selection (10 candidates examined, 1 refutable) shows overlap with prior work. The analysis indicates that while the algorithmic approach may be distinctive, the theoretical grounding has some precedent among the examined candidates. The absence of sibling papers in DISCO's taxonomy leaf suggests limited direct competition, though the small search scale (26 papers) leaves open the possibility of unexamined related work.

Based on the top-26 semantic matches and taxonomy structure, DISCO appears to occupy a relatively underexplored niche within benchmark evaluation acceleration. The greedy disagreement-based approach contrasts with clustering-heavy methods in neighboring leaves, and the lack of sibling papers suggests limited direct prior work in this specific formulation. However, the analysis covers a narrow slice of the literature, and the refutable theoretical contribution indicates that some conceptual elements have precedent. A more exhaustive search might reveal additional related efforts in efficient evaluation or active testing domains.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient model performance evaluation through sample selection. The field addresses how to reduce computational and annotation costs when assessing model quality by strategically choosing which samples to evaluate or train on. The taxonomy reveals six main branches. Strategic Sample Selection for Training Data Efficiency focuses on choosing informative subsets during training—ranging from active learning methods like PALM Active Learning[26] to curriculum-based approaches. Efficient Evaluation Through Sample Reduction targets accelerating benchmark evaluation itself, employing greedy selection or coverage-based strategies such as EffiEval Capability Coverage[2] to predict performance with fewer test samples. Model Selection and Output Ranking deals with choosing among candidate models or outputs, exemplified by Best-of-N Selection[20]. Hyperparameter and Configuration Optimization explores sample-efficient tuning, including bandit-inspired methods like Hyperband Prompt Selection[22]. Specialized Selection Techniques encompasses domain-specific sampling (e.g., Landslide Sample Sampling[14], Battery Testing Optimization[3]), while Auxiliary Methods and Applications covers supporting techniques like influence functions and variance estimation. Several contrasting themes emerge across these branches. Training-focused selection often emphasizes diversity and informativeness to improve learning efficiency, whereas evaluation-focused selection prioritizes representativeness and correlation with full-benchmark performance. A handful of works, including Efficient Test-Time Adaptation[1] and Continual Test-Time Adaptation[5], bridge training and evaluation by adapting models during inference with minimal samples. DISCO Sample Condensation[0] sits within the Efficient Evaluation Through Sample Reduction branch, specifically under greedy sample selection for performance prediction. Its emphasis on condensing evaluation sets to predict model rankings aligns closely with EffiEval Capability Coverage[2], which also seeks capability-aware subsets, though DISCO's greedy approach contrasts with coverage-based heuristics. Compared to Efficient Lifelong Evaluation[4], which addresses continual benchmarking over time, DISCO focuses on static benchmark acceleration. This positioning highlights an ongoing tension: balancing sample reduction with faithful performance estimation across diverse model families and tasks.

Claimed Contributions

DISCO method for efficient model evaluation via sample selection

9 retrieved papers

The authors propose DISCO, a method that selects evaluation samples based on inter-model disagreement rather than clustering-based representativeness. This greedy, sample-wise approach simplifies subset selection by focusing on samples that maximize diversity in model responses.

9 retrieved papers

Information-theoretic justification for disagreement-based selection

Can Refute

10 retrieved papers

The authors establish that inter-model disagreement, measured via Jensen-Shannon Divergence or Predictive Diversity Score, is information-theoretically optimal for selecting samples that best differentiate and rank models when estimating benchmark performance.

10 retrieved papers

Can Refute

Model signature-based performance prediction framework

7 retrieved papers

The authors introduce a direct prediction approach using model signatures (concatenated outputs on selected samples) as input to simple metamodels, bypassing the complexity of estimating hidden model parameters required by prior methods like IRT-based approaches.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DISCO method for efficient model evaluation via sample selection

[51] Assessing generalization of SGD via disagreement PDF

Cannot Refute

[52] Agree to Disagree: Robust Anomaly Detection with Noisy Labels PDF

Cannot Refute

[53] TriagedMSA: Triaging Sentimental Disagreement in Multimodal Sentiment Analysis PDF

Cannot Refute

[54] Reward Uncertainty for Exploration in Preference-based Reinforcement Learning PDF

Cannot Refute

[55] Handling disagreement in hate speech modelling PDF

Cannot Refute

[56] How does disagreement help generalization against label corruption? PDF

Cannot Refute

[57] A Note on" Assessing Generalization of SGD via Disagreement" PDF

Cannot Refute

[58] Agree to disagree: Diversity through disagreement for better transferability PDF

Cannot Refute

[59] Querying Easily Flip-flopped Samples for Deep Active Learning PDF

Cannot Refute

Contribution

Information-theoretic justification for disagreement-based selection

[63] DISCO: Diversifying Sample Condensation for Efficient Model Evaluation PDF

Can Refute

[60] Theory of disagreement-based active learning PDF

Cannot Refute

[61] Training Robust Deep Neural Networks on Noisy Labels Using Adaptive Sample Selection with Disagreement PDF

Cannot Refute

[62] Active learning for estimating reachable sets for systems with unknown dynamics PDF

Cannot Refute

[64] Committee-Based Sample Selection for Probabilistic Classifiers PDF

Cannot Refute

[65] Hybrid Disagreement-Diversity Active Learning for Bioacoustic Sound Event Detection PDF

Cannot Refute

[66] A Spatial-Spectral Disagreement-Based Sample Selection With an Application to Hyperspectral Data Classification PDF

Cannot Refute

[67] Ensemble multiple kernel active learning for classification of multisource remote sensing data PDF

Cannot Refute

[68] A disagreement-based active matrix completion approach with provable guarantee PDF

Cannot Refute

[69] Self-Supervised Exploration via Disagreement PDF

Cannot Refute

Contribution

Model signature-based performance prediction framework

[70] Meta-learning and Data Augmentation for Stress Testing Forecasting Models PDF

Cannot Refute

[71] Test selection for deep neural networks using meta-models with uncertainty metrics PDF

Cannot Refute

[72] Performance Modeling and Estimation of a Configurable Output Stationary Neural Network Accelerator PDF

Cannot Refute

[73] Scaling Laws for Downstream Task Performance in Machine Translation PDF

Cannot Refute

[74] Performance predictive metamodel for dynamic facade shading PDF

Cannot Refute

[75] Including stochastics in metamodel-based DEM model calibration PDF

Cannot Refute

[76] Model selection via meta-learning: a comparative study PDF

Cannot Refute

DISCO: Diversifying Sample Condensation for Accelerating Model Evaluation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

DISCO method for efficient model evaluation via sample selection

[51] Assessing generalization of SGD via disagreement PDF

[52] Agree to Disagree: Robust Anomaly Detection with Noisy Labels PDF

[53] TriagedMSA: Triaging Sentimental Disagreement in Multimodal Sentiment Analysis PDF

[54] Reward Uncertainty for Exploration in Preference-based Reinforcement Learning PDF

[55] Handling disagreement in hate speech modelling PDF

[56] How does disagreement help generalization against label corruption? PDF

[57] A Note on" Assessing Generalization of SGD via Disagreement" PDF

[58] Agree to disagree: Diversity through disagreement for better transferability PDF

[59] Querying Easily Flip-flopped Samples for Deep Active Learning PDF

Information-theoretic justification for disagreement-based selection

[63] DISCO: Diversifying Sample Condensation for Efficient Model Evaluation PDF

[60] Theory of disagreement-based active learning PDF

[61] Training Robust Deep Neural Networks on Noisy Labels Using Adaptive Sample Selection with Disagreement PDF

[62] Active learning for estimating reachable sets for systems with unknown dynamics PDF

[64] Committee-Based Sample Selection for Probabilistic Classifiers PDF

[65] Hybrid Disagreement-Diversity Active Learning for Bioacoustic Sound Event Detection PDF

[66] A Spatial-Spectral Disagreement-Based Sample Selection With an Application to Hyperspectral Data Classification PDF

[67] Ensemble multiple kernel active learning for classification of multisource remote sensing data PDF

[68] A disagreement-based active matrix completion approach with provable guarantee PDF

[69] Self-Supervised Exploration via Disagreement PDF

Model signature-based performance prediction framework

[70] Meta-learning and Data Augmentation for Stress Testing Forecasting Models PDF

[71] Test selection for deep neural networks using meta-models with uncertainty metrics PDF

[72] Performance Modeling and Estimation of a Configurable Output Stationary Neural Network Accelerator PDF

[73] Scaling Laws for Downstream Task Performance in Machine Translation PDF

[74] Performance predictive metamodel for dynamic facade shading PDF

[75] Including stochastics in metamodel-based DEM model calibration PDF

[76] Model selection via meta-learning: a comparative study PDF

Table of Contents