SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

ICLR 2026 Conference SubmissionAnonymous Authors
Efficient EvaluationLLM Evaluation
Abstract:

As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall’s~τ\tau of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SparseEval, a method that formulates efficient LLM benchmarking as a sparse optimization problem, using gradient descent to optimize anchor weights and iterative refinement for anchor selection. It resides in the 'Sample-Efficient and Adaptive Evaluation' leaf, which contains six papers total. This leaf sits within the broader 'Efficiency-Focused Evaluation Methods' branch, indicating a moderately populated research direction focused on reducing evaluation costs through intelligent sampling rather than comprehensive test suites.

The taxonomy reveals neighboring work in 'Test-Time Compute Optimization' (two papers) and a sibling branch 'LLM-Based Evaluation Methodologies' (twelve papers across three leaves). The scope note for the paper's leaf explicitly excludes test-time compute scaling and model compression, positioning SparseEval among methods that select representative samples rather than optimize inference itself. Related leaves like 'Task-Specific and Capability-Focused Benchmarks' (seven papers) and 'General-Purpose Multi-Dimensional Benchmarks' (five papers) address what to evaluate, while this work addresses how to evaluate efficiently.

Among thirty candidates examined, none clearly refute the three core contributions: sparse optimization formulation (ten candidates, zero refutable), the Anchor and Candidate Importance Score metrics (ten candidates, zero refutable), and the MLP-based anchor weight predictor (ten candidates, zero refutable). This suggests that within the limited search scope, the specific combination of gradient-based anchor optimization and task-aware refinement scores appears distinct from prior sample-efficient methods, though the search does not cover the entire field exhaustively.

Based on the top-thirty semantic matches and citation expansion, the work appears to occupy a recognizable niche within sample-efficient evaluation. The analysis does not capture potential overlap in broader optimization literature or recent preprints outside the search scope. The taxonomy structure indicates this is an active but not overcrowded area, with the paper's technical approach—MLP-based weight learning and iterative refinement—differentiating it from static subset selection methods among examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: efficient evaluation of large language models. The field has organized itself around several complementary perspectives. Evaluation Frameworks and Benchmarks establish standardized testbeds such as Holistic evaluation of language[2] and Promptbench[3], providing broad coverage of capabilities. Efficiency-Focused Evaluation Methods address the computational burden of assessment through sample-efficient and adaptive strategies, while LLM-Based Evaluation Methodologies explore using models themselves as judges. Domain-Specific Evaluation targets specialized contexts like scientific reasoning or agent behavior, and Bias and Fairness Evaluation scrutinizes model outputs for harmful patterns. Uncertainty and Confidence Estimation examines calibration, Model Efficiency Techniques optimize inference and training costs, and Comparative Model Analysis systematically contrasts different architectures. Together, these branches reflect a tension between comprehensive assessment and resource constraints, with many studies seeking to balance coverage against evaluation cost. A particularly active line of work focuses on reducing the sample complexity of evaluation without sacrificing reliability. SparseEval[0] sits squarely within this cluster, proposing adaptive sampling strategies that intelligently select test instances to maximize information gain. It shares methodological kinship with Efficient benchmarking of language[5], which similarly aims to shrink evaluation sets, and Sample-efficient human evaluation of[6], which applies efficiency principles to human annotation. Nearby efforts like Data Efficient Evaluation of[29] and Effieval[44] explore related trade-offs between test set size and measurement precision. The central challenge across these works is determining when a smaller, carefully chosen sample can yield statistically robust conclusions about model performance, a question that becomes increasingly urgent as models scale and evaluation budgets tighten. SparseEval[0] distinguishes itself by emphasizing dynamic instance selection, contrasting with static subset approaches seen in some neighboring studies.

Claimed Contributions

Formulation of efficient LLM evaluation as sparse optimization problem

The authors formulate the task of efficient benchmarking as a sparse optimization problem over a model-item performance matrix. They introduce a framework that uses gradient descent to optimize anchor weights and an iterative refinement strategy to select representative items (anchors) for evaluation.

10 retrieved papers
Anchor Importance Score and Candidate Importance Score metrics

The authors introduce two novel metrics: Anchor Importance Score (AIS) based on gradient norms to assess anchor contribution, and Candidate Importance Score (CIS) based on dot products with residuals to identify informative candidates. These metrics enable task-aware anchor refinement.

10 retrieved papers
MLP-based anchor weight predictor with end-to-end optimization

The authors propose using a multi-layer perceptron (MLP) as an aggregation function to approximate anchor weights through end-to-end gradient-based optimization, replacing traditional clustering-based weight assignment methods.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Formulation of efficient LLM evaluation as sparse optimization problem

The authors formulate the task of efficient benchmarking as a sparse optimization problem over a model-item performance matrix. They introduce a framework that uses gradient descent to optimize anchor weights and an iterative refinement strategy to select representative items (anchors) for evaluation.

Contribution

Anchor Importance Score and Candidate Importance Score metrics

The authors introduce two novel metrics: Anchor Importance Score (AIS) based on gradient norms to assess anchor contribution, and Candidate Importance Score (CIS) based on dot products with residuals to identify informative candidates. These metrics enable task-aware anchor refinement.

Contribution

MLP-based anchor weight predictor with end-to-end optimization

The authors propose using a multi-layer perceptron (MLP) as an aggregation function to approximate anchor weights through end-to-end gradient-based optimization, replacing traditional clustering-based weight assignment methods.