SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Efficient EvaluationLLM Evaluation

As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall’s~ $\tau$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SparseEval, a method that formulates efficient LLM benchmarking as a sparse optimization problem, using gradient descent to optimize anchor weights and iterative refinement for anchor selection. It resides in the 'Sample-Efficient and Adaptive Evaluation' leaf, which contains six papers total. This leaf sits within the broader 'Efficiency-Focused Evaluation Methods' branch, indicating a moderately populated research direction focused on reducing evaluation costs through intelligent sampling rather than comprehensive test suites.

The taxonomy reveals neighboring work in 'Test-Time Compute Optimization' (two papers) and a sibling branch 'LLM-Based Evaluation Methodologies' (twelve papers across three leaves). The scope note for the paper's leaf explicitly excludes test-time compute scaling and model compression, positioning SparseEval among methods that select representative samples rather than optimize inference itself. Related leaves like 'Task-Specific and Capability-Focused Benchmarks' (seven papers) and 'General-Purpose Multi-Dimensional Benchmarks' (five papers) address what to evaluate, while this work addresses how to evaluate efficiently.

Among thirty candidates examined, none clearly refute the three core contributions: sparse optimization formulation (ten candidates, zero refutable), the Anchor and Candidate Importance Score metrics (ten candidates, zero refutable), and the MLP-based anchor weight predictor (ten candidates, zero refutable). This suggests that within the limited search scope, the specific combination of gradient-based anchor optimization and task-aware refinement scores appears distinct from prior sample-efficient methods, though the search does not cover the entire field exhaustively.

Based on the top-thirty semantic matches and citation expansion, the work appears to occupy a recognizable niche within sample-efficient evaluation. The analysis does not capture potential overlap in broader optimization literature or recent preprints outside the search scope. The taxonomy structure indicates this is an active but not overcrowded area, with the paper's technical approach—MLP-based weight learning and iterative refinement—differentiating it from static subset selection methods among examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient evaluation of large language models. The field has organized itself around several complementary perspectives. Evaluation Frameworks and Benchmarks establish standardized testbeds such as Holistic evaluation of language[2] and Promptbench[3], providing broad coverage of capabilities. Efficiency-Focused Evaluation Methods address the computational burden of assessment through sample-efficient and adaptive strategies, while LLM-Based Evaluation Methodologies explore using models themselves as judges. Domain-Specific Evaluation targets specialized contexts like scientific reasoning or agent behavior, and Bias and Fairness Evaluation scrutinizes model outputs for harmful patterns. Uncertainty and Confidence Estimation examines calibration, Model Efficiency Techniques optimize inference and training costs, and Comparative Model Analysis systematically contrasts different architectures. Together, these branches reflect a tension between comprehensive assessment and resource constraints, with many studies seeking to balance coverage against evaluation cost. A particularly active line of work focuses on reducing the sample complexity of evaluation without sacrificing reliability. SparseEval[0] sits squarely within this cluster, proposing adaptive sampling strategies that intelligently select test instances to maximize information gain. It shares methodological kinship with Efficient benchmarking of language[5], which similarly aims to shrink evaluation sets, and Sample-efficient human evaluation of[6], which applies efficiency principles to human annotation. Nearby efforts like Data Efficient Evaluation of[29] and Effieval[44] explore related trade-offs between test set size and measurement precision. The central challenge across these works is determining when a smaller, carefully chosen sample can yield statistically robust conclusions about model performance, a question that becomes increasingly urgent as models scale and evaluation budgets tighten. SparseEval[0] distinguishes itself by emphasizing dynamic instance selection, contrasting with static subset approaches seen in some neighboring studies.

Claimed Contributions

Formulation of efficient LLM evaluation as sparse optimization problem

10 retrieved papers

The authors formulate the task of efficient benchmarking as a sparse optimization problem over a model-item performance matrix. They introduce a framework that uses gradient descent to optimize anchor weights and an iterative refinement strategy to select representative items (anchors) for evaluation.

10 retrieved papers

Anchor Importance Score and Candidate Importance Score metrics

10 retrieved papers

The authors introduce two novel metrics: Anchor Importance Score (AIS) based on gradient norms to assess anchor contribution, and Candidate Importance Score (CIS) based on dot products with residuals to identify informative candidates. These metrics enable task-aware anchor refinement.

10 retrieved papers

MLP-based anchor weight predictor with end-to-end optimization

10 retrieved papers

The authors propose using a multi-layer perceptron (MLP) as an aggregation function to approximate anchor weights through end-to-end gradient-based optimization, replacing traditional clustering-based weight assignment methods.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Efficient benchmarking (of language models) PDF

Arviv, Ofir, Bandel, Elron, Choshen, Leshem, Ein-Dor, Liat, Gera, Ariel, Perlitz, Yotam, Shmueli-Scheuer Michal, Shnarch, Eyal, Slonim, Noam (2024)

[6] Sample-efficient human evaluation of large language models via maximum discrepancy competition PDF

Chen, Huajun, Ding, Keyan, Feng Ke-hua, Guo Shuangquan, Hongzhi Tan, Ma, Kede, Sun Ge, Wang Zhihua, Yuzhou Cheng, Zhang Qiang, Zheng, Guozhou (2025)

[29] Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling PDF

Cong Xu, Gayathri Saranathan, Shah Arpit, M. Alam, Lim, James, Arpit Shah, Wong Soon Yee, James Lim, Martin Foltin, Soon Yee Wong, Bhattacharya, Suparna, Foltin Martin, Suparna Bhattacharya (2024)

[35] Toward a unified framework for data-efficient evaluation of large language models PDF

Liao, Lele, Zhang, Qile, Wu, Ruofan, Fang, Guanhua (2025)

[44] Effieval: Efficient and generalizable model evaluation via capability coverage maximization PDF

Wang Yao-ning, Ying Jiahao, Cao, Yixin, MA YuBo, Jiang Yu-gang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Formulation of efficient LLM evaluation as sparse optimization problem

[51] Enhanced sparse optimization approach for vital signal extraction from millimeter-wave radar PDF

Cannot Refute

[52] Sparse optimization on measures with over-parameterized gradient descent PDF

Cannot Refute

[53] Implicit regularization of decentralized gradient descent for sparse regression PDF

Cannot Refute

[54] An iterative threshold algorithm of log-sum regularization for sparse problem PDF

Cannot Refute

[55] The alternating descent conditional gradient method for sparse inverse problems PDF

Cannot Refute

[56] Sparse Spiking Gradient Descent PDF

Cannot Refute

[57] Combining Sparse Approximate Factorizations with Mixed-precision Iterative Refinement PDF

Cannot Refute

[58] Spectral super-resolution on the unit circle via gradient descent PDF

Cannot Refute

[59] Proximal methods for sparse optimal scoring and discriminant analysis PDF

Cannot Refute

[60] Group sparse optimization via lp, q regularization PDF

Cannot Refute

Contribution

Anchor Importance Score and Candidate Importance Score metrics

[61] Physics-informed neural networks with residual/gradient-based adaptive sampling methods for solving partial differential equations with sharp solutions PDF

Cannot Refute

[62] Mcl for mllms: Benchmarking forgetting in task-incremental multimodal learning PDF

Cannot Refute

[63] Deep spatial gradient and temporal depth learning for face anti-spoofing PDF

Cannot Refute

[64] Not all samples are created equal: Deep learning with importance sampling PDF

Cannot Refute

[65] An adaptive sampling method based on expected improvement function and residual gradient in pinns PDF

Cannot Refute

[66] Gradients of Counterfactuals PDF

Cannot Refute

[67] Machine Learning Approach to Detect Android Malware using Feature-Selection based on Feature Importance Score PDF

Cannot Refute

[68] Data pruning via moving-one-sample-out PDF

Cannot Refute

[69] Deep primitive convolutional neural network for image super resolution PDF

Cannot Refute

[70] Importance Estimation with Random Gradient for Neural Network Pruning PDF

Cannot Refute

Contribution

MLP-based anchor weight predictor with end-to-end optimization

[71] Learning Sparse Neural Networks through L0 Regularization PDF

Cannot Refute

[72] Sparse R-CNN: End-to-End Object Detection with Learnable Proposals PDF

Cannot Refute

[73] BPQP: A Differentiable Convex Optimization Framework for Efficient End-to-End Learning PDF

Cannot Refute

[74] Boundedness and Convergence Analysis of a Pi-Sigma Neural Network Based on Online Gradient Method and Sparse Optimization PDF

Cannot Refute

[75] SWG: an architecture for sparse weight gradient computation PDF

Cannot Refute

[76] Directly Training Temporal Spiking Neural Network with Sparse Surrogate Gradient PDF

Cannot Refute

[77] Learning both weights and connections for efficient neural network PDF

Cannot Refute

[78] An Innovative Study for Tool Wear Prediction Based on Stacked Sparse Autoencoder and Ensemble Learning Strategy PDF

Cannot Refute

[79] Fast, differentiable and sparse top-k: a convex analysis perspective PDF

Cannot Refute

[80] Accelerating one-shot neural architecture search via constructing a sparse search space PDF

Cannot Refute

SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Efficient benchmarking (of language models) PDF

[6] Sample-efficient human evaluation of large language models via maximum discrepancy competition PDF

[29] Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling PDF

[35] Toward a unified framework for data-efficient evaluation of large language models PDF

[44] Effieval: Efficient and generalizable model evaluation via capability coverage maximization PDF

Contribution Analysis

Formulation of efficient LLM evaluation as sparse optimization problem

[51] Enhanced sparse optimization approach for vital signal extraction from millimeter-wave radar PDF

[52] Sparse optimization on measures with over-parameterized gradient descent PDF

[53] Implicit regularization of decentralized gradient descent for sparse regression PDF

[54] An iterative threshold algorithm of log-sum regularization for sparse problem PDF

[55] The alternating descent conditional gradient method for sparse inverse problems PDF

[56] Sparse Spiking Gradient Descent PDF

[57] Combining Sparse Approximate Factorizations with Mixed-precision Iterative Refinement PDF

[58] Spectral super-resolution on the unit circle via gradient descent PDF

[59] Proximal methods for sparse optimal scoring and discriminant analysis PDF

[60] Group sparse optimization via lp, q regularization PDF

Anchor Importance Score and Candidate Importance Score metrics

[61] Physics-informed neural networks with residual/gradient-based adaptive sampling methods for solving partial differential equations with sharp solutions PDF

[62] Mcl for mllms: Benchmarking forgetting in task-incremental multimodal learning PDF

[63] Deep spatial gradient and temporal depth learning for face anti-spoofing PDF

[64] Not all samples are created equal: Deep learning with importance sampling PDF

[65] An adaptive sampling method based on expected improvement function and residual gradient in pinns PDF

[66] Gradients of Counterfactuals PDF

[67] Machine Learning Approach to Detect Android Malware using Feature-Selection based on Feature Importance Score PDF

[68] Data pruning via moving-one-sample-out PDF

[69] Deep primitive convolutional neural network for image super resolution PDF

[70] Importance Estimation with Random Gradient for Neural Network Pruning PDF

MLP-based anchor weight predictor with end-to-end optimization

[71] Learning Sparse Neural Networks through L0 Regularization PDF

[72] Sparse R-CNN: End-to-End Object Detection with Learnable Proposals PDF

[73] BPQP: A Differentiable Convex Optimization Framework for Efficient End-to-End Learning PDF

[74] Boundedness and Convergence Analysis of a Pi-Sigma Neural Network Based on Online Gradient Method and Sparse Optimization PDF

[75] SWG: an architecture for sparse weight gradient computation PDF

[76] Directly Training Temporal Spiking Neural Network with Sparse Surrogate Gradient PDF

[77] Learning both weights and connections for efficient neural network PDF

[78] An Innovative Study for Tool Wear Prediction Based on Stacked Sparse Autoencoder and Ensemble Learning Strategy PDF

[79] Fast, differentiable and sparse top-k: a convex analysis perspective PDF

[80] Accelerating one-shot neural architecture search via constructing a sparse search space PDF

Table of Contents