Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Layer-wise PruningCooperative Game TheoryShapley Value Approximation
Abstract:

While large language models (LLMs) demonstrate impressive performance across various tasks, their deployment in real-world scenarios is still constrained by high computational demands. Layer-wise pruning, a commonly employed strategy to mitigate inference costs, can partially address this challenge. However, existing approaches generally depend on static heuristic rules and fail to account for the interdependencies among layers, thereby limiting the effectiveness of the pruning process. To this end, this paper proposes a game-theoretic framework that formulates layer pruning as a cooperative game in which each layer acts as a player and model performance serves as the utility. As computing exact Shapley values is computationally infeasible for large language models (LLMs), we propose using a lightweight surrogate network to estimate layer-wise marginal contributions. This network can predict LLM performance for arbitrary layer combinations at a low computational cost. Additionally, we employ stratified Monte Carlo mask sampling to further reduce the cost of Sharpley value estimation. This approach captures inter-layer dependencies and dynamically identifies critical layers for pruning. Extensive experiments demonstrate the consistent superiority of our method in terms of perplexity and zero-shot accuracy, achieving more efficient and effective layer-wise pruning for large language models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a game-theoretic framework for layer pruning in large language models, using Shapley values to quantify layer contributions and guide removal decisions. According to the taxonomy, this work occupies the 'Game-Theoretic Contribution Estimation' leaf under 'Sparsity Allocation and Layer-wise Importance Estimation'. Notably, this leaf contains only one paper—the original submission itself—indicating that game-theoretic approaches to layer pruning represent a sparse, relatively unexplored direction within the field. The taxonomy shows 50 papers across the entire landscape, with most clustering in reconstruction-based optimization or heuristic allocation methods.

The taxonomy reveals that neighboring leaves focus on heuristic metrics (reconstruction error, activation statistics, weight norms) and optimization-based allocation (gradient-based or search procedures). These sibling categories contain multiple papers each, suggesting that the field has primarily relied on simpler scoring functions or learned allocation parameters. The game-theoretic leaf's isolation suggests that cooperative game formulations for layer importance remain underexplored. The taxonomy's scope note explicitly distinguishes game-theoretic methods from both heuristic and optimization-based approaches, positioning this work as a conceptually distinct alternative to mainstream allocation strategies.

Among 28 candidates examined, the analysis found that the core game-theoretic framework contribution (Contribution A) faces potential overlap: 3 of 10 examined candidates appear refutable. The surrogate-assisted Shapley estimation (Contribution B) shows no clear refutation among 8 candidates examined, suggesting greater novelty in the computational approximation strategy. The scalability claim (Contribution C) encounters 1 refutable candidate among 10 examined. These statistics reflect a limited search scope—top-K semantic matches plus citation expansion—not an exhaustive survey. The refutation counts indicate that while game-theoretic framing has some precedent, the specific surrogate network approach may be less anticipated.

Given the limited search scope of 28 candidates, the analysis suggests moderate novelty. The game-theoretic framing sits in an underpopulated taxonomy leaf, yet the contribution-level statistics reveal that at least a few prior works touch on similar ideas. The surrogate-assisted estimation appears more distinctive within the examined literature. The taxonomy structure indicates that the field has favored simpler heuristics and optimization-based methods, making the cooperative game perspective a less-traveled path—though not entirely unprecedented based on the candidates examined.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: layer-wise pruning for large language models. The field has organized itself around several complementary dimensions. At the highest level, researchers distinguish between different pruning granularities and structural approaches—ranging from unstructured weight removal to entire layer or attention-head elimination—and methods for sparsity allocation and layer-wise importance estimation, which determine how much to prune at each depth. A third branch focuses on pruning optimization frameworks and algorithms that solve the resulting combinatorial or continuous problems, while post-pruning performance recovery techniques address the accuracy drop through fine-tuning or knowledge distillation. Complementary compression techniques (quantization, low-rank factorization, KV-cache compression) often appear alongside pruning, and domain-specific or application-oriented pruning tailors these ideas to particular deployment scenarios. Representative works such as LLM-Pruner[2] and Simple Effective Pruning[3] illustrate how structured removal and calibration-based scoring can be combined, while Blockpruner[1] and SlimGPT[4] explore block-level and layer-skipping strategies. Within the sparsity allocation and layer-wise importance estimation branch, a particularly active line of inquiry revolves around principled scoring of layer contributions. Some methods rely on gradient-based or activation-based heuristics (e.g., FISTAPruner[5], Dynamic Layerwise Pruning[6]), while others adopt convex or game-theoretic formulations to distribute sparsity budgets more rigorously. Cooperative Game Pruning[0] sits squarely in this game-theoretic cluster, using Shapley-style contribution measures to assign importance scores across layers—an approach that contrasts with simpler magnitude or sensitivity metrics seen in nearby works like Simple Effective Pruning[3] or Fluctuation Adaptive Pruning[7]. By framing layer selection as a cooperative game, Cooperative Game Pruning[0] aims to capture synergistic effects that scalar importance scores may miss, positioning it as a more theoretically grounded alternative within the broader landscape of layer-wise importance estimation.

Claimed Contributions

Game-theoretic framework for layer pruning in LLMs

The authors introduce a novel perspective on LLM pruning by treating it as a cooperative game where each Transformer layer is a player and model performance defines the utility. This framework explicitly captures dynamic interdependencies among layers that static heuristics fail to account for.

10 retrieved papers
Can Refute
Surrogate-assisted Shapley value estimation with stratified Monte Carlo sampling

The authors develop a two-stage approximation strategy that combines stratified Monte Carlo mask sampling with a lightweight surrogate network to efficiently estimate Shapley values for layer contributions. This approach makes the computation tractable for large-scale models while preserving inter-layer dependencies.

8 retrieved papers
Scalable pruning method generalizing across architectures

The authors demonstrate that their pruning framework extends beyond Transformer-based LLMs to non-Transformer architectures and can be seamlessly combined with quantization. The method achieves consistent improvements in perplexity and zero-shot accuracy across various model sizes and architectures.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Game-theoretic framework for layer pruning in LLMs

The authors introduce a novel perspective on LLM pruning by treating it as a cooperative game where each Transformer layer is a player and model performance defines the utility. This framework explicitly captures dynamic interdependencies among layers that static heuristics fail to account for.

Contribution

Surrogate-assisted Shapley value estimation with stratified Monte Carlo sampling

The authors develop a two-stage approximation strategy that combines stratified Monte Carlo mask sampling with a lightweight surrogate network to efficiently estimate Shapley values for layer contributions. This approach makes the computation tractable for large-scale models while preserving inter-layer dependencies.

Contribution

Scalable pruning method generalizing across architectures

The authors demonstrate that their pruning framework extends beyond Transformer-based LLMs to non-Transformer architectures and can be seamlessly combined with quantization. The method achieves consistent improvements in perplexity and zero-shot accuracy across various model sizes and architectures.