PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

ICLR 2026 Conference SubmissionAnonymous Authors
Model PruningLarge Language ModelData SelectionEfficient Recovery
Abstract:

Model pruning is an effective approach for compressing large language models (LLMs). However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some irrelevant instructions may also introduce negative effects to model capacity recovery. To address these challenges, we propose the Post-training dAta Selection method for Efficient pruned large language model Recovery (PASER). PASER aims to identify instructions to recover the most compromised model capacities with a certain data budget. Our approach first applies manifold learning and spectral clustering to group recovery instructions in the semantic space, revealing capability-specific instruction sets. Then, the data budget is adaptively allocated across clusters by the degree of corresponding model capability degradation. In each cluster, we prioritize data samples that lead to the most decline of model performance. To mitigate potential negative tuning effects, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4%-20% of the original post-training data. We provide the anonymous code repository in Link.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PASER, a framework for selecting post-training data to recover capabilities of pruned large language models. Within the taxonomy, it resides in the 'Post-Pruning Recovery Data Selection' leaf under 'Data Selection and Curation Methods'. This leaf contains only two papers total, including PASER itself, indicating a relatively sparse and emerging research direction. The sibling work focuses on calibration data curation for compressed models, whereas PASER targets post-pruning fine-tuning, suggesting the field is still developing specialized approaches for this recovery challenge.

The taxonomy reveals neighboring research in 'General Data Selection for LLM Training' (four papers on broad-scale curation emphasizing perplexity and diversity) and 'Model Compression and Recovery' (seven papers across structured pruning, adapter-guided methods, and iterative self-improvement). PASER bridges these areas by addressing data selection specifically for pruned models, distinct from domain-agnostic curation methods and from pruning techniques themselves. The scope notes clarify that general data selection excludes post-pruning recovery focus, while compression methods exclude data selection without pruning context, positioning PASER at their intersection.

Among twenty-five candidates examined across three contributions, none were identified as clearly refuting PASER's novelty. The core framework examined ten candidates with zero refutable overlaps; the semantic clustering component examined ten with similar results; the capability degradation-aware selection examined five with no refutations. This limited search scope suggests that within the top-ranked semantic matches and citation expansions, no prior work explicitly combines manifold-based instruction clustering with adaptive budget allocation based on capability degradation for pruned LLM recovery, though the small candidate pool means broader literature may contain related ideas.

Based on the examined candidates and sparse taxonomy leaf, PASER appears to occupy a novel position within post-pruning data selection. However, the analysis covers only top-ranked semantic matches from a limited search, not an exhaustive survey of all compression or data curation literature. The framework's combination of semantic clustering, degradation-aware allocation, and negative tuning mitigation distinguishes it from the single sibling work and broader curation methods, though the emerging nature of this subfield means the landscape may evolve rapidly.

Taxonomy

Core-task Taxonomy Papers
13
3
Claimed Contributions
25
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: post-training data selection for pruned large language model recovery. The field divides into two main branches: Data Selection and Curation Methods, which focus on identifying high-quality training samples through metrics like perplexity, diversity, and task-specific relevance, and Model Compression and Recovery, which encompasses pruning techniques and subsequent fine-tuning strategies to restore performance. Works in the first branch often explore general principles for curating pre-training or instruction data—such as Data Optimization Survey[2] and Efficient Training Survey[3]—while the second branch addresses structural modifications and recovery protocols, including methods like LoRAShear[6] and Structural Pruning Recovery[10]. These branches are complementary: effective compression depends on both the pruning algorithm and the quality of data used during recovery, making data selection a critical enabler for efficient model deployment. Recent efforts highlight contrasting priorities between broad-scale curation and targeted post-pruning recovery. General data curation frameworks, such as Data Curation Foundation[7] and FineScope[8], emphasize scalability and diversity across large corpora, whereas post-pruning recovery methods require carefully selected calibration or fine-tuning sets that address the specific degradation introduced by compression. PASER[0] sits squarely within this specialized niche, focusing on selecting recovery data tailored to pruned models. It shares thematic ground with Calibration Data Curation[13], which also targets data selection for compressed architectures, but PASER[0] emphasizes post-pruning fine-tuning rather than calibration alone. This positioning distinguishes it from broader surveys like Efficient Training Survey[3] and from pruning-centric works like Think Prune Train[5], which integrate pruning and training but do not isolate the data selection challenge as explicitly.

Claimed Contributions

PASER framework for post-training data selection in pruned LLM recovery

The authors introduce PASER, a novel framework that selects instruction tuning data to efficiently recover capabilities of pruned large language models. The method addresses uneven capability degradation and high computational costs by identifying the most valuable recovery instructions within a limited data budget.

10 retrieved papers
Semantic-structural recovery instruction clustering (S2RIC)

The authors propose a clustering technique that uses manifold learning (diffusion kernel) and NMF-based spectral clustering to group instructions in semantic space. This reveals capability-specific instruction sets corresponding to different LLM capabilities affected by pruning.

10 retrieved papers
Capability degradation-aware instruction selection (CDAIS) with negative tuning mitigation

The authors develop an adaptive budget allocation mechanism that distributes data selection across capability clusters based on degradation severity, prioritizes efficiency-driven sample selection within clusters, and introduces a Concept Consistency Graph to filter conflicting or irrelevant data that could cause negative tuning effects.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PASER framework for post-training data selection in pruned LLM recovery

The authors introduce PASER, a novel framework that selects instruction tuning data to efficiently recover capabilities of pruned large language models. The method addresses uneven capability degradation and high computational costs by identifying the most valuable recovery instructions within a limited data budget.

Contribution

Semantic-structural recovery instruction clustering (S2RIC)

The authors propose a clustering technique that uses manifold learning (diffusion kernel) and NMF-based spectral clustering to group instructions in semantic space. This reveals capability-specific instruction sets corresponding to different LLM capabilities affected by pruning.

Contribution

Capability degradation-aware instruction selection (CDAIS) with negative tuning mitigation

The authors develop an adaptive budget allocation mechanism that distributes data selection across capability clusters based on degradation severity, prioritizes efficiency-driven sample selection within clusters, and introduces a Concept Consistency Graph to filter conflicting or irrelevant data that could cause negative tuning effects.