PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Model PruningLarge Language ModelData SelectionEfficient Recovery

Model pruning is an effective approach for compressing large language models (LLMs). However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some irrelevant instructions may also introduce negative effects to model capacity recovery. To address these challenges, we propose the Post-training dAta Selection method for Efficient pruned large language model Recovery (PASER). PASER aims to identify instructions to recover the most compromised model capacities with a certain data budget. Our approach first applies manifold learning and spectral clustering to group recovery instructions in the semantic space, revealing capability-specific instruction sets. Then, the data budget is adaptively allocated across clusters by the degree of corresponding model capability degradation. In each cluster, we prioritize data samples that lead to the most decline of model performance. To mitigate potential negative tuning effects, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4%-20% of the original post-training data. We provide the anonymous code repository in Link.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PASER, a framework for selecting post-training data to recover capabilities of pruned large language models. Within the taxonomy, it resides in the 'Post-Pruning Recovery Data Selection' leaf under 'Data Selection and Curation Methods'. This leaf contains only two papers total, including PASER itself, indicating a relatively sparse and emerging research direction. The sibling work focuses on calibration data curation for compressed models, whereas PASER targets post-pruning fine-tuning, suggesting the field is still developing specialized approaches for this recovery challenge.

The taxonomy reveals neighboring research in 'General Data Selection for LLM Training' (four papers on broad-scale curation emphasizing perplexity and diversity) and 'Model Compression and Recovery' (seven papers across structured pruning, adapter-guided methods, and iterative self-improvement). PASER bridges these areas by addressing data selection specifically for pruned models, distinct from domain-agnostic curation methods and from pruning techniques themselves. The scope notes clarify that general data selection excludes post-pruning recovery focus, while compression methods exclude data selection without pruning context, positioning PASER at their intersection.

Among twenty-five candidates examined across three contributions, none were identified as clearly refuting PASER's novelty. The core framework examined ten candidates with zero refutable overlaps; the semantic clustering component examined ten with similar results; the capability degradation-aware selection examined five with no refutations. This limited search scope suggests that within the top-ranked semantic matches and citation expansions, no prior work explicitly combines manifold-based instruction clustering with adaptive budget allocation based on capability degradation for pruned LLM recovery, though the small candidate pool means broader literature may contain related ideas.

Based on the examined candidates and sparse taxonomy leaf, PASER appears to occupy a novel position within post-pruning data selection. However, the analysis covers only top-ranked semantic matches from a limited search, not an exhaustive survey of all compression or data curation literature. The framework's combination of semantic clustering, degradation-aware allocation, and negative tuning mitigation distinguishes it from the single sibling work and broader curation methods, though the emerging nature of this subfield means the landscape may evolve rapidly.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: post-training data selection for pruned large language model recovery. The field divides into two main branches: Data Selection and Curation Methods, which focus on identifying high-quality training samples through metrics like perplexity, diversity, and task-specific relevance, and Model Compression and Recovery, which encompasses pruning techniques and subsequent fine-tuning strategies to restore performance. Works in the first branch often explore general principles for curating pre-training or instruction data—such as Data Optimization Survey[2] and Efficient Training Survey[3]—while the second branch addresses structural modifications and recovery protocols, including methods like LoRAShear[6] and Structural Pruning Recovery[10]. These branches are complementary: effective compression depends on both the pruning algorithm and the quality of data used during recovery, making data selection a critical enabler for efficient model deployment. Recent efforts highlight contrasting priorities between broad-scale curation and targeted post-pruning recovery. General data curation frameworks, such as Data Curation Foundation[7] and FineScope[8], emphasize scalability and diversity across large corpora, whereas post-pruning recovery methods require carefully selected calibration or fine-tuning sets that address the specific degradation introduced by compression. PASER[0] sits squarely within this specialized niche, focusing on selecting recovery data tailored to pruned models. It shares thematic ground with Calibration Data Curation[13], which also targets data selection for compressed architectures, but PASER[0] emphasizes post-pruning fine-tuning rather than calibration alone. This positioning distinguishes it from broader surveys like Efficient Training Survey[3] and from pruning-centric works like Think Prune Train[5], which integrate pruning and training but do not isolate the data selection challenge as explicitly.

Claimed Contributions

PASER framework for post-training data selection in pruned LLM recovery

10 retrieved papers

The authors introduce PASER, a novel framework that selects instruction tuning data to efficiently recover capabilities of pruned large language models. The method addresses uneven capability degradation and high computational costs by identifying the most valuable recovery instructions within a limited data budget.

10 retrieved papers

Semantic-structural recovery instruction clustering (S2RIC)

10 retrieved papers

The authors propose a clustering technique that uses manifold learning (diffusion kernel) and NMF-based spectral clustering to group instructions in semantic space. This reveals capability-specific instruction sets corresponding to different LLM capabilities affected by pruning.

10 retrieved papers

Capability degradation-aware instruction selection (CDAIS) with negative tuning mitigation

5 retrieved papers

The authors develop an adaptive budget allocation mechanism that distributes data selection across capability clusters based on degradation severity, prioritizes efficiency-driven sample selection within clusters, and introduces a Concept Consistency Graph to filter conflicting or irrelevant data that could cause negative tuning effects.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization PDF

He, Bowei, Yin, Lihao, Bowei He, Zhen Hui-Ling, Lihao Yin, Liu, Shuqi, Huiling Zhen, Wu Han, Shuqi Liu, Zhang Xiaokun, Han Wu, Yuan, Mingxuan, Xiaokun Zhang, Ma Chen, Mingxuan Yuan, Chen Ma (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PASER framework for post-training data selection in pruned LLM recovery

[19] Sparse fine-tuning for inference acceleration of large language models PDF

Cannot Refute

[20] Unified knowledge maintenance pruning and progressive recovery with weight recalling for large vision-language models PDF

Cannot Refute

[21] RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation PDF

Cannot Refute

[22] Empirical guidelines for deploying llms onto resource-constrained edge devices PDF

Cannot Refute

[23] A comprehensive review of model compression techniques in machine learning PDF

Cannot Refute

[24] Recomp: Improving retrieval-augmented lms with compression and selective augmentation PDF

Cannot Refute

[25] How to train data-efficient llms PDF

Cannot Refute

[26] Hardening llm fine-tuning: From differentially private data selection to trustworthy model quantization PDF

Cannot Refute

[27] A deeper look at depth pruning of LLMs PDF

Cannot Refute

[28] On-Device Large Language Models: A Survey of Model Compression and System Optimization PDF

Cannot Refute

Contribution

Semantic-structural recovery instruction clustering (S2RIC)

[29] Semantic Spectral Clustering with Contrastive Learning and Neighbor Mining PDF

Cannot Refute

[30] Opinion Texts Clustering Using Manifold Learning Based on Sentiment and Semantics Analysis PDF

Cannot Refute

[31] Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization PDF

Cannot Refute

[32] Spectral Clustering-Aware Learning of Embeddings for Speaker Diarisation PDF

Cannot Refute

[33] Improving spectral clustering with deep embedding, cluster estimation and metric learning PDF

Cannot Refute

[34] Isotropy in the Contextual Embedding Space: Clusters and Manifolds PDF

Cannot Refute

[35] Fast semi-supervised clustering with enhanced spectral embedding PDF

Cannot Refute

[36] Manifold learning and spectral clustering for image phylogeny forests PDF

Cannot Refute

[37] TowardÂ a Universal Map of EEG: A Semantic, Low-Dimensional Manifold for EEG Classification, Clustering, and Prognostication. PDF

Cannot Refute

[38] Manifold-Constrained Sentence Embeddings via Triplet Loss: Projecting Semantics onto Spheres, Tori, and MÃ¶bius Strips PDF

Cannot Refute

Contribution

Capability degradation-aware instruction selection (CDAIS) with negative tuning mitigation

[14] GART: Graph Neural Network-based Adaptive and Robust Task Scheduler for Heterogeneous Distributed Computing PDF

Cannot Refute

[15] Optimizing Data Quality in Real-Time: A Self-Healing Pipeline Approach PDF

Cannot Refute

[16] Adaptive Fault-Tolerant Thrust Allocation for Underwater Vehicles With Resource Constraints PDF

Cannot Refute

[17] Addressing Data Quality Decompensation in Federated Learning via Dynamic Client Selection PDF

Cannot Refute

[18] Dynamic Weight-Optimized Prototypical Contrastive Network for Cross-Domain Few-Shot Bearing Fault Diagnosis PDF

Cannot Refute

PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization PDF

Contribution Analysis

PASER framework for post-training data selection in pruned LLM recovery

[19] Sparse fine-tuning for inference acceleration of large language models PDF

[20] Unified knowledge maintenance pruning and progressive recovery with weight recalling for large vision-language models PDF

[21] RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation PDF

[22] Empirical guidelines for deploying llms onto resource-constrained edge devices PDF

[23] A comprehensive review of model compression techniques in machine learning PDF

[24] Recomp: Improving retrieval-augmented lms with compression and selective augmentation PDF

[25] How to train data-efficient llms PDF

[26] Hardening llm fine-tuning: From differentially private data selection to trustworthy model quantization PDF

[27] A deeper look at depth pruning of LLMs PDF

[28] On-Device Large Language Models: A Survey of Model Compression and System Optimization PDF

Semantic-structural recovery instruction clustering (S2RIC)

[29] Semantic Spectral Clustering with Contrastive Learning and Neighbor Mining PDF

[30] Opinion Texts Clustering Using Manifold Learning Based on Sentiment and Semantics Analysis PDF

[31] Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization PDF

[32] Spectral Clustering-Aware Learning of Embeddings for Speaker Diarisation PDF

[33] Improving spectral clustering with deep embedding, cluster estimation and metric learning PDF

[34] Isotropy in the Contextual Embedding Space: Clusters and Manifolds PDF

[35] Fast semi-supervised clustering with enhanced spectral embedding PDF

[36] Manifold learning and spectral clustering for image phylogeny forests PDF

[37] TowardÂ a Universal Map of EEG: A Semantic, Low-Dimensional Manifold for EEG Classification, Clustering, and Prognostication. PDF

[38] Manifold-Constrained Sentence Embeddings via Triplet Loss: Projecting Semantics onto Spheres, Tori, and MÃ¶bius Strips PDF

Capability degradation-aware instruction selection (CDAIS) with negative tuning mitigation

[14] GART: Graph Neural Network-based Adaptive and Robust Task Scheduler for Heterogeneous Distributed Computing PDF

[15] Optimizing Data Quality in Real-Time: A Self-Healing Pipeline Approach PDF

[16] Adaptive Fault-Tolerant Thrust Allocation for Underwater Vehicles With Resource Constraints PDF

[17] Addressing Data Quality Decompensation in Federated Learning via Dynamic Client Selection PDF

[18] Dynamic Weight-Optimized Prototypical Contrastive Network for Cross-Domain Few-Shot Bearing Fault Diagnosis PDF

Table of Contents