PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery
Overview
Overall Novelty Assessment
The paper introduces PASER, a framework for selecting post-training data to recover capabilities of pruned large language models. Within the taxonomy, it resides in the 'Post-Pruning Recovery Data Selection' leaf under 'Data Selection and Curation Methods'. This leaf contains only two papers total, including PASER itself, indicating a relatively sparse and emerging research direction. The sibling work focuses on calibration data curation for compressed models, whereas PASER targets post-pruning fine-tuning, suggesting the field is still developing specialized approaches for this recovery challenge.
The taxonomy reveals neighboring research in 'General Data Selection for LLM Training' (four papers on broad-scale curation emphasizing perplexity and diversity) and 'Model Compression and Recovery' (seven papers across structured pruning, adapter-guided methods, and iterative self-improvement). PASER bridges these areas by addressing data selection specifically for pruned models, distinct from domain-agnostic curation methods and from pruning techniques themselves. The scope notes clarify that general data selection excludes post-pruning recovery focus, while compression methods exclude data selection without pruning context, positioning PASER at their intersection.
Among twenty-five candidates examined across three contributions, none were identified as clearly refuting PASER's novelty. The core framework examined ten candidates with zero refutable overlaps; the semantic clustering component examined ten with similar results; the capability degradation-aware selection examined five with no refutations. This limited search scope suggests that within the top-ranked semantic matches and citation expansions, no prior work explicitly combines manifold-based instruction clustering with adaptive budget allocation based on capability degradation for pruned LLM recovery, though the small candidate pool means broader literature may contain related ideas.
Based on the examined candidates and sparse taxonomy leaf, PASER appears to occupy a novel position within post-pruning data selection. However, the analysis covers only top-ranked semantic matches from a limited search, not an exhaustive survey of all compression or data curation literature. The framework's combination of semantic clustering, degradation-aware allocation, and negative tuning mitigation distinguishes it from the single sibling work and broader curation methods, though the emerging nature of this subfield means the landscape may evolve rapidly.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce PASER, a novel framework that selects instruction tuning data to efficiently recover capabilities of pruned large language models. The method addresses uneven capability degradation and high computational costs by identifying the most valuable recovery instructions within a limited data budget.
The authors propose a clustering technique that uses manifold learning (diffusion kernel) and NMF-based spectral clustering to group instructions in semantic space. This reveals capability-specific instruction sets corresponding to different LLM capabilities affected by pruning.
The authors develop an adaptive budget allocation mechanism that distributes data selection across capability clusters based on degradation severity, prioritizes efficiency-driven sample selection within clusters, and introduces a Concept Consistency Graph to filter conflicting or irrelevant data that could cause negative tuning effects.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[13] Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
PASER framework for post-training data selection in pruned LLM recovery
The authors introduce PASER, a novel framework that selects instruction tuning data to efficiently recover capabilities of pruned large language models. The method addresses uneven capability degradation and high computational costs by identifying the most valuable recovery instructions within a limited data budget.
[19] Sparse fine-tuning for inference acceleration of large language models PDF
[20] Unified knowledge maintenance pruning and progressive recovery with weight recalling for large vision-language models PDF
[21] RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation PDF
[22] Empirical guidelines for deploying llms onto resource-constrained edge devices PDF
[23] A comprehensive review of model compression techniques in machine learning PDF
[24] Recomp: Improving retrieval-augmented lms with compression and selective augmentation PDF
[25] How to train data-efficient llms PDF
[26] Hardening llm fine-tuning: From differentially private data selection to trustworthy model quantization PDF
[27] A deeper look at depth pruning of LLMs PDF
[28] On-Device Large Language Models: A Survey of Model Compression and System Optimization PDF
Semantic-structural recovery instruction clustering (S2RIC)
The authors propose a clustering technique that uses manifold learning (diffusion kernel) and NMF-based spectral clustering to group instructions in semantic space. This reveals capability-specific instruction sets corresponding to different LLM capabilities affected by pruning.
[29] Semantic Spectral Clustering with Contrastive Learning and Neighbor Mining PDF
[30] Opinion Texts Clustering Using Manifold Learning Based on Sentiment and Semantics Analysis PDF
[31] Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization PDF
[32] Spectral Clustering-Aware Learning of Embeddings for Speaker Diarisation PDF
[33] Improving spectral clustering with deep embedding, cluster estimation and metric learning PDF
[34] Isotropy in the Contextual Embedding Space: Clusters and Manifolds PDF
[35] Fast semi-supervised clustering with enhanced spectral embedding PDF
[36] Manifold learning and spectral clustering for image phylogeny forests PDF
[37] Toward a Universal Map of EEG: A Semantic, Low-Dimensional Manifold for EEG Classification, Clustering, and Prognostication. PDF
[38] Manifold-Constrained Sentence Embeddings via Triplet Loss: Projecting Semantics onto Spheres, Tori, and Möbius Strips PDF
Capability degradation-aware instruction selection (CDAIS) with negative tuning mitigation
The authors develop an adaptive budget allocation mechanism that distributes data selection across capability clusters based on degradation severity, prioritizes efficiency-driven sample selection within clusters, and introduces a Concept Consistency Graph to filter conflicting or irrelevant data that could cause negative tuning effects.