Boomerang Distillation Enables Zero-Shot Model Size Interpolation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

knowledge distillationpretrainingadaptive computemodel interpolation

Large language models (LLMs) are typically deployed under diverse memory and compute constraints. Existing approaches build model families by training each size independently, which is prohibitively expensive and provides only coarse-grained size options. In this work, we identify a novel phenomenon that we call boomerang distillation: starting from a large base model (the teacher), one first distills down to a small student and then progressively reconstructs intermediate-sized models by re-incorporating blocks of teacher layers into the student without any additional training. This process produces zero-shot interpolated models of many intermediate sizes whose performance scales smoothly between the student and teacher, often matching or surpassing pretrained or distilled models of the same size. We further analyze when this type of interpolation succeeds, showing that alignment between teacher and student through pruning and distillation is essential. Boomerang distillation thus provides a simple and efficient way to generate fine-grained model families, dramatically reducing training cost while enabling flexible adaptation across deployment environments.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces boomerang distillation, a method for zero-shot model size interpolation that progressively reconstructs intermediate-sized models by re-incorporating teacher layers into a distilled student without retraining. Within the taxonomy, it occupies the 'Zero-Shot Model Size Interpolation' leaf under 'Capacity Gap and Model Size Adaptation'. Notably, this leaf contains only the original paper itself, indicating a sparse research direction. The broader parent category includes three leaves addressing capacity gaps, dynamic compression, and interpolation strategies, suggesting the paper targets an underexplored niche within a moderately active research area.

The taxonomy reveals that neighboring work primarily focuses on capacity gap mitigation through architectural adjustments or training modifications (e.g., Bridging Capacity Gap, Lifting Capacity Gap) and dynamic capacity approaches requiring progressive training schedules (e.g., Capacity Dynamic Distillation). The original paper diverges by decoupling size selection from training, enabling post-hoc interpolation. Related branches address core distillation mechanisms (feature-based, sequence-level matching) and domain-specific applications, but these do not directly tackle zero-shot size variation. The taxonomy's scope and exclude notes clarify that methods requiring retraining for each size belong in sibling categories, positioning this work as methodologically distinct.

Among twenty-one candidates examined via semantic search and citation expansion, none clearly refute the three core contributions. The boomerang distillation phenomenon itself was examined against ten candidates with zero refutable overlaps. The claim that interpolated models match or surpass standard distilled models drew six candidates, again with no refutations. The analysis of enabling conditions examined five candidates without finding prior work that anticipates this specific alignment-based interpolation mechanism. This limited search scope suggests the contributions appear novel within the examined literature, though the small candidate pool and sparse taxonomy leaf indicate the field context remains relatively unexplored.

Given the restricted search scale and the paper's position in a singleton taxonomy leaf, the work appears to introduce a genuinely new interpolation paradigm within the examined scope. However, the analysis covers top-K semantic matches and immediate citations, not an exhaustive survey of all distillation or model compression literature. The absence of sibling papers in the same leaf and the limited candidate pool mean the novelty assessment reflects current indexed work rather than a comprehensive field review.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Zero-shot model size interpolation through knowledge distillation. The field of knowledge distillation has evolved into a rich landscape organized around several complementary themes. At the highest level, researchers explore core distillation mechanisms and training paradigms that define how knowledge transfers from teacher to student models, including foundational techniques like Sequence-Level Distillation[24] and interactive approaches such as Interactive Knowledge Distillation[23]. A second major branch addresses the capacity gap and model size adaptation, investigating how to bridge performance differences when student models vary significantly in scale—exemplified by works like Bridging Capacity Gap[12] and Interpolative Distillation[13]. Domain-specific applications form another branch, tailoring distillation to specialized tasks ranging from speech and music encoding (e.g., Speech Music Encoder[5]) to plant disease detection and financial document processing. Finally, complementary compression and optimization techniques integrate distillation with pruning, quantization, and other efficiency methods, as seen in Enhanced Sparsification[3] and related efforts. Within the capacity gap and model size adaptation branch, a particularly active line of work focuses on enabling flexible deployment across diverse hardware constraints without retraining multiple models. Boomerang Distillation[0] sits squarely in this area, proposing zero-shot interpolation to generate student models of arbitrary sizes on the fly. This contrasts with earlier methods like Interpolative Distillation[13] and Capacity Dynamic Distillation[18], which typically require explicit training for each target size or rely on predefined capacity schedules. Meanwhile, approaches such as Bridging Capacity Gap[12] and Lifting Capacity Gap[11] emphasize architectural or training adjustments to narrow performance drops when scaling down, but do not directly address the zero-shot interpolation scenario. The central tension across these works revolves around balancing deployment flexibility, training efficiency, and the preservation of teacher-level performance, with Boomerang Distillation[0] offering a novel pathway by decoupling model size selection from the distillation training phase.

Claimed Contributions

Boomerang distillation phenomenon for zero-shot model size interpolation

10 retrieved papers

The authors identify and introduce boomerang distillation, a novel phenomenon where a small distilled student model can be progressively patched with teacher layers to create intermediate-sized models without additional training. This process produces models whose size and performance smoothly interpolate between the student and teacher.

10 retrieved papers

Demonstration that interpolated models match or surpass standard distilled models

6 retrieved papers

The authors demonstrate through experiments that models created via boomerang distillation achieve comparable or superior performance to models trained with standard knowledge distillation at the same size, despite requiring no additional training for the intermediate sizes.

6 retrieved papers

Analysis of conditions enabling boomerang distillation

5 retrieved papers

The authors conduct extensive experiments and ablations to characterize when boomerang distillation succeeds, showing that student initialization from teacher weights and training with alignment losses (such as cosine distance) are essential conditions, and that the approach consistently outperforms layer pruning methods.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Boomerang distillation phenomenon for zero-shot model size interpolation

[25] TransKD: Transformer knowledge distillation for efficient semantic segmentation PDF

Cannot Refute

[26] Distilhubert: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit Bert PDF

Cannot Refute

[27] On-Device Large Language Models: A Survey of Model Compression and System Optimization PDF

Cannot Refute

[28] VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models PDF

Cannot Refute

[29] Categories of response-based, feature-based, and relation-based knowledge distillation PDF

Cannot Refute

[30] LAD: Layer-Wise Adaptive Distillation for BERT Model Compression PDF

Cannot Refute

[31] Student network learning via evolutionary knowledge distillation PDF

Cannot Refute

[32] Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation PDF

Cannot Refute

[33] Robustdistiller: Compressing Universal Speech Representations for Enhanced Environment Robustness PDF

Cannot Refute

[34] DANet: Multi-scale UAV target detection with dynamic feature perception and scale-aware knowledge distillation PDF

Cannot Refute

Contribution

Demonstration that interpolated models match or surpass standard distilled models

[2] TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models PDF

Cannot Refute

[5] Distilling a speech and music encoder with task arithmetic PDF

Cannot Refute

[7] Mixkd: Towards efficient distillation of large-scale language models PDF

Cannot Refute

[35] Bi-Temporal Feature Relational Distillation for On-Board Lightweight Change Detection in Remote Sensing Imagery PDF

Cannot Refute

[36] Diversity-rewarded CFG distillation PDF

Cannot Refute

[37] 4D trajectory lightweight prediction algorithm based on knowledge distillation technique PDF

Cannot Refute

Contribution

Analysis of conditions enabling boomerang distillation

[38] End-to-end model compression via pruning and knowledge distillation for lightweight image super resolution PDF

Cannot Refute

[39] Epsd: Early pruning with self-distillation for efficient model compression PDF

Cannot Refute

[40] From Variance-Reduced Initialization to Knowledge Distillation-Inspired Pruning at Initialization: Embedding Efficiency Right from the Onset of Neural Network â¦ PDF

Cannot Refute

[41] Exploring Pruning-based Efficient Object Tracking via Hybrid Knowledge Distillation PDF

Cannot Refute

[42] Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification PDF

Cannot Refute

Boomerang Distillation Enables Zero-Shot Model Size Interpolation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Boomerang distillation phenomenon for zero-shot model size interpolation

[25] TransKD: Transformer knowledge distillation for efficient semantic segmentation PDF

[26] Distilhubert: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit Bert PDF

[27] On-Device Large Language Models: A Survey of Model Compression and System Optimization PDF

[28] VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models PDF

[29] Categories of response-based, feature-based, and relation-based knowledge distillation PDF

[30] LAD: Layer-Wise Adaptive Distillation for BERT Model Compression PDF

[31] Student network learning via evolutionary knowledge distillation PDF

[32] Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation PDF

[33] Robustdistiller: Compressing Universal Speech Representations for Enhanced Environment Robustness PDF

[34] DANet: Multi-scale UAV target detection with dynamic feature perception and scale-aware knowledge distillation PDF

Demonstration that interpolated models match or surpass standard distilled models

[2] TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models PDF

[5] Distilling a speech and music encoder with task arithmetic PDF

[7] Mixkd: Towards efficient distillation of large-scale language models PDF

[35] Bi-Temporal Feature Relational Distillation for On-Board Lightweight Change Detection in Remote Sensing Imagery PDF

[36] Diversity-rewarded CFG distillation PDF

[37] 4D trajectory lightweight prediction algorithm based on knowledge distillation technique PDF

Analysis of conditions enabling boomerang distillation

[38] End-to-end model compression via pruning and knowledge distillation for lightweight image super resolution PDF

[39] Epsd: Early pruning with self-distillation for efficient model compression PDF

[40] From Variance-Reduced Initialization to Knowledge Distillation-Inspired Pruning at Initialization: Embedding Efficiency Right from the Onset of Neural Network â¦ PDF

[41] Exploring Pruning-based Efficient Object Tracking via Hybrid Knowledge Distillation PDF

[42] Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification PDF

Table of Contents

[40] From Variance-Reduced Initialization to Knowledge Distillation-Inspired Pruning at Initialization: Embedding Efficiency Right from the Onset of Neural Network â¦ PDF