Boomerang Distillation Enables Zero-Shot Model Size Interpolation

ICLR 2026 Conference SubmissionAnonymous Authors
knowledge distillationpretrainingadaptive computemodel interpolation
Abstract:

Large language models (LLMs) are typically deployed under diverse memory and compute constraints. Existing approaches build model families by training each size independently, which is prohibitively expensive and provides only coarse-grained size options. In this work, we identify a novel phenomenon that we call boomerang distillation: starting from a large base model (the teacher), one first distills down to a small student and then progressively reconstructs intermediate-sized models by re-incorporating blocks of teacher layers into the student without any additional training. This process produces zero-shot interpolated models of many intermediate sizes whose performance scales smoothly between the student and teacher, often matching or surpassing pretrained or distilled models of the same size. We further analyze when this type of interpolation succeeds, showing that alignment between teacher and student through pruning and distillation is essential. Boomerang distillation thus provides a simple and efficient way to generate fine-grained model families, dramatically reducing training cost while enabling flexible adaptation across deployment environments.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces boomerang distillation, a method for zero-shot model size interpolation that progressively reconstructs intermediate-sized models by re-incorporating teacher layers into a distilled student without retraining. Within the taxonomy, it occupies the 'Zero-Shot Model Size Interpolation' leaf under 'Capacity Gap and Model Size Adaptation'. Notably, this leaf contains only the original paper itself, indicating a sparse research direction. The broader parent category includes three leaves addressing capacity gaps, dynamic compression, and interpolation strategies, suggesting the paper targets an underexplored niche within a moderately active research area.

The taxonomy reveals that neighboring work primarily focuses on capacity gap mitigation through architectural adjustments or training modifications (e.g., Bridging Capacity Gap, Lifting Capacity Gap) and dynamic capacity approaches requiring progressive training schedules (e.g., Capacity Dynamic Distillation). The original paper diverges by decoupling size selection from training, enabling post-hoc interpolation. Related branches address core distillation mechanisms (feature-based, sequence-level matching) and domain-specific applications, but these do not directly tackle zero-shot size variation. The taxonomy's scope and exclude notes clarify that methods requiring retraining for each size belong in sibling categories, positioning this work as methodologically distinct.

Among twenty-one candidates examined via semantic search and citation expansion, none clearly refute the three core contributions. The boomerang distillation phenomenon itself was examined against ten candidates with zero refutable overlaps. The claim that interpolated models match or surpass standard distilled models drew six candidates, again with no refutations. The analysis of enabling conditions examined five candidates without finding prior work that anticipates this specific alignment-based interpolation mechanism. This limited search scope suggests the contributions appear novel within the examined literature, though the small candidate pool and sparse taxonomy leaf indicate the field context remains relatively unexplored.

Given the restricted search scale and the paper's position in a singleton taxonomy leaf, the work appears to introduce a genuinely new interpolation paradigm within the examined scope. However, the analysis covers top-K semantic matches and immediate citations, not an exhaustive survey of all distillation or model compression literature. The absence of sibling papers in the same leaf and the limited candidate pool mean the novelty assessment reflects current indexed work rather than a comprehensive field review.

Taxonomy

Core-task Taxonomy Papers
24
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Zero-shot model size interpolation through knowledge distillation. The field of knowledge distillation has evolved into a rich landscape organized around several complementary themes. At the highest level, researchers explore core distillation mechanisms and training paradigms that define how knowledge transfers from teacher to student models, including foundational techniques like Sequence-Level Distillation[24] and interactive approaches such as Interactive Knowledge Distillation[23]. A second major branch addresses the capacity gap and model size adaptation, investigating how to bridge performance differences when student models vary significantly in scale—exemplified by works like Bridging Capacity Gap[12] and Interpolative Distillation[13]. Domain-specific applications form another branch, tailoring distillation to specialized tasks ranging from speech and music encoding (e.g., Speech Music Encoder[5]) to plant disease detection and financial document processing. Finally, complementary compression and optimization techniques integrate distillation with pruning, quantization, and other efficiency methods, as seen in Enhanced Sparsification[3] and related efforts. Within the capacity gap and model size adaptation branch, a particularly active line of work focuses on enabling flexible deployment across diverse hardware constraints without retraining multiple models. Boomerang Distillation[0] sits squarely in this area, proposing zero-shot interpolation to generate student models of arbitrary sizes on the fly. This contrasts with earlier methods like Interpolative Distillation[13] and Capacity Dynamic Distillation[18], which typically require explicit training for each target size or rely on predefined capacity schedules. Meanwhile, approaches such as Bridging Capacity Gap[12] and Lifting Capacity Gap[11] emphasize architectural or training adjustments to narrow performance drops when scaling down, but do not directly address the zero-shot interpolation scenario. The central tension across these works revolves around balancing deployment flexibility, training efficiency, and the preservation of teacher-level performance, with Boomerang Distillation[0] offering a novel pathway by decoupling model size selection from the distillation training phase.

Claimed Contributions

Boomerang distillation phenomenon for zero-shot model size interpolation

The authors identify and introduce boomerang distillation, a novel phenomenon where a small distilled student model can be progressively patched with teacher layers to create intermediate-sized models without additional training. This process produces models whose size and performance smoothly interpolate between the student and teacher.

10 retrieved papers
Demonstration that interpolated models match or surpass standard distilled models

The authors demonstrate through experiments that models created via boomerang distillation achieve comparable or superior performance to models trained with standard knowledge distillation at the same size, despite requiring no additional training for the intermediate sizes.

6 retrieved papers
Analysis of conditions enabling boomerang distillation

The authors conduct extensive experiments and ablations to characterize when boomerang distillation succeeds, showing that student initialization from teacher weights and training with alignment losses (such as cosine distance) are essential conditions, and that the approach consistently outperforms layer pruning methods.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Boomerang distillation phenomenon for zero-shot model size interpolation

The authors identify and introduce boomerang distillation, a novel phenomenon where a small distilled student model can be progressively patched with teacher layers to create intermediate-sized models without additional training. This process produces models whose size and performance smoothly interpolate between the student and teacher.

Contribution

Demonstration that interpolated models match or surpass standard distilled models

The authors demonstrate through experiments that models created via boomerang distillation achieve comparable or superior performance to models trained with standard knowledge distillation at the same size, despite requiring no additional training for the intermediate sizes.

Contribution

Analysis of conditions enabling boomerang distillation

The authors conduct extensive experiments and ablations to characterize when boomerang distillation succeeds, showing that student initialization from teacher weights and training with alignment losses (such as cosine distance) are essential conditions, and that the approach consistently outperforms layer pruning methods.