ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning

ICLR 2026 Conference SubmissionAnonymous Authors
Continual PretrainLarge Language ModelsParameter-Efficient Training
Abstract:

Conventional continual pretraining (CPT) for large language model (LLM) domain adaptation often suffers from catastrophic forgetting and limited domain capacity. Existing strategies adopt layer expansion, introducing additional trainable parameters to accommodate new knowledge. However, the uniform expansion and updates still entangle general and domain learning, undermining its effectiveness. Our pilot studies reveal that LLMs exhibit functional specialization, where layers and units differentially encode general-critical capabilities, suggesting that parameter expansion and optimization should be function-aware. We then propose ADEPT, Adaptive Expansion and Dynamic Decoupled Tuning for continual pretraining, a two-stage framework for domain-adaptive CPT. ADEPT first performs General-Competence Guided Selective Layer Expansion, duplicating layers least critical for the general domain to increase representational capacity while minimizing interference with general knowledge. It then applies Adaptive Unit-Wise Decoupled Tuning, disentangling parameter units within expanded layers according to their general-domain importance and assigning asymmetric learning rates to balance knowledge injection and retention. Experiments on mathematical and medical domains show that ADEPT outperforms full-parameter CPT by up to 5.76% on the general benchmarks and 5.58% on the target domain benchmarks with only 15% of parameters tuned and less than 50% training time. Ablation studies, theoretical analysis, and extended investigations further demonstrate the necessity of targeted expansion and decoupled optimization, providing new principles for efficient and robust domain-adaptive CPT. Our code is open-sourced at https://anonymous.4open.science/r/ADEPT-F2E3

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ADEPT, a two-stage framework for domain-adaptive continual pretraining that selectively expands model layers based on functional importance and applies asymmetric learning rates to balance general and domain knowledge. It resides in the Adaptive and Selective Parameter Expansion leaf, which contains only two papers within the broader Continual Pretraining Methodologies and Frameworks branch. This is a relatively sparse research direction compared to more crowded areas like Medical and Healthcare Domains (ten papers) or General Training Strategies (seven papers), suggesting the work targets a less explored methodological niche.

The taxonomy reveals that ADEPT's immediate neighbors include Catastrophic Forgetting Mitigation Techniques (five papers) and General Training Strategies (seven papers), both addressing stability and optimization during continual pretraining. The sibling paper in the same leaf, AdapterSwap, focuses on modular adapter mechanisms rather than base model expansion, highlighting a methodological divergence. Nearby branches like Cross-Lingual Adaptation and Application Domains emphasize different axes—language transfer and domain-specific corpora—while ADEPT concentrates on architecture-level adaptation strategies. The scope_note clarifies that this leaf excludes uniform full-parameter methods, positioning ADEPT as a selective, function-aware alternative.

Among twenty-four candidates examined, none clearly refute the three main contributions. The functional specialization perspective examined ten candidates with zero refutable matches, suggesting limited prior work explicitly framing continual pretraining through layer-wise functional roles. The ADEPT framework itself examined four candidates, again with no refutations, indicating the two-stage design combining selective expansion and decoupled tuning may be novel within the search scope. Empirical validation across mathematical and medical domains examined ten candidates without refutation, though this likely reflects the limited search scale rather than absolute novelty, as domain-specific benchmarking is common in the broader taxonomy.

Based on the limited search scope of twenty-four semantically similar candidates, the work appears to introduce a distinct methodological angle—function-aware parameter expansion—within a relatively sparse taxonomy leaf. The absence of refutable prior work across all contributions suggests potential novelty, though the small candidate pool and narrow semantic search radius mean this analysis cannot confirm whether similar ideas exist in adjacent methodological spaces or domain-specific literature not captured by top-K retrieval.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: continual pretraining for large language model domain adaptation. The field addresses how to efficiently adapt general-purpose LLMs to specialized domains—such as medicine, finance, or telecommunications—by continuing pretraining on domain-specific corpora. The taxonomy reveals several main branches: Continual Pretraining Methodologies and Frameworks explores techniques for stable and efficient adaptation, including adaptive parameter expansion and selective updating strategies; Cross-Lingual and Multilingual Adaptation focuses on extending models to new languages or multilingual settings; Application Domains and Specialized Corpora encompasses domain-specific efforts in areas like healthcare, e-commerce, and geotechnical engineering; Evaluation and Benchmarking Studies examine how to measure adaptation success; Toolkits and Infrastructure provide practical systems for deployment; and Retrieval-Augmented and Hybrid Approaches combine pretraining with external knowledge retrieval. Representative works illustrate these themes: Lifelong Pretraining[3] and Continual Pretraining[4] establish foundational methodologies, while domain-specific efforts like ChipNeMo[25], EcomGPT[8], and TelecomGPT[6] demonstrate application-driven adaptation. A particularly active line of work investigates the trade-offs between full continual pretraining and lighter-weight alternatives. Some studies explore parameter-efficient methods to reduce computational costs while maintaining domain performance, as seen in Efficient Domain Pretraining[7] and Domain Finetuning Strategies[14]. Others question whether continual pretraining is always necessary, with Continual Pretraining Not Needed[32] challenging conventional assumptions. Within this landscape, ADEPT[0] sits in the Adaptive and Selective Parameter Expansion cluster, emphasizing dynamic model growth to balance capacity and efficiency. This approach contrasts with neighboring work like AdapterSwap[49], which focuses on modular adapter mechanisms rather than expanding the base model. ADEPT[0] addresses the challenge of catastrophic forgetting and stability during adaptation—issues also explored in Stability Gap Mitigation[28] and Investigating Continual Pretraining[5]—by selectively adding parameters where domain knowledge is most needed, offering a middle ground between full retraining and purely modular strategies.

Claimed Contributions

Functional specialization perspective for continual pretraining

The authors demonstrate through pilot studies that LLMs exhibit functional specialization where layers and units differentially encode general-critical capabilities. They argue that parameter expansion and optimization should be function-aware, with targeted layer expansion and decoupled training as a principled solution to domain adaptation.

10 retrieved papers
ADEPT framework with two-stage design

The authors introduce ADEPT, a two-stage continual pretraining framework. The first stage selectively duplicates layers least critical for general domain to increase capacity. The second stage decouples parameter units within expanded layers and assigns asymmetric learning rates to balance knowledge injection and retention.

4 retrieved papers
Empirical validation across mathematical and medical domains

The authors perform comprehensive experiments showing ADEPT outperforms full-parameter continual pretraining by up to 5.76% on general benchmarks and 5.58% on target domain benchmarks, while using only 15% of parameters and less than 50% training time.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Functional specialization perspective for continual pretraining

The authors demonstrate through pilot studies that LLMs exhibit functional specialization where layers and units differentially encode general-critical capabilities. They argue that parameter expansion and optimization should be function-aware, with targeted layer expansion and decoupled training as a principled solution to domain adaptation.

Contribution

ADEPT framework with two-stage design

The authors introduce ADEPT, a two-stage continual pretraining framework. The first stage selectively duplicates layers least critical for general domain to increase capacity. The second stage decouples parameter units within expanded layers and assigns asymmetric learning rates to balance knowledge injection and retention.

Contribution

Empirical validation across mathematical and medical domains

The authors perform comprehensive experiments showing ADEPT outperforms full-parameter continual pretraining by up to 5.76% on general benchmarks and 5.58% on target domain benchmarks, while using only 15% of parameters and less than 50% training time.