ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Continual PretrainLarge Language ModelsParameter-Efficient Training

Conventional continual pretraining (CPT) for large language model (LLM) domain adaptation often suffers from catastrophic forgetting and limited domain capacity. Existing strategies adopt layer expansion, introducing additional trainable parameters to accommodate new knowledge. However, the uniform expansion and updates still entangle general and domain learning, undermining its effectiveness. Our pilot studies reveal that LLMs exhibit functional specialization, where layers and units differentially encode general-critical capabilities, suggesting that parameter expansion and optimization should be function-aware. We then propose ADEPT, Adaptive Expansion and Dynamic Decoupled Tuning for continual pretraining, a two-stage framework for domain-adaptive CPT. ADEPT first performs General-Competence Guided Selective Layer Expansion, duplicating layers least critical for the general domain to increase representational capacity while minimizing interference with general knowledge. It then applies Adaptive Unit-Wise Decoupled Tuning, disentangling parameter units within expanded layers according to their general-domain importance and assigning asymmetric learning rates to balance knowledge injection and retention. Experiments on mathematical and medical domains show that ADEPT outperforms full-parameter CPT by up to 5.76% on the general benchmarks and 5.58% on the target domain benchmarks with only 15% of parameters tuned and less than 50% training time. Ablation studies, theoretical analysis, and extended investigations further demonstrate the necessity of targeted expansion and decoupled optimization, providing new principles for efficient and robust domain-adaptive CPT. Our code is open-sourced at https://anonymous.4open.science/r/ADEPT-F2E3

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ADEPT, a two-stage framework for domain-adaptive continual pretraining that selectively expands model layers based on functional importance and applies asymmetric learning rates to balance general and domain knowledge. It resides in the Adaptive and Selective Parameter Expansion leaf, which contains only two papers within the broader Continual Pretraining Methodologies and Frameworks branch. This is a relatively sparse research direction compared to more crowded areas like Medical and Healthcare Domains (ten papers) or General Training Strategies (seven papers), suggesting the work targets a less explored methodological niche.

The taxonomy reveals that ADEPT's immediate neighbors include Catastrophic Forgetting Mitigation Techniques (five papers) and General Training Strategies (seven papers), both addressing stability and optimization during continual pretraining. The sibling paper in the same leaf, AdapterSwap, focuses on modular adapter mechanisms rather than base model expansion, highlighting a methodological divergence. Nearby branches like Cross-Lingual Adaptation and Application Domains emphasize different axes—language transfer and domain-specific corpora—while ADEPT concentrates on architecture-level adaptation strategies. The scope_note clarifies that this leaf excludes uniform full-parameter methods, positioning ADEPT as a selective, function-aware alternative.

Among twenty-four candidates examined, none clearly refute the three main contributions. The functional specialization perspective examined ten candidates with zero refutable matches, suggesting limited prior work explicitly framing continual pretraining through layer-wise functional roles. The ADEPT framework itself examined four candidates, again with no refutations, indicating the two-stage design combining selective expansion and decoupled tuning may be novel within the search scope. Empirical validation across mathematical and medical domains examined ten candidates without refutation, though this likely reflects the limited search scale rather than absolute novelty, as domain-specific benchmarking is common in the broader taxonomy.

Based on the limited search scope of twenty-four semantically similar candidates, the work appears to introduce a distinct methodological angle—function-aware parameter expansion—within a relatively sparse taxonomy leaf. The absence of refutable prior work across all contributions suggests potential novelty, though the small candidate pool and narrow semantic search radius mean this analysis cannot confirm whether similar ideas exist in adjacent methodological spaces or domain-specific literature not captured by top-K retrieval.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: continual pretraining for large language model domain adaptation. The field addresses how to efficiently adapt general-purpose LLMs to specialized domains—such as medicine, finance, or telecommunications—by continuing pretraining on domain-specific corpora. The taxonomy reveals several main branches: Continual Pretraining Methodologies and Frameworks explores techniques for stable and efficient adaptation, including adaptive parameter expansion and selective updating strategies; Cross-Lingual and Multilingual Adaptation focuses on extending models to new languages or multilingual settings; Application Domains and Specialized Corpora encompasses domain-specific efforts in areas like healthcare, e-commerce, and geotechnical engineering; Evaluation and Benchmarking Studies examine how to measure adaptation success; Toolkits and Infrastructure provide practical systems for deployment; and Retrieval-Augmented and Hybrid Approaches combine pretraining with external knowledge retrieval. Representative works illustrate these themes: Lifelong Pretraining[3] and Continual Pretraining[4] establish foundational methodologies, while domain-specific efforts like ChipNeMo[25], EcomGPT[8], and TelecomGPT[6] demonstrate application-driven adaptation. A particularly active line of work investigates the trade-offs between full continual pretraining and lighter-weight alternatives. Some studies explore parameter-efficient methods to reduce computational costs while maintaining domain performance, as seen in Efficient Domain Pretraining[7] and Domain Finetuning Strategies[14]. Others question whether continual pretraining is always necessary, with Continual Pretraining Not Needed[32] challenging conventional assumptions. Within this landscape, ADEPT[0] sits in the Adaptive and Selective Parameter Expansion cluster, emphasizing dynamic model growth to balance capacity and efficiency. This approach contrasts with neighboring work like AdapterSwap[49], which focuses on modular adapter mechanisms rather than expanding the base model. ADEPT[0] addresses the challenge of catastrophic forgetting and stability during adaptation—issues also explored in Stability Gap Mitigation[28] and Investigating Continual Pretraining[5]—by selectively adding parameters where domain knowledge is most needed, offering a middle ground between full retraining and purely modular strategies.

Claimed Contributions

Functional specialization perspective for continual pretraining

10 retrieved papers

The authors demonstrate through pilot studies that LLMs exhibit functional specialization where layers and units differentially encode general-critical capabilities. They argue that parameter expansion and optimization should be function-aware, with targeted layer expansion and decoupled training as a principled solution to domain adaptation.

10 retrieved papers

ADEPT framework with two-stage design

4 retrieved papers

The authors introduce ADEPT, a two-stage continual pretraining framework. The first stage selectively duplicates layers least critical for general domain to increase capacity. The second stage decouples parameter units within expanded layers and assigns asymmetric learning rates to balance knowledge injection and retention.

4 retrieved papers

Empirical validation across mathematical and medical domains

10 retrieved papers

The authors perform comprehensive experiments showing ADEPT outperforms full-parameter continual pretraining by up to 5.76% on general benchmarks and 5.58% on target domain benchmarks, while using only 15% of parameters and less than 50% training time.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[49] AdapterSwap: Continuous Training of LLMs with Data Removal and Access-Control Guarantees PDF

Fleshman, William, Khan Aleem, William Fleshman, Marone, Marc, Aleem Khan, Van Durme, Benjamin, Marc Marone, Benjamin Van Durme (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Functional specialization perspective for continual pretraining

[51] Higpt: Heterogeneous graph language model PDF

Cannot Refute

[52] Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation PDF

Cannot Refute

[53] Need a Small Specialized Language Model? Plan Early! PDF

Cannot Refute

[54] DEMix Layers: Disentangling Domains for Modular Language Modeling PDF

Cannot Refute

[55] Resonant pattern shaping through iterative latency induction in contextual token expansion of transformer-based language models PDF

Cannot Refute

[56] News without borders: Domain adaptation of multilingual sentence embeddings for cross-lingual news recommendation PDF

Cannot Refute

[57] Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism PDF

Cannot Refute

[58] Structural permutation layers: An unprecedented approach for modulating internal representations in large language models PDF

Cannot Refute

[59] Lightllm: A versatile large language model for predictive light sensing PDF

Cannot Refute

[60] Towards low-resource languages machine translation: A language-specific fine-tuning with LoRA for specialized large language models PDF

Cannot Refute

Contribution

ADEPT framework with two-stage design

[61] A Comprehensive Survey on Continual Learning in Generative Models PDF

Cannot Refute

[62] Convolutional prompting meets language models for continual learning PDF

Cannot Refute

[63] Deepfake Detection with Multi-Artifact Subspace Fine-Tuning and Selective Layer Masking PDF

Cannot Refute

[64] LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models PDF

Cannot Refute

Contribution

Empirical validation across mathematical and medical domains

[1] Continual Learning of Large Language Models: A Comprehensive Survey PDF

Cannot Refute

[4] Continual pre-training of language models PDF

Cannot Refute

[5] Investigating Continual Pretraining in Large Language Models: Insights and Implications PDF

Cannot Refute

[7] Efficient continual pre-training for building domain specific large language models PDF

Cannot Refute

[25] ChipNeMo: Domain-Adapted LLMs for Chip Design PDF

Cannot Refute

[28] Efficient Domain Continual pretraining by Mitigating the Stability Gap PDF

Cannot Refute

[32] Continual Pre-Training is (not) What You Need in Domain Adaption PDF

Cannot Refute

[65] Continual Test-Time Domain Adaptation PDF

Cannot Refute

[66] Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence PDF

Cannot Refute

[67] Scaling agents via continual pre-training PDF

Cannot Refute

ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[49] AdapterSwap: Continuous Training of LLMs with Data Removal and Access-Control Guarantees PDF

Contribution Analysis

Functional specialization perspective for continual pretraining

[51] Higpt: Heterogeneous graph language model PDF

[52] Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation PDF

[53] Need a Small Specialized Language Model? Plan Early! PDF

[54] DEMix Layers: Disentangling Domains for Modular Language Modeling PDF

[55] Resonant pattern shaping through iterative latency induction in contextual token expansion of transformer-based language models PDF

[56] News without borders: Domain adaptation of multilingual sentence embeddings for cross-lingual news recommendation PDF

[57] Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism PDF

[58] Structural permutation layers: An unprecedented approach for modulating internal representations in large language models PDF

[59] Lightllm: A versatile large language model for predictive light sensing PDF

[60] Towards low-resource languages machine translation: A language-specific fine-tuning with LoRA for specialized large language models PDF

ADEPT framework with two-stage design

[61] A Comprehensive Survey on Continual Learning in Generative Models PDF

[62] Convolutional prompting meets language models for continual learning PDF

[63] Deepfake Detection with Multi-Artifact Subspace Fine-Tuning and Selective Layer Masking PDF

[64] LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models PDF

Empirical validation across mathematical and medical domains

[1] Continual Learning of Large Language Models: A Comprehensive Survey PDF

[4] Continual pre-training of language models PDF

[5] Investigating Continual Pretraining in Large Language Models: Insights and Implications PDF

[7] Efficient continual pre-training for building domain specific large language models PDF

[25] ChipNeMo: Domain-Adapted LLMs for Chip Design PDF

[28] Efficient Domain Continual pretraining by Mitigating the Stability Gap PDF

[32] Continual Pre-Training is (not) What You Need in Domain Adaption PDF

[65] Continual Test-Time Domain Adaptation PDF

[66] Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence PDF

[67] Scaling agents via continual pre-training PDF

Table of Contents