PACE: Pretrained Audio Continual Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Audio recognitionContinual LearningIncremental LearningCatastrophic forgetting
Abstract:

Audio is a fundamental modality for analyzing speech, music, and environmental sounds. While pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world scenarios where data distributions evolve over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs) and provide a comprehensive analysis of its unique challenges. Unlike in the vision domain where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly applying such strategies to audio leads to poor performance. This is due to a fundamental property of audio backbones: they emphasize low-level spectral details rather than structured semantics, resulting in severe upstream–downstream misalignment. Through extensive empirical analysis, we identify a promising technical route based on analytic classifiers with first-session adaptation (FSA), but also uncover two major limitations: representation saturation in coarse-grained scenarios and representation shifts in fine-grained scenarios. To address these challenges, we propose PACE, an innovative method that improves FSA via a regularized analytic classifier and introduces multi-session adaptation through adaptive subspace-orthogonal PEFT for better semantic alignment. Additionally, we design spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments across six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, representing a significant step toward robust and scalable audio CL with PTMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the first systematic benchmark for audio continual learning with pretrained models and proposes PACE, a method addressing representation saturation and shifts. It resides in the 'Continual Learning with Pretrained Models' leaf, which contains no sibling papers in the taxonomy. This isolation suggests the research direction is relatively sparse within the surveyed literature, indicating that audio-specific continual learning with pretrained models has received limited prior attention compared to broader continual learning methodologies or foundation model applications in vision and robotics.

The taxonomy places this work within 'Continual Learning Methodologies and Optimization,' adjacent to multi-objective optimization frameworks and single-task learning branches. Neighboring leaves include 'Foundation Models in Vision and Pathology' and 'Foundation Models in Robotics,' which explore pretrained model adaptation in other modalities. The scope note for the parent branch emphasizes sequential learning and adaptive training strategies, while excluding domain-specific applications without methodological contributions. This positioning highlights that the paper bridges methodological innovation (PACE) with domain-specific challenges (audio's low-level spectral emphasis), distinguishing it from purely algorithmic or purely applied studies.

Among 22 candidates examined, none clearly refute the three main contributions. The benchmark contribution examined 10 candidates with zero refutable matches, the PACE method examined 2 candidates with zero refutations, and the challenge identification examined 10 candidates with zero refutations. This limited search scope suggests that within the top-K semantic matches and citation expansions, no prior work explicitly addresses audio continual learning benchmarks or the specific upstream-downstream misalignment problem. The absence of refutable candidates across all contributions, combined with the sparse taxonomy leaf, indicates the work occupies a relatively unexplored niche.

Based on the limited literature search of 22 candidates, the paper appears to address a gap in audio-specific continual learning with pretrained models. However, the analysis does not cover exhaustive searches across all continual learning or audio processing venues, and the taxonomy's sparsity in this leaf may reflect search limitations rather than absolute novelty. The methodological contributions (PACE, first-session adaptation) and empirical findings (representation saturation, spectral misalignment) seem distinct within the examined scope, though broader surveys might reveal related work in adjacent audio or continual learning communities.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

The field of audio continual learning with pretrained models addresses the challenge of adapting large-scale audio representations to sequential tasks without catastrophic forgetting. The taxonomy reveals five major branches: Continual Learning Methodologies and Optimization focuses on algorithmic strategies for mitigating forgetting and enabling incremental updates; Foundation Models and Transfer Learning examines how pretrained representations can be leveraged and fine-tuned across domains; Application Domains and Empirical Studies explores real-world deployments in speech, music, and environmental sound recognition; Research Methodology and Design Frameworks encompasses experimental protocols and evaluation metrics; and Institutional and Policy Objectives addresses broader organizational considerations. Works like Foundation Models Pathology[19] and Foundation Models Robotics[27] illustrate how pretrained architectures are being adapted beyond their original domains, while methodological contributions such as Complement Objective Training[10] and Indicator-based MOEA[9] provide optimization frameworks that balance multiple learning objectives. Recent efforts reveal a tension between parameter efficiency and task performance, with many studies exploring how to selectively update pretrained weights while preserving prior knowledge. PACE[0] situates itself within the Continual Learning with Pretrained Models branch, emphasizing practical adaptation strategies that build on frozen or partially frozen representations. This contrasts with approaches like RORA[7], which may prioritize architectural innovations, or works such as LLM Software Development[14] that focus on deployment pipelines rather than core learning dynamics. The landscape also shows growing interest in multi-objective formulations, as seen in WSN Multi-Objective Survey[45] and Multi-Objective Systems Survey[46], reflecting the need to simultaneously optimize accuracy, memory footprint, and computational cost. Open questions remain around how to best align pretrained features with continually arriving data distributions and whether domain-specific inductive biases can be injected without full retraining.

Claimed Contributions

First systematic benchmark for audio continual learning with pretrained models

The authors construct the first comprehensive benchmark specifically designed to evaluate continual learning methods on pretrained audio models. This benchmark includes six diverse audio datasets spanning coarse-grained and fine-grained tasks, and reveals fundamental challenges unique to the audio domain such as upstream-downstream misalignment and severe representation shifts.

10 retrieved papers
PACE method for pretrained audio continual learning

The authors introduce PACE, a novel continual learning framework that addresses audio-specific challenges through three key components: improved first-session adaptation with layer-aware tuning, multi-session adaptation using adaptive subspace-orthogonal parameter-efficient fine-tuning, and boundary-aware perturbations to enhance representation stability and discriminability.

2 retrieved papers
Identification of fundamental audio continual learning challenges

The authors systematically analyze audio continual learning and discover that unlike vision, audio models suffer from representation saturation during early adaptation on coarse-grained tasks and severe representation shifts on fine-grained tasks due to the mismatch between pretraining objectives focused on low-level spectral details and downstream semantic requirements.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First systematic benchmark for audio continual learning with pretrained models

The authors construct the first comprehensive benchmark specifically designed to evaluate continual learning methods on pretrained audio models. This benchmark includes six diverse audio datasets spanning coarse-grained and fine-grained tasks, and reveals fundamental challenges unique to the audio domain such as upstream-downstream misalignment and severe representation shifts.

Contribution

PACE method for pretrained audio continual learning

The authors introduce PACE, a novel continual learning framework that addresses audio-specific challenges through three key components: improved first-session adaptation with layer-aware tuning, multi-session adaptation using adaptive subspace-orthogonal parameter-efficient fine-tuning, and boundary-aware perturbations to enhance representation stability and discriminability.

Contribution

Identification of fundamental audio continual learning challenges

The authors systematically analyze audio continual learning and discover that unlike vision, audio models suffer from representation saturation during early adaptation on coarse-grained tasks and severe representation shifts on fine-grained tasks due to the mismatch between pretraining objectives focused on low-level spectral details and downstream semantic requirements.