Understanding the Learning Phases in Self-Supervised Learning via Critical Periods

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.8 Download Report PDF

Learning PhasesCritical PeriodsSelf-Supervised Learning

Self-supervised learning (SSL) has emerged as a powerful pretraining strategy to learn transferable representations from unlabeled data. Yet, it remains unclear how long SSL models should be pretrained for such representations to emerge. Contrary to the prevailing heuristic that longer pretraining translates to better downstream performance, we identify a transferability trade-off: across diverse SSL settings, intermediate checkpoints often yield stronger out-of-domain (OOD) generalization, whereas additional pretraining primarily benefits in-domain (ID) accuracy. From this observation, we hypothesize that SSL progresses through learning phases that can be characterized through the lens of critical periods (CP). Prior work on CP has shown that supervised learning models exhibit early phases of high plasticity, followed by a consolidation phase where adaptability declines but task-specific performance keeps increasing. Since traditional CP analysis depends on supervised labels, for SSL we rethink CP in two ways. First, we inject deficits to perturb the pretraining data and measure the quality of learned representations via downstream tasks. Second, to estimate network plasticity during pretraining we compute the Fisher Information matrix on pretext objectives, quantifying the sensitivity of model parameters to the supervisory signal defined by the pretext tasks. We conduct several experiments to demonstrate that SSL models do exhibit their own CP, with CP closure marking a sweet spot where representations are neither underdeveloped nor overfitted to the pretext task. Leveraging these insights, we propose CP-guided checkpoint selection as a mechanism for identifying intermediate checkpoints during SSL that improve OOD transferability. Finally, to balance the transferability trade-off, we propose CP-guided self-distillation, which selectively distills layer representations from the sweet spot (CP closure) checkpoint into their overspecialized counterparts in the final pretrained model.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates temporal dynamics of self-supervised learning pretraining, identifying a transferability trade-off where intermediate checkpoints yield stronger out-of-domain generalization while extended pretraining benefits in-domain accuracy. It resides in the 'Learning Phase Characterization' leaf under 'Theoretical Foundations and Mechanisms', which contains only this single paper among 50 total papers across 19 leaf nodes. This placement indicates a relatively sparse research direction focused specifically on temporal phase analysis during SSL pretraining, distinguishing it from the more populated methodological and application-oriented branches.

The taxonomy reveals neighboring theoretical work in 'Transferability Analysis and Measurement' (3 papers) and 'Representation Learning Principles' (2 papers), which examine transfer capability and feature learning mechanisms but without explicit temporal phase characterization. The broader 'Pretraining Methodologies' branch contains 13 papers across contrastive, generative, and architectural innovations, while 'Transfer Learning Strategies' encompasses 11 papers on adaptation techniques. The paper's focus on learning phases during pretraining positions it at the intersection of theoretical analysis and practical transfer concerns, bridging mechanistic understanding with downstream performance implications.

Among 27 candidates examined across three contributions, no clearly refuting prior work was identified. The transferability trade-off analysis examined 10 candidates with 0 refutations, the critical period reformulation for SSL examined 7 candidates with 0 refutations, and the checkpoint selection intervention examined 10 candidates with 0 refutations. This limited search scope suggests that within the top semantic matches and citation expansions, no prior work explicitly documents the same temporal trade-off phenomenon or applies critical period analysis to self-supervised settings, though the search does not claim exhaustive coverage of all potentially relevant literature.

Based on examination of 27 semantically related candidates, the work appears to occupy a distinct position within SSL research by explicitly characterizing learning phases and their differential impact on in-domain versus out-of-domain transfer. The sparse population of its taxonomy leaf and absence of refuting candidates among examined papers suggest novelty in this specific analytical framing, though the limited search scope means potentially relevant work outside the top-K semantic neighborhood may exist but was not captured in this analysis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: understanding learning phases and transferability in self-supervised pretraining. The field has organized itself around four major branches. Theoretical Foundations and Mechanisms investigates the underlying principles governing how self-supervised models learn and generalize, including phase transitions and critical periods during training. Pretraining Methodologies and Architectures encompasses the diverse algorithmic strategies—contrastive methods like MoCo Chest Xray[29], masked modeling approaches, and hybrid techniques—that enable models to extract useful representations from unlabeled data. Transfer Learning Strategies and Adaptation focuses on how pretrained representations are fine-tuned or adapted to downstream tasks, exploring questions of domain shift, few-shot learning as in Surgical Phases Few-Shot[50], and parameter-efficient adaptation methods such as those studied in BatchNorm Finetuning Transfer[5]. Application Domains and Empirical Studies documents the breadth of real-world deployments, from medical imaging in SSL Skin Cancer[10] and Retinal Multimodal SSL[13] to remote sensing in Consecutive Pretraining Remote Sensing[23] and specialized domains like Seismic Fault Transformer[17]. A particularly active line of work examines the temporal dynamics of pretraining: when and how representations become useful, and whether certain learning windows are more critical than others. Critical Periods SSL[0] sits squarely within this theoretical inquiry, characterizing distinct phases during self-supervised training and their impact on downstream transferability. This contrasts with more application-driven studies that take pretrained models as given and focus on adaptation strategies, such as BatchNorm Finetuning Transfer[5], which explores efficient fine-tuning by selectively updating normalization layers. Meanwhile, works like Big SSL Semi-Supervised[3] bridge pretraining and semi-supervised learning, highlighting trade-offs between label efficiency and representation quality. By situating learning phase characterization within the broader theoretical branch, Critical Periods SSL[0] complements empirical transfer studies and offers mechanistic insights into why certain pretraining regimes yield more robust or adaptable features.

Claimed Contributions

Identification of transferability trade-off in SSL pretraining

10 retrieved papers

The authors demonstrate that extended SSL pretraining creates a trade-off where intermediate checkpoints achieve better out-of-domain generalization, whereas longer pretraining primarily benefits in-domain accuracy. This challenges the prevailing heuristic that longer pretraining always improves downstream performance.

10 retrieved papers

Reformulation of critical period analysis for SSL

7 retrieved papers

The authors adapt critical period analysis from supervised learning to SSL by injecting deficits into pretraining data and computing Fisher Information on pretext objectives rather than class labels. This reformulation enables tracking plasticity phases during SSL pretraining without requiring downstream supervision.

7 retrieved papers

CP-guided checkpoint selection and self-distillation interventions

10 retrieved papers

The authors introduce two practical methods leveraging critical period insights: CP-guided checkpoint selection identifies intermediate checkpoints at CP closure for improved OOD transfer, and CP-guided self-distillation selectively distills early-layer representations from CP checkpoints into final models to balance the transferability trade-off.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of transferability trade-off in SSL pretraining

[7] GraphCLIP: Enhancing Transferability in Graph Foundation Models for Text-Attributed Graphs PDF

Cannot Refute

[51] Improving generalization for ai-synthesized voice detection PDF

Cannot Refute

[52] Self-supervised learning for generalizable out-of-distribution detection PDF

Cannot Refute

[53] Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging PDF

Cannot Refute

[54] Disentangled graph self-supervised learning for out-of-distribution generalization PDF

Cannot Refute

[55] SelfReg: Self-supervised Contrastive Regularization for Domain Generalization PDF

Cannot Refute

[56] Cross-domain pre-training with language models for transferable time series representations PDF

Cannot Refute

[57] How well do self-supervised models transfer? PDF

Cannot Refute

[58] Decoding Musical Neural Activity in Patients With Disorders of Consciousness Through Self-Supervised Contrastive Domain Generalization PDF

Cannot Refute

[59] Health Assessment of Rotating Equipment With Unseen Conditions Using Adversarial Domain Generalization Toward Self-Supervised Regularization Learning PDF

Cannot Refute

Contribution

Reformulation of critical period analysis for SSL

[37] Visual Reinforcement Learning With Self-Supervised 3D Representations PDF

Cannot Refute

[60] Self-Supervised Representation Learning for Quasi-Simultaneous Arrival Signal Identification Based on Reconnaissance Drones PDF

Cannot Refute

[61] Rethinking Evaluation Protocols of Visual Representations Learned via Self-supervised Learning PDF

Cannot Refute

[62] Augmentation-aware Self-supervised Learning with Conditioned Projector PDF

Cannot Refute

[63] Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness PDF

Cannot Refute

[64] Prediction of Pea Yield and Nodulation from Proximal Field and Root Imaging PDF

Cannot Refute

[65] Making Self-supervised Learning Robust to Spurious Correlation via Learning-speed Aware Sampling PDF

Cannot Refute

Contribution

CP-guided checkpoint selection and self-distillation interventions

[66] Capture the key in reasoning to enhance CoT distillation generalization PDF

Cannot Refute

[67] Progressive distillation induces an implicit curriculum PDF

Cannot Refute

[68] Latent syntax weaving in large language model representations: A novel mechanism for self-referential consistency in neural architectures PDF

Cannot Refute

[69] Online distilling from checkpoints for neural machine translation PDF

Cannot Refute

[70] From Large to Small Distillation in the Age of Large Language Models PDF

Cannot Refute

[71] Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation PDF

Cannot Refute

[72] Efficient Knowledge Distillation from Model Checkpoints PDF

Cannot Refute

[73] LATERAL: Learning Automatic, Transfer-Enhanced, and Relation-Aware Labels PDF

Cannot Refute

[74] Neural Machine Translation Transfer Model Based on Mutual Domain Guidance PDF

Cannot Refute

[75] A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders PDF

Cannot Refute

Understanding the Learning Phases in Self-Supervised Learning via Critical Periods

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Identification of transferability trade-off in SSL pretraining

[7] GraphCLIP: Enhancing Transferability in Graph Foundation Models for Text-Attributed Graphs PDF

[51] Improving generalization for ai-synthesized voice detection PDF

[52] Self-supervised learning for generalizable out-of-distribution detection PDF

[53] Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging PDF

[54] Disentangled graph self-supervised learning for out-of-distribution generalization PDF

[55] SelfReg: Self-supervised Contrastive Regularization for Domain Generalization PDF

[56] Cross-domain pre-training with language models for transferable time series representations PDF

[57] How well do self-supervised models transfer? PDF

[58] Decoding Musical Neural Activity in Patients With Disorders of Consciousness Through Self-Supervised Contrastive Domain Generalization PDF

[59] Health Assessment of Rotating Equipment With Unseen Conditions Using Adversarial Domain Generalization Toward Self-Supervised Regularization Learning PDF

Reformulation of critical period analysis for SSL

[37] Visual Reinforcement Learning With Self-Supervised 3D Representations PDF

[60] Self-Supervised Representation Learning for Quasi-Simultaneous Arrival Signal Identification Based on Reconnaissance Drones PDF

[61] Rethinking Evaluation Protocols of Visual Representations Learned via Self-supervised Learning PDF

[62] Augmentation-aware Self-supervised Learning with Conditioned Projector PDF

[63] Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness PDF

[64] Prediction of Pea Yield and Nodulation from Proximal Field and Root Imaging PDF

[65] Making Self-supervised Learning Robust to Spurious Correlation via Learning-speed Aware Sampling PDF

CP-guided checkpoint selection and self-distillation interventions

[66] Capture the key in reasoning to enhance CoT distillation generalization PDF

[67] Progressive distillation induces an implicit curriculum PDF

[68] Latent syntax weaving in large language model representations: A novel mechanism for self-referential consistency in neural architectures PDF

[69] Online distilling from checkpoints for neural machine translation PDF

[70] From Large to Small Distillation in the Age of Large Language Models PDF

[71] Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation PDF

[72] Efficient Knowledge Distillation from Model Checkpoints PDF

[73] LATERAL: Learning Automatic, Transfer-Enhanced, and Relation-Aware Labels PDF

[74] Neural Machine Translation Transfer Model Based on Mutual Domain Guidance PDF

[75] A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders PDF

Table of Contents