Rethinking JEPA: Compute‑Efficient Video Self-Supervised Learning with Frozen Teachers

ICLR 2026 Conference SubmissionAnonymous Authors
SALTvideoSSLvideo_representation_learningmasked_video_modelingMAEJEPAlatent_space_ prediction
Abstract:

Video Joint Embedding Predictive Architectures (V‑JEPA) learn generalizable off-the-shelf video representations by predicting masked regions in latent space with an exponential moving average (EMA)‑updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked‑latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a target encoder with a simple pixel‑reconstruction objective under V‑JEPA masking, then (ii) freeze it and train a student to predict the teacher’s latents on masked regions. This leads to a two‑stage, unregularized scheme, that we refer to as SALT (Static-teacher Asymmetric Latent Training). SALT decouples optimization into pixel reconstruction (teacher) and masked latent prediction (student), increasing transparency, efficiency, and scalability while preserving the ability of representations to generalize under frozen evaluation. Empirically, our student models outperform recently proposed V-JEPA 2 encoders under frozen backbone evaluation across diverse benchmarks. They are also more compute‑optimal: at matched pretraining FLOPs, our method achieves higher probing accuracy, and its scaling curves dominate V‑JEPA’s accuracy–FLOPs Pareto frontier. Finally, we find that student quality is remarkably robust to teacher quality: high-performing students emerge even with small, sub-optimal teachers. This points to a compute budget allocation that should overwhelmingly favor the student. These results position SALT as a simple, scalable, and compute‑efficient alternative to EMA‑based self‑distillation for video representation learning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SALT, a two-stage framework that decouples pixel reconstruction (frozen teacher) from masked latent prediction (student) for video self-supervised learning. It resides in the 'Masked Latent Prediction with Static Teachers' leaf, which contains only two papers total. This is a sparse research direction within the broader taxonomy of 16 papers across multiple branches, suggesting the specific combination of frozen teachers and masked latent prediction remains relatively underexplored compared to momentum-based or cross-modal approaches.

The taxonomy reveals neighboring work in 'Image Foundation Model Adaptation for Video' (3 papers) and 'Hybrid Reconstruction and Distillation Objectives' (2 papers), both exploring frozen or semi-frozen teacher paradigms but with different architectural focuses. The sibling paper in the same leaf likely shares the static teacher premise but may differ in masking strategy or training objectives. Meanwhile, the 'Teacher-Student Frameworks with Dynamic or Momentum Updates' branch (2 papers) represents the contrasting paradigm SALT explicitly moves away from, highlighting the field's ongoing debate between static versus adaptive teacher mechanisms.

Among 21 candidates examined, the SALT framework contribution shows 1 refutable candidate out of 8 examined, suggesting some prior exploration of static-teacher masked prediction exists but is limited in scope. The compute-efficiency claim examined 3 candidates with none refuting, indicating this angle may be less contested. The weak-teacher strong-student phenomenon examined 10 candidates with 3 refutable, pointing to more substantial prior work on teacher-student capacity mismatches, though the specific frozen-teacher context may differentiate this work from general distillation literature.

Based on top-21 semantic matches, the paper appears to occupy a relatively sparse niche within frozen-teacher video SSL, though certain conceptual elements (teacher-student dynamics, masked prediction) connect to broader established themes. The limited search scope means adjacent work in momentum-based methods or recent V-JEPA variants may not be fully captured, and the refutability signals reflect overlap within this constrained candidate set rather than exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers
16
3
Claimed Contributions
21
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: video self-supervised learning with frozen teacher encoders. The field organizes around several complementary strategies for learning video representations without manual labels. Frozen Teacher Architectures and Training Strategies focus on methods that keep a pre-trained teacher network fixed during student training, often employing masked prediction objectives where the student reconstructs latent features from partial observations. Teacher-Student Frameworks with Dynamic or Momentum Updates explore alternatives that allow the teacher to evolve gradually, either through momentum-based updates or other adaptive mechanisms. Domain-Specific Video Self-Supervised Learning tailors these ideas to specialized contexts such as medical imaging or action recognition, while Cross-Modal and Multimodal Self-Supervised Learning leverages alignment between video and other modalities like text or audio. Training-Free and Zero-Shot Video Understanding investigates how pre-trained models can generalize without further fine-tuning, and Self-Similarity and Non-Local Methods exploit spatial or temporal redundancies within video data. Within the Frozen Teacher branch, masked latent prediction with static teachers has attracted considerable attention. Works such as Unmasked Teacher[1] and Frozen CLIP Learners[2] demonstrate that a fixed teacher can provide stable supervision signals for student encoders, avoiding the complexity of momentum schedules. Rethinking JEPA[0] sits squarely in this cluster, re-examining the joint-embedding predictive architecture paradigm and proposing refinements to improve representation quality when the teacher remains frozen. Nearby, Rethinking JEPA[3] and Advancing Video SSL[5] explore related masked prediction strategies, though they may differ in architectural choices or the specific masking schemes employed. A central question across these efforts is how to balance the simplicity of a static teacher against the potential benefits of adaptive or momentum-driven updates, as seen in works like Momentum Contrastive Teacher[8]. Rethinking JEPA[0] contributes to this ongoing dialogue by clarifying design principles for frozen-teacher setups and highlighting trade-offs that inform future architectural decisions.

Claimed Contributions

SALT: Static-teacher Asymmetric Latent Training framework

The authors propose SALT, a two-stage video representation learning method that decouples teacher and student training. Stage 1 trains a target encoder with pixel reconstruction under V-JEPA masking, then Stage 2 freezes it and trains a student to predict the teacher's latents on masked regions, eliminating the need for EMA-based self-distillation.

8 retrieved papers
Can Refute
Compute-efficient alternative to EMA-based self-distillation

SALT achieves better compute efficiency than V-JEPA by allocating minimal compute to a small frozen teacher and focusing resources on the student. At matched pretraining FLOPs, SALT achieves higher probing accuracy and its scaling curves dominate V-JEPA's accuracy–FLOPs Pareto frontier.

3 retrieved papers
Weak-teacher, strong-student phenomenon

The authors demonstrate that high-quality students can be trained from much smaller and weaker frozen teachers, challenging the assumption that strong pretrained encoders are necessary. This finding suggests compute budgets should overwhelmingly favor the student over the teacher.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SALT: Static-teacher Asymmetric Latent Training framework

The authors propose SALT, a two-stage video representation learning method that decouples teacher and student training. Stage 1 trains a target encoder with pixel reconstruction under V-JEPA masking, then Stage 2 freezes it and trains a student to predict the teacher's latents on masked regions, eliminating the need for EMA-based self-distillation.

Contribution

Compute-efficient alternative to EMA-based self-distillation

SALT achieves better compute efficiency than V-JEPA by allocating minimal compute to a small frozen teacher and focusing resources on the student. At matched pretraining FLOPs, SALT achieves higher probing accuracy and its scaling curves dominate V-JEPA's accuracy–FLOPs Pareto frontier.

Contribution

Weak-teacher, strong-student phenomenon

The authors demonstrate that high-quality students can be trained from much smaller and weaker frozen teachers, challenging the assumption that strong pretrained encoders are necessary. This finding suggests compute budgets should overwhelmingly favor the student over the teacher.