Rethinking JEPA: Compute‑Efficient Video Self-Supervised Learning with Frozen Teachers

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

SALTvideoSSLvideo_representation_learningmasked_video_modelingMAEJEPAlatent_space_ prediction

Video Joint Embedding Predictive Architectures (V‑JEPA) learn generalizable off-the-shelf video representations by predicting masked regions in latent space with an exponential moving average (EMA)‑updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked‑latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a target encoder with a simple pixel‑reconstruction objective under V‑JEPA masking, then (ii) freeze it and train a student to predict the teacher’s latents on masked regions. This leads to a two‑stage, unregularized scheme, that we refer to as SALT (Static-teacher Asymmetric Latent Training). SALT decouples optimization into pixel reconstruction (teacher) and masked latent prediction (student), increasing transparency, efficiency, and scalability while preserving the ability of representations to generalize under frozen evaluation. Empirically, our student models outperform recently proposed V-JEPA 2 encoders under frozen backbone evaluation across diverse benchmarks. They are also more compute‑optimal: at matched pretraining FLOPs, our method achieves higher probing accuracy, and its scaling curves dominate V‑JEPA’s accuracy–FLOPs Pareto frontier. Finally, we find that student quality is remarkably robust to teacher quality: high-performing students emerge even with small, sub-optimal teachers. This points to a compute budget allocation that should overwhelmingly favor the student. These results position SALT as a simple, scalable, and compute‑efficient alternative to EMA‑based self‑distillation for video representation learning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SALT, a two-stage framework that decouples pixel reconstruction (frozen teacher) from masked latent prediction (student) for video self-supervised learning. It resides in the 'Masked Latent Prediction with Static Teachers' leaf, which contains only two papers total. This is a sparse research direction within the broader taxonomy of 16 papers across multiple branches, suggesting the specific combination of frozen teachers and masked latent prediction remains relatively underexplored compared to momentum-based or cross-modal approaches.

The taxonomy reveals neighboring work in 'Image Foundation Model Adaptation for Video' (3 papers) and 'Hybrid Reconstruction and Distillation Objectives' (2 papers), both exploring frozen or semi-frozen teacher paradigms but with different architectural focuses. The sibling paper in the same leaf likely shares the static teacher premise but may differ in masking strategy or training objectives. Meanwhile, the 'Teacher-Student Frameworks with Dynamic or Momentum Updates' branch (2 papers) represents the contrasting paradigm SALT explicitly moves away from, highlighting the field's ongoing debate between static versus adaptive teacher mechanisms.

Among 21 candidates examined, the SALT framework contribution shows 1 refutable candidate out of 8 examined, suggesting some prior exploration of static-teacher masked prediction exists but is limited in scope. The compute-efficiency claim examined 3 candidates with none refuting, indicating this angle may be less contested. The weak-teacher strong-student phenomenon examined 10 candidates with 3 refutable, pointing to more substantial prior work on teacher-student capacity mismatches, though the specific frozen-teacher context may differentiate this work from general distillation literature.

Based on top-21 semantic matches, the paper appears to occupy a relatively sparse niche within frozen-teacher video SSL, though certain conceptual elements (teacher-student dynamics, masked prediction) connect to broader established themes. The limited search scope means adjacent work in momentum-based methods or recent V-JEPA variants may not be fully captured, and the refutability signals reflect overlap within this constrained candidate set rather than exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: video self-supervised learning with frozen teacher encoders. The field organizes around several complementary strategies for learning video representations without manual labels. Frozen Teacher Architectures and Training Strategies focus on methods that keep a pre-trained teacher network fixed during student training, often employing masked prediction objectives where the student reconstructs latent features from partial observations. Teacher-Student Frameworks with Dynamic or Momentum Updates explore alternatives that allow the teacher to evolve gradually, either through momentum-based updates or other adaptive mechanisms. Domain-Specific Video Self-Supervised Learning tailors these ideas to specialized contexts such as medical imaging or action recognition, while Cross-Modal and Multimodal Self-Supervised Learning leverages alignment between video and other modalities like text or audio. Training-Free and Zero-Shot Video Understanding investigates how pre-trained models can generalize without further fine-tuning, and Self-Similarity and Non-Local Methods exploit spatial or temporal redundancies within video data. Within the Frozen Teacher branch, masked latent prediction with static teachers has attracted considerable attention. Works such as Unmasked Teacher[1] and Frozen CLIP Learners[2] demonstrate that a fixed teacher can provide stable supervision signals for student encoders, avoiding the complexity of momentum schedules. Rethinking JEPA[0] sits squarely in this cluster, re-examining the joint-embedding predictive architecture paradigm and proposing refinements to improve representation quality when the teacher remains frozen. Nearby, Rethinking JEPA[3] and Advancing Video SSL[5] explore related masked prediction strategies, though they may differ in architectural choices or the specific masking schemes employed. A central question across these efforts is how to balance the simplicity of a static teacher against the potential benefits of adaptive or momentum-driven updates, as seen in works like Momentum Contrastive Teacher[8]. Rethinking JEPA[0] contributes to this ongoing dialogue by clarifying design principles for frozen-teacher setups and highlighting trade-offs that inform future architectural decisions.

Claimed Contributions

SALT: Static-teacher Asymmetric Latent Training framework

Can Refute

8 retrieved papers

The authors propose SALT, a two-stage video representation learning method that decouples teacher and student training. Stage 1 trains a target encoder with pixel reconstruction under V-JEPA masking, then Stage 2 freezes it and trains a student to predict the teacher's latents on masked regions, eliminating the need for EMA-based self-distillation.

8 retrieved papers

Can Refute

Compute-efficient alternative to EMA-based self-distillation

3 retrieved papers

SALT achieves better compute efficiency than V-JEPA by allocating minimal compute to a small frozen teacher and focusing resources on the student. At matched pretraining FLOPs, SALT achieves higher probing accuracy and its scaling curves dominate V-JEPA's accuracy–FLOPs Pareto frontier.

3 retrieved papers

Weak-teacher, strong-student phenomenon

Can Refute

10 retrieved papers

The authors demonstrate that high-quality students can be trained from much smaller and weaker frozen teachers, challenging the assumption that strong pretrained encoders are necessary. This finding suggests compute budgets should overwhelmingly favor the student over the teacher.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers PDF

Li, Xianhang, Huang, Chen, Xianhang Li, Chun-Liang, Chen Huang, Malach, Eran, Chun-Liang Li, Susskind, Josh, Eran Malach, Thilak, Vimal, Josh Susskind, Littwin, Etai, Vimal Thilak, Etai Littwin (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SALT: Static-teacher Asymmetric Latent Training framework

[27] V-jepa: Latent video prediction for visual representation learning PDF

Can Refute

[3] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers PDF

Cannot Refute

[12] Contrastive Masked Video Modeling for Coronary Angiography Diagnosis PDF

Cannot Refute

[28] Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? PDF

Cannot Refute

[29] Aligning What Matters: Masked Latent Adaptation for Text-to-Audio-Video Generation PDF

Cannot Refute

[30] VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning PDF

Cannot Refute

[31] CAE v2: Context autoencoder with CLIP latent alignment PDF

Cannot Refute

[32] InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision PDF

Cannot Refute

Contribution

Compute-efficient alternative to EMA-based self-distillation

[3] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers PDF

Cannot Refute

[33] Santa: Source anchoring network and target alignment for continual test time adaptation PDF

Cannot Refute

[34] Combined Representation for Adept Learning (CORAL) PDF

Cannot Refute

Contribution

Weak-teacher, strong-student phenomenon

[18] Learning Student-Friendly Teacher Networks for Knowledge Distillation PDF

Can Refute

[19] On the efficacy of knowledge distillation PDF

Can Refute

[22] Understanding and improving knowledge distillation PDF

Can Refute

[17] On the Generalization vs Fidelity Paradox in Knowledge Distillation PDF

Cannot Refute

[20] Reciprocal Teacher-Student Learning via Forward and Feedback Knowledge Distillation PDF

Cannot Refute

[21] Robust semantic segmentation with multi-teacher knowledge distillation PDF

Cannot Refute

[23] Efficient Speech Translation through Model Compression and Knowledge Distillation PDF

Cannot Refute

[24] Knowledge distillation: A good teacher is patient and consistent PDF

Cannot Refute

[25] Teacher Network Calibration Improves Cross-Quality Knowledge Distillation PDF

Cannot Refute

[26] Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation PDF

Cannot Refute

Rethinking JEPA: Compute‑Efficient Video Self-Supervised Learning with Frozen Teachers

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers PDF

Contribution Analysis

SALT: Static-teacher Asymmetric Latent Training framework

[27] V-jepa: Latent video prediction for visual representation learning PDF

[3] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers PDF

[12] Contrastive Masked Video Modeling for Coronary Angiography Diagnosis PDF

[28] Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? PDF

[29] Aligning What Matters: Masked Latent Adaptation for Text-to-Audio-Video Generation PDF

[30] VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning PDF

[31] CAE v2: Context autoencoder with CLIP latent alignment PDF

[32] InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision PDF

Compute-efficient alternative to EMA-based self-distillation

[3] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers PDF

[33] Santa: Source anchoring network and target alignment for continual test time adaptation PDF

[34] Combined Representation for Adept Learning (CORAL) PDF

Weak-teacher, strong-student phenomenon

[18] Learning Student-Friendly Teacher Networks for Knowledge Distillation PDF

[19] On the efficacy of knowledge distillation PDF

[22] Understanding and improving knowledge distillation PDF

[17] On the Generalization vs Fidelity Paradox in Knowledge Distillation PDF

[20] Reciprocal Teacher-Student Learning via Forward and Feedback Knowledge Distillation PDF

[21] Robust semantic segmentation with multi-teacher knowledge distillation PDF

[23] Efficient Speech Translation through Model Compression and Knowledge Distillation PDF

[24] Knowledge distillation: A good teacher is patient and consistent PDF

[25] Teacher Network Calibration Improves Cross-Quality Knowledge Distillation PDF

[26] Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation PDF

Table of Contents