Rethinking JEPA: Compute‑Efficient Video Self-Supervised Learning with Frozen Teachers
Overview
Overall Novelty Assessment
The paper proposes SALT, a two-stage framework that decouples pixel reconstruction (frozen teacher) from masked latent prediction (student) for video self-supervised learning. It resides in the 'Masked Latent Prediction with Static Teachers' leaf, which contains only two papers total. This is a sparse research direction within the broader taxonomy of 16 papers across multiple branches, suggesting the specific combination of frozen teachers and masked latent prediction remains relatively underexplored compared to momentum-based or cross-modal approaches.
The taxonomy reveals neighboring work in 'Image Foundation Model Adaptation for Video' (3 papers) and 'Hybrid Reconstruction and Distillation Objectives' (2 papers), both exploring frozen or semi-frozen teacher paradigms but with different architectural focuses. The sibling paper in the same leaf likely shares the static teacher premise but may differ in masking strategy or training objectives. Meanwhile, the 'Teacher-Student Frameworks with Dynamic or Momentum Updates' branch (2 papers) represents the contrasting paradigm SALT explicitly moves away from, highlighting the field's ongoing debate between static versus adaptive teacher mechanisms.
Among 21 candidates examined, the SALT framework contribution shows 1 refutable candidate out of 8 examined, suggesting some prior exploration of static-teacher masked prediction exists but is limited in scope. The compute-efficiency claim examined 3 candidates with none refuting, indicating this angle may be less contested. The weak-teacher strong-student phenomenon examined 10 candidates with 3 refutable, pointing to more substantial prior work on teacher-student capacity mismatches, though the specific frozen-teacher context may differentiate this work from general distillation literature.
Based on top-21 semantic matches, the paper appears to occupy a relatively sparse niche within frozen-teacher video SSL, though certain conceptual elements (teacher-student dynamics, masked prediction) connect to broader established themes. The limited search scope means adjacent work in momentum-based methods or recent V-JEPA variants may not be fully captured, and the refutability signals reflect overlap within this constrained candidate set rather than exhaustive field coverage.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose SALT, a two-stage video representation learning method that decouples teacher and student training. Stage 1 trains a target encoder with pixel reconstruction under V-JEPA masking, then Stage 2 freezes it and trains a student to predict the teacher's latents on masked regions, eliminating the need for EMA-based self-distillation.
SALT achieves better compute efficiency than V-JEPA by allocating minimal compute to a small frozen teacher and focusing resources on the student. At matched pretraining FLOPs, SALT achieves higher probing accuracy and its scaling curves dominate V-JEPA's accuracy–FLOPs Pareto frontier.
The authors demonstrate that high-quality students can be trained from much smaller and weaker frozen teachers, challenging the assumption that strong pretrained encoders are necessary. This finding suggests compute budgets should overwhelmingly favor the student over the teacher.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[3] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
SALT: Static-teacher Asymmetric Latent Training framework
The authors propose SALT, a two-stage video representation learning method that decouples teacher and student training. Stage 1 trains a target encoder with pixel reconstruction under V-JEPA masking, then Stage 2 freezes it and trains a student to predict the teacher's latents on masked regions, eliminating the need for EMA-based self-distillation.
[27] V-jepa: Latent video prediction for visual representation learning PDF
[3] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers PDF
[12] Contrastive Masked Video Modeling for Coronary Angiography Diagnosis PDF
[28] Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? PDF
[29] Aligning What Matters: Masked Latent Adaptation for Text-to-Audio-Video Generation PDF
[30] VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning PDF
[31] CAE v2: Context autoencoder with CLIP latent alignment PDF
[32] InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision PDF
Compute-efficient alternative to EMA-based self-distillation
SALT achieves better compute efficiency than V-JEPA by allocating minimal compute to a small frozen teacher and focusing resources on the student. At matched pretraining FLOPs, SALT achieves higher probing accuracy and its scaling curves dominate V-JEPA's accuracy–FLOPs Pareto frontier.
[3] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers PDF
[33] Santa: Source anchoring network and target alignment for continual test time adaptation PDF
[34] Combined Representation for Adept Learning (CORAL) PDF
Weak-teacher, strong-student phenomenon
The authors demonstrate that high-quality students can be trained from much smaller and weaker frozen teachers, challenging the assumption that strong pretrained encoders are necessary. This finding suggests compute budgets should overwhelmingly favor the student over the teacher.