TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
zero-shot reinforcement learningunsupervised reinforcement learningself-predictive representationsjoint embedding predictive architecture
Abstract:

Latent prediction–where agents learn by predicting their own latents–has emerged as a powerful paradigm for training general representations in machine learning. In reinforcement learning (RL), this approach has been explored to define auxiliary losses for a variety of settings, including reward-based and unsupervised RL, behavior cloning, and world modeling. While existing methods are typically limited to single-task learning, one-step prediction, or on-policy trajectory data, we show that temporal difference (TD) learning enables learning representations predictive of long-term latent dynamics across multiple policies from offline, reward-free transitions. Building on this, we introduce TD-JEPA, which leverages TD-based latent-predictive representations into unsupervised RL. TD-JEPA trains explicit state and task encoders, a policy-conditioned multi-step predictor, and a set of parameterized policies directly in latent space. This enables zero-shot optimization of any reward function at test time. Theoretically, we show that an idealized variant of TD-JEPA avoids collapse with proper initialization, and learns encoders that capture a low-rank factorization of long-term policy dynamics, while the predictor recovers their successor features in latent space. Empirically, TD-JEPA matches or outperforms state-of-the-art baselines on locomotion, navigation, and manipulation tasks across 13 datasets in ExoRL and OGBench, especially in the challenging setting of zero-shot RL from pixels.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces TD-JEPA, which applies temporal-difference learning to train latent-predictive representations for zero-shot reinforcement learning. It resides in the 'Self-Predictive Latent Representations' leaf, which contains five papers including the original work. This leaf sits within the broader 'Latent Dynamics Prediction and World Modeling' branch, indicating a moderately populated research direction focused on learning forward models in latent space. The sibling papers explore related themes such as compositional structure, disentanglement, and bootstrapping-based prediction, suggesting an active but not overcrowded subfield where different architectural and objective choices are still being explored.

The taxonomy reveals that TD-JEPA's leaf is adjacent to 'World Model-Based Planning and Control', which emphasizes model-predictive control rather than representation learning, and 'Reward-Free and Passive Data Learning', which focuses on learning from observational data without reward signals. The paper's emphasis on policy-conditioned multi-step prediction and zero-shot task adaptation also connects it to the 'Cross-Task and Multi-Task Generalization' branch, though it remains distinct by prioritizing latent dynamics over explicit task encoders. The taxonomy's scope and exclude notes clarify that TD-JEPA's focus on TD-based objectives differentiates it from planning-centric world models and from methods requiring reward signals during training.

Among 23 candidates examined across three contributions, none were flagged as clearly refuting the paper's claims. The first contribution (TD-based latent-predictive representations) examined three candidates with no refutations, suggesting limited prior work directly combining TD learning with policy-conditioned multi-step latent prediction. The second contribution (TD-JEPA algorithm) and third contribution (theoretical analysis) each examined ten candidates, again with no refutations. This indicates that within the limited search scope, the specific combination of TD objectives, explicit state and task encoders, and zero-shot optimization in latent space appears relatively unexplored, though the search scale is modest and may not capture all relevant prior work.

Based on the limited literature search of 23 candidates, TD-JEPA appears to occupy a distinct position within the self-predictive latent representations subfield. The absence of refutable prior work among examined candidates suggests novelty in its specific technical approach, though the search scope does not guarantee exhaustive coverage of related methods in successor features, world modeling, or unsupervised RL. The taxonomy context indicates the paper contributes to an active but not saturated research direction, where different strategies for learning predictive latent dynamics are still being actively developed and compared.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: learning latent-predictive representations for zero-shot reinforcement learning. This field centers on building compact latent encodings that capture predictive structure in sequential decision problems, enabling agents to generalize to novel tasks or environments without task-specific fine-tuning. The taxonomy reveals several complementary research directions. Latent Dynamics Prediction and World Modeling focuses on learning forward models that simulate future states in latent space, often through self-predictive objectives as seen in Self-Predictive Representations[2] and Bootstrap Latent-Predictive[43]. Generalization and Transfer Across Tasks and Domains addresses how learned representations can be reused across different problem instances, while Reward-Predictive and Task-Conditioned Representations emphasize encoding goal-relevant information. Unsupervised and Self-Supervised Representation Learning explores methods like Contrastive Predictive Coding[19] that extract structure without explicit reward signals. Cross-Modal and Multi-Modal Representation Learning tackles scenarios where agents must integrate diverse sensory inputs, and Uncertainty Quantification and Robustness examines how to handle distributional shift and model confidence. Within the world modeling branch, a handful of works explore different strategies for self-prediction in latent space. Self-Predictive Combinatorial[1] investigates compositional structure, while Disentangled Predictive[45] aims to separate independent factors of variation. TD-JEPA[0] sits naturally in this cluster, emphasizing temporal-difference style objectives for learning predictive embeddings that support zero-shot transfer. Compared to Bootstrap Latent-Predictive[43], which relies on bootstrapping target networks, TD-JEPA[0] integrates temporal-difference learning more directly into the representation objective. Meanwhile, Regularized Latent Dynamics[5] highlights the importance of regularization to prevent overfitting in learned world models. These contrasting approaches reflect ongoing questions about how best to balance predictive accuracy, computational efficiency, and generalization: whether to prioritize disentanglement, compositional reasoning, or robust temporal consistency when building latent representations for zero-shot RL.

Claimed Contributions

TD-based latent-predictive representations for multi-step, policy-conditioned dynamics

The authors introduce a novel temporal-difference loss for latent-predictive representation learning that models multi-step, policy-conditioned dynamics from offline data. Unlike prior methods limited to single-step prediction or on-policy data, this approach learns representations that capture long-term features relevant for value estimation across multiple policies.

3 retrieved papers
TD-JEPA algorithm for zero-shot unsupervised RL

The authors propose TD-JEPA, a zero-shot unsupervised RL algorithm that jointly trains state encoders, task encoders, policy-conditioned predictors, and parameterized policies end-to-end from offline reward-free transitions. The method enables zero-shot optimization of any reward function at test time entirely in latent space.

10 retrieved papers
Theoretical analysis connecting TD-JEPA to successor features and policy evaluation

The authors provide theoretical guarantees showing that TD-JEPA with linear predictors avoids representation collapse, recovers a low-rank factorization of successor measures, and minimizes an upper bound on policy evaluation error. These results build on a novel gradient matching argument that generalizes existing analyses of latent-predictive representations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

TD-based latent-predictive representations for multi-step, policy-conditioned dynamics

The authors introduce a novel temporal-difference loss for latent-predictive representation learning that models multi-step, policy-conditioned dynamics from offline data. Unlike prior methods limited to single-step prediction or on-policy data, this approach learns representations that capture long-term features relevant for value estimation across multiple policies.

Contribution

TD-JEPA algorithm for zero-shot unsupervised RL

The authors propose TD-JEPA, a zero-shot unsupervised RL algorithm that jointly trains state encoders, task encoders, policy-conditioned predictors, and parameterized policies end-to-end from offline reward-free transitions. The method enables zero-shot optimization of any reward function at test time entirely in latent space.

Contribution

Theoretical analysis connecting TD-JEPA to successor features and policy evaluation

The authors provide theoretical guarantees showing that TD-JEPA with linear predictors avoids representation collapse, recovers a low-rank factorization of successor measures, and minimizes an upper bound on policy evaluation error. These results build on a novel gradient matching argument that generalizes existing analyses of latent-predictive representations.

TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforcement Learning | Novelty Validation