LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMJEPAfine-tuningpretraining

Large Language Model (LLM) pretraining, finetuning, and evaluation rely on input-space reconstruction and generative capabilities. Yet, it has been observed in vision that embedding-space training objectives, e.g., with Joint Embedding Predictive Architectures (JEPAs), are far superior to their input-space counterpart. That mismatch in how training is achieved between language and vision opens up a natural question: {\em can language training methods learn a few tricks from the vision ones?} The lack of JEPA-style LLM is a testimony of the challenge in designing such objectives for language. In this work, we propose a first step in that direction where we develop LLM-JEPA, a JEPA based solution for LLMs applicable both to finetuning and pretraining. Thus far, LLM-JEPA is able to outperform the standard LLM training objectives by a significant margin across models, all while being robust to overfiting. Those findings are observed across numerous datasets (NL-RX, GSM8K, Spider, RottenTomatoes) and various models from the Llama3, OpenELM, Gemma2 and Olmo families. Code: \url{https://anonymous.4open.science/r/llm-jepa-0C6F/README.md}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LLM-JEPA, a joint embedding predictive architecture for language models that optimizes in embedding space rather than input space. According to the taxonomy, this work occupies the 'Joint Embedding Predictive Architectures' leaf under 'Embedding-Space Optimization and Alignment'. Notably, this leaf contains only the original paper itself—no sibling papers exist in this specific category. This positioning suggests the work addresses a relatively unexplored direction within the broader field of embedding-space training objectives, which encompasses fifty papers across approximately thirty-six distinct topics.

The taxonomy reveals that neighboring research directions are well-populated. Adjacent leaves include 'Embedding Alignment and Steering' (three papers), 'Embedding Regularization and Stabilization' (two papers), and 'Embedding Perturbation and Exploration' (two papers). The broader 'Contrastive and Similarity-Based Learning' branch contains multiple active sub-areas, including text embedding via contrastive objectives and geometric optimization methods. The scope note for the original paper's leaf explicitly excludes contrastive learning on text pairs, distinguishing JEPA-style predictive objectives from the more established contrastive paradigm that dominates much of the embedding-space literature.

Among thirty candidates examined through semantic search and citation expansion, none were found to clearly refute any of the three main contributions. For the core LLM-JEPA objective, ten candidates were examined with zero refutable matches. Similarly, the custom attention mask implementation and empirical validation contributions each had ten candidates examined, again with no clear prior work overlap. This limited search scope—thirty papers rather than an exhaustive review—suggests that within the examined literature, the specific combination of JEPA-style objectives applied to language model training appears relatively novel, though the analysis cannot rule out relevant work outside this search radius.

The analysis indicates apparent novelty within the examined scope, particularly given the absence of sibling papers in the taxonomy leaf and zero refutations across thirty candidates. However, the limited search scale means this assessment reflects top-K semantic matches rather than comprehensive field coverage. The work's positioning at the intersection of vision-inspired architectures and language model training may explain why standard semantic search yields few direct precedents, though related ideas in embedding-space optimization and predictive learning exist in neighboring taxonomy branches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: embedding-space training objectives for large language models. The field encompasses a broad spectrum of approaches that shape how LLMs learn and represent information in continuous vector spaces. At the highest level, the taxonomy divides into branches addressing optimization and alignment strategies, contrastive and similarity-based learning, multimodal and multilingual embeddings, generative and predictive architectures, adversarial robustness, knowledge integration, application-specific training, interpretability, diversity optimization, and emerging specialized uses. Some branches focus on refining the geometric properties of embeddings—such as Contrastive and Similarity-Based Learning, which includes works like AnglE[21] and NV-Embed[12]—while others emphasize cross-domain or cross-modal integration, as seen in Multimodal and Cross-Modal Embeddings. Meanwhile, branches like Embedding-Space Optimization and Alignment explore how to steer or align representations through techniques ranging from adversarial training (Adversarial Training LLMs[2]) to structural projection methods (Structural Embedding Projection[4]) and continuous reasoning in latent space (Continuous Latent Reasoning[5]). A particularly active line of work centers on joint embedding predictive architectures, which aim to learn representations by predicting one embedding from another without relying solely on reconstruction or contrastive losses. LLM-JEPA[0] exemplifies this direction, proposing a framework that leverages predictive objectives in embedding space to improve generalization and alignment. This approach contrasts with more traditional contrastive methods like those in Improving Text Embeddings[1] or similarity-focused techniques such as Word Embeddings Steers[3], which emphasize pairwise relationships or direct steering of token-level representations. Compared to Continuous Latent Reasoning[5], which explores reasoning pathways within latent spaces, LLM-JEPA[0] focuses more explicitly on the predictive architecture itself as a training objective. Open questions remain around how these predictive objectives scale, how they interact with downstream task performance, and whether they offer robustness advantages over purely contrastive or generative alternatives.

Claimed Contributions

LLM-JEPA: A JEPA-based training objective for LLMs

10 retrieved papers

The authors introduce LLM-JEPA, a novel training objective that combines the standard next-token prediction loss with a Joint Embedding Predictive Architecture (JEPA) term. This approach operates in embedding space using different views (e.g., text and code) while preserving the generative capabilities of LLMs, adapting JEPA methods from vision to language models.

10 retrieved papers

Custom attention mask for efficient JEPA implementation

10 retrieved papers

The authors develop a custom attention mask mechanism that enables computing embeddings of different views in a single forward pass by making the self-attention causal per block. This implementation reduces the computational overhead from three forward passes to two, making LLM-JEPA more practical.

10 retrieved papers

Empirical validation across models, datasets, and training scenarios

10 retrieved papers

The authors provide comprehensive empirical evidence demonstrating that LLM-JEPA outperforms standard LLM training objectives across multiple model families, datasets, model sizes, and both finetuning and pretraining scenarios. They show consistent improvements and robustness to overfitting.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LLM-JEPA: A JEPA-based training objective for LLMs

[31] KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation PDF

Cannot Refute

[71] Connecting joint-embedding predictive architecture with contrastive self-supervised learning PDF

Cannot Refute

[72] ACT-JEPA: Novel Joint-Embedding Predictive Architecture for Efficient Policy Representation Learning PDF

Cannot Refute

[73] MuLan: A Joint Embedding of Music Audio and Natural Language PDF

Cannot Refute

[74] JEPA4Rec: Learning Effective Language Representations for Sequential Recommendation via Joint Embedding Predictive Architecture PDF

Cannot Refute

[75] Joint Embedding Predictive Architectures Focus on Slow Features PDF

Cannot Refute

[76] V-jepa 2: Self-supervised video models enable understanding, prediction and planning PDF

Cannot Refute

[77] Multimodal machine learning with large language embedding model for polymer property prediction PDF

Cannot Refute

[78] M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture PDF

Cannot Refute

[79] VL-JEPA: Joint Embedding Predictive Architecture for Vision-language PDF

Cannot Refute

Contribution

Custom attention mask for efficient JEPA implementation

[51] MAGVIT: Masked Generative Video Transformer PDF

Cannot Refute

[52] Scene Transformer: A unified architecture for predicting multiple agent trajectories PDF

Cannot Refute

[53] MATE: Multi-view Attention for Table Transformer Efficiency PDF

Cannot Refute

[54] Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery PDF

Cannot Refute

[55] Probabilistic temporal masked attention for cross-view online action detection PDF

Cannot Refute

[56] Incomplete multi-view clustering with cross-view generation via pre-trained transformer PDF

Cannot Refute

[57] SS-MVMETRO: Semi-supervised multi-view human mesh recovery transformer PDF

Cannot Refute

[58] Geometry-guided diffusion model with masked transformer for robust multi-view 3d human pose estimation PDF

Cannot Refute

[59] Multi-view masked world models for visual robotic manipulation PDF

Cannot Refute

[60] Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers PDF

Cannot Refute

Contribution

Empirical validation across models, datasets, and training scenarios

[61] Unified contrastive learning in image-text-label space PDF

Cannot Refute

[62] An empirical study of the impact of hyperparameter tuning and model optimization on the performance properties of deep neural networks PDF

Cannot Refute

[63] An Empirical Model of Large-Batch Training PDF

Cannot Refute

[64] The impact of automated parameter optimization on defect prediction models PDF

Cannot Refute

[65] Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding PDF

Cannot Refute

[66] Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment PDF

Cannot Refute

[67] An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics PDF

Cannot Refute

[68] Quantum approximate optimization algorithm for test case optimization PDF

Cannot Refute

[69] Fast bayesian optimization of machine learning hyperparameters on large datasets PDF

Cannot Refute

[70] Satisfying Real-world Goals with Dataset Constraints PDF

Cannot Refute

LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

LLM-JEPA: A JEPA-based training objective for LLMs

[31] KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation PDF

[71] Connecting joint-embedding predictive architecture with contrastive self-supervised learning PDF

[72] ACT-JEPA: Novel Joint-Embedding Predictive Architecture for Efficient Policy Representation Learning PDF

[73] MuLan: A Joint Embedding of Music Audio and Natural Language PDF

[74] JEPA4Rec: Learning Effective Language Representations for Sequential Recommendation via Joint Embedding Predictive Architecture PDF

[75] Joint Embedding Predictive Architectures Focus on Slow Features PDF

[76] V-jepa 2: Self-supervised video models enable understanding, prediction and planning PDF

[77] Multimodal machine learning with large language embedding model for polymer property prediction PDF

[78] M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture PDF

[79] VL-JEPA: Joint Embedding Predictive Architecture for Vision-language PDF

Custom attention mask for efficient JEPA implementation

[51] MAGVIT: Masked Generative Video Transformer PDF

[52] Scene Transformer: A unified architecture for predicting multiple agent trajectories PDF

[53] MATE: Multi-view Attention for Table Transformer Efficiency PDF

[54] Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery PDF

[55] Probabilistic temporal masked attention for cross-view online action detection PDF

[56] Incomplete multi-view clustering with cross-view generation via pre-trained transformer PDF

[57] SS-MVMETRO: Semi-supervised multi-view human mesh recovery transformer PDF

[58] Geometry-guided diffusion model with masked transformer for robust multi-view 3d human pose estimation PDF

[59] Multi-view masked world models for visual robotic manipulation PDF

[60] Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers PDF

Empirical validation across models, datasets, and training scenarios

[61] Unified contrastive learning in image-text-label space PDF

[62] An empirical study of the impact of hyperparameter tuning and model optimization on the performance properties of deep neural networks PDF

[63] An Empirical Model of Large-Batch Training PDF

[64] The impact of automated parameter optimization on defect prediction models PDF

[65] Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding PDF

[66] Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment PDF

[67] An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics PDF

[68] Quantum approximate optimization algorithm for test case optimization PDF

[69] Fast bayesian optimization of machine learning hyperparameters on large datasets PDF

[70] Satisfying Real-world Goals with Dataset Constraints PDF

Table of Contents