Abstract:

Large Language Model (LLM) pretraining, finetuning, and evaluation rely on input-space reconstruction and generative capabilities. Yet, it has been observed in vision that embedding-space training objectives, e.g., with Joint Embedding Predictive Architectures (JEPAs), are far superior to their input-space counterpart. That mismatch in how training is achieved between language and vision opens up a natural question: {\em can language training methods learn a few tricks from the vision ones?} The lack of JEPA-style LLM is a testimony of the challenge in designing such objectives for language. In this work, we propose a first step in that direction where we develop LLM-JEPA, a JEPA based solution for LLMs applicable both to finetuning and pretraining. Thus far, LLM-JEPA is able to outperform the standard LLM training objectives by a significant margin across models, all while being robust to overfiting. Those findings are observed across numerous datasets (NL-RX, GSM8K, Spider, RottenTomatoes) and various models from the Llama3, OpenELM, Gemma2 and Olmo families. Code: \url{https://anonymous.4open.science/r/llm-jepa-0C6F/README.md}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LLM-JEPA, a joint embedding predictive architecture for language models that optimizes in embedding space rather than input space. According to the taxonomy, this work occupies the 'Joint Embedding Predictive Architectures' leaf under 'Embedding-Space Optimization and Alignment'. Notably, this leaf contains only the original paper itself—no sibling papers exist in this specific category. This positioning suggests the work addresses a relatively unexplored direction within the broader field of embedding-space training objectives, which encompasses fifty papers across approximately thirty-six distinct topics.

The taxonomy reveals that neighboring research directions are well-populated. Adjacent leaves include 'Embedding Alignment and Steering' (three papers), 'Embedding Regularization and Stabilization' (two papers), and 'Embedding Perturbation and Exploration' (two papers). The broader 'Contrastive and Similarity-Based Learning' branch contains multiple active sub-areas, including text embedding via contrastive objectives and geometric optimization methods. The scope note for the original paper's leaf explicitly excludes contrastive learning on text pairs, distinguishing JEPA-style predictive objectives from the more established contrastive paradigm that dominates much of the embedding-space literature.

Among thirty candidates examined through semantic search and citation expansion, none were found to clearly refute any of the three main contributions. For the core LLM-JEPA objective, ten candidates were examined with zero refutable matches. Similarly, the custom attention mask implementation and empirical validation contributions each had ten candidates examined, again with no clear prior work overlap. This limited search scope—thirty papers rather than an exhaustive review—suggests that within the examined literature, the specific combination of JEPA-style objectives applied to language model training appears relatively novel, though the analysis cannot rule out relevant work outside this search radius.

The analysis indicates apparent novelty within the examined scope, particularly given the absence of sibling papers in the taxonomy leaf and zero refutations across thirty candidates. However, the limited search scale means this assessment reflects top-K semantic matches rather than comprehensive field coverage. The work's positioning at the intersection of vision-inspired architectures and language model training may explain why standard semantic search yields few direct precedents, though related ideas in embedding-space optimization and predictive learning exist in neighboring taxonomy branches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: embedding-space training objectives for large language models. The field encompasses a broad spectrum of approaches that shape how LLMs learn and represent information in continuous vector spaces. At the highest level, the taxonomy divides into branches addressing optimization and alignment strategies, contrastive and similarity-based learning, multimodal and multilingual embeddings, generative and predictive architectures, adversarial robustness, knowledge integration, application-specific training, interpretability, diversity optimization, and emerging specialized uses. Some branches focus on refining the geometric properties of embeddings—such as Contrastive and Similarity-Based Learning, which includes works like AnglE[21] and NV-Embed[12]—while others emphasize cross-domain or cross-modal integration, as seen in Multimodal and Cross-Modal Embeddings. Meanwhile, branches like Embedding-Space Optimization and Alignment explore how to steer or align representations through techniques ranging from adversarial training (Adversarial Training LLMs[2]) to structural projection methods (Structural Embedding Projection[4]) and continuous reasoning in latent space (Continuous Latent Reasoning[5]). A particularly active line of work centers on joint embedding predictive architectures, which aim to learn representations by predicting one embedding from another without relying solely on reconstruction or contrastive losses. LLM-JEPA[0] exemplifies this direction, proposing a framework that leverages predictive objectives in embedding space to improve generalization and alignment. This approach contrasts with more traditional contrastive methods like those in Improving Text Embeddings[1] or similarity-focused techniques such as Word Embeddings Steers[3], which emphasize pairwise relationships or direct steering of token-level representations. Compared to Continuous Latent Reasoning[5], which explores reasoning pathways within latent spaces, LLM-JEPA[0] focuses more explicitly on the predictive architecture itself as a training objective. Open questions remain around how these predictive objectives scale, how they interact with downstream task performance, and whether they offer robustness advantages over purely contrastive or generative alternatives.

Claimed Contributions

LLM-JEPA: A JEPA-based training objective for LLMs

The authors introduce LLM-JEPA, a novel training objective that combines the standard next-token prediction loss with a Joint Embedding Predictive Architecture (JEPA) term. This approach operates in embedding space using different views (e.g., text and code) while preserving the generative capabilities of LLMs, adapting JEPA methods from vision to language models.

10 retrieved papers
Custom attention mask for efficient JEPA implementation

The authors develop a custom attention mask mechanism that enables computing embeddings of different views in a single forward pass by making the self-attention causal per block. This implementation reduces the computational overhead from three forward passes to two, making LLM-JEPA more practical.

10 retrieved papers
Empirical validation across models, datasets, and training scenarios

The authors provide comprehensive empirical evidence demonstrating that LLM-JEPA outperforms standard LLM training objectives across multiple model families, datasets, model sizes, and both finetuning and pretraining scenarios. They show consistent improvements and robustness to overfitting.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LLM-JEPA: A JEPA-based training objective for LLMs

The authors introduce LLM-JEPA, a novel training objective that combines the standard next-token prediction loss with a Joint Embedding Predictive Architecture (JEPA) term. This approach operates in embedding space using different views (e.g., text and code) while preserving the generative capabilities of LLMs, adapting JEPA methods from vision to language models.

Contribution

Custom attention mask for efficient JEPA implementation

The authors develop a custom attention mask mechanism that enables computing embeddings of different views in a single forward pass by making the self-attention causal per block. This implementation reduces the computational overhead from three forward passes to two, making LLM-JEPA more practical.

Contribution

Empirical validation across models, datasets, and training scenarios

The authors provide comprehensive empirical evidence demonstrating that LLM-JEPA outperforms standard LLM training objectives across multiple model families, datasets, model sizes, and both finetuning and pretraining scenarios. They show consistent improvements and robustness to overfitting.