VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

ICLR 2026 Conference SubmissionAnonymous Authors
JEPAVLMvideo-languageefficiency
Abstract:

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model can focus on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that can reduce the number of decoding operations by approximately 2.85× while maintaining similar performance compared to dense non-adaptive uniform decoding. Beyond generation, the embedding-space formulation naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance of VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance to classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE, and POPEv2, despite only having 1.6B parameters.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

VL-JEPA introduces a vision-language model that predicts continuous embeddings rather than autoregressively generating tokens, achieving stronger performance with fifty percent fewer trainable parameters. The paper resides in the Joint Embedding Predictive Architectures leaf, which contains only three papers including VL-JEPA itself. This represents a relatively sparse research direction within the broader Embedding Space Alignment branch, suggesting the approach occupies a less crowded niche compared to more established paradigms like token-based generation or contrastive retrieval methods.

The taxonomy tree reveals that Joint Embedding Predictive Architectures sits alongside Structural Embedding Alignment and Semantic Space Optimization within the Embedding Space Alignment branch. Neighboring branches include Unified Multimodal Foundation Models with generative approaches like VL-GPT, and Dual-Stream Architectures featuring frozen language model adaptation methods. VL-JEPA diverges from these by operating entirely in embedding space rather than token space, while sharing the broader goal of robust cross-modal alignment pursued by contrastive methods in Cross-Modal Retrieval and prompt-based adaptation techniques.

Among thirty candidates examined across three contributions, no clearly refuting prior work was identified. The VL-JEPA architecture contribution examined ten candidates with zero refutable matches, as did the selective decoding mechanism and unified architecture contributions. This limited search scope suggests that within the top semantic matches and citation expansions analyzed, no substantial overlapping prior work was detected. The sparse population of the Joint Embedding Predictive Architectures leaf and absence of refuting candidates among examined papers indicate the approach may represent a relatively novel direction, though the analysis does not cover exhaustive literature review.

Based on the top-thirty semantic matches and taxonomy structure, VL-JEPA appears to explore a less saturated research direction within vision-language modeling. The limited number of sibling papers and absence of detected overlaps suggest novelty, though this assessment reflects only the examined candidate set rather than comprehensive field coverage. The embedding-space formulation and selective decoding mechanism appear distinctive within the analyzed scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: predicting continuous embeddings for vision-language tasks. The field organizes around several complementary strategies for learning shared representations that bridge visual and textual modalities. Unified Multimodal Foundation Models such as PaLM-E[10] and VL-GPT[26] integrate vision and language within a single architecture, while Dual-Stream Architectures like ViLBERT[2] maintain separate encoders before fusion. Embedding Space Alignment focuses on methods that directly optimize for joint representations, including contrastive approaches and Joint Embedding Predictive Architectures exemplified by VL-JEPA[0] and Monet[37]. Cross-Modal Retrieval emphasizes efficient matching between modalities, and Prompt Learning and Adaptation explores how to steer pretrained models toward new tasks. Additional branches address Continual Learning to prevent catastrophic forgetting, Specialized Application Domains ranging from robotics to medical imaging, and Memory and Context Modeling for temporal reasoning. Foundational Concepts and Surveys provide overarching perspectives on multimodal embeddings. Recent work highlights tensions between architectural complexity and alignment quality. Joint Embedding Predictive Architectures, where VL-JEPA[0] resides, pursue self-supervised objectives that predict latent representations rather than raw pixels or tokens, contrasting with next-token prediction methods like those in Next-Token Prediction[50]. Nearby efforts such as Multimodal Chain Thought[9] and Monet[37] explore structured reasoning and modular encoding within similar embedding frameworks. Meanwhile, Continuous Memory VLM[3] and related memory-augmented approaches tackle long-context scenarios that challenge static embedding models. A recurring question across these branches concerns the trade-off between end-to-end joint training and modular designs that preserve pretrained knowledge. VL-JEPA[0] aligns closely with predictive embedding methods that emphasize latent-space objectives, positioning itself as an alternative to autoregressive generation while sharing the broader goal of robust cross-modal alignment pursued by works like BEiT Pretraining[5] and Clap4clip[6].

Claimed Contributions

VL-JEPA architecture for vision-language tasks

The authors propose VL-JEPA, a novel vision-language model that predicts continuous semantic embeddings in latent space rather than generating discrete tokens autoregressively like classical VLMs. This architecture learns in an abstract representation space, focusing on task-relevant semantics while abstracting away surface-level linguistic variability.

10 retrieved papers
Selective decoding mechanism for real-time video applications

The authors demonstrate that VL-JEPA's non-autoregressive design enables a selective decoding strategy where text decoding occurs only when significant semantic changes are detected in the predicted embedding stream. This reduces decoding operations substantially while preserving performance, making it suitable for real-time streaming video applications.

10 retrieved papers
Unified architecture for multiple vision-language tasks

The authors show that VL-JEPA's embedding-based formulation enables a single unified model architecture to handle diverse tasks including generation, open-vocabulary classification, text-to-video retrieval, and discriminative visual question answering, without requiring task-specific architectural modifications.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VL-JEPA architecture for vision-language tasks

The authors propose VL-JEPA, a novel vision-language model that predicts continuous semantic embeddings in latent space rather than generating discrete tokens autoregressively like classical VLMs. This architecture learns in an abstract representation space, focusing on task-relevant semantics while abstracting away surface-level linguistic variability.

Contribution

Selective decoding mechanism for real-time video applications

The authors demonstrate that VL-JEPA's non-autoregressive design enables a selective decoding strategy where text decoding occurs only when significant semantic changes are detected in the predicted embedding stream. This reduces decoding operations substantially while preserving performance, making it suitable for real-time streaming video applications.

Contribution

Unified architecture for multiple vision-language tasks

The authors show that VL-JEPA's embedding-based formulation enables a single unified model architecture to handle diverse tasks including generation, open-vocabulary classification, text-to-video retrieval, and discriminative visual question answering, without requiring task-specific architectural modifications.