VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

JEPAVLMvideo-languageefficiency

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model can focus on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that can reduce the number of decoding operations by approximately 2.85× while maintaining similar performance compared to dense non-adaptive uniform decoding. Beyond generation, the embedding-space formulation naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance of VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance to classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE, and POPEv2, despite only having 1.6B parameters.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

VL-JEPA introduces a vision-language model that predicts continuous embeddings rather than autoregressively generating tokens, achieving stronger performance with fifty percent fewer trainable parameters. The paper resides in the Joint Embedding Predictive Architectures leaf, which contains only three papers including VL-JEPA itself. This represents a relatively sparse research direction within the broader Embedding Space Alignment branch, suggesting the approach occupies a less crowded niche compared to more established paradigms like token-based generation or contrastive retrieval methods.

The taxonomy tree reveals that Joint Embedding Predictive Architectures sits alongside Structural Embedding Alignment and Semantic Space Optimization within the Embedding Space Alignment branch. Neighboring branches include Unified Multimodal Foundation Models with generative approaches like VL-GPT, and Dual-Stream Architectures featuring frozen language model adaptation methods. VL-JEPA diverges from these by operating entirely in embedding space rather than token space, while sharing the broader goal of robust cross-modal alignment pursued by contrastive methods in Cross-Modal Retrieval and prompt-based adaptation techniques.

Among thirty candidates examined across three contributions, no clearly refuting prior work was identified. The VL-JEPA architecture contribution examined ten candidates with zero refutable matches, as did the selective decoding mechanism and unified architecture contributions. This limited search scope suggests that within the top semantic matches and citation expansions analyzed, no substantial overlapping prior work was detected. The sparse population of the Joint Embedding Predictive Architectures leaf and absence of refuting candidates among examined papers indicate the approach may represent a relatively novel direction, though the analysis does not cover exhaustive literature review.

Based on the top-thirty semantic matches and taxonomy structure, VL-JEPA appears to explore a less saturated research direction within vision-language modeling. The limited number of sibling papers and absence of detected overlaps suggest novelty, though this assessment reflects only the examined candidate set rather than comprehensive field coverage. The embedding-space formulation and selective decoding mechanism appear distinctive within the analyzed scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: predicting continuous embeddings for vision-language tasks. The field organizes around several complementary strategies for learning shared representations that bridge visual and textual modalities. Unified Multimodal Foundation Models such as PaLM-E[10] and VL-GPT[26] integrate vision and language within a single architecture, while Dual-Stream Architectures like ViLBERT[2] maintain separate encoders before fusion. Embedding Space Alignment focuses on methods that directly optimize for joint representations, including contrastive approaches and Joint Embedding Predictive Architectures exemplified by VL-JEPA[0] and Monet[37]. Cross-Modal Retrieval emphasizes efficient matching between modalities, and Prompt Learning and Adaptation explores how to steer pretrained models toward new tasks. Additional branches address Continual Learning to prevent catastrophic forgetting, Specialized Application Domains ranging from robotics to medical imaging, and Memory and Context Modeling for temporal reasoning. Foundational Concepts and Surveys provide overarching perspectives on multimodal embeddings. Recent work highlights tensions between architectural complexity and alignment quality. Joint Embedding Predictive Architectures, where VL-JEPA[0] resides, pursue self-supervised objectives that predict latent representations rather than raw pixels or tokens, contrasting with next-token prediction methods like those in Next-Token Prediction[50]. Nearby efforts such as Multimodal Chain Thought[9] and Monet[37] explore structured reasoning and modular encoding within similar embedding frameworks. Meanwhile, Continuous Memory VLM[3] and related memory-augmented approaches tackle long-context scenarios that challenge static embedding models. A recurring question across these branches concerns the trade-off between end-to-end joint training and modular designs that preserve pretrained knowledge. VL-JEPA[0] aligns closely with predictive embedding methods that emphasize latent-space objectives, positioning itself as an alternative to autoregressive generation while sharing the broader goal of robust cross-modal alignment pursued by works like BEiT Pretraining[5] and Clap4clip[6].

Claimed Contributions

VL-JEPA architecture for vision-language tasks

10 retrieved papers

The authors propose VL-JEPA, a novel vision-language model that predicts continuous semantic embeddings in latent space rather than generating discrete tokens autoregressively like classical VLMs. This architecture learns in an abstract representation space, focusing on task-relevant semantics while abstracting away surface-level linguistic variability.

10 retrieved papers

Selective decoding mechanism for real-time video applications

10 retrieved papers

The authors demonstrate that VL-JEPA's non-autoregressive design enables a selective decoding strategy where text decoding occurs only when significant semantic changes are detected in the predicted embedding stream. This reduces decoding operations substantially while preserving performance, making it suitable for real-time streaming video applications.

10 retrieved papers

Unified architecture for multiple vision-language tasks

10 retrieved papers

The authors show that VL-JEPA's embedding-based formulation enables a single unified model architecture to handle diverse tasks including generation, open-vocabulary classification, text-to-video retrieval, and discriminative visual question answering, without requiring task-specific architectural modifications.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models PDF

Tan-Hanh Pham, Chris Ngo (2025) • arXiv.org

[37] Monet: Reasoning in latent visual space beyond images and language PDF

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VL-JEPA architecture for vision-language tasks

[51] Vlm2vec: Training vision-language models for massive multimodal embedding tasks PDF

Cannot Refute

[52] CLIP-Adapter: Better Vision-Language Models with Feature Adapters PDF

Cannot Refute

[53] GLaMM: Pixel Grounding Large Multimodal Model PDF

Cannot Refute

[54] MMSDF: multimodal sparse dense fusion for 3D object detection PDF

Cannot Refute

[55] Controlling Vision-Language Models for Multi-Task Image Restoration PDF

Cannot Refute

[56] Multimodal intelligence: Representation learning, information fusion, and applications PDF

Cannot Refute

[57] Vision-language-vision auto-encoder: Scalable knowledge distillation from diffusion models PDF

Cannot Refute

[58] Lg-gaze: Learning geometry-aware continuous prompts for language-guided gaze estimation PDF

Cannot Refute

[59] UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning PDF

Cannot Refute

[60] Mmhs: Multimodal model for hate speech intensity prediction PDF

Cannot Refute

Contribution

Selective decoding mechanism for real-time video applications

[71] Selective Structured State-Spaces for Long-Form Video Understanding PDF

Cannot Refute

[72] Longvu: Spatiotemporal adaptive compression for long video-language understanding PDF

Cannot Refute

[73] Long movie clip classification with state-space video models PDF

Cannot Refute

[74] StreamingTOM: Streaming Token Compression for Efficient Video Understanding PDF

Cannot Refute

[75] METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding PDF

Cannot Refute

[76] Task-Aware KV Compression For Cost-Effective Long Video Understanding PDF

Cannot Refute

[77] Unified Video Action Model PDF

Cannot Refute

[78] SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning PDF

Cannot Refute

[79] HD-CAE: hybrid encoding and deformable decoding autoencoder with cascade attention for visual anomaly detection PDF

Cannot Refute

[80] DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models PDF

Cannot Refute

Contribution

Unified architecture for multiple vision-language tasks

[61] Geochat: Grounded large vision-language model for remote sensing PDF

Cannot Refute

[62] Vision-language models for medical report generation and visual question answering: a review PDF

Cannot Refute

[63] A survey on multimodal large language models in radiology for report generation and visual question answering PDF

Cannot Refute

[64] Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation PDF

Cannot Refute

[65] Hierarchical Question-Image Co-Attention for Visual Question Answering PDF

Cannot Refute

[66] Vision-language models in remote sensing: Current progress and future trends PDF

Cannot Refute

[67] Minigpt-4: Enhancing vision-language understanding with advanced large language models PDF

Cannot Refute

[68] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features PDF

Cannot Refute

[69] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities PDF

Cannot Refute

[70] Why are visually-grounded language models bad at image classification? PDF

Cannot Refute

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models PDF

[37] Monet: Reasoning in latent visual space beyond images and language PDF

Contribution Analysis

VL-JEPA architecture for vision-language tasks

[51] Vlm2vec: Training vision-language models for massive multimodal embedding tasks PDF

[52] CLIP-Adapter: Better Vision-Language Models with Feature Adapters PDF

[53] GLaMM: Pixel Grounding Large Multimodal Model PDF

[54] MMSDF: multimodal sparse dense fusion for 3D object detection PDF

[55] Controlling Vision-Language Models for Multi-Task Image Restoration PDF

[56] Multimodal intelligence: Representation learning, information fusion, and applications PDF

[57] Vision-language-vision auto-encoder: Scalable knowledge distillation from diffusion models PDF

[58] Lg-gaze: Learning geometry-aware continuous prompts for language-guided gaze estimation PDF

[59] UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning PDF

[60] Mmhs: Multimodal model for hate speech intensity prediction PDF

Selective decoding mechanism for real-time video applications

[71] Selective Structured State-Spaces for Long-Form Video Understanding PDF

[72] Longvu: Spatiotemporal adaptive compression for long video-language understanding PDF

[73] Long movie clip classification with state-space video models PDF

[74] StreamingTOM: Streaming Token Compression for Efficient Video Understanding PDF

[75] METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding PDF

[76] Task-Aware KV Compression For Cost-Effective Long Video Understanding PDF

[77] Unified Video Action Model PDF

[78] SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning PDF

[79] HD-CAE: hybrid encoding and deformable decoding autoencoder with cascade attention for visual anomaly detection PDF

[80] DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models PDF

Unified architecture for multiple vision-language tasks

[61] Geochat: Grounded large vision-language model for remote sensing PDF

[62] Vision-language models for medical report generation and visual question answering: a review PDF

[63] A survey on multimodal large language models in radiology for report generation and visual question answering PDF

[64] Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation PDF

[65] Hierarchical Question-Image Co-Attention for Visual Question Answering PDF

[66] Vision-language models in remote sensing: Current progress and future trends PDF

[67] Minigpt-4: Enhancing vision-language understanding with advanced large language models PDF

[68] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features PDF

[69] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities PDF

[70] Why are visually-grounded language models bad at image classification? PDF

Table of Contents