Unified Vision–Language Modeling via Concept Space Alignment

ICLR 2026 Conference SubmissionAnonymous Authors
multimodal embedding spacemultilingual embedding space
Abstract:

We introduce vSONAR, a vision–language embedding space extended from the text-only embedding space SONAR, which supports 200 text languages and 37 speech languages. To construct vSONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate vSONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the SONAR text decoder, vSONAR further surpasses state-of-the-art vision–language models on video captioning tasks, including DREAM-1K (BLEU 24.3 vs. 19.6) and VATEX (BLEU 45.0 vs. 41.5).

Leveraging vSONAR, we first demonstrate that the Large Concept Model (LCM) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce vLCM, which extends the LCM with vision–language instruction tuning. vLCM encodes vision and language inputs into an unified sequence of latent embeddings via vSONARand SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction–tuning data mixture highlight the potential of vLCM: vLCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces vSONAR, a vision-language embedding space extending the text-only SONAR space to support 200 text and 37 speech languages. It resides in the Foundation Model Pre-Training and Unified Representations leaf, which contains five papers including FLAVA, mPLUG, and ImageBind. This leaf focuses on large-scale pre-training methods creating unified cross-modal embeddings, representing a moderately populated research direction within the broader taxonomy of 50 papers across 36 topics.

The taxonomy tree reveals that vSONAR's leaf sits within Alignment Mechanisms and Architectures, alongside sibling branches for Contrastive and Multi-Level Alignment Frameworks (5 papers), Pre-Alignment and Structural Embedding Strategies (6 papers), and Prompt-Based Adaptation (3 papers). Neighboring branches include Multimodal Extensions (covering audio-visual-language and temporal alignment) and Application Domains (task-specific methods). The scope note clarifies this leaf addresses unified representations through pre-training, excluding task-specific fine-tuning covered elsewhere.

Among 30 candidates examined across three contributions, none were found to clearly refute the work. The V-SONAR extension examined 10 candidates with 0 refutable; the post-hoc alignment pipeline examined 10 with 0 refutable; and V-LCM examined 10 with 0 refutable. This suggests that within the limited search scope, the specific combination of extending a multilingual text embedding space to vision and applying it to concept-level understanding appears relatively unexplored, though the search scale is modest.

Based on top-30 semantic matches, the work appears to occupy a distinct position combining multilingual text embedding extension with vision-language alignment. The sibling papers in the same leaf pursue different strategies—FLAVA uses masked modeling, ImageBind targets emergent six-modality alignment—suggesting vSONAR's post-hoc mapping approach and focus on massively multilingual support may offer a complementary angle. However, the limited search scope means broader prior work in multilingual vision-language models may exist beyond these candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: vision-language embedding space alignment. The field centers on learning joint representations that bridge visual and textual modalities, enabling models to understand and relate images, videos, and language in a shared semantic space. The taxonomy reveals four main branches: Alignment Mechanisms and Architectures focuses on foundational techniques such as contrastive learning (e.g., Triple Contrastive Learning[3]), unified pre-training strategies (e.g., FLAVA[10], mPLUG[22]), and architectural innovations that map modalities into common embeddings. Multimodal Extensions and Cross-Modal Integration explores richer sensory combinations—incorporating video (Video LLaVA[2], Video-LLaMA[34]), audio-visual signals (AVLnet[31]), tactile data (Touch Vision Language[6]), and even brain activity (Brain Visual Linguistic[30])—to build more comprehensive multimodal models. Application Domains and Task-Specific Alignment addresses specialized settings like medical imaging (Multi-granularity Medical[8]), remote sensing (Remote Sensing[23]), sign language (Sign Language[38]), and zero-shot detection (Zero-shot Detection[43]), tailoring alignment methods to domain-specific challenges. Analysis and Robustness of Alignment investigates the quality and stability of learned spaces, examining phenomena such as modality gaps (Modality Gap[49]), linear structure (Linear Structure[11]), and alignment reliability across diverse conditions. Recent work has intensified around foundation model pre-training and unified representations, where many studies aim to scale alignment to large vision-language models with improved architectural designs (VILA[26], SPHINX[39], Ovis[5]) and novel training objectives (Align Before Fuse[4], Subspaces Alignment[1]). A key tension lies between general-purpose alignment—seeking broad transferability—and task-specific tuning that optimizes for particular downstream applications. Concept Space Alignment[0] situates itself within the Foundation Model Pre-Training cluster, emphasizing structured concept-level correspondences rather than purely instance-based contrastive signals. This approach contrasts with neighbors like ImageBind[33], which pursues emergent alignment across six modalities through large-scale pairing, and FLAVA[10], which adopts a unified masked modeling framework. By focusing on explicit concept spaces, Concept Space Alignment[0] offers a complementary perspective on how to organize and interpret the semantic structure underlying vision-language embeddings.

Claimed Contributions

V-SONAR: vision–language extension of SONAR embedding space

The authors propose V-SONAR, which extends the existing SONAR text and speech embedding space to include image and video modalities through a post-hoc alignment pipeline. This creates a unified embedding space covering four modalities (text, speech, image, video) across 200 languages.

10 retrieved papers
Post-hoc coarse-to-fine alignment pipeline for vision encoder

The authors develop a three-stage curriculum training approach that progressively aligns a vision encoder (PERCEPTIONENCODER) with the SONAR space, moving from large-scale image captions to synthetic video captions to high-quality human-annotated video captions.

10 retrieved papers
V-LCM: vision–language instruction-tuned Large Concept Model

The authors extend the Large Concept Model (LCM) to handle vision–language tasks by encoding multimodal inputs (images, videos, text) into a unified latent space using V-SONAR and SONAR, then training with the same latent diffusion objective for next-embedding prediction.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

V-SONAR: vision–language extension of SONAR embedding space

The authors propose V-SONAR, which extends the existing SONAR text and speech embedding space to include image and video modalities through a post-hoc alignment pipeline. This creates a unified embedding space covering four modalities (text, speech, image, video) across 200 languages.

Contribution

Post-hoc coarse-to-fine alignment pipeline for vision encoder

The authors develop a three-stage curriculum training approach that progressively aligns a vision encoder (PERCEPTIONENCODER) with the SONAR space, moving from large-scale image captions to synthetic video captions to high-quality human-annotated video captions.

Contribution

V-LCM: vision–language instruction-tuned Large Concept Model

The authors extend the Large Concept Model (LCM) to handle vision–language tasks by encoding multimodal inputs (images, videos, text) into a unified latent space using V-SONAR and SONAR, then training with the same latent diffusion objective for next-embedding prediction.