WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

audio-visual embeddingsmultimodal LLMsvideo retrieval

While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified & \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WAVE, a multimodal LLM-based embedding system that unifies text, audio, and video into a shared representation space. It resides in the 'Joint Embedding Space Construction' leaf, which contains four papers total (including WAVE itself). This leaf sits within the broader 'Unified Multimodal Representation Learning' branch, indicating a moderately populated research direction. The taxonomy shows that joint embedding approaches are distinct from tokenization-based unification and modality-specific fusion strategies, suggesting WAVE occupies a well-defined but not overcrowded niche focused on continuous alignment rather than discrete token vocabularies.

The taxonomy reveals several neighboring research directions. The sibling leaf 'Unified Tokenization Frameworks' (two papers) explores discrete token-based unification, while the parallel branch 'Modality-Specific Encoding and Fusion' (nine papers across three leaves) addresses separate encoding followed by late fusion. WAVE's hierarchical feature fusion and dual-encoder audio design connect it to 'Fine-Grained Audio-Visual Alignment' methods, yet its emphasis on a unified embedding space distinguishes it from modality-specific approaches. The taxonomy's scope notes clarify that WAVE's continuous embedding alignment excludes it from tokenization-based methods, positioning it at the intersection of representation learning and cross-modal retrieval.

Among thirty candidates examined, the analysis found limited prior work overlap. The core contribution (unified audio-visual embedding MLLM) examined ten candidates with one refutable match, suggesting some precedent exists but the search scope was narrow. The prompt-aware embeddings contribution (ten candidates, zero refutable) and hierarchical fusion strategy (ten candidates, zero refutable) appear more distinctive within this limited sample. The statistics indicate that while WAVE's unified embedding concept has at least one overlapping prior work among the examined papers, its instruction-following and architectural innovations show less direct precedent in the top-thirty semantic matches and their citations.

Based on the constrained literature search (thirty candidates from semantic retrieval), WAVE demonstrates moderate novelty in its unified embedding approach, with stronger novelty signals in its prompt-aware and hierarchical fusion components. The taxonomy context suggests the joint embedding space construction area is active but not saturated, with WAVE contributing to an established research direction while introducing specific architectural and training innovations. A more exhaustive search beyond the top-thirty candidates would be needed to fully assess the originality of the hierarchical fusion and instruction-following mechanisms.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: learning unified audio-visual embeddings with multimodal large language models. The field is organized around several complementary directions. Unified Multimodal Representation Learning focuses on constructing joint embedding spaces that align audio, visual, and textual modalities, often through contrastive or alignment-based objectives. Modality-Specific Encoding and Fusion addresses how to effectively encode and combine heterogeneous signals before feeding them into LLMs, while Audio-Visual Question Answering and Reasoning and Audio-Visual Speech Recognition and Understanding tackle specific downstream tasks that require tight audio-visual coordination. Specialized Audio-Visual Applications explore domain-specific uses such as emotion recognition, video captioning, and multimedia recommendation, whereas Multimodal LLM Architectures and Training examines the design choices—such as adapter modules, cross-modal attention, and training strategies—that enable LLMs to process multiple modalities. Finally, Evaluation, Benchmarking, and Analysis provides the datasets and metrics needed to assess model performance across diverse audio-visual scenarios. Representative works like Video LLaVA[1], Chat UniVi[3], and VideoLLaMA 2[19] illustrate how these branches intersect in practice. A particularly active line of work centers on joint embedding space construction, where methods such as Fine-grained Audio-Visual[2] and TEAL[4] explore fine-grained alignment between audio and visual features, often leveraging contrastive learning or cross-modal attention. These approaches contrast with modality-specific fusion strategies that preserve separate encoders before late integration, as seen in Video SALMONN[9] and BuboGPT[11]. WAVE[0] sits squarely within the joint embedding space construction cluster, emphasizing unified representations that enable seamless audio-visual reasoning. Compared to neighbors like Aligning Agentic Workflow[8], which focuses on workflow-level alignment, and Dense Knowledge Alignment[30], which targets knowledge-grounded multimodal understanding, WAVE[0] prioritizes the direct construction of a shared embedding space. This positioning highlights ongoing trade-offs between end-to-end unification and modular, task-specific fusion, with open questions around scalability, generalization, and the optimal granularity of cross-modal alignment remaining central to the field.

Claimed Contributions

WAVE: unified and versatile audio-visual embedding MLLM

Can Refute

10 retrieved papers

WAVE is the first multimodal LLM-based embedding model that creates a unified representation space for text, audio, silent video, and synchronised audio-visual inputs. It enables any-to-any cross-modal retrieval and achieves state-of-the-art performance on the MMEB-v2 video benchmark.

10 retrieved papers

Can Refute

Prompt-aware embeddings via instruction-following

10 retrieved papers

WAVE leverages the instruction-following capabilities of its MLLM backbone to generate prompt-aware embeddings that can be conditioned on user instructions. This enables the model to produce task-specific representations, demonstrated by strong performance on embedding-based multimodal question answering.

10 retrieved papers

Hierarchical feature-fusion strategy and dual-encoder audio design

10 retrieved papers

The authors propose a hierarchical feature-fusion strategy that aggregates representations from multiple MLLM layers to improve multimodal retrieval performance. They also introduce a dual-encoder architecture for audio that captures complementary speech and environmental sound cues, enhancing embedding expressiveness.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Fine-grained audio-visual joint representations for multimodal large language models PDF

Sun Guang-zhi, YU Wenyi, Guangzhi Sun, Tang, Changli, Wenyi Yu, Chen, Xianzhao, Changli Tang, Tan Tian, Xianzhao Chen, Li Wei, Tian Tan, Lu Lu, Wei Li, Ma, Zejun, Zhang Chao, Zejun Ma, Chao Zhang (2023)

[8] Aligning audio-visual joint representations with an agentic workflow PDF

Shentong Mo, Yibing Song (2024)

[30] Incorporating Dense Knowledge Alignment into Unified Multimodal Representation Models PDF

Yuhao Cui, Xinxing Zu, Zhang Wenhua, Zhongzhou Zhao, Wenhua Zhang, Jinyang Gao (2025) • Computer Vision and Pattern Recognition

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WAVE: unified and versatile audio-visual embedding MLLM

[60] Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video PDF

Can Refute

[51] End-to-end multimodal representation learning for video dialog PDF

Cannot Refute

[52] Everything at once-multi-modal fusion transformer for video retrieval PDF

Cannot Refute

[53] Polysemous visual-semantic embedding for cross-modal retrieval PDF

Cannot Refute

[54] Gramian Multimodal Representation Learning and Alignment PDF

Cannot Refute

[55] ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval PDF

Cannot Refute

[56] Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action PDF

Cannot Refute

[57] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text PDF

Cannot Refute

[58] Retrieving multimodal information for augmented generation: A survey PDF

Cannot Refute

[59] Disentangled Representation Learning for Text-Video Retrieval PDF

Cannot Refute

Contribution

Prompt-aware embeddings via instruction-following

[71] Generative multimodal models are in-context learners PDF

Cannot Refute

[72] MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping PDF

Cannot Refute

[73] Instruction-driven history-aware policies for robotic manipulations PDF

Cannot Refute

[74] MPT: Multimodal Prompt Tuning for Zero-shot Instruction Learning PDF

Cannot Refute

[75] Imagebind-llm: Multi-modality instruction tuning PDF

Cannot Refute

[76] Vislinginstruct: Elevating zero-shot learning in multi-modal language models with autonomous instruction optimization PDF

Cannot Refute

[77] MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models PDF

Cannot Refute

[78] Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning PDF

Cannot Refute

[79] Joint embeddings for graph instruction tuning PDF

Cannot Refute

[80] VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing PDF

Cannot Refute

Contribution

Hierarchical feature-fusion strategy and dual-encoder audio design

[61] Context-Aware Multimodal Fusion with Sensor-Augmented Cross-Modal Learning: The BLAF Architecture for Robust Chinese Homophone Disambiguation in â¦ PDF

Cannot Refute

[62] Modeling TCRâpMHC Binding with Dual Encoders and Cross-Attention Fusion PDF

Cannot Refute

[63] CFF-Net: Cross-Hierarchy Feature Fusion Network Based on Composite Dual-Channel Encoder for Surface Defect Segmentation PDF

Cannot Refute

[64] MuFuBP-Net: A Multimodal Fusion Network for Cuffless Blood Pressure Estimation Using Dual-Feature Pipeline With Probabilistic Feature Encoder PDF

Cannot Refute

[65] ECNet: Network Based on Encoder With Local Convolutional Attention and Cross Layer Information Fusion for Natural Gas Pipeline Leakage Detection PDF

Cannot Refute

[66] Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture PDF

Cannot Refute

[67] A Hierarchical Fusion SAR Image Change-Detection Method Based on HF-CRF Model PDF

Cannot Refute

[68] Dual Layer Cogni - Insight Deep-Mood Encoder: A Two- Tiered Approach for Depression Detection PDF

Cannot Refute

[69] DMFNet: Dual-Encoder Multistage Feature Fusion Network for Infrared Small Target Detection PDF

Cannot Refute

[70] A dual-encoder hierarchical feature fusion network for ancient mural disease detection PDF

Cannot Refute

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Fine-grained audio-visual joint representations for multimodal large language models PDF

[8] Aligning audio-visual joint representations with an agentic workflow PDF

[30] Incorporating Dense Knowledge Alignment into Unified Multimodal Representation Models PDF

Contribution Analysis

WAVE: unified and versatile audio-visual embedding MLLM

[60] Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video PDF

[51] End-to-end multimodal representation learning for video dialog PDF

[52] Everything at once-multi-modal fusion transformer for video retrieval PDF

[53] Polysemous visual-semantic embedding for cross-modal retrieval PDF

[54] Gramian Multimodal Representation Learning and Alignment PDF

[55] ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval PDF

[56] Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action PDF

[57] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text PDF

[58] Retrieving multimodal information for augmented generation: A survey PDF

[59] Disentangled Representation Learning for Text-Video Retrieval PDF

Prompt-aware embeddings via instruction-following

[71] Generative multimodal models are in-context learners PDF

[72] MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping PDF

[73] Instruction-driven history-aware policies for robotic manipulations PDF

[74] MPT: Multimodal Prompt Tuning for Zero-shot Instruction Learning PDF

[75] Imagebind-llm: Multi-modality instruction tuning PDF

[76] Vislinginstruct: Elevating zero-shot learning in multi-modal language models with autonomous instruction optimization PDF

[77] MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models PDF

[78] Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning PDF

[79] Joint embeddings for graph instruction tuning PDF

[80] VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing PDF

Hierarchical feature-fusion strategy and dual-encoder audio design

[61] Context-Aware Multimodal Fusion with Sensor-Augmented Cross-Modal Learning: The BLAF Architecture for Robust Chinese Homophone Disambiguation in â¦ PDF

[62] Modeling TCRâpMHC Binding with Dual Encoders and Cross-Attention Fusion PDF

[63] CFF-Net: Cross-Hierarchy Feature Fusion Network Based on Composite Dual-Channel Encoder for Surface Defect Segmentation PDF

[64] MuFuBP-Net: A Multimodal Fusion Network for Cuffless Blood Pressure Estimation Using Dual-Feature Pipeline With Probabilistic Feature Encoder PDF

[65] ECNet: Network Based on Encoder With Local Convolutional Attention and Cross Layer Information Fusion for Natural Gas Pipeline Leakage Detection PDF

[66] Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture PDF

[67] A Hierarchical Fusion SAR Image Change-Detection Method Based on HF-CRF Model PDF

[68] Dual Layer Cogni - Insight Deep-Mood Encoder: A Two- Tiered Approach for Depression Detection PDF

[69] DMFNet: Dual-Encoder Multistage Feature Fusion Network for Infrared Small Target Detection PDF

[70] A dual-encoder hierarchical feature fusion network for ancient mural disease detection PDF

Table of Contents

[61] Context-Aware Multimodal Fusion with Sensor-Augmented Cross-Modal Learning: The BLAF Architecture for Robust Chinese Homophone Disambiguation in â¦ PDF

[62] Modeling TCRâpMHC Binding with Dual Encoders and Cross-Attention Fusion PDF