WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

ICLR 2026 Conference SubmissionAnonymous Authors
audio-visual embeddingsmultimodal LLMsvideo retrieval
Abstract:

While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified & \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WAVE, a multimodal LLM-based embedding system that unifies text, audio, and video into a shared representation space. It resides in the 'Joint Embedding Space Construction' leaf, which contains four papers total (including WAVE itself). This leaf sits within the broader 'Unified Multimodal Representation Learning' branch, indicating a moderately populated research direction. The taxonomy shows that joint embedding approaches are distinct from tokenization-based unification and modality-specific fusion strategies, suggesting WAVE occupies a well-defined but not overcrowded niche focused on continuous alignment rather than discrete token vocabularies.

The taxonomy reveals several neighboring research directions. The sibling leaf 'Unified Tokenization Frameworks' (two papers) explores discrete token-based unification, while the parallel branch 'Modality-Specific Encoding and Fusion' (nine papers across three leaves) addresses separate encoding followed by late fusion. WAVE's hierarchical feature fusion and dual-encoder audio design connect it to 'Fine-Grained Audio-Visual Alignment' methods, yet its emphasis on a unified embedding space distinguishes it from modality-specific approaches. The taxonomy's scope notes clarify that WAVE's continuous embedding alignment excludes it from tokenization-based methods, positioning it at the intersection of representation learning and cross-modal retrieval.

Among thirty candidates examined, the analysis found limited prior work overlap. The core contribution (unified audio-visual embedding MLLM) examined ten candidates with one refutable match, suggesting some precedent exists but the search scope was narrow. The prompt-aware embeddings contribution (ten candidates, zero refutable) and hierarchical fusion strategy (ten candidates, zero refutable) appear more distinctive within this limited sample. The statistics indicate that while WAVE's unified embedding concept has at least one overlapping prior work among the examined papers, its instruction-following and architectural innovations show less direct precedent in the top-thirty semantic matches and their citations.

Based on the constrained literature search (thirty candidates from semantic retrieval), WAVE demonstrates moderate novelty in its unified embedding approach, with stronger novelty signals in its prompt-aware and hierarchical fusion components. The taxonomy context suggests the joint embedding space construction area is active but not saturated, with WAVE contributing to an established research direction while introducing specific architectural and training innovations. A more exhaustive search beyond the top-thirty candidates would be needed to fully assess the originality of the hierarchical fusion and instruction-following mechanisms.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: learning unified audio-visual embeddings with multimodal large language models. The field is organized around several complementary directions. Unified Multimodal Representation Learning focuses on constructing joint embedding spaces that align audio, visual, and textual modalities, often through contrastive or alignment-based objectives. Modality-Specific Encoding and Fusion addresses how to effectively encode and combine heterogeneous signals before feeding them into LLMs, while Audio-Visual Question Answering and Reasoning and Audio-Visual Speech Recognition and Understanding tackle specific downstream tasks that require tight audio-visual coordination. Specialized Audio-Visual Applications explore domain-specific uses such as emotion recognition, video captioning, and multimedia recommendation, whereas Multimodal LLM Architectures and Training examines the design choices—such as adapter modules, cross-modal attention, and training strategies—that enable LLMs to process multiple modalities. Finally, Evaluation, Benchmarking, and Analysis provides the datasets and metrics needed to assess model performance across diverse audio-visual scenarios. Representative works like Video LLaVA[1], Chat UniVi[3], and VideoLLaMA 2[19] illustrate how these branches intersect in practice. A particularly active line of work centers on joint embedding space construction, where methods such as Fine-grained Audio-Visual[2] and TEAL[4] explore fine-grained alignment between audio and visual features, often leveraging contrastive learning or cross-modal attention. These approaches contrast with modality-specific fusion strategies that preserve separate encoders before late integration, as seen in Video SALMONN[9] and BuboGPT[11]. WAVE[0] sits squarely within the joint embedding space construction cluster, emphasizing unified representations that enable seamless audio-visual reasoning. Compared to neighbors like Aligning Agentic Workflow[8], which focuses on workflow-level alignment, and Dense Knowledge Alignment[30], which targets knowledge-grounded multimodal understanding, WAVE[0] prioritizes the direct construction of a shared embedding space. This positioning highlights ongoing trade-offs between end-to-end unification and modular, task-specific fusion, with open questions around scalability, generalization, and the optimal granularity of cross-modal alignment remaining central to the field.

Claimed Contributions

WAVE: unified and versatile audio-visual embedding MLLM

WAVE is the first multimodal LLM-based embedding model that creates a unified representation space for text, audio, silent video, and synchronised audio-visual inputs. It enables any-to-any cross-modal retrieval and achieves state-of-the-art performance on the MMEB-v2 video benchmark.

10 retrieved papers
Can Refute
Prompt-aware embeddings via instruction-following

WAVE leverages the instruction-following capabilities of its MLLM backbone to generate prompt-aware embeddings that can be conditioned on user instructions. This enables the model to produce task-specific representations, demonstrated by strong performance on embedding-based multimodal question answering.

10 retrieved papers
Hierarchical feature-fusion strategy and dual-encoder audio design

The authors propose a hierarchical feature-fusion strategy that aggregates representations from multiple MLLM layers to improve multimodal retrieval performance. They also introduce a dual-encoder architecture for audio that captures complementary speech and environmental sound cues, enhancing embedding expressiveness.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WAVE: unified and versatile audio-visual embedding MLLM

WAVE is the first multimodal LLM-based embedding model that creates a unified representation space for text, audio, silent video, and synchronised audio-visual inputs. It enables any-to-any cross-modal retrieval and achieves state-of-the-art performance on the MMEB-v2 video benchmark.

Contribution

Prompt-aware embeddings via instruction-following

WAVE leverages the instruction-following capabilities of its MLLM backbone to generate prompt-aware embeddings that can be conditioned on user instructions. This enables the model to produce task-specific representations, demonstrated by strong performance on embedding-based multimodal question answering.

Contribution

Hierarchical feature-fusion strategy and dual-encoder audio design

The authors propose a hierarchical feature-fusion strategy that aggregates representations from multiple MLLM layers to improve multimodal retrieval performance. They also introduce a dual-encoder architecture for audio that captures complementary speech and environmental sound cues, enhancing embedding expressiveness.