WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
Overview
Overall Novelty Assessment
The paper introduces WAVE, a multimodal LLM-based embedding system that unifies text, audio, and video into a shared representation space. It resides in the 'Joint Embedding Space Construction' leaf, which contains four papers total (including WAVE itself). This leaf sits within the broader 'Unified Multimodal Representation Learning' branch, indicating a moderately populated research direction. The taxonomy shows that joint embedding approaches are distinct from tokenization-based unification and modality-specific fusion strategies, suggesting WAVE occupies a well-defined but not overcrowded niche focused on continuous alignment rather than discrete token vocabularies.
The taxonomy reveals several neighboring research directions. The sibling leaf 'Unified Tokenization Frameworks' (two papers) explores discrete token-based unification, while the parallel branch 'Modality-Specific Encoding and Fusion' (nine papers across three leaves) addresses separate encoding followed by late fusion. WAVE's hierarchical feature fusion and dual-encoder audio design connect it to 'Fine-Grained Audio-Visual Alignment' methods, yet its emphasis on a unified embedding space distinguishes it from modality-specific approaches. The taxonomy's scope notes clarify that WAVE's continuous embedding alignment excludes it from tokenization-based methods, positioning it at the intersection of representation learning and cross-modal retrieval.
Among thirty candidates examined, the analysis found limited prior work overlap. The core contribution (unified audio-visual embedding MLLM) examined ten candidates with one refutable match, suggesting some precedent exists but the search scope was narrow. The prompt-aware embeddings contribution (ten candidates, zero refutable) and hierarchical fusion strategy (ten candidates, zero refutable) appear more distinctive within this limited sample. The statistics indicate that while WAVE's unified embedding concept has at least one overlapping prior work among the examined papers, its instruction-following and architectural innovations show less direct precedent in the top-thirty semantic matches and their citations.
Based on the constrained literature search (thirty candidates from semantic retrieval), WAVE demonstrates moderate novelty in its unified embedding approach, with stronger novelty signals in its prompt-aware and hierarchical fusion components. The taxonomy context suggests the joint embedding space construction area is active but not saturated, with WAVE contributing to an established research direction while introducing specific architectural and training innovations. A more exhaustive search beyond the top-thirty candidates would be needed to fully assess the originality of the hierarchical fusion and instruction-following mechanisms.
Taxonomy
Research Landscape Overview
Claimed Contributions
WAVE is the first multimodal LLM-based embedding model that creates a unified representation space for text, audio, silent video, and synchronised audio-visual inputs. It enables any-to-any cross-modal retrieval and achieves state-of-the-art performance on the MMEB-v2 video benchmark.
WAVE leverages the instruction-following capabilities of its MLLM backbone to generate prompt-aware embeddings that can be conditioned on user instructions. This enables the model to produce task-specific representations, demonstrated by strong performance on embedding-based multimodal question answering.
The authors propose a hierarchical feature-fusion strategy that aggregates representations from multiple MLLM layers to improve multimodal retrieval performance. They also introduce a dual-encoder architecture for audio that captures complementary speech and environmental sound cues, enhancing embedding expressiveness.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] Fine-grained audio-visual joint representations for multimodal large language models PDF
[8] Aligning audio-visual joint representations with an agentic workflow PDF
[30] Incorporating Dense Knowledge Alignment into Unified Multimodal Representation Models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
WAVE: unified and versatile audio-visual embedding MLLM
WAVE is the first multimodal LLM-based embedding model that creates a unified representation space for text, audio, silent video, and synchronised audio-visual inputs. It enables any-to-any cross-modal retrieval and achieves state-of-the-art performance on the MMEB-v2 video benchmark.
[60] Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video PDF
[51] End-to-end multimodal representation learning for video dialog PDF
[52] Everything at once-multi-modal fusion transformer for video retrieval PDF
[53] Polysemous visual-semantic embedding for cross-modal retrieval PDF
[54] Gramian Multimodal Representation Learning and Alignment PDF
[55] ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval PDF
[56] Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action PDF
[57] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text PDF
[58] Retrieving multimodal information for augmented generation: A survey PDF
[59] Disentangled Representation Learning for Text-Video Retrieval PDF
Prompt-aware embeddings via instruction-following
WAVE leverages the instruction-following capabilities of its MLLM backbone to generate prompt-aware embeddings that can be conditioned on user instructions. This enables the model to produce task-specific representations, demonstrated by strong performance on embedding-based multimodal question answering.
[71] Generative multimodal models are in-context learners PDF
[72] MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping PDF
[73] Instruction-driven history-aware policies for robotic manipulations PDF
[74] MPT: Multimodal Prompt Tuning for Zero-shot Instruction Learning PDF
[75] Imagebind-llm: Multi-modality instruction tuning PDF
[76] Vislinginstruct: Elevating zero-shot learning in multi-modal language models with autonomous instruction optimization PDF
[77] MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models PDF
[78] Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning PDF
[79] Joint embeddings for graph instruction tuning PDF
[80] VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing PDF
Hierarchical feature-fusion strategy and dual-encoder audio design
The authors propose a hierarchical feature-fusion strategy that aggregates representations from multiple MLLM layers to improve multimodal retrieval performance. They also introduce a dual-encoder architecture for audio that captures complementary speech and environmental sound cues, enhancing embedding expressiveness.