WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

OmniModalityMultimodal LLMsBenchmarkReal-World Understanding

We introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i) collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii) diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii) high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (65.1% best accuracy). By analyzing the limitations of current models, we aim to provide valuable insight to guide development of real-world understanding. We hope our WorldSense can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

WorldSense introduces a benchmark for omnimodal video understanding that requires synchronized audio-visual reasoning across 1,662 videos, 8 domains, 67 subcategories, and 3,172 QA pairs spanning 26 tasks. It resides in the 'Omnimodal Comprehension Benchmarks' leaf alongside three sibling papers: OmniEval, Daily-Omni, and OmniMMI. This leaf is part of the broader 'Benchmarks and Evaluation Frameworks' branch, which contains four distinct evaluation categories within a 50-paper taxonomy. The concentration of four papers in this specific leaf suggests a moderately active research direction focused on comprehensive omnimodal assessment.

The taxonomy reveals that WorldSense's parent branch sits alongside five other major research directions: Multimodal Representation Learning, Omnimodal LLMs, Audio-Visual Generation, Task-Specific Understanding, and Theoretical Frameworks. Neighboring evaluation categories include Audio-Centric benchmarks (2 papers), Long-form Video benchmarks (2 papers), and Specialized Domain datasets (3 papers). WorldSense's emphasis on synchronized audio-visual coupling distinguishes it from Audio-Centric benchmarks that prioritize auditory information, while its focus on diverse real-world scenarios differentiates it from Specialized Domain datasets targeting narrow applications like brain signal decoding or robotic performance art.

Among 30 candidates examined, the contribution 'Design principles emphasizing omnimodal collaboration' shows one refutable candidate from 10 examined, suggesting some overlap with prior work on audio-visual coupling requirements. The 'WorldSense benchmark' contribution itself examined 10 candidates with zero refutations, indicating potential novelty in its specific combination of scale, task diversity, and annotation quality. The 'Comprehensive evaluation revealing MLLM limitations' contribution also found no refutations among 10 candidates, though this may reflect the limited search scope rather than definitive novelty. The analysis does not cover exhaustive comparison with all existing benchmarks in the field.

Given the limited 30-candidate search scope, WorldSense appears to occupy a recognizable position within an active but not overcrowded evaluation subfield. The presence of three sibling benchmarks suggests incremental progress rather than pioneering work, yet the specific emphasis on omnimodal collaboration and the scale of manual annotation may offer distinguishing features. The analysis captures top semantic matches and immediate neighbors but cannot assess whether similar benchmarks exist outside this search radius or in adjacent research communities.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: omnimodal video understanding with synchronized audio and visual inputs. The field has evolved into several major branches that reflect different emphases on representation, interaction, generation, and evaluation. Multimodal Representation Learning and Alignment focuses on learning joint embeddings and cross-modal correspondences, often drawing on contrastive or canonical correlation techniques to align audio and visual streams. Omnimodal Large Language Models and Interactive Systems integrate these modalities into conversational agents and reasoning frameworks, enabling richer human-machine interaction. Audio-Visual Generation with Temporal Control addresses synthesis tasks where temporal synchronization is paramount, while Task-Specific Audio-Visual Understanding targets specialized problems such as sound source localization or audio-visual scene analysis. Benchmarks and Evaluation Frameworks provide standardized testbeds for measuring omnimodal comprehension, and Multimodal Analysis and Theoretical Frameworks explore the underlying principles that govern cross-modal fusion. Representative works like Unified Video Language Audio[1] and Audio Centric Video[2] illustrate how different branches tackle alignment and task design, while foundational studies such as Audio Visual Scene Analysis[8] laid early groundwork for the field. Recent activity highlights a tension between holistic omnimodal reasoning and task-specific optimization. Many studies pursue end-to-end architectures that unify audio, vision, and language, yet specialized benchmarks reveal that general-purpose models often struggle with fine-grained temporal or semantic alignment. WorldSense[0] sits within the Benchmarks and Evaluation Frameworks branch, specifically under Omnimodal Comprehension Benchmarks, alongside works like OmniEval[5], Daily-Omni[12], and OmniMMI[17]. While OmniEval[5] and Daily-Omni[12] emphasize diverse question types and everyday scenarios, WorldSense[0] appears to focus on comprehensive evaluation of synchronized audio-visual understanding, providing a testbed that complements these neighboring efforts. This cluster of benchmarks collectively addresses the need for rigorous assessment of omnimodal systems, ensuring that advances in representation learning and interactive models translate into measurable improvements across varied real-world tasks.

Claimed Contributions

WorldSense benchmark for omnimodal video understanding

10 retrieved papers

The authors present WorldSense, a novel benchmark specifically designed to evaluate multimodal large language models on their ability to understand real-world scenarios through integrated processing of visual, audio, and textual information from synchronized videos. The benchmark features 1,662 videos across 8 domains and 67 subcategories, with 3,172 manually annotated question-answer pairs spanning 26 distinct tasks.

10 retrieved papers

Design principles emphasizing omnimodal collaboration

Can Refute

10 retrieved papers

The benchmark is constructed with deliberate design principles that ensure questions require tight coupling between audio and visual modalities for correct answers. This forces models to demonstrate genuine multimodal integration rather than relying on single-modality processing, establishing a rigorous evaluation framework for omnimodal understanding.

10 retrieved papers

Can Refute

Comprehensive evaluation revealing limitations of current MLLMs

10 retrieved papers

The authors conduct extensive experiments on state-of-the-art models, revealing that even the best proprietary model achieves only 65.1% accuracy while open-source models perform near chance level. Through ablation studies and failure analysis, they identify key factors influencing performance and provide actionable insights for improving omnimodal understanding in future models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs PDF

Zhang Yiman, Luo Zi-heng, Yiman Zhang, Yan, Qiangyu, Ziheng Luo, He Wei, Qiangyu Yan, Wei He, Chen Xinghao, Borui Jiang, Han, Kai, Xinghao Chen, Kai Han (2025)

[12] Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities PDF

Zhou Ziwei, Wang Rui, Ziwei Zhou, Wu, Zuxuan, Rui Wang, Zuxuan Wu (2025)

[17] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts PDF

Wang Yu-xuan, Wang Yueqian, Chen Bo, Wu, Tong, Zhao Dongyan, Zheng, Zilong (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WorldSense benchmark for omnimodal video understanding

[69] Longvideobench: A benchmark for long-context interleaved video-language understanding PDF

Cannot Refute

[70] Internvideo2: Scaling foundation models for multimodal video understanding PDF

Cannot Refute

[71] Vidi: Large multimodal models for video understanding and editing PDF

Cannot Refute

[72] Mvbench: A comprehensive multi-modal video understanding benchmark PDF

Cannot Refute

[73] Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis PDF

Cannot Refute

[74] Foundation models for video understanding: A survey PDF

Cannot Refute

[75] Internvid: A large-scale video-text dataset for multimodal understanding and generation PDF

Cannot Refute

[76] Iv-bench: A benchmark for image-grounded video perception and reasoning in multimodal llms PDF

Cannot Refute

[77] Value: A multi-task benchmark for video-and-language understanding evaluation PDF

Cannot Refute

[78] Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset PDF

Cannot Refute

Contribution

Design principles emphasizing omnimodal collaboration

[53] DAVE: Diagnostic benchmark for Audio Visual Evaluation PDF

Can Refute

[6] Representation learning for semantic alignment of language, audio, and visual modalities PDF

Cannot Refute

[21] Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment PDF

Cannot Refute

[51] Unsupervised audio-visual segmentation with modality alignment PDF

Cannot Refute

[52] Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment PDF

Cannot Refute

[54] Aurelia: Test-time reasoning distillation in audio-visual llms PDF

Cannot Refute

[55] Onellm: One framework to align all modalities with language PDF

Cannot Refute

[56] Av-superb: A multi-task evaluation benchmark for audio-visual representation models PDF

Cannot Refute

[57] Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners PDF

Cannot Refute

[58] Study of subjective and objective quality assessment of audio-visual signals PDF

Cannot Refute

Contribution

Comprehensive evaluation revealing limitations of current MLLMs

[59] Visualwebarena: Evaluating multimodal agents on realistic visual web tasks PDF

Cannot Refute

[60] Generative multimodal models are in-context learners PDF

Cannot Refute

[61] Tracking meets large multimodal models for driving scenario understanding PDF

Cannot Refute

[62] Llava-critic: Learning to evaluate multimodal models PDF

Cannot Refute

[63] Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? PDF

Cannot Refute

[64] Hallucination of Multimodal Large Language Models: A Survey PDF

Cannot Refute

[65] Probing multimodal llms as world models for driving PDF

Cannot Refute

[66] Touchstone: Evaluating vision-language models by language models PDF

Cannot Refute

[67] Videoautoarena: An automated arena for evaluating large multimodal models in video analysis through user simulation PDF

Cannot Refute

[68] Lmms-eval: Reality check on the evaluation of large multimodal models PDF

Cannot Refute

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs PDF

[12] Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities PDF

[17] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts PDF

Contribution Analysis

WorldSense benchmark for omnimodal video understanding

[69] Longvideobench: A benchmark for long-context interleaved video-language understanding PDF

[70] Internvideo2: Scaling foundation models for multimodal video understanding PDF

[71] Vidi: Large multimodal models for video understanding and editing PDF

[72] Mvbench: A comprehensive multi-modal video understanding benchmark PDF

[73] Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis PDF

[74] Foundation models for video understanding: A survey PDF

[75] Internvid: A large-scale video-text dataset for multimodal understanding and generation PDF

[76] Iv-bench: A benchmark for image-grounded video perception and reasoning in multimodal llms PDF

[77] Value: A multi-task benchmark for video-and-language understanding evaluation PDF

[78] Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset PDF

Design principles emphasizing omnimodal collaboration

[53] DAVE: Diagnostic benchmark for Audio Visual Evaluation PDF

[6] Representation learning for semantic alignment of language, audio, and visual modalities PDF

[21] Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment PDF

[51] Unsupervised audio-visual segmentation with modality alignment PDF

[52] Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment PDF

[54] Aurelia: Test-time reasoning distillation in audio-visual llms PDF

[55] Onellm: One framework to align all modalities with language PDF

[56] Av-superb: A multi-task evaluation benchmark for audio-visual representation models PDF

[57] Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners PDF

[58] Study of subjective and objective quality assessment of audio-visual signals PDF

Comprehensive evaluation revealing limitations of current MLLMs

[59] Visualwebarena: Evaluating multimodal agents on realistic visual web tasks PDF

[60] Generative multimodal models are in-context learners PDF

[61] Tracking meets large multimodal models for driving scenario understanding PDF

[62] Llava-critic: Learning to evaluate multimodal models PDF

[63] Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? PDF

[64] Hallucination of Multimodal Large Language Models: A Survey PDF

[65] Probing multimodal llms as world models for driving PDF

[66] Touchstone: Evaluating vision-language models by language models PDF

[67] Videoautoarena: An automated arena for evaluating large multimodal models in video analysis through user simulation PDF

[68] Lmms-eval: Reality check on the evaluation of large multimodal models PDF

Table of Contents