WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
OmniModalityMultimodal LLMsBenchmarkReal-World Understanding
Abstract:

We introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i) collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii) diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii) high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (65.1% best accuracy). By analyzing the limitations of current models, we aim to provide valuable insight to guide development of real-world understanding. We hope our WorldSense can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

WorldSense introduces a benchmark for omnimodal video understanding that requires synchronized audio-visual reasoning across 1,662 videos, 8 domains, 67 subcategories, and 3,172 QA pairs spanning 26 tasks. It resides in the 'Omnimodal Comprehension Benchmarks' leaf alongside three sibling papers: OmniEval, Daily-Omni, and OmniMMI. This leaf is part of the broader 'Benchmarks and Evaluation Frameworks' branch, which contains four distinct evaluation categories within a 50-paper taxonomy. The concentration of four papers in this specific leaf suggests a moderately active research direction focused on comprehensive omnimodal assessment.

The taxonomy reveals that WorldSense's parent branch sits alongside five other major research directions: Multimodal Representation Learning, Omnimodal LLMs, Audio-Visual Generation, Task-Specific Understanding, and Theoretical Frameworks. Neighboring evaluation categories include Audio-Centric benchmarks (2 papers), Long-form Video benchmarks (2 papers), and Specialized Domain datasets (3 papers). WorldSense's emphasis on synchronized audio-visual coupling distinguishes it from Audio-Centric benchmarks that prioritize auditory information, while its focus on diverse real-world scenarios differentiates it from Specialized Domain datasets targeting narrow applications like brain signal decoding or robotic performance art.

Among 30 candidates examined, the contribution 'Design principles emphasizing omnimodal collaboration' shows one refutable candidate from 10 examined, suggesting some overlap with prior work on audio-visual coupling requirements. The 'WorldSense benchmark' contribution itself examined 10 candidates with zero refutations, indicating potential novelty in its specific combination of scale, task diversity, and annotation quality. The 'Comprehensive evaluation revealing MLLM limitations' contribution also found no refutations among 10 candidates, though this may reflect the limited search scope rather than definitive novelty. The analysis does not cover exhaustive comparison with all existing benchmarks in the field.

Given the limited 30-candidate search scope, WorldSense appears to occupy a recognizable position within an active but not overcrowded evaluation subfield. The presence of three sibling benchmarks suggests incremental progress rather than pioneering work, yet the specific emphasis on omnimodal collaboration and the scale of manual annotation may offer distinguishing features. The analysis captures top semantic matches and immediate neighbors but cannot assess whether similar benchmarks exist outside this search radius or in adjacent research communities.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: omnimodal video understanding with synchronized audio and visual inputs. The field has evolved into several major branches that reflect different emphases on representation, interaction, generation, and evaluation. Multimodal Representation Learning and Alignment focuses on learning joint embeddings and cross-modal correspondences, often drawing on contrastive or canonical correlation techniques to align audio and visual streams. Omnimodal Large Language Models and Interactive Systems integrate these modalities into conversational agents and reasoning frameworks, enabling richer human-machine interaction. Audio-Visual Generation with Temporal Control addresses synthesis tasks where temporal synchronization is paramount, while Task-Specific Audio-Visual Understanding targets specialized problems such as sound source localization or audio-visual scene analysis. Benchmarks and Evaluation Frameworks provide standardized testbeds for measuring omnimodal comprehension, and Multimodal Analysis and Theoretical Frameworks explore the underlying principles that govern cross-modal fusion. Representative works like Unified Video Language Audio[1] and Audio Centric Video[2] illustrate how different branches tackle alignment and task design, while foundational studies such as Audio Visual Scene Analysis[8] laid early groundwork for the field. Recent activity highlights a tension between holistic omnimodal reasoning and task-specific optimization. Many studies pursue end-to-end architectures that unify audio, vision, and language, yet specialized benchmarks reveal that general-purpose models often struggle with fine-grained temporal or semantic alignment. WorldSense[0] sits within the Benchmarks and Evaluation Frameworks branch, specifically under Omnimodal Comprehension Benchmarks, alongside works like OmniEval[5], Daily-Omni[12], and OmniMMI[17]. While OmniEval[5] and Daily-Omni[12] emphasize diverse question types and everyday scenarios, WorldSense[0] appears to focus on comprehensive evaluation of synchronized audio-visual understanding, providing a testbed that complements these neighboring efforts. This cluster of benchmarks collectively addresses the need for rigorous assessment of omnimodal systems, ensuring that advances in representation learning and interactive models translate into measurable improvements across varied real-world tasks.

Claimed Contributions

WorldSense benchmark for omnimodal video understanding

The authors present WorldSense, a novel benchmark specifically designed to evaluate multimodal large language models on their ability to understand real-world scenarios through integrated processing of visual, audio, and textual information from synchronized videos. The benchmark features 1,662 videos across 8 domains and 67 subcategories, with 3,172 manually annotated question-answer pairs spanning 26 distinct tasks.

10 retrieved papers
Design principles emphasizing omnimodal collaboration

The benchmark is constructed with deliberate design principles that ensure questions require tight coupling between audio and visual modalities for correct answers. This forces models to demonstrate genuine multimodal integration rather than relying on single-modality processing, establishing a rigorous evaluation framework for omnimodal understanding.

10 retrieved papers
Can Refute
Comprehensive evaluation revealing limitations of current MLLMs

The authors conduct extensive experiments on state-of-the-art models, revealing that even the best proprietary model achieves only 65.1% accuracy while open-source models perform near chance level. Through ablation studies and failure analysis, they identify key factors influencing performance and provide actionable insights for improving omnimodal understanding in future models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WorldSense benchmark for omnimodal video understanding

The authors present WorldSense, a novel benchmark specifically designed to evaluate multimodal large language models on their ability to understand real-world scenarios through integrated processing of visual, audio, and textual information from synchronized videos. The benchmark features 1,662 videos across 8 domains and 67 subcategories, with 3,172 manually annotated question-answer pairs spanning 26 distinct tasks.

Contribution

Design principles emphasizing omnimodal collaboration

The benchmark is constructed with deliberate design principles that ensure questions require tight coupling between audio and visual modalities for correct answers. This forces models to demonstrate genuine multimodal integration rather than relying on single-modality processing, establishing a rigorous evaluation framework for omnimodal understanding.

Contribution

Comprehensive evaluation revealing limitations of current MLLMs

The authors conduct extensive experiments on state-of-the-art models, revealing that even the best proprietary model achieves only 65.1% accuracy while open-source models perform near chance level. Through ablation studies and failure analysis, they identify key factors influencing performance and provide actionable insights for improving omnimodal understanding in future models.