STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

ICLR 2026 Conference SubmissionAnonymous Authors
audio understandingspatio-temporal reasoning4D Intelligence
Abstract:

Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5% temporal, -35.2% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper formalizes audio 4D intelligence as reasoning over sound dynamics in time and 3D space, introducing STAR-Bench to measure this capability through foundational acoustic perception tasks and holistic spatio-temporal reasoning challenges. Within the taxonomy, it occupies the 'Deep Spatio-Temporal Reasoning Benchmarks' leaf under 'Multimodal Spatial Reasoning and Audio-Visual Intelligence', where it is currently the sole paper. This positioning reflects an emerging and relatively sparse research direction focused specifically on benchmark design for evaluating fine-grained acoustic reasoning, distinct from neighboring leaves addressing navigation, question answering, or dense prediction tasks.

The taxonomy reveals that STAR-Bench sits within a broader multimodal reasoning branch containing five other leaves: audio-visual QA and grounding, navigation and mapping, dense prediction, general reasoning frameworks, and saliency modeling. These neighboring directions emphasize task-specific applications or cross-modal integration, whereas STAR-Bench focuses on evaluation infrastructure for spatio-temporal acoustic understanding. The benchmark's emphasis on procedurally synthesized audio and physics simulation connects it to the 'Computational Simulation and Numerical Methods' branch, while its human-validated holistic tasks bridge toward practical audio-visual intelligence applications explored in sibling leaves.

Among 29 candidates examined across three contributions, no refutable prior work was identified. The formalization of audio 4D intelligence examined 10 candidates with zero refutations, suggesting this conceptual framing may be novel within the limited search scope. The benchmark design contribution reviewed 9 candidates without finding overlapping work, indicating that the hierarchical task structure combining foundational and holistic settings appears distinctive among examined papers. The data curation pipeline analyzed 10 candidates with no refutations, though this does not preclude similar validation approaches existing in unexamined literature or adjacent domains like computer vision benchmarking.

Based on the limited search scope of 29 semantically similar papers, the work appears to introduce a relatively novel evaluation paradigm within the examined candidate set. The taxonomy structure confirms this is a sparse research direction with no sibling papers in the same leaf, though the broader multimodal reasoning branch contains related work on audio-visual tasks. The analysis does not cover exhaustive literature review across all benchmark design methodologies or human validation protocols in adjacent fields, leaving open questions about potential overlaps beyond the top-K semantic matches examined.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Reasoning over sound dynamics in time and 3D space. This field encompasses a broad spectrum of research directions, from foundational acoustic theory and wave propagation principles to practical applications in spatial audio rendering, source localization, and multimodal intelligence. The taxonomy reveals seven major branches: Acoustic Field Theory and Wave Propagation addresses fundamental physics and metamaterial design (e.g., Acoustic Metasurfaces Focusing[3], Nonlinear Acoustic Metamaterials[12]); Spatial Audio Localization and Source Tracking focuses on identifying and following sound sources in space (e.g., Sound Source Localization[6], Bat Flight Tracking[2]); Spatial Audio Synthesis and Rendering deals with creating immersive auditory experiences (e.g., MPEG-H 3D Audio[43], Spatial Audio Future[15]); Computational Simulation and Numerical Methods provides tools for modeling acoustic phenomena (e.g., Space-time Isogeometric[30]); Multimodal Spatial Reasoning and Audio-Visual Intelligence integrates sound with visual and other sensory modalities (e.g., Multimodal Spatial Reasoning[4], Soundspaces[24]); Acoustic Measurement and Monitoring Systems emphasizes sensor technologies and environmental monitoring (e.g., Passive Acoustic Sensors[39]); and Cross-Domain Applications spans specialized uses from medical imaging to architectural acoustics. Within the multimodal branch, a particularly active line of work explores deep spatio-temporal reasoning, where models must jointly understand how sounds evolve over time and move through three-dimensional environments. STAR-Bench[0] situates itself squarely in this emerging area by providing a benchmark for evaluating such capabilities, complementing recent efforts like GTR[5] that tackle geometric and temporal reasoning in audio-visual contexts. While GTR[5] emphasizes geometric transformations and Multimodal Spatial Reasoning[4] explores broader cross-modal integration, STAR-Bench[0] focuses specifically on assessing models' ability to reason about dynamic acoustic scenes. This distinction highlights an open question in the field: how to design evaluation frameworks that capture the full complexity of spatial audio understanding, balancing physical realism with the demands of learning-based systems. The interplay between simulation-driven approaches (e.g., Sonic4D[20]) and data-driven benchmarks remains a key area of exploration.

Claimed Contributions

Formalization of audio 4D intelligence paradigm

The authors introduce a new paradigm called audio 4D intelligence, which is defined as the ability to perform deep reasoning over the dynamics of sound sources in time (1D) and three-dimensional space (3D), grounded in an understanding of the physical world. This formalization addresses the gap between current audio benchmarks that focus on text-representable semantics and real-world auditory intelligence that requires reasoning about linguistically hard-to-describe acoustic cues.

10 retrieved papers
STAR-Bench benchmark with hierarchical task structure

The authors introduce STAR-Bench, a comprehensive benchmark designed through a hierarchical task structure with two levels: Foundational Acoustic Perception (evaluating six core audio attributes across absolute perception ranges and relative discrimination sensitivity) and Holistic Spatio-Temporal Reasoning (evaluating temporal reasoning via segment reordering and spatial reasoning covering static localization, multi-source relations, and dynamic trajectories).

9 retrieved papers
Rigorous data curation pipeline with human validation

The authors develop a rigorous data curation pipeline that combines procedurally synthesized and physics-simulated audio for foundational perception tasks with a four-stage process for holistic reasoning tasks. This process includes taxonomy construction, AI-assisted automated filtering, human annotation with quality control, and final validation via human performance evaluation to ensure all benchmark items are fair, unambiguous, and solvable by humans.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Formalization of audio 4D intelligence paradigm

The authors introduce a new paradigm called audio 4D intelligence, which is defined as the ability to perform deep reasoning over the dynamics of sound sources in time (1D) and three-dimensional space (3D), grounded in an understanding of the physical world. This formalization addresses the gap between current audio benchmarks that focus on text-representable semantics and real-world auditory intelligence that requires reasoning about linguistically hard-to-describe acoustic cues.

Contribution

STAR-Bench benchmark with hierarchical task structure

The authors introduce STAR-Bench, a comprehensive benchmark designed through a hierarchical task structure with two levels: Foundational Acoustic Perception (evaluating six core audio attributes across absolute perception ranges and relative discrimination sensitivity) and Holistic Spatio-Temporal Reasoning (evaluating temporal reasoning via segment reordering and spatial reasoning covering static localization, multi-source relations, and dynamic trajectories).

Contribution

Rigorous data curation pipeline with human validation

The authors develop a rigorous data curation pipeline that combines procedurally synthesized and physics-simulated audio for foundational perception tasks with a four-stage process for holistic reasoning tasks. This process includes taxonomy construction, AI-assisted automated filtering, human annotation with quality control, and final validation via human performance evaluation to ensure all benchmark items are fair, unambiguous, and solvable by humans.