3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

ICLR 2026 Conference SubmissionAnonymous Authors
intuitive physicscognitionpoint trackingautoencodergenerative video modeling
Abstract:

AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D semantic point autoencoder which integrates 3D point trajectories, depth cues, and DINOv2 semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces 3DSPA, a 3D semantic point autoencoder that evaluates video realism by integrating point trajectories, depth cues, and DINOv2 semantic features. Within the taxonomy, it occupies the sole position in the 'Automated Video Realism Evaluation via 3D Semantic Trajectories' leaf under 'Video Quality Assessment and Realism Evaluation'. This leaf currently contains no sibling papers, indicating a sparse research direction. The broader taxonomy includes only three other papers across neighboring leaves, suggesting the field of automated 3D-semantic-trajectory-based realism evaluation is still emerging.

The taxonomy reveals three main branches: quality assessment, motion control in synthesis, and trajectory reconstruction. The original paper sits in the first branch, while related work like Multi-Entity 3D Motion Manipulation (one paper) addresses generation control, and Monocular 3D Trajectory Reconstruction (one paper) plus Semantic Violation Detection (one paper) focus on extracting or detecting motion patterns. The scope notes clarify that 3DSPA differs by unifying trajectory analysis with realism scoring, rather than separating reconstruction from evaluation. This positioning suggests the work bridges gaps between traditionally distinct tasks.

Among thirty candidates examined, none were found to refute any of the three contributions. Contribution A (3DSPA framework) examined ten candidates with zero refutable matches; Contribution B (3D point tracking capability) and Contribution C (physical law violation detection) each examined ten candidates with identical results. The limited search scope means these statistics reflect top-thirty semantic matches and their citations, not an exhaustive survey. All three contributions appear novel within this bounded literature search, though the small candidate pool and sparse taxonomy suggest the field lacks extensive prior work to compare against.

Given the sparse taxonomy structure and absence of sibling papers, the work appears to occupy a relatively unexplored niche. The limited search scope (thirty candidates) and zero refutable pairs indicate either genuine novelty or insufficient coverage of adjacent literatures. The taxonomy's small size (three papers total) reinforces that automated 3D-semantic-trajectory-based realism evaluation remains an early-stage research direction, making definitive novelty claims difficult without broader literature coverage.

Taxonomy

Core-task Taxonomy Papers
3
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: automated evaluation of video realism using 3D semantic point trajectories. This emerging field sits at the intersection of video quality assessment, 3D motion synthesis, and trajectory reconstruction. The taxonomy reflects three main branches that together address how to generate, analyze, and evaluate realistic video content. The first branch, Video Quality Assessment and Realism Evaluation, focuses on metrics and frameworks for judging whether synthesized or edited videos appear plausible to human observers, often leveraging perceptual models or learned features. The second branch, 3D Motion Control and Generation in Video Synthesis, emphasizes methods that produce or manipulate video by explicitly modeling three-dimensional scene dynamics, camera motion, and object trajectories. The third branch, 3D Trajectory Reconstruction and Semantic Detection, deals with extracting and interpreting motion paths from video data, including techniques for recovering depth, tracking points over time, and associating semantic labels with detected trajectories. Within this landscape, a handful of works have begun to bridge trajectory reconstruction and quality evaluation by treating 3D motion consistency as a key indicator of realism. For instance, 3DTrajMaster[1] demonstrates how mastering trajectory representations can improve downstream video understanding tasks, while earlier efforts like Planarity Trajectory Reconstruction[3] laid groundwork for recovering geometric structure from motion cues. The original paper, 3DSPA[0], occupies a distinctive position by directly coupling semantic point trajectories with automated realism scoring, rather than treating trajectory extraction and quality assessment as separate stages. This approach contrasts with traditional perceptual metrics that rely primarily on pixel-level or feature-level comparisons, and instead leverages the geometric and semantic coherence of tracked points to judge whether a video obeys physical plausibility. By doing so, 3DSPA[0] addresses a gap between purely appearance-based quality measures and the deeper structural cues that human viewers use to detect unrealistic motion.

Claimed Contributions

3DSPA: A 3D Semantic Point Autoencoder for Video Realism Evaluation

The authors introduce 3DSPA, an autoencoder-based framework that integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation. This enables automated evaluation of video realism, temporal consistency, and physical plausibility without requiring reference videos.

10 retrieved papers
Demonstration of 3DSPA as a Capable 3D Point Tracker

The authors show that 3DSPA can accurately reconstruct 3D point tracks even with its compressed latent representation, achieving performance comparable to state-of-the-art tracking methods on the TAPVid-3D benchmark.

10 retrieved papers
Reliable Detection of Physical Law Violations

The authors demonstrate that 3DSPA can distinguish between physically possible and impossible events in synthetic videos, outperforming vision-language models and other baselines in detecting violations of object permanence, immutability, continuity, and solidity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

3DSPA: A 3D Semantic Point Autoencoder for Video Realism Evaluation

The authors introduce 3DSPA, an autoencoder-based framework that integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation. This enables automated evaluation of video realism, temporal consistency, and physical plausibility without requiring reference videos.

Contribution

Demonstration of 3DSPA as a Capable 3D Point Tracker

The authors show that 3DSPA can accurately reconstruct 3D point tracks even with its compressed latent representation, achieving performance comparable to state-of-the-art tracking methods on the TAPVid-3D benchmark.

Contribution

Reliable Detection of Physical Law Violations

The authors demonstrate that 3DSPA can distinguish between physically possible and impossible events in synthetic videos, outperforming vision-language models and other baselines in detecting violations of object permanence, immutability, continuity, and solidity.