3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

intuitive physicscognitionpoint trackingautoencodergenerative video modeling

AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D semantic point autoencoder which integrates 3D point trajectories, depth cues, and DINOv2 semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces 3DSPA, a 3D semantic point autoencoder that evaluates video realism by integrating point trajectories, depth cues, and DINOv2 semantic features. Within the taxonomy, it occupies the sole position in the 'Automated Video Realism Evaluation via 3D Semantic Trajectories' leaf under 'Video Quality Assessment and Realism Evaluation'. This leaf currently contains no sibling papers, indicating a sparse research direction. The broader taxonomy includes only three other papers across neighboring leaves, suggesting the field of automated 3D-semantic-trajectory-based realism evaluation is still emerging.

The taxonomy reveals three main branches: quality assessment, motion control in synthesis, and trajectory reconstruction. The original paper sits in the first branch, while related work like Multi-Entity 3D Motion Manipulation (one paper) addresses generation control, and Monocular 3D Trajectory Reconstruction (one paper) plus Semantic Violation Detection (one paper) focus on extracting or detecting motion patterns. The scope notes clarify that 3DSPA differs by unifying trajectory analysis with realism scoring, rather than separating reconstruction from evaluation. This positioning suggests the work bridges gaps between traditionally distinct tasks.

Among thirty candidates examined, none were found to refute any of the three contributions. Contribution A (3DSPA framework) examined ten candidates with zero refutable matches; Contribution B (3D point tracking capability) and Contribution C (physical law violation detection) each examined ten candidates with identical results. The limited search scope means these statistics reflect top-thirty semantic matches and their citations, not an exhaustive survey. All three contributions appear novel within this bounded literature search, though the small candidate pool and sparse taxonomy suggest the field lacks extensive prior work to compare against.

Given the sparse taxonomy structure and absence of sibling papers, the work appears to occupy a relatively unexplored niche. The limited search scope (thirty candidates) and zero refutable pairs indicate either genuine novelty or insufficient coverage of adjacent literatures. The taxonomy's small size (three papers total) reinforces that automated 3D-semantic-trajectory-based realism evaluation remains an early-stage research direction, making definitive novelty claims difficult without broader literature coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: automated evaluation of video realism using 3D semantic point trajectories. This emerging field sits at the intersection of video quality assessment, 3D motion synthesis, and trajectory reconstruction. The taxonomy reflects three main branches that together address how to generate, analyze, and evaluate realistic video content. The first branch, Video Quality Assessment and Realism Evaluation, focuses on metrics and frameworks for judging whether synthesized or edited videos appear plausible to human observers, often leveraging perceptual models or learned features. The second branch, 3D Motion Control and Generation in Video Synthesis, emphasizes methods that produce or manipulate video by explicitly modeling three-dimensional scene dynamics, camera motion, and object trajectories. The third branch, 3D Trajectory Reconstruction and Semantic Detection, deals with extracting and interpreting motion paths from video data, including techniques for recovering depth, tracking points over time, and associating semantic labels with detected trajectories. Within this landscape, a handful of works have begun to bridge trajectory reconstruction and quality evaluation by treating 3D motion consistency as a key indicator of realism. For instance, 3DTrajMaster[1] demonstrates how mastering trajectory representations can improve downstream video understanding tasks, while earlier efforts like Planarity Trajectory Reconstruction[3] laid groundwork for recovering geometric structure from motion cues. The original paper, 3DSPA[0], occupies a distinctive position by directly coupling semantic point trajectories with automated realism scoring, rather than treating trajectory extraction and quality assessment as separate stages. This approach contrasts with traditional perceptual metrics that rely primarily on pixel-level or feature-level comparisons, and instead leverages the geometric and semantic coherence of tracked points to judge whether a video obeys physical plausibility. By doing so, 3DSPA[0] addresses a gap between purely appearance-based quality measures and the deeper structural cues that human viewers use to detect unrealistic motion.

Claimed Contributions

3DSPA: A 3D Semantic Point Autoencoder for Video Realism Evaluation

10 retrieved papers

The authors introduce 3DSPA, an autoencoder-based framework that integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation. This enables automated evaluation of video realism, temporal consistency, and physical plausibility without requiring reference videos.

10 retrieved papers

Demonstration of 3DSPA as a Capable 3D Point Tracker

10 retrieved papers

The authors show that 3DSPA can accurately reconstruct 3D point tracks even with its compressed latent representation, achieving performance comparable to state-of-the-art tracking methods on the TAPVid-3D benchmark.

10 retrieved papers

Reliable Detection of Physical Law Violations

10 retrieved papers

The authors demonstrate that 3DSPA can distinguish between physically possible and impossible events in synthetic videos, outperforming vision-language models and other baselines in detecting violations of object permanence, immutability, continuity, and solidity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

3DSPA: A 3D Semantic Point Autoencoder for Video Realism Evaluation

[24] Semantic scene upgrades for trajectory prediction PDF

Cannot Refute

[25] GT-Net: Variational Autoencoder Networks based on Graph Transformer for 3D Shape Learning PDF

Cannot Refute

[26] A unified 3d human motion synthesis model via conditional variational auto-encoder PDF

Cannot Refute

[27] Automatic generation of 3D scene animation based on dynamic knowledge graphs and contextual encoding PDF

Cannot Refute

[28] Semantic Latent Motion for Portrait Video Generation PDF

Cannot Refute

[29] A stacked denoising autoencoder and long short-term memory approach with rule-based refinement to extract valid semantic trajectories PDF

Cannot Refute

[30] Information Bottlenecked Variational Autoencoder for Disentangled 3D Facial Expression Modelling PDF

Cannot Refute

[31] Semantic Scene Completion With 2D and 3D Feature Fusion PDF

Cannot Refute

[32] 3D Semantic Scene Completion With Multi-scale Feature Maps and Masked Autoencoder PDF

Cannot Refute

[33] Semantically Disentangled Variational Autoencoder for Modeling 3D Facial Details. PDF

Cannot Refute

Contribution

Demonstration of 3DSPA as a Capable 3D Point Tracker

[4] Multiview Point Cloud Registration via Optimization in an Autoencoder Latent Space PDF

Cannot Refute

[5] Leveraging shape completion for 3d siamese tracking PDF

Cannot Refute

[6] Deep-learning cardiac motion analysis for human survival prediction PDF

Cannot Refute

[7] TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion Refinement PDF

Cannot Refute

[8] DropMAE: Learning Representations via Masked Autoencoders with Spatial-Attention Dropout for Temporal Matching Tasks PDF

Cannot Refute

[9] MOVIN: Realâtime Motion Capture using a Single LiDAR PDF

Cannot Refute

[10] Unsupervised Representation Learning for Diverse Deformable Shape Collections PDF

Cannot Refute

[11] Unsupervised multiple person tracking using autoencoder-based lifted multicuts PDF

Cannot Refute

[12] 3D-LMNet: Latent Embedding Matching for Accurate and Diverse 3D Point Cloud Reconstruction from a Single Image PDF

Cannot Refute

[13] 3D local features for direct pairwise registration PDF

Cannot Refute

Contribution

Reliable Detection of Physical Law Violations

[14] VideoPhy: Evaluating Physical Commonsense for Video Generation PDF

Cannot Refute

[15] Videophy-2: A challenging action-centric physical commonsense evaluation in video generation PDF

Cannot Refute

[16] Videohallu: Evaluating and mitigating multi-modal hallucinations on synthetic video understanding PDF

Cannot Refute

[17] How far is video generation from world model: A physical law perspective PDF

Cannot Refute

[18] Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness PDF

Cannot Refute

[19] Towards world simulator: Crafting physical commonsense-based benchmark for video generation PDF

Cannot Refute

[20] Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark PDF

Cannot Refute

[21] Impossible videos PDF

Cannot Refute

[22] Likephys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference PDF

Cannot Refute

[23] T2vphysbench: A first-principles benchmark for physical consistency in text-to-video generation PDF

Cannot Refute

3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

3DSPA: A 3D Semantic Point Autoencoder for Video Realism Evaluation

[24] Semantic scene upgrades for trajectory prediction PDF

[25] GT-Net: Variational Autoencoder Networks based on Graph Transformer for 3D Shape Learning PDF

[26] A unified 3d human motion synthesis model via conditional variational auto-encoder PDF

[27] Automatic generation of 3D scene animation based on dynamic knowledge graphs and contextual encoding PDF

[28] Semantic Latent Motion for Portrait Video Generation PDF

[29] A stacked denoising autoencoder and long short-term memory approach with rule-based refinement to extract valid semantic trajectories PDF

[30] Information Bottlenecked Variational Autoencoder for Disentangled 3D Facial Expression Modelling PDF

[31] Semantic Scene Completion With 2D and 3D Feature Fusion PDF

[32] 3D Semantic Scene Completion With Multi-scale Feature Maps and Masked Autoencoder PDF

[33] Semantically Disentangled Variational Autoencoder for Modeling 3D Facial Details. PDF

Demonstration of 3DSPA as a Capable 3D Point Tracker

[4] Multiview Point Cloud Registration via Optimization in an Autoencoder Latent Space PDF

[5] Leveraging shape completion for 3d siamese tracking PDF

[6] Deep-learning cardiac motion analysis for human survival prediction PDF

[7] TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion Refinement PDF

[8] DropMAE: Learning Representations via Masked Autoencoders with Spatial-Attention Dropout for Temporal Matching Tasks PDF

[9] MOVIN: Realâtime Motion Capture using a Single LiDAR PDF

[10] Unsupervised Representation Learning for Diverse Deformable Shape Collections PDF

[11] Unsupervised multiple person tracking using autoencoder-based lifted multicuts PDF

[12] 3D-LMNet: Latent Embedding Matching for Accurate and Diverse 3D Point Cloud Reconstruction from a Single Image PDF

[13] 3D local features for direct pairwise registration PDF

Reliable Detection of Physical Law Violations

[14] VideoPhy: Evaluating Physical Commonsense for Video Generation PDF

[15] Videophy-2: A challenging action-centric physical commonsense evaluation in video generation PDF

[16] Videohallu: Evaluating and mitigating multi-modal hallucinations on synthetic video understanding PDF

[17] How far is video generation from world model: A physical law perspective PDF

[18] Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness PDF

[19] Towards world simulator: Crafting physical commonsense-based benchmark for video generation PDF

[20] Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark PDF

[21] Impossible videos PDF

[22] Likephys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference PDF

[23] T2vphysbench: A first-principles benchmark for physical consistency in text-to-video generation PDF

Table of Contents

[9] MOVIN: Realâtime Motion Capture using a Single LiDAR PDF