What Happens Next? Anticipating Future Motion by Generating Point Trajectories

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

motion generationpoint trajectoriesflow matching

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper formulates motion forecasting from a single image as conditional generation of dense trajectory grids, positioning itself within the Dense Trajectory and Pixel-Level Motion Prediction leaf of the taxonomy. This leaf contains only three papers total, including the original work, indicating a relatively sparse research direction. The sibling papers explore related themes: one addresses uncertain future prediction with stochastic motion models, while another examines particle-based video representations. This small cluster suggests the specific combination of single-image input, dense trajectory output, and generative modeling remains underexplored compared to video-based tracking branches, which contain significantly more papers.

The taxonomy reveals that neighboring research directions are substantially more populated. The Point Tracking in Videos branch contains multiple well-developed subcategories with works like TAPIR and TAPNext that leverage temporal sequences for correspondence. The Multi-Object Tracking subtree addresses entity-level forecasting with learned motion predictors and appearance-based methods. The paper's approach diverges from these by eliminating temporal input entirely, instead inferring motion from static visual cues alone. This boundary is reinforced by the taxonomy's exclude notes, which explicitly separate single-image methods from video-based tracking and multi-frame prediction categories, highlighting the distinct challenge of forecasting without observing actual motion.

Among the 19 candidates examined through semantic search, none clearly refute the three main contributions. The first contribution—formulating motion forecasting as dense trajectory generation—examined 10 candidates with no refutable overlaps. The second contribution comparing trajectory generation to regressors and video generators examined 4 candidates, again with no clear prior work. The third contribution analyzing pixel generation overhead in video models examined 5 candidates without finding substantial precedent. This limited search scope suggests the specific framing and comparative analysis may be novel within the examined literature, though the small candidate pool (19 papers) means potentially relevant work outside top semantic matches remains unassessed.

The analysis indicates the work occupies a sparsely populated research direction, with its taxonomy leaf containing minimal prior work and neighboring branches focusing on fundamentally different input modalities. The absence of refutable candidates across 19 examined papers suggests novelty in the specific formulation and comparative insights, though this conclusion is constrained by the limited search scope. A more exhaustive literature review covering broader semantic neighborhoods or citation networks might reveal additional relevant precedents, particularly in adjacent areas like image-to-video generation or probabilistic motion modeling that were less thoroughly explored in this analysis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: forecasting motion from a single image using point trajectories. This field addresses the challenge of predicting how points in a static image will move over time, a problem that spans computer vision, robotics, and autonomous systems. The taxonomy reveals a diverse landscape organized into seven main branches. Single-Image Motion Forecasting and Trajectory Generation focuses on methods that extract future motion directly from still frames, often producing dense pixel-level or sparse point trajectories. Point Tracking in Videos and Temporal Sequences emphasizes techniques like TAPIR[47] and TAPNext[44] that follow points across video frames, building temporal correspondences. Multi-Object Tracking with Motion Prediction and Human and Pedestrian Trajectory Prediction branches address entity-level forecasting, where works such as Pedestrian Trajectory Prediction[7] model social interactions and scene context. Autonomous Driving and Vehicle Motion Planning applies these ideas to navigation and safety-critical scenarios, while Specialized Tracking and Prediction Applications explores domain-specific problems ranging from robotic manipulation to biological movement analysis. Finally, Tracking Infrastructure and Algorithmic Techniques provides foundational methods for association, occlusion handling, and efficient computation. Several active lines of work reveal key trade-offs between dense versus sparse representations, single-frame versus temporal modeling, and deterministic versus probabilistic forecasting. Early efforts like Uncertain Future[5] and Particle Video[14] explored stochastic motion from limited observations, while recent approaches increasingly leverage learned features and diffusion models, as seen in MoTDiff[15]. Anticipating Future Motion[0] sits within the Dense Trajectory and Pixel-Level Motion Prediction cluster, emphasizing the generation of detailed point trajectories from a single image without relying on video sequences. This contrasts with video-based trackers like Motiontrack[3], which exploit temporal continuity, and with probabilistic frameworks that model multiple plausible futures. The work shares thematic connections with Single Frame Prediction[37] in its reliance on static input, yet distinguishes itself by focusing on trajectory-level rather than frame-level synthesis, positioning it at the intersection of generative modeling and motion understanding.

Claimed Contributions

Formulation of motion forecasting as conditional generation of dense trajectory grids

10 retrieved papers

The authors propose a new formulation for predicting future motion from a single image by generating dense trajectory grids rather than regressing trajectories or generating RGB pixels. This approach models scene-wide dynamics and uncertainty using flow matching in a latent space learned by a trajectory variational autoencoder.

10 retrieved papers

Demonstration that trajectory generation outperforms prior regressors and video generators

4 retrieved papers

The authors demonstrate through experiments that their generative approach to trajectory prediction surpasses both regression-based trajectory forecasters and large-scale pretrained video generators, even when the latter are fine-tuned on the target domain. They attribute this to modeling uncertainty and reasoning about the entire scene jointly.

4 retrieved papers

Analysis showing pixel generation overhead limits motion forecasting in video generators

5 retrieved papers

The authors provide experimental evidence that state-of-the-art video generators struggle with motion forecasting not due to lack of world knowledge, but because generating RGB pixels introduces overhead that reduces focus on motion accuracy and physical plausibility. They demonstrate this by ablating output modality while keeping architecture fixed.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] An uncertain future: Forecasting from static images using variational autoencoders PDF

Jacob Walker, Carl Doersch, A. Gupta, M. Hebert, ABHINAV GUPTA, Martial Hebert (2016)

[15] MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models PDF

Choiï¼ Won-Tae, Wontae Choi, Jaelin Lee, Jeonï¼ Byeungwoo, Hyung Sup Yun, Chun, Il Yong, Byeungwoo Jeon, Il Yong Chun (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Formulation of motion forecasting as conditional generation of dense trajectory grids

[47] TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement PDF

Cannot Refute

[60] Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance PDF

Cannot Refute

[61] Scene compliant trajectory forecast with agent-centric spatio-temporal grids PDF

Cannot Refute

[62] IKMo: Image-Keyframed Motion Generation with Trajectory-Pose Conditioned Motion Diffusion Model PDF

Cannot Refute

[63] Meta Motion Sense and Motion Trajectory Prediction PDF

Cannot Refute

[64] SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories PDF

Cannot Refute

[65] Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis PDF

Cannot Refute

[66] COTA-motion: Controllable image-to-video synthesis with dense semantic trajectories PDF

Cannot Refute

[67] Motion-Conditioned Diffusion Model for Controllable Video Synthesis PDF

Cannot Refute

[68] Vehicular Multimodal Motion Forecasting via Conditional Score-based Modeling PDF

Cannot Refute

Contribution

Demonstration that trajectory generation outperforms prior regressors and video generators

[51] Trajectory grid diffusion for multimodal trajectory prediction in autonomous vehicles PDF

Cannot Refute

[52] Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction PDF

Cannot Refute

[53] VFR trajectory forecasting using deep generative model for autonomous airspace operations PDF

Cannot Refute

[54] Short-Term Probabilistic Wind Speed Predictions Integrating Multivariate Linear Regression and Generative Adversarial Network Methods PDF

Cannot Refute

Contribution

Analysis showing pixel generation overhead limits motion forecasting in video generators

[55] A good image generator is what you need for high-resolution video synthesis PDF

Cannot Refute

[56] Future frame synthesis for fast Monte Carlo rendering PDF

Cannot Refute

[57] View synthesis prediction for multiview video coding PDF

Cannot Refute

[58] Pixel-Level Tracking and Future Prediction in Video Streams PDF

Cannot Refute

[59] Intermediate frames prediction for integral imaging video based on Triple I-3D Net. PDF

Cannot Refute

What Happens Next? Anticipating Future Motion by Generating Point Trajectories

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] An uncertain future: Forecasting from static images using variational autoencoders PDF

[15] MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models PDF

Contribution Analysis

Formulation of motion forecasting as conditional generation of dense trajectory grids

[47] TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement PDF

[60] Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance PDF

[61] Scene compliant trajectory forecast with agent-centric spatio-temporal grids PDF

[62] IKMo: Image-Keyframed Motion Generation with Trajectory-Pose Conditioned Motion Diffusion Model PDF

[63] Meta Motion Sense and Motion Trajectory Prediction PDF

[64] SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories PDF

[65] Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis PDF

[66] COTA-motion: Controllable image-to-video synthesis with dense semantic trajectories PDF

[67] Motion-Conditioned Diffusion Model for Controllable Video Synthesis PDF

[68] Vehicular Multimodal Motion Forecasting via Conditional Score-based Modeling PDF

Demonstration that trajectory generation outperforms prior regressors and video generators

[51] Trajectory grid diffusion for multimodal trajectory prediction in autonomous vehicles PDF

[52] Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction PDF

[53] VFR trajectory forecasting using deep generative model for autonomous airspace operations PDF

[54] Short-Term Probabilistic Wind Speed Predictions Integrating Multivariate Linear Regression and Generative Adversarial Network Methods PDF

Analysis showing pixel generation overhead limits motion forecasting in video generators

[55] A good image generator is what you need for high-resolution video synthesis PDF

[56] Future frame synthesis for fast Monte Carlo rendering PDF

[57] View synthesis prediction for multiview video coding PDF

[58] Pixel-Level Tracking and Future Prediction in Video Streams PDF

[59] Intermediate frames prediction for integral imaging video based on Triple I-3D Net. PDF

Table of Contents