What Happens Next? Anticipating Future Motion by Generating Point Trajectories

ICLR 2026 Conference SubmissionAnonymous Authors
motion generationpoint trajectoriesflow matching
Abstract:

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper formulates motion forecasting from a single image as conditional generation of dense trajectory grids, positioning itself within the Dense Trajectory and Pixel-Level Motion Prediction leaf of the taxonomy. This leaf contains only three papers total, including the original work, indicating a relatively sparse research direction. The sibling papers explore related themes: one addresses uncertain future prediction with stochastic motion models, while another examines particle-based video representations. This small cluster suggests the specific combination of single-image input, dense trajectory output, and generative modeling remains underexplored compared to video-based tracking branches, which contain significantly more papers.

The taxonomy reveals that neighboring research directions are substantially more populated. The Point Tracking in Videos branch contains multiple well-developed subcategories with works like TAPIR and TAPNext that leverage temporal sequences for correspondence. The Multi-Object Tracking subtree addresses entity-level forecasting with learned motion predictors and appearance-based methods. The paper's approach diverges from these by eliminating temporal input entirely, instead inferring motion from static visual cues alone. This boundary is reinforced by the taxonomy's exclude notes, which explicitly separate single-image methods from video-based tracking and multi-frame prediction categories, highlighting the distinct challenge of forecasting without observing actual motion.

Among the 19 candidates examined through semantic search, none clearly refute the three main contributions. The first contribution—formulating motion forecasting as dense trajectory generation—examined 10 candidates with no refutable overlaps. The second contribution comparing trajectory generation to regressors and video generators examined 4 candidates, again with no clear prior work. The third contribution analyzing pixel generation overhead in video models examined 5 candidates without finding substantial precedent. This limited search scope suggests the specific framing and comparative analysis may be novel within the examined literature, though the small candidate pool (19 papers) means potentially relevant work outside top semantic matches remains unassessed.

The analysis indicates the work occupies a sparsely populated research direction, with its taxonomy leaf containing minimal prior work and neighboring branches focusing on fundamentally different input modalities. The absence of refutable candidates across 19 examined papers suggests novelty in the specific formulation and comparative insights, though this conclusion is constrained by the limited search scope. A more exhaustive literature review covering broader semantic neighborhoods or citation networks might reveal additional relevant precedents, particularly in adjacent areas like image-to-video generation or probabilistic motion modeling that were less thoroughly explored in this analysis.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
19
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: forecasting motion from a single image using point trajectories. This field addresses the challenge of predicting how points in a static image will move over time, a problem that spans computer vision, robotics, and autonomous systems. The taxonomy reveals a diverse landscape organized into seven main branches. Single-Image Motion Forecasting and Trajectory Generation focuses on methods that extract future motion directly from still frames, often producing dense pixel-level or sparse point trajectories. Point Tracking in Videos and Temporal Sequences emphasizes techniques like TAPIR[47] and TAPNext[44] that follow points across video frames, building temporal correspondences. Multi-Object Tracking with Motion Prediction and Human and Pedestrian Trajectory Prediction branches address entity-level forecasting, where works such as Pedestrian Trajectory Prediction[7] model social interactions and scene context. Autonomous Driving and Vehicle Motion Planning applies these ideas to navigation and safety-critical scenarios, while Specialized Tracking and Prediction Applications explores domain-specific problems ranging from robotic manipulation to biological movement analysis. Finally, Tracking Infrastructure and Algorithmic Techniques provides foundational methods for association, occlusion handling, and efficient computation. Several active lines of work reveal key trade-offs between dense versus sparse representations, single-frame versus temporal modeling, and deterministic versus probabilistic forecasting. Early efforts like Uncertain Future[5] and Particle Video[14] explored stochastic motion from limited observations, while recent approaches increasingly leverage learned features and diffusion models, as seen in MoTDiff[15]. Anticipating Future Motion[0] sits within the Dense Trajectory and Pixel-Level Motion Prediction cluster, emphasizing the generation of detailed point trajectories from a single image without relying on video sequences. This contrasts with video-based trackers like Motiontrack[3], which exploit temporal continuity, and with probabilistic frameworks that model multiple plausible futures. The work shares thematic connections with Single Frame Prediction[37] in its reliance on static input, yet distinguishes itself by focusing on trajectory-level rather than frame-level synthesis, positioning it at the intersection of generative modeling and motion understanding.

Claimed Contributions

Formulation of motion forecasting as conditional generation of dense trajectory grids

The authors propose a new formulation for predicting future motion from a single image by generating dense trajectory grids rather than regressing trajectories or generating RGB pixels. This approach models scene-wide dynamics and uncertainty using flow matching in a latent space learned by a trajectory variational autoencoder.

10 retrieved papers
Demonstration that trajectory generation outperforms prior regressors and video generators

The authors demonstrate through experiments that their generative approach to trajectory prediction surpasses both regression-based trajectory forecasters and large-scale pretrained video generators, even when the latter are fine-tuned on the target domain. They attribute this to modeling uncertainty and reasoning about the entire scene jointly.

4 retrieved papers
Analysis showing pixel generation overhead limits motion forecasting in video generators

The authors provide experimental evidence that state-of-the-art video generators struggle with motion forecasting not due to lack of world knowledge, but because generating RGB pixels introduces overhead that reduces focus on motion accuracy and physical plausibility. They demonstrate this by ablating output modality while keeping architecture fixed.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Formulation of motion forecasting as conditional generation of dense trajectory grids

The authors propose a new formulation for predicting future motion from a single image by generating dense trajectory grids rather than regressing trajectories or generating RGB pixels. This approach models scene-wide dynamics and uncertainty using flow matching in a latent space learned by a trajectory variational autoencoder.

Contribution

Demonstration that trajectory generation outperforms prior regressors and video generators

The authors demonstrate through experiments that their generative approach to trajectory prediction surpasses both regression-based trajectory forecasters and large-scale pretrained video generators, even when the latter are fine-tuned on the target domain. They attribute this to modeling uncertainty and reasoning about the entire scene jointly.

Contribution

Analysis showing pixel generation overhead limits motion forecasting in video generators

The authors provide experimental evidence that state-of-the-art video generators struggle with motion forecasting not due to lack of world knowledge, but because generating RGB pixels introduces overhead that reduces focus on motion accuracy and physical plausibility. They demonstrate this by ablating output modality while keeping architecture fixed.