Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

World ModelSelf-supervisedunsupervisedobject-centricvideo predicitonvideo generationimitation learninglatent particlesvae

We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, and pre-trained models will be made publicly available. Video rollouts are available: https://sites.google.com/view/lpwm

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a self-supervised object-centric world model using particle-based representations to discover keypoints, bounding boxes, and masks from video data. It sits within the 'Particle-Based and Graph-Based Object Modeling' leaf of the taxonomy, which contains only three papers total including this work. This represents a relatively sparse research direction compared to neighboring areas like 'Slot-Based Object Discovery' (four papers) or 'Transformer-Based Object-Centric Prediction' (three papers), suggesting the particle-based paradigm remains less explored than slot-based alternatives despite its potential for handling variable object counts and fine-grained spatial structure.

The taxonomy reveals substantial activity in adjacent areas. The parent branch 'Object-Centric Representation Learning' includes slot-based methods that use fixed-size feature vectors rather than flexible particle sets. Nearby branches address temporal dynamics through transformers or recurrent architectures, while 'Language-Conditioned and Goal-Conditioned Models' explores conditioning mechanisms similar to those claimed here. The paper's positioning bridges representation learning (particle discovery) with dynamics modeling (stochastic prediction) and decision-making applications, connecting multiple taxonomy branches. This cross-cutting nature distinguishes it from works focused solely on representation or prediction.

Among 26 candidates examined across three contributions, no clear refutations emerged. The first contribution (latent action module for particle dynamics) examined six candidates with none providing overlapping prior work. The second contribution (state-of-the-art video prediction) examined ten candidates, again with no refutations. The third contribution (goal-conditioned imitation learning application) similarly found no refuting work among ten candidates. This suggests that within the limited search scope, the combination of particle-based representations, stochastic dynamics modeling, and decision-making integration appears relatively unexplored, though the modest candidate pool (26 papers) means substantial relevant work may exist beyond this analysis.

Based on the limited literature search, the work appears to occupy a distinctive position combining particle-based scene decomposition with stochastic world modeling and control applications. The sparse population of its taxonomy leaf and absence of refuting candidates among 26 examined papers suggest novelty, though this assessment is constrained by the search scope. The cross-cutting nature—spanning representation learning, dynamics prediction, and decision-making—may contribute to the lack of direct precedents, as most prior work focuses on narrower aspects of this pipeline.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: object-centric stochastic video prediction and world modeling. This field aims to learn structured representations that decompose visual scenes into distinct entities and predict their future states under uncertainty. The taxonomy reveals a rich landscape organized around several complementary themes. One major branch focuses on object-centric representation learning and decomposition, where methods develop techniques to discover and track entities using slots, particles, or graph-based structures (e.g., Scalor[9], SlotDiffusion[6]). Another branch emphasizes object-centric video prediction and dynamics modeling, building forward models that leverage these decomposed representations (e.g., Object-centric Video Prediction[1], Slotformer[21]). Additional branches address language-conditioned and goal-conditioned models, holistic video generation and world models for broader scene synthesis, domain-specific applications such as autonomous driving (DriVerse[4]) or robotics (Robot Physical World Model[2]), physics-grounded and interpretable approaches (Physgen[10], Physics-Grounded Motion Forecasting[5]), structured and compositional frameworks, world models tailored for reinforcement learning and planning, specialized prediction tasks, text-based and symbolic representations, and 3D geometric modeling. Within this landscape, a particularly active line of work explores particle-based and graph-based object modeling, which represents entities as sets of interacting particles or nodes rather than fixed-size slot vectors. Latent Particle World Models[0] exemplifies this direction by using particle representations to capture fine-grained spatial structure and relational dynamics in a stochastic setting. This approach contrasts with slot-based methods like SlotDiffusion[6] or Slotformer[21], which typically employ a fixed number of abstract feature vectors, and aligns more closely with graph-structured models such as GraphMimic[12] and OCK[32], which emphasize explicit relational reasoning and flexible entity counts. The particle-based paradigm offers potential advantages in handling variable numbers of objects and modeling complex interactions, though it also raises questions about scalability and the trade-off between expressiveness and computational efficiency. Situating Latent Particle World Models[0] in this context, it occupies a niche that bridges fine-grained spatial decomposition with stochastic dynamics, offering a complementary perspective to both holistic generation approaches and more abstract slot-based frameworks.

Claimed Contributions

Self-supervised object-centric world model with novel latent action module

6 retrieved papers

The authors introduce LPWM, which combines object-centric particle representations with a novel context module that predicts per-particle latent action distributions. This enables stochastic dynamics modeling and supports flexible conditioning on actions, language, image goals, and multi-view inputs, all trained end-to-end from video data.

6 retrieved papers

State-of-the-art object-centric video prediction on diverse datasets

10 retrieved papers

LPWM achieves superior performance compared to existing object-centric methods across multiple real-world robotics datasets and simulated environments, demonstrating improved visual quality metrics and the ability to model complex multi-object interactions while maintaining object permanence.

10 retrieved papers

Application to decision-making via goal-conditioned imitation learning

10 retrieved papers

The authors show that pre-trained LPWM can be adapted for goal-conditioned imitation learning by learning a simple mapping from latent actions to real actions. They demonstrate competitive performance on multi-object manipulation tasks, establishing LPWM's practical utility beyond video prediction.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] GraphMimic: Graph-to-Graphs Generative Modeling from Videos for Policy Learning PDF

Guang-yan Chen, Te Cui, Meiling Wang, Chengcai Yang, Mengxiao Hu, Haoyang Lu, Yao Mu, Zicai Peng, Tianxing Zhou, Xinran Jiang, Yi Yang, Yufeng Yue (2025) • Computer Vision and Pattern Recognition

[32] OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics PDF

Yeon-Ji Song, Jaein Kim, Suhyung Choi, Jin-Hwa Kim, Byoung-Tak Zhang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Self-supervised object-centric world model with novel latent action module

[14] Object-Centric World Model for Language-Guided Manipulation PDF

Cannot Refute

[51] Learning to Act Anywhere with Task-centric Latent Actions PDF

Cannot Refute

[52] MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World PDF

Cannot Refute

[53] Latent Action Pretraining Through World Modeling PDF

Cannot Refute

[54] Object-level Scene Deocclusion PDF

Cannot Refute

[55] Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments PDF

Cannot Refute

Contribution

State-of-the-art object-centric video prediction on diverse datasets

[56] SlotPi: Physics-informed Object-centric Reasoning Models PDF

Cannot Refute

[57] Physion: Evaluating physical prediction from vision in humans and machines PDF

Cannot Refute

[58] Learning object permanence from videos via latent imaginations PDF

Cannot Refute

[59] Out of sight, still in mind: Reasoning and planning about unobserved objects with video tracking enabled memory models PDF

Cannot Refute

[60] Unsupervised learning of object structure and dynamics from videos PDF

Cannot Refute

[61] Looping loci: Developing object permanence from videos PDF

Cannot Refute

[62] Learning what and where: Disentangling location and identity tracking without supervision PDF

Cannot Refute

[63] Physion++: Evaluating physical scene understanding that requires online inference of different physical properties PDF

Cannot Refute

[64] Hopper: Multi-hop Transformer for Spatiotemporal Reasoning PDF

Cannot Refute

[65] Occlusion resistant learning of intuitive physics from videos PDF

Cannot Refute

Contribution

Application to decision-making via goal-conditioned imitation learning

[66] Goal-Conditioned Imitation Learning using Score-based Diffusion Policies PDF

Cannot Refute

[67] In-context imitation learning via next-token prediction PDF

Cannot Refute

[68] Data Scaling Laws in Imitation Learning for Robotic Manipulation PDF

Cannot Refute

[69] ICRT: In-Context Imitation Learning via Next-Token Prediction PDF

Cannot Refute

[70] One-shot imitation learning with graph neural networks for pick-and-place manipulation tasks PDF

Cannot Refute

[71] LUMOS: Language-Conditioned Imitation Learning with World Models PDF

Cannot Refute

[72] SCIL: Stage-Conditioned Imitation Learning for Multi-Stage Manipulation PDF

Cannot Refute

[73] The Art of Imitation: Learning Long-Horizon Manipulation Tasks From Few Demonstrations PDF

Cannot Refute

[74] Visual Imitation Learning of Task-Oriented Object Grasping and Rearrangement PDF

Cannot Refute

[75] ForceMimic: Force-Centric Imitation Learning with Force-Motion Capture System for Contact-Rich Manipulation PDF

Cannot Refute

Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] GraphMimic: Graph-to-Graphs Generative Modeling from Videos for Policy Learning PDF

[32] OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics PDF

Contribution Analysis

Self-supervised object-centric world model with novel latent action module

[14] Object-Centric World Model for Language-Guided Manipulation PDF

[51] Learning to Act Anywhere with Task-centric Latent Actions PDF

[52] MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World PDF

[53] Latent Action Pretraining Through World Modeling PDF

[54] Object-level Scene Deocclusion PDF

[55] Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments PDF

State-of-the-art object-centric video prediction on diverse datasets

[56] SlotPi: Physics-informed Object-centric Reasoning Models PDF

[57] Physion: Evaluating physical prediction from vision in humans and machines PDF

[58] Learning object permanence from videos via latent imaginations PDF

[59] Out of sight, still in mind: Reasoning and planning about unobserved objects with video tracking enabled memory models PDF

[60] Unsupervised learning of object structure and dynamics from videos PDF

[61] Looping loci: Developing object permanence from videos PDF

[62] Learning what and where: Disentangling location and identity tracking without supervision PDF

[63] Physion++: Evaluating physical scene understanding that requires online inference of different physical properties PDF

[64] Hopper: Multi-hop Transformer for Spatiotemporal Reasoning PDF

[65] Occlusion resistant learning of intuitive physics from videos PDF

Application to decision-making via goal-conditioned imitation learning

[66] Goal-Conditioned Imitation Learning using Score-based Diffusion Policies PDF

[67] In-context imitation learning via next-token prediction PDF

[68] Data Scaling Laws in Imitation Learning for Robotic Manipulation PDF

[69] ICRT: In-Context Imitation Learning via Next-Token Prediction PDF

[70] One-shot imitation learning with graph neural networks for pick-and-place manipulation tasks PDF

[71] LUMOS: Language-Conditioned Imitation Learning with World Models PDF

[72] SCIL: Stage-Conditioned Imitation Learning for Multi-Stage Manipulation PDF

[73] The Art of Imitation: Learning Long-Horizon Manipulation Tasks From Few Demonstrations PDF

[74] Visual Imitation Learning of Task-Oriented Object Grasping and Rearrangement PDF

[75] ForceMimic: Force-Centric Imitation Learning with Force-Motion Capture System for Contact-Rich Manipulation PDF

Table of Contents