Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

ICLR 2026 Conference SubmissionAnonymous Authors
World ModelSelf-supervisedunsupervisedobject-centricvideo predicitonvideo generationimitation learninglatent particlesvae
Abstract:

We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, and pre-trained models will be made publicly available. Video rollouts are available: https://sites.google.com/view/lpwm

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a self-supervised object-centric world model using particle-based representations to discover keypoints, bounding boxes, and masks from video data. It sits within the 'Particle-Based and Graph-Based Object Modeling' leaf of the taxonomy, which contains only three papers total including this work. This represents a relatively sparse research direction compared to neighboring areas like 'Slot-Based Object Discovery' (four papers) or 'Transformer-Based Object-Centric Prediction' (three papers), suggesting the particle-based paradigm remains less explored than slot-based alternatives despite its potential for handling variable object counts and fine-grained spatial structure.

The taxonomy reveals substantial activity in adjacent areas. The parent branch 'Object-Centric Representation Learning' includes slot-based methods that use fixed-size feature vectors rather than flexible particle sets. Nearby branches address temporal dynamics through transformers or recurrent architectures, while 'Language-Conditioned and Goal-Conditioned Models' explores conditioning mechanisms similar to those claimed here. The paper's positioning bridges representation learning (particle discovery) with dynamics modeling (stochastic prediction) and decision-making applications, connecting multiple taxonomy branches. This cross-cutting nature distinguishes it from works focused solely on representation or prediction.

Among 26 candidates examined across three contributions, no clear refutations emerged. The first contribution (latent action module for particle dynamics) examined six candidates with none providing overlapping prior work. The second contribution (state-of-the-art video prediction) examined ten candidates, again with no refutations. The third contribution (goal-conditioned imitation learning application) similarly found no refuting work among ten candidates. This suggests that within the limited search scope, the combination of particle-based representations, stochastic dynamics modeling, and decision-making integration appears relatively unexplored, though the modest candidate pool (26 papers) means substantial relevant work may exist beyond this analysis.

Based on the limited literature search, the work appears to occupy a distinctive position combining particle-based scene decomposition with stochastic world modeling and control applications. The sparse population of its taxonomy leaf and absence of refuting candidates among 26 examined papers suggest novelty, though this assessment is constrained by the search scope. The cross-cutting nature—spanning representation learning, dynamics prediction, and decision-making—may contribute to the lack of direct precedents, as most prior work focuses on narrower aspects of this pipeline.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: object-centric stochastic video prediction and world modeling. This field aims to learn structured representations that decompose visual scenes into distinct entities and predict their future states under uncertainty. The taxonomy reveals a rich landscape organized around several complementary themes. One major branch focuses on object-centric representation learning and decomposition, where methods develop techniques to discover and track entities using slots, particles, or graph-based structures (e.g., Scalor[9], SlotDiffusion[6]). Another branch emphasizes object-centric video prediction and dynamics modeling, building forward models that leverage these decomposed representations (e.g., Object-centric Video Prediction[1], Slotformer[21]). Additional branches address language-conditioned and goal-conditioned models, holistic video generation and world models for broader scene synthesis, domain-specific applications such as autonomous driving (DriVerse[4]) or robotics (Robot Physical World Model[2]), physics-grounded and interpretable approaches (Physgen[10], Physics-Grounded Motion Forecasting[5]), structured and compositional frameworks, world models tailored for reinforcement learning and planning, specialized prediction tasks, text-based and symbolic representations, and 3D geometric modeling. Within this landscape, a particularly active line of work explores particle-based and graph-based object modeling, which represents entities as sets of interacting particles or nodes rather than fixed-size slot vectors. Latent Particle World Models[0] exemplifies this direction by using particle representations to capture fine-grained spatial structure and relational dynamics in a stochastic setting. This approach contrasts with slot-based methods like SlotDiffusion[6] or Slotformer[21], which typically employ a fixed number of abstract feature vectors, and aligns more closely with graph-structured models such as GraphMimic[12] and OCK[32], which emphasize explicit relational reasoning and flexible entity counts. The particle-based paradigm offers potential advantages in handling variable numbers of objects and modeling complex interactions, though it also raises questions about scalability and the trade-off between expressiveness and computational efficiency. Situating Latent Particle World Models[0] in this context, it occupies a niche that bridges fine-grained spatial decomposition with stochastic dynamics, offering a complementary perspective to both holistic generation approaches and more abstract slot-based frameworks.

Claimed Contributions

Self-supervised object-centric world model with novel latent action module

The authors introduce LPWM, which combines object-centric particle representations with a novel context module that predicts per-particle latent action distributions. This enables stochastic dynamics modeling and supports flexible conditioning on actions, language, image goals, and multi-view inputs, all trained end-to-end from video data.

6 retrieved papers
State-of-the-art object-centric video prediction on diverse datasets

LPWM achieves superior performance compared to existing object-centric methods across multiple real-world robotics datasets and simulated environments, demonstrating improved visual quality metrics and the ability to model complex multi-object interactions while maintaining object permanence.

10 retrieved papers
Application to decision-making via goal-conditioned imitation learning

The authors show that pre-trained LPWM can be adapted for goal-conditioned imitation learning by learning a simple mapping from latent actions to real actions. They demonstrate competitive performance on multi-object manipulation tasks, establishing LPWM's practical utility beyond video prediction.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Self-supervised object-centric world model with novel latent action module

The authors introduce LPWM, which combines object-centric particle representations with a novel context module that predicts per-particle latent action distributions. This enables stochastic dynamics modeling and supports flexible conditioning on actions, language, image goals, and multi-view inputs, all trained end-to-end from video data.

Contribution

State-of-the-art object-centric video prediction on diverse datasets

LPWM achieves superior performance compared to existing object-centric methods across multiple real-world robotics datasets and simulated environments, demonstrating improved visual quality metrics and the ability to model complex multi-object interactions while maintaining object permanence.

Contribution

Application to decision-making via goal-conditioned imitation learning

The authors show that pre-trained LPWM can be adapted for goal-conditioned imitation learning by learning a simple mapping from latent actions to real actions. They demonstrate competitive performance on multi-object manipulation tasks, establishing LPWM's practical utility beyond video prediction.