Compositional Visual Planning via Inference-Time Diffusion Scaling

ICLR 2026 Conference SubmissionAnonymous Authors
PlanningCompositionalityDiffusion ModelsRobotics
Abstract:

Diffusion models excel at short-horizon robot planning, yet scaling them to long-horizon tasks remains challenging due to computational constraints and limited training data. Existing compositional approaches stitch together short segments by separately denoising each component and averaging overlapping regions. However, this suffers from instability as the factorization assumption breaks down in noisy data space, leading to inconsistent global plans. We propose that the key to stable compositional generation lies in enforcing boundary agreement on the estimated clean data (Tweedie estimates) rather than on noisy intermediate states. Our method formulates long-horizon planning as inference over a chain-structured factor graph of overlapping video chunks, where pretrained short-horizon video diffusion models provide local priors. At inference time, we enforce boundary agreement through a novel combination of synchronous and asynchronous message passing that operates on Tweedie estimates, producing globally consistent guidance without requiring additional training. Our training-free framework demonstrates significant improvements over existing baselines across 100 simulation tasks spanning 4 diverse scenes, effectively generalizing to unseen start-goal combinations that were not present in the original training data. Project website: https://comp-visual-planning.github.io/

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a training-free compositional framework for long-horizon visual planning that enforces boundary agreement on Tweedie estimates rather than noisy intermediate states. It resides in the 'Overlapping Chunk Composition' leaf under 'Trajectory Composition and Stitching Methods', which contains only two papers total. This places the work in a relatively sparse research direction within the broader taxonomy of 32 papers across multiple branches. The sibling paper in this leaf, Generative Trajectory Stitching, also addresses overlapping chunk composition, suggesting that this specific approach to long-horizon planning is an emerging but not yet crowded area.

The taxonomy reveals that neighboring leaves include 'Progressive Trajectory Extension' (one paper) and broader sibling branches like 'Hierarchical Skill-Based Planning' (six papers across three sub-categories) and 'Constraint-Based and Compositional Planning' (four papers). The paper's focus on factor graph inference over video chunks distinguishes it from hierarchical skill decomposition methods, which learn discrete primitives, and from constraint satisfaction approaches that compose energies. The scope note for this leaf explicitly excludes progressive extension without overlap and multiscale hierarchical methods, clarifying that the paper's overlapping chunk strategy occupies a distinct methodological niche within trajectory composition.

Among 24 candidates examined across three contributions, the analysis found limited prior work overlap. The core contribution of boundary agreement on Tweedie estimates examined four candidates with zero refutable matches. The message passing mechanism examined ten candidates, also with zero refutable matches, suggesting novelty in the inference procedure. However, the compositional planning benchmark contribution examined ten candidates and found two refutable matches, indicating that evaluation frameworks for compositional generalization may have more substantial prior work. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on the limited literature search of 24 candidates, the work appears to introduce novel inference mechanisms within a sparse research direction. The taxonomy structure shows that overlapping chunk composition itself is an emerging area with few direct comparisons. The two refutable matches for the benchmark contribution suggest that evaluation methodologies may be less novel than the core algorithmic approach, though the restricted search scope prevents definitive conclusions about the broader landscape.

Taxonomy

Core-task Taxonomy Papers
32
3
Claimed Contributions
24
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: long-horizon visual planning through compositional diffusion models. The field addresses how to generate extended action sequences or visual trajectories by leveraging diffusion-based generative models that can compose or stitch together shorter segments. The taxonomy reveals several complementary directions: Trajectory Composition and Stitching Methods focus on connecting overlapping or adjacent trajectory chunks to extend planning horizons, as seen in works like Generative Trajectory Stitching[3] and Generative Skill Chaining[2]. Hierarchical Skill-Based Planning decomposes tasks into reusable primitives, exemplified by SkillDiffuser[4] and related approaches. Constraint-Based and Compositional Planning emphasizes satisfying logical or geometric constraints during generation, while Vision-Language Guided Planning integrates natural language instructions to steer diffusion processes. Spatiotemporal and Visuomotor Policy Learning targets direct sensorimotor control with temporal coherence, and Foundational Diffusion Planning Frameworks provide core algorithmic innovations such as Planning with Diffusion[7]. Specialized Diffusion Applications explore domain-specific uses ranging from robotics to autonomous driving. Within this landscape, a particularly active line of work centers on trajectory stitching and chunk composition, where the challenge is to seamlessly merge locally generated segments into coherent long-horizon plans. Compositional Visual Planning[0] sits squarely in this branch, employing overlapping chunk composition to extend planning reach beyond what single-shot diffusion models can achieve. Its approach closely aligns with Generative Trajectory Stitching[3], which also addresses how to blend trajectory pieces, though the two may differ in their blending mechanisms or the granularity of overlap. Meanwhile, hierarchical methods like SkillDiffuser[4] offer an alternative by learning discrete skill libraries, trading off the flexibility of continuous stitching for the interpretability of modular primitives. Across these branches, open questions remain about how to balance computational efficiency, sample quality, and the ability to handle diverse constraints—issues that Compositional Visual Planning[0] and its neighbors continue to explore through different compositional strategies.

Claimed Contributions

Compositional visual planning via boundary agreement on Tweedie estimates

The authors formulate long-horizon planning as inference over a chain-structured factor graph of overlapping video chunks. Instead of enforcing consistency on noisy diffusion states (as in prior work), they enforce boundary agreement on Tweedie estimates (estimated clean data), addressing the core limitation that factorization assumptions break down during diffusion sampling.

4 retrieved papers
Joint synchronous and asynchronous message passing on denoised variables

The authors introduce two complementary message-passing mechanisms that operate on Tweedie estimates: a synchronous scheme treating the chain as a Gaussian linear system with parallel updates, and an asynchronous scheme using one-sided stop-gradient targets for faster convergence. These are integrated into a training-free DDIM sampler via diffusion-sphere guidance.

10 retrieved papers
Compositional planning benchmark for evaluating generalization to unseen start-goal combinations

The authors develop a benchmark for compositional planning in robotic manipulation where training data contains only N start-goal pairs, but evaluation includes N·N-N unseen combinations. This tests whether planners can generalize by composing fragments from the training distribution to solve novel tasks.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Compositional visual planning via boundary agreement on Tweedie estimates

The authors formulate long-horizon planning as inference over a chain-structured factor graph of overlapping video chunks. Instead of enforcing consistency on noisy diffusion states (as in prior work), they enforce boundary agreement on Tweedie estimates (estimated clean data), addressing the core limitation that factorization assumptions break down during diffusion sampling.

Contribution

Joint synchronous and asynchronous message passing on denoised variables

The authors introduce two complementary message-passing mechanisms that operate on Tweedie estimates: a synchronous scheme treating the chain as a Gaussian linear system with parallel updates, and an asynchronous scheme using one-sided stop-gradient targets for faster convergence. These are integrated into a training-free DDIM sampler via diffusion-sphere guidance.

Contribution

Compositional planning benchmark for evaluating generalization to unseen start-goal combinations

The authors develop a benchmark for compositional planning in robotic manipulation where training data contains only N start-goal pairs, but evaluation includes N·N-N unseen combinations. This tests whether planners can generalize by composing fragments from the training distribution to solve novel tasks.