Compositional Visual Planning via Inference-Time Diffusion Scaling

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.4 Download Report PDF

PlanningCompositionalityDiffusion ModelsRobotics

Diffusion models excel at short-horizon robot planning, yet scaling them to long-horizon tasks remains challenging due to computational constraints and limited training data. Existing compositional approaches stitch together short segments by separately denoising each component and averaging overlapping regions. However, this suffers from instability as the factorization assumption breaks down in noisy data space, leading to inconsistent global plans. We propose that the key to stable compositional generation lies in enforcing boundary agreement on the estimated clean data (Tweedie estimates) rather than on noisy intermediate states. Our method formulates long-horizon planning as inference over a chain-structured factor graph of overlapping video chunks, where pretrained short-horizon video diffusion models provide local priors. At inference time, we enforce boundary agreement through a novel combination of synchronous and asynchronous message passing that operates on Tweedie estimates, producing globally consistent guidance without requiring additional training. Our training-free framework demonstrates significant improvements over existing baselines across 100 simulation tasks spanning 4 diverse scenes, effectively generalizing to unseen start-goal combinations that were not present in the original training data. Project website: https://comp-visual-planning.github.io/

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a training-free compositional framework for long-horizon visual planning that enforces boundary agreement on Tweedie estimates rather than noisy intermediate states. It resides in the 'Overlapping Chunk Composition' leaf under 'Trajectory Composition and Stitching Methods', which contains only two papers total. This places the work in a relatively sparse research direction within the broader taxonomy of 32 papers across multiple branches. The sibling paper in this leaf, Generative Trajectory Stitching, also addresses overlapping chunk composition, suggesting that this specific approach to long-horizon planning is an emerging but not yet crowded area.

The taxonomy reveals that neighboring leaves include 'Progressive Trajectory Extension' (one paper) and broader sibling branches like 'Hierarchical Skill-Based Planning' (six papers across three sub-categories) and 'Constraint-Based and Compositional Planning' (four papers). The paper's focus on factor graph inference over video chunks distinguishes it from hierarchical skill decomposition methods, which learn discrete primitives, and from constraint satisfaction approaches that compose energies. The scope note for this leaf explicitly excludes progressive extension without overlap and multiscale hierarchical methods, clarifying that the paper's overlapping chunk strategy occupies a distinct methodological niche within trajectory composition.

Among 24 candidates examined across three contributions, the analysis found limited prior work overlap. The core contribution of boundary agreement on Tweedie estimates examined four candidates with zero refutable matches. The message passing mechanism examined ten candidates, also with zero refutable matches, suggesting novelty in the inference procedure. However, the compositional planning benchmark contribution examined ten candidates and found two refutable matches, indicating that evaluation frameworks for compositional generalization may have more substantial prior work. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on the limited literature search of 24 candidates, the work appears to introduce novel inference mechanisms within a sparse research direction. The taxonomy structure shows that overlapping chunk composition itself is an emerging area with few direct comparisons. The two refutable matches for the benchmark contribution suggest that evaluation methodologies may be less novel than the core algorithmic approach, though the restricted search scope prevents definitive conclusions about the broader landscape.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: long-horizon visual planning through compositional diffusion models. The field addresses how to generate extended action sequences or visual trajectories by leveraging diffusion-based generative models that can compose or stitch together shorter segments. The taxonomy reveals several complementary directions: Trajectory Composition and Stitching Methods focus on connecting overlapping or adjacent trajectory chunks to extend planning horizons, as seen in works like Generative Trajectory Stitching[3] and Generative Skill Chaining[2]. Hierarchical Skill-Based Planning decomposes tasks into reusable primitives, exemplified by SkillDiffuser[4] and related approaches. Constraint-Based and Compositional Planning emphasizes satisfying logical or geometric constraints during generation, while Vision-Language Guided Planning integrates natural language instructions to steer diffusion processes. Spatiotemporal and Visuomotor Policy Learning targets direct sensorimotor control with temporal coherence, and Foundational Diffusion Planning Frameworks provide core algorithmic innovations such as Planning with Diffusion[7]. Specialized Diffusion Applications explore domain-specific uses ranging from robotics to autonomous driving. Within this landscape, a particularly active line of work centers on trajectory stitching and chunk composition, where the challenge is to seamlessly merge locally generated segments into coherent long-horizon plans. Compositional Visual Planning[0] sits squarely in this branch, employing overlapping chunk composition to extend planning reach beyond what single-shot diffusion models can achieve. Its approach closely aligns with Generative Trajectory Stitching[3], which also addresses how to blend trajectory pieces, though the two may differ in their blending mechanisms or the granularity of overlap. Meanwhile, hierarchical methods like SkillDiffuser[4] offer an alternative by learning discrete skill libraries, trading off the flexibility of continuous stitching for the interpretability of modular primitives. Across these branches, open questions remain about how to balance computational efficiency, sample quality, and the ability to handle diverse constraints—issues that Compositional Visual Planning[0] and its neighbors continue to explore through different compositional strategies.

Claimed Contributions

Compositional visual planning via boundary agreement on Tweedie estimates

4 retrieved papers

The authors formulate long-horizon planning as inference over a chain-structured factor graph of overlapping video chunks. Instead of enforcing consistency on noisy diffusion states (as in prior work), they enforce boundary agreement on Tweedie estimates (estimated clean data), addressing the core limitation that factorization assumptions break down during diffusion sampling.

4 retrieved papers

Joint synchronous and asynchronous message passing on denoised variables

10 retrieved papers

The authors introduce two complementary message-passing mechanisms that operate on Tweedie estimates: a synchronous scheme treating the chain as a Gaussian linear system with parallel updates, and an asynchronous scheme using one-sided stop-gradient targets for faster convergence. These are integrated into a training-free DDIM sampler via diffusion-sphere guidance.

10 retrieved papers

Compositional planning benchmark for evaluating generalization to unseen start-goal combinations

Can Refute

10 retrieved papers

The authors develop a benchmark for compositional planning in robotic manipulation where training data contains only N start-goal pairs, but evaluation includes N·N-N unseen combinations. This tests whether planners can generalize by composing fragments from the training distribution to solve novel tasks.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Generative trajectory stitching through diffusion composition PDF

Luo, Yunhao, Mishra, Utkarsh A., Du, Yilun, Xu, Danfei (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Compositional visual planning via boundary agreement on Tweedie estimates

[52] Improved Sampling Of Diffusion Models In Fluid Dynamics With Tweedie's Formula PDF

Cannot Refute

[53] TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation PDF

Cannot Refute

[54] Compositional simulation-based inference for time series PDF

Cannot Refute

[55] Motion Composition and Interpolation Using Diffusion Models PDF

Cannot Refute

Contribution

Joint synchronous and asynchronous message passing on denoised variables

[33] Deep networks as denoising algorithms: Sample-efficient learning of diffusion models in high-dimensional graphical models PDF

Cannot Refute

[34] SCARefusion: Side channel analysis data restoration with diffusion model PDF

Cannot Refute

[35] SAFedHDM: Semi-asynchronous federated learning with highlight diffusion model for medical image segmentation PDF

Cannot Refute

[36] Asyncdiff: Parallelizing diffusion models by asynchronous denoising PDF

Cannot Refute

[37] Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing PDF

Cannot Refute

[38] Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference PDF

Cannot Refute

[39] Enhancing Approximate Message Passing via Diffusion Models Towards On-Device Intelligence PDF

Cannot Refute

[40] CL-DiffPhyCon: Closed-loop Diffusion Control of Complex Physical Systems PDF

Cannot Refute

[41] Your diffusion model is secretly a noise classifier and benefits from contrastive training PDF

Cannot Refute

[42] DG-RainDiff: Depth-Guided Dynamic Message Passing Diffusion Model for Mixture of Rain Removal PDF

Cannot Refute

Contribution

Compositional planning benchmark for evaluating generalization to unseen start-goal combinations

[3] Generative trajectory stitching through diffusion composition PDF

Can Refute

[50] Environment generation for zero-shot compositional reinforcement learning PDF

Can Refute

[43] AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning PDF

Cannot Refute

[44] Exedec: Execution decomposition for compositional generalization in neural program synthesis PDF

Cannot Refute

[45] Language model agents suffer from compositional generalization in web automation PDF

Cannot Refute

[46] A Benchmark for Compositional Visual Reasoning PDF

Cannot Refute

[47] Compositional generalization via neural-symbolic stack machines PDF

Cannot Refute

[48] Imagine the Unseen World: A Benchmark for Systematic Generalization in Visual World Models PDF

Cannot Refute

[49] What Do You Need for Diverse Trajectory Stitching in Diffusion Planning? PDF

Cannot Refute

[51] Learning from Less: Guiding Deep Reinforcement Learning with Differentiable Symbolic Planning PDF

Cannot Refute

Compositional Visual Planning via Inference-Time Diffusion Scaling

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Generative trajectory stitching through diffusion composition PDF

Contribution Analysis

Compositional visual planning via boundary agreement on Tweedie estimates

[52] Improved Sampling Of Diffusion Models In Fluid Dynamics With Tweedie's Formula PDF

[53] TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation PDF

[54] Compositional simulation-based inference for time series PDF

[55] Motion Composition and Interpolation Using Diffusion Models PDF

Joint synchronous and asynchronous message passing on denoised variables

[33] Deep networks as denoising algorithms: Sample-efficient learning of diffusion models in high-dimensional graphical models PDF

[34] SCARefusion: Side channel analysis data restoration with diffusion model PDF

[35] SAFedHDM: Semi-asynchronous federated learning with highlight diffusion model for medical image segmentation PDF

[36] Asyncdiff: Parallelizing diffusion models by asynchronous denoising PDF

[37] Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing PDF

[38] Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference PDF

[39] Enhancing Approximate Message Passing via Diffusion Models Towards On-Device Intelligence PDF

[40] CL-DiffPhyCon: Closed-loop Diffusion Control of Complex Physical Systems PDF

[41] Your diffusion model is secretly a noise classifier and benefits from contrastive training PDF

[42] DG-RainDiff: Depth-Guided Dynamic Message Passing Diffusion Model for Mixture of Rain Removal PDF

Compositional planning benchmark for evaluating generalization to unseen start-goal combinations

[3] Generative trajectory stitching through diffusion composition PDF

[50] Environment generation for zero-shot compositional reinforcement learning PDF

[43] AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning PDF

[44] Exedec: Execution decomposition for compositional generalization in neural program synthesis PDF

[45] Language model agents suffer from compositional generalization in web automation PDF

[46] A Benchmark for Compositional Visual Reasoning PDF

[47] Compositional generalization via neural-symbolic stack machines PDF

[48] Imagine the Unseen World: A Benchmark for Systematic Generalization in Visual World Models PDF

[49] What Do You Need for Diverse Trajectory Stitching in Diffusion Planning? PDF

[51] Learning from Less: Guiding Deep Reinforcement Learning with Differentiable Symbolic Planning PDF

Table of Contents