Compositional Diffusion with Guided search for Long-Horizon Planning

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion ModelsCompositional DiffusionGoal-directed Planning
Abstract:

Generative models have emerged as powerful tools for planning, with compositional approaches offering particular promise for modeling long-horizon task distributions by composing together local, modular generative models. This compositional paradigm spans diverse domains, from multi-step manipulation planning to panoramic image synthesis to long video generation. However, compositional generative models face a critical challenge: when local distributions are multimodal, existing composition methods average incompatible modes, producing plans that are neither locally feasible nor globally coherent. We propose Compositional Diffusion with Guided Search (CDGS), which addresses this \emph{mode averaging} problem by embedding search directly within the diffusion denoising process. Our method explores diverse combinations of local modes through population-based sampling, prunes infeasible candidates using likelihood-based filtering, and enforces global consistency through iterative resampling between overlapping segments. CDGS matches oracle performance on seven robot manipulation tasks, outperforming baselines that lack compositionality or require long-horizon training data. The approach generalizes across domains, enabling coherent text-guided panoramic images and long videos through effective local-to-global message passing. More details: https://cdgsearch.github.io/

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Compositional Diffusion with Guided Search (CDGS), a method for composing short-horizon diffusion models into long-horizon robot manipulation plans. It resides in the 'Diffusion-Based Trajectory Composition' leaf, which contains only four papers total, including this work and three siblings. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting the specific approach of embedding search within diffusion denoising for compositional planning is not yet heavily explored.

The taxonomy reveals that robotic manipulation encompasses multiple alternative paradigms: hierarchical skill chaining and subgoal decomposition (five papers), imitation learning (three papers), model-based RL (two papers), and vision-language-action systems (three papers). The diffusion-based trajectory composition leaf sits adjacent to these approaches, sharing the goal of long-horizon planning but diverging in its use of probabilistic generative models rather than hierarchical abstractions or reinforcement learning. The scope note explicitly excludes non-diffusion methods, positioning this work within a narrower methodological niche focused on generative composition.

Among 18 candidates examined across three contributions, the iterative resampling mechanism shows one refutable candidate from 10 examined, while the core CDGS framework and likelihood-based pruning appear more novel (zero refutable candidates from eight and zero examined, respectively). The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The iterative resampling mechanism's overlap with prior work suggests this component may have precedent, while the integration of search-based mode exploration within diffusion denoising appears less directly anticipated by the examined literature.

Given the sparse four-paper leaf and limited 18-candidate search, the work appears to occupy a relatively unexplored intersection of diffusion models and search-based planning for compositional generation. The taxonomy context suggests the field is fragmented across diverse methodological branches, with diffusion-based composition representing a minority approach. However, the analysis cannot rule out relevant work outside the top-K semantic neighborhood or in adjacent communities not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
18
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Compositional generation of long-horizon sequences from short-horizon models. This field addresses the challenge of extending predictions or plans far into the future by composing outputs from models trained on shorter temporal windows. The taxonomy reveals a diverse landscape spanning robotic manipulation and task planning, spatiotemporal forecasting (weather radar, precipitation), video generation and synthesis, motion and trajectory synthesis, sequential recommendation, domain-specific temporal modeling (medical imaging, materials science), reasoning and memory architectures, multi-step time series forecasting, long-horizon vision tasks, and bio-inspired neuromorphic memory systems. Within robotic manipulation, diffusion-based trajectory composition has emerged as a particularly active direction, leveraging generative models to stitch together short-horizon skills or subgoals into coherent long-horizon behaviors. Meanwhile, spatiotemporal forecasting branches explore recurrent and attention-based architectures for extrapolating radar echoes or precipitation patterns, and video synthesis methods tackle the challenge of maintaining temporal consistency over extended sequences. A central tension across these branches involves balancing computational efficiency with the ability to capture long-range dependencies and avoid compounding errors. In robotic planning, works like Trajectory Stitching Diffusion[5] and Diverse Trajectory Stitching[13] explore how to compose pre-trained diffusion models for different skills, while Compositional Diffusion Planning[0] sits within this cluster by emphasizing modular composition of trajectory segments to achieve extended task horizons. Compared to approaches that rely on hierarchical abstractions (e.g., Subgoal Manipulation[2]) or skill chaining (Skill Chaining Diffusion[7]), the diffusion-based composition methods offer flexible probabilistic blending of short-horizon priors. In contrast, spatiotemporal forecasting branches such as LSTM Radar Extrapolation[3] and SepConv Radar Ensemble[4] focus on recurrent or convolutional architectures for weather prediction, highlighting a different set of trade-offs around spatial resolution and ensemble uncertainty. The original paper's emphasis on diffusion-based trajectory composition places it squarely within the robotic manipulation branch, where it contributes to ongoing efforts to scale planning horizons without retraining monolithic models.

Claimed Contributions

Compositional Diffusion with Guided Search (CDGS)

CDGS is a novel inference-time algorithm that integrates guided search into the diffusion denoising process to compose short-horizon local generative models into coherent long-horizon plans. The method addresses the mode-averaging problem in compositional generative models through population-based sampling, iterative resampling for global consistency, and likelihood-based pruning of infeasible candidates.

8 retrieved papers
Iterative resampling mechanism for local-to-global message passing

The method introduces an iterative resampling procedure that alternates between forward noising and denoising steps to propagate information across distant segments through overlapping variables. This enables effective local-to-global message passing, ensuring that compositional sampling produces globally coherent candidate plans.

10 retrieved papers
Can Refute
Likelihood-based pruning using DDIM inversion

The approach employs a novel pruning mechanism based on DDIM inversion to approximate local plan likelihoods and filter out incoherent global plans. The method defines a smoothness measure based on diffusion trajectory curvature to identify and eliminate plans with locally inconsistent segments that result from mode-averaging.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Compositional Diffusion with Guided Search (CDGS)

CDGS is a novel inference-time algorithm that integrates guided search into the diffusion denoising process to compose short-horizon local generative models into coherent long-horizon plans. The method addresses the mode-averaging problem in compositional generative models through population-based sampling, iterative resampling for global consistency, and likelihood-based pruning of infeasible candidates.

Contribution

Iterative resampling mechanism for local-to-global message passing

The method introduces an iterative resampling procedure that alternates between forward noising and denoising steps to propagate information across distant segments through overlapping variables. This enables effective local-to-global message passing, ensuring that compositional sampling produces globally coherent candidate plans.

Contribution

Likelihood-based pruning using DDIM inversion

The approach employs a novel pruning mechanism based on DDIM inversion to approximate local plan likelihoods and filter out incoherent global plans. The method defines a smoothness measure based on diffusion trajectory curvature to identify and eliminate plans with locally inconsistent segments that result from mode-averaging.