Masked Generative Policy for Robotic Control

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Imitation LearningMasked Generative TransformerGenerative Model

We present Masked Generative Policy (MGP), a novel framework for visuomotor imitation learning. We represent actions as discrete tokens, and train a conditional masked transformer that generates tokens in parallel and then rapidly refines only low-confidence tokens. We further propose two new sampling paradigms: MGP-Short, which performs parallel masked generation with score-based refinement for Markovian tasks, and MGP-Long, which predicts full trajectories in a single pass and dynamically refines low-confidence action tokens based on new observations. With globally coherent prediction and robust adaptive execution capabilities, MGP-Long enables reliable control on complex and non-Markovian tasks that prior methods struggle with. Extensive evaluations on 150 robotic manipulation tasks spanning the Meta-World and LIBERO benchmarks show that MGP achieves both rapid inference and superior success rates compared to state-of-the-art diffusion and autoregressive policies. Specifically, MGP increases the average success rate by 9% across 150 tasks while cutting per-sequence inference time by up to 35×. It further improves the average success rate by 60% in dynamic and missing-observation environments, and solves two non-Markovian scenarios where other state-of-the-art methods fail. Further results and videos are available at: https://anonymous.4open.science/r/masked_generative_policy-8BC6.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Masked Generative Policy (MGP), which represents actions as discrete tokens and employs a conditional masked transformer for parallel generation with iterative refinement. According to the taxonomy, this work resides in the 'Discrete Token Generation' leaf under 'Policy Architecture and Learning'. Notably, this leaf contains only the original paper itself—no sibling papers are listed. This suggests the discrete token generation approach for visuomotor manipulation is a relatively sparse research direction within the taxonomy's 50-paper scope, contrasting with more populated areas like diffusion policies or transformer-based methods.

The taxonomy reveals neighboring approaches in adjacent leaves: 'Diffusion Policies' (1 paper), 'Transformer-Based Policies' (1 paper), and 'Hierarchical and Compositional Policies' (1 paper). The scope note for 'Discrete Token Generation' explicitly excludes continuous action diffusion and transformer policies, positioning MGP as an alternative to these paradigms. The broader 'Policy Architecture and Learning' branch also includes recurrent architectures and hybrid imitation-RL methods, indicating diverse algorithmic strategies. MGP's masked generation mechanism appears to occupy a distinct niche between autoregressive token models and continuous diffusion approaches, though the taxonomy structure suggests limited prior exploration of this specific combination.

Among the three contributions analyzed, the literature search examined 10 candidates total. The core MGP framework (Contribution 1) had 4 candidates examined with 0 refutable, while MGP-Long with Adaptive Token Refinement (Contribution 3) examined 6 candidates with 0 refutable. MGP-Short (Contribution 2) had no candidates examined. The absence of refutable prior work across all contributions, combined with the limited search scope of 10 papers, suggests these specific mechanisms—parallel masked generation with score-based refinement and dynamic trajectory refinement—may not have direct precedents in the examined literature. However, this reflects the bounded search rather than exhaustive field coverage.

Based on the 10-candidate search and sparse taxonomy positioning, MGP appears to introduce a relatively unexplored combination of techniques within the surveyed literature. The lack of sibling papers in its taxonomy leaf and zero refutable candidates across contributions indicate potential novelty, though the limited search scope (10 papers from semantic search) means substantial related work may exist outside this analysis. The taxonomy's explicit exclusion of diffusion and transformer methods from the discrete token leaf further suggests MGP occupies a distinct methodological space, though comprehensive field coverage would require broader literature examination.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: visuomotor imitation learning for robotic manipulation. The field organizes around several complementary dimensions. Demonstration Modality and Source addresses how robots acquire training data—ranging from teleoperation and VR interfaces (Deep Imitation VR[20]) to cross-embodiment transfer (Cross-Morphology Demonstration[2]) and even single-demonstration methods (You Only Demonstrate Once[7]). Representation and Feature Learning explores how visual observations are encoded, including object-centric approaches (Viola Object-Centric[10]) and pre-trained vision models (Roboclip[49]). Policy Architecture and Learning encompasses the core algorithmic choices: end-to-end networks (Multi-Task End-to-End[28]), hierarchical planners (Visual Planning Acting[6]), diffusion-based policies (Diff-LfD[46]), and discrete token generation methods. Generalization and Data Efficiency investigates how policies transfer across tasks and environments, while Task-Specific Manipulation Domains targets challenges like assembly (Assembly from Demonstrations[44]) and deformable objects (Deformable Object Manipulation[48]). Execution and Deployment Optimization refines real-time performance, and Benchmarking and Empirical Analysis provides systematic evaluation frameworks. Within Policy Architecture and Learning, a particularly active line explores discrete token generation as an alternative to continuous action prediction. Masked Generative Policy[0] exemplifies this direction by framing action sequences as masked token prediction, drawing inspiration from language modeling techniques. This contrasts with diffusion-based approaches (S-Diffusion[34], Latent Diffusion Planning[25]) that model action distributions through iterative denoising, and with hierarchical methods (Coarse-to-Fine Imitation[8]) that decompose policies into multiple stages. The discrete tokenization strategy offers potential advantages in sample efficiency and interpretability, positioning Masked Generative Policy[0] alongside recent efforts to leverage transformer architectures for sequential decision-making. Meanwhile, works like You Only Teach Once[9] and DemoHLM[40] emphasize learning from minimal or structured demonstrations, highlighting ongoing tensions between data requirements and generalization capability across the broader policy learning landscape.

Claimed Contributions

Masked Generative Policy (MGP) framework for visuomotor imitation learning

4 retrieved papers

The authors introduce MGP, a new framework that represents robot actions as discrete tokens and uses a conditional masked transformer to generate these tokens in parallel with selective refinement of low-confidence tokens. This approach aims to overcome the inference bottlenecks of diffusion models and the sequential constraints of autoregressive models.

4 retrieved papers

MGP-Short sampling paradigm for Markovian tasks

0 retrieved papers

The authors develop MGP-Short, a sampling method that performs parallel masked token generation with score-based refinement specifically designed for Markovian manipulation tasks. This method achieves rapid inference while maintaining high success rates on standard benchmarks.

0 retrieved papers

MGP-Long sampling paradigm with Adaptive Token Refinement for non-Markovian tasks

6 retrieved papers

The authors propose MGP-Long, which predicts complete action trajectories in one pass and then dynamically refines low-confidence tokens using new observations through an Adaptive Token Refinement strategy. This enables globally coherent predictions and robust execution for complex, long-horizon, and non-Markovian manipulation tasks.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Masked Generative Policy (MGP) framework for visuomotor imitation learning

[57] Sample-efficient Imitative Multi-token Decision Transformer for Real-world Driving PDF

Cannot Refute

[58] Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics PDF

Cannot Refute

[59] Transformer-Based Sequence Modeling with Action Discretization for Robotic Grasping PDF

Cannot Refute

[60] Enhancing Offline Reinforcement Learning with Decision Transformers: Evaluating Performance Across Simulated Robotic Control Tasks PDF

Cannot Refute

Contribution

MGP-Short sampling paradigm for Markovian tasks

Contribution

MGP-Long sampling paradigm with Adaptive Token Refinement for non-Markovian tasks

[51] Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation PDF

Cannot Refute

[52] Smart: Scalable multi-agent real-time motion generation via next-token prediction PDF

Cannot Refute

[53] Augmented transformer with adaptive graph for temporal action proposal generation PDF

Cannot Refute

[54] Multilevel semantic and adaptive actionness learning for weakly supervised temporal action localization. PDF

Cannot Refute

[55] Adaptive Progressive Transformer-Based Trajectory Prediction Under Fine-Grained Trajectory-Scene Interaction Constraint PDF

Cannot Refute

[56] MVP: Memory-enhanced Vision-Language-Action Policy with Feedback Learning PDF

Cannot Refute

Masked Generative Policy for Robotic Control

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Masked Generative Policy (MGP) framework for visuomotor imitation learning

[57] Sample-efficient Imitative Multi-token Decision Transformer for Real-world Driving PDF

[58] Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics PDF

[59] Transformer-Based Sequence Modeling with Action Discretization for Robotic Grasping PDF

[60] Enhancing Offline Reinforcement Learning with Decision Transformers: Evaluating Performance Across Simulated Robotic Control Tasks PDF

MGP-Short sampling paradigm for Markovian tasks

MGP-Long sampling paradigm with Adaptive Token Refinement for non-Markovian tasks

[51] Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation PDF

[52] Smart: Scalable multi-agent real-time motion generation via next-token prediction PDF

[53] Augmented transformer with adaptive graph for temporal action proposal generation PDF

[54] Multilevel semantic and adaptive actionness learning for weakly supervised temporal action localization. PDF

[55] Adaptive Progressive Transformer-Based Trajectory Prediction Under Fine-Grained Trajectory-Scene Interaction Constraint PDF

[56] MVP: Memory-enhanced Vision-Language-Action Policy with Feedback Learning PDF

Table of Contents