Real-Time Robot Execution with Masked Action Chunking

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Robot ManipulationReal-time Execution

Real-time execution is essential for cyber-physical systems such as robots. These systems operate in dynamic real-world environments where even small delays can undermine responsiveness and compromise performance. Asynchronous inference has recently emerged as a system-level paradigm for real-time robot manipulation, enabling the next action chunk to be predicted while the current one is being executed. While this approach achieves real-time responsiveness, naive integration often results in execution failure. Previous methods attributed this failure to inter-chunk discontinuity and developed test-time algorithms to smooth chunk boundaries. In contrast, we identify another critical yet overlooked factor: intra-chunk inconsistency, where the robot’s executed action chunk partially misaligns with its current perception. To address this, we propose REMAC, which learns corrective adjustments on the pretrained policy through masked action chunking, enabling the policy to remain resilient under mismatches between intended actions and actual execution during asynchronous inference. In addition, we introduce a prefix-preserved sampling procedure to reinforce inter-chunk continuity. Overall, our method delivers more reliable policies without incurring additional latency. Extensive experiments in both simulation and real-world settings demonstrate that our method enables faster task execution, maintains robustness across varying delays, and consistently achieves higher completion rates.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: real-time robot manipulation under asynchronous inference. The field addresses the challenge of executing manipulation policies when perception and decision-making modules run at different rates or with variable latency. The taxonomy organizes work into several main branches: asynchronous inference architectures for vision-language-action models, which handle the mismatch between slow neural network inference and fast control loops; deep reinforcement learning training methods that account for asynchronous data collection; real-time motion planning and trajectory optimization that must react quickly despite delayed observations; system infrastructure and middleware designed to coordinate asynchronous components; perception and inference optimization techniques that reduce latency; and human-in-the-loop refinement approaches that improve policies post-deployment. Representative efforts include asynchronous off-policy RL methods (Asynchronous Off-Policy[2]) for training, middleware solutions (xbot2 Middleware[12], ROS2 Timed Rebeca[10]) for coordination, and model predictive control adaptations (ASAP-MPC[11], Incremental MPC Time-Delay[25]) for planning under delay. A particularly active line of work focuses on future-state-aware and chunk correction approaches within the asynchronous inference architectures branch, where methods predict or compensate for the time lag between observation and action execution. Masked Action Chunking[0] sits squarely in this cluster, addressing how to generate and refine sequences of actions when inference cannot keep pace with control frequency. It shares thematic concerns with Real-Time Correction VLA[7], which also emphasizes correcting action sequences on-the-fly, and with approaches like Action Chunking Flow[3] that structure action generation to respect temporal dependencies. Nearby work such as Observe Then Act[5] and VLASH[1] similarly grapple with the trade-off between waiting for fresh observations versus acting on potentially stale information. The central tension across these methods is balancing reactivity—how quickly the system can respond to new sensory input—against the computational cost of frequent re-inference, with different solutions offering varying degrees of look-ahead prediction, action buffering, and online correction.

Claimed Contributions

REMAC: Real-time Execution with Masked Action Chunking

5 retrieved papers

The authors introduce REMAC, a training-time method that adapts pretrained vision-language-action policies for asynchronous inference by learning corrective adjustments through masked action chunking. This approach addresses intra-chunk inconsistency by masking arbitrary portions of action chunks during training, enabling the policy to handle misalignments between observations and executed actions without introducing additional inference latency.

5 retrieved papers

Identification of intra-chunk inconsistency as a critical failure mode

Can Refute

4 retrieved papers

The authors identify and formalize intra-chunk inconsistency as a previously overlooked challenge in asynchronous inference with action chunking. This occurs when executed actions from a previous chunk are conditioned on outdated observations, creating a perception-action mismatch within a single chunk that degrades policy performance.

4 retrieved papers

Can Refute

Prefix-preserved sampling procedure for inter-chunk continuity

Can Refute

3 retrieved papers

The authors propose a prefix-preserved sampling procedure that initializes action generation using previously executed actions as priors and preserves the overlapping segment between consecutive chunks during sampling. This method enhances inter-chunk continuity by maintaining coherence across chunk boundaries during asynchronous execution.

3 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference PDF

Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, Song Han (2025)

[7] Leave no observation behind: Real-time correction for vla action chunks PDF

Alvarez, Maxime, Kohei Sendai, Matsushima, Tatsuya, Maxime Alvarez, Matsuo, Yutaka, T. Matsushima, Iwasawa, Yusuke, Yutaka Matsuo, Yusuke Iwasawa (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

REMAC: Real-time Execution with Masked Action Chunking

[7] Leave no observation behind: Real-time correction for vla action chunks PDF

Cannot Refute

[34] : a VLA That Learns From Experience PDF

Cannot Refute

[35] AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models PDF

Cannot Refute

[36] A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation PDF

Cannot Refute

[37] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies PDF

Cannot Refute

Contribution

Identification of intra-chunk inconsistency as a critical failure mode

[1] VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference PDF

Can Refute

[38] SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics PDF

Cannot Refute

[39] ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning PDF

Cannot Refute

[40] Mobile robot programming using natural language PDF

Cannot Refute

Contribution

Prefix-preserved sampling procedure for inter-chunk continuity

[41] Training-Time Action Conditioning for Efficient Real-Time Chunking PDF

Can Refute

[42] LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs PDF

Cannot Refute

[43] From Logs to Insights: Exploring User Behavior in RobotStudio PDF

Cannot Refute

Real-Time Robot Execution with Masked Action Chunking

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference PDF

[7] Leave no observation behind: Real-time correction for vla action chunks PDF

Contribution Analysis

REMAC: Real-time Execution with Masked Action Chunking

[7] Leave no observation behind: Real-time correction for vla action chunks PDF

[34] : a VLA That Learns From Experience PDF

[35] AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models PDF

[36] A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation PDF

[37] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies PDF

Identification of intra-chunk inconsistency as a critical failure mode

[1] VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference PDF

[38] SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics PDF

[39] ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning PDF

[40] Mobile robot programming using natural language PDF

Prefix-preserved sampling procedure for inter-chunk continuity

[41] Training-Time Action Conditioning for Efficient Real-Time Chunking PDF

[42] LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs PDF

[43] From Logs to Insights: Exploring User Behavior in RobotStudio PDF

Table of Contents