Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

ICLR 2026 Conference SubmissionAnonymous Authors
VLARobot Foundation ModelRobot LearningReinforcement Learning
Abstract:

Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a plug-and-play framework that improves VLAs through residual reinforcement learning and distribution-aware data collection. In Stage 1 (specialist acquisition), we freeze the VLA backbone and train lightweight residual actors via off-policy RL. These specialists take over in states where the base policy fails, thereby probing failure regions of the generalist. In Stage 2 (data collection), we employ a hybrid rollout scheme that biases residual interventions toward states frequently visited by the base policy, aligning collected trajectories with the generalist’s deployment distribution while capturing recovery behaviors. In Stage 3 (fine-tuning), these curated trajectories are distilled back into the generalist with standard SFT, applicable to both flow-matching and autoregressive heads. We evaluate PLD across diverse settings: it achieves a near-saturated 99% task success rate on the LIBERO benchmark, delivers over 50% performance gains in SimplerEnv, and demonstrates practicality on real-world Franka arm manipulation tasks. We further provide ablations showing that residual policy probing and distribution-aware replay are key to collecting deployment-aligned data that improves VLAs’ capabilities on both seen and unseen tasks. Our results demonstrate that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations, offering a scalable path toward self-improving VLA models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a three-stage framework (Probe, Learn, Distill) that combines residual reinforcement learning with targeted data generation to improve vision-language-action models. It resides in the 'Residual RL for Vision-Language-Action Models' leaf, which contains only two papers including the original work. This indicates a relatively sparse research direction within the broader taxonomy of eight total papers across six leaf nodes, suggesting the specific combination of residual RL and VLA post-training remains an emerging area rather than a saturated subfield.

The taxonomy reveals three main branches: residual RL methods, data generation strategies, and VLA architecture improvements. The original paper bridges the first two branches by using residual specialists to probe failure modes and then generating training data from hybrid rollouts. Neighboring leaves include force control applications of residual RL and self-improving data generation cycles, but these focus on contact-rich tasks or autonomous dataset expansion respectively. The paper's integration of residual policy training with distribution-aware data collection distinguishes it from purely architectural or purely data-centric approaches in sibling categories.

Among nineteen candidates examined, the PLD framework contribution shows one refutable candidate out of three examined, the hybrid rollout scheme shows one refutable candidate out of six examined, and the systematic study of RL-generated data shows two refutable candidates out of ten examined. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The framework and rollout contributions appear to have less substantial prior overlap based on the examined candidates, while the systematic study dimension encounters more related work within the sample, though the absolute numbers remain small given the constrained search.

Based on the limited literature search of nineteen candidates, the work appears to occupy a relatively novel position combining residual RL with data generation for VLA models. The sparse taxonomy leaf and low refutation rates across contributions suggest incremental novelty, though the analysis cannot rule out relevant work outside the top-K semantic neighborhood. The framework's distinctiveness likely stems from its specific three-stage integration rather than individual components in isolation.

Taxonomy

Core-task Taxonomy Papers
8
3
Claimed Contributions
19
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Improving vision-language-action models through residual reinforcement learning and data generation. The field organizes around several complementary directions. One major branch focuses on residual reinforcement learning for policy enhancement, where methods refine pretrained vision-language-action (VLA) models by learning corrective residual policies on top of base behaviors. A second branch emphasizes data generation and augmentation strategies, exploring synthetic trajectory creation and experience replay to overcome data scarcity in robot learning. A third branch examines VLA model architectures and training recipes, investigating how to effectively combine visual perception, language grounding, and action prediction within unified frameworks. Finally, survey and taxonomic studies provide structured overviews of these evolving methodologies, as seen in works like VLA Recipe Survey[1] and VLA RL Survey[7]. Within the residual RL branch, a central theme is how to leverage large pretrained models while adapting them efficiently to specific tasks or environments. Residual RL VLA[0] exemplifies this approach by applying residual reinforcement learning directly to vision-language-action models, aiming to preserve the generalization of pretraining while fine-tuning task-specific corrections. This contrasts with VLA From Experience[4], which also builds on pretrained VLA foundations but may emphasize different aspects of experience collection or policy composition. Meanwhile, related efforts like Dexflywheel[5] explore continuous improvement loops that combine data generation with iterative policy refinement, highlighting trade-offs between sample efficiency and the complexity of maintaining stable learning dynamics. The original paper sits squarely in this residual RL cluster, sharing with VLA From Experience[4] a focus on enhancing pretrained models, yet distinguished by its explicit residual formulation and integration of synthetic data generation to bootstrap the refinement process.

Claimed Contributions

Probe, Learn, Distill (PLD) framework for VLA post-training

The authors introduce a three-stage pipeline that freezes the VLA backbone, trains lightweight residual actors via off-policy RL to probe failure regions, employs a hybrid rollout scheme for data collection aligned with the base policy distribution, and distills collected trajectories back into the generalist using standard supervised fine-tuning.

3 retrieved papers
Can Refute
Hybrid rollout scheme with base policy probing

The authors propose a data collection mechanism that first rolls out the base policy for random steps, then lets the residual RL policy take over. This approach generates demonstration trajectories containing recovery behaviors from suboptimal regions while maintaining alignment with the base policy's state distribution.

6 retrieved papers
Can Refute
Systematic study of RL-generated data for VLA generalization

The authors provide a comprehensive empirical analysis examining how automatically generated RL data compares to human demonstrations and other data sources, demonstrating that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations for both in-distribution performance and zero-shot generalization to unseen tasks.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Probe, Learn, Distill (PLD) framework for VLA post-training

The authors introduce a three-stage pipeline that freezes the VLA backbone, trains lightweight residual actors via off-policy RL to probe failure regions, employs a hybrid rollout scheme for data collection aligned with the base policy distribution, and distills collected trajectories back into the generalist using standard supervised fine-tuning.

Contribution

Hybrid rollout scheme with base policy probing

The authors propose a data collection mechanism that first rolls out the base policy for random steps, then lets the residual RL policy take over. This approach generates demonstration trajectories containing recovery behaviors from suboptimal regions while maintaining alignment with the base policy's state distribution.

Contribution

Systematic study of RL-generated data for VLA generalization

The authors provide a comprehensive empirical analysis examining how automatically generated RL data compares to human demonstrations and other data sources, demonstrating that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations for both in-distribution performance and zero-shot generalization to unseen tasks.