Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

VLARobot Foundation ModelRobot LearningReinforcement Learning

Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a plug-and-play framework that improves VLAs through residual reinforcement learning and distribution-aware data collection. In Stage 1 (specialist acquisition), we freeze the VLA backbone and train lightweight residual actors via off-policy RL. These specialists take over in states where the base policy fails, thereby probing failure regions of the generalist. In Stage 2 (data collection), we employ a hybrid rollout scheme that biases residual interventions toward states frequently visited by the base policy, aligning collected trajectories with the generalist’s deployment distribution while capturing recovery behaviors. In Stage 3 (fine-tuning), these curated trajectories are distilled back into the generalist with standard SFT, applicable to both flow-matching and autoregressive heads. We evaluate PLD across diverse settings: it achieves a near-saturated 99% task success rate on the LIBERO benchmark, delivers over 50% performance gains in SimplerEnv, and demonstrates practicality on real-world Franka arm manipulation tasks. We further provide ablations showing that residual policy probing and distribution-aware replay are key to collecting deployment-aligned data that improves VLAs’ capabilities on both seen and unseen tasks. Our results demonstrate that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations, offering a scalable path toward self-improving VLA models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a three-stage framework (Probe, Learn, Distill) that combines residual reinforcement learning with targeted data generation to improve vision-language-action models. It resides in the 'Residual RL for Vision-Language-Action Models' leaf, which contains only two papers including the original work. This indicates a relatively sparse research direction within the broader taxonomy of eight total papers across six leaf nodes, suggesting the specific combination of residual RL and VLA post-training remains an emerging area rather than a saturated subfield.

The taxonomy reveals three main branches: residual RL methods, data generation strategies, and VLA architecture improvements. The original paper bridges the first two branches by using residual specialists to probe failure modes and then generating training data from hybrid rollouts. Neighboring leaves include force control applications of residual RL and self-improving data generation cycles, but these focus on contact-rich tasks or autonomous dataset expansion respectively. The paper's integration of residual policy training with distribution-aware data collection distinguishes it from purely architectural or purely data-centric approaches in sibling categories.

Among nineteen candidates examined, the PLD framework contribution shows one refutable candidate out of three examined, the hybrid rollout scheme shows one refutable candidate out of six examined, and the systematic study of RL-generated data shows two refutable candidates out of ten examined. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The framework and rollout contributions appear to have less substantial prior overlap based on the examined candidates, while the systematic study dimension encounters more related work within the sample, though the absolute numbers remain small given the constrained search.

Based on the limited literature search of nineteen candidates, the work appears to occupy a relatively novel position combining residual RL with data generation for VLA models. The sparse taxonomy leaf and low refutation rates across contributions suggest incremental novelty, though the analysis cannot rule out relevant work outside the top-K semantic neighborhood. The framework's distinctiveness likely stems from its specific three-stage integration rather than individual components in isolation.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Improving vision-language-action models through residual reinforcement learning and data generation. The field organizes around several complementary directions. One major branch focuses on residual reinforcement learning for policy enhancement, where methods refine pretrained vision-language-action (VLA) models by learning corrective residual policies on top of base behaviors. A second branch emphasizes data generation and augmentation strategies, exploring synthetic trajectory creation and experience replay to overcome data scarcity in robot learning. A third branch examines VLA model architectures and training recipes, investigating how to effectively combine visual perception, language grounding, and action prediction within unified frameworks. Finally, survey and taxonomic studies provide structured overviews of these evolving methodologies, as seen in works like VLA Recipe Survey[1] and VLA RL Survey[7]. Within the residual RL branch, a central theme is how to leverage large pretrained models while adapting them efficiently to specific tasks or environments. Residual RL VLA[0] exemplifies this approach by applying residual reinforcement learning directly to vision-language-action models, aiming to preserve the generalization of pretraining while fine-tuning task-specific corrections. This contrasts with VLA From Experience[4], which also builds on pretrained VLA foundations but may emphasize different aspects of experience collection or policy composition. Meanwhile, related efforts like Dexflywheel[5] explore continuous improvement loops that combine data generation with iterative policy refinement, highlighting trade-offs between sample efficiency and the complexity of maintaining stable learning dynamics. The original paper sits squarely in this residual RL cluster, sharing with VLA From Experience[4] a focus on enhancing pretrained models, yet distinguished by its explicit residual formulation and integration of synthetic data generation to bootstrap the refinement process.

Claimed Contributions

Probe, Learn, Distill (PLD) framework for VLA post-training

Can Refute

3 retrieved papers

The authors introduce a three-stage pipeline that freezes the VLA backbone, trains lightweight residual actors via off-policy RL to probe failure regions, employs a hybrid rollout scheme for data collection aligned with the base policy distribution, and distills collected trajectories back into the generalist using standard supervised fine-tuning.

3 retrieved papers

Can Refute

Hybrid rollout scheme with base policy probing

Can Refute

6 retrieved papers

The authors propose a data collection mechanism that first rolls out the base policy for random steps, then lets the residual RL policy take over. This approach generates demonstration trajectories containing recovery behaviors from suboptimal regions while maintaining alignment with the base policy's state distribution.

6 retrieved papers

Can Refute

Systematic study of RL-generated data for VLA generalization

Can Refute

10 retrieved papers

The authors provide a comprehensive empirical analysis examining how automatically generated RL data compares to human demonstrations and other data sources, demonstrating that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations for both in-distribution performance and zero-shot generalization to unseen tasks.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] : a VLA That Learns From Experience PDF

P Intelligence, A Amin, R Aniceto, A Balakrishna (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Probe, Learn, Distill (PLD) framework for VLA post-training

[15] Rldg: Robotic generalist policy distillation via reinforcement learning PDF

Can Refute

[3] A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions PDF

Cannot Refute

[18] Integrating World Models into Vision Language Action and Navigation: A Comprehensive Survey PDF

Cannot Refute

Contribution

Hybrid rollout scheme with base policy probing

[22] Residual Off-Policy RL for Finetuning Behavior Cloning Policies PDF

Can Refute

[5] Dexflywheel: A scalable and self-improving data generation framework for dexterous manipulation PDF

Cannot Refute

[19] Dexcap: Scalable and portable mocap data collection system for dexterous manipulation PDF

Cannot Refute

[20] Predictive Performance Tuning PDF

Cannot Refute

[21] Scaling Human Supervision for Robotic Manipulation PDF

Cannot Refute

[23] Context-Sensitive Monitoring and Evaluation Principles and Performance of Devolution Policy Harmonization Programs in Kenya PDF

Cannot Refute

Contribution

Systematic study of RL-generated data for VLA generalization

[10] Interactive Post-Training for Vision-Language-Action Models PDF

Can Refute

[15] Rldg: Robotic generalist policy distillation via reinforcement learning PDF

Can Refute

[1] Recipe for Vision-Language-Action Models in Robotic Manipulation: A Survey PDF

Cannot Refute

[9] : A Vision-Language-Action Flow Model for General Robot Control PDF

Cannot Refute

[11] Concept2robot: Learning manipulation concepts from instructions and human demonstrations PDF

Cannot Refute

[12] Roboclip: One demonstration is enough to learn robot policies PDF

Cannot Refute

[13] VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning PDF

Cannot Refute

[14] Large vlm-based vision-language-action models for robotic manipulation: A survey PDF

Cannot Refute

[16] FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real PDF

Cannot Refute

[17] VLA Model-Expert Collaboration for Bi-directional Manipulation Learning PDF

Cannot Refute

Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] : a VLA That Learns From Experience PDF

Contribution Analysis

Probe, Learn, Distill (PLD) framework for VLA post-training

[15] Rldg: Robotic generalist policy distillation via reinforcement learning PDF

[3] A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions PDF

[18] Integrating World Models into Vision Language Action and Navigation: A Comprehensive Survey PDF

Hybrid rollout scheme with base policy probing

[22] Residual Off-Policy RL for Finetuning Behavior Cloning Policies PDF

[5] Dexflywheel: A scalable and self-improving data generation framework for dexterous manipulation PDF

[19] Dexcap: Scalable and portable mocap data collection system for dexterous manipulation PDF

[20] Predictive Performance Tuning PDF

[21] Scaling Human Supervision for Robotic Manipulation PDF

[23] Context-Sensitive Monitoring and Evaluation Principles and Performance of Devolution Policy Harmonization Programs in Kenya PDF

Systematic study of RL-generated data for VLA generalization

[10] Interactive Post-Training for Vision-Language-Action Models PDF

[15] Rldg: Robotic generalist policy distillation via reinforcement learning PDF

[1] Recipe for Vision-Language-Action Models in Robotic Manipulation: A Survey PDF

[9] : A Vision-Language-Action Flow Model for General Robot Control PDF

[11] Concept2robot: Learning manipulation concepts from instructions and human demonstrations PDF

[12] Roboclip: One demonstration is enough to learn robot policies PDF

[13] VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning PDF

[14] Large vlm-based vision-language-action models for robotic manipulation: A survey PDF

[16] FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real PDF

[17] VLA Model-Expert Collaboration for Bi-directional Manipulation Learning PDF

Table of Contents