SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

VLA ModelsReinforcement LearningBimanual ManipulationRobot Learning

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks under distribution shift. To overcome these limitations, we explore reinforcement learning (RL) as a pathway to scaling VLA training beyond limited datasets. Inspired by LLM breakthroughs where RL with outcome rewards enhances step-by-step reasoning, we ask: Can outcome-driven RL improve long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. Applied to OpenVLA-OFT, SimpleVLA-RL achieves 99% of SoTA performance on LIBERO and 80% relative improvement on RoboTwin 1.0&2.0, outperforming $\pi_0$ with our proposed exploration-enhancing strategies. SimpleVLA-RL reduces dependence on large-scale data, enables robust generalization, and remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon "pushcut'' during RL training, wherein the policy discovers unseen patterns beyond those seen in previous training process.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SimpleVLA-RL, an efficient reinforcement learning framework for vision-language-action models applied to robotic manipulation. It resides in the 'Robotic Manipulation and Embodied Control' leaf, which contains six papers including the original work. This leaf sits within the broader 'Application Domains and Task-Specific Adaptations' branch, indicating a moderately populated research direction focused on practical deployment. The taxonomy reveals that robotic manipulation is one of four application domains, suggesting this is an active but not overcrowded area compared to the algorithmic development branches.

The taxonomy structure shows that SimpleVLA-RL's leaf neighbors include autonomous driving, vision-language navigation, and GUI agents, each addressing distinct embodiment challenges. The paper's approach connects to algorithmic branches such as 'Policy Gradient and Proximal Policy Optimization Methods' and 'Online Reinforcement Learning and Interactive Training', which contain four and three papers respectively. The 'Reward Design and World Model Integration' branch, particularly 'Verifiable Reward and Outcome-Based Learning' with two papers, provides relevant context for the outcome-driven RL paradigm. This positioning suggests the work bridges application-specific concerns with established algorithmic foundations.

Among 27 candidates examined, the contribution-level analysis reveals mixed novelty signals. The core framework contribution (SimpleVLA-RL) examined 10 candidates with zero refutations, suggesting relative novelty in the specific engineering approach. Exploration-enhancing strategies examined 7 candidates, also with zero refutations, indicating potential originality in this aspect. However, the outcome-driven RL paradigm examined 10 candidates and found 1 refutable match, suggesting that using simple binary rewards for VLA training has precedent in the limited search scope. The statistics indicate a focused but not exhaustive literature review.

Based on the top-27 semantic matches examined, the work appears to offer engineering contributions in framework design and exploration strategies, while the outcome-driven RL concept shows overlap with prior work. The taxonomy placement in a moderately populated leaf suggests the paper addresses a recognized problem space with established context. The analysis does not cover the full breadth of robotics or RL literature, so definitive novelty claims require broader verification beyond this semantic search scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Scaling vision-language-action model training via reinforcement learning. The field has organized itself around several complementary branches that address different facets of this challenge. Reinforcement Learning Algorithms for VLA Fine-Tuning explores policy optimization techniques such as PPO variants and flow-based methods, while Reward Design and World Model Integration examines how to construct effective learning signals and leverage predictive models for sample efficiency. Reasoning and Chain-of-Thought Integration investigates how to incorporate deliberative planning into action selection, and Data Generation and Self-Improvement Strategies focuses on synthetic data creation and autonomous curriculum learning. Application Domains and Task-Specific Adaptations targets concrete settings like robotic manipulation, navigation, and GUI control, whereas Architectural Innovations and Training Frameworks addresses model design and scalable training infrastructure. Survey and Taxonomic Studies provide high-level perspectives on the rapidly evolving landscape, as seen in works like Pure VLA Survey[3] and VLA RL Survey[38]. A particularly active line of work centers on robotic manipulation and embodied control, where methods must balance generalization across diverse tasks with sample-efficient learning from limited real-world interactions. SimpleVLA-RL[0] sits within this branch, emphasizing straightforward RL-based fine-tuning for vision-language-action policies in manipulation settings. It shares thematic ground with VLA Online RL[5], which also explores online policy improvement, and contrasts with approaches like IRL-VLA[4] that leverage inverse reinforcement learning for reward specification. Nearby efforts such as SafeVLA[15] and SafeVLA Constrained[17] highlight the importance of safety constraints in physical systems, while MoRE[23] and MobileVLA-R1[42] address mixture-of-experts architectures and mobile manipulation respectively. The central tension across these works involves trading off exploration risk, computational cost, and the ability to generalize from pre-trained vision-language representations to precise low-level control, with SimpleVLA-RL[0] positioning itself as a practical entry point that prioritizes simplicity and scalability in the RL fine-tuning process.

Claimed Contributions

SimpleVLA-RL: An efficient RL framework for VLA models

10 retrieved papers

The authors develop an end-to-end reinforcement learning framework specifically designed for Vision-Language-Action models. This framework extends veRL with VLA-specific components including interactive trajectory sampling, parallel multi-environment rendering, and optimized training-inference-rendering infrastructure to enable stable and sample-efficient online RL training for robotic manipulation.

10 retrieved papers

Exploration-enhancing strategies for VLA RL training

7 retrieved papers

The authors introduce three key modifications to enhance exploration during RL training: dynamic sampling that excludes uniform-reward trajectory groups, higher clipping range in the GRPO objective (from [0.8, 1.2] to [0.8, 1.28]), and increased rollout temperature (from 1.0 to 1.6). These strategies collectively improve training stability and policy performance.

7 retrieved papers

Outcome-driven RL paradigm for VLA with simple binary rewards

Can Refute

10 retrieved papers

The authors apply an outcome-level reinforcement learning approach to VLA models using only sparse binary rewards (1 for task success, 0 for failure) rather than hand-crafted dense rewards. This paradigm, inspired by recent LLM breakthroughs, enables VLA models to improve long-horizon action planning through trial-and-error exploration without requiring task-specific reward engineering.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Large vlm-based vision-language-action models for robotic manipulation: A survey PDF

Shao Rui, Li Wei, Zhang Ren-shan, Liu, Zhiyang, Chen Ran, Nie, Liqiang (2025)

[15] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning PDF

Borong Zhang, Yuhao Zhang, Jiaming Ji, Yingshan Lei, Josef Dai, Yuanpei Chen, Yaodong Yang (2025) • arXiv.org

[17] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning PDF

Zhang, Borong, Zhang Yuhao, Ji, Jiaming, Chen, Yuanpei, Yang YaoDong (2025)

[23] MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models PDF

Zhao Han, Song Wen-xuan, Wang, Donglin, Tong Xin-yang, Ding, Pengxiang, Cheng Xuelian, Ge, Zongyuan (2025)

[42] MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots PDF

Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang, Hao Tang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SimpleVLA-RL: An efficient RL framework for VLA models

[10] Thinkact: Vision-language-action reasoning via reinforced visual latent planning PDF

Cannot Refute

[16] Vla-r1: Enhancing reasoning in vision-language-action models PDF

Cannot Refute

[21] ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models PDF

Cannot Refute

[38] A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation PDF

Cannot Refute

[51] TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation PDF

Cannot Refute

[52] Recipe for Vision-Language-Action Models in Robotic Manipulation: A Survey PDF

Cannot Refute

[53] Language-conditioned imitation learning for robot manipulation tasks PDF

Cannot Refute

[54] Liv: Language-image representations and rewards for robotic control PDF

Cannot Refute

[55] Vlmpc: Vision-language model predictive control for robotic manipulation PDF

Cannot Refute

[56] MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation PDF

Cannot Refute

Contribution

Exploration-enhancing strategies for VLA RL training

[67] DCPO: Dynamic Clipping Policy Optimization PDF

Cannot Refute

[68] Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization PDF

Cannot Refute

[69] From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature PDF

Cannot Refute

[70] BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping PDF

Cannot Refute

[71] Adaptive PPO With Multi-Armed Bandit Clipping and Meta-Control for Robust Power Grid Operation Under Adversarial Attacks PDF

Cannot Refute

[72] CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning PDF

Cannot Refute

[73] A dynamical clipping approach with task feedback for proximal policy optimization PDF

Cannot Refute

Contribution

Outcome-driven RL paradigm for VLA with simple binary rewards

[60] Sqil: Imitation learning via reinforcement learning with sparse rewards PDF

Can Refute

[57] Hierarchical reinforcement learning for handling sparse rewards in multi-goal navigation PDF

Cannot Refute

[58] Offline reinforcement learning with failure under sparse reward environments PDF

Cannot Refute

[59] Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards PDF

Cannot Refute

[61] Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning PDF

Cannot Refute

[62] Revisiting sparse rewards for goal-reaching reinforcement learning PDF

Cannot Refute

[63] Tree Search for LLM Agent Reinforcement Learning PDF

Cannot Refute

[64] Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research PDF

Cannot Refute

[65] Deep-reinforcement-learning-based autonomous UAV navigation with sparse rewards PDF

Cannot Refute

[66] Seea-r1: Tree-structured reinforcement fine-tuning for self-evolving embodied agents PDF

Cannot Refute

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Large vlm-based vision-language-action models for robotic manipulation: A survey PDF

[15] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning PDF

[17] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning PDF

[23] MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models PDF

[42] MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots PDF

Contribution Analysis

SimpleVLA-RL: An efficient RL framework for VLA models

[10] Thinkact: Vision-language-action reasoning via reinforced visual latent planning PDF

[16] Vla-r1: Enhancing reasoning in vision-language-action models PDF

[21] ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models PDF

[38] A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation PDF

[51] TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation PDF

[52] Recipe for Vision-Language-Action Models in Robotic Manipulation: A Survey PDF

[53] Language-conditioned imitation learning for robot manipulation tasks PDF

[54] Liv: Language-image representations and rewards for robotic control PDF

[55] Vlmpc: Vision-language model predictive control for robotic manipulation PDF

[56] MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation PDF

Exploration-enhancing strategies for VLA RL training

[67] DCPO: Dynamic Clipping Policy Optimization PDF

[68] Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization PDF

[69] From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature PDF

[70] BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping PDF

[71] Adaptive PPO With Multi-Armed Bandit Clipping and Meta-Control for Robust Power Grid Operation Under Adversarial Attacks PDF

[72] CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning PDF

[73] A dynamical clipping approach with task feedback for proximal policy optimization PDF

Outcome-driven RL paradigm for VLA with simple binary rewards

[60] Sqil: Imitation learning via reinforcement learning with sparse rewards PDF

[57] Hierarchical reinforcement learning for handling sparse rewards in multi-goal navigation PDF

[58] Offline reinforcement learning with failure under sparse reward environments PDF

[59] Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards PDF

[61] Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning PDF

[62] Revisiting sparse rewards for goal-reaching reinforcement learning PDF

[63] Tree Search for LLM Agent Reinforcement Learning PDF

[64] Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research PDF

[65] Deep-reinforcement-learning-based autonomous UAV navigation with sparse rewards PDF

[66] Seea-r1: Tree-structured reinforcement fine-tuning for self-evolving embodied agents PDF

Table of Contents