SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
VLA ModelsReinforcement LearningBimanual ManipulationRobot Learning
Abstract:

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks under distribution shift. To overcome these limitations, we explore reinforcement learning (RL) as a pathway to scaling VLA training beyond limited datasets. Inspired by LLM breakthroughs where RL with outcome rewards enhances step-by-step reasoning, we ask: Can outcome-driven RL improve long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. Applied to OpenVLA-OFT, SimpleVLA-RL achieves 99% of SoTA performance on LIBERO and 80% relative improvement on RoboTwin 1.0&2.0, outperforming π0\pi_0 with our proposed exploration-enhancing strategies. SimpleVLA-RL reduces dependence on large-scale data, enables robust generalization, and remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon "pushcut'' during RL training, wherein the policy discovers unseen patterns beyond those seen in previous training process.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SimpleVLA-RL, an efficient reinforcement learning framework for vision-language-action models applied to robotic manipulation. It resides in the 'Robotic Manipulation and Embodied Control' leaf, which contains six papers including the original work. This leaf sits within the broader 'Application Domains and Task-Specific Adaptations' branch, indicating a moderately populated research direction focused on practical deployment. The taxonomy reveals that robotic manipulation is one of four application domains, suggesting this is an active but not overcrowded area compared to the algorithmic development branches.

The taxonomy structure shows that SimpleVLA-RL's leaf neighbors include autonomous driving, vision-language navigation, and GUI agents, each addressing distinct embodiment challenges. The paper's approach connects to algorithmic branches such as 'Policy Gradient and Proximal Policy Optimization Methods' and 'Online Reinforcement Learning and Interactive Training', which contain four and three papers respectively. The 'Reward Design and World Model Integration' branch, particularly 'Verifiable Reward and Outcome-Based Learning' with two papers, provides relevant context for the outcome-driven RL paradigm. This positioning suggests the work bridges application-specific concerns with established algorithmic foundations.

Among 27 candidates examined, the contribution-level analysis reveals mixed novelty signals. The core framework contribution (SimpleVLA-RL) examined 10 candidates with zero refutations, suggesting relative novelty in the specific engineering approach. Exploration-enhancing strategies examined 7 candidates, also with zero refutations, indicating potential originality in this aspect. However, the outcome-driven RL paradigm examined 10 candidates and found 1 refutable match, suggesting that using simple binary rewards for VLA training has precedent in the limited search scope. The statistics indicate a focused but not exhaustive literature review.

Based on the top-27 semantic matches examined, the work appears to offer engineering contributions in framework design and exploration strategies, while the outcome-driven RL concept shows overlap with prior work. The taxonomy placement in a moderately populated leaf suggests the paper addresses a recognized problem space with established context. The analysis does not cover the full breadth of robotics or RL literature, so definitive novelty claims require broader verification beyond this semantic search scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Scaling vision-language-action model training via reinforcement learning. The field has organized itself around several complementary branches that address different facets of this challenge. Reinforcement Learning Algorithms for VLA Fine-Tuning explores policy optimization techniques such as PPO variants and flow-based methods, while Reward Design and World Model Integration examines how to construct effective learning signals and leverage predictive models for sample efficiency. Reasoning and Chain-of-Thought Integration investigates how to incorporate deliberative planning into action selection, and Data Generation and Self-Improvement Strategies focuses on synthetic data creation and autonomous curriculum learning. Application Domains and Task-Specific Adaptations targets concrete settings like robotic manipulation, navigation, and GUI control, whereas Architectural Innovations and Training Frameworks addresses model design and scalable training infrastructure. Survey and Taxonomic Studies provide high-level perspectives on the rapidly evolving landscape, as seen in works like Pure VLA Survey[3] and VLA RL Survey[38]. A particularly active line of work centers on robotic manipulation and embodied control, where methods must balance generalization across diverse tasks with sample-efficient learning from limited real-world interactions. SimpleVLA-RL[0] sits within this branch, emphasizing straightforward RL-based fine-tuning for vision-language-action policies in manipulation settings. It shares thematic ground with VLA Online RL[5], which also explores online policy improvement, and contrasts with approaches like IRL-VLA[4] that leverage inverse reinforcement learning for reward specification. Nearby efforts such as SafeVLA[15] and SafeVLA Constrained[17] highlight the importance of safety constraints in physical systems, while MoRE[23] and MobileVLA-R1[42] address mixture-of-experts architectures and mobile manipulation respectively. The central tension across these works involves trading off exploration risk, computational cost, and the ability to generalize from pre-trained vision-language representations to precise low-level control, with SimpleVLA-RL[0] positioning itself as a practical entry point that prioritizes simplicity and scalability in the RL fine-tuning process.

Claimed Contributions

SimpleVLA-RL: An efficient RL framework for VLA models

The authors develop an end-to-end reinforcement learning framework specifically designed for Vision-Language-Action models. This framework extends veRL with VLA-specific components including interactive trajectory sampling, parallel multi-environment rendering, and optimized training-inference-rendering infrastructure to enable stable and sample-efficient online RL training for robotic manipulation.

10 retrieved papers
Exploration-enhancing strategies for VLA RL training

The authors introduce three key modifications to enhance exploration during RL training: dynamic sampling that excludes uniform-reward trajectory groups, higher clipping range in the GRPO objective (from [0.8, 1.2] to [0.8, 1.28]), and increased rollout temperature (from 1.0 to 1.6). These strategies collectively improve training stability and policy performance.

7 retrieved papers
Outcome-driven RL paradigm for VLA with simple binary rewards

The authors apply an outcome-level reinforcement learning approach to VLA models using only sparse binary rewards (1 for task success, 0 for failure) rather than hand-crafted dense rewards. This paradigm, inspired by recent LLM breakthroughs, enables VLA models to improve long-horizon action planning through trial-and-error exploration without requiring task-specific reward engineering.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SimpleVLA-RL: An efficient RL framework for VLA models

The authors develop an end-to-end reinforcement learning framework specifically designed for Vision-Language-Action models. This framework extends veRL with VLA-specific components including interactive trajectory sampling, parallel multi-environment rendering, and optimized training-inference-rendering infrastructure to enable stable and sample-efficient online RL training for robotic manipulation.

Contribution

Exploration-enhancing strategies for VLA RL training

The authors introduce three key modifications to enhance exploration during RL training: dynamic sampling that excludes uniform-reward trajectory groups, higher clipping range in the GRPO objective (from [0.8, 1.2] to [0.8, 1.28]), and increased rollout temperature (from 1.0 to 1.6). These strategies collectively improve training stability and policy performance.

Contribution

Outcome-driven RL paradigm for VLA with simple binary rewards

The authors apply an outcome-level reinforcement learning approach to VLA models using only sparse binary rewards (1 for task success, 0 for failure) rather than hand-crafted dense rewards. This paradigm, inspired by recent LLM breakthroughs, enables VLA models to improve long-horizon action planning through trial-and-error exploration without requiring task-specific reward engineering.

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning | Novelty Validation