On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

ICLR 2026 Conference SubmissionAnonymous Authors
RobustnessVision-Language-Action Models
Abstract:

In Vision–Language–Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) π0\pi_0 demonstrates superior robustness. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the π0\pi_0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust BYOVLA that requires external LLMs, and a 10.4% gain under mixed perturbations. On the real-world FR5 robot, under four types of multimodal perturbations, RobustVLA shows strong low-data performance, outperforming π0\pi_0 by 65.6%65.6\% success rate with 25 demonstrations. Even with abundant demos, our method still outperform π0\pi_0 by 30% success rate. Code and demo videos available at \url{https://anonymous.4open.science/r/RobustVLA-283D}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a comprehensive robustness evaluation of VLA models under 17 perturbations across four modalities (vision, language, action, environment) and proposes RobustVLA, a framework combining offline robust optimization against worst-case action noise with input consistency enforcement. It resides in the Multi-Dimensional Perturbation Benchmarks leaf, which contains four papers including VLA Multimodal Robustness, Eva-VLA, Libero-Plus, and one other benchmark. This leaf sits within the broader Robustness Evaluation Frameworks and Benchmarks branch, indicating a moderately populated research direction focused on systematic multi-modal testing rather than single-modality assessments.

The taxonomy reveals neighboring leaves addressing complementary evaluation angles: Fuzzing and Automated Testing Frameworks explore systematic vulnerability discovery, Agent Robustness evaluates compound systems in interactive environments, and Structured Task Difficulty measures capabilities through graded task hierarchies. The paper's multi-modal focus distinguishes it from domain-specific robustness studies (Navigation, Autonomous Driving) and from defense-oriented branches (Adversarial Training, Multi-Modal Defense). Its scope note emphasizes perturbations across objects, viewpoints, instructions, and environmental conditions, explicitly excluding single-dimension studies and agent-specific frameworks that appear in sibling categories.

Among 23 candidates examined, the comprehensive evaluation contribution (10 candidates, 0 refutable) and RobustVLA framework (10 candidates, 0 refutable) show no clear prior overlap within the limited search scope. However, the offline robust optimization against worst-case action noise (3 candidates examined) is clearly refuted by one candidate, suggesting this specific technical mechanism has precedent. The evaluation and framework contributions appear more novel given the absence of refuting work among the examined candidates, though the search scale (23 papers) leaves open the possibility of relevant prior work beyond top-K semantic matches.

Based on the limited literature search covering 23 candidates from semantic retrieval, the paper's evaluation protocol and multi-modal framework appear relatively novel within the examined scope, while the action-noise optimization technique shows overlap with at least one prior method. The taxonomy structure indicates this work occupies a moderately active research area with four sibling benchmarks, suggesting incremental but meaningful progress in multi-dimensional robustness assessment rather than exploration of entirely sparse territory.

Taxonomy

Core-task Taxonomy Papers
41
3
Claimed Contributions
23
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Robustness of vision-language-action models against multi-modal perturbations. The field has organized itself around several complementary branches that together address how VLA models withstand adversarial and natural disturbances across vision, language, and action modalities. Adversarial Attack Methods explore techniques to craft perturbations that exploit model vulnerabilities, while Defense and Robustness Enhancement Techniques develop countermeasures such as adversarial training and architectural modifications. Robustness Evaluation Frameworks and Benchmarks provide standardized testbeds like Eva-VLA[7] and Libero-Plus[8] to measure model resilience under diverse perturbation scenarios. Vulnerability Analysis characterizes threat surfaces, Safety Alignment enforces behavioral constraints, and Domain-Specific Applications examine robustness in specialized contexts such as manipulation or navigation. Surveys and Cross-Cutting Analyses synthesize insights across these dimensions, while Auxiliary Techniques supply supporting methods like data augmentation or interpretability tools. Within the evaluation landscape, a particularly active line of work focuses on multi-dimensional perturbation benchmarks that systematically probe VLA models across vision, language, and action channels simultaneously. VLA Multimodal Robustness[0] exemplifies this direction by proposing comprehensive evaluation protocols that test resilience to coordinated perturbations, contrasting with earlier single-modality assessments like Video-Language Robustness[9] or domain-specific suites such as LIBERO-PRO[16]. Neighboring efforts like Multimodal Adversarial Robustness[3] and Evaluating VLM Robustness[4] have explored robustness from complementary angles—examining cross-modal attack transferability or establishing foundational evaluation metrics—but VLA Multimodal Robustness[0] distinguishes itself by emphasizing the interplay of perturbations across all three modalities in embodied settings. This positioning reflects a broader shift toward holistic robustness characterization, moving beyond isolated modality tests to capture the complex failure modes that emerge when vision, language, and action channels are simultaneously stressed.

Claimed Contributions

Comprehensive robustness evaluation of VLAs under multi-modal perturbations

The authors systematically evaluate existing Vision-Language-Action models against 17 different perturbations spanning action, observation, environment, and instruction modalities. This evaluation reveals that actions are the most fragile modality, existing visual-robust VLAs do not generalize to other modalities, and π0 demonstrates superior robustness.

10 retrieved papers
RobustVLA framework for multi-modal robustness enhancement

The authors introduce RobustVLA, a unified framework that enhances robustness against both input and output perturbations in VLA models. The method combines offline robust optimization against worst-case action noise for output robustness with consistency enforcement across semantically equivalent inputs for input robustness, using a multi-armed bandit formulation with UCB to balance multiple perturbation types.

10 retrieved papers
Offline robust optimization against worst-case action noise in flow matching

The authors derive and optimize against worst-case action perturbations by maximizing the flow matching loss, which can be interpreted as adversarial training, label smoothing, and outlier penalization. This approach addresses the challenge of achieving action robustness in offline settings where interactive environments are unavailable.

3 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Comprehensive robustness evaluation of VLAs under multi-modal perturbations

The authors systematically evaluate existing Vision-Language-Action models against 17 different perturbations spanning action, observation, environment, and instruction modalities. This evaluation reveals that actions are the most fragile modality, existing visual-robust VLAs do not generalize to other modalities, and π0 demonstrates superior robustness.

Contribution

RobustVLA framework for multi-modal robustness enhancement

The authors introduce RobustVLA, a unified framework that enhances robustness against both input and output perturbations in VLA models. The method combines offline robust optimization against worst-case action noise for output robustness with consistency enforcement across semantically equivalent inputs for input robustness, using a multi-armed bandit formulation with UCB to balance multiple perturbation types.

Contribution

Offline robust optimization against worst-case action noise in flow matching

The authors derive and optimize against worst-case action perturbations by maximizing the flow matching loss, which can be interpreted as adversarial training, label smoothing, and outlier penalization. This approach addresses the challenge of achieving action robustness in offline settings where interactive environments are unavailable.