On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

RobustnessVision-Language-Action Models

In Vision–Language–Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) $\pi_0$ demonstrates superior robustness. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the $\pi_0$ backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust BYOVLA that requires external LLMs, and a 10.4% gain under mixed perturbations. On the real-world FR5 robot, under four types of multimodal perturbations, RobustVLA shows strong low-data performance, outperforming $\pi_0$ by $65.6\%$ success rate with 25 demonstrations. Even with abundant demos, our method still outperform $\pi_0$ by 30% success rate. Code and demo videos available at \url{https://anonymous.4open.science/r/RobustVLA-283D}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a comprehensive robustness evaluation of VLA models under 17 perturbations across four modalities (vision, language, action, environment) and proposes RobustVLA, a framework combining offline robust optimization against worst-case action noise with input consistency enforcement. It resides in the Multi-Dimensional Perturbation Benchmarks leaf, which contains four papers including VLA Multimodal Robustness, Eva-VLA, Libero-Plus, and one other benchmark. This leaf sits within the broader Robustness Evaluation Frameworks and Benchmarks branch, indicating a moderately populated research direction focused on systematic multi-modal testing rather than single-modality assessments.

The taxonomy reveals neighboring leaves addressing complementary evaluation angles: Fuzzing and Automated Testing Frameworks explore systematic vulnerability discovery, Agent Robustness evaluates compound systems in interactive environments, and Structured Task Difficulty measures capabilities through graded task hierarchies. The paper's multi-modal focus distinguishes it from domain-specific robustness studies (Navigation, Autonomous Driving) and from defense-oriented branches (Adversarial Training, Multi-Modal Defense). Its scope note emphasizes perturbations across objects, viewpoints, instructions, and environmental conditions, explicitly excluding single-dimension studies and agent-specific frameworks that appear in sibling categories.

Among 23 candidates examined, the comprehensive evaluation contribution (10 candidates, 0 refutable) and RobustVLA framework (10 candidates, 0 refutable) show no clear prior overlap within the limited search scope. However, the offline robust optimization against worst-case action noise (3 candidates examined) is clearly refuted by one candidate, suggesting this specific technical mechanism has precedent. The evaluation and framework contributions appear more novel given the absence of refuting work among the examined candidates, though the search scale (23 papers) leaves open the possibility of relevant prior work beyond top-K semantic matches.

Based on the limited literature search covering 23 candidates from semantic retrieval, the paper's evaluation protocol and multi-modal framework appear relatively novel within the examined scope, while the action-noise optimization technique shows overlap with at least one prior method. The taxonomy structure indicates this work occupies a moderately active research area with four sibling benchmarks, suggesting incremental but meaningful progress in multi-dimensional robustness assessment rather than exploration of entirely sparse territory.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Robustness of vision-language-action models against multi-modal perturbations. The field has organized itself around several complementary branches that together address how VLA models withstand adversarial and natural disturbances across vision, language, and action modalities. Adversarial Attack Methods explore techniques to craft perturbations that exploit model vulnerabilities, while Defense and Robustness Enhancement Techniques develop countermeasures such as adversarial training and architectural modifications. Robustness Evaluation Frameworks and Benchmarks provide standardized testbeds like Eva-VLA[7] and Libero-Plus[8] to measure model resilience under diverse perturbation scenarios. Vulnerability Analysis characterizes threat surfaces, Safety Alignment enforces behavioral constraints, and Domain-Specific Applications examine robustness in specialized contexts such as manipulation or navigation. Surveys and Cross-Cutting Analyses synthesize insights across these dimensions, while Auxiliary Techniques supply supporting methods like data augmentation or interpretability tools. Within the evaluation landscape, a particularly active line of work focuses on multi-dimensional perturbation benchmarks that systematically probe VLA models across vision, language, and action channels simultaneously. VLA Multimodal Robustness[0] exemplifies this direction by proposing comprehensive evaluation protocols that test resilience to coordinated perturbations, contrasting with earlier single-modality assessments like Video-Language Robustness[9] or domain-specific suites such as LIBERO-PRO[16]. Neighboring efforts like Multimodal Adversarial Robustness[3] and Evaluating VLM Robustness[4] have explored robustness from complementary angles—examining cross-modal attack transferability or establishing foundational evaluation metrics—but VLA Multimodal Robustness[0] distinguishes itself by emphasizing the interplay of perturbations across all three modalities in embodied settings. This positioning reflects a broader shift toward holistic robustness characterization, moving beyond isolated modality tests to capture the complex failure modes that emerge when vision, language, and action channels are simultaneously stressed.

Claimed Contributions

Comprehensive robustness evaluation of VLAs under multi-modal perturbations

10 retrieved papers

The authors systematically evaluate existing Vision-Language-Action models against 17 different perturbations spanning action, observation, environment, and instruction modalities. This evaluation reveals that actions are the most fragile modality, existing visual-robust VLAs do not generalize to other modalities, and π0 demonstrates superior robustness.

10 retrieved papers

RobustVLA framework for multi-modal robustness enhancement

10 retrieved papers

The authors introduce RobustVLA, a unified framework that enhances robustness against both input and output perturbations in VLA models. The method combines offline robust optimization against worst-case action noise for output robustness with consistency enforcement across semantically equivalent inputs for input robustness, using a multi-armed bandit formulation with UCB to balance multiple perturbation types.

10 retrieved papers

Offline robust optimization against worst-case action noise in flow matching

Can Refute

3 retrieved papers

The authors derive and optimize against worst-case action perturbations by maximizing the flow matching loss, which can be interpreted as adversarial training, label smoothing, and outlier penalization. This approach addresses the challenge of achieving action robustness in offline settings where interactive environments are unavailable.

3 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical Variations PDF

Liu Han-qing, Long Jiahuan, Wu Junqi, Tang Hui-li, Jiang, Tingsong, Zhou Weien, Yao Wen (2025)

[8] Libero-plus: In-depth robustness analysis of vision-language-action models PDF

Wang Siyin, Senyu Fei, Shi Junhao, Siyin Wang, Dai Zihao, Junhao Shi, Cai Jikun, Zihao Dai, Jikun Cai, Ji Li, Pengfang Qian, Li Ji, Zhang Shi-duo, Xinzhe He, Fei, Zhaoye, Shiduo Zhang, Fu Jinlan, Zhaoye Fei, Gong Jingjing, Jinlan Fu, Qiu, Xipeng, Jingjing Gong (2025)

[16] LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization PDF

Xu Yang-ming, Xueyang Zhou, Yangming Xu, Chen Yong-chao, Guiyao Tie, Zhang Guo-wen, Yongchao Chen, Chu Duan-feng, Guowen Zhang, Zhou Pan, Duanfeng Chu, Sun, Lichao, Pan Zhou, Lichao Sun (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Comprehensive robustness evaluation of VLAs under multi-modal perturbations

[3] Revisiting the adversarial robustness of vision language models: a multimodal perspective PDF

Cannot Refute

[4] On Evaluating Adversarial Robustness of Large Vision-Language Models PDF

Cannot Refute

[10] Rationalvla: A rational vision-language-action model with dual system PDF

Cannot Refute

[13] VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation PDF

Cannot Refute

[14] A survey of attacks on large visionâlanguage models: Resources, advances, and future trends PDF

Cannot Refute

[42] Mvtamperbench: Evaluating robustness of vision-language models PDF

Cannot Refute

[43] Boosting Adversarial Robustness of Vision-Language Pre-training Models against Multimodal Adversarial attacks PDF

Cannot Refute

[44] On the Adversarial Robustness of Multi-Modal Foundation Models PDF

Cannot Refute

[45] On the robustness of multimodal language model towards distractions PDF

Cannot Refute

[46] Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks PDF

Cannot Refute

Contribution

RobustVLA framework for multi-modal robustness enhancement

[1] Enhancing the robustness of vision-language foundation models by alignment perturbation PDF

Cannot Refute

[5] Model-agnostic adversarial attack and defense for vision-language-action models PDF

Cannot Refute

[9] Robustness Analysis of Video-Language Models Against Visual and Language Perturbations PDF

Cannot Refute

[10] Rationalvla: A rational vision-language-action model with dual system PDF

Cannot Refute

[13] VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation PDF

Cannot Refute

[21] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning PDF

Cannot Refute

[47] Recipe for Vision-Language-Action Models in Robotic Manipulation: A Survey PDF

Cannot Refute

[48] Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations PDF

Cannot Refute

[49] Enhance Vision-Language Alignment with Noise PDF

Cannot Refute

[50] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning PDF

Cannot Refute

Contribution

Offline robust optimization against worst-case action noise in flow matching

[51] Rorl: Robust offline reinforcement learning via conservative smoothing PDF

Can Refute

[52] Monotonic Quantile Network for Worst-Case Offline Reinforcement Learning PDF

Cannot Refute

[53] Balance Equation-based Distributionally Robust Offline Imitation Learning PDF

Cannot Refute

On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical Variations PDF

[8] Libero-plus: In-depth robustness analysis of vision-language-action models PDF

[16] LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization PDF

Contribution Analysis

Comprehensive robustness evaluation of VLAs under multi-modal perturbations

[3] Revisiting the adversarial robustness of vision language models: a multimodal perspective PDF

[4] On Evaluating Adversarial Robustness of Large Vision-Language Models PDF

[10] Rationalvla: A rational vision-language-action model with dual system PDF

[13] VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation PDF

[14] A survey of attacks on large visionâlanguage models: Resources, advances, and future trends PDF

[42] Mvtamperbench: Evaluating robustness of vision-language models PDF

[43] Boosting Adversarial Robustness of Vision-Language Pre-training Models against Multimodal Adversarial attacks PDF

[44] On the Adversarial Robustness of Multi-Modal Foundation Models PDF

[45] On the robustness of multimodal language model towards distractions PDF

[46] Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks PDF

RobustVLA framework for multi-modal robustness enhancement

[1] Enhancing the robustness of vision-language foundation models by alignment perturbation PDF

[5] Model-agnostic adversarial attack and defense for vision-language-action models PDF

[9] Robustness Analysis of Video-Language Models Against Visual and Language Perturbations PDF

[10] Rationalvla: A rational vision-language-action model with dual system PDF

[13] VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation PDF

[21] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning PDF

[47] Recipe for Vision-Language-Action Models in Robotic Manipulation: A Survey PDF

[48] Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations PDF

[49] Enhance Vision-Language Alignment with Noise PDF

[50] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning PDF

Offline robust optimization against worst-case action noise in flow matching

[51] Rorl: Robust offline reinforcement learning via conservative smoothing PDF

[52] Monotonic Quantile Network for Worst-Case Offline Reinforcement Learning PDF

[53] Balance Equation-based Distributionally Robust Offline Imitation Learning PDF

Table of Contents

[14] A survey of attacks on large visionâlanguage models: Resources, advances, and future trends PDF