On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
Overview
Overall Novelty Assessment
The paper contributes a comprehensive robustness evaluation of VLA models under 17 perturbations across four modalities (vision, language, action, environment) and proposes RobustVLA, a framework combining offline robust optimization against worst-case action noise with input consistency enforcement. It resides in the Multi-Dimensional Perturbation Benchmarks leaf, which contains four papers including VLA Multimodal Robustness, Eva-VLA, Libero-Plus, and one other benchmark. This leaf sits within the broader Robustness Evaluation Frameworks and Benchmarks branch, indicating a moderately populated research direction focused on systematic multi-modal testing rather than single-modality assessments.
The taxonomy reveals neighboring leaves addressing complementary evaluation angles: Fuzzing and Automated Testing Frameworks explore systematic vulnerability discovery, Agent Robustness evaluates compound systems in interactive environments, and Structured Task Difficulty measures capabilities through graded task hierarchies. The paper's multi-modal focus distinguishes it from domain-specific robustness studies (Navigation, Autonomous Driving) and from defense-oriented branches (Adversarial Training, Multi-Modal Defense). Its scope note emphasizes perturbations across objects, viewpoints, instructions, and environmental conditions, explicitly excluding single-dimension studies and agent-specific frameworks that appear in sibling categories.
Among 23 candidates examined, the comprehensive evaluation contribution (10 candidates, 0 refutable) and RobustVLA framework (10 candidates, 0 refutable) show no clear prior overlap within the limited search scope. However, the offline robust optimization against worst-case action noise (3 candidates examined) is clearly refuted by one candidate, suggesting this specific technical mechanism has precedent. The evaluation and framework contributions appear more novel given the absence of refuting work among the examined candidates, though the search scale (23 papers) leaves open the possibility of relevant prior work beyond top-K semantic matches.
Based on the limited literature search covering 23 candidates from semantic retrieval, the paper's evaluation protocol and multi-modal framework appear relatively novel within the examined scope, while the action-noise optimization technique shows overlap with at least one prior method. The taxonomy structure indicates this work occupies a moderately active research area with four sibling benchmarks, suggesting incremental but meaningful progress in multi-dimensional robustness assessment rather than exploration of entirely sparse territory.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors systematically evaluate existing Vision-Language-Action models against 17 different perturbations spanning action, observation, environment, and instruction modalities. This evaluation reveals that actions are the most fragile modality, existing visual-robust VLAs do not generalize to other modalities, and π0 demonstrates superior robustness.
The authors introduce RobustVLA, a unified framework that enhances robustness against both input and output perturbations in VLA models. The method combines offline robust optimization against worst-case action noise for output robustness with consistency enforcement across semantically equivalent inputs for input robustness, using a multi-armed bandit formulation with UCB to balance multiple perturbation types.
The authors derive and optimize against worst-case action perturbations by maximizing the flow matching loss, which can be interpreted as adversarial training, label smoothing, and outlier penalization. This approach addresses the challenge of achieving action robustness in offline settings where interactive environments are unavailable.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[7] Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical Variations PDF
[8] Libero-plus: In-depth robustness analysis of vision-language-action models PDF
[16] LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Comprehensive robustness evaluation of VLAs under multi-modal perturbations
The authors systematically evaluate existing Vision-Language-Action models against 17 different perturbations spanning action, observation, environment, and instruction modalities. This evaluation reveals that actions are the most fragile modality, existing visual-robust VLAs do not generalize to other modalities, and π0 demonstrates superior robustness.
[3] Revisiting the adversarial robustness of vision language models: a multimodal perspective PDF
[4] On Evaluating Adversarial Robustness of Large Vision-Language Models PDF
[10] Rationalvla: A rational vision-language-action model with dual system PDF
[13] VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation PDF
[14] A survey of attacks on large visionâlanguage models: Resources, advances, and future trends PDF
[42] Mvtamperbench: Evaluating robustness of vision-language models PDF
[43] Boosting Adversarial Robustness of Vision-Language Pre-training Models against Multimodal Adversarial attacks PDF
[44] On the Adversarial Robustness of Multi-Modal Foundation Models PDF
[45] On the robustness of multimodal language model towards distractions PDF
[46] Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks PDF
RobustVLA framework for multi-modal robustness enhancement
The authors introduce RobustVLA, a unified framework that enhances robustness against both input and output perturbations in VLA models. The method combines offline robust optimization against worst-case action noise for output robustness with consistency enforcement across semantically equivalent inputs for input robustness, using a multi-armed bandit formulation with UCB to balance multiple perturbation types.
[1] Enhancing the robustness of vision-language foundation models by alignment perturbation PDF
[5] Model-agnostic adversarial attack and defense for vision-language-action models PDF
[9] Robustness Analysis of Video-Language Models Against Visual and Language Perturbations PDF
[10] Rationalvla: A rational vision-language-action model with dual system PDF
[13] VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation PDF
[21] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning PDF
[47] Recipe for Vision-Language-Action Models in Robotic Manipulation: A Survey PDF
[48] Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations PDF
[49] Enhance Vision-Language Alignment with Noise PDF
[50] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning PDF
Offline robust optimization against worst-case action noise in flow matching
The authors derive and optimize against worst-case action perturbations by maximizing the flow matching loss, which can be interpreted as adversarial training, label smoothing, and outlier penalization. This approach addresses the challenge of achieving action robustness in offline settings where interactive environments are unavailable.