VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models
Overview
Overall Novelty Assessment
The paper proposes a gray-box adversarial attack framework targeting the vision encoder of Large Vision-Language Models (LVLMs), aiming for efficient and transferable untargeted attacks under limited perturbation budgets. According to the taxonomy, this work resides in the 'Vision-Language Model Attacks' leaf under 'Adversarial Attack Methodologies and Frameworks'. Notably, this leaf contains only the original paper itself with no sibling papers, suggesting this is a relatively sparse or emerging research direction within the broader adversarial attack landscape. The taxonomy includes fifty papers across approximately thirty-six topics, indicating that vision-language model attacks represent a small but distinct niche.
The taxonomy reveals that the paper's immediate parent branch, 'Adversarial Attack Methodologies and Frameworks', also includes a sibling leaf on 'Robustness Problem Formulation and Analysis', which focuses on theoretical examinations of adversarial robustness definitions rather than attack implementations. Neighboring branches address 'Optimization and Learning Frameworks' and 'Task-Specific Methods and Applications', particularly 'Computer Vision and Multimodal Tasks'. The scope note for the paper's leaf explicitly excludes attacks on unimodal vision or language models alone, positioning this work at the intersection of multimodal architectures. This placement suggests the paper bridges adversarial attack research with the growing field of vision-language integration, diverging from purely vision-centric or language-centric attack strategies.
Among the three identified contributions, the gray-box vision-encoder-only attack framework examined ten candidates and found one potentially refuting prior work, while the theoretical analysis and four key observations examined four and ten candidates respectively with no clear refutations. The literature search scope covered twenty-four candidates total, yielding one refutable pair overall. This indicates that among the limited set of semantically similar papers examined, the core attack framework may have some overlap with existing work, whereas the theoretical justification and empirical observations appear more distinctive. The modest search scale means these findings reflect top-K semantic matches rather than exhaustive coverage of all relevant adversarial attack literature.
Given the limited search scope of twenty-four candidates and the sparse taxonomy leaf with no sibling papers, the work appears to occupy a relatively novel position within vision-language model attacks specifically. However, the presence of one refutable candidate for the main framework contribution suggests that certain aspects may build incrementally on existing gray-box or encoder-targeted attack strategies. The analysis does not cover the full breadth of adversarial robustness research, particularly work published in venues outside the semantic search radius or recent preprints not yet indexed.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce VEAttack, a gray-box attack method that targets only the vision encoder of LVLMs by perturbing patch tokens and minimizing cosine similarity between clean and perturbed visual features. This approach eliminates dependence on downstream tasks, labels, and LLM gradients while achieving efficient and transferable attacks.
The authors provide theoretical analysis establishing a lower bound on perturbations in multimodal aligned features when attacking only the vision encoder. This theoretical foundation demonstrates that perturbations on patch tokens propagate more effectively to downstream LLMs than perturbations on class tokens.
The authors identify and empirically demonstrate four novel observations about LVLM vulnerabilities, including how vision encoder attacks induce LLM hidden layer variations, differential attention to image versus instruction tokens across tasks, a paradoxical relationship between encoder robustness and attack transferability, and reduced sensitivity to attack iteration counts.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Gray-box vision-encoder-only attack framework (VEAttack)
The authors introduce VEAttack, a gray-box attack method that targets only the vision encoder of LVLMs by perturbing patch tokens and minimizing cosine similarity between clean and perturbed visual features. This approach eliminates dependence on downstream tasks, labels, and LLM gradients while achieving efficient and transferable attacks.
[53] Break the visual perception: Adversarial attacks targeting encoded visual tokens of large vision-language models PDF
[51] Sample-agnostic adversarial perturbation for vision-language pre-training models PDF
[52] When alignment fails: Multimodal adversarial attacks on vision-language-action models PDF
[54] Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment PDF
[55] Attacking multimodal os agents with malicious image patches PDF
[56] Physpatch: A physically realizable and transferable adversarial patch attack for multimodal large language models-based autonomous driving systems PDF
[57] Transfer Attack for Bad and Good: Explain and Boost Adversarial Transferability across Multimodal Large Language Models PDF
[58] As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks? PDF
[59] Towards Adversarial Robust Learning On Multimodal Neural Networks PDF
[60] Attacks on multimodal models PDF
Theoretical analysis of vision-encoder-only attack feasibility
The authors provide theoretical analysis establishing a lower bound on perturbations in multimodal aligned features when attacking only the vision encoder. This theoretical foundation demonstrates that perturbations on patch tokens propagate more effectively to downstream LLMs than perturbations on class tokens.
[61] Transferable Multimodal Attack on Vision-Language Pre-training Models PDF
[62] Towards Adversarial Attack on Vision-Language Pre-training Models PDF
[63] One perturbation is enough: On generating universal adversarial perturbations against vision-language pre-training models PDF
[64] Exploring visual vulnerabilities via multi-loss adversarial search for jailbreaking vision-language models PDF
Four key observations about LVLM vulnerabilities
The authors identify and empirically demonstrate four novel observations about LVLM vulnerabilities, including how vision encoder attacks induce LLM hidden layer variations, differential attention to image versus instruction tokens across tasks, a paradoxical relationship between encoder robustness and attack transferability, and reduced sensitivity to attack iteration counts.