VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
adversarial attackvision-encoder-onlylarge vision language modelsdownstream-agnostic
Abstract:

Large Vision-Language Models (LVLMs) have demonstrated capabilities in multimodal understanding, yet their vulnerability to adversarial attacks raises significant concerns. To achieve practical attacking, this paper aims at efficient and transferable untargeted attacks under limited perturbation sizes. Considering this objective, white‑box attacks require full‑model gradients and task‑specific labels, making costs scale with tasks, while black‑box attacks rely on proxy models, typically requiring large perturbation sizes and elaborate transfer strategies. Given the centrality and widespread reuse of the vision encoder in LVLMs, we adopt a gray‑box setting that targets the vision encoder alone for efficient but effective attacking. We theoretically establish the feasibility of vision‑encoder‑only attacks, laying the foundation for our gray‑box setting. Based on this analysis, we propose perturbing patch tokens rather than the class token, informed by both theoretical and empirical insights. We generate adversarial examples by minimizing the cosine similarity between clean and perturbed visual features, without accessing the subsequent models, tasks, or labels. This significantly reduces computational overhead while eliminating the task and label dependence. VEAttack has achieved a performance degradation of 94.5% on image caption task and 75.7% on visual question answering task. We also reveal some key observations to provide insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token attention differential, 3) Möbius band in transfer attack, 4) low sensitivity to attack steps.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a gray-box adversarial attack framework targeting the vision encoder of Large Vision-Language Models (LVLMs), aiming for efficient and transferable untargeted attacks under limited perturbation budgets. According to the taxonomy, this work resides in the 'Vision-Language Model Attacks' leaf under 'Adversarial Attack Methodologies and Frameworks'. Notably, this leaf contains only the original paper itself with no sibling papers, suggesting this is a relatively sparse or emerging research direction within the broader adversarial attack landscape. The taxonomy includes fifty papers across approximately thirty-six topics, indicating that vision-language model attacks represent a small but distinct niche.

The taxonomy reveals that the paper's immediate parent branch, 'Adversarial Attack Methodologies and Frameworks', also includes a sibling leaf on 'Robustness Problem Formulation and Analysis', which focuses on theoretical examinations of adversarial robustness definitions rather than attack implementations. Neighboring branches address 'Optimization and Learning Frameworks' and 'Task-Specific Methods and Applications', particularly 'Computer Vision and Multimodal Tasks'. The scope note for the paper's leaf explicitly excludes attacks on unimodal vision or language models alone, positioning this work at the intersection of multimodal architectures. This placement suggests the paper bridges adversarial attack research with the growing field of vision-language integration, diverging from purely vision-centric or language-centric attack strategies.

Among the three identified contributions, the gray-box vision-encoder-only attack framework examined ten candidates and found one potentially refuting prior work, while the theoretical analysis and four key observations examined four and ten candidates respectively with no clear refutations. The literature search scope covered twenty-four candidates total, yielding one refutable pair overall. This indicates that among the limited set of semantically similar papers examined, the core attack framework may have some overlap with existing work, whereas the theoretical justification and empirical observations appear more distinctive. The modest search scale means these findings reflect top-K semantic matches rather than exhaustive coverage of all relevant adversarial attack literature.

Given the limited search scope of twenty-four candidates and the sparse taxonomy leaf with no sibling papers, the work appears to occupy a relatively novel position within vision-language model attacks specifically. However, the presence of one refutable candidate for the main framework contribution suggests that certain aspects may build incrementally on existing gray-box or encoder-targeted attack strategies. The analysis does not cover the full breadth of adversarial robustness research, particularly work published in venues outside the semantic search radius or recent preprints not yet indexed.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: The paper addresses an unspecified core task, but the taxonomy reveals a research landscape organized around six major branches. Adversarial Attack Methodologies and Frameworks explores techniques for crafting attacks against machine learning systems, with a notable sub-area focusing on Vision-Language Model Attacks where methods target multimodal architectures. Optimization and Learning Frameworks encompasses algorithmic approaches for training and tuning models, including multi-objective optimization strategies as seen in works like Multimodal multi-objective optimization[6] and Tackling the Objective Inconsistency[5]. Task-Specific Methods and Applications groups domain-tailored solutions, while Research Methodology and Problem Formulation addresses foundational questions about how studies are designed and framed, exemplified by works such as Pragmatism as a research[4] and Objectives of the Study[1]. Domain Applications and Empirical Studies captures applied investigations across diverse fields, and Unspecified Study Objectives and Metadata collects works with less clearly defined scopes. Within this landscape, adversarial robustness research has become particularly active, especially at the intersection of vision and language modalities. VEAttack[0] situates itself squarely in the Vision-Language Model Attacks cluster, contributing to an emerging line of work that probes vulnerabilities in systems processing both visual and textual inputs. This contrasts with branches focused on optimization theory or general research methodology, which tend to emphasize algorithmic efficiency or study design principles rather than security concerns. Compared to foundational methodological works like Objectives of the Study[1] or Pragmatism as a research[4], VEAttack[0] adopts a more applied, attack-centric perspective, seeking to expose weaknesses in specific model architectures. The positioning highlights ongoing tensions in the field between developing robust multimodal systems and understanding their failure modes, a theme that cuts across several branches and remains an open question as vision-language models grow in capability and deployment.

Claimed Contributions

Gray-box vision-encoder-only attack framework (VEAttack)

The authors introduce VEAttack, a gray-box attack method that targets only the vision encoder of LVLMs by perturbing patch tokens and minimizing cosine similarity between clean and perturbed visual features. This approach eliminates dependence on downstream tasks, labels, and LLM gradients while achieving efficient and transferable attacks.

10 retrieved papers
Can Refute
Theoretical analysis of vision-encoder-only attack feasibility

The authors provide theoretical analysis establishing a lower bound on perturbations in multimodal aligned features when attacking only the vision encoder. This theoretical foundation demonstrates that perturbations on patch tokens propagate more effectively to downstream LLMs than perturbations on class tokens.

4 retrieved papers
Four key observations about LVLM vulnerabilities

The authors identify and empirically demonstrate four novel observations about LVLM vulnerabilities, including how vision encoder attacks induce LLM hidden layer variations, differential attention to image versus instruction tokens across tasks, a paradoxical relationship between encoder robustness and attack transferability, and reduced sensitivity to attack iteration counts.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Gray-box vision-encoder-only attack framework (VEAttack)

The authors introduce VEAttack, a gray-box attack method that targets only the vision encoder of LVLMs by perturbing patch tokens and minimizing cosine similarity between clean and perturbed visual features. This approach eliminates dependence on downstream tasks, labels, and LLM gradients while achieving efficient and transferable attacks.

Contribution

Theoretical analysis of vision-encoder-only attack feasibility

The authors provide theoretical analysis establishing a lower bound on perturbations in multimodal aligned features when attacking only the vision encoder. This theoretical foundation demonstrates that perturbations on patch tokens propagate more effectively to downstream LLMs than perturbations on class tokens.

Contribution

Four key observations about LVLM vulnerabilities

The authors identify and empirically demonstrate four novel observations about LVLM vulnerabilities, including how vision encoder attacks induce LLM hidden layer variations, differential attention to image versus instruction tokens across tasks, a paradoxical relationship between encoder robustness and attack transferability, and reduced sensitivity to attack iteration counts.