Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Vision-language modelJailbreakTransferability

Vision–language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes UltraBreak, a framework for crafting universal adversarial perturbations that transfer across vision-language models and jailbreak objectives. It resides in the Gradient-Based White-Box Attacks leaf, which contains five papers including the original work. This leaf sits within Attack Methodology and Optimization, a moderately populated branch covering gradient-based, black-box, universal perturbation, and cross-modal strategies. The taxonomy reveals a crowded research area with substantial prior work on gradient-driven optimization for adversarial attacks, suggesting the paper enters a well-explored domain.

The taxonomy tree shows neighboring leaves addressing Black-Box and Transfer-Based Attacks (nine papers), Universal Adversarial Perturbations (four papers), and Cross-Modal and Multimodal Perturbation Strategies (seven papers). UltraBreak bridges gradient-based optimization with transferability concerns, connecting to both the white-box methodology branch and the broader Transferability and Generalization category. The scope note for Gradient-Based White-Box Attacks explicitly includes optimization on known model parameters, while excluding black-box methods without gradient access. This positioning suggests the work straddles methodological boundaries, leveraging white-box surrogates to achieve black-box transferability.

Among twenty-five candidates examined, the analysis found two refutable pairs across three contributions. The core UltraBreak framework and semantic adversarial target mechanism each examined ten candidates with zero refutations, indicating limited direct overlap within this search scope. However, vision-space constraints via transformations and regularization examined five candidates and identified two refutable instances, suggesting more substantial prior work in this specific technical component. The limited search scale means these statistics reflect top-K semantic matches rather than exhaustive coverage, leaving open the possibility of additional relevant work beyond the examined set.

Based on the limited search scope of twenty-five candidates, the framework-level contributions appear less directly anticipated, while the vision-space regularization techniques show clearer precedent. The taxonomy context reveals a densely populated research area with multiple overlapping methodological branches, suggesting incremental refinement rather than paradigm shift. Acknowledging the bounded search, the analysis captures immediate semantic neighbors but cannot rule out relevant work in adjacent taxonomy leaves or outside the top-K retrieval window.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: universal and transferable jailbreak attacks on vision-language models. The field structure reflects a multifaceted effort to understand and exploit vulnerabilities in multimodal systems. Attack Methodology and Optimization encompasses gradient-based white-box techniques (e.g., Universal Jailbreak VLMs[0], White-box Multimodal Jailbreaks[16]) alongside optimization strategies that craft adversarial inputs by leveraging model internals. Jailbreak Attack Techniques explores diverse manipulation strategies, from prompt-based methods (Adversarial Prompt Tuning[8]) to visual perturbations (Visual Adversarial Jailbreak[1]) and cross-modal obfuscation (Cross-Modal Obfuscation[3]). Transferability and Generalization investigates how attacks generalize across models and prompts (Transferable Black-Box Attacks[5], Model-Prompt Transferable Attackers[10]), while Robustness Evaluation and Benchmarking provides systematic assessments (JailbreakV Benchmark[15], Evaluating Adversarial Robustness[2]). Defense and Safety Mechanisms addresses mitigation strategies (UniGuard[9], SafeMLRM[44]), Domain-Specific and Application-Oriented Attacks targets specialized contexts like medical imaging (Medical VLP Vulnerable[21]) or autonomous driving (Autonomous Driving Attack[36]), and Adversarial Robustness of VLP Models examines fundamental resilience properties. A particularly active line of work centers on gradient-based white-box attacks that optimize universal adversarial perturbations, balancing effectiveness against computational cost and detectability. Universal Jailbreak VLMs[0] sits squarely within this branch, emphasizing transferability across diverse vision-language architectures through gradient-driven optimization. Nearby works like White-box Multimodal Jailbreaks[16] and JailBound[35] share this white-box orientation but may differ in their treatment of cross-modal alignment or constraint formulations. In contrast, Visual Adversarial Jailbreak[1] and Multi-Loss Adversarial Search[38] explore alternative optimization objectives or search strategies, highlighting trade-offs between attack universality, stealthiness, and the need for model access. Open questions persist around the interplay between transferability and imperceptibility, the role of vision-language alignment mechanisms in vulnerability, and the extent to which defenses can generalize across the spectrum of attack methodologies without sacrificing model utility.

Claimed Contributions

UltraBreak framework for universal and transferable jailbreak attacks

10 retrieved papers

The authors introduce UltraBreak, a novel optimisation-based jailbreak framework that achieves both cross-target universality and cross-model transferability against vision-language models. The framework combines vision-level regularisation with semantically guided textual supervision to mitigate surrogate overfitting and enable strong transferability.

10 retrieved papers

Semantic adversarial target with attention mechanism

10 retrieved papers

The authors propose a semantic-based loss function that operates in the textual embedding space rather than forcing exact token sequences. This approach includes an attention mechanism that dynamically assigns weights to target tokens, smoothing the loss landscape and improving optimisation stability compared to traditional cross-entropy loss.

10 retrieved papers

Vision-space constraints via transformations and regularisation

Can Refute

5 retrieved papers

The authors introduce constraints on the image optimisation space through random transformations (rotation, scaling, translation) and total variation regularisation. These constraints guide the optimiser toward discovering robust, model-agnostic jailbreak patterns that transfer effectively across different vision-language models.

5 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Visual adversarial examples jailbreak aligned large language models PDF

Henderson, Peter, Huang, Kaixuan, Mittal, Prateek, Panda, Ashwinee, Qi, Xiangyu, Wang Mengdi (2024)

[16] White-box Multimodal Jailbreaks Against Large Vision-Language Models PDF

Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, Yu-Gang Jiang (2024)

[35] JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models PDF

Song Jiaxin, Wang Yi-xu, Jiaxin Song, Li Jie, Yixu Wang, Yu Rui, Jie Li, Teng Yan, Rui Yu, Ma, Xingjun, Yan Teng, Wang YingChun, Xingjun Ma, Yingchun Wang (2025)

[38] Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models PDF

Shuyang Hao, Bryan Hooi, Jun Liu, Kai-Wei Chang, Zi Huang, Yujun Cai (2024) • Computer Vision and Pattern Recognition

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UltraBreak framework for universal and transferable jailbreak attacks

[2] On evaluating adversarial robustness of large vision-language models PDF

Cannot Refute

[5] Transferable Adversarial Attacks on Black-Box Vision-Language Models PDF

Cannot Refute

[10] Towards Building Model/Prompt-Transferable Attackers against Large Vision-Language Models PDF

Cannot Refute

[21] Medical vlp model is vulnerable: Towards multimodal adversarial attack on large medical vision-language models PDF

Cannot Refute

[28] Pandora's Box: Towards Building Universal Attackers against Real-World Large Vision-Language Models PDF

Cannot Refute

[33] One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models PDF

Cannot Refute

[66] An image is worth 1000 lies: Transferability of adversarial images across prompts on vision-language models PDF

Cannot Refute

[67] An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models PDF

Cannot Refute

[68] Imperceptible Transfer Attack on Large Vision-Language Models PDF

Cannot Refute

[69] DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models PDF

Cannot Refute

Contribution

Semantic adversarial target with attention mechanism

[51] Feint and attack: Attention-based strategies for jailbreaking and protecting llms PDF

Cannot Refute

[52] EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection PDF

Cannot Refute

[53] Distraction is all you need for multimodal large language model jailbreaking PDF

Cannot Refute

[54] Multi-turn jailbreaking large language models via attention shifting PDF

Cannot Refute

[55] {EmbedX}:{Embedding-Based}{Cross-Trigger} backdoor attack against large language models PDF

Cannot Refute

[56] Fit the Distribution: Cross-Image/Prompt Adversarial Attacks on Multimodal Large Language Models PDF

Cannot Refute

[57] AttnGCG: Enhancing jailbreaking attacks on LLMs with attention manipulation PDF

Cannot Refute

[58] Towards Prompt-robust Face Privacy Protection via Adversarial Decoupling Augmentation Framework PDF

Cannot Refute

[59] Towards Adversarial Robust Learning On Multimodal Neural Networks PDF

Cannot Refute

[60] Secure Guard: A Semantic-Based Jailbreak Prompt Detection Framework for Protecting Large Language Models PDF

Cannot Refute

Contribution

Vision-space constraints via transformations and regularisation

[61] Countering adversarial images using input transformations PDF

Can Refute

[62] Transferable 3d adversarial textures using end-to-end optimization PDF

Can Refute

[63] Improving Perceptual Quality of Spatially Transformed Adversarial Examples PDF

Cannot Refute

[64] Exploring the Applications and Advancements in Linear Algebra: A Comprehensive Review PDF

Cannot Refute

[65] Learning Robust Vision-Language Models from Natural Latent Spaces PDF

Cannot Refute

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Visual adversarial examples jailbreak aligned large language models PDF

[16] White-box Multimodal Jailbreaks Against Large Vision-Language Models PDF

[35] JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models PDF

[38] Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models PDF

Contribution Analysis

UltraBreak framework for universal and transferable jailbreak attacks

[2] On evaluating adversarial robustness of large vision-language models PDF

[5] Transferable Adversarial Attacks on Black-Box Vision-Language Models PDF

[10] Towards Building Model/Prompt-Transferable Attackers against Large Vision-Language Models PDF

[21] Medical vlp model is vulnerable: Towards multimodal adversarial attack on large medical vision-language models PDF

[28] Pandora's Box: Towards Building Universal Attackers against Real-World Large Vision-Language Models PDF

[33] One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models PDF

[66] An image is worth 1000 lies: Transferability of adversarial images across prompts on vision-language models PDF

[67] An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models PDF

[68] Imperceptible Transfer Attack on Large Vision-Language Models PDF

[69] DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models PDF

Semantic adversarial target with attention mechanism

[51] Feint and attack: Attention-based strategies for jailbreaking and protecting llms PDF

[52] EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection PDF

[53] Distraction is all you need for multimodal large language model jailbreaking PDF

[54] Multi-turn jailbreaking large language models via attention shifting PDF

[55] {EmbedX}:{Embedding-Based}{Cross-Trigger} backdoor attack against large language models PDF

[56] Fit the Distribution: Cross-Image/Prompt Adversarial Attacks on Multimodal Large Language Models PDF

[57] AttnGCG: Enhancing jailbreaking attacks on LLMs with attention manipulation PDF

[58] Towards Prompt-robust Face Privacy Protection via Adversarial Decoupling Augmentation Framework PDF

[59] Towards Adversarial Robust Learning On Multimodal Neural Networks PDF

[60] Secure Guard: A Semantic-Based Jailbreak Prompt Detection Framework for Protecting Large Language Models PDF

Vision-space constraints via transformations and regularisation

[61] Countering adversarial images using input transformations PDF

[62] Transferable 3d adversarial textures using end-to-end optimization PDF

[63] Improving Perceptual Quality of Spatially Transformed Adversarial Examples PDF

[64] Exploring the Applications and Advancements in Linear Algebra: A Comprehensive Review PDF

[65] Learning Robust Vision-Language Models from Natural Latent Spaces PDF

Table of Contents