Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Vision-language modelJailbreakTransferability
Abstract:

Vision–language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes UltraBreak, a framework for crafting universal adversarial perturbations that transfer across vision-language models and jailbreak objectives. It resides in the Gradient-Based White-Box Attacks leaf, which contains five papers including the original work. This leaf sits within Attack Methodology and Optimization, a moderately populated branch covering gradient-based, black-box, universal perturbation, and cross-modal strategies. The taxonomy reveals a crowded research area with substantial prior work on gradient-driven optimization for adversarial attacks, suggesting the paper enters a well-explored domain.

The taxonomy tree shows neighboring leaves addressing Black-Box and Transfer-Based Attacks (nine papers), Universal Adversarial Perturbations (four papers), and Cross-Modal and Multimodal Perturbation Strategies (seven papers). UltraBreak bridges gradient-based optimization with transferability concerns, connecting to both the white-box methodology branch and the broader Transferability and Generalization category. The scope note for Gradient-Based White-Box Attacks explicitly includes optimization on known model parameters, while excluding black-box methods without gradient access. This positioning suggests the work straddles methodological boundaries, leveraging white-box surrogates to achieve black-box transferability.

Among twenty-five candidates examined, the analysis found two refutable pairs across three contributions. The core UltraBreak framework and semantic adversarial target mechanism each examined ten candidates with zero refutations, indicating limited direct overlap within this search scope. However, vision-space constraints via transformations and regularization examined five candidates and identified two refutable instances, suggesting more substantial prior work in this specific technical component. The limited search scale means these statistics reflect top-K semantic matches rather than exhaustive coverage, leaving open the possibility of additional relevant work beyond the examined set.

Based on the limited search scope of twenty-five candidates, the framework-level contributions appear less directly anticipated, while the vision-space regularization techniques show clearer precedent. The taxonomy context reveals a densely populated research area with multiple overlapping methodological branches, suggesting incremental refinement rather than paradigm shift. Acknowledging the bounded search, the analysis captures immediate semantic neighbors but cannot rule out relevant work in adjacent taxonomy leaves or outside the top-K retrieval window.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: universal and transferable jailbreak attacks on vision-language models. The field structure reflects a multifaceted effort to understand and exploit vulnerabilities in multimodal systems. Attack Methodology and Optimization encompasses gradient-based white-box techniques (e.g., Universal Jailbreak VLMs[0], White-box Multimodal Jailbreaks[16]) alongside optimization strategies that craft adversarial inputs by leveraging model internals. Jailbreak Attack Techniques explores diverse manipulation strategies, from prompt-based methods (Adversarial Prompt Tuning[8]) to visual perturbations (Visual Adversarial Jailbreak[1]) and cross-modal obfuscation (Cross-Modal Obfuscation[3]). Transferability and Generalization investigates how attacks generalize across models and prompts (Transferable Black-Box Attacks[5], Model-Prompt Transferable Attackers[10]), while Robustness Evaluation and Benchmarking provides systematic assessments (JailbreakV Benchmark[15], Evaluating Adversarial Robustness[2]). Defense and Safety Mechanisms addresses mitigation strategies (UniGuard[9], SafeMLRM[44]), Domain-Specific and Application-Oriented Attacks targets specialized contexts like medical imaging (Medical VLP Vulnerable[21]) or autonomous driving (Autonomous Driving Attack[36]), and Adversarial Robustness of VLP Models examines fundamental resilience properties. A particularly active line of work centers on gradient-based white-box attacks that optimize universal adversarial perturbations, balancing effectiveness against computational cost and detectability. Universal Jailbreak VLMs[0] sits squarely within this branch, emphasizing transferability across diverse vision-language architectures through gradient-driven optimization. Nearby works like White-box Multimodal Jailbreaks[16] and JailBound[35] share this white-box orientation but may differ in their treatment of cross-modal alignment or constraint formulations. In contrast, Visual Adversarial Jailbreak[1] and Multi-Loss Adversarial Search[38] explore alternative optimization objectives or search strategies, highlighting trade-offs between attack universality, stealthiness, and the need for model access. Open questions persist around the interplay between transferability and imperceptibility, the role of vision-language alignment mechanisms in vulnerability, and the extent to which defenses can generalize across the spectrum of attack methodologies without sacrificing model utility.

Claimed Contributions

UltraBreak framework for universal and transferable jailbreak attacks

The authors introduce UltraBreak, a novel optimisation-based jailbreak framework that achieves both cross-target universality and cross-model transferability against vision-language models. The framework combines vision-level regularisation with semantically guided textual supervision to mitigate surrogate overfitting and enable strong transferability.

10 retrieved papers
Semantic adversarial target with attention mechanism

The authors propose a semantic-based loss function that operates in the textual embedding space rather than forcing exact token sequences. This approach includes an attention mechanism that dynamically assigns weights to target tokens, smoothing the loss landscape and improving optimisation stability compared to traditional cross-entropy loss.

10 retrieved papers
Vision-space constraints via transformations and regularisation

The authors introduce constraints on the image optimisation space through random transformations (rotation, scaling, translation) and total variation regularisation. These constraints guide the optimiser toward discovering robust, model-agnostic jailbreak patterns that transfer effectively across different vision-language models.

5 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UltraBreak framework for universal and transferable jailbreak attacks

The authors introduce UltraBreak, a novel optimisation-based jailbreak framework that achieves both cross-target universality and cross-model transferability against vision-language models. The framework combines vision-level regularisation with semantically guided textual supervision to mitigate surrogate overfitting and enable strong transferability.

Contribution

Semantic adversarial target with attention mechanism

The authors propose a semantic-based loss function that operates in the textual embedding space rather than forcing exact token sequences. This approach includes an attention mechanism that dynamically assigns weights to target tokens, smoothing the loss landscape and improving optimisation stability compared to traditional cross-entropy loss.

Contribution

Vision-space constraints via transformations and regularisation

The authors introduce constraints on the image optimisation space through random transformations (rotation, scaling, translation) and total variation regularisation. These constraints guide the optimiser toward discovering robust, model-agnostic jailbreak patterns that transfer effectively across different vision-language models.