Jailbreak Transferability Emerges from Shared Representations

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

AI Safety and SecurityAdversarial InputsJailbreaking

Jailbreak transferability is the surprising phenomenon when an adversarial attack compromising one model also elicits harmful responses from other models. Despite widespread demonstrations, there is little consensus on why transfer is possible: is it a quirk of safety training, an artifact of model families, or a more fundamental property of representation learning? We present evidence that transferability emerges from shared representations rather than incidental flaws. Across 20 open-weight models and 33 jailbreak attacks, we find two factors that systematically shape transfer: (1) representational similarity under benign prompts, and (2) the strength of the jailbreak on the source model. To move beyond correlation, we show that deliberately increasing similarity through benign-only distillation causally increases transfer. Qualitative analysis reveal systematic patterns; for example, persona-style jailbreaks transfer far more often than cipher-based prompts, consistent with the idea that natural-language attacks exploit models’ shared representation space, whereas cipher-based attacks rely on idiosyncratic quirks that do not generalize. Together, these results reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates why jailbreak attacks transfer across language models, proposing that shared representations rather than incidental flaws drive transferability. It resides in the 'Representation-Based Transferability Analysis' leaf, which contains only two papers total within the broader 'Jailbreak Transferability Mechanisms and Analysis' branch. This is a relatively sparse research direction compared to crowded attack-generation categories like 'Gradient-Based Suffix Generation Methods' or 'Automated Semantic Jailbreak Generation,' suggesting the paper addresses a less-explored theoretical question about transferability mechanisms rather than developing new attack techniques.

The taxonomy reveals that most jailbreak research focuses on attack methodologies—token-level optimization, semantic manipulation, multimodal techniques—rather than mechanistic explanations. The paper's branch sits adjacent to 'Cross-Language and Multilingual Transferability,' which examines transfer across linguistic boundaries, and is conceptually distinct from attack-focused branches like 'Adversarial Suffix and Token-Level Optimization Attacks' or 'Semantic and Prompt-Level Jailbreak Techniques.' The scope note clarifies this leaf excludes attack optimization methods and defense strategies, positioning the work as foundational analysis rather than applied technique development. Its emphasis on representation similarity and causal manipulation through distillation differentiates it from neighboring optimization-centric studies.

Among 22 candidates examined, none clearly refute the three main contributions. The large-scale empirical analysis (5 candidates examined, 0 refutable) and systematic attack-type characterization (7 candidates, 0 refutable) appear relatively novel within this limited search scope. The benign-only distillation protocol (10 candidates, 0 refutable) shows no substantial prior work among examined papers. These statistics suggest the contributions are distinct within the top-K semantic neighborhood, though the search scale is modest and does not guarantee exhaustive coverage of all relevant prior work in representation-based transferability analysis or causal intervention methods.

Based on 22 examined candidates from semantic search, the work appears to occupy a relatively underexplored niche connecting representation learning theory to jailbreak transferability. The sparse taxonomy leaf and absence of refutable prior work within the search scope suggest novelty, though this assessment is constrained by the limited candidate pool. A broader literature review might uncover related work in adversarial robustness or representation alignment that was not captured by the semantic search strategy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: jailbreak transferability across language models. The field examines how adversarial prompts that successfully bypass safety mechanisms in one model can be reused or adapted to compromise others. The taxonomy reflects a rich landscape organized around attack methodologies, transferability mechanisms, defenses, and specialized contexts. Major branches include token-level optimization techniques (e.g., Universal Transferable Adversarial Attacks[1], Autodan Stealthy Jailbreak[2]) that craft adversarial suffixes, semantic and prompt-level methods that manipulate meaning rather than tokens, and multimodal attacks targeting vision-language models (Multimodal Jailbreaking Attack[3], Universal Image Jailbreaks Transfer[6]). A dedicated branch on transferability mechanisms explores why attacks generalize, while defense-focused work seeks robustness enhancements. Hybrid methodologies and contextual factors (cross-language settings, system messages, agent environments) round out the taxonomy, illustrating that jailbreak research spans diverse threat models and application domains. Recent work highlights tensions between attack stealthiness, transferability, and computational cost. Some studies pursue query-efficient strategies (Jailbreaking Twenty Queries[5]) or ensemble-based transfer (Simulated Ensemble Attack[34]), while others investigate representation-level explanations for why certain prompts generalize. Jailbreak Transferability Shared Representations[0] sits squarely within the transferability mechanisms branch, focusing on representation-based analysis to understand cross-model generalization. Its emphasis on shared internal structures contrasts with neighboring efforts like Enhancing Jailbreak Transferability[7], which may prioritize algorithmic improvements to boost transfer rates, or Boosting Jailbreak Transferability[9], which explores optimization refinements. By probing the representational underpinnings of transferability, this work complements attack-centric studies and informs both offensive research and the design of defenses that account for common vulnerabilities across model families.

Claimed Contributions

Large-scale empirical analysis of jailbreak transferability factors

5 retrieved papers

The authors conduct a comprehensive empirical study across 20 open-weight models and 33 jailbreak attacks applied to 313 harmful prompts, identifying two systematic factors that predict jailbreak transferability: the strength of the jailbreak on the source model and the representational similarity between models measured under benign prompts.

5 retrieved papers

Benign-only distillation protocol for causal manipulation of transferability

10 retrieved papers

The authors develop a distillation method that fine-tunes a student model exclusively on benign prompt-response pairs from a teacher model, deliberately increasing their representational similarity. This intervention causally increases jailbreak transferability from teacher to student, providing evidence that shared representations drive transfer rather than artifacts of safety training.

10 retrieved papers

Systematic characterization of attack-type differences in transferability

7 retrieved papers

The authors show that persona-style jailbreaks, which use natural language and align with shared semantic representations, transfer far more reliably across models than cipher-based jailbreaks, which exploit idiosyncratic model-specific quirks. This finding supports the hypothesis that transferability emerges from shared representational geometry.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Understanding and enhancing the transferability of jailbreaking attacks PDF

LIN Runqi, Han Bo, Runqi Lin, Li, Fengwang, Bo Han, Liu, Tongling, Fengwang Li, Tongling Liu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large-scale empirical analysis of jailbreak transferability factors

[69] AdversariaL attacK sAfety aLIgnment (ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement-Introducing â¦ PDF

Cannot Refute

[70] CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations PDF

Cannot Refute

[71] Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis PDF

Cannot Refute

[72] SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention PDF

Cannot Refute

[73] One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs PDF

Cannot Refute

Contribution

Benign-only distillation protocol for causal manipulation of transferability

[51] Common knowledge learning for generating transferable adversarial examples PDF

Cannot Refute

[52] Continuous transfer of neural network representational similarity for incremental learning PDF

Cannot Refute

[53] Similarity of neural network models: A survey of functional and representational measures PDF

Cannot Refute

[54] Similarity of neural architectures using adversarial attack transferability PDF

Cannot Refute

[55] Data-free knowledge distillation via text-noise fusion and dynamic adversarial temperature. PDF

Cannot Refute

[56] Distillation-Based Cross-Model Transferable Adversarial Attack for Remote Sensing Image Classification PDF

Cannot Refute

[57] Improving the transferability of adversarial examples with diverse gradients PDF

Cannot Refute

[58] Guided adversarial contrastive distillation for robust students PDF

Cannot Refute

[59] Distillation as a defense to adversarial perturbations against deep neural networks PDF

Cannot Refute

[60] Distilling Adversarial Robustness Using Heterogeneous Teachers PDF

Cannot Refute

Contribution

Systematic characterization of attack-type differences in transferability

[2] Autodan: Generating stealthy jailbreak prompts on aligned large language models PDF

Cannot Refute

[4] Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks PDF

Cannot Refute

[9] Boosting Jailbreak Transferability for Large Language Models PDF

Cannot Refute

[34] Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models PDF

Cannot Refute

[61] LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models PDF

Cannot Refute

[62] A Survey of Recent Advances in Adversarial Attack and Defense on Vision-Language Models PDF

Cannot Refute

[64] A Review of âDo Anything Nowâ Jailbreak Attacks in Large Language Models: Potential Risks, Impacts, and Defense Strategies PDF

Cannot Refute

Jailbreak Transferability Emerges from Shared Representations

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Understanding and enhancing the transferability of jailbreaking attacks PDF

Contribution Analysis

Large-scale empirical analysis of jailbreak transferability factors

[69] AdversariaL attacK sAfety aLIgnment (ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement-Introducing â¦ PDF

[70] CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations PDF

[71] Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis PDF

[72] SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention PDF

[73] One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs PDF

Benign-only distillation protocol for causal manipulation of transferability

[51] Common knowledge learning for generating transferable adversarial examples PDF

[52] Continuous transfer of neural network representational similarity for incremental learning PDF

[53] Similarity of neural network models: A survey of functional and representational measures PDF

[54] Similarity of neural architectures using adversarial attack transferability PDF

[55] Data-free knowledge distillation via text-noise fusion and dynamic adversarial temperature. PDF

[56] Distillation-Based Cross-Model Transferable Adversarial Attack for Remote Sensing Image Classification PDF

[57] Improving the transferability of adversarial examples with diverse gradients PDF

[58] Guided adversarial contrastive distillation for robust students PDF

[59] Distillation as a defense to adversarial perturbations against deep neural networks PDF

[60] Distilling Adversarial Robustness Using Heterogeneous Teachers PDF

Systematic characterization of attack-type differences in transferability

[2] Autodan: Generating stealthy jailbreak prompts on aligned large language models PDF

[4] Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks PDF

[9] Boosting Jailbreak Transferability for Large Language Models PDF

[34] Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models PDF

[61] LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models PDF

[62] A Survey of Recent Advances in Adversarial Attack and Defense on Vision-Language Models PDF

[64] A Review of âDo Anything Nowâ Jailbreak Attacks in Large Language Models: Potential Risks, Impacts, and Defense Strategies PDF

Table of Contents

[69] AdversariaL attacK sAfety aLIgnment (ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement-Introducing â¦ PDF

[64] A Review of âDo Anything Nowâ Jailbreak Attacks in Large Language Models: Potential Risks, Impacts, and Defense Strategies PDF