Jailbreak Transferability Emerges from Shared Representations

ICLR 2026 Conference SubmissionAnonymous Authors
AI Safety and SecurityAdversarial InputsJailbreaking
Abstract:

Jailbreak transferability is the surprising phenomenon when an adversarial attack compromising one model also elicits harmful responses from other models. Despite widespread demonstrations, there is little consensus on why transfer is possible: is it a quirk of safety training, an artifact of model families, or a more fundamental property of representation learning? We present evidence that transferability emerges from shared representations rather than incidental flaws. Across 20 open-weight models and 33 jailbreak attacks, we find two factors that systematically shape transfer: (1) representational similarity under benign prompts, and (2) the strength of the jailbreak on the source model. To move beyond correlation, we show that deliberately increasing similarity through benign-only distillation causally increases transfer. Qualitative analysis reveal systematic patterns; for example, persona-style jailbreaks transfer far more often than cipher-based prompts, consistent with the idea that natural-language attacks exploit models’ shared representation space, whereas cipher-based attacks rely on idiosyncratic quirks that do not generalize. Together, these results reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates why jailbreak attacks transfer across language models, proposing that shared representations rather than incidental flaws drive transferability. It resides in the 'Representation-Based Transferability Analysis' leaf, which contains only two papers total within the broader 'Jailbreak Transferability Mechanisms and Analysis' branch. This is a relatively sparse research direction compared to crowded attack-generation categories like 'Gradient-Based Suffix Generation Methods' or 'Automated Semantic Jailbreak Generation,' suggesting the paper addresses a less-explored theoretical question about transferability mechanisms rather than developing new attack techniques.

The taxonomy reveals that most jailbreak research focuses on attack methodologies—token-level optimization, semantic manipulation, multimodal techniques—rather than mechanistic explanations. The paper's branch sits adjacent to 'Cross-Language and Multilingual Transferability,' which examines transfer across linguistic boundaries, and is conceptually distinct from attack-focused branches like 'Adversarial Suffix and Token-Level Optimization Attacks' or 'Semantic and Prompt-Level Jailbreak Techniques.' The scope note clarifies this leaf excludes attack optimization methods and defense strategies, positioning the work as foundational analysis rather than applied technique development. Its emphasis on representation similarity and causal manipulation through distillation differentiates it from neighboring optimization-centric studies.

Among 22 candidates examined, none clearly refute the three main contributions. The large-scale empirical analysis (5 candidates examined, 0 refutable) and systematic attack-type characterization (7 candidates, 0 refutable) appear relatively novel within this limited search scope. The benign-only distillation protocol (10 candidates, 0 refutable) shows no substantial prior work among examined papers. These statistics suggest the contributions are distinct within the top-K semantic neighborhood, though the search scale is modest and does not guarantee exhaustive coverage of all relevant prior work in representation-based transferability analysis or causal intervention methods.

Based on 22 examined candidates from semantic search, the work appears to occupy a relatively underexplored niche connecting representation learning theory to jailbreak transferability. The sparse taxonomy leaf and absence of refutable prior work within the search scope suggest novelty, though this assessment is constrained by the limited candidate pool. A broader literature review might uncover related work in adversarial robustness or representation alignment that was not captured by the semantic search strategy.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: jailbreak transferability across language models. The field examines how adversarial prompts that successfully bypass safety mechanisms in one model can be reused or adapted to compromise others. The taxonomy reflects a rich landscape organized around attack methodologies, transferability mechanisms, defenses, and specialized contexts. Major branches include token-level optimization techniques (e.g., Universal Transferable Adversarial Attacks[1], Autodan Stealthy Jailbreak[2]) that craft adversarial suffixes, semantic and prompt-level methods that manipulate meaning rather than tokens, and multimodal attacks targeting vision-language models (Multimodal Jailbreaking Attack[3], Universal Image Jailbreaks Transfer[6]). A dedicated branch on transferability mechanisms explores why attacks generalize, while defense-focused work seeks robustness enhancements. Hybrid methodologies and contextual factors (cross-language settings, system messages, agent environments) round out the taxonomy, illustrating that jailbreak research spans diverse threat models and application domains. Recent work highlights tensions between attack stealthiness, transferability, and computational cost. Some studies pursue query-efficient strategies (Jailbreaking Twenty Queries[5]) or ensemble-based transfer (Simulated Ensemble Attack[34]), while others investigate representation-level explanations for why certain prompts generalize. Jailbreak Transferability Shared Representations[0] sits squarely within the transferability mechanisms branch, focusing on representation-based analysis to understand cross-model generalization. Its emphasis on shared internal structures contrasts with neighboring efforts like Enhancing Jailbreak Transferability[7], which may prioritize algorithmic improvements to boost transfer rates, or Boosting Jailbreak Transferability[9], which explores optimization refinements. By probing the representational underpinnings of transferability, this work complements attack-centric studies and informs both offensive research and the design of defenses that account for common vulnerabilities across model families.

Claimed Contributions

Large-scale empirical analysis of jailbreak transferability factors

The authors conduct a comprehensive empirical study across 20 open-weight models and 33 jailbreak attacks applied to 313 harmful prompts, identifying two systematic factors that predict jailbreak transferability: the strength of the jailbreak on the source model and the representational similarity between models measured under benign prompts.

5 retrieved papers
Benign-only distillation protocol for causal manipulation of transferability

The authors develop a distillation method that fine-tunes a student model exclusively on benign prompt-response pairs from a teacher model, deliberately increasing their representational similarity. This intervention causally increases jailbreak transferability from teacher to student, providing evidence that shared representations drive transfer rather than artifacts of safety training.

10 retrieved papers
Systematic characterization of attack-type differences in transferability

The authors show that persona-style jailbreaks, which use natural language and align with shared semantic representations, transfer far more reliably across models than cipher-based jailbreaks, which exploit idiosyncratic model-specific quirks. This finding supports the hypothesis that transferability emerges from shared representational geometry.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large-scale empirical analysis of jailbreak transferability factors

The authors conduct a comprehensive empirical study across 20 open-weight models and 33 jailbreak attacks applied to 313 harmful prompts, identifying two systematic factors that predict jailbreak transferability: the strength of the jailbreak on the source model and the representational similarity between models measured under benign prompts.

Contribution

Benign-only distillation protocol for causal manipulation of transferability

The authors develop a distillation method that fine-tunes a student model exclusively on benign prompt-response pairs from a teacher model, deliberately increasing their representational similarity. This intervention causally increases jailbreak transferability from teacher to student, providing evidence that shared representations drive transfer rather than artifacts of safety training.

Contribution

Systematic characterization of attack-type differences in transferability

The authors show that persona-style jailbreaks, which use natural language and align with shared semantic representations, transfer far more reliably across models than cipher-based jailbreaks, which exploit idiosyncratic model-specific quirks. This finding supports the hypothesis that transferability emerges from shared representational geometry.