Automatic Instance Selection with Genetic Updating for Few-shot LLM Jailbreak

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Large Language ModelJailbreak attackFew-shottext gradientgenetic algorithm

This paper studies the problem of few-shot large language model (LLM) jailbreak, which aims to trigger unsafe outputs of LLMs using only a handful of adversarial examples. However, the effectiveness of the current few-shot jailbreak attacks is limited by the challenge of systematically selecting the most potent instances, with existing methods often resorting to inefficient manual or random selection. In this paper, we propose a novel approach named Automatic Instance Selection with Genetic Updating (ACCEPT) for few-shot LLM jailbreak. The core of our ACCEPT is to utilize textual gradient and fitness scores to guide the optimization process automatically. In particular, our ACCEPT designs a loss objective prioritizing successful jailbreaks, which can further guide the selection of instances via textual gradient. Furthermore, we construct a pool with meaningless marks, and consider the injection operators as chromosomes following the genetic algorithm. A fitness function is then defined in jailbreak scenarios, which helps the iterations across generations for proper prompts. Extensive experiments across several benchmark datasets can validate the effectiveness of the proposed ACCEPT in comparison with extensive baselines.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ACCEPT, a framework for few-shot LLM jailbreak that combines textual gradient-based instance selection with genetic algorithm optimization for non-semantic mark injection. It resides in the 'Few-Shot Demonstration Injection' leaf of the taxonomy, which contains four papers total including this work. This leaf sits within the broader 'In-Context Learning Based Jailbreak Attacks' branch, indicating a moderately populated research direction focused on exploiting demonstration examples to bypass safety alignment. The taxonomy reveals this is one of several active attack paradigms, alongside prompt optimization, multimodal attacks, and reinforcement learning approaches.

The taxonomy structure shows ACCEPT's leaf neighbors other demonstration-based methods, while adjacent leaves explore 'Contextual Priming and Response Manipulation' (two papers) and parallel branches investigate 'Gradient-Free Suffix Optimization' and 'Semantic Obfuscation' techniques. The scope notes clarify that demonstration injection methods differ from many-shot attacks or non-demonstration approaches, positioning ACCEPT at the intersection of in-context learning exploitation and systematic instance selection. The broader taxonomy reveals approximately 29 papers across diverse attack mechanisms, suggesting the field has fragmented into specialized sub-problems rather than converging on unified methodologies.

Among the three identified contributions, the literature search examined five candidates total. The textual gradient-based instance selection mechanism was evaluated against four candidates with zero refutations found, while the integrated ACCEPT framework was compared to one candidate with no overlap detected. The genetic algorithm component received no direct comparison due to limited candidate availability. This limited search scope—examining roughly five semantically similar papers rather than an exhaustive survey—means the analysis captures immediate neighbors but may miss relevant work in adjacent taxonomy branches or recent preprints. The absence of refutations among examined candidates suggests potential novelty within the searched subset, though the small sample size precludes definitive conclusions.

Based on the constrained literature search covering five candidates, ACCEPT appears to occupy a distinct position combining gradient-based selection with genetic optimization for few-shot jailbreak. However, the analysis explicitly covers only top-K semantic matches and does not encompass the full taxonomy of 29 papers or adjacent research directions like reinforcement learning attacks or structural manipulation methods. The contribution-level statistics reflect what was examined, not the complete prior art landscape, leaving open questions about overlap with gradient-free optimization or evolutionary strategies in neighboring taxonomy branches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: few-shot large language model jailbreak attack. The field of jailbreak attacks on large language models has evolved into a rich taxonomy spanning multiple strategic dimensions. At the highest level, researchers explore In-Context Learning Based Jailbreak Attacks that leverage demonstration injection and contextual manipulation, Prompt Optimization and Manipulation Attacks that systematically refine adversarial inputs, and Multimodal Jailbreak Attacks that exploit vision-language interfaces. Additional branches include Model Adaptation and Fine-Tuning Attacks that poison or retrain models, Cross-Lingual and Domain-Specific Attacks that exploit linguistic or specialized vulnerabilities, Reinforcement Learning Enhanced Attacks that iteratively optimize jailbreak strategies, Self-Generated and Recursive Attacks where models produce their own adversarial prompts, and Defense and Detection Mechanisms that aim to identify or mitigate these threats. Works such as Improved Few-shot Jailbreaking[3] and RL Few-shot Jailbreak[24] illustrate how different branches can converge on the core challenge of crafting effective few-shot demonstrations. Within this landscape, a particularly active line of inquiry focuses on how carefully selected in-context examples can bypass safety guardrails. Genetic Instance Selection[0] sits squarely in the Few-Shot Demonstration Injection cluster, emphasizing evolutionary or selection-based strategies to identify potent demonstration sets. This contrasts with approaches like Improved Few-shot Jailbreaking[3], which may prioritize systematic prompt refinement, and Self-Instruct Jailbreaking[8], which explores recursive self-generation of adversarial content. Meanwhile, defense-oriented efforts such as Few-shot Jailbreak Guard[1] and Zero-Shot Jailbreak Detection[25] highlight the ongoing arms race between attack sophistication and protective measures. The original paper's focus on genetic or instance-selection mechanisms places it among methods that treat demonstration choice as an optimization problem, distinguishing it from purely prompt-engineering or multimodal strategies while sharing common ground with reinforcement and iterative refinement techniques.

Claimed Contributions

Textual gradient-based instance selection for few-shot jailbreak

4 retrieved papers

The authors propose using textual gradients to automatically select the most effective semantic instances for few-shot jailbreak attacks. This method treats instance selection as a differentiable optimization process in text space, using LLM-generated gradient feedback to guide iterative improvements rather than relying on manual or random selection.

4 retrieved papers

ACCEPT framework integrating semantic and non-semantic optimization

1 retrieved paper

The authors introduce ACCEPT, a hybrid framework that synergistically combines two optimization mechanisms: textual gradient-guided selection of semantic instances and genetic algorithm-driven injection of non-semantic markers (emojis, special characters) into harmful prompts. This dual-layer approach addresses both instance selection and attack evasiveness simultaneously.

1 retrieved paper

Genetic algorithm for non-semantic mark injection optimization

0 retrieved papers

The authors develop a genetic algorithm-based mechanism that systematically optimizes the injection of non-semantic markers into harmful prompts. The approach encodes perturbation strategies as chromosomes with genes controlling operation type, element selection, intensity, position, and additional transformations, using evolutionary search to discover optimal evasion strategies.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Jailbreak and guard aligned language models with only few in-context demonstrations PDF

Wei, Zeming, Zeming Wei, Wang Yi-fei, Yifei Wang, Li Ang, Yisen Wang, Mo, Yichuan, Wang, Yisen (2023)

[3] Improved few-shot jailbreaking can circumvent aligned language models and their defenses PDF

Zheng, Xiaosen, Pang, Tianyu, Xiaosen Zheng, Du Chao, Tianyu Pang, Liu Qian, Chao Du, Jiang Jing, Qian Liu, Lin, Min, Jing Jiang, Min Lin (2024)

[8] Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning PDF

Hua, Jiaqi, Jiaqi Hua, Wanxu Wei (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Textual gradient-based instance selection for few-shot jailbreak

[6] Hijacking Large Language Models via Adversarial In-Context Learning PDF

Cannot Refute

[30] Class specific autoencoders enhance sample diversity PDF

Cannot Refute

[31] MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization PDF

Cannot Refute

[32] Sparse Adversarial Attack For Video Via Gradient-Based Keyframe Selection PDF

Cannot Refute

Contribution

ACCEPT framework integrating semantic and non-semantic optimization

[33] SEC-Prompt: SEmantic Complementary Prompting for Few-Shot Class-Incremental Learning PDF

Cannot Refute

Contribution

Automatic Instance Selection with Genetic Updating for Few-shot LLM Jailbreak

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Jailbreak and guard aligned language models with only few in-context demonstrations PDF

[3] Improved few-shot jailbreaking can circumvent aligned language models and their defenses PDF

[8] Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning PDF

Contribution Analysis

Textual gradient-based instance selection for few-shot jailbreak

[6] Hijacking Large Language Models via Adversarial In-Context Learning PDF

[30] Class specific autoencoders enhance sample diversity PDF

[31] MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization PDF

[32] Sparse Adversarial Attack For Video Via Gradient-Based Keyframe Selection PDF

ACCEPT framework integrating semantic and non-semantic optimization

[33] SEC-Prompt: SEmantic Complementary Prompting for Few-Shot Class-Incremental Learning PDF

Genetic algorithm for non-semantic mark injection optimization

Table of Contents