Automatic Instance Selection with Genetic Updating for Few-shot LLM Jailbreak

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelJailbreak attackFew-shottext gradientgenetic algorithm
Abstract:

This paper studies the problem of few-shot large language model (LLM) jailbreak, which aims to trigger unsafe outputs of LLMs using only a handful of adversarial examples. However, the effectiveness of the current few-shot jailbreak attacks is limited by the challenge of systematically selecting the most potent instances, with existing methods often resorting to inefficient manual or random selection. In this paper, we propose a novel approach named Automatic Instance Selection with Genetic Updating (ACCEPT) for few-shot LLM jailbreak. The core of our ACCEPT is to utilize textual gradient and fitness scores to guide the optimization process automatically. In particular, our ACCEPT designs a loss objective prioritizing successful jailbreaks, which can further guide the selection of instances via textual gradient. Furthermore, we construct a pool with meaningless marks, and consider the injection operators as chromosomes following the genetic algorithm. A fitness function is then defined in jailbreak scenarios, which helps the iterations across generations for proper prompts. Extensive experiments across several benchmark datasets can validate the effectiveness of the proposed ACCEPT in comparison with extensive baselines.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ACCEPT, a framework for few-shot LLM jailbreak that combines textual gradient-based instance selection with genetic algorithm optimization for non-semantic mark injection. It resides in the 'Few-Shot Demonstration Injection' leaf of the taxonomy, which contains four papers total including this work. This leaf sits within the broader 'In-Context Learning Based Jailbreak Attacks' branch, indicating a moderately populated research direction focused on exploiting demonstration examples to bypass safety alignment. The taxonomy reveals this is one of several active attack paradigms, alongside prompt optimization, multimodal attacks, and reinforcement learning approaches.

The taxonomy structure shows ACCEPT's leaf neighbors other demonstration-based methods, while adjacent leaves explore 'Contextual Priming and Response Manipulation' (two papers) and parallel branches investigate 'Gradient-Free Suffix Optimization' and 'Semantic Obfuscation' techniques. The scope notes clarify that demonstration injection methods differ from many-shot attacks or non-demonstration approaches, positioning ACCEPT at the intersection of in-context learning exploitation and systematic instance selection. The broader taxonomy reveals approximately 29 papers across diverse attack mechanisms, suggesting the field has fragmented into specialized sub-problems rather than converging on unified methodologies.

Among the three identified contributions, the literature search examined five candidates total. The textual gradient-based instance selection mechanism was evaluated against four candidates with zero refutations found, while the integrated ACCEPT framework was compared to one candidate with no overlap detected. The genetic algorithm component received no direct comparison due to limited candidate availability. This limited search scope—examining roughly five semantically similar papers rather than an exhaustive survey—means the analysis captures immediate neighbors but may miss relevant work in adjacent taxonomy branches or recent preprints. The absence of refutations among examined candidates suggests potential novelty within the searched subset, though the small sample size precludes definitive conclusions.

Based on the constrained literature search covering five candidates, ACCEPT appears to occupy a distinct position combining gradient-based selection with genetic optimization for few-shot jailbreak. However, the analysis explicitly covers only top-K semantic matches and does not encompass the full taxonomy of 29 papers or adjacent research directions like reinforcement learning attacks or structural manipulation methods. The contribution-level statistics reflect what was examined, not the complete prior art landscape, leaving open questions about overlap with gradient-free optimization or evolutionary strategies in neighboring taxonomy branches.

Taxonomy

Core-task Taxonomy Papers
29
3
Claimed Contributions
5
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: few-shot large language model jailbreak attack. The field of jailbreak attacks on large language models has evolved into a rich taxonomy spanning multiple strategic dimensions. At the highest level, researchers explore In-Context Learning Based Jailbreak Attacks that leverage demonstration injection and contextual manipulation, Prompt Optimization and Manipulation Attacks that systematically refine adversarial inputs, and Multimodal Jailbreak Attacks that exploit vision-language interfaces. Additional branches include Model Adaptation and Fine-Tuning Attacks that poison or retrain models, Cross-Lingual and Domain-Specific Attacks that exploit linguistic or specialized vulnerabilities, Reinforcement Learning Enhanced Attacks that iteratively optimize jailbreak strategies, Self-Generated and Recursive Attacks where models produce their own adversarial prompts, and Defense and Detection Mechanisms that aim to identify or mitigate these threats. Works such as Improved Few-shot Jailbreaking[3] and RL Few-shot Jailbreak[24] illustrate how different branches can converge on the core challenge of crafting effective few-shot demonstrations. Within this landscape, a particularly active line of inquiry focuses on how carefully selected in-context examples can bypass safety guardrails. Genetic Instance Selection[0] sits squarely in the Few-Shot Demonstration Injection cluster, emphasizing evolutionary or selection-based strategies to identify potent demonstration sets. This contrasts with approaches like Improved Few-shot Jailbreaking[3], which may prioritize systematic prompt refinement, and Self-Instruct Jailbreaking[8], which explores recursive self-generation of adversarial content. Meanwhile, defense-oriented efforts such as Few-shot Jailbreak Guard[1] and Zero-Shot Jailbreak Detection[25] highlight the ongoing arms race between attack sophistication and protective measures. The original paper's focus on genetic or instance-selection mechanisms places it among methods that treat demonstration choice as an optimization problem, distinguishing it from purely prompt-engineering or multimodal strategies while sharing common ground with reinforcement and iterative refinement techniques.

Claimed Contributions

Textual gradient-based instance selection for few-shot jailbreak

The authors propose using textual gradients to automatically select the most effective semantic instances for few-shot jailbreak attacks. This method treats instance selection as a differentiable optimization process in text space, using LLM-generated gradient feedback to guide iterative improvements rather than relying on manual or random selection.

4 retrieved papers
ACCEPT framework integrating semantic and non-semantic optimization

The authors introduce ACCEPT, a hybrid framework that synergistically combines two optimization mechanisms: textual gradient-guided selection of semantic instances and genetic algorithm-driven injection of non-semantic markers (emojis, special characters) into harmful prompts. This dual-layer approach addresses both instance selection and attack evasiveness simultaneously.

1 retrieved paper
Genetic algorithm for non-semantic mark injection optimization

The authors develop a genetic algorithm-based mechanism that systematically optimizes the injection of non-semantic markers into harmful prompts. The approach encodes perturbation strategies as chromosomes with genes controlling operation type, element selection, intensity, position, and additional transformations, using evolutionary search to discover optimal evasion strategies.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Textual gradient-based instance selection for few-shot jailbreak

The authors propose using textual gradients to automatically select the most effective semantic instances for few-shot jailbreak attacks. This method treats instance selection as a differentiable optimization process in text space, using LLM-generated gradient feedback to guide iterative improvements rather than relying on manual or random selection.

Contribution

ACCEPT framework integrating semantic and non-semantic optimization

The authors introduce ACCEPT, a hybrid framework that synergistically combines two optimization mechanisms: textual gradient-guided selection of semantic instances and genetic algorithm-driven injection of non-semantic markers (emojis, special characters) into harmful prompts. This dual-layer approach addresses both instance selection and attack evasiveness simultaneously.

Contribution

Genetic algorithm for non-semantic mark injection optimization

The authors develop a genetic algorithm-based mechanism that systematically optimizes the injection of non-semantic markers into harmful prompts. The approach encodes perturbation strategies as chromosomes with genes controlling operation type, element selection, intensity, position, and additional transformations, using evolutionary search to discover optimal evasion strategies.

Automatic Instance Selection with Genetic Updating for Few-shot LLM Jailbreak | Novelty Validation