Automatic Instance Selection with Genetic Updating for Few-shot LLM Jailbreak
Overview
Overall Novelty Assessment
The paper proposes ACCEPT, a framework for few-shot LLM jailbreak that combines textual gradient-based instance selection with genetic algorithm optimization for non-semantic mark injection. It resides in the 'Few-Shot Demonstration Injection' leaf of the taxonomy, which contains four papers total including this work. This leaf sits within the broader 'In-Context Learning Based Jailbreak Attacks' branch, indicating a moderately populated research direction focused on exploiting demonstration examples to bypass safety alignment. The taxonomy reveals this is one of several active attack paradigms, alongside prompt optimization, multimodal attacks, and reinforcement learning approaches.
The taxonomy structure shows ACCEPT's leaf neighbors other demonstration-based methods, while adjacent leaves explore 'Contextual Priming and Response Manipulation' (two papers) and parallel branches investigate 'Gradient-Free Suffix Optimization' and 'Semantic Obfuscation' techniques. The scope notes clarify that demonstration injection methods differ from many-shot attacks or non-demonstration approaches, positioning ACCEPT at the intersection of in-context learning exploitation and systematic instance selection. The broader taxonomy reveals approximately 29 papers across diverse attack mechanisms, suggesting the field has fragmented into specialized sub-problems rather than converging on unified methodologies.
Among the three identified contributions, the literature search examined five candidates total. The textual gradient-based instance selection mechanism was evaluated against four candidates with zero refutations found, while the integrated ACCEPT framework was compared to one candidate with no overlap detected. The genetic algorithm component received no direct comparison due to limited candidate availability. This limited search scope—examining roughly five semantically similar papers rather than an exhaustive survey—means the analysis captures immediate neighbors but may miss relevant work in adjacent taxonomy branches or recent preprints. The absence of refutations among examined candidates suggests potential novelty within the searched subset, though the small sample size precludes definitive conclusions.
Based on the constrained literature search covering five candidates, ACCEPT appears to occupy a distinct position combining gradient-based selection with genetic optimization for few-shot jailbreak. However, the analysis explicitly covers only top-K semantic matches and does not encompass the full taxonomy of 29 papers or adjacent research directions like reinforcement learning attacks or structural manipulation methods. The contribution-level statistics reflect what was examined, not the complete prior art landscape, leaving open questions about overlap with gradient-free optimization or evolutionary strategies in neighboring taxonomy branches.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose using textual gradients to automatically select the most effective semantic instances for few-shot jailbreak attacks. This method treats instance selection as a differentiable optimization process in text space, using LLM-generated gradient feedback to guide iterative improvements rather than relying on manual or random selection.
The authors introduce ACCEPT, a hybrid framework that synergistically combines two optimization mechanisms: textual gradient-guided selection of semantic instances and genetic algorithm-driven injection of non-semantic markers (emojis, special characters) into harmful prompts. This dual-layer approach addresses both instance selection and attack evasiveness simultaneously.
The authors develop a genetic algorithm-based mechanism that systematically optimizes the injection of non-semantic markers into harmful prompts. The approach encodes perturbation strategies as chromosomes with genes controlling operation type, element selection, intensity, position, and additional transformations, using evolutionary search to discover optimal evasion strategies.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] Jailbreak and guard aligned language models with only few in-context demonstrations PDF
[3] Improved few-shot jailbreaking can circumvent aligned language models and their defenses PDF
[8] Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Textual gradient-based instance selection for few-shot jailbreak
The authors propose using textual gradients to automatically select the most effective semantic instances for few-shot jailbreak attacks. This method treats instance selection as a differentiable optimization process in text space, using LLM-generated gradient feedback to guide iterative improvements rather than relying on manual or random selection.
[6] Hijacking Large Language Models via Adversarial In-Context Learning PDF
[30] Class specific autoencoders enhance sample diversity PDF
[31] MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization PDF
[32] Sparse Adversarial Attack For Video Via Gradient-Based Keyframe Selection PDF
ACCEPT framework integrating semantic and non-semantic optimization
The authors introduce ACCEPT, a hybrid framework that synergistically combines two optimization mechanisms: textual gradient-guided selection of semantic instances and genetic algorithm-driven injection of non-semantic markers (emojis, special characters) into harmful prompts. This dual-layer approach addresses both instance selection and attack evasiveness simultaneously.
[33] SEC-Prompt: SEmantic Complementary Prompting for Few-Shot Class-Incremental Learning PDF
Genetic algorithm for non-semantic mark injection optimization
The authors develop a genetic algorithm-based mechanism that systematically optimizes the injection of non-semantic markers into harmful prompts. The approach encodes perturbation strategies as chromosomes with genes controlling operation type, element selection, intensity, position, and additional transformations, using evolutionary search to discover optimal evasion strategies.