Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Large Language ModelsAny-Depth AlignmentDeep-prefill attacksSafety tokenInference-time defense

Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA) an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model’s strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving benign utility with minimal over-refusal and maintaining resilience even after the base model undergoes subsequent instruction tuning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Any-Depth Alignment (ADA), an inference-time method that reintroduces assistant header tokens mid-generation to trigger safety reassessment at arbitrary depths. It resides in the Mid-Generation Safety Intervention leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader Inference-Time Safety Mechanisms branch. This positioning suggests the work addresses a specific gap: most inference-time defenses operate at input filtering or single evaluation points, whereas this leaf focuses on continuous safety monitoring throughout token generation.

The taxonomy reveals that neighboring leaves address complementary aspects of runtime safety. External Guard Models employ separate monitoring systems for policy violations, while Step-by-Step Detoxification constrains toxic content incrementally during decoding. The broader Inference-Time Safety Mechanisms branch sits alongside Training-Based Safety Alignment (which modifies parameters) and Agent Safety frameworks (which handle multi-step task execution). The paper's focus on mid-stream intervention without parameter changes distinguishes it from reasoning-enhanced training methods and preference optimization approaches, though it shares conceptual overlap with agent safety work addressing extended generation contexts.

Among 26 candidates examined, the deep prefill attack contribution shows one refutable candidate from 10 examined, while the ADA-RK method appears more novel with zero refutations across 6 candidates. The ADA-LP linear probe method also has one refutable candidate among 10 examined. These statistics reflect a limited semantic search scope rather than exhaustive coverage. The attack methodology appears to have some prior exploration, while the core rethinking mechanism shows less direct overlap within the examined literature. The relatively small candidate pool suggests caution in interpreting these findings as definitive novelty assessments.

Given the sparse three-paper leaf and limited 26-candidate search, the work appears to occupy a relatively underexplored niche within inference-time safety. The taxonomy structure indicates this is a nascent research direction compared to more crowded areas like training-based alignment or agent benchmarking. However, the analysis cannot rule out relevant work outside the top-K semantic matches or in adjacent domains not captured by the taxonomy's scope boundaries.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Extending safety alignment to arbitrary generation depths in large language models. The field has organized itself around several complementary perspectives on ensuring safe model behavior. Inference-Time Safety Mechanisms focus on runtime interventions that can detect and correct unsafe outputs during generation, including mid-generation checks and post-hoc filtering. Training-Based Safety Alignment encompasses methods that bake safety into model weights through supervised fine-tuning, reinforcement learning, and preference optimization. Agent Safety and Behavioral Alignment addresses the unique challenges of autonomous systems that interact with environments and tools, where safety must extend beyond single responses to multi-step decision sequences. Safety Vulnerability Analysis investigates adversarial attacks and failure modes, while Reasoning and Cognitive Processes examine how models develop and express internal deliberation. Evaluation and Benchmarking Infrastructure provides the measurement frameworks needed to assess safety across diverse scenarios, and Auxiliary Technical Foundations supplies supporting techniques like representation learning and interpretability methods. Recent work has increasingly recognized that safety cannot be treated as a single-shot output property but must persist throughout extended reasoning chains and agentic interactions. Any-Depth Alignment[0] sits within the Mid-Generation Safety Intervention cluster, addressing the challenge of maintaining safety guarantees as models produce longer, more complex outputs with intermediate reasoning steps. This emphasis contrasts with approaches like Root Defense[10] and Root Defence[19], which focus on detecting and blocking unsafe requests at the input stage before generation begins. Meanwhile, works such as AgentAlign[1] and STAIR[2] tackle safety in multi-turn agentic settings where models take actions over time, highlighting a related but distinct challenge of behavioral consistency across episodes. The central tension across these branches involves balancing the flexibility needed for capable reasoning against the robustness required to prevent safety failures at any point in the generation process.

Claimed Contributions

Deep prefill attacks for testing depth-robustness of alignment

Can Refute

10 retrieved papers

The authors introduce deep prefill attacks, a new evaluation methodology that tests whether language models can maintain safety alignment at arbitrary generation depths by using harmful assistant-prefills ranging from tens to thousands of tokens. This reveals that current alignment strategies fail to generalize beyond shallow depths.

10 retrieved papers

Can Refute

Any-Depth Alignment Rethinking (ADA-RK) method

6 retrieved papers

The authors propose ADA-RK, a training-free inference-time defense that re-injects assistant header tokens (Safety Tokens) at periodic depths during generation to trigger the model to reassess harmfulness and produce refusals at any point in generation, without requiring parameter updates.

6 retrieved papers

Any-Depth Alignment Linear Probe (ADA-LP) method

Can Refute

10 retrieved papers

The authors develop ADA-LP, which uses a lightweight linear classifier applied to the hidden states of injected Safety Tokens to detect harmfulness. This method achieves near-100% refusal rates against deep prefills and adversarial attacks while maintaining minimal over-refusal on benign tasks, by unlocking the model's innate safety representations without modifying base model parameters.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[10] Root defense strategies: Ensuring safety of LLM at the decoding level PDF

Chen Jiawei, Shang Yu-ying, Tian Yu, Zeng Xin-yi, Zhang Jingyuan (2025)

[19] Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level PDF

Zeng Xin-yi, Shang Yu-ying, Xinyi Zeng, Chen Jiawei, Yuying Shang, Zhang Jingyuan, Yutao Zhu, Tian Yu, Jiawei Chen, Yu Tian (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Deep prefill attacks for testing depth-robustness of alignment

[25] Safety Alignment Should Be Made More Than Just a Few Tokens Deep PDF

Can Refute

[53] Defending against alignment-breaking attacks via robustly aligned llm PDF

Cannot Refute

[54] PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking PDF

Cannot Refute

[55] Sampling-aware adversarial attacks against large language models PDF

Cannot Refute

[56] Output Length Effect on DeepSeek-R1's Safety in Forced Thinking PDF

Cannot Refute

[57] Mbias: Mitigating bias in large language models while retaining context PDF

Cannot Refute

[58] Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction PDF

Cannot Refute

[59] Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence PDF

Cannot Refute

[60] Catching Contamination Before Generation: Spectral Kill Switches for Agents PDF

Cannot Refute

[61] Energy-Oriented Alignment for Large Language Models PDF

Cannot Refute

Contribution

Any-Depth Alignment Rethinking (ADA-RK) method

[37] A watermark for large language models PDF

Cannot Refute

[38] Cleangen: Mitigating backdoor attacks for generation tasks in large language models PDF

Cannot Refute

[39] Modelshield: Adaptive and robust watermark against model extraction attack PDF

Cannot Refute

[40] Context-preserving hierarchical self-shadowing for llm internal consistency maintenance PDF

Cannot Refute

[41] TransLock: Securing LLM deployment for software applications via self-locking watermarks PDF

Cannot Refute

[42] Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling PDF

Cannot Refute

Contribution

Any-Depth Alignment Linear Probe (ADA-LP) method

[45] Spot risks before speaking! unraveling safety attention heads in large vision-language models PDF

Can Refute

[43] Lightweight safety classification using pruned language models PDF

Cannot Refute

[44] Steering language models with activation engineering PDF

Cannot Refute

[46] Real-time detection of hallucinated entities in long-form generation PDF

Cannot Refute

[47] Interpreting learned feedback patterns in large language models PDF

Cannot Refute

[48] Aligned probing: Relating toxic behavior and model internals PDF

Cannot Refute

[49] Probing Toxic Content in Large Pre-Trained Language Models PDF

Cannot Refute

[50] The first to know: How token distributions reveal hidden knowledge in large vision-language models? PDF

Cannot Refute

[51] Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks PDF

Cannot Refute

[52] Toward universal steering and monitoring of AI models PDF

Cannot Refute

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[10] Root defense strategies: Ensuring safety of LLM at the decoding level PDF

[19] Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level PDF

Contribution Analysis

Deep prefill attacks for testing depth-robustness of alignment

[25] Safety Alignment Should Be Made More Than Just a Few Tokens Deep PDF

[53] Defending against alignment-breaking attacks via robustly aligned llm PDF

[54] PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking PDF

[55] Sampling-aware adversarial attacks against large language models PDF

[56] Output Length Effect on DeepSeek-R1's Safety in Forced Thinking PDF

[57] Mbias: Mitigating bias in large language models while retaining context PDF

[58] Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction PDF

[59] Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence PDF

[60] Catching Contamination Before Generation: Spectral Kill Switches for Agents PDF

[61] Energy-Oriented Alignment for Large Language Models PDF

Any-Depth Alignment Rethinking (ADA-RK) method

[37] A watermark for large language models PDF

[38] Cleangen: Mitigating backdoor attacks for generation tasks in large language models PDF

[39] Modelshield: Adaptive and robust watermark against model extraction attack PDF

[40] Context-preserving hierarchical self-shadowing for llm internal consistency maintenance PDF

[41] TransLock: Securing LLM deployment for software applications via self-locking watermarks PDF

[42] Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling PDF

Any-Depth Alignment Linear Probe (ADA-LP) method

[45] Spot risks before speaking! unraveling safety attention heads in large vision-language models PDF

[43] Lightweight safety classification using pruned language models PDF

[44] Steering language models with activation engineering PDF

[46] Real-time detection of hallucinated entities in long-form generation PDF

[47] Interpreting learned feedback patterns in large language models PDF

[48] Aligned probing: Relating toxic behavior and model internals PDF

[49] Probing Toxic Content in Large Pre-Trained Language Models PDF

[50] The first to know: How token distributions reveal hidden knowledge in large vision-language models? PDF

[51] Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks PDF

[52] Toward universal steering and monitoring of AI models PDF

Table of Contents