Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsAny-Depth AlignmentDeep-prefill attacksSafety tokenInference-time defense
Abstract:

Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA) an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model’s strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving benign utility with minimal over-refusal and maintaining resilience even after the base model undergoes subsequent instruction tuning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Any-Depth Alignment (ADA), an inference-time method that reintroduces assistant header tokens mid-generation to trigger safety reassessment at arbitrary depths. It resides in the Mid-Generation Safety Intervention leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader Inference-Time Safety Mechanisms branch. This positioning suggests the work addresses a specific gap: most inference-time defenses operate at input filtering or single evaluation points, whereas this leaf focuses on continuous safety monitoring throughout token generation.

The taxonomy reveals that neighboring leaves address complementary aspects of runtime safety. External Guard Models employ separate monitoring systems for policy violations, while Step-by-Step Detoxification constrains toxic content incrementally during decoding. The broader Inference-Time Safety Mechanisms branch sits alongside Training-Based Safety Alignment (which modifies parameters) and Agent Safety frameworks (which handle multi-step task execution). The paper's focus on mid-stream intervention without parameter changes distinguishes it from reasoning-enhanced training methods and preference optimization approaches, though it shares conceptual overlap with agent safety work addressing extended generation contexts.

Among 26 candidates examined, the deep prefill attack contribution shows one refutable candidate from 10 examined, while the ADA-RK method appears more novel with zero refutations across 6 candidates. The ADA-LP linear probe method also has one refutable candidate among 10 examined. These statistics reflect a limited semantic search scope rather than exhaustive coverage. The attack methodology appears to have some prior exploration, while the core rethinking mechanism shows less direct overlap within the examined literature. The relatively small candidate pool suggests caution in interpreting these findings as definitive novelty assessments.

Given the sparse three-paper leaf and limited 26-candidate search, the work appears to occupy a relatively underexplored niche within inference-time safety. The taxonomy structure indicates this is a nascent research direction compared to more crowded areas like training-based alignment or agent benchmarking. However, the analysis cannot rule out relevant work outside the top-K semantic matches or in adjacent domains not captured by the taxonomy's scope boundaries.

Taxonomy

Core-task Taxonomy Papers
36
3
Claimed Contributions
26
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Extending safety alignment to arbitrary generation depths in large language models. The field has organized itself around several complementary perspectives on ensuring safe model behavior. Inference-Time Safety Mechanisms focus on runtime interventions that can detect and correct unsafe outputs during generation, including mid-generation checks and post-hoc filtering. Training-Based Safety Alignment encompasses methods that bake safety into model weights through supervised fine-tuning, reinforcement learning, and preference optimization. Agent Safety and Behavioral Alignment addresses the unique challenges of autonomous systems that interact with environments and tools, where safety must extend beyond single responses to multi-step decision sequences. Safety Vulnerability Analysis investigates adversarial attacks and failure modes, while Reasoning and Cognitive Processes examine how models develop and express internal deliberation. Evaluation and Benchmarking Infrastructure provides the measurement frameworks needed to assess safety across diverse scenarios, and Auxiliary Technical Foundations supplies supporting techniques like representation learning and interpretability methods. Recent work has increasingly recognized that safety cannot be treated as a single-shot output property but must persist throughout extended reasoning chains and agentic interactions. Any-Depth Alignment[0] sits within the Mid-Generation Safety Intervention cluster, addressing the challenge of maintaining safety guarantees as models produce longer, more complex outputs with intermediate reasoning steps. This emphasis contrasts with approaches like Root Defense[10] and Root Defence[19], which focus on detecting and blocking unsafe requests at the input stage before generation begins. Meanwhile, works such as AgentAlign[1] and STAIR[2] tackle safety in multi-turn agentic settings where models take actions over time, highlighting a related but distinct challenge of behavioral consistency across episodes. The central tension across these branches involves balancing the flexibility needed for capable reasoning against the robustness required to prevent safety failures at any point in the generation process.

Claimed Contributions

Deep prefill attacks for testing depth-robustness of alignment

The authors introduce deep prefill attacks, a new evaluation methodology that tests whether language models can maintain safety alignment at arbitrary generation depths by using harmful assistant-prefills ranging from tens to thousands of tokens. This reveals that current alignment strategies fail to generalize beyond shallow depths.

10 retrieved papers
Can Refute
Any-Depth Alignment Rethinking (ADA-RK) method

The authors propose ADA-RK, a training-free inference-time defense that re-injects assistant header tokens (Safety Tokens) at periodic depths during generation to trigger the model to reassess harmfulness and produce refusals at any point in generation, without requiring parameter updates.

6 retrieved papers
Any-Depth Alignment Linear Probe (ADA-LP) method

The authors develop ADA-LP, which uses a lightweight linear classifier applied to the hidden states of injected Safety Tokens to detect harmfulness. This method achieves near-100% refusal rates against deep prefills and adversarial attacks while maintaining minimal over-refusal on benign tasks, by unlocking the model's innate safety representations without modifying base model parameters.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Deep prefill attacks for testing depth-robustness of alignment

The authors introduce deep prefill attacks, a new evaluation methodology that tests whether language models can maintain safety alignment at arbitrary generation depths by using harmful assistant-prefills ranging from tens to thousands of tokens. This reveals that current alignment strategies fail to generalize beyond shallow depths.

Contribution

Any-Depth Alignment Rethinking (ADA-RK) method

The authors propose ADA-RK, a training-free inference-time defense that re-injects assistant header tokens (Safety Tokens) at periodic depths during generation to trigger the model to reassess harmfulness and produce refusals at any point in generation, without requiring parameter updates.

Contribution

Any-Depth Alignment Linear Probe (ADA-LP) method

The authors develop ADA-LP, which uses a lightweight linear classifier applied to the hidden states of injected Safety Tokens to detect harmfulness. This method achieves near-100% refusal rates against deep prefills and adversarial attacks while maintaining minimal over-refusal on benign tasks, by unlocking the model's innate safety representations without modifying base model parameters.

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth | Novelty Validation