AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

Large Language ModelsSafetyActivation Steering

As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising their general capabilities. Our codes are available at \url{https://anonymous.4open.science/r/AlphaSteer-929C/}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces AlphaSteer, a theoretically grounded activation steering method that applies learnable transformations to refusal direction vectors during inference to enhance LLM safety. It resides in the Activation-Based Steering Techniques leaf, which contains nine papers including the original work. This leaf sits within the broader Refusal Steering and Control Methods branch, indicating a moderately populated research direction focused on inference-time interventions. The taxonomy shows this is an active area with multiple concurrent approaches exploring how to manipulate internal activations to induce refusal behaviors without modifying model weights.

The taxonomy reveals that Activation-Based Steering Techniques is one of three sibling categories under Refusal Steering and Control Methods, alongside Training-Based Refusal Enhancement (seven papers) and Controllable and Adaptive Safety Alignment (three papers). Neighboring branches include Refusal Mechanism Analysis and Representation, which studies internal refusal structures, and Over-Refusal Mitigation and Utility Preservation, which addresses the safety-utility trade-off from a diagnostic perspective. The paper's focus on learnable steering with utility preservation connects it to over-refusal concerns while remaining distinct from training-based approaches that modify weights or adaptive frameworks that adjust safety thresholds dynamically.

Among 22 candidates examined across three contributions, no clearly refuting prior work was identified. The AlphaSteer method itself examined six candidates with zero refutable matches, the learnable transformation mechanism examined six candidates with zero refutations, and the null-space projection technique examined ten candidates with zero refutations. This suggests that within the limited search scope of top-K semantic matches and citation expansion, the specific combination of theoretical grounding, learnable transformations, and null-space constraints for utility preservation appears relatively novel. However, the analysis explicitly notes this is not an exhaustive literature search, and the moderate density of the parent leaf indicates active parallel work in activation steering.

Based on the limited search scope of 22 candidates, the work appears to occupy a distinct position within a moderately crowded research direction. The absence of refuting candidates across all three contributions suggests novelty in the specific technical approach, though the taxonomy structure shows the broader activation steering paradigm is well-established with eight sibling papers. The analysis does not cover exhaustive prior work in adjacent areas like training-based methods or adaptive alignment, which may contain relevant comparisons not captured by semantic search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: learning refusal steering for large language model safety enhancement. The field has organized itself around several complementary branches that together address how models decide to refuse harmful requests and how those decisions can be improved. One major branch focuses on understanding refusal mechanisms through representation analysis, examining how models internally encode safety-related features and decision boundaries. Another branch develops steering and control methods that directly manipulate model activations or apply targeted interventions to guide refusal behavior, with works like Refusal Direction[5] and AlphaSteer[0] exploring activation-based techniques. A third branch tackles over-refusal mitigation, seeking to preserve utility when safety measures become too conservative, while parallel efforts concentrate on evaluation benchmarks such as Sorry Bench[2] and domain-specific safety challenges. Additional branches explore alternative response strategies beyond simple refusal, adversarial robustness against jailbreaks, and reinforcement learning approaches that optimize safety through preference data. Within the activation-based steering cluster, a particularly active line of work investigates how to extract and apply refusal-related directions in model representation space. AlphaSteer[0] sits squarely in this area, emphasizing methods that steer model behavior by intervening on internal activations during inference. Nearby approaches like SafeSwitch[26] and Steering Without Side Effects[30] share this focus on activation manipulation but differ in their treatment of trade-offs: some prioritize minimizing unintended impacts on model capabilities, while others explore how to make steering more robust or interpretable. A contrasting thread examines whether refusal can be localized to specific layers or features, with Feature Guided SAE[46] and SARSteer[48] probing the granularity of safety representations. The central tension across these methods involves balancing effective refusal of genuinely harmful requests against maintaining helpfulness on benign queries, a challenge that motivates ongoing exploration of how steering vectors generalize and whether they can be applied selectively without degrading overall performance.

Claimed Contributions

AlphaSteer: a theoretically grounded activation steering method with null-space constraints

6 retrieved papers

The authors propose AlphaSteer, a novel activation steering approach that uses null-space constraints to preserve utility on benign prompts while learning to construct refusal direction vectors for malicious prompts. This method addresses the safety-utility trade-off through principled learning objectives rather than heuristic designs.

6 retrieved papers

Learnable activation steering mechanism with transformation matrix

6 retrieved papers

The authors introduce a learnable transformation matrix that dynamically constructs steering vectors based on prompt activations, enabling data-driven and fine-grained control over the steering process instead of relying on fixed vectors or manual thresholds.

6 retrieved papers

Null-space projection for utility preservation

10 retrieved papers

The authors develop a null-space projection method that constrains the steering transformation to produce near-zero vectors for benign prompts, ensuring their activations remain unchanged and thus preserving model utility on non-harmful tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Automating steering for safe multimodal large language models PDF

Wang Mengru, Lyucheng Wu, Xu, Ziwen, Mengru Wang, Cao Tri, Ziwen Xu, Tri Cao, Hooi, Bryan, Nay Oo, Deng Shu-min, Bryan Hooi, Shumin Deng (2025)

[26] SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals PDF

Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Heng Ji, Denghui Zhang (2025)

[30] Steering without side effects: Improving post-deployment control of language models PDF

Stickland, Asa Cooper, Lyzhov, Alexander, Asa Cooper Stickland, Pfau, Jacob, Alexander Lyzhov, Jacob Pfau, Bowman, Samuel R., Salsabila Mahdi, Samuel R. Bowman (2024)

[32] Internal activation as the polar star for steering unsafe llm behavior PDF

Qian Cheng, Peixuan Han, Chen, Xiusi, Cheng Qian, Zhang Yu-ji, Xiusi Chen, Ji, Heng, Yuji Zhang, Zhang, Denghui, Denghui Zhang, Heng Ji (2025)

[46] Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts PDF

Zhu, Zining, Samaksh Bhargav, Zining Zhu (2025)

[48] SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering PDF

Lin, Weilin, Li, Jianze, Weilin Lin, Xiong Hui, Jianze Li, Liu Li, Hui Xiong, Li Liu (2025)

[49] Scaling laws for activation steering with Llama 2 models and refusal mechanisms PDF

Sheikh Abdur Raheem Ali, Justin Xu, Ivory Yang, Jasmine Xinze Li, Ayse Arslan, Clark Benham (2025)

[50] LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation PDF

Shu Huizhen, Li Xuying, Huizhen Shu, Li Zhuo, Xuying Li, Zhuo Li (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AlphaSteer: a theoretically grounded activation steering method with null-space constraints

[67] Pixel: Adaptive steering via position-wise injection with exact estimated levels under subspace calibration PDF

Cannot Refute

[68] Msrs: Adaptive multi-subspace representation steering for attribute alignment in large language models PDF

Cannot Refute

[69] What makes and breaks safety fine-tuning? a mechanistic study PDF

Cannot Refute

[70] MOSAICO: offline synthesis of adaptation strategy repertoires with flexible trade-offs PDF

Cannot Refute

[71] MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors PDF

Cannot Refute

[72] A Multi-Task Energy-Aware Impedance Controller for Enhanced Safety in Physical Human-Robot Interaction PDF

Cannot Refute

Contribution

Learnable activation steering mechanism with transformation matrix

[51] DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance PDF

Cannot Refute

[52] DPD-LoRA: Dynamic Prompt-Driven Low-Rank Adaptation for Improved Generalization PDF

Cannot Refute

[53] FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering PDF

Cannot Refute

[54] Fusion Steering: Prompt-Specific Activation Control PDF

Cannot Refute

[55] Understanding Reasoning Mechanisms in Large Language Models Through Direction Learning PDF

Cannot Refute

[56] Steering LLMs' Reasoning With Activation State Machines PDF

Cannot Refute

Contribution

Null-space projection for utility preservation

[57] Quantum-inspired Embeddings Projection and Similarity Metrics for Representation Learning PDF

Cannot Refute

[58] DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners PDF

Cannot Refute

[59] Learning a Generalizable Trajectory Sampling Distribution for Model Predictive Control PDF

Cannot Refute

[60] BOF steelmaking endpoint carbon content and temperature soft sensor model based on supervised weighted local structure preserving projection PDF

Cannot Refute

[61] Data Obfuscation Through Latent Space Projection for Privacy-Preserving AI Governance: Case Studies in Medical Diagnosis and Finance Fraud Detection PDF

Cannot Refute

[62] UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models PDF

Cannot Refute

[63] Falcon: Fine-grained activation manipulation by contrastive orthogonal unalignment for large language model PDF

Cannot Refute

[64] Robust Long-Term Vehicle Trajectory Prediction Using Link Projection and a Situation-Aware Transformer PDF

Cannot Refute

[65] Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models PDF

Cannot Refute

[66] Privacy by Projection: Federated Population Density Estimation by Projecting on Random Features PDF

Cannot Refute

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Automating steering for safe multimodal large language models PDF

[26] SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals PDF

[30] Steering without side effects: Improving post-deployment control of language models PDF

[32] Internal activation as the polar star for steering unsafe llm behavior PDF

[46] Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts PDF

[48] SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering PDF

[49] Scaling laws for activation steering with Llama 2 models and refusal mechanisms PDF

[50] LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation PDF

Contribution Analysis

AlphaSteer: a theoretically grounded activation steering method with null-space constraints

[67] Pixel: Adaptive steering via position-wise injection with exact estimated levels under subspace calibration PDF

[68] Msrs: Adaptive multi-subspace representation steering for attribute alignment in large language models PDF

[69] What makes and breaks safety fine-tuning? a mechanistic study PDF

[70] MOSAICO: offline synthesis of adaptation strategy repertoires with flexible trade-offs PDF

[71] MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors PDF

[72] A Multi-Task Energy-Aware Impedance Controller for Enhanced Safety in Physical Human-Robot Interaction PDF

Learnable activation steering mechanism with transformation matrix

[51] DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance PDF

[52] DPD-LoRA: Dynamic Prompt-Driven Low-Rank Adaptation for Improved Generalization PDF

[53] FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering PDF

[54] Fusion Steering: Prompt-Specific Activation Control PDF

[55] Understanding Reasoning Mechanisms in Large Language Models Through Direction Learning PDF

[56] Steering LLMs' Reasoning With Activation State Machines PDF

Null-space projection for utility preservation

[57] Quantum-inspired Embeddings Projection and Similarity Metrics for Representation Learning PDF

[58] DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners PDF

[59] Learning a Generalizable Trajectory Sampling Distribution for Model Predictive Control PDF

[60] BOF steelmaking endpoint carbon content and temperature soft sensor model based on supervised weighted local structure preserving projection PDF

[61] Data Obfuscation Through Latent Space Projection for Privacy-Preserving AI Governance: Case Studies in Medical Diagnosis and Finance Fraud Detection PDF

[62] UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models PDF

[63] Falcon: Fine-grained activation manipulation by contrastive orthogonal unalignment for large language model PDF

[64] Robust Long-Term Vehicle Trajectory Prediction Using Link Projection and a Situation-Aware Transformer PDF

[65] Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models PDF

[66] Privacy by Projection: Federated Population Density Estimation by Projecting on Random Features PDF

Table of Contents