AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsSafetyActivation Steering
Abstract:

As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising their general capabilities. Our codes are available at \url{https://anonymous.4open.science/r/AlphaSteer-929C/}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces AlphaSteer, a theoretically grounded activation steering method that applies learnable transformations to refusal direction vectors during inference to enhance LLM safety. It resides in the Activation-Based Steering Techniques leaf, which contains nine papers including the original work. This leaf sits within the broader Refusal Steering and Control Methods branch, indicating a moderately populated research direction focused on inference-time interventions. The taxonomy shows this is an active area with multiple concurrent approaches exploring how to manipulate internal activations to induce refusal behaviors without modifying model weights.

The taxonomy reveals that Activation-Based Steering Techniques is one of three sibling categories under Refusal Steering and Control Methods, alongside Training-Based Refusal Enhancement (seven papers) and Controllable and Adaptive Safety Alignment (three papers). Neighboring branches include Refusal Mechanism Analysis and Representation, which studies internal refusal structures, and Over-Refusal Mitigation and Utility Preservation, which addresses the safety-utility trade-off from a diagnostic perspective. The paper's focus on learnable steering with utility preservation connects it to over-refusal concerns while remaining distinct from training-based approaches that modify weights or adaptive frameworks that adjust safety thresholds dynamically.

Among 22 candidates examined across three contributions, no clearly refuting prior work was identified. The AlphaSteer method itself examined six candidates with zero refutable matches, the learnable transformation mechanism examined six candidates with zero refutations, and the null-space projection technique examined ten candidates with zero refutations. This suggests that within the limited search scope of top-K semantic matches and citation expansion, the specific combination of theoretical grounding, learnable transformations, and null-space constraints for utility preservation appears relatively novel. However, the analysis explicitly notes this is not an exhaustive literature search, and the moderate density of the parent leaf indicates active parallel work in activation steering.

Based on the limited search scope of 22 candidates, the work appears to occupy a distinct position within a moderately crowded research direction. The absence of refuting candidates across all three contributions suggests novelty in the specific technical approach, though the taxonomy structure shows the broader activation steering paradigm is well-established with eight sibling papers. The analysis does not cover exhaustive prior work in adjacent areas like training-based methods or adaptive alignment, which may contain relevant comparisons not captured by semantic search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: learning refusal steering for large language model safety enhancement. The field has organized itself around several complementary branches that together address how models decide to refuse harmful requests and how those decisions can be improved. One major branch focuses on understanding refusal mechanisms through representation analysis, examining how models internally encode safety-related features and decision boundaries. Another branch develops steering and control methods that directly manipulate model activations or apply targeted interventions to guide refusal behavior, with works like Refusal Direction[5] and AlphaSteer[0] exploring activation-based techniques. A third branch tackles over-refusal mitigation, seeking to preserve utility when safety measures become too conservative, while parallel efforts concentrate on evaluation benchmarks such as Sorry Bench[2] and domain-specific safety challenges. Additional branches explore alternative response strategies beyond simple refusal, adversarial robustness against jailbreaks, and reinforcement learning approaches that optimize safety through preference data. Within the activation-based steering cluster, a particularly active line of work investigates how to extract and apply refusal-related directions in model representation space. AlphaSteer[0] sits squarely in this area, emphasizing methods that steer model behavior by intervening on internal activations during inference. Nearby approaches like SafeSwitch[26] and Steering Without Side Effects[30] share this focus on activation manipulation but differ in their treatment of trade-offs: some prioritize minimizing unintended impacts on model capabilities, while others explore how to make steering more robust or interpretable. A contrasting thread examines whether refusal can be localized to specific layers or features, with Feature Guided SAE[46] and SARSteer[48] probing the granularity of safety representations. The central tension across these methods involves balancing effective refusal of genuinely harmful requests against maintaining helpfulness on benign queries, a challenge that motivates ongoing exploration of how steering vectors generalize and whether they can be applied selectively without degrading overall performance.

Claimed Contributions

AlphaSteer: a theoretically grounded activation steering method with null-space constraints

The authors propose AlphaSteer, a novel activation steering approach that uses null-space constraints to preserve utility on benign prompts while learning to construct refusal direction vectors for malicious prompts. This method addresses the safety-utility trade-off through principled learning objectives rather than heuristic designs.

6 retrieved papers
Learnable activation steering mechanism with transformation matrix

The authors introduce a learnable transformation matrix that dynamically constructs steering vectors based on prompt activations, enabling data-driven and fine-grained control over the steering process instead of relying on fixed vectors or manual thresholds.

6 retrieved papers
Null-space projection for utility preservation

The authors develop a null-space projection method that constrains the steering transformation to produce near-zero vectors for benign prompts, ensuring their activations remain unchanged and thus preserving model utility on non-harmful tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AlphaSteer: a theoretically grounded activation steering method with null-space constraints

The authors propose AlphaSteer, a novel activation steering approach that uses null-space constraints to preserve utility on benign prompts while learning to construct refusal direction vectors for malicious prompts. This method addresses the safety-utility trade-off through principled learning objectives rather than heuristic designs.

Contribution

Learnable activation steering mechanism with transformation matrix

The authors introduce a learnable transformation matrix that dynamically constructs steering vectors based on prompt activations, enabling data-driven and fine-grained control over the steering process instead of relying on fixed vectors or manual thresholds.

Contribution

Null-space projection for utility preservation

The authors develop a null-space projection method that constrains the steering transformation to produce near-zero vectors for benign prompts, ensuring their activations remain unchanged and thus preserving model utility on non-harmful tasks.

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint | Novelty Validation