Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning

ICLR 2026 Conference SubmissionAnonymous Authors
hallucinationrepresentation learninginterpretabilityfinetuningsteering
Abstract:

Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model's weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL's light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%\sim30\%-40%40 \% across multiple short-form QA benchmarks. CASAL is \sim30x more compute-efficient and \sim20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL's flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CASAL, a method that bakes activation steering benefits directly into model weights through contrastive learning, enabling models to abstain from answering questions they do not know. This work sits within the 'Contrastive Activation Steering' leaf of the taxonomy, which contains only three papers including CASAL itself. The leaf focuses specifically on methods computing steering vectors from contrastive positive-negative example pairs. This represents a relatively sparse research direction within the broader activation steering landscape, suggesting the specific approach of amortizing steering through weight updates rather than inference-time intervention occupies a less crowded niche.

The taxonomy reveals CASAL's position within a dense ecosystem of activation steering methods. Neighboring leaves include 'Adaptive and Query-Specific Steering' (3 papers), 'Concept and Representation Space Steering' (3 papers), and 'Sparse Representation and Feature-Based Control' (3 papers). The broader 'Activation Steering Methods' branch contains seven distinct approaches, indicating substantial research activity in inference-time interventions. CASAL diverges from these neighbors by moving steering from inference to training time, bridging the gap between the 'Activation Steering Methods' branch and the 'Training-Based Approaches' branch, which focuses on weight updates through fine-tuning and alignment.

Among 29 candidates examined across three contributions, no clearly refutable prior work was identified. Contribution A (CASAL framework) examined 10 candidates with 0 refutable; Contribution B (representation-level training objective) examined 10 candidates with 0 refutable; Contribution C (steering-based training for dense and MoE architectures) examined 9 candidates with 0 refutable. This suggests that within the limited search scope, the specific combination of contrastive activation steering with amortized learning through weight updates appears relatively unexplored. The two sibling papers in the same taxonomy leaf focus on inference-time steering rather than training-time amortization, indicating differentiation even within this narrow research direction.

Based on the limited literature search of 29 candidates, CASAL appears to occupy a distinctive position by combining contrastive steering principles with training-time weight updates. The taxonomy structure shows this bridges two major branches—activation steering and training-based approaches—that typically remain separate. However, the analysis covers top-K semantic matches and does not constitute an exhaustive survey of all possible related work in parameter-efficient fine-tuning, representation learning, or hallucination mitigation more broadly.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Reducing hallucinations in large language models through activation steering. The field has organized itself around several complementary strategies for mitigating hallucinations in large language models. Activation steering methods form a central branch, encompassing techniques that manipulate internal representations to guide model behavior toward truthfulness—ranging from contrastive activation approaches like Contrastive Activation Addition[6] to more sophisticated methods such as Latent Space Steering[1] and Concept Activation Vectors[5]. Parallel branches address modality-specific challenges in vision-language and video-language models, while internal state analysis focuses on detecting and interpreting truthfulness signals within model activations. Training-based approaches and knowledge-augmented methods offer longer-term solutions, and contrastive decoding techniques manipulate outputs at generation time. Additional branches explore neuron-level interventions, adversarial robustness, and broader safety concerns including bias, deception, and misalignment. Within activation steering, a particularly active line of work centers on contrastive methods that derive steering vectors by comparing activations from truthful versus hallucinated contexts. CASAL[0] exemplifies this approach by learning contrastive activation steering vectors to reduce hallucinations during inference, positioning itself alongside foundational contrastive techniques like Contrastive Activation Addition[6] and more recent innovations such as Internal Contrastive Decoding[10]. These methods share the insight that internal representations encode truthfulness signals that can be amplified or suppressed. Nearby works explore related themes: Hidden Life Tokens[3] investigates how specific token representations influence model behavior, while Sparse Representation Steering[11] and Spectral Activation Editing[12] offer alternative geometric perspectives on activation manipulation. A key tension across these approaches involves balancing intervention strength—steering too aggressively risks degrading fluency or task performance, while subtle interventions may fail to suppress hallucinations reliably. CASAL[0] sits within this dense cluster of contrastive steering methods, emphasizing learned activation adjustments that preserve model capabilities while targeting hallucination-prone representations.

Claimed Contributions

CASAL: Contrastive Activation Steering for Amortized Learning

The authors propose CASAL, a training method that embeds activation steering benefits directly into model weights by training a lightweight subnetwork to approximate steering solutions. This approach reduces hallucinations by teaching models to abstain from answering unknown questions while maintaining performance on known queries.

10 retrieved papers
Representation-level training objective without cross-entropy loss

The method uses a local representation loss applied to residual stream activations as the sole training objective, rather than using it as an auxiliary signal alongside standard cross-entropy loss. This enables efficient single-layer training by providing learning signals from the model's own hidden representations.

10 retrieved papers
First steering-based training framework for both dense and MoE architectures

The authors demonstrate that CASAL is architecture-agnostic and modality-agnostic, successfully reducing hallucinations in both dense transformers and Mixture-of-Experts models, as well as in text-only and vision-language settings.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CASAL: Contrastive Activation Steering for Amortized Learning

The authors propose CASAL, a training method that embeds activation steering benefits directly into model weights by training a lightweight subnetwork to approximate steering solutions. This approach reduces hallucinations by teaching models to abstain from answering unknown questions while maintaining performance on known queries.

Contribution

Representation-level training objective without cross-entropy loss

The method uses a local representation loss applied to residual stream activations as the sole training objective, rather than using it as an auxiliary signal alongside standard cross-entropy loss. This enables efficient single-layer training by providing learning signals from the model's own hidden representations.

Contribution

First steering-based training framework for both dense and MoE architectures

The authors demonstrate that CASAL is architecture-agnostic and modality-agnostic, successfully reducing hallucinations in both dense transformers and Mixture-of-Experts models, as well as in text-only and vision-language settings.

Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning | Novelty Validation