Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

hallucinationrepresentation learninginterpretabilityfinetuningsteering

Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model's weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL's light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by $\sim30\%$ - $40 \%$ across multiple short-form QA benchmarks. CASAL is $\sim$ 30x more compute-efficient and $\sim$ 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL's flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CASAL, a method that bakes activation steering benefits directly into model weights through contrastive learning, enabling models to abstain from answering questions they do not know. This work sits within the 'Contrastive Activation Steering' leaf of the taxonomy, which contains only three papers including CASAL itself. The leaf focuses specifically on methods computing steering vectors from contrastive positive-negative example pairs. This represents a relatively sparse research direction within the broader activation steering landscape, suggesting the specific approach of amortizing steering through weight updates rather than inference-time intervention occupies a less crowded niche.

The taxonomy reveals CASAL's position within a dense ecosystem of activation steering methods. Neighboring leaves include 'Adaptive and Query-Specific Steering' (3 papers), 'Concept and Representation Space Steering' (3 papers), and 'Sparse Representation and Feature-Based Control' (3 papers). The broader 'Activation Steering Methods' branch contains seven distinct approaches, indicating substantial research activity in inference-time interventions. CASAL diverges from these neighbors by moving steering from inference to training time, bridging the gap between the 'Activation Steering Methods' branch and the 'Training-Based Approaches' branch, which focuses on weight updates through fine-tuning and alignment.

Among 29 candidates examined across three contributions, no clearly refutable prior work was identified. Contribution A (CASAL framework) examined 10 candidates with 0 refutable; Contribution B (representation-level training objective) examined 10 candidates with 0 refutable; Contribution C (steering-based training for dense and MoE architectures) examined 9 candidates with 0 refutable. This suggests that within the limited search scope, the specific combination of contrastive activation steering with amortized learning through weight updates appears relatively unexplored. The two sibling papers in the same taxonomy leaf focus on inference-time steering rather than training-time amortization, indicating differentiation even within this narrow research direction.

Based on the limited literature search of 29 candidates, CASAL appears to occupy a distinctive position by combining contrastive steering principles with training-time weight updates. The taxonomy structure shows this bridges two major branches—activation steering and training-based approaches—that typically remain separate. However, the analysis covers top-K semantic matches and does not constitute an exhaustive survey of all possible related work in parameter-efficient fine-tuning, representation learning, or hallucination mitigation more broadly.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Reducing hallucinations in large language models through activation steering. The field has organized itself around several complementary strategies for mitigating hallucinations in large language models. Activation steering methods form a central branch, encompassing techniques that manipulate internal representations to guide model behavior toward truthfulness—ranging from contrastive activation approaches like Contrastive Activation Addition[6] to more sophisticated methods such as Latent Space Steering[1] and Concept Activation Vectors[5]. Parallel branches address modality-specific challenges in vision-language and video-language models, while internal state analysis focuses on detecting and interpreting truthfulness signals within model activations. Training-based approaches and knowledge-augmented methods offer longer-term solutions, and contrastive decoding techniques manipulate outputs at generation time. Additional branches explore neuron-level interventions, adversarial robustness, and broader safety concerns including bias, deception, and misalignment. Within activation steering, a particularly active line of work centers on contrastive methods that derive steering vectors by comparing activations from truthful versus hallucinated contexts. CASAL[0] exemplifies this approach by learning contrastive activation steering vectors to reduce hallucinations during inference, positioning itself alongside foundational contrastive techniques like Contrastive Activation Addition[6] and more recent innovations such as Internal Contrastive Decoding[10]. These methods share the insight that internal representations encode truthfulness signals that can be amplified or suppressed. Nearby works explore related themes: Hidden Life Tokens[3] investigates how specific token representations influence model behavior, while Sparse Representation Steering[11] and Spectral Activation Editing[12] offer alternative geometric perspectives on activation manipulation. A key tension across these approaches involves balancing intervention strength—steering too aggressively risks degrading fluency or task performance, while subtle interventions may fail to suppress hallucinations reliably. CASAL[0] sits within this dense cluster of contrastive steering methods, emphasizing learned activation adjustments that preserve model capabilities while targeting hallucination-prone representations.

Claimed Contributions

CASAL: Contrastive Activation Steering for Amortized Learning

10 retrieved papers

The authors propose CASAL, a training method that embeds activation steering benefits directly into model weights by training a lightweight subnetwork to approximate steering solutions. This approach reduces hallucinations by teaching models to abstain from answering unknown questions while maintaining performance on known queries.

10 retrieved papers

Representation-level training objective without cross-entropy loss

10 retrieved papers

The method uses a local representation loss applied to residual stream activations as the sole training objective, rather than using it as an auxiliary signal alongside standard cross-entropy loss. This enables efficient single-layer training by providing learning signals from the model's own hidden representations.

10 retrieved papers

First steering-based training framework for both dense and MoE architectures

9 retrieved papers

The authors demonstrate that CASAL is architecture-agnostic and modality-agnostic, successfully reducing hallucinations in both dense transformers and Mixture-of-Experts models, as well as in text-only and vision-language settings.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Steering Llama 2 via Contrastive Activation Addition PDF

Schulz, Julian (2023) • Annual Meeting of the Association for Computational Linguistics

[21] Differentially Private Steering for Large Language Model Alignment PDF

Goel, Anmol, Hu Yaxi, Anmol Goel, Gurevych, Iryna, Yaxian Hu, Sanyal, Amartya, Iryna Gurevych, Amartya Sanyal (2025) • International Conference on Learning Representations

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CASAL: Contrastive Activation Steering for Amortized Learning

[1] Reducing hallucinations in large vision-language models via latent space steering PDF

Cannot Refute

[6] Steering Llama 2 via Contrastive Activation Addition PDF

Cannot Refute

[8] Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization PDF

Cannot Refute

[25] Activation Steering Decoding: Mitigating Hallucination in Large Vision-Language Models through Bidirectional Hidden State Intervention PDF

Cannot Refute

[28] Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories PDF

Cannot Refute

[33] Learning to steer: Input-dependent steering for multimodal llms PDF

Cannot Refute

[61] Hallucination augmented contrastive learning for multimodal large language model PDF

Cannot Refute

[62] ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM PDF

Cannot Refute

[63] Regularized Contrastive Decoding with Hard Negative Samples for LLM Hallucination Mitigation PDF

Cannot Refute

[64] Attention-guided self-reflection for zero-shot hallucination detection in large language models PDF

Cannot Refute

Contribution

Representation-level training objective without cross-entropy loss

[51] Improving text embeddings with large language models PDF

Cannot Refute

[52] Training Large Language Models to Reason in a Continuous Latent Space PDF

Cannot Refute

[53] Nv-embed: Improved techniques for training llms as generalist embedding models PDF

Cannot Refute

[54] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing PDF

Cannot Refute

[55] Make large language model a better ranker PDF

Cannot Refute

[56] Cross-Domain Pre-training with Language Models for Transferable Time Series Representations PDF

Cannot Refute

[57] Probing the Robustness of Large Language Models Safety to Latent Perturbations PDF

Cannot Refute

[58] Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model PDF

Cannot Refute

[59] Pretraining context compressor for large language models with embedding-based memory PDF

Cannot Refute

[60] On the role of pretrained language models in general-purpose text embeddings: A survey PDF

Cannot Refute

Contribution

First steering-based training framework for both dense and MoE architectures

[65] Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training PDF

Cannot Refute

[66] Mol-MoE: Training Preference-Guided Routers for Molecule Generation PDF

Cannot Refute

[67] Multilingual Routing in Mixture-of-Experts PDF

Cannot Refute

[68] Steer-MoE: Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module PDF

Cannot Refute

[69] Steering MoE LLMs via Expert (De)Activation PDF

Cannot Refute

[70] Steered Mixture-of-Experts for Light Field Images and Video: Representation and Coding PDF

Cannot Refute

[71] The Compression-Decay Comprehension Test (CDCT): An Information-Theoretic Benchmark for Measuring Machine Comprehension PDF

Cannot Refute

[72] How to maximize the creativity of artificial intelligence: an experimental analysis of response order and prompting effects PDF

Cannot Refute

[73] MoRE-LLM: Mixture of Rule Experts Guided by a Large Language Model PDF

Cannot Refute

Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Steering Llama 2 via Contrastive Activation Addition PDF

[21] Differentially Private Steering for Large Language Model Alignment PDF

Contribution Analysis

CASAL: Contrastive Activation Steering for Amortized Learning

[1] Reducing hallucinations in large vision-language models via latent space steering PDF

[6] Steering Llama 2 via Contrastive Activation Addition PDF

[8] Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization PDF

[25] Activation Steering Decoding: Mitigating Hallucination in Large Vision-Language Models through Bidirectional Hidden State Intervention PDF

[28] Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories PDF

[33] Learning to steer: Input-dependent steering for multimodal llms PDF

[61] Hallucination augmented contrastive learning for multimodal large language model PDF

[62] ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM PDF

[63] Regularized Contrastive Decoding with Hard Negative Samples for LLM Hallucination Mitigation PDF

[64] Attention-guided self-reflection for zero-shot hallucination detection in large language models PDF

Representation-level training objective without cross-entropy loss

[51] Improving text embeddings with large language models PDF

[52] Training Large Language Models to Reason in a Continuous Latent Space PDF

[53] Nv-embed: Improved techniques for training llms as generalist embedding models PDF

[54] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing PDF

[55] Make large language model a better ranker PDF

[56] Cross-Domain Pre-training with Language Models for Transferable Time Series Representations PDF

[57] Probing the Robustness of Large Language Models Safety to Latent Perturbations PDF

[58] Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model PDF

[59] Pretraining context compressor for large language models with embedding-based memory PDF

[60] On the role of pretrained language models in general-purpose text embeddings: A survey PDF

First steering-based training framework for both dense and MoE architectures

[65] Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training PDF

[66] Mol-MoE: Training Preference-Guided Routers for Molecule Generation PDF

[67] Multilingual Routing in Mixture-of-Experts PDF

[68] Steer-MoE: Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module PDF

[69] Steering MoE LLMs via Expert (De)Activation PDF

[70] Steered Mixture-of-Experts for Light Field Images and Video: Representation and Coding PDF

[71] The Compression-Decay Comprehension Test (CDCT): An Information-Theoretic Benchmark for Measuring Machine Comprehension PDF

[72] How to maximize the creativity of artificial intelligence: an experimental analysis of response order and prompting effects PDF

[73] MoRE-LLM: Mixture of Rule Experts Guided by a Large Language Model PDF

Table of Contents