COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Steerable GenerationLarge language modelsRepresentation EngineeringTest-time InterventionLearning Dynamics

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches either capture suboptimally steering signals from labeled examples or require hundreds to thousands of examples to optimize using specific procedures for each behavioral target. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer enables real-time adaptation to new steering objectives and facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces COLD-Steer, a training-free framework that steers LLM activations by approximating gradient descent effects from in-context examples. It resides in the Gradient-Based Activation Steering leaf, which contains only two papers total (including this one). This is a notably sparse research direction within the broader Activation and Representation Manipulation branch, suggesting the paper targets a relatively underexplored niche. The sibling paper (Inference-time Intervention) represents the primary direct comparator in this specific methodological space.

The taxonomy reveals that Gradient-Based Activation Steering sits alongside two other activation manipulation approaches: Direct Activation Intervention (concept vectors without gradients) and Representation Engineering (frameworks for analyzing concept representations). The broader Activation and Representation Manipulation branch is one of six major control paradigms, with neighboring branches covering Decoding Control, Training-Based methods, and Agent Control. COLD-Steer's gradient-based approach distinguishes it from simpler vector addition methods while remaining distinct from training-based alignment techniques, positioning it at the intersection of computational efficiency and adaptive steering.

Among 29 candidates examined, the framework's core contribution (in-context one-step learning dynamics) shows one refutable candidate out of 10 examined, while the two approximation methods and theoretical unification show no clear refutations across their respective candidate sets. The limited search scope (top-K semantic search plus citation expansion) means these statistics reflect a focused but not exhaustive literature review. The approximation methods and theoretical contributions appear more novel within this constrained examination, though the core framework concept encounters at least one overlapping prior work among the candidates reviewed.

Based on the limited search scope of 29 candidates, the work appears to occupy a sparsely populated methodological niche with modest prior overlap. The taxonomy structure confirms that gradient-based activation steering remains less crowded than decoding-based or training-based control approaches. However, the analysis cannot rule out additional relevant work beyond the top-K semantic matches examined, particularly in adjacent areas like representation engineering or direct intervention methods that might employ related approximation techniques.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: inference-time control of large language model behavior. The field encompasses diverse strategies for steering LLM outputs without retraining, organized into six main branches. Activation and Representation Manipulation methods directly modify internal model states, using techniques such as gradient-based steering (COLD-Steer[0], Inference-time Intervention[5]) or concept activation vectors (Concept Activation Vectors[4]) to guide behavior at the representation level. Decoding and Generation Control focuses on constraining or shaping outputs during token sampling, including format enforcement (Verifiable Format Control[8]) and attribute-based generation (Ctrl Transformer[6]). Training-Based Control and Alignment covers methods that prepare models for inference-time steering through specialized training objectives (Safe Alignment[9], InfAlign[20]). Agent Control and Task Execution addresses higher-level orchestration of LLM-driven agents in interactive environments (LLM-Agent-Controller[24], Executable Code Actions[33]), while Evaluation and Benchmarking provides frameworks for assessing controllability (Controllable Generation Benchmark[21]). Specialized Control Applications targets domain-specific challenges such as code security (Secure Vulnerable Code[14]) or cross-lingual adaptation (Cross-lingual Intervention[27]). A particularly active line of work centers on activation-level interventions that manipulate latent representations to achieve fine-grained control over model behavior. COLD-Steer[0] exemplifies gradient-based activation steering, computing targeted adjustments to internal states to guide outputs toward desired attributes. This approach contrasts with Inference-time Intervention[5], which applies simpler linear interventions based on pre-identified steering vectors, trading off computational cost against flexibility. Both methods share the goal of modifying behavior without altering model weights, yet differ in how they identify and apply steering signals. Meanwhile, works like Latent Actions Control[1] and Adaptable Logical Control[3] explore complementary strategies that encode control objectives into latent action spaces or logical constraints, highlighting ongoing questions about the optimal level of abstraction for steering. COLD-Steer[0] sits squarely within the gradient-based activation manipulation cluster, distinguished by its use of optimization-driven steering that adapts dynamically to specific prompts, offering a middle ground between the simplicity of fixed intervention vectors and the complexity of full retraining approaches.

Claimed Contributions

COLD-Steer framework for steering LLMs via in-context one-step learning dynamics

Can Refute

10 retrieved papers

The authors propose COLD-Steer, a novel optimization-free activation steering framework that approximates how gradient updates from contextual examples would affect intermediate representations, enabling targeted causal intervention during inference without requiring parameter updates or extensive training.

10 retrieved papers

Can Refute

Two complementary approximation methods: unit kernel and finite-difference

10 retrieved papers

The authors develop two distinct methods for efficiently approximating learning dynamics: COLD-Kernel-Steer, which uses kernel-weighted combinations of gradient effects, and COLD-FD-Steer, which approximates gradients via finite differences, both avoiding expensive backpropagation during inference.

10 retrieved papers

Theoretical unification of existing contrastive methods

9 retrieved papers

The authors establish that their framework provides a theoretical foundation showing how existing contrastive activation steering methods like CAA can be understood as implicit approximations of gradient descent on specific loss functions.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Inference-time intervention: Eliciting truthful answers from a language model PDF

Li Kenneth, Patel, Oam, Kenneth Li, ViÃ©gas, Fernanda, Oam Patel, Pfister, Hanspeter, Fernanda Vi'egas, Wattenberg Martin, H. Pfister, M. Wattenberg (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

COLD-Steer framework for steering LLMs via in-context one-step learning dynamics

[58] In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering PDF

Can Refute

[51] Steering Language Models With Activation Engineering PDF

Cannot Refute

[52] Steering Large Language Model Activations in Sparse Spaces PDF

Cannot Refute

[53] Learning to retrieve prompts for in-context learning PDF

Cannot Refute

[54] In-Context Retrieval-Augmented Language Models PDF

Cannot Refute

[55] In-context learning creates task vectors PDF

Cannot Refute

[56] In-context impersonation reveals large language models' strengths and biases PDF

Cannot Refute

[57] Semantics-adaptive activation intervention for llms via dynamic steering vectors PDF

Cannot Refute

[59] Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking PDF

Cannot Refute

[60] Latent cascade synthesis: Investigating iterative pseudo-contextual scaffold formation in contemporary large language models PDF

Cannot Refute

Contribution

Two complementary approximation methods: unit kernel and finite-difference

[70] A Mixed Finite Differences Scheme for Gradient Approximation PDF

Cannot Refute

[71] Neural networks can learn representations with gradient descent PDF

Cannot Refute

[72] Mixed Finite Differences Scheme for Gradient Approximation PDF

Cannot Refute

[73] Physics Informed Neural Network using Finite Difference Method PDF

Cannot Refute

[74] A Numerical Gradient Inversion Attack in Variational Quantum Neural-Networks PDF

Cannot Refute

[75] A generalized neural tangent kernel analysis for two-layer neural networks PDF

Cannot Refute

[76] Synthetic-domain computing and neural networks using lithium niobate integrated nonlinear phononics PDF

Cannot Refute

[77] Comparative study of application of production sequencing and scheduling problems in tire mixing operations with ADAM, Grey Wolf Optimizer, and Genetic â¦ PDF

Cannot Refute

[78] A Unified Kernel for Neural Network Learning PDF

Cannot Refute

[79] Bayesian Parameter Shift Rule in Variational Quantum Eigensolvers PDF

Cannot Refute

Contribution

Theoretical unification of existing contrastive methods

[61] Falcon: Fine-grained activation manipulation by contrastive orthogonal unalignment for large language model PDF

Cannot Refute

[62] Sled: Self logits evolution decoding for improving factuality in large language models PDF

Cannot Refute

[63] Beyond Linear Steering: Unified Multi-Attribute Control for Language Models PDF

Cannot Refute

[64] Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization PDF

Cannot Refute

[65] Investigating generalization of one-shot LLM steering vectors PDF

Cannot Refute

[66] Automated Grading Through Contrastive Learning: A Gradient Analysis and Feature Ablation Approach PDF

Cannot Refute

[67] Fooling Contrastive Language-Image Pre-Trained Models with CLIPMasterPrints PDF

Cannot Refute

[68] Comparing Sparse Autoencoder Representations and Mean Activation Difference for Language Model Steering PDF

Cannot Refute

[69] Steering Distillation: Transferring Behavioral Control Interfaces via Gradient-Based Steering Alignment PDF

Cannot Refute

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Inference-time intervention: Eliciting truthful answers from a language model PDF

Contribution Analysis

COLD-Steer framework for steering LLMs via in-context one-step learning dynamics

[58] In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering PDF

[51] Steering Language Models With Activation Engineering PDF

[52] Steering Large Language Model Activations in Sparse Spaces PDF

[53] Learning to retrieve prompts for in-context learning PDF

[54] In-Context Retrieval-Augmented Language Models PDF

[55] In-context learning creates task vectors PDF

[56] In-context impersonation reveals large language models' strengths and biases PDF

[57] Semantics-adaptive activation intervention for llms via dynamic steering vectors PDF

[59] Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking PDF

[60] Latent cascade synthesis: Investigating iterative pseudo-contextual scaffold formation in contemporary large language models PDF

Two complementary approximation methods: unit kernel and finite-difference

[70] A Mixed Finite Differences Scheme for Gradient Approximation PDF

[71] Neural networks can learn representations with gradient descent PDF

[72] Mixed Finite Differences Scheme for Gradient Approximation PDF

[73] Physics Informed Neural Network using Finite Difference Method PDF

[74] A Numerical Gradient Inversion Attack in Variational Quantum Neural-Networks PDF

[75] A generalized neural tangent kernel analysis for two-layer neural networks PDF

[76] Synthetic-domain computing and neural networks using lithium niobate integrated nonlinear phononics PDF

[77] Comparative study of application of production sequencing and scheduling problems in tire mixing operations with ADAM, Grey Wolf Optimizer, and Genetic â¦ PDF

[78] A Unified Kernel for Neural Network Learning PDF

[79] Bayesian Parameter Shift Rule in Variational Quantum Eigensolvers PDF

Theoretical unification of existing contrastive methods

[61] Falcon: Fine-grained activation manipulation by contrastive orthogonal unalignment for large language model PDF

[62] Sled: Self logits evolution decoding for improving factuality in large language models PDF

[63] Beyond Linear Steering: Unified Multi-Attribute Control for Language Models PDF

[64] Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization PDF

[65] Investigating generalization of one-shot LLM steering vectors PDF

[66] Automated Grading Through Contrastive Learning: A Gradient Analysis and Feature Ablation Approach PDF

[67] Fooling Contrastive Language-Image Pre-Trained Models with CLIPMasterPrints PDF

[68] Comparing Sparse Autoencoder Representations and Mean Activation Difference for Language Model Steering PDF

[69] Steering Distillation: Transferring Behavioral Control Interfaces via Gradient-Based Steering Alignment PDF

Table of Contents

[77] Comparative study of application of production sequencing and scheduling problems in tire mixing operations with ADAM, Grey Wolf Optimizer, and Genetic â¦ PDF