COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

ICLR 2026 Conference SubmissionAnonymous Authors
Steerable GenerationLarge language modelsRepresentation EngineeringTest-time InterventionLearning Dynamics
Abstract:

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches either capture suboptimally steering signals from labeled examples or require hundreds to thousands of examples to optimize using specific procedures for each behavioral target. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer enables real-time adaptation to new steering objectives and facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces COLD-Steer, a training-free framework that steers LLM activations by approximating gradient descent effects from in-context examples. It resides in the Gradient-Based Activation Steering leaf, which contains only two papers total (including this one). This is a notably sparse research direction within the broader Activation and Representation Manipulation branch, suggesting the paper targets a relatively underexplored niche. The sibling paper (Inference-time Intervention) represents the primary direct comparator in this specific methodological space.

The taxonomy reveals that Gradient-Based Activation Steering sits alongside two other activation manipulation approaches: Direct Activation Intervention (concept vectors without gradients) and Representation Engineering (frameworks for analyzing concept representations). The broader Activation and Representation Manipulation branch is one of six major control paradigms, with neighboring branches covering Decoding Control, Training-Based methods, and Agent Control. COLD-Steer's gradient-based approach distinguishes it from simpler vector addition methods while remaining distinct from training-based alignment techniques, positioning it at the intersection of computational efficiency and adaptive steering.

Among 29 candidates examined, the framework's core contribution (in-context one-step learning dynamics) shows one refutable candidate out of 10 examined, while the two approximation methods and theoretical unification show no clear refutations across their respective candidate sets. The limited search scope (top-K semantic search plus citation expansion) means these statistics reflect a focused but not exhaustive literature review. The approximation methods and theoretical contributions appear more novel within this constrained examination, though the core framework concept encounters at least one overlapping prior work among the candidates reviewed.

Based on the limited search scope of 29 candidates, the work appears to occupy a sparsely populated methodological niche with modest prior overlap. The taxonomy structure confirms that gradient-based activation steering remains less crowded than decoding-based or training-based control approaches. However, the analysis cannot rule out additional relevant work beyond the top-K semantic matches examined, particularly in adjacent areas like representation engineering or direct intervention methods that might employ related approximation techniques.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: inference-time control of large language model behavior. The field encompasses diverse strategies for steering LLM outputs without retraining, organized into six main branches. Activation and Representation Manipulation methods directly modify internal model states, using techniques such as gradient-based steering (COLD-Steer[0], Inference-time Intervention[5]) or concept activation vectors (Concept Activation Vectors[4]) to guide behavior at the representation level. Decoding and Generation Control focuses on constraining or shaping outputs during token sampling, including format enforcement (Verifiable Format Control[8]) and attribute-based generation (Ctrl Transformer[6]). Training-Based Control and Alignment covers methods that prepare models for inference-time steering through specialized training objectives (Safe Alignment[9], InfAlign[20]). Agent Control and Task Execution addresses higher-level orchestration of LLM-driven agents in interactive environments (LLM-Agent-Controller[24], Executable Code Actions[33]), while Evaluation and Benchmarking provides frameworks for assessing controllability (Controllable Generation Benchmark[21]). Specialized Control Applications targets domain-specific challenges such as code security (Secure Vulnerable Code[14]) or cross-lingual adaptation (Cross-lingual Intervention[27]). A particularly active line of work centers on activation-level interventions that manipulate latent representations to achieve fine-grained control over model behavior. COLD-Steer[0] exemplifies gradient-based activation steering, computing targeted adjustments to internal states to guide outputs toward desired attributes. This approach contrasts with Inference-time Intervention[5], which applies simpler linear interventions based on pre-identified steering vectors, trading off computational cost against flexibility. Both methods share the goal of modifying behavior without altering model weights, yet differ in how they identify and apply steering signals. Meanwhile, works like Latent Actions Control[1] and Adaptable Logical Control[3] explore complementary strategies that encode control objectives into latent action spaces or logical constraints, highlighting ongoing questions about the optimal level of abstraction for steering. COLD-Steer[0] sits squarely within the gradient-based activation manipulation cluster, distinguished by its use of optimization-driven steering that adapts dynamically to specific prompts, offering a middle ground between the simplicity of fixed intervention vectors and the complexity of full retraining approaches.

Claimed Contributions

COLD-Steer framework for steering LLMs via in-context one-step learning dynamics

The authors propose COLD-Steer, a novel optimization-free activation steering framework that approximates how gradient updates from contextual examples would affect intermediate representations, enabling targeted causal intervention during inference without requiring parameter updates or extensive training.

10 retrieved papers
Can Refute
Two complementary approximation methods: unit kernel and finite-difference

The authors develop two distinct methods for efficiently approximating learning dynamics: COLD-Kernel-Steer, which uses kernel-weighted combinations of gradient effects, and COLD-FD-Steer, which approximates gradients via finite differences, both avoiding expensive backpropagation during inference.

10 retrieved papers
Theoretical unification of existing contrastive methods

The authors establish that their framework provides a theoretical foundation showing how existing contrastive activation steering methods like CAA can be understood as implicit approximations of gradient descent on specific loss functions.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

COLD-Steer framework for steering LLMs via in-context one-step learning dynamics

The authors propose COLD-Steer, a novel optimization-free activation steering framework that approximates how gradient updates from contextual examples would affect intermediate representations, enabling targeted causal intervention during inference without requiring parameter updates or extensive training.

Contribution

Two complementary approximation methods: unit kernel and finite-difference

The authors develop two distinct methods for efficiently approximating learning dynamics: COLD-Kernel-Steer, which uses kernel-weighted combinations of gradient effects, and COLD-FD-Steer, which approximates gradients via finite differences, both avoiding expensive backpropagation during inference.

Contribution

Theoretical unification of existing contrastive methods

The authors establish that their framework provides a theoretical foundation showing how existing contrastive activation steering methods like CAA can be understood as implicit approximations of gradient descent on specific loss functions.