Evaluating steering techniques using human similarity judgments

ICLR 2026 Conference SubmissionAnonymous Authors
cognitive sciencetransformerslarge language modelshuman-AI alignmenthuman-centered AIsteeringcognitive benchmarking
Abstract:

Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards "kind" similarity and struggled with "size" alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces an evaluation framework for LLM steering techniques grounded in triadic similarity judgments, a task drawn from cognitive psychology. Within the taxonomy, it occupies the 'Cognitive Alignment Through Similarity Judgments' leaf under 'Steering and Alignment Evaluation Methods,' where it is currently the sole representative. This positioning reflects a sparse research direction: the broader 'Steering and Alignment Evaluation Methods' branch contains only two leaves, with the sibling leaf ('Mechanistic Interpretability of LLM Behavior') focusing on internal representation decomposition rather than external human-alignment tasks. The taxonomy reveals that cognitive-alignment evaluation via similarity tasks is an underexplored niche within the larger steering literature.

The taxonomy structure shows that most related work clusters in the 'Application Domains' branch, emphasizing task-specific implementations (knowledge reasoning, recommender systems, web agent security) rather than foundational evaluation methods. The neighboring 'Mechanistic Interpretability' leaf examines internal model behavior through decomposition techniques, offering a complementary but distinct approach to understanding steering effects. The paper's cognitive-psychology grounding distinguishes it from these application-oriented and mechanistic directions, bridging human cognition research with LLM steering evaluation in a way that the taxonomy suggests is relatively novel within this literature sample.

Across three identified contributions, the literature search examined 30 candidates total, with 10 candidates per contribution. None of the contributions were clearly refuted by prior work in this limited sample. The evaluation framework using triadic similarity judgments, the dual-axis competence-alignment measurement, and the discovery of privileged representational axes each showed no overlapping prior work among the examined candidates. This suggests that, within the scope of the top-30 semantic matches and their citations, the specific combination of triadic similarity tasks, dual-axis evaluation, and representational bias analysis appears distinctive, though the search scale leaves open the possibility of relevant work beyond this sample.

Based on the limited search scope, the work appears to occupy a relatively unexplored intersection of cognitive psychology and LLM steering evaluation. The taxonomy's sparse population in this direction and the absence of refuting candidates among 30 examined papers suggest novelty, though this assessment is constrained by the search methodology. A more exhaustive review of cognitive science applications to LLM evaluation or broader steering literature might reveal additional relevant precedents not captured in this top-K semantic search.

Taxonomy

Core-task Taxonomy Papers
4
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating LLM steering techniques using triadic similarity judgments. The field of LLM steering and alignment has grown into a diverse landscape, organized here around two main branches. The first branch, Steering and Alignment Evaluation Methods, encompasses approaches that assess how well models can be guided toward desired behaviors, including cognitive alignment strategies that probe whether models internalize human-like conceptual structures. The second branch, Application Domains and Task-Specific Implementations, focuses on deploying steering techniques in concrete settings such as recommendation systems, web agents, and knowledge-intensive tasks. Works like LLMs for Recommenders[3] illustrate how steering principles translate into domain-specific challenges, while Knowledge Graph Reflection[1] and Web Agent Security[2] demonstrate the breadth of contexts where alignment and control matter. Together, these branches reflect a field balancing foundational evaluation questions with practical deployment concerns. Within the Steering and Alignment Evaluation Methods branch, a particularly active line of inquiry examines how to measure alignment beyond surface-level performance metrics, exploring whether models exhibit human-like reasoning patterns or merely mimic outputs. Steering Techniques Evaluation[0] sits squarely in this cognitive alignment cluster, using triadic similarity judgments—a psychologically grounded method—to assess whether steering interventions genuinely shift internal representations in interpretable ways. This contrasts with more application-driven works like LLMs for Recommenders[3], which prioritize task success over cognitive fidelity, and complements efforts like LLM Assertiveness Decomposition[4], which dissects model behavior into interpretable components. The central tension across these directions is whether evaluation should emphasize human-aligned internal structure or downstream utility, with Steering Techniques Evaluation[0] leaning toward the former by grounding its metrics in human similarity perception.

Claimed Contributions

Evaluation framework using triadic similarity judgments for LLM steering

The authors introduce an evaluation approach for LLM steering techniques grounded in cognitive science methods. They apply triadic similarity judgment tasks—where agents judge which of two items is most similar to a reference item along specified dimensions (size or kind)—to assess both steering accuracy and alignment with human mental representations.

10 retrieved papers
Dual-axis evaluation measuring competence and alignment

The authors propose evaluating steering methods along two distinct dimensions: competence (task accuracy) and alignment (how well steered model representations match human representational geometry). This dual evaluation framework distinguishes between performance and cognitive similarity, addressing the gap between what systems do and how they do it.

10 retrieved papers
Discovery of privileged representational axes in LLMs

The authors identify that LLMs exhibit inherent biases in their representational structure, specifically showing stronger alignment with kind-based similarity over size-based similarity even without explicit steering. This finding reveals systematic differences in how LLMs organize semantic knowledge compared to humans.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Evaluation framework using triadic similarity judgments for LLM steering

The authors introduce an evaluation approach for LLM steering techniques grounded in cognitive science methods. They apply triadic similarity judgment tasks—where agents judge which of two items is most similar to a reference item along specified dimensions (size or kind)—to assess both steering accuracy and alignment with human mental representations.

Contribution

Dual-axis evaluation measuring competence and alignment

The authors propose evaluating steering methods along two distinct dimensions: competence (task accuracy) and alignment (how well steered model representations match human representational geometry). This dual evaluation framework distinguishes between performance and cognitive similarity, addressing the gap between what systems do and how they do it.

Contribution

Discovery of privileged representational axes in LLMs

The authors identify that LLMs exhibit inherent biases in their representational structure, specifically showing stronger alignment with kind-based similarity over size-based similarity even without explicit steering. This finding reveals systematic differences in how LLMs organize semantic knowledge compared to humans.