LatentQA: Teaching LLMs to Decode Activations Into Natural Language

ICLR 2026 Conference SubmissionAnonymous Authors
AI SafetyActivation EngineeringTop-Down Transparency of Language Models
Abstract:

Top-down transparency typically analyzes language model activations using probes with scalar or single-token outputs, limiting the range of behaviors that can be captured. To alleviate this issue, we develop a more expressive probe that can directly output natural language and perform LatentQA: the task of answering open-ended questions about activations. A key difficulty in developing such a probe is collecting a dataset mapping activations to natural-language descriptions. In response, we propose an approach for generating a pseudo-labeled dataset of activations and associated question-answer pairs and develop a fine-tuning method for training a decoder LLM on this dataset. We then validate our decoder’s fidelity by assessing its ability to read and steer model activations. First, we evaluate the decoder on a number of supervised reading tasks with a known answer, such as uncovering hidden system prompts and relational knowledge extraction, and observe that it outperforms competitive probing baselines. Second, we demonstrate that the decoder is precise enough to steer the target model to exhibit behaviors unseen during training. Finally, we show that LatentQA scales well with increasing dataset and model size, which is promising given how easily our approach can generate additional pseudo-labels.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LatentQA, a task for answering open-ended questions about language model activations using a natural language probe. It resides in the Activation-to-Language Decoding leaf, which contains four papers total including this work. This leaf sits within the broader Activation-Based Decoding and Steering Methods branch, indicating a moderately populated research direction. The sibling papers in this leaf explore related but distinct approaches: privileged information activations, meta-models for behavior prediction, and pedagogical activation decoding. The taxonomy structure suggests this is an active but not overcrowded area, with clear differentiation among methods.

The taxonomy reveals that Activation-to-Language Decoding is one of three sub-branches under Activation-Based Decoding and Steering Methods, alongside Activation Intervention for Behavior Control and Layer-Contrastive Decoding Strategies. Neighboring branches include Interpretability and Explanation Methods, which focuses on post-hoc analysis without activation modification, and Controlled and Guided Text Generation, which operates at the decoding level rather than internal representation level. The scope notes clarify that this work differs from probing methods in Interpretability by generating natural language outputs rather than training diagnostic classifiers, and from steering methods by prioritizing interpretation over behavior control.

Among the three contributions analyzed, the LatentQA task and natural language probe examined ten candidates and found one potentially refutable prior work, suggesting some overlap with existing activation decoding approaches. The pseudo-labeled dataset generation approach examined ten candidates with no clear refutations, indicating relative novelty in this specific methodology. The Latent Interpretation Tuning method examined three candidates with no refutations. These statistics reflect a limited search scope of twenty-three total candidates, not an exhaustive literature review. The first contribution appears to have the most substantial prior work among the examined candidates.

Based on the limited search of twenty-three semantically similar papers, the work appears to occupy a recognizable position within an established research direction. The taxonomy context shows this is a moderately active area with clear boundaries from neighboring topics. The contribution-level analysis suggests varying degrees of novelty across components, with the training methodology appearing more distinctive than the core task formulation among the examined candidates. This assessment is constrained by the top-K semantic search scope and does not claim comprehensive coverage of all relevant prior work.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: decoding language model activations into natural language. The field encompasses diverse approaches to understanding and manipulating the internal representations of language models. The taxonomy reveals seven major branches: Activation-Based Decoding and Steering Methods focus on directly interpreting or modifying internal states to control model behavior, including techniques like DoLa[1] and Activation Addition[4]. Interpretability and Explanation Methods develop tools and frameworks for making model decisions transparent, exemplified by works such as AllenNLP Interpret[37] and NLP Explainability[21]. Controlled and Guided Text Generation explores constrained decoding strategies like Reward-Augmented Decoding[38] and Chain-of-Thought Decoding[45]. Model Behavior Analysis and Robustness investigates how models process information and respond to perturbations, while Cross-Domain and Multimodal Representation Learning examines how representations transfer across modalities, including Brain-to-Text Decoding[20]. Text Representation and Embedding Methods study fundamental encoding schemes, and Application-Driven Interpretation targets domain-specific evaluation needs. Within Activation-Based Decoding and Steering Methods, a particularly active line of work centers on translating internal activations into interpretable text. LatentQA[0] sits squarely in this Activation-to-Language Decoding cluster, alongside Privileged Information Activations[6] and Meta-Models[27], all aiming to extract human-readable explanations from hidden states. While Privileged Information Activations[6] focuses on leveraging information asymmetries during training, LatentQA[0] emphasizes question-answering as a decoding framework, contrasting with the meta-modeling approach of Meta-Models[27] that learns to predict model behavior from activations. A key tension across these methods involves balancing faithfulness to the original representations versus producing fluent, actionable natural language outputs. Teaching Activation Decoding[32] explores pedagogical applications of similar techniques, highlighting how activation decoding can serve both interpretability and practical intervention goals.

Claimed Contributions

LatentQA task and natural language probe

The authors introduce LatentQA, a task that involves answering open-ended questions about model activations in natural language. They develop a probe that outputs natural language rather than scalars or single tokens, enabling richer interpretation of model behaviors.

10 retrieved papers
Can Refute
Pseudo-labeled dataset generation approach

The authors propose a method for curating a dataset that maps model activations to natural language question-answer pairs. This dataset is generated using control prompts and GPT-based labeling, enabling training of the decoder without manual annotation.

10 retrieved papers
Latent Interpretation Tuning (LIT) method

The authors develop LIT, a fine-tuning method that trains a decoder LLM to predict qualitative properties of future model completions given current activations. This method enables both reading and steering of model activations using natural language.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LatentQA task and natural language probe

The authors introduce LatentQA, a task that involves answering open-ended questions about model activations in natural language. They develop a probe that outputs natural language rather than scalars or single tokens, enabling richer interpretation of model behaviors.

Contribution

Pseudo-labeled dataset generation approach

The authors propose a method for curating a dataset that maps model activations to natural language question-answer pairs. This dataset is generated using control prompts and GPT-based labeling, enabling training of the decoder without manual annotation.

Contribution

Latent Interpretation Tuning (LIT) method

The authors develop LIT, a fine-tuning method that trains a decoder LLM to predict qualitative properties of future model completions given current activations. This method enables both reading and steering of model activations using natural language.

LatentQA: Teaching LLMs to Decode Activations Into Natural Language | Novelty Validation