LatentQA: Teaching LLMs to Decode Activations Into Natural Language

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

AI SafetyActivation EngineeringTop-Down Transparency of Language Models

Top-down transparency typically analyzes language model activations using probes with scalar or single-token outputs, limiting the range of behaviors that can be captured. To alleviate this issue, we develop a more expressive probe that can directly output natural language and perform LatentQA: the task of answering open-ended questions about activations. A key difficulty in developing such a probe is collecting a dataset mapping activations to natural-language descriptions. In response, we propose an approach for generating a pseudo-labeled dataset of activations and associated question-answer pairs and develop a fine-tuning method for training a decoder LLM on this dataset. We then validate our decoder’s fidelity by assessing its ability to read and steer model activations. First, we evaluate the decoder on a number of supervised reading tasks with a known answer, such as uncovering hidden system prompts and relational knowledge extraction, and observe that it outperforms competitive probing baselines. Second, we demonstrate that the decoder is precise enough to steer the target model to exhibit behaviors unseen during training. Finally, we show that LatentQA scales well with increasing dataset and model size, which is promising given how easily our approach can generate additional pseudo-labels.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LatentQA, a task for answering open-ended questions about language model activations using a natural language probe. It resides in the Activation-to-Language Decoding leaf, which contains four papers total including this work. This leaf sits within the broader Activation-Based Decoding and Steering Methods branch, indicating a moderately populated research direction. The sibling papers in this leaf explore related but distinct approaches: privileged information activations, meta-models for behavior prediction, and pedagogical activation decoding. The taxonomy structure suggests this is an active but not overcrowded area, with clear differentiation among methods.

The taxonomy reveals that Activation-to-Language Decoding is one of three sub-branches under Activation-Based Decoding and Steering Methods, alongside Activation Intervention for Behavior Control and Layer-Contrastive Decoding Strategies. Neighboring branches include Interpretability and Explanation Methods, which focuses on post-hoc analysis without activation modification, and Controlled and Guided Text Generation, which operates at the decoding level rather than internal representation level. The scope notes clarify that this work differs from probing methods in Interpretability by generating natural language outputs rather than training diagnostic classifiers, and from steering methods by prioritizing interpretation over behavior control.

Among the three contributions analyzed, the LatentQA task and natural language probe examined ten candidates and found one potentially refutable prior work, suggesting some overlap with existing activation decoding approaches. The pseudo-labeled dataset generation approach examined ten candidates with no clear refutations, indicating relative novelty in this specific methodology. The Latent Interpretation Tuning method examined three candidates with no refutations. These statistics reflect a limited search scope of twenty-three total candidates, not an exhaustive literature review. The first contribution appears to have the most substantial prior work among the examined candidates.

Based on the limited search of twenty-three semantically similar papers, the work appears to occupy a recognizable position within an established research direction. The taxonomy context shows this is a moderately active area with clear boundaries from neighboring topics. The contribution-level analysis suggests varying degrees of novelty across components, with the training methodology appearing more distinctive than the core task formulation among the examined candidates. This assessment is constrained by the top-K semantic search scope and does not claim comprehensive coverage of all relevant prior work.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: decoding language model activations into natural language. The field encompasses diverse approaches to understanding and manipulating the internal representations of language models. The taxonomy reveals seven major branches: Activation-Based Decoding and Steering Methods focus on directly interpreting or modifying internal states to control model behavior, including techniques like DoLa[1] and Activation Addition[4]. Interpretability and Explanation Methods develop tools and frameworks for making model decisions transparent, exemplified by works such as AllenNLP Interpret[37] and NLP Explainability[21]. Controlled and Guided Text Generation explores constrained decoding strategies like Reward-Augmented Decoding[38] and Chain-of-Thought Decoding[45]. Model Behavior Analysis and Robustness investigates how models process information and respond to perturbations, while Cross-Domain and Multimodal Representation Learning examines how representations transfer across modalities, including Brain-to-Text Decoding[20]. Text Representation and Embedding Methods study fundamental encoding schemes, and Application-Driven Interpretation targets domain-specific evaluation needs. Within Activation-Based Decoding and Steering Methods, a particularly active line of work centers on translating internal activations into interpretable text. LatentQA[0] sits squarely in this Activation-to-Language Decoding cluster, alongside Privileged Information Activations[6] and Meta-Models[27], all aiming to extract human-readable explanations from hidden states. While Privileged Information Activations[6] focuses on leveraging information asymmetries during training, LatentQA[0] emphasizes question-answering as a decoding framework, contrasting with the meta-modeling approach of Meta-Models[27] that learns to predict model behavior from activations. A key tension across these methods involves balancing faithfulness to the original representations versus producing fluent, actionable natural language outputs. Teaching Activation Decoding[32] explores pedagogical applications of similar techniques, highlighting how activation decoding can serve both interpretability and practical intervention goals.

Claimed Contributions

LatentQA task and natural language probe

Can Refute

10 retrieved papers

The authors introduce LatentQA, a task that involves answering open-ended questions about model activations in natural language. They develop a probe that outputs natural language rather than scalars or single tokens, enabling richer interpretation of model behaviors.

10 retrieved papers

Can Refute

Pseudo-labeled dataset generation approach

10 retrieved papers

The authors propose a method for curating a dataset that maps model activations to natural language question-answer pairs. This dataset is generated using control prompts and GPT-based labeling, enabling training of the decoder without manual annotation.

10 retrieved papers

Latent Interpretation Tuning (LIT) method

3 retrieved papers

The authors develop LIT, a fine-tuning method that trains a decoder LLM to predict qualitative properties of future model completions given current activations. This method enables both reading and steering of model activations using natural language.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Do Natural Language Descriptions of Model Activations Convey Privileged Information? PDF

Li Millicent, Millicent Li, A. Ceballos-Arroyo, Saphra, Naomi, Giordano Rogers, Wallace, Byron C., Naomi Saphra, Byron C. Wallace (2025) • arXiv.org

[27] Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language PDF

Anthony Costarelli, Mat Allen, Severin Field (2024) • arXiv.org

[32] Teaching LLMs to Decode Activations Into Natural Language PDF

A Pan, L Chen, J Steinhardt (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LatentQA task and natural language probe

[27] Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language PDF

Can Refute

[32] Teaching LLMs to Decode Activations Into Natural Language PDF

Cannot Refute

[51] Natural language descriptions of deep visual features PDF

Cannot Refute

[52] No answer needed: Predicting llm answer accuracy from question-only linear probes PDF

Cannot Refute

[53] Clip-dissect: Automatic description of neuron representations in deep vision networks PDF

Cannot Refute

[54] Stochastic subnetwork induction for contextual perturbation analysis in large language model architectures PDF

Cannot Refute

[55] Explaining Data Patterns in Natural Language with Language Models PDF

Cannot Refute

[56] Using statistical natural language processing for understanding complex responses to free-response tasks PDF

Cannot Refute

[57] Neural response to prosocial scenes relates to subsequent giving behavior in adolescents: A pilot study PDF

Cannot Refute

[58] Open Vocabulary Compositional Explanations for Neuron Alignment PDF

Cannot Refute

Contribution

Pseudo-labeled dataset generation approach

[59] Prompting-based synthetic data generation for few-shot question answering PDF

Cannot Refute

[60] Source2synth: Synthetic data generation and curation grounded in real data sources PDF

Cannot Refute

[61] Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs PDF

Cannot Refute

[62] On gnn explainability with activation rules PDF

Cannot Refute

[63] Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data Generation and Evaluation using Reasoning Models PDF

Cannot Refute

[64] Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications PDF

Cannot Refute

[65] OmniTab: Pretraining with natural and synthetic data for few-shot table-based question answering PDF

Cannot Refute

[66] Decision-Making Large Language Model for Wireless Communication: A Comprehensive Survey on Key Techniques PDF

Cannot Refute

[67] Long Context Understanding using Self-Generated Synthetic Data PDF

Cannot Refute

[68] A Pipeline for Generating, Annotating and Employing Synthetic Data for Real World Question Answering PDF

Cannot Refute

Contribution

Latent Interpretation Tuning (LIT) method

[69] Contrastive Tokens and Label Activation for Remote Sensing Weakly Supervised Semantic Segmentation PDF

Cannot Refute

[70] Meta-Learning for Decoding Neural Activity Data With Noisy Labels PDF

Cannot Refute

[71] LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation PDF

Cannot Refute

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Do Natural Language Descriptions of Model Activations Convey Privileged Information? PDF

[27] Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language PDF

[32] Teaching LLMs to Decode Activations Into Natural Language PDF

Contribution Analysis

LatentQA task and natural language probe

[27] Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language PDF

[32] Teaching LLMs to Decode Activations Into Natural Language PDF

[51] Natural language descriptions of deep visual features PDF

[52] No answer needed: Predicting llm answer accuracy from question-only linear probes PDF

[53] Clip-dissect: Automatic description of neuron representations in deep vision networks PDF

[54] Stochastic subnetwork induction for contextual perturbation analysis in large language model architectures PDF

[55] Explaining Data Patterns in Natural Language with Language Models PDF

[56] Using statistical natural language processing for understanding complex responses to free-response tasks PDF

[57] Neural response to prosocial scenes relates to subsequent giving behavior in adolescents: A pilot study PDF

[58] Open Vocabulary Compositional Explanations for Neuron Alignment PDF

Pseudo-labeled dataset generation approach

[59] Prompting-based synthetic data generation for few-shot question answering PDF

[60] Source2synth: Synthetic data generation and curation grounded in real data sources PDF

[61] Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs PDF

[62] On gnn explainability with activation rules PDF

[63] Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data Generation and Evaluation using Reasoning Models PDF

[64] Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications PDF

[65] OmniTab: Pretraining with natural and synthetic data for few-shot table-based question answering PDF

[66] Decision-Making Large Language Model for Wireless Communication: A Comprehensive Survey on Key Techniques PDF

[67] Long Context Understanding using Self-Generated Synthetic Data PDF

[68] A Pipeline for Generating, Annotating and Employing Synthetic Data for Real World Question Answering PDF

Latent Interpretation Tuning (LIT) method

[69] Contrastive Tokens and Label Activation for Remote Sensing Weakly Supervised Semantic Segmentation PDF

[70] Meta-Learning for Decoding Neural Activity Data With Noisy Labels PDF

[71] LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation PDF

Table of Contents