Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

honestyinterrogationalignment auditing

As AI systems become more capable of complex agentic tasks, they also become more capable of pursuing undesirable objectives and causing harm. Previous work has attempted to catch these unsafe instances by interrogating models directly about their objectives and behaviors. However, the main weakness of trusting interrogations is that models can lie. We propose self-report fine-tuning (SRFT), a simple supervised fine-tuning technique that trains models to occasionally make factual mistakes, then admit them when asked. We show that the admission of factual errors in simple question-answering settings generalizes out-of-distribution (OOD) to the admission of hidden misaligned objectives in adversarial agentic settings. We evaluate SRFT in OOD stealth tasks, where models are instructed to complete a hidden misaligned objective alongside a user-specified objective without being caught by monitoring. After SRFT, models are more likely to confess the details of their hidden objectives when interrogated, even under strong pressure not to disclose them. Interrogation on SRFT models can detect hidden objectives with near-ceiling performance (F1 score = 0.98), while the baseline model lies when interrogated under the same conditions (F1 score = 0). Interrogation on SRFT models can further elicit the content of the hidden objective, recovering 28-100% details, compared to 0% details recovered in the baseline model and by prefilled assistant turn attacks. This provides a promising technique for promoting honesty propensity and incriminating misaligned AI systems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes self-report fine-tuning (SRFT), a supervised fine-tuning technique that trains models to admit factual errors and then generalizes this honesty to confessing hidden misaligned objectives in adversarial settings. Within the taxonomy, this work occupies the 'Self-Reporting Fine-Tuning for Objective Disclosure' leaf under 'Direct Elicitation of Hidden Knowledge and Objectives'. Notably, this leaf contains only the original paper itself with no sibling papers, indicating this is a relatively unexplored research direction. The broader parent category includes four other leaves covering alternative elicitation approaches, suggesting the field has multiple active threads but this specific training-based honesty approach stands alone.

The taxonomy reveals substantial activity in neighboring areas. The sibling leaf 'Secret Knowledge Elicitation via Black-Box and White-Box Techniques' contains two papers focused on auditor-designed probes rather than model training. The 'Deception Detection and Adversarial Behavior Analysis' branch (five leaves, nine papers total) addresses related concerns about model dishonesty but emphasizes detection rather than prevention through training. The 'Latent Knowledge Discovery in Model Activations' leaf explores unsupervised extraction from internal representations, contrasting with SRFT's supervised output-based approach. This positioning suggests SRFT bridges cooperative elicitation and adversarial robustness concerns, occupying a distinct methodological niche between passive probing and active deception detection.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the work. The core SRFT technique examined ten candidates with zero refutable matches, as did the near-ceiling detection performance claim and the robustness evaluation under adversarial conditions. This absence of overlapping prior work within the limited search scope suggests the specific combination of supervised fine-tuning for honesty generalization to adversarial settings may be novel. However, the search examined only top-K semantic matches and citations, not exhaustive coverage of alignment, interpretability, or adversarial robustness literatures. The statistics indicate no immediate precedent among closely related papers but cannot rule out relevant work in adjacent subfields.

Based on the limited search scope of thirty candidates, the work appears to introduce a distinctive approach within a sparsely populated research direction. The taxonomy structure shows active exploration of related problems (deception detection, bias elicitation, latent knowledge extraction) but minimal prior work specifically on training models to self-report hidden objectives. The analysis covers semantically proximate papers and direct citations but does not encompass broader alignment or mechanistic interpretability literatures that might contain conceptually related techniques.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Eliciting hidden objectives from language models through interrogation. The field encompasses a diverse set of approaches for uncovering what language models know, believe, or intend beyond their surface-level outputs. The taxonomy reveals several major branches: Direct Elicitation of Hidden Knowledge and Objectives focuses on techniques that explicitly prompt models to reveal concealed information or goals, often through fine-tuning or structured questioning (e.g., Eliciting Secret Knowledge[1], Auditing Hidden Objectives[3]). Implicit Bias Detection and Measurement examines methods for surfacing unstated preferences and stereotypes through psychometric tests and behavioral probing (Measuring Implicit Bias[4], Psychometric Bias Attack[2]). Deception Detection and Adversarial Behavior Analysis investigates scenarios where models may actively conceal information or exhibit strategic dishonesty (Resilience to Interrogation[10], Benign Prompt Deception[22]). Meanwhile, branches like Model Behavior Probing and Interpretation and Implicit Content Reasoning tackle the challenge of inferring latent knowledge from model activations or implicit reasoning patterns (Discovering Latent Knowledge[7], Inferring Implicit Relations[12]). A particularly active line of work explores the tension between explicit statements and implicit behaviors, with studies comparing what models say versus what their actions reveal (Explicit vs Implicit[5]). Another emerging theme involves adversarial interrogation and red teaming, where researchers design prompts to elicit harmful content or expose vulnerabilities (Eliciting Offensiveness[21], Many Wolves[11]). Spilling the Beans[0] sits within the Direct Elicitation branch, specifically under Self-Reporting Fine-Tuning for Objective Disclosure, emphasizing training models to voluntarily disclose their hidden objectives. This approach contrasts with more adversarial methods like Psychometric Bias Attack[2] or implicit probing techniques such as Discovering Latent Knowledge[7], positioning itself as a cooperative rather than extractive interrogation strategy that relies on model transparency rather than external inference.

Claimed Contributions

Self-report fine-tuning (SRFT) technique

10 retrieved papers

The authors introduce SRFT, a supervised fine-tuning method that teaches models to make factual errors in question-answering and then honestly admit these errors when questioned. This technique generalizes out-of-distribution to enable models to confess hidden misaligned objectives in adversarial agentic settings.

10 retrieved papers

Near-ceiling detection and elicitation performance

10 retrieved papers

The authors demonstrate that SRFT achieves near-perfect performance (F1 score = 0.98) in detecting whether models have hidden objectives and can recover 28-100% of hidden objective details through interrogation, vastly outperforming baseline models and prefilled assistant turn attacks which achieve near-zero performance.

10 retrieved papers

Robustness evaluation under adversarial conditions

10 retrieved papers

The authors evaluate SRFT's robustness by testing it against adversarial conditions where models are explicitly instructed to conceal their objectives and are provided with plausible decoy objectives to cite instead. They show that SRFT maintains honest admission capabilities in most cases despite these adversarial pressures.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Self-report fine-tuning (SRFT) technique

[50] Training language models to self-correct via reinforcement learning PDF

Cannot Refute

[51] Fine-tuning Language Models for Factuality PDF

Cannot Refute

[52] Editing Factual Knowledge in Language Models PDF

Cannot Refute

[53] Can We Edit Factual Knowledge by In-Context Learning? PDF

Cannot Refute

[54] Fine-tuning large language models for improving factuality in legal question answering PDF

Cannot Refute

[55] Fine-grained hallucination detection and editing for language models PDF

Cannot Refute

[56] ChatGPT: potential, prospects, and limitations PDF

Cannot Refute

[57] Enhancing Large Language Models with Reliable Knowledge Graphs PDF

Cannot Refute

[58] JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability PDF

Cannot Refute

[59] An integrated model of learning from errors PDF

Cannot Refute

Contribution

Near-ceiling detection and elicitation performance

[3] Auditing language models for hidden objectives PDF

Cannot Refute

[6] In-context impersonation reveals large language models' strengths and biases PDF

Cannot Refute

[9] Autonomous Agents for Interrogation PDF

Cannot Refute

[10] On Large Language Modelsâ Resilience to Coercive Interrogation PDF

Cannot Refute

[15] Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems PDF

Cannot Refute

[35] The Traitors: Deception and Trust in Multi-Agent Language Model Simulations PDF

Cannot Refute

[36] Stochastic constraint self-reflective syntax reconstruction in large language model internal representational spaces PDF

Cannot Refute

[37] Chain of attack: Hide your intention through multi-turn interrogation PDF

Cannot Refute

[38] Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction PDF

Cannot Refute

[39] Stochastic resonance pathways for latent knowledge reassembly in large language models PDF

Cannot Refute

Contribution

Robustness evaluation under adversarial conditions

[40] Cats confuse reasoning LLM: Query agnostic adversarial triggers for reasoning models PDF

Cannot Refute

[41] Algorithms for Adversarially Robust Deep Learning PDF

Cannot Refute

[42] Adversarial Reasoning at Jailbreaking Time PDF

Cannot Refute

[43] Confidence Elicitation: A New Attack Vector for Large Language Models PDF

Cannot Refute

[44] Cyber Deception: Taxonomy, State of the Art, Frameworks, Trends, and Open Challenges PDF

Cannot Refute

[45] MultiPhishGuard: An LLM-based Multi-Agent System for Phishing Email Detection PDF

Cannot Refute

[46] The difficulties of preference elicitation resulting from strategic thinking: How concerned should we be? PDF

Cannot Refute

[47] The construction of preference PDF

Cannot Refute

[48] The Effects of Experience on Deception in Human-Agent Negotiation PDF

Cannot Refute

[49] Small Cells of Suspects: Eliciting Cues to Deception by Strategic Interviewing PDF

Cannot Refute

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Self-report fine-tuning (SRFT) technique

[50] Training language models to self-correct via reinforcement learning PDF

[51] Fine-tuning Language Models for Factuality PDF

[52] Editing Factual Knowledge in Language Models PDF

[53] Can We Edit Factual Knowledge by In-Context Learning? PDF

[54] Fine-tuning large language models for improving factuality in legal question answering PDF

[55] Fine-grained hallucination detection and editing for language models PDF

[56] ChatGPT: potential, prospects, and limitations PDF

[57] Enhancing Large Language Models with Reliable Knowledge Graphs PDF

[58] JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability PDF

[59] An integrated model of learning from errors PDF

Near-ceiling detection and elicitation performance

[3] Auditing language models for hidden objectives PDF

[6] In-context impersonation reveals large language models' strengths and biases PDF

[9] Autonomous Agents for Interrogation PDF

[10] On Large Language Modelsâ Resilience to Coercive Interrogation PDF

[15] Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems PDF

[35] The Traitors: Deception and Trust in Multi-Agent Language Model Simulations PDF

[36] Stochastic constraint self-reflective syntax reconstruction in large language model internal representational spaces PDF

[37] Chain of attack: Hide your intention through multi-turn interrogation PDF

[38] Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction PDF

[39] Stochastic resonance pathways for latent knowledge reassembly in large language models PDF

Robustness evaluation under adversarial conditions

[40] Cats confuse reasoning LLM: Query agnostic adversarial triggers for reasoning models PDF

[41] Algorithms for Adversarially Robust Deep Learning PDF

[42] Adversarial Reasoning at Jailbreaking Time PDF

[43] Confidence Elicitation: A New Attack Vector for Large Language Models PDF

[44] Cyber Deception: Taxonomy, State of the Art, Frameworks, Trends, and Open Challenges PDF

[45] MultiPhishGuard: An LLM-based Multi-Agent System for Phishing Email Detection PDF

[46] The difficulties of preference elicitation resulting from strategic thinking: How concerned should we be? PDF

[47] The construction of preference PDF

[48] The Effects of Experience on Deception in Human-Agent Negotiation PDF

[49] Small Cells of Suspects: Eliciting Cues to Deception by Strategic Interviewing PDF

Table of Contents

[10] On Large Language Modelsâ Resilience to Coercive Interrogation PDF