Critical Confabulation: Can LLMs Hallucinate for Social Good?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelsAI for Social GoodHallucination and ConfabulationNarrative ModelingData Contamination and MemorizationComputational CreativityEvidence-Grounded Generation

LLMs hallucinate, yet some confabulations can have social affordances if carefully bounded. We propose critical confabulation (inspired by critical fabulation from literary and social theory), the use of LLM hallucinations to "fill-in-the-gap'' for omissions in archives due to social and political inequality, and reconstruct divergent yet evidence-bound narratives for history's "hidden figures''. We simulate these gaps with an open-ended narrative cloze task: asking LLMs to generate a masked event in a character-centric timeline sourced from a novel corpus of unpublished texts. We evaluate audited (for data contamination), fully-open models (the OLMo-2 family) and unaudited open-weight and proprietary baselines under a range of prompts designed to elicit controlled and useful hallucinations. Our findings validate LLMs' foundational narrative understanding capabilities to perform critical confabulation, and show how controlled and well-specified hallucinations can support LLM applications for knowledge production without collapsing speculation into a lack of historical accuracy and fidelity.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces critical confabulation, a framework that deliberately uses LLM hallucinations to reconstruct missing historical narratives for marginalized figures, evaluated through a narrative cloze task on unpublished texts. According to the taxonomy tree, this work sits in the 'Critical Confabulation and Narrative Cloze Approaches' leaf under 'Controlled Hallucination Methods for Historical Reconstruction'. Notably, this leaf contains only the original paper itself—no sibling papers are listed. This suggests the paper occupies a relatively sparse research direction within the broader field of using LLMs for historical reconstruction, which comprises 21 papers across multiple branches.

The taxonomy reveals several neighboring research directions that contextualize this work's position. The sibling leaf 'Recursive and Generative Ancestral Reconstruction Systems' explores iterative human-AI methods for ancestral narratives, while 'AI-Mediated Voice Recreation for Specific Historical Figures' focuses on recreating individual voices. The broader 'Bias Analysis and Representation Studies' branch examines how LLMs encode historical inequities, and 'Community-Level Oral History and Archive Analysis' addresses collective memory preservation. The scope note for the paper's leaf explicitly excludes 'general creative generation without historical grounding', positioning critical confabulation as evidence-bound speculation rather than unconstrained creativity.

Among 30 candidates examined through limited semantic search, none were found to clearly refute any of the three main contributions. The critical confabulation framework examined 10 candidates with 0 refutable matches; the narrative cloze task similarly examined 10 candidates with no refutations; and the contamination-audited dataset from unpublished texts also showed 10 candidates examined with no clear prior work. This absence of refutable candidates across all contributions, combined with the paper being the sole occupant of its taxonomy leaf, suggests the specific combination of controlled hallucination for historical reconstruction, narrative cloze evaluation, and contamination-audited unpublished sources represents a relatively unexplored configuration within the limited search scope.

Based on the limited literature search of 30 candidates, the work appears to occupy a novel position combining theoretical framing (critical confabulation), methodological innovation (narrative cloze), and dataset construction (contamination-audited unpublished texts). However, the analysis cannot assess whether more extensive searches in adjacent fields—such as digital humanities, archival studies, or computational creativity—might reveal closer precedents. The taxonomy structure suggests active research in related areas like bias analysis and oral history, but the specific intersection this paper targets remains sparsely populated within the examined scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Using LLM hallucinations to reconstruct missing historical narratives for marginalized figures. This emerging field addresses the challenge of recovering voices and stories systematically excluded from traditional archives by leveraging large language models' generative capacities in ethically grounded ways. The taxonomy reveals four main branches: Controlled Hallucination Methods for Historical Reconstruction explores techniques that deliberately harness LLM confabulation to fill archival gaps, including approaches like narrative cloze and critical confabulation; Bias Analysis and Representation Studies examines how LLMs encode and reproduce historical inequities, investigating cultural biases and representational harms in model outputs (e.g., Cultural Bias Asian[5], Unequal Voices[1]); Community-Level Oral History and Archive Analysis focuses on integrating oral traditions and community knowledge with computational methods (Oral History Understanding[3], Archival Photographs Multimodal[7]); and Methodological and Theoretical Frameworks addresses the philosophical and ethical foundations needed for responsible AI-assisted historiography, drawing on concepts like epistemic injustice and structural oppression (Epistemic Injustice[10], Historical Structural Oppression[17]). Particularly active tensions emerge between controlled generation methods and critical scholarship on representation. Works like Designing Invisible[2] and Moses Williams Representation[8] highlight how marginalized figures remain underrepresented or distorted even in computational reconstructions, while studies such as Simulating Social Perception[9] and Contextualizing Harmful Language[12] probe how models perpetuate historical biases. Critical Confabulation[0] situates itself within the controlled hallucination branch, proposing methods that intentionally use model confabulation as a historiographic tool rather than treating it as error. This approach contrasts with more cautious frameworks like Prosthetic Denial[15] and Spectral Imaginings[18], which emphasize the risks of fabricating narratives for communities already subjected to erasure. The central question across these lines of work remains how to balance generative reconstruction with epistemic humility, ensuring that computational methods amplify rather than replace marginalized voices.

Claimed Contributions

Critical confabulation framework for LLM hallucinations

10 retrieved papers

The authors introduce critical confabulation as a framework that repurposes LLM hallucinations to reconstruct evidence-bounded narratives for historically under-documented figures. This approach adapts Hartman's critical fabulation methodology to leverage controlled confabulations for addressing archival silence and recovering divergent historical narratives.

10 retrieved papers

Narrative cloze task for evaluating critical confabulation

10 retrieved papers

The authors operationalize critical confabulation as a narrative cloze task where LLMs must reconstruct masked events in character timelines. This task serves as a proxy for fragmentary historical records and enables systematic evaluation of models' ability to perform evidence-bounded confabulation.

10 retrieved papers

Contamination-audited dataset from unpublished historical texts

10 retrieved papers

The authors construct a dataset from the Black Writing and Thought Collection with rigorous data contamination auditing procedures. They perform sentence-level string searches and behavioral probes to ensure the dataset represents genuinely unseen history, enabling reliable evaluation of confabulation rather than memorization.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Critical confabulation framework for LLM hallucinations

[20] From Herodotus to Algorithms: Rethinking Historical Inquiry in the Age of AI. PDF

Cannot Refute

[22] ROGER: Extracting Narratives Using Large Language Models from Robert Gerstmann's Historical Photo Archive of the Sacambaya Expedition in 1928. PDF

Cannot Refute

[23] Speculative Historiography in the Age of Hallucinations PDF

Cannot Refute

[24] Knowledge Extraction from LLMs for Scalable Historical Data Annotation PDF

Cannot Refute

[25] Ghosts and the machine: testing the use of Artificial Intelligence to deliver historical life course biographies from big data PDF

Cannot Refute

[26] Can Generative AI Uncover Hidden Patterns in Historical Domestic Traffic Ads Through Data Analysis? A ChatLoS-DTA Exploration PDF

Cannot Refute

[27] Kongzi: A Historical Large Language Model with Fact Enhancement PDF

Cannot Refute

[28] The Performance of Artificial Intelligence in the Use of Indigenous American Languages PDF

Cannot Refute

[29] When Language Fails: Tragedy and Thucydides PDF

Cannot Refute

[30] Histactor: Summon Your Favorite Historical Persona PDF

Cannot Refute

Contribution

Narrative cloze task for evaluating critical confabulation

[41] SNAP: semantic stories for next activity prediction PDF

Cannot Refute

[42] Language models outperform cloze predictability in a cognitive model of reading PDF

Cannot Refute

[43] Automatic story generation: A survey of approaches PDF

Cannot Refute

[44] What do large language models learn about scripts? PDF

Cannot Refute

[45] Constructing Narrative Event Evolutionary Graph for Script Event Prediction PDF

Cannot Refute

[46] Conditional generation of temporally-ordered event sequences PDF

Cannot Refute

[47] News event prediction by trigger evolution graph and event segment PDF

Cannot Refute

[48] LUPIN: A LLM Approach for Activity Suffix Prediction in Business Process Event Logs PDF

Cannot Refute

[49] Goal-directed story generation: Augmenting generative language models with reinforcement learning PDF

Cannot Refute

[50] Storyimager: A unified and efficient framework for coherent story visualization and completion PDF

Cannot Refute

Contribution

Contamination-audited dataset from unpublished historical texts

[31] Detecting pretraining data from large language models PDF

Cannot Refute

[32] Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs PDF

Cannot Refute

[33] Training on the benchmark is not all you need PDF

Cannot Refute

[34] Benchmarking Foundation Models with Language-Model-as-an-Examiner PDF

Cannot Refute

[35] Benchmarking Benchmark Leakage in Large Language Models PDF

Cannot Refute

[36] Benchmark probing: Investigating data leakage in large language models PDF

Cannot Refute

[37] Investigating Data Contamination in Modern Benchmarks for Large Language Models PDF

Cannot Refute

[38] Investigating data contamination for pre-training language models PDF

Cannot Refute

[39] Inside the black box: Detecting data leakage in pre-trained language encoders PDF

Cannot Refute

[40] Developing the PsyCogMetricsâ¢ AI Lab to Evaluate Large Language Models and Advance Cognitive ScienceâA Three-Cycle Action Design Science Study PDF

Cannot Refute

Critical Confabulation: Can LLMs Hallucinate for Social Good?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Critical confabulation framework for LLM hallucinations

[20] From Herodotus to Algorithms: Rethinking Historical Inquiry in the Age of AI. PDF

[22] ROGER: Extracting Narratives Using Large Language Models from Robert Gerstmann's Historical Photo Archive of the Sacambaya Expedition in 1928. PDF

[23] Speculative Historiography in the Age of Hallucinations PDF

[24] Knowledge Extraction from LLMs for Scalable Historical Data Annotation PDF

[25] Ghosts and the machine: testing the use of Artificial Intelligence to deliver historical life course biographies from big data PDF

[26] Can Generative AI Uncover Hidden Patterns in Historical Domestic Traffic Ads Through Data Analysis? A ChatLoS-DTA Exploration PDF

[27] Kongzi: A Historical Large Language Model with Fact Enhancement PDF

[28] The Performance of Artificial Intelligence in the Use of Indigenous American Languages PDF

[29] When Language Fails: Tragedy and Thucydides PDF

[30] Histactor: Summon Your Favorite Historical Persona PDF

Narrative cloze task for evaluating critical confabulation

[41] SNAP: semantic stories for next activity prediction PDF

[42] Language models outperform cloze predictability in a cognitive model of reading PDF

[43] Automatic story generation: A survey of approaches PDF

[44] What do large language models learn about scripts? PDF

[45] Constructing Narrative Event Evolutionary Graph for Script Event Prediction PDF

[46] Conditional generation of temporally-ordered event sequences PDF

[47] News event prediction by trigger evolution graph and event segment PDF

[48] LUPIN: A LLM Approach for Activity Suffix Prediction in Business Process Event Logs PDF

[49] Goal-directed story generation: Augmenting generative language models with reinforcement learning PDF

[50] Storyimager: A unified and efficient framework for coherent story visualization and completion PDF

Contamination-audited dataset from unpublished historical texts

[31] Detecting pretraining data from large language models PDF

[32] Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs PDF

[33] Training on the benchmark is not all you need PDF

[34] Benchmarking Foundation Models with Language-Model-as-an-Examiner PDF

[35] Benchmarking Benchmark Leakage in Large Language Models PDF

[36] Benchmark probing: Investigating data leakage in large language models PDF

[37] Investigating Data Contamination in Modern Benchmarks for Large Language Models PDF

[38] Investigating data contamination for pre-training language models PDF

[39] Inside the black box: Detecting data leakage in pre-trained language encoders PDF

[40] Developing the PsyCogMetricsâ¢ AI Lab to Evaluate Large Language Models and Advance Cognitive ScienceâA Three-Cycle Action Design Science Study PDF

Table of Contents

[40] Developing the PsyCogMetricsâ¢ AI Lab to Evaluate Large Language Models and Advance Cognitive ScienceâA Three-Cycle Action Design Science Study PDF