Productive LLM Hallucinations: Conditions, Mechanisms, and Benefits

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language Models; Hallucination; Productive Hallucinations; Reasoning Dynamics

Hallucinations in large language models (LLMs) are typically regarded as harmful errors to be suppressed. We revisit this assumption and ask whether, and under what conditions, hallucinations can instead be beneficial. To address this question, we introduce $\textbf{HIVE}$ ( $\textbf{H}$ allucination $\textbf{I}$ nference and $\textbf{V}$ erification $\textbf{E}$ ngine), a task-agnostic framework that systematically evaluates the impact of hallucinated semantics across diverse tasks and models. By unifying generation, discrimination, and downstream evaluation, HIVE enables controlled comparative assessments of how hallucinations alter overall model performance. Extensive experiments on nine datasets and ten models show that hallucinations can yield substantial improvements up to $\textbf{+17.2}$ % in accuracy especially in open-ended domains such as reasoning, biomedical, and vision language tasks. Stronger models consistently harness hallucinations, while weaker ones are more volatile. Mechanistic analyses show that hallucinations broaden semantic coverage, stabilize reasoning trajectories, and follow an inverted-U profile where moderate strength maximizes benefits across diverse tasks. These findings reframe hallucination from a defect to a controllable cognitive resource, suggesting opportunities for evaluating and training LLMs not merely to avoid hallucinations, but to exploit them constructively.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HIVE, a task-agnostic framework for systematically evaluating whether hallucinations can improve LLM performance across diverse tasks. Within the taxonomy, it resides in the 'Task-Agnostic Evaluation Frameworks' leaf under 'Empirical Evaluation of Beneficial Hallucinations'. This leaf contains only two papers total: the original work and one sibling (Heaven-Sent Hell-Bent). This sparse population suggests the research direction—developing general frameworks for controlled hallucination assessment—is relatively underexplored compared to domain-specific evaluations or mitigation-focused branches.

The taxonomy tree reveals neighboring work in sibling leaves: 'Creativity and Reasoning Task Evaluations' (three papers measuring hallucination benefits in open-ended tasks) and 'Domain-Specific Application Evaluations' (two papers in specialized contexts like drug discovery). The parent branch 'Empirical Evaluation of Beneficial Hallucinations' excludes theoretical arguments and mitigation methods, positioning HIVE within a growing but still nascent empirical tradition. Nearby branches like 'Hallucination Mitigation and Detection Methods' (six papers across three subcategories) show that reliability-focused work remains more densely populated than benefit-focused evaluation frameworks.

Among the 27 candidates examined through limited semantic search, none clearly refute any of the three contributions. For the HIVE framework itself, 10 candidates were examined with zero refutable overlaps. The empirical evidence contribution (10 candidates examined) and mechanistic insights contribution (7 candidates examined) similarly show no clear prior work providing overlapping findings. This absence of refutation within the examined scope suggests the specific combination—task-agnostic framework, systematic comparative assessment, and mechanistic analysis of beneficial hallucinations—may represent a novel integration, though the limited search scale (27 papers, not exhaustive) means undiscovered prior work remains possible.

Based on the top-27 semantic matches and taxonomy structure, the work appears to occupy a sparsely populated research direction with minimal direct prior overlap among examined candidates. The framework's positioning between theoretical acceptance arguments and domain-specific applications, combined with its systematic evaluation methodology, suggests a contribution that bridges conceptual and empirical gaps. However, the analysis covers a constrained literature sample and cannot rule out relevant work outside the examined scope or emerging concurrently in this rapidly evolving subfield.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: beneficial effects of hallucinations in large language models. The field has evolved from viewing hallucinations purely as errors to recognizing their potential utility across diverse contexts. The taxonomy reflects this shift through seven main branches: Theoretical Frameworks establish conceptual foundations for understanding when and why hallucinations might be valuable; Empirical Evaluation develops systematic methods to measure beneficial outcomes; Creative and Imaginative Exploitation explores deliberate use of hallucinations for generative tasks; Hallucination Mitigation and Detection Methods addresses the dual challenge of controlling unwanted fabrications while preserving useful ones; Multimodal and Vision-Language Hallucinations[5] extends these questions beyond text; Behavioral Analysis and Mechanistic Interpretability investigates the underlying causes; and Applied Domain Studies examines context-specific benefits in education, drug discovery[13], and other fields. Works like Confabulation Value[4] and Feature Not Bug[30] exemplify the theoretical reframing, while studies such as Creativity Perspective Survey[7] and Shakespearean Sparks[17] demonstrate practical exploitation strategies. Recent research reveals tension between harnessing creativity and maintaining reliability. Several studies explore pedagogical applications where controlled errors can enhance learning, as seen in Erroneous Math Tutoring[19] and Pedagogy-First Approach[20], contrasting with safety-focused work like Safe Trustworthy AI[11] that prioritizes mitigation. Productive Hallucinations[0] sits within the Empirical Evaluation branch alongside Heaven-Sent Hell-Bent[25], both developing task-agnostic frameworks to systematically assess when hallucinations provide value rather than harm. While Heaven-Sent Hell-Bent[25] may emphasize the duality of beneficial versus detrimental outcomes, Productive Hallucinations[0] appears focused on establishing rigorous evaluation criteria that transcend specific application domains. This positioning bridges theoretical insights about hallucination utility with practical measurement challenges, contributing methodology that can inform both creative exploitation strategies and context-aware mitigation approaches across the taxonomy's applied branches.

Claimed Contributions

HIVE framework for systematic hallucination evaluation

10 retrieved papers

The authors propose HIVE, a general-purpose framework that unifies caption generation, hallucination discrimination, and downstream task evaluation. It enables controlled comparisons between faithful and hallucinated inputs across both text-only and multimodal tasks to measure hallucination effects on model performance.

10 retrieved papers

Empirical evidence of beneficial hallucinations across tasks and models

10 retrieved papers

The authors provide broad empirical evidence demonstrating that hallucinations can improve performance in perception-driven tasks, with gains reaching up to 17.2% accuracy. The benefits vary systematically by task type and model capacity, showing that hallucinations are not uniformly harmful.

10 retrieved papers

Mechanistic insights into how hallucinations enhance reasoning

7 retrieved papers

The authors demonstrate through systematic analysis that hallucinations reshape semantic inputs by broadening coverage, modulate inference dynamics by altering reasoning trajectories, and exhibit an inverted-U relationship where moderate hallucination strength yields optimal performance. These findings reframe hallucination as a controllable cognitive resource rather than purely a defect.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[25] Heaven-Sent or Hell-Bent? Benchmarking the Intelligence and Defectiveness of LLM Hallucinations PDF

Chengxu Yang, Jingling Yuan, Siqi Cai, Jiawei Jiang, Chuang Hu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HIVE framework for systematic hallucination evaluation

[1] A comprehensive survey of hallucination mitigation techniques in large language models PDF

Cannot Refute

[5] Hallucination of multimodal large language models: A survey PDF

Cannot Refute

[31] Chain-of-Verification Reduces Hallucination in Large Language Models PDF

Cannot Refute

[32] Evaluating Object Hallucination in Large Vision-Language Models PDF

Cannot Refute

[33] A Systematic Literature Review of Hallucinations in Large Language Models PDF

Cannot Refute

[34] Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models PDF

Cannot Refute

[35] MedVH: Toward Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context. PDF

Cannot Refute

[36] A comprehensive taxonomy of hallucinations in Large Language Models PDF

Cannot Refute

[37] Evaluating the quality of hallucination benchmarks for large vision-language models PDF

Cannot Refute

[38] Fine-grained Hallucination Detection and Editing for Language Models PDF

Cannot Refute

Contribution

Empirical evidence of beneficial hallucinations across tasks and models

[5] Hallucination of multimodal large language models: A survey PDF

Cannot Refute

[32] Evaluating Object Hallucination in Large Vision-Language Models PDF

Cannot Refute

[46] Medical hallucinations in foundation models and their impact on healthcare PDF

Cannot Refute

[47] Analyzing and Mitigating Object Hallucination in Large Vision-Language Models PDF

Cannot Refute

[48] Med-HALT: Medical Domain Hallucination Test for Large Language Models PDF

Cannot Refute

[49] Med-hvl: Automatic medical domain hallucination evaluation for large vision-language models PDF

Cannot Refute

[50] FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models PDF

Cannot Refute

[51] Large language models and the perils of their hallucinations PDF

Cannot Refute

[52] Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models PDF

Cannot Refute

[53] Hallucinations in ChatGPT: A Cautionary Tale for Biomedical Researchers. PDF

Cannot Refute

Contribution

Mechanistic insights into how hallucinations enhance reasoning

[39] MedMMV: A Controllable Multimodal Multi-Agent Framework for Reliable and Verifiable Clinical Reasoning PDF

Cannot Refute

[40] Latent constellation routing for large language models: An experimental inquiry into structured semantic pathways PDF

Cannot Refute

[41] Tackling LLM Hallucination with Abductive Reasoning PDF

Cannot Refute

[42] Thinking, Faithful and Stable: Mitigating Hallucinations in LLMs PDF

Cannot Refute

[43] Enhancing LLM Reasoning Capabilities Through Brokered Multi-Expert Reflection PDF

Cannot Refute

[44] LENS: Layers of Evaluation of Hallucination in GenAI Systems PDF

Cannot Refute

[45] Modality-Bridging for Automated Chain-of-Thought Construction in Meteorological Reasoning: A Study on WeatherQA PDF

Cannot Refute

Productive LLM Hallucinations: Conditions, Mechanisms, and Benefits

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[25] Heaven-Sent or Hell-Bent? Benchmarking the Intelligence and Defectiveness of LLM Hallucinations PDF

Contribution Analysis

HIVE framework for systematic hallucination evaluation

[1] A comprehensive survey of hallucination mitigation techniques in large language models PDF

[5] Hallucination of multimodal large language models: A survey PDF

[31] Chain-of-Verification Reduces Hallucination in Large Language Models PDF

[32] Evaluating Object Hallucination in Large Vision-Language Models PDF

[33] A Systematic Literature Review of Hallucinations in Large Language Models PDF

[34] Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models PDF

[35] MedVH: Toward Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context. PDF

[36] A comprehensive taxonomy of hallucinations in Large Language Models PDF

[37] Evaluating the quality of hallucination benchmarks for large vision-language models PDF

[38] Fine-grained Hallucination Detection and Editing for Language Models PDF

Empirical evidence of beneficial hallucinations across tasks and models

[5] Hallucination of multimodal large language models: A survey PDF

[32] Evaluating Object Hallucination in Large Vision-Language Models PDF

[46] Medical hallucinations in foundation models and their impact on healthcare PDF

[47] Analyzing and Mitigating Object Hallucination in Large Vision-Language Models PDF

[48] Med-HALT: Medical Domain Hallucination Test for Large Language Models PDF

[49] Med-hvl: Automatic medical domain hallucination evaluation for large vision-language models PDF

[50] FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models PDF

[51] Large language models and the perils of their hallucinations PDF

[52] Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models PDF

[53] Hallucinations in ChatGPT: A Cautionary Tale for Biomedical Researchers. PDF

Mechanistic insights into how hallucinations enhance reasoning

[39] MedMMV: A Controllable Multimodal Multi-Agent Framework for Reliable and Verifiable Clinical Reasoning PDF

[40] Latent constellation routing for large language models: An experimental inquiry into structured semantic pathways PDF

[41] Tackling LLM Hallucination with Abductive Reasoning PDF

[42] Thinking, Faithful and Stable: Mitigating Hallucinations in LLMs PDF

[43] Enhancing LLM Reasoning Capabilities Through Brokered Multi-Expert Reflection PDF

[44] LENS: Layers of Evaluation of Hallucination in GenAI Systems PDF

[45] Modality-Bridging for Automated Chain-of-Thought Construction in Meteorological Reasoning: A Study on WeatherQA PDF

Table of Contents