Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering

ICLR 2026 Conference SubmissionAnonymous Authors
Uncertainty QuantificationLLMsRAGContextual QAHallucinations
Abstract:

Uncertainty Quantification (UQ) research has primarily focused on closed-book factual question answering (QA), while contextual QA remains unexplored, despite its importance in real-world applications. In this work, we focus on UQ for the contextual QA task and propose a theoretically grounded approach to quantify \emph{epistemic uncertainty}. We begin by introducing a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution. By decomposing this measure, we isolate the epistemic component and approximate the true distribution by a perfectly prompted, idealized model. We then derive an upper bound for epistemic uncertainty and show that it can be interpreted as semantic feature gaps in the given model’s hidden representations relative to the ideal model. We further apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: \emph{context-reliance} (using the provided context rather than parametric knowledge), \emph{context comprehension} (extracting relevant information from context), and \emph{honesty} (avoiding intentional lies). Using a top-down interpretability approach, we extract these features by using only a small number of labeled samples and ensemble them to form a robust uncertainty score. Experiments on multiple QA benchmarks in both in-distribution and out-of-distribution settings show that our method substantially outperforms state-of-the-art unsupervised (sampling-free and sampling-based) and supervised UQ methods, achieving up to a 13-point PRR improvement while incurring a negligible inference overhead.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a theoretically grounded framework for epistemic uncertainty quantification in contextual question answering, decomposing token-level uncertainty and approximating epistemic components via semantic feature gaps relative to an idealized model. It resides in the Feature-Based and Causal Uncertainty leaf, which contains only two papers total. This represents a relatively sparse research direction within the broader Semantic and Representation-Based Uncertainty branch, suggesting the feature-gap perspective on epistemic uncertainty remains underexplored compared to probability-based or consistency-driven methods.

The taxonomy reveals neighboring approaches in Semantic Consistency and Reformulation, which emphasize paraphrasing and output consistency checks, and Token-Level and Probability-Based Uncertainty, which derive uncertainty from logits and entropy. The paper's causal and feature-centric lens distinguishes it from these directions: rather than measuring consistency across reformulations or calibrating probabilities, it isolates epistemic gaps through hidden representation analysis. The scope note for Feature-Based and Causal Uncertainty explicitly excludes multi-granular methods, positioning this work as a single-source semantic approach focused on internal model features rather than ensemble or hybrid techniques.

Among twenty-four candidates examined, none clearly refute the three core contributions. The theoretically grounded framework examined four candidates with zero refutations, the three-feature approximation for contextual QA examined ten candidates with zero refutations, and the top-down interpretability method examined ten candidates with zero refutations. This limited search scope suggests that within the top-K semantic matches and citation expansions, no prior work directly overlaps with the proposed feature-gap decomposition or the specific three-feature hypothesis for contextual QA. The absence of refutations across all contributions indicates potential novelty, though the search is not exhaustive.

Given the sparse taxonomy leaf and zero refutations across twenty-four candidates, the work appears to occupy a relatively unexplored niche within epistemic uncertainty quantification. The feature-gap perspective and contextual QA focus differentiate it from neighboring semantic and probability-based methods. However, the limited search scope means this assessment reflects top-K semantic matches rather than comprehensive field coverage, and broader literature may contain relevant prior work not captured in this analysis.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: epistemic uncertainty quantification in contextual question answering. The field organizes around several major branches that reflect different facets of uncertainty in modern QA systems. Uncertainty Estimation Methods and Frameworks encompasses diverse techniques ranging from semantic and representation-based approaches to ensemble and probabilistic methods, addressing how to measure what a model does not know. Retrieval-Augmented and Knowledge-Grounded Systems focuses on uncertainty arising when external knowledge sources inform answers, while Hallucination Detection and Factuality Verification targets the reliability of generated content. Application-Specific Uncertainty and QA Systems explores domain-tailored solutions such as medical or multimodal settings, and Ambiguity and Context-Aware Uncertainty examines how question ambiguity and contextual nuances shape uncertainty. Finally, Benchmarking and Evaluation provides the empirical foundations for comparing methods across these dimensions. Within the semantic and representation-based uncertainty cluster, a handful of works explore how feature-level and causal perspectives can reveal epistemic gaps. Feature Gaps[0] investigates uncertainty through the lens of feature representations and causal structures, emphasizing how missing or misaligned features contribute to epistemic doubt. This contrasts with neighboring efforts like ESI[41], which may prioritize different semantic signals or consistency measures. Meanwhile, works such as RAG Uncertainty[1] and Ualign[2] illustrate how retrieval-augmented frameworks introduce distinct uncertainty challenges tied to document relevance and alignment, while Knowledge Graph QA[3] and Medical QA Uncertainty[4] demonstrate domain-specific adaptations. The interplay between feature-based reasoning and broader semantic consistency remains an active area, with Feature Gaps[0] offering a distinctive causal angle that complements more consistency-driven or retrieval-focused approaches in understanding what models truly know.

Claimed Contributions

Theoretically grounded epistemic uncertainty quantification framework via feature gaps

The authors propose a task-agnostic uncertainty metric that decomposes total uncertainty into epistemic and aleatoric components. They derive an upper bound showing epistemic uncertainty can be interpreted as semantic feature gaps between the actual model and an idealized perfectly-prompted model, providing theoretical grounding for their approach.

4 retrieved papers
Three-feature approximation for contextual QA epistemic uncertainty

The authors instantiate their generic framework for contextual question-answering by identifying three high-level semantic features (context-reliance, context comprehension, and honesty) that approximate the feature gap between actual and ideal models in this specific task domain.

10 retrieved papers
Top-down interpretability method for feature extraction and ensembling

The authors develop a practical method that extracts the hypothesized semantic features using contrastive prompting and PCA on a small labeled dataset, then ensembles these features into a single uncertainty score that requires only three dot products at test time without sampling.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretically grounded epistemic uncertainty quantification framework via feature gaps

The authors propose a task-agnostic uncertainty metric that decomposes total uncertainty into epistemic and aleatoric components. They derive an upper bound showing epistemic uncertainty can be interpreted as semantic feature gaps between the actual model and an idealized perfectly-prompted model, providing theoretical grounding for their approach.

Contribution

Three-feature approximation for contextual QA epistemic uncertainty

The authors instantiate their generic framework for contextual question-answering by identifying three high-level semantic features (context-reliance, context comprehension, and honesty) that approximate the feature gap between actual and ideal models in this specific task domain.

Contribution

Top-down interpretability method for feature extraction and ensembling

The authors develop a practical method that extracts the hypothesized semantic features using contrastive prompting and PCA on a small labeled dataset, then ensembles these features into a single uncertainty score that requires only three dot products at test time without sampling.

Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering | Novelty Validation