Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
calibrationLLMsemanticuncertaintytheory
Abstract:

Large Language Models (LLMs) often lack meaningful confidence estimates for the semantic content of their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in various open-ended question-answering tasks, despite training only on next-token prediction. To formalize this phenomenon, we introduce "BB-calibration," a notion of calibration parameterized by the choice of equivalence classes. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges in base LLMs, leveraging a recent connection between calibration and local loss optimality. This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) instruction-tuning procedures systematically break this calibration, and (3) chain-of-thought reasoning breaks calibration (intuitively because models cannot predict their final answers before completing their generation). To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces B-calibration, a parameterized framework for semantic calibration in base LLMs, and provides a theoretical mechanism linking semantic calibration to local loss optimality. It resides in the 'Emergence and Mechanisms of Calibration' leaf under 'Theoretical Foundations and Empirical Analysis,' where it is currently the sole paper. This leaf focuses on explaining why semantic calibration emerges through theoretical analysis, distinguishing it from purely empirical evaluations or method proposals. The sparse population of this leaf suggests that theoretical explanations of calibration emergence remain underexplored in the literature, positioning the work in a relatively open research direction.

The taxonomy reveals substantial activity in neighboring areas: the sibling leaf 'Empirical Calibration Studies' contains three papers examining calibration properties across models and tasks, while 'Confidence Estimation Methods and Frameworks' encompasses multiple leaves with 20+ papers developing black-box and white-box uncertainty quantification techniques. The parent branch 'Theoretical Foundations and Empirical Analysis' also includes work on confidence-probability alignment and decoding strategy effects. The paper's theoretical focus on emergence mechanisms differentiates it from these empirical and methodological neighbors, though it shares conceptual ground with studies analyzing when and why calibration properties manifest during training or scaling.

Among 20 candidates examined across three contributions, no clearly refuting prior work was identified. The B-calibration framework examined 10 candidates with zero refutable matches, suggesting novelty in formalizing semantic calibration via equivalence classes. The theoretical mechanism linking calibration to local loss optimality examined only 1 candidate, reflecting limited prior theoretical work in this specific direction. The testable predictions contribution examined 9 candidates, again with no refutations, indicating that the predictive framework and its experimental validation appear distinct from existing empirical studies. The limited search scope (20 candidates total) means these findings reflect top semantic matches rather than exhaustive coverage.

Given the sparse theoretical landscape and the absence of refuting work among examined candidates, the paper appears to occupy a relatively novel position within its immediate research context. However, the small search scope and the single-paper status of its taxonomy leaf suggest caution: while no overlapping prior work surfaced in top-20 semantic matches, a broader literature review might reveal related theoretical analyses not captured here. The work's novelty seems strongest in its formal B-calibration framework and mechanistic explanation, with empirical validation building on established evaluation paradigms from neighboring leaves.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: semantic confidence calibration in large language models. The field addresses how well a model's expressed confidence aligns with the true correctness or semantic consistency of its outputs. The taxonomy organizes research into six main branches: Confidence Estimation Methods and Frameworks develop techniques to measure uncertainty from model outputs, often using semantic clustering or consistency checks (e.g., Semantic Entropy[19], Semantic Density[18]); Calibration via Fine-Tuning and Optimization explores training-time interventions to improve alignment between confidence and accuracy (e.g., Uncertainty-Aware Fine-tuning[47], SEED-GRPO[37]); Theoretical Foundations and Empirical Analysis investigates the underlying mechanisms and emergence of calibration properties; Domain-Specific Applications tailors calibration methods to specialized settings like medical question answering (Medical Uncertainty Challenge[8]) or text-to-SQL (Text-to-SQL Confidence[2]); Adaptive Inference and Selective Prediction leverages confidence scores to decide when to abstain or escalate; and Uncertainty Taxonomy and Semantic Distinctions clarifies different notions of uncertainty, distinguishing epistemic from aleatoric sources and token-level from semantic-level measures. A particularly active line of work contrasts token-probability-based methods with semantic-level approaches: while traditional calibration often relies on softmax probabilities or temperature scaling (Temperature Scaling[6]), many recent studies argue that semantic consistency across paraphrases or sampled outputs better captures true uncertainty (Generating with Confidence[3], Semantic Embeddings[30]). Another key tension involves whether to estimate confidence from generations alone (Generations Only[31]) or to incorporate internal model states and intermediate representations (Intermediate Representations[22]). The original paper, Semantic Calibration[0], sits squarely within the Theoretical Foundations and Empirical Analysis branch, focusing on the emergence and mechanisms of calibration. It complements works like Calibration Pre-trained Models[4] and Revisiting Calibration[9] by examining how and why semantic-level confidence signals arise during pretraining and scaling, offering insights that inform both estimation frameworks and fine-tuning strategies across the taxonomy.

Claimed Contributions

B-calibration framework for semantic calibration in LLMs

The authors introduce B-calibration, a formal framework that generalizes calibration to arbitrary equivalence classes defined by a collapsing function B. This framework enables rigorous analysis of semantic calibration by treating the LLM as inducing a classifier over semantic classes.

10 retrieved papers
Theoretical mechanism linking semantic calibration to local loss optimality

The authors establish a theoretical mechanism explaining emergent semantic calibration in base LLMs by connecting B-calibration to local loss optimality. They prove that B-calibration is equivalent to local loss optimality with respect to a corresponding perturbation family, and show when such perturbations are easy for autoregressive models to implement.

1 retrieved paper
Testable predictions about when semantic calibration emerges

The authors derive testable predictions from their theory, stating that base LLMs exhibit semantic calibration when they can predict their own semantic class distribution before generation. They validate three specific implications: base LLMs are semantically calibrated on question-answering tasks, instruction-tuning breaks this calibration, and chain-of-thought reasoning breaks calibration.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

B-calibration framework for semantic calibration in LLMs

The authors introduce B-calibration, a formal framework that generalizes calibration to arbitrary equivalence classes defined by a collapsing function B. This framework enables rigorous analysis of semantic calibration by treating the LLM as inducing a classifier over semantic classes.

Contribution

Theoretical mechanism linking semantic calibration to local loss optimality

The authors establish a theoretical mechanism explaining emergent semantic calibration in base LLMs by connecting B-calibration to local loss optimality. They prove that B-calibration is equivalent to local loss optimality with respect to a corresponding perturbation family, and show when such perturbations are easy for autoregressive models to implement.

Contribution

Testable predictions about when semantic calibration emerges

The authors derive testable predictions from their theory, stating that base LLMs exhibit semantic calibration when they can predict their own semantic class distribution before generation. They validate three specific implications: base LLMs are semantically calibrated on question-answering tasks, instruction-tuning breaks this calibration, and chain-of-thought reasoning breaks calibration.