Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

calibrationLLMsemanticuncertaintytheory

Large Language Models (LLMs) often lack meaningful confidence estimates for the semantic content of their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in various open-ended question-answering tasks, despite training only on next-token prediction. To formalize this phenomenon, we introduce " $B$ -calibration," a notion of calibration parameterized by the choice of equivalence classes. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges in base LLMs, leveraging a recent connection between calibration and local loss optimality. This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) instruction-tuning procedures systematically break this calibration, and (3) chain-of-thought reasoning breaks calibration (intuitively because models cannot predict their final answers before completing their generation). To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces B-calibration, a parameterized framework for semantic calibration in base LLMs, and provides a theoretical mechanism linking semantic calibration to local loss optimality. It resides in the 'Emergence and Mechanisms of Calibration' leaf under 'Theoretical Foundations and Empirical Analysis,' where it is currently the sole paper. This leaf focuses on explaining why semantic calibration emerges through theoretical analysis, distinguishing it from purely empirical evaluations or method proposals. The sparse population of this leaf suggests that theoretical explanations of calibration emergence remain underexplored in the literature, positioning the work in a relatively open research direction.

The taxonomy reveals substantial activity in neighboring areas: the sibling leaf 'Empirical Calibration Studies' contains three papers examining calibration properties across models and tasks, while 'Confidence Estimation Methods and Frameworks' encompasses multiple leaves with 20+ papers developing black-box and white-box uncertainty quantification techniques. The parent branch 'Theoretical Foundations and Empirical Analysis' also includes work on confidence-probability alignment and decoding strategy effects. The paper's theoretical focus on emergence mechanisms differentiates it from these empirical and methodological neighbors, though it shares conceptual ground with studies analyzing when and why calibration properties manifest during training or scaling.

Among 20 candidates examined across three contributions, no clearly refuting prior work was identified. The B-calibration framework examined 10 candidates with zero refutable matches, suggesting novelty in formalizing semantic calibration via equivalence classes. The theoretical mechanism linking calibration to local loss optimality examined only 1 candidate, reflecting limited prior theoretical work in this specific direction. The testable predictions contribution examined 9 candidates, again with no refutations, indicating that the predictive framework and its experimental validation appear distinct from existing empirical studies. The limited search scope (20 candidates total) means these findings reflect top semantic matches rather than exhaustive coverage.

Given the sparse theoretical landscape and the absence of refuting work among examined candidates, the paper appears to occupy a relatively novel position within its immediate research context. However, the small search scope and the single-paper status of its taxonomy leaf suggest caution: while no overlapping prior work surfaced in top-20 semantic matches, a broader literature review might reveal related theoretical analyses not captured here. The work's novelty seems strongest in its formal B-calibration framework and mechanistic explanation, with empirical validation building on established evaluation paradigms from neighboring leaves.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: semantic confidence calibration in large language models. The field addresses how well a model's expressed confidence aligns with the true correctness or semantic consistency of its outputs. The taxonomy organizes research into six main branches: Confidence Estimation Methods and Frameworks develop techniques to measure uncertainty from model outputs, often using semantic clustering or consistency checks (e.g., Semantic Entropy[19], Semantic Density[18]); Calibration via Fine-Tuning and Optimization explores training-time interventions to improve alignment between confidence and accuracy (e.g., Uncertainty-Aware Fine-tuning[47], SEED-GRPO[37]); Theoretical Foundations and Empirical Analysis investigates the underlying mechanisms and emergence of calibration properties; Domain-Specific Applications tailors calibration methods to specialized settings like medical question answering (Medical Uncertainty Challenge[8]) or text-to-SQL (Text-to-SQL Confidence[2]); Adaptive Inference and Selective Prediction leverages confidence scores to decide when to abstain or escalate; and Uncertainty Taxonomy and Semantic Distinctions clarifies different notions of uncertainty, distinguishing epistemic from aleatoric sources and token-level from semantic-level measures. A particularly active line of work contrasts token-probability-based methods with semantic-level approaches: while traditional calibration often relies on softmax probabilities or temperature scaling (Temperature Scaling[6]), many recent studies argue that semantic consistency across paraphrases or sampled outputs better captures true uncertainty (Generating with Confidence[3], Semantic Embeddings[30]). Another key tension involves whether to estimate confidence from generations alone (Generations Only[31]) or to incorporate internal model states and intermediate representations (Intermediate Representations[22]). The original paper, Semantic Calibration[0], sits squarely within the Theoretical Foundations and Empirical Analysis branch, focusing on the emergence and mechanisms of calibration. It complements works like Calibration Pre-trained Models[4] and Revisiting Calibration[9] by examining how and why semantic-level confidence signals arise during pretraining and scaling, offering insights that inform both estimation frameworks and fine-tuning strategies across the taxonomy.

Claimed Contributions

B-calibration framework for semantic calibration in LLMs

10 retrieved papers

The authors introduce B-calibration, a formal framework that generalizes calibration to arbitrary equivalence classes defined by a collapsing function B. This framework enables rigorous analysis of semantic calibration by treating the LLM as inducing a classifier over semantic classes.

10 retrieved papers

Theoretical mechanism linking semantic calibration to local loss optimality

1 retrieved paper

The authors establish a theoretical mechanism explaining emergent semantic calibration in base LLMs by connecting B-calibration to local loss optimality. They prove that B-calibration is equivalent to local loss optimality with respect to a corresponding perturbation family, and show when such perturbations are easy for autoregressive models to implement.

1 retrieved paper

Testable predictions about when semantic calibration emerges

9 retrieved papers

The authors derive testable predictions from their theory, stating that base LLMs exhibit semantic calibration when they can predict their own semantic class distribution before generation. They validate three specific implications: base LLMs are semantically calibrated on question-answering tasks, instruction-tuning breaks this calibration, and chain-of-thought reasoning breaks calibration.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

B-calibration framework for semantic calibration in LLMs

[15] Task calibration: Calibrating large language models on inference tasks PDF

Cannot Refute

[58] Calibrating long-form generations from large language models PDF

Cannot Refute

[59] The Geometry of Creative Variability: How Credal Sets Expose Calibration Gaps in Language Models PDF

Cannot Refute

[60] FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models PDF

Cannot Refute

[61] Self-Calibrated Listwise Reranking with Large Language Models PDF

Cannot Refute

[62] Linguistic calibration of long-form generations PDF

Cannot Refute

[63] OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting PDF

Cannot Refute

[64] QA-Calibration of Language Model Confidence Scores PDF

Cannot Refute

[65] Examining the efficacy of generative artificial intelligence in item generation: comparative analysis of human-developed and AI-generated reading tests PDF

Cannot Refute

[66] InfAlign: Inference-aware language model alignment PDF

Cannot Refute

Contribution

Theoretical mechanism linking semantic calibration to local loss optimality

[67] Self-modulated gradient diffusion for large language model internal consistency calibration PDF

Cannot Refute

Contribution

Testable predictions about when semantic calibration emerges

[18] Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space PDF

Cannot Refute

[24] Semantically diverse language generation for uncertainty estimation in language models PDF

Cannot Refute

[30] Improving uncertainty quantification in large language models via semantic embeddings PDF

Cannot Refute

[51] Semantic probabilistic control of language models PDF

Cannot Refute

[53] Next Semantic Scale Prediction via Hierarchical Diffusion Language Models PDF

Cannot Refute

[54] Estimating the Probabilities of Rare Outputs in Language Models PDF

Cannot Refute

[55] Exploiting latent semantic information in statistical language modeling PDF

Cannot Refute

[56] A visionâlanguage model-based traffic sign detection method for high-resolution drone images: A case study in Guyuan, China PDF

Cannot Refute

[57] Calibrating Verbalized Probabilities for Large Language Models PDF

Cannot Refute

Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

B-calibration framework for semantic calibration in LLMs

[15] Task calibration: Calibrating large language models on inference tasks PDF

[58] Calibrating long-form generations from large language models PDF

[59] The Geometry of Creative Variability: How Credal Sets Expose Calibration Gaps in Language Models PDF

[60] FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models PDF

[61] Self-Calibrated Listwise Reranking with Large Language Models PDF

[62] Linguistic calibration of long-form generations PDF

[63] OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting PDF

[64] QA-Calibration of Language Model Confidence Scores PDF

[65] Examining the efficacy of generative artificial intelligence in item generation: comparative analysis of human-developed and AI-generated reading tests PDF

[66] InfAlign: Inference-aware language model alignment PDF

Theoretical mechanism linking semantic calibration to local loss optimality

[67] Self-modulated gradient diffusion for large language model internal consistency calibration PDF

Testable predictions about when semantic calibration emerges

[18] Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space PDF

[24] Semantically diverse language generation for uncertainty estimation in language models PDF

[30] Improving uncertainty quantification in large language models via semantic embeddings PDF

[51] Semantic probabilistic control of language models PDF

[53] Next Semantic Scale Prediction via Hierarchical Diffusion Language Models PDF

[54] Estimating the Probabilities of Rare Outputs in Language Models PDF

[55] Exploiting latent semantic information in statistical language modeling PDF

[56] A visionâlanguage model-based traffic sign detection method for high-resolution drone images: A case study in Guyuan, China PDF

[57] Calibrating Verbalized Probabilities for Large Language Models PDF

Table of Contents

[56] A visionâlanguage model-based traffic sign detection method for high-resolution drone images: A case study in Guyuan, China PDF