Post-hoc Probabilistic Vision-Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Uncertainty QuantificationActive Fine-TuningBayesian Deep LearningVision-Language Models
Abstract:

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes BayesVLM, a post-hoc Bayesian uncertainty quantification method for vision-language models using Laplace approximation over final layers. It resides in the 'Post-hoc Probabilistic Embedding Approaches' leaf, which contains only three papers total (including this work and two siblings: Probabilistic Embeddings Frozen and Intra Class Probabilistic). This is a relatively sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting the specific combination of post-hoc Bayesian methods and VLM embeddings remains underexplored compared to calibration-focused or training-based probabilistic approaches.

The taxonomy reveals neighboring leaves focused on training-based probabilistic modeling (four papers) and hidden representation-based uncertainty (three papers), indicating alternative strategies for VLM uncertainty. The parent branch 'Uncertainty Estimation Methods and Frameworks' encompasses multiple approaches, while sibling branches address calibration techniques (13 papers across four leaves) and application-specific evaluation (17 papers). The scope note clarifies that post-hoc methods must avoid retraining, distinguishing this work from training-based probabilistic VLMs like those requiring fine-tuning or learned uncertainty predictors. The paper's analytical uncertainty propagation connects to semantic uncertainty quantification approaches but differs by operating directly on embedding distributions rather than output consistency.

Among 28 candidates examined across three contributions, only one refutable pair emerged. The core BayesVLM framework (Contribution 1) examined nine candidates with zero refutations, suggesting limited direct prior work on Laplace-approximated VLM embeddings. Contribution 2 (analytical cosine similarity distributions) examined nine candidates and found one potential overlap, indicating some existing work on uncertainty propagation in similarity metrics. Contribution 3 (active learning demonstrations) examined ten candidates without refutation, though this may reflect the application focus rather than methodological novelty. The limited search scope (top-K semantic matches plus citations) means these statistics capture nearby prior work but not exhaustive field coverage.

Based on the constrained literature search, the work appears to occupy a relatively novel position combining post-hoc Bayesian inference with VLM embeddings. The sparse population of its taxonomy leaf and low refutation rate across contributions suggest incremental overlap with existing methods, though the analytical uncertainty propagation shows some prior exploration. The analysis covers semantically proximate papers but cannot confirm absence of related work in adjacent research communities or recent preprints outside the search scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: post-hoc uncertainty quantification for vision-language models. The field addresses how to estimate and calibrate confidence in VLM predictions without retraining from scratch. The taxonomy reveals several complementary directions: Uncertainty Estimation Methods and Frameworks develop techniques to quantify model uncertainty through probabilistic embeddings, Bayesian approaches, and ensemble-like strategies; Confidence Calibration Techniques focus on adjusting model outputs to align predicted confidence with actual accuracy, often via temperature scaling or contrastive methods; Uncertainty-aware Applications and Evaluation explore how uncertainty estimates can guide downstream tasks and benchmarks; Hallucination Detection and Mitigation targets the specific problem of identifying and reducing spurious or unfaithful outputs; and Related Vision-Language Model Topics cover broader VLM concerns such as robustness and modality alignment. Representative works like ProbVLM[11] and Multimodal Uncertainty Encoders[28] illustrate early probabilistic embedding strategies, while Calibrated Robust Finetuning[4] and Attentional Vision Calibration[3] exemplify calibration-focused methods. A particularly active line of work centers on post-hoc probabilistic embedding approaches, which retrofit uncertainty into frozen or minimally adapted VLMs. Posthoc Probabilistic VLM[0] sits squarely in this cluster, alongside Probabilistic Embeddings Frozen[15] and Intra Class Probabilistic[37], all aiming to capture distributional information in embedding space without extensive retraining. This contrasts with calibration-centric methods like Attentional Vision Calibration[3] or Unveiling Uncertainty[5], which adjust confidence scores but may not model full distributional uncertainty. A key trade-off is between computational efficiency and expressiveness: probabilistic embeddings can capture richer uncertainty but require careful design to remain tractable, while calibration techniques are often simpler yet may not address epistemic uncertainty as directly. Open questions include how to best integrate these uncertainty estimates into real-world applications, balance calibration with hallucination mitigation, and evaluate uncertainty quality across diverse VLM architectures and tasks.

Claimed Contributions

BayesVLM: Post-hoc probabilistic vision-language models using Laplace approximation

The authors introduce BayesVLM, a training-free post-hoc uncertainty quantification method for vision-language models that leverages Laplace approximation over the last layers of VLM encoders. This approach enables uncertainty estimation without requiring architectural modifications, retraining, or additional training procedures.

9 retrieved papers
Analytical distribution over cosine similarities for efficient uncertainty propagation

The authors develop a novel Bayesian formulation for VLMs by introducing independent probabilistic models for each modality and deriving a closed-form Gaussian approximation (ProbCosine) of the distribution over cosine similarities. This enables efficient propagation of uncertainties from model parameters to VLM outputs.

9 retrieved papers
Can Refute
Demonstration of BayesVLM effectiveness in zero-shot classification and active learning

The authors empirically validate BayesVLM across multiple benchmarks, showing improved calibration and uncertainty estimates in zero-shot classification tasks, and demonstrating sample-efficient active learning through uncertainty-based data selection using BALD and EPIG acquisition functions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

BayesVLM: Post-hoc probabilistic vision-language models using Laplace approximation

The authors introduce BayesVLM, a training-free post-hoc uncertainty quantification method for vision-language models that leverages Laplace approximation over the last layers of VLM encoders. This approach enables uncertainty estimation without requiring architectural modifications, retraining, or additional training procedures.

Contribution

Analytical distribution over cosine similarities for efficient uncertainty propagation

The authors develop a novel Bayesian formulation for VLMs by introducing independent probabilistic models for each modality and deriving a closed-form Gaussian approximation (ProbCosine) of the distribution over cosine similarities. This enables efficient propagation of uncertainties from model parameters to VLM outputs.

Contribution

Demonstration of BayesVLM effectiveness in zero-shot classification and active learning

The authors empirically validate BayesVLM across multiple benchmarks, showing improved calibration and uncertainty estimates in zero-shot classification tasks, and demonstrating sample-efficient active learning through uncertainty-based data selection using BALD and EPIG acquisition functions.