Contextual Similarity Distillation: Ensemble Uncertainties with a Single Model

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Uncertainty QuantificationEpistemic UncertaintyReinforcement LearningDeep EnsemblesExplorationNeural Tangent Kernel

Uncertainty quantification is a critical aspect of reinforcement learning and deep learning, with numerous applications ranging from efficient exploration and stable offline reinforcement learning to outlier detection in medical diagnostics. The scale of modern neural networks, however, complicates the use of many theoretically well-motivated approaches such as full Bayesian inference. Approximate methods like deep ensembles can provide reliable uncertainty estimates but still remain computationally expensive. In this work, we propose contextual similarity distillation, a novel approach that explicitly estimates the variance of an ensemble of deep neural networks with a single model, without ever learning or evaluating such an ensemble in the first place. Our method builds on the predictable learning dynamics of wide neural networks, governed by the neural tangent kernel, to derive an efficient approximation of the predictive variance of an infinite ensemble. Specifically, we reinterpret the computation of ensemble variance as a supervised regression problem with kernel similarities as regression targets. The resulting model can estimate predictive variance at inference time with a single forward pass, and can make use of unlabeled target-domain data or data augmentations to refine its uncertainty estimates. We empirically validate our method across a variety of out-of-distribution detection benchmarks and sparse-reward reinforcement learning environments. We find that our single-model method performs competitively and sometimes superior to ensemble-based baselines and serves as a reliable signal for efficient exploration. These results, we believe, position contextual similarity distillation as a principled and scalable alternative for uncertainty quantification in reinforcement learning and general deep learning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes contextual similarity distillation (CSD), a method to estimate ensemble variance using a single model without training or evaluating the ensemble. It resides in the 'Ensemble Distillation and Approximation' leaf, which contains only two papers including this one. This sparse population suggests the specific approach of distilling ensemble variance via kernel-based regression targets is relatively underexplored. The taxonomy shows ensemble-based uncertainty quantification is a well-established branch, but the distillation subfield remains narrow compared to broader Bayesian or deterministic categories.

The taxonomy reveals neighboring leaves include 'Deep Ensemble Methods' (full ensemble training) and 'Bayesian Neural Networks' (probabilistic weight inference). CSD bridges these directions by leveraging neural tangent kernel theory—typically associated with theoretical foundations—to approximate ensemble behavior without Bayesian sampling or multiple training runs. The 'Deterministic and Single-Pass Uncertainty Estimation' branch offers alternative efficiency strategies (feature-based metrics, learned confidence), but CSD's kernel-similarity regression formulation diverges by explicitly targeting ensemble variance rather than implicit confidence proxies. This positioning suggests the work synthesizes theoretical insights with practical ensemble approximation goals.

Among 25 candidates examined, the theoretical framework based on neural tangent kernel shows overlap with prior work (3 refutable candidates out of 10 examined for this contribution). The CSD method itself and the contextualized regression formulation appear more novel within this limited search scope (0 refutable candidates across 15 examined). The statistics indicate that while the kernel-theoretic foundation connects to existing literature, the specific distillation mechanism and unlabeled-data regression strategy have less direct precedent among the top-25 semantically similar papers. This pattern suggests incremental theoretical grounding combined with a more distinctive methodological contribution.

Based on the limited search scope (25 candidates from semantic retrieval), the work appears to occupy a sparsely populated niche within ensemble approximation. The taxonomy context confirms that distillation-based ensemble compression is less crowded than full ensemble or Bayesian methods. However, the analysis does not cover exhaustive citation networks or domain-specific ensemble literature, so the novelty assessment remains provisional. The kernel-theoretic overlap suggests the work builds on established theory while introducing a new application pathway for variance estimation.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: uncertainty quantification in deep neural networks. The field has matured into a rich landscape organized around several complementary strategies. Foundational frameworks and comprehensive reviews (e.g., Survey Uncertainty Deep Networks[1], Uncertainty Quantification Deep Learning[2][3], Survey Uncertainty Quantification Methods[5]) establish the conceptual underpinnings, distinguishing epistemic uncertainty (model ignorance) from aleatoric uncertainty (inherent data noise). Bayesian and probabilistic approaches leverage posterior inference to capture model uncertainty, while ensemble-based methods aggregate predictions from multiple models or training runs to estimate epistemic confidence. Deterministic and single-pass techniques offer computational efficiency by extracting uncertainty from a single forward pass, and calibration methods refine raw confidence scores to align with empirical accuracy. Specialized branches address distribution shift and out-of-distribution detection, multi-fidelity modeling that fuses information across data sources (e.g., Aleatory Multi-Fidelity[4], Multi-Fidelity Analysis[17]), and domain-specific applications spanning medical imaging, autonomous systems, and scientific computing. Within the ensemble-based branch, a central tension exists between the high fidelity of full ensembles and the computational cost they impose at inference time. Ensemble distillation and approximation methods seek to compress ensemble knowledge into a single, efficient model while preserving uncertainty estimates. Contextual Similarity Distillation[0] exemplifies this direction by distilling ensemble predictions through contextual similarity mechanisms, aiming to retain predictive diversity without maintaining multiple networks. This contrasts with Single Model Estimation[33], which pursues uncertainty from a lone model via architectural or training innovations, and with test-time augmentation strategies (Test-Time Augmentation[32]) that generate pseudo-ensembles on the fly. The original work thus occupies a pragmatic middle ground: it inherits the representational richness of ensembles yet targets deployment scenarios where resource constraints favor a streamlined architecture.

Claimed Contributions

Contextual Similarity Distillation (CSD) method

10 retrieved papers

The authors introduce a method that approximates the predictive variance of an infinite ensemble of neural networks using only a single model. CSD reframes ensemble variance computation as a supervised regression problem where labels correspond to kernel similarities, enabling efficient uncertainty quantification without training multiple models.

10 retrieved papers

Theoretical framework based on Neural Tangent Kernel

Can Refute

10 retrieved papers

The authors develop a theoretical foundation grounded in the Neural Tangent Kernel (NTK) theory to derive an analytical expression for ensemble uncertainties. This framework characterizes deep ensembles through the NTK Gaussian Process and enables the derivation of their single-model approximation method.

10 retrieved papers

Can Refute

Contextualized regression formulation with unlabeled data

5 retrieved papers

The authors formulate a contextualized regression model that extends their approach to work efficiently for arbitrary query points. This formulation enables the method to leverage unlabeled data from target domains or data augmentations to improve uncertainty estimates, a capability not easily incorporated in standard ensemble methods.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[33] Estimating Epistemic and Aleatoric Uncertainty with a Single Model PDF

Matthew Chan, Christopher Metzler, M. A. Chan, MarÃa Molina, Maria J. Molina, Christopher A. Metzler (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Contextual Similarity Distillation (CSD) method

[33] Estimating Epistemic and Aleatoric Uncertainty with a Single Model PDF

Cannot Refute

[51] The diversified ensemble neural network PDF

Cannot Refute

[52] Uncertainty estimation using a single deep deterministic neural network PDF

Cannot Refute

[53] Single-model uncertainties for deep learning PDF

Cannot Refute

[54] Deep ensembles work, but are they necessary? PDF

Cannot Refute

[55] ST-TransNet: A Spatiotemporal Transformer Network for Uncertainty Estimation from a Single Deterministic Precipitation Forecast PDF

Cannot Refute

[56] Ensemble solar forecasting and post-processing using dropout neural network and information from neighboring satellite pixels PDF

Cannot Refute

[57] Probabilistic binary neural networks PDF

Cannot Refute

[58] Prune and tune ensembles: low-cost ensemble learning with sparse independent subnetworks PDF

Cannot Refute

[59] Deep confidence: a computationally efficient framework for calculating reliable prediction errors for deep neural networks PDF

Cannot Refute

Contribution

Theoretical framework based on Neural Tangent Kernel

[65] Bayesian deep ensembles via the neural tangent kernel PDF

Can Refute

[66] Uncertainty quantification with the empirical neural tangent kernel PDF

Can Refute

[74] Disentangling the Predictive Variance of Deep Ensembles through the Neural Tangent Kernel PDF

Can Refute

[67] Uncertainty quantification from ensemble variance scaling laws in deep neural networks PDF

Cannot Refute

[68] Epistemic uncertainty and observation noise with the neural tangent kernel PDF

Cannot Refute

[69] No-regret bandit exploration based on soft tree ensemble model PDF

Cannot Refute

[70] Deep Learning for High-Dimensional Decision Making and Uncertainty Quantification PDF

Cannot Refute

[71] Fed-ensemble: Ensemble Models in Federated Learning for Improved Generalization and Uncertainty Quantification PDF

Cannot Refute

[72] Single Model Uncertainty Estimation via Stochastic Data Centering PDF

Cannot Refute

[73] Universal Value-Function Uncertainties PDF

Cannot Refute

Contribution

Contextualized regression formulation with unlabeled data

[60] 3d semi-supervised learning with uncertainty-aware multi-view co-training PDF

Cannot Refute

[61] TSceneJAL: Joint Active Learning of Traffic Scenes for 3D Object Detection PDF

Cannot Refute

[62] Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance PDF

Cannot Refute

[63] A Physics-Informed Multiview Collaborative Semisupervised Framework for Battery Lifespan Early Prediction PDF

Cannot Refute

[64] SemiâSupervised Hybrid Local Kernel Regression for Soft Sensor Modelling of RubberâMixing Process PDF

Cannot Refute

Contextual Similarity Distillation: Ensemble Uncertainties with a Single Model

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[33] Estimating Epistemic and Aleatoric Uncertainty with a Single Model PDF

Contribution Analysis

Contextual Similarity Distillation (CSD) method

[33] Estimating Epistemic and Aleatoric Uncertainty with a Single Model PDF

[51] The diversified ensemble neural network PDF

[52] Uncertainty estimation using a single deep deterministic neural network PDF

[53] Single-model uncertainties for deep learning PDF

[54] Deep ensembles work, but are they necessary? PDF

[55] ST-TransNet: A Spatiotemporal Transformer Network for Uncertainty Estimation from a Single Deterministic Precipitation Forecast PDF

[56] Ensemble solar forecasting and post-processing using dropout neural network and information from neighboring satellite pixels PDF

[57] Probabilistic binary neural networks PDF

[58] Prune and tune ensembles: low-cost ensemble learning with sparse independent subnetworks PDF

[59] Deep confidence: a computationally efficient framework for calculating reliable prediction errors for deep neural networks PDF

Theoretical framework based on Neural Tangent Kernel

[65] Bayesian deep ensembles via the neural tangent kernel PDF

[66] Uncertainty quantification with the empirical neural tangent kernel PDF

[74] Disentangling the Predictive Variance of Deep Ensembles through the Neural Tangent Kernel PDF

[67] Uncertainty quantification from ensemble variance scaling laws in deep neural networks PDF

[68] Epistemic uncertainty and observation noise with the neural tangent kernel PDF

[69] No-regret bandit exploration based on soft tree ensemble model PDF

[70] Deep Learning for High-Dimensional Decision Making and Uncertainty Quantification PDF

[71] Fed-ensemble: Ensemble Models in Federated Learning for Improved Generalization and Uncertainty Quantification PDF

[72] Single Model Uncertainty Estimation via Stochastic Data Centering PDF

[73] Universal Value-Function Uncertainties PDF

Contextualized regression formulation with unlabeled data

[60] 3d semi-supervised learning with uncertainty-aware multi-view co-training PDF

[61] TSceneJAL: Joint Active Learning of Traffic Scenes for 3D Object Detection PDF

[62] Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance PDF

[63] A Physics-Informed Multiview Collaborative Semisupervised Framework for Battery Lifespan Early Prediction PDF

[64] SemiâSupervised Hybrid Local Kernel Regression for Soft Sensor Modelling of RubberâMixing Process PDF

Table of Contents

[64] SemiâSupervised Hybrid Local Kernel Regression for Soft Sensor Modelling of RubberâMixing Process PDF