Contextual Similarity Distillation: Ensemble Uncertainties with a Single Model

ICLR 2026 Conference SubmissionAnonymous Authors
Uncertainty QuantificationEpistemic UncertaintyReinforcement LearningDeep EnsemblesExplorationNeural Tangent Kernel
Abstract:

Uncertainty quantification is a critical aspect of reinforcement learning and deep learning, with numerous applications ranging from efficient exploration and stable offline reinforcement learning to outlier detection in medical diagnostics. The scale of modern neural networks, however, complicates the use of many theoretically well-motivated approaches such as full Bayesian inference. Approximate methods like deep ensembles can provide reliable uncertainty estimates but still remain computationally expensive. In this work, we propose contextual similarity distillation, a novel approach that explicitly estimates the variance of an ensemble of deep neural networks with a single model, without ever learning or evaluating such an ensemble in the first place. Our method builds on the predictable learning dynamics of wide neural networks, governed by the neural tangent kernel, to derive an efficient approximation of the predictive variance of an infinite ensemble. Specifically, we reinterpret the computation of ensemble variance as a supervised regression problem with kernel similarities as regression targets. The resulting model can estimate predictive variance at inference time with a single forward pass, and can make use of unlabeled target-domain data or data augmentations to refine its uncertainty estimates. We empirically validate our method across a variety of out-of-distribution detection benchmarks and sparse-reward reinforcement learning environments. We find that our single-model method performs competitively and sometimes superior to ensemble-based baselines and serves as a reliable signal for efficient exploration. These results, we believe, position contextual similarity distillation as a principled and scalable alternative for uncertainty quantification in reinforcement learning and general deep learning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes contextual similarity distillation (CSD), a method to estimate ensemble variance using a single model without training or evaluating the ensemble. It resides in the 'Ensemble Distillation and Approximation' leaf, which contains only two papers including this one. This sparse population suggests the specific approach of distilling ensemble variance via kernel-based regression targets is relatively underexplored. The taxonomy shows ensemble-based uncertainty quantification is a well-established branch, but the distillation subfield remains narrow compared to broader Bayesian or deterministic categories.

The taxonomy reveals neighboring leaves include 'Deep Ensemble Methods' (full ensemble training) and 'Bayesian Neural Networks' (probabilistic weight inference). CSD bridges these directions by leveraging neural tangent kernel theory—typically associated with theoretical foundations—to approximate ensemble behavior without Bayesian sampling or multiple training runs. The 'Deterministic and Single-Pass Uncertainty Estimation' branch offers alternative efficiency strategies (feature-based metrics, learned confidence), but CSD's kernel-similarity regression formulation diverges by explicitly targeting ensemble variance rather than implicit confidence proxies. This positioning suggests the work synthesizes theoretical insights with practical ensemble approximation goals.

Among 25 candidates examined, the theoretical framework based on neural tangent kernel shows overlap with prior work (3 refutable candidates out of 10 examined for this contribution). The CSD method itself and the contextualized regression formulation appear more novel within this limited search scope (0 refutable candidates across 15 examined). The statistics indicate that while the kernel-theoretic foundation connects to existing literature, the specific distillation mechanism and unlabeled-data regression strategy have less direct precedent among the top-25 semantically similar papers. This pattern suggests incremental theoretical grounding combined with a more distinctive methodological contribution.

Based on the limited search scope (25 candidates from semantic retrieval), the work appears to occupy a sparsely populated niche within ensemble approximation. The taxonomy context confirms that distillation-based ensemble compression is less crowded than full ensemble or Bayesian methods. However, the analysis does not cover exhaustive citation networks or domain-specific ensemble literature, so the novelty assessment remains provisional. The kernel-theoretic overlap suggests the work builds on established theory while introducing a new application pathway for variance estimation.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: uncertainty quantification in deep neural networks. The field has matured into a rich landscape organized around several complementary strategies. Foundational frameworks and comprehensive reviews (e.g., Survey Uncertainty Deep Networks[1], Uncertainty Quantification Deep Learning[2][3], Survey Uncertainty Quantification Methods[5]) establish the conceptual underpinnings, distinguishing epistemic uncertainty (model ignorance) from aleatoric uncertainty (inherent data noise). Bayesian and probabilistic approaches leverage posterior inference to capture model uncertainty, while ensemble-based methods aggregate predictions from multiple models or training runs to estimate epistemic confidence. Deterministic and single-pass techniques offer computational efficiency by extracting uncertainty from a single forward pass, and calibration methods refine raw confidence scores to align with empirical accuracy. Specialized branches address distribution shift and out-of-distribution detection, multi-fidelity modeling that fuses information across data sources (e.g., Aleatory Multi-Fidelity[4], Multi-Fidelity Analysis[17]), and domain-specific applications spanning medical imaging, autonomous systems, and scientific computing. Within the ensemble-based branch, a central tension exists between the high fidelity of full ensembles and the computational cost they impose at inference time. Ensemble distillation and approximation methods seek to compress ensemble knowledge into a single, efficient model while preserving uncertainty estimates. Contextual Similarity Distillation[0] exemplifies this direction by distilling ensemble predictions through contextual similarity mechanisms, aiming to retain predictive diversity without maintaining multiple networks. This contrasts with Single Model Estimation[33], which pursues uncertainty from a lone model via architectural or training innovations, and with test-time augmentation strategies (Test-Time Augmentation[32]) that generate pseudo-ensembles on the fly. The original work thus occupies a pragmatic middle ground: it inherits the representational richness of ensembles yet targets deployment scenarios where resource constraints favor a streamlined architecture.

Claimed Contributions

Contextual Similarity Distillation (CSD) method

The authors introduce a method that approximates the predictive variance of an infinite ensemble of neural networks using only a single model. CSD reframes ensemble variance computation as a supervised regression problem where labels correspond to kernel similarities, enabling efficient uncertainty quantification without training multiple models.

10 retrieved papers
Theoretical framework based on Neural Tangent Kernel

The authors develop a theoretical foundation grounded in the Neural Tangent Kernel (NTK) theory to derive an analytical expression for ensemble uncertainties. This framework characterizes deep ensembles through the NTK Gaussian Process and enables the derivation of their single-model approximation method.

10 retrieved papers
Can Refute
Contextualized regression formulation with unlabeled data

The authors formulate a contextualized regression model that extends their approach to work efficiently for arbitrary query points. This formulation enables the method to leverage unlabeled data from target domains or data augmentations to improve uncertainty estimates, a capability not easily incorporated in standard ensemble methods.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Contextual Similarity Distillation (CSD) method

The authors introduce a method that approximates the predictive variance of an infinite ensemble of neural networks using only a single model. CSD reframes ensemble variance computation as a supervised regression problem where labels correspond to kernel similarities, enabling efficient uncertainty quantification without training multiple models.

Contribution

Theoretical framework based on Neural Tangent Kernel

The authors develop a theoretical foundation grounded in the Neural Tangent Kernel (NTK) theory to derive an analytical expression for ensemble uncertainties. This framework characterizes deep ensembles through the NTK Gaussian Process and enables the derivation of their single-model approximation method.

Contribution

Contextualized regression formulation with unlabeled data

The authors formulate a contextualized regression model that extends their approach to work efficiently for arbitrary query points. This formulation enables the method to leverage unlabeled data from target domains or data augmentations to improve uncertainty estimates, a capability not easily incorporated in standard ensemble methods.