SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines

ICLR 2026 Conference SubmissionAnonymous Authors
knowledge distillationSGD-based learningBayesian machine learning
Abstract:

Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs. While KD has shown strong empirical success in numerous applications, its theoretical underpinnings remain only partially understood. In this work, we adopt a Bayesian perspective on KD to rigorously analyze the convergence behavior of students trained with Stochastic Gradient Descent (SGD). We study two regimes: (i)(i) when the teacher provides the exact Bayes Class Probabilities (BCPs); and (ii)(ii) supervision with noisy approximations of the BCPs. Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision. We further characterize how the level of noise affects generalization and accuracy. Motivated by these insights, we advocate the use of Bayesian deep learning models, which typically provide improved estimates of the BCPs, as teachers in KD. Consistent with our analysis, we experimentally demonstrate that students distilled from Bayesian teachers not only achieve higher accuracies (up to +4.27%), but also exhibit more stable convergence (up to 30% less noise), compared to students distilled from deterministic teachers.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
10
3
Claimed Contributions
19
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: convergence analysis of knowledge distillation with probabilistic supervision. The field structure reflects a multi-faceted investigation into how student models learn from teacher distributions rather than hard labels. The taxonomy organizes work into theoretical convergence analysis and optimization dynamics, divergence measures and loss functions, student-teacher dynamics and deviations, training strategies and curriculum learning, and application-specific implementations. Theoretical branches examine the mathematical foundations of distillation convergence, including Bayesian and probabilistic teacher models that provide uncertainty-aware supervision. Divergence measures explore how different loss functions—such as KL divergence variants studied in Rethinking KL Divergence[1]—affect learning dynamics. Student-teacher dynamics investigate mismatches and deviations, as seen in Student Teacher Deviations[7], while training strategies address curriculum design and annealing approaches like Annealing Distillation[9]. Application branches demonstrate these principles in domains such as fault diagnosis, exemplified by Lightweight Bearing Fault[2]. Particularly active lines of work contrast deterministic versus probabilistic teacher supervision, examining trade-offs between convergence guarantees and practical performance. Distributed settings, explored in Distributed Distillation[3], raise questions about how decentralized training affects convergence properties. Variance reduction techniques, such as those in Partial Variance Reduction[5] and Stochastic Polyak Distillation[4], address optimization stability. The original paper, SGD Bayesian Distillation[0], sits within the theoretical convergence branch focusing on Bayesian teacher models. It shares thematic ground with works analyzing probabilistic supervision but emphasizes rigorous SGD convergence analysis under Bayesian uncertainty, contrasting with more heuristic approaches in Probability Distillation Caveat[6] or noise-focused studies like Random Label Noises[8]. This positioning highlights a growing interest in formal guarantees for distillation with uncertain teachers.

Claimed Contributions

Convergence analysis of SGD-based knowledge distillation with Bayesian class probability supervision

The authors provide a theoretical analysis of how students trained via SGD converge when supervised with exact Bayes Class Probabilities versus noisy approximations. They show that learning from BCPs yields variance reduction and removes neighborhood terms in convergence bounds compared to one-hot supervision.

6 retrieved papers
Advocacy for Bayesian deep learning models as teachers in knowledge distillation

Motivated by their theoretical findings that teacher calibration affects student performance, the authors propose using Bayesian neural networks as teachers because they provide better-calibrated probability estimates that more faithfully approximate the true BCPs.

10 retrieved papers
Can Refute
Characterization of interpolation property and gradient noise under BCP supervision

The authors prove that when students are supervised with true BCPs, the optimization task satisfies the interpolation property, meaning the minimizer matches true BCPs at each sample. They also characterize how gradient noise depends on the quality of BCP estimates.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Convergence analysis of SGD-based knowledge distillation with Bayesian class probability supervision

The authors provide a theoretical analysis of how students trained via SGD converge when supervised with exact Bayes Class Probabilities versus noisy approximations. They show that learning from BCPs yields variance reduction and removes neighborhood terms in convergence bounds compared to one-hot supervision.

Contribution

Advocacy for Bayesian deep learning models as teachers in knowledge distillation

Motivated by their theoretical findings that teacher calibration affects student performance, the authors propose using Bayesian neural networks as teachers because they provide better-calibrated probability estimates that more faithfully approximate the true BCPs.

Contribution

Characterization of interpolation property and gradient noise under BCP supervision

The authors prove that when students are supervised with true BCPs, the optimization task satisfies the interpolation property, meaning the minimizer matches true BCPs at each sample. They also characterize how gradient noise depends on the quality of BCP estimates.