SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors provide a theoretical analysis of how students trained via SGD converge when supervised with exact Bayes Class Probabilities versus noisy approximations. They show that learning from BCPs yields variance reduction and removes neighborhood terms in convergence bounds compared to one-hot supervision.
Motivated by their theoretical findings that teacher calibration affects student performance, the authors propose using Bayesian neural networks as teachers because they provide better-calibrated probability estimates that more faithfully approximate the true BCPs.
The authors prove that when students are supervised with true BCPs, the optimization task satisfies the interpolation property, meaning the minimizer matches true BCPs at each sample. They also characterize how gradient noise depends on the quality of BCP estimates.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Convergence analysis of SGD-based knowledge distillation with Bayesian class probability supervision
The authors provide a theoretical analysis of how students trained via SGD converge when supervised with exact Bayes Class Probabilities versus noisy approximations. They show that learning from BCPs yields variance reduction and removes neighborhood terms in convergence bounds compared to one-hot supervision.
[8] Stochastic gradient descent with random label noises: doubly stochastic models and inference stabilizer PDF
[21] Towards understanding knowledge distillation PDF
[22] Agree to disagree: Adaptive ensemble knowledge distillation in gradient space PDF
[23] Classification accuracy improvement of the optical diffractive deep neural network by employing a knowledge distillation and stochastic gradient descent β-Lasso joint ⦠PDF
[24] Probabilistic Self-supervised Learning via Scoring Rules Minimization PDF
[25] Evidential Knowledge Distillation PDF
Advocacy for Bayesian deep learning models as teachers in knowledge distillation
Motivated by their theoretical findings that teacher calibration affects student performance, the authors propose using Bayesian neural networks as teachers because they provide better-calibrated probability estimates that more faithfully approximate the true BCPs.
[18] A statistical perspective on distillation PDF
[11] Modeling rapid language learning by distilling Bayesian priors into artificial neural networks PDF
[12] Efficient Uncertainty Estimation via Distillation of Bayesian Large Language Models PDF
[13] Bayesian knowledge distillation for online action detection PDF
[14] Uncertainty-based knowledge distillation for Bayesian deep neural network compression PDF
[15] Bayesian knowledge distillation: A bayesian perspective of distillation with uncertainty quantification PDF
[16] Bayesian evidential deep learning for online action detection PDF
[17] Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information PDF
[19] Bayeskd: Bayesian knowledge distillation for compact llms in constrained fine-tuning scenarios PDF
[20] Knowledge Distillation and Its Application to Network Traffic Classification PDF
Characterization of interpolation property and gradient noise under BCP supervision
The authors prove that when students are supervised with true BCPs, the optimization task satisfies the interpolation property, meaning the minimizer matches true BCPs at each sample. They also characterize how gradient noise depends on the quality of BCP estimates.