Knowledge Distillation by On-the-Fly Native Ensemble Xu Lan & Xiatian Zhu + Shaogang Gong & 1 Queen Mary University of London, London, UK 2 Vision Semantics Ltd [email protected] [email protected] [email protected] 1. Introduction Solution: Knowledge Distillation 2. Methodology Cross Entropy Hard vs. Soft Class Labels: l ONE leads to higher correlations due to the learning constraint from the distillation loss; l ONE yields superior mean model generalisation capability with lower error rate 26.61 vs 31.07 by ME. 3. Experiments 4 . Further Analysis Ø CIFAR and SVHN tests Figure 1: Knowledge distillation and vanilla training Figure 2: Overview of online distillation training of ResNet-110 by the proposed On- the-Fly Native Ensemble (ONE). With ONE, we reconfigure the network by adding m auxiliary branches. Each branch with shared layers makes an individual model, and their ensemble is used to build the teacher model. Knowledge Distillation by On-the-Fly Native Ensemble Ø ImageNet test Ø Knowledge Distillation and Ensemble Comparisons 5 . Reference (1) Model variance: average prediction differences between every two models/branches. (2) Mean model generalisation capability. [2]Y. Zhang, X. Tao, T. Hospedales, and H. Lu. “Deep mutual learning.” CVPR, 2018. [1] G. Hinton, O. Vinyals, and J. Dean. “Distilling the knowledge in a neural network.” arXiv:1503.02531, 2015. Multi-Branch Design: Ø Considering no correlation between classes. Ø Prone to model overfitting (b) knowledge distillation (a) Vanilla strategy Target Model P L Teacher Model P L Target Model P L P Model Predictions L Labels TS Training Samples TS TS TS Legend Limitations of Offline Knowledge Distillation: Ø Lengthy training time Ø Complex multi-stage training process Ø Possible teacher model overfitting Limitations of Online Knowledge Distillation: Ø Provide limited extra supervision information Ø Still need to train multiply models Ø Complex Asynchronous model updating Classifier … Branch 0 Gate Logits ! "# # ! "# $ ! "# % ! "# & ! '( $ ! '( % ! '( & ) $ * ) % * ) & * ) # ) % ) & ) $ ) # * Ensemble Predictions On-the-Fly Knowledge Distillation Ensemble Logits High-Level Layers (Res4X block) Low-Level Layers (Conv1, Res2X, Res3X) Branch 1 Branch m Predictions Gate Network: On-the-Fly Knowledge Distillation: A gate which learns to ensemble all (m + 1) branches to build a stronger teacher: m auxiliary branches with the same configuration, each serving as an independent efficient classification model. Compute soft probability distributions at a temperature of T for branches and the ONE teacher as: Distill knowledge from the teacher to each branch: Drawbacks of Hard Label based Cross Entropy: Overall Loss Function: ONE vs. Model Ensemble (ME) Ø Effect of On-the-Fly Knowledge Distillation