Experimental Study of Some Properties of  Knowledge Distillation

A. Szijártó; P. Lehotay-Kéry; A. Kiss

doi:10.24193/subbi.2020.2.01

A. Szijártó Faculty of Informatics, Eötvös Loránd University. H-1117 Budapest, ´ Pázmány P. stny 1/C, Hungary and Faculty of Mathematics and Computer Science, Babeș-Bolyai University, No. 1 Mihail Kogalniceanu St., RO-400084 Cluj-Napoca, Romania.
P. Lehotay-Kéry Faculty of Informatics, Eötvös Loránd University. H-1117 Budapest, ´ Pázmány P. stny 1/C, Hungary and Faculty of Mathematics and Computer Science, Babeș-Bolyai University, No. 1 Mihail Kogalniceanu St., RO-400084 Cluj-Napoca, Romania.
A. Kiss Faculty of Informatics, Eötvös Loránd University. H-1117 Budapest, ´ Pázmány P. stny 1/C, Hungary and Faculty of Mathematics and Computer Science, Babeș-Bolyai University, No. 1 Mihail Kogalniceanu St., RO-400084 Cluj-Napoca, Romania.

DOI: https://doi.org/10.24193/subbi.2020.2.01

Abstract

For more complex classification problems it is inevitable that we use increasingly complex and cumbersome classifying models. However, often we do not have the space or processing power to deploy these models.
Knowledge distillation is an effective way to improve the accuracy of an otherwise smaller, simpler model using a more complex teacher network or ensemble of networks. This way we can have a classifier with an accuracy that is comparable to the accuracy of the teacher while small enough to deploy.
In this paper we evaluate certain features of this distilling method, while trying to improve its results. These experiments and examinations and the discovered properties may also help to further develop this operation.

References

[1] Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10):993–1001, 1990.
[2] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
[3] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[4] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.
[5] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597. IEEE, 2016.
[6] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
[7] Nicholas Frosst and Geoffrey Hinton. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784, 2017.
[8] Mengya Gao, Yujun Shen, Quanquan Li, and Chen Change Loy. Residual knowledge distillation. arXiv preprint arXiv:2002.09168, 2020.
[9] Mohammad Farhadi Bajestani and Yezhou Yang. Tkd: Temporal knowledge distillation for active perception. In The IEEE Winter Conference on Applications of Computer Vision, pages 953–962, 2020.
[10] Xu Cheng, Zhefan Rao, Yilan Chen, and Quanshi Zhang. Explaining knowledge distillation by quantifying the knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12925–12935, 2020.
[11] Xiang Wu, Ran He, Yibo Hu, and Zhenan Sun. Learning an evolutionary embedding via massive knowledge distillation. International Journal of Computer Vision, pages 1–18, 2020.
[12] Inseop Chung, SeongUk Park, Jangho Kim, and Nojun Kwak. Feature-map-level online adversarial knowledge distillation. arXiv preprint arXiv:2002.01775, 2020.
[13] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[14] Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and Andrew D Back. Face recognition: A convolutional neural-network approach. IEEE transactions on neural networks, 8(1):98-113, 1997.
[15] Kouichi Yamaguchi, Kenji Sakamoto, Toshio Akabane, and Yoshiji Fujimoto. A neural network for speaker-independent isolated word recognition. In First International Conference on Spoken Language Processing, pages 1077–1080, 1990.
[16] Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019–1025, 1999.
[17] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
[18] Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Advances in neural information processing systems, pages 3084–3092, 2013.
[19] John S Bridle. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pages 227–236. Springer, 1990.
[20] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.