An Analysis on Very Deep Convolutional Neural Networks: Problems and Solutions
Abstract
Neural Networks have become a powerful tool in computer vision because of the recent breakthroughs in computation time and model architecture. Very deep models allow for better deciphering of the hidden patterns in the data; however, training them successfully is not a trivial problem, because of the notorious vanishing/exploding gradient problem. We illustrate this problem on VGG models, with 8 and 38 hidden layers, on the CIFAR100 image dataset, where we visualize how the gradients evolve during training. We explore known solutions to this problem like Batch Normalization (BatchNorm) or Residual Networks (ResNets), explaining the theory behind them. Our experiments show that the deeper model suffers from the vanishing gradient problem, but BatchNorm and ResNets do solve it. The employed solutions slighly improve the performance of shallower models as well, yet, the fixed deeper models outperform them.
References
[2] Ciregan, D., Meier, U., and Schmidhuber, J. Multi-column deep neural networks for image classification. In 2012 IEEE conference on computer vision and pattern recognition (2012), IEEE, pp. 3642–3649.
[3] Cybenko, G. Mathematics of control. Signals and Systems 2 (1989), 303.
[4] Eigen, D., Rolfe, J., Fergus, R., and LeCun, Y. Understanding deep architectures using a recursive convolutional network. arXiv preprint arXiv:1312.1847 (2013).
[5] Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
[6] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 770–778.
[7] Hochreiter, S. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universit¨at M¨unchen 91, 1 (1991).
[8] Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[9] Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 4700–4708.
[10] Ioffe, S., and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 (2015), F. R. Bach and D. M. Blei, Eds., vol. 37 of JMLR Workshop and Conference Proceedings, JMLR.org, pp. 448–456.
[11] Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. University of Toronto (2009).
[12] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM 60, 6 (2012), 84–90.
[13] Lian, X., and Liu, J. Revisit batch normalization: New understanding and refinement via composition optimization. In The 22nd International Conference on Artificial Intelligence and Statistics (2019), pp. 3254–3263.
[14] Martens, J. Deep learning via hessian-free optimization. In ICML (2010), vol. 27, pp. 735–742.
[15] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature 550, 7676 (2017), 354–359.
[16] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015), Y. Bengio and Y. LeCun, Eds.
[17] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.
[18] Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway networks. CoRR abs/1505.00387 (2015).
[19] Srivastava, R. K., Greff, K., and Schmidhuber, J. Training very deep networks. In Advances in neural information processing systems (2015), pp. 2377–2385.
[20] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 1–9.
[21] Thacker, W. C. The role of the hessian matrix in fitting models to measurements. Journal of Geophysical Research: Oceans 94, C5 (1989), 6177–6196.
[22] Wu, D., Wang, Y., Xia, S.-T., Bailey, J., and Ma, X. Skip connections matter: On the transferability of adversarial examples generated with resnets. unpublished, 2020.
[23] XalosXandrez. Batch normalization before or after relu? https://www.reddit. com/r/MachineLearning/comments/67gonq/dbatchnormalizationbeforeorafterrelu/. Published: 2017-04-25.
[24] Xie, S., Girshick, R., Dollar, P., Tu, Z., and He, K. ´ Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 1492–1500.
[25] Yu, D., Seltzer, M. L., Li, J., Huang, J.-T., and Seide, F. Feature learning in deep neural networks-studies on speech recognition tasks. arXiv preprint arXiv:1301.3605 (2013).
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
When the article is accepted for publication, I, as the author and representative of the coauthors, hereby agree to transfer to Studia Universitatis Babes-Bolyai, Series Informatica, all rights, including those pertaining to electronic forms and transmissions, under existing copyright laws, except for the following, which the author specifically retain: the right to make further copies of all or part of the published article for my use in classroom teaching; the right to reuse all or part of this material in a review or in a textbook of which I am the author; the right to make copies of the published work for internal distribution within the institution that employs me.