An Analysis on Very Deep Convolutional Neural Networks: Problems and Solutions

T.-V. Pricope

doi:10.24193/subbi.2021.1.01

T.-V. Pricope The University of Edinburgh, School of Informatics, 10 Crichton St, Newington, Edinburgh EH8 9AB, United Kingdom

DOI: https://doi.org/10.24193/subbi.2021.1.01

Abstract

Neural Networks have become a powerful tool in computer vision because of the recent breakthroughs in computation time and model architecture. Very deep models allow for better deciphering of the hidden patterns in the data; however, training them successfully is not a trivial problem, because of the notorious vanishing/exploding gradient problem. We illustrate this problem on VGG models, with 8 and 38 hidden layers, on the CIFAR100 image dataset, where we visualize how the gradients evolve during training. We explore known solutions to this problem like Batch Normalization (BatchNorm) or Residual Networks (ResNets), explaining the theory behind them. Our experiments show that the deeper model suffers from the vanishing gradient problem, but BatchNorm and ResNets do solve it. The employed solutions slighly improve the performance of shallower models as well, yet, the fixed deeper models outperform them.

References

[1] Bjorck, N., Gomes, C. P., Selman, B., and Weinberger, K. Q. Understanding batch normalization. In Advances in Neural Information Processing Systems (2018), pp. 7694–7705.
[2] Ciregan, D., Meier, U., and Schmidhuber, J. Multi-column deep neural networks for image classification. In 2012 IEEE conference on computer vision and pattern recognition (2012), IEEE, pp. 3642–3649.
[3] Cybenko, G. Mathematics of control. Signals and Systems 2 (1989), 303.
[4] Eigen, D., Rolfe, J., Fergus, R., and LeCun, Y. Understanding deep architectures using a recursive convolutional network. arXiv preprint arXiv:1312.1847 (2013).
[5] Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
[6] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 770–778.
[7] Hochreiter, S. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universit¨at M¨unchen 91, 1 (1991).
[8] Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[9] Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 4700–4708.
[10] Ioffe, S., and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 (2015), F. R. Bach and D. M. Blei, Eds., vol. 37 of JMLR Workshop and Conference Proceedings, JMLR.org, pp. 448–456.
[11] Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. University of Toronto (2009).
[12] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM 60, 6 (2012), 84–90.
[13] Lian, X., and Liu, J. Revisit batch normalization: New understanding and refinement via composition optimization. In The 22nd International Conference on Artificial Intelligence and Statistics (2019), pp. 3254–3263.
[14] Martens, J. Deep learning via hessian-free optimization. In ICML (2010), vol. 27, pp. 735–742.
[15] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature 550, 7676 (2017), 354–359.
[16] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015), Y. Bengio and Y. LeCun, Eds.
[17] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.
[18] Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway networks. CoRR abs/1505.00387 (2015).
[19] Srivastava, R. K., Greff, K., and Schmidhuber, J. Training very deep networks. In Advances in neural information processing systems (2015), pp. 2377–2385.
[20] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 1–9.
[21] Thacker, W. C. The role of the hessian matrix in fitting models to measurements. Journal of Geophysical Research: Oceans 94, C5 (1989), 6177–6196.
[22] Wu, D., Wang, Y., Xia, S.-T., Bailey, J., and Ma, X. Skip connections matter: On the transferability of adversarial examples generated with resnets. unpublished, 2020.
[23] XalosXandrez. Batch normalization before or after relu? https://www.reddit. com/r/MachineLearning/comments/67gonq/dbatchnormalizationbeforeorafterrelu/. Published: 2017-04-25.
[24] Xie, S., Girshick, R., Dollar, P., Tu, Z., and He, K. ´ Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 1492–1500.
[25] Yu, D., Seltzer, M. L., Li, J., Huang, J.-T., and Seide, F. Feature learning in deep neural networks-studies on speech recognition tasks. arXiv preprint arXiv:1301.3605 (2013).