Despite their formidable success in recent years, a fundamental understanding of deep neural networks (DNNs) is still lacking. Open questions include the origin of the slowness of the training dynamics, and the relationship between the dimensionality of parameter space and number of training examples, since DNNs empirically generalize very well even when over-parametrized. A popular way to address these issues is to study the topology of the cost function (the loss landscape) and the properties of the algorithm used for training (usually stochastic gradient descent, SGD).
Here, we use methods and results coming from the physics of disordered systems, in particular glasses and sphere packings. On one hand, we are able to understand to what extent DNNs resemble widely studied physical systems. On the other hand, we use this knowledge to identify properties of the learning dynamics and of the landscape.
In particular, through the study of time correlation functions in weight space, we argue that the slow dynamics is not due to barrier crossing, but rather to an increasingly large number of null-gradient directions, and we show that, at the end of learning, the system is diffusing at the bottom of the landscape. We also find that DNNs exhibit a phase transition between over- and under-parametrized regimes, where perfect fitting can or cannot be achieved. We show that in this overparametrized phase there cannot be spurious local minima. In the vicinity of this transition, properties of the curvature of the loss function minima are critical.
This kind of knowledge can be used both as a basis for a more grounded understanding of DNNs and for hands-on requirements such as hyperparameter optimization and model selection.
Aula A7 (piano terra)
Dipartimento di Ingegneria informatica automatica e gestionale A Ruberti (DIAG)
Via Ariosto 25 - Roma
Marco Baity Jesi
ETH Eawag, Zurich, Switzerland