Ruggiero Seccia

Dottore di ricerca

ciclo: XXXIII

supervisore: Laura Palagi

Titolo della tesi: Block Coordinate Incremental Methods for training Deep Neural Networks

Data volumes are exploding. In 2020 on Google alone, on average 40000 search queries per second were submitted, which amounts to 1.2 trillion searches yearly. On YouTube 300 new hours of video show up each minute, while on Facebook more than 100 terabytes of data are daily shared by people. These are just a few of the mind-blowing statistics about the growth of big-data reported in \cite{Bigdatastats} that illustrate how fast data volumes are increasing. The availability of this immense mole of information and the need for analysing these data is enabling the definition and adoption of more complex machine learning models to extract value out of these data. Among these models, Deep Neural Networks (DNNs) are playing a crucial role in leading to great improvements in several fields. DNNs with millions of parameters and hundreds of layers are indeed being successfully applied to solve challenging tasks such as image recognition or natural language processing. As an instance, in 2012 a DNN with around 60 million parameters, called AlexNet \cite{deng2009imagenet}, won the ImageNet competition outperforming the previous best model by a considerable margin \cite{krizhevsky2012imagenet}. In July 2020, GPT3 was presented, a natural language processing model with 185 billion parameters, which is able to solve natural language processing tasks incredibly well compared to previous models \cite{brown2020language}. With the amount of data and the complexity of the models both rocketing, training deep models with billions of parameters and huge data sources is becoming a very time and resources consuming task, with some models taking days to be trained. Thus, the need for efficient optimization algorithms able to handle large amount of data and optimize wide and deep models is becoming an important issue. As a consequence in the last decades a lot of attention has been paid on developing more efficient and effective algorithms to properly tackle the complexity of training DNN models. State-of-the-art optimization algorithms for training DNNs, called minibatch methods, are mainly focused on exploiting the finite-sum structure of the optimization problem by defining a \textit{sample-decomposition} of the objective function (see e.g. \cite{Nocedal,Goodfellow-et-al-2016} and the reference therein). However, they do not take into consideration the layered structure of the model itself which naturally provides a \textit{variable-decomposition} of the objective function that might be leveraged in order to speed up the learning process. With this respect, Block Coordinate Decomposition (BCD) methods, by minimizing blocks of variables at each iteration, represent a promising approach to enhance algorithms performance (see e.g. \cite{grippof1999globally,Wright2015} and the reference therein). Even though BCD methods have proved to be effective when dealing with problems with many variables, these methods have not been sufficiently exploited for training DNNs yet. Preliminary results on shallow neural networks \cite{Buzzi2001,Grippo2016} (i.e. neural networks with a single hidden layer) have shown the effectiveness of BCD methods in defining a variable-decomposition able to exploit the structure of the model. Thus, motivated by the success of minibatch methods and the preliminary results of BCD methods, in this thesis we investigate to what extent the training process of DNNs can benefit from coupling algorithms that involve a sample-decomposition approach with algorithms that apply a variable-decomposition approach, in order to define block coordinate minibatch methods for training DNNs. Among the minibatch methods, we focus on \textit{incremental} methods and study their behaviour as gradient methods with errors. \subsection*{Scientific contributions} \noindent The main contributions of this thesis can be summarized as follows: \begin{itemize} \item Introduce a Block Coordinate Incremental framework, called Block Incremental Gradient (BIG). By leveraging the sample-decomposition approach, typical of mini-batch methods, with the variable-decomposition approach, typical of BCD methods, BIG can be fruitfully applied to solve large scale finite-sum problems with many variables (Chapter \ref{ch:BIG}); \item Define a convergence theory for BIG. Following standard results in the literature of gradient methods with errors \cite{Bertsekas2000,solodov1998incremental}, we prove convergence of BIG towards $\epsilon$-stationary points and stationary points where a bounded away from zero and a diminishing stepsize are respectively employed (Chapter \ref{ch:BIG}); \item Investigate how the layered structure of DNNs can be exploited in order to speed up the learning process. We discuss how the \textit{backpropagation} procedure computationally affects a BCD approach within DNNs and define a layered-decomposition updating scheme that allows to computationally efficiently tackle the optimization problem (Chapter \ref{ch:BLING}); \item Introduce DNNs-specific batch and minibatch BCD frameworks, respectively called Block Layer Decomposition (BLD) and Block Layer Incremental Gradient (BLInG), and provide evidence of their effectiveness in improving algorithms' performance from numerical results on benchmark test problems (Chapter \ref{ch:BLING} \item Introduce a novel optimization approach, called Controlled Minibatch Method (CMA), for solving large scale finite-sum problems that combines the good numerical performance of minibatch methods with the mild theoretical hypothesis needed by standard batch methods to prove convergence towards stationary points. We define and prove convergence of two CMA-based frameworks, called CMA1 and CMA2, and provide numerical evidence of the effectiveness of this novel approach (Chapter \ref{ch:CMA}); \item Extend CMA frameworks by including a BLInG iteration within the method so to define Controlled Block Minibatch Algorithms (CBMAs). We prove convergence of the proposed approach and test its effectiveness on standard test problems (Chapter \ref{ch:CMA}). \end{itemize} \subsection*{Thesis outline} \noindent This thesis is organized as follows. In Chapter \ref{ch:DNNs} we start by describing the optimization problem behind the training of a DNN. We introduce the notation used throughout the thesis and present some of the main difficulties in training DNNs from an optimization point of view. In Chapter \ref{ch:minibatch_methods} and \ref{ch:BCD} we provide the reader with an overview of the main minibatch methods for training DNNs and background information on Block Coordinate Decomposition (BCD) frameworks. In particular, in Chapter \ref{ch:minibatch_methods} we first introduce the difference between batch and minibatch algorithms and then focus our attention on the latter methods, more commonly adopted when training DNNs. In Chapter \ref{ch:BCD}, instead, we recall Block Coordinate Decomposition methods, explore when and how it is fruitful to apply these methods, discuss the main convergence results from the literature when different block updating schemes are considered and, finally, recall some of the preliminary attempts to apply BCD methods to train Neural Networks. In Chapter \ref{ch:BIG} we introduce a novel framework, called Block Incremental Gradient (BIG), which combines the sample-decomposition approach, typical of minibatch methods, with the variable-decomposition approach, typical of BCD methods, in order to define a minibatch decomposition approach able to scale with respect to both the number of variables and samples, and thus suitable to train wide and deep neural networks on large datasets. A convergent theory for BIG is developed by analysing the method as a gradient method with errors \cite{Bertsekas2000,solodov1998incremental} and its application to optimization problems is discussed. In Chapter \ref{ch:BLING} we investigate how BCD methods can be fruitfully applied and embedded in minibatch frameworks to train Deep Neural Networks. That is, we first describe a batch BCD method, called Block Layer Decomposition method (BLD), able to leverage the natural variable decomposition provided by the layered structure of a DNN; then, following the convergence theory developed in the previous chapter for BIG, we extend the algorithm proposing a minibatch BCD scheme, called Block Layer Incremental Gradient method (BLInG), able to scale with respect to both the number of variables and the number of samples. Numerical performance are assessed by performing extensive numerical results on standard datasets when changing the structure of the DNN. Finally, in Chapter \ref{ch:CMA}, we take steps from a consideration: minibatch methods, although leading to great performance when dealing with large scale finite-sum problems, usually require stringent assumptions to be satisfied in order to guarantee convergence toward stationary points. Then, in order to overcome this limiting factor, we introduce a novel optimization approach, called Controlled Minibatch Algorithm (CMA), which balances the good numerical performance typical of minibatch methods with the mild assumptions required by most of the batch methods. We introduce and analyse the convergence of two CMA-based frameworks and show their performance on several problems, with a particular emphasis on the big-data setting. Then, encouraged by the good numerical results obtained in the previous chapter, we further extend the CMA framework by defining a variable-decomposition modification of CMA methods, called Controlled Block Minibatch Algorithm (CBMA), which applies BLInG within its iterations. Convergence of the CBMA methods so defined is discussed and numerical performance assessed.

Produzione scientifica

11573/1676841 - 2023 - A machine-learning based bio-psycho-social model for the prediction of non-obstructive and obstructive coronary artery disease

Raparelli, Valeria; Romiti, Giulio Francesco; Di Teodoro, Giulia; Seccia, Ruggiero; Tanzilli, Gaetano; Viceconte, Nicola; Marrapodi, Ramona; Flego, Davide; Corica, Bernadette; Cangemi, Roberto; Pilote, Louise; Basili, Stefania; Proietti, Marco; Palagi, Laura; Stefanini, Lucia - 01a Articolo in rivista

rivista: CLINICAL RESEARCH IN CARDIOLOGY (Heidelberg, Darmstadt: Springer Medizin) pp. 1263-1277 - issn: 1861-0684 - wos: WOS:000962357500001 (1) - scopus: 2-s2.0-85168781941 (1)

11573/1645381 - 2022 - A novel optimization perspective to the problem of designing sequences of tasks in a reinforcement learning framework

Seccia, R.; Foglino, F.; Leonetti, M.; Sagratella, S. - 01a Articolo in rivista

rivista: OPTIMIZATION AND ENGINEERING (Berlin, New York: Springer Dordrecht Netherlands: Kluwer Academic Publishers) pp. - - issn: 1389-4420 - wos: WOS:000742249400001 (0) - scopus: 2-s2.0-85123117834 (0)

11573/1513147 - 2021 - On the convergence of a Block-Coordinate Incremental Gradient method

Palagi, Laura; Seccia, Ruggiero - 01a Articolo in rivista

rivista: SOFT COMPUTING (Heidelberg ; Berlin : Springer) pp. 1-12 - issn: 1432-7643 - wos: WOS:000626095600002 (0) - scopus: 2-s2.0-85102211226 (0)

11573/1644111 - 2021 - A machine-learning-based bio-psycho-social model for the prediction of non-obstructive and obstructive coronary artery disease

Raparelli, V; Proietti, M; Romiti, G. F.; Seccia, R; Di Teodoro, G.; Tanzilli, G; Marrapodi, R; Flego, D; Corica, B; Cangemi, R; Palagi, L; Basili, S; Stefanini, L - 01h Abstract in rivista

rivista: EUROPEAN HEART JOURNAL (Oxford : Oxford University Press United Kingdom: Harcourt Publishers Limitedinternational.com, Fax: 011 44 20 83085876) pp. 3064-3064 - issn: 0195-668X - wos: WOS:000720456903365 (0) - scopus: (0)

11573/1492440 - 2021 - Machine learning use for prognostic purposes in multiple sclerosis

Seccia, Ruggiero; Romano, Silvia; Salvetti, Marco; Crisanti, Andrea; Palagi, Laura; Grassi, Francesca - 01a Articolo in rivista

rivista: LIFE (Basel: MDPI) pp. - - issn: 2075-1729 - wos: WOS:000622720800001 (13) - scopus: 2-s2.0-85101021093 (19)

11573/1379126 - 2020 - Block layer decomposition schemes for training deep neural networks

Palagi, L.; Seccia, R. - 01a Articolo in rivista

rivista: JOURNAL OF GLOBAL OPTIMIZATION (Kluwer Academic Publishers:Journals Department, PO Box 322, 3300 AH Dordrecht Netherlands:011 31 78 6576050, EMAIL: frontoffice@wkap.nl, kluweronline@wkap.nl, INTERNET: http://www.kluwerlaw.com, Fax: 011 31 78 6576254) pp. 97-124 - issn: 0925-5001 - wos: WOS:000529229900006 (2) - scopus: 2-s2.0-85075232350 (5)

11573/1385217 - 2020 - Considering patient clinical history impacts performance of machine learning models in predicting course of multiple sclerosis

Seccia, R.; Gammelli, D.; Dominici, F.; Romano, S.; Landi, A. C.; Salvetti, M.; Tacchella, A.; Zaccaria, A.; Crisanti, A.; Grassi, F.; Palagi, L. - 01a Articolo in rivista

rivista: PLOS ONE (San Francisco, CA : Public Library of Science) pp. - - issn: 1932-6203 - wos: WOS:000535303100021 (22) - scopus: 2-s2.0-85082116501 (29)

11573/1375328 - 2020 - Data of patients undergoing rehabilitation programs

Seccia, Ruggiero; Boresta, Marco; Fusco, Federico; Tronci, Edoardo; Di Gemma, Emanuele; Palagi, Laura; Mangone, Massimiliano; Agostini, Francesco; Bernetti, Andrea; Santilli, Valter; Damiani, Carlo; Goffredo, Michela; Franceschini, Marco - 01a Articolo in rivista

rivista: DATA IN BRIEF (New York : Elsevier Inc.) pp. - - issn: 2352-3409 - wos: WOS:000541974300001 (33) - scopus: 2-s2.0-85082623637 (35)

11573/1288082 - 2019 - A Gray-Box Approach for Curriculum Learning

Foglino, Francesco; Leonetti, Matteo; Sagratella, Simone; Seccia, Ruggiero - 02a Capitolo o Articolo

libro: Optimization of Complex Systems: Theory, Models, Algorithms and Applications - (978-3-030-21802-7; 978-3-030-21803-4)

11573/1291201 - 2019 - ONLINE BLOCK LAYER DECOMPOSITION SCHEMES FOR TRAINING DEEP NEURAL NETWORKS

Palagi, Laura; Seccia, Ruggiero - 13a Altro ministeriale