Titolo della tesi: Controlled mini-batch algorithms for large-scale optimization in Deep Learning
The volume of data processed by machine learning models, particularly deep neural networks (DNNs), is rapidly increasing alongside the dimensions of these models.
In 2023, Internet users generated an average of approximately 2.5 quintillion bytes ($2.5\times10^{18}$) of data daily\footnote{\url{https://explodingtopics.com/blog/big-data-stats}}.
The well-known large language model, ChatGPT, has been trained on a dataset containing hundreds of billions of web pages\footnote{\url{https://openai.com/index/chatgpt/}}.
As tasks become more complex and training datasets grow larger, the number of trainable parameters in DNNs has been increasing exponentially.
State-of-the-art DNNs can have billions, or even trillions, of trainable parameters, thanks to innovations like self-attention layers \cite{khan2022transformers}, diffusion models \cite{croitoru2023diffusion}, and the use of over-parameterization to achieve perfect interpolation \cite{chang2021provable}.
Encompassing such tasks as regression and image classification, this thesis is focused on the training of deep neural networks in a supervised context, where the training problem is formulated as the unconstrained minimization of a smooth and possibly non convex objective function with respect to the network weights.
Hence, the dimension of the target problem depends both on the dimension of the network and on the amount of the input data available for the training, which makes it computationally impossible to use traditional full-batch optimization methods such as Gradient Descent, L-BFGS, and other Newton's method modifications.
This lead to the widespread adoption of \textit{mini-batch} methods, which update the network weights using, at each iteration, only a small subset of the available samples to compute an estimate of the gradient of the objective function.
Despite their increased computational efficiency with respect to full-batch algorithms, mini-batch methods require strong assumptions (e.g, convexity, strong convexity, gradient-related search directions) to prove the convergence to a stationary point.
Furthermore, adaptive mini-batch methods, which adjust the step size based on the gradient estimate, can suffer from instability in the final phases of training.
Seeking for a balance between theoretical guarantees of convergence and stability and the steadily increasing efficiency requirements to minimize the training time, as well as the carbon footprint of the training process itself \cite{papa2024survey}, in this thesis we propose an approach based on enhancing Incremental Gradient (IG) and Random Reshuffling (RR) algorithms through the use of sufficient decrease conditions and derivative-free extrapolation linesearch procedures.
More in detail, we consider an efficient version of the Controlled Mini-batch Algorithm (CMA), firstly proposed in Seccia's PhD Thesis and available in \cite{seccia2022convergence}. CMA is a framework encompassing both IG and RR with a sufficient decrease condition on the objective function and the possibility to execute a line-search to ensure the convergence even when no assumptions are made on the search direction or the gradient.
We further present the \textit{light} version of CMA, CMA Light, which achieves the same convergence results in the IG framework but much better efficiency performance on state-of-the-art regression and image classification tasks using an estimate of the objective function to check the sufficient decrease.
With the further variant Block-Decomposed CMA Light (BD-CMAL), we also investigate the possibility of coupling our mini-batch method with a block decomposition framework, trying to expand further what proposed in \cite{palagi2021convergence}.
Finally, with Fast-Controlled Mini-batch Algorithm (F-CMA), we extend the convergence theory of CMA Light to the case, when the samples are reshuffled at every iteration and we develop a line-search procedure that uses an approximation of the objective function.
We test F-CMA on networks with up to 120 millions of trainable parameters against the most commonly used state-of-the-art optimizers for deep learning, and we show a significant advantage both in terms of stability and in terms of generalization capability of the trained models.