Titolo della tesi: Will it fail and why? A large case study of company default prediction with highly interpretable machine learning models
Finding a model to predict the default of a firm is a well-known topic over the financial and data science community.
Many modern approaches try to find well-performing models to forecast it; those models often act like a black-box and don't give to financial institutions the fundamental explanations they need for their choices.
This project aims to find a robust predictive model using a tree-based machine learning algorithm which flanked by a game-theoretic approach can provide sound explanations of the output of the model.
In our work we use in combination three large and important datasets in order to investigate both bankruptcy and bank default: a state of difficulty for companies that often anticipates actual bankruptcy.
We combine one dataset from the Italian Central Credit Register of the Bank of Italy, one from balance sheet information related to Italian firms, and information from AnaCredit dataset, a novel source of credit data by European Central Bank.
We try to go beyond the academic study and to show how our model, based on some promising machine learning algorithms, outperforms the current default predictions made by credit institutions and at the same time, provides insights on the reasons that lead to a particular outcome.
Default prediction problem has been studied for over fifty years, but remain a very hard task even today.
Since it maintains a remarkable practical relevance, we try to put in practice our efforts in order to obtain the maximum prediction results, also in comparison with the reference literature.
Finally, we dedicated a special effort to the analysis of predictions in highly unbalanced contexts.
Imbalanced classes are a common problem in machine learning classification that typically is addressed by removing the imbalance in the training set.
We conjecture that it is not always the best choice and propose the use of a slightly unbalanced training set, showing that this approach contributes to maximize the performance.