Thesis title: Predicting University Dropout in Italy: A Machine Learning Approach using Large-Scale Data
Context
University dropout is a critical issue with significant financial, academic, and societal consequences. In the Italian higher education system, dropout rates remain a concern, particularly at the bachelor’s level. Recent advancements in Machine Learning (ML) and Artificial Intelligence (AI) provide new opportunities for predicting student dropout, enabling early interventions. However, challenges such as scalability, explainability, and the selection of optimal features across various institutions complicate the adoption of these models.
Objectives
This thesis aims to develop and evaluate scalable ML models for predicting university dropout within the Italian context, leveraging data that is readily available to all universities through the ANS/ANVUR system. Additionally, the study explores the limitations of Explainable AI (XAI) in identifying consistent features for dropout prediction and proposes solutions that can be applied across Italian universities.
Methods
The study employs a dataset of 271,707 student records from Tor Vergata University and Pisa University, making it one of the largest datasets in dropout prediction research. Multiple machine learning models, including Ensemble methods, were developed and tested at different stages of the students’ academic careers (e.g., matriculation, mid-year, end of year one). SHAP values were used to explore the explainability of the models and feature importance across thousands of model iterations. The models were evaluated based on their accuracy and F1 score to ensure robust prediction performance.
Results
The study demonstrates that larger training samples result in more generalized and extendable models, with Ensemble methods significantly improving dropout prediction. For example, dropout prediction for students who were not part of the training sample (e.g., pre-reform students) still achieved high accuracy (81\% by end of year one). The findings also reveal that the concept of a consistent "best set of features" for dropout prediction is problematic; feature importance fluctuates with even minor changes in the model, which presents challenges for the broader application of Explainable AI in this context.
Limitations
The analysis of the existing literature on student dropout prediction reveals that numerous studies, despite tackling the same problem, have employed vastly different sets of input features, algorithms, and methods, many of which achieved strong results. This suggests that the methods used in each case were effective for the specific dataset and student population they were trained on, and would likely continue to perform well in similar contexts. However, applying these same methodologies to other settings does not guarantee similar outcomes. The current study demonstrated that the results obtained at Tor Vergata were replicated, and in many cases surpassed, at Pisa University. This suggests that while the methodology may be effective across other Italian universities, there is no certainty that it will yield comparable results outside of Italy. Furthermore, various attempts to create "explainable models" for dropout prediction have shown that this area is one where Explainable AI (XAI) faces inherent limitations.
Conclusions
This research contributes a scalable, "out of the box" solution for predicting dropout that can be applied across all Italian universities without the need for additional data collection. The study underscores the effectiveness of larger datasets and Ensemble models in producing reliable predictions, while also highlighting the limitations of Explainable AI for identifying consistent predictive features. Future research should focus on translating these predictions into actionable interventions and further exploring the limitations of explainability in ML-driven educational analytics.