Titolo della tesi: Anomaly detection in panel data identification features: comparing temporal and supervised classification record linkage-based methods
With repeated observations of the same units in time, panel data enable researchers to study the dynamics of a broad set of phenomena. Units in a panel are identified by key features for them to be followed through the observation periods; these must be chosen so that it is ensured that they remain fixed and consistent over time. Losing the uniqueness on the key features results in losing track of the units’ history; therefore, if errors in the reported key features occur, these must be identified.
This issue is approached due to a real-life problem observed in granular insurance data, reported since 2016 and used by some Central Banks to build statistics for the European System of Central Banks (ESCB). The reported insurance assets are uniquely identified by codes that are required to be kept stable and consistent over time; nevertheless, due to reporting errors, unexpected changes in the codes may occur, causing inconsistencies when compiling insurance statistics. This causes a limited decreased quality of the produced statistics, which cannot be neglected.
Two approaches are proposed in this work to deal with the described issue: a temporal one making use of ARIMA models for time series prediction and a supervised classification one using Machine Learning models.
The two, apparently, very different methodologies are used for the same goal, looking at the issue from two different perspectives: the former exploiting the temporal aspect of the data, the latter by focusing on subsequent couples of reporting periods. Both rely on the idea that records in the data that do not share the same value for the key feature but refer to the same unit will be equal or at least similar to the other observed features, with a record linkage perspective.
The two methodologies are trained and tested on Italian data from 2019-2022, with ad hoc procedures to ensure robustness and reliability of the results.
Promising test results are presented to show the potential benefits of the two proposed methodologies on data quality management processes and the efficiency gains coming from automation.