Addressing dataset shift in supervised classification via data perturbation

n supervised classification, dataset shift occurs when for the units in the test set a change in the distribution of a single feature, a combination of features, or the class boundaries, is observed with respect to the training set. As a result, in real data applications, the common assumption that the training and testing data follow the same distribution is often violated. Dataset shift might be due to several reasons; the focus is on what is called “covariate shift”, namely the conditional probability p(y|x) remains unchanged, but the input distribution p(x) differs from training to test set. Random perturbation of variables or units when building the classifier can help in addressing this issue. Evidence of the performance of the proposed approach is obtained on simulated and real data.

25 Novembre 2022

Laura Anderlucci
University of Bologna