Thesis title: Energy Trees: Classification and Regression With Structured and Mixed-Type Data
As data analyses continue to grow in complexity, so has the need for frameworks and models that keep up the pace. Object Oriented Data Analysis has a primary role in this direction because it works directly with structured data objects, i.e. using variables that have not undergone any process of (further) simplification. However, so far, the focus has been only on single-type variables at the same time. In an attempt to fill this gap, Energy Trees are introduced in this work as a statistically sound model to perform classification and regression with structured and mixed-type covariates.
Two successful and well-established ideas from the literature, namely Conditional Trees and Energy Statistics, are used and combined to build Energy Trees. In such a way, the proposed model benefits from several properties. However, the problem of splitting with respect to structured covariates is still not well-defined. In this work, two alternative procedures, namely feature vector extraction and clustering, are proposed and compared. Then, the choices that must be made both for traditional covariates, i.e. numeric and nominal, and for the structured covariates here considered, i.e. functions, graphs, and persistence diagrams, are outlined. Additionally, one of the striking advantages of Energy Trees is their great flexibility, hence general indications to change these choices, as well as implementing any other type of covariates, are also provided.
Extensive simulation studies are employed to show that Energy Trees are unbiased, do not suffer from overfitting, and select meaningful covariates. These studies are performed for increasing levels of complexity, starting from traditional covariates only and arriving to the case of structured and mixed-type covariates. All of them provide positive results. Once Energy Trees are confirmed to work properly, their applicability, as well as some extensions, may be considered. With reference to the latter, the ensemble models called bagging of Energy Trees and Random Energy Forests are presented. Additionally, the Unsupervised Random Energy Forest model for unlabeled learning samples is introduced and tested on simulated data.
The Energy Trees framework is implemented in the R package etree. Hence, the latter is described in detail, covering all the main functions and features. A usage example is also included, before describing both current and future work.
Finally, Energy Trees are employed to conduct four empirical analyses on data coming from the fields of human biology and medicine. Specifically, the main prediction tasks involve knee osteoarthritis, intelligence, schizophrenia, and brain tumors. Covariates include the shape of the bones, multimodal brain connectomes, brain metabolic networks, demographic information, and various others. The analyses show that the predictive ability of the model is adequate, besides suggesting its potential utility in these important but intricate fields.