Seminari


2019


Roberta Varriale (ISTAT) - Machine learning methods for estimating the employment status in Italy
29 Novembre 2019 - Sala 34 ore 14
In recent decades, National Statistical Institutes have focused on producing official statistics by exploiting multiple sources of information (multi-source statistics) rather than a single source, usually a statistical survey. The growing importance of producing multi-source statistics in official statistics has led to increasing investments in research activities in this sector. In this context, one of the research projects addressed by the Italian National Statistical Institute (Istat) concerned the study of methodologies for producing estimates on employment rates in Italy through the use of multiple sources of information, survey data and administrative sources. The data comes from the Labour Force (LF) survey conducted by Istat and from several administrative sources that Istat regularly acquires from external bodies. The “quantity” of information is very different: those coming from administrative sources concern about 25 million individuals, while those coming from the LF survey refer to an extremely limited number (about 330,000) of individuals. The two measures do not agree on employment status for about 6% of the units from the LF survey. One proposed approach uses a Hidden Markov model to take into account the deficiencies in the measurement process of both survey and administrative sources. The model describes a measurement process as a function of a time-varying latent state (in this case the employment category), whose dynamics is described by a Markov chain defined over a discrete set of states. At present, the implementation phase for the production process of statistics on employment through the use of HM models is coming to an end in Istat. The present work describes the use of Machine Learning methods to predict the individual employment status. This approach is based on the application of decision tree and random forest models, that are predictive models, usually used to classify instances of large amounts of data. In the work, the obtained results will be described, together with their usefulness in this application context. The models have been applied through the use of the software R.
Garyfallos Konstantinoudis - Discrete versus continuous domain models for disease mapping and applications on childhood cancers
22 Novembre 2019 - Sala 34 ore 12
The main goals of disease mapping is to calculate disease risk and identify high-risk areas. Such analyses are hampered by the limited geographical resolution of the available data. Typically data are counts of cases per spatial unit and the most common approach is the Besag-York-Molli ́e model (BYM). Less frequently, exact geocodes are available, allowing modelling a disease as a point process through Log-Gaussian Cox processes (LGCPs). The objective of this study is to examine in a simulation the performance of BYM and LGCPs for disease mapping. We simulated data in the Canton of Zurich in Switzerland sampling cases from the true population mimicking the childhood leukaemia incidence (n=334 during 1985-2015). We considered 39 different scenarios varying in the risk generating function (step-wise, smooth, flat risk), the size of the high-risk areas (1, 5 and 10km radii), the risk increase within the high-risk areas (2 and 5-fold) and the number of cases (n, 5n and 10n). We used the root mean integrated square error (RMISE) to examine the ability of the models to recover the true risk surface and their sensitivity/specificity in identifying highrisk areas. We found that, for larger radii, LGCPs recover the true risk surface with lower error across almost all scenarios (median RMISE: 9.17-27.0) compared to the BYM (median RMISE: 9.12-35.6). For radii = 1km and flat risk surfaces BYM performs better. In terms of sensitivity and specificity across almost all scenarios the median area under the curve (AUC) for LGCPs was higher (median AUC: 0.81-1) compared to the BYM (median AUC: 0.65-0.93). We applied these methods to childhood leukaemia incidence in the canton of Zurich during 1985-2015 and identified two high-risk spatially coherent areas. Our findings suggest that there are important gains to be made from the use of LGCP models in spatial epidemiology.
Roberta De Vito (Department of Biostatistics, Brown University, Providence, Rhode Island, USA) - Multi-study factor analysis for biological data
14 Novembre 2019 - Aula XIV (palazzina Tumminelli) ore 12
We introduce a novel class of factor analysis methodologies for the joint analysis of multiple studies. The goal is to separately identify and estimate 1) common factors shared across multiple studies, and 2) study-specific factors. We develop a fast Expectation Conditional-Maximization algorithm for parameter estimates and we provide a procedure for choosing the common and specific factors. We present simulations evaluating the performance of the method and we illustrate it by applying it to gene expression data in ovarian cancer and to nutrient-based dietary patterns and the risk of head and neck cancer. In both cases, we clarify the benefits of a joint analysis compared to the standard factor analysis. Moreover, we generalize the model in a Bayesian framework. We implement it using sparse modeling of high-dimensional factor loadings matrices, both common and specific, using the infinite gamma shrinkage prior. We propose a computationally efficient algorithm, based on a traditional Gibbs sampler, to produce the Bayes estimates of the parameters and to select the number of relevant common factors. We assess the operating characteristics of our method by means of simulation studies, and we present an application to the prediction of the biological signal from four gene expression studies on ovarian cancer.
Daniel K. Sewell (University of Iowa) - An introduction to the statistical analysis of network data
9 e 10 Settembre 2019 - Aula VII (ex Castellano) ore 10-16 (con pausa)

Simone Russo - L’invalidità previdenziale: studio dell’incidenza della disabilità nella popolazione in età lavorativa e analisi delle determinanti attraverso dati di registro
5 Giugno 2019 - Aula Master (Viale Regina Elena 295) ore 10
L’invecchiamento della popolazione italiana sta determinando un incremento notevole del numero di soggetti malati cronici e disabili. Ad oggi non esistono studi specifici, soprattutto nella realtà italiana, sugli effetti dell’invecchiamento della popolazione in termini di prestazioni di invalidità previdenziale e in generale sulle determinanti di questo fenomeno. Le domande accolte per prestazioni previdenziali di invalidità sono aumentate considerevolmente, a partire dalla fine degli anni '90. Complessivamente, le diverse analisi condotte dimostrano che l’evoluzione delle domande accolte per prestazioni d’invalidità previdenziale è legata ad una serie di caratteristiche individuali dei lavoratori beneficiari e a fattori di contesto di svariata natura, in particolare demografica, territoriale, epidemiologica, economica e legati, altresì, alla struttura occupazionale.
Enrico Tucci - L’emigrazione dall’Italia attraverso l’integrazione e l’analisi di rilevazioni statistiche e fonti ufficiali
5 Giugno 2019 - Aula Master (Viale Regina Elena 295) ore 10
L’obiettivo del lavoro è quello di analizzare le migrazioni internazionali dal censimento della popolazione del 2011 attraverso un uso integrato delle fonti disponibili. In Italia le statistiche ufficiali sono prodotte con un utilizzo diretto della fonte anagrafica, che non consente di rilevare il fenomeno nella sua interezza, soprattutto per la difficoltà di conteggiare i movimenti verso l’estero. I risultati ottenuti in questo lavoro evidenziano la possibilità di ridurre il gap informativo attraverso un database longitudinale basato su dati individuali. La nuova prospettiva di analisi, data dal collegamento dei movimenti nel tempo relativi ad uno stesso individuo, permette di osservare fenomeni rilevanti in chiave di politiche migratorie, quali le migrazioni di ritorno e quelle circolari. Infine, viene esaminata la mobilità internazionale dei “nuovi italiani” attraverso un approccio longitudinale e viene applicato un modello di regressione per comprendere quali caratteristiche siano maggiormente connesse con la propensione a diventare italiani.
Simone Padoan (Università "Luigi Bocconi" di Milano) - Modellizzazione statistica dei valori estremi
16-17 Aprile 2019 - Sala 34 ore 10-14

Dott. Stefano Cavastracci Strascia e Dott. Agostino Tripodi - Overdispersed-Poisson Model in Claims Reserving: Closed Tool for One-Year Volatility in GLM Framework
29 Marzo 2019 - Aula V ore 14.15
L’obiettivo del lavoro è la realizzazione di uno strumento per stimare la volatilità a un anno della riserva sinistri, calcolata – in formula chiusa - attraverso i modelli lineari generalizzati (GLM), in particolare in relazione al modello di Poisson con sovradispersione. Fino ad ora, questa volatilità di un anno è stata stimata attraverso la ben nota metodologia di bootstrap che richiede l'uso del metodo Monte Carlo con una tecnica di ri-riservazione. Nondimeno, questo metodo richiede tempo sotto il punto di vista del calcolo ed altre condizioni di stabilità; pertanto, nella pratica sono spesso utilizzate tecniche di approssimazione. Verranno inoltre presentate alcune applicazioni con il software R il cui codice è stato riportato nel paper.
Giorgio Consigli (Università degli Studi di Bergamo) - Asset-liability management for occupational pension funds under market and longevity risk: a case study and alternative modelling approaches
22 Marzo 2019 - Aula V ore 15
The modelling of institutional ALM problems has a long history in stochastic programming starting in the late 80’s with the first industry developments such as the well-known Yasuda Kasai model (Ziemba, Turner, Carino et al, 1994) specifically for pension fund management (PF ALM). Due to economic and demographic pressures in most OECD countries and an increasing interest on PF ALM developments by the industry and by policy makers, we witness now-a-day a growing demand for R&D projects to the scientific community. Taking the view of a PF manager, the presentation will develop around the definition of a generic pension fund (PF) asset-liability management (ALM) problem and analyse the key underlying methodological implications of: (i) it's evolution from an early stage multistage stochastic programming (MSP) with recourse to most recent MSP and distributionally robust (DRO) formulations, (ii) a peculiar and rich risk spectrum including market risk as well as liability risk, such as longevity risk and demographic factors leading to (iii) valuation or pricing approaches based on incomplete market assumptions and, due to recent International regulation, (iv) a risk-based capital allocation for long-term solvency. The above represent fundamental stochastic and mathematical problems of modern financial optimisation. Two possible approaches to DRO are considered, based on a stochastic control framework or by explicitly introducing an uncertainty set for probability measures and formulating the inner DRO problem as a probability distance minimization problem over a given space of measures. Keywords: asset-liability management, multistage stochastic programming, distributional uncertainty, distributionally robust optimization, solvency ratio, liability pricing, longevity risk, capital allocation.
Annibale Biggeri (Università di Firenze) - Incertezza e riproducibilità nella ricerca biomedica
22 Febbraio 2019 - Sala 34 ore 11

12

2018


Yves Tillé (Université de Neuchatel) - How to select a sample?
27 Novembre 2018 - Sala 34 ore 14.30
The principles of sampling can be synthesized in randomization, restriction and over-representation. Define a sample design – define stratification, equal/unequal selection probability, etc. – means to use prior information and it is equivalent to assume a model on the population. Several well-known sampling designs are optimal related to models that maximizes the entropy. In the Cube method the prior information are used to derive a sample that match the total or means of auxiliary variables. In this respect, the sample is called balanced. Furthermore, if distances between statistical units – based on geographical coordinates or defined via auxiliary variables – are available, it could be interesting to spread the sample in the space in order to make the design more efficient. In this perspective, new spatial sampling methods, such as the GRTS, the local pivotal method and the local cube, will be covered.

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma