Seminars


2024


External information borrowing in clinical trial hypothesis testing with controlled TIErate inflation
April 5, 2024, 12:00
When designing a novel clinical trial, external information about the control and/or treatment arm effect is typically available. Borrowing of such external information is often desired in order to improve the trial’s efficiency, and can be of crucial importance in situations where the sample size that can realistically be recruited is limited, as, e.g., pediatric or rare disease trials. The Bayesian approach allows borrowing of such external information through the adoption of informative prior distributions. An issue associated with the incorporation of external information is that external and current information may systematically differ. However, such inconsistency may not be predictable or quantifiable a priori. Robust prior choices are typically proposed to avoid extreme worsening of operating characteristics in such situations. In this talk, we will focus on frequentist type I error rate and power. We will in particular consider how type I error rate is affected by incorporation of external information, and present a novel approach which allows a principled and controlled inflation. Both one and two-arm clinical trial designs will be considered.
Does measurement error distort country differences in temporary employment? differences in temporary employment? A study on Italy and the Netherlands A study on Italy and the Netherlands using a multi- using a multi-group hidden Markov model group hidden Markov model
March 22, 2024, 12:00
This paper investigates the effect of measurement error on two key labour market indicators: the distribution of temporary employment as well as the transition rate in and out of temporary employment over time in Italy and the Netherlands. In this way, we study whether the cross-country differences in these indicators (coming from the different institutions in the two labour markets) pertain when we correct for measurement error in socioeconomic data. The comparative analysis of the Italian and Dutch labour markets is carried out for the time period 2017-2019 using linked employment data from the Labour Force Survey and the Employment Register of the two countries. For this purpose, we use a multiple-group Hidden Markov Model with two indicators for the employment contract type that accounts for both random and systematic measurement error. The results indicate that measurement error severely biases our view on mobility from temporary to permanent employment in the two countries but also distorts the picture of cross-country differences in the phenomenon of interest.
Finite Population Survey Sampling: An unapologetic Bayesian Perspective
March 1, 2024, 18:00
In this talk I will offer some perspectives on Bayesian inference for finite population quantities when the units in the population are assumed to exhibit complex dependencies. Beginning with an overview of Bayesian hierarchical models, including some that yield design-based Horvitz-Thompson estimators, the talk proceeds to introduce dependence in finite populations and sets out inferential frameworks for ignorable and nonignorable responses. Multivariate dependencies using graphical models and spatial processes are discussed and some salient features of two recent analyses for spatially oriented finite populations are presented.

2023


Ten years of mobile phone big data statistical analyses
December 1, 2023, 12:00
In the era of big data, monitoring and forecasting people crowding and mobility is a relevant aspect for urban policies, and smart cities use signals from mobile phone networks to support the optimization of urban systems and flows. Mobile phone data can be used for various purposes, as they come in different types: in this talk, applications are presented for social and cultural events monitoring, mobility flows and flooding risk analysis. Special attention is devoted to the statistical methods useful for these analyses, based on spatio-temporal data: briefly discussed are results obtained using the Histogram of Oriented Gradients approach for image reduction, the Functional Data Clustering of time series and the VARX model with Harmonic Dynamic Regression.
From Rome to London: Breaking the Silos of Disciplines. A Conversation around Cardiometabolic Risks Factors Modelling and the Use of Wastewater in Public Health
November 27, 2023, 10:00-12:00

Circular local likelihood regression
November 24, 2023, 12:00
In this talk, we will present a general framework for estimating regression models with circular covariates and a general response. We will start with an overview on nonparametric regression models with circular covariate, revising the main ideas and motivating the need of a more general method. Our goal is to estimate (nonparametrically) a conditional characteristic by maximizing the circular local likelihood. The proposed estimator is shown to be asymptotically normal. The problem of selecting the smoothing parameter is also addressed, as well as bias and variance computation. The finite sample performance of the estimation method in practice is studied through an extensive simulation study, where we cover the cases of Gaussian, Bernoulli, Poisson and Gamma distributed responses. The generality of our approach is illustrated with several real-data examples from different fields. In particular, we will focus on an example of neural response in macaques. This is a joint work with M. Alonso-Pena and I. Gijbels and corresponds to two published papers in Biometrics (2023) and Journal of the American Statistical Association (2023).
Surveying sensitive topics with indirect questioning techniques: methods and real applications
November 17, 2023, 12:00
In many fields of the applied research, mostly in sociological, economics, demographic, ecological and medical studies, the investigator very often has to gather information concerning highly personal, sensitive, stigmatizing, and perhaps incriminating issues such as drug addiction, domestic violence, racial prejudice, illegal income, noncompliance with laws and regulations. Doing research on sensitive themes by traditional direct questioning survey modes is not an easy matter since it is likely to meet with two sources of errors: nonresponse and untruthful answers. These errors can seriously flaw the quality of the data and, thus, jeopardize the usefulness of the collected information for subsequent analyses including inference on unknown characteristics of the population under study. Although the errors cannot be totally avoided, they may be mitigated by increasing respondent cooperation through a nonstandard data-collection approach based on indirect questioning techniques (IQTs). The talk aims to introduce some issues related to privacy protection when sensitive topics are surveyed, give a general idea of the approach, and illustrate how some IQTs have been used to collect data and obtain prevalence estimates in a number of real studies about illegal immigrants, abortion, drug use, cannabis legalization, sexual behaviours and Covid-19 health behaviours.
Specification tools for time fixed effects in country panels
November 13, 2023, 15:00
This paper proposes specification tools for time fixed-effects in country panels that complement summary and graphical representations of the data. They cover the standard two-way fixed-effect model, as well as other more general d-way fixed effect specifications with d ≥ 3. The tools are based on the observable characteristics of univariate time series of contrasts implied by a given specification; they use flagging rules based on graphical or statistical analysis. Evidence on which contrasts do not contain time fixed-effects can be harvested by algorithms; this paper discusses two examples of such algorithms. Implications for the specification of Differences in Differences estimation are discussed, and results are illustrated using a country panel of prices of mobile telecommunication services.
A parsimonious family of mixtures of multivariate Poisson log-normal factor analyzers for clustering count data
November 3, 2023, 15:00
Multivariate count data are commonly encountered in bioinformatics. Although the Poisson distribution seems a natural fit for these count data, its multivariate extension is computationally expensive. Recently, mixtures of multivariate Poisson lognormal (MPLN) models have been used to efficiently analyze these multivariate count measurements. In the MPLN model, the counts, conditional on the latent variable, are modelled using a Poisson distribution, and the latent variable comes from a multivariate Gaussian distribution. Due to this hierarchical structure, the MPLN model can account for over-dispersion as opposed to the traditional Poisson distribution and allows for correlation between the variables. The mixture of multivariate Poisson-log normal distributions for high dimensional data is extended by incorporating a factor analyzer structure in the latent space. A family of parsimonious mixtures of multivariate Poisson lognormal distributions are proposed by decomposing the covariance matrix and imposing constraints on these decompositions. The performance of the model is demonstrated using simulated and real datasets.
Testing for treatment effect in multitreatment case
October 27, 2023, 12:00
In the present seminar, the problem of testing for the presence/absence of a treatment effect is discussed. A new test-statistic, essentially based on the same principles as the classical Kruskal-Wallis test, is introduced, and its theoretical properties are studied. The good behaviour of the proposed test in terms of both significance level and power, with respect to other commonly used test procedures, is shown through a simulation study. Test-statistics for stochastic dominance problems are also studied.
Hidden Markov Models for error corrected statistics
October 20, 2023, 12:00
Policy making is based on official statistics that may still include measurement error (ME). This ME is typically the result of administrative delays in registration, differenced in conceptual definitions or processing errors. ME can lead to a distorted view of the number of people in groups of interest. Threatening the integrity of official statistics and therefore also the effectiveness of policies based on them. HMMs are statistical models that help approximate categorical variables that are incorrectly measured with observed data. First, HMMs use two observed measures of the same statistic from different sources to approximate the error-corrected number of individuals at every moment in time. The two observed measures help to triangulate “true” information and reduce the effect of ME that exists in each of them. Second, HMMs estimate the transition rate from one state (e.g., receiving social assistance benefits) to other states (e.g. employment) and vice versa using the error-corrected measure that was approximated before. To do so, HMMs require several (at least three) observations in different time points per individual, which are available in our case in both registers. This way HMM present a measurement model (error correction), and structural model (representing the relations and changes over time).
The resilience of complex networks: methods and applications
October, 6 2023
The analysis of the resilience of a network is of key relevance in many contexts of applied science, for its natural connections with the assessment of the stability of an overall system. In this talk I will present some methodological criteria for building suitable resilience measures, along with some applicative instances. I will also provide some remarks on avenues of future research, by including also a discussion on the possible connections between complex networks and reliability theory.
Time series segmentation by non-homogeneous hidden semi-Markov models
May, 26, 2023
Motivated by classification issues in environmental studies, a class of hidden semi-Markov models is introduced to segment multivariate time series according to a finite number of latent regimes. The observed data are modelled by a mixture of multivariate densities, whose parameters evolve according to a latent multinomial process. The multinomial process is modelled as a semi-Markov chain where the time spent in a state and the chances of a regime- switching event are separately modeled by a battery of regression models that depend on time- varying covariates. Maximum likelihood parameter estimation is carried out by integrating an EM algorithm with a suitable data augmentation. While the proposal extends previous approaches that rely on mixtures models and hidden Markov models, it keeps a parsimonious structure that facilitates results interpretation. It is illustrated on a case study of a bivariate time series of wind and wave directions, observed by a buoy in the Adriatic sea.
ABCC : Approximate Bayesian Conditional Copulae
April 19 2023
Copula models are flexible tools to represent complex structures of dependence for multivariate random variables. According to Sklar's theorem any d-dimensional absolutely continuous density can be uniquely represented as the product of the marginal distributions and a copula function that captures the dependence structure among the vector components. In real data applications, the interest of the analyses often lies on specific functionals of the dependence, which quantify aspects of it in a few numerical values. A broad literature exists on such functionals, however, extensions to include covariates are still limited. This is mainly due to the lack of unbiased estimators of the copula function, especially when one does not have enough information to select the copula model. Recent advances in computational methodologies and algorithms have allowed inference in the presence of complicated likelihood functions, especially in the Bayesian approach, whose methods, despite being computationally intensive, allow us to better evaluate the uncertainty of the estimates. In this work, we present two Bayesian methods to approximate the posterior distribution of functionals of the dependence, using nonparametric models which avoid the selection of the copula function. These methods are compared in simulation studies and in a realistic application in astrophysics. Joint work with Clara Grazian and Luciana Dalla Valle.
Convex clustering of mixed numerical and categorical data
April, 28 2023
Clustering analysis is an unsupervised learning technique widely used for information extraction. Current clustering algorithms often face instabilities due to the non-convex nature of their objective function. The class of convex clustering methods does not suffer from such instabilities and finds a global optimum for the clustering objective. Whereas convex clustering has previously been established for single-type data, real-life data sets usually comprise both numerical and categorical, or mixed, data. Therefore, we introduce the mixed data convex clustering (MIDACC) framework. We implement this framework by developing a dedicated subgradient descent algorithm. Through numerical experiments, we show that, in contrast to baseline methods, MIDACC achieves near-perfect recovery of both spherical and non-spherical clusters, is able to capture information from mixed data while distinguishing signal from noise, and has the ability to recover the true number of clusters present in the data. Furthermore, MIDACC outperforms all baseline methods on a real-life data set.
The Multivariate Heterogeneous Preference Estimation for Switching Multiple Price List
April 21, 2023
The Multivariate Heterogeneous Preference Estimator for Switching Multiple Price List Switching Multiple Price List Anna Conte Dipartimento di Scienze Statistiche, Sapienza Università di Roma The Multiple Price List (MPL) and Switching Multiple Price List (sMPL) provide a useful framework for estimating preference parameters, most usually risk aversion, from a sample of experimental subjects or survey respondents. In this paper, we consider designs in which more than one sMPL is presented to each subject, allowing more than one preference parameter to be estimated simultanously, and we propose a consistent estimator in this setting - the Multivariate Heterogeneous Preference (MHP) estimator. Focusing on the bivariate case of two sMPLs and two preference parameters, we demonstrate that non-standard econometric techniques, namely Monte Carlo integration with importance sampling, are required to implement the MHP estimator. Via a Monte Carlo exercise, we show that our estimator has good finite-sample properties. Finally, we apply the MHP estimator to a real data set and compare the estimates to those obtained using an inconsistent estimator applied in previous studies.
Directional distribution depth function and its application to classification
April 14, 2023
Statistical depth functions are introduced as a way to provide a center-outward ordering of the sample points in multidimensional space, which can be used for outlier detection, classification, and other exploratory tools. In this work we propose a novel definition of depth function for multivariate data using random directions, which preserves the Mahalanobis distance of the points in the original space. More specifically, the proposed depth function is the expectation of all depths along the potentially infinite random directions, which, in turn, are functions of the point percentiles estimated via parametric or nonparametric models. The proposed method is evaluated through simulated experiments and real data applications, and is shown to be effective in supervised classification problems.
Functional estimation of anisotropic covariance and autocovariance operators on the sphere
March 31, 2023
In this talk we present nonparametric estimators for the second-order central moments of possibly anisotropic spherical random fields, within a functional data analysis context. We consider a measurement framework where each random field among an identically distributed collection of spherical random fields is sampled at a few random directions, possibly subject to measurement error. The collection of random fields could be i.i.d. or serially dependent. Though similar setups have already been explored for random functions defined on the unit interval, the nonparametric estimators proposed in the literature often rely on local polynomials, which do not readily extend to the (product) spherical setting. We therefore formulate our estimation procedure as a variational problem involving a generalized Tikhonov regularization term. Using the machinery of reproducing kernel Hilbert spaces, we establish representer theorems that fully characterize the form of our estimators. We determine their uniform rates of convergence as the number of random fields diverges, both for the dense (increasing number of spatial samples) and sparse (bounded number of spatial samples) regimes. A simulation study and a preliminary exploration of a real dataset of ocean temperatures will be also discussed. Joint work with Julien Fageot, Matthieu Simeoni and Victor M. Panaretos.
A long history of model based clustering based on trimming and constraints
March 24, 2023, 12:00
Model based clustering plays a major role in data analysis. Our interest focuses on approaches related to maximum likelihood estimation via EM/CEM algorithms. However, it is very common that input-datasets contain observations belonging to contaminating sources, out of the assumed family of distributions in the chosen model. It is well known that this contamination in the sample is able to break likelihood based estimators. Methodology based on the joint application of trimming and constraints, under the label TCLUST, has been developed for robustifying model based clustering proposals. Trimming tries to eliminate contaminating observations, however in order to achieve robust proposals in clustering, it is also needed to apply constraints to control the relative size of clusters' variability. There are TCLUST procedures available for estimating mixtures in different settings: linear models, factor analyzers and functional data among others. Statistical properties of TCLUST procedures, including consistency and a non negligible breakdown point are available. TCLUST's constraints have evolved in the last few years, providing an improved flexibility, in order to capture the patterns in the covariance matrix decomposition included in the classical parsimonious family of Celeux and Govaert. An important open issue in TCLUST procedures is related to their input parameters: the number of clusters, the level of trimming and the strength of the constraints. Exploratory tools and automatized procedures for assisting the users in choosing these input parameters have been developed. TCLUST procedures are available in CRAN ('tclust' package) and in MATLAB ('FSDA' toolbox).
Causal Regularization
March, 17, 2023 (12 pm)
Causality is the holy grail of science, but humankind has struggled to operationalize it for millennia. In recent decades, a number of more successful ways of dealing with causality in practice, such as propensity score matching, the PC algorithm, and invariant causal prediction, have been introduced. However, approaches that use a graphical model formulation tend to struggle with computational complexity, whenever the system gets large. Finding the causal structure typically becomes a combinatorial-hard problem. In our causal inference approach, we build forth on ideas present in invariant causal prediction and the causal Dantzig and anchor regression, by replacing combinatorial optimization with a continuous optimization using a form of causal regularization. This makes our method applicable to large systems. Furthermore, our approach allows a precise formulation of the trade-off between in-sample and out-of-sample prediction error. In allegato la locandina con l'abstract e i riferimenti per partecipare in presenza o da remoto.
Central Quantile subspace and its applications
March 17, 2023 (4pm)
Quantile regression (QR) is becoming increasingly popular due to its relevance in many scientific investigations. There is a great amount of work about linear and nonlinear QR models. Specifically, nonparametric estimation of the conditional quantiles received particular attention, due to its model flexibility. However, nonparametric QR techniques are limited in the number of covariates. Dimension reduction offers a solution to this problem by considering low-dimensional smoothing without specifying any parametric or nonparametric regression relation. The existing dimension reduction techniques focus on the entire conditional distribution. We, on the other hand, turn our attention to dimension reduction techniques for conditional quantiles and introduce a new method for reducing the dimension of the predictor X. The performance of the methodology is demonstrated through simulation examples and data applications, especially to financial data. Finally, various extensions of the method are presented, such as nonlinear dimension reduction and the use of categorical predictors.
smoothEM: a new approach for smoothEM: a new approach for the simultaneous assessment of smooth patterns and spikes
10/03/2023
We consider functional data where an underlying smooth curve is composed not just with errors, but also with irregular spikes that (a) are themselves of interest, and (b) can negatively affect our ability to characterize the underlying curve. We propose an approach that, combining regularized spline smoothing and an ExpectationMaximization algorithm, allows one to both identify spikes and estimate the smooth component. Imposing some assumptions on the error distribution, we prove consistency of EM estimates. Next, we demonstrate the performance of our proposal on finite samples and its robustness to assumptions violations through simulations. Finally, we apply our proposal to data on the annual heatwaves index in the US and on weekly electricity consumption in Ireland. In both datasets, we are able to characterize underlying smooth trends and to pinpoint irregular/extreme behaviors. Work in collaboration with Huy Dang (Penn State University) and Francesca Chiaromonte (Penn State University and Sant’Anna School of Advanced Studies).
Spatio-Temporal Semantic -Temporal Semantic Partitions of the Land Surface through Deep Embeddings
03/03/2023
Temporal sequences of satellite images constitute a highly valuable and abundant resource to analyze a given region. However, the labeled data needed to train most machine learning models are scarce and difficult to obtain. In this context, we investigate a fully unsupervised methodology that, given a sequence of images, learns a semantic embedding and then, creates a partition of the ground according to its semantic properties and its evolution over time. We illustrate the methodology by conducting the semantic analysis of a sequence of satellite images of a region of Navarre (Spain). The proposed approach reveals a novel broad perspective of the land, where potentially large areas that share both a similar semantic and a similar temporal evolution are connected in a compact and well-structured manner. The results also show a close relationship between the allocation of the clusters in the geographic space and their allocation in the embedded spaces. The semantic analysis is completed by obtaining the representative sequence of tiles corresponding to each cluster, the linear interpolation between related areas, and a graph that shows the relationships between the clusters, providing a concise semantic summary of the whole region.

2022


Wrapping onto a torus: handling multivariate circular data in the presence of outliers
December 16, 2022, 12:00
Abstract: Multivariate circular data arise commonly in many different fields, including the analysis of wind directions, protein bioinformatics, animal movements, handwriting recognition, people orientation, cognitive and experimental psychology, human motor resonance, neuronal activity, robotics, astronomy, biology, physics, earth science and meteorology. Observations can be thought of as points on a p-dimensional torus, whose surface is obtained by revolving the unit circle in a p−dimensional manifold. The peculiarity of multivariate torus data is periodicity, that reflects in the boundedness of the sample space and often of the parametric space. The problem of modeling circular data has been tackled through suitable distributions, among which two of the most popular are the von Mises and the Wrapped Normal. Here, we focus on the family of unimodal and elliptically symmetric wrapped distributions with emphasis on the Wrapped Normal. Despite the boundedness of the support of circular variates, torus data are not immune to the occurrence of outliers, that is unexpected values, such as angles or directions, that do not share the main pattern of the bulk of the data. Then, a robust procedure to fit a wrapped distribution is presented. The proposed algorithm is characterized by the computation of data dependent weights aimed to down-weight anomalous values. We discuss and compare different approaches to obtain weights, with particular attention to the weighted likelihood methodology. A formal outliers detection rule is also suggested, that is based on classical robust distances evaluated over unwrapped data. In allegato la locandina con i riferimenti per partecipare al seminario e l'abstract.
Pseudo-populations resampling for finite populations under complex designs
December 2, 2022, 12:00
"Pseudo-populations resampling for finite populations under complex designs" Pier Luigi Conti Dipartimento di Scienze Statistiche Sapienza Università di Roma Abstract: The present talk is devoted to resampling for finite populations when the sampling design is not simple. As a consequence of the complex sampling design, there is dependence among sampled units. Hence, classical Efron bootstrap does not work in the case under examination. Resampling schemes based on pseudo-populations will be developed, and their main justifications and properties will be shown. The approach used is of asymptotic nature, and parallels results obtained by Bickel and Friedman for the i.i.d. case. Main applications of theoretical results are devoted to the construction of confidence intervals for finite population parameters. Finally, computational issues will be discussed. In allegato la locandina con l'abstract e i riferimenti per partecipare al seminario in presenza e a distanza.
Addressing dataset shift in supervised classification via data perturbation
November 25, 2022, 12:00
n supervised classification, dataset shift occurs when for the units in the test set a change in the distribution of a single feature, a combination of features, or the class boundaries, is observed with respect to the training set. As a result, in real data applications, the common assumption that the training and testing data follow the same distribution is often violated. Dataset shift might be due to several reasons; the focus is on what is called “covariate shift”, namely the conditional probability p(y|x) remains unchanged, but the input distribution p(x) differs from training to test set. Random perturbation of variables or units when building the classifier can help in addressing this issue. Evidence of the performance of the proposed approach is obtained on simulated and real data.
Stein’s Method Meets Statistics: A Review of Some Recent Developments
November 18, 2022, 12:00
Stein’s method compares probability distributions through the study of a class of linear operators called Stein operators. While initially studied in the field of probability, Stein’s method has led to significant advances in theoretical statistics, computational statistics and machine learning in recent years. In this talk, I will present some of these recent developments and, in doing so, try to stimulate further research into the successful field of Stein’s method and statistics. The topics I shall discuss include (if the time permits) new insights into the finite-sample approximation of estimators (like maximum likelihood estimators), a measure of the impact of the prior choice in Bayesian statistics, tools to benchmark and compare sampling methods such as approximate Markov chain Monte Carlo, deterministic alternatives to sampling methods, parameter estimation and goodness-of-fit testing. This talk is based on a large collaborative effort with many co-authors.
On mixtures of linear quantile regressions for longitudinal and clustered data
November 11, 2022, 12:00
Quantile regression represents a well established technique for modelling data when the interest is on the effect of predictors on the conditional response quantiles. When responses are repeatedly collected over time, or when they are hierarchically nested, dependence needs to be properly considered. A standard way of proceeding is based on including higher level unit-specific random coefficients in the model. The distribution of such coefficients may be either specified parametrically or left unspecified. In the last case, it can be estimated non parametrically by using a discrete distribution defined on G locations. This may approximate the distribution of time-constant and/or time-varying random coefficients, leading to a static, dynamic, or mixed-type mixture of linear quantile regression equations. An EM algorithm and a block-bootstrap procedure are employed to derive parameter estimates and corresponding standard errors. Standard penalized likelihood criteria are used to identify the optimal number of mixture components. This class of models is described by using a benchmark dataset and employing the functions in the newly develop lqmix R package.
Causal effects of chemotherapy regimen intensity on survival outcome through Marginal Structural Models
Novembre 4, 2022, 12:00
As patients develop toxic side-effects, cancer treatment is adapted over time by either delaying or reducing the dosage of the next chemotherapy course. In this talk Marginal Structural Models in combination with InverseProbability-of-Treatment Weighted estimators to assess the causal effects of chemotherapy regimen modifications on survival outcome will be discussed. The focus is on the use of actual treatment data and Received Dose Intensity in contrast with the use of intended treatment regimen. The latter approach, known as Intention to treat, is very common but also very far from the everyday clinical practice. In this talk, I will discuss the confounding nature of toxic side-effects data and shows the damaging effect of not including toxicity in the analysis. The method developed is applied to the osteosarcoma randomised clinical trials BO03 and BO06 (EORTC 80861 and 80931).
Density modelling with Functional Data Analysis
October 28, 2022, 12:00
Recent technological advances have eased the collection of big amounts of data in many research fields. In this scenario, a useful statistical technique is density estimation which represents an important source of information. One dimensional density functions represent a special case of functional data subject to the constraints to be non-negative and with a constant integral equal to one. Because of these constraints, densities functions do not form a vector space and a naive application of functional data analysis (FDA) methods may lead to nonvalid estimates. To address this issue, two main strategies can be found in the literature. In the first, the probability density functions (pdfs) are mapped into a linear functional space through a suitably chosen transformation. Established methods for Hilbert space valued data can be applied to the transformed functions and the results are moved back into the density space by means of the inverse transformation. In the second strategy, probability density functions are treated as an infinite dimensional compositional data since they are part of some whole which only carry relative information. In this work, by means of a suitable transformation, densities are embedded in the Hilbert space of square integrable functions where standard FDA methodologies can be applied.
The three-sigma rule to define antibody positivity: is it a beauty or a beast?
October 14, 2022, 12:00
Many epidemiological studies aim to estimate the proportion of individuals currently or previously infected by a given microorganism. Given that an infection inevitably leads to an immune response, this estimation exercise often requires identifying individuals who reach a minimal level of microbe-specific antibodies in their serum. This threshold invariantly is defined by the three-sigma rule: mean plus three times the standard deviation from the hypothetical antibody-negative population. Notwithstanding not being linked to a specific parametric distribution, it has the most intuitive interpretation in the context of a normal distribution. I will then discuss the problems of estimation bias and apparent control of specificity arising from applying this rule to nonnormal distributions for the seronegative population. I will use public data on antibody testing against the SARS-CoV2 to illustrate these problems. We should finally ask ourselves whether the three-sigma rule is a beautiful statistical concept or, instead, a little beast hidden in antibody data analysis.
A general framework for implementing distances for categorical variables
June 17, 2022, 13:30
In many statistical methods, distance plays an important role. For instance, data visualization, classification and clustering methods require quantification of distances among objects. How to define such distance depends on the nature of the data and/or problem at hand. For distance between numerical variables, in particular in multivariate contexts, there exist many definitions that depend on the actual observed differences between values. It is worth underlining that often it is necessary to rescale the variables before computing the distances. Many distance functions exist for numerical variables. For categorical data, defining a distance is even more complex as the nature of such data prohibits straightforward arithmetic operations. Specific measures therefore need to be introduced that can be used to describe or study structure and/or relationships in the categorical data. In this paper, we introduce a general framework that allows an efficient and transparent implementation for distance between categorical variables. We show that several existing distances (for example distance measures that incorporate association among variables) can be incorporated into the framework. Moreover, our framework quite naturally leads to the introduction of new distance formulations as well.
Model-assisted indirect small area estimation
May 27, 2022, 12:00
Generalised regression is the most common design-based model-assisted method for estimation of population means and totals in practical survey sampling. However, it is often unacceptable in the context of small area estimation, where one is interested in population means and totals for a large number of areas (or domains) and the sample sizes are either small or non-existent in many of them. In this seminar, we discuss an approach to extend generalised regression from direct estimation for the whole population to indirect estimation of all the small area populations. This requires to trade variance off with bias and enables a practical methodology for estimation at the different aggregation levels, which is coherent numerically (self-benchmarking) as well as conceptually in terms of the design-based model-assisted inference outlook. Estimation can be conducted by means of an *extended* weighting system that has as many sets of weights as the number of small areas: each set produces the estimate for a domain mean of one or more survey variables of interest and is, in this sense, multipurpose.
Extending the boundaries of a macroeconometric model for Italian economy to inequality
May 20, 2022, 12:00
According to the growing debate on the beyond-GDP approach, a strand of literature explores how the traditional system on national account (SNA), that is the pillar for the GDP measurement, could be extended to account for some of the main themes related to well-being and sustainability. In this presentation we extend the macroeconometric model for Italy developed by Istat (MeMo-It) introducing an inequality measure in the consumption function. Empirical analysis shows that a positive income shock that increases aggregate consumption in the current year might be completely off-set by the negative effect of the increase in inequality that becomes effective in the next year. In this framework, the impact of the Italian “reddito di cittadinanza”, a policy measure aiming at reducing poverty, has been evaluated. According to the results obtained we support the idea that a step forward on wellbeing and sustainability could be realized starting from a structural macroeconometric approach.
Bayesian Statistics applied to Early “Oncology Drug Development”
May 13, 2022, 17:00
Oncology dose finding studies, in general, aim at determining the maximum tolerated dose (MTD) reflecting the desire to treat patients who have limited options under the assumption that higher drug doses will have better therapeutic activity. We are describing different methods (ie 3+3, mTPI, mTPI-2, and BLRM). This seminar will feature speakers from Pfizer Inc., to share their insights and the recent statistical innovations to address the challenges. In addition to safety evaluation, Early Sign of Efficacy (ESOE) is a critical step in all early clinical programs to extend the development of a molecule or not. Robust and consistent calculation of the probability of making the right decision is critical. Innovative methodologies are needed to optimize these calculations and ensure all molecules are assessed in the same way across the oncology portfolio. Case studies will be discussed for dose finding and the utilization of Bayesian statistics in ESOE evaluation.
Bayesian Inference In High-dimensional Spatial Statistics: Conquering New Challenges
May 6, 2022, 17:00
Geographic Information Systems (GIS) and related technologies such as remote sensors, satellite imaging and portable devices that are capable of collecting precise positioning information, even on portable hand-held devices, have spawned massive amounts of spatial-temporal databases. Spatial "data science" broadly refers to the use of technology, statistical methods, computational algorithms to extract knowledge and insights from spatially referenced data. Applications of spatial-temporal data science are pervasive in the natural and environmental sciences; economics; climate science; ecology; forestry; and public health. With the abundance of spatial BIG DATA problems in the sciences and engineering, GIS and spatial data science will likely occupy a central place in the data revolution engulfing us. This talk will discuss construction and implementation of scalable Gaussian processes and the importance of conjugate Bayesian models in carrying out Bayesian inference for spatially and temporally oriented massive data sets exhibiting complex dependencies in diverse applications. We will elucidate recent developments in Bayesian statistical science such as geosketching and predictive stacking that can harness high performance scientific computing methods for spatial-temporal BIG DATA analysis and emphasize how such methods can be implemented on modest computing architectures. The talk will include specific examples of Bayesian hierarchical modeling in Light Detection and Ranging (LiDAR) systems and other remote-sensed technologies; environmental sciences; and public health.
Spatial and functional data over non-Euclidean domains
April 29, 2022, 12:00
Recent years have seen an explosive growth in the recording of increasingly complex and high-dimensional data. Classical statistical methods are often unfit to handle such data, whose analysis calls for the definition of new methods merging ideas and approaches from statistics and applied mathematics. My talk will in particular focus on spatial and functional data defined over non-Euclidean domains, such as linear networks, two-dimensional manifolds and non-convex volumes. I will present an innovative class of methods, based on regularizing terms involving Partial Differential Equations (PDEs), defined over the complex domains being considered. These physics-informed regression methods enable the inclusion in the statistical model of the available problem specific information, suitably encoded in the regularizing PDE. The proposed methods make use of advanced numerical techniques, such as finite element analysis and isogeometric analysis. A challenging application to neuroimaging data will be illustrated.
Factor models with downside risk
Aprile 22, 2022, 12:00
We propose a conditional model of asset returns in the presence of common factors and downside risk. Specifically, we generalize existing latent factor models in three ways: we show how to estimate the threshold which identifies the 'disappointment' event triggering the bad state of the world; we permit different factor structures for asset returns in good and bad states; we show how to recover the observable factors' risk premia from the estimated latent ones in different states. The usefulness of the model is illustrated through two applications to cross-sections of asset returns in equity markets and other major asset classes. Paper link https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3937321
Marine litter in the North-Western Ionian sea – Data features and space-time modeling
April 8, 2022, 12:00
Marine litter has recently become a recognized global ecological concern, and its distribution and impacts on deep-sea habitats are under continuous investigation. Here we focus on marine litter data collected as a by-product of trawl fishery surveys regularly conducted at a local scale in the Mediterranean. Litter data are multivariate, have space-time structure, and are semi-continuous, i.e. they combine information on occurrence and conditional-to-presence abundance. Data on potential environmental drivers obtained by remote sensing or GIS technologies are also available with different spatial support. The modeling strategy is based on a two-part model that enables handling the zero-inflation problem and the spatial correlation characterizing the data. In the spirit of multi-species distribution models, we propose to jointly infer different litter categories in a Hurdle-model framework. The effects of potential environmental drivers and shared spatial effects linking abundances and probabilities of occurrences of litter categories are implemented via the SPDE approach in the computationally efficient INLA context. Results support the possibility of better understanding the spatio-temporal dynamics of marine litter in the study area.
How much evidence do you need? Data Science and Bayesian Statistics to inform Environmental Policy during the COVID-19 Pandemic
April 4, 2022, 14:00
In this talk, I will provide an overview of data science methods, including methods for Bayesian analysis, causal inference, and machine learning, to inform environmental policy. This is based on my work analyzing a data platform of unprecedented size and representativeness. The platform includes more than 500 million observations on the health experience of over 95% of the US population older than 65 years old linked to air pollution exposure and several confounders. Finally, I provide an overview of studies on air pollution exposure, environmental racism, wildfires, and how they also can exacerbate the vulnerability to COVID-19. Press Coverage • https://www.nytimes.com/2021/08/13/climate/wildfires-smoke-covid.html • https://www.nytimes.com/2020/04/07/climate/air-pollution-coronavirus-covid.html • https://www.nytimes.com/2020/12/07/climate/trump-epa-soot-covid.html?smid=tw-share • https://science.sciencemag.org/content/360/6388/473 • https://www.npr.org/sections/health-shots/2017/06/28/534594373/u-s-air-pollution-stillkills-thousands-every-year-study-concludes • https://www.statnews.com/2016/11/14/climate-change-agreements/ • https://news.harvard.edu/gazette/story/2016/08/smoke-waves-will-affect-millions-incoming-decades/ • https://sites.sph.harvard.edu/francesca-dominici/senator-cory-booker-talking-about-nejmstudy/
Measures of Interrater Agreement
March 25, 2022, 12:00
Agreement among ratings or measurements provided by several raters (humans or devices) is considered in education, biomedical sciences, and other disciplines. For instance, the agreement among ratings of educators who assess on a new rating scale the language proficiency of a corpus of argumentative texts is considered to test reliability of the scale, or the agreement among clinical diagnoses provided by physicians is analysed for identifying the best treatment for the patient. In all these applications, the main interest is to analyse interrater absolute agreement, that is the extent that raters assign the same (or very similar) values on the rating scale. Many indices of interrater agreement on a whole group of subjects (objects) have been proposed. Less frequently agreement on single subjects has been considered, in spite of the fact that this is useful, for example, to request the raters for a specific comparison on single cases in which agreement is poor. In the seminar, after a critical review of the most used indices of interrater agreement, new subject-specific and global measures of absolute agreement for ratings on different levels of scale are presented. Some applications will show the advantages of the indices proposed.
Unsupervised whole graph embedding methods and applications
March 18, 2022, 12:00
Networks represent a powerful model for problems in different scientific and technological fields, such as neuroscience, molecular biology, biomedicine, sociology, social network analysis, and political science. As the number of network applications increases, so does a need for novel data analysis techniques. In many applications, the analysis focuses on a single network to cluster or classify its nodes or predict pairs of nodes that will form a link. In this talk, we focus on problems where a network is a statistical unit, and the analysis regards whole networks rather than their parts. Methods for learning features on networks focus mainly on the neighborhood of nodes and edges. We review some of the existing methodologies and introduce Netpro2vec, an embedding framework based on representations of graphs based on empirical probability distributions. The goal is to use basic node descriptions other than the degree, such as those induced by the Transition Matrix and Node Distance Distribution, to describe the local and global characteristics of the networks. The framework is evaluated on synthetic and real biomedical network datasets and compared to well-known competitors. Finally, open problems and future research directions are highlighted.
Multimodal regression with circular data
March 4, 2022, 12:00
There is a diverse range of practical situations where one may encounter random variables which are not defined on Euclidean spaces, as it is the case for circular data. Circular measurements may be accompanied by other observations, either defined on the unit circumference or on the real line, and in such cases it may be of interest to model the relationship between the variables from a regression perspective. It is not infrequent that parametric models fail to capture the underlying model given their lack of flexibility, but it may also happen that the usual paradigm of (classical) mean regression. We will present in this talk some recent advances in nonparametric multimodal regression, showing an adaptation of the mean-shift algorithm for regression scenarios involving circular response and/or covariate. Real data illustrations will be also presented. This is a joint work with María Alonso-Pena.

2021


Challenges in emulating target trials
December, 14, 2021
The framework of target trial emulation (TTE) is increasingly adopted when researchers wish to address causal questions using observational data. TTE has multiple advantages, starting from the clarity of explicitly specifying the hypothetical target experimental trial for the questions of interest. However, because the data often arise from linked administrative databases that are not created for research purposes, their handling demands extreme care if biased conclusions are to be avoided. Two main sources of bias have been broadly recognised in the epidemiological literature: immortal time bias and inappropriate selection of comparative groups. This talk will focus on other challenges to emulating target trials which are not commonly aired, using two examples. Hernán et al. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. Journal of Clinical Epidemiology, 2016: 79 (2016) 70e75 Hernán and Robins. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. American Journal of Epidemiology, 2016: 183, 758–764 Suissa. Immortal time bias in observational studies of drug effects. Pharmacoepidemiol Drug Safety, 2007: 241–9
Testing for the Rank of a Covariance Kernel
December 10, 2021
How can we discern whether the covariance of a stochastic process is of reduced rank, and if so, what its precise rank is? And how can we do so at a given level of confidence? This question is central to a great deal of methods for functional data, which require low-dimensional representations whether by functional PCA or other methods. The difficulty is that the determination is to be made on the basis of i.i.d. replications of the process observed discretely and with measurement error contamination. This adds a ridge to the empirical covariance, obfuscating the underlying dimension. We describe a matrix-completion inspired test statistic that circumvents this issue by measuring the best possible least square fit of the empirical covariance's off-diagonal elements, optimised over covariances of given finite rank. For a fixed grid of sufficiently large size, we determine the statistic's asymptotic null distribution as the number of replications grows. We then use it to construct a bootstrap implementation of a stepwise testing procedure controlling the family-wise error rate corresponding to the collection of hypotheses formalising the question at hand. Under minimal regularity assumptions we prove that the procedure is consistent and that its bootstrap implementation is valid. The procedure circumvents smoothing and associated smoothing parameters, is indifferent to measurement error heteroskedasticity, and does not assume a low-noise regime. Based on joint work with Anirvan Chakraborty.
Robust Statistics for (big) data analytics
December 3, 2021,
Data rarely follow the simple models of mathematical statistics. Often, there will be distinct subsets of observations so that more than one model may be appropriate. Further, parameters may gradually change over time. In addition, there are often dispersed or grouped outliers which, in the context of international trade data, may correspond to fraudulent behavior. All these issues are present in the datasets that are analyzed on a daily basis by the Joint Research Centre of the European Commission and can only tackled by using methods which are robust to deviations to model assumptions (see for example [6]). This distance between mathematical theory and data reality has led, over the last sixty years, to the development of a large body of work on robust statistics. In the seventies of last century, it was expected that in the near future any author of an applied article who did not use the robust alterative would be asked by the referee for an explanation [9]. Now, a further forty years on, there does not seem to have been the foreseen breakthrough into the wider scientific universe. In this talk, we initially sketch what we see as some of the reasons for this failure, suggest a system of interrogating robust analyses, which we call monitoring [5] and describe a series of robust and efficient methods to detect model deviations, groups of homogeneous observations [10], multiple outliers and/or sudden level shifts in time series ([8]). Particular attention will be given to robust and efficient methods (known as forward search) which enables to use a flexible level of trimming and understand the effect that each unit (outlier or not) exerts on the model (see for example [1], [2] [7]). Finally, we discuss the extension of the above methods to transformations and to the big data context. The Box-Cox power transformation family for non-negative responses in linear models has a long and interesting history in both statistical practice and theory. The Yeo-Johnson transformation extends the family to observations that can be positive or negative. In this talk, we describe an extended Yeo-Johnson transformation that allows positive and negative responses to have different power transformations ([4] or [3]). As an illustration of the suggested procedure, we analyse data on the performance of investment funds, 99 out of 309 of which report a loss. The problem is to use regression to predict medium term performance from two short term DSS - Dipartimento di Scienze Statistiche - www.dss.uniroma1.it indicators. It is clear from scatterplots of the data that the negative responses have a lower variance than the positive ones and a different relationship with the explanatory variables. Tests and graphical methods from our robust analysis allow the detection of outliers, the testing of the values of transformation parameters and the building of a simple regression model. All the methods described in the talk have been included in the FSDA Matlab toolbox freely donwloadable as a toolbox from Mathworks file exchange or from github at the web address https://uniprjrc.github.io/FSDA/ References [1] Atkinson, A. C. and Riani, M. (2000). Robust Diagnostic Regression Analysis. Springer– Verlag, New York. [2] Atkinson, A. C., Riani, M., and Cerioli, A. (2004). Exploring Multivariate Data with the Forward Search. Springer–Verlag, New York. [3] Atkinson, A. C., Riani, M., and Corbellini, A. (2020). The analysis of transformations for profit-and-loss data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 69(2), 251–275. [4] Atkinson, A. C., Riani, M., and Corbellini, A. (2021). The BoxCox Transformation: Review and Extensions. Statistical Science, 36(2), 239 – 255. [5] Cerioli, A., Riani, M., Atkinson, A. C., and Corbellini, A. (2018). The power of monitoring: How to make the most of a contaminated multivariate sample (with discussion). Statistical Methods and Applications, 27, 559–666. https://doi.org/10.1007/s10260–017–0409–8. [6] Perrotta, D., Torti, F., Cerasa, A., and Riani, M. (2020). The robust estimation of monthly prices of goods traded by the European Union. Technical Report EUR 30188 EN, JRC120407, European Commission, Joint Research Centre, Publications Office of the European Union, Luxembourg. ISBN 978-92-76-18351-8, doi:10.2760/635844. [7] Riani, M., Atkinson, A. C., and Cerioli, A. (2009). Finding an unknown number of multivariate outliers. Journal of the Royal Statistical Society, Series B, 71, 447–466. [8] Rousseeuw, P., Perrotta, D., Riani, M., and Hubert, M. (2019). Robust monitoring of time series with application to fraud detection. Econometrics and Sttaistics, 9, 108–121. [9] Stigler, S. M. (2010). The changing history of robustness. The American Statistician, 64, 277–281. [10] Torti, F., Perrotta, D., Riani, M., and Cerioli, A. (2018). Assessing trimming methodologies for clustering linear regression data. Advances in Data Analysis and Classification, 13, 227–257.

2019


seminar
22 Febbraio 2019 - Sala 34 ore 11

Giorgio Consigli (Università degli Studi di Bergamo) - Asset-liability management for occupational pension funds under market and longevity risk: a case study and alternative modelling approaches
22 Marzo 2019 - Aula V ore 15
The modelling of institutional ALM problems has a long history in stochastic programming starting in the late 80’s with the first industry developments such as the well-known Yasuda Kasai model (Ziemba, Turner, Carino et al, 1994) specifically for pension fund management (PF ALM). Due to economic and demographic pressures in most OECD countries and an increasing interest on PF ALM developments by the industry and by policy makers, we witness now-a-day a growing demand for R&D projects to the scientific community. Taking the view of a PF manager, the presentation will develop around the definition of a generic pension fund (PF) asset-liability management (ALM) problem and analyse the key underlying methodological implications of: (i) it's evolution from an early stage multistage stochastic programming (MSP) with recourse to most recent MSP and distributionally robust (DRO) formulations, (ii) a peculiar and rich risk spectrum including market risk as well as liability risk, such as longevity risk and demographic factors leading to (iii) valuation or pricing approaches based on incomplete market assumptions and, due to recent International regulation, (iv) a risk-based capital allocation for long-term solvency. The above represent fundamental stochastic and mathematical problems of modern financial optimisation. Two possible approaches to DRO are considered, based on a stochastic control framework or by explicitly introducing an uncertainty set for probability measures and formulating the inner DRO problem as a probability distance minimization problem over a given space of measures. Keywords: asset-liability management, multistage stochastic programming, distributional uncertainty, distributionally robust optimization, solvency ratio, liability pricing, longevity risk, capital allocation.
Gianluca Mastrantonio (Politecnico di Torino) - New formulation of the logistic-normal process to analyze trajectory tracking data
28 Gennaio 2019 - Sala 34 ore 10.30
Improved communication systems, shrinking battery sizes and the price drop of tracking devices have led to an increasing availability of trajectory tracking data. These data are often analyzed to understand animals behavior using mixture-type model. In this work, we propose a new model based on the logistic-normal process. Due to a new formalization and the way we specify the core- gionalization matrix of the associated multivariate Gaussian process, we show that our model, differently from other proposals, is invariant with respect to the choice of the reference element and of the order- ing of the components of the probability vectors. We estimate the model under a Bayesian framework, using an approximation of the Gaussian process needed to avoid impractical computational time. We perform a simulation study with the aim of showing the ability of the model to retrieve the parameters used to simulate the data. The model is then applied to the real data where a wolf is observed before and after procreation. Results are easy to interpret, showing differences in the two phases. Joint work with: Enrico Bibbona (Politecnico di Torino), Clara Grazian (Università di Pescara), Sara Mancinelli (università "Sapienza" di Roma)
Overdispersed-Poisson Model in Claims Reserving: Closed Tool for One-Year Volatility in GLM Framework
29 Marzo 2019 - Aula V ore 14.15

seminar
16-17 Aprile 2019 - Sala 34 ore 10-14

seminar
5 Giugno 2019 - Aula Master (Viale Regina Elena 295) ore 10

seminar
5 Giugno 2019 - Aula Master (Viale Regina Elena 295) ore 10

Daniel K. Sewell (University of Iowa) - An introduction to the statistical analysis of network data
9 e 10 Settembre 2019 - Aula VII (ex Castellano) ore 10-16 (con pausa)

Multi-study factor analysis for biological data
14 Novembre 2019 - Aula XIV (palazzina Tumminelli) ore 12
We introduce a novel class of factor analysis methodologies for the joint analysis of multiple studies. The goal is to separately identify and estimate 1) common factors shared across multiple studies, and 2) study-specific factors. We develop a fast Expectation Conditional-Maximization algorithm for parameter estimates and we provide a procedure for choosing the common and specific factors. We present simulations evaluating the performance of the method and we illustrate it by applying it to gene expression data in ovarian cancer and to nutrient-based dietary patterns and the risk of head and neck cancer. In both cases, we clarify the benefits of a joint analysis compared to the standard factor analysis. Moreover, we generalize the model in a Bayesian framework. We implement it using sparse modeling of high-dimensional factor loadings matrices, both common and specific, using the infinite gamma shrinkage prior. We propose a computationally efficient algorithm, based on a traditional Gibbs sampler, to produce the Bayes estimates of the parameters and to select the number of relevant common factors. We assess the operating characteristics of our method by means of simulation studies, and we present an application to the prediction of the biological signal from four gene expression studies on ovarian cancer.
Garyfallos Konstantinoudis - Discrete versus continuous domain models for disease mapping and applications on childhood cancers
22 Novembre 2019 - Sala 34 ore 12
The main goals of disease mapping is to calculate disease risk and identify high-risk areas. Such analyses are hampered by the limited geographical resolution of the available data. Typically data are counts of cases per spatial unit and the most common approach is the Besag-York-Molli ́e model (BYM). Less frequently, exact geocodes are available, allowing modelling a disease as a point process through Log-Gaussian Cox processes (LGCPs). The objective of this study is to examine in a simulation the performance of BYM and LGCPs for disease mapping. We simulated data in the Canton of Zurich in Switzerland sampling cases from the true population mimicking the childhood leukaemia incidence (n=334 during 1985-2015). We considered 39 different scenarios varying in the risk generating function (step-wise, smooth, flat risk), the size of the high-risk areas (1, 5 and 10km radii), the risk increase within the high-risk areas (2 and 5-fold) and the number of cases (n, 5n and 10n). We used the root mean integrated square error (RMISE) to examine the ability of the models to recover the true risk surface and their sensitivity/specificity in identifying highrisk areas. We found that, for larger radii, LGCPs recover the true risk surface with lower error across almost all scenarios (median RMISE: 9.17-27.0) compared to the BYM (median RMISE: 9.12-35.6). For radii = 1km and flat risk surfaces BYM performs better. In terms of sensitivity and specificity across almost all scenarios the median area under the curve (AUC) for LGCPs was higher (median AUC: 0.81-1) compared to the BYM (median AUC: 0.65-0.93). We applied these methods to childhood leukaemia incidence in the canton of Zurich during 1985-2015 and identified two high-risk spatially coherent areas. Our findings suggest that there are important gains to be made from the use of LGCP models in spatial epidemiology.
Roberta Varriale (ISTAT) - Machine learning methods for estimating the employment status in Italy
29 Novembre 2019 - Sala 34 ore 14
In recent decades, National Statistical Institutes have focused on producing official statistics by exploiting multiple sources of information (multi-source statistics) rather than a single source, usually a statistical survey. The growing importance of producing multi-source statistics in official statistics has led to increasing investments in research activities in this sector. In this context, one of the research projects addressed by the Italian National Statistical Institute (Istat) concerned the study of methodologies for producing estimates on employment rates in Italy through the use of multiple sources of information, survey data and administrative sources. The data comes from the Labour Force (LF) survey conducted by Istat and from several administrative sources that Istat regularly acquires from external bodies. The “quantity” of information is very different: those coming from administrative sources concern about 25 million individuals, while those coming from the LF survey refer to an extremely limited number (about 330,000) of individuals. The two measures do not agree on employment status for about 6% of the units from the LF survey. One proposed approach uses a Hidden Markov model to take into account the deficiencies in the measurement process of both survey and administrative sources. The model describes a measurement process as a function of a time-varying latent state (in this case the employment category), whose dynamics is described by a Markov chain defined over a discrete set of states. At present, the implementation phase for the production process of statistics on employment through the use of HM models is coming to an end in Istat. The present work describes the use of Machine Learning methods to predict the individual employment status. This approach is based on the application of decision tree and random forest models, that are predictive models, usually used to classify instances of large amounts of data. In the work, the obtained results will be described, together with their usefulness in this application context. The models have been applied through the use of the software R.

2018


Yves Tillé (Université de Neuchatel) - How to select a sample?
27 Novembre 2018 - Sala 34 ore 14.30
The principles of sampling can be synthesized in randomization, restriction and over-representation. Define a sample design – define stratification, equal/unequal selection probability, etc. – means to use prior information and it is equivalent to assume a model on the population. Several well-known sampling designs are optimal related to models that maximizes the entropy. In the Cube method the prior information are used to derive a sample that match the total or means of auxiliary variables. In this respect, the sample is called balanced. Furthermore, if distances between statistical units – based on geographical coordinates or defined via auxiliary variables – are available, it could be interesting to spread the sample in the space in order to make the design more efficient. In this perspective, new spatial sampling methods, such as the GRTS, the local pivotal method and the local cube, will be covered.

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma