Seminari


2024


How to address bias and confounding when biological sex is the exposure
25 Ottobre 2024, ore 12
Background: Global child mortality rates vary by sex, with higher mortality rates generally found in males. However, previous research has shown that this ratio is reversed in infants admitted to Paediatric Intensive Care Units (PICU). I aimed to determine whether female sex is causally linked to higher mortality in PICU. Methods: I created a longitudinal linked dataset that could be used to evaluate whether sex is causally related to mortality in PICU using routine data on >100,000 children admitted to PICU. I compared a number of estimation methods, namely (i) gcomputation, (ii) propensity score-based singly and doubly robust methods, and (iii) targeted learning aided by machine learning, to determine whether the observed sexratio reversal in PICU mortality is supported by the data. Results: Female biological sex increased the mortality rate in PICU by up to 0.26% (95% CI -0.05%, 0.57%). In the multiply imputed dataset this estimate was 0.35% (95% CI 0.09%, 0.61%). The reversal in mortality rates in PICU was not explained by collider bias. The collider bias was driving the naïve estimate towards the null. Conclusion: Female biological sex is linked to higher mortality in PICU. Mechanistic reasons underlying this causal relationship are still unknown.
Online Multivariate Changepoint Detection: Leveraging Links With Computational Geometry
22 Ottobre 2024, ore 12
The increasing volume of data streams poses significant computational challenges for detecting changepoints online. Likelihood-based methods are effective, but a naive sequential implementation becomes impractical online due to high computational costs. We develop an online algorithm that exactly calculates the likelihood ratio test for a single changepoint in pdimensional data streams by leveraging fascinating connections with computational geometry. This connection straightforwardly allows us to recover sparse likelihood ratio statistics exactly: that is assuming only a subset of the dimensions are changing. Our algorithm is straightforward, fast, and apparently quasi-linear. A dyadic variant of our algorithm is provably quasi-linear, being Op(nlog(n) p+1) for n data points and p less than 3, but slower in practice. These algorithms are computationally impractical when p is larger than 5, and we provide an approximate algorithm suitable for such p which is Op(np̃log(n) p̃+1), for some user-specified p̃≤ 5. We derive some statistical guarantees for the proposed procedures in the Gaussian case, and confirm the good computational and statistical performance, and usefulness, of the algorithms on both empirical data and on NBA data.
Depths and Local Depths. A journey in statistical depths
4 Ottobre, 2024, ore 12:00
Statistical data depth is a growing area in non-parametric statistics, originally developed for the analysis of multidimensional data but useful in other frameworks such as e.g. spherical and functional data. The main applications are a center-outward ordering of the observations, location and scale estimation, classification, clustering and some graphical tools. Statistical local depth functions are a generalization of statistical depth functions and they are used for describing local geometric features and mode(s) in multivariate distributions. In this seminar, after an introduction on statistical data depths, we illustrate some analytical and statistical properties of the local depths. We show how these functions are a bridge between density functions and depths, we illustrate their theoretical properties and we discuss some applications.
A time-heterogeneous rectangular latent Markov model with covariates to measure dynamics and correlates of student's learning abilities data
3 Giugno 2024, ore 12
Accurate and up-to-date assessments of students' abilities are essential for personalised learning. These assessments allow instructors to adjust class content to match different skill levels, and help students gain awareness of their learning paths. While technology-based learning environments have eased data collection, arguably they have brought important challenges for analysis, mainly relating to data complexity. This work introduces a novel fully time-heterogeneous rectangular latent Markov specification, tailored to complex longitudinal data of this kind. The proposed toolkit incorporates measurement model heterogeneity, allowing for possibly as many different measurement models as the number of distinct measurement occasions. The structural model, in which we include predictors of initial and transition probabilities, is consequently specified, and informative dropout is modelled explicitly and jointly with its potential correlates. However, the resulting model is overly complex to estimate with standard simultaneous procedures. We address the estimation problem by designing a bias-adjusted three-step estimator, which separates the estimation of the measurement models from the structural model fit. Our primary empirical aim is to analyse the abilities and progression in learning statistical topics over time of a concrete cohort of students, while accounting for their individual characteristics. Results from an extensive simulation study substantiate our empirical findings.
Soil consumption and organized crime
24 Maggio 2024, ore 12
Soil is a non-renewable natural resource providing ecosystem services essential for life. Soil consumption is the increase in artificial land cover through anthropogenic activities. While standard economic variables (population and GDP growth) appear to have limited predictive power for soil consumption, the recent literature on the mutually beneficial relationship between criminal organizations (“mafias”) and local politicians/administrators suggest a role for the presence and strength of mafias at the local level. We contribute to the literature by providing direct evidence of the link between soil consumption and mafia strength in the Southern Italian region of Apulia using a rich dataset at the fine municipality level that we created by merging information from a variety of sources. We show that alternative measures of the local strength of organized crime help improve substantially our predictions of soil consumption, both total soil consumption and soil consumption in protected areas. Under a plausible instrumental variable assumption, we also provide a quantitative assessment of the causal effect of the local strength of organized crime on soil consumption.
On the complexity for functional data
17 Maggio 2024, ore 12
The complexity of a stochastic process may be associated with the notion of degrees of freedom and, in some situations, may coincide with the concept of dimensionality. In this talk, complexity is studied through the use of the small-ball probability of the process. Specifically, we assume that this probability factorizes in two terms: one dependent only on the center of the ball, the other dependent only on the radius. The second term, which includes the information about the complexity, will be studied statistically: estimation, asymptotic behavior, and practical performance in a simulated and real environment.
Introduzione all’Analisi dei Dati Spaziali
15 Maggio 2024, ore 10-14
Seminari del Dottorato - Scuola di Scienze Statistiche - Curriculum Demografia
Issues and challenges in making inference from non-probability samples
3 Maggio 2024, ore 12
In the recent decade the relevance of the non-probability sampling in surveys is considerably increased because of the availability of alternative data sources such as Big Data and web surveys. The major concern about non-probability samples is that the unknown selection process is frequently selective, so that they often fail to represent the target population properly and hence result in highly biased estimators. In this work two approaches for dealing with selection bias when the selection process is nonignorable are discussed. The first one, based on the empirical likelihood, does not require parametric specification of the population model but the probability of being in the non-probability sample needed to be modeled. Auxiliary information known for the population or estimable from a probability sample can be incorporated as calibration constraints, thus enhancing the precision of the estimators. The second one introduces the concept of uncertainty on data generating model resulting from the lack of knowledge of the sampling design acting in the nonprobability sample. First of all, when extra-sample information is available, the class of plausible distributions for the variable of interest is defined. Next, a plausible estimate for such distribution is constructed and its accuracy is evaluated.
External information borrowing in clinical trial hypothesis testing with controlled TIErate inflation
5 Aprile 2024, ore 12
When designing a novel clinical trial, external information about the control and/or treatment arm effect is typically available. Borrowing of such external information is often desired in order to improve the trial’s efficiency, and can be of crucial importance in situations where the sample size that can realistically be recruited is limited, as, e.g., pediatric or rare disease trials. The Bayesian approach allows borrowing of such external information through the adoption of informative prior distributions. An issue associated with the incorporation of external information is that external and current information may systematically differ. However, such inconsistency may not be predictable or quantifiable a priori. Robust prior choices are typically proposed to avoid extreme worsening of operating characteristics in such situations. In this talk, we will focus on frequentist type I error rate and power. We will in particular consider how type I error rate is affected by incorporation of external information, and present a novel approach which allows a principled and controlled inflation. Both one and two-arm clinical trial designs will be considered.
Does measurement error distort country differences in temporary employment? differences in temporary employment? A study on Italy and the Netherlands A study on Italy and the Netherlands using a multi- using a multi-group hidden Markov model group hidden Markov model
22 marzo 2024, ore 12.00
This paper investigates the effect of measurement error on two key labour market indicators: the distribution of temporary employment as well as the transition rate in and out of temporary employment over time in Italy and the Netherlands. In this way, we study whether the cross-country differences in these indicators (coming from the different institutions in the two labour markets) pertain when we correct for measurement error in socioeconomic data. The comparative analysis of the Italian and Dutch labour markets is carried out for the time period 2017-2019 using linked employment data from the Labour Force Survey and the Employment Register of the two countries. For this purpose, we use a multiple-group Hidden Markov Model with two indicators for the employment contract type that accounts for both random and systematic measurement error. The results indicate that measurement error severely biases our view on mobility from temporary to permanent employment in the two countries but also distorts the picture of cross-country differences in the phenomenon of interest.
12

2023


Ten years of mobile phone big data statistical analyses
1 Dicembre 2023, ore 12
In the era of big data, monitoring and forecasting people crowding and mobility is a relevant aspect for urban policies, and smart cities use signals from mobile phone networks to support the optimization of urban systems and flows. Mobile phone data can be used for various purposes, as they come in different types: in this talk, applications are presented for social and cultural events monitoring, mobility flows and flooding risk analysis. Special attention is devoted to the statistical methods useful for these analyses, based on spatio-temporal data: briefly discussed are results obtained using the Histogram of Oriented Gradients approach for image reduction, the Functional Data Clustering of time series and the VARX model with Harmonic Dynamic Regression.
From Rome to London: Breaking the Silos of Disciplines. A Conversation around Cardiometabolic Risks Factors Modelling and the Use of Wastewater in Public Health
27 Novembre 2023, ore 10-12

Circular local likelihood regression
24 Novembre 2023, ore 12
In this talk, we will present a general framework for estimating regression models with circular covariates and a general response. We will start with an overview on nonparametric regression models with circular covariate, revising the main ideas and motivating the need of a more general method. Our goal is to estimate (nonparametrically) a conditional characteristic by maximizing the circular local likelihood. The proposed estimator is shown to be asymptotically normal. The problem of selecting the smoothing parameter is also addressed, as well as bias and variance computation. The finite sample performance of the estimation method in practice is studied through an extensive simulation study, where we cover the cases of Gaussian, Bernoulli, Poisson and Gamma distributed responses. The generality of our approach is illustrated with several real-data examples from different fields. In particular, we will focus on an example of neural response in macaques. This is a joint work with M. Alonso-Pena and I. Gijbels and corresponds to two published papers in Biometrics (2023) and Journal of the American Statistical Association (2023).
Surveying sensitive topics with indirect questioning techniques: methods and real applications
17 Novembre 2023, ore 12
In many fields of the applied research, mostly in sociological, economics, demographic, ecological and medical studies, the investigator very often has to gather information concerning highly personal, sensitive, stigmatizing, and perhaps incriminating issues such as drug addiction, domestic violence, racial prejudice, illegal income, noncompliance with laws and regulations. Doing research on sensitive themes by traditional direct questioning survey modes is not an easy matter since it is likely to meet with two sources of errors: nonresponse and untruthful answers. These errors can seriously flaw the quality of the data and, thus, jeopardize the usefulness of the collected information for subsequent analyses including inference on unknown characteristics of the population under study. Although the errors cannot be totally avoided, they may be mitigated by increasing respondent cooperation through a nonstandard data-collection approach based on indirect questioning techniques (IQTs). The talk aims to introduce some issues related to privacy protection when sensitive topics are surveyed, give a general idea of the approach, and illustrate how some IQTs have been used to collect data and obtain prevalence estimates in a number of real studies about illegal immigrants, abortion, drug use, cannabis legalization, sexual behaviours and Covid-19 health behaviours.
Specification tools for time fixed effects in country panels
13 Novembre, 2023, ore 15:00
This paper proposes specification tools for time fixed-effects in country panels that complement summary and graphical representations of the data. They cover the standard two-way fixed-effect model, as well as other more general d-way fixed effect specifications with d ≥ 3. The tools are based on the observable characteristics of univariate time series of contrasts implied by a given specification; they use flagging rules based on graphical or statistical analysis. Evidence on which contrasts do not contain time fixed-effects can be harvested by algorithms; this paper discusses two examples of such algorithms. Implications for the specification of Differences in Differences estimation are discussed, and results are illustrated using a country panel of prices of mobile telecommunication services.
A parsimonious family of mixtures of multivariate Poisson log-normal factor analyzers for clustering count data
3 Novembre, 2023, ore 15:00
Multivariate count data are commonly encountered in bioinformatics. Although the Poisson distribution seems a natural fit for these count data, its multivariate extension is computationally expensive. Recently, mixtures of multivariate Poisson lognormal (MPLN) models have been used to efficiently analyze these multivariate count measurements. In the MPLN model, the counts, conditional on the latent variable, are modelled using a Poisson distribution, and the latent variable comes from a multivariate Gaussian distribution. Due to this hierarchical structure, the MPLN model can account for over-dispersion as opposed to the traditional Poisson distribution and allows for correlation between the variables. The mixture of multivariate Poisson-log normal distributions for high dimensional data is extended by incorporating a factor analyzer structure in the latent space. A family of parsimonious mixtures of multivariate Poisson lognormal distributions are proposed by decomposing the covariance matrix and imposing constraints on these decompositions. The performance of the model is demonstrated using simulated and real datasets.
Testing for treatment effect in multitreatment case
27 Ottobre, ore 12
In the present seminar, the problem of testing for the presence/absence of a treatment effect is discussed. A new test-statistic, essentially based on the same principles as the classical Kruskal-Wallis test, is introduced, and its theoretical properties are studied. The good behaviour of the proposed test in terms of both significance level and power, with respect to other commonly used test procedures, is shown through a simulation study. Test-statistics for stochastic dominance problems are also studied.
Hidden Markov Models for error corrected statistics
20 Ottobre 2023, ore 12:00
Policy making is based on official statistics that may still include measurement error (ME). This ME is typically the result of administrative delays in registration, differenced in conceptual definitions or processing errors. ME can lead to a distorted view of the number of people in groups of interest. Threatening the integrity of official statistics and therefore also the effectiveness of policies based on them. HMMs are statistical models that help approximate categorical variables that are incorrectly measured with observed data. First, HMMs use two observed measures of the same statistic from different sources to approximate the error-corrected number of individuals at every moment in time. The two observed measures help to triangulate “true” information and reduce the effect of ME that exists in each of them. Second, HMMs estimate the transition rate from one state (e.g., receiving social assistance benefits) to other states (e.g. employment) and vice versa using the error-corrected measure that was approximated before. To do so, HMMs require several (at least three) observations in different time points per individual, which are available in our case in both registers. This way HMM present a measurement model (error correction), and structural model (representing the relations and changes over time).
The resilience of complex networks: methods and applications
06 ottobre 2023
The analysis of the resilience of a network is of key relevance in many contexts of applied science, for its natural connections with the assessment of the stability of an overall system. In this talk I will present some methodological criteria for building suitable resilience measures, along with some applicative instances. I will also provide some remarks on avenues of future research, by including also a discussion on the possible connections between complex networks and reliability theory.
Time series segmentation by non-homogeneous hidden semi-Markov models
26 maggio 2023
Motivated by classification issues in environmental studies, a class of hidden semi-Markov models is introduced to segment multivariate time series according to a finite number of latent regimes. The observed data are modelled by a mixture of multivariate densities, whose parameters evolve according to a latent multinomial process. The multinomial process is modelled as a semi-Markov chain where the time spent in a state and the chances of a regime- switching event are separately modeled by a battery of regression models that depend on time- varying covariates. Maximum likelihood parameter estimation is carried out by integrating an EM algorithm with a suitable data augmentation. While the proposal extends previous approaches that rely on mixtures models and hidden Markov models, it keeps a parsimonious structure that facilitates results interpretation. It is illustrated on a case study of a bivariate time series of wind and wave directions, observed by a buoy in the Adriatic sea.
12

2022


Wrapping onto a torus: handling multivariate circular data in the presence of outliers
16 Dicembre 2022
Abstract: Multivariate circular data arise commonly in many different fields, including the analysis of wind directions, protein bioinformatics, animal movements, handwriting recognition, people orientation, cognitive and experimental psychology, human motor resonance, neuronal activity, robotics, astronomy, biology, physics, earth science and meteorology. Observations can be thought of as points on a p-dimensional torus, whose surface is obtained by revolving the unit circle in a p−dimensional manifold. The peculiarity of multivariate torus data is periodicity, that reflects in the boundedness of the sample space and often of the parametric space. The problem of modeling circular data has been tackled through suitable distributions, among which two of the most popular are the von Mises and the Wrapped Normal. Here, we focus on the family of unimodal and elliptically symmetric wrapped distributions with emphasis on the Wrapped Normal. Despite the boundedness of the support of circular variates, torus data are not immune to the occurrence of outliers, that is unexpected values, such as angles or directions, that do not share the main pattern of the bulk of the data. Then, a robust procedure to fit a wrapped distribution is presented. The proposed algorithm is characterized by the computation of data dependent weights aimed to down-weight anomalous values. We discuss and compare different approaches to obtain weights, with particular attention to the weighted likelihood methodology. A formal outliers detection rule is also suggested, that is based on classical robust distances evaluated over unwrapped data. In allegato la locandina con i riferimenti per partecipare al seminario e l'abstract.
Pseudo-populations resampling for finite populations under complex designs
2 Dicembre 2022
"Pseudo-populations resampling for finite populations under complex designs" Pier Luigi Conti Dipartimento di Scienze Statistiche Sapienza Università di Roma Abstract: The present talk is devoted to resampling for finite populations when the sampling design is not simple. As a consequence of the complex sampling design, there is dependence among sampled units. Hence, classical Efron bootstrap does not work in the case under examination. Resampling schemes based on pseudo-populations will be developed, and their main justifications and properties will be shown. The approach used is of asymptotic nature, and parallels results obtained by Bickel and Friedman for the i.i.d. case. Main applications of theoretical results are devoted to the construction of confidence intervals for finite population parameters. Finally, computational issues will be discussed. In allegato la locandina con l'abstract e i riferimenti per partecipare al seminario in presenza e a distanza.
Addressing dataset shift in supervised classification via data perturbation
25 Novembre 2022
n supervised classification, dataset shift occurs when for the units in the test set a change in the distribution of a single feature, a combination of features, or the class boundaries, is observed with respect to the training set. As a result, in real data applications, the common assumption that the training and testing data follow the same distribution is often violated. Dataset shift might be due to several reasons; the focus is on what is called “covariate shift”, namely the conditional probability p(y|x) remains unchanged, but the input distribution p(x) differs from training to test set. Random perturbation of variables or units when building the classifier can help in addressing this issue. Evidence of the performance of the proposed approach is obtained on simulated and real data.
Stein’s Method Meets Statistics: A Review of Some Recent Developments
18 Novembre 2022
Stein’s method compares probability distributions through the study of a class of linear operators called Stein operators. While initially studied in the field of probability, Stein’s method has led to significant advances in theoretical statistics, computational statistics and machine learning in recent years. In this talk, I will present some of these recent developments and, in doing so, try to stimulate further research into the successful field of Stein’s method and statistics. The topics I shall discuss include (if the time permits) new insights into the finite-sample approximation of estimators (like maximum likelihood estimators), a measure of the impact of the prior choice in Bayesian statistics, tools to benchmark and compare sampling methods such as approximate Markov chain Monte Carlo, deterministic alternatives to sampling methods, parameter estimation and goodness-of-fit testing. This talk is based on a large collaborative effort with many co-authors.
On mixtures of linear quantile regressions for longitudinal and clustered data
11 Novembre 2022
Quantile regression represents a well established technique for modelling data when the interest is on the effect of predictors on the conditional response quantiles. When responses are repeatedly collected over time, or when they are hierarchically nested, dependence needs to be properly considered. A standard way of proceeding is based on including higher level unit-specific random coefficients in the model. The distribution of such coefficients may be either specified parametrically or left unspecified. In the last case, it can be estimated non parametrically by using a discrete distribution defined on G locations. This may approximate the distribution of time-constant and/or time-varying random coefficients, leading to a static, dynamic, or mixed-type mixture of linear quantile regression equations. An EM algorithm and a block-bootstrap procedure are employed to derive parameter estimates and corresponding standard errors. Standard penalized likelihood criteria are used to identify the optimal number of mixture components. This class of models is described by using a benchmark dataset and employing the functions in the newly develop lqmix R package.
Causal effects of chemotherapy regimen intensity on survival outcome through Marginal Structural Models
4 Novembre 2022
As patients develop toxic side-effects, cancer treatment is adapted over time by either delaying or reducing the dosage of the next chemotherapy course. In this talk Marginal Structural Models in combination with InverseProbability-of-Treatment Weighted estimators to assess the causal effects of chemotherapy regimen modifications on survival outcome will be discussed. The focus is on the use of actual treatment data and Received Dose Intensity in contrast with the use of intended treatment regimen. The latter approach, known as Intention to treat, is very common but also very far from the everyday clinical practice. In this talk, I will discuss the confounding nature of toxic side-effects data and shows the damaging effect of not including toxicity in the analysis. The method developed is applied to the osteosarcoma randomised clinical trials BO03 and BO06 (EORTC 80861 and 80931).
Density modelling with Functional Data Analysis
28 Ottobre 2022
Recent technological advances have eased the collection of big amounts of data in many research fields. In this scenario, a useful statistical technique is density estimation which represents an important source of information. One dimensional density functions represent a special case of functional data subject to the constraints to be non-negative and with a constant integral equal to one. Because of these constraints, densities functions do not form a vector space and a naive application of functional data analysis (FDA) methods may lead to nonvalid estimates. To address this issue, two main strategies can be found in the literature. In the first, the probability density functions (pdfs) are mapped into a linear functional space through a suitably chosen transformation. Established methods for Hilbert space valued data can be applied to the transformed functions and the results are moved back into the density space by means of the inverse transformation. In the second strategy, probability density functions are treated as an infinite dimensional compositional data since they are part of some whole which only carry relative information. In this work, by means of a suitable transformation, densities are embedded in the Hilbert space of square integrable functions where standard FDA methodologies can be applied.
The three-sigma rule to define antibody positivity: is it a beauty or a beast?
14 Ottobre 2022
Many epidemiological studies aim to estimate the proportion of individuals currently or previously infected by a given microorganism. Given that an infection inevitably leads to an immune response, this estimation exercise often requires identifying individuals who reach a minimal level of microbe-specific antibodies in their serum. This threshold invariantly is defined by the three-sigma rule: mean plus three times the standard deviation from the hypothetical antibody-negative population. Notwithstanding not being linked to a specific parametric distribution, it has the most intuitive interpretation in the context of a normal distribution. I will then discuss the problems of estimation bias and apparent control of specificity arising from applying this rule to nonnormal distributions for the seronegative population. I will use public data on antibody testing against the SARS-CoV2 to illustrate these problems. We should finally ask ourselves whether the three-sigma rule is a beautiful statistical concept or, instead, a little beast hidden in antibody data analysis.
A general framework for implementing distances for categorical variables
17 Giugno 2022
In many statistical methods, distance plays an important role. For instance, data visualization, classification and clustering methods require quantification of distances among objects. How to define such distance depends on the nature of the data and/or problem at hand. For distance between numerical variables, in particular in multivariate contexts, there exist many definitions that depend on the actual observed differences between values. It is worth underlining that often it is necessary to rescale the variables before computing the distances. Many distance functions exist for numerical variables. For categorical data, defining a distance is even more complex as the nature of such data prohibits straightforward arithmetic operations. Specific measures therefore need to be introduced that can be used to describe or study structure and/or relationships in the categorical data. In this paper, we introduce a general framework that allows an efficient and transparent implementation for distance between categorical variables. We show that several existing distances (for example distance measures that incorporate association among variables) can be incorporated into the framework. Moreover, our framework quite naturally leads to the introduction of new distance formulations as well.
Model-assisted indirect small area estimation
27 Maggio 2022
Generalised regression is the most common design-based model-assisted method for estimation of population means and totals in practical survey sampling. However, it is often unacceptable in the context of small area estimation, where one is interested in population means and totals for a large number of areas (or domains) and the sample sizes are either small or non-existent in many of them. In this seminar, we discuss an approach to extend generalised regression from direct estimation for the whole population to indirect estimation of all the small area populations. This requires to trade variance off with bias and enables a practical methodology for estimation at the different aggregation levels, which is coherent numerically (self-benchmarking) as well as conceptually in terms of the design-based model-assisted inference outlook. Estimation can be conducted by means of an *extended* weighting system that has as many sets of weights as the number of small areas: each set produces the estimate for a domain mean of one or more survey variables of interest and is, in this sense, multipurpose.
12

2021


Challenges in emulating target trials
14 Dicembre 2021
The framework of target trial emulation (TTE) is increasingly adopted when researchers wish to address causal questions using observational data. TTE has multiple advantages, starting from the clarity of explicitly specifying the hypothetical target experimental trial for the questions of interest. However, because the data often arise from linked administrative databases that are not created for research purposes, their handling demands extreme care if biased conclusions are to be avoided. Two main sources of bias have been broadly recognised in the epidemiological literature: immortal time bias and inappropriate selection of comparative groups. This talk will focus on other challenges to emulating target trials which are not commonly aired, using two examples. Hernán et al. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. Journal of Clinical Epidemiology, 2016: 79 (2016) 70e75 Hernán and Robins. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. American Journal of Epidemiology, 2016: 183, 758–764 Suissa. Immortal time bias in observational studies of drug effects. Pharmacoepidemiol Drug Safety, 2007: 241–9
Testing for the Rank of a Covariance Kernel
10 Dicembre 2021
How can we discern whether the covariance of a stochastic process is of reduced rank, and if so, what its precise rank is? And how can we do so at a given level of confidence? This question is central to a great deal of methods for functional data, which require low-dimensional representations whether by functional PCA or other methods. The difficulty is that the determination is to be made on the basis of i.i.d. replications of the process observed discretely and with measurement error contamination. This adds a ridge to the empirical covariance, obfuscating the underlying dimension. We describe a matrix-completion inspired test statistic that circumvents this issue by measuring the best possible least square fit of the empirical covariance's off-diagonal elements, optimised over covariances of given finite rank. For a fixed grid of sufficiently large size, we determine the statistic's asymptotic null distribution as the number of replications grows. We then use it to construct a bootstrap implementation of a stepwise testing procedure controlling the family-wise error rate corresponding to the collection of hypotheses formalising the question at hand. Under minimal regularity assumptions we prove that the procedure is consistent and that its bootstrap implementation is valid. The procedure circumvents smoothing and associated smoothing parameters, is indifferent to measurement error heteroskedasticity, and does not assume a low-noise regime. Based on joint work with Anirvan Chakraborty.
Robust Statistics for (big) data analytics
3 Dicembre 2021
Data rarely follow the simple models of mathematical statistics. Often, there will be distinct subsets of observations so that more than one model may be appropriate. Further, parameters may gradually change over time. In addition, there are often dispersed or grouped outliers which, in the context of international trade data, may correspond to fraudulent behavior. All these issues are present in the datasets that are analyzed on a daily basis by the Joint Research Centre of the European Commission and can only tackled by using methods which are robust to deviations to model assumptions (see for example [6]). This distance between mathematical theory and data reality has led, over the last sixty years, to the development of a large body of work on robust statistics. In the seventies of last century, it was expected that in the near future any author of an applied article who did not use the robust alterative would be asked by the referee for an explanation [9]. Now, a further forty years on, there does not seem to have been the foreseen breakthrough into the wider scientific universe. In this talk, we initially sketch what we see as some of the reasons for this failure, suggest a system of interrogating robust analyses, which we call monitoring [5] and describe a series of robust and efficient methods to detect model deviations, groups of homogeneous observations [10], multiple outliers and/or sudden level shifts in time series ([8]). Particular attention will be given to robust and efficient methods (known as forward search) which enables to use a flexible level of trimming and understand the effect that each unit (outlier or not) exerts on the model (see for example [1], [2] [7]). Finally, we discuss the extension of the above methods to transformations and to the big data context. The Box-Cox power transformation family for non-negative responses in linear models has a long and interesting history in both statistical practice and theory. The Yeo-Johnson transformation extends the family to observations that can be positive or negative. In this talk, we describe an extended Yeo-Johnson transformation that allows positive and negative responses to have different power transformations ([4] or [3]). As an illustration of the suggested procedure, we analyse data on the performance of investment funds, 99 out of 309 of which report a loss. The problem is to use regression to predict medium term performance from two short term DSS - Dipartimento di Scienze Statistiche - www.dss.uniroma1.it indicators. It is clear from scatterplots of the data that the negative responses have a lower variance than the positive ones and a different relationship with the explanatory variables. Tests and graphical methods from our robust analysis allow the detection of outliers, the testing of the values of transformation parameters and the building of a simple regression model. All the methods described in the talk have been included in the FSDA Matlab toolbox freely donwloadable as a toolbox from Mathworks file exchange or from github at the web address https://uniprjrc.github.io/FSDA/ References [1] Atkinson, A. C. and Riani, M. (2000). Robust Diagnostic Regression Analysis. Springer– Verlag, New York. [2] Atkinson, A. C., Riani, M., and Cerioli, A. (2004). Exploring Multivariate Data with the Forward Search. Springer–Verlag, New York. [3] Atkinson, A. C., Riani, M., and Corbellini, A. (2020). The analysis of transformations for profit-and-loss data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 69(2), 251–275. [4] Atkinson, A. C., Riani, M., and Corbellini, A. (2021). The BoxCox Transformation: Review and Extensions. Statistical Science, 36(2), 239 – 255. [5] Cerioli, A., Riani, M., Atkinson, A. C., and Corbellini, A. (2018). The power of monitoring: How to make the most of a contaminated multivariate sample (with discussion). Statistical Methods and Applications, 27, 559–666. https://doi.org/10.1007/s10260–017–0409–8. [6] Perrotta, D., Torti, F., Cerasa, A., and Riani, M. (2020). The robust estimation of monthly prices of goods traded by the European Union. Technical Report EUR 30188 EN, JRC120407, European Commission, Joint Research Centre, Publications Office of the European Union, Luxembourg. ISBN 978-92-76-18351-8, doi:10.2760/635844. [7] Riani, M., Atkinson, A. C., and Cerioli, A. (2009). Finding an unknown number of multivariate outliers. Journal of the Royal Statistical Society, Series B, 71, 447–466. [8] Rousseeuw, P., Perrotta, D., Riani, M., and Hubert, M. (2019). Robust monitoring of time series with application to fraud detection. Econometrics and Sttaistics, 9, 108–121. [9] Stigler, S. M. (2010). The changing history of robustness. The American Statistician, 64, 277–281. [10] Torti, F., Perrotta, D., Riani, M., and Cerioli, A. (2018). Assessing trimming methodologies for clustering linear regression data. Advances in Data Analysis and Classification, 13, 227–257.

2019


Annibale Biggeri (Università di Firenze) - Incertezza e riproducibilità nella ricerca biomedica
22 Febbraio 2019 - Sala 34 ore 11

Giorgio Consigli (Università degli Studi di Bergamo) - Asset-liability management for occupational pension funds under market and longevity risk: a case study and alternative modelling approaches
22 Marzo 2019 - Aula V ore 15
The modelling of institutional ALM problems has a long history in stochastic programming starting in the late 80’s with the first industry developments such as the well-known Yasuda Kasai model (Ziemba, Turner, Carino et al, 1994) specifically for pension fund management (PF ALM). Due to economic and demographic pressures in most OECD countries and an increasing interest on PF ALM developments by the industry and by policy makers, we witness now-a-day a growing demand for R&D projects to the scientific community. Taking the view of a PF manager, the presentation will develop around the definition of a generic pension fund (PF) asset-liability management (ALM) problem and analyse the key underlying methodological implications of: (i) it's evolution from an early stage multistage stochastic programming (MSP) with recourse to most recent MSP and distributionally robust (DRO) formulations, (ii) a peculiar and rich risk spectrum including market risk as well as liability risk, such as longevity risk and demographic factors leading to (iii) valuation or pricing approaches based on incomplete market assumptions and, due to recent International regulation, (iv) a risk-based capital allocation for long-term solvency. The above represent fundamental stochastic and mathematical problems of modern financial optimisation. Two possible approaches to DRO are considered, based on a stochastic control framework or by explicitly introducing an uncertainty set for probability measures and formulating the inner DRO problem as a probability distance minimization problem over a given space of measures. Keywords: asset-liability management, multistage stochastic programming, distributional uncertainty, distributionally robust optimization, solvency ratio, liability pricing, longevity risk, capital allocation.
Gianluca Mastrantonio (Politecnico di Torino) - New formulation of the logistic-normal process to analyze trajectory tracking data
28 Gennaio 2019 - Sala 34 ore 10.30
Improved communication systems, shrinking battery sizes and the price drop of tracking devices have led to an increasing availability of trajectory tracking data. These data are often analyzed to understand animals behavior using mixture-type model. In this work, we propose a new model based on the logistic-normal process. Due to a new formalization and the way we specify the core- gionalization matrix of the associated multivariate Gaussian process, we show that our model, differently from other proposals, is invariant with respect to the choice of the reference element and of the order- ing of the components of the probability vectors. We estimate the model under a Bayesian framework, using an approximation of the Gaussian process needed to avoid impractical computational time. We perform a simulation study with the aim of showing the ability of the model to retrieve the parameters used to simulate the data. The model is then applied to the real data where a wolf is observed before and after procreation. Results are easy to interpret, showing differences in the two phases. Joint work with: Enrico Bibbona (Politecnico di Torino), Clara Grazian (Università di Pescara), Sara Mancinelli (università "Sapienza" di Roma)
Dott. Stefano Cavastracci Strascia e Dott. Agostino Tripodi - Overdispersed-Poisson Model in Claims Reserving: Closed Tool for One-Year Volatility in GLM Framework
29 Marzo 2019 - Aula V ore 14.15
L’obiettivo del lavoro è la realizzazione di uno strumento per stimare la volatilità a un anno della riserva sinistri, calcolata – in formula chiusa - attraverso i modelli lineari generalizzati (GLM), in particolare in relazione al modello di Poisson con sovradispersione. Fino ad ora, questa volatilità di un anno è stata stimata attraverso la ben nota metodologia di bootstrap che richiede l'uso del metodo Monte Carlo con una tecnica di ri-riservazione. Nondimeno, questo metodo richiede tempo sotto il punto di vista del calcolo ed altre condizioni di stabilità; pertanto, nella pratica sono spesso utilizzate tecniche di approssimazione. Verranno inoltre presentate alcune applicazioni con il software R il cui codice è stato riportato nel paper.
Simone Padoan (Università "Luigi Bocconi" di Milano) - Modellizzazione statistica dei valori estremi
16-17 Aprile 2019 - Sala 34 ore 10-14

Enrico Tucci - L’emigrazione dall’Italia attraverso l’integrazione e l’analisi di rilevazioni statistiche e fonti ufficiali
5 Giugno 2019 - Aula Master (Viale Regina Elena 295) ore 10
L’obiettivo del lavoro è quello di analizzare le migrazioni internazionali dal censimento della popolazione del 2011 attraverso un uso integrato delle fonti disponibili. In Italia le statistiche ufficiali sono prodotte con un utilizzo diretto della fonte anagrafica, che non consente di rilevare il fenomeno nella sua interezza, soprattutto per la difficoltà di conteggiare i movimenti verso l’estero. I risultati ottenuti in questo lavoro evidenziano la possibilità di ridurre il gap informativo attraverso un database longitudinale basato su dati individuali. La nuova prospettiva di analisi, data dal collegamento dei movimenti nel tempo relativi ad uno stesso individuo, permette di osservare fenomeni rilevanti in chiave di politiche migratorie, quali le migrazioni di ritorno e quelle circolari. Infine, viene esaminata la mobilità internazionale dei “nuovi italiani” attraverso un approccio longitudinale e viene applicato un modello di regressione per comprendere quali caratteristiche siano maggiormente connesse con la propensione a diventare italiani.
Simone Russo - L’invalidità previdenziale: studio dell’incidenza della disabilità nella popolazione in età lavorativa e analisi delle determinanti attraverso dati di registro
5 Giugno 2019 - Aula Master (Viale Regina Elena 295) ore 10
L’invecchiamento della popolazione italiana sta determinando un incremento notevole del numero di soggetti malati cronici e disabili. Ad oggi non esistono studi specifici, soprattutto nella realtà italiana, sugli effetti dell’invecchiamento della popolazione in termini di prestazioni di invalidità previdenziale e in generale sulle determinanti di questo fenomeno. Le domande accolte per prestazioni previdenziali di invalidità sono aumentate considerevolmente, a partire dalla fine degli anni '90. Complessivamente, le diverse analisi condotte dimostrano che l’evoluzione delle domande accolte per prestazioni d’invalidità previdenziale è legata ad una serie di caratteristiche individuali dei lavoratori beneficiari e a fattori di contesto di svariata natura, in particolare demografica, territoriale, epidemiologica, economica e legati, altresì, alla struttura occupazionale.
Daniel K. Sewell (University of Iowa) - An introduction to the statistical analysis of network data
9 e 10 Settembre 2019 - Aula VII (ex Castellano) ore 10-16 (con pausa)

Roberta De Vito (Department of Biostatistics, Brown University, Providence, Rhode Island, USA) - Multi-study factor analysis for biological data
14 Novembre 2019 - Aula XIV (palazzina Tumminelli) ore 12
We introduce a novel class of factor analysis methodologies for the joint analysis of multiple studies. The goal is to separately identify and estimate 1) common factors shared across multiple studies, and 2) study-specific factors. We develop a fast Expectation Conditional-Maximization algorithm for parameter estimates and we provide a procedure for choosing the common and specific factors. We present simulations evaluating the performance of the method and we illustrate it by applying it to gene expression data in ovarian cancer and to nutrient-based dietary patterns and the risk of head and neck cancer. In both cases, we clarify the benefits of a joint analysis compared to the standard factor analysis. Moreover, we generalize the model in a Bayesian framework. We implement it using sparse modeling of high-dimensional factor loadings matrices, both common and specific, using the infinite gamma shrinkage prior. We propose a computationally efficient algorithm, based on a traditional Gibbs sampler, to produce the Bayes estimates of the parameters and to select the number of relevant common factors. We assess the operating characteristics of our method by means of simulation studies, and we present an application to the prediction of the biological signal from four gene expression studies on ovarian cancer.
Garyfallos Konstantinoudis - Discrete versus continuous domain models for disease mapping and applications on childhood cancers
22 Novembre 2019 - Sala 34 ore 12
The main goals of disease mapping is to calculate disease risk and identify high-risk areas. Such analyses are hampered by the limited geographical resolution of the available data. Typically data are counts of cases per spatial unit and the most common approach is the Besag-York-Molli ́e model (BYM). Less frequently, exact geocodes are available, allowing modelling a disease as a point process through Log-Gaussian Cox processes (LGCPs). The objective of this study is to examine in a simulation the performance of BYM and LGCPs for disease mapping. We simulated data in the Canton of Zurich in Switzerland sampling cases from the true population mimicking the childhood leukaemia incidence (n=334 during 1985-2015). We considered 39 different scenarios varying in the risk generating function (step-wise, smooth, flat risk), the size of the high-risk areas (1, 5 and 10km radii), the risk increase within the high-risk areas (2 and 5-fold) and the number of cases (n, 5n and 10n). We used the root mean integrated square error (RMISE) to examine the ability of the models to recover the true risk surface and their sensitivity/specificity in identifying highrisk areas. We found that, for larger radii, LGCPs recover the true risk surface with lower error across almost all scenarios (median RMISE: 9.17-27.0) compared to the BYM (median RMISE: 9.12-35.6). For radii = 1km and flat risk surfaces BYM performs better. In terms of sensitivity and specificity across almost all scenarios the median area under the curve (AUC) for LGCPs was higher (median AUC: 0.81-1) compared to the BYM (median AUC: 0.65-0.93). We applied these methods to childhood leukaemia incidence in the canton of Zurich during 1985-2015 and identified two high-risk spatially coherent areas. Our findings suggest that there are important gains to be made from the use of LGCP models in spatial epidemiology.
12

2018


Yves Tillé (Université de Neuchatel) - How to select a sample?
27 Novembre 2018 - Sala 34 ore 14.30
The principles of sampling can be synthesized in randomization, restriction and over-representation. Define a sample design – define stratification, equal/unequal selection probability, etc. – means to use prior information and it is equivalent to assume a model on the population. Several well-known sampling designs are optimal related to models that maximizes the entropy. In the Cube method the prior information are used to derive a sample that match the total or means of auxiliary variables. In this respect, the sample is called balanced. Furthermore, if distances between statistical units – based on geographical coordinates or defined via auxiliary variables – are available, it could be interesting to spread the sample in the space in order to make the design more efficient. In this perspective, new spatial sampling methods, such as the GRTS, the local pivotal method and the local cube, will be covered.

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma