A gentle statistical introduction to domain adaptation
December 5, 2025, 12:00
In the statistical learning framework, we assume that the learning set is an unbiased sample of the population of interest and the test set is used to assess the fitted model. Anyway, in real-world applications, the available data for the learning set can result to be a biased sample of the target population and therefore a classifier trained on these data may result to be quite inadequate. In other words, there is a “shift" between the two domains. There are many reasons that can explain this dissimilarity: for instance, collecting new labeled data is often time-consuming, costly, or even infeasible, especially when the statistical properties of the population evolve over time. In some other cases, we are provided with a large amount of unlabeled data, but only a small amount of labeled data. In the machine learning literature, the area dealing with this kind of problem is called Domain Adaptation. In this seminar, we present main ideas of the Domain adaptation in a statistical setting. Numerical studies will be further presented based on both simulated and real datasets.
|
Causal Inference when Intervention Units and Outcome Units Differ
November 28, 2025, 12:00
We study causal inference in settings characterized by interference with a bipartite structure. There are two distinct sets of units: intervention units to which an intervention can be applied and outcome units on which the outcome of interest can be measured. Outcome units may be affected by interventions on some, but not all, intervention units, as captured by a bipartite graph. Examples of this setting can be found in analyses of the impact of pollution abatement in plants on health outcomes for individuals, or the effect of transportation network expansions on regional economic activity. We introduce and discuss a variety of old and new causal estimands for these bipartite settings. We do not impose restrictions on the functional form of the exposure mapping and the potential outcomes, thus allowing for heterogeneity, non-linearity, non-additivity, and potential interactions in treatment effects. We propose unbiased weighting estimators for these estimands from a design-based perspective, based on the knowledge of the bipartite network under general experimental designs. We derive their variance and prove consistency for increasing number of outcome units. Using the Chinese high-speed rail construction study, analyzed in Borusyak and Hull [2023], we discuss non-trivial positivity violations that depend on the estimands, the adopted experimental design, and the structure of the bipartite graph.
|
Bayesian network propensity score estimation for testing causal effect in binary data
November 21, 2025, 12:00
Estimating treatment effects in real-world data is challenging due to potential bias arising from non-random treatment assignment and confounding variables. The propensity score (PS) is commonly used to correct for such bias, but its accuracy depends on properly modelling the treatment–covariate relationships. This study proposes estimating PS via Bayesian Networks (BNs), offering a flexible and often superior alternative to logistic regression, especially in complex dependency settings. The BN-based PS is applied to construct two estimators of the Average Treatment Effect (ATE): the Horvitz-Thompson (HT) and Hajek (H) types. When the PS model is correctly specified, both are asymptotically equivalent; under misspecification, the H-type performs better. Extensive simulations confirm the robustness and effectiveness of the proposed approach.
|
Bayesian hierarchical models for daily temperatures: means, quantiles, and record-breaking events
November 11, 2025, 12:00
This seminar presents a comprehensive Bayesian hierarchical framework for modeling daily temperatures, integrating mean, quantile, and record-breaking analyses across space and time. We first introduce a spatio-temporal mean model for daily maximum temperatures, incorporating two temporal scales, explicit autoregressive dependence, fixed effects, and multiple random effects to capture spatial variability. Building on this, we develop a mixed effects quantile regression model with asymmetric Laplace errors to explore climate change across quantiles, enabling marginal quantile inference from conditional autoregressive structures and revealing pronounced spatial-quantile heterogeneity in climate signals. Next, we introduce a new framework for analyzing high-temperature events, proposing hierarchical models for both univariate and bivariate record-breaking temperatures. The univariate model employs a logistic regression formulation with an explicit long-term trend and strong daily spatial random effects to analyze the occurrence of calendar-day records across years, allowing inference on the number, spatial distribution, and temporal evolution of record-breaking events under climate change. The bivariate model uses a probit regression to jointly model maximum and minimum temperature records, capturing spatial and temporal dependence through anisotropic coregionalized Gaussian processes and revealing correlated yet distinct patterns of record-breaking behavior. All models leverage Gaussian latent representations for closed form Gibbs sampling and provide spatial predictions at unobserved locations. Applications to long-term datasets from Spain illustrate trends in daily temperatures, quantile-specific climate change, and the increasing occurrence of record-breaking temperatures, offering new tools for the statistical analysis of environmental extremes.
|
From PoS to u-PoS: a journey through the realm of Probability of success
November 7, 2025, 12:00
The Probability of success (PoS) of a trial is conventionally defined as the expected value of the power function of a test with respect to a design prior assigned to the parameter under scrutiny. Even though this quantity is widely used in clinical trials for experimental design, the definition of probability of success is not univocal. In this presentation we review and compare the main types of PoS; specifically, we focus on a unifying, decision-theoretic approach that yields a new type of PoS as the expected utility of the trial (u-PoS), that is the expected probability of making the correct choice between two hypotheses. We base our comparisons on properties of the probability distributions of the power-related random variables associated to these definitions. Our conclusion is that u-PoS shows a conceptual advantage over previous versions of PoS; moreover, when suitable design priors are used, it may produce smaller optimal sample sizes with respect to its competitors.
|
Bayesian Regression Factor Model for Multivariate Causal Effect
October 31, 2025, 12:00
In the context of causal inference, the study of causal effects for multivariate potential outcomes remains relatively underexplored. This gap largely stems from the inherent difficulty of quantifying the overall causal impact of a treatment on correlated outcomes and understanding how these effects vary across different outcomes. Nevertheless, this research question is critically important in fields such as environmental epidemiology, where one may seek to assess the causal relationship between air pollution regulations and the concentrations of multiple pollutants, or between air pollution exposure and hospitalizations for various diseases.
To address this challenge, we leverage a Bayesian factor regression model to identify latent, treatment-specific factors that capture causal effects among correlated multivariate outcomes. We propose a methodology that (i) introduces novel causal estimands within a general framework for multivariate outcomes, and (ii) develops a multi-treatment Bayesian factor regression model that enables the identification and characterization of causal latent effects. The innovative use of the dependent Dirichlet process as the distribution for the factor scores further allows us to handle missing data through a principled, awareness-driven, and fair imputation mechanism.
The performance of the proposed method is demonstrated through both simulation studies and real-world applications in environmental epidemiology.
|
La statistica come arte critica
October 24, 2025, 12:00
La statistica occupa un ruolo centrale nel comprendere e gestire la complessità del mondo contemporaneo, ben oltre i confini delle scienze dure: si configura sempre più come una “disciplina critica”, capace non solo di analizzare dati, ma di mettere in discussione le categorie stesse con cui interpretiamo la realtà. Lungi dall’essere una semplice tecnica neutra, la statistica nasce e si sviluppa all’interno di specifici contesti sociali e culturali. Sotto il profilo epistemologico, la probabilità sostituisce via via sempre più la certezza e ogni conoscenza si configura come approssimazione rivedibile. Questo approccio contrasta con le esigenze della comunicazione pubblica, che predilige semplificazioni e slogan. La statistica, invece, lavora con astrazioni raffinate che, pur senza corrispettivi empirici diretti, consentono una più profonda comprensione della realtà. In quanto “arte critica”, essa non conferma categorie analitiche precostituite, ma anzi tende a metterle in crisi, promuovendo un’attitudine nei confronti del dato che mantiene aperto quanto più possibile il ventaglio del giudizio e la flessibilità interpretativa. Così intesa, la statistica non solo descrive il mondo, ma partecipa attivamente alla costruzione di visioni più ricche, dinamiche della realtà.
|
The Interplay between Bayesian Inference and Conformal Prediction
October 17, 2025, 12:00
Conformal prediction has emerged as a cutting-edge methodology in statistics and machine learning, providing prediction intervals with finite-sample frequentist coverage guarantees. Yet, its interplay with Bayesian statistics – often criticised for lacking frequentist guarantees – remains underexplored. Recent work has suggested that conformal prediction can serve to “calibrate” Bayesian procedures, thereby imparting frequentist validity and motivating deeper investigation into frequentist–Bayesian hybrids. We further argue that Bayesian inference has the potential to enhance conformal prediction, for instance, through more informative intervals. Thus, in a spirit of modern statistics based less on division and more on synthesis, the two paradigms may be viewed as complementary, jointly striving for a principled balance between validity and efficiency. In this talk, I will outline paths toward bridging this gap. After surveying existing ideas, a formalization of the Bayesian conformal inference framework will be provided, in both its full and split forms. Emphasis will be given to the challenging aspect of computational complexity, with potential solutions. Finally, the advantages of Bayesian conformal inference will be discussed in small-area estimation, a paradigmatic example for this hybrid perspective.
Based on joint work with Brunero Liseo.
|
Plug-in estimation for parametric and penalised multi-state Markov models using ordinary differential equations
October 3, 2025, 12:00
Existing multi-state models are restricted either by the 'plug-in' parameters that can be estimate, or the dependence on the bootstrap for variance estimation. Our objective is to develop an efficient algorithm and implementation for 'plug-in' maximum likelihood estimation for parametric and smooth penalised Markov multi-state models. For methods, we restrict our attention to smooth parametric and penalised transition intensities for multi-state Markov models. We propose a new algorithm that uses a system of ordinary differential equations to calculate the parameters and their gradients, with standard errors calculated using the delta method. The algorithm supports 'plug-in' parameter estimation for state occupancy probabilities, transition probabilities, length of stay, relative survival, screening sensitivity, utilities, costs, net monetary benefit and their linear combinations. We provide an implementation in R that allows for a wide range of parametric and penalised survival models. Using simulations, we demonstrate good coverage for a range of transition intensity models. We apply these methods to an earlier multi-state analysis of the Rotterdam Breast Cancer Data and extend the analysis to include regression standardisation. In conclusion, we have provided a broad framework for 'plug-in' parameter estimation for Markov multi-state models with smooth transition intensities. These methods have applications to a range of disciplines, including epidemiology and health economics.
|
An Overview of Robust Regression Mixture Models Emphasizing Symmetric α-Stable Distributions
September 23, 2025, 12:00
The typical method for estimating mixture of regression models relies on the assumption that error components are normally distributed. This assumption makes them highly vulnerable to outliers or data with heavy-tailed errors. This lecture will review some robust alternatives for mixture of regression models. In particular, we will focus on a new robust model introduced by Zarei, which extends the mixture of symmetric α-stable (SαS) distributions to the regression setting.
The SαS distribution is a heavy-tailed generalization of the normal distribution, where an additional parameter, α, controls the heaviness of the tails. A unique characteristic of the SαS distribution is that its variance diverges to infinity when α < 2. This property makes the model exceptionally robust against extreme outliers compared to other heavy-tailed distributions, like the Student's t-distribution.
The model's parameters, except for α, are estimated using a standard Expectation-Maximization (EM) algorithm. The parameter α is estimated separately via a stochastic EM algorithm that utilizes a rejection sampling method. We will illustrate and compare this new model with existing mixture regression models using both simulated and real-world datasets.
|
Bi-clustering multivariate categorical data via extended mixtures of latent trait analyzers
May 30, 2025, 12:00
Multivariate categorical outcomes are common in real-world applications but, frequently, their high dimensionality makes both the analysis and the interpretation particularly challenging. In this context, model-based clustering offers a powerful approach to data reduction and structure discovery. We build upon the Mixture of Latent Trait Analyzers (MLTA) framework to propose a model that enables simultaneous clustering of both rows and columns of the data matrix. Specifically, rows are grouped into components using a finite mixture specification. Within each component, variables are segmented based on a flexible yet parsimonious specification of the linear predictor. To capture residual dependence among observations, we retain the use of a multidimensional latent trait, consistent with the original MLTA formulation. Additionally, the model accommodates the influence of
individual-specific covariates on the clustering process via a concomitant variable framework.
Parameter estimation is carried out using maximum likelihood, implemented through an extended Expectation-Maximization (EM) algorithm. Since the likelihood involves integrals without closed-form solutions, we apply a Gauss-Hermite quadrature for numerical approximation. A comprehensive simulation study evaluates the model’s
ability to accurately recover both clustering structure and parameter values, demonstrating strong performance. Finally, we apply the proposed method to an original dataset on pediatric patients with suspected appendicitis, aiming to identify patient subgroups characterized by distinct patterns of clinical conditions.
|
On the Estimation of Climate Normals and Anomalies
May 23, 2025, 12:00
The quantification of the interannual component of variability in climatological time series is essential for the assessment and prediction of the El Niño - Southern Oscillation phenomenon. This is achieved by estimating the deviation of a climate variable (e.g., temperature, pressure, precipitation, or wind strength) from its normal conditions, defined by its baseline level and seasonal patterns. Climate normals are currently estimated by simple arithmetic averages calculated over the most recent 30-year period ending in a year divisible by 10. The suitability of the standard methodology has been questioned in the context of a changing climate, characterized by nonstationary conditions. The literature has focused on the choice of the bandwidth and the ability to account for trends induced by climate change. The paper contributes to the literature by proposing a regularized real time filter based on local trigonometric regression, optimizing the estimation biasvariance trade-off in the presence of climate change, and by introducing a class of seasonal kernels enhancing the localization of the estimates of climate normals. Application to sea surface temperature series in the Niño 3.4 region and zonal and trade winds strength in the equatorial and tropical Pacific region, illustrates the relevance of our proposal. Joint work with Alessandro Giovannelli, Università dell’Aquila.
|
A Hierarchical Model for Comparing Spectral Patterns in Lemur Vocalization
May 16, 2025, 12:00
In this talk, I will present a hierarchical model designed to analyze the spectrograms of animal vocalizations, with a focus on grunt calls from different lemur species. The primary goal is to uncover a latent spectral shape that characterizes each species and allows us to quantify dissimilarities between them. A key challenge lies in aligning calls of varying durations and temporal dynamics. To tackle this, we incorporate a synchronization function to manage non-stationary temporal features and adopt a circular representation of time to handle artifacts caused by the discretization of analog signals. Given the high dimensionality of spectrogram data, we use a Nearest Neighbor Gaussian Process for efficient computation and sample from the posterior distribution using MCMC. The model is applied to recordings from eight lemur species. For each species, we identify a representative vocal pattern and use a simple distance metric to compare them. Predictive performance is assessed via crossvalidation, and we also explore some special cases that highlight the model’s flexibility.
|
Mixture-based clustering for ordinal responses
April 4, 2025, 12:00
Existing methods can perform likelihood-based clustering on a multivariate data matrix of ordinal data, using finite mixtures to cluster the rows (observations) of the matrix. These models can incorporate the main effects of individual rows and columns, as well as cluster effects, to model the matrix of responses. However, many real-world applications also include available covariates, which can provide insights into the main characteristics of the clusters. In our research, we have extended the mixture-based models to include covariates directly, to allow the clustering structures to be determined both by the individuals' similar patterns of responses and the effects of the covariates on the individuals' responses. We focus on clustering the rows of the data matrix, using the proportional odds cumulative logit model for ordinal data. We fit the models using the Expectation-Maximization algorithm and assess performance through a comprehensive simulation study. We
also illustrate an application of the models.
|
High-Dimensional Covariance Estimation via Sparse Pairwise Likelihood
March 28, 2025, 12:00
Pairwise likelihood offers a practical approximation to the full likelihood function, enabling efficient inference for high-dimensional covariance models by combining marginal bivariate likelihoods. This approach simplifies complex dependencies and retains optimal statistical efficiency in certain models, such as the multivariate normal distribution, where pairwise and full likelihoods are maximized by the same parameter values. We propose a novel method for estimating sparse high-dimensional covariance matrices by maximizing a truncated pairwise likelihood function, which includes only terms corresponding to nonzero covariance elements. Truncation is achieved by minimizing the L2-distance between pairwise and full likelihood scores, coupled with an L1-penalty to exclude uninformative terms. Unlike traditional regularization techniques, our method focuses on selecting entire pairwise likelihood objects rather than shrinking individual parameters, preserving the unbiasedness of the estimating equations. Theoretical analysis demonstrates that the resulting estimator is consistent and achieves the same efficiency as the oracle maximum likelihood estimator, which assumes knowledge of the nonzero covariance structure, even as the dimensionality grows exponentially. Numerical experiments confirm the effectiveness of the proposed approach.
|
Extensions of Item Response Theory models for analysing complex phenomena
March 21, 2025, 12:00
Item Response Theory (IRT) provides a robust framework for analysing phenomena that cannot be directly observed. In such cases, inference relies on observable behaviours, such as responses to questionnaire items or survey indicators related to the phenomenon of interest. This talk will introduce key IRT concepts, with a particular focus on extensions suited for analysing complex phenomena. Specifically, the benefits of modelling the latent trait with a discrete distribution will be
discussed, which, unlike traditional IRT models based on a continuous trait, offer a more flexible representation of heterogeneous populations. Moreover, the talk will cover the concept of multidimensionality, which allows for the representation of multiple latent traits through a multivariate latent variable approach. This is particularly useful when
analysing constructs that are inherently composed of several interrelated dimensions. The presentation will conclude with two real-world applications: assessing the risk profiles of healthcare contracting authorities in public procurement management and estimating educational poverty levels in Italy by integrating small area estimation with multidimensional IRT.
|
Environmental risk assessment via concomitant-variable multivariate penalized hidden semi-Markov models with autoregression
March 7, 2025, 14:00
Environmental risk assessment often requires modeling complex temporal processes influenced by multiple variables and characterized by environmental condition shifts. This work introduces a novel methodological framework based on concomitant-variable multivariate penalized hidden semi-Markov models (CV-MPHSMM) with autoregression to capture such dynamics. The proposed model extends traditional hidden semi-Markov models by integrating concomitant variables to account for external environmental factors influencing state transitions and sojourns, and by incorporating penalization techniques to enhance model interpretability and prevent overfitting in high-dimensional settings. Autoregressive components are included to model temporal dependencies within and between observed multivariate time series. Analytical expressions for multivariate risk measures are obtained under the CV-MPHSMM. The framework is applied to pollution, demonstrating its capacity to identify latent states, quantify transition probabilities, and detect environmental condition shifts. Simulation studies validate the robustness and flexibility of the proposed model in handling complex scenarios, while case studies highlight its practical utility in informing risk management strategies. The findings underscore the potential of CV-MPHSMMs with autoregression as a powerful tool for advancing environmental risk assessment and decision-making under uncertainty.
|
Linkages between AIR POLLUTION and CLIMATE CHANGE: MODELS and MEASUREMENTS
January 31, 2025, 12:00
Air pollution and climate change are closely linked: the chemical species that lead to a degradation in air quality are normally co-emitted with greenhouse gases. Thus, changes in one inevitably cause changes in the other. Air pollution and climate change are both threats to global population health and require a response that involves intersectoral policy and action. What are the tools commonly used to tackle these issues?
Since the 90’s, ENEA has been established long term observatories of the main climatic parameters in “hot spot” regions like the Mediterranean and Antarctica. The Station for Climate Observations on the island of Lampedusa, an integrated research facility in the Mediterranean, has been collecting more than 25 years of
data for the study of climate change.
In the same years, ENEA has developed a chemical transformation model to study air pollution producing different simulations both for present and future scenarios (till the year 2030-2050), than for 3-days forecast over Italy (4km) and Europe (10km).
During the seminar an overview of the models developed, the measures collected in these remote regions and the activities currently carried out will be presented, discussing both long time series than specific studies,
focusing on the main challenges and uncertainties the scientific community is facing.
|
Language development in pre-school children: study designs and methods of analysis (In Italian)
January 24, 2025, 12:00
|