Wrapping onto a torus: handling multivariate circular data in the presence of outliers
December 16, 2022, 12:00
Abstract: Multivariate circular data arise commonly in many different fields, including the analysis of wind directions, protein bioinformatics, animal movements, handwriting recognition, people orientation, cognitive and experimental psychology, human motor resonance, neuronal activity, robotics, astronomy, biology, physics, earth science and meteorology. Observations can be thought of as points on a p-dimensional torus, whose surface is obtained by revolving the unit circle in a p−dimensional manifold. The peculiarity of multivariate torus data is periodicity, that reflects in the boundedness of the sample space and often of the parametric space. The problem of modeling circular data has been tackled through suitable distributions, among which two of the most popular are the von Mises and the Wrapped Normal. Here, we focus on the family of unimodal and elliptically symmetric wrapped distributions with emphasis on the Wrapped Normal. Despite the boundedness of the support of circular variates, torus data are not immune to the occurrence of outliers, that is unexpected values, such as angles or directions, that do not share the main pattern of the bulk of the data. Then, a robust procedure to fit a wrapped distribution is presented. The proposed algorithm is characterized by the computation of data dependent weights aimed to down-weight anomalous values. We discuss and compare different approaches to obtain weights, with particular attention to the weighted likelihood methodology. A formal outliers detection rule is also suggested, that is based on classical robust distances evaluated over unwrapped data. In allegato la locandina con i riferimenti per partecipare al seminario e l'abstract.
|
Pseudo-populations resampling for finite populations under complex designs
December 2, 2022, 12:00
"Pseudo-populations resampling for finite populations under complex designs" Pier Luigi Conti Dipartimento di Scienze Statistiche Sapienza Università di Roma Abstract: The present talk is devoted to resampling for finite populations when the sampling design is not simple. As a consequence of the complex sampling design, there is dependence among sampled units. Hence, classical Efron bootstrap does not work in the case under examination. Resampling schemes based on pseudo-populations will be developed, and their main justifications and properties will be shown. The approach used is of asymptotic nature, and parallels results obtained by Bickel and Friedman for the i.i.d. case. Main applications of theoretical results are devoted to the construction of confidence intervals for finite population parameters. Finally, computational issues will be discussed. In allegato la locandina con l'abstract e i riferimenti per partecipare al seminario in presenza e a distanza.
|
Addressing dataset shift in supervised classification via data perturbation
November 25, 2022, 12:00
n supervised classification, dataset shift occurs when for the units in the test set a change in the distribution of a single feature, a combination of features, or the class boundaries, is observed with respect to the training set. As a result, in real data applications, the common assumption that the training and testing data follow the same distribution is often violated. Dataset shift might be due to several reasons; the focus is on what is called “covariate shift”, namely the conditional probability p(y|x) remains unchanged, but the input distribution p(x) differs from training to test set. Random perturbation of variables or units when building the classifier can help in addressing this issue. Evidence of the performance of the proposed approach is obtained on simulated and real data.
|
Stein’s Method Meets Statistics: A Review of Some Recent Developments
November 18, 2022, 12:00
Stein’s method compares probability distributions through the study of a class of linear operators called Stein operators. While initially studied in the field of probability, Stein’s method has led to significant advances in theoretical statistics, computational statistics and machine learning in recent years. In this talk, I will present some of these recent developments and, in doing so, try to stimulate further research into the successful field of Stein’s method and statistics. The topics I shall discuss include (if the time permits) new insights into the finite-sample approximation of estimators (like maximum likelihood estimators), a measure of the impact of the prior choice in Bayesian statistics, tools to benchmark and compare sampling methods such as approximate Markov chain Monte Carlo, deterministic alternatives to sampling methods, parameter estimation and goodness-of-fit testing. This talk is based on a large collaborative effort with many co-authors.
|
On mixtures of linear quantile regressions for longitudinal and clustered data
November 11, 2022, 12:00
Quantile regression represents a well established technique for modelling data when
the interest is on the effect of predictors on the conditional response quantiles. When
responses are repeatedly collected over time, or when they are hierarchically nested,
dependence needs to be properly considered.
A standard way of proceeding is based on including higher level unit-specific random
coefficients in the model. The distribution of such coefficients may be either specified
parametrically or left unspecified. In the last case, it can be estimated non
parametrically by using a discrete distribution defined on G locations. This may
approximate the distribution of time-constant and/or time-varying random
coefficients, leading to a static, dynamic, or mixed-type mixture of linear quantile
regression equations.
An EM algorithm and a block-bootstrap procedure are employed to derive parameter
estimates and corresponding standard errors. Standard penalized likelihood criteria are
used to identify the optimal number of mixture components.
This class of models is described by using a benchmark dataset and employing the
functions in the newly develop lqmix R package.
|
Causal effects of chemotherapy regimen intensity on survival outcome through Marginal Structural Models
Novembre 4, 2022, 12:00
As patients develop toxic side-effects, cancer treatment is adapted over time
by either delaying or reducing the dosage of the next chemotherapy course.
In this talk Marginal Structural Models in combination with InverseProbability-of-Treatment Weighted estimators to assess the causal effects of
chemotherapy regimen modifications on survival outcome will be discussed.
The focus is on the use of actual treatment data and Received Dose Intensity
in contrast with the use of intended treatment regimen. The latter approach,
known as Intention to treat, is very common but also very far from the
everyday clinical practice. In this talk, I will discuss the confounding nature
of toxic side-effects data and shows the damaging effect of not including
toxicity in the analysis.
The method developed is applied to the osteosarcoma randomised clinical
trials BO03 and BO06 (EORTC 80861 and 80931).
|
Density modelling with Functional Data Analysis
October 28, 2022, 12:00
Recent technological advances have eased the collection of big amounts of data in many
research fields. In this scenario, a useful statistical technique is density estimation which
represents an important source of information. One dimensional density functions represent a
special case of functional data subject to the constraints to be non-negative and with a constant
integral equal to one. Because of these constraints, densities functions do not form a vector
space and a naive application of functional data analysis (FDA) methods may lead to nonvalid estimates. To address this issue, two main strategies can be found in the literature. In the
first, the probability density functions (pdfs) are mapped into a linear functional space through
a suitably chosen transformation. Established methods for Hilbert space valued data can be
applied to the transformed functions and the results are moved back into the density space by
means of the inverse transformation. In the second strategy, probability density functions are
treated as an infinite dimensional compositional data since they are part of some whole which
only carry relative information. In this work, by means of a suitable transformation, densities
are embedded in the Hilbert space of square integrable functions where standard FDA
methodologies can be applied.
|
The three-sigma rule to define antibody positivity: is it a beauty or a beast?
October 14, 2022, 12:00
Many epidemiological studies aim to estimate the proportion of
individuals currently or previously infected by a given microorganism.
Given that an infection inevitably leads to an immune response, this
estimation exercise often requires identifying individuals who reach a
minimal level of microbe-specific antibodies in their serum. This
threshold invariantly is defined by the three-sigma rule: mean plus three
times the standard deviation from the hypothetical antibody-negative
population. Notwithstanding not being linked to a specific parametric
distribution, it has the most intuitive interpretation in the context of a
normal distribution. I will then discuss the problems of estimation bias
and apparent control of specificity arising from applying this rule to nonnormal distributions for the seronegative population. I will use public data
on antibody testing against the SARS-CoV2 to illustrate these problems.
We should finally ask ourselves whether the three-sigma rule is a beautiful
statistical concept or, instead, a little beast hidden in antibody data
analysis.
|
A general framework for implementing distances for categorical variables
June 17, 2022, 13:30
In many statistical methods, distance plays an important role. For instance, data visualization,
classification and clustering methods require quantification of distances among objects. How to
define such distance depends on the nature of the data and/or problem at hand. For distance
between numerical variables, in particular in multivariate contexts, there exist many definitions
that depend on the actual observed differences between values. It is worth underlining that often
it is necessary to rescale the variables before computing the distances. Many distance functions
exist for numerical variables. For categorical data, defining a distance is even more complex as
the nature of such data prohibits straightforward arithmetic operations. Specific measures
therefore need to be introduced that can be used to describe or study structure and/or relationships
in the categorical data. In this paper, we introduce a general framework that allows an efficient
and transparent implementation for distance between categorical variables. We show that several
existing distances (for example distance measures that incorporate association among variables)
can be incorporated into the framework. Moreover, our framework quite naturally leads to the
introduction of new distance formulations as well.
|
Model-assisted indirect small area estimation
May 27, 2022, 12:00
Generalised regression is the most common design-based model-assisted method for estimation
of population means and totals in practical survey sampling. However, it is often unacceptable
in the context of small area estimation, where one is interested in population means and totals for
a large number of areas (or domains) and the sample sizes are either small or non-existent in
many of them. In this seminar, we discuss an approach to extend generalised regression from
direct estimation for the whole population to indirect estimation of all the small area populations.
This requires to trade variance off with bias and enables a practical methodology for estimation
at the different aggregation levels, which is coherent numerically (self-benchmarking) as well as
conceptually in terms of the design-based model-assisted inference outlook. Estimation can be
conducted by means of an *extended* weighting system that has as many sets of weights as the
number of small areas: each set produces the estimate for a domain mean of one or more survey
variables of interest and is, in this sense, multipurpose.
|
Extending the boundaries of a macroeconometric model for Italian economy to inequality
May 20, 2022, 12:00
According to the growing debate on the beyond-GDP approach, a strand of literature
explores how the traditional system on national account (SNA), that is the pillar for
the GDP measurement, could be extended to account for some of the main themes
related to well-being and sustainability. In this presentation we extend the
macroeconometric model for Italy developed by Istat (MeMo-It) introducing an
inequality measure in the consumption function. Empirical analysis shows that a
positive income shock that increases aggregate consumption in the current year might
be completely off-set by the negative effect of the increase in inequality that becomes
effective in the next year. In this framework, the impact of the Italian “reddito di
cittadinanza”, a policy measure aiming at reducing poverty, has been evaluated.
According to the results obtained we support the idea that a step forward on wellbeing and sustainability could be realized starting from a structural macroeconometric
approach.
|
Bayesian Statistics applied to Early “Oncology Drug Development”
May 13, 2022, 17:00
Oncology dose finding studies, in general, aim at determining the maximum tolerated
dose (MTD) reflecting the desire to treat patients who have limited options under the
assumption that higher drug doses will have better therapeutic activity. We are
describing different methods (ie 3+3, mTPI, mTPI-2, and BLRM). This seminar will
feature speakers from Pfizer Inc., to share their insights and the recent statistical
innovations to address the challenges.
In addition to safety evaluation, Early Sign of Efficacy (ESOE) is a critical step in all
early clinical programs to extend the development of a molecule or not. Robust and
consistent calculation of the probability of making the right decision is critical.
Innovative methodologies are needed to optimize these calculations and ensure all
molecules are assessed in the same way across the oncology portfolio.
Case studies will be discussed for dose finding and the utilization of Bayesian statistics
in ESOE evaluation.
|
Bayesian Inference In High-dimensional Spatial Statistics: Conquering New Challenges
May 6, 2022, 17:00
Geographic Information Systems (GIS) and related technologies such as remote sensors, satellite
imaging and portable devices that are capable of collecting precise positioning information, even
on portable hand-held devices, have spawned massive amounts of spatial-temporal databases.
Spatial "data science" broadly refers to the use of technology, statistical methods, computational
algorithms to extract knowledge and insights from spatially referenced data. Applications of
spatial-temporal data science are pervasive in the natural and environmental sciences; economics;
climate science; ecology; forestry; and public health. With the abundance of spatial BIG DATA
problems in the sciences and engineering, GIS and spatial data science will likely occupy a
central place in the data revolution engulfing us. This talk will discuss construction and
implementation of scalable Gaussian processes and the importance of conjugate Bayesian models
in carrying out Bayesian inference for spatially and temporally oriented massive data sets
exhibiting complex dependencies in diverse applications. We will elucidate recent developments
in Bayesian statistical science such as geosketching and predictive stacking that can harness high
performance scientific computing methods for spatial-temporal BIG DATA analysis and
emphasize how such methods can be implemented on modest computing architectures. The talk
will include specific examples of Bayesian hierarchical modeling in Light Detection and Ranging
(LiDAR) systems and other remote-sensed technologies; environmental sciences; and public
health.
|
Spatial and functional data over non-Euclidean domains
April 29, 2022, 12:00
Recent years have seen an explosive growth in the recording of increasingly
complex and high-dimensional data. Classical statistical methods are often unfit
to handle such data, whose analysis calls for the definition of new methods
merging ideas and approaches from statistics and applied mathematics. My talk
will in particular focus on spatial and functional data defined over non-Euclidean
domains, such as linear networks, two-dimensional manifolds and non-convex
volumes. I will present an innovative class of methods, based on regularizing
terms involving Partial Differential Equations (PDEs), defined over the complex
domains being considered. These physics-informed regression methods enable
the inclusion in the statistical model of the available problem specific information,
suitably encoded in the regularizing PDE. The proposed methods make use of
advanced numerical techniques, such as finite element analysis and isogeometric
analysis. A challenging application to neuroimaging data will be illustrated.
|
Factor models with downside risk
Aprile 22, 2022, 12:00
We propose a conditional model of asset returns in the presence of common
factors and downside risk. Specifically, we generalize existing latent factor
models in three ways: we show how to estimate the threshold which identifies
the 'disappointment' event triggering the bad state of the world; we permit
different factor structures for asset returns in good and bad states; we show
how to recover the observable factors' risk premia from the estimated latent
ones in different states. The usefulness of the model is illustrated through two
applications to cross-sections of asset returns in equity markets and other
major asset classes.
Paper link
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3937321
|
Marine litter in the North-Western Ionian sea – Data features and space-time modeling
April 8, 2022, 12:00
Marine litter has recently become a recognized global ecological concern, and its
distribution and impacts on deep-sea habitats are under continuous investigation.
Here we focus on marine litter data collected as a by-product of trawl fishery surveys
regularly conducted at a local scale in the Mediterranean. Litter data are multivariate,
have space-time structure, and are semi-continuous, i.e. they combine information on
occurrence and conditional-to-presence abundance. Data on potential environmental
drivers obtained by remote sensing or GIS technologies are also available with
different spatial support. The modeling strategy is based on a two-part model that
enables handling the zero-inflation problem and the spatial correlation characterizing
the data. In the spirit of multi-species distribution models, we propose to jointly infer
different litter categories in a Hurdle-model framework. The effects of potential
environmental drivers and shared spatial effects linking abundances and probabilities
of occurrences of litter categories are implemented via the SPDE approach in the
computationally efficient INLA context. Results support the possibility of better
understanding the spatio-temporal dynamics of marine litter in the study area.
|
How much evidence do you need? Data Science and Bayesian Statistics to inform Environmental Policy during the COVID-19 Pandemic
April 4, 2022, 14:00
In this talk, I will provide an overview of data science methods, including methods
for Bayesian analysis, causal inference, and machine learning, to inform
environmental policy. This is based on my work analyzing a data platform of
unprecedented size and representativeness. The platform includes more than 500
million observations on the health experience of over 95% of the US population older
than 65 years old linked to air pollution exposure and several confounders. Finally, I
provide an overview of studies on air pollution exposure, environmental racism,
wildfires, and how they also can exacerbate the vulnerability to COVID-19.
Press Coverage
• https://www.nytimes.com/2021/08/13/climate/wildfires-smoke-covid.html
• https://www.nytimes.com/2020/04/07/climate/air-pollution-coronavirus-covid.html
• https://www.nytimes.com/2020/12/07/climate/trump-epa-soot-covid.html?smid=tw-share
• https://science.sciencemag.org/content/360/6388/473
• https://www.npr.org/sections/health-shots/2017/06/28/534594373/u-s-air-pollution-stillkills-thousands-every-year-study-concludes
• https://www.statnews.com/2016/11/14/climate-change-agreements/
• https://news.harvard.edu/gazette/story/2016/08/smoke-waves-will-affect-millions-incoming-decades/
• https://sites.sph.harvard.edu/francesca-dominici/senator-cory-booker-talking-about-nejmstudy/
|
Measures of Interrater Agreement
March 25, 2022, 12:00
Agreement among ratings or measurements provided by several raters (humans or
devices) is considered in education, biomedical sciences, and other disciplines. For
instance, the agreement among ratings of educators who assess on a new rating scale
the language proficiency of a corpus of argumentative texts is considered to test
reliability of the scale, or the agreement among clinical diagnoses provided by
physicians is analysed for identifying the best treatment for the patient. In all these
applications, the main interest is to analyse interrater absolute agreement, that is the
extent that raters assign the same (or very similar) values on the rating scale. Many
indices of interrater agreement on a whole group of subjects (objects) have been
proposed. Less frequently agreement on single subjects has been considered, in spite
of the fact that this is useful, for example, to request the raters for a specific
comparison on single cases in which agreement is poor. In the seminar, after a critical
review of the most used indices of interrater agreement, new subject-specific and
global measures of absolute agreement for ratings on different levels of scale are
presented. Some applications will show the advantages of the indices proposed.
|
Unsupervised whole graph embedding methods and applications
March 18, 2022, 12:00
Networks represent a powerful model for problems in different scientific and
technological fields, such as neuroscience, molecular biology, biomedicine,
sociology, social network analysis, and political science. As the number of network
applications increases, so does a need for novel data analysis techniques. In many
applications, the analysis focuses on a single network to cluster or classify its nodes
or predict pairs of nodes that will form a link. In this talk, we focus on problems where
a network is a statistical unit, and the analysis regards whole networks rather than their
parts.
Methods for learning features on networks focus mainly on the neighborhood of nodes
and edges. We review some of the existing methodologies and introduce Netpro2vec,
an embedding framework based on representations of graphs based on empirical
probability distributions. The goal is to use basic node descriptions other than the
degree, such as those induced by the Transition Matrix and Node Distance
Distribution, to describe the local and global characteristics of the networks. The
framework is evaluated on synthetic and real biomedical network datasets and
compared to well-known competitors. Finally, open problems and future research
directions are highlighted.
|
Multimodal regression with circular data
March 4, 2022, 12:00
There is a diverse range of practical situations where one may encounter
random variables which are not defined on Euclidean spaces, as it is the case
for circular data. Circular measurements may be accompanied by other
observations, either defined on the unit circumference or on the real line, and
in such cases it may be of interest to model the relationship between the
variables from a regression perspective. It is not infrequent that parametric
models fail to capture the underlying model given their lack of flexibility,
but it may also happen that the usual paradigm of (classical) mean regression.
We will present in this talk some recent advances in nonparametric
multimodal regression, showing an adaptation of the mean-shift algorithm
for regression scenarios involving circular response and/or covariate. Real
data illustrations will be also presented. This is a joint work with María
Alonso-Pena.
|