DATA SCIENCE

Delivered study plan 2021/2022

EDUCATIONAL OFFER - a.a. 2021/2022

Students must attend courses 1-4 and pass the exam for at least 12 CFU of the following:

1. March 2022 - Digital Epidemiology [3 CFU]
Organizers: Ciro Cattuto, Sebastiano Filetti, Stefano Leonardi
Detailed program at this link. The lectures will be held online on the Zoom platform.

Prof. Paolo Villari (Sapienza): Study design in epidemiology
7 March 2022 (09:00-10:30; 11:00-12:30)
Abstract: This introductory lecture intends to outline the methodology of the main epidemiological study models. The methods for quantitatively describing health phenomena, studying the associations between risk factors and diseases, and assessing the effectiveness and safety of health interventions will be illustrated starting from the classical objectives of epidemiology (descriptive, analytical and experimental). Frequency measures (prevalence and incidence) and the main measures of association will be covered. The potential and limitations of different epidemiological approaches will be analyzed by reviewing epidemiological studies published in the literature, such as the clinical trials that have documented the efficacy and safety of vaccines against COVID-19. The approach of systematic literature reviews and meta-analyses to synthesize scientific evidence on the efficacy of health interventions will also be illustrated as an introduction.

Proff. Daniela Paolotti (ISI Foundations), Caterina Rizzo (OPBG): Digital Disease Detection
8 March 2022, (09:00-10:30; 11:00-12:30)
Abstract:
The pervasiveness of Web and mobile technologies as well as the growing adoption of smart wearable sensors have significantly changed the landscape of epidemic intelligence data gathering with an unprecedented impact on global public health. In this double presentation, we will show how disease surveillance in public health works and how digital technologies have changed the way we monitor diseases.

Prof. Luca Ferretti (Oxford): Digital Contact Tracing and Exposure Notification
14 March 2022 (09:00-10:30; 11:00-12:30)
Abstract:
Digital approaches have the potential to improve non-pharmaceutical interventions. One of the most innovative approaches proposed during the current pandemic has been Digital Contact Tracing, i.e. a faster and more effective approach to record proximity between individuals and notify individuals who are likely to have been exposed to a pathogen due to their proximity to an infected individual. In practice, many countries have implemented Exposure Notification apps based on privacy-preserving proximity tracing via Bluetooth Low Energy. In this lecture, I will present an overview of Digital Contact Tracing in terms of technology and public health. I will show how it can improve Test-Trace-Isolate strategies and discuss its requirements and the epidemiological evidence for its impact. I will also present some of the lesson learned from a few successful implementations of the technology and many failed ones. Finally, I will discuss its potential in terms of epidemic management and epidemic surveillance.

Prof. Ciro Cattuto (Università di Torino): Human Proximity: from measurement to models and interventions
15 March 2022 (09:00-10:30; 11:00-12:30)

Proff.Caterina Rizzo (OPBG), Daniela Paolotti (ISI Foundations): What we talk about when we talk about vaccine hesitancy
21 March 2022 (09:00-10:30; 11:00-12:30)
Abstract:
Vaccination hesitancy has been an important public health issue even before COVID-19. Studies have found that vaccine hesitancy is a complex, multi-faceted phenomenon that needs to be addressed with an interdisciplinary methodology. In particular, social media can drive this phenomenon through vocal influencers but can also help uncover the unknown determinants behind this global phenomenon. In this double presentation, we will present some studies aimed at leveraging social media platforms to address the problem of vaccine hesitancy with a European and Italian perspective.

Prof. Giorgio Guzzetta (Fondazione Bruno Kessler): Bayesian reconstruction of transmission chains from epidemiological surveillance and contact tracing data
22 March 2022 (09:00-10:30; 11:00-12:30)
Abstract:
The information on transmission chains (who infected whom and when during an epidemic outbreak) provides precious insights on transmission heterogeneities and on critical quantities of epidemic dynamics, such as the distribution of generation times and transmission distances. In practice, epidemiological investigations can determine where and from whom an individual was infected only in rare cases. Bayesian inference models can exploit the spatial and/or temporal structure in epidemiological data to probabilistically reconstruct transmission chains and infer statistical properties of the transmission dynamics. We will show the general principles of these models and some applications to geo-referenced data from mosquito-borne surveillance and to contact tracing data from Hepatitis A Virus and SARS-CoV-2 outbreaks.

2. Aprile 2022 - Deep Learning Seminars [3 CFU]
Organizers: Simone Scardapane, Fabrizio Silvestri
Detailed program at this link. The lectures will be held online on the Zoom platform.

The three seminars will cover several advanced topics in deep learning: meta learning (i.e., “learning to learn”), continual learning (i.e., learning from a continuous stream of tasks), and data engineering for deep learning (i.e., preparing data for being used in deep learning pipelines).

Prof. Fabrizio Silvestri (Sapienza University): Meta-learning
20 April 2022 (9:00-13:00)
Abstract:
Learning is usually seen as a method to extract patterns from data, and learn associations among these patterns and dependent variables (labels, responses, etc.). The input to a Machine Learning algorithm is usually a set of data points X along with labels Y and the goal is to learn a function f: X -> Y that associates labels with each data sample in X. Meta-learning is a fundamental paradigm shift where the input to the algorithm is not a set of data points but, instead, a set of "tasks". The goal is to learn how to efficiently learn a model for a new task starting from an existing model that has been trained on previous tasks. In this lecture, we will review the main basics of Meta-Learning and this will serve as an introduction to the topic for students who have never seen this topic before.

Dr. Vincenzo Lomonaco (Pisa University): Introduction to Continual learning
21 April 2022 (9:00-13:00)
Abstract:
Learning continually from non-stationary data streams is a long-standing goal and a challenging problem in machine learning research. Naively fine-tuning prediction models only on the newly available data often incurs in Catastrophic Forgetting or Interference: a sudden erase of all the previously acquired knowledge. On the other hand, re-training prediction models from scratch on the accumulated data is not only inefficient but possibly unsustainable in the long-term and where fast, frequent model updates are necessary. In this lecture we will discuss recent progress and trends in making machines learn continually through architectural, regularization and replay approaches.

Dr. Andreas Damianou (Spotify): Working with data in industrial machine learning applications
27-28 April 2022 (9:00-13:00)
Abstract:
Data is a crucial aspect of today's machine learning workflows. Over the last few years machine learning (ML) methods, such as deep learning, have been made more and more efficient when it comes to using large volumes of data. This, in turn, has contributed to the numerous remarkable successes of modern ML. However, the practical application of ML in industry comes with a variety of problems and considerations which are often related to data. Indeed, real-world data are incomplete, noisy, sensitive, biased and ever changing, making it hard to train reliable ML models on them. At the same time, productionizing a ML model also means productionizing the associated data pipelines, and issues like data versioning and publishing come into play. In this lecture series I will give an overview of the important role of data in an industrial ML setting, ranging from raw data collection to practical feature transformation and engineering. Further, we will dive deep into a variety of data-related issues (and solutions) for ML, ranging from data cleaning and versioning, to scalability, bias and privacy.

3. Maggio 2022 - Cultural Analytics [3 CFU]
Organizers: Carlos Castillo, Giorgio Barnabò
Detailed program at this link.

Prof. Diego Saez-Trumper (Wikimedia): Wikipedia beyond the encyclopedic value
9 May 2022 (10:30-13:30, room B2, DIAG), 10 May 2022 (09:00-13:00, Aula Magna, DIAG)
Abstract:
In this lecture, we are going to take Wikipedia (and sister projects) as an object of study, as well as a large data repository. We are going to learn the basics of how Wikipedia works and the data that is produced and shared as a result of Wikipedians' contributions and interactions. We will review a set of tools for consuming and processing this data, and also discuss some problems that can be solved using Wikipedia data as well as some open questions on this field.

Proff. Chris Danforth, Peter Dodds (Vermont University): Measuring the happiness, health, & stories of society through the sociotechnical dynamics of social media and fiction
24 May 2022 (15:00-17:00, online)
Abstract
This talk will describe a suite of physically inspired instruments we've developed to enable the exploration of large-scale text data, illuminate collective behavioral patterns, and develop a science of stories. Along with our flagship efforts at http://hedonometer.org and https://storywrangling.org, we show how Instagram photos reveal markers of depression prior to formal diagnosis, and Twitter topic dynamics ranked Trump as being more popular than God throughout his presidency. Finally, we present evidence in support of a hypothesis posed by author Kurt Vonnegut, namely that there are only a few emotional arcs (or modes) exhibited by the vast majority of works of fiction.

Prof. Melanie Walsh (University of Washington): Goodreads: a computational Study
27 May 2022 (16:00-18:00, online)
Abstract:
This lecture examines how Goodreads users define, discuss, and debate "classic" literature by computationally analyzing and close reading more than 120,000 user reviews. We begin by exploring how crowdsourced tagging systems like those found on Goodreads have influenced the evolution of genre among readers and amateur critics, and we highlight the contemporary value of the "classics" in particular. We identify the most commonly tagged "classic" literary works and find that Goodreads users have curated a vision of literature that is less diverse, in terms of the race and ethnicity of authors, than many U.S. high school and college syllabi. Drawing on computational methods such as topic modeling, we point to some of the forces that influence readers’ perceptions, such as schooling and what we call the classic industry - industries that benefit from the reinforcement of works as classics in other mediums and domains like film, television, publishing, and e-commerce (e.g., Goodreads and Amazon). We also highlight themes that users commonly discuss in their reviews (e.g., boring characters) and writing styles that often stand out in them (e.g., conversational and slangy language). Throughout the essay, we make the case that computational methods and internet data, when combined, can help literary critics capture the creative explosion of reader responses and critique algorithmic culture’s effects on literary history.

Prof. Nello Cristianini (Bristol University): Media Content Analysis and Culturomics
30 May 2022 (09:30-13:00, Aula Magna, DIAG); 31 May 2022 (09:00-13:00, room B2, DIAG)
Abstract:
We will review case studies where various types of textual content have been used to reveal insights about cultural aspects of society, as well as the origins of this method. This will include various studies with social media, newspapers, and historical newspapers; it will include studies of UK, US, Italian, and Slovenian historical newspapers. Some attention will also be devoted to the analysis of Wikipedia access and product sales, and book content. The general techniques will be based on simple statistics, but some of the work will involve NLP tools, such as parsers, in order to generate network data. We will also review the cultural roots of the methodology, as it is applied today. Time permitting, we will also address studies aimed at which cultural biases a machine can absorb from the data, giving a new perspective on an old problem in archival science, that is the problem of bias in archival content. Problems facing computer scientists involved in cultural analytics are not traditionally seen in other parts of computer science and are better understood in the humanities.

4. Giugno 2022 - Theory and Practice of Deep Learning [3 CFU]
Organizers: Fabrizio Silvestri, Michael Bronstein
Detailed program at this link.

Prof. Petar Veličković (DeepMind & Cambridge University): Graph Neural Networks: Geometric, Structural and Algorithmic Perspectives
13 June 2022 (09:00, Aula Magna, DIAG)
Abstract:
Recent years have seen a surge in research on graph representation learning, including techniques for deep graph embeddings, generalizations of CNNs to graph-structured data, and neural message-passing approaches. These advances in graph neural networks (GNNs) and related techniques have led to new state-of-the-art results in numerous domains: chemical synthesis, vehicle routing, 3D-vision, recommender systems, question answering, continuous control, self-driving and social network analysis. Accordingly, GNNs regularly top the charts on fastest-growing trends and workshops at virtually all top machine learning conferences. In this series of lectures, I will attempt to provide several "bird’s eye" views on GNNs. Following a quick motivation on the utility of graph representation learning, I will derive GNNs from first principles of permutation invariance and equivariance. We will discuss how we can build GNNs that are not strictly reliant on the input graph structure, and how we can categorise their expressive power using graph isomorphism testing. Finally, we will explore an emerging connection between GNNs and classical algorithms, and demonstrate how we successfully used this connection to power mathematical discovery (a milestone which recently graced the cover of Nature).
The talk will be geared towards a generic computer science audience, though some basic knowledge of machine learning with neural networks will be a useful prerequisite. The content is inspired by my ongoing work on the categorisation of geometric deep learning, alongside Joan Bruna, Michael Bronstein and Taco Cohen.

Prof. Alfredo Canziani (NYU Courant Institute): Energy-Based Models
14 June 2022 (09:00, Aula Magna, DIAG)
Abstract:
Energy-Based Models (EBMs) provide a common theoretical framework for many learning models, including traditional discriminative and generative approaches, as well as graph-transformer networks, several manifold learning methods, and joint embedding methods. EBMs capture dependencies between variables by associating a scalar energy to each configuration of the input variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimise the energy. Learning consists in finding an energy function in which observed configurations of the variables are given lower energies than unobserved ones.
We will start this tutorial by introducing the EBM terminology by revisiting the training of a simple classifier in terms of shaping its energy function. We will then introduce latent variables (LVs) for modelling the unpredictable component of a given phenomenon and learning one-to-many relationships. Finally, we’ll cover classical examples of architectural, regularised, and contrastive training techniques for EBMs and LV-EBMs.

Alessandra Piktus (HuggingFace): Knowledge-intensive NLP and retrieval augmentation
15 June 2022 (09:00, Aula Magna, DIAG)
Abstract:
With the advent of large-scale, transformer-based language models such as BERT and GPT, we have witnessed unprecedented progress on many NLP tasks. Common benchmarks testing natural language inference, paraphrase detection or closed-book question answering saw submissions approaching or exceeding human performance. Yet, NLP is far from being a solved problem, and the ability to reliably access and utilise knowledge - be it common sense or factual, emerges as a consistently challenging problem.
In this session, we will take a closer look at knowledge-intensive NLP (KI-NLP) tasks. First, we will go over examples of KI-NLP datasets, and analyse the challenges and limitations of their positioning. We will then provide an overview of common approaches to modelling such tasks-contextualising them with respect to both classical information retrieval and state-of-the-art, billion-parameter-scale language modelling. We will then use the RAG model as an example to guide us through the process of building a typical retriever-reader architecture. Finally, we will glimpse into a new exciting path of research exploring the concept of generative retrieval for KI-NLP.

Prof. Sanjeev Arora (Princeton University): Rethinking "optimization" in deep learning
16 June 2022 (15:00, Aula Magna, DIAG)
Abstract:
The talk will focus on recent works, showing ways in which traditional optimization analyses are a bad match for deep learning phenomena for two reasons: (a) Traditional analyses of gradient descent rely upon an inequality by which the learning rate is set using the smoothness of the loss. This is violated in deep learning losses (Li et al. 2020, Cohen et al. 2021). (b) Traditional treats the cost/loss function as a black box and sets the goal as finding any solution of low cost. It is increasingly clear that the cost of the solution does not capture its goodness completely because two solutions of the same cost can have very different performances on held-out data. Instead, the exact trajectory taken by gradient-based optimization has a big effect on the quality of the solution.
The talk will introduce these surprising phenomena and how new theory has been developed in the past few years to understand them.

5. Febbraio - Giugno 2022 - The Copernicus green revolution for sustainable development - Politecnico di Milano [5 CFU]
Organizer: Prof. Maria Brovelli
Detailed program at this link.

Speakers:
Maria Brovelli (DICA-Polimi, 1 CFU)
Daniele Oxoli (DICA-Polimi, 1 CFU);
Marco Gianinetto (DABC-Polimi, 1 CFU);
Branka Cuca (DABC-Polimi, 1 CFU);
Andrea Monti-Guarnieri (DEIB-Polimi, 1 CFU)

EDUCATIONAL OFFER - ACADEMIC YEAR 2020/2021

1. Machine learning in production (May 5-7-12-14 2021) [3CFU]
https://www.sscardapane.it/teaching/reproducibledl/
(da remoto)
- Simone Scardapane (Sapienza)

2. Neural Information Retrieval and NLP (May 20-21-27-28 2021) [3CFU]
- Nicola Tonellotto (Univ. of Pisa), Fabrizio Silvestri (Sapienza)
21/05 from 9.30 to 13.30 - Practicum from 15.30 to 17.30
22/05 from 10.30 to 13.30 - Practicum from 15.30 to 17.30
27/05 from 9.30 to 13.30 - Practicum from 15.30 to 17.30
28/05 from 9.30 to 13.30 - Practicum from 15.30 to 17.30
Abstract:
Advances from the natural language processing community have recently sparked a renaissance in the task of ad-hoc search. Particularly, large contextualized language modeling techniques, such as BERT, have equipped ranking models with a far deeper understanding of language than the capabilities of previous bag-of-words (BoW) models. Applying these techniques to a new task is tricky, requiring knowledge of deep learning frameworks, and significant scripting and data munging. In this course, we provide background on classical (e.g., BoW), modern (e.g., Learning to Rank). We introduce students to the Transformer architecture also showing how they are used in foundational aspects of modern large language models (e.g., BERT) and contemporary (e.g., doc2query) search ranking and re-ranking techniques. Going further, we detail and demonstrate how these can be easily experimentally applied to new search tasks in a new declarative style of conducting experiments exemplified by the PyTerrier and OpenNIR search toolkits.

3. Data Science Innovation in Diabetes (June - July 2021) [3CFU]
- Marianna Maranghi (Sapienza), Data Science Innovation in Diabetes (1 day webinar):
"Use of big data: what potential? Which methodology? The diabetic patients data model as an example"
- Ciro Cattuto (Univ. of Torino), Daniela Paolotti (ISI Torino): Digital Epidemiology on the field (3 CFU)

4. Learning in Games, Markets, and Sequential Decision Making (September 6-10 2021) [3CFU]
Speakers:
Prof. Nicolò Cesa-Bianchi (Università degli Studi di Milano)
Prof. Jose R. Correa (Universidad de Chile)
Prof. Michal Feldman (Tel-Aviv University)
Renato Paes Leme (Research Scientist at Google Research New York)
Prof. Eva Tardos (Cornell University).

EDUCATIONAL OFFER - ACADEMIC YEAR 2019/2020

1. Data Science for Humanities [3CFU]
- Riconoscere le emozioni: un percorso dinamico fra opere d'arte e memi (February 10th-11th 2020)
Prof. Davide Nadali (La Sapienza) e Antonella Sbrilli (La Sapienza)
- Visita al Museo dell’Arte Classica della Sapienza (February 18th 2020)
- In Codice Ratio: towards Knowledge Discovery from Medieval Manuscripts(February 24th 2020)
Prof. Paolo Merialdo (Universita di Roma Tre): Speakers: Donatella Firmani (Universita di Roma Tre), Elena Nieddu (Universita di Roma Tre), Francesco Cavina
- Decifrazione delle scritture antiche attraverso la digitalizzazione e il machine learning (February 25th 2020)
Prof.ssa Silvia Ferrara (Università di Bologna) e Fabio Tamburini (Università di Bologna)

2. Computational and Statistical Methods of Data Reduction [3CFU]
- Robust Statistics for Data Reduction (March 9th - 10th 2020, 09:00-13:00, Aula B203) SUBSTITUTED BY "Computational methods: sampling and inferential issues", Prof. Serena Arima (Università del Salento), MAY 11th - 12th 09:00-13:00
- Dimensionality Reduction in Clustering and Streaming (March 16th - 17th 2020, 09:00-13:00, Aula B203) RESCHEDULED FOR MAY 14th - MAY 15th 09:00-13:00
Prof. Chris Schwiegelshohn (La Sapienza)

3. Learning in Games, Markets, and Sequential Decision Making RESCHEDULED FOR FALL 2021 [3CFU]
- Learning good equilibria in repeated games (May 11th - 12th 2020)
Prof. Tim Roughgarden (Columbia University)
- Online learning through multi-armed bandit models (May 13th - 14th 2020)
Prof. Nicolò Cesa-Bianchi (Università degli Studi di Milano)
- Learning prices and equilibria in markets (May 15th 2020)
Prof. Nicola Gatti (Politecnico di Milano), Prof. Stefano Leonardi (La Sapienza)

4. Data Science for Geographical Information System Applications - Improving Location of Collaborative Mapping Apps [3CFU]
- Citizen Science and VGI, (May 20th 2020, 14.00 - 17.00)
Prof. Maria Antonia Brovelli (Politecnico Milano)
- Geopaparazzi e SMASH (May 20th 2020, 17.00 - 18.00)
Prof. Maria Antonia Brovelli (Politecnico Milano), Silvia Franceschi
- OpenStreetMap: how to contribute and use its data, (May 21st 2020, 09.00 - 12.00)
Prof. Maria Antonia Brovelli (Politecnico Milano)
- Epicollect (May 21st 2020, 12.00 - 13.00)
Prof. Maria Antonia Brovelli and Daniele Oxoli (Politecnico Milano)
- QField (May 21st 2020, 14.00 - 15.00)
Prof. Maria Antonia Brovelli and Berk Anbaroglu (Politecnico Milano)
- LandslideSurvey, (May 21st 2020, 15.00 - 16.00)
Prof. Maria Antonia Brovelli and Edoardo Pessina (Politecnico Milano) Edoardo Pessina
- Global Navigation Satellite Systems (GNSS) fundamentals (May 27th 2020, 14.00 - 18.00)
Prof. Augusto Mazzoni (Sapienza)
- From GNSS raw measurements to precise positions (theory and on field survey) (May 28th 2020, 09.00 - 13.00)
Prof. Augusto Mazzoni (Sapienza)
- From GNSS raw measurements to precise positions (data processing) (May 28th 2020, 14.30 - 16.30)
Prof. Augusto Mazzoni (Sapienza)

5. Network Science and Machine Learning Methods for Health and Medicine (In collaborazione con il Dottorato in Tecnologie Biomediche Innovative) [3CFU]
- Network science and network medicine (June 5th, 09.00-13.00)
Prof. Guido Caldarelli (IMT Alti Studi Lucca)
- Digital Epidemiology and Network Science for monitoring and modeling Coronavirus outbreak (June 8th - 9th, 09:00 - 13:00)
Prof. Ciro Cattuto (ISI and University of Torino)
- A guided tour on the application of AI in Health and Medical Research (June 10th 09:00 - 13:00)
Prof. Alberto Tozzi (Ospedale Pediatrico Bambin Gesù)
- Network Medicine (June 15th 15:00 - 19:00)
Prof. Kimberly Glass (Harvard Univ. and Brigham and women’s Hospital)

EDUCATIONAL OFFER - ACADEMIC YEAR 2018/2019

1. Algorithms and computational models for large-scale data analysis (April 8th - 16th 2019) [3CFU]
Silvio Lattanzi (Google)
Fabrizio Silvestri (Facebook)

2. Cryptographic primitives for blockchain and the security of cloud-based systems (May 10th - 21st 2019) [3CFU]
Silvio Micali (MIT)
Giuseppe Persiano (Salerno)
Daniele Venturi (Sapienza)
Angelo Spognardi (Sapienza)

3. Algorithmic bias, fairness and ethics in machine learning systems (May 27th - 31st 2019) [3CFU]
Carlos Castillo (UPF)
Francesco Bonchi (ISI Torino and Eurecat Barcelona)
Vassilis Gkatzelis (Drexel University)
Fedor Sandomirskiy (St. Petersburg School of Management)
Chris Schwiegelshohn (Sapienza)

4. Italian PhD School of Data Science [6CFU]
Co-organized with the PhD program in Data Science of Pisa
3-10 September, Pisa.