CYBERSECURITY

Seminari

2024

The Bug The Better: Mining Bugs in Complex Programs
10/12/2024 10:00, Aula L1, Via del Castro Laurenziano 7a, Roma
Speaker: Flavio Toffalini (Ruhr-Universität Bochum)

Adversaries continuously exploit vulnerabilities to compromise systems, such as crafting malicious JavaScript programs to hijack Web browsers and obtain remote execution. The most effective strategy for preventing such exploitation, and enhancing system security, is identifying and patching bugs. However, discovering vulnerabilities in modern systems requires facing scalability issues, and dealing with emerging attack surfaces.

This presentation will explore cutting-edge advancements in automated software testing, focusing on techniques to maximize the detection of security-critical bugs. Additionally, we will examine new challenges, such as errors injected by compilers into secure code, logic errors in Java programs, and erroneous code optimization in JavaScript engines.

A journey into pytorch, the ecosystem, and deep learning compilers
11/11/2024 14:00, Aula Magna, DIAG, Via Ariosto 25, Roma.
Speaker: Luca Antiga (Lightning AI)

PyTorch has become a key building block of modern AI. In this talk, we'll explore its journey from the early days, through the growth of its ecosystem and the pivotal role of open source, all the way to the recent rise of deep learning compilers. We'll dive into the technical aspects of compiler technologies and discuss how they are going to shape the future of AI infrastructure.

Retrieval Augmented Generation (RAG): Applications, Limitations, and Future Directions
22/10/2024 15:00, Aula Magna, DIAG, Via Ariosto 25, Roma.
Speaker: Fabio Petroni (Samaya AI)

Retrieval Augmented Generation (RAG) is a technique we proposed in 2020 that allows generative AI models to access external information, enhancing their responses to prompts. Since then, the popularity of this approach has skyrocketed, becoming the de facto standard for handling knowledge-intensive tasks in both academia and industry. In this talk, I will describe various applications of RAG, including improving Wikipedia verifiability and providing a glimpse into the work we’re doing at Samaya AI. I will then discuss some limitations of this architecture, such as the “lost in the middle” effect, and conclude by outlining future research directions that I find most exciting.

Parallelizing GPU-based Mini-Batch Graph Neural Network Training
3/7/2024 11:30, Aula Magna, DIAG, Via Ariosto 25, Roma.
Speaker: Marco Serafini (UMass Amherst)

Many datasets are best represented as graphs of entities connected by relationships rather than as a single uniform dataset or table. Graph Neural Networks (GNNs) have been used to achieve state-of-the-art performance in tasks such as classification and link prediction. This talk will discuss recent research on scalable GNN training.

The talk will focus on the popular mini-batch approach to GNN training, where each iteration consists of three steps: sampling the k-hop neighbors of the mini-batch, loading the samples onto the GPUs, and training. The first part of the talk will discuss NextDoor, which showed for the first time that we can significantly speed up end-to-end GNN training by using GPU-based sampling. To maximize the utilization of GPU resources and speed up sampling, NextDoor proposes a new form of parallelism, called transit parallelism. The second part of the talk focuses on a new approach called split parallelism to run the entire mini-batch training pipeline on GPUs. It presents a system called GSplit that avoids redundant data loads and has all GPUs perform sampling and training cooperatively on the same GPU. Finally, the last part of the talk will discuss results from an experimental comparison between full-graph and mini-batch training systems.

Fighting against cyber threats from a system perspective
11/6/2024 12:00, Aula A7, DIAG, Via Ariosto 25, Roma
Speaker: David Bromberg (Univ. of Rennes IRISA)

Cyber attacks have now invaded our daily lives. According to a report by the European police agency Europol, cybercrime threats are exploding in Europe. Not a day goes by without discovering that an institution or a company has been attacked. In this talk we will explore how research in systems and distributed systems may improve the resilience to cyber attacks following 3 axes targeting mobile systems, distributed systems, and operating systems; (I) The astonishingly widespread adoption of the Android operating system has been accompanied by the spread of malware across the Android ecosystem at an alarming rate leading to study how to strengthen the robustness of mobile systems such as Android; (II) Peer sampling is a key component of distributed systems for overlay management and information dissemination. It is regularly challenged by Byzantine nodes, leading to a revisiting of the field by introducing new algorithms and investigating how SGX hardware enclaves can improve resilience to threats. (III) A significant amount of research focuses on defending against cyber attacks such as ransomware but little on getting systems back up and running once they have been attacked.

In this talk, we will explore specifically the first axe.

Leveraging Textual Specifications for Automated Attack Discovery in Network Protocols
28/5/2024 12:00, Aula B2, DIAG, Via Ariosto 25, Roma
Speaker: Cristina Nita-Rotaru (Northeastern University)

Automated attack discovery techniques, such as attacker synthesis or model-based fuzzing, provide powerful ways to ensure network protocols operate correctly and securely. Such techniques, in general, require a formal representation of the protocol, often in the form of a finite state machine (FSM). Unfortunately, many protocols are only described in English prose. We show how to extract protocol specification in the form of FSM from RFCs. Unlike other works that rely on rule-based approaches or use off-the-shelf NLP tools directly, we suggest a data-driven approach for extracting FSMs from RFC documents. Specifically, we use a hybrid approach consisting of three key steps: (1) large-scale word-representation learning for technical language, (2) focused zero-shot learning for mapping protocol text to a protocol-independent information language, and (3) rule-based mapping from protocol-independent information to a specific protocol FSM. We show the generalizability of our FSM extraction by using the RFCs for six different protocols: BGPv4, DCCP, LTP, PPTP, SCTP and TCP. We demonstrate how automated extraction of an FSM from an RFC can be applied to the synthesis of attacks, with TCP and DCCP as case-studies. This work appeared in IEEE Security and Privacy 2022 as``Automated Attack Synthesis by Extracting Finite State Machines from Protocol Specification Documents.'' Maria Leonor Pacheco, Max von Hippel, Ben Weintraub Dan Goldwasser Cristina Nita-Rotaru. IEEE S&P 2022.Code available at: https://github.com/RFCNLP

Taming the Cost of Deep Neural Models: Hybrid Models to the Rescue?
16/5/2022 14:30, Aula Magna, DIAG, Via Ariosto 25, Roma
Speaker: Laks V.S. Lakshmanan (UBC Vancouver)

Deep learning, and in particular, large language models have made great strides in many fields including vision, language, and medicine. The impressive performance of large models comes at a significant price: the models tend to be billions to trillions of parameters in size, are expensive to train, have a huge operational cost, and typically need cloud service for deployment. Meanwhile, considerable research efforts have been devoted to designing smaller/cheaper models, at the price of restricted generalizability and performance. Not all queries we may wish to pose to a model are hard. Some queries can be answered nearly as accurately with cheaper models at a fraction of the cost of the larger models. However, the performance of cheaper models may suffer on other queries. Can we combine the best of both worlds by striking a balance between cost and performance? In this talk, I will describe two settings in which our group has tackled this issue. In the first setting, we are interested in approximate answers to queries over model predictions. We show how, under some assumptions about the cheap model, queries can be answered with a provably high precision or recall by using a judicious combination of invoking the large model on data samples and the cheap model on data objects. In the second setting, we are interested in learning a router, which, given a query, predicts its level of hardness, based on which the query is either routed to the small model or to the large model. For both settings, results of extensive experiments show the effectiveness and efficiency of our approach

Securing Data in the Cyberspace: Challenges and Emerging Solutions
10/5/2022 14:30, Aula B203, DIAG, Via Ariosto 25, Roma
Speaker: Ivan Visconti (Univ. Salerno)

In this talk, I'll discuss significant challenges in data protection due to current and future threats. I'll present recent research results that, leveraging advanced cryptographic tools, provide new defenses in several domains. In particular, I'll describe efficient zero-knowledge proofs and their applications to:
a) detecting deep fakes/disinformation through novel image authentication mechanisms;
b) long-term data protection via post-quantum security;
c) data sanitization in tough scenarios (blockchains/AI).

(online: https://uniroma1.zoom.us/j/84550558864?pwd=UTdtNElrWHdWVytKckxBbkJZN25uUT09)

Towards Autonomous and Adaptable Digital Twins
02/02/2024 15:00, Aula Magna, DIAG, Via Ariosto 25, Roma
Speaker: Andrea Matta (Politecnico di Milano)

With the advent of Industry 4.0, digital representations of products and manufacturing systems have been considered central for optimizing their development, production, and delivery phases. Digital twins are not simply conceived as simulation models of their physical counterpart, differently they are developed as means for better understanding and control of the real system. To keep alignment with physical systems along their whole lifecycle, digital twins need automation for synchronization and model updates. Different data-driven approaches will be explored for model generation of process flows and equipment from different data views. The advantages and disadvantages of these approaches will be discussed to provide a comprehensive understanding. Additionally, techniques for online validation and synchronization of digital twins will be presented, ensuring that the digital twin accurately reflects the physical system in real time. Applications in manufacturing and circular economies will be described, showcasing their potential to optimize production processes, reduce waste, and enhance sustainability.

2023

Interpretable Neural Symbolic AI
7/11/2023 15:00, Aula Magna, DIAG, Via Ariosto 25, Roma
Speaker: Pietro Barbiero (Università della Svizzera Italiana)

Interpretable and neural symbolic AI share a common goal: to enhance the currently opaque and brittle decision making process of deep learning methods. To address this issue, I will discuss the design of novel interpretable deep learning methods endowed with reasoning capabilities. I will then show how these methods could be applied in diverse real-world domains, ranging from answering queries on knowledge graphs to formulating conjectures in universal algebra.

Sharding and Blockchain: on the cross-chain smart contracts
25/5/2023 12:00, Room A4, DIAG, Via Ariosto 25, Roma
Speaker: Antonella Del Pozzo (Université Paris-Saclay CEA, France)

During this talk, we will offer a succinct overview of sharding within the context of Blockchain and examine its impact on Blockchain smart contracts execution. We will emphasize the primary challenges presented by the execution of smart contracts being distributed across Blockchain shards, and provide guidance on how to manage it effectively through an adaptation of the classical 2PC protocol.

Exploring Change – A New Dimension of Data Analytics
14/3/2023 15:00, Aula Magna, DIAG, Via Ariosto 25, Roma
Data and metadata in datasets experience many different kinds of change. Values are inserted, deleted or updated; rows appear and disappear; columns are added or repurposed, etc. In such a dynamic situation, users might have many questions related to changes in the dataset, for instance which parts of the data are trustworthy and which are not? Users will wonder: How many changes have there been in the recent minutes, days or years? What kind of changes were made at which points of time? How dirty is the data? Is data cleansing required? The fact that data changed can hint at different hidden processes or agendas: a frequently crowd-updated city name may be controversial; a person whose name has been recently changed may be the target of vandalism; and so on. We show various use cases that benefit from recognizing and exploring such change. We present a system and methods to interactively explore such change, addressing the variability dimension of big data challenges. To this end, we propose a model to capture change and the process of exploring dynamic data to identify salient changes. We provide exploration primitives along with motivational examples and measures for the volatility of data. Finally, we identify technical challenges that need to be addressed to make our vision a reality, show some use cases of change exploration and propose directions of future work.

Computational Intelligence for Health
20/1/2023 10:30, Aula Magna, DIAG, Via Ariosto 25, Roma
Speaker: Ophir Frieder (Georgetown University)

We are just now slowly, physically recovering from the recent pandemic; mentally we have a long journey ahead of us, and many are touting a looming mental health crisis. Thus, initially, we describe a web-intelligent, social-media monitoring approach for depression detection and continue with a presentation of a patented, licensed, and proprietary intelligent agent that identifies behavioral deviancy, an early warning for potential mental health concerns. We then turn our attention towards web-intelligent monitoring of social media to detect physical disease outbreaks and describe the implications of such surveillance schemes to healthcare planning for a major children-focused hospital. We conclude by, once again, focusing on patented, licensed, and proprietary intelligent agent technology this time to screen for covid via the use of surrogates. Other medically oriented mining and search applications are briefly mentioned.

2022

Intelligenza Artificiale e Diritto: Prospettive e Problemi Aperti
19/12/2022 16:00, Aula Magna, DIAG, Via Ariosto 25, Roma
Grazie all’Intelligenza Artificiale (IA), attività fino ad oggi svolte esclusivamente dalle persone possono essere affidate alle macchine, che hanno acquisito alcune capacità di ragionare, apprendere e agire. I successi scientifici e tecnologici dell’IA sollevano fondamentali interrogativi sociali, etici e giuridici. Ci chiediamo se le tecnologie dell’IA potranno essere controllate e dirette verso il bene degli individui e della società, o saranno invece rivolte a interessi particolari, a danno di diritti individuali e valori sociali; se consentiranno di perfezionare le nostre istituzioni o finiranno per travolgerle; se ci potranno aiutare a creare e applicare il diritto secondo razionalità e giustizia, o se invece contribuiranno a che il diritto divenga più rigido, opaco e iniquo. L’incontro odierno ci aiuta a rispondere a queste domande, sia attraverso interventi di esperti delle diverse aree disciplinari coinvolte (diritto, informatica, ingegneria, filosofia,...), sia discutendo con l'autore i contenuti del recentissimo volume "L'intelligenza artificiale e il diritto" (Giovanni Sartor, Giappichelli, 2022).

Can We Trust Machine Learning Models?
14/12/2022 15:00, Aula 201, Palazzina D, Viale Regina Elena 295, Roma
Speaker: Vitaly Shmatikov (Cornell Tech)

Modern machine learning models achieve super-human accuracy on tasks such as image classification and natural-language generation, but accuracy does not tell the entire story of what these models are learning. In this talk, I will look at today's machine learning from a security and privacy perspective, and ask several fundamental questions. Could models trained on sensitive private data memorize and leak this data? When training involves crowd-sourced data, untrusted users, or third-party code, could models learn malicious functionality, causing them to produce incorrect or biased outputs? What damage could result from such compromised models? I will illustrate these vulnerabilities with concrete examples and discuss the benefits and tradeoffs of technologies (such as federated learning) that promise to protect the integrity and privacy of machine learning models and their training data. I will then outline practical approaches towards making trusted machine learning a reality.

The Persistent Problem of Software Insecurity
28/11/2022 15:00, Aula Magna, DIAG, Via Ariosto 25, Roma
Speaker: Elisa Bertino (Purdue University)

Software is increasingly playing a key role in all infrastructure and application domains we may think of. Unfortunately, as we all know, software systems are still often insecure, despite the fact the “problem of software security” had been known to the industry and research communities for decades. In this talk, I'll first present results about different analyses that we have carried out about authentication vulnerabilities in mobile applications, including an extensive study to detect vulnerable implementations of pseudo-random number generator (PRNG) in mobile apps. The study has been carried out using an analysis tool, OTP-Lint that assesses implementations of the PRNGs in an automated manner without requiring the source code. By analyzing 6,431 commercial apps downloaded from two well-known apps market, OTP-Lint identified 399 vulnerable apps that generate predictable OTP values. I'll then discuss other factors that today complicate the problem of software security - a notable factor being the software supply chain. We then discuss "what it takes" to convince all parties involved in the software ecosystem to address the problem of software insecurity and outline research directions.

Better Together: Combining Sketching and Sampling for Effective Stream Processing
17/11/2022 11:00, Aula Magna, DIAG, Via Ariosto 25, Roma
Speaker: Prof. Roy Friedman, Technion.

Abstract: Monitoring large data streams and maintaining statistics about them is a challenging task, revolving around the tradeoff triangle between memory frugality, computational complexity, and accuracy. The two common approaches for addressing these problems are sketching and sampling. In this talk, I will present a couple of examples of how an effective combination of the two can yield better results than either of them.

The first example is NitroSketch, a generic framework that boosts the performance of all sketches that employ multiple counter arrays, including, e.g., the famous count-min sketch, count-sketch, and Univmon. NitroSketch systematically addresses the performance bottlenecks of sketches without sacrificing robustness and generality. Its key contribution is the careful synthesis of rigorous, yet practical solutions to reduce the number of per-packet CPU and memory operations. NitroSketch is implemented on three popular software switch platforms (Open vSwitch-DPDK, FD.io-VPP, and BESS). Our performance evaluation shows that accuracy is comparable to unmodified sketches while attaining up to two orders of magnitude speedup, and up to 45% reduction in CPU usage.

The second example is SQUAD, a novel algorithm for tracking quantiles (e.g., tail latencies) of significant items within a stream, where an item can be the source IP + destination IP addresses in a networking application, a URI or a user ID in a web service, or an object ID in a key-value store. While quantile sketches have been studied in the past, naively applying one instance of such sketches to each item is very memory wasteful. Similarly, applying sampling alone also requires prohibitive amounts of memory. In contrast, SQUAD addresses this problem by combining sampling and sketching in a way that improves the asymptotic space complexity. Intuitively, SQUAD allocates a sketch only to items identified as likely to be significant and uses a background sampling process to capture the behavior of the quantiles of an item before it is allocated with a sketch. This allows SQUAD to use fewer samples and sketches. An empirical evaluation demonstrates SQUAD’s superiority using extensive simulations on real-world traces.
* Based on joint works with Ran Ben-Basat, Vladimir Braverman, Gil Einziger, Yaron Kassner, Zaoxing Liu, Vyas Sekar, and Rana Shahout

Secure Biometric Authentication Using Privacy-Preserving Cryptographic Protocols
17/11/2022, B222 @DIAG (Via Ariosto 25), or online
Speaker: Paolo Gasti, New York Institute of Technology (NYIT)

As an authentication method, biometrics offer unparalleled convenience and security. With very little for users to remember and do, there is also very little that they can do incorrectly, thus limiting the attack surface of an authentication system. Unfortunately, biometrics also present a challenging privacy/security tradeoff: biometric data is the ultimate personally identifiable information (PII), and is highly regulated in various jurisdiction in Europe, Asia, and the United States. As a result, practical large-scale biometric deployments must take into account strong protection of the data they process. This talk will present recent advances in the area of cryptographic protocol applied to biometric recognition for the purpose of protecting biometric data during and after authentication. We will introduce various concepts around biometric authentication, such as biometric liveness and authentication error rates, and provide a general overview of modern cryptographic techniques designed to guarantee strong biometric privacy.