FIORELLA ARTUSO

Dottoressa di ricerca

ciclo: XXXVI


supervisore: Leonardo Querzoni

Titolo della tesi: Deep Learning based Binary Code Analysis

The exponential growth of software complexity, coupled with the rise of heterogeneous architectures, further complicates the process of manual binary code analysis. Despite its complexity, binary code analysis is extremely valuable, particularly in scenarios where direct access to the source code is unavailable, such as with proprietary software, firmware images, and malware samples. To tackle these challenges, the scientific community has started studying methods for creating automated binary analysis tools based on Deep Learning (DL) that alleviate the workload of human reverse engineers. Unfortunately, there has been a proliferation of such solutions without much effort toward systematization. This thesis contains three main contributions: First, we present a comprehensive literature review that spans nine years of research up to 2024. We propose a systematization of 54 research papers, identify a deep learning pipeline common to all these solutions, and provide an in-depth analysis of each of its steps. This analysis highlights key trends across various approaches as well as gaps that need further investigation. Second, we explore the applicability of Deep Learning solutions to a novel task: the detection of debug information bugs in optimized binaries. This represents a practically important problem, as most software running in production is produced by an optimizing compiler. Current solutions rely on invariants—human-defined rules that embed the desired behavior—whose violation may indicate the presence of a bug. Although this approach has proved effective in discovering several bugs, it is unable to identify bugs that do not trigger invariants. We trained a set of different models borrowed from the NLP community in an unsupervised way on a large dataset of debug traces. Our results show that DNNs are capable of discovering bugs in both synthetic and real datasets. Additionally, with our models we were able to report 12 unknown bugs in a recent version of the widely used LLVM toolchain, two of which have been confirmed. Finally, our last contribution is a novel assembly code model named BinBert. This model is built on a transformer pre-trained on a huge dataset of both assembly instruction sequences and execution information (i.e., symbolic expressions). BinBert can be applied to assembly instruction sequences, and it is fine-tunable—that is, it can be retrained as part of a neural architecture on task-specific data. Through fine-tuning, BinBert learns how to apply the general knowledge acquired during pre-training to the specific task. We evaluated BinBert on a multi-task benchmark that we specifically designed to test the understanding of assembly code. The benchmark is composed of several tasks, some taken from the literature and a few novel tasks that we designed, with a mix of intrinsic and downstream tasks. Our results show that BinBert outperforms state-of-the-art models for binary instruction embedding, raising the bar for binary code understanding. Moreover, BinBert has been developed by taking into account the gaps that we found in our systematization effort—mainly, the lack of comparison with standard architectures, the use of tokenization strategies without comparison and rationale, and the testing of models on a single or very limited tasks.

Produzione scientifica

11573/1713407 - 2024 - BinBert: Binary Code Understanding with a Fine-tunable and Execution-aware Transformer
Artuso, Fiorella; Mormando, Marco; Di Luna, Giuseppe Antonio; Querzoni, Leonardo - 01a Articolo in rivista
rivista: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING (IEEE Computer society New York) pp. - - issn: 1545-5971 - wos: (0) - scopus: 2-s2.0-85192991806 (0)

11573/1639211 - 2022 - Debugging Debug Information with Neural Networks
Artuso, F.; Di Luna, G. A.; Querzoni, L. - 01a Articolo in rivista
rivista: IEEE ACCESS (Piscataway NJ: Institute of Electrical and Electronics Engineers) pp. 54136-54148 - issn: 2169-3536 - wos: WOS:000804620000001 (2) - scopus: 2-s2.0-85130449764 (4)

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma