Luca Massarelli

Dottore di ricerca

ciclo: XXXIII

supervisore: Roberto Baldoni / Leonardo Querzoni

Titolo della tesi: Applications of Language Models: from Humans to Machines

Exploring new techniques for extracting information from large quantities of data is an essential research topic. The internet gives us the ability to produce and collect large quantities of data whose value strongly depends on our ability to process them. Hence, we need new technologies that can handle all of this data. One of the most common types of data is the text written in human languages. Understanding the meaning of a text is probably one of the most challenging tasks for a machine. Hence, in the last few years, several works proposed solutions to the classical problem of language modeling (i.e., modeling the probability distribution of words in a text) that leverage Neural Networks, building Neural Language Models. This new technology can model with high precision the probability distribution associated with a text. Using this technology, it is also possible to generate a text starting from a sentence. Thus, in our first contribution, we study neural language models. In particular, we perform an in-depth analysis of the automatically generated text. Our results point out that the latter is not only grammatically correct, but it also contains verifiable facts. However, the capability of Language Models goes beyond the analysis of the human text. For example, we can use them to improve the process of developing and analyzing software. For this reason, in the other two contributions of this thesis, we leverage Language Model to analyze source code and binary programs. Our second contribution shows how an autoencoder model can detect debug-information bugs by looking at the sequence of executed lines shown during programs’ normal debugging. Using this strategy, we found and report five different debug-symbols bugs in the LLVM toolchain. Interestingly, a classical differential-based approach that we developed could not find these bugs. Finally, we used a self-attentive Recurrent Neural Network to compute dense representations of binary functions in our last contribution. Using this representation, we show that it is possible to tackle several crucial reverse engineering tasks, from identifying known vulnerability to malware classification

Produzione scientifica

11573/1477115 - 2022 - Function Representations for Binary Similarity

Massarelli, Luca; Di Luna, Giuseppe Antonio; Petroni, Fabio; Querzoni, Leonardo; Baldoni, Roberto - 01a Articolo in rivista

rivista: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING (IEEE Computer society New York) pp. 2259-2273 - issn: 1545-5971 - wos: WOS:000822381000001 (10) - scopus: 2-s2.0-85099729625 (10)

11573/1555591 - 2021 - Who's debugging the debuggers? Exposing debug information bugs in optimized binaries

Di Luna, G. A.; Italiano, D.; Massarelli, L.; Osterlund, S.; Giuffrida, C.; Querzoni, L. - 04b Atto di convegno in volume

congresso: Architectural Support for Programming Languages and Operating Systems (Virtual; Online)

libro: ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems - (9781450383172)

11573/1413959 - 2020 - AndroDFA: Android Malware Classification Based on Resource Consumption

Massarelli, Luca; Aniello, Leonardo; Ciccotelli, Claudio; Querzoni, Leonardo; Ucci, Daniele; Baldoni, Roberto - 01a Articolo in rivista

rivista: INFORMATION (Basel: Molecular Diversity Preservation International) pp. - - issn: 2078-2489 - wos: WOS:000551236800014 (9) - scopus: 2-s2.0-85087498938 (7)

11573/1481935 - 2020 - How Decoding Strategies Affect the Verifiability of Generated Text

Massarelli, Luca; Petroni, Fabio; Piktus, Aleksandra; Ott, Myle; Rocktaschel, Tim; Plachouras, Vassilis; Silvestri, Fabrizio; Riedel, Sebastian - 04b Atto di convegno in volume

congresso: Findings of the Association for Computational Linguistics: EMNLP 2020 (Online)

libro: Findings of the Association for Computational Linguistics: EMNLP 2020 - ()

11573/1321763 - 2019 - Triage of IoT Attacks Through Process Mining

Coltellese, Simone; Maria Maggi, Fabrizio; Marrella, Andrea; Massarelli, Luca; Querzoni, Leonardo - 04b Atto di convegno in volume

congresso: On the Move to Meaningful Internet Systems: OTM 2019 Conferences (Rhodes; Greece)

libro: On the Move to Meaningful Internet Systems: OTM 2019 Conferences - (978-3-030-33245-7; 978-3-030-33246-4)

11573/1285253 - 2019 - SAFE: Self-Attentive Function Embeddings for Binary Similarity

Massarelli, Luca; Di Luna, Giuseppe Antonio; Petroni, Fabio; Baldoni, Roberto; Querzoni, Leonardo - 04b Atto di convegno in volume

congresso: 16th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment DIMVA 2019 (Gothenburg; Sweden)

libro: Detection of Intrusions and Malware, and Vulnerability Assessment - (978-3-030-22037-2; 978-3-030-22038-9)

11573/1285230 - 2019 - Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis

Massarelli, Luca; Di Luna, Giuseppe Antonio; Petroni, Fabio; Querzoni, Leonardo; Baldoni, Roberto - 04b Atto di convegno in volume

congresso: 2nd Workshop on Binary Analysis Research (BAR 2019) (San Diego (CA); United States)

libro: Proceedings BAR 2019 Workshop on Binary Analysis Research - (1-891562-58-4)

11573/1160258 - 2017 - Android malware family classification based on resource consumption over time

Massarelli, L; Aniello, L; Ciccotelli, C; Querzoni, L.; Ucci, D.; Baldoni, R. - 04b Atto di convegno in volume

congresso: 12th International Conference on Malicious and Unwanted Software, MALWARE 2017 (Fajardo, Puerto Rico, USA)

libro: Proceedings of the 2017 12th International Conference on Malicious and Unwanted Software, MALWARE 2017 - (9781538614365; 978-1-5386-2592-7)