Luca Massarelli

PhD Graduate

PhD program:: XXXIII


supervisor: Roberto Baldoni / Leonardo Querzoni

Thesis title: Applications of Language Models: from Humans to Machines

Exploring new techniques for extracting information from large quantities of data is an essential research topic. The internet gives us the ability to produce and collect large quantities of data whose value strongly depends on our ability to process them. Hence, we need new technologies that can handle all of this data. One of the most common types of data is the text written in human languages. Understanding the meaning of a text is probably one of the most challenging tasks for a machine. Hence, in the last few years, several works proposed solutions to the classical problem of language modeling (i.e., modeling the probability distribution of words in a text) that leverage Neural Networks, building Neural Language Models. This new technology can model with high precision the probability distribution associated with a text. Using this technology, it is also possible to generate a text starting from a sentence. Thus, in our first contribution, we study neural language models. In particular, we perform an in-depth analysis of the automatically generated text. Our results point out that the latter is not only grammatically correct, but it also contains verifiable facts. However, the capability of Language Models goes beyond the analysis of the human text. For example, we can use them to improve the process of developing and analyzing software. For this reason, in the other two contributions of this thesis, we leverage Language Model to analyze source code and binary programs. Our second contribution shows how an autoencoder model can detect debug-information bugs by looking at the sequence of executed lines shown during programs’ normal debugging. Using this strategy, we found and report five different debug-symbols bugs in the LLVM toolchain. Interestingly, a classical differential-based approach that we developed could not find these bugs. Finally, we used a self-attentive Recurrent Neural Network to compute dense representations of binary functions in our last contribution. Using this representation, we show that it is possible to tackle several crucial reverse engineering tasks, from identifying known vulnerability to malware classification

Research products

Connessione ad iris non disponibile

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma