Thesis title: Applications of Language Models: from Humans to Machines
Exploring new techniques for extracting information from large quantities of data is an
essential research topic.
The internet gives us the ability to produce and collect large quantities of data whose
value strongly depends on our ability to process them. Hence, we need new technologies
that can handle all of this data.
One of the most common types of data is the text written in human languages. Understanding
the meaning of a text is probably one of the most challenging tasks for a machine.
Hence, in the last few years, several works proposed solutions to the classical problem
of language modeling (i.e., modeling the probability distribution of words in a text) that
leverage Neural Networks, building Neural Language Models. This new technology can
model with high precision the probability distribution associated with a text. Using this
technology, it is also possible to generate a text starting from a sentence. Thus, in our
first contribution, we study neural language models. In particular, we perform an in-depth
analysis of the automatically generated text. Our results point out that the latter is not only
grammatically correct, but it also contains verifiable facts.
However, the capability of Language Models goes beyond the analysis of the human
text. For example, we can use them to improve the process of developing and analyzing
software. For this reason, in the other two contributions of this thesis, we leverage Language
Model to analyze source code and binary programs. Our second contribution shows how
an autoencoder model can detect debug-information bugs by looking at the sequence of
executed lines shown during programs’ normal debugging. Using this strategy, we found and
report five different debug-symbols bugs in the LLVM toolchain. Interestingly, a classical
differential-based approach that we developed could not find these bugs.
Finally, we used a self-attentive Recurrent Neural Network to compute dense representations
of binary functions in our last contribution. Using this representation, we show that
it is possible to tackle several crucial reverse engineering tasks, from identifying known
vulnerability to malware classification