Thesis title: VerbAtlas: A multilingual semantic resource for Semantic Role Labeling
Understanding language is a faculty that humans acquire in their childhood and develop during all their life not only by means of learning by heart a list of lexical elements and grammar rules, but also through other faculties proper of human cognition such as the ability of generalizing and the ability of understanding creative and novel ways to express infinite new concepts.
In contrast, machines rely on formal languages hence they need to be fed with all the lexicon and grammar rules to make them able to execute Natural Language Processing (NLP) tasks.
The main objective of NLP is in fact to find new ways to make a machine manipulate human language, which means making it able to produce proper generalizations across linguistic phenomena.
One of this phenomena are the argument structures of verbs, that is, the linguistic elements that usually collocates with a verb called "arguments". NLP has a task for it called "Semantic Role Labeling" (SRL) which consists in finding those arguments and labeling them with the proper semantic role (namely, tags for representing the semantic relation between a verb and an argument).
In the past, there were attempts to build a repository for predicate and their respective argument structures such as PropBank, FrameNet or VerbNet, but each of them suffers from different issues ranging from the low-coverage to the scarce informativeness of its representation framework. Moreover, these verbal repositories are monolingual, therefore, it is impossible to use them in languages other than English without creating a new one from scratch, a work which is time and money consuming.
Another critical issue in SRL is the domain adaptation problem which consists in the ability of one framework to adapt to any kind of textual domains. In fact, in the past, it frequently happened that one resource achieved good performances when applied, for example, to the CoNLL-2009 dataset of finance textual domain but fails when the domain of application changes to the Brown Corpus' miscellaneous domain.
In this thesis it will be presented a new, hand-crafted and multilingual lexical-semantic resource: VerbAtlas. Its goal is to solve the main problems that affected previous resources by bringing together all verbal synsets from BabelNet into semantically-coherent frames. These frames come with common and prototypical argument structure and also provide concept-specific information.
In contrast to PropBank, which implements enumerative semantic roles (i.e. semantic roles defined only progressively in terms of numbers like ARG0, ARG1, ARG2 etc..) VerbAtlas implements an explicit, cross-frame set of semantic roles linked to selectional preferences expressed in terms of BabelNet synsets. Moreover, it is the first resource enriched with semantic information about implicit, shadow, and default arguments.
The efficacy of VerbAtlas will be proven in a dependency-based Semantic Role Labeling task. It will be shown how integrating this new resource in a SOTA neural system leads to a gain in performances on the in-domain and out-of domain test sets of CoNLL 2009 tasks proving its adaptivity on different domains.
VerbAtlas is available at http://verbatlas.org.