Marco Miccheli

PhD Graduate

PhD program:: XXXIV

Thesis title: Two essays concerning complexity, language and machine learning

Abstract 1: In this first essay, we present our research project concerning the extraction of information from the Translation Quality Assessment (TQA) process, in which the quality of a translation conducted by a human translator from one language to another, is evaluated by another human translator. We take advantage of the dataset provided by the professional translation service provider Translated SRL, consisting of thousand of translations, produced by human translators and edited (with error annotations) by human reviewers. We deal with subjectivity that raises from the linguists involved in the process and we aim to understand which are the features able to catch translators' behaviour. We applied Bayesian Networks methods to build a probabilistic framework that helps us to understand the patterns of the translation process, assessing the difficulty of the source texts, the skill of the translators, and the strictness of the reviewers, together with the consistency of both the last two. We run three validation methods in which we test the two Bayesian models created, comparing them, and showing that they can reasonably fit the data and retrieve significant patterns in the behaviour of the linguists involved. We designed an experiment to add new data to our dataset to check the predictability of the quality of the individual translated texts, and despite the single estimate of the quality of the specific translation has been shown to be poorly predictable, the best of our models has proven to be able to predict the mutual relations between reviewers, showing the possibility to represent a useful tool in assessing linguists behaviours and therefore in establishing their reliability. Abstract 2: Relatedness is measure of similarity between two human activities, in terms of inputs and contexts needed for their development. The estimate of Relatedness has become of great interest as a tool to inform policies and development strategies in governments, international organisation, and firms, under the idea that it is easier to move between related activities rather towards unrelated ones. In our paper Relatedness in the Era of Machine Learning, we focus on countries and we show that the standard, widespread approach of estimating Relatedness through the co-locations of activities (e.g. Product Space) generates measure of Relatedness that performs worse than trivial auto-correlation prediction strategies. We argue that this is due to the poor signal-to-noise ratio present in international trade data. In our paper we build new methods to measure relatedness, both by moving from a two-products correlation approach to a many-products one, and by finding correlations through textual similarities between products' description in the Harmonized System. Here we focus this last technique and show its performance on prediction task.