STEFANO PERRELLA

Dottore di ricerca

ciclo: XXXVIII

Titolo della tesi: Towards Accurate and Interpretable Machine Translation Evaluation

Machine Translation (MT) automatically maps text across languages, while MT evaluation concerns assessing the quality of translated text. The MT field largely relies on human evaluation to track progress, while leveraging automatic evaluation for rapid experimentation and iterative system development. Although MT evaluation methods have improved dramatically in recent years, they now face new challenges as automatic translation quality approaches human performance, which makes it increasingly difficult to detect subtle translation errors. Moreover, modern automatic MT evaluation techniques rely on black-box neural models, which are opaque and difficult to interpret. Furthermore, most metrics reduce translation quality to a single scalar score, offering limited insights into the underlying reasons for that score. If these models provide unreliable assessments, they may steer system development in the wrong direction. In this dissertation, we advance the ability to measure progress in the MT field, with a particular focus on making automatic evaluation more interpretable. Our contributions also extend to MT meta-evaluation, that is, the evaluation of MT metrics. First, we address the lack of interpretability of neural metrics by introducing the MaTESe metrics -- the first neural metrics capable of identifying error spans within translations and assigning severity levels to them, thus providing finer-grained, more interpretable feedback. Then, we introduce a meta-evaluation framework that measures metric performance using Precision, Recall, F-score, and Re-Ranking Precision in evaluation scenarios designed to proxy common applications. In this way, we provide insights into evaluation accuracy beyond simple correlation with human judgment, which -- while useful for comparing metrics -- offers limited information about their true evaluation accuracy. We further advance MT meta-evaluation by uncovering flaws in standard meta-evaluation strategies. To this end, we introduce sentinel metrics -- that is, intentionally incomplete metrics designed to expose weaknesses in meta-evaluation -- and use them to demonstrate that certain meta-evaluation strategies can inadvertently reward metrics that base their evaluation on spurious correlations between text features and human judgments of translation quality, or that simply produce continuous rather than discrete outputs. Finally, we shift focus from improving evaluation techniques to facilitating better test data selection. As MT systems achieve increasingly higher performance, current benchmarks have become too easy, making it difficult to distinguish among top systems or identify areas for improvement. We address this issue by introducing Translation Difficulty Estimation, the task of identifying difficult-to-translate texts, and show how difficulty estimators can be used to construct more challenging MT benchmarks. We then train a state-of-the-art difficulty estimator and use it to build the test set for the General Machine Translation Shared Task at the 2025 edition of the Conference on Machine Translation (WMT). Our approach reduced the proportion of perfect translation outputs on English-source translation directions from 18.74% (WMT24) to 3.60% (WMT25), thereby better exposing the limitations of top systems. Together, these contributions advance the field of Machine Translation by providing interpretable metrics and meta-metrics, more reliable meta-evaluation, and more informative benchmarks -- ultimately improving how progress in MT is measured and understood.

Produzione scientifica

11573/1722083 - 2025 - DIBIMT: A Gold Evaluation Benchmark for Studying Lexical Ambiguity in Machine Translation

Martelli, Federico; Perrella, Stefano; Campolungo, Niccolò; Munda, Tina; Koeva, Svetla; Tiberius, Carole; Navigli, Roberto - 01a Articolo in rivista

rivista: COMPUTATIONAL LINGUISTICS (Cambridge, MA : MIT Press Journals) pp. 343-413 - issn: 1530-9312 - wos: WOS:001515402300010 (2) - scopus: (0)

11573/1743271 - 2024 - Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Perrella, Stefano; Proietti, Lorenzo; Huguet Cabot, Pere-Lluis; Barba, Edoardo; Navigli, Roberto - 04b Atto di convegno in volume

congresso: Conference on Empirical Methods in Natural Language Processing (Miami; Florida)

libro: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing - (9798891761643)

11573/1720213 - 2024 - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!

Perrella, Stefano; Proietti, Lorenzo; Scire', Alessandro; Barba, Edoardo; Navigli, Roberto - 04b Atto di convegno in volume

congresso: Association for Computational Linguistics (Bangkok; Thailand)

libro: Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! - (9798891760943)

11573/1711963 - 2024 - Analyzing Homonymy Disambiguation Capabilities of Pretrained Language Models

Proietti, Lorenzo; Perrella, Stefano; Tedeschi, Simone; Vulpis, Giulia; Lavalle, Leonardo; Sanchietti, Andrea; Ferrari, Andrea; Navigli, Roberto - 04b Atto di convegno in volume

congresso: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (Torino; Italy)

libro: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) - (978-2-493814-10-4)

11573/1670755 - 2022 - MaTESe: Machine Translation Evaluation as a Sequence Tagging Problem

Perrella, Stefano; Proietti, Lorenzo; Scirã, Alessandro; Campolungo, Niccolò; Navigli, Roberto - 04b Atto di convegno in volume

congresso: Conference on Machine Translation (Abu Dhabi, United Arab Emirates)

libro: Proceedings of the Seventh Conference on Machine Translation (WMT) - (9781959429296)