ALESSANDRO SCIRÈ

PhD Graduate

PhD program:: XXXVII


supervisor: Roberto Navigli

Thesis title: Towards Interpretable and Factual Natural Language Generation

This thesis addresses two major challenges in the field of Natural Language Generation (NLG): interpretability and factuality. Despite the significant advancements made by Large Language Models (LLMs) in NLG tasks such as text summarization and machine translation, studies have consistently demonstrated that their outputs often remain opaque and contain factual inaccuracies. These limitations raise concerns regarding the reliability and trustworthiness of current NLG systems, both in academia and industry. Our journey begins by addressing long-form text summarization with an emphasis on interpretability. In this setting, given the length and complexity of texts like books, users must place a considerable trust in the summarization system’s ability to distill key information accurately. To address this need, we propose an extractive-then-abstractive summarization approach that highlights the relevant portions of the original text used to generate the final summary. This method contributes in building trust in the final user, as they can check the sentences deemed relevant by the system. However, our investigation into book summarization systems revealed two critical findings; first, we observed many output featuring factual errors, and second, we discovered that standard automatic metrics are not enough to detect them. To tackle these issues, we introduce FENICE, our summarization factuality metric that leverages Natural Language Inference (NLI) and claim extraction. By aligning claims extracted from the summary with corresponding sections of the source document, FENICE provides insights into which specific parts of the summary are accurate or hallucinated, addressing both factuality and interpretability simultaneously. Having addressed summarization consistency with FENICE, we recognized that verifying the factual accuracy of summaries against a source document is only one part of a bigger quest. In real-world applications, texts often need to be validated against multiple, external sources, where the relevant information is not known beforehand. This leads to the broader task of end-to-end factuality evaluation, where verification extends beyond a predefined document to any potential evidence, retrieved from various knowledge bases. To tackle this, we introduce LLM-OASIS, the first large-scale resource designed for training and evaluating models on this more complex verification task. Our resource is created by extracting and falsifying claims from Wikipedia pages, and subsequently generating factual and unfactual versions of the original text. We then train and evaluate language models on their capability of discerning factual texts from their falsified counterparts. Our experiments reveal the challenging nature of this benchmark for current LLMs, even in the Retrieval Augmented Generation (RAG) setting, with smaller, specialized models fine-tuned on our resource, achieving competitive performance. In the last chapter of the thesis, we show that the lack of transparency and interpretability also extends to other areas of NLG, such as machine translation. In this context, the leading trend of MT evaluation methods share similar limitations, offering only a general quality score without revealing the precise nature or location of translation errors. As an initial step toward a more interpretable evaluation, we propose MaTESe, a novel metric that frames MT evaluation as a sequence tagging task, identifying mistranslated spans and categorizing errors by type and severity. This thesis contributes to the ongoing effort to make NLG systems both interpretable and factually reliable, demonstrating the feasibility and importance of these qualities in practical applications. Our hope is that the methodologies, resources, and insights outlined in this research will inspire future works and lay a solid foundation for more transparent and trustworthy NLG systems, ultimately building greater confidence in AI-driven text generation.

Research products

11573/1726952 - 2024 - NounAtlas: Filling the Gap in Nominal Semantic Role Labeling
Navigli, Roberto; Lo Pinto, Marco; Silvestri, Pasquale; Rotondi, Dennis; Ciciliano, Simone; Scire', Alessandro - 04b Atto di convegno in volume
conference: Association for Computational Linguistics (Bangkok; Thailand)
book: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) - (9798891760943)

11573/1720213 - 2024 - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!
Perrella, Stefano; Proietti, Lorenzo; Scire', Alessandro; Barba, Edoardo; Navigli, Roberto - 04b Atto di convegno in volume
conference: Association for Computational Linguistics (Bangkok; Thailand)
book: Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! - (9798891760943)

11573/1726537 - 2024 - FENICE: Factuality Evaluation of Summarization Based on Natural Language Inference and Claim Extraction
Scire', Alessandro; Ghonim, K.; Navigli, R. - 04b Atto di convegno in volume
conference: 62nd Annual Meeting of the Association-for-Computational-Linguistics (ACL) (Bangkok; Thailand)
book: Findings of the Association for Computational Linguistics: ACL 2024 - (9798891760998)

11573/1685072 - 2023 - Echoes from Alexandria: A Large Resource for Multilingual Book Summarization
Scirã, Alessandro; Conia, Simone; Ciciliano, Simone; Navigli, Roberto - 04b Atto di convegno in volume
conference: Association for Computational Linguistics (Toronto; Canada)
book: Findings of the Association for Computational Linguistics: ACL 2023 - ()

11573/1672381 - 2022 - Semantic Role Labeling meets definition modeling: Using natural language to describe predicate-argument structures
Conia, Simone; Barba, Edoardo; Scirã, Alessandro; Navigli, Roberto - 04b Atto di convegno in volume
conference: Empirical Methods in Natural Language Processing (Abu Dhabi; United Arab Emirates)
book: Findings of the Association for Computational Linguistics: EMNLP 2022 - ()

11573/1670755 - 2022 - MaTESe: Machine Translation Evaluation as a Sequence Tagging Problem
Perrella, Stefano; Proietti, Lorenzo; Scirã, Alessandro; Campolungo, Niccolò; Navigli, Roberto - 04b Atto di convegno in volume
conference: Conference on Machine Translation (Abu Dhabi, United Arab Emirates)
book: Proceedings of the Seventh Conference on Machine Translation (WMT) - (9781959429296)

11573/1271809 - 2019 - Fog-computing-based heartbeat detection and arrhythmia classification using machine learning
Scirè, Alessandro; Tropeano, Fabrizio; Anagnostopoulos, Aris; Chatzigiannakis, Ioannis - 01a Articolo in rivista
paper: ALGORITHMS (Molecular Diversity Preservation Int. (Basel, Switzerland)) pp. - - issn: 1999-4893 - wos: WOS:000460694900005 (21) - scopus: 2-s2.0-85061991327 (38)

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma