FRANCESCO MARIA MOLFESE

PhD Graduate

PhD program:: XXXVIII


supervisor: Roberto Navigli

Thesis title: Enabling Robust and Reliable Commonsense Reasoning in Large Language Models

In recent years, Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP), demonstrating remarkable performance across a wide range of tasks. Despite these advances, fundamental questions remain about their reasoning capabilities, particularly in commonsense reasoning, which refers to the ability to draw inferences from implicit, everyday knowledge that humans typically take for granted. Current approaches often fail to provide models with knowledge in forms they can effectively utilize, while evaluation methodologies focus exclusively on final answers, ignoring the reasoning processes that lead to them. Moreover, the dominant multiple-choice evaluation paradigm can systematically misrepresent model capabilities, especially as models generate increasingly complex reasoning traces. These limitations hinder progress toward language models with robust, human-like commonsense reasoning and understanding. This thesis aims to advance both the performance and reliable evaluation of commonsense reasoning in language models, addressing fundamental challenges in knowledge augmentation and assessment methodologies. First, we introduce a retrieval augmentation framework that provides models with complete reasoning examples rather than isolated facts, demonstrating consistent improvements across multiple benchmarks without requiring model retraining. Second, we expose systematic inconsistencies in multiple-choice evaluation through comprehensive human annotation studies, revealing substantial disagreement between automated extraction strategies and human judgment, and introducing a dataset that exposes critical vulnerabilities in state-of-the-art LLM-based answer extractors. Third, we present the first comprehensive benchmark for evaluating reasoning traces in commonsense domains, demonstrating that a significant proportion of correct answers contain flawed reasoning and that reasoning-aware evaluation reveals substantial performance drops compared to answer-only assessment. Together, these findings show that progress in commonsense reasoning depends on improving how language models incorporate contextual knowledge and how their reasoning is evaluated. By enhancing reasoning through example-based retrieval and exposing limitations of answer-only evaluation, this work provides the methods needed to develop language models with more reliable commonsense reasoning capabilities.

Research products

11573/1744349 - 2025 - Right Answer, Wrong Score: Uncovering the Inconsistencies of {LLM} Evaluation in Multiple-Choice Question Answering
Molfese, Francesco Maria; Moroni, Luca; Gioffre', Luca; Scirè, Alessandro; Conia, Simone; Navigli, Roberto - 04b Atto di convegno in volume
conference: Association for Computational Linguistics (Vienna, Austria)
book: Findings of the Association for Computational Linguistics. {ACL}2025, Vienna, Austria, July 27–August 1st, 2025 - (979-8-89176-256-5)

11573/1717749 - 2024 - CNER: Concept and Named Entity Recognition
Martinelli, Giuliano; Molfese, Francesco; Tedeschi, Simone; Fernández-Castro, Alberte; Navigli, Roberto - 04b Atto di convegno in volume
conference: North American Association for Computational Linguistics (Mexico City; Mexico)
book: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - (979-8-89176-114-8)

11573/1713460 - 2024 - CroCoAlign: A Cross-Lingual, Context-Aware and Fully-Neural Sentence Alignment System for Long Texts
Molfese, Francesco Maria; Bejgu, Andrei Stefan; Tedeschi, Simone; Conia, Simone; Navigli, Roberto - 04b Atto di convegno in volume
conference: European Association for Computational Linguistics (St. Julian's; Malta)
book: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) - (979-8-89176-088-2)

11573/1727951 - 2024 - ZEBRA: Zero-Shot Example-Based Retrieval Augmentation for Commonsense Question Answering
Molfese, Francesco Maria; Conia, Simone; Orlando, Riccardo; Navigli, Roberto - 04b Atto di convegno in volume
conference: Empirical Methods in Natural Language Processing (Miami; United States)
book: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing - (979-8-89176-164-3)

11573/1689178 - 2023 - Secret underwater acoustic key generation challenged by Eve's simulator
Yldrm, S.; Pelekanakis, K.; Sklivanitis, G.; Pados, D. A.; Paglierani, P.; Petroccia, R.; Alves, J.; Molfese, F.; Cuomo, F. - 01a Articolo in rivista
paper: IEEE JOURNAL OF OCEANIC ENGINEERING (IEEE / Institute of Electrical and Electronics Engineers Incorporated:445 Hoes Lane:Piscataway, NJ 08854:(800)701-4333, (732)981-0060, EMAIL: subscription-service@ieee.org, INTERNET: http://www.ieee.org, Fax: (732)981-9667) pp. 1-18 - issn: 0364-9059 - wos: WOS:001047527600001 (7) - scopus: 2-s2.0-85167808669 (3)

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma