ANDREA BACCIU

PhD Graduate

PhD program:: XXXVII


supervisor: Fabrizio Silvestri
co-supervisor: Nicola Tonellotto

Thesis title: Beyond Traditional Search: Bridging Retrieval, Reasoning, and Language Barriers in Intelligent Search Systems

This thesis investigates innovative approaches to address fundamental limitations in modern search systems, with a particular focus on enhancing the synergy between retrieval and reasoning components and improving the ease of information access. A primary research thread explores the optimization of interactions between retrievers and reasoning components. We specifically address the challenge of "hard false relevant" documents—those that appear superficially relevant but lack true semantic alignment with queries. To tackle this issue, we introduce Reinforced Retrieval Augmented Machine Learning (RRAML), a visionary framework that enables retrieval systems to be fine-tuned within a Retrieval-Augmented Generation (RAG) architecture. This novel approach allows the retriever to adapt through continued training when combined with a reasoner in a RAG scenario, to reduce the retrieval of false relevant documents. Within the same research direction, we explored Neural Semantic Parsing (NSP), which uses Large Language Models (LLMs) to translate natural language queries into a machine-readable format that can be used to retrieve information from knowledge graphs. This approach represents an alternative to traditional RAG systems, where, instead of retrieving from unstructured text, the LLM facilitates access to structured and verified information stored in knowledge graphs, providing greater control and transparency in the information retrieval process. To enhance this system's reliability, we developed the Hallucination Simulation Framework, which deliberately induces hallucinations in semantic parsers during training. Complementing this, we created the Hallucination Detection Model (HDM), which identifies and mitigates hallucinations stemming from knowledge gaps, improving answer reliability by 20\%. This framework enables semantic parsers to recognize their knowledge boundaries and uncertainty levels, resulting in more transparent and trustworthy responses. To further support the democratization of information, this thesis introduces approaches to bridge linguistic and cultural barriers, enabling users worldwide to access accurate, relevant information regardless of language or technical proficiency. This goal led to the development of a new architecture, x-NDB, for cross-lingual neural databases and the creation of X-WikiNLDB, a dataset containing unstructured text in multiple languages simulating data retrieved online. X-WikiNLDB facilitates robust cross-lingual information retrieval. Our cross-lingual performance are on-par with previous work in the English-only scenario. Furthermore, we demonstrated significant zero-shot performance improvements of 2-5$\times$ compared to the multilingual counterpart across several low-resource languages in Catalan, Tagalog, Yoruba, Japanese, and Korean. This success suggests that cross-lingual training encourages models to capture deeper semantic understanding rather than surface-level patterns, enabling better generalization across linguistic boundaries. To advance language accessibility further, we developed Fauno, DanteLLM, and OpenDanteLLM, pioneering a series of open-source Italian language models. Starting with Fauno, we created the first open-source conversational Italian LLM along with a novel conversational dataset. Building upon this foundation, DanteLLM achieved remarkable performance improvements, demonstrating a 10\% increase over Fauno and 7\% over the best competitor across comprehensive Italian benchmarks. OpenDanteLLM, while trained exclusively on open-source data, showed a 5\% improvement over existing methods while ensuring unrestricted access through its commercial-friendly Apache 2.0 license. This work, which gained particular relevance during ChatGPT's temporary ban in Italy, has established a new direction in language-specific AI development, inspiring other researchers to create additional Italian language models. Our approach demonstrates the feasibility of building high-performance, privacy-preserving language models that operate entirely offline, providing Italian-speaking users with reliable alternatives to centralized systems. The thesis also addresses the challenge of ambiguous queries and the dependence on behavioral data in search systems. We introduce Generative Query Recommendation (GQR), a zero-shot approach that reimagines query expansion without relying on user query logs. By leveraging LLMs as the sole component, GQR eliminates the complex pipelines and query log dependencies associated with traditional methods. Our system significantly outperformed industry standards, demonstrating a 10-point improvement in NDCG@10. The generated query reformulations showed reduced ambiguity with respect to the document collection, as measured by a 7-point increase in the Simplified Clarity Score. A blind user study with 12 annotators further validated GQR's effectiveness, with users preferring our recommendations approximately 60\% of the time compared to leading industry alternatives. This approach establishes a new paradigm for query recommendations that achieves superior retrieval accuracy and user satisfaction without requiring behavioral data. Finally, we present the Multi-Relevant Future Items Evaluation (MRFI) protocol for Sequential Recommender Systems (SRS), which improves evaluation by considering multiple relevant items. MRFI, alongside a novel loss function that integrates relevance feedback, enhances recommendation accuracy and reliability, achieving improvements of 2.82 points of NDCG@10 and 0.64\% Hit Rate across several benchmark datasets. This methodological innovation provides a robust basis for evaluating and training sequential recommendation systems across diverse applications. Together, these contributions address the core challenges of relevance, reliability, and accessibility in information retrieval and language technology, supporting the vision of universally accessible, high-quality information systems.

Research products

11573/1733922 - 2025 - A Reproducible Analysis of Sequential Recommender Systems
Betello, Filippo; Purificato, Antonio; Siciliano, Federico; Trappolini, Giovanni; Bacciu, Andrea; Tonellotto, Nicola; Silvestri, Fabrizio - 01a Articolo in rivista
paper: IEEE ACCESS (Piscataway NJ: Institute of Electrical and Electronics Engineers) pp. 5762-5772 - issn: 2169-3536 - wos: WOS:001398321900029 (0) - scopus: 2-s2.0-85213480745 (2)

11573/1716988 - 2024 - DanteLLM: Let’s Push Italian LLM Research Forward!
Bacciu, Andrea; Campagnano, Cesare; Trappolini, Giovanni; Silvestri, Fabrizio - 04b Atto di convegno in volume
conference: LREC-COLING (Turin; Italy)
book: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) - (9782493814104)

11573/1716987 - 2024 - Handling Ontology Gaps in Semantic Parsing
Bacciu, Andrea; Damonte, Marco; Basaldella, Marco; Monti, Emilio - 04b Atto di convegno in volume
conference: 13th Joint Conference on Lexical and Computational Semantics (Città del Messico)
book: Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (SEM 2024) - ()

11573/1701860 - 2023 - RRAML: Reinforced Retrieval Augmented Machine Learning
Bacciu, A.; Cuconasu, F.; Siciliano, F.; Silvestri, F.; Tonellotto, N.; Trappolini, G. - 04b Atto di convegno in volume
conference: 22nd International Conference of the Italian Association for Artificial Intelligence (AIxIA 2023 DP) co-located with 22nd International Conference of the Italian Association for Artificial Intelligence (AIxIA 2023) (Rome; Italy)
book: Proceedings of the Discussion Papers - 22nd International Conference of the Italian Association for Artificial Intelligence (AIxIA 2023 DP) co-located with 22nd International Conference of the Italian Association for Artificial Intelligence (AIxIA 2023) - ()

11573/1689394 - 2023 - Integrating Item Relevance in Training Loss for Sequential Recommender Systems
Bacciu, Andrea; Siciliano, Federico; Tonellotto, Nicola; Silvestri, Fabrizio - 04b Atto di convegno in volume
conference: RecSys '23: 17th ACM Conference on Recommender Systems (Singapore)
book: RecSys '23: Proceedings of the 17th ACM Conference on Recommender Systems - (9798400702419)

11573/1698182 - 2023 - Fauno: The Italian Large Language Model that will leave you senza parole!
Bacciu, Andrea; Trappolini, Giovanni; Santilli, Andrea; Rodolà, Emanuele; Silvestri, Fabrizio - 04b Atto di convegno in volume
conference: IIR2023: 13th Italian Information Retrieval Workshop (Pisa; Italy)
book: Proceedings of the 13th Italian Information Retrieval Workshop (IIR 2023) Pisa, Italy, June 8-9, 2023 - ()

11573/1637620 - 2022 - Study on Transfer Learning Capabilities for Pneumonia Classification in Chest-X-Rays Images
Avola, D.; Bacciu, A.; Cinque, L.; Fagioli, A.; Marini, M. R.; Taiello, R. - 01a Articolo in rivista
paper: COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE (Elsevier Science Ireland Limited:PO Box 85, Limerick Ireland:011 353 61 709600, 011 353 61 61944, EMAIL: usinfo-f@elsevier.com, INTERNET: http://www.elsevier.com, Fax: 011 353 61 709114) pp. 1-12 - issn: 0169-2607 - wos: WOS:000869041700005 (23) - scopus: 2-s2.0-85129701097 (39)

11573/1569652 - 2021 - Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources
Conia, Simone; Bacciu, Andrea; Navigli, Roberto - 04b Atto di convegno in volume
conference: North American Association for Computational Linguistics (Online)
book: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - (978-1-954085-46-6)

11573/1446340 - 2019 - Cross-domain authorship attribution combining instance-based and profile-based features notebook for PAN at CLEF 2019
Bacciu, A.; La Morgia, M.; Mei, A.; Nemmi, E. N.; Neri, V.; Stefa, J. - 04b Atto di convegno in volume
conference: 20th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2019 (Lugano; Switzerland)
book: CEUR Workshop Proceedings - ()

11573/1446354 - 2019 - Bot and gender detection of twitter accounts using distortion and LSA notebook for PAN at CLEF 2019
Bacciu, A.; La Morgia, M.; Mei, A.; Nemmi, E. N.; Neri, V.; Stefa, J. - 04b Atto di convegno in volume
conference: 20th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2019 (Lugano; Switzerland)
book: CEUR Workshop Proceedings - ()

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma