ALEKSANDRA PIKTUS

Dottoressa di ricerca

ciclo: XXXVIII



Titolo della tesi: Unstructured Data for Large Language Models

In recent years, we have witnessed an impressive rise in the ubiquity of large language models (LLMs). Although their fundamental objective, predicting the most probable next word in a sequence, has remained unchanged, the models themselves have expanded dramatically in scale and capability, becoming the dominant paradigm in Natural Language Processing (NLP). Progress has been marked by the development of increasingly sophisticated evaluation benchmarks on one hand and by a growing demand for vast amounts of training data on the other. In this thesis, we examine how unstructured, primarily web-based data is utilized in LLM pre-training and fine-tuning. We investigate two principal roles that large textual corpora play within these models: first, as a source of world knowledge through retrieval augmentation, and second, as pre-training data. We begin by demonstrating how retrieval from large, unstructured web corpora can enhance performance on open-domain tasks, paving the way towards assistants capable of supporting humans in solving complex, knowledge-intensive problems. Next, we address the challenge of improving the robustness of pre-training data through the development of tools that enable qualitative analysis of massive text collections. Finally, we explore potential avenues for model scaling under data-constrained conditions, anticipating a future in which the entirety of publicly available web text may no longer suffice to meet the demands of ever-larger language models

Produzione scientifica

11573/1717588 - 2023 - Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face
Akiki, Christopher; Ogundepo, Odunayo; Piktus, Aleksandra; Zhang, Xinyu; Oladipo, Akintunde; Lin, Jimmy; Potthast, Martin - 04b Atto di convegno in volume
congresso: Empirical Methods in Natural Language Processing (EMNLP) (Singapore)
libro: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations - ()

11573/1717583 - 2023 - FinGPT: Large Generative Models for a Small Language
Luukkonen, Risto; Komulainen, Ville; Luoma, Jouni; Eskelinen, Anni; Kanerva, Jenna; Kupari, Hanna-Mari; Ginter, Filip; Laippala, Veronika; Muennighoff, Niklas; Piktus, Aleksandra; Wang, Thomas; Tazi, Nouamane; Scao, Teven; Wolf, Thomas; Suominen, Osma; Sairanen, Samuli; Merioksa, Mikko; Heinonen, Jyrki; Vahtola, Aija; Antao, Samuel; Pyysalo, Sampo - 04b Atto di convegno in volume
congresso: EMNLP (Singapore)
libro: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing - ()

11573/1717594 - 2023 - Scaling Data-Constrained Language Models
Muennighoff, Niklas; Rush, Alexander M.; Barak, Boaz; Le Scao, Teven; Piktus, Aleksandra; Tazi, Nouamane; Pyysalo, Sampo; Wolf, Thomas; Raffel, Colin - 04b Atto di convegno in volume
congresso: Advances in Neural Information Processing Systems (was NIPS) NeurIPS (New Orleans; USA)
libro: Advances in Neural Information Processing Systems 36 (NeurIPS 2023) - (9781713899921)

11573/1717586 - 2023 - The ROOTS Search Tool: Data Transparency for LLMs
Piktus, Aleksandra; Akiki, Christopher; Villegas, Paulo; Laurençon, Hugo; Dupont, Gérard; Luccioni, Sasha; Jernite, Yacine; Rogers, Anna - 04b Atto di convegno in volume
congresso: ACL (Toronto; Canada)
libro: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) - (9781959429708)

11573/1717590 - 2023 - GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration
Piktus, Aleksandra; Ogundepo, Odunayo; Akiki, Christopher; Oladipo, Akintunde; Zhang, Xinyu; Schoelkopf, Hailey; Biderman, Stella; Potthast, Martin; Lin, Jimmy - 04b Atto di convegno in volume
congresso: ACL (Toronto; Canada)
libro: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) - (9781959429708)

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma