LORENZO DI ROCCO

Dottore di ricerca

ciclo: XXXVII


supervisore: Umberto Ferraro Petrillo

Titolo della tesi: Scalable Solutions for Large-scale Bioinformatics Analysis: A Critical Study of Apache Spark Application in High-Performance Computational Genomics

Over time, the evolution of sequencing platforms has revolutionized the ability to unravel DNA complexity, enabling an increasing understanding of the genetic structure of the organisms. However, these technological advancements have resulted in the generation of vast amounts of data that can be processed, stored, and interpreted. The growing volume of sequencing output has motivated a successful integration of computational genomics with supercomputing and artificial intelligence techniques to efficiently face computational challenges and extract meaningful insights from raw data, ultimately improving the speed and accuracy of genomic analysis. However, the potential of distributed computing in genomics has yet to be fully unlocked. While theoretically advantageous, the distribution of complex bioinformatics tasks is challenging, as it requires a deep understanding of distributed systems and advanced programming skills. This thesis leverages Apache Spark to propose distributed pipelines designed to address critical challenges in computational genomics that involve processing large datasets. Apache Spark is a highlevel framework that simplifies and accelerates the development of distributed solutions by managing technical issues internally. This feature makes Apache Spark particularly appealing for developing user-friendly, cloud-compatible libraries that promote the adoption of distributed computing in computational genomics. However, while this abstraction has proven successful in various realworld applications, it may face limitations in addressing the highly complex, sequentially structured problems often encountered in computational genomics, where processing across non-shared memory systems poses unique challenges. Through extensive experimental evaluations, this thesis aims to assess the strengths and limitations of applying Apache Spark to large-scale problems in computational genomics. Each chapter focuses on a specific genomics-related application that is known for its data-intensive nature and where distributed computing could serve as a strategic resource. For each case, a pipeline is proposed and thoroughly analyzed through experiments aimed at evaluating scalability and identifying potential bottlenecks.

Produzione scientifica

Connessione ad iris non disponibile

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma