LORENZO DI ROCCO

Dottore di ricerca

ciclo: XXXVII


supervisore: Umberto Ferraro Petrillo

Titolo della tesi: Scalable Solutions for Large-scale Bioinformatics Analysis: A Critical Study of Apache Spark Application in High-Performance Computational Genomics

Over time, the evolution of sequencing platforms has revolutionized the ability to unravel DNA complexity, enabling an increasing understanding of the genetic structure of the organisms. However, these technological advancements have resulted in the generation of vast amounts of data that can be processed, stored, and interpreted. The growing volume of sequencing output has motivated a successful integration of computational genomics with supercomputing and artificial intelligence techniques to efficiently face computational challenges and extract meaningful insights from raw data, ultimately improving the speed and accuracy of genomic analysis. However, the potential of distributed computing in genomics has yet to be fully unlocked. While theoretically advantageous, the distribution of complex bioinformatics tasks is challenging, as it requires a deep understanding of distributed systems and advanced programming skills. This thesis leverages Apache Spark to propose distributed pipelines designed to address critical challenges in computational genomics that involve processing large datasets. Apache Spark is a highlevel framework that simplifies and accelerates the development of distributed solutions by managing technical issues internally. This feature makes Apache Spark particularly appealing for developing user-friendly, cloud-compatible libraries that promote the adoption of distributed computing in computational genomics. However, while this abstraction has proven successful in various realworld applications, it may face limitations in addressing the highly complex, sequentially structured problems often encountered in computational genomics, where processing across non-shared memory systems poses unique challenges. Through extensive experimental evaluations, this thesis aims to assess the strengths and limitations of applying Apache Spark to large-scale problems in computational genomics. Each chapter focuses on a specific genomics-related application that is known for its data-intensive nature and where distributed computing could serve as a strategic resource. For each case, a pipeline is proposed and thoroughly analyzed through experiments aimed at evaluating scalability and identifying potential bottlenecks.

Produzione scientifica

11573/1733746 - 2025 - A Flexible Parametric Approach to Synthetic Patients Generation in Clinical Trials
Cipriani, Marta; Rocco, Lorenzo Di; Alfò, Marco - 04b Atto di convegno in volume
congresso: SIS 2024 (Bari)
libro: Methodological and Applied Statistics and Demography III - (9783031644306; 9783031644313)

11573/1717606 - 2024 - A distributed approach for persistent homology computation on a large scale
Ceccaroni, Riccardo; Di Rocco, Lorenzo; Ferraro Petrillo, Umberto; Brutti, Pierpaolo - 01a Articolo in rivista
rivista: THE JOURNAL OF SUPERCOMPUTING (Kluwer Academic Publishers / Massachusetts:PO Box 358, Accord Station:Hingham, MA 02018:(617)871-6600) pp. - - issn: 0920-8542 - wos: WOS:001289394700004 (0) - scopus: 2-s2.0-85200951050 (0)

11573/1729124 - 2024 - Exploiting mechanistic models toward personalised strategies in Breast Cancer
Lombardi, Veronica; Venafra, Veronica; Di Rocco, Lorenzo; Nicolaeasa, Lorentz; Ferraro Petrillo, Umberto; Alfo', Marco; Sacco, Francesca; Perfetto, Livia - 04f Poster
congresso: 20th BITS Annual Meeting (Trento, Italy)
libro: 20th BITS 2024 - ()

11573/1692306 - 2023 - A Distributed Alignment-free Pipeline for Human SNPs Genotyping
Di Rocco, L.; Ferraro Petrillo, U. - 04b Atto di convegno in volume
congresso: 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (Houston)
libro: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics - (9798400701269)

11573/1671078 - 2023 - Using software visualization to support the teaching of distributed programming
Di Rocco, L.; Ferraro Petrillo, U.; Palini, F. - 01a Articolo in rivista
rivista: THE JOURNAL OF SUPERCOMPUTING (Kluwer Academic Publishers / Massachusetts:PO Box 358, Accord Station:Hingham, MA 02018:(617)871-6600) pp. 3974-3998 - issn: 0920-8542 - wos: WOS:000854426900002 (2) - scopus: 2-s2.0-85138168998 (2)

11573/1644217 - 2022 - Scheduling K-mers Counting in a Distributed Environment
Amorosi, L.; Di Rocco, L.; Ferraro Petrillo, U. - 04b Atto di convegno in volume
congresso: International Conference on Optimization and Decision Sciences, ODS 2021 (Rome, Italy)
libro: Optimization in Artificial Intelligence and Data Sciences - ()

11573/1657394 - 2022 - Community detection in networks: a heuristic version of Girvan Newman algorithm
Bombelli, Ilaria; Di Rocco, Lorenzo - 04b Atto di convegno in volume
congresso: SIS2022 - 51ST SCIENTIFIC MEETING OF THE ITALIAN STATISTICAL SOCIETY (Caserta; Italy)
libro: Book of the short papers - (9788891932310)

11573/1671074 - 2022 - DIAMIN: a software library for the distributed analysis of large-scale molecular interaction networks
Di Rocco, Lorenzo; Ferraro Petrillo, Umberto; Rombo, Simona E - 01a Articolo in rivista
rivista: BMC BIOINFORMATICS ([London]: BioMed Central, [2000]-) pp. 1-18 - issn: 1471-2105 - wos: WOS:000881990300001 (2) - scopus: 2-s2.0-85141688064 (2)

11573/1556563 - 2021 - Large Scale Graph Based Network Forensics Analysis
Di Rocco, L.; Ferraro Petrillo, U.; Palini, F. - 04b Atto di convegno in volume
congresso: 25th International Conference on Pattern Recognition Workshops, ICPR 2020 (milan; Italy)
libro: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) - ()

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma