Thesis title: Assessment of genetic determinants in Escherichia coli uropathogenic lifestyle and intracellular persistence via optimized k-mer matching of million-genome collections on laptops
Urinary tract infections (UTIs) are among the most common bacterial infections in humans, primarily caused by uropathogenic Escherichia coli (UPEC). A key challenge in UTI management is UPEC’s ability to persist intracellularly, enabling evasion of host defenses and antibiotic treatments. Although recent studies revealed that mobile genetic elements are crucial in UPEC persistence, their limited scale underlines the necessity for more comprehensive genomic analyses for detecting the potential determinants linked to UTI pathogenesis and UPEC persistence. Meanwhile, recent advances in sequencing technologies resulted in vast bacterial genome collections being generated, such as the AllTheBacteria collection (n = 2,440,377), holding great potential for several applications, such as in rapid diagnostics and epidemiological surveillance at point-of-care (POC). However, the exponential growth of data has outpaced computational performance, limiting our ability to perform real-time searches across million-genome collections, especially on portable devices.
These searches on portable devices have been made possible by Phylign, a tool combining phylogenetic compression with k-mer matching and alignment. Yet, Phylign remains unsuitable for time-sensitive analyses with long and divergent queries, as no established methodology exists for guiding the selection, application, parametrization, and calibration of low-level k-mer indexes with phylogenetic compression with respect to specific biological questions.
This study, therefore, has two primary aims: 1) to develop an end-to-end methodology for rapid k-mer searches across million-genome collections on portable devices using phylogenetic compression; and 2) to investigate the genetic determinants of UPEC lifestyle and intracellular persistence by characterizing the content and the dissemination of the plasmid of the plasmid carried by the persistent prostatic UPEC strain EC73.
Here, we develop and implement a comprehensive methodology for rapid k-mer searches across million-genome collections on portable devices and apply it to elucidate the genetic determinants of UPEC lifestyle and persistence. The methodology proceeds in three steps: 1) translating the biological question of interest into a k-mer-based problem, where k-mer matching is 2) formalized through a combination of three elements, defined as matching strategy; this last is then used to 3) guide the selection of the most suitable k-mer indexes for the given application. We apply this framework to the plasmid search problem across million-genome collections using Phylign and evaluate four state-of-the-art k-mer indexes (COBS, Fulgor, Themisto, and Metagraph), identifying Fulgor as the best trade-off between space efficiency and search speed.
Finally, we characterize the plasmid of the UPEC strain EC73, identifying multiple functionally distinct genes and a marginal diffusion among the 3,776 UPEC genomes in the AllTheBacteria collection. Three EC73 plasmid genes were also identified as potential determinants involved in UPEC intracellular persistence.
Overall, this work provides the first systematic framework for large-scale k-mer search on portable devices leveraging phylogenetic compression. The enhancement of Phylign results in up to ~4x faster searches and enables detection of matches in 15x more genomes. Compared to LexicMap, the Phylign-Fulgor combination maintains scalability to portable devices while detecting matches across a comparable number of genomes. Using only a standard laptop, the developed method demonstrates at scale that UPEC strains do not universally share common genetic elements from the EC73 plasmid gene pool that determine their lifestyle and intracellular persistence.