Thesis title: COVID-19 cohort studies data harmonization, standardization, and processing to analyze data-driven evidence of the impact of SARS-CoV-2 variants
During the peak of the COVID-19 pandemic, when lockdown was the only solution to contain the spread of the virus, the need for data collection, sharing, harmonization, and standardization to generate accurate evidence was significant. The European Cohorts of Patients and Schools to Advance Response to Epidemics (EuCARE) project objectives included studying the viral variants, looking at mutation patterns, hospitalization patterns, long-term effects on patients, impacts on healthcare workers, and implications for schools. However, the Real-World Data (RWD) data that was collected from the partner's site showed inconsistencies and a lack of harmonization and standardization. As part of the EuCARE project, this study focuses on a comprehensive data cohort analysis of the EuCARE database using state-of-the-art curation, semantic mapping techniques, and data harmonization. The project aims to ensure that the standard coding systems, such as SNOMED CT, LOINC, and NCIt, are well in use, to guarantee that the EuCARE dataset is well consistent and interoperable enough to aid in meaningful cross-site comparisons and analyses. EuCARE COVID-19 Cohort study coordinated by EuResist uses an RDBMS MariaDB database, having 125 tables of complex structure with a total of 1015 variables. Out of all these, 476 are identified as unique entities, with 516 being standardized by SNOMED CT concepts, 564 by LOINC, and 602 by NCIt terms and definitions. As a result, this study created a data harmonization architecture framework that facilitates careful efforts of semantic mapping, harmonization, and standardization to enhance the data integrity and quality for a comprehensive analysis of the phenomena surrounding COVID-19.
Furthermore, this study aligns with FAIR principles (Findable, Accessible, Interoperable, Reusable) and focuses on wider data accessibility beyond this consortium. Therefore, the collected EuCARE data is discoverable, accessible, and provides valuable insights to any potential scientific user worldwide. The objective is to maximize the impact and relevance of the data produced within the EuCARE project. To overcome the challenges posed by non-standardized and harmonized data and promote data integrity, this thesis adopts a Comprehensive Data Management Plan (DMP), semantic data mapping using Usagi, and Intra and inter-project data harmonization. Usagi is an annotation tool. It was developed by OHDSI to support automatically mapping data element names to standardized identifiers. This would foster consistency in the interpretation of data and in making meaningful comparisons across study sites, which would enable researchers to have a source of valuable insight concerning how the pandemic had affected people across populations and contexts. This thesis will investigate the extensive cohort analysis, how the collected data from different sources are comparable and interoperable and the data platforms used to perform the harmonization and standardization process which are one of the main objectives of the EuCARE project. The study partners at EuCARE have developed a consolidated data network, allowing for globally accessible and usable data that is both harmonized and standardized. It also promotes ontology-based data harmonization and standardization approaches in clinical data management.
Keywords: Data Harmonization, Standardization, Semantic Data Mapping, Data Matching, FAIR Principles, Data Metadata, Data Integration, Interoperability, Ontology, SNOMED CT, LOINC, NCIt.