Titolo della tesi: Development of modern databases for the National Checklist of Italian Fauna
The term ‘Anthropocene’ is increasingly used to define the current geological era, characterised by large-scale human-induced environmental changes that are driving biodiversity loss at an unprecedented rate, comparable to the Earth’s previous five mass extinctions. Among the various ecosystems impacted, oceans are particularly significant due to their vast extent, the exceptionally high biodiversity they harbor, and the crucial ecosystem services they provide to humanity. However, much of the marine realm remains unexplored, raising concerns that species are disappearing before they can even be discovered.
Within marine ecosystems, marine molluscs represent one of the most hyperdiverse groups of invertebrates that play key roles in marine ecosystems and provide important services to humans. Despite this, their biodiversity is increasingly under threat due to multiple human-induced pressures. Coastal development, bottom trawling and dredging disrupt critical benthic habitats for molluscs, while pollution from industrial waste, oil spills, agricultural runoff and microplastics introduces harmful substances that affect their survival and reproduction. Unsustainable harvesting for commercial purposes further depletes wild populations of commercially relevant species, while invasive species introduce competition and predation pressures. Climate change poses additional challenges, particularly through ocean warming and acidification. Rising sea temperatures alter mollusc metabolism, phenology and reproductive cycles, while forcing some species to modify their ranges in search of suitable conditions. Ocean acidification, resulting from increased atmospheric CO₂ absorption, reduces carbonate ion availability, making it difficult for molluscs to build and maintain their calcium carbonate shells, especially during their larval development, leading to lower survival rates. Extreme weather events, such as prolonged drought, can further disrupt molluscan assemblage persistence. Moreover, marine molluscs remain underrepresented in conservation efforts. Many species suffer from data deficiency, with little information available on their population trends, ecological requirements, or threats. Conservation funding and policy initiatives often prioritise charismatic or economically important species, leaving many molluscs overlooked.
Italy holds a particularly rich marine malacofauna, with 1,777 recorded species, of which 133 are endemic, representing approximately 71.5% of the Mediterranean Sea’s mollusc diversity (Renda et al., 2022). Of these, only 0.2% are currently protected by the Habitats Directive (1992/43/EEC) and a further 0.6% of species are protected exclusively by the Bern and Barcelona Conventions, painting a concerning picture for the conservation status of this taxonomic group. More generally, the entire Mediterranean marine malacofauna is somewhat neglected, considering that only 2% has been evaluated according to the IUCN Red List criteria and, of this percentage, 43% of the species are still considered Data Deficient. This situation underscores the urgent need to improve both the quality and quantity of knowledge about species distributions over space and time, which is essential for effective conservation decisions at a national level.
Italy has a long tradition of malacological research, and, over time, significant data has been collected and stored in public and private Natural History Collections (NHCs), and published in literature, where the Bollettino Malacologico of the Italian Society of Malacology stands out, accumulating from 1979 a large amount of valuable taxonomic, ecological and biodiversity data. However, much of this information is ‘frozen’ in non-usable formats: only a little part of the Italian malacological NHCs are digitised, while published data is often shared in human-readable formats, hindering their reuse. To make this information usable for biodiversity conservation and research, it must go through a mobilisation process to be transformed into a standardised, georeferenced and integrable format. To date, the Checklist of the Italian Fauna is the only comprehensive effort to mobilise geographic and taxonomic knowledge about marine molluscs in Italy by integrating data from various sources (e.g. literature, NHCs, expert knowledge) at a national scale. While the taxonomic information it provides is highly reliable – carefully examined and refined by specialists, making it a crucial taxonomic reference for Italy – the geographical information – although of high quality, for the same reasons described above – is generalised at the level of biogeographical sectors.
The present thesis aims to develop a pipeline to mobilise data from ‘frozen’ sources, making them accessible in a standardised and integrable format, and to demonstrate their application in biodiversity research and conservation. This overarching goal is divided into three specific objectives, which form the three core chapters of this study:
1. To produce a standardised and georeferenced dataset by creating a databasing pipeline that mobilises pre-existing data on Italian marine malacofauna, integrating information from NHCs and literature.
2. To analyse biases and knowledge gaps affecting currently available opportunistic biodiversity data of marine gastropods in Italy, integrating the mobilised dataset with existing open-access repositories.
3. To model past, present, and future species distributions under different climate scenarios for selected species of conservation interest occurring in Italy, leveraging mobilised data to project potential biodiversity shifts.
To compile a comprehensive dataset on Italian marine molluscs, data were gathered from two main sources: NHCs and literature. While data from NHCs were obtained through direct requests to both private collectors and institutions, a systematic search for literature data was conducted using the public databases Scopus and Web of Science, ensuring a broad retrieval of relevant scientific publications. Additionally, targeted searches were performed in the specialised journals Iberus, Bollettino Malacologico and Alleryana. Once collected, data from both sources were merged and formatted following the Darwin Core schema (Darwin Core Task Group, 2009), a widely adopted standard for biodiversity data exchange. The Biodiversity Data Cleaning R package was used to ensure correct formatting. To ensure taxonomic consistency across sources, species names were aligned with the authoritative classification provided by the World Register of Marine Species, using the taxon-match Life Watch web service. Georeferencing was performed following the point-radius method with GEOLocate web-based collaborative client. To harmonise dates, the R package ‘lubridate’ was employed, allowing for efficient date parsing and conversion to the ISO 8601 format, ensuring consistency in temporal representation across the dataset.
Subsequently, we used the occurrences for the class Gastropoda of the mobilised dataset, integrated with those from GBIF to assess: 1) the cumulative contribution of new areas documented by data from NHCs, literature and citizen science over time; 2) data representativeness of Italian environments, habitat types and network of Marine Protected Areas (MPAs); 3) the main drivers of spatial biases; and 4) data usability in long-term biodiversity research and monitoring. We divided the entire dataset into the three main data sources (i.e. NHCs, literature and citizen science) based on the dwc:basisOfRecord field. We divided the study area into a 10 km x 10 km square grid and computed how much each source contributed to cover the cells and the cumulative number of cells covered over time. Since the data mobilisation effort could be unequally distributed among sources, we produced two cumulative curves: one considering the contribution of the data as it is, and the other weighting each type of data on the mobilisation effort. An equivalent number of random background points were generated using QGIS software to serve as a comparison to the actual occurrences to assess whether they are representative of Italy and to identify knowledge gaps. To test whether random background data and real occurrences are located in significantly different environments, we performed Chi-square and t-tests. We used an ensemble modeling approach to study which environmental variables are predictors of where opportunistic occurrences of marine gastropods are located in Italy, considering bathymetry, density of coastal roads, proximity to harbour areas, proximity to diving spots and whether each point were inside or outside a MPA as variables. In this framework, people who make observations or collect specimens are therefore our ‘species’ of which we want to model the probability of occurrence. Lastly, to identify locations with a consistent baseline of samplings over time, all presence coordinates were clustered using a density-based clustering algorithm (Ester et al., 1996) with the R package ‘dbscan’ (Hahsler et al., 2019). The outliers were removed from the results and in the remaining clusters, the temporal coverage was assessed by calculating the minimum mean time between each record and the next temporally closest one.
Using the mobilised data, we modeled the past, present and future potential distribution of four species of conservation interest with different degrees of thermophily occurring in Italy (Luria lurida, Zonaria pyrum, Naria spurca and Talisman scrobilator), producing two sets of ensemble Species Distribution Models (SDMs). In the first set we used a classic modeling approach, pooling all available data with four environmental variables (i.e. acidity, primary productivity, salinity and temperature) under current conditions, using environmental layers from the BioOracle online database. Environmental layers for the decade 2010-2020 were used as present-day conditions, while the layers from the decades 2050-2060 and 2090-2100 were chosen as future conditions, both in the most optimistic and pessimistic scenarios.
We then reprocessed the data in a second SDM using a multitemporal approach, selecting only occurrences with temporal information and associating each of them with the environmental condition of the same decade. This method emerged as more capable in capturing the entire environmental niche of the species in various studies, but due to limited access to historical marine environmental data, is often impossible to perform. Since the output of the classical model identified temperature as the most important variable for all species, we used temperature (interpolated from historical coastal land temperatures of the CHELSAcruts database) in combination with bathymetry – selected as the only potentially important time-stable variable for molluscs to reduce the model's dependence on temperature – as variables for the multitemporal model. This method is based on the premise that sea surface temperatures are correlated with air temperatures but has the rough assumption that the thermocline has remained constant from 1900 until today. Using this approach, it was also possible to model the past distribution of the species using temperature conditions simulated for the 1900-1910, 1930-1940 and 1970-1980 decades. All variables were tested for collinearity problems based on the Variance Inflation Factor and all modeling analyses were performed with the ‘biomod2’ R package, using an ensemble of three non-parametric algorithms: Maxent, Random Forest and Generalized Boosted Model.
The two sets of models were, then, compared by fitting a linear model for each pair of probabilistic projections and mapping the residuals to observe where one prediction deviates from the other and by computing similarity indices.
With our effort we mobilised 44,096 records of 1,513 marine mollusc species in Italy, producing a quality checked, standard dataset that is currently accessible at Zenodo and Global Biodiversity Information Facility. During the databasing process it was possible to highlight the main problems and mistakes that affect taxonomic, temporal and geographical information in frozen data, producing guidelines for institutions and individuals who want to start databasing their data to limit management problems and to create datasets that can be effectively shared and reused. Problems common to all three dimensions are the presence of multiple pieces of information within the same cell (e.g. location + date of occurrence; scientific name + author + uncertainty of identification), the use of special characters and punctuation, and inconsistent formatting within the same column. We therefore recommend entering only one piece of information per cell, avoiding special characters and punctuation as much as possible, choosing a format for each column and following it consistently – preferably one that follows a pre-formulated standard such as Darwin Core. At least for the personnel that work on databasing on behalf of institutions (such as museums and universities), training and study on the use of standards for sharing biodiversity data, on standardisation and georeferencing tools/methods/protocols and, eventually, on the correct use of Excel is necessary.
NHCs and literature are the sources that contribute the most in documenting Italian seas, with a marginal role of citizen science data. We noted that the available open-access distributional data on marine gastropods are still not representative of Italian marine environments, with deep-sea habitats and unprotected sites being underrepresented. Bathymetry appears to be the main driver of spatial bias for marine gastropod data in Italy, but proximity to diving spots and the presence of MPAs also play an important role, with biased occurrences towards areas of diving interest and within MPAs. Moreover, the data appears to be clustered near research centres, universities and natural history museums. Most of these high-density locations show a very fragmented coverage over time, with a given record at least one decade away from the nearest one for 79% of the locations, indicating limited use in long term ecological studies.
Lastly, we identified a continuous but slight expansion of the potential range of L. lurida, Z. pyrum, N. spurca and T. scrobilator from 1900 to 1970, and a sharper expansion into higher latitudes between 1970 and today, aligning with historical records of sea warming and therefore congruently with a certain degree of thermophily in all species. Potential range centroids of all species showed a tendency to shift northward caused by local expansions, contractions, or fragmentation of potential suitable areas. All fossil-fueled projections revealed a drastic reduction in suitable areas, with the complete disappearance of Z. pyrum in 2090-2100 scenarios. This result indicates that, although thermophilous species can thrive under mild temperature increases (like those that have taken place from the beginning of 1900 to the present day), they may undergo drastic range contractions under severe climate change (like those predicted in future scenarios).
In conclusion, although the opportunistic data on Italian marine malacological biodiversity are affected by various spatio-temporal knowledge gaps, understanding bias patterns and drivers helps us to better direct future sampling efforts. In addition to new data collections, a greater effort in mobilising data from NHCs and literature following standard pipelines and guidelines can fill in the gaps, providing insights for the conservation of neglected and data-deficient species.