fossilbrush: An R package for automated detection and resolution of anomalies in palaeontological occurrence data

Abstract Fossil occurrence databases are indispensable resources to the palaeontological community, yet present unique data cleaning challenges. Many studies devote significant attention to cleaning fossil occurrence data prior to analysis, but such efforts are typically bespoke and difficult to rep...

Full description

Saved in:
Bibliographic Details
Main Authors: Joseph T. Flannery‐Sutherland, Nussaïbah B. Raja, Ádám T. Kocsis, Wolfgang Kiessling
Format: Article
Language:English
Published: Wiley 2022-11-01
Series:Methods in Ecology and Evolution
Subjects:
Online Access:https://doi.org/10.1111/2041-210X.13966
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849771909092212736
author Joseph T. Flannery‐Sutherland
Nussaïbah B. Raja
Ádám T. Kocsis
Wolfgang Kiessling
author_facet Joseph T. Flannery‐Sutherland
Nussaïbah B. Raja
Ádám T. Kocsis
Wolfgang Kiessling
author_sort Joseph T. Flannery‐Sutherland
collection DOAJ
description Abstract Fossil occurrence databases are indispensable resources to the palaeontological community, yet present unique data cleaning challenges. Many studies devote significant attention to cleaning fossil occurrence data prior to analysis, but such efforts are typically bespoke and difficult to reproduce. There are also no standardised methods to detect and resolve errors despite the development of an ecosystem of cleaning tools fuelled by the concurrent growth of neontological occurrence databases. As fossil occurrence databases continue to increase in size, the demand for standardised, automated and reproducible methods to improve data quality will only grow. Here, we present semi‐automated cleaning solutions to address these issues with a new R package fossilbrush. We apply our cleaning protocols to the Paleobiology Database to assess the prevalence of anomalous entries and the efficacy and impact of our methods. We find that anomalies may be effectively resolved by comparison against a published compendium of stratigraphic ranges, improving the stratigraphic quality of the data, and through methods which detect outliers in taxon‐wise occurrence stratigraphic distributions. Despite this, anomalous entries remain prevalent throughout major clades, with often more than 30% of genera in major fossil groups (e.g. bivalves, echinoderms) displaying stratigraphically suspect occurrence records. Our methods provide a way to flag and resolve anomalous taxonomic data before downstream palaeobiological analysis and may also aid in the automation and targeting of future cleaning efforts. We stress, however, that our methods are semi‐automated and are primarily for the detection of potential anomalies for further scrutiny, as full automation should not be a substitute for expert vetting. We note that some of our methods do not rely on external databases for anomaly resolution and so are also applicable to occurrences in neontological databases, expanding the utility of the fossilbrush R package.
format Article
id doaj-art-9523261dd4864d5c81a1b5ee80f6ef5d
institution DOAJ
issn 2041-210X
language English
publishDate 2022-11-01
publisher Wiley
record_format Article
series Methods in Ecology and Evolution
spelling doaj-art-9523261dd4864d5c81a1b5ee80f6ef5d2025-08-20T03:02:28ZengWileyMethods in Ecology and Evolution2041-210X2022-11-0113112404241810.1111/2041-210X.13966fossilbrush: An R package for automated detection and resolution of anomalies in palaeontological occurrence dataJoseph T. Flannery‐Sutherland0Nussaïbah B. Raja1Ádám T. Kocsis2Wolfgang Kiessling3School of Earth Sciences University of Bristol Bristol UKGeoZentrum Nordbayern, Department of Geography and Geosciences Friedrich‐Alexander University Erlangen‐Nürnberg Erlangen GermanyGeoZentrum Nordbayern, Department of Geography and Geosciences Friedrich‐Alexander University Erlangen‐Nürnberg Erlangen GermanyGeoZentrum Nordbayern, Department of Geography and Geosciences Friedrich‐Alexander University Erlangen‐Nürnberg Erlangen GermanyAbstract Fossil occurrence databases are indispensable resources to the palaeontological community, yet present unique data cleaning challenges. Many studies devote significant attention to cleaning fossil occurrence data prior to analysis, but such efforts are typically bespoke and difficult to reproduce. There are also no standardised methods to detect and resolve errors despite the development of an ecosystem of cleaning tools fuelled by the concurrent growth of neontological occurrence databases. As fossil occurrence databases continue to increase in size, the demand for standardised, automated and reproducible methods to improve data quality will only grow. Here, we present semi‐automated cleaning solutions to address these issues with a new R package fossilbrush. We apply our cleaning protocols to the Paleobiology Database to assess the prevalence of anomalous entries and the efficacy and impact of our methods. We find that anomalies may be effectively resolved by comparison against a published compendium of stratigraphic ranges, improving the stratigraphic quality of the data, and through methods which detect outliers in taxon‐wise occurrence stratigraphic distributions. Despite this, anomalous entries remain prevalent throughout major clades, with often more than 30% of genera in major fossil groups (e.g. bivalves, echinoderms) displaying stratigraphically suspect occurrence records. Our methods provide a way to flag and resolve anomalous taxonomic data before downstream palaeobiological analysis and may also aid in the automation and targeting of future cleaning efforts. We stress, however, that our methods are semi‐automated and are primarily for the detection of potential anomalies for further scrutiny, as full automation should not be a substitute for expert vetting. We note that some of our methods do not rely on external databases for anomaly resolution and so are also applicable to occurrences in neontological databases, expanding the utility of the fossilbrush R package.https://doi.org/10.1111/2041-210X.13966chronostratigraphydata cleaningfossil occurrencepalaeobiology databaseSepkoski Compendiumstratigraphic density
spellingShingle Joseph T. Flannery‐Sutherland
Nussaïbah B. Raja
Ádám T. Kocsis
Wolfgang Kiessling
fossilbrush: An R package for automated detection and resolution of anomalies in palaeontological occurrence data
Methods in Ecology and Evolution
chronostratigraphy
data cleaning
fossil occurrence
palaeobiology database
Sepkoski Compendium
stratigraphic density
title fossilbrush: An R package for automated detection and resolution of anomalies in palaeontological occurrence data
title_full fossilbrush: An R package for automated detection and resolution of anomalies in palaeontological occurrence data
title_fullStr fossilbrush: An R package for automated detection and resolution of anomalies in palaeontological occurrence data
title_full_unstemmed fossilbrush: An R package for automated detection and resolution of anomalies in palaeontological occurrence data
title_short fossilbrush: An R package for automated detection and resolution of anomalies in palaeontological occurrence data
title_sort fossilbrush an r package for automated detection and resolution of anomalies in palaeontological occurrence data
topic chronostratigraphy
data cleaning
fossil occurrence
palaeobiology database
Sepkoski Compendium
stratigraphic density
url https://doi.org/10.1111/2041-210X.13966
work_keys_str_mv AT josephtflannerysutherland fossilbrushanrpackageforautomateddetectionandresolutionofanomaliesinpalaeontologicaloccurrencedata
AT nussaibahbraja fossilbrushanrpackageforautomateddetectionandresolutionofanomaliesinpalaeontologicaloccurrencedata
AT adamtkocsis fossilbrushanrpackageforautomateddetectionandresolutionofanomaliesinpalaeontologicaloccurrencedata
AT wolfgangkiessling fossilbrushanrpackageforautomateddetectionandresolutionofanomaliesinpalaeontologicaloccurrencedata