Automating the Curation of DNA Barcode Databases for Vascular Plants

ABSTRACT Comprehensive, curated, and current DNA barcode reference databases are essential for both the identification of single specimens and for the interpretation of metabarcoding data. In the case of plants, nuclear (ITS) and plastid (rbcL, matK) markers are commonly used together. Because the p...

Full description

Saved in:
Bibliographic Details
Main Authors: Andreas Kolter, Paul Hebert
Format: Article
Language:English
Published: Wiley 2025-05-01
Series:Environmental DNA
Online Access:https://doi.org/10.1002/edn3.70125
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849420188720562176
author Andreas Kolter
Paul Hebert
author_facet Andreas Kolter
Paul Hebert
author_sort Andreas Kolter
collection DOAJ
description ABSTRACT Comprehensive, curated, and current DNA barcode reference databases are essential for both the identification of single specimens and for the interpretation of metabarcoding data. In the case of plants, nuclear (ITS) and plastid (rbcL, matK) markers are commonly used together. Because the plastid regions are segments of protein‐coding genes, their alignment and analysis are usually straightforward. By contrast, the assembly and validation of ITS records is considerably more difficult for two reasons: the prevalence of indels and intraindividual sequence variation. This complexity has provoked the development of several workflows to support the curation of reference databases for the internal transcribed spacer (ITS) region for plant barcoding. However, the pipelines used to create these databases lack functionalities which are essential to ensure a solid post‐analytical validation. This paper presents a new workflow to address these shortcomings, with the goal of enhancing the reliability and accuracy of plant barcoding studies. We furthermore demonstrate that clustering of reference databases results in a substantial drop in the fraction of queries that gain a correct species‐level assignment. By contrast, setting an acceptance threshold for identifications, based on the distance between query and match, leads to a meaningful reduction of error rates in incomplete reference databases.
format Article
id doaj-art-e29f8bd5002049a2be5e6b5647e78ed0
institution Kabale University
issn 2637-4943
language English
publishDate 2025-05-01
publisher Wiley
record_format Article
series Environmental DNA
spelling doaj-art-e29f8bd5002049a2be5e6b5647e78ed02025-08-20T03:31:49ZengWileyEnvironmental DNA2637-49432025-05-0173n/an/a10.1002/edn3.70125Automating the Curation of DNA Barcode Databases for Vascular PlantsAndreas Kolter0Paul Hebert1Centre for Biodiversity Genomics University of Guelph Guelph ON CanadaCentre for Biodiversity Genomics University of Guelph Guelph ON CanadaABSTRACT Comprehensive, curated, and current DNA barcode reference databases are essential for both the identification of single specimens and for the interpretation of metabarcoding data. In the case of plants, nuclear (ITS) and plastid (rbcL, matK) markers are commonly used together. Because the plastid regions are segments of protein‐coding genes, their alignment and analysis are usually straightforward. By contrast, the assembly and validation of ITS records is considerably more difficult for two reasons: the prevalence of indels and intraindividual sequence variation. This complexity has provoked the development of several workflows to support the curation of reference databases for the internal transcribed spacer (ITS) region for plant barcoding. However, the pipelines used to create these databases lack functionalities which are essential to ensure a solid post‐analytical validation. This paper presents a new workflow to address these shortcomings, with the goal of enhancing the reliability and accuracy of plant barcoding studies. We furthermore demonstrate that clustering of reference databases results in a substantial drop in the fraction of queries that gain a correct species‐level assignment. By contrast, setting an acceptance threshold for identifications, based on the distance between query and match, leads to a meaningful reduction of error rates in incomplete reference databases.https://doi.org/10.1002/edn3.70125
spellingShingle Andreas Kolter
Paul Hebert
Automating the Curation of DNA Barcode Databases for Vascular Plants
Environmental DNA
title Automating the Curation of DNA Barcode Databases for Vascular Plants
title_full Automating the Curation of DNA Barcode Databases for Vascular Plants
title_fullStr Automating the Curation of DNA Barcode Databases for Vascular Plants
title_full_unstemmed Automating the Curation of DNA Barcode Databases for Vascular Plants
title_short Automating the Curation of DNA Barcode Databases for Vascular Plants
title_sort automating the curation of dna barcode databases for vascular plants
url https://doi.org/10.1002/edn3.70125
work_keys_str_mv AT andreaskolter automatingthecurationofdnabarcodedatabasesforvascularplants
AT paulhebert automatingthecurationofdnabarcodedatabasesforvascularplants