From literature to biodiversity data: mining arthropod organismal traits with machine learning

The fields of taxonomy and biodiversity research have witnessed an exponential growth in published literature. This vast corpus of articles holds information on the diverse biological traits of organisms and their ecologies. However, access to and extraction of relevant data from this extensive reso...

Full description

Saved in:
Bibliographic Details
Main Authors: Joseph Cornelius, Harald Detering, Oscar Lithgow-Serrano, Donat Agosti, Fabio Rinaldi, Robert Waterhouse
Format: Article
Language:English
Published: Pensoft Publishers 2025-08-01
Series:Biodiversity Data Journal
Subjects:
Online Access:https://bdj.pensoft.net/article/153070/download/pdf/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849335509949612032
author Joseph Cornelius
Harald Detering
Oscar Lithgow-Serrano
Donat Agosti
Fabio Rinaldi
Robert Waterhouse
author_facet Joseph Cornelius
Harald Detering
Oscar Lithgow-Serrano
Donat Agosti
Fabio Rinaldi
Robert Waterhouse
author_sort Joseph Cornelius
collection DOAJ
description The fields of taxonomy and biodiversity research have witnessed an exponential growth in published literature. This vast corpus of articles holds information on the diverse biological traits of organisms and their ecologies. However, access to and extraction of relevant data from this extensive resource remain challenging. Advances in text and data mining (TDM) and Natural Language Processing (NLP) techniques offer new opportunities for liberating such information from literature. Testing and using such approaches to annotate articles in machine-actionable formats is, therefore, necessary to enable the exploitation of existing knowledge in new biology, ecology and evolution research. Here, we explore the potential of these methods to annotate and extract organismal trait data for the most diverse animal group on Earth, the arthropods. The article processing workflow uses manually curated trait dictionaries with trained NLP models to perform labelling of entities and relationships of thousands of articles. A subset of manually annotated documents facilitated the formal evaluation of the performance of the workflow in terms of entity recognition and normalisation and relationship extraction, highlighting several important technical challenges. The results are made available to the scientific community through an interactive web tool and queryable resource, the ArTraDB Arthropod Trait Database. These methodological explorations provide a framework that could be extended beyond the arthropods, where TDM and NLP approaches applied to the taxonomy and biodiversity literature will greatly facilitate data synthesis studies and literature reviews, the identification of knowledge gaps and biases, as well as the data-informed investigation of ecological and evolutionary trends and patterns.
format Article
id doaj-art-3f5e466919604239b3ad3cac8b8d2066
institution Kabale University
issn 1314-2828
language English
publishDate 2025-08-01
publisher Pensoft Publishers
record_format Article
series Biodiversity Data Journal
spelling doaj-art-3f5e466919604239b3ad3cac8b8d20662025-08-20T03:45:14ZengPensoft PublishersBiodiversity Data Journal1314-28282025-08-011313010.3897/BDJ.13.e153070153070From literature to biodiversity data: mining arthropod organismal traits with machine learningJoseph Cornelius0Harald Detering1Oscar Lithgow-Serrano2Donat Agosti3Fabio Rinaldi4Robert Waterhouse5SIB Swiss Institute of BioinformaticsSIB Swiss Institute of BioinformaticsSIB Swiss Institute of BioinformaticsPlaziSIB Swiss Institute of BioinformaticsSIB Swiss Institute of BioinformaticsThe fields of taxonomy and biodiversity research have witnessed an exponential growth in published literature. This vast corpus of articles holds information on the diverse biological traits of organisms and their ecologies. However, access to and extraction of relevant data from this extensive resource remain challenging. Advances in text and data mining (TDM) and Natural Language Processing (NLP) techniques offer new opportunities for liberating such information from literature. Testing and using such approaches to annotate articles in machine-actionable formats is, therefore, necessary to enable the exploitation of existing knowledge in new biology, ecology and evolution research. Here, we explore the potential of these methods to annotate and extract organismal trait data for the most diverse animal group on Earth, the arthropods. The article processing workflow uses manually curated trait dictionaries with trained NLP models to perform labelling of entities and relationships of thousands of articles. A subset of manually annotated documents facilitated the formal evaluation of the performance of the workflow in terms of entity recognition and normalisation and relationship extraction, highlighting several important technical challenges. The results are made available to the scientific community through an interactive web tool and queryable resource, the ArTraDB Arthropod Trait Database. These methodological explorations provide a framework that could be extended beyond the arthropods, where TDM and NLP approaches applied to the taxonomy and biodiversity literature will greatly facilitate data synthesis studies and literature reviews, the identification of knowledge gaps and biases, as well as the data-informed investigation of ecological and evolutionary trends and patterns.https://bdj.pensoft.net/article/153070/download/pdf/arthropodsbiodiversitynatural language process
spellingShingle Joseph Cornelius
Harald Detering
Oscar Lithgow-Serrano
Donat Agosti
Fabio Rinaldi
Robert Waterhouse
From literature to biodiversity data: mining arthropod organismal traits with machine learning
Biodiversity Data Journal
arthropods
biodiversity
natural language process
title From literature to biodiversity data: mining arthropod organismal traits with machine learning
title_full From literature to biodiversity data: mining arthropod organismal traits with machine learning
title_fullStr From literature to biodiversity data: mining arthropod organismal traits with machine learning
title_full_unstemmed From literature to biodiversity data: mining arthropod organismal traits with machine learning
title_short From literature to biodiversity data: mining arthropod organismal traits with machine learning
title_sort from literature to biodiversity data mining arthropod organismal traits with machine learning
topic arthropods
biodiversity
natural language process
url https://bdj.pensoft.net/article/153070/download/pdf/
work_keys_str_mv AT josephcornelius fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning
AT haralddetering fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning
AT oscarlithgowserrano fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning
AT donatagosti fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning
AT fabiorinaldi fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning
AT robertwaterhouse fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning