From literature to biodiversity data: mining arthropod organismal traits with machine learning
The fields of taxonomy and biodiversity research have witnessed an exponential growth in published literature. This vast corpus of articles holds information on the diverse biological traits of organisms and their ecologies. However, access to and extraction of relevant data from this extensive reso...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Pensoft Publishers
2025-08-01
|
| Series: | Biodiversity Data Journal |
| Subjects: | |
| Online Access: | https://bdj.pensoft.net/article/153070/download/pdf/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849335509949612032 |
|---|---|
| author | Joseph Cornelius Harald Detering Oscar Lithgow-Serrano Donat Agosti Fabio Rinaldi Robert Waterhouse |
| author_facet | Joseph Cornelius Harald Detering Oscar Lithgow-Serrano Donat Agosti Fabio Rinaldi Robert Waterhouse |
| author_sort | Joseph Cornelius |
| collection | DOAJ |
| description | The fields of taxonomy and biodiversity research have witnessed an exponential growth in published literature. This vast corpus of articles holds information on the diverse biological traits of organisms and their ecologies. However, access to and extraction of relevant data from this extensive resource remain challenging. Advances in text and data mining (TDM) and Natural Language Processing (NLP) techniques offer new opportunities for liberating such information from literature. Testing and using such approaches to annotate articles in machine-actionable formats is, therefore, necessary to enable the exploitation of existing knowledge in new biology, ecology and evolution research. Here, we explore the potential of these methods to annotate and extract organismal trait data for the most diverse animal group on Earth, the arthropods. The article processing workflow uses manually curated trait dictionaries with trained NLP models to perform labelling of entities and relationships of thousands of articles. A subset of manually annotated documents facilitated the formal evaluation of the performance of the workflow in terms of entity recognition and normalisation and relationship extraction, highlighting several important technical challenges. The results are made available to the scientific community through an interactive web tool and queryable resource, the ArTraDB Arthropod Trait Database. These methodological explorations provide a framework that could be extended beyond the arthropods, where TDM and NLP approaches applied to the taxonomy and biodiversity literature will greatly facilitate data synthesis studies and literature reviews, the identification of knowledge gaps and biases, as well as the data-informed investigation of ecological and evolutionary trends and patterns. |
| format | Article |
| id | doaj-art-3f5e466919604239b3ad3cac8b8d2066 |
| institution | Kabale University |
| issn | 1314-2828 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Pensoft Publishers |
| record_format | Article |
| series | Biodiversity Data Journal |
| spelling | doaj-art-3f5e466919604239b3ad3cac8b8d20662025-08-20T03:45:14ZengPensoft PublishersBiodiversity Data Journal1314-28282025-08-011313010.3897/BDJ.13.e153070153070From literature to biodiversity data: mining arthropod organismal traits with machine learningJoseph Cornelius0Harald Detering1Oscar Lithgow-Serrano2Donat Agosti3Fabio Rinaldi4Robert Waterhouse5SIB Swiss Institute of BioinformaticsSIB Swiss Institute of BioinformaticsSIB Swiss Institute of BioinformaticsPlaziSIB Swiss Institute of BioinformaticsSIB Swiss Institute of BioinformaticsThe fields of taxonomy and biodiversity research have witnessed an exponential growth in published literature. This vast corpus of articles holds information on the diverse biological traits of organisms and their ecologies. However, access to and extraction of relevant data from this extensive resource remain challenging. Advances in text and data mining (TDM) and Natural Language Processing (NLP) techniques offer new opportunities for liberating such information from literature. Testing and using such approaches to annotate articles in machine-actionable formats is, therefore, necessary to enable the exploitation of existing knowledge in new biology, ecology and evolution research. Here, we explore the potential of these methods to annotate and extract organismal trait data for the most diverse animal group on Earth, the arthropods. The article processing workflow uses manually curated trait dictionaries with trained NLP models to perform labelling of entities and relationships of thousands of articles. A subset of manually annotated documents facilitated the formal evaluation of the performance of the workflow in terms of entity recognition and normalisation and relationship extraction, highlighting several important technical challenges. The results are made available to the scientific community through an interactive web tool and queryable resource, the ArTraDB Arthropod Trait Database. These methodological explorations provide a framework that could be extended beyond the arthropods, where TDM and NLP approaches applied to the taxonomy and biodiversity literature will greatly facilitate data synthesis studies and literature reviews, the identification of knowledge gaps and biases, as well as the data-informed investigation of ecological and evolutionary trends and patterns.https://bdj.pensoft.net/article/153070/download/pdf/arthropodsbiodiversitynatural language process |
| spellingShingle | Joseph Cornelius Harald Detering Oscar Lithgow-Serrano Donat Agosti Fabio Rinaldi Robert Waterhouse From literature to biodiversity data: mining arthropod organismal traits with machine learning Biodiversity Data Journal arthropods biodiversity natural language process |
| title | From literature to biodiversity data: mining arthropod organismal traits with machine learning |
| title_full | From literature to biodiversity data: mining arthropod organismal traits with machine learning |
| title_fullStr | From literature to biodiversity data: mining arthropod organismal traits with machine learning |
| title_full_unstemmed | From literature to biodiversity data: mining arthropod organismal traits with machine learning |
| title_short | From literature to biodiversity data: mining arthropod organismal traits with machine learning |
| title_sort | from literature to biodiversity data mining arthropod organismal traits with machine learning |
| topic | arthropods biodiversity natural language process |
| url | https://bdj.pensoft.net/article/153070/download/pdf/ |
| work_keys_str_mv | AT josephcornelius fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning AT haralddetering fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning AT oscarlithgowserrano fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning AT donatagosti fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning AT fabiorinaldi fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning AT robertwaterhouse fromliteraturetobiodiversitydataminingarthropodorganismaltraitswithmachinelearning |