MilkOligoCorpus: A semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides.

Milk oligosaccharides are bioactive components that regulate the composition of the neonatal microbiota and exert immunomodulatory functions. Their beneficial effects depend on their structure. Numerous studies have shown intra- and inter-species variation in the structural composition and concentra...

Full description

Saved in:
Bibliographic Details
Main Authors: Mathilde Rumeau, Marine Courtin, Robert Bossy, Clara Sauvion, Valentin Loux, Mouhamadou Ba, Christelle Knudsen, Sylvie Combes, Claire Nédellec, Louise Deléger
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0319729
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849396607898877952
author Mathilde Rumeau
Marine Courtin
Robert Bossy
Clara Sauvion
Valentin Loux
Mouhamadou Ba
Christelle Knudsen
Sylvie Combes
Claire Nédellec
Louise Deléger
author_facet Mathilde Rumeau
Marine Courtin
Robert Bossy
Clara Sauvion
Valentin Loux
Mouhamadou Ba
Christelle Knudsen
Sylvie Combes
Claire Nédellec
Louise Deléger
author_sort Mathilde Rumeau
collection DOAJ
description Milk oligosaccharides are bioactive components that regulate the composition of the neonatal microbiota and exert immunomodulatory functions. Their beneficial effects depend on their structure. Numerous studies have shown intra- and inter-species variation in the structural composition and concentration of these compounds in mammalian milk, yet the biological significance of such variation remains poorly understood. Automated natural language processing methods are promising tools for extracting and gathering structured data from unstructured texts to get insight into the biological significance of milk oligosaccharide variation across mammals. These methods require training and evaluation on manually annotated text corpora. While annotated corpora exist for chemical substances, none are specifically designed for training natural language processing models to extract information on milk oligosaccharides. To this end, we propose MilkOligoCorpus, a new gold standard for milk oligosaccharide composition in mammalian species. MilkOligoCorpus' annotation scheme is a rich entity/relation model designed to describe the diversity pattern of milk oligosaccharides according to female factor variability and to help better understand the structure-related function of milk oligosaccharides. MilkOligoCorpus consists of abstracts (15) and extracts (15) from 20 full text articles indexed by PubMed annotated with entities related to individuals, samples, oligosaccharides and oligosaccharide quantification linked by binary and n-ary relationships. To address data interoperability across disparate publications and databases, four terminological resources were also developed to assign unique identifiers to the entities, supported by external ontologies. This paper presents the creation of the MilkOligoCorpus and its associated schema, along with the development of annotation guidelines and terminological resources. We also present experimental results obtained by baseline information extraction models on the corpus.
format Article
id doaj-art-8e71ccfa575c42f29edd5036338ec32f
institution Kabale University
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-8e71ccfa575c42f29edd5036338ec32f2025-08-20T03:39:18ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01208e031972910.1371/journal.pone.0319729MilkOligoCorpus: A semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides.Mathilde RumeauMarine CourtinRobert BossyClara SauvionValentin LouxMouhamadou BaChristelle KnudsenSylvie CombesClaire NédellecLouise DelégerMilk oligosaccharides are bioactive components that regulate the composition of the neonatal microbiota and exert immunomodulatory functions. Their beneficial effects depend on their structure. Numerous studies have shown intra- and inter-species variation in the structural composition and concentration of these compounds in mammalian milk, yet the biological significance of such variation remains poorly understood. Automated natural language processing methods are promising tools for extracting and gathering structured data from unstructured texts to get insight into the biological significance of milk oligosaccharide variation across mammals. These methods require training and evaluation on manually annotated text corpora. While annotated corpora exist for chemical substances, none are specifically designed for training natural language processing models to extract information on milk oligosaccharides. To this end, we propose MilkOligoCorpus, a new gold standard for milk oligosaccharide composition in mammalian species. MilkOligoCorpus' annotation scheme is a rich entity/relation model designed to describe the diversity pattern of milk oligosaccharides according to female factor variability and to help better understand the structure-related function of milk oligosaccharides. MilkOligoCorpus consists of abstracts (15) and extracts (15) from 20 full text articles indexed by PubMed annotated with entities related to individuals, samples, oligosaccharides and oligosaccharide quantification linked by binary and n-ary relationships. To address data interoperability across disparate publications and databases, four terminological resources were also developed to assign unique identifiers to the entities, supported by external ontologies. This paper presents the creation of the MilkOligoCorpus and its associated schema, along with the development of annotation guidelines and terminological resources. We also present experimental results obtained by baseline information extraction models on the corpus.https://doi.org/10.1371/journal.pone.0319729
spellingShingle Mathilde Rumeau
Marine Courtin
Robert Bossy
Clara Sauvion
Valentin Loux
Mouhamadou Ba
Christelle Knudsen
Sylvie Combes
Claire Nédellec
Louise Deléger
MilkOligoCorpus: A semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides.
PLoS ONE
title MilkOligoCorpus: A semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides.
title_full MilkOligoCorpus: A semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides.
title_fullStr MilkOligoCorpus: A semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides.
title_full_unstemmed MilkOligoCorpus: A semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides.
title_short MilkOligoCorpus: A semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides.
title_sort milkoligocorpus a semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides
url https://doi.org/10.1371/journal.pone.0319729
work_keys_str_mv AT mathilderumeau milkoligocorpusasemanticallyannotatedresourceforknowledgeextractiononmammalianmilkoligosaccharides
AT marinecourtin milkoligocorpusasemanticallyannotatedresourceforknowledgeextractiononmammalianmilkoligosaccharides
AT robertbossy milkoligocorpusasemanticallyannotatedresourceforknowledgeextractiononmammalianmilkoligosaccharides
AT clarasauvion milkoligocorpusasemanticallyannotatedresourceforknowledgeextractiononmammalianmilkoligosaccharides
AT valentinloux milkoligocorpusasemanticallyannotatedresourceforknowledgeextractiononmammalianmilkoligosaccharides
AT mouhamadouba milkoligocorpusasemanticallyannotatedresourceforknowledgeextractiononmammalianmilkoligosaccharides
AT christelleknudsen milkoligocorpusasemanticallyannotatedresourceforknowledgeextractiononmammalianmilkoligosaccharides
AT sylviecombes milkoligocorpusasemanticallyannotatedresourceforknowledgeextractiononmammalianmilkoligosaccharides
AT clairenedellec milkoligocorpusasemanticallyannotatedresourceforknowledgeextractiononmammalianmilkoligosaccharides
AT louisedeleger milkoligocorpusasemanticallyannotatedresourceforknowledgeextractiononmammalianmilkoligosaccharides