Annotation of biological samples data to standard ontologies with support from large language models

The semantic integration of biological data is hindered by the vast heterogeneity of data sources and their limited semantic formalization. A crucial step in this process is mapping data elements to ontological concepts, which typically involves substantial manual effort. Large Language Models (LLMs...

Full description

Saved in:

Bibliographic Details
Main Authors:	Andrea Riquelme-García, Juan Mulero-Hernández, Jesualdo Tomás Fernández-Breis
Format:	Article
Language:	English
Published:	Elsevier 2025-01-01
Series:	Computational and Structural Biotechnology Journal
Subjects:	Bioinformatics Generative AI Large language models Data interoperability Biological samples
Online Access:	http://www.sciencedirect.com/science/article/pii/S2001037025001837
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849688733300817920
author	Andrea Riquelme-García Juan Mulero-Hernández Jesualdo Tomás Fernández-Breis
author_facet	Andrea Riquelme-García Juan Mulero-Hernández Jesualdo Tomás Fernández-Breis
author_sort	Andrea Riquelme-García
collection	DOAJ
description	The semantic integration of biological data is hindered by the vast heterogeneity of data sources and their limited semantic formalization. A crucial step in this process is mapping data elements to ontological concepts, which typically involves substantial manual effort. Large Language Models (LLMs) have demonstrated potential in automating complex language-related tasks and may offer a solution to streamline biological data annotation. This study investigates the utility of LLMs—specifically various base and fine-tuned GPT models—for the automatic assignment of ontological identifiers to biological sample labels. We evaluated model performance in annotating labels to four widely used ontologies: the Cell Line Ontology (CLO), Cell Ontology (CL), Uber-anatomy Ontology (UBERON), and BRENDA Tissue Ontology (BTO). Our dataset was compiled from publicly available, high-quality databases containing biologically relevant sequence information, which suffers from inconsistent annotation practices, complicating integrative analyses. Model outputs were compared against annotations generated by text2term, a state-of-the-art annotation tool. The fine-tuned GPT model outperformed both the base models and text2term in annotating cell lines and cell types, particularly for the CL and UBERON ontologies, achieving a precision of 47–64% and a recall of 88–97%. In contrast, base models exhibited significantly lower performance. These results suggest that fine-tuned LLMs can accelerate and improve the accuracy of biological data annotation. Nonetheless, our evaluation highlights persistent challenges, including variable precision across ontology categories and the continued need for expert curation to ensure annotation validity.
format	Article
id	doaj-art-bd00bad2e2564212a0980ab7e45c19dd
institution	DOAJ
issn	2001-0370
language	English
publishDate	2025-01-01
publisher	Elsevier
record_format	Article
series	Computational and Structural Biotechnology Journal
spelling	doaj-art-bd00bad2e2564212a0980ab7e45c19dd2025-08-20T03:21:51ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01272155216710.1016/j.csbj.2025.05.020Annotation of biological samples data to standard ontologies with support from large language modelsAndrea Riquelme-García0Juan Mulero-Hernández1Jesualdo Tomás Fernández-Breis2Departamento de Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, IMIB-Pascual Parrilla, Murcia, 30100, SpainDepartamento de Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, IMIB-Pascual Parrilla, Murcia, 30100, SpainCorresponding author.; Departamento de Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, IMIB-Pascual Parrilla, Murcia, 30100, SpainThe semantic integration of biological data is hindered by the vast heterogeneity of data sources and their limited semantic formalization. A crucial step in this process is mapping data elements to ontological concepts, which typically involves substantial manual effort. Large Language Models (LLMs) have demonstrated potential in automating complex language-related tasks and may offer a solution to streamline biological data annotation. This study investigates the utility of LLMs—specifically various base and fine-tuned GPT models—for the automatic assignment of ontological identifiers to biological sample labels. We evaluated model performance in annotating labels to four widely used ontologies: the Cell Line Ontology (CLO), Cell Ontology (CL), Uber-anatomy Ontology (UBERON), and BRENDA Tissue Ontology (BTO). Our dataset was compiled from publicly available, high-quality databases containing biologically relevant sequence information, which suffers from inconsistent annotation practices, complicating integrative analyses. Model outputs were compared against annotations generated by text2term, a state-of-the-art annotation tool. The fine-tuned GPT model outperformed both the base models and text2term in annotating cell lines and cell types, particularly for the CL and UBERON ontologies, achieving a precision of 47–64% and a recall of 88–97%. In contrast, base models exhibited significantly lower performance. These results suggest that fine-tuned LLMs can accelerate and improve the accuracy of biological data annotation. Nonetheless, our evaluation highlights persistent challenges, including variable precision across ontology categories and the continued need for expert curation to ensure annotation validity.http://www.sciencedirect.com/science/article/pii/S2001037025001837BioinformaticsGenerative AILarge language modelsData interoperabilityBiological samples
spellingShingle	Andrea Riquelme-García Juan Mulero-Hernández Jesualdo Tomás Fernández-Breis Annotation of biological samples data to standard ontologies with support from large language models Computational and Structural Biotechnology Journal Bioinformatics Generative AI Large language models Data interoperability Biological samples
title	Annotation of biological samples data to standard ontologies with support from large language models
title_full	Annotation of biological samples data to standard ontologies with support from large language models
title_fullStr	Annotation of biological samples data to standard ontologies with support from large language models
title_full_unstemmed	Annotation of biological samples data to standard ontologies with support from large language models
title_short	Annotation of biological samples data to standard ontologies with support from large language models
title_sort	annotation of biological samples data to standard ontologies with support from large language models
topic	Bioinformatics Generative AI Large language models Data interoperability Biological samples
url	http://www.sciencedirect.com/science/article/pii/S2001037025001837
work_keys_str_mv	AT andreariquelmegarcia annotationofbiologicalsamplesdatatostandardontologieswithsupportfromlargelanguagemodels AT juanmulerohernandez annotationofbiologicalsamplesdatatostandardontologieswithsupportfromlargelanguagemodels AT jesualdotomasfernandezbreis annotationofbiologicalsamplesdatatostandardontologieswithsupportfromlargelanguagemodels

Annotation of biological samples data to standard ontologies with support from large language models

Similar Items