Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models

Abstract BackgroundIdentifying Human Phenotype Ontology (HPO) terms is crucial for diagnosing and managing rare diseases. However, clinicians, especially junior physicians, often face challenges due to the complexity of describing patient phenotypes accurately. Traditional man...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wei Zhong, Mingyue Sun, Shun Yao, YiFan Liu, Dingchuan Peng, Yan Liu, Kai Yang, HuiMin Gao, HuiHui Yan, WenJing Hao, YouSheng Yan, ChengHong Yin
Format:	Article
Language:	English
Published:	JMIR Publications 2025-06-01
Series:	Journal of Medical Internet Research
Online Access:	https://www.jmir.org/2025/1/e73233
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850213837599408128
author	Wei Zhong Mingyue Sun Shun Yao YiFan Liu Dingchuan Peng Yan Liu Kai Yang HuiMin Gao HuiHui Yan WenJing Hao YouSheng Yan ChengHong Yin
author_facet	Wei Zhong Mingyue Sun Shun Yao YiFan Liu Dingchuan Peng Yan Liu Kai Yang HuiMin Gao HuiHui Yan WenJing Hao YouSheng Yan ChengHong Yin
author_sort	Wei Zhong
collection	DOAJ
description	Abstract BackgroundIdentifying Human Phenotype Ontology (HPO) terms is crucial for diagnosing and managing rare diseases. However, clinicians, especially junior physicians, often face challenges due to the complexity of describing patient phenotypes accurately. Traditional manual search methods using HPO databases are time-consuming and prone to errors. ObjectiveThe aim of the study is to investigate whether the use of multimodal large language models (MLLMs) can improve the accuracy of junior physicians in identifying HPO terms from patient images related to rare diseases. MethodsIn total, 20 junior physicians from 10 specialties participated. Each physician evaluated 27 patient images sourced from publicly available literature, with phenotypes relevant to rare diseases listed in the Chinese Rare Disease Catalogue. The study was divided into 2 groups: the manual search group relied on the Chinese Human Phenotype Ontology website, while the MLLM-assisted group used an electronic questionnaire that included HPO terms preidentified by ChatGPT-4o as prompts, followed by a search using the Chinese Human Phenotype Ontology. The primary outcome was the accuracy of HPO identification, defined as the proportion of correctly identified HPO terms compared to a standard set determined by an expert panel. Additionally, the accuracy of outputs from ChatGPT-4o and 2 open-source MLLMs (Llama3.2:11b and Llama3.2:90b) was evaluated using the same criteria, with hallucinations for each model documented separately. Furthermore, participating physicians completed an additional electronic questionnaire regarding their rare disease background to identify factors affecting their ability to accurately describe patient images using standardized HPO terms. ResultsA total of 270 descriptions were evaluated per group. The MLLM-assisted group achieved a significantly higher accuracy rate of 67.4% (182/270) compared to 20.4% (55/270) in the manual group (relative risk 3.31, 95% CI 2.58‐4.25; P ConclusionsThe integration of MLLMs into clinical workflows significantly enhances the accuracy of HPO identification by junior physicians, offering promising potential to improve the diagnosis of rare diseases and standardize phenotype descriptions in medical research. However, the notable hallucination rate observed in MLLMs underscores the necessity for further refinement and rigorous validation before widespread adoption in clinical practice.
format	Article
id	doaj-art-b3bb60bd36d14eb7a422798531d8bafa
institution	OA Journals
issn	1438-8871
language	English
publishDate	2025-06-01
publisher	JMIR Publications
record_format	Article
series	Journal of Medical Internet Research
spelling	doaj-art-b3bb60bd36d14eb7a422798531d8bafa2025-08-20T02:09:03ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-06-0127e73233e7323310.2196/73233Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language ModelsWei Zhonghttp://orcid.org/0000-0001-9823-9500Mingyue Sunhttp://orcid.org/0009-0003-5012-4538Shun Yaohttp://orcid.org/0009-0004-6235-2201YiFan Liuhttp://orcid.org/0009-0008-7339-4756Dingchuan Penghttp://orcid.org/0009-0000-9588-6809Yan Liuhttp://orcid.org/0000-0003-1698-5783Kai Yanghttp://orcid.org/0000-0002-7457-3106HuiMin Gaohttp://orcid.org/0009-0004-8874-6022HuiHui Yanhttp://orcid.org/0009-0008-2979-9895WenJing Haohttp://orcid.org/0009-0006-8537-0036YouSheng Yanhttp://orcid.org/0000-0002-0405-1302ChengHong Yinhttp://orcid.org/0000-0002-2503-3285 Abstract BackgroundIdentifying Human Phenotype Ontology (HPO) terms is crucial for diagnosing and managing rare diseases. However, clinicians, especially junior physicians, often face challenges due to the complexity of describing patient phenotypes accurately. Traditional manual search methods using HPO databases are time-consuming and prone to errors. ObjectiveThe aim of the study is to investigate whether the use of multimodal large language models (MLLMs) can improve the accuracy of junior physicians in identifying HPO terms from patient images related to rare diseases. MethodsIn total, 20 junior physicians from 10 specialties participated. Each physician evaluated 27 patient images sourced from publicly available literature, with phenotypes relevant to rare diseases listed in the Chinese Rare Disease Catalogue. The study was divided into 2 groups: the manual search group relied on the Chinese Human Phenotype Ontology website, while the MLLM-assisted group used an electronic questionnaire that included HPO terms preidentified by ChatGPT-4o as prompts, followed by a search using the Chinese Human Phenotype Ontology. The primary outcome was the accuracy of HPO identification, defined as the proportion of correctly identified HPO terms compared to a standard set determined by an expert panel. Additionally, the accuracy of outputs from ChatGPT-4o and 2 open-source MLLMs (Llama3.2:11b and Llama3.2:90b) was evaluated using the same criteria, with hallucinations for each model documented separately. Furthermore, participating physicians completed an additional electronic questionnaire regarding their rare disease background to identify factors affecting their ability to accurately describe patient images using standardized HPO terms. ResultsA total of 270 descriptions were evaluated per group. The MLLM-assisted group achieved a significantly higher accuracy rate of 67.4% (182/270) compared to 20.4% (55/270) in the manual group (relative risk 3.31, 95% CI 2.58‐4.25; P ConclusionsThe integration of MLLMs into clinical workflows significantly enhances the accuracy of HPO identification by junior physicians, offering promising potential to improve the diagnosis of rare diseases and standardize phenotype descriptions in medical research. However, the notable hallucination rate observed in MLLMs underscores the necessity for further refinement and rigorous validation before widespread adoption in clinical practice.https://www.jmir.org/2025/1/e73233
spellingShingle	Wei Zhong Mingyue Sun Shun Yao YiFan Liu Dingchuan Peng Yan Liu Kai Yang HuiMin Gao HuiHui Yan WenJing Hao YouSheng Yan ChengHong Yin Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models Journal of Medical Internet Research
title	Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models
title_full	Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models
title_fullStr	Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models
title_full_unstemmed	Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models
title_short	Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models
title_sort	enhancing the accuracy of human phenotype ontology identification comparative evaluation of multimodal large language models
url	https://www.jmir.org/2025/1/e73233
work_keys_str_mv	AT weizhong enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels AT mingyuesun enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels AT shunyao enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels AT yifanliu enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels AT dingchuanpeng enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels AT yanliu enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels AT kaiyang enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels AT huimingao enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels AT huihuiyan enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels AT wenjinghao enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels AT youshengyan enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels AT chenghongyin enhancingtheaccuracyofhumanphenotypeontologyidentificationcomparativeevaluationofmultimodallargelanguagemodels

Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models

Similar Items