Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis

Abstract BackgroundRecent natural language processing breakthroughs, particularly with the emergence of large language models (LLMs), have demonstrated remarkable capabilities on general knowledge benchmarks. However, there is limited data on the performance and understanding...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ahmad Idrissi-Yaghir, Kamyar Arzideh, Henning Schäfer, Bahadir Eryilmaz, Mikel Bahn, Yutong Wen, Katarzyna Borys, Eva Hartmann, Cynthia Schmidt, Obioma Pelka, Johannes Haubold, Christoph M Friedrich, Felix Nensa, René Hosch
Format:	Article
Language:	English
Published:	JMIR Publications 2025-08-01
Series:	Journal of Medical Internet Research
Online Access:	https://www.jmir.org/2025/1/e73540
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849223165626023936
author	Ahmad Idrissi-Yaghir Kamyar Arzideh Henning Schäfer Bahadir Eryilmaz Mikel Bahn Yutong Wen Katarzyna Borys Eva Hartmann Cynthia Schmidt Obioma Pelka Johannes Haubold Christoph M Friedrich Felix Nensa René Hosch
author_facet	Ahmad Idrissi-Yaghir Kamyar Arzideh Henning Schäfer Bahadir Eryilmaz Mikel Bahn Yutong Wen Katarzyna Borys Eva Hartmann Cynthia Schmidt Obioma Pelka Johannes Haubold Christoph M Friedrich Felix Nensa René Hosch
author_sort	Ahmad Idrissi-Yaghir
collection	DOAJ
description	Abstract BackgroundRecent natural language processing breakthroughs, particularly with the emergence of large language models (LLMs), have demonstrated remarkable capabilities on general knowledge benchmarks. However, there is limited data on the performance and understanding of these models in relation to the Fast Healthcare Interoperability Resources (FHIR) standard. The complexity and specialized nature of FHIR present challenges for LLMs, which are typically trained on broad datasets and may have a limited understanding of the nuances required for domain-specific tasks. Improving health data interoperability can greatly benefit the use of clinical data and interaction with electronic health records. ObjectiveThis study presents the Fast Healthcare Interoperability Resources (FHIR) Workbench, a comprehensive suite of datasets designed to evaluate the ability of LLMs to understand and apply the FHIR standard. MethodsIn total, 4 evaluation datasets were created to assess the FHIR knowledge and capabilities of LLMs. These tasks include multiple-choice questions on general FHIR concepts and the FHIR Representational State Transfer (REST) application programming interface, as well as correctly identifying the resource type and generating FHIR resources from unstructured clinical patient notes. In addition, we evaluate open-source LLMs, such as Qwen 2.5 Coder and DeepSeek-V3, and commercial LLMs, including GPT-4o and Gemini 2, on these tasks in a zero-shot setting. To provide context for interpreting LLM performance, a subset of the datasets was human-evaluated by recruiting 6 participants with varying levels of FHIR expertise. ResultsOur evaluation across multiple FHIR tasks revealed nuanced performance metrics. Commercial models demonstrated exceptional capabilities, with GPT-4o achieving a 0.9990 F1 ConclusionsThis study highlights the competitive performance of both open-source models, such as Qwen and DeepSeek, and commercial models, such as GPT-4o and Gemini, in FHIR-related tasks. While open-source models are advancing rapidly, commercial models still have an advantage for specific, complex tasks. The FHIR Workbench offers a valuable platform for evaluating the capabilities of these models and promoting improvements in health data interoperability.
format	Article
id	doaj-art-579ffc0b53eb4f8a97d85c1b8982e055
institution	Kabale University
issn	1438-8871
language	English
publishDate	2025-08-01
publisher	JMIR Publications
record_format	Article
series	Journal of Medical Internet Research
spelling	doaj-art-579ffc0b53eb4f8a97d85c1b8982e0552025-08-25T20:54:01ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-08-0127e73540e7354010.2196/73540Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative AnalysisAhmad Idrissi-Yaghirhttp://orcid.org/0000-0003-1507-9690Kamyar Arzidehhttp://orcid.org/0009-0005-6074-804XHenning Schäferhttp://orcid.org/0000-0002-4123-0406Bahadir Eryilmazhttp://orcid.org/0009-0002-8743-4751Mikel Bahnhttp://orcid.org/0009-0002-0866-4023Yutong Wenhttp://orcid.org/0009-0003-8557-6665Katarzyna Boryshttp://orcid.org/0000-0001-6987-6041Eva Hartmannhttp://orcid.org/0009-0000-2600-7217Cynthia Schmidthttp://orcid.org/0000-0003-1994-0687Obioma Pelkahttp://orcid.org/0000-0001-5156-4429Johannes Hauboldhttp://orcid.org/0000-0003-4843-5911Christoph M Friedrichhttp://orcid.org/0000-0001-7906-0038Felix Nensahttp://orcid.org/0000-0002-5811-7100René Hoschhttp://orcid.org/0000-0003-1760-2342 Abstract BackgroundRecent natural language processing breakthroughs, particularly with the emergence of large language models (LLMs), have demonstrated remarkable capabilities on general knowledge benchmarks. However, there is limited data on the performance and understanding of these models in relation to the Fast Healthcare Interoperability Resources (FHIR) standard. The complexity and specialized nature of FHIR present challenges for LLMs, which are typically trained on broad datasets and may have a limited understanding of the nuances required for domain-specific tasks. Improving health data interoperability can greatly benefit the use of clinical data and interaction with electronic health records. ObjectiveThis study presents the Fast Healthcare Interoperability Resources (FHIR) Workbench, a comprehensive suite of datasets designed to evaluate the ability of LLMs to understand and apply the FHIR standard. MethodsIn total, 4 evaluation datasets were created to assess the FHIR knowledge and capabilities of LLMs. These tasks include multiple-choice questions on general FHIR concepts and the FHIR Representational State Transfer (REST) application programming interface, as well as correctly identifying the resource type and generating FHIR resources from unstructured clinical patient notes. In addition, we evaluate open-source LLMs, such as Qwen 2.5 Coder and DeepSeek-V3, and commercial LLMs, including GPT-4o and Gemini 2, on these tasks in a zero-shot setting. To provide context for interpreting LLM performance, a subset of the datasets was human-evaluated by recruiting 6 participants with varying levels of FHIR expertise. ResultsOur evaluation across multiple FHIR tasks revealed nuanced performance metrics. Commercial models demonstrated exceptional capabilities, with GPT-4o achieving a 0.9990 F1 ConclusionsThis study highlights the competitive performance of both open-source models, such as Qwen and DeepSeek, and commercial models, such as GPT-4o and Gemini, in FHIR-related tasks. While open-source models are advancing rapidly, commercial models still have an advantage for specific, complex tasks. The FHIR Workbench offers a valuable platform for evaluating the capabilities of these models and promoting improvements in health data interoperability.https://www.jmir.org/2025/1/e73540
spellingShingle	Ahmad Idrissi-Yaghir Kamyar Arzideh Henning Schäfer Bahadir Eryilmaz Mikel Bahn Yutong Wen Katarzyna Borys Eva Hartmann Cynthia Schmidt Obioma Pelka Johannes Haubold Christoph M Friedrich Felix Nensa René Hosch Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis Journal of Medical Internet Research
title	Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis
title_full	Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis
title_fullStr	Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis
title_full_unstemmed	Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis
title_short	Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis
title_sort	using a diverse test suite to assess large language models on fast health care interoperability resources knowledge comparative analysis
url	https://www.jmir.org/2025/1/e73540
work_keys_str_mv	AT ahmadidrissiyaghir usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT kamyararzideh usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT henningschafer usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT bahadireryilmaz usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT mikelbahn usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT yutongwen usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT katarzynaborys usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT evahartmann usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT cynthiaschmidt usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT obiomapelka usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT johanneshaubold usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT christophmfriedrich usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT felixnensa usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis AT renehosch usingadiversetestsuitetoassesslargelanguagemodelsonfasthealthcareinteroperabilityresourcesknowledgecomparativeanalysis

Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis

Similar Items