Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation

Objectives We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.Methods 50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by d...

Full description

Saved in:

Bibliographic Details
Main Authors:	Vasileios Ntinopoulos, Hector Rodriguez Cetina Biefer, Igor Tudorache, Nestoras Papadopoulos, Dragan Odavic, Petar Risteski, Achim Haeussler, Omer Dzemali
Format:	Article
Language:	English
Published:	BMJ Publishing Group 2025-02-01
Series:	BMJ Health & Care Informatics
Online Access:	https://informatics.bmj.com/content/32/1/e101139.full
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849389076598226944
author	Vasileios Ntinopoulos Hector Rodriguez Cetina Biefer Igor Tudorache Nestoras Papadopoulos Dragan Odavic Petar Risteski Achim Haeussler Omer Dzemali
author_facet	Vasileios Ntinopoulos Hector Rodriguez Cetina Biefer Igor Tudorache Nestoras Papadopoulos Dragan Odavic Petar Risteski Achim Haeussler Omer Dzemali
author_sort	Vasileios Ntinopoulos
collection	DOAJ
description	Objectives We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.Methods 50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by domain experts, and subsequently used for LLM-prompting. 18 LLMs were evaluated against a baseline transformer-based model. Performance assessment comprised four entity extraction and five binary classification tasks with a total of 450 predictions for each LLM. LLM-response consistency assessment was performed over three same-prompt iterations.Results Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b exhibited an excellent overall accuracy >0.98 (0.995, 0.988, 0.988, 0.988, 0.986, 0.982, 0.982, and 0.982, respectively), significantly higher than the baseline RoBERTa model (0.742). Claude 2.0, Claude 2.1, Claude 3.0 Opus, PaLM 2 chat-bison, GPT 4, Claude 3.0 Sonnet and Llama 3-70b showed a marginally higher and Gemini Advanced a marginally lower multiple-run consistency than the baseline model RoBERTa (Krippendorff’s alpha value 1, 0.998, 0.996, 0.996, 0.992, 0.991, 0.989, 0.988, and 0.985, respectively).Discussion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b performed the best, exhibiting outstanding performance in both entity extraction and binary classification, with highly consistent responses over multiple same-prompt iterations. Their use could leverage data for research and unburden healthcare professionals. Real-data analyses are warranted to confirm their performance in a real-world setting.Conclusion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b seem to be able to reliably extract data from unstructured and semi-structured electronic health records. Further analyses using real data are warranted to confirm their performance in a real-world setting.
format	Article
id	doaj-art-bf81e8e5e86f405eb054c1833bee6caa
institution	Kabale University
issn	2632-1009
language	English
publishDate	2025-02-01
publisher	BMJ Publishing Group
record_format	Article
series	BMJ Health & Care Informatics
spelling	doaj-art-bf81e8e5e86f405eb054c1833bee6caa2025-08-20T03:42:04ZengBMJ Publishing GroupBMJ Health & Care Informatics2632-10092025-02-0132110.1136/bmjhci-2024-101139Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluationVasileios Ntinopoulos0Hector Rodriguez Cetina Biefer1Igor Tudorache2Nestoras Papadopoulos3Dragan Odavic4Petar Risteski5Achim Haeussler6Omer Dzemali7Department of Cardiac Surgery, Municipal Hospital of Zurich – Triemli, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandObjectives We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.Methods 50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by domain experts, and subsequently used for LLM-prompting. 18 LLMs were evaluated against a baseline transformer-based model. Performance assessment comprised four entity extraction and five binary classification tasks with a total of 450 predictions for each LLM. LLM-response consistency assessment was performed over three same-prompt iterations.Results Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b exhibited an excellent overall accuracy >0.98 (0.995, 0.988, 0.988, 0.988, 0.986, 0.982, 0.982, and 0.982, respectively), significantly higher than the baseline RoBERTa model (0.742). Claude 2.0, Claude 2.1, Claude 3.0 Opus, PaLM 2 chat-bison, GPT 4, Claude 3.0 Sonnet and Llama 3-70b showed a marginally higher and Gemini Advanced a marginally lower multiple-run consistency than the baseline model RoBERTa (Krippendorff’s alpha value 1, 0.998, 0.996, 0.996, 0.992, 0.991, 0.989, 0.988, and 0.985, respectively).Discussion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b performed the best, exhibiting outstanding performance in both entity extraction and binary classification, with highly consistent responses over multiple same-prompt iterations. Their use could leverage data for research and unburden healthcare professionals. Real-data analyses are warranted to confirm their performance in a real-world setting.Conclusion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b seem to be able to reliably extract data from unstructured and semi-structured electronic health records. Further analyses using real data are warranted to confirm their performance in a real-world setting.https://informatics.bmj.com/content/32/1/e101139.full
spellingShingle	Vasileios Ntinopoulos Hector Rodriguez Cetina Biefer Igor Tudorache Nestoras Papadopoulos Dragan Odavic Petar Risteski Achim Haeussler Omer Dzemali Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation BMJ Health & Care Informatics
title	Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
title_full	Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
title_fullStr	Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
title_full_unstemmed	Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
title_short	Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
title_sort	large language models for data extraction from unstructured and semi structured electronic health records a multiple model performance evaluation
url	https://informatics.bmj.com/content/32/1/e101139.full
work_keys_str_mv	AT vasileiosntinopoulos largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation AT hectorrodriguezcetinabiefer largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation AT igortudorache largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation AT nestoraspapadopoulos largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation AT draganodavic largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation AT petarristeski largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation AT achimhaeussler largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation AT omerdzemali largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation

Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation

Similar Items