Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation

Objectives We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.Methods 50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by d...

Full description

Saved in:
Bibliographic Details
Main Authors: Vasileios Ntinopoulos, Hector Rodriguez Cetina Biefer, Igor Tudorache, Nestoras Papadopoulos, Dragan Odavic, Petar Risteski, Achim Haeussler, Omer Dzemali
Format: Article
Language:English
Published: BMJ Publishing Group 2025-02-01
Series:BMJ Health & Care Informatics
Online Access:https://informatics.bmj.com/content/32/1/e101139.full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849389076598226944
author Vasileios Ntinopoulos
Hector Rodriguez Cetina Biefer
Igor Tudorache
Nestoras Papadopoulos
Dragan Odavic
Petar Risteski
Achim Haeussler
Omer Dzemali
author_facet Vasileios Ntinopoulos
Hector Rodriguez Cetina Biefer
Igor Tudorache
Nestoras Papadopoulos
Dragan Odavic
Petar Risteski
Achim Haeussler
Omer Dzemali
author_sort Vasileios Ntinopoulos
collection DOAJ
description Objectives We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.Methods 50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by domain experts, and subsequently used for LLM-prompting. 18 LLMs were evaluated against a baseline transformer-based model. Performance assessment comprised four entity extraction and five binary classification tasks with a total of 450 predictions for each LLM. LLM-response consistency assessment was performed over three same-prompt iterations.Results Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b exhibited an excellent overall accuracy >0.98 (0.995, 0.988, 0.988, 0.988, 0.986, 0.982, 0.982, and 0.982, respectively), significantly higher than the baseline RoBERTa model (0.742). Claude 2.0, Claude 2.1, Claude 3.0 Opus, PaLM 2 chat-bison, GPT 4, Claude 3.0 Sonnet and Llama 3-70b showed a marginally higher and Gemini Advanced a marginally lower multiple-run consistency than the baseline model RoBERTa (Krippendorff’s alpha value 1, 0.998, 0.996, 0.996, 0.992, 0.991, 0.989, 0.988, and 0.985, respectively).Discussion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b performed the best, exhibiting outstanding performance in both entity extraction and binary classification, with highly consistent responses over multiple same-prompt iterations. Their use could leverage data for research and unburden healthcare professionals. Real-data analyses are warranted to confirm their performance in a real-world setting.Conclusion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b seem to be able to reliably extract data from unstructured and semi-structured electronic health records. Further analyses using real data are warranted to confirm their performance in a real-world setting.
format Article
id doaj-art-bf81e8e5e86f405eb054c1833bee6caa
institution Kabale University
issn 2632-1009
language English
publishDate 2025-02-01
publisher BMJ Publishing Group
record_format Article
series BMJ Health & Care Informatics
spelling doaj-art-bf81e8e5e86f405eb054c1833bee6caa2025-08-20T03:42:04ZengBMJ Publishing GroupBMJ Health & Care Informatics2632-10092025-02-0132110.1136/bmjhci-2024-101139Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluationVasileios Ntinopoulos0Hector Rodriguez Cetina Biefer1Igor Tudorache2Nestoras Papadopoulos3Dragan Odavic4Petar Risteski5Achim Haeussler6Omer Dzemali7Department of Cardiac Surgery, Municipal Hospital of Zurich – Triemli, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandDepartment of Cardiac Surgery, University Hospital of Zurich, Zurich, SwitzerlandObjectives We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.Methods 50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by domain experts, and subsequently used for LLM-prompting. 18 LLMs were evaluated against a baseline transformer-based model. Performance assessment comprised four entity extraction and five binary classification tasks with a total of 450 predictions for each LLM. LLM-response consistency assessment was performed over three same-prompt iterations.Results Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b exhibited an excellent overall accuracy >0.98 (0.995, 0.988, 0.988, 0.988, 0.986, 0.982, 0.982, and 0.982, respectively), significantly higher than the baseline RoBERTa model (0.742). Claude 2.0, Claude 2.1, Claude 3.0 Opus, PaLM 2 chat-bison, GPT 4, Claude 3.0 Sonnet and Llama 3-70b showed a marginally higher and Gemini Advanced a marginally lower multiple-run consistency than the baseline model RoBERTa (Krippendorff’s alpha value 1, 0.998, 0.996, 0.996, 0.992, 0.991, 0.989, 0.988, and 0.985, respectively).Discussion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b performed the best, exhibiting outstanding performance in both entity extraction and binary classification, with highly consistent responses over multiple same-prompt iterations. Their use could leverage data for research and unburden healthcare professionals. Real-data analyses are warranted to confirm their performance in a real-world setting.Conclusion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b seem to be able to reliably extract data from unstructured and semi-structured electronic health records. Further analyses using real data are warranted to confirm their performance in a real-world setting.https://informatics.bmj.com/content/32/1/e101139.full
spellingShingle Vasileios Ntinopoulos
Hector Rodriguez Cetina Biefer
Igor Tudorache
Nestoras Papadopoulos
Dragan Odavic
Petar Risteski
Achim Haeussler
Omer Dzemali
Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
BMJ Health & Care Informatics
title Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
title_full Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
title_fullStr Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
title_full_unstemmed Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
title_short Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
title_sort large language models for data extraction from unstructured and semi structured electronic health records a multiple model performance evaluation
url https://informatics.bmj.com/content/32/1/e101139.full
work_keys_str_mv AT vasileiosntinopoulos largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation
AT hectorrodriguezcetinabiefer largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation
AT igortudorache largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation
AT nestoraspapadopoulos largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation
AT draganodavic largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation
AT petarristeski largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation
AT achimhaeussler largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation
AT omerdzemali largelanguagemodelsfordataextractionfromunstructuredandsemistructuredelectronichealthrecordsamultiplemodelperformanceevaluation