Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study

Abstract BackgroundGenerative large language models (LLMs) have the potential to revolutionize medical education by generating tailored learning materials, enhancing teaching efficiency, and improving learner engagement. However, the application of LLMs in health care settings...

Full description

Saved in:
Bibliographic Details
Main Authors: Carl Ehrett, Sudeep Hegde, Kwame Andre, Dixizi Liu, Timothy Wilson
Format: Article
Language:English
Published: JMIR Publications 2024-11-01
Series:JMIR Medical Education
Online Access:https://mededu.jmir.org/2024/1/e51433
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850149975420305408
author Carl Ehrett
Sudeep Hegde
Kwame Andre
Dixizi Liu
Timothy Wilson
author_facet Carl Ehrett
Sudeep Hegde
Kwame Andre
Dixizi Liu
Timothy Wilson
author_sort Carl Ehrett
collection DOAJ
description Abstract BackgroundGenerative large language models (LLMs) have the potential to revolutionize medical education by generating tailored learning materials, enhancing teaching efficiency, and improving learner engagement. However, the application of LLMs in health care settings, particularly for augmenting small datasets in text classification tasks, remains underexplored, particularly for cost- and privacy-conscious applications that do not permit the use of third-party services such as OpenAI’s ChatGPT. ObjectiveThis study aims to explore the use of open-source LLMs, such as Large Language Model Meta AI (LLaMA) and Alpaca models, for data augmentation in a specific text classification task related to hospital staff surveys. MethodsThe surveys were designed to elicit narratives of everyday adaptation by frontline radiology staff during the initial phase of the COVID-19 pandemic. A 2-step process of data augmentation and text classification was conducted. The study generated synthetic data similar to the survey reports using 4 generative LLMs for data augmentation. A different set of 3 classifier LLMs was then used to classify the augmented text for thematic categories. The study evaluated performance on the classification task. ResultsThe overall best-performing combination of LLMs, temperature, classifier, and number of synthetic data cases is via augmentation with LLaMA 7B at temperature 0.7 with 100 augments, using Robustly Optimized BERT Pretraining Approach (RoBERTa) for the classification task, achieving an average area under the receiver operating characteristic (AUC) curve of 0.87 (SD 0.02; ie, 1 SD). The results demonstrate that open-source LLMs can enhance text classifiers’ performance for small datasets in health care contexts, providing promising pathways for improving medical education processes and patient care practices. ConclusionsThe study demonstrates the value of data augmentation with open-source LLMs, highlights the importance of privacy and ethical considerations when using LLMs, and suggests future directions for research in this field.
format Article
id doaj-art-ecbd56d88d7546b48c86aa200d01ac5d
institution OA Journals
issn 2369-3762
language English
publishDate 2024-11-01
publisher JMIR Publications
record_format Article
series JMIR Medical Education
spelling doaj-art-ecbd56d88d7546b48c86aa200d01ac5d2025-08-20T02:26:42ZengJMIR PublicationsJMIR Medical Education2369-37622024-11-0110e51433e5143310.2196/51433Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods StudyCarl Ehretthttp://orcid.org/0000-0002-0711-0347Sudeep Hegdehttp://orcid.org/0000-0001-9711-8459Kwame Andrehttp://orcid.org/0009-0002-9037-7428Dixizi Liuhttp://orcid.org/0009-0004-2422-5463Timothy Wilsonhttp://orcid.org/0009-0007-1711-6367 Abstract BackgroundGenerative large language models (LLMs) have the potential to revolutionize medical education by generating tailored learning materials, enhancing teaching efficiency, and improving learner engagement. However, the application of LLMs in health care settings, particularly for augmenting small datasets in text classification tasks, remains underexplored, particularly for cost- and privacy-conscious applications that do not permit the use of third-party services such as OpenAI’s ChatGPT. ObjectiveThis study aims to explore the use of open-source LLMs, such as Large Language Model Meta AI (LLaMA) and Alpaca models, for data augmentation in a specific text classification task related to hospital staff surveys. MethodsThe surveys were designed to elicit narratives of everyday adaptation by frontline radiology staff during the initial phase of the COVID-19 pandemic. A 2-step process of data augmentation and text classification was conducted. The study generated synthetic data similar to the survey reports using 4 generative LLMs for data augmentation. A different set of 3 classifier LLMs was then used to classify the augmented text for thematic categories. The study evaluated performance on the classification task. ResultsThe overall best-performing combination of LLMs, temperature, classifier, and number of synthetic data cases is via augmentation with LLaMA 7B at temperature 0.7 with 100 augments, using Robustly Optimized BERT Pretraining Approach (RoBERTa) for the classification task, achieving an average area under the receiver operating characteristic (AUC) curve of 0.87 (SD 0.02; ie, 1 SD). The results demonstrate that open-source LLMs can enhance text classifiers’ performance for small datasets in health care contexts, providing promising pathways for improving medical education processes and patient care practices. ConclusionsThe study demonstrates the value of data augmentation with open-source LLMs, highlights the importance of privacy and ethical considerations when using LLMs, and suggests future directions for research in this field.https://mededu.jmir.org/2024/1/e51433
spellingShingle Carl Ehrett
Sudeep Hegde
Kwame Andre
Dixizi Liu
Timothy Wilson
Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
JMIR Medical Education
title Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
title_full Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
title_fullStr Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
title_full_unstemmed Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
title_short Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
title_sort leveraging open source large language models for data augmentation in hospital staff surveys mixed methods study
url https://mededu.jmir.org/2024/1/e51433
work_keys_str_mv AT carlehrett leveragingopensourcelargelanguagemodelsfordataaugmentationinhospitalstaffsurveysmixedmethodsstudy
AT sudeephegde leveragingopensourcelargelanguagemodelsfordataaugmentationinhospitalstaffsurveysmixedmethodsstudy
AT kwameandre leveragingopensourcelargelanguagemodelsfordataaugmentationinhospitalstaffsurveysmixedmethodsstudy
AT dixiziliu leveragingopensourcelargelanguagemodelsfordataaugmentationinhospitalstaffsurveysmixedmethodsstudy
AT timothywilson leveragingopensourcelargelanguagemodelsfordataaugmentationinhospitalstaffsurveysmixedmethodsstudy