Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting

Abstract The study assessed the feasibility of using synthetic data to fine-tune various open-source LLMs for free text to structured data conversation in radiology, comparing their performance with GPT models. A training set of 3000 synthetic thyroid nodule dictations was generated to train six ope...

Full description

Saved in:

Bibliographic Details
Main Authors:	Aakriti Pandita, Angela Keniston, Nikhil Madhuripan
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-07-01
Series:	npj Digital Medicine
Online Access:	https://doi.org/10.1038/s41746-025-01658-3
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849332481284636672
author	Aakriti Pandita Angela Keniston Nikhil Madhuripan
author_facet	Aakriti Pandita Angela Keniston Nikhil Madhuripan
author_sort	Aakriti Pandita
collection	DOAJ
description	Abstract The study assessed the feasibility of using synthetic data to fine-tune various open-source LLMs for free text to structured data conversation in radiology, comparing their performance with GPT models. A training set of 3000 synthetic thyroid nodule dictations was generated to train six open-source models (Starcoderbase-1B, Starcoderbase-3B, Mistral-7B, Llama-3-8B, Llama-2-13B, and Yi-34B). ACR TI-RADS template was the target model output. The model performance was tested on 50 thyroid nodule dictations from MIMIC-III patient dataset and compared against 0-shot, 1-shot, and 5-shot performance of GPT-3.5 and GPT-4. GPT-4 5-shot and Yi-34B showed the highest performance with no statistically significant difference between the models. Various open models outperformed GPT models with statistical significance. Overall, models trained with synthetic data showed performance comparable to GPT models in structured text conversion in our study. Given privacy preserving advantages, open LLMs can be utilized as a viable alternative to proprietary GPT models.
format	Article
id	doaj-art-5d659fa9fc6b47ccb0816c1381a39d03
institution	Kabale University
issn	2398-6352
language	English
publishDate	2025-07-01
publisher	Nature Portfolio
record_format	Article
series	npj Digital Medicine
spelling	doaj-art-5d659fa9fc6b47ccb0816c1381a39d032025-08-20T03:46:12ZengNature Portfolionpj Digital Medicine2398-63522025-07-01811810.1038/s41746-025-01658-3Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reportingAakriti Pandita0Angela Keniston1Nikhil Madhuripan2Department of Medicine, University of Colorado Anschutz Medical CampusDepartment of Medicine, University of Colorado Anschutz Medical CampusDepartment of Radiology, University of Colorado Anschutz Medical CampusAbstract The study assessed the feasibility of using synthetic data to fine-tune various open-source LLMs for free text to structured data conversation in radiology, comparing their performance with GPT models. A training set of 3000 synthetic thyroid nodule dictations was generated to train six open-source models (Starcoderbase-1B, Starcoderbase-3B, Mistral-7B, Llama-3-8B, Llama-2-13B, and Yi-34B). ACR TI-RADS template was the target model output. The model performance was tested on 50 thyroid nodule dictations from MIMIC-III patient dataset and compared against 0-shot, 1-shot, and 5-shot performance of GPT-3.5 and GPT-4. GPT-4 5-shot and Yi-34B showed the highest performance with no statistically significant difference between the models. Various open models outperformed GPT models with statistical significance. Overall, models trained with synthetic data showed performance comparable to GPT models in structured text conversion in our study. Given privacy preserving advantages, open LLMs can be utilized as a viable alternative to proprietary GPT models.https://doi.org/10.1038/s41746-025-01658-3
spellingShingle	Aakriti Pandita Angela Keniston Nikhil Madhuripan Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting npj Digital Medicine
title	Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting
title_full	Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting
title_fullStr	Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting
title_full_unstemmed	Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting
title_short	Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting
title_sort	synthetic data trained open source language models are feasible alternatives to proprietary models for radiology reporting
url	https://doi.org/10.1038/s41746-025-01658-3
work_keys_str_mv	AT aakritipandita syntheticdatatrainedopensourcelanguagemodelsarefeasiblealternativestoproprietarymodelsforradiologyreporting AT angelakeniston syntheticdatatrainedopensourcelanguagemodelsarefeasiblealternativestoproprietarymodelsforradiologyreporting AT nikhilmadhuripan syntheticdatatrainedopensourcelanguagemodelsarefeasiblealternativestoproprietarymodelsforradiologyreporting

Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting

Similar Items