Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting

Abstract The study assessed the feasibility of using synthetic data to fine-tune various open-source LLMs for free text to structured data conversation in radiology, comparing their performance with GPT models. A training set of 3000 synthetic thyroid nodule dictations was generated to train six ope...

Full description

Saved in:
Bibliographic Details
Main Authors: Aakriti Pandita, Angela Keniston, Nikhil Madhuripan
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-025-01658-3
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849332481284636672
author Aakriti Pandita
Angela Keniston
Nikhil Madhuripan
author_facet Aakriti Pandita
Angela Keniston
Nikhil Madhuripan
author_sort Aakriti Pandita
collection DOAJ
description Abstract The study assessed the feasibility of using synthetic data to fine-tune various open-source LLMs for free text to structured data conversation in radiology, comparing their performance with GPT models. A training set of 3000 synthetic thyroid nodule dictations was generated to train six open-source models (Starcoderbase-1B, Starcoderbase-3B, Mistral-7B, Llama-3-8B, Llama-2-13B, and Yi-34B). ACR TI-RADS template was the target model output. The model performance was tested on 50 thyroid nodule dictations from MIMIC-III patient dataset and compared against 0-shot, 1-shot, and 5-shot performance of GPT-3.5 and GPT-4. GPT-4 5-shot and Yi-34B showed the highest performance with no statistically significant difference between the models. Various open models outperformed GPT models with statistical significance. Overall, models trained with synthetic data showed performance comparable to GPT models in structured text conversion in our study. Given privacy preserving advantages, open LLMs can be utilized as a viable alternative to proprietary GPT models.
format Article
id doaj-art-5d659fa9fc6b47ccb0816c1381a39d03
institution Kabale University
issn 2398-6352
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series npj Digital Medicine
spelling doaj-art-5d659fa9fc6b47ccb0816c1381a39d032025-08-20T03:46:12ZengNature Portfolionpj Digital Medicine2398-63522025-07-01811810.1038/s41746-025-01658-3Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reportingAakriti Pandita0Angela Keniston1Nikhil Madhuripan2Department of Medicine, University of Colorado Anschutz Medical CampusDepartment of Medicine, University of Colorado Anschutz Medical CampusDepartment of Radiology, University of Colorado Anschutz Medical CampusAbstract The study assessed the feasibility of using synthetic data to fine-tune various open-source LLMs for free text to structured data conversation in radiology, comparing their performance with GPT models. A training set of 3000 synthetic thyroid nodule dictations was generated to train six open-source models (Starcoderbase-1B, Starcoderbase-3B, Mistral-7B, Llama-3-8B, Llama-2-13B, and Yi-34B). ACR TI-RADS template was the target model output. The model performance was tested on 50 thyroid nodule dictations from MIMIC-III patient dataset and compared against 0-shot, 1-shot, and 5-shot performance of GPT-3.5 and GPT-4. GPT-4 5-shot and Yi-34B showed the highest performance with no statistically significant difference between the models. Various open models outperformed GPT models with statistical significance. Overall, models trained with synthetic data showed performance comparable to GPT models in structured text conversion in our study. Given privacy preserving advantages, open LLMs can be utilized as a viable alternative to proprietary GPT models.https://doi.org/10.1038/s41746-025-01658-3
spellingShingle Aakriti Pandita
Angela Keniston
Nikhil Madhuripan
Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting
npj Digital Medicine
title Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting
title_full Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting
title_fullStr Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting
title_full_unstemmed Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting
title_short Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting
title_sort synthetic data trained open source language models are feasible alternatives to proprietary models for radiology reporting
url https://doi.org/10.1038/s41746-025-01658-3
work_keys_str_mv AT aakritipandita syntheticdatatrainedopensourcelanguagemodelsarefeasiblealternativestoproprietarymodelsforradiologyreporting
AT angelakeniston syntheticdatatrainedopensourcelanguagemodelsarefeasiblealternativestoproprietarymodelsforradiologyreporting
AT nikhilmadhuripan syntheticdatatrainedopensourcelanguagemodelsarefeasiblealternativestoproprietarymodelsforradiologyreporting