Targeted generative data augmentation for automatic metastases detection from free-text radiology reports

Automatic identification of metastatic sites in cancer patients from electronic health records is a challenging yet crucial task with significant implications for diagnosis and treatment. In this study, we demonstrate how advancements in natural language processing, namely the instruction-following...

Full description

Saved in:
Bibliographic Details
Main Authors: Maede Ashofteh Barabadi, Xiaodan Zhu, Wai Yip Chan, Amber L. Simpson, Richard K. G. Do
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-02-01
Series:Frontiers in Artificial Intelligence
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/frai.2025.1513674/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832087187710017536
author Maede Ashofteh Barabadi
Xiaodan Zhu
Wai Yip Chan
Amber L. Simpson
Richard K. G. Do
author_facet Maede Ashofteh Barabadi
Xiaodan Zhu
Wai Yip Chan
Amber L. Simpson
Richard K. G. Do
author_sort Maede Ashofteh Barabadi
collection DOAJ
description Automatic identification of metastatic sites in cancer patients from electronic health records is a challenging yet crucial task with significant implications for diagnosis and treatment. In this study, we demonstrate how advancements in natural language processing, namely the instruction-following capability of recent large language models and extensive model pretraining, made it possible to automate metastases detection from radiology reports texts with a limited amount of gold-labeled data. Specifically, we prompt Llama3, an open-source instruction-tuned large language model, to generate synthetic training data to expand our limited labeled data and adapt BERT, a small pretrained language model, to the task. We further investigate three targeted data augmentation techniques which selectively expand the original training samples, leading to comparable or superior performance compared to vanilla data augmentation, in most cases, while being substantially more computationally efficient. In our experiments, data augmentation improved the average F1-score by 2.3, 3.5, and 3.9 points for lung, liver, and adrenal glands, the organs for which we had access to expert-annotated data. This observation suggests that Llama3, which has not been specifically tailored to this task or clinical data in general, can generate high-quality synthetic data through paraphrasing in the clinical context. We also compare metastasis identification accuracy between models utilizing institutionally standardized reports vs. non-structured reports, which complicate the extraction of relevant information, and show how including patient history with a customized model architecture narrows the gap between those two setups from 7.3 to 4.5 points on F1-score under LoRA tuning. Our work delivers a broadly applicable solution with remarkable performance that does not require model customization for each institution, making large-scale, low-cost spatio-temporal cancer progression pattern extraction possible.
format Article
id doaj-art-e68979adf89f44ecb30b0d464faffb48
institution Kabale University
issn 2624-8212
language English
publishDate 2025-02-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Artificial Intelligence
spelling doaj-art-e68979adf89f44ecb30b0d464faffb482025-02-06T07:10:07ZengFrontiers Media S.A.Frontiers in Artificial Intelligence2624-82122025-02-01810.3389/frai.2025.15136741513674Targeted generative data augmentation for automatic metastases detection from free-text radiology reportsMaede Ashofteh Barabadi0Xiaodan Zhu1Wai Yip Chan2Amber L. Simpson3Richard K. G. Do4Ingenuity Labs Research Institute, Department of Electrical and Computer Engineering, Queen's University, Kingston, ON, CanadaIngenuity Labs Research Institute, Department of Electrical and Computer Engineering, Queen's University, Kingston, ON, CanadaIngenuity Labs Research Institute, Department of Electrical and Computer Engineering, Queen's University, Kingston, ON, CanadaSchool of Computing and Department of Biomedical and Molecular Sciences, Queen's University, Kingston, ON, CanadaDepartment of Radiology, Memorial Sloan Kettering Cancer Center, New York, NY, United StatesAutomatic identification of metastatic sites in cancer patients from electronic health records is a challenging yet crucial task with significant implications for diagnosis and treatment. In this study, we demonstrate how advancements in natural language processing, namely the instruction-following capability of recent large language models and extensive model pretraining, made it possible to automate metastases detection from radiology reports texts with a limited amount of gold-labeled data. Specifically, we prompt Llama3, an open-source instruction-tuned large language model, to generate synthetic training data to expand our limited labeled data and adapt BERT, a small pretrained language model, to the task. We further investigate three targeted data augmentation techniques which selectively expand the original training samples, leading to comparable or superior performance compared to vanilla data augmentation, in most cases, while being substantially more computationally efficient. In our experiments, data augmentation improved the average F1-score by 2.3, 3.5, and 3.9 points for lung, liver, and adrenal glands, the organs for which we had access to expert-annotated data. This observation suggests that Llama3, which has not been specifically tailored to this task or clinical data in general, can generate high-quality synthetic data through paraphrasing in the clinical context. We also compare metastasis identification accuracy between models utilizing institutionally standardized reports vs. non-structured reports, which complicate the extraction of relevant information, and show how including patient history with a customized model architecture narrows the gap between those two setups from 7.3 to 4.5 points on F1-score under LoRA tuning. Our work delivers a broadly applicable solution with remarkable performance that does not require model customization for each institution, making large-scale, low-cost spatio-temporal cancer progression pattern extraction possible.https://www.frontiersin.org/articles/10.3389/frai.2025.1513674/fullsynthetic data generationtargeted data augmentationmetastases detectionnatural language processinglarge language modelsfree-text radiology report
spellingShingle Maede Ashofteh Barabadi
Xiaodan Zhu
Wai Yip Chan
Amber L. Simpson
Richard K. G. Do
Targeted generative data augmentation for automatic metastases detection from free-text radiology reports
Frontiers in Artificial Intelligence
synthetic data generation
targeted data augmentation
metastases detection
natural language processing
large language models
free-text radiology report
title Targeted generative data augmentation for automatic metastases detection from free-text radiology reports
title_full Targeted generative data augmentation for automatic metastases detection from free-text radiology reports
title_fullStr Targeted generative data augmentation for automatic metastases detection from free-text radiology reports
title_full_unstemmed Targeted generative data augmentation for automatic metastases detection from free-text radiology reports
title_short Targeted generative data augmentation for automatic metastases detection from free-text radiology reports
title_sort targeted generative data augmentation for automatic metastases detection from free text radiology reports
topic synthetic data generation
targeted data augmentation
metastases detection
natural language processing
large language models
free-text radiology report
url https://www.frontiersin.org/articles/10.3389/frai.2025.1513674/full
work_keys_str_mv AT maedeashoftehbarabadi targetedgenerativedataaugmentationforautomaticmetastasesdetectionfromfreetextradiologyreports
AT xiaodanzhu targetedgenerativedataaugmentationforautomaticmetastasesdetectionfromfreetextradiologyreports
AT waiyipchan targetedgenerativedataaugmentationforautomaticmetastasesdetectionfromfreetextradiologyreports
AT amberlsimpson targetedgenerativedataaugmentationforautomaticmetastasesdetectionfromfreetextradiologyreports
AT richardkgdo targetedgenerativedataaugmentationforautomaticmetastasesdetectionfromfreetextradiologyreports