Data augmented lung cancer prediction framework using the nested case control NLST cohort

PurposeIn the context of lung cancer screening, the scarcity of well-labeled medical images poses a significant challenge to implement supervised learning-based deep learning methods. While data augmentation is an effective technique for countering the difficulties caused by insufficient data, it ha...

Full description

Saved in:
Bibliographic Details
Main Authors: Yifan Jiang, Venkata S. K. Manem
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-02-01
Series:Frontiers in Oncology
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fonc.2025.1492758/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850189416780267520
author Yifan Jiang
Yifan Jiang
Venkata S. K. Manem
Venkata S. K. Manem
Venkata S. K. Manem
author_facet Yifan Jiang
Yifan Jiang
Venkata S. K. Manem
Venkata S. K. Manem
Venkata S. K. Manem
author_sort Yifan Jiang
collection DOAJ
description PurposeIn the context of lung cancer screening, the scarcity of well-labeled medical images poses a significant challenge to implement supervised learning-based deep learning methods. While data augmentation is an effective technique for countering the difficulties caused by insufficient data, it has not been fully explored in the context of lung cancer screening. In this research study, we analyzed the state-of-the-art (SOTA) data augmentation techniques for lung cancer binary prediction.MethodsTo comprehensively evaluate the efficiency of data augmentation approaches, we considered the nested case control National Lung Screening Trial (NLST) cohort comprising of 253 individuals who had the commonly used CT scans without contrast. The CT scans were pre-processed into three-dimensional volumes based on the lung nodule annotations. Subsequently, we evaluated five basic (online) and two generative model-based offline data augmentation methods with ten state-of-the-art (SOTA) 3D deep learning-based lung cancer prediction models.ResultsOur results demonstrated that the performance improvement by data augmentation was highly dependent on approach used. The Cutmix method resulted in the highest average performance improvement across all three metrics: 1.07%, 3.29%, 1.19% for accuracy, F1 score and AUC, respectively. MobileNetV2 with a simple data augmentation approach achieved the best AUC of 0.8719 among all lung cancer predictors, demonstrating a 7.62% improvement compared to baseline. Furthermore, the MED-DDPM data augmentation approach was able to improve prediction performance by rebalancing the training set and adding moderately synthetic data.ConclusionsThe effectiveness of online and offline data augmentation methods were highly sensitive to the prediction model, highlighting the importance of carefully selecting the optimal data augmentation method. Our findings suggest that certain traditional methods can provide more stable and higher performance compared to SOTA online data augmentation approaches. Overall, these results offer meaningful insights for the development and clinical integration of data augmented deep learning tools for lung cancer screening.
format Article
id doaj-art-c168ffcbb29b401c9466cfed0a606e1e
institution OA Journals
issn 2234-943X
language English
publishDate 2025-02-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Oncology
spelling doaj-art-c168ffcbb29b401c9466cfed0a606e1e2025-08-20T02:15:37ZengFrontiers Media S.A.Frontiers in Oncology2234-943X2025-02-011510.3389/fonc.2025.14927581492758Data augmented lung cancer prediction framework using the nested case control NLST cohortYifan Jiang0Yifan Jiang1Venkata S. K. Manem2Venkata S. K. Manem3Venkata S. K. Manem4Centre de Recherche du CHU de Québec, Université Laval, Québec, QC, CanadaDépartement de Biologie Moléculaire, Biochimie Médicale et Pathologie, Université Laval, Québec, QC, CanadaCentre de Recherche du CHU de Québec, Université Laval, Québec, QC, CanadaDépartement de Biologie Moléculaire, Biochimie Médicale et Pathologie, Université Laval, Québec, QC, CanadaInstitut Universitaire de Cardiologie et de Pneumologie de Québec, Québec, QC, CanadaPurposeIn the context of lung cancer screening, the scarcity of well-labeled medical images poses a significant challenge to implement supervised learning-based deep learning methods. While data augmentation is an effective technique for countering the difficulties caused by insufficient data, it has not been fully explored in the context of lung cancer screening. In this research study, we analyzed the state-of-the-art (SOTA) data augmentation techniques for lung cancer binary prediction.MethodsTo comprehensively evaluate the efficiency of data augmentation approaches, we considered the nested case control National Lung Screening Trial (NLST) cohort comprising of 253 individuals who had the commonly used CT scans without contrast. The CT scans were pre-processed into three-dimensional volumes based on the lung nodule annotations. Subsequently, we evaluated five basic (online) and two generative model-based offline data augmentation methods with ten state-of-the-art (SOTA) 3D deep learning-based lung cancer prediction models.ResultsOur results demonstrated that the performance improvement by data augmentation was highly dependent on approach used. The Cutmix method resulted in the highest average performance improvement across all three metrics: 1.07%, 3.29%, 1.19% for accuracy, F1 score and AUC, respectively. MobileNetV2 with a simple data augmentation approach achieved the best AUC of 0.8719 among all lung cancer predictors, demonstrating a 7.62% improvement compared to baseline. Furthermore, the MED-DDPM data augmentation approach was able to improve prediction performance by rebalancing the training set and adding moderately synthetic data.ConclusionsThe effectiveness of online and offline data augmentation methods were highly sensitive to the prediction model, highlighting the importance of carefully selecting the optimal data augmentation method. Our findings suggest that certain traditional methods can provide more stable and higher performance compared to SOTA online data augmentation approaches. Overall, these results offer meaningful insights for the development and clinical integration of data augmented deep learning tools for lung cancer screening.https://www.frontiersin.org/articles/10.3389/fonc.2025.1492758/fulldata augmentationlung cancercancer risk predictioncomputed tomographyArtificial Intelliegncemachine learning
spellingShingle Yifan Jiang
Yifan Jiang
Venkata S. K. Manem
Venkata S. K. Manem
Venkata S. K. Manem
Data augmented lung cancer prediction framework using the nested case control NLST cohort
Frontiers in Oncology
data augmentation
lung cancer
cancer risk prediction
computed tomography
Artificial Intelliegnce
machine learning
title Data augmented lung cancer prediction framework using the nested case control NLST cohort
title_full Data augmented lung cancer prediction framework using the nested case control NLST cohort
title_fullStr Data augmented lung cancer prediction framework using the nested case control NLST cohort
title_full_unstemmed Data augmented lung cancer prediction framework using the nested case control NLST cohort
title_short Data augmented lung cancer prediction framework using the nested case control NLST cohort
title_sort data augmented lung cancer prediction framework using the nested case control nlst cohort
topic data augmentation
lung cancer
cancer risk prediction
computed tomography
Artificial Intelliegnce
machine learning
url https://www.frontiersin.org/articles/10.3389/fonc.2025.1492758/full
work_keys_str_mv AT yifanjiang dataaugmentedlungcancerpredictionframeworkusingthenestedcasecontrolnlstcohort
AT yifanjiang dataaugmentedlungcancerpredictionframeworkusingthenestedcasecontrolnlstcohort
AT venkataskmanem dataaugmentedlungcancerpredictionframeworkusingthenestedcasecontrolnlstcohort
AT venkataskmanem dataaugmentedlungcancerpredictionframeworkusingthenestedcasecontrolnlstcohort
AT venkataskmanem dataaugmentedlungcancerpredictionframeworkusingthenestedcasecontrolnlstcohort