Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning

Abstract Objectives To present an accurate machine-learning (ML) method and knowledge-based heuristics for automatic sequence-type identification in multi-centric multiparametric MRI (mpMRI) datasets for prostate cancer (PCa) ML. Methods Retrospective prostate mpMRI studies were classified into 5 se...

Full description

Saved in:
Bibliographic Details
Main Authors: José Guilherme de Almeida, Ana Sofia Castro Verde, Carlos Bilreiro, Inês Santiago, Joana Ip, Manolis Tsiknakis, Kostas Marias, Daniele Regge, Celso Matos, Nickolas Papanikolaou, ProCAncer-I
Format: Article
Language:English
Published: SpringerOpen 2025-03-01
Series:Insights into Imaging
Subjects:
Online Access:https://doi.org/10.1186/s13244-025-01938-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849710901673852928
author José Guilherme de Almeida
Ana Sofia Castro Verde
Carlos Bilreiro
Inês Santiago
Joana Ip
Manolis Tsiknakis
Kostas Marias
Daniele Regge
Celso Matos
Nickolas Papanikolaou
ProCAncer-I
author_facet José Guilherme de Almeida
Ana Sofia Castro Verde
Carlos Bilreiro
Inês Santiago
Joana Ip
Manolis Tsiknakis
Kostas Marias
Daniele Regge
Celso Matos
Nickolas Papanikolaou
ProCAncer-I
author_sort José Guilherme de Almeida
collection DOAJ
description Abstract Objectives To present an accurate machine-learning (ML) method and knowledge-based heuristics for automatic sequence-type identification in multi-centric multiparametric MRI (mpMRI) datasets for prostate cancer (PCa) ML. Methods Retrospective prostate mpMRI studies were classified into 5 series types—T2-weighted (T2W), diffusion-weighted images (DWI), apparent diffusion coefficients (ADC), dynamic contrast-enhanced (DCE) and other series types (others). Metadata was processed for all series and two models were trained (XGBoost after custom categorical tokenization and CatBoost with raw categorical data) using 5-fold cross-validation (CV) with different data fractions for learning curve analyses. For validation, two test sets—hold-out test set and temporal split—were used. A leave-one-group-out (LOGO) CV analysis was performed with centres as groups to understand the effect of dataset-specific data. Results 4045 studies (31,053 series) and 1004 studies (7891 series) from 11 centres were used to train and test series identification models, respectively. Test F1-scores were consistently above 0.95 (CatBoost) and 0.97 (XGBoost). Learning curves demonstrate learning saturation, while temporal validation shows model remain capable of correctly identifying all T2W/DWI/ADC triplets. However, optimal performance requires centre-specific data—controlling for model and used feature sets when comparing CV with LOGOCV, F1-score dropped for T2W, DCE and others (−0.146, −0.181 and −0.179, respectively), with larger performance decreases for CatBoost (−0.265). Finally, we delineate heuristics to assist researchers in series classification for PCa mpMRI datasets. Conclusions Automatic series-type identification is feasible and can enable automated data curation. However, dataset-specific data should be included to achieve optimal performance. Critical relevance statement Organising large collections of data is time-consuming but necessary to train clinical machine-learning models. To address this, we outline and validate an automatic series identification method that can facilitate this process. Finally, we outline a set of metadata-based heuristics that can be used to further automate series-type identification. Key Points Multi-centric prostate MRI studies were used for sequence annotation model training. Automatic sequence annotation requires few instances and generalises temporally. Sequence annotation, necessary for clinical AI model training, can be performed automatically. Graphical Abstract
format Article
id doaj-art-4b5bb6bcc3914426945aa91bcdfa5874
institution DOAJ
issn 1869-4101
language English
publishDate 2025-03-01
publisher SpringerOpen
record_format Article
series Insights into Imaging
spelling doaj-art-4b5bb6bcc3914426945aa91bcdfa58742025-08-20T03:14:46ZengSpringerOpenInsights into Imaging1869-41012025-03-0116111310.1186/s13244-025-01938-2Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learningJosé Guilherme de Almeida0Ana Sofia Castro Verde1Carlos Bilreiro2Inês Santiago3Joana Ip4Manolis Tsiknakis5Kostas Marias6Daniele Regge7Celso Matos8Nickolas Papanikolaou9ProCAncer-IChampalimaud FoundationChampalimaud FoundationChampalimaud Clinical CenterChampalimaud Clinical CenterChampalimaud Clinical CenterFORTHDepartment of Electrical and Computer Engineering, Hellenic Mediterranean UniversityDepartment of Radiology, Candiolo Cancer Institute, FPO-IRCCSChampalimaud FoundationChampalimaud FoundationAbstract Objectives To present an accurate machine-learning (ML) method and knowledge-based heuristics for automatic sequence-type identification in multi-centric multiparametric MRI (mpMRI) datasets for prostate cancer (PCa) ML. Methods Retrospective prostate mpMRI studies were classified into 5 series types—T2-weighted (T2W), diffusion-weighted images (DWI), apparent diffusion coefficients (ADC), dynamic contrast-enhanced (DCE) and other series types (others). Metadata was processed for all series and two models were trained (XGBoost after custom categorical tokenization and CatBoost with raw categorical data) using 5-fold cross-validation (CV) with different data fractions for learning curve analyses. For validation, two test sets—hold-out test set and temporal split—were used. A leave-one-group-out (LOGO) CV analysis was performed with centres as groups to understand the effect of dataset-specific data. Results 4045 studies (31,053 series) and 1004 studies (7891 series) from 11 centres were used to train and test series identification models, respectively. Test F1-scores were consistently above 0.95 (CatBoost) and 0.97 (XGBoost). Learning curves demonstrate learning saturation, while temporal validation shows model remain capable of correctly identifying all T2W/DWI/ADC triplets. However, optimal performance requires centre-specific data—controlling for model and used feature sets when comparing CV with LOGOCV, F1-score dropped for T2W, DCE and others (−0.146, −0.181 and −0.179, respectively), with larger performance decreases for CatBoost (−0.265). Finally, we delineate heuristics to assist researchers in series classification for PCa mpMRI datasets. Conclusions Automatic series-type identification is feasible and can enable automated data curation. However, dataset-specific data should be included to achieve optimal performance. Critical relevance statement Organising large collections of data is time-consuming but necessary to train clinical machine-learning models. To address this, we outline and validate an automatic series identification method that can facilitate this process. Finally, we outline a set of metadata-based heuristics that can be used to further automate series-type identification. Key Points Multi-centric prostate MRI studies were used for sequence annotation model training. Automatic sequence annotation requires few instances and generalises temporally. Sequence annotation, necessary for clinical AI model training, can be performed automatically. Graphical Abstracthttps://doi.org/10.1186/s13244-025-01938-2ProstateProstatic NeoplasmsMultiparametric magnetic resonance imagingData curationSupervised machine learning
spellingShingle José Guilherme de Almeida
Ana Sofia Castro Verde
Carlos Bilreiro
Inês Santiago
Joana Ip
Manolis Tsiknakis
Kostas Marias
Daniele Regge
Celso Matos
Nickolas Papanikolaou
ProCAncer-I
Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning
Insights into Imaging
Prostate
Prostatic Neoplasms
Multiparametric magnetic resonance imaging
Data curation
Supervised machine learning
title Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning
title_full Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning
title_fullStr Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning
title_full_unstemmed Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning
title_short Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning
title_sort automatic sequence identification in multicentric prostate multiparametric mri datasets for clinical machine learning
topic Prostate
Prostatic Neoplasms
Multiparametric magnetic resonance imaging
Data curation
Supervised machine learning
url https://doi.org/10.1186/s13244-025-01938-2
work_keys_str_mv AT joseguilhermedealmeida automaticsequenceidentificationinmulticentricprostatemultiparametricmridatasetsforclinicalmachinelearning
AT anasofiacastroverde automaticsequenceidentificationinmulticentricprostatemultiparametricmridatasetsforclinicalmachinelearning
AT carlosbilreiro automaticsequenceidentificationinmulticentricprostatemultiparametricmridatasetsforclinicalmachinelearning
AT inessantiago automaticsequenceidentificationinmulticentricprostatemultiparametricmridatasetsforclinicalmachinelearning
AT joanaip automaticsequenceidentificationinmulticentricprostatemultiparametricmridatasetsforclinicalmachinelearning
AT manolistsiknakis automaticsequenceidentificationinmulticentricprostatemultiparametricmridatasetsforclinicalmachinelearning
AT kostasmarias automaticsequenceidentificationinmulticentricprostatemultiparametricmridatasetsforclinicalmachinelearning
AT danieleregge automaticsequenceidentificationinmulticentricprostatemultiparametricmridatasetsforclinicalmachinelearning
AT celsomatos automaticsequenceidentificationinmulticentricprostatemultiparametricmridatasetsforclinicalmachinelearning
AT nickolaspapanikolaou automaticsequenceidentificationinmulticentricprostatemultiparametricmridatasetsforclinicalmachinelearning
AT procanceri automaticsequenceidentificationinmulticentricprostatemultiparametricmridatasetsforclinicalmachinelearning