Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome

Abstract Background and objective This study has two main objectives. First, to evaluate a feature selection methodology based on SEQENS, an algorithm for identifying relevant variables. Second, to validate machine learning models that predict the risk of complications in patients with acute myeloid...

Full description

Saved in:
Bibliographic Details
Main Authors: Pedro Pons-Suñer, François Signol, Noemi Alvarez, Claudia Sargas, Sara Dorado, Jose Vicente Gil Ortí, Juan A. Delgado Sanchis, Marta Llop, Laura Arnal, Rafael Llobet, Juan-Carlos Perez-Cortes, Rosa Ayala, Eva Barragán
Format: Article
Language:English
Published: BMC 2025-05-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-025-03001-y
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850284755056066560
author Pedro Pons-Suñer
François Signol
Noemi Alvarez
Claudia Sargas
Sara Dorado
Jose Vicente Gil Ortí
Juan A. Delgado Sanchis
Marta Llop
Laura Arnal
Rafael Llobet
Juan-Carlos Perez-Cortes
Rosa Ayala
Eva Barragán
author_facet Pedro Pons-Suñer
François Signol
Noemi Alvarez
Claudia Sargas
Sara Dorado
Jose Vicente Gil Ortí
Juan A. Delgado Sanchis
Marta Llop
Laura Arnal
Rafael Llobet
Juan-Carlos Perez-Cortes
Rosa Ayala
Eva Barragán
author_sort Pedro Pons-Suñer
collection DOAJ
description Abstract Background and objective This study has two main objectives. First, to evaluate a feature selection methodology based on SEQENS, an algorithm for identifying relevant variables. Second, to validate machine learning models that predict the risk of complications in patients with acute myeloid leukemia (AML) using data available at diagnosis. Predictions are made at three time points: 90 days, six months, and one year post-diagnosis. These objectives represent fundamental steps toward the development of a tool to assist clinicians in therapeutic decision-making and provide insights into the risk factors associated with AML complications. Methods A dataset of 568 patients, including demographic, clinical, genetic (VAF), and cytogenetic information, was created by combining data from Hospital 12 de Octubre (Madrid, Spain) and Instituto de Investigación Sanitaria La Fe (Valencia, Spain). Feature selection based on an enhanced version of SEQENS was conducted for each time point, followed by the comparison of four classifiers (XGBoost, Multi-Layer Perceptron, Logistic Regression and Decision Tree) to assess the impact of feature selection on model performance. Results SEQENS identified different relevant features for each prediction horizon, with Age, TP53, − 7/7Q, and EZH2 consistently relevant across all time points. The models were evaluated using 5-fold cross-validation, XGBoost achieve the highest average ROC-AUC scores of 0.81, 0.84, and 0.82 for 90-day, 6-month, and 1-year predictions, respectively. Generally, performance remained stable or improved after applying SEQENS-based feature selection. Evaluation on an external test set of 54 patients yielded ROC-AUC scores of 0.72 (90-day), 0.75 (6-month), and 0.68 (1-year). Conclusions The models achieved performance levels that suggest they could serve as therapeutic decision support tools at different times after diagnosis. The selected variables align with the European LeukemiaNet (ELN) 2022 risk classification, and the SEQENS-based feature selection effectively reduced the feature set while maintaining prediction accuracy.
format Article
id doaj-art-41678823ef6a4a46abdeefcd862ddd8b
institution OA Journals
issn 1472-6947
language English
publishDate 2025-05-01
publisher BMC
record_format Article
series BMC Medical Informatics and Decision Making
spelling doaj-art-41678823ef6a4a46abdeefcd862ddd8b2025-08-20T01:47:29ZengBMCBMC Medical Informatics and Decision Making1472-69472025-05-0125112210.1186/s12911-025-03001-yIdentification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcomePedro Pons-Suñer0François Signol1Noemi Alvarez2Claudia Sargas3Sara Dorado4Jose Vicente Gil Ortí5Juan A. Delgado Sanchis6Marta Llop7Laura Arnal8Rafael Llobet9Juan-Carlos Perez-Cortes10Rosa Ayala11Eva Barragán12ITI, Universitat Politècnica de ValènciaITI, Universitat Politècnica de ValènciaHospital Universitario 12 de Octubre, Imas12, Departament of Medicine, Complutense UniversityInstituto de Investigación Sanitaria La FeAltum Sequencing, s.l., Computer Science and Engineering Department, Carlos III UniversityInstituto de Investigación Sanitaria La FeITI, Universitat Politècnica de ValènciaInstituto de Investigación Sanitaria La FeITI, Universitat Politècnica de ValènciaITI, Universitat Politècnica de ValènciaITI, Universitat Politècnica de ValènciaHospital Universitario 12 de Octubre, Imas12, Departament of Medicine, Complutense UniversityInstituto de Investigación Sanitaria La FeAbstract Background and objective This study has two main objectives. First, to evaluate a feature selection methodology based on SEQENS, an algorithm for identifying relevant variables. Second, to validate machine learning models that predict the risk of complications in patients with acute myeloid leukemia (AML) using data available at diagnosis. Predictions are made at three time points: 90 days, six months, and one year post-diagnosis. These objectives represent fundamental steps toward the development of a tool to assist clinicians in therapeutic decision-making and provide insights into the risk factors associated with AML complications. Methods A dataset of 568 patients, including demographic, clinical, genetic (VAF), and cytogenetic information, was created by combining data from Hospital 12 de Octubre (Madrid, Spain) and Instituto de Investigación Sanitaria La Fe (Valencia, Spain). Feature selection based on an enhanced version of SEQENS was conducted for each time point, followed by the comparison of four classifiers (XGBoost, Multi-Layer Perceptron, Logistic Regression and Decision Tree) to assess the impact of feature selection on model performance. Results SEQENS identified different relevant features for each prediction horizon, with Age, TP53, − 7/7Q, and EZH2 consistently relevant across all time points. The models were evaluated using 5-fold cross-validation, XGBoost achieve the highest average ROC-AUC scores of 0.81, 0.84, and 0.82 for 90-day, 6-month, and 1-year predictions, respectively. Generally, performance remained stable or improved after applying SEQENS-based feature selection. Evaluation on an external test set of 54 patients yielded ROC-AUC scores of 0.72 (90-day), 0.75 (6-month), and 0.68 (1-year). Conclusions The models achieved performance levels that suggest they could serve as therapeutic decision support tools at different times after diagnosis. The selected variables align with the European LeukemiaNet (ELN) 2022 risk classification, and the SEQENS-based feature selection effectively reduced the feature set while maintaining prediction accuracy.https://doi.org/10.1186/s12911-025-03001-yAcute myeloid leukemiaMachine learningPatient evolutionTherapy outcomeRecurrenceMortality
spellingShingle Pedro Pons-Suñer
François Signol
Noemi Alvarez
Claudia Sargas
Sara Dorado
Jose Vicente Gil Ortí
Juan A. Delgado Sanchis
Marta Llop
Laura Arnal
Rafael Llobet
Juan-Carlos Perez-Cortes
Rosa Ayala
Eva Barragán
Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome
BMC Medical Informatics and Decision Making
Acute myeloid leukemia
Machine learning
Patient evolution
Therapy outcome
Recurrence
Mortality
title Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome
title_full Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome
title_fullStr Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome
title_full_unstemmed Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome
title_short Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome
title_sort identification of relevant features using seqens to improve supervised machine learning models predicting aml treatment outcome
topic Acute myeloid leukemia
Machine learning
Patient evolution
Therapy outcome
Recurrence
Mortality
url https://doi.org/10.1186/s12911-025-03001-y
work_keys_str_mv AT pedroponssuner identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT francoissignol identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT noemialvarez identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT claudiasargas identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT saradorado identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT josevicentegilorti identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT juanadelgadosanchis identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT martallop identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT lauraarnal identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT rafaelllobet identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT juancarlosperezcortes identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT rosaayala identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome
AT evabarragan identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome