Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome
Abstract Background and objective This study has two main objectives. First, to evaluate a feature selection methodology based on SEQENS, an algorithm for identifying relevant variables. Second, to validate machine learning models that predict the risk of complications in patients with acute myeloid...
Saved in:
| Main Authors: | , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-05-01
|
| Series: | BMC Medical Informatics and Decision Making |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12911-025-03001-y |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850284755056066560 |
|---|---|
| author | Pedro Pons-Suñer François Signol Noemi Alvarez Claudia Sargas Sara Dorado Jose Vicente Gil Ortí Juan A. Delgado Sanchis Marta Llop Laura Arnal Rafael Llobet Juan-Carlos Perez-Cortes Rosa Ayala Eva Barragán |
| author_facet | Pedro Pons-Suñer François Signol Noemi Alvarez Claudia Sargas Sara Dorado Jose Vicente Gil Ortí Juan A. Delgado Sanchis Marta Llop Laura Arnal Rafael Llobet Juan-Carlos Perez-Cortes Rosa Ayala Eva Barragán |
| author_sort | Pedro Pons-Suñer |
| collection | DOAJ |
| description | Abstract Background and objective This study has two main objectives. First, to evaluate a feature selection methodology based on SEQENS, an algorithm for identifying relevant variables. Second, to validate machine learning models that predict the risk of complications in patients with acute myeloid leukemia (AML) using data available at diagnosis. Predictions are made at three time points: 90 days, six months, and one year post-diagnosis. These objectives represent fundamental steps toward the development of a tool to assist clinicians in therapeutic decision-making and provide insights into the risk factors associated with AML complications. Methods A dataset of 568 patients, including demographic, clinical, genetic (VAF), and cytogenetic information, was created by combining data from Hospital 12 de Octubre (Madrid, Spain) and Instituto de Investigación Sanitaria La Fe (Valencia, Spain). Feature selection based on an enhanced version of SEQENS was conducted for each time point, followed by the comparison of four classifiers (XGBoost, Multi-Layer Perceptron, Logistic Regression and Decision Tree) to assess the impact of feature selection on model performance. Results SEQENS identified different relevant features for each prediction horizon, with Age, TP53, − 7/7Q, and EZH2 consistently relevant across all time points. The models were evaluated using 5-fold cross-validation, XGBoost achieve the highest average ROC-AUC scores of 0.81, 0.84, and 0.82 for 90-day, 6-month, and 1-year predictions, respectively. Generally, performance remained stable or improved after applying SEQENS-based feature selection. Evaluation on an external test set of 54 patients yielded ROC-AUC scores of 0.72 (90-day), 0.75 (6-month), and 0.68 (1-year). Conclusions The models achieved performance levels that suggest they could serve as therapeutic decision support tools at different times after diagnosis. The selected variables align with the European LeukemiaNet (ELN) 2022 risk classification, and the SEQENS-based feature selection effectively reduced the feature set while maintaining prediction accuracy. |
| format | Article |
| id | doaj-art-41678823ef6a4a46abdeefcd862ddd8b |
| institution | OA Journals |
| issn | 1472-6947 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | BMC |
| record_format | Article |
| series | BMC Medical Informatics and Decision Making |
| spelling | doaj-art-41678823ef6a4a46abdeefcd862ddd8b2025-08-20T01:47:29ZengBMCBMC Medical Informatics and Decision Making1472-69472025-05-0125112210.1186/s12911-025-03001-yIdentification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcomePedro Pons-Suñer0François Signol1Noemi Alvarez2Claudia Sargas3Sara Dorado4Jose Vicente Gil Ortí5Juan A. Delgado Sanchis6Marta Llop7Laura Arnal8Rafael Llobet9Juan-Carlos Perez-Cortes10Rosa Ayala11Eva Barragán12ITI, Universitat Politècnica de ValènciaITI, Universitat Politècnica de ValènciaHospital Universitario 12 de Octubre, Imas12, Departament of Medicine, Complutense UniversityInstituto de Investigación Sanitaria La FeAltum Sequencing, s.l., Computer Science and Engineering Department, Carlos III UniversityInstituto de Investigación Sanitaria La FeITI, Universitat Politècnica de ValènciaInstituto de Investigación Sanitaria La FeITI, Universitat Politècnica de ValènciaITI, Universitat Politècnica de ValènciaITI, Universitat Politècnica de ValènciaHospital Universitario 12 de Octubre, Imas12, Departament of Medicine, Complutense UniversityInstituto de Investigación Sanitaria La FeAbstract Background and objective This study has two main objectives. First, to evaluate a feature selection methodology based on SEQENS, an algorithm for identifying relevant variables. Second, to validate machine learning models that predict the risk of complications in patients with acute myeloid leukemia (AML) using data available at diagnosis. Predictions are made at three time points: 90 days, six months, and one year post-diagnosis. These objectives represent fundamental steps toward the development of a tool to assist clinicians in therapeutic decision-making and provide insights into the risk factors associated with AML complications. Methods A dataset of 568 patients, including demographic, clinical, genetic (VAF), and cytogenetic information, was created by combining data from Hospital 12 de Octubre (Madrid, Spain) and Instituto de Investigación Sanitaria La Fe (Valencia, Spain). Feature selection based on an enhanced version of SEQENS was conducted for each time point, followed by the comparison of four classifiers (XGBoost, Multi-Layer Perceptron, Logistic Regression and Decision Tree) to assess the impact of feature selection on model performance. Results SEQENS identified different relevant features for each prediction horizon, with Age, TP53, − 7/7Q, and EZH2 consistently relevant across all time points. The models were evaluated using 5-fold cross-validation, XGBoost achieve the highest average ROC-AUC scores of 0.81, 0.84, and 0.82 for 90-day, 6-month, and 1-year predictions, respectively. Generally, performance remained stable or improved after applying SEQENS-based feature selection. Evaluation on an external test set of 54 patients yielded ROC-AUC scores of 0.72 (90-day), 0.75 (6-month), and 0.68 (1-year). Conclusions The models achieved performance levels that suggest they could serve as therapeutic decision support tools at different times after diagnosis. The selected variables align with the European LeukemiaNet (ELN) 2022 risk classification, and the SEQENS-based feature selection effectively reduced the feature set while maintaining prediction accuracy.https://doi.org/10.1186/s12911-025-03001-yAcute myeloid leukemiaMachine learningPatient evolutionTherapy outcomeRecurrenceMortality |
| spellingShingle | Pedro Pons-Suñer François Signol Noemi Alvarez Claudia Sargas Sara Dorado Jose Vicente Gil Ortí Juan A. Delgado Sanchis Marta Llop Laura Arnal Rafael Llobet Juan-Carlos Perez-Cortes Rosa Ayala Eva Barragán Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome BMC Medical Informatics and Decision Making Acute myeloid leukemia Machine learning Patient evolution Therapy outcome Recurrence Mortality |
| title | Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome |
| title_full | Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome |
| title_fullStr | Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome |
| title_full_unstemmed | Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome |
| title_short | Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome |
| title_sort | identification of relevant features using seqens to improve supervised machine learning models predicting aml treatment outcome |
| topic | Acute myeloid leukemia Machine learning Patient evolution Therapy outcome Recurrence Mortality |
| url | https://doi.org/10.1186/s12911-025-03001-y |
| work_keys_str_mv | AT pedroponssuner identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT francoissignol identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT noemialvarez identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT claudiasargas identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT saradorado identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT josevicentegilorti identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT juanadelgadosanchis identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT martallop identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT lauraarnal identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT rafaelllobet identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT juancarlosperezcortes identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT rosaayala identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome AT evabarragan identificationofrelevantfeaturesusingseqenstoimprovesupervisedmachinelearningmodelspredictingamltreatmentoutcome |