DNA sequence classification for diabetes mellitus using NuSVC and XGBoost: A comparative.

Diabetes Mellitus is a global health concern, characterized by high blood sugar levels over a prolonged period, leading to severe complications if left unmanaged. The early identification of individuals at risk is critical for effective intervention and treatment. Traditional diagnostic methods rely...

Full description

Saved in:
Bibliographic Details
Main Authors: Said A Salloum, Khaled Mohammad Alomari, Ayham Salloum
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0328253
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849387596610797568
author Said A Salloum
Khaled Mohammad Alomari
Ayham Salloum
author_facet Said A Salloum
Khaled Mohammad Alomari
Ayham Salloum
author_sort Said A Salloum
collection DOAJ
description Diabetes Mellitus is a global health concern, characterized by high blood sugar levels over a prolonged period, leading to severe complications if left unmanaged. The early identification of individuals at risk is critical for effective intervention and treatment. Traditional diagnostic methods rely heavily on clinical symptoms and biochemical tests, which may not capture the underlying genetic predispositions. With the advent of genomics, DNA sequence analysis has emerged as a promising approach to uncover the genetic markers associated with Diabetes Mellitus. However, the challenge lies in accurately classifying DNA sequences to predict susceptibility to the disease, given the complex nature of genetic data. This study addresses this challenge by employing two advanced machine learning models, NuSVC (Nu-Support Vector Classification) and XGBoost (Extreme Gradient Boosting), to classify DNA sequences related to Diabetes Mellitus. The dataset, obtained from reputable sources like NCBI, was preprocessed using Natural Language Processing (NLP) techniques, where DNA sequences were treated as textual data and transformed into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). To handle the class imbalance in the dataset, SMOTE (Synthetic Minority Over-sampling Technique) was applied. The models were trained and validated using 10-fold cross-validation. XGBoost was trained with up to 300 boosting rounds, and performance was evaluated using accuracy, precision, recall, F1-score, ROC-AUC, and log loss. The results demonstrate that XGBoost outperformed NuSVC across all metrics, achieving an accuracy of 98%, a log loss of 0.0650, and an AUC of 1.00, compared to NuSVC's accuracy of 87%, log loss of 0.2649, and AUC of 0.95. The superior performance of XGBoost indicates its robustness in handling complex genetic data and its potential utility in clinical applications for early diagnosis of Diabetes Mellitus. The findings of this study underscore the importance of advanced machine learning techniques in genomics and suggest that integrating such models into healthcare systems could significantly enhance predictive diagnostics.
format Article
id doaj-art-4cb3327b7b3649bba04c78c513a9fbd3
institution Kabale University
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-4cb3327b7b3649bba04c78c513a9fbd32025-08-20T03:51:35ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01207e032825310.1371/journal.pone.0328253DNA sequence classification for diabetes mellitus using NuSVC and XGBoost: A comparative.Said A SalloumKhaled Mohammad AlomariAyham SalloumDiabetes Mellitus is a global health concern, characterized by high blood sugar levels over a prolonged period, leading to severe complications if left unmanaged. The early identification of individuals at risk is critical for effective intervention and treatment. Traditional diagnostic methods rely heavily on clinical symptoms and biochemical tests, which may not capture the underlying genetic predispositions. With the advent of genomics, DNA sequence analysis has emerged as a promising approach to uncover the genetic markers associated with Diabetes Mellitus. However, the challenge lies in accurately classifying DNA sequences to predict susceptibility to the disease, given the complex nature of genetic data. This study addresses this challenge by employing two advanced machine learning models, NuSVC (Nu-Support Vector Classification) and XGBoost (Extreme Gradient Boosting), to classify DNA sequences related to Diabetes Mellitus. The dataset, obtained from reputable sources like NCBI, was preprocessed using Natural Language Processing (NLP) techniques, where DNA sequences were treated as textual data and transformed into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). To handle the class imbalance in the dataset, SMOTE (Synthetic Minority Over-sampling Technique) was applied. The models were trained and validated using 10-fold cross-validation. XGBoost was trained with up to 300 boosting rounds, and performance was evaluated using accuracy, precision, recall, F1-score, ROC-AUC, and log loss. The results demonstrate that XGBoost outperformed NuSVC across all metrics, achieving an accuracy of 98%, a log loss of 0.0650, and an AUC of 1.00, compared to NuSVC's accuracy of 87%, log loss of 0.2649, and AUC of 0.95. The superior performance of XGBoost indicates its robustness in handling complex genetic data and its potential utility in clinical applications for early diagnosis of Diabetes Mellitus. The findings of this study underscore the importance of advanced machine learning techniques in genomics and suggest that integrating such models into healthcare systems could significantly enhance predictive diagnostics.https://doi.org/10.1371/journal.pone.0328253
spellingShingle Said A Salloum
Khaled Mohammad Alomari
Ayham Salloum
DNA sequence classification for diabetes mellitus using NuSVC and XGBoost: A comparative.
PLoS ONE
title DNA sequence classification for diabetes mellitus using NuSVC and XGBoost: A comparative.
title_full DNA sequence classification for diabetes mellitus using NuSVC and XGBoost: A comparative.
title_fullStr DNA sequence classification for diabetes mellitus using NuSVC and XGBoost: A comparative.
title_full_unstemmed DNA sequence classification for diabetes mellitus using NuSVC and XGBoost: A comparative.
title_short DNA sequence classification for diabetes mellitus using NuSVC and XGBoost: A comparative.
title_sort dna sequence classification for diabetes mellitus using nusvc and xgboost a comparative
url https://doi.org/10.1371/journal.pone.0328253
work_keys_str_mv AT saidasalloum dnasequenceclassificationfordiabetesmellitususingnusvcandxgboostacomparative
AT khaledmohammadalomari dnasequenceclassificationfordiabetesmellitususingnusvcandxgboostacomparative
AT ayhamsalloum dnasequenceclassificationfordiabetesmellitususingnusvcandxgboostacomparative