Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset

Abstract The surge in high‐throughput technologies has empowered the acquisition of vast genomic datasets, prompting the search for genetic markers and biomarkers relevant to complex traits. However, grappling with the inherent complexities of high dimensionality and sparsity within these datasets p...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hawlader A. Al‐Mamun, Monica F. Danilevicz, Jacob I. Marsh, Cedric Gondro, David Edwards
Format:	Article
Language:	English
Published:	Wiley 2025-03-01
Series:	The Plant Genome
Online Access:	https://doi.org/10.1002/tpg2.20503
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850276829375496192
author	Hawlader A. Al‐Mamun Monica F. Danilevicz Jacob I. Marsh Cedric Gondro David Edwards
author_facet	Hawlader A. Al‐Mamun Monica F. Danilevicz Jacob I. Marsh Cedric Gondro David Edwards
author_sort	Hawlader A. Al‐Mamun
collection	DOAJ
description	Abstract The surge in high‐throughput technologies has empowered the acquisition of vast genomic datasets, prompting the search for genetic markers and biomarkers relevant to complex traits. However, grappling with the inherent complexities of high dimensionality and sparsity within these datasets poses formidable hurdles. The immense number of features and their potential redundancy demand efficient strategies for extracting pertinent information and identifying significant markers. Feature selection is important in large genomic data as it helps in enhancing interpretability and computational efficiency. This study focuses on addressing these challenges through a comprehensive investigation into genomic feature selection methodologies, employing a rich soybean (Glycine max L. Merr.) dataset comprising 966 lines with over 5.5 million single nucleotide polymorphisms. Emphasizing the “small n large p” dilemma prevalent in contemporary genomic studies, we compared the efficacy of traditional genome‐wide association studies (GWAS) with two prominent machine learning tools, random forest and extreme gradient boosting, in pinpointing predictive features. Utilizing the expansive soybean dataset, we assessed the performance of these methodologies in selecting features that optimize predictive modeling for various phenotypes. By constructing predictive models based on the selected features, we ascertain the comparative prediction accuracies, thereby illuminating the strengths and limitations of these feature selection methodologies in the realm of genomic data analysis.
format	Article
id	doaj-art-a8d44b03c3ed4b8bbadd22d129596bdd
institution	OA Journals
issn	1940-3372
language	English
publishDate	2025-03-01
publisher	Wiley
record_format	Article
series	The Plant Genome
spelling	doaj-art-a8d44b03c3ed4b8bbadd22d129596bdd2025-08-20T01:50:06ZengWileyThe Plant Genome1940-33722025-03-01181n/an/a10.1002/tpg2.20503Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean datasetHawlader A. Al‐Mamun0Monica F. Danilevicz1Jacob I. Marsh2Cedric Gondro3David Edwards4Centre for Applied Bioinformatics and School of Biological Sciences University of Western Australia Perth Western Australia AustraliaCentre for Applied Bioinformatics and School of Biological Sciences University of Western Australia Perth Western Australia AustraliaDepartment of Biology University of North Carolina Chapel Hill North Carolina USADepartment of Animal Science Michigan State University East Lansing Michigan USACentre for Applied Bioinformatics and School of Biological Sciences University of Western Australia Perth Western Australia AustraliaAbstract The surge in high‐throughput technologies has empowered the acquisition of vast genomic datasets, prompting the search for genetic markers and biomarkers relevant to complex traits. However, grappling with the inherent complexities of high dimensionality and sparsity within these datasets poses formidable hurdles. The immense number of features and their potential redundancy demand efficient strategies for extracting pertinent information and identifying significant markers. Feature selection is important in large genomic data as it helps in enhancing interpretability and computational efficiency. This study focuses on addressing these challenges through a comprehensive investigation into genomic feature selection methodologies, employing a rich soybean (Glycine max L. Merr.) dataset comprising 966 lines with over 5.5 million single nucleotide polymorphisms. Emphasizing the “small n large p” dilemma prevalent in contemporary genomic studies, we compared the efficacy of traditional genome‐wide association studies (GWAS) with two prominent machine learning tools, random forest and extreme gradient boosting, in pinpointing predictive features. Utilizing the expansive soybean dataset, we assessed the performance of these methodologies in selecting features that optimize predictive modeling for various phenotypes. By constructing predictive models based on the selected features, we ascertain the comparative prediction accuracies, thereby illuminating the strengths and limitations of these feature selection methodologies in the realm of genomic data analysis.https://doi.org/10.1002/tpg2.20503
spellingShingle	Hawlader A. Al‐Mamun Monica F. Danilevicz Jacob I. Marsh Cedric Gondro David Edwards Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset The Plant Genome
title	Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset
title_full	Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset
title_fullStr	Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset
title_full_unstemmed	Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset
title_short	Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset
title_sort	exploring genomic feature selection a comparative analysis of gwas and machine learning algorithms in a large scale soybean dataset
url	https://doi.org/10.1002/tpg2.20503
work_keys_str_mv	AT hawladeraalmamun exploringgenomicfeatureselectionacomparativeanalysisofgwasandmachinelearningalgorithmsinalargescalesoybeandataset AT monicafdanilevicz exploringgenomicfeatureselectionacomparativeanalysisofgwasandmachinelearningalgorithmsinalargescalesoybeandataset AT jacobimarsh exploringgenomicfeatureselectionacomparativeanalysisofgwasandmachinelearningalgorithmsinalargescalesoybeandataset AT cedricgondro exploringgenomicfeatureselectionacomparativeanalysisofgwasandmachinelearningalgorithmsinalargescalesoybeandataset AT davidedwards exploringgenomicfeatureselectionacomparativeanalysisofgwasandmachinelearningalgorithmsinalargescalesoybeandataset

Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset

Similar Items