Double weighted k nearest neighbours for binary classification of high dimensional genomic data

Abstract High dimensional gene expression datasets consist of a large number of genes, many of which do not play a significant role in classifying tissue samples. The high dimensional nature of this type of data, characterized by a large number of gene features substantially exceeding its sample siz...

Full description

Saved in:
Bibliographic Details
Main Authors: Amjad Ali, Zardad Khan, Hailiang Du, Saeed Aldahmani
Format: Article
Language:English
Published: Nature Portfolio 2025-04-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-025-97505-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849726659022815232
author Amjad Ali
Zardad Khan
Hailiang Du
Saeed Aldahmani
author_facet Amjad Ali
Zardad Khan
Hailiang Du
Saeed Aldahmani
author_sort Amjad Ali
collection DOAJ
description Abstract High dimensional gene expression datasets consist of a large number of genes, many of which do not play a significant role in classifying tissue samples. The high dimensional nature of this type of data, characterized by a large number of gene features substantially exceeding its sample size, makes it challenging for existing methods to work efficiently in terms of prediction accuracy and execution time. To address this issue, a new classification procedure called double weighted k nearest neighbours ( $$\hbox {DW}k\hbox {NN}$$ ) is proposed. $$\hbox {DW}k\hbox {NN}$$ is specifically designed for gene expression data and incorporates feature weights derived from genes’ ability to express deferentially between classes. Features weights are derived in a manner that automatically increase the impact of informative features while decreasing it for features that are less/non informative. To achieve this goal, the estimated weighted distances from the observations in the k nearest neighbourhood to the test point are used in an exponential function. The outputs of the function are summed for both the classes separately and the test point is assigned the class label with the largest sum. By utilizing the proposed weighting method based on the differential capability of genes, the $$\hbox {DW}k\hbox {NN}$$ method aims to achieve robust and efficient classification by allowing only the most informative features/genes to contribute to the classification task. Experimental evaluations, in comparison with several methods, i.e., standard $$k\hbox {NN}$$ , weighted k nearest neighbours classifier ( $$\hbox {W}k\hbox {NN}$$ ), random k nearest neighbour ( $$\hbox {R}k\hbox {NN}$$ ), extended neighbourhood rule ensemble (ExNRule), k conditional nearest neighbour ( $$k\hbox {CNN}$$ ), $$\hbox {E}k\hbox {CNN}$$ ensemble and support vector machines (SVM), demonstrate the effectiveness of $$\hbox {DW}k\hbox {NN}$$ in accurately classifying gene expression datasets. Overall, $$\hbox {DW}k\hbox {NN}$$ presents a promising approach for gene expression data analysis through the two fold weighted distance calculation strategy using classification accuracy, Cohen’s kappa, sensitivity and $$\hbox {F}_1-$$ score as performance metrics.
format Article
id doaj-art-0a546fa06b01414bbf1ae12d4e6d7293
institution DOAJ
issn 2045-2322
language English
publishDate 2025-04-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-0a546fa06b01414bbf1ae12d4e6d72932025-08-20T03:10:07ZengNature PortfolioScientific Reports2045-23222025-04-0115111710.1038/s41598-025-97505-2Double weighted k nearest neighbours for binary classification of high dimensional genomic dataAmjad Ali0Zardad Khan1Hailiang Du2Saeed Aldahmani3Department of Statistics and Bussines Analytics, United Arab Emirates UniversityDepartment of Statistics and Bussines Analytics, United Arab Emirates UniversityDepartment of Mathematical Sciences, Durham UniversityDepartment of Statistics and Bussines Analytics, United Arab Emirates UniversityAbstract High dimensional gene expression datasets consist of a large number of genes, many of which do not play a significant role in classifying tissue samples. The high dimensional nature of this type of data, characterized by a large number of gene features substantially exceeding its sample size, makes it challenging for existing methods to work efficiently in terms of prediction accuracy and execution time. To address this issue, a new classification procedure called double weighted k nearest neighbours ( $$\hbox {DW}k\hbox {NN}$$ ) is proposed. $$\hbox {DW}k\hbox {NN}$$ is specifically designed for gene expression data and incorporates feature weights derived from genes’ ability to express deferentially between classes. Features weights are derived in a manner that automatically increase the impact of informative features while decreasing it for features that are less/non informative. To achieve this goal, the estimated weighted distances from the observations in the k nearest neighbourhood to the test point are used in an exponential function. The outputs of the function are summed for both the classes separately and the test point is assigned the class label with the largest sum. By utilizing the proposed weighting method based on the differential capability of genes, the $$\hbox {DW}k\hbox {NN}$$ method aims to achieve robust and efficient classification by allowing only the most informative features/genes to contribute to the classification task. Experimental evaluations, in comparison with several methods, i.e., standard $$k\hbox {NN}$$ , weighted k nearest neighbours classifier ( $$\hbox {W}k\hbox {NN}$$ ), random k nearest neighbour ( $$\hbox {R}k\hbox {NN}$$ ), extended neighbourhood rule ensemble (ExNRule), k conditional nearest neighbour ( $$k\hbox {CNN}$$ ), $$\hbox {E}k\hbox {CNN}$$ ensemble and support vector machines (SVM), demonstrate the effectiveness of $$\hbox {DW}k\hbox {NN}$$ in accurately classifying gene expression datasets. Overall, $$\hbox {DW}k\hbox {NN}$$ presents a promising approach for gene expression data analysis through the two fold weighted distance calculation strategy using classification accuracy, Cohen’s kappa, sensitivity and $$\hbox {F}_1-$$ score as performance metrics.https://doi.org/10.1038/s41598-025-97505-2
spellingShingle Amjad Ali
Zardad Khan
Hailiang Du
Saeed Aldahmani
Double weighted k nearest neighbours for binary classification of high dimensional genomic data
Scientific Reports
title Double weighted k nearest neighbours for binary classification of high dimensional genomic data
title_full Double weighted k nearest neighbours for binary classification of high dimensional genomic data
title_fullStr Double weighted k nearest neighbours for binary classification of high dimensional genomic data
title_full_unstemmed Double weighted k nearest neighbours for binary classification of high dimensional genomic data
title_short Double weighted k nearest neighbours for binary classification of high dimensional genomic data
title_sort double weighted k nearest neighbours for binary classification of high dimensional genomic data
url https://doi.org/10.1038/s41598-025-97505-2
work_keys_str_mv AT amjadali doubleweightedknearestneighboursforbinaryclassificationofhighdimensionalgenomicdata
AT zardadkhan doubleweightedknearestneighboursforbinaryclassificationofhighdimensionalgenomicdata
AT hailiangdu doubleweightedknearestneighboursforbinaryclassificationofhighdimensionalgenomicdata
AT saeedaldahmani doubleweightedknearestneighboursforbinaryclassificationofhighdimensionalgenomicdata