Double weighted k nearest neighbours for binary classification of high dimensional genomic data

Abstract High dimensional gene expression datasets consist of a large number of genes, many of which do not play a significant role in classifying tissue samples. The high dimensional nature of this type of data, characterized by a large number of gene features substantially exceeding its sample siz...

Full description

Saved in:

Bibliographic Details
Main Authors:	Amjad Ali, Zardad Khan, Hailiang Du, Saeed Aldahmani
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-04-01
Series:	Scientific Reports
Online Access:	https://doi.org/10.1038/s41598-025-97505-2
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849726659022815232
author	Amjad Ali Zardad Khan Hailiang Du Saeed Aldahmani
author_facet	Amjad Ali Zardad Khan Hailiang Du Saeed Aldahmani
author_sort	Amjad Ali
collection	DOAJ
description	Abstract High dimensional gene expression datasets consist of a large number of genes, many of which do not play a significant role in classifying tissue samples. The high dimensional nature of this type of data, characterized by a large number of gene features substantially exceeding its sample size, makes it challenging for existing methods to work efficiently in terms of prediction accuracy and execution time. To address this issue, a new classification procedure called double weighted k nearest neighbours ( $$\hbox {DW}k\hbox {NN}$$ ) is proposed. $$\hbox {DW}k\hbox {NN}$$ is specifically designed for gene expression data and incorporates feature weights derived from genes’ ability to express deferentially between classes. Features weights are derived in a manner that automatically increase the impact of informative features while decreasing it for features that are less/non informative. To achieve this goal, the estimated weighted distances from the observations in the k nearest neighbourhood to the test point are used in an exponential function. The outputs of the function are summed for both the classes separately and the test point is assigned the class label with the largest sum. By utilizing the proposed weighting method based on the differential capability of genes, the $$\hbox {DW}k\hbox {NN}$$ method aims to achieve robust and efficient classification by allowing only the most informative features/genes to contribute to the classification task. Experimental evaluations, in comparison with several methods, i.e., standard $$k\hbox {NN}$$ , weighted k nearest neighbours classifier ( $$\hbox {W}k\hbox {NN}$$ ), random k nearest neighbour ( $$\hbox {R}k\hbox {NN}$$ ), extended neighbourhood rule ensemble (ExNRule), k conditional nearest neighbour ( $$k\hbox {CNN}$$ ), $$\hbox {E}k\hbox {CNN}$$ ensemble and support vector machines (SVM), demonstrate the effectiveness of $$\hbox {DW}k\hbox {NN}$$ in accurately classifying gene expression datasets. Overall, $$\hbox {DW}k\hbox {NN}$$ presents a promising approach for gene expression data analysis through the two fold weighted distance calculation strategy using classification accuracy, Cohen’s kappa, sensitivity and $$\hbox {F}_1-$$ score as performance metrics.
format	Article
id	doaj-art-0a546fa06b01414bbf1ae12d4e6d7293
institution	DOAJ
issn	2045-2322
language	English
publishDate	2025-04-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-0a546fa06b01414bbf1ae12d4e6d72932025-08-20T03:10:07ZengNature PortfolioScientific Reports2045-23222025-04-0115111710.1038/s41598-025-97505-2Double weighted k nearest neighbours for binary classification of high dimensional genomic dataAmjad Ali0Zardad Khan1Hailiang Du2Saeed Aldahmani3Department of Statistics and Bussines Analytics, United Arab Emirates UniversityDepartment of Statistics and Bussines Analytics, United Arab Emirates UniversityDepartment of Mathematical Sciences, Durham UniversityDepartment of Statistics and Bussines Analytics, United Arab Emirates UniversityAbstract High dimensional gene expression datasets consist of a large number of genes, many of which do not play a significant role in classifying tissue samples. The high dimensional nature of this type of data, characterized by a large number of gene features substantially exceeding its sample size, makes it challenging for existing methods to work efficiently in terms of prediction accuracy and execution time. To address this issue, a new classification procedure called double weighted k nearest neighbours ( $$\hbox {DW}k\hbox {NN}$$ ) is proposed. $$\hbox {DW}k\hbox {NN}$$ is specifically designed for gene expression data and incorporates feature weights derived from genes’ ability to express deferentially between classes. Features weights are derived in a manner that automatically increase the impact of informative features while decreasing it for features that are less/non informative. To achieve this goal, the estimated weighted distances from the observations in the k nearest neighbourhood to the test point are used in an exponential function. The outputs of the function are summed for both the classes separately and the test point is assigned the class label with the largest sum. By utilizing the proposed weighting method based on the differential capability of genes, the $$\hbox {DW}k\hbox {NN}$$ method aims to achieve robust and efficient classification by allowing only the most informative features/genes to contribute to the classification task. Experimental evaluations, in comparison with several methods, i.e., standard $$k\hbox {NN}$$ , weighted k nearest neighbours classifier ( $$\hbox {W}k\hbox {NN}$$ ), random k nearest neighbour ( $$\hbox {R}k\hbox {NN}$$ ), extended neighbourhood rule ensemble (ExNRule), k conditional nearest neighbour ( $$k\hbox {CNN}$$ ), $$\hbox {E}k\hbox {CNN}$$ ensemble and support vector machines (SVM), demonstrate the effectiveness of $$\hbox {DW}k\hbox {NN}$$ in accurately classifying gene expression datasets. Overall, $$\hbox {DW}k\hbox {NN}$$ presents a promising approach for gene expression data analysis through the two fold weighted distance calculation strategy using classification accuracy, Cohen’s kappa, sensitivity and $$\hbox {F}_1-$$ score as performance metrics.https://doi.org/10.1038/s41598-025-97505-2
spellingShingle	Amjad Ali Zardad Khan Hailiang Du Saeed Aldahmani Double weighted k nearest neighbours for binary classification of high dimensional genomic data Scientific Reports
title	Double weighted k nearest neighbours for binary classification of high dimensional genomic data
title_full	Double weighted k nearest neighbours for binary classification of high dimensional genomic data
title_fullStr	Double weighted k nearest neighbours for binary classification of high dimensional genomic data
title_full_unstemmed	Double weighted k nearest neighbours for binary classification of high dimensional genomic data
title_short	Double weighted k nearest neighbours for binary classification of high dimensional genomic data
title_sort	double weighted k nearest neighbours for binary classification of high dimensional genomic data
url	https://doi.org/10.1038/s41598-025-97505-2
work_keys_str_mv	AT amjadali doubleweightedknearestneighboursforbinaryclassificationofhighdimensionalgenomicdata AT zardadkhan doubleweightedknearestneighboursforbinaryclassificationofhighdimensionalgenomicdata AT hailiangdu doubleweightedknearestneighboursforbinaryclassificationofhighdimensionalgenomicdata AT saeedaldahmani doubleweightedknearestneighboursforbinaryclassificationofhighdimensionalgenomicdata

Double weighted k nearest neighbours for binary classification of high dimensional genomic data

Similar Items