Double weighted k nearest neighbours for binary classification of high dimensional genomic data
Abstract High dimensional gene expression datasets consist of a large number of genes, many of which do not play a significant role in classifying tissue samples. The high dimensional nature of this type of data, characterized by a large number of gene features substantially exceeding its sample siz...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-04-01
|
| Series: | Scientific Reports |
| Online Access: | https://doi.org/10.1038/s41598-025-97505-2 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract High dimensional gene expression datasets consist of a large number of genes, many of which do not play a significant role in classifying tissue samples. The high dimensional nature of this type of data, characterized by a large number of gene features substantially exceeding its sample size, makes it challenging for existing methods to work efficiently in terms of prediction accuracy and execution time. To address this issue, a new classification procedure called double weighted k nearest neighbours ( $$\hbox {DW}k\hbox {NN}$$ ) is proposed. $$\hbox {DW}k\hbox {NN}$$ is specifically designed for gene expression data and incorporates feature weights derived from genes’ ability to express deferentially between classes. Features weights are derived in a manner that automatically increase the impact of informative features while decreasing it for features that are less/non informative. To achieve this goal, the estimated weighted distances from the observations in the k nearest neighbourhood to the test point are used in an exponential function. The outputs of the function are summed for both the classes separately and the test point is assigned the class label with the largest sum. By utilizing the proposed weighting method based on the differential capability of genes, the $$\hbox {DW}k\hbox {NN}$$ method aims to achieve robust and efficient classification by allowing only the most informative features/genes to contribute to the classification task. Experimental evaluations, in comparison with several methods, i.e., standard $$k\hbox {NN}$$ , weighted k nearest neighbours classifier ( $$\hbox {W}k\hbox {NN}$$ ), random k nearest neighbour ( $$\hbox {R}k\hbox {NN}$$ ), extended neighbourhood rule ensemble (ExNRule), k conditional nearest neighbour ( $$k\hbox {CNN}$$ ), $$\hbox {E}k\hbox {CNN}$$ ensemble and support vector machines (SVM), demonstrate the effectiveness of $$\hbox {DW}k\hbox {NN}$$ in accurately classifying gene expression datasets. Overall, $$\hbox {DW}k\hbox {NN}$$ presents a promising approach for gene expression data analysis through the two fold weighted distance calculation strategy using classification accuracy, Cohen’s kappa, sensitivity and $$\hbox {F}_1-$$ score as performance metrics. |
|---|---|
| ISSN: | 2045-2322 |