Improving Gaussian Naive Bayes classification on imbalanced data through coordinate-based minority feature mining

As a widely used classification model, the Gaussian Naive Bayes (GNB) classifier experiences a significant decline in performance when handling imbalanced data. Most traditional approaches rely on sampling techniques; however, these methods alter the quantity and distribution of the original data an...

Full description

Saved in:
Bibliographic Details
Main Authors: Wei Wang, Li Yan, Fen Liu, Yanxi Li
Format: Article
Language:English
Published: PeerJ Inc. 2025-07-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-3003.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:As a widely used classification model, the Gaussian Naive Bayes (GNB) classifier experiences a significant decline in performance when handling imbalanced data. Most traditional approaches rely on sampling techniques; however, these methods alter the quantity and distribution of the original data and are prone to issues such as class overlap and overfitting, thus presenting clear limitations. This article proposes a coordinate transformation algorithm based on radial local relative density changes (RLDC). A key feature of this algorithm is that it preserves the original dataset’s quantity and distribution. Instead of modifying the data, it enhances classification performance by generating new features that more prominently represent minority classes. The algorithm transforms the dataset from absolute coordinates to RLDC-relative coordinates, revealing latent local relative density change features. Due to the imbalanced distribution, sparse feature space, and class overlap, minority class samples can exhibit distinct patterns in these transformed features. Based on these new features, the GNB classifier can increase the conditional probability of the minority class, thereby improving its classification performance on imbalanced datasets. To validate the effectiveness of the proposed algorithm, this study conducts comprehensive comparative experiments using the GNB classifier on 20 imbalanced datasets of varying scales, dimensions, and characteristics. The evaluation includes 10 oversampling algorithms, two undersampling algorithms, and two hybrid sampling algorithms. Experimental results show that the RLDC-based coordinate transformation algorithm ranks first in the average performance across three classification evaluation metrics. Compared to the average values of the comparison algorithms, it achieves improvements of 21.84%, 33.45%, and 54.63% across the three metrics, respectively. This algorithm offers a novel approach to addressing the imbalanced data problem in GNB classification and holds significant theoretical and practical value.
ISSN:2376-5992