GDSMOTE: A Novel Synthetic Oversampling Method for High-Dimensional Imbalanced Financial Data

Synthetic oversampling methods for dealing with imbalanced classification problems have been widely studied. However, the current synthetic oversampling methods still cannot perform well when facing high-dimensional imbalanced financial data. The failure of distance measurement in high-dimensional s...

Full description

Saved in:
Bibliographic Details
Main Authors: Libin Hu, Yunfeng Zhang
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/12/24/4036
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850241479177404416
author Libin Hu
Yunfeng Zhang
author_facet Libin Hu
Yunfeng Zhang
author_sort Libin Hu
collection DOAJ
description Synthetic oversampling methods for dealing with imbalanced classification problems have been widely studied. However, the current synthetic oversampling methods still cannot perform well when facing high-dimensional imbalanced financial data. The failure of distance measurement in high-dimensional space, error accumulation caused by noise samples, and the reduction of recognition accuracy of majority samples caused by the distribution of synthetic samples are the main reasons that limit the performance of current methods. Taking these factors into consideration, a novel synthetic oversampling method is proposed, namely the gradient distribution-based synthetic minority oversampling technique (GDSMOTE). Firstly, the concept of gradient contribution was used to assign the minority-class samples to different gradient intervals instead of relying on the spatial distance. Secondly, the root sample selection strategy of GDSMOTE avoids the error accumulation caused by noise samples and a new concept of nearest neighbor was proposed to determine the auxiliary samples. Finally, a safety gradient distribution approximation strategy based on cosine similarity was designed to determine the number of samples to be synthesized in each safety gradient interval. Experiments on high-dimensional imbalanced financial datasets show that GDSMOTE can achieve a higher F1-Score and MCC metrics than baseline methods while achieving a higher recall score. This means that our method has the characteristics of improving the recognition accuracy of minority-class samples without sacrificing the recognition accuracy of majority-class samples and has good adaptability to data decision-making tasks in the financial field.
format Article
id doaj-art-a839d269cfc347628f1602a4f11c8a7e
institution OA Journals
issn 2227-7390
language English
publishDate 2024-12-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj-art-a839d269cfc347628f1602a4f11c8a7e2025-08-20T02:00:35ZengMDPI AGMathematics2227-73902024-12-011224403610.3390/math12244036GDSMOTE: A Novel Synthetic Oversampling Method for High-Dimensional Imbalanced Financial DataLibin Hu0Yunfeng Zhang1School of Management Science and Engineering, Shandong University of Finance and Economics, Jinan 250014, ChinaSchool of Computer Science and Technology, Shandong University of Finance and Economics, Jinan 250014, ChinaSynthetic oversampling methods for dealing with imbalanced classification problems have been widely studied. However, the current synthetic oversampling methods still cannot perform well when facing high-dimensional imbalanced financial data. The failure of distance measurement in high-dimensional space, error accumulation caused by noise samples, and the reduction of recognition accuracy of majority samples caused by the distribution of synthetic samples are the main reasons that limit the performance of current methods. Taking these factors into consideration, a novel synthetic oversampling method is proposed, namely the gradient distribution-based synthetic minority oversampling technique (GDSMOTE). Firstly, the concept of gradient contribution was used to assign the minority-class samples to different gradient intervals instead of relying on the spatial distance. Secondly, the root sample selection strategy of GDSMOTE avoids the error accumulation caused by noise samples and a new concept of nearest neighbor was proposed to determine the auxiliary samples. Finally, a safety gradient distribution approximation strategy based on cosine similarity was designed to determine the number of samples to be synthesized in each safety gradient interval. Experiments on high-dimensional imbalanced financial datasets show that GDSMOTE can achieve a higher F1-Score and MCC metrics than baseline methods while achieving a higher recall score. This means that our method has the characteristics of improving the recognition accuracy of minority-class samples without sacrificing the recognition accuracy of majority-class samples and has good adaptability to data decision-making tasks in the financial field.https://www.mdpi.com/2227-7390/12/24/4036synthetic oversamplinghigh-dimensional imbalanced financial datagradient distributiongradient right nearest neighborsafety gradient distribution approximation
spellingShingle Libin Hu
Yunfeng Zhang
GDSMOTE: A Novel Synthetic Oversampling Method for High-Dimensional Imbalanced Financial Data
Mathematics
synthetic oversampling
high-dimensional imbalanced financial data
gradient distribution
gradient right nearest neighbor
safety gradient distribution approximation
title GDSMOTE: A Novel Synthetic Oversampling Method for High-Dimensional Imbalanced Financial Data
title_full GDSMOTE: A Novel Synthetic Oversampling Method for High-Dimensional Imbalanced Financial Data
title_fullStr GDSMOTE: A Novel Synthetic Oversampling Method for High-Dimensional Imbalanced Financial Data
title_full_unstemmed GDSMOTE: A Novel Synthetic Oversampling Method for High-Dimensional Imbalanced Financial Data
title_short GDSMOTE: A Novel Synthetic Oversampling Method for High-Dimensional Imbalanced Financial Data
title_sort gdsmote a novel synthetic oversampling method for high dimensional imbalanced financial data
topic synthetic oversampling
high-dimensional imbalanced financial data
gradient distribution
gradient right nearest neighbor
safety gradient distribution approximation
url https://www.mdpi.com/2227-7390/12/24/4036
work_keys_str_mv AT libinhu gdsmoteanovelsyntheticoversamplingmethodforhighdimensionalimbalancedfinancialdata
AT yunfengzhang gdsmoteanovelsyntheticoversamplingmethodforhighdimensionalimbalancedfinancialdata