Unsupervised feature selection and class labeling for credit card fraud
Abstract Large datasets frequently lack class labels, and obtaining labeled data often involves substantial financial and time costs, along with risks of label noise and inaccuracies due to manual annotation. In the context of fraud detection, such as credit card fraud, these challenges are compound...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SpringerOpen
2025-05-01
|
| Series: | Journal of Big Data |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s40537-025-01154-1 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Large datasets frequently lack class labels, and obtaining labeled data often involves substantial financial and time costs, along with risks of label noise and inaccuracies due to manual annotation. In the context of fraud detection, such as credit card fraud, these challenges are compounded by privacy concerns and high class imbalances, which severely degrades classification performance of machine learning models. In this paper, we present a fully unsupervised approach that combines SHapley Additive exPlanations (SHAP) for feature selection with an autoencoder based method for generating class labels for a widely used credit card fraud detection dataset. Using this publicly available and well-known dataset, we construct different sized datasets using feature selection, generate class labels, and measure the quality and efficacy of the labels. We evaluate the labels by training different types of supervised classifiers on the newly generated labels and measure their Area Under the Precision-Recall Curve (AUPRC). Empirical results show that using SHAP feature selection consistently and significantly improves the quality and usability of the generated class labels, as measured by the AUPRC performance of classifiers trained on them. Results also show that the generated labels, both with and without a feature selection preprocessing step, outperform Isolation Forest (IF), an unsupervised anomaly detection method used as a baseline. This demonstrates that SHAP-based feature ranking and selection significantly improves generated class label quality for credit card fraud detection and is a promising strategy for handling large, imbalanced, and unlabeled fraud detection datasets. |
|---|---|
| ISSN: | 2196-1115 |