Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
Machine learning (ML) practitioners are always in pursuit of refined data to develop robust and generalizable ML models to solve real-world problems. However, most real-world datasets are noisy, imbalanced, and contain redundant samples, prompting the need to address these problems before the datase...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10982066/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Machine learning (ML) practitioners are always in pursuit of refined data to develop robust and generalizable ML models to solve real-world problems. However, most real-world datasets are noisy, imbalanced, and contain redundant samples, prompting the need to address these problems before the datasets are processed by ML models. Among other factors, redundant samples are becoming a greater challenge in ML development due to advancements in data-capturing tools (sensors, wearable devices, etc.). Redundant samples can increase computing and storage requirements while minimally contributing to predictive performance, necessitating their removal before the training phase. However, removing redundant samples without degrading predictive performance from ML models is challenging because it requires deep analysis of all the data and the correlations among features. In this paper, we propose a dominant patterns and Hamming distance-based sampling scheme to prune redundant samples from the data without degrading predictive performance. Specifically, we reduce the data size by a reasonable margin while maintaining predictive performance similar to or better than the original data with reduced training time. Our sampling scheme has five key steps: data pre-processing, dominant pattern extraction by exploiting correlations between features, Hamming distance-based data classification into diverse and less diverse parts, data clustering for redundant-sample pruning from less diverse parts, and fine-tuning/synthesizing the final data. The key objective is to curate compact, diverse, and high-fidelity data that accurately preserves the characteristics of original data while solving accuracy-versus-time trade-off. Experiments are performed on benchmark datasets of binary nature using diverse ML classifiers, and predictive performance, data properties, and computing time are compared with original data and prior data reduction schemes. From the experiments and analysis, our scheme yielded much better results without compromising predictive performance and data fidelity than its counterparts and original data. Our results and analysis provide a new perspective on building ML models with compact data (i.e., a mini-version of the original data) while providing better predictive performance. |
|---|---|
| ISSN: | 2169-3536 |