Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
Machine learning (ML) practitioners are always in pursuit of refined data to develop robust and generalizable ML models to solve real-world problems. However, most real-world datasets are noisy, imbalanced, and contain redundant samples, prompting the need to address these problems before the datase...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10982066/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849328578314895360 |
|---|---|
| author | Abdul Majeed Seong Oun Hwang |
| author_facet | Abdul Majeed Seong Oun Hwang |
| author_sort | Abdul Majeed |
| collection | DOAJ |
| description | Machine learning (ML) practitioners are always in pursuit of refined data to develop robust and generalizable ML models to solve real-world problems. However, most real-world datasets are noisy, imbalanced, and contain redundant samples, prompting the need to address these problems before the datasets are processed by ML models. Among other factors, redundant samples are becoming a greater challenge in ML development due to advancements in data-capturing tools (sensors, wearable devices, etc.). Redundant samples can increase computing and storage requirements while minimally contributing to predictive performance, necessitating their removal before the training phase. However, removing redundant samples without degrading predictive performance from ML models is challenging because it requires deep analysis of all the data and the correlations among features. In this paper, we propose a dominant patterns and Hamming distance-based sampling scheme to prune redundant samples from the data without degrading predictive performance. Specifically, we reduce the data size by a reasonable margin while maintaining predictive performance similar to or better than the original data with reduced training time. Our sampling scheme has five key steps: data pre-processing, dominant pattern extraction by exploiting correlations between features, Hamming distance-based data classification into diverse and less diverse parts, data clustering for redundant-sample pruning from less diverse parts, and fine-tuning/synthesizing the final data. The key objective is to curate compact, diverse, and high-fidelity data that accurately preserves the characteristics of original data while solving accuracy-versus-time trade-off. Experiments are performed on benchmark datasets of binary nature using diverse ML classifiers, and predictive performance, data properties, and computing time are compared with original data and prior data reduction schemes. From the experiments and analysis, our scheme yielded much better results without compromising predictive performance and data fidelity than its counterparts and original data. Our results and analysis provide a new perspective on building ML models with compact data (i.e., a mini-version of the original data) while providing better predictive performance. |
| format | Article |
| id | doaj-art-3895fdf40dd744939d8329035bb3fd85 |
| institution | Kabale University |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-3895fdf40dd744939d8329035bb3fd852025-08-20T03:47:33ZengIEEEIEEE Access2169-35362025-01-0113796557967710.1109/ACCESS.2025.356643010982066Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling SchemeAbdul Majeed0https://orcid.org/0000-0002-3030-5054Seong Oun Hwang1https://orcid.org/0000-0003-4240-6255Department of Computer Engineering, Gachon University, Seongnam, South KoreaDepartment of Computer Engineering, Gachon University, Seongnam, South KoreaMachine learning (ML) practitioners are always in pursuit of refined data to develop robust and generalizable ML models to solve real-world problems. However, most real-world datasets are noisy, imbalanced, and contain redundant samples, prompting the need to address these problems before the datasets are processed by ML models. Among other factors, redundant samples are becoming a greater challenge in ML development due to advancements in data-capturing tools (sensors, wearable devices, etc.). Redundant samples can increase computing and storage requirements while minimally contributing to predictive performance, necessitating their removal before the training phase. However, removing redundant samples without degrading predictive performance from ML models is challenging because it requires deep analysis of all the data and the correlations among features. In this paper, we propose a dominant patterns and Hamming distance-based sampling scheme to prune redundant samples from the data without degrading predictive performance. Specifically, we reduce the data size by a reasonable margin while maintaining predictive performance similar to or better than the original data with reduced training time. Our sampling scheme has five key steps: data pre-processing, dominant pattern extraction by exploiting correlations between features, Hamming distance-based data classification into diverse and less diverse parts, data clustering for redundant-sample pruning from less diverse parts, and fine-tuning/synthesizing the final data. The key objective is to curate compact, diverse, and high-fidelity data that accurately preserves the characteristics of original data while solving accuracy-versus-time trade-off. Experiments are performed on benchmark datasets of binary nature using diverse ML classifiers, and predictive performance, data properties, and computing time are compared with original data and prior data reduction schemes. From the experiments and analysis, our scheme yielded much better results without compromising predictive performance and data fidelity than its counterparts and original data. Our results and analysis provide a new perspective on building ML models with compact data (i.e., a mini-version of the original data) while providing better predictive performance.https://ieeexplore.ieee.org/document/10982066/Data diversitymachine learningsample pruningtraining datapredictive performance |
| spellingShingle | Abdul Majeed Seong Oun Hwang Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme IEEE Access Data diversity machine learning sample pruning training data predictive performance |
| title | Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme |
| title_full | Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme |
| title_fullStr | Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme |
| title_full_unstemmed | Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme |
| title_short | Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme |
| title_sort | data compactness versus prediction performance achieving both by pruning redundant samples with dominant patterns and hamming distance based sampling scheme |
| topic | Data diversity machine learning sample pruning training data predictive performance |
| url | https://ieeexplore.ieee.org/document/10982066/ |
| work_keys_str_mv | AT abdulmajeed datacompactnessversuspredictionperformanceachievingbothbypruningredundantsampleswithdominantpatternsandhammingdistancebasedsamplingscheme AT seongounhwang datacompactnessversuspredictionperformanceachievingbothbypruningredundantsampleswithdominantpatternsandhammingdistancebasedsamplingscheme |