Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme

Machine learning (ML) practitioners are always in pursuit of refined data to develop robust and generalizable ML models to solve real-world problems. However, most real-world datasets are noisy, imbalanced, and contain redundant samples, prompting the need to address these problems before the datase...

Full description

Saved in:

Bibliographic Details
Main Authors:	Abdul Majeed, Seong Oun Hwang
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Data diversity machine learning sample pruning training data predictive performance
Online Access:	https://ieeexplore.ieee.org/document/10982066/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849328578314895360
author	Abdul Majeed Seong Oun Hwang
author_facet	Abdul Majeed Seong Oun Hwang
author_sort	Abdul Majeed
collection	DOAJ
description	Machine learning (ML) practitioners are always in pursuit of refined data to develop robust and generalizable ML models to solve real-world problems. However, most real-world datasets are noisy, imbalanced, and contain redundant samples, prompting the need to address these problems before the datasets are processed by ML models. Among other factors, redundant samples are becoming a greater challenge in ML development due to advancements in data-capturing tools (sensors, wearable devices, etc.). Redundant samples can increase computing and storage requirements while minimally contributing to predictive performance, necessitating their removal before the training phase. However, removing redundant samples without degrading predictive performance from ML models is challenging because it requires deep analysis of all the data and the correlations among features. In this paper, we propose a dominant patterns and Hamming distance-based sampling scheme to prune redundant samples from the data without degrading predictive performance. Specifically, we reduce the data size by a reasonable margin while maintaining predictive performance similar to or better than the original data with reduced training time. Our sampling scheme has five key steps: data pre-processing, dominant pattern extraction by exploiting correlations between features, Hamming distance-based data classification into diverse and less diverse parts, data clustering for redundant-sample pruning from less diverse parts, and fine-tuning/synthesizing the final data. The key objective is to curate compact, diverse, and high-fidelity data that accurately preserves the characteristics of original data while solving accuracy-versus-time trade-off. Experiments are performed on benchmark datasets of binary nature using diverse ML classifiers, and predictive performance, data properties, and computing time are compared with original data and prior data reduction schemes. From the experiments and analysis, our scheme yielded much better results without compromising predictive performance and data fidelity than its counterparts and original data. Our results and analysis provide a new perspective on building ML models with compact data (i.e., a mini-version of the original data) while providing better predictive performance.
format	Article
id	doaj-art-3895fdf40dd744939d8329035bb3fd85
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-3895fdf40dd744939d8329035bb3fd852025-08-20T03:47:33ZengIEEEIEEE Access2169-35362025-01-0113796557967710.1109/ACCESS.2025.356643010982066Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling SchemeAbdul Majeed0https://orcid.org/0000-0002-3030-5054Seong Oun Hwang1https://orcid.org/0000-0003-4240-6255Department of Computer Engineering, Gachon University, Seongnam, South KoreaDepartment of Computer Engineering, Gachon University, Seongnam, South KoreaMachine learning (ML) practitioners are always in pursuit of refined data to develop robust and generalizable ML models to solve real-world problems. However, most real-world datasets are noisy, imbalanced, and contain redundant samples, prompting the need to address these problems before the datasets are processed by ML models. Among other factors, redundant samples are becoming a greater challenge in ML development due to advancements in data-capturing tools (sensors, wearable devices, etc.). Redundant samples can increase computing and storage requirements while minimally contributing to predictive performance, necessitating their removal before the training phase. However, removing redundant samples without degrading predictive performance from ML models is challenging because it requires deep analysis of all the data and the correlations among features. In this paper, we propose a dominant patterns and Hamming distance-based sampling scheme to prune redundant samples from the data without degrading predictive performance. Specifically, we reduce the data size by a reasonable margin while maintaining predictive performance similar to or better than the original data with reduced training time. Our sampling scheme has five key steps: data pre-processing, dominant pattern extraction by exploiting correlations between features, Hamming distance-based data classification into diverse and less diverse parts, data clustering for redundant-sample pruning from less diverse parts, and fine-tuning/synthesizing the final data. The key objective is to curate compact, diverse, and high-fidelity data that accurately preserves the characteristics of original data while solving accuracy-versus-time trade-off. Experiments are performed on benchmark datasets of binary nature using diverse ML classifiers, and predictive performance, data properties, and computing time are compared with original data and prior data reduction schemes. From the experiments and analysis, our scheme yielded much better results without compromising predictive performance and data fidelity than its counterparts and original data. Our results and analysis provide a new perspective on building ML models with compact data (i.e., a mini-version of the original data) while providing better predictive performance.https://ieeexplore.ieee.org/document/10982066/Data diversitymachine learningsample pruningtraining datapredictive performance
spellingShingle	Abdul Majeed Seong Oun Hwang Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme IEEE Access Data diversity machine learning sample pruning training data predictive performance
title	Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
title_full	Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
title_fullStr	Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
title_full_unstemmed	Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
title_short	Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
title_sort	data compactness versus prediction performance achieving both by pruning redundant samples with dominant patterns and hamming distance based sampling scheme
topic	Data diversity machine learning sample pruning training data predictive performance
url	https://ieeexplore.ieee.org/document/10982066/
work_keys_str_mv	AT abdulmajeed datacompactnessversuspredictionperformanceachievingbothbypruningredundantsampleswithdominantpatternsandhammingdistancebasedsamplingscheme AT seongounhwang datacompactnessversuspredictionperformanceachievingbothbypruningredundantsampleswithdominantpatternsandhammingdistancebasedsamplingscheme

Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme

Similar Items