Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme

Machine learning (ML) practitioners are always in pursuit of refined data to develop robust and generalizable ML models to solve real-world problems. However, most real-world datasets are noisy, imbalanced, and contain redundant samples, prompting the need to address these problems before the datase...

Full description

Saved in:
Bibliographic Details
Main Authors: Abdul Majeed, Seong Oun Hwang
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10982066/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849328578314895360
author Abdul Majeed
Seong Oun Hwang
author_facet Abdul Majeed
Seong Oun Hwang
author_sort Abdul Majeed
collection DOAJ
description Machine learning (ML) practitioners are always in pursuit of refined data to develop robust and generalizable ML models to solve real-world problems. However, most real-world datasets are noisy, imbalanced, and contain redundant samples, prompting the need to address these problems before the datasets are processed by ML models. Among other factors, redundant samples are becoming a greater challenge in ML development due to advancements in data-capturing tools (sensors, wearable devices, etc.). Redundant samples can increase computing and storage requirements while minimally contributing to predictive performance, necessitating their removal before the training phase. However, removing redundant samples without degrading predictive performance from ML models is challenging because it requires deep analysis of all the data and the correlations among features. In this paper, we propose a dominant patterns and Hamming distance-based sampling scheme to prune redundant samples from the data without degrading predictive performance. Specifically, we reduce the data size by a reasonable margin while maintaining predictive performance similar to or better than the original data with reduced training time. Our sampling scheme has five key steps: data pre-processing, dominant pattern extraction by exploiting correlations between features, Hamming distance-based data classification into diverse and less diverse parts, data clustering for redundant-sample pruning from less diverse parts, and fine-tuning/synthesizing the final data. The key objective is to curate compact, diverse, and high-fidelity data that accurately preserves the characteristics of original data while solving accuracy-versus-time trade-off. Experiments are performed on benchmark datasets of binary nature using diverse ML classifiers, and predictive performance, data properties, and computing time are compared with original data and prior data reduction schemes. From the experiments and analysis, our scheme yielded much better results without compromising predictive performance and data fidelity than its counterparts and original data. Our results and analysis provide a new perspective on building ML models with compact data (i.e., a mini-version of the original data) while providing better predictive performance.
format Article
id doaj-art-3895fdf40dd744939d8329035bb3fd85
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-3895fdf40dd744939d8329035bb3fd852025-08-20T03:47:33ZengIEEEIEEE Access2169-35362025-01-0113796557967710.1109/ACCESS.2025.356643010982066Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling SchemeAbdul Majeed0https://orcid.org/0000-0002-3030-5054Seong Oun Hwang1https://orcid.org/0000-0003-4240-6255Department of Computer Engineering, Gachon University, Seongnam, South KoreaDepartment of Computer Engineering, Gachon University, Seongnam, South KoreaMachine learning (ML) practitioners are always in pursuit of refined data to develop robust and generalizable ML models to solve real-world problems. However, most real-world datasets are noisy, imbalanced, and contain redundant samples, prompting the need to address these problems before the datasets are processed by ML models. Among other factors, redundant samples are becoming a greater challenge in ML development due to advancements in data-capturing tools (sensors, wearable devices, etc.). Redundant samples can increase computing and storage requirements while minimally contributing to predictive performance, necessitating their removal before the training phase. However, removing redundant samples without degrading predictive performance from ML models is challenging because it requires deep analysis of all the data and the correlations among features. In this paper, we propose a dominant patterns and Hamming distance-based sampling scheme to prune redundant samples from the data without degrading predictive performance. Specifically, we reduce the data size by a reasonable margin while maintaining predictive performance similar to or better than the original data with reduced training time. Our sampling scheme has five key steps: data pre-processing, dominant pattern extraction by exploiting correlations between features, Hamming distance-based data classification into diverse and less diverse parts, data clustering for redundant-sample pruning from less diverse parts, and fine-tuning/synthesizing the final data. The key objective is to curate compact, diverse, and high-fidelity data that accurately preserves the characteristics of original data while solving accuracy-versus-time trade-off. Experiments are performed on benchmark datasets of binary nature using diverse ML classifiers, and predictive performance, data properties, and computing time are compared with original data and prior data reduction schemes. From the experiments and analysis, our scheme yielded much better results without compromising predictive performance and data fidelity than its counterparts and original data. Our results and analysis provide a new perspective on building ML models with compact data (i.e., a mini-version of the original data) while providing better predictive performance.https://ieeexplore.ieee.org/document/10982066/Data diversitymachine learningsample pruningtraining datapredictive performance
spellingShingle Abdul Majeed
Seong Oun Hwang
Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
IEEE Access
Data diversity
machine learning
sample pruning
training data
predictive performance
title Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
title_full Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
title_fullStr Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
title_full_unstemmed Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
title_short Data Compactness Versus Prediction Performance: Achieving Both by Pruning Redundant Samples With Dominant Patterns and Hamming Distance Based Sampling Scheme
title_sort data compactness versus prediction performance achieving both by pruning redundant samples with dominant patterns and hamming distance based sampling scheme
topic Data diversity
machine learning
sample pruning
training data
predictive performance
url https://ieeexplore.ieee.org/document/10982066/
work_keys_str_mv AT abdulmajeed datacompactnessversuspredictionperformanceachievingbothbypruningredundantsampleswithdominantpatternsandhammingdistancebasedsamplingscheme
AT seongounhwang datacompactnessversuspredictionperformanceachievingbothbypruningredundantsampleswithdominantpatternsandhammingdistancebasedsamplingscheme