Malware Detection Using a Random Forest Method Trained on a Balanced Synthetic Dataset

The accuracy of malware detection is closely related to the available datasets, which are often small and imbalanced. To overcome these challenges, this study proposed a new method that creates synthetic malware data and increases the size and balance by generating several data sets with a flow-base...

Full description

Saved in:

Bibliographic Details
Main Authors:	Neo Onica Matsobane, Sello Mokwena
Format:	Article
Language:	English
Published:	IMS Vogosca 2025-03-01
Series:	Science, Engineering and Technology
Subjects:	malware detection accuracy random forest flow-based model balanced dataset synthetic dataset
Online Access:	https://setjournal.com/SET/article/view/167
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849735959054123008
author	Neo Onica Matsobane Sello Mokwena
author_facet	Neo Onica Matsobane Sello Mokwena
author_sort	Neo Onica Matsobane
collection	DOAJ
description	The accuracy of malware detection is closely related to the available datasets, which are often small and imbalanced. To overcome these challenges, this study proposed a new method that creates synthetic malware data and increases the size and balance by generating several data sets with a flow-based model. Subsequently, a random forest classifier is fitted on this augmented dataset. This study aimed to analyze the generation of synthetic data based on flow-based models and the impact of synthetic data generation on the performance of a random forest for malware detection. A flow-based model was used to generate a balanced synthetic dataset based on the CICMalDroid2020 dataset. The generated data was used for feature selection and engineering to optimize the Random Forest model. The experimental results demonstrate the effectiveness of the proposed approach. The flow-based model generated an additional 13,402 samples, massively increasing the dataset size, even though the original dataset had only 11,598 data entries. After training on the synthetic augmented dataset, the Random Forest model achieved better performance compared to the original dataset evaluation with metrics precision (93%), recall (100%), balanced precision (96%), and the F1 score (91%). The results show that flow-based model-generated synthetic data can significantly enhance malware detection capabilities.
format	Article
id	doaj-art-56253d52cc6f47a7bd0345c579513dfe
institution	DOAJ
issn	2831-1043 2744-2527
language	English
publishDate	2025-03-01
publisher	IMS Vogosca
record_format	Article
series	Science, Engineering and Technology
spelling	doaj-art-56253d52cc6f47a7bd0345c579513dfe2025-08-20T03:07:24ZengIMS VogoscaScience, Engineering and Technology2831-10432744-25272025-03-015110.54327/set2025/v5.i1.167Malware Detection Using a Random Forest Method Trained on a Balanced Synthetic DatasetNeo Onica Matsobane0https://orcid.org/0000-0001-7912-7853Sello Mokwena1https://orcid.org/0000-0002-6160-863XDepartment of Computer Sciences, Faculty of Science and Agriculture, University of Limpopo, Polokwane, South Africa.Department of Computer Sciences, Faculty of Science and Agriculture, University of Limpopo, Polokwane, South Africa.The accuracy of malware detection is closely related to the available datasets, which are often small and imbalanced. To overcome these challenges, this study proposed a new method that creates synthetic malware data and increases the size and balance by generating several data sets with a flow-based model. Subsequently, a random forest classifier is fitted on this augmented dataset. This study aimed to analyze the generation of synthetic data based on flow-based models and the impact of synthetic data generation on the performance of a random forest for malware detection. A flow-based model was used to generate a balanced synthetic dataset based on the CICMalDroid2020 dataset. The generated data was used for feature selection and engineering to optimize the Random Forest model. The experimental results demonstrate the effectiveness of the proposed approach. The flow-based model generated an additional 13,402 samples, massively increasing the dataset size, even though the original dataset had only 11,598 data entries. After training on the synthetic augmented dataset, the Random Forest model achieved better performance compared to the original dataset evaluation with metrics precision (93%), recall (100%), balanced precision (96%), and the F1 score (91%). The results show that flow-based model-generated synthetic data can significantly enhance malware detection capabilities. https://setjournal.com/SET/article/view/167malware detectionaccuracyrandom forestflow-based modelbalanced datasetsynthetic dataset
spellingShingle	Neo Onica Matsobane Sello Mokwena Malware Detection Using a Random Forest Method Trained on a Balanced Synthetic Dataset Science, Engineering and Technology malware detection accuracy random forest flow-based model balanced dataset synthetic dataset
title	Malware Detection Using a Random Forest Method Trained on a Balanced Synthetic Dataset
title_full	Malware Detection Using a Random Forest Method Trained on a Balanced Synthetic Dataset
title_fullStr	Malware Detection Using a Random Forest Method Trained on a Balanced Synthetic Dataset
title_full_unstemmed	Malware Detection Using a Random Forest Method Trained on a Balanced Synthetic Dataset
title_short	Malware Detection Using a Random Forest Method Trained on a Balanced Synthetic Dataset
title_sort	malware detection using a random forest method trained on a balanced synthetic dataset
topic	malware detection accuracy random forest flow-based model balanced dataset synthetic dataset
url	https://setjournal.com/SET/article/view/167
work_keys_str_mv	AT neoonicamatsobane malwaredetectionusingarandomforestmethodtrainedonabalancedsyntheticdataset AT sellomokwena malwaredetectionusingarandomforestmethodtrainedonabalancedsyntheticdataset

Malware Detection Using a Random Forest Method Trained on a Balanced Synthetic Dataset

Similar Items