Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files

During a typical cyber-attack lifecycle, several key phases are involved, including footprinting and reconnaissance, scanning, exploitation, and covering tracks. The successful delivery of a payload lies at the heart of ensuring the effectiveness of cyberattacks, which is typically executed followin...

Full description

Saved in:
Bibliographic Details
Main Authors: Moemedi Lefoane, Ibrahim Ghafir, Sohag Kabir, Irfan-Ullah Awan, Khalil El Hindi, Anand Mahendran
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11114946/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849770776487526400
author Moemedi Lefoane
Ibrahim Ghafir
Sohag Kabir
Irfan-Ullah Awan
Khalil El Hindi
Anand Mahendran
author_facet Moemedi Lefoane
Ibrahim Ghafir
Sohag Kabir
Irfan-Ullah Awan
Khalil El Hindi
Anand Mahendran
author_sort Moemedi Lefoane
collection DOAJ
description During a typical cyber-attack lifecycle, several key phases are involved, including footprinting and reconnaissance, scanning, exploitation, and covering tracks. The successful delivery of a payload lies at the heart of ensuring the effectiveness of cyberattacks, which is typically executed following the exploitation of vulnerabilities. This allows adversaries to gain backdoor access to their target and accomplish their objectives. With the increasing use of generative Artificial Intelligence (AI), adversaries are leveraging AI to enhance their attack strategies, ranging from creating more credible phishing attacks and social engineering tactics to developing advanced viruses that are delivered through various means such as phishing attacks. Efforts to devise AI techniques for the detection of malicious executable files have garnered significant attention in the research community. While numerous Machine Learning (ML) techniques have been proposed for this purpose, a significant challenge arises due to the memory requirements for storing the extracted features. These features, resembling unstructured vocabulary features in natural language processing, need to be converted into a rectangular matrix for input into a classification model during training. The resulting matrix is sparse and its size depends on the unique features extracted, leading to a substantial increase in memory requirements, posing a significant challenge. This research proposes a novel ML-based intrusion detection system designed for the detection of malicious executable files. The proposed system utilises each of Non-Negative Matrix Factorization (NMF) and Latent Semantic Analysis (LSA) as an individual technique for feature selection. In addition to these two individual techniques, this system introduces a hybrid feature selection approach that combines both NMF and LSA. The proposed system was assessed using a dataset containing benign and malicious executable files, yielding a performance accuracy of over 96% and False Positive Rate (FPR) score of less than 2.2% across several ML models.
format Article
id doaj-art-ffee7b98b1c14a4197f8cfec8fe9a6bb
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-ffee7b98b1c14a4197f8cfec8fe9a6bb2025-08-20T03:02:53ZengIEEEIEEE Access2169-35362025-01-011313886713888210.1109/ACCESS.2025.359648311114946Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable FilesMoemedi Lefoane0https://orcid.org/0000-0002-1057-1726Ibrahim Ghafir1https://orcid.org/0000-0003-3702-3866Sohag Kabir2https://orcid.org/0000-0001-7483-9974Irfan-Ullah Awan3Khalil El Hindi4https://orcid.org/0000-0003-2457-9961Anand Mahendran5Faculty of Engineering and Digital Technologies, University of Bradford, Bradford, U.K.Faculty of Engineering and Digital Technologies, University of Bradford, Bradford, U.K.Faculty of Engineering and Digital Technologies, University of Bradford, Bradford, U.K.Faculty of Engineering and Digital Technologies, University of Bradford, Bradford, U.K.Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaSchool of Computer Science and Engineering, Vellore Institute of Technology, Chennai, IndiaDuring a typical cyber-attack lifecycle, several key phases are involved, including footprinting and reconnaissance, scanning, exploitation, and covering tracks. The successful delivery of a payload lies at the heart of ensuring the effectiveness of cyberattacks, which is typically executed following the exploitation of vulnerabilities. This allows adversaries to gain backdoor access to their target and accomplish their objectives. With the increasing use of generative Artificial Intelligence (AI), adversaries are leveraging AI to enhance their attack strategies, ranging from creating more credible phishing attacks and social engineering tactics to developing advanced viruses that are delivered through various means such as phishing attacks. Efforts to devise AI techniques for the detection of malicious executable files have garnered significant attention in the research community. While numerous Machine Learning (ML) techniques have been proposed for this purpose, a significant challenge arises due to the memory requirements for storing the extracted features. These features, resembling unstructured vocabulary features in natural language processing, need to be converted into a rectangular matrix for input into a classification model during training. The resulting matrix is sparse and its size depends on the unique features extracted, leading to a substantial increase in memory requirements, posing a significant challenge. This research proposes a novel ML-based intrusion detection system designed for the detection of malicious executable files. The proposed system utilises each of Non-Negative Matrix Factorization (NMF) and Latent Semantic Analysis (LSA) as an individual technique for feature selection. In addition to these two individual techniques, this system introduces a hybrid feature selection approach that combines both NMF and LSA. The proposed system was assessed using a dataset containing benign and malicious executable files, yielding a performance accuracy of over 96% and False Positive Rate (FPR) score of less than 2.2% across several ML models.https://ieeexplore.ieee.org/document/11114946/Feature selectionlatent semantic analysisnon-negative matrix factorizationmalicious executable filesintrusion detection systemmachine learning
spellingShingle Moemedi Lefoane
Ibrahim Ghafir
Sohag Kabir
Irfan-Ullah Awan
Khalil El Hindi
Anand Mahendran
Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files
IEEE Access
Feature selection
latent semantic analysis
non-negative matrix factorization
malicious executable files
intrusion detection system
machine learning
title Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files
title_full Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files
title_fullStr Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files
title_full_unstemmed Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files
title_short Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files
title_sort non negative matrix factorization and latent semantic analysis for hybrid feature selection a proposed machine learning system for the detection of malicious executable files
topic Feature selection
latent semantic analysis
non-negative matrix factorization
malicious executable files
intrusion detection system
machine learning
url https://ieeexplore.ieee.org/document/11114946/
work_keys_str_mv AT moemedilefoane nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles
AT ibrahimghafir nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles
AT sohagkabir nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles
AT irfanullahawan nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles
AT khalilelhindi nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles
AT anandmahendran nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles