Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files
During a typical cyber-attack lifecycle, several key phases are involved, including footprinting and reconnaissance, scanning, exploitation, and covering tracks. The successful delivery of a payload lies at the heart of ensuring the effectiveness of cyberattacks, which is typically executed followin...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11114946/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849770776487526400 |
|---|---|
| author | Moemedi Lefoane Ibrahim Ghafir Sohag Kabir Irfan-Ullah Awan Khalil El Hindi Anand Mahendran |
| author_facet | Moemedi Lefoane Ibrahim Ghafir Sohag Kabir Irfan-Ullah Awan Khalil El Hindi Anand Mahendran |
| author_sort | Moemedi Lefoane |
| collection | DOAJ |
| description | During a typical cyber-attack lifecycle, several key phases are involved, including footprinting and reconnaissance, scanning, exploitation, and covering tracks. The successful delivery of a payload lies at the heart of ensuring the effectiveness of cyberattacks, which is typically executed following the exploitation of vulnerabilities. This allows adversaries to gain backdoor access to their target and accomplish their objectives. With the increasing use of generative Artificial Intelligence (AI), adversaries are leveraging AI to enhance their attack strategies, ranging from creating more credible phishing attacks and social engineering tactics to developing advanced viruses that are delivered through various means such as phishing attacks. Efforts to devise AI techniques for the detection of malicious executable files have garnered significant attention in the research community. While numerous Machine Learning (ML) techniques have been proposed for this purpose, a significant challenge arises due to the memory requirements for storing the extracted features. These features, resembling unstructured vocabulary features in natural language processing, need to be converted into a rectangular matrix for input into a classification model during training. The resulting matrix is sparse and its size depends on the unique features extracted, leading to a substantial increase in memory requirements, posing a significant challenge. This research proposes a novel ML-based intrusion detection system designed for the detection of malicious executable files. The proposed system utilises each of Non-Negative Matrix Factorization (NMF) and Latent Semantic Analysis (LSA) as an individual technique for feature selection. In addition to these two individual techniques, this system introduces a hybrid feature selection approach that combines both NMF and LSA. The proposed system was assessed using a dataset containing benign and malicious executable files, yielding a performance accuracy of over 96% and False Positive Rate (FPR) score of less than 2.2% across several ML models. |
| format | Article |
| id | doaj-art-ffee7b98b1c14a4197f8cfec8fe9a6bb |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-ffee7b98b1c14a4197f8cfec8fe9a6bb2025-08-20T03:02:53ZengIEEEIEEE Access2169-35362025-01-011313886713888210.1109/ACCESS.2025.359648311114946Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable FilesMoemedi Lefoane0https://orcid.org/0000-0002-1057-1726Ibrahim Ghafir1https://orcid.org/0000-0003-3702-3866Sohag Kabir2https://orcid.org/0000-0001-7483-9974Irfan-Ullah Awan3Khalil El Hindi4https://orcid.org/0000-0003-2457-9961Anand Mahendran5Faculty of Engineering and Digital Technologies, University of Bradford, Bradford, U.K.Faculty of Engineering and Digital Technologies, University of Bradford, Bradford, U.K.Faculty of Engineering and Digital Technologies, University of Bradford, Bradford, U.K.Faculty of Engineering and Digital Technologies, University of Bradford, Bradford, U.K.Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaSchool of Computer Science and Engineering, Vellore Institute of Technology, Chennai, IndiaDuring a typical cyber-attack lifecycle, several key phases are involved, including footprinting and reconnaissance, scanning, exploitation, and covering tracks. The successful delivery of a payload lies at the heart of ensuring the effectiveness of cyberattacks, which is typically executed following the exploitation of vulnerabilities. This allows adversaries to gain backdoor access to their target and accomplish their objectives. With the increasing use of generative Artificial Intelligence (AI), adversaries are leveraging AI to enhance their attack strategies, ranging from creating more credible phishing attacks and social engineering tactics to developing advanced viruses that are delivered through various means such as phishing attacks. Efforts to devise AI techniques for the detection of malicious executable files have garnered significant attention in the research community. While numerous Machine Learning (ML) techniques have been proposed for this purpose, a significant challenge arises due to the memory requirements for storing the extracted features. These features, resembling unstructured vocabulary features in natural language processing, need to be converted into a rectangular matrix for input into a classification model during training. The resulting matrix is sparse and its size depends on the unique features extracted, leading to a substantial increase in memory requirements, posing a significant challenge. This research proposes a novel ML-based intrusion detection system designed for the detection of malicious executable files. The proposed system utilises each of Non-Negative Matrix Factorization (NMF) and Latent Semantic Analysis (LSA) as an individual technique for feature selection. In addition to these two individual techniques, this system introduces a hybrid feature selection approach that combines both NMF and LSA. The proposed system was assessed using a dataset containing benign and malicious executable files, yielding a performance accuracy of over 96% and False Positive Rate (FPR) score of less than 2.2% across several ML models.https://ieeexplore.ieee.org/document/11114946/Feature selectionlatent semantic analysisnon-negative matrix factorizationmalicious executable filesintrusion detection systemmachine learning |
| spellingShingle | Moemedi Lefoane Ibrahim Ghafir Sohag Kabir Irfan-Ullah Awan Khalil El Hindi Anand Mahendran Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files IEEE Access Feature selection latent semantic analysis non-negative matrix factorization malicious executable files intrusion detection system machine learning |
| title | Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files |
| title_full | Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files |
| title_fullStr | Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files |
| title_full_unstemmed | Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files |
| title_short | Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files |
| title_sort | non negative matrix factorization and latent semantic analysis for hybrid feature selection a proposed machine learning system for the detection of malicious executable files |
| topic | Feature selection latent semantic analysis non-negative matrix factorization malicious executable files intrusion detection system machine learning |
| url | https://ieeexplore.ieee.org/document/11114946/ |
| work_keys_str_mv | AT moemedilefoane nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles AT ibrahimghafir nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles AT sohagkabir nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles AT irfanullahawan nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles AT khalilelhindi nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles AT anandmahendran nonnegativematrixfactorizationandlatentsemanticanalysisforhybridfeatureselectionaproposedmachinelearningsystemforthedetectionofmaliciousexecutablefiles |