OPTISTACK: A Hybrid Ensemble Learning and XAI-Based Approach for Malware Detection in Compressed Files

The increasing reliance on compressed file formats for data storage and transmission has made them attractive vectors for malware propagation, as their structural complexity enables evasion of conventional detection mechanisms. Although entropy-based analysis has been widely applied in executable ma...

Full description

Saved in:
Bibliographic Details
Main Authors: Khaled Mahmud Sujon, Rohayanti Binti Hassan, M. Abdullah-Al-Wadud, Jia Uddin
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11036813/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849683292143484928
author Khaled Mahmud Sujon
Rohayanti Binti Hassan
M. Abdullah-Al-Wadud
Jia Uddin
author_facet Khaled Mahmud Sujon
Rohayanti Binti Hassan
M. Abdullah-Al-Wadud
Jia Uddin
author_sort Khaled Mahmud Sujon
collection DOAJ
description The increasing reliance on compressed file formats for data storage and transmission has made them attractive vectors for malware propagation, as their structural complexity enables evasion of conventional detection mechanisms. Although entropy-based analysis has been widely applied in executable malware detection, its application to compressed file formats remains underexplored. Moreover, existing approaches are predominantly limited to Shannon entropy, failing to exploit the discriminative power of higher-order statistical metrics. Additionally, standalone machine learning models often suffer from limited generalizability and lack interpretability, hindering their real-world deployment in security-critical systems. To address these challenges, we propose OPTISTACK, a novel stacking ensemble framework that integrates Random Forest (RF), Decision Tree (DT), and XGBoost (XGB) as base learners with a Logistic Regression (LR) meta-classifier. Our model leverages an advanced entropy-based feature space&#x2014;including R&#x00E9;nyi entropy (with <inline-formula> <tex-math notation="LaTeX">$\alpha = 2, 4, 6$ </tex-math></inline-formula>), mean entropy, and quartile-based entropy (25th and 75th percentiles)&#x2014;to capture fine-grained statistical variations in compressed data. To the best of our knowledge, this is the first study to integrate higher-order entropy metrics and distributional entropy features into a stacking ensemble model for malware detection in compressed files. Extensive evaluation on the NapierOne dataset, spanning six prevalent compression formats&#x2014;ZIP, 7ZIP, GZIP (GNU Zip), RAR (Roshal Archive), TAR (Tape Archive), and ZLIB&#x2014;demonstrates that OPTISTACK significantly outperforms traditional models, achieving 99.45% accuracy, 99.62% F1-score, 98.80% MCC, and 94.11% AUC-ROC. Our PDP-ICE analysis reveals that minor variations in 25th and 75th quartile entropy values lead to substantial shifts in classification probabilities, underscoring their critical role in model sensitivity and robustness. SHAP-based interpretability analysis further identifies the 25th quartile entropy as the most influential feature across all models. Additionally, we introduce an entropy network graph-based vulnerability analysis that reveals ZIP and RAR as the most malware-prone formats. By combining stacking ensemble learning, advanced entropy metrics, and Explainable AI (XAI) techniques, OPTISTACK delivers a robust, interpretable, and generalizable framework for detecting malware in compressed file environments&#x2014;addressing key limitations in existing cybersecurity methodologies.
format Article
id doaj-art-9916fa786f5f43c2a7f26dd4eaedb2a8
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-9916fa786f5f43c2a7f26dd4eaedb2a82025-08-20T03:23:57ZengIEEEIEEE Access2169-35362025-01-011310499210502610.1109/ACCESS.2025.357988011036813OPTISTACK: A Hybrid Ensemble Learning and XAI-Based Approach for Malware Detection in Compressed FilesKhaled Mahmud Sujon0https://orcid.org/0009-0009-4065-9874Rohayanti Binti Hassan1https://orcid.org/0000-0003-1062-1719M. Abdullah-Al-Wadud2https://orcid.org/0000-0001-6767-3574Jia Uddin3https://orcid.org/0000-0002-3403-4095Department of Software Engineering, Faculty of Computing, Universiti Teknologi Malaysia (UTM), Johor Bahru, Johor, MalaysiaFaculty of Computing, UTM, Johor Bahru, Johor, MalaysiaDepartment of Software Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaArtificial Intelligence and Big Data Department, Woosong University, Daejeon, Republic of KoreaThe increasing reliance on compressed file formats for data storage and transmission has made them attractive vectors for malware propagation, as their structural complexity enables evasion of conventional detection mechanisms. Although entropy-based analysis has been widely applied in executable malware detection, its application to compressed file formats remains underexplored. Moreover, existing approaches are predominantly limited to Shannon entropy, failing to exploit the discriminative power of higher-order statistical metrics. Additionally, standalone machine learning models often suffer from limited generalizability and lack interpretability, hindering their real-world deployment in security-critical systems. To address these challenges, we propose OPTISTACK, a novel stacking ensemble framework that integrates Random Forest (RF), Decision Tree (DT), and XGBoost (XGB) as base learners with a Logistic Regression (LR) meta-classifier. Our model leverages an advanced entropy-based feature space&#x2014;including R&#x00E9;nyi entropy (with <inline-formula> <tex-math notation="LaTeX">$\alpha = 2, 4, 6$ </tex-math></inline-formula>), mean entropy, and quartile-based entropy (25th and 75th percentiles)&#x2014;to capture fine-grained statistical variations in compressed data. To the best of our knowledge, this is the first study to integrate higher-order entropy metrics and distributional entropy features into a stacking ensemble model for malware detection in compressed files. Extensive evaluation on the NapierOne dataset, spanning six prevalent compression formats&#x2014;ZIP, 7ZIP, GZIP (GNU Zip), RAR (Roshal Archive), TAR (Tape Archive), and ZLIB&#x2014;demonstrates that OPTISTACK significantly outperforms traditional models, achieving 99.45% accuracy, 99.62% F1-score, 98.80% MCC, and 94.11% AUC-ROC. Our PDP-ICE analysis reveals that minor variations in 25th and 75th quartile entropy values lead to substantial shifts in classification probabilities, underscoring their critical role in model sensitivity and robustness. SHAP-based interpretability analysis further identifies the 25th quartile entropy as the most influential feature across all models. Additionally, we introduce an entropy network graph-based vulnerability analysis that reveals ZIP and RAR as the most malware-prone formats. By combining stacking ensemble learning, advanced entropy metrics, and Explainable AI (XAI) techniques, OPTISTACK delivers a robust, interpretable, and generalizable framework for detecting malware in compressed file environments&#x2014;addressing key limitations in existing cybersecurity methodologies.https://ieeexplore.ieee.org/document/11036813/Malware detectioncompressed filesentropy-based analysisexplainable AIensemble learning
spellingShingle Khaled Mahmud Sujon
Rohayanti Binti Hassan
M. Abdullah-Al-Wadud
Jia Uddin
OPTISTACK: A Hybrid Ensemble Learning and XAI-Based Approach for Malware Detection in Compressed Files
IEEE Access
Malware detection
compressed files
entropy-based analysis
explainable AI
ensemble learning
title OPTISTACK: A Hybrid Ensemble Learning and XAI-Based Approach for Malware Detection in Compressed Files
title_full OPTISTACK: A Hybrid Ensemble Learning and XAI-Based Approach for Malware Detection in Compressed Files
title_fullStr OPTISTACK: A Hybrid Ensemble Learning and XAI-Based Approach for Malware Detection in Compressed Files
title_full_unstemmed OPTISTACK: A Hybrid Ensemble Learning and XAI-Based Approach for Malware Detection in Compressed Files
title_short OPTISTACK: A Hybrid Ensemble Learning and XAI-Based Approach for Malware Detection in Compressed Files
title_sort optistack a hybrid ensemble learning and xai based approach for malware detection in compressed files
topic Malware detection
compressed files
entropy-based analysis
explainable AI
ensemble learning
url https://ieeexplore.ieee.org/document/11036813/
work_keys_str_mv AT khaledmahmudsujon optistackahybridensemblelearningandxaibasedapproachformalwaredetectionincompressedfiles
AT rohayantibintihassan optistackahybridensemblelearningandxaibasedapproachformalwaredetectionincompressedfiles
AT mabdullahalwadud optistackahybridensemblelearningandxaibasedapproachformalwaredetectionincompressedfiles
AT jiauddin optistackahybridensemblelearningandxaibasedapproachformalwaredetectionincompressedfiles