Improving Generalization of ML-Based IDS With Lifecycle-Based Dataset, Auto-Learning Features, and Deep Learning

During the past 10 years, researchers have extensively explored the use of machine learning (ML) in enhancing network intrusion detection systems (IDS). While many studies focused on improving accuracy of ML-based IDS, true effectiveness lies in robust generalization: the ability to classify unseen...

Full description

Saved in:
Bibliographic Details
Main Authors: Didik Sudyana, Ying-Dar Lin, Miel Verkerken, Ren-Hung Hwang, Yuan-Cheng Lai, Laurens D'Hooge, Tim Wauters, Bruno Volckaert, Filip De Turck
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Transactions on Machine Learning in Communications and Networking
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10531223/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850035957941665792
author Didik Sudyana
Ying-Dar Lin
Miel Verkerken
Ren-Hung Hwang
Yuan-Cheng Lai
Laurens D'Hooge
Tim Wauters
Bruno Volckaert
Filip De Turck
author_facet Didik Sudyana
Ying-Dar Lin
Miel Verkerken
Ren-Hung Hwang
Yuan-Cheng Lai
Laurens D'Hooge
Tim Wauters
Bruno Volckaert
Filip De Turck
author_sort Didik Sudyana
collection DOAJ
description During the past 10 years, researchers have extensively explored the use of machine learning (ML) in enhancing network intrusion detection systems (IDS). While many studies focused on improving accuracy of ML-based IDS, true effectiveness lies in robust generalization: the ability to classify unseen data accurately. Many existing models train and test on the same dataset, failing to represent the real unseen scenarios. Others who train and test using different datasets often struggle to generalize effectively. This study emphasizes the improvement of generalization through a novel composite approach involving the use of a lifecycle-based dataset (characterizing the attack as sequences of techniques), automatic feature learning (auto-learning), and a CNN-based deep learning model. The established model is tested on five public datasets to assess its generalization performance. The proposed approach demonstrates outstanding generalization performance, achieving an average F1 score of 0.85 and a recall of 0.94. This significantly outperforms the 0.56 and 0.42 averages recall achieved by attack-based datasets using CIC-IDS-2017 and CIC-IDS-2018 as training data, respectively. Furthermore, auto-learning features boost the F1 score by 0.2 compared to traditional statistical features. Overall, the efforts have resulted in significant advancements in model generalization, offering a more robust strategy for addressing intrusion detection challenges.
format Article
id doaj-art-07c0e8189c7b4cabac224e4af94c4acc
institution DOAJ
issn 2831-316X
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Transactions on Machine Learning in Communications and Networking
spelling doaj-art-07c0e8189c7b4cabac224e4af94c4acc2025-08-20T02:57:19ZengIEEEIEEE Transactions on Machine Learning in Communications and Networking2831-316X2024-01-01264566210.1109/TMLCN.2024.340215810531223Improving Generalization of ML-Based IDS With Lifecycle-Based Dataset, Auto-Learning Features, and Deep LearningDidik Sudyana0https://orcid.org/0000-0001-5378-2622Ying-Dar Lin1https://orcid.org/0000-0002-5226-4396Miel Verkerken2https://orcid.org/0000-0002-1781-900XRen-Hung Hwang3https://orcid.org/0000-0001-7996-4184Yuan-Cheng Lai4https://orcid.org/0000-0003-3695-5784Laurens D'Hooge5https://orcid.org/0000-0001-5086-6361Tim Wauters6https://orcid.org/0000-0003-2618-3311Bruno Volckaert7https://orcid.org/0000-0003-0575-5894Filip De Turck8https://orcid.org/0000-0003-4824-1199Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, TaiwanDepartment of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, TaiwanDepartment of Information Technology, IDLab-imec, Ghent University, Ghent, BelgiumCollege of Artificial Intelligence, National Yang Ming Chiao Tung University, Hsinchu, TaiwanDepartment of Information Management, National Taiwan University of Science and Technology, Taipei, TaiwanDepartment of Information Technology, IDLab-imec, Ghent University, Ghent, BelgiumDepartment of Information Technology, IDLab-imec, Ghent University, Ghent, BelgiumDepartment of Information Technology, IDLab-imec, Ghent University, Ghent, BelgiumDepartment of Information Technology, IDLab-imec, Ghent University, Ghent, BelgiumDuring the past 10 years, researchers have extensively explored the use of machine learning (ML) in enhancing network intrusion detection systems (IDS). While many studies focused on improving accuracy of ML-based IDS, true effectiveness lies in robust generalization: the ability to classify unseen data accurately. Many existing models train and test on the same dataset, failing to represent the real unseen scenarios. Others who train and test using different datasets often struggle to generalize effectively. This study emphasizes the improvement of generalization through a novel composite approach involving the use of a lifecycle-based dataset (characterizing the attack as sequences of techniques), automatic feature learning (auto-learning), and a CNN-based deep learning model. The established model is tested on five public datasets to assess its generalization performance. The proposed approach demonstrates outstanding generalization performance, achieving an average F1 score of 0.85 and a recall of 0.94. This significantly outperforms the 0.56 and 0.42 averages recall achieved by attack-based datasets using CIC-IDS-2017 and CIC-IDS-2018 as training data, respectively. Furthermore, auto-learning features boost the F1 score by 0.2 compared to traditional statistical features. Overall, the efforts have resulted in significant advancements in model generalization, offering a more robust strategy for addressing intrusion detection challenges.https://ieeexplore.ieee.org/document/10531223/Intrusion detectionML-based IDSmodel generalizationlifecycle-based datasetauto-learning features
spellingShingle Didik Sudyana
Ying-Dar Lin
Miel Verkerken
Ren-Hung Hwang
Yuan-Cheng Lai
Laurens D'Hooge
Tim Wauters
Bruno Volckaert
Filip De Turck
Improving Generalization of ML-Based IDS With Lifecycle-Based Dataset, Auto-Learning Features, and Deep Learning
IEEE Transactions on Machine Learning in Communications and Networking
Intrusion detection
ML-based IDS
model generalization
lifecycle-based dataset
auto-learning features
title Improving Generalization of ML-Based IDS With Lifecycle-Based Dataset, Auto-Learning Features, and Deep Learning
title_full Improving Generalization of ML-Based IDS With Lifecycle-Based Dataset, Auto-Learning Features, and Deep Learning
title_fullStr Improving Generalization of ML-Based IDS With Lifecycle-Based Dataset, Auto-Learning Features, and Deep Learning
title_full_unstemmed Improving Generalization of ML-Based IDS With Lifecycle-Based Dataset, Auto-Learning Features, and Deep Learning
title_short Improving Generalization of ML-Based IDS With Lifecycle-Based Dataset, Auto-Learning Features, and Deep Learning
title_sort improving generalization of ml based ids with lifecycle based dataset auto learning features and deep learning
topic Intrusion detection
ML-based IDS
model generalization
lifecycle-based dataset
auto-learning features
url https://ieeexplore.ieee.org/document/10531223/
work_keys_str_mv AT didiksudyana improvinggeneralizationofmlbasedidswithlifecyclebaseddatasetautolearningfeaturesanddeeplearning
AT yingdarlin improvinggeneralizationofmlbasedidswithlifecyclebaseddatasetautolearningfeaturesanddeeplearning
AT mielverkerken improvinggeneralizationofmlbasedidswithlifecyclebaseddatasetautolearningfeaturesanddeeplearning
AT renhunghwang improvinggeneralizationofmlbasedidswithlifecyclebaseddatasetautolearningfeaturesanddeeplearning
AT yuanchenglai improvinggeneralizationofmlbasedidswithlifecyclebaseddatasetautolearningfeaturesanddeeplearning
AT laurensdhooge improvinggeneralizationofmlbasedidswithlifecyclebaseddatasetautolearningfeaturesanddeeplearning
AT timwauters improvinggeneralizationofmlbasedidswithlifecyclebaseddatasetautolearningfeaturesanddeeplearning
AT brunovolckaert improvinggeneralizationofmlbasedidswithlifecyclebaseddatasetautolearningfeaturesanddeeplearning
AT filipdeturck improvinggeneralizationofmlbasedidswithlifecyclebaseddatasetautolearningfeaturesanddeeplearning