The Choice of Training Data and the Generalizability of Machine Learning Models for Network Intrusion Detection Systems
Network Intrusion Detection Systems (NIDS) driven by Machine Learning (ML) algorithms are usually trained using publicly available datasets consisting of labeled traffic samples, where labels refer to traffic classes, usually one benign and multiple harmful. This paper studies the generalizability o...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-07-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/15/8466 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849407555538780160 |
|---|---|
| author | Marcin Iwanowski Dominik Olszewski Waldemar Graniszewski Jacek Krupski Franciszek Pelc |
| author_facet | Marcin Iwanowski Dominik Olszewski Waldemar Graniszewski Jacek Krupski Franciszek Pelc |
| author_sort | Marcin Iwanowski |
| collection | DOAJ |
| description | Network Intrusion Detection Systems (NIDS) driven by Machine Learning (ML) algorithms are usually trained using publicly available datasets consisting of labeled traffic samples, where labels refer to traffic classes, usually one benign and multiple harmful. This paper studies the generalizability of models trained on such datasets. This issue is crucial given the application of such a model to actual internet traffic because high-performance measures obtained on datasets do not necessarily imply similar efficiency on the real traffic. We propose a procedure consisting of cross-validation using various sets sharing some standard traffic classes combined with the t-SNE visualization. We apply it to investigate four well-known and widely used datasets: UNSW-NB15, CIC-CSE-IDS2018, BoT-IoT, and ToN-IoT. Our investigation reveals that the high accuracy of a model obtained on one set used for training is reproducible on others only to a limited extent. Moreover, benign traffic classes’ generalizability differs from harmful traffic. Given its application in the actual network environment, it implies that one needs to select the data used to train the ML model carefully to determine to what extent the classes present in the dataset used for training are similar to those in the real target traffic environment. On the other hand, merging datasets may result in more exhaustive data collection, consisting of a more diverse spectrum of training samples. |
| format | Article |
| id | doaj-art-09c9f77608d34c8fa595b9714fa60ea0 |
| institution | Kabale University |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-09c9f77608d34c8fa595b9714fa60ea02025-08-20T03:36:02ZengMDPI AGApplied Sciences2076-34172025-07-011515846610.3390/app15158466The Choice of Training Data and the Generalizability of Machine Learning Models for Network Intrusion Detection SystemsMarcin Iwanowski0Dominik Olszewski1Waldemar Graniszewski2Jacek Krupski3Franciszek Pelc4Institute of Control and Industrial Electronics, Warsaw University of Technology, ul.Koszykowa 75, 00-662 Warszawa, PolandInstitute of Control and Industrial Electronics, Warsaw University of Technology, ul.Koszykowa 75, 00-662 Warszawa, PolandInstitute of Control and Industrial Electronics, Warsaw University of Technology, ul.Koszykowa 75, 00-662 Warszawa, PolandInstitute of Control and Industrial Electronics, Warsaw University of Technology, ul.Koszykowa 75, 00-662 Warszawa, PolandInstitute of Control and Industrial Electronics, Warsaw University of Technology, ul.Koszykowa 75, 00-662 Warszawa, PolandNetwork Intrusion Detection Systems (NIDS) driven by Machine Learning (ML) algorithms are usually trained using publicly available datasets consisting of labeled traffic samples, where labels refer to traffic classes, usually one benign and multiple harmful. This paper studies the generalizability of models trained on such datasets. This issue is crucial given the application of such a model to actual internet traffic because high-performance measures obtained on datasets do not necessarily imply similar efficiency on the real traffic. We propose a procedure consisting of cross-validation using various sets sharing some standard traffic classes combined with the t-SNE visualization. We apply it to investigate four well-known and widely used datasets: UNSW-NB15, CIC-CSE-IDS2018, BoT-IoT, and ToN-IoT. Our investigation reveals that the high accuracy of a model obtained on one set used for training is reproducible on others only to a limited extent. Moreover, benign traffic classes’ generalizability differs from harmful traffic. Given its application in the actual network environment, it implies that one needs to select the data used to train the ML model carefully to determine to what extent the classes present in the dataset used for training are similar to those in the real target traffic environment. On the other hand, merging datasets may result in more exhaustive data collection, consisting of a more diverse spectrum of training samples.https://www.mdpi.com/2076-3417/15/15/8466intrusion detection systeminternet traffic classificationmachine learning |
| spellingShingle | Marcin Iwanowski Dominik Olszewski Waldemar Graniszewski Jacek Krupski Franciszek Pelc The Choice of Training Data and the Generalizability of Machine Learning Models for Network Intrusion Detection Systems Applied Sciences intrusion detection system internet traffic classification machine learning |
| title | The Choice of Training Data and the Generalizability of Machine Learning Models for Network Intrusion Detection Systems |
| title_full | The Choice of Training Data and the Generalizability of Machine Learning Models for Network Intrusion Detection Systems |
| title_fullStr | The Choice of Training Data and the Generalizability of Machine Learning Models for Network Intrusion Detection Systems |
| title_full_unstemmed | The Choice of Training Data and the Generalizability of Machine Learning Models for Network Intrusion Detection Systems |
| title_short | The Choice of Training Data and the Generalizability of Machine Learning Models for Network Intrusion Detection Systems |
| title_sort | choice of training data and the generalizability of machine learning models for network intrusion detection systems |
| topic | intrusion detection system internet traffic classification machine learning |
| url | https://www.mdpi.com/2076-3417/15/15/8466 |
| work_keys_str_mv | AT marciniwanowski thechoiceoftrainingdataandthegeneralizabilityofmachinelearningmodelsfornetworkintrusiondetectionsystems AT dominikolszewski thechoiceoftrainingdataandthegeneralizabilityofmachinelearningmodelsfornetworkintrusiondetectionsystems AT waldemargraniszewski thechoiceoftrainingdataandthegeneralizabilityofmachinelearningmodelsfornetworkintrusiondetectionsystems AT jacekkrupski thechoiceoftrainingdataandthegeneralizabilityofmachinelearningmodelsfornetworkintrusiondetectionsystems AT franciszekpelc thechoiceoftrainingdataandthegeneralizabilityofmachinelearningmodelsfornetworkintrusiondetectionsystems AT marciniwanowski choiceoftrainingdataandthegeneralizabilityofmachinelearningmodelsfornetworkintrusiondetectionsystems AT dominikolszewski choiceoftrainingdataandthegeneralizabilityofmachinelearningmodelsfornetworkintrusiondetectionsystems AT waldemargraniszewski choiceoftrainingdataandthegeneralizabilityofmachinelearningmodelsfornetworkintrusiondetectionsystems AT jacekkrupski choiceoftrainingdataandthegeneralizabilityofmachinelearningmodelsfornetworkintrusiondetectionsystems AT franciszekpelc choiceoftrainingdataandthegeneralizabilityofmachinelearningmodelsfornetworkintrusiondetectionsystems |