Transformer-based tokenization for IoT traffic classification across diverse network environments
The rapid expansion of the Internet of Things (IoT) has significantly increased the volume and diversity of network traffic, making accurate IoT traffic classification crucial for maintaining network security and efficiency. However, existing traffic classification methods, including traditional mac...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
PeerJ Inc.
2025-08-01
|
| Series: | PeerJ Computer Science |
| Subjects: | |
| Online Access: | https://peerj.com/articles/cs-3126.pdf |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849388716709117952 |
|---|---|
| author | Firdaus Afifi Faiz Zaki Hazim Hanif Nik Aqil Nor Badrul Anuar |
| author_facet | Firdaus Afifi Faiz Zaki Hazim Hanif Nik Aqil Nor Badrul Anuar |
| author_sort | Firdaus Afifi |
| collection | DOAJ |
| description | The rapid expansion of the Internet of Things (IoT) has significantly increased the volume and diversity of network traffic, making accurate IoT traffic classification crucial for maintaining network security and efficiency. However, existing traffic classification methods, including traditional machine learning and deep learning approaches, often exhibit critical limitations, such as insufficient generalization across diverse IoT environments, dependency on extensive labelled datasets, and susceptibility to overfitting in dynamic scenarios. While recent transformer-based models show promise in capturing contextual information, they typically rely on standard tokenization, which is ill-suited for the irregular nature of IoT traffic and often remains confined to single-purpose tasks. To address these challenges, this study introduces MIND-IoT, a novel and scalable framework for classifying generalized IoT traffic. MIND-IoT employs a hybrid architecture that combines Transformer-based models for capturing long-range dependencies and convolutional neural networks (CNNs) for efficient local feature extraction. A key innovation is IoT-Tokenize, a custom tokenization pipeline designed to preserve the structural semantics of network flows by converting statistical traffic features into semantically meaningful feature-value pairs. The framework operates in two phases: a pre-training phase utilizing masked language modeling (MLM) on large-scale IoT data (UNSW IoT Traces and MonIoTr) to learn robust representations and a fine-tuning phase that adapts the model to specific classification tasks, including binary IoT vs. non-IoT classification, IoT category classification, and device identification. Comprehensive evaluation across multiple diverse datasets (IoT Sentinel, YourThings, and IoT-FCSIT, in addition to the pre-training datasets) demonstrates MIND-IoT’s superior performance, robustness, and adaptability compared to traditional methods. The model achieves an accuracy of up to 98.14% and a 97.85% F1-score, demonstrating its ability to classify new datasets and adapt to emerging tasks with minimal fine-tuning and remarkable efficiency. This research positions MIND-IoT as a highly effective and scalable solution for real-world IoT traffic classification challenges. |
| format | Article |
| id | doaj-art-5f1eaf9475fc472a97fabf44b8e5f9b9 |
| institution | Kabale University |
| issn | 2376-5992 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | PeerJ Inc. |
| record_format | Article |
| series | PeerJ Computer Science |
| spelling | doaj-art-5f1eaf9475fc472a97fabf44b8e5f9b92025-08-20T03:42:11ZengPeerJ Inc.PeerJ Computer Science2376-59922025-08-0111e312610.7717/peerj-cs.3126Transformer-based tokenization for IoT traffic classification across diverse network environmentsFirdaus Afifi0Faiz Zaki1Hazim Hanif2Nik Aqil3Nor Badrul Anuar4Faculty of Computer Science and Mathematics, Universiti Malaysia Terengganu, Kuala Nerus, Terengganu, MalaysiaCentre of Research for Cyber Security and Network (CSNET), Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, MalaysiaCentre of Research for Cyber Security and Network (CSNET), Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, MalaysiaCentre of Research for Cyber Security and Network (CSNET), Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, MalaysiaCentre of Research for Cyber Security and Network (CSNET), Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, MalaysiaThe rapid expansion of the Internet of Things (IoT) has significantly increased the volume and diversity of network traffic, making accurate IoT traffic classification crucial for maintaining network security and efficiency. However, existing traffic classification methods, including traditional machine learning and deep learning approaches, often exhibit critical limitations, such as insufficient generalization across diverse IoT environments, dependency on extensive labelled datasets, and susceptibility to overfitting in dynamic scenarios. While recent transformer-based models show promise in capturing contextual information, they typically rely on standard tokenization, which is ill-suited for the irregular nature of IoT traffic and often remains confined to single-purpose tasks. To address these challenges, this study introduces MIND-IoT, a novel and scalable framework for classifying generalized IoT traffic. MIND-IoT employs a hybrid architecture that combines Transformer-based models for capturing long-range dependencies and convolutional neural networks (CNNs) for efficient local feature extraction. A key innovation is IoT-Tokenize, a custom tokenization pipeline designed to preserve the structural semantics of network flows by converting statistical traffic features into semantically meaningful feature-value pairs. The framework operates in two phases: a pre-training phase utilizing masked language modeling (MLM) on large-scale IoT data (UNSW IoT Traces and MonIoTr) to learn robust representations and a fine-tuning phase that adapts the model to specific classification tasks, including binary IoT vs. non-IoT classification, IoT category classification, and device identification. Comprehensive evaluation across multiple diverse datasets (IoT Sentinel, YourThings, and IoT-FCSIT, in addition to the pre-training datasets) demonstrates MIND-IoT’s superior performance, robustness, and adaptability compared to traditional methods. The model achieves an accuracy of up to 98.14% and a 97.85% F1-score, demonstrating its ability to classify new datasets and adapt to emerging tasks with minimal fine-tuning and remarkable efficiency. This research positions MIND-IoT as a highly effective and scalable solution for real-world IoT traffic classification challenges.https://peerj.com/articles/cs-3126.pdfTransformerIoTNetwork traffic classificationNetwork traffic analysisModel fine-tuningPretraining |
| spellingShingle | Firdaus Afifi Faiz Zaki Hazim Hanif Nik Aqil Nor Badrul Anuar Transformer-based tokenization for IoT traffic classification across diverse network environments PeerJ Computer Science Transformer IoT Network traffic classification Network traffic analysis Model fine-tuning Pretraining |
| title | Transformer-based tokenization for IoT traffic classification across diverse network environments |
| title_full | Transformer-based tokenization for IoT traffic classification across diverse network environments |
| title_fullStr | Transformer-based tokenization for IoT traffic classification across diverse network environments |
| title_full_unstemmed | Transformer-based tokenization for IoT traffic classification across diverse network environments |
| title_short | Transformer-based tokenization for IoT traffic classification across diverse network environments |
| title_sort | transformer based tokenization for iot traffic classification across diverse network environments |
| topic | Transformer IoT Network traffic classification Network traffic analysis Model fine-tuning Pretraining |
| url | https://peerj.com/articles/cs-3126.pdf |
| work_keys_str_mv | AT firdausafifi transformerbasedtokenizationforiottrafficclassificationacrossdiversenetworkenvironments AT faizzaki transformerbasedtokenizationforiottrafficclassificationacrossdiversenetworkenvironments AT hazimhanif transformerbasedtokenizationforiottrafficclassificationacrossdiversenetworkenvironments AT nikaqil transformerbasedtokenizationforiottrafficclassificationacrossdiversenetworkenvironments AT norbadrulanuar transformerbasedtokenizationforiottrafficclassificationacrossdiversenetworkenvironments |