Transformer-based tokenization for IoT traffic classification across diverse network environments

The rapid expansion of the Internet of Things (IoT) has significantly increased the volume and diversity of network traffic, making accurate IoT traffic classification crucial for maintaining network security and efficiency. However, existing traffic classification methods, including traditional mac...

Full description

Saved in:
Bibliographic Details
Main Authors: Firdaus Afifi, Faiz Zaki, Hazim Hanif, Nik Aqil, Nor Badrul Anuar
Format: Article
Language:English
Published: PeerJ Inc. 2025-08-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-3126.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849388716709117952
author Firdaus Afifi
Faiz Zaki
Hazim Hanif
Nik Aqil
Nor Badrul Anuar
author_facet Firdaus Afifi
Faiz Zaki
Hazim Hanif
Nik Aqil
Nor Badrul Anuar
author_sort Firdaus Afifi
collection DOAJ
description The rapid expansion of the Internet of Things (IoT) has significantly increased the volume and diversity of network traffic, making accurate IoT traffic classification crucial for maintaining network security and efficiency. However, existing traffic classification methods, including traditional machine learning and deep learning approaches, often exhibit critical limitations, such as insufficient generalization across diverse IoT environments, dependency on extensive labelled datasets, and susceptibility to overfitting in dynamic scenarios. While recent transformer-based models show promise in capturing contextual information, they typically rely on standard tokenization, which is ill-suited for the irregular nature of IoT traffic and often remains confined to single-purpose tasks. To address these challenges, this study introduces MIND-IoT, a novel and scalable framework for classifying generalized IoT traffic. MIND-IoT employs a hybrid architecture that combines Transformer-based models for capturing long-range dependencies and convolutional neural networks (CNNs) for efficient local feature extraction. A key innovation is IoT-Tokenize, a custom tokenization pipeline designed to preserve the structural semantics of network flows by converting statistical traffic features into semantically meaningful feature-value pairs. The framework operates in two phases: a pre-training phase utilizing masked language modeling (MLM) on large-scale IoT data (UNSW IoT Traces and MonIoTr) to learn robust representations and a fine-tuning phase that adapts the model to specific classification tasks, including binary IoT vs. non-IoT classification, IoT category classification, and device identification. Comprehensive evaluation across multiple diverse datasets (IoT Sentinel, YourThings, and IoT-FCSIT, in addition to the pre-training datasets) demonstrates MIND-IoT’s superior performance, robustness, and adaptability compared to traditional methods. The model achieves an accuracy of up to 98.14% and a 97.85% F1-score, demonstrating its ability to classify new datasets and adapt to emerging tasks with minimal fine-tuning and remarkable efficiency. This research positions MIND-IoT as a highly effective and scalable solution for real-world IoT traffic classification challenges.
format Article
id doaj-art-5f1eaf9475fc472a97fabf44b8e5f9b9
institution Kabale University
issn 2376-5992
language English
publishDate 2025-08-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj-art-5f1eaf9475fc472a97fabf44b8e5f9b92025-08-20T03:42:11ZengPeerJ Inc.PeerJ Computer Science2376-59922025-08-0111e312610.7717/peerj-cs.3126Transformer-based tokenization for IoT traffic classification across diverse network environmentsFirdaus Afifi0Faiz Zaki1Hazim Hanif2Nik Aqil3Nor Badrul Anuar4Faculty of Computer Science and Mathematics, Universiti Malaysia Terengganu, Kuala Nerus, Terengganu, MalaysiaCentre of Research for Cyber Security and Network (CSNET), Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, MalaysiaCentre of Research for Cyber Security and Network (CSNET), Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, MalaysiaCentre of Research for Cyber Security and Network (CSNET), Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, MalaysiaCentre of Research for Cyber Security and Network (CSNET), Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, MalaysiaThe rapid expansion of the Internet of Things (IoT) has significantly increased the volume and diversity of network traffic, making accurate IoT traffic classification crucial for maintaining network security and efficiency. However, existing traffic classification methods, including traditional machine learning and deep learning approaches, often exhibit critical limitations, such as insufficient generalization across diverse IoT environments, dependency on extensive labelled datasets, and susceptibility to overfitting in dynamic scenarios. While recent transformer-based models show promise in capturing contextual information, they typically rely on standard tokenization, which is ill-suited for the irregular nature of IoT traffic and often remains confined to single-purpose tasks. To address these challenges, this study introduces MIND-IoT, a novel and scalable framework for classifying generalized IoT traffic. MIND-IoT employs a hybrid architecture that combines Transformer-based models for capturing long-range dependencies and convolutional neural networks (CNNs) for efficient local feature extraction. A key innovation is IoT-Tokenize, a custom tokenization pipeline designed to preserve the structural semantics of network flows by converting statistical traffic features into semantically meaningful feature-value pairs. The framework operates in two phases: a pre-training phase utilizing masked language modeling (MLM) on large-scale IoT data (UNSW IoT Traces and MonIoTr) to learn robust representations and a fine-tuning phase that adapts the model to specific classification tasks, including binary IoT vs. non-IoT classification, IoT category classification, and device identification. Comprehensive evaluation across multiple diverse datasets (IoT Sentinel, YourThings, and IoT-FCSIT, in addition to the pre-training datasets) demonstrates MIND-IoT’s superior performance, robustness, and adaptability compared to traditional methods. The model achieves an accuracy of up to 98.14% and a 97.85% F1-score, demonstrating its ability to classify new datasets and adapt to emerging tasks with minimal fine-tuning and remarkable efficiency. This research positions MIND-IoT as a highly effective and scalable solution for real-world IoT traffic classification challenges.https://peerj.com/articles/cs-3126.pdfTransformerIoTNetwork traffic classificationNetwork traffic analysisModel fine-tuningPretraining
spellingShingle Firdaus Afifi
Faiz Zaki
Hazim Hanif
Nik Aqil
Nor Badrul Anuar
Transformer-based tokenization for IoT traffic classification across diverse network environments
PeerJ Computer Science
Transformer
IoT
Network traffic classification
Network traffic analysis
Model fine-tuning
Pretraining
title Transformer-based tokenization for IoT traffic classification across diverse network environments
title_full Transformer-based tokenization for IoT traffic classification across diverse network environments
title_fullStr Transformer-based tokenization for IoT traffic classification across diverse network environments
title_full_unstemmed Transformer-based tokenization for IoT traffic classification across diverse network environments
title_short Transformer-based tokenization for IoT traffic classification across diverse network environments
title_sort transformer based tokenization for iot traffic classification across diverse network environments
topic Transformer
IoT
Network traffic classification
Network traffic analysis
Model fine-tuning
Pretraining
url https://peerj.com/articles/cs-3126.pdf
work_keys_str_mv AT firdausafifi transformerbasedtokenizationforiottrafficclassificationacrossdiversenetworkenvironments
AT faizzaki transformerbasedtokenizationforiottrafficclassificationacrossdiversenetworkenvironments
AT hazimhanif transformerbasedtokenizationforiottrafficclassificationacrossdiversenetworkenvironments
AT nikaqil transformerbasedtokenizationforiottrafficclassificationacrossdiversenetworkenvironments
AT norbadrulanuar transformerbasedtokenizationforiottrafficclassificationacrossdiversenetworkenvironments