An Adaptive Scalable Data Pipeline for Multiclass Attack Classification in Large-Scale IoT Networks

The current large-scale Internet of Things (IoT) networks typically generate high-velocity network traffic streams. Attackers use IoT devices to create botnets and launch attacks, such as DDoS, Spamming, Cryptocurrency mining, Phishing, etc. The service providers of large-scale IoT networks need to...

Full description

Saved in:
Bibliographic Details
Main Authors: Selvam Saravanan, Uma Maheswari Balasubramanian
Format: Article
Language:English
Published: Tsinghua University Press 2024-06-01
Series:Big Data Mining and Analytics
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/BDMA.2023.9020027
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832569314452963328
author Selvam Saravanan
Uma Maheswari Balasubramanian
author_facet Selvam Saravanan
Uma Maheswari Balasubramanian
author_sort Selvam Saravanan
collection DOAJ
description The current large-scale Internet of Things (IoT) networks typically generate high-velocity network traffic streams. Attackers use IoT devices to create botnets and launch attacks, such as DDoS, Spamming, Cryptocurrency mining, Phishing, etc. The service providers of large-scale IoT networks need to set up a data pipeline to collect the vast network traffic data from the IoT devices, store it, analyze it, and report the malicious IoT devices and types of attacks. Further, the attacks originating from IoT devices are dynamic, as attackers launch one kind of attack at one time and another kind of attack at another time. The number of attacks and benign instances also vary from time to time. This phenomenon of change in attack patterns is called concept drift. Hence, the attack detection system must learn continuously from the ever-changing real-time attack patterns in large-scale IoT network traffic. To meet this requirement, in this work, we propose a data pipeline with Apache Kafka, Apache Spark structured streaming, and MongoDB that can adapt to the ever-changing attack patterns in real time and classify attacks in large-scale IoT networks. When concept drift is detected, the proposed system retrains the classifier with the instances that cause the drift and a representative subsample instances from the previous training of the model. The proposed approach is evaluated with the latest dataset, IoT23, which consists of benign and several attack instances from various IoT devices. Attack classification accuracy is improved from 97.8% to 99.46% by the proposed system. The training time of distributed random forest algorithm is also studied by varying the number of cores in Apache Spark environment.
format Article
id doaj-art-2a238a0ca8ac41738cf1f1c0977240e5
institution Kabale University
issn 2096-0654
language English
publishDate 2024-06-01
publisher Tsinghua University Press
record_format Article
series Big Data Mining and Analytics
spelling doaj-art-2a238a0ca8ac41738cf1f1c0977240e52025-02-02T22:18:05ZengTsinghua University PressBig Data Mining and Analytics2096-06542024-06-017250051110.26599/BDMA.2023.9020027An Adaptive Scalable Data Pipeline for Multiclass Attack Classification in Large-Scale IoT NetworksSelvam Saravanan0Uma Maheswari Balasubramanian1Department of Computer Science and Engineering, Amrita School of Computing, Amrita Vishwa Vidyapeetham, Bengaluru 560035, IndiaDepartment of Computer Science and Engineering, Amrita School of Computing, Amrita Vishwa Vidyapeetham, Bengaluru 560035, IndiaThe current large-scale Internet of Things (IoT) networks typically generate high-velocity network traffic streams. Attackers use IoT devices to create botnets and launch attacks, such as DDoS, Spamming, Cryptocurrency mining, Phishing, etc. The service providers of large-scale IoT networks need to set up a data pipeline to collect the vast network traffic data from the IoT devices, store it, analyze it, and report the malicious IoT devices and types of attacks. Further, the attacks originating from IoT devices are dynamic, as attackers launch one kind of attack at one time and another kind of attack at another time. The number of attacks and benign instances also vary from time to time. This phenomenon of change in attack patterns is called concept drift. Hence, the attack detection system must learn continuously from the ever-changing real-time attack patterns in large-scale IoT network traffic. To meet this requirement, in this work, we propose a data pipeline with Apache Kafka, Apache Spark structured streaming, and MongoDB that can adapt to the ever-changing attack patterns in real time and classify attacks in large-scale IoT networks. When concept drift is detected, the proposed system retrains the classifier with the instances that cause the drift and a representative subsample instances from the previous training of the model. The proposed approach is evaluated with the latest dataset, IoT23, which consists of benign and several attack instances from various IoT devices. Attack classification accuracy is improved from 97.8% to 99.46% by the proposed system. The training time of distributed random forest algorithm is also studied by varying the number of cores in Apache Spark environment.https://www.sciopen.com/article/10.26599/BDMA.2023.9020027internet of things (iot)apache sparkapache kafkamongodbstreamingconcept drift
spellingShingle Selvam Saravanan
Uma Maheswari Balasubramanian
An Adaptive Scalable Data Pipeline for Multiclass Attack Classification in Large-Scale IoT Networks
Big Data Mining and Analytics
internet of things (iot)
apache spark
apache kafka
mongodb
streaming
concept drift
title An Adaptive Scalable Data Pipeline for Multiclass Attack Classification in Large-Scale IoT Networks
title_full An Adaptive Scalable Data Pipeline for Multiclass Attack Classification in Large-Scale IoT Networks
title_fullStr An Adaptive Scalable Data Pipeline for Multiclass Attack Classification in Large-Scale IoT Networks
title_full_unstemmed An Adaptive Scalable Data Pipeline for Multiclass Attack Classification in Large-Scale IoT Networks
title_short An Adaptive Scalable Data Pipeline for Multiclass Attack Classification in Large-Scale IoT Networks
title_sort adaptive scalable data pipeline for multiclass attack classification in large scale iot networks
topic internet of things (iot)
apache spark
apache kafka
mongodb
streaming
concept drift
url https://www.sciopen.com/article/10.26599/BDMA.2023.9020027
work_keys_str_mv AT selvamsaravanan anadaptivescalabledatapipelineformulticlassattackclassificationinlargescaleiotnetworks
AT umamaheswaribalasubramanian anadaptivescalabledatapipelineformulticlassattackclassificationinlargescaleiotnetworks
AT selvamsaravanan adaptivescalabledatapipelineformulticlassattackclassificationinlargescaleiotnetworks
AT umamaheswaribalasubramanian adaptivescalabledatapipelineformulticlassattackclassificationinlargescaleiotnetworks