Mixed-Embeddings and Deep Learning Ensemble for DGA Classification With Limited Training Data

Recent papers in the cybersecurity research field of Domain Generation Algorithms (DGAs) detection show the increase of performances associated with the introduction of unsupervised neural vectorized representation of domain names in the supervised classification process. In this paper we explore th...

Full description

Saved in:
Bibliographic Details
Main Authors: Christian Morbidoni, Alessandro Cucchiarelli, Luca Spalazzi
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10979335/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849322499773300736
author Christian Morbidoni
Alessandro Cucchiarelli
Luca Spalazzi
author_facet Christian Morbidoni
Alessandro Cucchiarelli
Luca Spalazzi
author_sort Christian Morbidoni
collection DOAJ
description Recent papers in the cybersecurity research field of Domain Generation Algorithms (DGAs) detection show the increase of performances associated with the introduction of unsupervised neural vectorized representation of domain names in the supervised classification process. In this paper we explore the effectiveness of this approach by proposing a novel mixed pre-trained neural embeddings model which integrates different vectorized representations of domain names: n-grams streams and words. We used the embeddings with two different classifiers, both based on ensemble architectures: a stacking model and an end-to-end multi-input neural architecture. We trained and tested the classifiers with two datasets, differing both in the distribution of domain names between real and DGAs and in the number and type of DGAs. The obtained results show that our solution provides considerable advantages with respect to state-of-the-art single classifiers both in classification accuracy and in the detection of challenging DGAs, such as those based on word dictionaries. The improvement of performance is significant in a particularly relevant operating condition, known as few-shot-learning, where only few examples of DGA-generated domain names are available for the classifier training.
format Article
id doaj-art-0ad126d490cf4a6489df808400eeb1b2
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-0ad126d490cf4a6489df808400eeb1b22025-08-20T03:49:22ZengIEEEIEEE Access2169-35362025-01-0113811678118710.1109/ACCESS.2025.356502210979335Mixed-Embeddings and Deep Learning Ensemble for DGA Classification With Limited Training DataChristian Morbidoni0https://orcid.org/0000-0003-0244-9322Alessandro Cucchiarelli1https://orcid.org/0000-0003-0173-9862Luca Spalazzi2https://orcid.org/0000-0002-4807-6632Department of Business Administration, Università degli Studi G. d’Annunzio Chieti-Pescara, Pescara, ItalyDepartment of Information Engineering, Università Politecnica delle Marche, Ancona, ItalyDepartment of Information Engineering, Università Politecnica delle Marche, Ancona, ItalyRecent papers in the cybersecurity research field of Domain Generation Algorithms (DGAs) detection show the increase of performances associated with the introduction of unsupervised neural vectorized representation of domain names in the supervised classification process. In this paper we explore the effectiveness of this approach by proposing a novel mixed pre-trained neural embeddings model which integrates different vectorized representations of domain names: n-grams streams and words. We used the embeddings with two different classifiers, both based on ensemble architectures: a stacking model and an end-to-end multi-input neural architecture. We trained and tested the classifiers with two datasets, differing both in the distribution of domain names between real and DGAs and in the number and type of DGAs. The obtained results show that our solution provides considerable advantages with respect to state-of-the-art single classifiers both in classification accuracy and in the detection of challenging DGAs, such as those based on word dictionaries. The improvement of performance is significant in a particularly relevant operating condition, known as few-shot-learning, where only few examples of DGA-generated domain names are available for the classifier training.https://ieeexplore.ieee.org/document/10979335/Domain generation algorithms (DGA)botnetdeep learningLSTMn-gramspre-trained embeddings
spellingShingle Christian Morbidoni
Alessandro Cucchiarelli
Luca Spalazzi
Mixed-Embeddings and Deep Learning Ensemble for DGA Classification With Limited Training Data
IEEE Access
Domain generation algorithms (DGA)
botnet
deep learning
LSTM
n-grams
pre-trained embeddings
title Mixed-Embeddings and Deep Learning Ensemble for DGA Classification With Limited Training Data
title_full Mixed-Embeddings and Deep Learning Ensemble for DGA Classification With Limited Training Data
title_fullStr Mixed-Embeddings and Deep Learning Ensemble for DGA Classification With Limited Training Data
title_full_unstemmed Mixed-Embeddings and Deep Learning Ensemble for DGA Classification With Limited Training Data
title_short Mixed-Embeddings and Deep Learning Ensemble for DGA Classification With Limited Training Data
title_sort mixed embeddings and deep learning ensemble for dga classification with limited training data
topic Domain generation algorithms (DGA)
botnet
deep learning
LSTM
n-grams
pre-trained embeddings
url https://ieeexplore.ieee.org/document/10979335/
work_keys_str_mv AT christianmorbidoni mixedembeddingsanddeeplearningensemblefordgaclassificationwithlimitedtrainingdata
AT alessandrocucchiarelli mixedembeddingsanddeeplearningensemblefordgaclassificationwithlimitedtrainingdata
AT lucaspalazzi mixedembeddingsanddeeplearningensemblefordgaclassificationwithlimitedtrainingdata