BERT-GraphSAGE: hybrid approach to spam detection

Abstract Even after the advent of various communication networks, emails have retained their importance and serious, professional character. Moreover, as the number of Internet users increases, so does the number of spam emails. Spam refers to any unsolicited and unwanted communication, which leads...

Full description

Saved in:
Bibliographic Details
Main Authors: F. Zouak, O. El Beqqali, J. Riffi
Format: Article
Language:English
Published: SpringerOpen 2025-05-01
Series:Journal of Big Data
Subjects:
Online Access:https://doi.org/10.1186/s40537-025-01176-9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850268947315687424
author F. Zouak
O. El Beqqali
J. Riffi
author_facet F. Zouak
O. El Beqqali
J. Riffi
author_sort F. Zouak
collection DOAJ
description Abstract Even after the advent of various communication networks, emails have retained their importance and serious, professional character. Moreover, as the number of Internet users increases, so does the number of spam emails. Spam refers to any unsolicited and unwanted communication, which leads to a significant waste of resources and overloads networks. The majority of these spam emails come from advertisers wishing to promote their products, while others have more malicious intentions, such as phishing emails aimed at tricking recipients into providing confidential information like website credentials or credit card details. In our research, we aim to improve spam detection by combining BERT (Bidirectional Encoder Representations from Transformers) and GraphSAGE. The embedding vectors generated by BERT are used to represent the nodes of a graph, which are then linked based on the calculation of cosine similarity. This graph structure is subsequently exploited by GraphSAGE, which doesn’t merely record the results of embedding mapping; it learns an inductive method of embedding generation. This enables GraphSAGE to generalize to unseen emails by sampling and aggregating the characteristics of neighboring emails to produce robust node representations. Our model was evaluated on three benchmark datasets: ENRON, SpamAssassin, and LingSpam. It achieved 98.87% accuracy, 99.81% precision, and 99.98% AUC on ENRON, 96.44% accuracy, 94.43% precision, and 98.86% AUC on SpamAssassin, and 99.20% accuracy, 96.98% precision, and 99.55% AUC on LingSpam, outperforming several state-of-the-art baselines. These results confirm the robustness of our approach in accurately distinguishing between spam and legitimate emails.
format Article
id doaj-art-5751dc205f4c460faf2d01c67ac7b418
institution OA Journals
issn 2196-1115
language English
publishDate 2025-05-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj-art-5751dc205f4c460faf2d01c67ac7b4182025-08-20T01:53:19ZengSpringerOpenJournal of Big Data2196-11152025-05-0112111410.1186/s40537-025-01176-9BERT-GraphSAGE: hybrid approach to spam detectionF. Zouak0O. El Beqqali1J. Riffi2LISAC Laboratory, Faculty of Sciences Dhar El Mahraz, University Sidi Mohamed Ben AbdellahLISAC Laboratory, Faculty of Sciences Dhar El Mahraz, University Sidi Mohamed Ben AbdellahLISAC Laboratory, Faculty of Sciences Dhar El Mahraz, University Sidi Mohamed Ben AbdellahAbstract Even after the advent of various communication networks, emails have retained their importance and serious, professional character. Moreover, as the number of Internet users increases, so does the number of spam emails. Spam refers to any unsolicited and unwanted communication, which leads to a significant waste of resources and overloads networks. The majority of these spam emails come from advertisers wishing to promote their products, while others have more malicious intentions, such as phishing emails aimed at tricking recipients into providing confidential information like website credentials or credit card details. In our research, we aim to improve spam detection by combining BERT (Bidirectional Encoder Representations from Transformers) and GraphSAGE. The embedding vectors generated by BERT are used to represent the nodes of a graph, which are then linked based on the calculation of cosine similarity. This graph structure is subsequently exploited by GraphSAGE, which doesn’t merely record the results of embedding mapping; it learns an inductive method of embedding generation. This enables GraphSAGE to generalize to unseen emails by sampling and aggregating the characteristics of neighboring emails to produce robust node representations. Our model was evaluated on three benchmark datasets: ENRON, SpamAssassin, and LingSpam. It achieved 98.87% accuracy, 99.81% precision, and 99.98% AUC on ENRON, 96.44% accuracy, 94.43% precision, and 98.86% AUC on SpamAssassin, and 99.20% accuracy, 96.98% precision, and 99.55% AUC on LingSpam, outperforming several state-of-the-art baselines. These results confirm the robustness of our approach in accurately distinguishing between spam and legitimate emails.https://doi.org/10.1186/s40537-025-01176-9BERTCosine similarityGraphSAGENode classificationSpam detection
spellingShingle F. Zouak
O. El Beqqali
J. Riffi
BERT-GraphSAGE: hybrid approach to spam detection
Journal of Big Data
BERT
Cosine similarity
GraphSAGE
Node classification
Spam detection
title BERT-GraphSAGE: hybrid approach to spam detection
title_full BERT-GraphSAGE: hybrid approach to spam detection
title_fullStr BERT-GraphSAGE: hybrid approach to spam detection
title_full_unstemmed BERT-GraphSAGE: hybrid approach to spam detection
title_short BERT-GraphSAGE: hybrid approach to spam detection
title_sort bert graphsage hybrid approach to spam detection
topic BERT
Cosine similarity
GraphSAGE
Node classification
Spam detection
url https://doi.org/10.1186/s40537-025-01176-9
work_keys_str_mv AT fzouak bertgraphsagehybridapproachtospamdetection
AT oelbeqqali bertgraphsagehybridapproachtospamdetection
AT jriffi bertgraphsagehybridapproachtospamdetection