Surprisal-based algorithm for detecting anomalies in categorical data

Anomaly detection is an important research area in a diverse range of real-world applications. Although many algorithms have been proposed to address anomaly detection for numerical datasets, categorical and mixed datasets remain a significant challenge, primarily because a natural distance metric i...

Full description

Saved in:
Bibliographic Details
Main Authors: Ossama Cherkaoui, Houda Anoun, Abderrahim Maizate
Format: Article
Language:English
Published: KeAi Communications Co. Ltd. 2025-06-01
Series:Data Science and Management
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666764925000050
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849428659479248896
author Ossama Cherkaoui
Houda Anoun
Abderrahim Maizate
author_facet Ossama Cherkaoui
Houda Anoun
Abderrahim Maizate
author_sort Ossama Cherkaoui
collection DOAJ
description Anomaly detection is an important research area in a diverse range of real-world applications. Although many algorithms have been proposed to address anomaly detection for numerical datasets, categorical and mixed datasets remain a significant challenge, primarily because a natural distance metric is lacking. Consequently, the methods proposed in the literature implement entirely different assumptions regarding the definition of categorical anomalies. This paper presents a novel categorical anomaly detection approach, offering two key contributions to existing methods. First, a novel surprisal-based anomaly score is introduced, which provides a more accurate assessment of anomalies by considering the full distribution of categorical values. Second, the proposed method considers complex correlations in the data beyond the pairwise interactions of features. This study proposed and tested the novel categorical surprisal anomaly detection algorithm (CSAD) by comparing and evaluating it against six competitors. The experimental results indicate that CSAD produced the best overall performance, achieving the highest average ROC-AUC and PR-AUC values of 0.8 and 0.443, respectively. Furthermore, CSAD's execution time is satisfactory even when processing large, high-dimensional datasets.
format Article
id doaj-art-d512e14dc7cd4119a11098fa9dc485a7
institution Kabale University
issn 2666-7649
language English
publishDate 2025-06-01
publisher KeAi Communications Co. Ltd.
record_format Article
series Data Science and Management
spelling doaj-art-d512e14dc7cd4119a11098fa9dc485a72025-08-20T03:28:37ZengKeAi Communications Co. Ltd.Data Science and Management2666-76492025-06-018218519510.1016/j.dsm.2025.01.005Surprisal-based algorithm for detecting anomalies in categorical dataOssama Cherkaoui0Houda Anoun1Abderrahim Maizate2Corresponding author.; CED Engineering Sciences, Hassan II University of Casablanca, Casablanca, 20000, MoroccoCED Engineering Sciences, Hassan II University of Casablanca, Casablanca, 20000, MoroccoCED Engineering Sciences, Hassan II University of Casablanca, Casablanca, 20000, MoroccoAnomaly detection is an important research area in a diverse range of real-world applications. Although many algorithms have been proposed to address anomaly detection for numerical datasets, categorical and mixed datasets remain a significant challenge, primarily because a natural distance metric is lacking. Consequently, the methods proposed in the literature implement entirely different assumptions regarding the definition of categorical anomalies. This paper presents a novel categorical anomaly detection approach, offering two key contributions to existing methods. First, a novel surprisal-based anomaly score is introduced, which provides a more accurate assessment of anomalies by considering the full distribution of categorical values. Second, the proposed method considers complex correlations in the data beyond the pairwise interactions of features. This study proposed and tested the novel categorical surprisal anomaly detection algorithm (CSAD) by comparing and evaluating it against six competitors. The experimental results indicate that CSAD produced the best overall performance, achieving the highest average ROC-AUC and PR-AUC values of 0.8 and 0.443, respectively. Furthermore, CSAD's execution time is satisfactory even when processing large, high-dimensional datasets.http://www.sciencedirect.com/science/article/pii/S2666764925000050Unsupervised learningAnomaly detectionCategorical dataSurprisal anomaly score
spellingShingle Ossama Cherkaoui
Houda Anoun
Abderrahim Maizate
Surprisal-based algorithm for detecting anomalies in categorical data
Data Science and Management
Unsupervised learning
Anomaly detection
Categorical data
Surprisal anomaly score
title Surprisal-based algorithm for detecting anomalies in categorical data
title_full Surprisal-based algorithm for detecting anomalies in categorical data
title_fullStr Surprisal-based algorithm for detecting anomalies in categorical data
title_full_unstemmed Surprisal-based algorithm for detecting anomalies in categorical data
title_short Surprisal-based algorithm for detecting anomalies in categorical data
title_sort surprisal based algorithm for detecting anomalies in categorical data
topic Unsupervised learning
Anomaly detection
Categorical data
Surprisal anomaly score
url http://www.sciencedirect.com/science/article/pii/S2666764925000050
work_keys_str_mv AT ossamacherkaoui surprisalbasedalgorithmfordetectinganomaliesincategoricaldata
AT houdaanoun surprisalbasedalgorithmfordetectinganomaliesincategoricaldata
AT abderrahimmaizate surprisalbasedalgorithmfordetectinganomaliesincategoricaldata