Surprisal-based algorithm for detecting anomalies in categorical data
Anomaly detection is an important research area in a diverse range of real-world applications. Although many algorithms have been proposed to address anomaly detection for numerical datasets, categorical and mixed datasets remain a significant challenge, primarily because a natural distance metric i...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
KeAi Communications Co. Ltd.
2025-06-01
|
| Series: | Data Science and Management |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2666764925000050 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849428659479248896 |
|---|---|
| author | Ossama Cherkaoui Houda Anoun Abderrahim Maizate |
| author_facet | Ossama Cherkaoui Houda Anoun Abderrahim Maizate |
| author_sort | Ossama Cherkaoui |
| collection | DOAJ |
| description | Anomaly detection is an important research area in a diverse range of real-world applications. Although many algorithms have been proposed to address anomaly detection for numerical datasets, categorical and mixed datasets remain a significant challenge, primarily because a natural distance metric is lacking. Consequently, the methods proposed in the literature implement entirely different assumptions regarding the definition of categorical anomalies. This paper presents a novel categorical anomaly detection approach, offering two key contributions to existing methods. First, a novel surprisal-based anomaly score is introduced, which provides a more accurate assessment of anomalies by considering the full distribution of categorical values. Second, the proposed method considers complex correlations in the data beyond the pairwise interactions of features. This study proposed and tested the novel categorical surprisal anomaly detection algorithm (CSAD) by comparing and evaluating it against six competitors. The experimental results indicate that CSAD produced the best overall performance, achieving the highest average ROC-AUC and PR-AUC values of 0.8 and 0.443, respectively. Furthermore, CSAD's execution time is satisfactory even when processing large, high-dimensional datasets. |
| format | Article |
| id | doaj-art-d512e14dc7cd4119a11098fa9dc485a7 |
| institution | Kabale University |
| issn | 2666-7649 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | KeAi Communications Co. Ltd. |
| record_format | Article |
| series | Data Science and Management |
| spelling | doaj-art-d512e14dc7cd4119a11098fa9dc485a72025-08-20T03:28:37ZengKeAi Communications Co. Ltd.Data Science and Management2666-76492025-06-018218519510.1016/j.dsm.2025.01.005Surprisal-based algorithm for detecting anomalies in categorical dataOssama Cherkaoui0Houda Anoun1Abderrahim Maizate2Corresponding author.; CED Engineering Sciences, Hassan II University of Casablanca, Casablanca, 20000, MoroccoCED Engineering Sciences, Hassan II University of Casablanca, Casablanca, 20000, MoroccoCED Engineering Sciences, Hassan II University of Casablanca, Casablanca, 20000, MoroccoAnomaly detection is an important research area in a diverse range of real-world applications. Although many algorithms have been proposed to address anomaly detection for numerical datasets, categorical and mixed datasets remain a significant challenge, primarily because a natural distance metric is lacking. Consequently, the methods proposed in the literature implement entirely different assumptions regarding the definition of categorical anomalies. This paper presents a novel categorical anomaly detection approach, offering two key contributions to existing methods. First, a novel surprisal-based anomaly score is introduced, which provides a more accurate assessment of anomalies by considering the full distribution of categorical values. Second, the proposed method considers complex correlations in the data beyond the pairwise interactions of features. This study proposed and tested the novel categorical surprisal anomaly detection algorithm (CSAD) by comparing and evaluating it against six competitors. The experimental results indicate that CSAD produced the best overall performance, achieving the highest average ROC-AUC and PR-AUC values of 0.8 and 0.443, respectively. Furthermore, CSAD's execution time is satisfactory even when processing large, high-dimensional datasets.http://www.sciencedirect.com/science/article/pii/S2666764925000050Unsupervised learningAnomaly detectionCategorical dataSurprisal anomaly score |
| spellingShingle | Ossama Cherkaoui Houda Anoun Abderrahim Maizate Surprisal-based algorithm for detecting anomalies in categorical data Data Science and Management Unsupervised learning Anomaly detection Categorical data Surprisal anomaly score |
| title | Surprisal-based algorithm for detecting anomalies in categorical data |
| title_full | Surprisal-based algorithm for detecting anomalies in categorical data |
| title_fullStr | Surprisal-based algorithm for detecting anomalies in categorical data |
| title_full_unstemmed | Surprisal-based algorithm for detecting anomalies in categorical data |
| title_short | Surprisal-based algorithm for detecting anomalies in categorical data |
| title_sort | surprisal based algorithm for detecting anomalies in categorical data |
| topic | Unsupervised learning Anomaly detection Categorical data Surprisal anomaly score |
| url | http://www.sciencedirect.com/science/article/pii/S2666764925000050 |
| work_keys_str_mv | AT ossamacherkaoui surprisalbasedalgorithmfordetectinganomaliesincategoricaldata AT houdaanoun surprisalbasedalgorithmfordetectinganomaliesincategoricaldata AT abderrahimmaizate surprisalbasedalgorithmfordetectinganomaliesincategoricaldata |