Assessing BERT-based models for Arabic and low-resource languages in crime text classification

The bidirectional encoder representations from Transformers (BERT) has recently attracted considerable attention from researchers and practitioners, demonstrating notable effectiveness in various natural language processing (NLP) tasks, including text classification. This efficacy can be attributed...

Full description

Saved in:
Bibliographic Details
Main Authors: Njood K. Al-harbi, Manal Alghieth
Format: Article
Language:English
Published: PeerJ Inc. 2025-07-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-3017.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849409232716169216
author Njood K. Al-harbi
Manal Alghieth
author_facet Njood K. Al-harbi
Manal Alghieth
author_sort Njood K. Al-harbi
collection DOAJ
description The bidirectional encoder representations from Transformers (BERT) has recently attracted considerable attention from researchers and practitioners, demonstrating notable effectiveness in various natural language processing (NLP) tasks, including text classification. This efficacy can be attributed to its unique architectural features, particularly its ability to process text using both left and right context, having been pre-trained on extensive datasets. In the context of the criminal domain, the classification of data is a crucial activity, and Transformers are increasingly recognized for their potential to support law enforcement efforts. BERT has been released in English and Chinese, as well as a multilingual version that accommodates over 100 languages. However, there is a pressing need to analyze the availability and performance of BERT in Arabic and other low-resource languages. This study primarily focuses on analyzing BERT-based models tailored for the Arabic language; however, due to the limited number of existing studies in this area, the research extends to include other low-resource languages. The study evaluates these models’ performance in comparison to machine learning (ML), deep learning (DL), and other Transformer models. Furthermore, it assesses the availability of relevant data and examines the effectiveness of BERT-based models in low-resource linguistic contexts. The study concludes with recommendations for future research directions, supported by empirical statistical evidence.
format Article
id doaj-art-b9b34b8a8ef8485da1c5b8f7f5b763ea
institution Kabale University
issn 2376-5992
language English
publishDate 2025-07-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj-art-b9b34b8a8ef8485da1c5b8f7f5b763ea2025-08-20T03:35:33ZengPeerJ Inc.PeerJ Computer Science2376-59922025-07-0111e301710.7717/peerj-cs.3017Assessing BERT-based models for Arabic and low-resource languages in crime text classificationNjood K. Al-harbiManal AlghiethThe bidirectional encoder representations from Transformers (BERT) has recently attracted considerable attention from researchers and practitioners, demonstrating notable effectiveness in various natural language processing (NLP) tasks, including text classification. This efficacy can be attributed to its unique architectural features, particularly its ability to process text using both left and right context, having been pre-trained on extensive datasets. In the context of the criminal domain, the classification of data is a crucial activity, and Transformers are increasingly recognized for their potential to support law enforcement efforts. BERT has been released in English and Chinese, as well as a multilingual version that accommodates over 100 languages. However, there is a pressing need to analyze the availability and performance of BERT in Arabic and other low-resource languages. This study primarily focuses on analyzing BERT-based models tailored for the Arabic language; however, due to the limited number of existing studies in this area, the research extends to include other low-resource languages. The study evaluates these models’ performance in comparison to machine learning (ML), deep learning (DL), and other Transformer models. Furthermore, it assesses the availability of relevant data and examines the effectiveness of BERT-based models in low-resource linguistic contexts. The study concludes with recommendations for future research directions, supported by empirical statistical evidence.https://peerj.com/articles/cs-3017.pdfArtificial intelligenceDeep learningTransformerBERTText classificationCrime classification
spellingShingle Njood K. Al-harbi
Manal Alghieth
Assessing BERT-based models for Arabic and low-resource languages in crime text classification
PeerJ Computer Science
Artificial intelligence
Deep learning
Transformer
BERT
Text classification
Crime classification
title Assessing BERT-based models for Arabic and low-resource languages in crime text classification
title_full Assessing BERT-based models for Arabic and low-resource languages in crime text classification
title_fullStr Assessing BERT-based models for Arabic and low-resource languages in crime text classification
title_full_unstemmed Assessing BERT-based models for Arabic and low-resource languages in crime text classification
title_short Assessing BERT-based models for Arabic and low-resource languages in crime text classification
title_sort assessing bert based models for arabic and low resource languages in crime text classification
topic Artificial intelligence
Deep learning
Transformer
BERT
Text classification
Crime classification
url https://peerj.com/articles/cs-3017.pdf
work_keys_str_mv AT njoodkalharbi assessingbertbasedmodelsforarabicandlowresourcelanguagesincrimetextclassification
AT manalalghieth assessingbertbasedmodelsforarabicandlowresourcelanguagesincrimetextclassification