Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set
This research investigates the impacts of pre-processing techniques on the effectiveness of topic modeling algorithms for Arabic texts, focusing on a comparison between BERTopic, Latent Dirichlet Allocation (LDA), and Non-Negative Matrix Factorization (NMF). Using the Single-label Arabic News Articl...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2024-12-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/14/23/11350 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850107190679961600 |
|---|---|
| author | Haya Alangari Nahlah Algethami |
| author_facet | Haya Alangari Nahlah Algethami |
| author_sort | Haya Alangari |
| collection | DOAJ |
| description | This research investigates the impacts of pre-processing techniques on the effectiveness of topic modeling algorithms for Arabic texts, focusing on a comparison between BERTopic, Latent Dirichlet Allocation (LDA), and Non-Negative Matrix Factorization (NMF). Using the Single-label Arabic News Article Data set (SANAD), which includes 195,174 Arabic news articles, this study explores pre-processing methods such as cleaning, stemming, normalization, and stop word removal, which are crucial processes given the complex morphology of Arabic. Additionally, the influence of six different embedding models on the topic modeling performance was assessed. The originality of this work lies in addressing the lack of previous studies that optimize BERTopic through adjusting the <i>n</i>-gram range parameter and combining it with different embedding models for effective Arabic topic modeling. Pre-processing techniques were fine-tuned to improve data quality before applying BERTopic, LDA, and NMF, and the performance was assessed using metrics such as topic coherence and diversity. Coherence was measured using Normalized Pointwise Mutual Information (NPMI). The results show that the Tashaphyne stemmer significantly enhanced the performance of LDA and NMF. BERTopic, optimized with pre-processing and bi-grams, outperformed LDA and NMF in both coherence and diversity. The CAMeL-Lab/bert-base-arabic-camelbert-da embedding yielded the best results, emphasizing the importance of pre-processing in Arabic topic modeling. |
| format | Article |
| id | doaj-art-a9a1c2007c9847a4b689ce4a0bd3e212 |
| institution | OA Journals |
| issn | 2076-3417 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-a9a1c2007c9847a4b689ce4a0bd3e2122025-08-20T02:38:38ZengMDPI AGApplied Sciences2076-34172024-12-0114231135010.3390/app142311350Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data SetHaya Alangari0Nahlah Algethami1Computer Science Department, College of Computing and Informatics, Saudi Electronic University, Riyadh 1167, Saudi ArabiaComputer Science Department, College of Computing and Informatics, Saudi Electronic University, Riyadh 1167, Saudi ArabiaThis research investigates the impacts of pre-processing techniques on the effectiveness of topic modeling algorithms for Arabic texts, focusing on a comparison between BERTopic, Latent Dirichlet Allocation (LDA), and Non-Negative Matrix Factorization (NMF). Using the Single-label Arabic News Article Data set (SANAD), which includes 195,174 Arabic news articles, this study explores pre-processing methods such as cleaning, stemming, normalization, and stop word removal, which are crucial processes given the complex morphology of Arabic. Additionally, the influence of six different embedding models on the topic modeling performance was assessed. The originality of this work lies in addressing the lack of previous studies that optimize BERTopic through adjusting the <i>n</i>-gram range parameter and combining it with different embedding models for effective Arabic topic modeling. Pre-processing techniques were fine-tuned to improve data quality before applying BERTopic, LDA, and NMF, and the performance was assessed using metrics such as topic coherence and diversity. Coherence was measured using Normalized Pointwise Mutual Information (NPMI). The results show that the Tashaphyne stemmer significantly enhanced the performance of LDA and NMF. BERTopic, optimized with pre-processing and bi-grams, outperformed LDA and NMF in both coherence and diversity. The CAMeL-Lab/bert-base-arabic-camelbert-da embedding yielded the best results, emphasizing the importance of pre-processing in Arabic topic modeling.https://www.mdpi.com/2076-3417/14/23/11350BERTopictopic modelingpre-processing techniquesLDANMFNPMI |
| spellingShingle | Haya Alangari Nahlah Algethami Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set Applied Sciences BERTopic topic modeling pre-processing techniques LDA NMF NPMI |
| title | Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set |
| title_full | Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set |
| title_fullStr | Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set |
| title_full_unstemmed | Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set |
| title_short | Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set |
| title_sort | exploring the effects of pre processing techniques on topic modeling of an arabic news article data set |
| topic | BERTopic topic modeling pre-processing techniques LDA NMF NPMI |
| url | https://www.mdpi.com/2076-3417/14/23/11350 |
| work_keys_str_mv | AT hayaalangari exploringtheeffectsofpreprocessingtechniquesontopicmodelingofanarabicnewsarticledataset AT nahlahalgethami exploringtheeffectsofpreprocessingtechniquesontopicmodelingofanarabicnewsarticledataset |