AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized Text
Topic modeling is an unsupervised learning technique, that is extensively used for discovering latent topics in huge text corpora. However, existing models often fall short in cross-lingual scenarios, particularly for morphologically rich and low-resource languages such as Arabic. Cross-lingual topi...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11058925/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849320349101981696 |
|---|---|
| author | Sania Aftar Abdul Rehman Sonia Bergamaschi Luca Gagliardelli |
| author_facet | Sania Aftar Abdul Rehman Sonia Bergamaschi Luca Gagliardelli |
| author_sort | Sania Aftar |
| collection | DOAJ |
| description | Topic modeling is an unsupervised learning technique, that is extensively used for discovering latent topics in huge text corpora. However, existing models often fall short in cross-lingual scenarios, particularly for morphologically rich and low-resource languages such as Arabic. Cross-lingual topic analysis extracts shared topics across languages but often relies on resource-intensive datasets or limited translation dictionaries, restricting its diversity and effectiveness. Transfer learning provides a promising solution to these challenges. This presents AZIM, an Arabic-centric extension of ZeroShotTM, adapted to use Arabic as the training language for zero-shot multilingual topic modeling. The model’s performance is evaluated across diverse Latin-script and non-Latin-script languages, focusing on its adaptability to Modern Standard Arabic (MSA) and Classical Arabic (CA). Additionally, the study explores the impact of summarized versus general text. The results illustrate that the summarized versions of the datasets consistently outperform their baselines in terms of interpretability and coherence. Furthermore, the model also illustrates robust cross-lingual generalization as shown by non-Latin scripts such as Persian and Urdu outperforming certain Latin-based languages. However, variations in performance between the languages show the complex nature of multilingual embeddings. The performance difference between Modern Standard Arabic and Classical Arabic reveals that the limitations of the pre-trained embeddings, namely, their bias towards modern corpora. These findings point out the importance of adapting techniques for morphologically rich and low-resource languages for the purpose of enhancing the cross-lingual topic modeling. |
| format | Article |
| id | doaj-art-7a570fcd2fc74cb3b09214e0c895f670 |
| institution | Kabale University |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-7a570fcd2fc74cb3b09214e0c895f6702025-08-20T03:50:07ZengIEEEIEEE Access2169-35362025-01-011311437011438310.1109/ACCESS.2025.358430911058925AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized TextSania Aftar0https://orcid.org/0000-0001-8151-8941Abdul Rehman1Sonia Bergamaschi2https://orcid.org/0000-0001-8087-6587Luca Gagliardelli3Department of Engineering “Enzo Ferrari,”, University of Modena and Reggio Emilia, Modena, ItalyDepartment of Engineering “Enzo Ferrari,”, University of Modena and Reggio Emilia, Modena, ItalyDepartment of Engineering “Enzo Ferrari,”, University of Modena and Reggio Emilia, Modena, ItalyDepartment of Engineering “Enzo Ferrari,”, University of Modena and Reggio Emilia, Modena, ItalyTopic modeling is an unsupervised learning technique, that is extensively used for discovering latent topics in huge text corpora. However, existing models often fall short in cross-lingual scenarios, particularly for morphologically rich and low-resource languages such as Arabic. Cross-lingual topic analysis extracts shared topics across languages but often relies on resource-intensive datasets or limited translation dictionaries, restricting its diversity and effectiveness. Transfer learning provides a promising solution to these challenges. This presents AZIM, an Arabic-centric extension of ZeroShotTM, adapted to use Arabic as the training language for zero-shot multilingual topic modeling. The model’s performance is evaluated across diverse Latin-script and non-Latin-script languages, focusing on its adaptability to Modern Standard Arabic (MSA) and Classical Arabic (CA). Additionally, the study explores the impact of summarized versus general text. The results illustrate that the summarized versions of the datasets consistently outperform their baselines in terms of interpretability and coherence. Furthermore, the model also illustrates robust cross-lingual generalization as shown by non-Latin scripts such as Persian and Urdu outperforming certain Latin-based languages. However, variations in performance between the languages show the complex nature of multilingual embeddings. The performance difference between Modern Standard Arabic and Classical Arabic reveals that the limitations of the pre-trained embeddings, namely, their bias towards modern corpora. These findings point out the importance of adapting techniques for morphologically rich and low-resource languages for the purpose of enhancing the cross-lingual topic modeling.https://ieeexplore.ieee.org/document/11058925/Low resource languagesMSA and classical Arabicmultilingual embeddingszero-shot cross-lingual topic modeling |
| spellingShingle | Sania Aftar Abdul Rehman Sonia Bergamaschi Luca Gagliardelli AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized Text IEEE Access Low resource languages MSA and classical Arabic multilingual embeddings zero-shot cross-lingual topic modeling |
| title | AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized Text |
| title_full | AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized Text |
| title_fullStr | AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized Text |
| title_full_unstemmed | AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized Text |
| title_short | AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized Text |
| title_sort | azim arabic centric zero shot inference for multilingual topic modeling with enhanced performance on summarized text |
| topic | Low resource languages MSA and classical Arabic multilingual embeddings zero-shot cross-lingual topic modeling |
| url | https://ieeexplore.ieee.org/document/11058925/ |
| work_keys_str_mv | AT saniaaftar azimarabiccentriczeroshotinferenceformultilingualtopicmodelingwithenhancedperformanceonsummarizedtext AT abdulrehman azimarabiccentriczeroshotinferenceformultilingualtopicmodelingwithenhancedperformanceonsummarizedtext AT soniabergamaschi azimarabiccentriczeroshotinferenceformultilingualtopicmodelingwithenhancedperformanceonsummarizedtext AT lucagagliardelli azimarabiccentriczeroshotinferenceformultilingualtopicmodelingwithenhancedperformanceonsummarizedtext |