ArabSis: Arabic Corpus Sentiment Analysis
Despite rapid progress in natural language processing (NLP), the development of specialized resources for niche domains—critical for specialized applications like affective computing and emotionally intelligent AI—remains a persistent challenge. While benchmark datasets abound...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10990213/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850136620034949120 |
|---|---|
| author | Ziad Doughan Sari Itani Samir Itani |
| author_facet | Ziad Doughan Sari Itani Samir Itani |
| author_sort | Ziad Doughan |
| collection | DOAJ |
| description | Despite rapid progress in natural language processing (NLP), the development of specialized resources for niche domains—critical for specialized applications like affective computing and emotionally intelligent AI—remains a persistent challenge. While benchmark datasets abound for general tasks, languages like Arabic and fields like multi-dimensional sentiment analysis beyond binary classification as positive or negative suffer from resource scarcity, limiting progress in human-centric applications. To address this gap, we present ArabSis: a novel Arabic corpus for multi-dimensional sentiment analysis across five categorical emotions (Joy, Sadness, Fear, Liking, Hatred). Our work introduces a reproducible framework for creating specialized corpora in low-resource languages, enabling future research in regressive dimensional sentiment analysis and other specialized NLP applications. The ArabSis corpus, developed through systematic data augmentation and human labelling, facilitates advanced analysis using traditional NLP techniques (TF-IDF, Bag of Words) and modern deep learning approaches. It also targets the universal Arabic language whereas previous research focuses on Arabic regardless of the dialect which make small nuances and inconsistencies among dialects unnoticeable and unfixable. We evaluate machine learning (ML) and deep learning (DL) models in one-vs-all classification tasks, demonstrating that ML models (e.g., SVMs, Random Forests) outperform DL counterparts on smaller datasets. An ensemble method combining top-performing models achieves 98.6% accuracy through score averaging and majority voting systems, though revealing inherent biases in ensemble voting mechanisms. The study provides a comprehensive pipeline encompassing data preprocessing, exploratory analysis, and model training, validated through 5-fold cross-validation, establishing a blueprint for developing specialized NLP resources, particularly for under-resourced languages. |
| format | Article |
| id | doaj-art-27c46c49241e49ee95a56ef769d82042 |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-27c46c49241e49ee95a56ef769d820422025-08-20T02:31:04ZengIEEEIEEE Access2169-35362025-01-0113810838109510.1109/ACCESS.2025.356775510990213ArabSis: Arabic Corpus Sentiment AnalysisZiad Doughan0https://orcid.org/0000-0002-7566-7710Sari Itani1https://orcid.org/0009-0007-7886-510XSamir Itani2https://orcid.org/0000-0002-2053-5817Department of Electrical and Computer Engineering, Faculty of Engineering, Beirut Arab University, Beirut, LebanonDepartment of Electrical and Computer Engineering, Faculty of Engineering, Beirut Arab University, Beirut, LebanonDepartment of Arabic Language and Literature, Faculty of Human Sciences, Beirut Arab University, Beirut, LebanonDespite rapid progress in natural language processing (NLP), the development of specialized resources for niche domains—critical for specialized applications like affective computing and emotionally intelligent AI—remains a persistent challenge. While benchmark datasets abound for general tasks, languages like Arabic and fields like multi-dimensional sentiment analysis beyond binary classification as positive or negative suffer from resource scarcity, limiting progress in human-centric applications. To address this gap, we present ArabSis: a novel Arabic corpus for multi-dimensional sentiment analysis across five categorical emotions (Joy, Sadness, Fear, Liking, Hatred). Our work introduces a reproducible framework for creating specialized corpora in low-resource languages, enabling future research in regressive dimensional sentiment analysis and other specialized NLP applications. The ArabSis corpus, developed through systematic data augmentation and human labelling, facilitates advanced analysis using traditional NLP techniques (TF-IDF, Bag of Words) and modern deep learning approaches. It also targets the universal Arabic language whereas previous research focuses on Arabic regardless of the dialect which make small nuances and inconsistencies among dialects unnoticeable and unfixable. We evaluate machine learning (ML) and deep learning (DL) models in one-vs-all classification tasks, demonstrating that ML models (e.g., SVMs, Random Forests) outperform DL counterparts on smaller datasets. An ensemble method combining top-performing models achieves 98.6% accuracy through score averaging and majority voting systems, though revealing inherent biases in ensemble voting mechanisms. The study provides a comprehensive pipeline encompassing data preprocessing, exploratory analysis, and model training, validated through 5-fold cross-validation, establishing a blueprint for developing specialized NLP resources, particularly for under-resourced languages.https://ieeexplore.ieee.org/document/10990213/Arabic NLPartificial intelligenceensemble methodsmachine learningnatural language processingsentiment analysis |
| spellingShingle | Ziad Doughan Sari Itani Samir Itani ArabSis: Arabic Corpus Sentiment Analysis IEEE Access Arabic NLP artificial intelligence ensemble methods machine learning natural language processing sentiment analysis |
| title | ArabSis: Arabic Corpus Sentiment Analysis |
| title_full | ArabSis: Arabic Corpus Sentiment Analysis |
| title_fullStr | ArabSis: Arabic Corpus Sentiment Analysis |
| title_full_unstemmed | ArabSis: Arabic Corpus Sentiment Analysis |
| title_short | ArabSis: Arabic Corpus Sentiment Analysis |
| title_sort | arabsis arabic corpus sentiment analysis |
| topic | Arabic NLP artificial intelligence ensemble methods machine learning natural language processing sentiment analysis |
| url | https://ieeexplore.ieee.org/document/10990213/ |
| work_keys_str_mv | AT ziaddoughan arabsisarabiccorpussentimentanalysis AT sariitani arabsisarabiccorpussentimentanalysis AT samiritani arabsisarabiccorpussentimentanalysis |