ArabSis: Arabic Corpus Sentiment Analysis

Despite rapid progress in natural language processing (NLP), the development of specialized resources for niche domains—critical for specialized applications like affective computing and emotionally intelligent AI—remains a persistent challenge. While benchmark datasets abound...

Full description

Saved in:
Bibliographic Details
Main Authors: Ziad Doughan, Sari Itani, Samir Itani
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10990213/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850136620034949120
author Ziad Doughan
Sari Itani
Samir Itani
author_facet Ziad Doughan
Sari Itani
Samir Itani
author_sort Ziad Doughan
collection DOAJ
description Despite rapid progress in natural language processing (NLP), the development of specialized resources for niche domains—critical for specialized applications like affective computing and emotionally intelligent AI—remains a persistent challenge. While benchmark datasets abound for general tasks, languages like Arabic and fields like multi-dimensional sentiment analysis beyond binary classification as positive or negative suffer from resource scarcity, limiting progress in human-centric applications. To address this gap, we present ArabSis: a novel Arabic corpus for multi-dimensional sentiment analysis across five categorical emotions (Joy, Sadness, Fear, Liking, Hatred). Our work introduces a reproducible framework for creating specialized corpora in low-resource languages, enabling future research in regressive dimensional sentiment analysis and other specialized NLP applications. The ArabSis corpus, developed through systematic data augmentation and human labelling, facilitates advanced analysis using traditional NLP techniques (TF-IDF, Bag of Words) and modern deep learning approaches. It also targets the universal Arabic language whereas previous research focuses on Arabic regardless of the dialect which make small nuances and inconsistencies among dialects unnoticeable and unfixable. We evaluate machine learning (ML) and deep learning (DL) models in one-vs-all classification tasks, demonstrating that ML models (e.g., SVMs, Random Forests) outperform DL counterparts on smaller datasets. An ensemble method combining top-performing models achieves 98.6% accuracy through score averaging and majority voting systems, though revealing inherent biases in ensemble voting mechanisms. The study provides a comprehensive pipeline encompassing data preprocessing, exploratory analysis, and model training, validated through 5-fold cross-validation, establishing a blueprint for developing specialized NLP resources, particularly for under-resourced languages.
format Article
id doaj-art-27c46c49241e49ee95a56ef769d82042
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-27c46c49241e49ee95a56ef769d820422025-08-20T02:31:04ZengIEEEIEEE Access2169-35362025-01-0113810838109510.1109/ACCESS.2025.356775510990213ArabSis: Arabic Corpus Sentiment AnalysisZiad Doughan0https://orcid.org/0000-0002-7566-7710Sari Itani1https://orcid.org/0009-0007-7886-510XSamir Itani2https://orcid.org/0000-0002-2053-5817Department of Electrical and Computer Engineering, Faculty of Engineering, Beirut Arab University, Beirut, LebanonDepartment of Electrical and Computer Engineering, Faculty of Engineering, Beirut Arab University, Beirut, LebanonDepartment of Arabic Language and Literature, Faculty of Human Sciences, Beirut Arab University, Beirut, LebanonDespite rapid progress in natural language processing (NLP), the development of specialized resources for niche domains—critical for specialized applications like affective computing and emotionally intelligent AI—remains a persistent challenge. While benchmark datasets abound for general tasks, languages like Arabic and fields like multi-dimensional sentiment analysis beyond binary classification as positive or negative suffer from resource scarcity, limiting progress in human-centric applications. To address this gap, we present ArabSis: a novel Arabic corpus for multi-dimensional sentiment analysis across five categorical emotions (Joy, Sadness, Fear, Liking, Hatred). Our work introduces a reproducible framework for creating specialized corpora in low-resource languages, enabling future research in regressive dimensional sentiment analysis and other specialized NLP applications. The ArabSis corpus, developed through systematic data augmentation and human labelling, facilitates advanced analysis using traditional NLP techniques (TF-IDF, Bag of Words) and modern deep learning approaches. It also targets the universal Arabic language whereas previous research focuses on Arabic regardless of the dialect which make small nuances and inconsistencies among dialects unnoticeable and unfixable. We evaluate machine learning (ML) and deep learning (DL) models in one-vs-all classification tasks, demonstrating that ML models (e.g., SVMs, Random Forests) outperform DL counterparts on smaller datasets. An ensemble method combining top-performing models achieves 98.6% accuracy through score averaging and majority voting systems, though revealing inherent biases in ensemble voting mechanisms. The study provides a comprehensive pipeline encompassing data preprocessing, exploratory analysis, and model training, validated through 5-fold cross-validation, establishing a blueprint for developing specialized NLP resources, particularly for under-resourced languages.https://ieeexplore.ieee.org/document/10990213/Arabic NLPartificial intelligenceensemble methodsmachine learningnatural language processingsentiment analysis
spellingShingle Ziad Doughan
Sari Itani
Samir Itani
ArabSis: Arabic Corpus Sentiment Analysis
IEEE Access
Arabic NLP
artificial intelligence
ensemble methods
machine learning
natural language processing
sentiment analysis
title ArabSis: Arabic Corpus Sentiment Analysis
title_full ArabSis: Arabic Corpus Sentiment Analysis
title_fullStr ArabSis: Arabic Corpus Sentiment Analysis
title_full_unstemmed ArabSis: Arabic Corpus Sentiment Analysis
title_short ArabSis: Arabic Corpus Sentiment Analysis
title_sort arabsis arabic corpus sentiment analysis
topic Arabic NLP
artificial intelligence
ensemble methods
machine learning
natural language processing
sentiment analysis
url https://ieeexplore.ieee.org/document/10990213/
work_keys_str_mv AT ziaddoughan arabsisarabiccorpussentimentanalysis
AT sariitani arabsisarabiccorpussentimentanalysis
AT samiritani arabsisarabiccorpussentimentanalysis