ArabSis: Arabic Corpus Sentiment Analysis

Despite rapid progress in natural language processing (NLP), the development of specialized resources for niche domains—critical for specialized applications like affective computing and emotionally intelligent AI—remains a persistent challenge. While benchmark datasets abound...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ziad Doughan, Sari Itani, Samir Itani
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Arabic NLP artificial intelligence ensemble methods machine learning natural language processing sentiment analysis
Online Access:	https://ieeexplore.ieee.org/document/10990213/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850136620034949120
author	Ziad Doughan Sari Itani Samir Itani
author_facet	Ziad Doughan Sari Itani Samir Itani
author_sort	Ziad Doughan
collection	DOAJ
description	Despite rapid progress in natural language processing (NLP), the development of specialized resources for niche domains—critical for specialized applications like affective computing and emotionally intelligent AI—remains a persistent challenge. While benchmark datasets abound for general tasks, languages like Arabic and fields like multi-dimensional sentiment analysis beyond binary classification as positive or negative suffer from resource scarcity, limiting progress in human-centric applications. To address this gap, we present ArabSis: a novel Arabic corpus for multi-dimensional sentiment analysis across five categorical emotions (Joy, Sadness, Fear, Liking, Hatred). Our work introduces a reproducible framework for creating specialized corpora in low-resource languages, enabling future research in regressive dimensional sentiment analysis and other specialized NLP applications. The ArabSis corpus, developed through systematic data augmentation and human labelling, facilitates advanced analysis using traditional NLP techniques (TF-IDF, Bag of Words) and modern deep learning approaches. It also targets the universal Arabic language whereas previous research focuses on Arabic regardless of the dialect which make small nuances and inconsistencies among dialects unnoticeable and unfixable. We evaluate machine learning (ML) and deep learning (DL) models in one-vs-all classification tasks, demonstrating that ML models (e.g., SVMs, Random Forests) outperform DL counterparts on smaller datasets. An ensemble method combining top-performing models achieves 98.6% accuracy through score averaging and majority voting systems, though revealing inherent biases in ensemble voting mechanisms. The study provides a comprehensive pipeline encompassing data preprocessing, exploratory analysis, and model training, validated through 5-fold cross-validation, establishing a blueprint for developing specialized NLP resources, particularly for under-resourced languages.
format	Article
id	doaj-art-27c46c49241e49ee95a56ef769d82042
institution	OA Journals
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-27c46c49241e49ee95a56ef769d820422025-08-20T02:31:04ZengIEEEIEEE Access2169-35362025-01-0113810838109510.1109/ACCESS.2025.356775510990213ArabSis: Arabic Corpus Sentiment AnalysisZiad Doughan0https://orcid.org/0000-0002-7566-7710Sari Itani1https://orcid.org/0009-0007-7886-510XSamir Itani2https://orcid.org/0000-0002-2053-5817Department of Electrical and Computer Engineering, Faculty of Engineering, Beirut Arab University, Beirut, LebanonDepartment of Electrical and Computer Engineering, Faculty of Engineering, Beirut Arab University, Beirut, LebanonDepartment of Arabic Language and Literature, Faculty of Human Sciences, Beirut Arab University, Beirut, LebanonDespite rapid progress in natural language processing (NLP), the development of specialized resources for niche domains—critical for specialized applications like affective computing and emotionally intelligent AI—remains a persistent challenge. While benchmark datasets abound for general tasks, languages like Arabic and fields like multi-dimensional sentiment analysis beyond binary classification as positive or negative suffer from resource scarcity, limiting progress in human-centric applications. To address this gap, we present ArabSis: a novel Arabic corpus for multi-dimensional sentiment analysis across five categorical emotions (Joy, Sadness, Fear, Liking, Hatred). Our work introduces a reproducible framework for creating specialized corpora in low-resource languages, enabling future research in regressive dimensional sentiment analysis and other specialized NLP applications. The ArabSis corpus, developed through systematic data augmentation and human labelling, facilitates advanced analysis using traditional NLP techniques (TF-IDF, Bag of Words) and modern deep learning approaches. It also targets the universal Arabic language whereas previous research focuses on Arabic regardless of the dialect which make small nuances and inconsistencies among dialects unnoticeable and unfixable. We evaluate machine learning (ML) and deep learning (DL) models in one-vs-all classification tasks, demonstrating that ML models (e.g., SVMs, Random Forests) outperform DL counterparts on smaller datasets. An ensemble method combining top-performing models achieves 98.6% accuracy through score averaging and majority voting systems, though revealing inherent biases in ensemble voting mechanisms. The study provides a comprehensive pipeline encompassing data preprocessing, exploratory analysis, and model training, validated through 5-fold cross-validation, establishing a blueprint for developing specialized NLP resources, particularly for under-resourced languages.https://ieeexplore.ieee.org/document/10990213/Arabic NLPartificial intelligenceensemble methodsmachine learningnatural language processingsentiment analysis
spellingShingle	Ziad Doughan Sari Itani Samir Itani ArabSis: Arabic Corpus Sentiment Analysis IEEE Access Arabic NLP artificial intelligence ensemble methods machine learning natural language processing sentiment analysis
title	ArabSis: Arabic Corpus Sentiment Analysis
title_full	ArabSis: Arabic Corpus Sentiment Analysis
title_fullStr	ArabSis: Arabic Corpus Sentiment Analysis
title_full_unstemmed	ArabSis: Arabic Corpus Sentiment Analysis
title_short	ArabSis: Arabic Corpus Sentiment Analysis
title_sort	arabsis arabic corpus sentiment analysis
topic	Arabic NLP artificial intelligence ensemble methods machine learning natural language processing sentiment analysis
url	https://ieeexplore.ieee.org/document/10990213/
work_keys_str_mv	AT ziaddoughan arabsisarabiccorpussentimentanalysis AT sariitani arabsisarabiccorpussentimentanalysis AT samiritani arabsisarabiccorpussentimentanalysis

ArabSis: Arabic Corpus Sentiment Analysis

Similar Items