Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages

The widespread use of online social media has enabled users to express their thoughts, feelings, opinions, and sentiments in their preferred languages. These diverse perspectives offer valuable insights for data-driven decision-making. While extensive sentiment analysis approaches have been develope...

Full description

Saved in:
Bibliographic Details
Main Authors: Muhammad Kashif Nazir, Cm Nadeem Faisal, Muhammad Asif Habib, Haseeb Ahmad
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10835765/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The widespread use of online social media has enabled users to express their thoughts, feelings, opinions, and sentiments in their preferred languages. These diverse perspectives offer valuable insights for data-driven decision-making. While extensive sentiment analysis approaches have been developed for resource-rich languages like English and Chinese, low-resource languages such as Roman Urdu and Roman Punjabi, especially in code-mixed contexts, have been largely neglected due to the lack of datasets and limited research on their unique morphological structures and grammatical complexities. This study aims to present a novel approach for multiclass sentiment analysis of low-resource, code-mixed datasets using multilingual transformers. Specifically, a dataset comprising Roman Urdu, Roman Punjabi, and English comments was collected. After applying traditional natural language preprocessing techniques, transformer-based libraries were used for tokenization and embedding. Subsequently, the Multilingual Bidirectional Encoder Representations from Transformers (mBERT) model was optimized and trained for multiclass sentiment analysis on the code-mixed data. The evaluation results showed a significant improvement in accuracy (+22.55%), precision (+21.06%), recall (+22.55%), and F-measure (+25.50%) compared to benchmark algorithms. Additionally, the proposed model outperformed other transformer-based models, as well as deep learning and machine learning algorithms in sentiment extraction from code-mixed data. These findings highlight the potential of the proposed approach for sentiment analysis in low-resource, code-mixed languages.
ISSN:2169-3536