Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages

The widespread use of online social media has enabled users to express their thoughts, feelings, opinions, and sentiments in their preferred languages. These diverse perspectives offer valuable insights for data-driven decision-making. While extensive sentiment analysis approaches have been develope...

Full description

Saved in:
Bibliographic Details
Main Authors: Muhammad Kashif Nazir, Cm Nadeem Faisal, Muhammad Asif Habib, Haseeb Ahmad
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10835765/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841536221466066944
author Muhammad Kashif Nazir
Cm Nadeem Faisal
Muhammad Asif Habib
Haseeb Ahmad
author_facet Muhammad Kashif Nazir
Cm Nadeem Faisal
Muhammad Asif Habib
Haseeb Ahmad
author_sort Muhammad Kashif Nazir
collection DOAJ
description The widespread use of online social media has enabled users to express their thoughts, feelings, opinions, and sentiments in their preferred languages. These diverse perspectives offer valuable insights for data-driven decision-making. While extensive sentiment analysis approaches have been developed for resource-rich languages like English and Chinese, low-resource languages such as Roman Urdu and Roman Punjabi, especially in code-mixed contexts, have been largely neglected due to the lack of datasets and limited research on their unique morphological structures and grammatical complexities. This study aims to present a novel approach for multiclass sentiment analysis of low-resource, code-mixed datasets using multilingual transformers. Specifically, a dataset comprising Roman Urdu, Roman Punjabi, and English comments was collected. After applying traditional natural language preprocessing techniques, transformer-based libraries were used for tokenization and embedding. Subsequently, the Multilingual Bidirectional Encoder Representations from Transformers (mBERT) model was optimized and trained for multiclass sentiment analysis on the code-mixed data. The evaluation results showed a significant improvement in accuracy (+22.55%), precision (+21.06%), recall (+22.55%), and F-measure (+25.50%) compared to benchmark algorithms. Additionally, the proposed model outperformed other transformer-based models, as well as deep learning and machine learning algorithms in sentiment extraction from code-mixed data. These findings highlight the potential of the proposed approach for sentiment analysis in low-resource, code-mixed languages.
format Article
id doaj-art-fbca1f06d25043338f651940cc50f4af
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-fbca1f06d25043338f651940cc50f4af2025-01-15T00:03:06ZengIEEEIEEE Access2169-35362025-01-01137538755410.1109/ACCESS.2025.352771010835765Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource LanguagesMuhammad Kashif Nazir0https://orcid.org/0000-0003-4094-4412Cm Nadeem Faisal1https://orcid.org/0000-0001-8781-4143Muhammad Asif Habib2Haseeb Ahmad3https://orcid.org/0000-0002-6359-7452Department of Computer Science, National Textile University, Faisalabad, PakistanDepartment of Computer Science, National Textile University, Faisalabad, PakistanCollege of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi ArabiaDepartment of Computer Science, National Textile University, Faisalabad, PakistanThe widespread use of online social media has enabled users to express their thoughts, feelings, opinions, and sentiments in their preferred languages. These diverse perspectives offer valuable insights for data-driven decision-making. While extensive sentiment analysis approaches have been developed for resource-rich languages like English and Chinese, low-resource languages such as Roman Urdu and Roman Punjabi, especially in code-mixed contexts, have been largely neglected due to the lack of datasets and limited research on their unique morphological structures and grammatical complexities. This study aims to present a novel approach for multiclass sentiment analysis of low-resource, code-mixed datasets using multilingual transformers. Specifically, a dataset comprising Roman Urdu, Roman Punjabi, and English comments was collected. After applying traditional natural language preprocessing techniques, transformer-based libraries were used for tokenization and embedding. Subsequently, the Multilingual Bidirectional Encoder Representations from Transformers (mBERT) model was optimized and trained for multiclass sentiment analysis on the code-mixed data. The evaluation results showed a significant improvement in accuracy (+22.55%), precision (+21.06%), recall (+22.55%), and F-measure (+25.50%) compared to benchmark algorithms. Additionally, the proposed model outperformed other transformer-based models, as well as deep learning and machine learning algorithms in sentiment extraction from code-mixed data. These findings highlight the potential of the proposed approach for sentiment analysis in low-resource, code-mixed languages.https://ieeexplore.ieee.org/document/10835765/Code-mixed datasetclassificationlow resource languagesmBERTsentiment analysistransformer
spellingShingle Muhammad Kashif Nazir
Cm Nadeem Faisal
Muhammad Asif Habib
Haseeb Ahmad
Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages
IEEE Access
Code-mixed dataset
classification
low resource languages
mBERT
sentiment analysis
transformer
title Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages
title_full Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages
title_fullStr Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages
title_full_unstemmed Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages
title_short Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages
title_sort leveraging multilingual transformer for multiclass sentiment analysis in code mixed data of low resource languages
topic Code-mixed dataset
classification
low resource languages
mBERT
sentiment analysis
transformer
url https://ieeexplore.ieee.org/document/10835765/
work_keys_str_mv AT muhammadkashifnazir leveragingmultilingualtransformerformulticlasssentimentanalysisincodemixeddataoflowresourcelanguages
AT cmnadeemfaisal leveragingmultilingualtransformerformulticlasssentimentanalysisincodemixeddataoflowresourcelanguages
AT muhammadasifhabib leveragingmultilingualtransformerformulticlasssentimentanalysisincodemixeddataoflowresourcelanguages
AT haseebahmad leveragingmultilingualtransformerformulticlasssentimentanalysisincodemixeddataoflowresourcelanguages