Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages
The widespread use of online social media has enabled users to express their thoughts, feelings, opinions, and sentiments in their preferred languages. These diverse perspectives offer valuable insights for data-driven decision-making. While extensive sentiment analysis approaches have been develope...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10835765/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841536221466066944 |
---|---|
author | Muhammad Kashif Nazir Cm Nadeem Faisal Muhammad Asif Habib Haseeb Ahmad |
author_facet | Muhammad Kashif Nazir Cm Nadeem Faisal Muhammad Asif Habib Haseeb Ahmad |
author_sort | Muhammad Kashif Nazir |
collection | DOAJ |
description | The widespread use of online social media has enabled users to express their thoughts, feelings, opinions, and sentiments in their preferred languages. These diverse perspectives offer valuable insights for data-driven decision-making. While extensive sentiment analysis approaches have been developed for resource-rich languages like English and Chinese, low-resource languages such as Roman Urdu and Roman Punjabi, especially in code-mixed contexts, have been largely neglected due to the lack of datasets and limited research on their unique morphological structures and grammatical complexities. This study aims to present a novel approach for multiclass sentiment analysis of low-resource, code-mixed datasets using multilingual transformers. Specifically, a dataset comprising Roman Urdu, Roman Punjabi, and English comments was collected. After applying traditional natural language preprocessing techniques, transformer-based libraries were used for tokenization and embedding. Subsequently, the Multilingual Bidirectional Encoder Representations from Transformers (mBERT) model was optimized and trained for multiclass sentiment analysis on the code-mixed data. The evaluation results showed a significant improvement in accuracy (+22.55%), precision (+21.06%), recall (+22.55%), and F-measure (+25.50%) compared to benchmark algorithms. Additionally, the proposed model outperformed other transformer-based models, as well as deep learning and machine learning algorithms in sentiment extraction from code-mixed data. These findings highlight the potential of the proposed approach for sentiment analysis in low-resource, code-mixed languages. |
format | Article |
id | doaj-art-fbca1f06d25043338f651940cc50f4af |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-fbca1f06d25043338f651940cc50f4af2025-01-15T00:03:06ZengIEEEIEEE Access2169-35362025-01-01137538755410.1109/ACCESS.2025.352771010835765Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource LanguagesMuhammad Kashif Nazir0https://orcid.org/0000-0003-4094-4412Cm Nadeem Faisal1https://orcid.org/0000-0001-8781-4143Muhammad Asif Habib2Haseeb Ahmad3https://orcid.org/0000-0002-6359-7452Department of Computer Science, National Textile University, Faisalabad, PakistanDepartment of Computer Science, National Textile University, Faisalabad, PakistanCollege of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi ArabiaDepartment of Computer Science, National Textile University, Faisalabad, PakistanThe widespread use of online social media has enabled users to express their thoughts, feelings, opinions, and sentiments in their preferred languages. These diverse perspectives offer valuable insights for data-driven decision-making. While extensive sentiment analysis approaches have been developed for resource-rich languages like English and Chinese, low-resource languages such as Roman Urdu and Roman Punjabi, especially in code-mixed contexts, have been largely neglected due to the lack of datasets and limited research on their unique morphological structures and grammatical complexities. This study aims to present a novel approach for multiclass sentiment analysis of low-resource, code-mixed datasets using multilingual transformers. Specifically, a dataset comprising Roman Urdu, Roman Punjabi, and English comments was collected. After applying traditional natural language preprocessing techniques, transformer-based libraries were used for tokenization and embedding. Subsequently, the Multilingual Bidirectional Encoder Representations from Transformers (mBERT) model was optimized and trained for multiclass sentiment analysis on the code-mixed data. The evaluation results showed a significant improvement in accuracy (+22.55%), precision (+21.06%), recall (+22.55%), and F-measure (+25.50%) compared to benchmark algorithms. Additionally, the proposed model outperformed other transformer-based models, as well as deep learning and machine learning algorithms in sentiment extraction from code-mixed data. These findings highlight the potential of the proposed approach for sentiment analysis in low-resource, code-mixed languages.https://ieeexplore.ieee.org/document/10835765/Code-mixed datasetclassificationlow resource languagesmBERTsentiment analysistransformer |
spellingShingle | Muhammad Kashif Nazir Cm Nadeem Faisal Muhammad Asif Habib Haseeb Ahmad Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages IEEE Access Code-mixed dataset classification low resource languages mBERT sentiment analysis transformer |
title | Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages |
title_full | Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages |
title_fullStr | Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages |
title_full_unstemmed | Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages |
title_short | Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages |
title_sort | leveraging multilingual transformer for multiclass sentiment analysis in code mixed data of low resource languages |
topic | Code-mixed dataset classification low resource languages mBERT sentiment analysis transformer |
url | https://ieeexplore.ieee.org/document/10835765/ |
work_keys_str_mv | AT muhammadkashifnazir leveragingmultilingualtransformerformulticlasssentimentanalysisincodemixeddataoflowresourcelanguages AT cmnadeemfaisal leveragingmultilingualtransformerformulticlasssentimentanalysisincodemixeddataoflowresourcelanguages AT muhammadasifhabib leveragingmultilingualtransformerformulticlasssentimentanalysisincodemixeddataoflowresourcelanguages AT haseebahmad leveragingmultilingualtransformerformulticlasssentimentanalysisincodemixeddataoflowresourcelanguages |