Multi-Modal Emotion Detection and Sentiment Analysis
In the digital era, the proliferation of online reviews through videos has been meteoric and driven by recent technological advancements. The sentiments expressed in these videos drive consumer reliance, decision-making and may shape perception building of general public. The sentiment classificatio...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10935793/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In the digital era, the proliferation of online reviews through videos has been meteoric and driven by recent technological advancements. The sentiments expressed in these videos drive consumer reliance, decision-making and may shape perception building of general public. The sentiment classification of videos is emerging as a vital area of research in Natural Language Processing (NLP) and is challenging due to complexity of emotions and temporal aspects of modalities. These challenges become more complex for low-resourced languages such as Urdu. Addressing the need for a comprehensive study in this realm, we introduce a generic framework, Urdu Multi-modal Sentiment Analysis (UMSA), for emotion detection and sentiment classification of videos.UMSA highlights that incorporating additional modalities and cross-modal interactions significantly enhances the analysis. Our research culminates in the creation of the Urdu Sentiments Dataset (USD), a comprehensive collection of Urdu video reviews. In this study, we classify videos using a two-phase approach that incorporates early fusion and ensembling. After fusion, we perform ensembling of two models for each modality: audio, text, and frames. We utilize Long Short-Term Memory (LSTM) networks and Random Forest Classifier for audio. Text-based analysis is conducted using Logistic Regression and the Bidirectional Encoder Representations from Transformers (BERT) model. For frames, we employ Random Forest and Convolutional Neural Networks (CNN). Afterwards, we implement model ensembling across the three modalities. This multi-modal integration proves essential in providing a clearer and more comprehensive understanding of the sentiments conveyed and achieved more than 80% accuracy. The validation of UMSA is reinforced through a comprehensive case study approach. This independent validation highlights its robustness and adaptability to real-world scenarios. |
|---|---|
| ISSN: | 2169-3536 |