Meaningful Multimodal Emotion Recognition Based on Capsule Graph Transformer Architecture

The development of emotionally intelligent computers depends on emotion recognition based on richer multimodal inputs, such as text, speech, and visual cues, as multiple modalities complement one another. The effectiveness of complex relationships between modalities for emotion recognition has been...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hajar Filali, Chafik Boulealam, Khalid El Fazazy, Adnane Mohamed Mahraz, Hamid Tairi, Jamal Riffi
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Information
Subjects:	emotion recognition deep learning graph convolutional network capsule network vision transformer meaningful neural network (MNN)
Online Access:	https://www.mdpi.com/2078-2489/16/1/40
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832588353677033472
author	Hajar Filali Chafik Boulealam Khalid El Fazazy Adnane Mohamed Mahraz Hamid Tairi Jamal Riffi
author_facet	Hajar Filali Chafik Boulealam Khalid El Fazazy Adnane Mohamed Mahraz Hamid Tairi Jamal Riffi
author_sort	Hajar Filali
collection	DOAJ
description	The development of emotionally intelligent computers depends on emotion recognition based on richer multimodal inputs, such as text, speech, and visual cues, as multiple modalities complement one another. The effectiveness of complex relationships between modalities for emotion recognition has been demonstrated, but these relationships are still largely unexplored. Various fusion mechanisms using simply concatenated information have been the mainstay of previous research in learning multimodal representations for emotion classification, rather than fully utilizing the benefits of deep learning. In this paper, a unique deep multimodal emotion model is proposed, which uses the meaningful neural network to learn meaningful multimodal representations while classifying data. Specifically, the proposed model concatenates multimodality inputs using a graph convolutional network to extract acoustic modality, a capsule network to generate the textual modality, and vision transformer to acquire the visual modality. Despite the effectiveness of MNN, we have used it as a methodological innovation that will be fed with the previously generated vector parameters to produce better predictive results. Our suggested approach for more accurate multimodal emotion recognition has been shown through extensive examinations, producing state-of-the-art results with accuracies of 69% and 56% on two public datasets, MELD and MOSEI, respectively.
format	Article
id	doaj-art-80e6ae9781ae45fc8b2787461092b46c
institution	Kabale University
issn	2078-2489
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj-art-80e6ae9781ae45fc8b2787461092b46c2025-01-24T13:35:15ZengMDPI AGInformation2078-24892025-01-011614010.3390/info16010040Meaningful Multimodal Emotion Recognition Based on Capsule Graph Transformer ArchitectureHajar Filali0Chafik Boulealam1Khalid El Fazazy2Adnane Mohamed Mahraz3Hamid Tairi4Jamal Riffi5LISAC, Department of Computer Science, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, MoroccoLISAC, Department of Computer Science, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, MoroccoLISAC, Department of Computer Science, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, MoroccoLISAC, Department of Computer Science, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, MoroccoLISAC, Department of Computer Science, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, MoroccoLISAC, Department of Computer Science, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, MoroccoThe development of emotionally intelligent computers depends on emotion recognition based on richer multimodal inputs, such as text, speech, and visual cues, as multiple modalities complement one another. The effectiveness of complex relationships between modalities for emotion recognition has been demonstrated, but these relationships are still largely unexplored. Various fusion mechanisms using simply concatenated information have been the mainstay of previous research in learning multimodal representations for emotion classification, rather than fully utilizing the benefits of deep learning. In this paper, a unique deep multimodal emotion model is proposed, which uses the meaningful neural network to learn meaningful multimodal representations while classifying data. Specifically, the proposed model concatenates multimodality inputs using a graph convolutional network to extract acoustic modality, a capsule network to generate the textual modality, and vision transformer to acquire the visual modality. Despite the effectiveness of MNN, we have used it as a methodological innovation that will be fed with the previously generated vector parameters to produce better predictive results. Our suggested approach for more accurate multimodal emotion recognition has been shown through extensive examinations, producing state-of-the-art results with accuracies of 69% and 56% on two public datasets, MELD and MOSEI, respectively.https://www.mdpi.com/2078-2489/16/1/40emotion recognitiondeep learninggraph convolutional networkcapsule networkvision transformermeaningful neural network (MNN)
spellingShingle	Hajar Filali Chafik Boulealam Khalid El Fazazy Adnane Mohamed Mahraz Hamid Tairi Jamal Riffi Meaningful Multimodal Emotion Recognition Based on Capsule Graph Transformer Architecture Information emotion recognition deep learning graph convolutional network capsule network vision transformer meaningful neural network (MNN)
title	Meaningful Multimodal Emotion Recognition Based on Capsule Graph Transformer Architecture
title_full	Meaningful Multimodal Emotion Recognition Based on Capsule Graph Transformer Architecture
title_fullStr	Meaningful Multimodal Emotion Recognition Based on Capsule Graph Transformer Architecture
title_full_unstemmed	Meaningful Multimodal Emotion Recognition Based on Capsule Graph Transformer Architecture
title_short	Meaningful Multimodal Emotion Recognition Based on Capsule Graph Transformer Architecture
title_sort	meaningful multimodal emotion recognition based on capsule graph transformer architecture
topic	emotion recognition deep learning graph convolutional network capsule network vision transformer meaningful neural network (MNN)
url	https://www.mdpi.com/2078-2489/16/1/40
work_keys_str_mv	AT hajarfilali meaningfulmultimodalemotionrecognitionbasedoncapsulegraphtransformerarchitecture AT chafikboulealam meaningfulmultimodalemotionrecognitionbasedoncapsulegraphtransformerarchitecture AT khalidelfazazy meaningfulmultimodalemotionrecognitionbasedoncapsulegraphtransformerarchitecture AT adnanemohamedmahraz meaningfulmultimodalemotionrecognitionbasedoncapsulegraphtransformerarchitecture AT hamidtairi meaningfulmultimodalemotionrecognitionbasedoncapsulegraphtransformerarchitecture AT jamalriffi meaningfulmultimodalemotionrecognitionbasedoncapsulegraphtransformerarchitecture

Meaningful Multimodal Emotion Recognition Based on Capsule Graph Transformer Architecture

Similar Items