Enhanced Emotion Recognition Through Dynamic Restrained Adaptive Loss and Extended Multimodal Bottleneck Transformer

Emotion recognition in video aims to estimate human emotions using acoustic, visual, and linguistic information. This problem is considered multimodal and requires learning different modalities, such as visual, verbal, and vocal cues. Although previous studies have focused on developing sophisticate...

Full description

Saved in:
Bibliographic Details
Main Authors: Dang-Khanh Nguyen, Eunchae Lim, Soo-Hyung Kim, Hyung-Jeong Yang, Seungwon Kim
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/5/2862
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Emotion recognition in video aims to estimate human emotions using acoustic, visual, and linguistic information. This problem is considered multimodal and requires learning different modalities, such as visual, verbal, and vocal cues. Although previous studies have focused on developing sophisticated deep learning models, this work proposes a different approach using dynamic restrained adaptive loss inspired by multitask learning to understand multimodal inputs jointly. This training strategy allows predictions from one modality to enhance the accuracy of predictions from other modalities, mirroring the concept of multitask learning, where the results of one task can improve the performance of related tasks. Furthermore, this work introduces the extended multimodal bottleneck transformer, an efficient and effective mid-fusion method designed for problems involving more than two modalities to enhance the performance of emotion recognition systems. The proposed method significantly improves results compared to other end-to-end multimodal fusion techniques on three multimodal benchmarks—Interactive Emotional Dyadic Motion Capture (IEMOCAP), Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), and the Chinese Multimodal Sentiment Analysis dataset with independent unimodal annotations (CH-SIMS).
ISSN:2076-3417