EMVAS: end-to-end multimodal emotion visualization analysis system

Abstract Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To addr...

Full description

Saved in:
Bibliographic Details
Main Authors: Xianxun Zhu, Heyang Feng, Erik Cambria, Yao Huang, Ming Ju, Haochen Yuan, Rui Wang
Format: Article
Language:English
Published: Springer 2025-07-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-025-01931-8
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To address these limitations, we propose EMVAS-an end-to-end multimodal emotion visualization analysis system that seamlessly integrates visual, auditory, and textual modalities. The preprocessing architecture utilizes silence-based audio segmentation alongside end-to-end DeepSpeech2 audio-to-text conversion to generate a synchronized and semantically consistent data stream. For feature extraction, facial landmark detection and action unit analysis capture fine-grained visual cues; Mel-frequency cepstral coefficients, log-scaled fundamental frequency, and Constant-Q transform extract detailed audio features; and a Transformer-based encoder processes textual data for contextual emotion analysis. These heterogeneous features are projected into a unified latent space and fused using a self-supervised multitask learning framework that leverages both shared and modality-specific representations to achieve robust emotion classification. An intuitive front-end provides real-time visualization of temporal trends and emotion frequency distributions. Extensive experiments on benchmark datasets and real-world scenarios demonstrate that EMVAS outperforms state-of-the-art baselines by achieving higher classification accuracy, improved F1 scores, lower mean absolute error, and stronger correlations. Graphical abstract
ISSN:2199-4536
2198-6053