DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM

Multimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interact...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuanyuan Sun, Ting Zhou
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11088104/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Multimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interactions. Although existing research has made progress in multimodal alignment and dialogue relationship modeling through architectures such as graph neural networks and pre-trained language models, challenges persist in dataset overfitting and underexplored generative approaches. In this study, a generative MERC framework based on Multimodal Large Language Models (MLLMs) is proposed, employing Video-LLaMA, an open-source and advanced tri-modal foundation model, for end-to-end multimodal emotion reasoning. Carefully crafted structured prompts are used to align emotion semantics with dataset annotations, combined with Low-Rank Adaptation (LoRA) for parameter-efficient optimization. The method achieves a state-of-the-art weighted F1-score of 68.57% on the MELD benchmark. Further, exploratory experiments on dynamic modality combinations and fine-tuning strategies offer actionable insights for MLLM-based MERC research. This work not only advances emotion understanding in dialogues but also highlights MLLMs’ potential in complex multimodal reasoning tasks.
ISSN:2169-3536