DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM
Multimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interact...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11088104/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Multimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interactions. Although existing research has made progress in multimodal alignment and dialogue relationship modeling through architectures such as graph neural networks and pre-trained language models, challenges persist in dataset overfitting and underexplored generative approaches. In this study, a generative MERC framework based on Multimodal Large Language Models (MLLMs) is proposed, employing Video-LLaMA, an open-source and advanced tri-modal foundation model, for end-to-end multimodal emotion reasoning. Carefully crafted structured prompts are used to align emotion semantics with dataset annotations, combined with Low-Rank Adaptation (LoRA) for parameter-efficient optimization. The method achieves a state-of-the-art weighted F1-score of 68.57% on the MELD benchmark. Further, exploratory experiments on dynamic modality combinations and fine-tuning strategies offer actionable insights for MLLM-based MERC research. This work not only advances emotion understanding in dialogues but also highlights MLLMs’ potential in complex multimodal reasoning tasks. |
|---|---|
| ISSN: | 2169-3536 |