DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM
Multimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interact...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11088104/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849771568488513536 |
|---|---|
| author | Yuanyuan Sun Ting Zhou |
| author_facet | Yuanyuan Sun Ting Zhou |
| author_sort | Yuanyuan Sun |
| collection | DOAJ |
| description | Multimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interactions. Although existing research has made progress in multimodal alignment and dialogue relationship modeling through architectures such as graph neural networks and pre-trained language models, challenges persist in dataset overfitting and underexplored generative approaches. In this study, a generative MERC framework based on Multimodal Large Language Models (MLLMs) is proposed, employing Video-LLaMA, an open-source and advanced tri-modal foundation model, for end-to-end multimodal emotion reasoning. Carefully crafted structured prompts are used to align emotion semantics with dataset annotations, combined with Low-Rank Adaptation (LoRA) for parameter-efficient optimization. The method achieves a state-of-the-art weighted F1-score of 68.57% on the MELD benchmark. Further, exploratory experiments on dynamic modality combinations and fine-tuning strategies offer actionable insights for MLLM-based MERC research. This work not only advances emotion understanding in dialogues but also highlights MLLMs’ potential in complex multimodal reasoning tasks. |
| format | Article |
| id | doaj-art-7d2b13c0e41440948aa69c51e6b0d8cb |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-7d2b13c0e41440948aa69c51e6b0d8cb2025-08-20T03:02:35ZengIEEEIEEE Access2169-35362025-01-011312104812106010.1109/ACCESS.2025.359144711088104DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLMYuanyuan Sun0Ting Zhou1https://orcid.org/0009-0006-5876-788XCentral China Normal University, Wuhan, ChinaSun Yat-sen University, Shenzhen, ChinaMultimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interactions. Although existing research has made progress in multimodal alignment and dialogue relationship modeling through architectures such as graph neural networks and pre-trained language models, challenges persist in dataset overfitting and underexplored generative approaches. In this study, a generative MERC framework based on Multimodal Large Language Models (MLLMs) is proposed, employing Video-LLaMA, an open-source and advanced tri-modal foundation model, for end-to-end multimodal emotion reasoning. Carefully crafted structured prompts are used to align emotion semantics with dataset annotations, combined with Low-Rank Adaptation (LoRA) for parameter-efficient optimization. The method achieves a state-of-the-art weighted F1-score of 68.57% on the MELD benchmark. Further, exploratory experiments on dynamic modality combinations and fine-tuning strategies offer actionable insights for MLLM-based MERC research. This work not only advances emotion understanding in dialogues but also highlights MLLMs’ potential in complex multimodal reasoning tasks.https://ieeexplore.ieee.org/document/11088104/Multimodal emotion recognition in conversationmultimodal large language modelsstructured prompt engineeringdownstream task fine-tuning |
| spellingShingle | Yuanyuan Sun Ting Zhou DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM IEEE Access Multimodal emotion recognition in conversation multimodal large language models structured prompt engineering downstream task fine-tuning |
| title | DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM |
| title_full | DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM |
| title_fullStr | DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM |
| title_full_unstemmed | DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM |
| title_short | DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM |
| title_sort | dialoguemllm transforming multimodal emotion recognition in conversation through instruction tuned mllm |
| topic | Multimodal emotion recognition in conversation multimodal large language models structured prompt engineering downstream task fine-tuning |
| url | https://ieeexplore.ieee.org/document/11088104/ |
| work_keys_str_mv | AT yuanyuansun dialoguemllmtransformingmultimodalemotionrecognitioninconversationthroughinstructiontunedmllm AT tingzhou dialoguemllmtransformingmultimodalemotionrecognitioninconversationthroughinstructiontunedmllm |