DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM

Multimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interact...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuanyuan Sun, Ting Zhou
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11088104/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849771568488513536
author Yuanyuan Sun
Ting Zhou
author_facet Yuanyuan Sun
Ting Zhou
author_sort Yuanyuan Sun
collection DOAJ
description Multimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interactions. Although existing research has made progress in multimodal alignment and dialogue relationship modeling through architectures such as graph neural networks and pre-trained language models, challenges persist in dataset overfitting and underexplored generative approaches. In this study, a generative MERC framework based on Multimodal Large Language Models (MLLMs) is proposed, employing Video-LLaMA, an open-source and advanced tri-modal foundation model, for end-to-end multimodal emotion reasoning. Carefully crafted structured prompts are used to align emotion semantics with dataset annotations, combined with Low-Rank Adaptation (LoRA) for parameter-efficient optimization. The method achieves a state-of-the-art weighted F1-score of 68.57% on the MELD benchmark. Further, exploratory experiments on dynamic modality combinations and fine-tuning strategies offer actionable insights for MLLM-based MERC research. This work not only advances emotion understanding in dialogues but also highlights MLLMs’ potential in complex multimodal reasoning tasks.
format Article
id doaj-art-7d2b13c0e41440948aa69c51e6b0d8cb
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-7d2b13c0e41440948aa69c51e6b0d8cb2025-08-20T03:02:35ZengIEEEIEEE Access2169-35362025-01-011312104812106010.1109/ACCESS.2025.359144711088104DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLMYuanyuan Sun0Ting Zhou1https://orcid.org/0009-0006-5876-788XCentral China Normal University, Wuhan, ChinaSun Yat-sen University, Shenzhen, ChinaMultimodal Emotion Recognition in Conversation (MERC) is an advanced research area that integrates cross-modal understanding and contextual reasoning through text-speech-visual fusion, with applications spanning diverse scenarios including student emotion monitoring in high school classroom interactions. Although existing research has made progress in multimodal alignment and dialogue relationship modeling through architectures such as graph neural networks and pre-trained language models, challenges persist in dataset overfitting and underexplored generative approaches. In this study, a generative MERC framework based on Multimodal Large Language Models (MLLMs) is proposed, employing Video-LLaMA, an open-source and advanced tri-modal foundation model, for end-to-end multimodal emotion reasoning. Carefully crafted structured prompts are used to align emotion semantics with dataset annotations, combined with Low-Rank Adaptation (LoRA) for parameter-efficient optimization. The method achieves a state-of-the-art weighted F1-score of 68.57% on the MELD benchmark. Further, exploratory experiments on dynamic modality combinations and fine-tuning strategies offer actionable insights for MLLM-based MERC research. This work not only advances emotion understanding in dialogues but also highlights MLLMs’ potential in complex multimodal reasoning tasks.https://ieeexplore.ieee.org/document/11088104/Multimodal emotion recognition in conversationmultimodal large language modelsstructured prompt engineeringdownstream task fine-tuning
spellingShingle Yuanyuan Sun
Ting Zhou
DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM
IEEE Access
Multimodal emotion recognition in conversation
multimodal large language models
structured prompt engineering
downstream task fine-tuning
title DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM
title_full DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM
title_fullStr DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM
title_full_unstemmed DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM
title_short DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM
title_sort dialoguemllm transforming multimodal emotion recognition in conversation through instruction tuned mllm
topic Multimodal emotion recognition in conversation
multimodal large language models
structured prompt engineering
downstream task fine-tuning
url https://ieeexplore.ieee.org/document/11088104/
work_keys_str_mv AT yuanyuansun dialoguemllmtransformingmultimodalemotionrecognitioninconversationthroughinstructiontunedmllm
AT tingzhou dialoguemllmtransformingmultimodalemotionrecognitioninconversationthroughinstructiontunedmllm