R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis

Multimodal sentiment analysis (MSA) aims to assess emotional states by integrating information from text, audio, and video. However, the heterogeneous nature of these modalities presents substantial challenges for accurate sentiment prediction. Existing approaches typically align pairs of modalities...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yan Zhuang, Yanru Zhang, Jiawen Deng, Fuji Ren
Format:	Article
Language:	English
Published:	American Association for the Advancement of Science (AAAS) 2025-01-01
Series:	Research
Online Access:	https://spj.science.org/doi/10.34133/research.0729
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850111390186995712
author	Yan Zhuang Yanru Zhang Jiawen Deng Fuji Ren
author_facet	Yan Zhuang Yanru Zhang Jiawen Deng Fuji Ren
author_sort	Yan Zhuang
collection	DOAJ
description	Multimodal sentiment analysis (MSA) aims to assess emotional states by integrating information from text, audio, and video. However, the heterogeneous nature of these modalities presents substantial challenges for accurate sentiment prediction. Existing approaches typically align pairs of modalities using attention mechanisms or contrastive learning, which are computationally expensive. Additionally, they often rely on a single granularity of alignment, either by averaging features over all time steps or aligning features at each individual time step. These approaches overlook the fact that emotional expression can vary across individuals and contexts, requiring multiple granularities to capture emotion effectively. To address these challenges, we propose a novel framework, Retrieve, Rank, and Reconstruction with Different Granularities (R3DG). R3DG segments the audio and video modalities into multiple representations at varying granularities based on their temporal durations. It then selects the most relevant representations that align closely with the text modality. To preserve the original information, R3DG reconstructs the audio and video data using the selected representations. Finally, the fused audio, video, and text features are aligned and combined for sentiment prediction, reducing the need for multiple alignment steps. Extensive experiments on 5 benchmark MSA datasets demonstrate that R3DG outperforms existing methods and achieves substantial reductions in computational time. Code is available at https://github.com/YetZzzzzz/R3DG.
format	Article
id	doaj-art-c246a022978e42eba1715e08d3e3271a
institution	OA Journals
issn	2639-5274
language	English
publishDate	2025-01-01
publisher	American Association for the Advancement of Science (AAAS)
record_format	Article
series	Research
spelling	doaj-art-c246a022978e42eba1715e08d3e3271a2025-08-20T02:37:38ZengAmerican Association for the Advancement of Science (AAAS)Research2639-52742025-01-01810.34133/research.0729R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment AnalysisYan Zhuang0Yanru Zhang1Jiawen Deng2Fuji Ren3College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.Multimodal sentiment analysis (MSA) aims to assess emotional states by integrating information from text, audio, and video. However, the heterogeneous nature of these modalities presents substantial challenges for accurate sentiment prediction. Existing approaches typically align pairs of modalities using attention mechanisms or contrastive learning, which are computationally expensive. Additionally, they often rely on a single granularity of alignment, either by averaging features over all time steps or aligning features at each individual time step. These approaches overlook the fact that emotional expression can vary across individuals and contexts, requiring multiple granularities to capture emotion effectively. To address these challenges, we propose a novel framework, Retrieve, Rank, and Reconstruction with Different Granularities (R3DG). R3DG segments the audio and video modalities into multiple representations at varying granularities based on their temporal durations. It then selects the most relevant representations that align closely with the text modality. To preserve the original information, R3DG reconstructs the audio and video data using the selected representations. Finally, the fused audio, video, and text features are aligned and combined for sentiment prediction, reducing the need for multiple alignment steps. Extensive experiments on 5 benchmark MSA datasets demonstrate that R3DG outperforms existing methods and achieves substantial reductions in computational time. Code is available at https://github.com/YetZzzzzz/R3DG.https://spj.science.org/doi/10.34133/research.0729
spellingShingle	Yan Zhuang Yanru Zhang Jiawen Deng Fuji Ren R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis Research
title	R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
title_full	R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
title_fullStr	R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
title_full_unstemmed	R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
title_short	R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
title_sort	r3dg retrieve rank and reconstruction with different granularities for multimodal sentiment analysis
url	https://spj.science.org/doi/10.34133/research.0729
work_keys_str_mv	AT yanzhuang r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis AT yanruzhang r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis AT jiawendeng r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis AT fujiren r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis

R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis

Similar Items