R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis

Multimodal sentiment analysis (MSA) aims to assess emotional states by integrating information from text, audio, and video. However, the heterogeneous nature of these modalities presents substantial challenges for accurate sentiment prediction. Existing approaches typically align pairs of modalities...

Full description

Saved in:
Bibliographic Details
Main Authors: Yan Zhuang, Yanru Zhang, Jiawen Deng, Fuji Ren
Format: Article
Language:English
Published: American Association for the Advancement of Science (AAAS) 2025-01-01
Series:Research
Online Access:https://spj.science.org/doi/10.34133/research.0729
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850111390186995712
author Yan Zhuang
Yanru Zhang
Jiawen Deng
Fuji Ren
author_facet Yan Zhuang
Yanru Zhang
Jiawen Deng
Fuji Ren
author_sort Yan Zhuang
collection DOAJ
description Multimodal sentiment analysis (MSA) aims to assess emotional states by integrating information from text, audio, and video. However, the heterogeneous nature of these modalities presents substantial challenges for accurate sentiment prediction. Existing approaches typically align pairs of modalities using attention mechanisms or contrastive learning, which are computationally expensive. Additionally, they often rely on a single granularity of alignment, either by averaging features over all time steps or aligning features at each individual time step. These approaches overlook the fact that emotional expression can vary across individuals and contexts, requiring multiple granularities to capture emotion effectively. To address these challenges, we propose a novel framework, Retrieve, Rank, and Reconstruction with Different Granularities (R3DG). R3DG segments the audio and video modalities into multiple representations at varying granularities based on their temporal durations. It then selects the most relevant representations that align closely with the text modality. To preserve the original information, R3DG reconstructs the audio and video data using the selected representations. Finally, the fused audio, video, and text features are aligned and combined for sentiment prediction, reducing the need for multiple alignment steps. Extensive experiments on 5 benchmark MSA datasets demonstrate that R3DG outperforms existing methods and achieves substantial reductions in computational time. Code is available at https://github.com/YetZzzzzz/R3DG.
format Article
id doaj-art-c246a022978e42eba1715e08d3e3271a
institution OA Journals
issn 2639-5274
language English
publishDate 2025-01-01
publisher American Association for the Advancement of Science (AAAS)
record_format Article
series Research
spelling doaj-art-c246a022978e42eba1715e08d3e3271a2025-08-20T02:37:38ZengAmerican Association for the Advancement of Science (AAAS)Research2639-52742025-01-01810.34133/research.0729R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment AnalysisYan Zhuang0Yanru Zhang1Jiawen Deng2Fuji Ren3College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.Multimodal sentiment analysis (MSA) aims to assess emotional states by integrating information from text, audio, and video. However, the heterogeneous nature of these modalities presents substantial challenges for accurate sentiment prediction. Existing approaches typically align pairs of modalities using attention mechanisms or contrastive learning, which are computationally expensive. Additionally, they often rely on a single granularity of alignment, either by averaging features over all time steps or aligning features at each individual time step. These approaches overlook the fact that emotional expression can vary across individuals and contexts, requiring multiple granularities to capture emotion effectively. To address these challenges, we propose a novel framework, Retrieve, Rank, and Reconstruction with Different Granularities (R3DG). R3DG segments the audio and video modalities into multiple representations at varying granularities based on their temporal durations. It then selects the most relevant representations that align closely with the text modality. To preserve the original information, R3DG reconstructs the audio and video data using the selected representations. Finally, the fused audio, video, and text features are aligned and combined for sentiment prediction, reducing the need for multiple alignment steps. Extensive experiments on 5 benchmark MSA datasets demonstrate that R3DG outperforms existing methods and achieves substantial reductions in computational time. Code is available at https://github.com/YetZzzzzz/R3DG.https://spj.science.org/doi/10.34133/research.0729
spellingShingle Yan Zhuang
Yanru Zhang
Jiawen Deng
Fuji Ren
R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
Research
title R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
title_full R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
title_fullStr R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
title_full_unstemmed R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
title_short R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
title_sort r3dg retrieve rank and reconstruction with different granularities for multimodal sentiment analysis
url https://spj.science.org/doi/10.34133/research.0729
work_keys_str_mv AT yanzhuang r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis
AT yanruzhang r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis
AT jiawendeng r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis
AT fujiren r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis