R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
Multimodal sentiment analysis (MSA) aims to assess emotional states by integrating information from text, audio, and video. However, the heterogeneous nature of these modalities presents substantial challenges for accurate sentiment prediction. Existing approaches typically align pairs of modalities...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
American Association for the Advancement of Science (AAAS)
2025-01-01
|
| Series: | Research |
| Online Access: | https://spj.science.org/doi/10.34133/research.0729 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850111390186995712 |
|---|---|
| author | Yan Zhuang Yanru Zhang Jiawen Deng Fuji Ren |
| author_facet | Yan Zhuang Yanru Zhang Jiawen Deng Fuji Ren |
| author_sort | Yan Zhuang |
| collection | DOAJ |
| description | Multimodal sentiment analysis (MSA) aims to assess emotional states by integrating information from text, audio, and video. However, the heterogeneous nature of these modalities presents substantial challenges for accurate sentiment prediction. Existing approaches typically align pairs of modalities using attention mechanisms or contrastive learning, which are computationally expensive. Additionally, they often rely on a single granularity of alignment, either by averaging features over all time steps or aligning features at each individual time step. These approaches overlook the fact that emotional expression can vary across individuals and contexts, requiring multiple granularities to capture emotion effectively. To address these challenges, we propose a novel framework, Retrieve, Rank, and Reconstruction with Different Granularities (R3DG). R3DG segments the audio and video modalities into multiple representations at varying granularities based on their temporal durations. It then selects the most relevant representations that align closely with the text modality. To preserve the original information, R3DG reconstructs the audio and video data using the selected representations. Finally, the fused audio, video, and text features are aligned and combined for sentiment prediction, reducing the need for multiple alignment steps. Extensive experiments on 5 benchmark MSA datasets demonstrate that R3DG outperforms existing methods and achieves substantial reductions in computational time. Code is available at https://github.com/YetZzzzzz/R3DG. |
| format | Article |
| id | doaj-art-c246a022978e42eba1715e08d3e3271a |
| institution | OA Journals |
| issn | 2639-5274 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | American Association for the Advancement of Science (AAAS) |
| record_format | Article |
| series | Research |
| spelling | doaj-art-c246a022978e42eba1715e08d3e3271a2025-08-20T02:37:38ZengAmerican Association for the Advancement of Science (AAAS)Research2639-52742025-01-01810.34133/research.0729R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment AnalysisYan Zhuang0Yanru Zhang1Jiawen Deng2Fuji Ren3College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.Multimodal sentiment analysis (MSA) aims to assess emotional states by integrating information from text, audio, and video. However, the heterogeneous nature of these modalities presents substantial challenges for accurate sentiment prediction. Existing approaches typically align pairs of modalities using attention mechanisms or contrastive learning, which are computationally expensive. Additionally, they often rely on a single granularity of alignment, either by averaging features over all time steps or aligning features at each individual time step. These approaches overlook the fact that emotional expression can vary across individuals and contexts, requiring multiple granularities to capture emotion effectively. To address these challenges, we propose a novel framework, Retrieve, Rank, and Reconstruction with Different Granularities (R3DG). R3DG segments the audio and video modalities into multiple representations at varying granularities based on their temporal durations. It then selects the most relevant representations that align closely with the text modality. To preserve the original information, R3DG reconstructs the audio and video data using the selected representations. Finally, the fused audio, video, and text features are aligned and combined for sentiment prediction, reducing the need for multiple alignment steps. Extensive experiments on 5 benchmark MSA datasets demonstrate that R3DG outperforms existing methods and achieves substantial reductions in computational time. Code is available at https://github.com/YetZzzzzz/R3DG.https://spj.science.org/doi/10.34133/research.0729 |
| spellingShingle | Yan Zhuang Yanru Zhang Jiawen Deng Fuji Ren R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis Research |
| title | R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis |
| title_full | R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis |
| title_fullStr | R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis |
| title_full_unstemmed | R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis |
| title_short | R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis |
| title_sort | r3dg retrieve rank and reconstruction with different granularities for multimodal sentiment analysis |
| url | https://spj.science.org/doi/10.34133/research.0729 |
| work_keys_str_mv | AT yanzhuang r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis AT yanruzhang r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis AT jiawendeng r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis AT fujiren r3dgretrieverankandreconstructionwithdifferentgranularitiesformultimodalsentimentanalysis |