TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis
Abstract Multimodal sentiment analysis (MSA) is crucial in human-computer interaction. Current methods use simple sub-models for feature extraction, neglecting multi-scale features and the complexity of emotions. Text, visual, and audio each have unique characteristics in MSA, with text often provid...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Springer
2025-01-01
|
Series: | Complex & Intelligent Systems |
Subjects: | |
Online Access: | https://doi.org/10.1007/s40747-024-01724-5 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823861459052920832 |
---|---|
author | Junsong Fu Youjia Fu Huixia Xue Zihao Xu |
author_facet | Junsong Fu Youjia Fu Huixia Xue Zihao Xu |
author_sort | Junsong Fu |
collection | DOAJ |
description | Abstract Multimodal sentiment analysis (MSA) is crucial in human-computer interaction. Current methods use simple sub-models for feature extraction, neglecting multi-scale features and the complexity of emotions. Text, visual, and audio each have unique characteristics in MSA, with text often providing more emotional cues due to its rich semantics. However, current approaches treat modalities equally, not maximizing text’s advantages. To solve these problems, we propose a novel method named a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning (TMFN). Firstly, we propose an innovative pyramid-structured multi-scale feature extraction method, which captures the multi-scale features of modal data through convolution kernels of different sizes and strengthens key features through channel attention mechanism. Second, we design a text-based multimodal feature fusion module, which consists of a text gating unit (TGU) and a text-based channel-wise attention transformer (TCAT). TGU is responsible for guiding and regulating the fusion process of other modal information, while TCAT improves the model’s ability to capture the relationship between features of different modalities and achieves effective feature interaction. Finally, to further optimize the representation of fused features, we introduce unsupervised contrastive learning to deeply explore the intrinsic connection between multi-scale features and fused features. Experimental results show that our proposed model outperforms the state-of-the-art models in MSA on two benchmark datasets. |
format | Article |
id | doaj-art-3de9660b26e645629b4defd9034bc458 |
institution | Kabale University |
issn | 2199-4536 2198-6053 |
language | English |
publishDate | 2025-01-01 |
publisher | Springer |
record_format | Article |
series | Complex & Intelligent Systems |
spelling | doaj-art-3de9660b26e645629b4defd9034bc4582025-02-09T13:00:55ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-01-0111211610.1007/s40747-024-01724-5TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysisJunsong Fu0Youjia Fu1Huixia Xue2Zihao Xu3College of Computer Science and Engineering, Chongqing University of TechnologyCollege of Computer Science and Engineering, Chongqing University of TechnologyCollege of Computer Science and Engineering, Chongqing University of TechnologyCollege of Artificial Intelligence, Chongqing University of TechnologyAbstract Multimodal sentiment analysis (MSA) is crucial in human-computer interaction. Current methods use simple sub-models for feature extraction, neglecting multi-scale features and the complexity of emotions. Text, visual, and audio each have unique characteristics in MSA, with text often providing more emotional cues due to its rich semantics. However, current approaches treat modalities equally, not maximizing text’s advantages. To solve these problems, we propose a novel method named a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning (TMFN). Firstly, we propose an innovative pyramid-structured multi-scale feature extraction method, which captures the multi-scale features of modal data through convolution kernels of different sizes and strengthens key features through channel attention mechanism. Second, we design a text-based multimodal feature fusion module, which consists of a text gating unit (TGU) and a text-based channel-wise attention transformer (TCAT). TGU is responsible for guiding and regulating the fusion process of other modal information, while TCAT improves the model’s ability to capture the relationship between features of different modalities and achieves effective feature interaction. Finally, to further optimize the representation of fused features, we introduce unsupervised contrastive learning to deeply explore the intrinsic connection between multi-scale features and fused features. Experimental results show that our proposed model outperforms the state-of-the-art models in MSA on two benchmark datasets.https://doi.org/10.1007/s40747-024-01724-5Multimodal sentiment analysisMulti-scale feature extractionMultimodal data fusionTransformerUnsupervised contrastive learning |
spellingShingle | Junsong Fu Youjia Fu Huixia Xue Zihao Xu TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis Complex & Intelligent Systems Multimodal sentiment analysis Multi-scale feature extraction Multimodal data fusion Transformer Unsupervised contrastive learning |
title | TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis |
title_full | TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis |
title_fullStr | TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis |
title_full_unstemmed | TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis |
title_short | TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis |
title_sort | tmfn a text based multimodal fusion network with multi scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis |
topic | Multimodal sentiment analysis Multi-scale feature extraction Multimodal data fusion Transformer Unsupervised contrastive learning |
url | https://doi.org/10.1007/s40747-024-01724-5 |
work_keys_str_mv | AT junsongfu tmfnatextbasedmultimodalfusionnetworkwithmultiscalefeatureextractionandunsupervisedcontrastivelearningformultimodalsentimentanalysis AT youjiafu tmfnatextbasedmultimodalfusionnetworkwithmultiscalefeatureextractionandunsupervisedcontrastivelearningformultimodalsentimentanalysis AT huixiaxue tmfnatextbasedmultimodalfusionnetworkwithmultiscalefeatureextractionandunsupervisedcontrastivelearningformultimodalsentimentanalysis AT zihaoxu tmfnatextbasedmultimodalfusionnetworkwithmultiscalefeatureextractionandunsupervisedcontrastivelearningformultimodalsentimentanalysis |