TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis

Abstract Multimodal sentiment analysis (MSA) is crucial in human-computer interaction. Current methods use simple sub-models for feature extraction, neglecting multi-scale features and the complexity of emotions. Text, visual, and audio each have unique characteristics in MSA, with text often provid...

Full description

Saved in:
Bibliographic Details
Main Authors: Junsong Fu, Youjia Fu, Huixia Xue, Zihao Xu
Format: Article
Language:English
Published: Springer 2025-01-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-024-01724-5
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823861459052920832
author Junsong Fu
Youjia Fu
Huixia Xue
Zihao Xu
author_facet Junsong Fu
Youjia Fu
Huixia Xue
Zihao Xu
author_sort Junsong Fu
collection DOAJ
description Abstract Multimodal sentiment analysis (MSA) is crucial in human-computer interaction. Current methods use simple sub-models for feature extraction, neglecting multi-scale features and the complexity of emotions. Text, visual, and audio each have unique characteristics in MSA, with text often providing more emotional cues due to its rich semantics. However, current approaches treat modalities equally, not maximizing text’s advantages. To solve these problems, we propose a novel method named a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning (TMFN). Firstly, we propose an innovative pyramid-structured multi-scale feature extraction method, which captures the multi-scale features of modal data through convolution kernels of different sizes and strengthens key features through channel attention mechanism. Second, we design a text-based multimodal feature fusion module, which consists of a text gating unit (TGU) and a text-based channel-wise attention transformer (TCAT). TGU is responsible for guiding and regulating the fusion process of other modal information, while TCAT improves the model’s ability to capture the relationship between features of different modalities and achieves effective feature interaction. Finally, to further optimize the representation of fused features, we introduce unsupervised contrastive learning to deeply explore the intrinsic connection between multi-scale features and fused features. Experimental results show that our proposed model outperforms the state-of-the-art models in MSA on two benchmark datasets.
format Article
id doaj-art-3de9660b26e645629b4defd9034bc458
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2025-01-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-3de9660b26e645629b4defd9034bc4582025-02-09T13:00:55ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-01-0111211610.1007/s40747-024-01724-5TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysisJunsong Fu0Youjia Fu1Huixia Xue2Zihao Xu3College of Computer Science and Engineering, Chongqing University of TechnologyCollege of Computer Science and Engineering, Chongqing University of TechnologyCollege of Computer Science and Engineering, Chongqing University of TechnologyCollege of Artificial Intelligence, Chongqing University of TechnologyAbstract Multimodal sentiment analysis (MSA) is crucial in human-computer interaction. Current methods use simple sub-models for feature extraction, neglecting multi-scale features and the complexity of emotions. Text, visual, and audio each have unique characteristics in MSA, with text often providing more emotional cues due to its rich semantics. However, current approaches treat modalities equally, not maximizing text’s advantages. To solve these problems, we propose a novel method named a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning (TMFN). Firstly, we propose an innovative pyramid-structured multi-scale feature extraction method, which captures the multi-scale features of modal data through convolution kernels of different sizes and strengthens key features through channel attention mechanism. Second, we design a text-based multimodal feature fusion module, which consists of a text gating unit (TGU) and a text-based channel-wise attention transformer (TCAT). TGU is responsible for guiding and regulating the fusion process of other modal information, while TCAT improves the model’s ability to capture the relationship between features of different modalities and achieves effective feature interaction. Finally, to further optimize the representation of fused features, we introduce unsupervised contrastive learning to deeply explore the intrinsic connection between multi-scale features and fused features. Experimental results show that our proposed model outperforms the state-of-the-art models in MSA on two benchmark datasets.https://doi.org/10.1007/s40747-024-01724-5Multimodal sentiment analysisMulti-scale feature extractionMultimodal data fusionTransformerUnsupervised contrastive learning
spellingShingle Junsong Fu
Youjia Fu
Huixia Xue
Zihao Xu
TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis
Complex & Intelligent Systems
Multimodal sentiment analysis
Multi-scale feature extraction
Multimodal data fusion
Transformer
Unsupervised contrastive learning
title TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis
title_full TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis
title_fullStr TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis
title_full_unstemmed TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis
title_short TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis
title_sort tmfn a text based multimodal fusion network with multi scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis
topic Multimodal sentiment analysis
Multi-scale feature extraction
Multimodal data fusion
Transformer
Unsupervised contrastive learning
url https://doi.org/10.1007/s40747-024-01724-5
work_keys_str_mv AT junsongfu tmfnatextbasedmultimodalfusionnetworkwithmultiscalefeatureextractionandunsupervisedcontrastivelearningformultimodalsentimentanalysis
AT youjiafu tmfnatextbasedmultimodalfusionnetworkwithmultiscalefeatureextractionandunsupervisedcontrastivelearningformultimodalsentimentanalysis
AT huixiaxue tmfnatextbasedmultimodalfusionnetworkwithmultiscalefeatureextractionandunsupervisedcontrastivelearningformultimodalsentimentanalysis
AT zihaoxu tmfnatextbasedmultimodalfusionnetworkwithmultiscalefeatureextractionandunsupervisedcontrastivelearningformultimodalsentimentanalysis