CAFNet: Cross-Modal Adaptive Fusion Network With Attention and Gated Weighting for RGB-T Semantic Segmentation

Advanced applications such as autonomous vehicles and intelligent surveillance systems require strong environmental awareness abilities, and RGB-T semantic segmentation technology which fuses visible details and thermal imaging contour features can effectively improve perception capabilities. Howeve...

Full description

Saved in:
Bibliographic Details
Main Authors: Meili Fu, Huanliang Sun, Zhihan Chen, Lulin Wei
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11112589/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849390182067863552
author Meili Fu
Huanliang Sun
Zhihan Chen
Lulin Wei
author_facet Meili Fu
Huanliang Sun
Zhihan Chen
Lulin Wei
author_sort Meili Fu
collection DOAJ
description Advanced applications such as autonomous vehicles and intelligent surveillance systems require strong environmental awareness abilities, and RGB-T semantic segmentation technology which fuses visible details and thermal imaging contour features can effectively improve perception capabilities. However, current methods typically use basic fusion strategies like channel concatenation or weighted summation, neglecting feature distribution differences from distinct imaging mechanisms. We propose a cross-modal adaptive fusion network (CAFNet) for multimodal feature extraction and fusion that uses the ConvNeXt V2 architecture for multistage feature learning and integrates a channel-spatial attention module (CSAM) and an adaptive weighted fusion module (AWFM) in a dual-stream encoder. The CSAM enhances thermal radiation contour recognition under low-light conditions through cross-modal spatial enhancement and channel selection. The AWFM adaptively balances the weights of the RGB detail and thermal contour through dynamic gating, enabling the complementary fusion of contextual features. Finally, the decoder combines multistage features to predict semantics. The experimental results show that CAFNet achieves a 60.1% mIoU on the MFNet dataset, which is 1.2% higher than that of EAEFNet (58.9% mIoU), and the computational cost (110.61G FLOPs) and parameter count (68.13 M) are also reduced by 25% and 66.1%, respectively. The proposed method achieves exceptional results on both daytime and nighttime images, with mIoU scores of 51.3% and 60.5%, respectively.
format Article
id doaj-art-d4f045babfcb4694b79048d7c84c99ad
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-d4f045babfcb4694b79048d7c84c99ad2025-08-20T03:41:44ZengIEEEIEEE Access2169-35362025-01-011313738413739510.1109/ACCESS.2025.359581111112589CAFNet: Cross-Modal Adaptive Fusion Network With Attention and Gated Weighting for RGB-T Semantic SegmentationMeili Fu0https://orcid.org/0009-0001-5610-8059Huanliang Sun1https://orcid.org/0009-0009-7073-4716Zhihan Chen2https://orcid.org/0009-0002-1725-5459Lulin Wei3https://orcid.org/0009-0007-9074-9814School of Computer Science and Engineering, Shenyang Jianzhu University, Shenyang, ChinaSchool of Computer Science and Engineering, Shenyang Jianzhu University, Shenyang, ChinaSchool of Computer Science and Engineering, Shenyang Jianzhu University, Shenyang, ChinaSchool of Computer Science and Engineering, Shenyang Jianzhu University, Shenyang, ChinaAdvanced applications such as autonomous vehicles and intelligent surveillance systems require strong environmental awareness abilities, and RGB-T semantic segmentation technology which fuses visible details and thermal imaging contour features can effectively improve perception capabilities. However, current methods typically use basic fusion strategies like channel concatenation or weighted summation, neglecting feature distribution differences from distinct imaging mechanisms. We propose a cross-modal adaptive fusion network (CAFNet) for multimodal feature extraction and fusion that uses the ConvNeXt V2 architecture for multistage feature learning and integrates a channel-spatial attention module (CSAM) and an adaptive weighted fusion module (AWFM) in a dual-stream encoder. The CSAM enhances thermal radiation contour recognition under low-light conditions through cross-modal spatial enhancement and channel selection. The AWFM adaptively balances the weights of the RGB detail and thermal contour through dynamic gating, enabling the complementary fusion of contextual features. Finally, the decoder combines multistage features to predict semantics. The experimental results show that CAFNet achieves a 60.1% mIoU on the MFNet dataset, which is 1.2% higher than that of EAEFNet (58.9% mIoU), and the computational cost (110.61G FLOPs) and parameter count (68.13 M) are also reduced by 25% and 66.1%, respectively. The proposed method achieves exceptional results on both daytime and nighttime images, with mIoU scores of 51.3% and 60.5%, respectively.https://ieeexplore.ieee.org/document/11112589/Adaptive weighted fusionchannel-spatial attentiondynamic gating mechanismRGB-T semantic segmentation
spellingShingle Meili Fu
Huanliang Sun
Zhihan Chen
Lulin Wei
CAFNet: Cross-Modal Adaptive Fusion Network With Attention and Gated Weighting for RGB-T Semantic Segmentation
IEEE Access
Adaptive weighted fusion
channel-spatial attention
dynamic gating mechanism
RGB-T semantic segmentation
title CAFNet: Cross-Modal Adaptive Fusion Network With Attention and Gated Weighting for RGB-T Semantic Segmentation
title_full CAFNet: Cross-Modal Adaptive Fusion Network With Attention and Gated Weighting for RGB-T Semantic Segmentation
title_fullStr CAFNet: Cross-Modal Adaptive Fusion Network With Attention and Gated Weighting for RGB-T Semantic Segmentation
title_full_unstemmed CAFNet: Cross-Modal Adaptive Fusion Network With Attention and Gated Weighting for RGB-T Semantic Segmentation
title_short CAFNet: Cross-Modal Adaptive Fusion Network With Attention and Gated Weighting for RGB-T Semantic Segmentation
title_sort cafnet cross modal adaptive fusion network with attention and gated weighting for rgb t semantic segmentation
topic Adaptive weighted fusion
channel-spatial attention
dynamic gating mechanism
RGB-T semantic segmentation
url https://ieeexplore.ieee.org/document/11112589/
work_keys_str_mv AT meilifu cafnetcrossmodaladaptivefusionnetworkwithattentionandgatedweightingforrgbtsemanticsegmentation
AT huanliangsun cafnetcrossmodaladaptivefusionnetworkwithattentionandgatedweightingforrgbtsemanticsegmentation
AT zhihanchen cafnetcrossmodaladaptivefusionnetworkwithattentionandgatedweightingforrgbtsemanticsegmentation
AT lulinwei cafnetcrossmodaladaptivefusionnetworkwithattentionandgatedweightingforrgbtsemanticsegmentation