CAFNet: Cross-Modal Adaptive Fusion Network With Attention and Gated Weighting for RGB-T Semantic Segmentation

Advanced applications such as autonomous vehicles and intelligent surveillance systems require strong environmental awareness abilities, and RGB-T semantic segmentation technology which fuses visible details and thermal imaging contour features can effectively improve perception capabilities. Howeve...

Full description

Saved in:
Bibliographic Details
Main Authors: Meili Fu, Huanliang Sun, Zhihan Chen, Lulin Wei
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11112589/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Advanced applications such as autonomous vehicles and intelligent surveillance systems require strong environmental awareness abilities, and RGB-T semantic segmentation technology which fuses visible details and thermal imaging contour features can effectively improve perception capabilities. However, current methods typically use basic fusion strategies like channel concatenation or weighted summation, neglecting feature distribution differences from distinct imaging mechanisms. We propose a cross-modal adaptive fusion network (CAFNet) for multimodal feature extraction and fusion that uses the ConvNeXt V2 architecture for multistage feature learning and integrates a channel-spatial attention module (CSAM) and an adaptive weighted fusion module (AWFM) in a dual-stream encoder. The CSAM enhances thermal radiation contour recognition under low-light conditions through cross-modal spatial enhancement and channel selection. The AWFM adaptively balances the weights of the RGB detail and thermal contour through dynamic gating, enabling the complementary fusion of contextual features. Finally, the decoder combines multistage features to predict semantics. The experimental results show that CAFNet achieves a 60.1% mIoU on the MFNet dataset, which is 1.2% higher than that of EAEFNet (58.9% mIoU), and the computational cost (110.61G FLOPs) and parameter count (68.13 M) are also reduced by 25% and 66.1%, respectively. The proposed method achieves exceptional results on both daytime and nighttime images, with mIoU scores of 51.3% and 60.5%, respectively.
ISSN:2169-3536