CAFNet: Cross-Modal Adaptive Fusion Network With Attention and Gated Weighting for RGB-T Semantic Segmentation
Advanced applications such as autonomous vehicles and intelligent surveillance systems require strong environmental awareness abilities, and RGB-T semantic segmentation technology which fuses visible details and thermal imaging contour features can effectively improve perception capabilities. Howeve...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11112589/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Advanced applications such as autonomous vehicles and intelligent surveillance systems require strong environmental awareness abilities, and RGB-T semantic segmentation technology which fuses visible details and thermal imaging contour features can effectively improve perception capabilities. However, current methods typically use basic fusion strategies like channel concatenation or weighted summation, neglecting feature distribution differences from distinct imaging mechanisms. We propose a cross-modal adaptive fusion network (CAFNet) for multimodal feature extraction and fusion that uses the ConvNeXt V2 architecture for multistage feature learning and integrates a channel-spatial attention module (CSAM) and an adaptive weighted fusion module (AWFM) in a dual-stream encoder. The CSAM enhances thermal radiation contour recognition under low-light conditions through cross-modal spatial enhancement and channel selection. The AWFM adaptively balances the weights of the RGB detail and thermal contour through dynamic gating, enabling the complementary fusion of contextual features. Finally, the decoder combines multistage features to predict semantics. The experimental results show that CAFNet achieves a 60.1% mIoU on the MFNet dataset, which is 1.2% higher than that of EAEFNet (58.9% mIoU), and the computational cost (110.61G FLOPs) and parameter count (68.13 M) are also reduced by 25% and 66.1%, respectively. The proposed method achieves exceptional results on both daytime and nighttime images, with mIoU scores of 51.3% and 60.5%, respectively. |
|---|---|
| ISSN: | 2169-3536 |