CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation
Indoor image semantic segmentation technology is applied to fields such as smart homes and indoor security. The challenges faced by semantic segmentation techniques using RGB images and depth maps as data sources include the semantic gap between RGB images and depth maps and the loss of detailed inf...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | Journal of Imaging |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2313-433X/11/6/177 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849431799748362240 |
|---|---|
| author | Long-Fei Wu Dan Wei Chang-An Xu |
| author_facet | Long-Fei Wu Dan Wei Chang-An Xu |
| author_sort | Long-Fei Wu |
| collection | DOAJ |
| description | Indoor image semantic segmentation technology is applied to fields such as smart homes and indoor security. The challenges faced by semantic segmentation techniques using RGB images and depth maps as data sources include the semantic gap between RGB images and depth maps and the loss of detailed information. To address these issues, a multi-head self-attention mechanism is adopted to adaptively align features of the two modalities and perform feature fusion in both spatial and channel dimensions. Appropriate feature extraction methods are designed according to the different characteristics of RGB images and depth maps. For RGB images, asymmetric convolution is introduced to capture features in the horizontal and vertical directions, enhance short-range information dependence, mitigate the gridding effect of dilated convolution, and introduce criss-cross attention to obtain contextual information from global dependency relationships. On the depth map, a strategy of extracting significant unimodal features from the channel and spatial dimensions is used. A lightweight skip connection module is designed to fuse low-level and high-level features. In addition, since the first layer contains the richest detailed information and the last layer contains rich semantic information, a feature refinement head is designed to fuse the two. The method achieves an mIoU of 53.86% and 51.85% on the NYUDv2 and SUN-RGBD datasets, which is superior to mainstream methods. |
| format | Article |
| id | doaj-art-cf9bce992c52445996cc02f5a7ef070c |
| institution | Kabale University |
| issn | 2313-433X |
| language | English |
| publishDate | 2025-05-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Journal of Imaging |
| spelling | doaj-art-cf9bce992c52445996cc02f5a7ef070c2025-08-20T03:27:32ZengMDPI AGJournal of Imaging2313-433X2025-05-0111617710.3390/jimaging11060177CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic SegmentationLong-Fei Wu0Dan Wei1Chang-An Xu2School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, ChinaSchool of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, ChinaCollege of Materials and Energy, South China Agricultural University, Guangzhou 510642, ChinaIndoor image semantic segmentation technology is applied to fields such as smart homes and indoor security. The challenges faced by semantic segmentation techniques using RGB images and depth maps as data sources include the semantic gap between RGB images and depth maps and the loss of detailed information. To address these issues, a multi-head self-attention mechanism is adopted to adaptively align features of the two modalities and perform feature fusion in both spatial and channel dimensions. Appropriate feature extraction methods are designed according to the different characteristics of RGB images and depth maps. For RGB images, asymmetric convolution is introduced to capture features in the horizontal and vertical directions, enhance short-range information dependence, mitigate the gridding effect of dilated convolution, and introduce criss-cross attention to obtain contextual information from global dependency relationships. On the depth map, a strategy of extracting significant unimodal features from the channel and spatial dimensions is used. A lightweight skip connection module is designed to fuse low-level and high-level features. In addition, since the first layer contains the richest detailed information and the last layer contains rich semantic information, a feature refinement head is designed to fuse the two. The method achieves an mIoU of 53.86% and 51.85% on the NYUDv2 and SUN-RGBD datasets, which is superior to mainstream methods.https://www.mdpi.com/2313-433X/11/6/177cross-modal fusionRGB-Dfeature extractionfeature interaction |
| spellingShingle | Long-Fei Wu Dan Wei Chang-An Xu CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation Journal of Imaging cross-modal fusion RGB-D feature extraction feature interaction |
| title | CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation |
| title_full | CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation |
| title_fullStr | CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation |
| title_full_unstemmed | CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation |
| title_short | CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation |
| title_sort | cfanet the cross modal fusion attention network for indoor rgb d semantic segmentation |
| topic | cross-modal fusion RGB-D feature extraction feature interaction |
| url | https://www.mdpi.com/2313-433X/11/6/177 |
| work_keys_str_mv | AT longfeiwu cfanetthecrossmodalfusionattentionnetworkforindoorrgbdsemanticsegmentation AT danwei cfanetthecrossmodalfusionattentionnetworkforindoorrgbdsemanticsegmentation AT changanxu cfanetthecrossmodalfusionattentionnetworkforindoorrgbdsemanticsegmentation |