CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation

Indoor image semantic segmentation technology is applied to fields such as smart homes and indoor security. The challenges faced by semantic segmentation techniques using RGB images and depth maps as data sources include the semantic gap between RGB images and depth maps and the loss of detailed inf...

Full description

Saved in:
Bibliographic Details
Main Authors: Long-Fei Wu, Dan Wei, Chang-An Xu
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Journal of Imaging
Subjects:
Online Access:https://www.mdpi.com/2313-433X/11/6/177
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849431799748362240
author Long-Fei Wu
Dan Wei
Chang-An Xu
author_facet Long-Fei Wu
Dan Wei
Chang-An Xu
author_sort Long-Fei Wu
collection DOAJ
description Indoor image semantic segmentation technology is applied to fields such as smart homes and indoor security. The challenges faced by semantic segmentation techniques using RGB images and depth maps as data sources include the semantic gap between RGB images and depth maps and the loss of detailed information. To address these issues, a multi-head self-attention mechanism is adopted to adaptively align features of the two modalities and perform feature fusion in both spatial and channel dimensions. Appropriate feature extraction methods are designed according to the different characteristics of RGB images and depth maps. For RGB images, asymmetric convolution is introduced to capture features in the horizontal and vertical directions, enhance short-range information dependence, mitigate the gridding effect of dilated convolution, and introduce criss-cross attention to obtain contextual information from global dependency relationships. On the depth map, a strategy of extracting significant unimodal features from the channel and spatial dimensions is used. A lightweight skip connection module is designed to fuse low-level and high-level features. In addition, since the first layer contains the richest detailed information and the last layer contains rich semantic information, a feature refinement head is designed to fuse the two. The method achieves an mIoU of 53.86% and 51.85% on the NYUDv2 and SUN-RGBD datasets, which is superior to mainstream methods.
format Article
id doaj-art-cf9bce992c52445996cc02f5a7ef070c
institution Kabale University
issn 2313-433X
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Journal of Imaging
spelling doaj-art-cf9bce992c52445996cc02f5a7ef070c2025-08-20T03:27:32ZengMDPI AGJournal of Imaging2313-433X2025-05-0111617710.3390/jimaging11060177CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic SegmentationLong-Fei Wu0Dan Wei1Chang-An Xu2School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, ChinaSchool of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, ChinaCollege of Materials and Energy, South China Agricultural University, Guangzhou 510642, ChinaIndoor image semantic segmentation technology is applied to fields such as smart homes and indoor security. The challenges faced by semantic segmentation techniques using RGB images and depth maps as data sources include the semantic gap between RGB images and depth maps and the loss of detailed information. To address these issues, a multi-head self-attention mechanism is adopted to adaptively align features of the two modalities and perform feature fusion in both spatial and channel dimensions. Appropriate feature extraction methods are designed according to the different characteristics of RGB images and depth maps. For RGB images, asymmetric convolution is introduced to capture features in the horizontal and vertical directions, enhance short-range information dependence, mitigate the gridding effect of dilated convolution, and introduce criss-cross attention to obtain contextual information from global dependency relationships. On the depth map, a strategy of extracting significant unimodal features from the channel and spatial dimensions is used. A lightweight skip connection module is designed to fuse low-level and high-level features. In addition, since the first layer contains the richest detailed information and the last layer contains rich semantic information, a feature refinement head is designed to fuse the two. The method achieves an mIoU of 53.86% and 51.85% on the NYUDv2 and SUN-RGBD datasets, which is superior to mainstream methods.https://www.mdpi.com/2313-433X/11/6/177cross-modal fusionRGB-Dfeature extractionfeature interaction
spellingShingle Long-Fei Wu
Dan Wei
Chang-An Xu
CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation
Journal of Imaging
cross-modal fusion
RGB-D
feature extraction
feature interaction
title CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation
title_full CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation
title_fullStr CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation
title_full_unstemmed CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation
title_short CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation
title_sort cfanet the cross modal fusion attention network for indoor rgb d semantic segmentation
topic cross-modal fusion
RGB-D
feature extraction
feature interaction
url https://www.mdpi.com/2313-433X/11/6/177
work_keys_str_mv AT longfeiwu cfanetthecrossmodalfusionattentionnetworkforindoorrgbdsemanticsegmentation
AT danwei cfanetthecrossmodalfusionattentionnetworkforindoorrgbdsemanticsegmentation
AT changanxu cfanetthecrossmodalfusionattentionnetworkforindoorrgbdsemanticsegmentation