A Cross-Modal Attention-Driven Multi-Sensor Fusion Method for Semantic Segmentation of Point Clouds

To bridge the modality gap between camera images and LiDAR point clouds in autonomous driving systems—a critical challenge exacerbated by current fusion methods’ inability to effectively integrate cross-modal features—we propose the Cross-Modal Fusion (CMF) framework. This attention-driven architect...

Full description

Saved in:
Bibliographic Details
Main Authors: Huisheng Shi, Xin Wang, Jianghong Zhao, Xinnan Hua
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/8/2474
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:To bridge the modality gap between camera images and LiDAR point clouds in autonomous driving systems—a critical challenge exacerbated by current fusion methods’ inability to effectively integrate cross-modal features—we propose the Cross-Modal Fusion (CMF) framework. This attention-driven architecture enables hierarchical multi-sensor data fusion, achieving state-of-the-art performance in semantic segmentation tasks.The CMF framework first projects point clouds onto the camera coordinates through the use of perspective projection to provide spatio-depth information for RGB images. Then, a two-stream feature extraction network is proposed to extract features from the two modalities separately, and multilevel fusion of the two modalities is realized by a residual fusion module (RCF) with cross-modal attention. Finally, we design a perceptual alignment loss that integrates cross-entropy with feature matching terms, effectively minimizing the semantic discrepancy between camera and LiDAR representations during fusion. The experimental results based on the SemanticKITTI and nuScenes benchmark datasets demonstrate that the CMF method achieves mean intersection over union (mIoU) scores of 64.2% and 79.3%, respectively, outperforming existing state-of-the-art methods in regard to accuracy and exhibiting enhanced robustness in regard to complex scenarios. The results of the ablation studies further validate that enhancing the feature interaction and fusion capabilities in semantic segmentation models through cross-modal attention and perceptually guided cross-entropy loss (Pgce) is effective in regard to improving segmentation accuracy and robustness.
ISSN:1424-8220