CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles
Accurate 3D object detection is vital in autonomous driving. Single-modal detectors, using either camera or LiDAR, struggle with issues like limited depth perception or difficulty in distinguishing semantically similar objects. While multimodal approaches aim to address these limitations by combinin...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10960445/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Accurate 3D object detection is vital in autonomous driving. Single-modal detectors, using either camera or LiDAR, struggle with issues like limited depth perception or difficulty in distinguishing semantically similar objects. While multimodal approaches aim to address these limitations by combining LiDAR and camera data, they often face complexities in integrating sparse and uneven point cloud distributions, resulting in inefficient feature fusion. To tackle these challenges, we propose CLS-3D (Content-wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection). This novel framework fuses LiDAR and camera features using a single multi-modal backbone and augments them with the semantic probabilities obtained from the image stream. Our method captures local and global spatial relationships through a slot reweighting mechanism and incorporates I3C-IoU loss for precise box regression. The semantically augmented features, via a single multi-modal backbone, are embedded using a content-based transformer and processed through a slot-wise auto-encoder structure with channel-wise positional embeddings and a feed-forward MLP network. Our model improves temporal consistency and detection accuracy by dynamically adjusting feature relevance through slot-wise reweighting. We further define a I3C-IoU metric, considering centre, overlap, and scale for enhanced box regression accuracy. This mechanism allows the model to focus on significant temporal information, enhancing its ability to learn complex sequences and improving the overall performance of 3D object detection, especially in challenging scenarios such as occlusion and long-range detection. Extensive experiments on the KITTI and nuScenes benchmark demonstrate that CLS-3D achieves state-of-the-art performance, with 89.52% 3D mAP and 94.08% BEV mAP, outperforming existing methods. |
|---|---|
| ISSN: | 2169-3536 |