CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles
Accurate 3D object detection is vital in autonomous driving. Single-modal detectors, using either camera or LiDAR, struggle with issues like limited depth perception or difficulty in distinguishing semantically similar objects. While multimodal approaches aim to address these limitations by combinin...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10960445/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850170853055004672 |
|---|---|
| author | Husnain Mushtaq Sohaib Latif Muhammad Saad Bin Ilyas Syed Muhammad Mohsin Mohammed Ali |
| author_facet | Husnain Mushtaq Sohaib Latif Muhammad Saad Bin Ilyas Syed Muhammad Mohsin Mohammed Ali |
| author_sort | Husnain Mushtaq |
| collection | DOAJ |
| description | Accurate 3D object detection is vital in autonomous driving. Single-modal detectors, using either camera or LiDAR, struggle with issues like limited depth perception or difficulty in distinguishing semantically similar objects. While multimodal approaches aim to address these limitations by combining LiDAR and camera data, they often face complexities in integrating sparse and uneven point cloud distributions, resulting in inefficient feature fusion. To tackle these challenges, we propose CLS-3D (Content-wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection). This novel framework fuses LiDAR and camera features using a single multi-modal backbone and augments them with the semantic probabilities obtained from the image stream. Our method captures local and global spatial relationships through a slot reweighting mechanism and incorporates I3C-IoU loss for precise box regression. The semantically augmented features, via a single multi-modal backbone, are embedded using a content-based transformer and processed through a slot-wise auto-encoder structure with channel-wise positional embeddings and a feed-forward MLP network. Our model improves temporal consistency and detection accuracy by dynamically adjusting feature relevance through slot-wise reweighting. We further define a I3C-IoU metric, considering centre, overlap, and scale for enhanced box regression accuracy. This mechanism allows the model to focus on significant temporal information, enhancing its ability to learn complex sequences and improving the overall performance of 3D object detection, especially in challenging scenarios such as occlusion and long-range detection. Extensive experiments on the KITTI and nuScenes benchmark demonstrate that CLS-3D achieves state-of-the-art performance, with 89.52% 3D mAP and 94.08% BEV mAP, outperforming existing methods. |
| format | Article |
| id | doaj-art-e0d788c1a1b54ab189420df87eb4a9d7 |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-e0d788c1a1b54ab189420df87eb4a9d72025-08-20T02:20:23ZengIEEEIEEE Access2169-35362025-01-0113698406985610.1109/ACCESS.2025.355878010960445CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous VehiclesHusnain Mushtaq0https://orcid.org/0009-0002-3532-5510Sohaib Latif1Muhammad Saad Bin Ilyas2https://orcid.org/0000-0002-9537-3461Syed Muhammad Mohsin3https://orcid.org/0000-0003-0886-9061Mohammed Ali4https://orcid.org/0000-0002-5908-4013School of Computer Science and Engineering, Central South University, Changsha, ChinaDepartment of Computer Science and Software Engineering, Grand Asian University, Sialkot, PakistanDepartment of Computer Science, The University of Chenab, Gujranwala, PakistanDepartment of Computer Science, COMSATS University Islamabad, Islamabad, PakistanDepartment of Computer Science, King Khalid University, Abha, Saudi ArabiaAccurate 3D object detection is vital in autonomous driving. Single-modal detectors, using either camera or LiDAR, struggle with issues like limited depth perception or difficulty in distinguishing semantically similar objects. While multimodal approaches aim to address these limitations by combining LiDAR and camera data, they often face complexities in integrating sparse and uneven point cloud distributions, resulting in inefficient feature fusion. To tackle these challenges, we propose CLS-3D (Content-wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection). This novel framework fuses LiDAR and camera features using a single multi-modal backbone and augments them with the semantic probabilities obtained from the image stream. Our method captures local and global spatial relationships through a slot reweighting mechanism and incorporates I3C-IoU loss for precise box regression. The semantically augmented features, via a single multi-modal backbone, are embedded using a content-based transformer and processed through a slot-wise auto-encoder structure with channel-wise positional embeddings and a feed-forward MLP network. Our model improves temporal consistency and detection accuracy by dynamically adjusting feature relevance through slot-wise reweighting. We further define a I3C-IoU metric, considering centre, overlap, and scale for enhanced box regression accuracy. This mechanism allows the model to focus on significant temporal information, enhancing its ability to learn complex sequences and improving the overall performance of 3D object detection, especially in challenging scenarios such as occlusion and long-range detection. Extensive experiments on the KITTI and nuScenes benchmark demonstrate that CLS-3D achieves state-of-the-art performance, with 89.52% 3D mAP and 94.08% BEV mAP, outperforming existing methods.https://ieeexplore.ieee.org/document/10960445/Deep learning3D object detectionLiDarViTchannel attentionautonomous vehicles |
| spellingShingle | Husnain Mushtaq Sohaib Latif Muhammad Saad Bin Ilyas Syed Muhammad Mohsin Mohammed Ali CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles IEEE Access Deep learning 3D object detection LiDarViT channel attention autonomous vehicles |
| title | CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles |
| title_full | CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles |
| title_fullStr | CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles |
| title_full_unstemmed | CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles |
| title_short | CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles |
| title_sort | cls 3d content wise lidar camera fusion and slot reweighting transformer for 3d object detection in autonomous vehicles |
| topic | Deep learning 3D object detection LiDarViT channel attention autonomous vehicles |
| url | https://ieeexplore.ieee.org/document/10960445/ |
| work_keys_str_mv | AT husnainmushtaq cls3dcontentwiselidarcamerafusionandslotreweightingtransformerfor3dobjectdetectioninautonomousvehicles AT sohaiblatif cls3dcontentwiselidarcamerafusionandslotreweightingtransformerfor3dobjectdetectioninautonomousvehicles AT muhammadsaadbinilyas cls3dcontentwiselidarcamerafusionandslotreweightingtransformerfor3dobjectdetectioninautonomousvehicles AT syedmuhammadmohsin cls3dcontentwiselidarcamerafusionandslotreweightingtransformerfor3dobjectdetectioninautonomousvehicles AT mohammedali cls3dcontentwiselidarcamerafusionandslotreweightingtransformerfor3dobjectdetectioninautonomousvehicles |