CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles

Accurate 3D object detection is vital in autonomous driving. Single-modal detectors, using either camera or LiDAR, struggle with issues like limited depth perception or difficulty in distinguishing semantically similar objects. While multimodal approaches aim to address these limitations by combinin...

Full description

Saved in:
Bibliographic Details
Main Authors: Husnain Mushtaq, Sohaib Latif, Muhammad Saad Bin Ilyas, Syed Muhammad Mohsin, Mohammed Ali
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10960445/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850170853055004672
author Husnain Mushtaq
Sohaib Latif
Muhammad Saad Bin Ilyas
Syed Muhammad Mohsin
Mohammed Ali
author_facet Husnain Mushtaq
Sohaib Latif
Muhammad Saad Bin Ilyas
Syed Muhammad Mohsin
Mohammed Ali
author_sort Husnain Mushtaq
collection DOAJ
description Accurate 3D object detection is vital in autonomous driving. Single-modal detectors, using either camera or LiDAR, struggle with issues like limited depth perception or difficulty in distinguishing semantically similar objects. While multimodal approaches aim to address these limitations by combining LiDAR and camera data, they often face complexities in integrating sparse and uneven point cloud distributions, resulting in inefficient feature fusion. To tackle these challenges, we propose CLS-3D (Content-wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection). This novel framework fuses LiDAR and camera features using a single multi-modal backbone and augments them with the semantic probabilities obtained from the image stream. Our method captures local and global spatial relationships through a slot reweighting mechanism and incorporates I3C-IoU loss for precise box regression. The semantically augmented features, via a single multi-modal backbone, are embedded using a content-based transformer and processed through a slot-wise auto-encoder structure with channel-wise positional embeddings and a feed-forward MLP network. Our model improves temporal consistency and detection accuracy by dynamically adjusting feature relevance through slot-wise reweighting. We further define a I3C-IoU metric, considering centre, overlap, and scale for enhanced box regression accuracy. This mechanism allows the model to focus on significant temporal information, enhancing its ability to learn complex sequences and improving the overall performance of 3D object detection, especially in challenging scenarios such as occlusion and long-range detection. Extensive experiments on the KITTI and nuScenes benchmark demonstrate that CLS-3D achieves state-of-the-art performance, with 89.52% 3D mAP and 94.08% BEV mAP, outperforming existing methods.
format Article
id doaj-art-e0d788c1a1b54ab189420df87eb4a9d7
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-e0d788c1a1b54ab189420df87eb4a9d72025-08-20T02:20:23ZengIEEEIEEE Access2169-35362025-01-0113698406985610.1109/ACCESS.2025.355878010960445CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous VehiclesHusnain Mushtaq0https://orcid.org/0009-0002-3532-5510Sohaib Latif1Muhammad Saad Bin Ilyas2https://orcid.org/0000-0002-9537-3461Syed Muhammad Mohsin3https://orcid.org/0000-0003-0886-9061Mohammed Ali4https://orcid.org/0000-0002-5908-4013School of Computer Science and Engineering, Central South University, Changsha, ChinaDepartment of Computer Science and Software Engineering, Grand Asian University, Sialkot, PakistanDepartment of Computer Science, The University of Chenab, Gujranwala, PakistanDepartment of Computer Science, COMSATS University Islamabad, Islamabad, PakistanDepartment of Computer Science, King Khalid University, Abha, Saudi ArabiaAccurate 3D object detection is vital in autonomous driving. Single-modal detectors, using either camera or LiDAR, struggle with issues like limited depth perception or difficulty in distinguishing semantically similar objects. While multimodal approaches aim to address these limitations by combining LiDAR and camera data, they often face complexities in integrating sparse and uneven point cloud distributions, resulting in inefficient feature fusion. To tackle these challenges, we propose CLS-3D (Content-wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection). This novel framework fuses LiDAR and camera features using a single multi-modal backbone and augments them with the semantic probabilities obtained from the image stream. Our method captures local and global spatial relationships through a slot reweighting mechanism and incorporates I3C-IoU loss for precise box regression. The semantically augmented features, via a single multi-modal backbone, are embedded using a content-based transformer and processed through a slot-wise auto-encoder structure with channel-wise positional embeddings and a feed-forward MLP network. Our model improves temporal consistency and detection accuracy by dynamically adjusting feature relevance through slot-wise reweighting. We further define a I3C-IoU metric, considering centre, overlap, and scale for enhanced box regression accuracy. This mechanism allows the model to focus on significant temporal information, enhancing its ability to learn complex sequences and improving the overall performance of 3D object detection, especially in challenging scenarios such as occlusion and long-range detection. Extensive experiments on the KITTI and nuScenes benchmark demonstrate that CLS-3D achieves state-of-the-art performance, with 89.52% 3D mAP and 94.08% BEV mAP, outperforming existing methods.https://ieeexplore.ieee.org/document/10960445/Deep learning3D object detectionLiDarViTchannel attentionautonomous vehicles
spellingShingle Husnain Mushtaq
Sohaib Latif
Muhammad Saad Bin Ilyas
Syed Muhammad Mohsin
Mohammed Ali
CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles
IEEE Access
Deep learning
3D object detection
LiDarViT
channel attention
autonomous vehicles
title CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles
title_full CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles
title_fullStr CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles
title_full_unstemmed CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles
title_short CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles
title_sort cls 3d content wise lidar camera fusion and slot reweighting transformer for 3d object detection in autonomous vehicles
topic Deep learning
3D object detection
LiDarViT
channel attention
autonomous vehicles
url https://ieeexplore.ieee.org/document/10960445/
work_keys_str_mv AT husnainmushtaq cls3dcontentwiselidarcamerafusionandslotreweightingtransformerfor3dobjectdetectioninautonomousvehicles
AT sohaiblatif cls3dcontentwiselidarcamerafusionandslotreweightingtransformerfor3dobjectdetectioninautonomousvehicles
AT muhammadsaadbinilyas cls3dcontentwiselidarcamerafusionandslotreweightingtransformerfor3dobjectdetectioninautonomousvehicles
AT syedmuhammadmohsin cls3dcontentwiselidarcamerafusionandslotreweightingtransformerfor3dobjectdetectioninautonomousvehicles
AT mohammedali cls3dcontentwiselidarcamerafusionandslotreweightingtransformerfor3dobjectdetectioninautonomousvehicles