BEVFusion With Dual Hard Instance Probing for Multimodal 3D Object Detection

False negatives (FN) in 3D object detection, which occur when small, distant, or hidden objects are missed, pose significant safety risks in autonomous driving systems. Recent multi-modal fusion methods have been proposed to enhance 3D object detection by combining the geometric accuracy of LiDAR po...

Full description

Saved in:
Bibliographic Details
Main Authors: Taeho Kim, Joohee Kim
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10872908/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823857161883615232
author Taeho Kim
Joohee Kim
author_facet Taeho Kim
Joohee Kim
author_sort Taeho Kim
collection DOAJ
description False negatives (FN) in 3D object detection, which occur when small, distant, or hidden objects are missed, pose significant safety risks in autonomous driving systems. Recent multi-modal fusion methods have been proposed to enhance 3D object detection by combining the geometric accuracy of LiDAR point clouds with the rich semantic features of camera images. However, few methods explicitly address false negatives, and many fail to effectively align and interact multimodal features during the fusion process. To address these challenges, we propose BEVFusion with Dual Hard Instance Probing (BEVFusion-DHIP), a novel 3D object detection framework designed to systematically reduce false negatives. BEVFusion-DHIP incorporates Hard Instance Probing (HIP) into both LiDAR BEV features and 3D position-aware image features, progressively refining the detection of challenging objects across multiple stages. Furthermore, we introduce a Deformable Attention Fusion Network (DAFusionNet) to dynamically align and fuse LiDAR and camera BEV features during the fusion process, effectively mitigating spatial misalignment and enhancing inter-modal feature interaction. Experimental results on the nuScenes dataset show that the proposed BEVFusion-DHIP outperforms state-of-the-art lidar and camera+lidar based 3D object detection models. For example, BEVFusion-DHIP achieves improvements of 3.0 and 3.2 in mAP and NDS, respectively, compared to the baseline model BEVFusion.
format Article
id doaj-art-cf79d65217a84873961c9aa6a258df28
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-cf79d65217a84873961c9aa6a258df282025-02-12T00:02:34ZengIEEEIEEE Access2169-35362025-01-0113255462555610.1109/ACCESS.2025.353886610872908BEVFusion With Dual Hard Instance Probing for Multimodal 3D Object DetectionTaeho Kim0https://orcid.org/0009-0000-8880-1299Joohee Kim1https://orcid.org/0000-0001-8833-0319Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL, USADepartment of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL, USAFalse negatives (FN) in 3D object detection, which occur when small, distant, or hidden objects are missed, pose significant safety risks in autonomous driving systems. Recent multi-modal fusion methods have been proposed to enhance 3D object detection by combining the geometric accuracy of LiDAR point clouds with the rich semantic features of camera images. However, few methods explicitly address false negatives, and many fail to effectively align and interact multimodal features during the fusion process. To address these challenges, we propose BEVFusion with Dual Hard Instance Probing (BEVFusion-DHIP), a novel 3D object detection framework designed to systematically reduce false negatives. BEVFusion-DHIP incorporates Hard Instance Probing (HIP) into both LiDAR BEV features and 3D position-aware image features, progressively refining the detection of challenging objects across multiple stages. Furthermore, we introduce a Deformable Attention Fusion Network (DAFusionNet) to dynamically align and fuse LiDAR and camera BEV features during the fusion process, effectively mitigating spatial misalignment and enhancing inter-modal feature interaction. Experimental results on the nuScenes dataset show that the proposed BEVFusion-DHIP outperforms state-of-the-art lidar and camera+lidar based 3D object detection models. For example, BEVFusion-DHIP achieves improvements of 3.0 and 3.2 in mAP and NDS, respectively, compared to the baseline model BEVFusion.https://ieeexplore.ieee.org/document/10872908/3D object detectionmulti-modaltransformerdeformable attentiondeep learning
spellingShingle Taeho Kim
Joohee Kim
BEVFusion With Dual Hard Instance Probing for Multimodal 3D Object Detection
IEEE Access
3D object detection
multi-modal
transformer
deformable attention
deep learning
title BEVFusion With Dual Hard Instance Probing for Multimodal 3D Object Detection
title_full BEVFusion With Dual Hard Instance Probing for Multimodal 3D Object Detection
title_fullStr BEVFusion With Dual Hard Instance Probing for Multimodal 3D Object Detection
title_full_unstemmed BEVFusion With Dual Hard Instance Probing for Multimodal 3D Object Detection
title_short BEVFusion With Dual Hard Instance Probing for Multimodal 3D Object Detection
title_sort bevfusion with dual hard instance probing for multimodal 3d object detection
topic 3D object detection
multi-modal
transformer
deformable attention
deep learning
url https://ieeexplore.ieee.org/document/10872908/
work_keys_str_mv AT taehokim bevfusionwithdualhardinstanceprobingformultimodal3dobjectdetection
AT jooheekim bevfusionwithdualhardinstanceprobingformultimodal3dobjectdetection