TF-CMFA: Robust Multimodal 3D Object Detection for Dynamic Environments Using Temporal Fusion and Cross-Modal Alignment

In recent years, multimodal 3D object detection methods have garnered significant attention in autonomous driving systems due to their impressive detection performance. However, most existing research seldom addresses the issues of robustness and performance degradation in dynamic environments due t...

Full description

Saved in:
Bibliographic Details
Main Authors: Yujing Wang, Abdul Hadi Abd Rahman, Fadilla 'Atyka Nor Rashid
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10975058/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850192157508370432
author Yujing Wang
Abdul Hadi Abd Rahman
Fadilla 'Atyka Nor Rashid
author_facet Yujing Wang
Abdul Hadi Abd Rahman
Fadilla 'Atyka Nor Rashid
author_sort Yujing Wang
collection DOAJ
description In recent years, multimodal 3D object detection methods have garnered significant attention in autonomous driving systems due to their impressive detection performance. However, most existing research seldom addresses the issues of robustness and performance degradation in dynamic environments due to the difficulty of aligning modal features. In this paper, we introduce an innovative efficient fusion method that integrates time series features to improve the accuracy of 3D object detection through multi-sensor fusion, making it more suitable for dynamic and realistic scenarios such as automated driving, and verifying its robustness. The proposed framework incorporates a Temporal Local Self-Fusion Module (TLSFM) in the LiDAR stream to enrich the representation of LiDAR BEV features. To better align BEV features in image streams and point cloud streams, a Cross-Modal Fusion Alignment (CMFA), is introduced. The Temporal Fusion-CMFA (TF-CMFA) framework which contains TLSFM and CMFA module, demonstrates state-of-the-art performance, achieving a mean average precision (mAP) score of 74.4% and a NuScenes detection score (NDS) of 75.7% on the NuScenes benchmark dataset. Performance improvements recorded on the Waymo dataset, with improvements of +2.1 and +2.3 in the ALL-L1 and ALL-L2 metrics compared to VoxelMamba. Finally, robustness experiments demonstrate the performance of proposed approach under sensor failure conditions, demonstrating its exceptional robustness under such conditions.
format Article
id doaj-art-443de30e20b448b691dd0d67bcd82db6
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-443de30e20b448b691dd0d67bcd82db62025-08-20T02:14:38ZengIEEEIEEE Access2169-35362025-01-0113748217483210.1109/ACCESS.2025.356348310975058TF-CMFA: Robust Multimodal 3D Object Detection for Dynamic Environments Using Temporal Fusion and Cross-Modal AlignmentYujing Wang0https://orcid.org/0009-0002-6414-8197Abdul Hadi Abd Rahman1https://orcid.org/0000-0002-0261-073XFadilla 'Atyka Nor Rashid2Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, MalaysiaCenter for Artificial Intelligence Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, MalaysiaCenter for Artificial Intelligence Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, MalaysiaIn recent years, multimodal 3D object detection methods have garnered significant attention in autonomous driving systems due to their impressive detection performance. However, most existing research seldom addresses the issues of robustness and performance degradation in dynamic environments due to the difficulty of aligning modal features. In this paper, we introduce an innovative efficient fusion method that integrates time series features to improve the accuracy of 3D object detection through multi-sensor fusion, making it more suitable for dynamic and realistic scenarios such as automated driving, and verifying its robustness. The proposed framework incorporates a Temporal Local Self-Fusion Module (TLSFM) in the LiDAR stream to enrich the representation of LiDAR BEV features. To better align BEV features in image streams and point cloud streams, a Cross-Modal Fusion Alignment (CMFA), is introduced. The Temporal Fusion-CMFA (TF-CMFA) framework which contains TLSFM and CMFA module, demonstrates state-of-the-art performance, achieving a mean average precision (mAP) score of 74.4% and a NuScenes detection score (NDS) of 75.7% on the NuScenes benchmark dataset. Performance improvements recorded on the Waymo dataset, with improvements of +2.1 and +2.3 in the ALL-L1 and ALL-L2 metrics compared to VoxelMamba. Finally, robustness experiments demonstrate the performance of proposed approach under sensor failure conditions, demonstrating its exceptional robustness under such conditions.https://ieeexplore.ieee.org/document/10975058/3D object detectionfeature alignmentmultimodalrobustness
spellingShingle Yujing Wang
Abdul Hadi Abd Rahman
Fadilla 'Atyka Nor Rashid
TF-CMFA: Robust Multimodal 3D Object Detection for Dynamic Environments Using Temporal Fusion and Cross-Modal Alignment
IEEE Access
3D object detection
feature alignment
multimodal
robustness
title TF-CMFA: Robust Multimodal 3D Object Detection for Dynamic Environments Using Temporal Fusion and Cross-Modal Alignment
title_full TF-CMFA: Robust Multimodal 3D Object Detection for Dynamic Environments Using Temporal Fusion and Cross-Modal Alignment
title_fullStr TF-CMFA: Robust Multimodal 3D Object Detection for Dynamic Environments Using Temporal Fusion and Cross-Modal Alignment
title_full_unstemmed TF-CMFA: Robust Multimodal 3D Object Detection for Dynamic Environments Using Temporal Fusion and Cross-Modal Alignment
title_short TF-CMFA: Robust Multimodal 3D Object Detection for Dynamic Environments Using Temporal Fusion and Cross-Modal Alignment
title_sort tf cmfa robust multimodal 3d object detection for dynamic environments using temporal fusion and cross modal alignment
topic 3D object detection
feature alignment
multimodal
robustness
url https://ieeexplore.ieee.org/document/10975058/
work_keys_str_mv AT yujingwang tfcmfarobustmultimodal3dobjectdetectionfordynamicenvironmentsusingtemporalfusionandcrossmodalalignment
AT abdulhadiabdrahman tfcmfarobustmultimodal3dobjectdetectionfordynamicenvironmentsusingtemporalfusionandcrossmodalalignment
AT fadillaatykanorrashid tfcmfarobustmultimodal3dobjectdetectionfordynamicenvironmentsusingtemporalfusionandcrossmodalalignment