DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception

Abstract Accurate and robust perception in the Bird’s Eye View (BEV) is essential for effective environmental understanding in autonomous driving systems. This study introduces DMFormer, an innovative multi-modal BEV perception framework that employs Transformer architecture and a diffusion denoisin...

Full description

Saved in:
Bibliographic Details
Main Authors: Xuefeng Bao, Feng Liu, Yunli Chen, Yong Li, Rui Tian
Format: Article
Language:English
Published: Springer 2025-07-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-025-01984-9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849389464788402176
author Xuefeng Bao
Feng Liu
Yunli Chen
Yong Li
Rui Tian
author_facet Xuefeng Bao
Feng Liu
Yunli Chen
Yong Li
Rui Tian
author_sort Xuefeng Bao
collection DOAJ
description Abstract Accurate and robust perception in the Bird’s Eye View (BEV) is essential for effective environmental understanding in autonomous driving systems. This study introduces DMFormer, an innovative multi-modal BEV perception framework that employs Transformer architecture and a diffusion denoising model to tackle key challenges, including sensor noise, efficient fusion of multi-modal data, and modeling dynamic scenes. DMFormer integrates a diffusion-based image denoising module to enhance camera feature quality and reduce noise stemming from lighting fluctuations, adverse weather, and occlusions. Furthermore, a LiDAR-camera feature alignment mechanism is implemented to combine LiDAR’s spatial geometric insights with the camera’s semantic information. By employing a multi-scale self-attention strategy in the Transformer encoder and a query-driven decoder, DMFormer effectively captures both global and local contextual details, enabling precise 3D object detection and segmentation. Experiments conducted on the nuScenes dataset reveal that DMFormer achieves outstanding performance, with a comprehensive performance metric (NDS) of 73.6% and a mean average precision (mAP) of 71.8%, outperforming current state-of-the-art approaches. Additionally, its superior detection capabilities in complex environments and for dynamic objects highlight its efficacy in BEV perception tasks.
format Article
id doaj-art-df441dad6d3640f88453af1ff0e03683
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2025-07-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-df441dad6d3640f88453af1ff0e036832025-08-20T03:41:57ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-07-0111812010.1007/s40747-025-01984-9DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perceptionXuefeng Bao0Feng Liu1Yunli Chen2Yong Li3Rui Tian4College of Computer Science, Beijing University of TechnologySchool of Software Engineering, Beijing Jiaotong UniversityCollege of Computer Science, Beijing University of TechnologyCollege of Computer Science, Beijing University of TechnologyCollege of Computer Science, Beijing University of TechnologyAbstract Accurate and robust perception in the Bird’s Eye View (BEV) is essential for effective environmental understanding in autonomous driving systems. This study introduces DMFormer, an innovative multi-modal BEV perception framework that employs Transformer architecture and a diffusion denoising model to tackle key challenges, including sensor noise, efficient fusion of multi-modal data, and modeling dynamic scenes. DMFormer integrates a diffusion-based image denoising module to enhance camera feature quality and reduce noise stemming from lighting fluctuations, adverse weather, and occlusions. Furthermore, a LiDAR-camera feature alignment mechanism is implemented to combine LiDAR’s spatial geometric insights with the camera’s semantic information. By employing a multi-scale self-attention strategy in the Transformer encoder and a query-driven decoder, DMFormer effectively captures both global and local contextual details, enabling precise 3D object detection and segmentation. Experiments conducted on the nuScenes dataset reveal that DMFormer achieves outstanding performance, with a comprehensive performance metric (NDS) of 73.6% and a mean average precision (mAP) of 71.8%, outperforming current state-of-the-art approaches. Additionally, its superior detection capabilities in complex environments and for dynamic objects highlight its efficacy in BEV perception tasks.https://doi.org/10.1007/s40747-025-01984-9Autonomous drivingBEV perceptionMulti-modal fusion
spellingShingle Xuefeng Bao
Feng Liu
Yunli Chen
Yong Li
Rui Tian
DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception
Complex & Intelligent Systems
Autonomous driving
BEV perception
Multi-modal fusion
title DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception
title_full DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception
title_fullStr DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception
title_full_unstemmed DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception
title_short DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception
title_sort dmformer a transformer with denoising and multi modal data fusion for enhancing bev perception
topic Autonomous driving
BEV perception
Multi-modal fusion
url https://doi.org/10.1007/s40747-025-01984-9
work_keys_str_mv AT xuefengbao dmformeratransformerwithdenoisingandmultimodaldatafusionforenhancingbevperception
AT fengliu dmformeratransformerwithdenoisingandmultimodaldatafusionforenhancingbevperception
AT yunlichen dmformeratransformerwithdenoisingandmultimodaldatafusionforenhancingbevperception
AT yongli dmformeratransformerwithdenoisingandmultimodaldatafusionforenhancingbevperception
AT ruitian dmformeratransformerwithdenoisingandmultimodaldatafusionforenhancingbevperception