DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception
Abstract Accurate and robust perception in the Bird’s Eye View (BEV) is essential for effective environmental understanding in autonomous driving systems. This study introduces DMFormer, an innovative multi-modal BEV perception framework that employs Transformer architecture and a diffusion denoisin...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-07-01
|
| Series: | Complex & Intelligent Systems |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s40747-025-01984-9 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849389464788402176 |
|---|---|
| author | Xuefeng Bao Feng Liu Yunli Chen Yong Li Rui Tian |
| author_facet | Xuefeng Bao Feng Liu Yunli Chen Yong Li Rui Tian |
| author_sort | Xuefeng Bao |
| collection | DOAJ |
| description | Abstract Accurate and robust perception in the Bird’s Eye View (BEV) is essential for effective environmental understanding in autonomous driving systems. This study introduces DMFormer, an innovative multi-modal BEV perception framework that employs Transformer architecture and a diffusion denoising model to tackle key challenges, including sensor noise, efficient fusion of multi-modal data, and modeling dynamic scenes. DMFormer integrates a diffusion-based image denoising module to enhance camera feature quality and reduce noise stemming from lighting fluctuations, adverse weather, and occlusions. Furthermore, a LiDAR-camera feature alignment mechanism is implemented to combine LiDAR’s spatial geometric insights with the camera’s semantic information. By employing a multi-scale self-attention strategy in the Transformer encoder and a query-driven decoder, DMFormer effectively captures both global and local contextual details, enabling precise 3D object detection and segmentation. Experiments conducted on the nuScenes dataset reveal that DMFormer achieves outstanding performance, with a comprehensive performance metric (NDS) of 73.6% and a mean average precision (mAP) of 71.8%, outperforming current state-of-the-art approaches. Additionally, its superior detection capabilities in complex environments and for dynamic objects highlight its efficacy in BEV perception tasks. |
| format | Article |
| id | doaj-art-df441dad6d3640f88453af1ff0e03683 |
| institution | Kabale University |
| issn | 2199-4536 2198-6053 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Springer |
| record_format | Article |
| series | Complex & Intelligent Systems |
| spelling | doaj-art-df441dad6d3640f88453af1ff0e036832025-08-20T03:41:57ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-07-0111812010.1007/s40747-025-01984-9DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perceptionXuefeng Bao0Feng Liu1Yunli Chen2Yong Li3Rui Tian4College of Computer Science, Beijing University of TechnologySchool of Software Engineering, Beijing Jiaotong UniversityCollege of Computer Science, Beijing University of TechnologyCollege of Computer Science, Beijing University of TechnologyCollege of Computer Science, Beijing University of TechnologyAbstract Accurate and robust perception in the Bird’s Eye View (BEV) is essential for effective environmental understanding in autonomous driving systems. This study introduces DMFormer, an innovative multi-modal BEV perception framework that employs Transformer architecture and a diffusion denoising model to tackle key challenges, including sensor noise, efficient fusion of multi-modal data, and modeling dynamic scenes. DMFormer integrates a diffusion-based image denoising module to enhance camera feature quality and reduce noise stemming from lighting fluctuations, adverse weather, and occlusions. Furthermore, a LiDAR-camera feature alignment mechanism is implemented to combine LiDAR’s spatial geometric insights with the camera’s semantic information. By employing a multi-scale self-attention strategy in the Transformer encoder and a query-driven decoder, DMFormer effectively captures both global and local contextual details, enabling precise 3D object detection and segmentation. Experiments conducted on the nuScenes dataset reveal that DMFormer achieves outstanding performance, with a comprehensive performance metric (NDS) of 73.6% and a mean average precision (mAP) of 71.8%, outperforming current state-of-the-art approaches. Additionally, its superior detection capabilities in complex environments and for dynamic objects highlight its efficacy in BEV perception tasks.https://doi.org/10.1007/s40747-025-01984-9Autonomous drivingBEV perceptionMulti-modal fusion |
| spellingShingle | Xuefeng Bao Feng Liu Yunli Chen Yong Li Rui Tian DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception Complex & Intelligent Systems Autonomous driving BEV perception Multi-modal fusion |
| title | DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception |
| title_full | DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception |
| title_fullStr | DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception |
| title_full_unstemmed | DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception |
| title_short | DMformer: a transformer with denoising and multi-modal data fusion for enhancing BEV perception |
| title_sort | dmformer a transformer with denoising and multi modal data fusion for enhancing bev perception |
| topic | Autonomous driving BEV perception Multi-modal fusion |
| url | https://doi.org/10.1007/s40747-025-01984-9 |
| work_keys_str_mv | AT xuefengbao dmformeratransformerwithdenoisingandmultimodaldatafusionforenhancingbevperception AT fengliu dmformeratransformerwithdenoisingandmultimodaldatafusionforenhancingbevperception AT yunlichen dmformeratransformerwithdenoisingandmultimodaldatafusionforenhancingbevperception AT yongli dmformeratransformerwithdenoisingandmultimodaldatafusionforenhancingbevperception AT ruitian dmformeratransformerwithdenoisingandmultimodaldatafusionforenhancingbevperception |