Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction

In Bird's Eye View perception, significant emphasis is placed on deploying well-performing, convoluted model architectures and leveraging as many sensor modalities as possible to reach maximal performance. This paper investigates whether foundation models and multi-sensor deployments are...

Full description

Saved in:
Bibliographic Details
Main Authors: Seamie Hayes, Ganesh Sistu, Ciaran Eising
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Open Journal of Vehicular Technology
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10974666/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850153636316839936
author Seamie Hayes
Ganesh Sistu
Ciaran Eising
author_facet Seamie Hayes
Ganesh Sistu
Ciaran Eising
author_sort Seamie Hayes
collection DOAJ
description In Bird's Eye View perception, significant emphasis is placed on deploying well-performing, convoluted model architectures and leveraging as many sensor modalities as possible to reach maximal performance. This paper investigates whether foundation models and multi-sensor deployments are essential for enhancing BEV perception. We examine the relative importance of advanced feature extraction versus the number of sensor modalities and assess whether foundation models can address feature extraction limitations and reduce the need for extensive training data. Specifically, incorporating the self-supervised DINOv2 for feature extraction and Metric3Dv2 for depth estimation into the Lift-Splat-Shoot framework results in a 7.4 IoU point increase in vehicle segmentation, representing a relative improvement of 22.4%, while requiring only half the training data and iterations compared to the original model. Furthermore, using Metric3Dv2’s depth maps as a pseudo-LiDAR point cloud within the Simple-BEV model improves IoU by 2.9 points, marking a 6.1% relative increase compared to the Camera-only setup. Finally, we extend the famous Gaussian Splatting BEV perception models, GaussianFormer and GaussianOcc, through multimodal deployment. The addition of LiDAR information in GaussianFormer results in a 9.4-point increase in mIoU, a 48.7% improvement over the Camera-only model, nearing state-of-the-art multimodal performance even with limited LiDAR scans. In the self-supervised GaussianOcc model, incorporating LiDAR leads to a 0.36-point increase in mIoU, representing a 3.6% improvement over the Camera-only model. This limited gain can be attributed to the absence of LiDAR encoding and the self-supervised nature of the model. Overall, our findings highlight the critical role of foundation models and multi-sensor integration in advancing BEV perception. By leveraging sophisticated foundation models and multi-sensor deployment, we can further model performance and reduce data requirements, addressing key challenges in BEV perception.
format Article
id doaj-art-2f0afcbdf5ab462e908a6e49fc3082c7
institution OA Journals
issn 2644-1330
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Open Journal of Vehicular Technology
spelling doaj-art-2f0afcbdf5ab462e908a6e49fc3082c72025-08-20T02:25:40ZengIEEEIEEE Open Journal of Vehicular Technology2644-13302025-01-0161241126110.1109/OJVT.2025.356367710974666Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy PredictionSeamie Hayes0https://orcid.org/0009-0008-0587-1872Ganesh Sistu1Ciaran Eising2https://orcid.org/0000-0001-8383-2635Department of Electronic and Computer Engineering, University of Limerick, Limerick, IrelandDepartment of Electronic and Computer Engineering, University of Limerick, Limerick, IrelandDepartment of Electronic and Computer Engineering, University of Limerick, Limerick, IrelandIn Bird's Eye View perception, significant emphasis is placed on deploying well-performing, convoluted model architectures and leveraging as many sensor modalities as possible to reach maximal performance. This paper investigates whether foundation models and multi-sensor deployments are essential for enhancing BEV perception. We examine the relative importance of advanced feature extraction versus the number of sensor modalities and assess whether foundation models can address feature extraction limitations and reduce the need for extensive training data. Specifically, incorporating the self-supervised DINOv2 for feature extraction and Metric3Dv2 for depth estimation into the Lift-Splat-Shoot framework results in a 7.4 IoU point increase in vehicle segmentation, representing a relative improvement of 22.4%, while requiring only half the training data and iterations compared to the original model. Furthermore, using Metric3Dv2’s depth maps as a pseudo-LiDAR point cloud within the Simple-BEV model improves IoU by 2.9 points, marking a 6.1% relative increase compared to the Camera-only setup. Finally, we extend the famous Gaussian Splatting BEV perception models, GaussianFormer and GaussianOcc, through multimodal deployment. The addition of LiDAR information in GaussianFormer results in a 9.4-point increase in mIoU, a 48.7% improvement over the Camera-only model, nearing state-of-the-art multimodal performance even with limited LiDAR scans. In the self-supervised GaussianOcc model, incorporating LiDAR leads to a 0.36-point increase in mIoU, representing a 3.6% improvement over the Camera-only model. This limited gain can be attributed to the absence of LiDAR encoding and the self-supervised nature of the model. Overall, our findings highlight the critical role of foundation models and multi-sensor integration in advancing BEV perception. By leveraging sophisticated foundation models and multi-sensor deployment, we can further model performance and reduce data requirements, addressing key challenges in BEV perception.https://ieeexplore.ieee.org/document/10974666/Bird's eye viewfoundation modelLiDARmultimodalsemantic occupancy
spellingShingle Seamie Hayes
Ganesh Sistu
Ciaran Eising
Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction
IEEE Open Journal of Vehicular Technology
Bird's eye view
foundation model
LiDAR
multimodal
semantic occupancy
title Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction
title_full Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction
title_fullStr Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction
title_full_unstemmed Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction
title_short Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction
title_sort leveraging frozen foundation models and multimodal fusion for bev segmentation and occupancy prediction
topic Bird's eye view
foundation model
LiDAR
multimodal
semantic occupancy
url https://ieeexplore.ieee.org/document/10974666/
work_keys_str_mv AT seamiehayes leveragingfrozenfoundationmodelsandmultimodalfusionforbevsegmentationandoccupancyprediction
AT ganeshsistu leveragingfrozenfoundationmodelsandmultimodalfusionforbevsegmentationandoccupancyprediction
AT ciaraneising leveragingfrozenfoundationmodelsandmultimodalfusionforbevsegmentationandoccupancyprediction