A Comparative Study and Optimization of Camera-Based BEV Segmentation for Real-Time Autonomous Driving

This study addresses the optimization of a camera-based bird’s eye view (BEV) segmentation technique that operates in real-time within an embedded system environment while maintaining high accuracy despite limited computational resources. Specifically, it examines three technical approaches for BEV...

Full description

Saved in:
Bibliographic Details
Main Authors: Woomin Jun, Sungjin Lee
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/7/2300
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This study addresses the optimization of a camera-based bird’s eye view (BEV) segmentation technique that operates in real-time within an embedded system environment while maintaining high accuracy despite limited computational resources. Specifically, it examines three technical approaches for BEV segmentation in autonomous driving: depth-based methods, MLP-based methods, and transformer-based methods, focusing on key techniques such as lift–splat–shoot, HDMapNet, and BEVFormer. A mathematical analysis of these methods is conducted, followed by a comparative performance evaluation using the nuScenes dataset. The optimization process was carried out in three stages: accuracy improvement, latency reduction, and model size optimization. In the first stage of the process, the three modules for BEV segmentation (encoder, view transformation, and decoder) were selected with the goal of maximizing mIoU performance. In the second stage, environmental variable optimization was performed through input resolution adaptation and data augmentation to improve accuracy. Finally, in the third stage, model compression was applied to minimize model size and latency for efficient deployment on embedded systems. Experimental results from the first stage show that the lift–splat–shoot view transformation model, based on the InternImage-B encoder and EfficientNet-B0 decoder, achieved the highest performance with 54.9 mIoU at an input image size of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>448</mn><mo>×</mo><mn>800</mn></mrow></semantics></math></inline-formula>. Notably, the lift–splat–shoot view transformation model with the InternImage-T encoder and EfficientNet-B0 decoder demonstrated performance of 53.1 mIoU while achieving high efficiency (51.7 ms and 159.5 MB, respectively). The application of the second stage revealed that increasing the input resolution does not always lead to improved accuracy, and there is an optimal resolution size depending on the model. In this study, the best performance was achieved with an input image size of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>448</mn><mo>×</mo><mn>800</mn></mrow></semantics></math></inline-formula>. During the third stage, FP16 quantization enabled a 50% reduction in memory size and decreased latency while maintaining similar or identical mIoU performance. When deployed on the NVIDIA AGX Orin device, which operates under power constraints, energy efficiency improved, although it resulted in higher latency under certain power supply conditions. As a result, the InternImage encoder-based lift–splat–shoot technique was shown to achieve the highest accuracy performance relative to latency and model size. This approach outperformed the original method by achieving a 29.2% higher mIoU while maintaining similar latency performance and reducing memory size by 32.2%.
ISSN:1424-8220