A multi‐stage model for bird's eye view prediction based on stereo‐matching model and RGB‐D semantic segmentation

Abstract Bird's‐Eye‐View (BEV) map is a powerful and detailed scene representation for intelligent vehicles that provides both the location and semantic information about nearby objects from a top‐down perspective. BEV map generation is a complex multi‐stage task, and the existing methods typic...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhongyu Rao, Yingfeng Cai, Hai Wang, Long Chen, Yicheng Li
Format: Article
Language:English
Published: Wiley 2024-12-01
Series:IET Intelligent Transport Systems
Subjects:
Online Access:https://doi.org/10.1049/itr2.12367
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850174915347480576
author Zhongyu Rao
Yingfeng Cai
Hai Wang
Long Chen
Yicheng Li
author_facet Zhongyu Rao
Yingfeng Cai
Hai Wang
Long Chen
Yicheng Li
author_sort Zhongyu Rao
collection DOAJ
description Abstract Bird's‐Eye‐View (BEV) map is a powerful and detailed scene representation for intelligent vehicles that provides both the location and semantic information about nearby objects from a top‐down perspective. BEV map generation is a complex multi‐stage task, and the existing methods typically perform poorly for distant scenes. Thus, the authors introduce a novel multi‐stage model that infers to obtain more accurate BEV map. First, the authors propose the Adaptive Aggregation with Stereo Mixture Density (AA‐SMD) model, which is an improved stereo matching model that eliminates bleeding artefacts and provides more accurate depth estimation. Next, the authors employ the RGB‐Depth (RGB‐D) semantic segmentation model to improve the semantic segmentation performance and connectivity of their model. The depth map and semantic segmentation maps are then combined to create an incomplete BEV map. Finally, the authors propose a Multi Strip Pooling Unet (MSP‐Unet) model with a hierarchical multi‐scale (HMS) attention and strip pooling (SP) module to improve prediction with BEV generation. The authors evaluate their model with a Car Learn to Act (CARLA)‐generated synthetic dataset. The experiment results demonstrate that the authors’ model generates a highly accurate representation of the surrounding environment achieving a state‐of‐the‐art result of 61.50% Mean Intersection‐over‐Union (MIoU) across eight classes.
format Article
id doaj-art-31a2922ea7f1497c8ecdf97478aa953d
institution OA Journals
issn 1751-956X
1751-9578
language English
publishDate 2024-12-01
publisher Wiley
record_format Article
series IET Intelligent Transport Systems
spelling doaj-art-31a2922ea7f1497c8ecdf97478aa953d2025-08-20T02:19:34ZengWileyIET Intelligent Transport Systems1751-956X1751-95782024-12-0118122552256410.1049/itr2.12367A multi‐stage model for bird's eye view prediction based on stereo‐matching model and RGB‐D semantic segmentationZhongyu Rao0Yingfeng Cai1Hai Wang2Long Chen3Yicheng Li4Automotive Engineering Research Institute of Jiangsu University Zhenjiang People's Republic of ChinaAutomotive Engineering Research Institute of Jiangsu University Zhenjiang People's Republic of ChinaSchool of Automotive and Traffic Engineering of Jiangsu University Zhenjiang People's Republic of ChinaAutomotive Engineering Research Institute of Jiangsu University Zhenjiang People's Republic of ChinaAutomotive Engineering Research Institute of Jiangsu University Zhenjiang People's Republic of ChinaAbstract Bird's‐Eye‐View (BEV) map is a powerful and detailed scene representation for intelligent vehicles that provides both the location and semantic information about nearby objects from a top‐down perspective. BEV map generation is a complex multi‐stage task, and the existing methods typically perform poorly for distant scenes. Thus, the authors introduce a novel multi‐stage model that infers to obtain more accurate BEV map. First, the authors propose the Adaptive Aggregation with Stereo Mixture Density (AA‐SMD) model, which is an improved stereo matching model that eliminates bleeding artefacts and provides more accurate depth estimation. Next, the authors employ the RGB‐Depth (RGB‐D) semantic segmentation model to improve the semantic segmentation performance and connectivity of their model. The depth map and semantic segmentation maps are then combined to create an incomplete BEV map. Finally, the authors propose a Multi Strip Pooling Unet (MSP‐Unet) model with a hierarchical multi‐scale (HMS) attention and strip pooling (SP) module to improve prediction with BEV generation. The authors evaluate their model with a Car Learn to Act (CARLA)‐generated synthetic dataset. The experiment results demonstrate that the authors’ model generates a highly accurate representation of the surrounding environment achieving a state‐of‐the‐art result of 61.50% Mean Intersection‐over‐Union (MIoU) across eight classes.https://doi.org/10.1049/itr2.12367autonomous drivingimage recognitionlearning (artificial intelligence)
spellingShingle Zhongyu Rao
Yingfeng Cai
Hai Wang
Long Chen
Yicheng Li
A multi‐stage model for bird's eye view prediction based on stereo‐matching model and RGB‐D semantic segmentation
IET Intelligent Transport Systems
autonomous driving
image recognition
learning (artificial intelligence)
title A multi‐stage model for bird's eye view prediction based on stereo‐matching model and RGB‐D semantic segmentation
title_full A multi‐stage model for bird's eye view prediction based on stereo‐matching model and RGB‐D semantic segmentation
title_fullStr A multi‐stage model for bird's eye view prediction based on stereo‐matching model and RGB‐D semantic segmentation
title_full_unstemmed A multi‐stage model for bird's eye view prediction based on stereo‐matching model and RGB‐D semantic segmentation
title_short A multi‐stage model for bird's eye view prediction based on stereo‐matching model and RGB‐D semantic segmentation
title_sort multi stage model for bird s eye view prediction based on stereo matching model and rgb d semantic segmentation
topic autonomous driving
image recognition
learning (artificial intelligence)
url https://doi.org/10.1049/itr2.12367
work_keys_str_mv AT zhongyurao amultistagemodelforbirdseyeviewpredictionbasedonstereomatchingmodelandrgbdsemanticsegmentation
AT yingfengcai amultistagemodelforbirdseyeviewpredictionbasedonstereomatchingmodelandrgbdsemanticsegmentation
AT haiwang amultistagemodelforbirdseyeviewpredictionbasedonstereomatchingmodelandrgbdsemanticsegmentation
AT longchen amultistagemodelforbirdseyeviewpredictionbasedonstereomatchingmodelandrgbdsemanticsegmentation
AT yichengli amultistagemodelforbirdseyeviewpredictionbasedonstereomatchingmodelandrgbdsemanticsegmentation
AT zhongyurao multistagemodelforbirdseyeviewpredictionbasedonstereomatchingmodelandrgbdsemanticsegmentation
AT yingfengcai multistagemodelforbirdseyeviewpredictionbasedonstereomatchingmodelandrgbdsemanticsegmentation
AT haiwang multistagemodelforbirdseyeviewpredictionbasedonstereomatchingmodelandrgbdsemanticsegmentation
AT longchen multistagemodelforbirdseyeviewpredictionbasedonstereomatchingmodelandrgbdsemanticsegmentation
AT yichengli multistagemodelforbirdseyeviewpredictionbasedonstereomatchingmodelandrgbdsemanticsegmentation