Lightweight Stereo Matching for Real-Time Applications With 2D Cost Volume Aggregation

Despite the significant advancements in learning-based stereo matching algorithms, a significant challenge remains: the high computational cost and memory demands of 3D convolutions, which hinder real-time deployment on resource-constrained platforms like edge devices. This paper presents a novel ap...

Full description

Saved in:
Bibliographic Details
Main Authors: Thai la, Linh Tao, Dai Watanabe
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11005575/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Despite the significant advancements in learning-based stereo matching algorithms, a significant challenge remains: the high computational cost and memory demands of 3D convolutions, which hinder real-time deployment on resource-constrained platforms like edge devices. This paper presents a novel approach that completely avoids the use of 3D convolutions, with the goal of achieving faster inference speeds while maintaining comparable accuracy to existing state-of-the-art methods. Our proposed solution revolves around a 2D cost aggregation technique, which serves as a viable alternative to traditional 3D convolutions, delivering similar results in terms of accuracy. This innovative method significantly reduces computational overhead, facilitating more efficient resource utilization and paving the way for real-time applications. Complementing our 2D cost aggregation module, this study introduce a multi-stage feature extractor, designed to enhance feature representation while remaining straightforward and lightweight. The integration of the 2D cost aggregation and multi-stage feature extraction results in an efficient architecture for cost aggregation, simplifying the model and ensuring computational efficiency without sacrificing accuracy. This framework delivers high-performance stereo matching suitable for devices with limited computational capabilities. Through evaluation on benchmark datasets, the proposed network achieves 1.37 px EPE on Scene Flow and 4.22 % and 4.09 % D1-all on KITTI 2012 and KITTI 2015, respectively, yet runs in 13 ms with only 27 GFLOPs and 0.41 M parameters.
ISSN:2169-3536