A simple monocular depth estimation network for balancing complexity and accuracy

Abstract Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior pe...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xuanxuan Liu, Shuai Tang, Mengdie Feng, Xueqi Guo, Yanru Zhang, Yan Wang
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-04-01
Series:	Scientific Reports
Subjects:	Monocular depth estimation Deformable cross-attention Transformer Adaptive bins
Online Access:	https://doi.org/10.1038/s41598-025-97568-1
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior performance. Particularly in practical applications, enhancing the accuracy of depth prediction while ensuring computational efficiency remains a challenging issue. To tackle this challenge, we propose a novel and simple depth estimation model called SimMDE, which treats monocular depth estimation as an ordinal regression problem. Beginning with a baseline encoder, our model is equipped with a Deformable Cross-Attention Feature Fusion (DCF) decoder with sparse attention. This decoder efficiently integrates multi-scale feature maps, markedly reducing the quadratic complexity of the Transformer model. For the extraction of finer local features, we propose a Local Multi-dimensional Convolutional Attention (LMC) module. Meanwhile, we propose a Wavelet Attention Transformer (WAT) module to achieve pixel-level precise classification of images. Furthermore, we also conduct extensive experiments on two widely recognized depth estimation benchmark datasets: NYU and KITTI. The experimental findings unequivocally demonstrate that our model attains exceptional accuracy in depth estimation while upholding high computational efficiency. Remarkably, our framework SimMDE, extending from AdaBins, demonstrates enhancements, resulting in substantial improvements of 11.7% and 10.3% in the absolute relative error (AbsRel) on the NYU and KITTI datasets, respectively, with fewer parameters.
ISSN:	2045-2322

A simple monocular depth estimation network for balancing complexity and accuracy

Similar Items