A simple monocular depth estimation network for balancing complexity and accuracy

Abstract Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior pe...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xuanxuan Liu, Shuai Tang, Mengdie Feng, Xueqi Guo, Yanru Zhang, Yan Wang
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-04-01
Series:	Scientific Reports
Subjects:	Monocular depth estimation Deformable cross-attention Transformer Adaptive bins
Online Access:	https://doi.org/10.1038/s41598-025-97568-1
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850146302171545600
author	Xuanxuan Liu Shuai Tang Mengdie Feng Xueqi Guo Yanru Zhang Yan Wang
author_facet	Xuanxuan Liu Shuai Tang Mengdie Feng Xueqi Guo Yanru Zhang Yan Wang
author_sort	Xuanxuan Liu
collection	DOAJ
description	Abstract Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior performance. Particularly in practical applications, enhancing the accuracy of depth prediction while ensuring computational efficiency remains a challenging issue. To tackle this challenge, we propose a novel and simple depth estimation model called SimMDE, which treats monocular depth estimation as an ordinal regression problem. Beginning with a baseline encoder, our model is equipped with a Deformable Cross-Attention Feature Fusion (DCF) decoder with sparse attention. This decoder efficiently integrates multi-scale feature maps, markedly reducing the quadratic complexity of the Transformer model. For the extraction of finer local features, we propose a Local Multi-dimensional Convolutional Attention (LMC) module. Meanwhile, we propose a Wavelet Attention Transformer (WAT) module to achieve pixel-level precise classification of images. Furthermore, we also conduct extensive experiments on two widely recognized depth estimation benchmark datasets: NYU and KITTI. The experimental findings unequivocally demonstrate that our model attains exceptional accuracy in depth estimation while upholding high computational efficiency. Remarkably, our framework SimMDE, extending from AdaBins, demonstrates enhancements, resulting in substantial improvements of 11.7% and 10.3% in the absolute relative error (AbsRel) on the NYU and KITTI datasets, respectively, with fewer parameters.
format	Article
id	doaj-art-8d671d5346da4415a1c287810aaae2ef
institution	OA Journals
issn	2045-2322
language	English
publishDate	2025-04-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-8d671d5346da4415a1c287810aaae2ef2025-08-20T02:27:53ZengNature PortfolioScientific Reports2045-23222025-04-0115111510.1038/s41598-025-97568-1A simple monocular depth estimation network for balancing complexity and accuracyXuanxuan Liu0Shuai Tang1Mengdie Feng2Xueqi Guo3Yanru Zhang4Yan Wang5Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaSchool of Future Technology, South China University of TechnologyShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaAbstract Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior performance. Particularly in practical applications, enhancing the accuracy of depth prediction while ensuring computational efficiency remains a challenging issue. To tackle this challenge, we propose a novel and simple depth estimation model called SimMDE, which treats monocular depth estimation as an ordinal regression problem. Beginning with a baseline encoder, our model is equipped with a Deformable Cross-Attention Feature Fusion (DCF) decoder with sparse attention. This decoder efficiently integrates multi-scale feature maps, markedly reducing the quadratic complexity of the Transformer model. For the extraction of finer local features, we propose a Local Multi-dimensional Convolutional Attention (LMC) module. Meanwhile, we propose a Wavelet Attention Transformer (WAT) module to achieve pixel-level precise classification of images. Furthermore, we also conduct extensive experiments on two widely recognized depth estimation benchmark datasets: NYU and KITTI. The experimental findings unequivocally demonstrate that our model attains exceptional accuracy in depth estimation while upholding high computational efficiency. Remarkably, our framework SimMDE, extending from AdaBins, demonstrates enhancements, resulting in substantial improvements of 11.7% and 10.3% in the absolute relative error (AbsRel) on the NYU and KITTI datasets, respectively, with fewer parameters.https://doi.org/10.1038/s41598-025-97568-1Monocular depth estimationDeformable cross-attentionTransformerAdaptive bins
spellingShingle	Xuanxuan Liu Shuai Tang Mengdie Feng Xueqi Guo Yanru Zhang Yan Wang A simple monocular depth estimation network for balancing complexity and accuracy Scientific Reports Monocular depth estimation Deformable cross-attention Transformer Adaptive bins
title	A simple monocular depth estimation network for balancing complexity and accuracy
title_full	A simple monocular depth estimation network for balancing complexity and accuracy
title_fullStr	A simple monocular depth estimation network for balancing complexity and accuracy
title_full_unstemmed	A simple monocular depth estimation network for balancing complexity and accuracy
title_short	A simple monocular depth estimation network for balancing complexity and accuracy
title_sort	simple monocular depth estimation network for balancing complexity and accuracy
topic	Monocular depth estimation Deformable cross-attention Transformer Adaptive bins
url	https://doi.org/10.1038/s41598-025-97568-1
work_keys_str_mv	AT xuanxuanliu asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT shuaitang asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT mengdiefeng asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT xueqiguo asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT yanruzhang asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT yanwang asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT xuanxuanliu simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT shuaitang simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT mengdiefeng simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT xueqiguo simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT yanruzhang simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT yanwang simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy

A simple monocular depth estimation network for balancing complexity and accuracy

Similar Items