A simple monocular depth estimation network for balancing complexity and accuracy

Abstract Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior pe...

Full description

Saved in:
Bibliographic Details
Main Authors: Xuanxuan Liu, Shuai Tang, Mengdie Feng, Xueqi Guo, Yanru Zhang, Yan Wang
Format: Article
Language:English
Published: Nature Portfolio 2025-04-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-97568-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850146302171545600
author Xuanxuan Liu
Shuai Tang
Mengdie Feng
Xueqi Guo
Yanru Zhang
Yan Wang
author_facet Xuanxuan Liu
Shuai Tang
Mengdie Feng
Xueqi Guo
Yanru Zhang
Yan Wang
author_sort Xuanxuan Liu
collection DOAJ
description Abstract Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior performance. Particularly in practical applications, enhancing the accuracy of depth prediction while ensuring computational efficiency remains a challenging issue. To tackle this challenge, we propose a novel and simple depth estimation model called SimMDE, which treats monocular depth estimation as an ordinal regression problem. Beginning with a baseline encoder, our model is equipped with a Deformable Cross-Attention Feature Fusion (DCF) decoder with sparse attention. This decoder efficiently integrates multi-scale feature maps, markedly reducing the quadratic complexity of the Transformer model. For the extraction of finer local features, we propose a Local Multi-dimensional Convolutional Attention (LMC) module. Meanwhile, we propose a Wavelet Attention Transformer (WAT) module to achieve pixel-level precise classification of images. Furthermore, we also conduct extensive experiments on two widely recognized depth estimation benchmark datasets: NYU and KITTI. The experimental findings unequivocally demonstrate that our model attains exceptional accuracy in depth estimation while upholding high computational efficiency. Remarkably, our framework SimMDE, extending from AdaBins, demonstrates enhancements, resulting in substantial improvements of 11.7% and 10.3% in the absolute relative error (AbsRel) on the NYU and KITTI datasets, respectively, with fewer parameters.
format Article
id doaj-art-8d671d5346da4415a1c287810aaae2ef
institution OA Journals
issn 2045-2322
language English
publishDate 2025-04-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-8d671d5346da4415a1c287810aaae2ef2025-08-20T02:27:53ZengNature PortfolioScientific Reports2045-23222025-04-0115111510.1038/s41598-025-97568-1A simple monocular depth estimation network for balancing complexity and accuracyXuanxuan Liu0Shuai Tang1Mengdie Feng2Xueqi Guo3Yanru Zhang4Yan Wang5Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaSchool of Future Technology, South China University of TechnologyShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaAbstract Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior performance. Particularly in practical applications, enhancing the accuracy of depth prediction while ensuring computational efficiency remains a challenging issue. To tackle this challenge, we propose a novel and simple depth estimation model called SimMDE, which treats monocular depth estimation as an ordinal regression problem. Beginning with a baseline encoder, our model is equipped with a Deformable Cross-Attention Feature Fusion (DCF) decoder with sparse attention. This decoder efficiently integrates multi-scale feature maps, markedly reducing the quadratic complexity of the Transformer model. For the extraction of finer local features, we propose a Local Multi-dimensional Convolutional Attention (LMC) module. Meanwhile, we propose a Wavelet Attention Transformer (WAT) module to achieve pixel-level precise classification of images. Furthermore, we also conduct extensive experiments on two widely recognized depth estimation benchmark datasets: NYU and KITTI. The experimental findings unequivocally demonstrate that our model attains exceptional accuracy in depth estimation while upholding high computational efficiency. Remarkably, our framework SimMDE, extending from AdaBins, demonstrates enhancements, resulting in substantial improvements of 11.7% and 10.3% in the absolute relative error (AbsRel) on the NYU and KITTI datasets, respectively, with fewer parameters.https://doi.org/10.1038/s41598-025-97568-1Monocular depth estimationDeformable cross-attentionTransformerAdaptive bins
spellingShingle Xuanxuan Liu
Shuai Tang
Mengdie Feng
Xueqi Guo
Yanru Zhang
Yan Wang
A simple monocular depth estimation network for balancing complexity and accuracy
Scientific Reports
Monocular depth estimation
Deformable cross-attention
Transformer
Adaptive bins
title A simple monocular depth estimation network for balancing complexity and accuracy
title_full A simple monocular depth estimation network for balancing complexity and accuracy
title_fullStr A simple monocular depth estimation network for balancing complexity and accuracy
title_full_unstemmed A simple monocular depth estimation network for balancing complexity and accuracy
title_short A simple monocular depth estimation network for balancing complexity and accuracy
title_sort simple monocular depth estimation network for balancing complexity and accuracy
topic Monocular depth estimation
Deformable cross-attention
Transformer
Adaptive bins
url https://doi.org/10.1038/s41598-025-97568-1
work_keys_str_mv AT xuanxuanliu asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy
AT shuaitang asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy
AT mengdiefeng asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy
AT xueqiguo asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy
AT yanruzhang asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy
AT yanwang asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy
AT xuanxuanliu simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy
AT shuaitang simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy
AT mengdiefeng simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy
AT xueqiguo simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy
AT yanruzhang simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy
AT yanwang simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy