A simple monocular depth estimation network for balancing complexity and accuracy
Abstract Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior pe...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-04-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-97568-1 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850146302171545600 |
|---|---|
| author | Xuanxuan Liu Shuai Tang Mengdie Feng Xueqi Guo Yanru Zhang Yan Wang |
| author_facet | Xuanxuan Liu Shuai Tang Mengdie Feng Xueqi Guo Yanru Zhang Yan Wang |
| author_sort | Xuanxuan Liu |
| collection | DOAJ |
| description | Abstract Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior performance. Particularly in practical applications, enhancing the accuracy of depth prediction while ensuring computational efficiency remains a challenging issue. To tackle this challenge, we propose a novel and simple depth estimation model called SimMDE, which treats monocular depth estimation as an ordinal regression problem. Beginning with a baseline encoder, our model is equipped with a Deformable Cross-Attention Feature Fusion (DCF) decoder with sparse attention. This decoder efficiently integrates multi-scale feature maps, markedly reducing the quadratic complexity of the Transformer model. For the extraction of finer local features, we propose a Local Multi-dimensional Convolutional Attention (LMC) module. Meanwhile, we propose a Wavelet Attention Transformer (WAT) module to achieve pixel-level precise classification of images. Furthermore, we also conduct extensive experiments on two widely recognized depth estimation benchmark datasets: NYU and KITTI. The experimental findings unequivocally demonstrate that our model attains exceptional accuracy in depth estimation while upholding high computational efficiency. Remarkably, our framework SimMDE, extending from AdaBins, demonstrates enhancements, resulting in substantial improvements of 11.7% and 10.3% in the absolute relative error (AbsRel) on the NYU and KITTI datasets, respectively, with fewer parameters. |
| format | Article |
| id | doaj-art-8d671d5346da4415a1c287810aaae2ef |
| institution | OA Journals |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-8d671d5346da4415a1c287810aaae2ef2025-08-20T02:27:53ZengNature PortfolioScientific Reports2045-23222025-04-0115111510.1038/s41598-025-97568-1A simple monocular depth estimation network for balancing complexity and accuracyXuanxuan Liu0Shuai Tang1Mengdie Feng2Xueqi Guo3Yanru Zhang4Yan Wang5Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaSchool of Future Technology, South China University of TechnologyShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaShenzhen Institute for Advanced Study, University of Electronic Science and Technology of ChinaAbstract Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior performance. Particularly in practical applications, enhancing the accuracy of depth prediction while ensuring computational efficiency remains a challenging issue. To tackle this challenge, we propose a novel and simple depth estimation model called SimMDE, which treats monocular depth estimation as an ordinal regression problem. Beginning with a baseline encoder, our model is equipped with a Deformable Cross-Attention Feature Fusion (DCF) decoder with sparse attention. This decoder efficiently integrates multi-scale feature maps, markedly reducing the quadratic complexity of the Transformer model. For the extraction of finer local features, we propose a Local Multi-dimensional Convolutional Attention (LMC) module. Meanwhile, we propose a Wavelet Attention Transformer (WAT) module to achieve pixel-level precise classification of images. Furthermore, we also conduct extensive experiments on two widely recognized depth estimation benchmark datasets: NYU and KITTI. The experimental findings unequivocally demonstrate that our model attains exceptional accuracy in depth estimation while upholding high computational efficiency. Remarkably, our framework SimMDE, extending from AdaBins, demonstrates enhancements, resulting in substantial improvements of 11.7% and 10.3% in the absolute relative error (AbsRel) on the NYU and KITTI datasets, respectively, with fewer parameters.https://doi.org/10.1038/s41598-025-97568-1Monocular depth estimationDeformable cross-attentionTransformerAdaptive bins |
| spellingShingle | Xuanxuan Liu Shuai Tang Mengdie Feng Xueqi Guo Yanru Zhang Yan Wang A simple monocular depth estimation network for balancing complexity and accuracy Scientific Reports Monocular depth estimation Deformable cross-attention Transformer Adaptive bins |
| title | A simple monocular depth estimation network for balancing complexity and accuracy |
| title_full | A simple monocular depth estimation network for balancing complexity and accuracy |
| title_fullStr | A simple monocular depth estimation network for balancing complexity and accuracy |
| title_full_unstemmed | A simple monocular depth estimation network for balancing complexity and accuracy |
| title_short | A simple monocular depth estimation network for balancing complexity and accuracy |
| title_sort | simple monocular depth estimation network for balancing complexity and accuracy |
| topic | Monocular depth estimation Deformable cross-attention Transformer Adaptive bins |
| url | https://doi.org/10.1038/s41598-025-97568-1 |
| work_keys_str_mv | AT xuanxuanliu asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT shuaitang asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT mengdiefeng asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT xueqiguo asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT yanruzhang asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT yanwang asimplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT xuanxuanliu simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT shuaitang simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT mengdiefeng simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT xueqiguo simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT yanruzhang simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy AT yanwang simplemonoculardepthestimationnetworkforbalancingcomplexityandaccuracy |