DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer
6D object pose estimation, a critical component in computer vision and robotics domains, involves determining the 3D location and orientation of an object relative to a canonical reference frame. Recently, the widespread proliferation of RGB‐D sensors has precipitated a marked increase in interest t...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Wiley
2025-08-01
|
| Series: | Advanced Intelligent Systems |
| Subjects: | |
| Online Access: | https://doi.org/10.1002/aisy.202401001 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849230110385766400 |
|---|---|
| author | Xuan Fan Tao An Hongbo Gao Tao Xie Lijun Zhao Ruifeng Li |
| author_facet | Xuan Fan Tao An Hongbo Gao Tao Xie Lijun Zhao Ruifeng Li |
| author_sort | Xuan Fan |
| collection | DOAJ |
| description | 6D object pose estimation, a critical component in computer vision and robotics domains, involves determining the 3D location and orientation of an object relative to a canonical reference frame. Recently, the widespread proliferation of RGB‐D sensors has precipitated a marked increase in interest towards 6D pose estimation leveraging RGB‐D data. A deep bidirectional fusion network is developed, DBF‐Net, achieving efficient yet accurate 6D object pose estimation. Specifically, a sparse linear Transformer (SLT) with linear computation complexity is introduced to effectively leverage cross‐modal semantic resemblance during the feature extraction stage, such that it fully models semantic associations between various modalities and efficiently aggregates the globally enhanced features of each modality. Once acquiring two feature representations from two modalities, a feature balancer (FB) based on SLT is proposed to adaptively reconcile the importance of these feature representations. Leveraging the global receptive field of SLT, FB effectively eliminates the ambiguity induced by visual similarity in appearance representation or depth missing of reflective surfaces in geometry representations, thereby enhancing the generalization ability and robustness of the network. Experimental results demonstrate that DBF‐Net surpasses current state‐of‐the‐art works by nontrivial margins across multiple benchmarks. The code is available at https://github.com/Mrfanxuan/dbf_net. |
| format | Article |
| id | doaj-art-b6e14efd5326488aa1e03d3ca164d187 |
| institution | Kabale University |
| issn | 2640-4567 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Wiley |
| record_format | Article |
| series | Advanced Intelligent Systems |
| spelling | doaj-art-b6e14efd5326488aa1e03d3ca164d1872025-08-21T11:05:47ZengWileyAdvanced Intelligent Systems2640-45672025-08-0178n/an/a10.1002/aisy.202401001DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear TransformerXuan Fan0Tao An1Hongbo Gao2Tao Xie3Lijun Zhao4Ruifeng Li5State Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 ChinaState Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 ChinaState Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 ChinaState Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 ChinaState Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 ChinaState Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 China6D object pose estimation, a critical component in computer vision and robotics domains, involves determining the 3D location and orientation of an object relative to a canonical reference frame. Recently, the widespread proliferation of RGB‐D sensors has precipitated a marked increase in interest towards 6D pose estimation leveraging RGB‐D data. A deep bidirectional fusion network is developed, DBF‐Net, achieving efficient yet accurate 6D object pose estimation. Specifically, a sparse linear Transformer (SLT) with linear computation complexity is introduced to effectively leverage cross‐modal semantic resemblance during the feature extraction stage, such that it fully models semantic associations between various modalities and efficiently aggregates the globally enhanced features of each modality. Once acquiring two feature representations from two modalities, a feature balancer (FB) based on SLT is proposed to adaptively reconcile the importance of these feature representations. Leveraging the global receptive field of SLT, FB effectively eliminates the ambiguity induced by visual similarity in appearance representation or depth missing of reflective surfaces in geometry representations, thereby enhancing the generalization ability and robustness of the network. Experimental results demonstrate that DBF‐Net surpasses current state‐of‐the‐art works by nontrivial margins across multiple benchmarks. The code is available at https://github.com/Mrfanxuan/dbf_net.https://doi.org/10.1002/aisy.2024010016D object pose estimationsdeep learningfeature representationsRGB‐Dtransformers |
| spellingShingle | Xuan Fan Tao An Hongbo Gao Tao Xie Lijun Zhao Ruifeng Li DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer Advanced Intelligent Systems 6D object pose estimations deep learning feature representations RGB‐D transformers |
| title | DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer |
| title_full | DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer |
| title_fullStr | DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer |
| title_full_unstemmed | DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer |
| title_short | DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer |
| title_sort | dbf net a deep bidirectional fusion network for 6d object pose estimation with sparse linear transformer |
| topic | 6D object pose estimations deep learning feature representations RGB‐D transformers |
| url | https://doi.org/10.1002/aisy.202401001 |
| work_keys_str_mv | AT xuanfan dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer AT taoan dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer AT hongbogao dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer AT taoxie dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer AT lijunzhao dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer AT ruifengli dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer |