DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer

6D object pose estimation, a critical component in computer vision and robotics domains, involves determining the 3D location and orientation of an object relative to a canonical reference frame. Recently, the widespread proliferation of RGB‐D sensors has precipitated a marked increase in interest t...

Full description

Saved in:
Bibliographic Details
Main Authors: Xuan Fan, Tao An, Hongbo Gao, Tao Xie, Lijun Zhao, Ruifeng Li
Format: Article
Language:English
Published: Wiley 2025-08-01
Series:Advanced Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1002/aisy.202401001
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849230110385766400
author Xuan Fan
Tao An
Hongbo Gao
Tao Xie
Lijun Zhao
Ruifeng Li
author_facet Xuan Fan
Tao An
Hongbo Gao
Tao Xie
Lijun Zhao
Ruifeng Li
author_sort Xuan Fan
collection DOAJ
description 6D object pose estimation, a critical component in computer vision and robotics domains, involves determining the 3D location and orientation of an object relative to a canonical reference frame. Recently, the widespread proliferation of RGB‐D sensors has precipitated a marked increase in interest towards 6D pose estimation leveraging RGB‐D data. A deep bidirectional fusion network is developed, DBF‐Net, achieving efficient yet accurate 6D object pose estimation. Specifically, a sparse linear Transformer (SLT) with linear computation complexity is introduced to effectively leverage cross‐modal semantic resemblance during the feature extraction stage, such that it fully models semantic associations between various modalities and efficiently aggregates the globally enhanced features of each modality. Once acquiring two feature representations from two modalities, a feature balancer (FB) based on SLT is proposed to adaptively reconcile the importance of these feature representations. Leveraging the global receptive field of SLT, FB effectively eliminates the ambiguity induced by visual similarity in appearance representation or depth missing of reflective surfaces in geometry representations, thereby enhancing the generalization ability and robustness of the network. Experimental results demonstrate that DBF‐Net surpasses current state‐of‐the‐art works by nontrivial margins across multiple benchmarks. The code is available at https://github.com/Mrfanxuan/dbf_net.
format Article
id doaj-art-b6e14efd5326488aa1e03d3ca164d187
institution Kabale University
issn 2640-4567
language English
publishDate 2025-08-01
publisher Wiley
record_format Article
series Advanced Intelligent Systems
spelling doaj-art-b6e14efd5326488aa1e03d3ca164d1872025-08-21T11:05:47ZengWileyAdvanced Intelligent Systems2640-45672025-08-0178n/an/a10.1002/aisy.202401001DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear TransformerXuan Fan0Tao An1Hongbo Gao2Tao Xie3Lijun Zhao4Ruifeng Li5State Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 ChinaState Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 ChinaState Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 ChinaState Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 ChinaState Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 ChinaState Key Laboratory of Robotics and Systems Harbin Institute of Technology Harbin 150001 China6D object pose estimation, a critical component in computer vision and robotics domains, involves determining the 3D location and orientation of an object relative to a canonical reference frame. Recently, the widespread proliferation of RGB‐D sensors has precipitated a marked increase in interest towards 6D pose estimation leveraging RGB‐D data. A deep bidirectional fusion network is developed, DBF‐Net, achieving efficient yet accurate 6D object pose estimation. Specifically, a sparse linear Transformer (SLT) with linear computation complexity is introduced to effectively leverage cross‐modal semantic resemblance during the feature extraction stage, such that it fully models semantic associations between various modalities and efficiently aggregates the globally enhanced features of each modality. Once acquiring two feature representations from two modalities, a feature balancer (FB) based on SLT is proposed to adaptively reconcile the importance of these feature representations. Leveraging the global receptive field of SLT, FB effectively eliminates the ambiguity induced by visual similarity in appearance representation or depth missing of reflective surfaces in geometry representations, thereby enhancing the generalization ability and robustness of the network. Experimental results demonstrate that DBF‐Net surpasses current state‐of‐the‐art works by nontrivial margins across multiple benchmarks. The code is available at https://github.com/Mrfanxuan/dbf_net.https://doi.org/10.1002/aisy.2024010016D object pose estimationsdeep learningfeature representationsRGB‐Dtransformers
spellingShingle Xuan Fan
Tao An
Hongbo Gao
Tao Xie
Lijun Zhao
Ruifeng Li
DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer
Advanced Intelligent Systems
6D object pose estimations
deep learning
feature representations
RGB‐D
transformers
title DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer
title_full DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer
title_fullStr DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer
title_full_unstemmed DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer
title_short DBF‐Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer
title_sort dbf net a deep bidirectional fusion network for 6d object pose estimation with sparse linear transformer
topic 6D object pose estimations
deep learning
feature representations
RGB‐D
transformers
url https://doi.org/10.1002/aisy.202401001
work_keys_str_mv AT xuanfan dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer
AT taoan dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer
AT hongbogao dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer
AT taoxie dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer
AT lijunzhao dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer
AT ruifengli dbfnetadeepbidirectionalfusionnetworkfor6dobjectposeestimationwithsparselineartransformer