Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing

Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by in...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zibo Hu, Kun Gao, Jingyi Wang, Zhijia Yang, Zefeng Zhang, Haobo Cheng, Wei Li
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:	Efficient cross-modality block inverse pyramid feature refinement (IPFR) multiscale visual-cross-text fusion module (MSVCTFM) open-set object detection
Online Access:	https://ieeexplore.ieee.org/document/11021309/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849430203379482624
author	Zibo Hu Kun Gao Jingyi Wang Zhijia Yang Zefeng Zhang Haobo Cheng Wei Li
author_facet	Zibo Hu Kun Gao Jingyi Wang Zhijia Yang Zefeng Zhang Haobo Cheng Wei Li
author_sort	Zibo Hu
collection	DOAJ
description	Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.
format	Article
id	doaj-art-e556da91e44a4107b3f450ef64648b73
institution	Kabale University
issn	1939-1404 2151-1535
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
spelling	doaj-art-e556da91e44a4107b3f450ef64648b732025-08-20T03:28:05ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing1939-14042151-15352025-01-0118152911530310.1109/JSTARS.2025.357577011021309Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote SensingZibo Hu0https://orcid.org/0000-0002-8315-2215Kun Gao1https://orcid.org/0000-0001-6666-8036Jingyi Wang2https://orcid.org/0009-0006-1123-4971Zhijia Yang3https://orcid.org/0000-0001-8970-663XZefeng Zhang4Haobo Cheng5Wei Li6https://orcid.org/0000-0001-7015-7335School of Optics and Photonics, Beijing Institute of Technology, Beijing, ChinaSchool of Optics and Photonics, Beijing Institute of Technology, Beijing, ChinaSchool of Optics and Photonics, Beijing Institute of Technology, Beijing, ChinaSchool of Optics and Photonics, Beijing Institute of Technology, Beijing, ChinaSchool of Optics and Photonics, Beijing Institute of Technology, Beijing, ChinaSchool of Optics and Photonics, Beijing Institute of Technology, Beijing, ChinaSchool of Information and Electronics, Beijing Institute of Technology, Beijing, ChinaOpen-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.https://ieeexplore.ieee.org/document/11021309/Efficient cross-modality blockinverse pyramid feature refinement (IPFR)multiscale visual-cross-text fusion module (MSVCTFM)open-set object detection
spellingShingle	Zibo Hu Kun Gao Jingyi Wang Zhijia Yang Zefeng Zhang Haobo Cheng Wei Li Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Efficient cross-modality block inverse pyramid feature refinement (IPFR) multiscale visual-cross-text fusion module (MSVCTFM) open-set object detection
title	Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing
title_full	Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing
title_fullStr	Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing
title_full_unstemmed	Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing
title_short	Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing
title_sort	enhanced grounding dino efficient cross modality block for open set object detection in remote sensing
topic	Efficient cross-modality block inverse pyramid feature refinement (IPFR) multiscale visual-cross-text fusion module (MSVCTFM) open-set object detection
url	https://ieeexplore.ieee.org/document/11021309/
work_keys_str_mv	AT zibohu enhancedgroundingdinoefficientcrossmodalityblockforopensetobjectdetectioninremotesensing AT kungao enhancedgroundingdinoefficientcrossmodalityblockforopensetobjectdetectioninremotesensing AT jingyiwang enhancedgroundingdinoefficientcrossmodalityblockforopensetobjectdetectioninremotesensing AT zhijiayang enhancedgroundingdinoefficientcrossmodalityblockforopensetobjectdetectioninremotesensing AT zefengzhang enhancedgroundingdinoefficientcrossmodalityblockforopensetobjectdetectioninremotesensing AT haobocheng enhancedgroundingdinoefficientcrossmodalityblockforopensetobjectdetectioninremotesensing AT weili enhancedgroundingdinoefficientcrossmodalityblockforopensetobjectdetectioninremotesensing

Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing

Similar Items