Frequency Spectrum Adaptor for Remote Sensing Image–Text Retrieval

Remote sensing image–text retrieval (RSITR) is a critical task that involves parsing the content of remote sensing (RS) images to match semantically relevant text. Existing RSITR methods primarily focus on directly adopting pretrained models and performing transfer learning through fine-t...

Full description

Saved in:
Bibliographic Details
Main Authors: Ziyi Wan, Enyuan Zhao, Jie Nie, Ze Zhang, Zhiqiang Wei, Nan Zheng, Yuting Zhao
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11081463/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Remote sensing image–text retrieval (RSITR) is a critical task that involves parsing the content of remote sensing (RS) images to match semantically relevant text. Existing RSITR methods primarily focus on directly adopting pretrained models and performing transfer learning through fine-tuning, neglecting the complex high-dimensional structured information present in RS images, where texture, color, scale, and semantics are tightly coupled. Consequently, these methods exhibit limitations in handling specific receptive fields or preserving the structural information within RS images, leading to inaccurate retrieval matches. To mitigate this issue, this article proposes a frequency spectrum adapter for RSITR, which aims to perceive the unique structured information of RS images to facilitate the transfer of visual–linguistic knowledge from the natural domain to the RS domain. The main contributions of this article are as follows. First, a frequency-domain-based RS image–text retrieval adapter (FRS-Adapter) was developed. By expanding the spectral receptive field, it extracts the unique structured information of RS images, enhancing the fine-tuning effect of RS to natural scene domain transfer. Second, a unimodal filter bank was designed, which uses filter banks for unimodal spectral compression across multiple bands. Within each band, the spectral features of amplitude and phase structures are utilized to further enhance the representation of structured information. Third, a cross-modal spectrum mutual aggregation module was introduced to promote the deep integration of linguistic, spatial, and spectral information. This guides the retention of relevant frequency components and effectively reduces the impact of irrelevant frequency components. Fourth, we conduct quantitative and qualitative experiments on three large RS cross-modal retrieval datasets, validating the significant performance of the FRS-Adapter in RSITR.
ISSN:1939-1404
2151-1535