Frequency Spectrum Adaptor for Remote Sensing Image–Text Retrieval
Remote sensing image–text retrieval (RSITR) is a critical task that involves parsing the content of remote sensing (RS) images to match semantically relevant text. Existing RSITR methods primarily focus on directly adopting pretrained models and performing transfer learning through fine-t...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11081463/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Remote sensing image–text retrieval (RSITR) is a critical task that involves parsing the content of remote sensing (RS) images to match semantically relevant text. Existing RSITR methods primarily focus on directly adopting pretrained models and performing transfer learning through fine-tuning, neglecting the complex high-dimensional structured information present in RS images, where texture, color, scale, and semantics are tightly coupled. Consequently, these methods exhibit limitations in handling specific receptive fields or preserving the structural information within RS images, leading to inaccurate retrieval matches. To mitigate this issue, this article proposes a frequency spectrum adapter for RSITR, which aims to perceive the unique structured information of RS images to facilitate the transfer of visual–linguistic knowledge from the natural domain to the RS domain. The main contributions of this article are as follows. First, a frequency-domain-based RS image–text retrieval adapter (FRS-Adapter) was developed. By expanding the spectral receptive field, it extracts the unique structured information of RS images, enhancing the fine-tuning effect of RS to natural scene domain transfer. Second, a unimodal filter bank was designed, which uses filter banks for unimodal spectral compression across multiple bands. Within each band, the spectral features of amplitude and phase structures are utilized to further enhance the representation of structured information. Third, a cross-modal spectrum mutual aggregation module was introduced to promote the deep integration of linguistic, spatial, and spectral information. This guides the retention of relevant frequency components and effectively reduces the impact of irrelevant frequency components. Fourth, we conduct quantitative and qualitative experiments on three large RS cross-modal retrieval datasets, validating the significant performance of the FRS-Adapter in RSITR. |
|---|---|
| ISSN: | 1939-1404 2151-1535 |