Cross-Modality Consistency Network for Remote Sensing Text-Image Retrieval
Remote sensing cross-modality text-image retrieval aims to retrieve a specific object from a large image gallery based on a natural language description, and vice versa. Existing methods mainly capture local and global context information within each modality for cross-modality matching. However, th...
Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11075559/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Remote sensing cross-modality text-image retrieval aims to retrieve a specific object from a large image gallery based on a natural language description, and vice versa. Existing methods mainly capture local and global context information within each modality for cross-modality matching. However, these methods are prone to interference from redundant information, such as background noises and irrelevant words, and neglect the capture of co-occurrence semantic relations between modalities (i.e., the probability of semantic information co-occurring with other information). To filter out intramodality redundant information and capture intermodality co-occurrent relations, we propose a cross-modality consistency network including a text-image attention-conditioned module (TAM) and a co-occurrent features module (CFM). First, TAM interacts with visual and textual feature representations by employing the cross-modality attention mechanism to focus on semantically similar fine-grained image features and then generate aggregated visual representations. Second, CFM is designed to estimate co-occurrence probability by measuring fine-grained feature similarity, thereby reinforcing the relations of target-consist features across modalities. In addition, we propose the cross-modality distinction loss function to learn semantic consistency between modalities by compacting intraclass samples and separating interclass samples. Extensive benchmark experiments on three benchmarks demonstrate that our approach outperforms state-of-the-art methods. |
|---|---|
| ISSN: | 1939-1404 2151-1535 |