Cross-Modality Consistency Network for Remote Sensing Text-Image Retrieval

Remote sensing cross-modality text-image retrieval aims to retrieve a specific object from a large image gallery based on a natural language description, and vice versa. Existing methods mainly capture local and global context information within each modality for cross-modality matching. However, th...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuchen Sha, Yujian Feng, Miao He, Yichi Jin, Shuai You, Yimu Ji, Fei Wu, Shangdong Liu, Shaoshuai Che
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11075559/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Remote sensing cross-modality text-image retrieval aims to retrieve a specific object from a large image gallery based on a natural language description, and vice versa. Existing methods mainly capture local and global context information within each modality for cross-modality matching. However, these methods are prone to interference from redundant information, such as background noises and irrelevant words, and neglect the capture of co-occurrence semantic relations between modalities (i.e., the probability of semantic information co-occurring with other information). To filter out intramodality redundant information and capture intermodality co-occurrent relations, we propose a cross-modality consistency network including a text-image attention-conditioned module (TAM) and a co-occurrent features module (CFM). First, TAM interacts with visual and textual feature representations by employing the cross-modality attention mechanism to focus on semantically similar fine-grained image features and then generate aggregated visual representations. Second, CFM is designed to estimate co-occurrence probability by measuring fine-grained feature similarity, thereby reinforcing the relations of target-consist features across modalities. In addition, we propose the cross-modality distinction loss function to learn semantic consistency between modalities by compacting intraclass samples and separating interclass samples. Extensive benchmark experiments on three benchmarks demonstrate that our approach outperforms state-of-the-art methods.
ISSN:1939-1404
2151-1535