Cross-Modality Consistency Network for Remote Sensing Text-Image Retrieval

Remote sensing cross-modality text-image retrieval aims to retrieve a specific object from a large image gallery based on a natural language description, and vice versa. Existing methods mainly capture local and global context information within each modality for cross-modality matching. However, th...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yuchen Sha, Yujian Feng, Miao He, Yichi Jin, Shuai You, Yimu Ji, Fei Wu, Shangdong Liu, Shaoshuai Che
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:	Co-occurrence relations cross-modality attention mechanism remote sensing cross-modality text-image retrieval (RSCTIR)
Online Access:	https://ieeexplore.ieee.org/document/11075559/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Remote sensing cross-modality text-image retrieval aims to retrieve a specific object from a large image gallery based on a natural language description, and vice versa. Existing methods mainly capture local and global context information within each modality for cross-modality matching. However, these methods are prone to interference from redundant information, such as background noises and irrelevant words, and neglect the capture of co-occurrence semantic relations between modalities (i.e., the probability of semantic information co-occurring with other information). To filter out intramodality redundant information and capture intermodality co-occurrent relations, we propose a cross-modality consistency network including a text-image attention-conditioned module (TAM) and a co-occurrent features module (CFM). First, TAM interacts with visual and textual feature representations by employing the cross-modality attention mechanism to focus on semantically similar fine-grained image features and then generate aggregated visual representations. Second, CFM is designed to estimate co-occurrence probability by measuring fine-grained feature similarity, thereby reinforcing the relations of target-consist features across modalities. In addition, we propose the cross-modality distinction loss function to learn semantic consistency between modalities by compacting intraclass samples and separating interclass samples. Extensive benchmark experiments on three benchmarks demonstrate that our approach outperforms state-of-the-art methods.
ISSN:	1939-1404 2151-1535

Cross-Modality Consistency Network for Remote Sensing Text-Image Retrieval

Similar Items