Uncertainty-aware coarse-to-fine alignment for text-image person retrieval
Abstract Text-to-image person retrieval, a fine-grained cross-modal retrieval problem, aims to search for person images from an image library that match a given textual caption. Existing text-to-image person retrieval methods usually use fixed-point embedding to express the semantics of the two moda...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-04-01
|
| Series: | Visual Intelligence |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s44267-025-00078-x |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Text-to-image person retrieval, a fine-grained cross-modal retrieval problem, aims to search for person images from an image library that match a given textual caption. Existing text-to-image person retrieval methods usually use fixed-point embedding to express the semantics of the two modalities and perform multi-granularity alignment between modalities in the embedding space. However, owing to the inherent mutual one-to-many correspondence between images and texts, it is often difficult for fixed-point embedding methods to adequately capture this relationship, leading to erroneous retrieval results. To address this problem, we propose a novel uncertainty-aware coarse-to-fine alignment method, which first maps fixed-point embedding to probability distributions and then aligns two modalities in terms of distributions and sampling points at a coarse-to-fine granularity, for accurate text-to-image person retrieval. Specifically, we first introduce two contrastive learning tasks of distribution contrast learning and point contrast learning, to achieve coarse-grained inter-modal alignment with uncertainty-aware. The distribution contrast learning task ensures that distributions with the same identity are as similar as possible across modalities through distribution-based contrastive learning. The point contrast learning task performs the contrastive learning of inter-modal and intra-modal sampling points, which not only models rich and diverse cross-modal associations, but also optimizes the learning of distributions. For the fine-grained association requirements of text-to-image person retrieval, we design the task of uncertainty-aware attribute masking language reconstruction, which achieves fine-grained alignment by randomly masking attribute words in the text and reconstructing them via inter-modal sample point interactions. Extensive experiments on two public datasets demonstrate the superior performance of our method. |
|---|---|
| ISSN: | 2097-3330 2731-9008 |