Uncertainty-aware coarse-to-fine alignment for text-image person retrieval

Abstract Text-to-image person retrieval, a fine-grained cross-modal retrieval problem, aims to search for person images from an image library that match a given textual caption. Existing text-to-image person retrieval methods usually use fixed-point embedding to express the semantics of the two moda...

Full description

Saved in:
Bibliographic Details
Main Authors: Yifei Deng, Zhengyu Chen, Chenglong Li, Jin Tang
Format: Article
Language:English
Published: Springer 2025-04-01
Series:Visual Intelligence
Subjects:
Online Access:https://doi.org/10.1007/s44267-025-00078-x
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Text-to-image person retrieval, a fine-grained cross-modal retrieval problem, aims to search for person images from an image library that match a given textual caption. Existing text-to-image person retrieval methods usually use fixed-point embedding to express the semantics of the two modalities and perform multi-granularity alignment between modalities in the embedding space. However, owing to the inherent mutual one-to-many correspondence between images and texts, it is often difficult for fixed-point embedding methods to adequately capture this relationship, leading to erroneous retrieval results. To address this problem, we propose a novel uncertainty-aware coarse-to-fine alignment method, which first maps fixed-point embedding to probability distributions and then aligns two modalities in terms of distributions and sampling points at a coarse-to-fine granularity, for accurate text-to-image person retrieval. Specifically, we first introduce two contrastive learning tasks of distribution contrast learning and point contrast learning, to achieve coarse-grained inter-modal alignment with uncertainty-aware. The distribution contrast learning task ensures that distributions with the same identity are as similar as possible across modalities through distribution-based contrastive learning. The point contrast learning task performs the contrastive learning of inter-modal and intra-modal sampling points, which not only models rich and diverse cross-modal associations, but also optimizes the learning of distributions. For the fine-grained association requirements of text-to-image person retrieval, we design the task of uncertainty-aware attribute masking language reconstruction, which achieves fine-grained alignment by randomly masking attribute words in the text and reconstructing them via inter-modal sample point interactions. Extensive experiments on two public datasets demonstrate the superior performance of our method.
ISSN:2097-3330
2731-9008