Progressive multi-subspace fusion for text-image matching

Abstract Text-image cross-model matching is a core challenge in multimodal machine learning, aiming to enable efficient retrieval of images and texts across different modalities. The difficulty in this task stems from the inherent gap between text and image representations, which can lead to subopti...

Full description

Saved in:
Bibliographic Details
Main Authors: Haoming Wang, Li Zhu, Wentao Ma, Qian’ge Guo
Format: Article
Language:English
Published: Springer 2025-06-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-025-01946-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849331373924417536
author Haoming Wang
Li Zhu
Wentao Ma
Qian’ge Guo
author_facet Haoming Wang
Li Zhu
Wentao Ma
Qian’ge Guo
author_sort Haoming Wang
collection DOAJ
description Abstract Text-image cross-model matching is a core challenge in multimodal machine learning, aiming to enable efficient retrieval of images and texts across different modalities. The difficulty in this task stems from the inherent gap between text and image representations, which can lead to suboptimal retrieval performance. Traditional approaches attempt to learn a shared representation space where both image and text can be directly compared. However, they often fail to account for the varying levels of semantic information captured in different layers of the encoders, resulting in inadequate alignment between the modalities. To address these limitations, we propose a novel approach called Progressive Multi-Subspace Fusion, dubbed PMSF for text-image matching. Our model reduces the model gap by using a progressive learning process, starting with shallow representations and moving to deeper layers. We use a dual-tower structure to encode multi-level features for both image and text, which are then mapped to corresponding auxiliary subspaces. These subspaces are fused through an adaptive GPO pooling strategy, enabling joint learning of a shared representation space. Experimental results on benchmark datasets, including Flickr30K and MSCOCO, show that PMSF significantly improves retrieval performance, achieving a Rsum score of 516.9 and 510.7, outperforming 23 state-of-the-art methods.
format Article
id doaj-art-f2a130956e4c43659d7551a72a939fd5
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2025-06-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-f2a130956e4c43659d7551a72a939fd52025-08-20T03:46:37ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-06-0111811610.1007/s40747-025-01946-1Progressive multi-subspace fusion for text-image matchingHaoming Wang0Li Zhu1Wentao Ma2Qian’ge Guo3School of Electronic & Information Engineering, Xi’an Jiaotong UniversityEngineering University of People’s Armed PoliceSchool of Information and Artificial Intelligence, Anhui Agricultural UniversityEngineering University of People’s Armed PoliceAbstract Text-image cross-model matching is a core challenge in multimodal machine learning, aiming to enable efficient retrieval of images and texts across different modalities. The difficulty in this task stems from the inherent gap between text and image representations, which can lead to suboptimal retrieval performance. Traditional approaches attempt to learn a shared representation space where both image and text can be directly compared. However, they often fail to account for the varying levels of semantic information captured in different layers of the encoders, resulting in inadequate alignment between the modalities. To address these limitations, we propose a novel approach called Progressive Multi-Subspace Fusion, dubbed PMSF for text-image matching. Our model reduces the model gap by using a progressive learning process, starting with shallow representations and moving to deeper layers. We use a dual-tower structure to encode multi-level features for both image and text, which are then mapped to corresponding auxiliary subspaces. These subspaces are fused through an adaptive GPO pooling strategy, enabling joint learning of a shared representation space. Experimental results on benchmark datasets, including Flickr30K and MSCOCO, show that PMSF significantly improves retrieval performance, achieving a Rsum score of 516.9 and 510.7, outperforming 23 state-of-the-art methods.https://doi.org/10.1007/s40747-025-01946-1Representation spaceModel gapText-image matching
spellingShingle Haoming Wang
Li Zhu
Wentao Ma
Qian’ge Guo
Progressive multi-subspace fusion for text-image matching
Complex & Intelligent Systems
Representation space
Model gap
Text-image matching
title Progressive multi-subspace fusion for text-image matching
title_full Progressive multi-subspace fusion for text-image matching
title_fullStr Progressive multi-subspace fusion for text-image matching
title_full_unstemmed Progressive multi-subspace fusion for text-image matching
title_short Progressive multi-subspace fusion for text-image matching
title_sort progressive multi subspace fusion for text image matching
topic Representation space
Model gap
Text-image matching
url https://doi.org/10.1007/s40747-025-01946-1
work_keys_str_mv AT haomingwang progressivemultisubspacefusionfortextimagematching
AT lizhu progressivemultisubspacefusionfortextimagematching
AT wentaoma progressivemultisubspacefusionfortextimagematching
AT qiangeguo progressivemultisubspacefusionfortextimagematching