Progressive multi-subspace fusion for text-image matching
Abstract Text-image cross-model matching is a core challenge in multimodal machine learning, aiming to enable efficient retrieval of images and texts across different modalities. The difficulty in this task stems from the inherent gap between text and image representations, which can lead to subopti...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-06-01
|
| Series: | Complex & Intelligent Systems |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s40747-025-01946-1 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849331373924417536 |
|---|---|
| author | Haoming Wang Li Zhu Wentao Ma Qian’ge Guo |
| author_facet | Haoming Wang Li Zhu Wentao Ma Qian’ge Guo |
| author_sort | Haoming Wang |
| collection | DOAJ |
| description | Abstract Text-image cross-model matching is a core challenge in multimodal machine learning, aiming to enable efficient retrieval of images and texts across different modalities. The difficulty in this task stems from the inherent gap between text and image representations, which can lead to suboptimal retrieval performance. Traditional approaches attempt to learn a shared representation space where both image and text can be directly compared. However, they often fail to account for the varying levels of semantic information captured in different layers of the encoders, resulting in inadequate alignment between the modalities. To address these limitations, we propose a novel approach called Progressive Multi-Subspace Fusion, dubbed PMSF for text-image matching. Our model reduces the model gap by using a progressive learning process, starting with shallow representations and moving to deeper layers. We use a dual-tower structure to encode multi-level features for both image and text, which are then mapped to corresponding auxiliary subspaces. These subspaces are fused through an adaptive GPO pooling strategy, enabling joint learning of a shared representation space. Experimental results on benchmark datasets, including Flickr30K and MSCOCO, show that PMSF significantly improves retrieval performance, achieving a Rsum score of 516.9 and 510.7, outperforming 23 state-of-the-art methods. |
| format | Article |
| id | doaj-art-f2a130956e4c43659d7551a72a939fd5 |
| institution | Kabale University |
| issn | 2199-4536 2198-6053 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | Springer |
| record_format | Article |
| series | Complex & Intelligent Systems |
| spelling | doaj-art-f2a130956e4c43659d7551a72a939fd52025-08-20T03:46:37ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-06-0111811610.1007/s40747-025-01946-1Progressive multi-subspace fusion for text-image matchingHaoming Wang0Li Zhu1Wentao Ma2Qian’ge Guo3School of Electronic & Information Engineering, Xi’an Jiaotong UniversityEngineering University of People’s Armed PoliceSchool of Information and Artificial Intelligence, Anhui Agricultural UniversityEngineering University of People’s Armed PoliceAbstract Text-image cross-model matching is a core challenge in multimodal machine learning, aiming to enable efficient retrieval of images and texts across different modalities. The difficulty in this task stems from the inherent gap between text and image representations, which can lead to suboptimal retrieval performance. Traditional approaches attempt to learn a shared representation space where both image and text can be directly compared. However, they often fail to account for the varying levels of semantic information captured in different layers of the encoders, resulting in inadequate alignment between the modalities. To address these limitations, we propose a novel approach called Progressive Multi-Subspace Fusion, dubbed PMSF for text-image matching. Our model reduces the model gap by using a progressive learning process, starting with shallow representations and moving to deeper layers. We use a dual-tower structure to encode multi-level features for both image and text, which are then mapped to corresponding auxiliary subspaces. These subspaces are fused through an adaptive GPO pooling strategy, enabling joint learning of a shared representation space. Experimental results on benchmark datasets, including Flickr30K and MSCOCO, show that PMSF significantly improves retrieval performance, achieving a Rsum score of 516.9 and 510.7, outperforming 23 state-of-the-art methods.https://doi.org/10.1007/s40747-025-01946-1Representation spaceModel gapText-image matching |
| spellingShingle | Haoming Wang Li Zhu Wentao Ma Qian’ge Guo Progressive multi-subspace fusion for text-image matching Complex & Intelligent Systems Representation space Model gap Text-image matching |
| title | Progressive multi-subspace fusion for text-image matching |
| title_full | Progressive multi-subspace fusion for text-image matching |
| title_fullStr | Progressive multi-subspace fusion for text-image matching |
| title_full_unstemmed | Progressive multi-subspace fusion for text-image matching |
| title_short | Progressive multi-subspace fusion for text-image matching |
| title_sort | progressive multi subspace fusion for text image matching |
| topic | Representation space Model gap Text-image matching |
| url | https://doi.org/10.1007/s40747-025-01946-1 |
| work_keys_str_mv | AT haomingwang progressivemultisubspacefusionfortextimagematching AT lizhu progressivemultisubspacefusionfortextimagematching AT wentaoma progressivemultisubspacefusionfortextimagematching AT qiangeguo progressivemultisubspacefusionfortextimagematching |