Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook

Large-scale image and text representation learning is critical in determining the performance of multimodal tasks involving images and text, such as visual question answering and image captioning. Most existing research on large-scale image and text representation learning relies on Transformer netw...

Full description

Saved in:
Bibliographic Details
Main Authors: Yang Qin, Shuxue Ding, Huiming Xie
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10883956/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850092995593895936
author Yang Qin
Shuxue Ding
Huiming Xie
author_facet Yang Qin
Shuxue Ding
Huiming Xie
author_sort Yang Qin
collection DOAJ
description Large-scale image and text representation learning is critical in determining the performance of multimodal tasks involving images and text, such as visual question answering and image captioning. Most existing research on large-scale image and text representation learning relies on Transformer networks for pre-training, i.e., learning generic semantic representations from large-scale image-to-text pairs. These representations are then fine-tuned and transferred to downstream multimodal tasks. This paper first provides a brief analysis of the advantages of pre-training models. It then comprehensively summarizes the relevant research on large-scale image and text representation learning based on pre-training. The focus is on pre-training model architectures, pre-training tasks, and image-text datasets. Finally, we provide a summary and outlook of large-scale image and text representation learning.
format Article
id doaj-art-1ea68f4a974d499ca5df9c006e5285c1
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-1ea68f4a974d499ca5df9c006e5285c12025-08-20T02:42:01ZengIEEEIEEE Access2169-35362025-01-0113499224993310.1109/ACCESS.2025.354119410883956Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and OutlookYang Qin0https://orcid.org/0000-0001-7510-6596Shuxue Ding1https://orcid.org/0000-0002-4963-3883Huiming Xie2https://orcid.org/0009-0008-9542-6907Guangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin, ChinaGuangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin, ChinaEngineering Comprehensive Training Center, Guilin University of Aerospace Technology, Guilin, ChinaLarge-scale image and text representation learning is critical in determining the performance of multimodal tasks involving images and text, such as visual question answering and image captioning. Most existing research on large-scale image and text representation learning relies on Transformer networks for pre-training, i.e., learning generic semantic representations from large-scale image-to-text pairs. These representations are then fine-tuned and transferred to downstream multimodal tasks. This paper first provides a brief analysis of the advantages of pre-training models. It then comprehensively summarizes the relevant research on large-scale image and text representation learning based on pre-training. The focus is on pre-training model architectures, pre-training tasks, and image-text datasets. Finally, we provide a summary and outlook of large-scale image and text representation learning.https://ieeexplore.ieee.org/document/10883956/Image-and-textlarge-scale representation learningpre-trainingtransformerself-supervised learning
spellingShingle Yang Qin
Shuxue Ding
Huiming Xie
Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
IEEE Access
Image-and-text
large-scale representation learning
pre-training
transformer
self-supervised learning
title Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
title_full Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
title_fullStr Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
title_full_unstemmed Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
title_short Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
title_sort advancements in large scale image and text representation learning a comprehensive review and outlook
topic Image-and-text
large-scale representation learning
pre-training
transformer
self-supervised learning
url https://ieeexplore.ieee.org/document/10883956/
work_keys_str_mv AT yangqin advancementsinlargescaleimageandtextrepresentationlearningacomprehensivereviewandoutlook
AT shuxueding advancementsinlargescaleimageandtextrepresentationlearningacomprehensivereviewandoutlook
AT huimingxie advancementsinlargescaleimageandtextrepresentationlearningacomprehensivereviewandoutlook