Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook

Large-scale image and text representation learning is critical in determining the performance of multimodal tasks involving images and text, such as visual question answering and image captioning. Most existing research on large-scale image and text representation learning relies on Transformer netw...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yang Qin, Shuxue Ding, Huiming Xie
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Image-and-text large-scale representation learning pre-training transformer self-supervised learning
Online Access:	https://ieeexplore.ieee.org/document/10883956/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850092995593895936
author	Yang Qin Shuxue Ding Huiming Xie
author_facet	Yang Qin Shuxue Ding Huiming Xie
author_sort	Yang Qin
collection	DOAJ
description	Large-scale image and text representation learning is critical in determining the performance of multimodal tasks involving images and text, such as visual question answering and image captioning. Most existing research on large-scale image and text representation learning relies on Transformer networks for pre-training, i.e., learning generic semantic representations from large-scale image-to-text pairs. These representations are then fine-tuned and transferred to downstream multimodal tasks. This paper first provides a brief analysis of the advantages of pre-training models. It then comprehensively summarizes the relevant research on large-scale image and text representation learning based on pre-training. The focus is on pre-training model architectures, pre-training tasks, and image-text datasets. Finally, we provide a summary and outlook of large-scale image and text representation learning.
format	Article
id	doaj-art-1ea68f4a974d499ca5df9c006e5285c1
institution	DOAJ
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-1ea68f4a974d499ca5df9c006e5285c12025-08-20T02:42:01ZengIEEEIEEE Access2169-35362025-01-0113499224993310.1109/ACCESS.2025.354119410883956Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and OutlookYang Qin0https://orcid.org/0000-0001-7510-6596Shuxue Ding1https://orcid.org/0000-0002-4963-3883Huiming Xie2https://orcid.org/0009-0008-9542-6907Guangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin, ChinaGuangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin, ChinaEngineering Comprehensive Training Center, Guilin University of Aerospace Technology, Guilin, ChinaLarge-scale image and text representation learning is critical in determining the performance of multimodal tasks involving images and text, such as visual question answering and image captioning. Most existing research on large-scale image and text representation learning relies on Transformer networks for pre-training, i.e., learning generic semantic representations from large-scale image-to-text pairs. These representations are then fine-tuned and transferred to downstream multimodal tasks. This paper first provides a brief analysis of the advantages of pre-training models. It then comprehensively summarizes the relevant research on large-scale image and text representation learning based on pre-training. The focus is on pre-training model architectures, pre-training tasks, and image-text datasets. Finally, we provide a summary and outlook of large-scale image and text representation learning.https://ieeexplore.ieee.org/document/10883956/Image-and-textlarge-scale representation learningpre-trainingtransformerself-supervised learning
spellingShingle	Yang Qin Shuxue Ding Huiming Xie Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook IEEE Access Image-and-text large-scale representation learning pre-training transformer self-supervised learning
title	Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
title_full	Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
title_fullStr	Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
title_full_unstemmed	Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
title_short	Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
title_sort	advancements in large scale image and text representation learning a comprehensive review and outlook
topic	Image-and-text large-scale representation learning pre-training transformer self-supervised learning
url	https://ieeexplore.ieee.org/document/10883956/
work_keys_str_mv	AT yangqin advancementsinlargescaleimageandtextrepresentationlearningacomprehensivereviewandoutlook AT shuxueding advancementsinlargescaleimageandtextrepresentationlearningacomprehensivereviewandoutlook AT huimingxie advancementsinlargescaleimageandtextrepresentationlearningacomprehensivereviewandoutlook

Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook

Similar Items