Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook
Large-scale image and text representation learning is critical in determining the performance of multimodal tasks involving images and text, such as visual question answering and image captioning. Most existing research on large-scale image and text representation learning relies on Transformer netw...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10883956/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850092995593895936 |
|---|---|
| author | Yang Qin Shuxue Ding Huiming Xie |
| author_facet | Yang Qin Shuxue Ding Huiming Xie |
| author_sort | Yang Qin |
| collection | DOAJ |
| description | Large-scale image and text representation learning is critical in determining the performance of multimodal tasks involving images and text, such as visual question answering and image captioning. Most existing research on large-scale image and text representation learning relies on Transformer networks for pre-training, i.e., learning generic semantic representations from large-scale image-to-text pairs. These representations are then fine-tuned and transferred to downstream multimodal tasks. This paper first provides a brief analysis of the advantages of pre-training models. It then comprehensively summarizes the relevant research on large-scale image and text representation learning based on pre-training. The focus is on pre-training model architectures, pre-training tasks, and image-text datasets. Finally, we provide a summary and outlook of large-scale image and text representation learning. |
| format | Article |
| id | doaj-art-1ea68f4a974d499ca5df9c006e5285c1 |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-1ea68f4a974d499ca5df9c006e5285c12025-08-20T02:42:01ZengIEEEIEEE Access2169-35362025-01-0113499224993310.1109/ACCESS.2025.354119410883956Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and OutlookYang Qin0https://orcid.org/0000-0001-7510-6596Shuxue Ding1https://orcid.org/0000-0002-4963-3883Huiming Xie2https://orcid.org/0009-0008-9542-6907Guangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin, ChinaGuangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin, ChinaEngineering Comprehensive Training Center, Guilin University of Aerospace Technology, Guilin, ChinaLarge-scale image and text representation learning is critical in determining the performance of multimodal tasks involving images and text, such as visual question answering and image captioning. Most existing research on large-scale image and text representation learning relies on Transformer networks for pre-training, i.e., learning generic semantic representations from large-scale image-to-text pairs. These representations are then fine-tuned and transferred to downstream multimodal tasks. This paper first provides a brief analysis of the advantages of pre-training models. It then comprehensively summarizes the relevant research on large-scale image and text representation learning based on pre-training. The focus is on pre-training model architectures, pre-training tasks, and image-text datasets. Finally, we provide a summary and outlook of large-scale image and text representation learning.https://ieeexplore.ieee.org/document/10883956/Image-and-textlarge-scale representation learningpre-trainingtransformerself-supervised learning |
| spellingShingle | Yang Qin Shuxue Ding Huiming Xie Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook IEEE Access Image-and-text large-scale representation learning pre-training transformer self-supervised learning |
| title | Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook |
| title_full | Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook |
| title_fullStr | Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook |
| title_full_unstemmed | Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook |
| title_short | Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook |
| title_sort | advancements in large scale image and text representation learning a comprehensive review and outlook |
| topic | Image-and-text large-scale representation learning pre-training transformer self-supervised learning |
| url | https://ieeexplore.ieee.org/document/10883956/ |
| work_keys_str_mv | AT yangqin advancementsinlargescaleimageandtextrepresentationlearningacomprehensivereviewandoutlook AT shuxueding advancementsinlargescaleimageandtextrepresentationlearningacomprehensivereviewandoutlook AT huimingxie advancementsinlargescaleimageandtextrepresentationlearningacomprehensivereviewandoutlook |