The CLIP - GPT Image Captioning Model Integrated with Global Semantics

Image captioning is a method for automatically generating language descriptions for images. Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captio...

Full description

Saved in:
Bibliographic Details
Main Authors: TAO Rui, REN Honge, CAO Haiyan
Format: Article
Language:zho
Published: Harbin University of Science and Technology Publications 2024-04-01
Series:Journal of Harbin University of Science and Technology
Subjects:
Online Access:https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2307
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849702714687094784
author TAO Rui
REN Honge
CAO Haiyan
author_facet TAO Rui
REN Honge
CAO Haiyan
author_sort TAO Rui
collection DOAJ
description Image captioning is a method for automatically generating language descriptions for images. Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captioning models. In this paper, we introduce a new method that breaks through the limitation of visual feature classification by dividing images into patches as visual semantic units for open-vocabulary cross-modal association with language features. It combines the two loss functions of masked language modeling and image-text matching, selects highly difficult negative samples to train the cross-modal hop network to extract consistent global semantics, improving the accuracy of distinguishing highly similar image and text feature points within the neighborhood of the subspace. Experimental results on two datasets, MS COCO and Flickr30k, show that the performance of the model is improved compared to models that also use CLIP + GPT to generate image descriptions and other mainstream methods, demonstrating the effectiveness of the proposed method.
format Article
id doaj-art-25e486eb20964d4e8d31d70ee776e5c7
institution DOAJ
issn 1007-2683
language zho
publishDate 2024-04-01
publisher Harbin University of Science and Technology Publications
record_format Article
series Journal of Harbin University of Science and Technology
spelling doaj-art-25e486eb20964d4e8d31d70ee776e5c72025-08-20T03:17:32ZzhoHarbin University of Science and Technology PublicationsJournal of Harbin University of Science and Technology1007-26832024-04-012902162410.15938/j.jhust.2024.02.003The CLIP - GPT Image Captioning Model Integrated with Global SemanticsTAO Rui0REN Honge1CAO Haiyan2College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040 , China;College of Computer Science, Hulunbuir University, Hulunbuir 021008 , ChinaCollege of Information and Computer Engineering, Northeast Forestry University, Harbin 150040 , China;Heilongjiang Forestry Intelligent Equipment Engineering Research Center, Harbin 150040 , ChinaCollege of Computer Science, Hulunbuir University, Hulunbuir 021008 , China Image captioning is a method for automatically generating language descriptions for images. Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captioning models. In this paper, we introduce a new method that breaks through the limitation of visual feature classification by dividing images into patches as visual semantic units for open-vocabulary cross-modal association with language features. It combines the two loss functions of masked language modeling and image-text matching, selects highly difficult negative samples to train the cross-modal hop network to extract consistent global semantics, improving the accuracy of distinguishing highly similar image and text feature points within the neighborhood of the subspace. Experimental results on two datasets, MS COCO and Flickr30k, show that the performance of the model is improved compared to models that also use CLIP + GPT to generate image descriptions and other mainstream methods, demonstrating the effectiveness of the proposed method.https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2307cross-modalimage captioningpre-training modelshared subspacesemantic alignment
spellingShingle TAO Rui
REN Honge
CAO Haiyan
The CLIP - GPT Image Captioning Model Integrated with Global Semantics
Journal of Harbin University of Science and Technology
cross-modal
image captioning
pre-training model
shared subspace
semantic alignment
title The CLIP - GPT Image Captioning Model Integrated with Global Semantics
title_full The CLIP - GPT Image Captioning Model Integrated with Global Semantics
title_fullStr The CLIP - GPT Image Captioning Model Integrated with Global Semantics
title_full_unstemmed The CLIP - GPT Image Captioning Model Integrated with Global Semantics
title_short The CLIP - GPT Image Captioning Model Integrated with Global Semantics
title_sort clip gpt image captioning model integrated with global semantics
topic cross-modal
image captioning
pre-training model
shared subspace
semantic alignment
url https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2307
work_keys_str_mv AT taorui theclipgptimagecaptioningmodelintegratedwithglobalsemantics
AT renhonge theclipgptimagecaptioningmodelintegratedwithglobalsemantics
AT caohaiyan theclipgptimagecaptioningmodelintegratedwithglobalsemantics
AT taorui clipgptimagecaptioningmodelintegratedwithglobalsemantics
AT renhonge clipgptimagecaptioningmodelintegratedwithglobalsemantics
AT caohaiyan clipgptimagecaptioningmodelintegratedwithglobalsemantics