The CLIP - GPT Image Captioning Model Integrated with Global Semantics

Image captioning is a method for automatically generating language descriptions for images. Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captio...

Full description

Saved in:
Bibliographic Details
Main Authors: TAO Rui, REN Honge, CAO Haiyan
Format: Article
Language:zho
Published: Harbin University of Science and Technology Publications 2024-04-01
Series:Journal of Harbin University of Science and Technology
Subjects:
Online Access:https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2307
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Image captioning is a method for automatically generating language descriptions for images. Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captioning models. In this paper, we introduce a new method that breaks through the limitation of visual feature classification by dividing images into patches as visual semantic units for open-vocabulary cross-modal association with language features. It combines the two loss functions of masked language modeling and image-text matching, selects highly difficult negative samples to train the cross-modal hop network to extract consistent global semantics, improving the accuracy of distinguishing highly similar image and text feature points within the neighborhood of the subspace. Experimental results on two datasets, MS COCO and Flickr30k, show that the performance of the model is improved compared to models that also use CLIP + GPT to generate image descriptions and other mainstream methods, demonstrating the effectiveness of the proposed method.
ISSN:1007-2683