The CLIP - GPT Image Captioning Model Integrated with Global Semantics
Image captioning is a method for automatically generating language descriptions for images. Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captio...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Harbin University of Science and Technology Publications
2024-04-01
|
| Series: | Journal of Harbin University of Science and Technology |
| Subjects: | |
| Online Access: | https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2307 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Image captioning is a method for automatically generating language descriptions for images. Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captioning models. In this paper, we introduce a new method that breaks through the limitation of visual feature classification by dividing images into patches as visual semantic units for open-vocabulary cross-modal association with language features. It combines the two loss functions of masked language modeling and image-text matching, selects highly difficult negative samples to train the cross-modal hop network to extract consistent global semantics, improving the accuracy of distinguishing highly similar image and text feature points within the neighborhood of the subspace. Experimental results on two datasets, MS COCO and Flickr30k, show that the performance of the model is improved compared to models that also use CLIP + GPT to generate image descriptions and other mainstream methods, demonstrating the effectiveness of the proposed method. |
|---|---|
| ISSN: | 1007-2683 |