The CLIP - GPT Image Captioning Model Integrated with Global Semantics
Image captioning is a method for automatically generating language descriptions for images. Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captio...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Harbin University of Science and Technology Publications
2024-04-01
|
| Series: | Journal of Harbin University of Science and Technology |
| Subjects: | |
| Online Access: | https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2307 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849702714687094784 |
|---|---|
| author | TAO Rui REN Honge CAO Haiyan |
| author_facet | TAO Rui REN Honge CAO Haiyan |
| author_sort | TAO Rui |
| collection | DOAJ |
| description |
Image captioning is a method for automatically generating language descriptions for images. Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captioning models. In this paper, we introduce a new method that breaks through the limitation of visual feature classification by dividing images into patches as visual semantic units for open-vocabulary cross-modal association with language features. It combines the two loss functions of masked language modeling and image-text matching, selects highly difficult negative samples to train the cross-modal hop network to extract consistent global semantics, improving the accuracy of distinguishing highly similar image and text feature points within the neighborhood of the subspace. Experimental results on two datasets, MS COCO and Flickr30k, show that the performance of the model is improved compared to models that also use CLIP + GPT to generate image descriptions and other mainstream methods, demonstrating the effectiveness of the proposed method. |
| format | Article |
| id | doaj-art-25e486eb20964d4e8d31d70ee776e5c7 |
| institution | DOAJ |
| issn | 1007-2683 |
| language | zho |
| publishDate | 2024-04-01 |
| publisher | Harbin University of Science and Technology Publications |
| record_format | Article |
| series | Journal of Harbin University of Science and Technology |
| spelling | doaj-art-25e486eb20964d4e8d31d70ee776e5c72025-08-20T03:17:32ZzhoHarbin University of Science and Technology PublicationsJournal of Harbin University of Science and Technology1007-26832024-04-012902162410.15938/j.jhust.2024.02.003The CLIP - GPT Image Captioning Model Integrated with Global SemanticsTAO Rui0REN Honge1CAO Haiyan2College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040 , China;College of Computer Science, Hulunbuir University, Hulunbuir 021008 , ChinaCollege of Information and Computer Engineering, Northeast Forestry University, Harbin 150040 , China;Heilongjiang Forestry Intelligent Equipment Engineering Research Center, Harbin 150040 , ChinaCollege of Computer Science, Hulunbuir University, Hulunbuir 021008 , China Image captioning is a method for automatically generating language descriptions for images. Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captioning models. In this paper, we introduce a new method that breaks through the limitation of visual feature classification by dividing images into patches as visual semantic units for open-vocabulary cross-modal association with language features. It combines the two loss functions of masked language modeling and image-text matching, selects highly difficult negative samples to train the cross-modal hop network to extract consistent global semantics, improving the accuracy of distinguishing highly similar image and text feature points within the neighborhood of the subspace. Experimental results on two datasets, MS COCO and Flickr30k, show that the performance of the model is improved compared to models that also use CLIP + GPT to generate image descriptions and other mainstream methods, demonstrating the effectiveness of the proposed method.https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2307cross-modalimage captioningpre-training modelshared subspacesemantic alignment |
| spellingShingle | TAO Rui REN Honge CAO Haiyan The CLIP - GPT Image Captioning Model Integrated with Global Semantics Journal of Harbin University of Science and Technology cross-modal image captioning pre-training model shared subspace semantic alignment |
| title | The CLIP - GPT Image Captioning Model Integrated with Global Semantics |
| title_full | The CLIP - GPT Image Captioning Model Integrated with Global Semantics |
| title_fullStr | The CLIP - GPT Image Captioning Model Integrated with Global Semantics |
| title_full_unstemmed | The CLIP - GPT Image Captioning Model Integrated with Global Semantics |
| title_short | The CLIP - GPT Image Captioning Model Integrated with Global Semantics |
| title_sort | clip gpt image captioning model integrated with global semantics |
| topic | cross-modal image captioning pre-training model shared subspace semantic alignment |
| url | https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2307 |
| work_keys_str_mv | AT taorui theclipgptimagecaptioningmodelintegratedwithglobalsemantics AT renhonge theclipgptimagecaptioningmodelintegratedwithglobalsemantics AT caohaiyan theclipgptimagecaptioningmodelintegratedwithglobalsemantics AT taorui clipgptimagecaptioningmodelintegratedwithglobalsemantics AT renhonge clipgptimagecaptioningmodelintegratedwithglobalsemantics AT caohaiyan clipgptimagecaptioningmodelintegratedwithglobalsemantics |