Image Captioning Based on Semantic Scenes

With the development of artificial intelligence and deep learning technologies, image captioning has become an important research direction at the intersection of computer vision and natural language processing. The purpose of image captioning is to generate corresponding natural language descriptio...

Full description

Saved in:
Bibliographic Details
Main Authors: Fengzhi Zhao, Zhezhou Yu, Tao Wang, Yi Lv
Format: Article
Language:English
Published: MDPI AG 2024-10-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/26/10/876
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850205164515885056
author Fengzhi Zhao
Zhezhou Yu
Tao Wang
Yi Lv
author_facet Fengzhi Zhao
Zhezhou Yu
Tao Wang
Yi Lv
author_sort Fengzhi Zhao
collection DOAJ
description With the development of artificial intelligence and deep learning technologies, image captioning has become an important research direction at the intersection of computer vision and natural language processing. The purpose of image captioning is to generate corresponding natural language descriptions by understanding the content of images. This technology has broad application prospects in fields such as image retrieval, autonomous driving, and visual question answering. Currently, many researchers have proposed region-based image captioning methods. These methods generate captions by extracting features from different regions of an image. However, they often rely on local features of the image and overlook the understanding of the overall scene, leading to captions that lack coherence and accuracy when dealing with complex scenes. Additionally, image captioning methods are unable to extract complete semantic information from visual data, which may lead to captions with biases and deficiencies. Due to these reasons, existing methods struggle to generate comprehensive and accurate captions. To fill this gap, we propose the Semantic Scenes Encoder (SSE) for image captioning. It first extracts a scene graph from the image and integrates it into the encoding of the image information. Then, it extracts a semantic graph from the captions and preserves semantic information through a learnable attention mechanism, which we refer to as the dictionary. During the generation of captions, it combines the encoded information of the image and the learned semantic information to generate complete and accurate captions. To verify the effectiveness of the SSE, we tested the model on the MSCOCO dataset. The experimental results show that the SSE improves the overall quality of the captions. The improvement in scores across multiple evaluation metrics further demonstrates that the SSE possesses significant advantages when processing identical images.
format Article
id doaj-art-e3720435c5de4f69ba4c7f29b8bd608f
institution OA Journals
issn 1099-4300
language English
publishDate 2024-10-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj-art-e3720435c5de4f69ba4c7f29b8bd608f2025-08-20T02:11:09ZengMDPI AGEntropy1099-43002024-10-01261087610.3390/e26100876Image Captioning Based on Semantic ScenesFengzhi Zhao0Zhezhou Yu1Tao Wang2Yi Lv3College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaCollege of Computer Science and Technology, Jilin University, Changchun 130012, ChinaCollege of Computer Science and Technology, Jilin University, Changchun 130012, ChinaCollege of Computer Science and Technology, Jilin University, Changchun 130012, ChinaWith the development of artificial intelligence and deep learning technologies, image captioning has become an important research direction at the intersection of computer vision and natural language processing. The purpose of image captioning is to generate corresponding natural language descriptions by understanding the content of images. This technology has broad application prospects in fields such as image retrieval, autonomous driving, and visual question answering. Currently, many researchers have proposed region-based image captioning methods. These methods generate captions by extracting features from different regions of an image. However, they often rely on local features of the image and overlook the understanding of the overall scene, leading to captions that lack coherence and accuracy when dealing with complex scenes. Additionally, image captioning methods are unable to extract complete semantic information from visual data, which may lead to captions with biases and deficiencies. Due to these reasons, existing methods struggle to generate comprehensive and accurate captions. To fill this gap, we propose the Semantic Scenes Encoder (SSE) for image captioning. It first extracts a scene graph from the image and integrates it into the encoding of the image information. Then, it extracts a semantic graph from the captions and preserves semantic information through a learnable attention mechanism, which we refer to as the dictionary. During the generation of captions, it combines the encoded information of the image and the learned semantic information to generate complete and accurate captions. To verify the effectiveness of the SSE, we tested the model on the MSCOCO dataset. The experimental results show that the SSE improves the overall quality of the captions. The improvement in scores across multiple evaluation metrics further demonstrates that the SSE possesses significant advantages when processing identical images.https://www.mdpi.com/1099-4300/26/10/876image captioningsemantic scenes encoderattention mechanismgraph
spellingShingle Fengzhi Zhao
Zhezhou Yu
Tao Wang
Yi Lv
Image Captioning Based on Semantic Scenes
Entropy
image captioning
semantic scenes encoder
attention mechanism
graph
title Image Captioning Based on Semantic Scenes
title_full Image Captioning Based on Semantic Scenes
title_fullStr Image Captioning Based on Semantic Scenes
title_full_unstemmed Image Captioning Based on Semantic Scenes
title_short Image Captioning Based on Semantic Scenes
title_sort image captioning based on semantic scenes
topic image captioning
semantic scenes encoder
attention mechanism
graph
url https://www.mdpi.com/1099-4300/26/10/876
work_keys_str_mv AT fengzhizhao imagecaptioningbasedonsemanticscenes
AT zhezhouyu imagecaptioningbasedonsemanticscenes
AT taowang imagecaptioningbasedonsemanticscenes
AT yilv imagecaptioningbasedonsemanticscenes