Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block

Cross-modal research has long been a critical pillar for the future development of human-computer interaction. With deep learning achieving remarkable results in computer vision and natural language processing, image captioning has emerged as a key focus area in artificial intelligence research. Tra...

Full description

Saved in:
Bibliographic Details
Main Authors: Xi Yang, Xingguo Jiang, Jinfeng Liu
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10978853/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849324327407714304
author Xi Yang
Xingguo Jiang
Jinfeng Liu
author_facet Xi Yang
Xingguo Jiang
Jinfeng Liu
author_sort Xi Yang
collection DOAJ
description Cross-modal research has long been a critical pillar for the future development of human-computer interaction. With deep learning achieving remarkable results in computer vision and natural language processing, image captioning has emerged as a key focus area in artificial intelligence research. Traditionally, most image captioning studies have focused on the English context; however, interdisciplinary efforts should not be confined to monolingual environments. Instead, it is essential to expand into multiple languages, given that Chinese is one of the world’s most widely used logographic languages. The study of Chinese image captioning holds immense value but presents significant challenges due to the complexity of Chinese semantic features. To address these difficulties, we propose a Deep Fusion Feature Encoder, which enables the model to extract more detailed visual features from images. Additionally, we introduce Swi-Gumbel Attention and develop a Feature Filtering Block based on it, aiding the model in accurately capturing core semantic elements during caption generation. Experimental results demonstrate that our method achieves superior performance across multiple Chinese datasets. Specifically, in the experimental section of this paper, we compared our proposed model with those based on recurrent neural networks and transformer, demonstrating both its advantages and limitations. Additionally, we provided insights into future research directions for Chinese image captioning. Through ablation experiments, we validated the effectiveness of the Deep Fusion Feature Encoder, Swi-Gumbel Attention, and Triple-Layer Feature Filtering Block. We also explored the impact of different architectural configurations within the Multi-Layer Feature Filtering Block on caption accuracy.
format Article
id doaj-art-6a3b090ff75243c39fe8ea1626717216
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-6a3b090ff75243c39fe8ea16267172162025-08-20T03:48:46ZengIEEEIEEE Access2169-35362025-01-0113759357595210.1109/ACCESS.2025.356487310978853Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering BlockXi Yang0https://orcid.org/0009-0007-8893-5247Xingguo Jiang1https://orcid.org/0009-0006-9623-8479Jinfeng Liu2https://orcid.org/0000-0002-9174-9142School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin, Sichuan, ChinaSchool of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin, Sichuan, ChinaSchool of Information Engineering, Ningxia University, Yinchuan, Ningxia, ChinaCross-modal research has long been a critical pillar for the future development of human-computer interaction. With deep learning achieving remarkable results in computer vision and natural language processing, image captioning has emerged as a key focus area in artificial intelligence research. Traditionally, most image captioning studies have focused on the English context; however, interdisciplinary efforts should not be confined to monolingual environments. Instead, it is essential to expand into multiple languages, given that Chinese is one of the world’s most widely used logographic languages. The study of Chinese image captioning holds immense value but presents significant challenges due to the complexity of Chinese semantic features. To address these difficulties, we propose a Deep Fusion Feature Encoder, which enables the model to extract more detailed visual features from images. Additionally, we introduce Swi-Gumbel Attention and develop a Feature Filtering Block based on it, aiding the model in accurately capturing core semantic elements during caption generation. Experimental results demonstrate that our method achieves superior performance across multiple Chinese datasets. Specifically, in the experimental section of this paper, we compared our proposed model with those based on recurrent neural networks and transformer, demonstrating both its advantages and limitations. Additionally, we provided insights into future research directions for Chinese image captioning. Through ablation experiments, we validated the effectiveness of the Deep Fusion Feature Encoder, Swi-Gumbel Attention, and Triple-Layer Feature Filtering Block. We also explored the impact of different architectural configurations within the Multi-Layer Feature Filtering Block on caption accuracy.https://ieeexplore.ieee.org/document/10978853/Image captioningRegNetattention mechanismBi-LSTM
spellingShingle Xi Yang
Xingguo Jiang
Jinfeng Liu
Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block
IEEE Access
Image captioning
RegNet
attention mechanism
Bi-LSTM
title Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block
title_full Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block
title_fullStr Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block
title_full_unstemmed Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block
title_short Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block
title_sort chinese image captioning based on deep fusion feature and multi layer feature filtering block
topic Image captioning
RegNet
attention mechanism
Bi-LSTM
url https://ieeexplore.ieee.org/document/10978853/
work_keys_str_mv AT xiyang chineseimagecaptioningbasedondeepfusionfeatureandmultilayerfeaturefilteringblock
AT xingguojiang chineseimagecaptioningbasedondeepfusionfeatureandmultilayerfeaturefilteringblock
AT jinfengliu chineseimagecaptioningbasedondeepfusionfeatureandmultilayerfeaturefilteringblock