Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block
Cross-modal research has long been a critical pillar for the future development of human-computer interaction. With deep learning achieving remarkable results in computer vision and natural language processing, image captioning has emerged as a key focus area in artificial intelligence research. Tra...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10978853/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849324327407714304 |
|---|---|
| author | Xi Yang Xingguo Jiang Jinfeng Liu |
| author_facet | Xi Yang Xingguo Jiang Jinfeng Liu |
| author_sort | Xi Yang |
| collection | DOAJ |
| description | Cross-modal research has long been a critical pillar for the future development of human-computer interaction. With deep learning achieving remarkable results in computer vision and natural language processing, image captioning has emerged as a key focus area in artificial intelligence research. Traditionally, most image captioning studies have focused on the English context; however, interdisciplinary efforts should not be confined to monolingual environments. Instead, it is essential to expand into multiple languages, given that Chinese is one of the world’s most widely used logographic languages. The study of Chinese image captioning holds immense value but presents significant challenges due to the complexity of Chinese semantic features. To address these difficulties, we propose a Deep Fusion Feature Encoder, which enables the model to extract more detailed visual features from images. Additionally, we introduce Swi-Gumbel Attention and develop a Feature Filtering Block based on it, aiding the model in accurately capturing core semantic elements during caption generation. Experimental results demonstrate that our method achieves superior performance across multiple Chinese datasets. Specifically, in the experimental section of this paper, we compared our proposed model with those based on recurrent neural networks and transformer, demonstrating both its advantages and limitations. Additionally, we provided insights into future research directions for Chinese image captioning. Through ablation experiments, we validated the effectiveness of the Deep Fusion Feature Encoder, Swi-Gumbel Attention, and Triple-Layer Feature Filtering Block. We also explored the impact of different architectural configurations within the Multi-Layer Feature Filtering Block on caption accuracy. |
| format | Article |
| id | doaj-art-6a3b090ff75243c39fe8ea1626717216 |
| institution | Kabale University |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-6a3b090ff75243c39fe8ea16267172162025-08-20T03:48:46ZengIEEEIEEE Access2169-35362025-01-0113759357595210.1109/ACCESS.2025.356487310978853Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering BlockXi Yang0https://orcid.org/0009-0007-8893-5247Xingguo Jiang1https://orcid.org/0009-0006-9623-8479Jinfeng Liu2https://orcid.org/0000-0002-9174-9142School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin, Sichuan, ChinaSchool of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin, Sichuan, ChinaSchool of Information Engineering, Ningxia University, Yinchuan, Ningxia, ChinaCross-modal research has long been a critical pillar for the future development of human-computer interaction. With deep learning achieving remarkable results in computer vision and natural language processing, image captioning has emerged as a key focus area in artificial intelligence research. Traditionally, most image captioning studies have focused on the English context; however, interdisciplinary efforts should not be confined to monolingual environments. Instead, it is essential to expand into multiple languages, given that Chinese is one of the world’s most widely used logographic languages. The study of Chinese image captioning holds immense value but presents significant challenges due to the complexity of Chinese semantic features. To address these difficulties, we propose a Deep Fusion Feature Encoder, which enables the model to extract more detailed visual features from images. Additionally, we introduce Swi-Gumbel Attention and develop a Feature Filtering Block based on it, aiding the model in accurately capturing core semantic elements during caption generation. Experimental results demonstrate that our method achieves superior performance across multiple Chinese datasets. Specifically, in the experimental section of this paper, we compared our proposed model with those based on recurrent neural networks and transformer, demonstrating both its advantages and limitations. Additionally, we provided insights into future research directions for Chinese image captioning. Through ablation experiments, we validated the effectiveness of the Deep Fusion Feature Encoder, Swi-Gumbel Attention, and Triple-Layer Feature Filtering Block. We also explored the impact of different architectural configurations within the Multi-Layer Feature Filtering Block on caption accuracy.https://ieeexplore.ieee.org/document/10978853/Image captioningRegNetattention mechanismBi-LSTM |
| spellingShingle | Xi Yang Xingguo Jiang Jinfeng Liu Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block IEEE Access Image captioning RegNet attention mechanism Bi-LSTM |
| title | Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block |
| title_full | Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block |
| title_fullStr | Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block |
| title_full_unstemmed | Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block |
| title_short | Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block |
| title_sort | chinese image captioning based on deep fusion feature and multi layer feature filtering block |
| topic | Image captioning RegNet attention mechanism Bi-LSTM |
| url | https://ieeexplore.ieee.org/document/10978853/ |
| work_keys_str_mv | AT xiyang chineseimagecaptioningbasedondeepfusionfeatureandmultilayerfeaturefilteringblock AT xingguojiang chineseimagecaptioningbasedondeepfusionfeatureandmultilayerfeaturefilteringblock AT jinfengliu chineseimagecaptioningbasedondeepfusionfeatureandmultilayerfeaturefilteringblock |