Feature refinement and rethinking attention for remote sensing image captioning

Abstract Effectively recognizing different regions of interest with attention mechanisms plays an important role in remote sensing image captioning task. However, these attention-driven models implicitly hypothesize that the focused region information is correct, which is too restrictive. Furthermor...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yunpeng Li, Chengjin Tao, Meng Liu, Xiangrong Zhang, Guanchun Wang, Tianyang Zhang, Dong Zhao, Dabao Wang
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-03-01
Series:	Scientific Reports
Subjects:	Remote sensing image captioning Visual perception Feature refinement Rethinking attention mechanism Vision-language
Online Access:	https://doi.org/10.1038/s41598-025-93125-y
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850039937613692928
author	Yunpeng Li Chengjin Tao Meng Liu Xiangrong Zhang Guanchun Wang Tianyang Zhang Dong Zhao Dabao Wang
author_facet	Yunpeng Li Chengjin Tao Meng Liu Xiangrong Zhang Guanchun Wang Tianyang Zhang Dong Zhao Dabao Wang
author_sort	Yunpeng Li
collection	DOAJ
description	Abstract Effectively recognizing different regions of interest with attention mechanisms plays an important role in remote sensing image captioning task. However, these attention-driven models implicitly hypothesize that the focused region information is correct, which is too restrictive. Furthermore, the visual feature extractors will fail when facing weak correlation between objects. To address these issues, we propose a feature refinement and rethinking attention framework. Specifically, we firstly construct a feature refinement module by interacting grid-level features using refinement gate. It is noticeable that the irrelevant visual features from remote sensing images are weakened. Moreover, different from one attentive vector for inferring one word, the rethinking attention with rethinking LSTM layer is developed to spontaneously focus on different regions, when rethinking confidence is desirable. Thus, there are more than one region for predicting one word. Besides, the confidence rectification strategy is adopted to model rethinking attention for learn strongly discriminative contextual representation. We validate the designed framework on four datasets (i.e., NWPU-Captions, RSICD, UCM-Captions and Sydney-Captions). Extensive experiments show that our approach have superior performance and achieved significant improvements on the NWPU-Captions dataset.
format	Article
id	doaj-art-b4933fb64c9b40328010a40b7038b16e
institution	DOAJ
issn	2045-2322
language	English
publishDate	2025-03-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-b4933fb64c9b40328010a40b7038b16e2025-08-20T02:56:12ZengNature PortfolioScientific Reports2045-23222025-03-0115111610.1038/s41598-025-93125-yFeature refinement and rethinking attention for remote sensing image captioningYunpeng Li0Chengjin Tao1Meng Liu2Xiangrong Zhang3Guanchun Wang4Tianyang Zhang5Dong Zhao6Dabao Wang7The Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityThe Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityThe Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian UniversityKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian UniversityKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian UniversityThe Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityRemote Sensing Satellite Department, China Academy of Space TechnologyAbstract Effectively recognizing different regions of interest with attention mechanisms plays an important role in remote sensing image captioning task. However, these attention-driven models implicitly hypothesize that the focused region information is correct, which is too restrictive. Furthermore, the visual feature extractors will fail when facing weak correlation between objects. To address these issues, we propose a feature refinement and rethinking attention framework. Specifically, we firstly construct a feature refinement module by interacting grid-level features using refinement gate. It is noticeable that the irrelevant visual features from remote sensing images are weakened. Moreover, different from one attentive vector for inferring one word, the rethinking attention with rethinking LSTM layer is developed to spontaneously focus on different regions, when rethinking confidence is desirable. Thus, there are more than one region for predicting one word. Besides, the confidence rectification strategy is adopted to model rethinking attention for learn strongly discriminative contextual representation. We validate the designed framework on four datasets (i.e., NWPU-Captions, RSICD, UCM-Captions and Sydney-Captions). Extensive experiments show that our approach have superior performance and achieved significant improvements on the NWPU-Captions dataset.https://doi.org/10.1038/s41598-025-93125-yRemote sensing image captioningVisual perceptionFeature refinementRethinking attention mechanismVision-language
spellingShingle	Yunpeng Li Chengjin Tao Meng Liu Xiangrong Zhang Guanchun Wang Tianyang Zhang Dong Zhao Dabao Wang Feature refinement and rethinking attention for remote sensing image captioning Scientific Reports Remote sensing image captioning Visual perception Feature refinement Rethinking attention mechanism Vision-language
title	Feature refinement and rethinking attention for remote sensing image captioning
title_full	Feature refinement and rethinking attention for remote sensing image captioning
title_fullStr	Feature refinement and rethinking attention for remote sensing image captioning
title_full_unstemmed	Feature refinement and rethinking attention for remote sensing image captioning
title_short	Feature refinement and rethinking attention for remote sensing image captioning
title_sort	feature refinement and rethinking attention for remote sensing image captioning
topic	Remote sensing image captioning Visual perception Feature refinement Rethinking attention mechanism Vision-language
url	https://doi.org/10.1038/s41598-025-93125-y
work_keys_str_mv	AT yunpengli featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT chengjintao featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT mengliu featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT xiangrongzhang featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT guanchunwang featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT tianyangzhang featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT dongzhao featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT dabaowang featurerefinementandrethinkingattentionforremotesensingimagecaptioning

Feature refinement and rethinking attention for remote sensing image captioning

Similar Items