Feature refinement and rethinking attention for remote sensing image captioning

Abstract Effectively recognizing different regions of interest with attention mechanisms plays an important role in remote sensing image captioning task. However, these attention-driven models implicitly hypothesize that the focused region information is correct, which is too restrictive. Furthermor...

Full description

Saved in:
Bibliographic Details
Main Authors: Yunpeng Li, Chengjin Tao, Meng Liu, Xiangrong Zhang, Guanchun Wang, Tianyang Zhang, Dong Zhao, Dabao Wang
Format: Article
Language:English
Published: Nature Portfolio 2025-03-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-93125-y
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850039937613692928
author Yunpeng Li
Chengjin Tao
Meng Liu
Xiangrong Zhang
Guanchun Wang
Tianyang Zhang
Dong Zhao
Dabao Wang
author_facet Yunpeng Li
Chengjin Tao
Meng Liu
Xiangrong Zhang
Guanchun Wang
Tianyang Zhang
Dong Zhao
Dabao Wang
author_sort Yunpeng Li
collection DOAJ
description Abstract Effectively recognizing different regions of interest with attention mechanisms plays an important role in remote sensing image captioning task. However, these attention-driven models implicitly hypothesize that the focused region information is correct, which is too restrictive. Furthermore, the visual feature extractors will fail when facing weak correlation between objects. To address these issues, we propose a feature refinement and rethinking attention framework. Specifically, we firstly construct a feature refinement module by interacting grid-level features using refinement gate. It is noticeable that the irrelevant visual features from remote sensing images are weakened. Moreover, different from one attentive vector for inferring one word, the rethinking attention with rethinking LSTM layer is developed to spontaneously focus on different regions, when rethinking confidence is desirable. Thus, there are more than one region for predicting one word. Besides, the confidence rectification strategy is adopted to model rethinking attention for learn strongly discriminative contextual representation. We validate the designed framework on four datasets (i.e., NWPU-Captions, RSICD, UCM-Captions and Sydney-Captions). Extensive experiments show that our approach have superior performance and achieved significant improvements on the NWPU-Captions dataset.
format Article
id doaj-art-b4933fb64c9b40328010a40b7038b16e
institution DOAJ
issn 2045-2322
language English
publishDate 2025-03-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-b4933fb64c9b40328010a40b7038b16e2025-08-20T02:56:12ZengNature PortfolioScientific Reports2045-23222025-03-0115111610.1038/s41598-025-93125-yFeature refinement and rethinking attention for remote sensing image captioningYunpeng Li0Chengjin Tao1Meng Liu2Xiangrong Zhang3Guanchun Wang4Tianyang Zhang5Dong Zhao6Dabao Wang7The Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityThe Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityThe Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian UniversityKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian UniversityKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian UniversityThe Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityRemote Sensing Satellite Department, China Academy of Space TechnologyAbstract Effectively recognizing different regions of interest with attention mechanisms plays an important role in remote sensing image captioning task. However, these attention-driven models implicitly hypothesize that the focused region information is correct, which is too restrictive. Furthermore, the visual feature extractors will fail when facing weak correlation between objects. To address these issues, we propose a feature refinement and rethinking attention framework. Specifically, we firstly construct a feature refinement module by interacting grid-level features using refinement gate. It is noticeable that the irrelevant visual features from remote sensing images are weakened. Moreover, different from one attentive vector for inferring one word, the rethinking attention with rethinking LSTM layer is developed to spontaneously focus on different regions, when rethinking confidence is desirable. Thus, there are more than one region for predicting one word. Besides, the confidence rectification strategy is adopted to model rethinking attention for learn strongly discriminative contextual representation. We validate the designed framework on four datasets (i.e., NWPU-Captions, RSICD, UCM-Captions and Sydney-Captions). Extensive experiments show that our approach have superior performance and achieved significant improvements on the NWPU-Captions dataset.https://doi.org/10.1038/s41598-025-93125-yRemote sensing image captioningVisual perceptionFeature refinementRethinking attention mechanismVision-language
spellingShingle Yunpeng Li
Chengjin Tao
Meng Liu
Xiangrong Zhang
Guanchun Wang
Tianyang Zhang
Dong Zhao
Dabao Wang
Feature refinement and rethinking attention for remote sensing image captioning
Scientific Reports
Remote sensing image captioning
Visual perception
Feature refinement
Rethinking attention mechanism
Vision-language
title Feature refinement and rethinking attention for remote sensing image captioning
title_full Feature refinement and rethinking attention for remote sensing image captioning
title_fullStr Feature refinement and rethinking attention for remote sensing image captioning
title_full_unstemmed Feature refinement and rethinking attention for remote sensing image captioning
title_short Feature refinement and rethinking attention for remote sensing image captioning
title_sort feature refinement and rethinking attention for remote sensing image captioning
topic Remote sensing image captioning
Visual perception
Feature refinement
Rethinking attention mechanism
Vision-language
url https://doi.org/10.1038/s41598-025-93125-y
work_keys_str_mv AT yunpengli featurerefinementandrethinkingattentionforremotesensingimagecaptioning
AT chengjintao featurerefinementandrethinkingattentionforremotesensingimagecaptioning
AT mengliu featurerefinementandrethinkingattentionforremotesensingimagecaptioning
AT xiangrongzhang featurerefinementandrethinkingattentionforremotesensingimagecaptioning
AT guanchunwang featurerefinementandrethinkingattentionforremotesensingimagecaptioning
AT tianyangzhang featurerefinementandrethinkingattentionforremotesensingimagecaptioning
AT dongzhao featurerefinementandrethinkingattentionforremotesensingimagecaptioning
AT dabaowang featurerefinementandrethinkingattentionforremotesensingimagecaptioning