Feature refinement and rethinking attention for remote sensing image captioning
Abstract Effectively recognizing different regions of interest with attention mechanisms plays an important role in remote sensing image captioning task. However, these attention-driven models implicitly hypothesize that the focused region information is correct, which is too restrictive. Furthermor...
Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-03-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-93125-y |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850039937613692928 |
|---|---|
| author | Yunpeng Li Chengjin Tao Meng Liu Xiangrong Zhang Guanchun Wang Tianyang Zhang Dong Zhao Dabao Wang |
| author_facet | Yunpeng Li Chengjin Tao Meng Liu Xiangrong Zhang Guanchun Wang Tianyang Zhang Dong Zhao Dabao Wang |
| author_sort | Yunpeng Li |
| collection | DOAJ |
| description | Abstract Effectively recognizing different regions of interest with attention mechanisms plays an important role in remote sensing image captioning task. However, these attention-driven models implicitly hypothesize that the focused region information is correct, which is too restrictive. Furthermore, the visual feature extractors will fail when facing weak correlation between objects. To address these issues, we propose a feature refinement and rethinking attention framework. Specifically, we firstly construct a feature refinement module by interacting grid-level features using refinement gate. It is noticeable that the irrelevant visual features from remote sensing images are weakened. Moreover, different from one attentive vector for inferring one word, the rethinking attention with rethinking LSTM layer is developed to spontaneously focus on different regions, when rethinking confidence is desirable. Thus, there are more than one region for predicting one word. Besides, the confidence rectification strategy is adopted to model rethinking attention for learn strongly discriminative contextual representation. We validate the designed framework on four datasets (i.e., NWPU-Captions, RSICD, UCM-Captions and Sydney-Captions). Extensive experiments show that our approach have superior performance and achieved significant improvements on the NWPU-Captions dataset. |
| format | Article |
| id | doaj-art-b4933fb64c9b40328010a40b7038b16e |
| institution | DOAJ |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-b4933fb64c9b40328010a40b7038b16e2025-08-20T02:56:12ZengNature PortfolioScientific Reports2045-23222025-03-0115111610.1038/s41598-025-93125-yFeature refinement and rethinking attention for remote sensing image captioningYunpeng Li0Chengjin Tao1Meng Liu2Xiangrong Zhang3Guanchun Wang4Tianyang Zhang5Dong Zhao6Dabao Wang7The Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityThe Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityThe Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian UniversityKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian UniversityKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian UniversityThe Jiangsu Province Engineering Research Center of Integrated Circuit Reliability Technology and Testing System, Wuxi UniversityRemote Sensing Satellite Department, China Academy of Space TechnologyAbstract Effectively recognizing different regions of interest with attention mechanisms plays an important role in remote sensing image captioning task. However, these attention-driven models implicitly hypothesize that the focused region information is correct, which is too restrictive. Furthermore, the visual feature extractors will fail when facing weak correlation between objects. To address these issues, we propose a feature refinement and rethinking attention framework. Specifically, we firstly construct a feature refinement module by interacting grid-level features using refinement gate. It is noticeable that the irrelevant visual features from remote sensing images are weakened. Moreover, different from one attentive vector for inferring one word, the rethinking attention with rethinking LSTM layer is developed to spontaneously focus on different regions, when rethinking confidence is desirable. Thus, there are more than one region for predicting one word. Besides, the confidence rectification strategy is adopted to model rethinking attention for learn strongly discriminative contextual representation. We validate the designed framework on four datasets (i.e., NWPU-Captions, RSICD, UCM-Captions and Sydney-Captions). Extensive experiments show that our approach have superior performance and achieved significant improvements on the NWPU-Captions dataset.https://doi.org/10.1038/s41598-025-93125-yRemote sensing image captioningVisual perceptionFeature refinementRethinking attention mechanismVision-language |
| spellingShingle | Yunpeng Li Chengjin Tao Meng Liu Xiangrong Zhang Guanchun Wang Tianyang Zhang Dong Zhao Dabao Wang Feature refinement and rethinking attention for remote sensing image captioning Scientific Reports Remote sensing image captioning Visual perception Feature refinement Rethinking attention mechanism Vision-language |
| title | Feature refinement and rethinking attention for remote sensing image captioning |
| title_full | Feature refinement and rethinking attention for remote sensing image captioning |
| title_fullStr | Feature refinement and rethinking attention for remote sensing image captioning |
| title_full_unstemmed | Feature refinement and rethinking attention for remote sensing image captioning |
| title_short | Feature refinement and rethinking attention for remote sensing image captioning |
| title_sort | feature refinement and rethinking attention for remote sensing image captioning |
| topic | Remote sensing image captioning Visual perception Feature refinement Rethinking attention mechanism Vision-language |
| url | https://doi.org/10.1038/s41598-025-93125-y |
| work_keys_str_mv | AT yunpengli featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT chengjintao featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT mengliu featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT xiangrongzhang featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT guanchunwang featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT tianyangzhang featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT dongzhao featurerefinementandrethinkingattentionforremotesensingimagecaptioning AT dabaowang featurerefinementandrethinkingattentionforremotesensingimagecaptioning |