Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
Remote sensing image captioning (RSIC) aims to describe ground objects and scenes within remote sensing images in natural language form. As the complexity and diversity of scenes in remote sensing images increase, existing methods, although effective in specific tasks, are largely trained on particu...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-03-01
|
| Series: | Remote Sensing |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2072-4292/17/7/1237 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850188332160516096 |
|---|---|
| author | Zhang Guo Haomin Liu Zihao Ren Licheng Jiao Shuiping Gou Ruimin Li |
| author_facet | Zhang Guo Haomin Liu Zihao Ren Licheng Jiao Shuiping Gou Ruimin Li |
| author_sort | Zhang Guo |
| collection | DOAJ |
| description | Remote sensing image captioning (RSIC) aims to describe ground objects and scenes within remote sensing images in natural language form. As the complexity and diversity of scenes in remote sensing images increase, existing methods, although effective in specific tasks, are largely trained on particular scene images and corpora. This limits their ability to generate descriptions for scenes not encountered during training. Given the finite resources for data annotation and the expanding range of application scenarios, training data typically cover only a subset of common scenes, leaving many potential scene types unrepresented. Consequently, developing models capable of effectively handling unseen scenes with limited training data is imperative. This study introduces an innovative remote sensing image captioning model based on scene attribute learning—SALCap. The proposed model defines scene attributes and employs a specifically designed global object scene attribute extractor to capture these attributes. It then uses an attribute inference module to predict scene information through scene attributes, ensuring that this part of the scene’s information is reused in sentence generation through additional attribute loss. Experiments show that the method not only improves the accuracy of the description but also significantly enhances the model’s adaptability and generalizability relative to unseen scenes. This advancement expands the practical utility of remote sensing image captioning across diverse scenarios, particularly under the constraints of limited annotations. |
| format | Article |
| id | doaj-art-332e5c838c6e475bb928db75b42286cd |
| institution | OA Journals |
| issn | 2072-4292 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Remote Sensing |
| spelling | doaj-art-332e5c838c6e475bb928db75b42286cd2025-08-20T02:15:54ZengMDPI AGRemote Sensing2072-42922025-03-01177123710.3390/rs17071237Attribute-Based Learning for Remote Sensing Image Captioning in Unseen ScenesZhang Guo0Haomin Liu1Zihao Ren2Licheng Jiao3Shuiping Gou4Ruimin Li5Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaRemote sensing image captioning (RSIC) aims to describe ground objects and scenes within remote sensing images in natural language form. As the complexity and diversity of scenes in remote sensing images increase, existing methods, although effective in specific tasks, are largely trained on particular scene images and corpora. This limits their ability to generate descriptions for scenes not encountered during training. Given the finite resources for data annotation and the expanding range of application scenarios, training data typically cover only a subset of common scenes, leaving many potential scene types unrepresented. Consequently, developing models capable of effectively handling unseen scenes with limited training data is imperative. This study introduces an innovative remote sensing image captioning model based on scene attribute learning—SALCap. The proposed model defines scene attributes and employs a specifically designed global object scene attribute extractor to capture these attributes. It then uses an attribute inference module to predict scene information through scene attributes, ensuring that this part of the scene’s information is reused in sentence generation through additional attribute loss. Experiments show that the method not only improves the accuracy of the description but also significantly enhances the model’s adaptability and generalizability relative to unseen scenes. This advancement expands the practical utility of remote sensing image captioning across diverse scenarios, particularly under the constraints of limited annotations.https://www.mdpi.com/2072-4292/17/7/1237image captioningremote sensingunseen scenestransformer networkglobal semantic information |
| spellingShingle | Zhang Guo Haomin Liu Zihao Ren Licheng Jiao Shuiping Gou Ruimin Li Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes Remote Sensing image captioning remote sensing unseen scenes transformer network global semantic information |
| title | Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes |
| title_full | Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes |
| title_fullStr | Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes |
| title_full_unstemmed | Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes |
| title_short | Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes |
| title_sort | attribute based learning for remote sensing image captioning in unseen scenes |
| topic | image captioning remote sensing unseen scenes transformer network global semantic information |
| url | https://www.mdpi.com/2072-4292/17/7/1237 |
| work_keys_str_mv | AT zhangguo attributebasedlearningforremotesensingimagecaptioninginunseenscenes AT haominliu attributebasedlearningforremotesensingimagecaptioninginunseenscenes AT zihaoren attributebasedlearningforremotesensingimagecaptioninginunseenscenes AT lichengjiao attributebasedlearningforremotesensingimagecaptioninginunseenscenes AT shuipinggou attributebasedlearningforremotesensingimagecaptioninginunseenscenes AT ruiminli attributebasedlearningforremotesensingimagecaptioninginunseenscenes |