Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes

Remote sensing image captioning (RSIC) aims to describe ground objects and scenes within remote sensing images in natural language form. As the complexity and diversity of scenes in remote sensing images increase, existing methods, although effective in specific tasks, are largely trained on particu...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhang Guo, Haomin Liu, Zihao Ren, Licheng Jiao, Shuiping Gou, Ruimin Li
Format:	Article
Language:	English
Published:	MDPI AG 2025-03-01
Series:	Remote Sensing
Subjects:	image captioning remote sensing unseen scenes transformer network global semantic information
Online Access:	https://www.mdpi.com/2072-4292/17/7/1237
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850188332160516096
author	Zhang Guo Haomin Liu Zihao Ren Licheng Jiao Shuiping Gou Ruimin Li
author_facet	Zhang Guo Haomin Liu Zihao Ren Licheng Jiao Shuiping Gou Ruimin Li
author_sort	Zhang Guo
collection	DOAJ
description	Remote sensing image captioning (RSIC) aims to describe ground objects and scenes within remote sensing images in natural language form. As the complexity and diversity of scenes in remote sensing images increase, existing methods, although effective in specific tasks, are largely trained on particular scene images and corpora. This limits their ability to generate descriptions for scenes not encountered during training. Given the finite resources for data annotation and the expanding range of application scenarios, training data typically cover only a subset of common scenes, leaving many potential scene types unrepresented. Consequently, developing models capable of effectively handling unseen scenes with limited training data is imperative. This study introduces an innovative remote sensing image captioning model based on scene attribute learning—SALCap. The proposed model defines scene attributes and employs a specifically designed global object scene attribute extractor to capture these attributes. It then uses an attribute inference module to predict scene information through scene attributes, ensuring that this part of the scene’s information is reused in sentence generation through additional attribute loss. Experiments show that the method not only improves the accuracy of the description but also significantly enhances the model’s adaptability and generalizability relative to unseen scenes. This advancement expands the practical utility of remote sensing image captioning across diverse scenarios, particularly under the constraints of limited annotations.
format	Article
id	doaj-art-332e5c838c6e475bb928db75b42286cd
institution	OA Journals
issn	2072-4292
language	English
publishDate	2025-03-01
publisher	MDPI AG
record_format	Article
series	Remote Sensing
spelling	doaj-art-332e5c838c6e475bb928db75b42286cd2025-08-20T02:15:54ZengMDPI AGRemote Sensing2072-42922025-03-01177123710.3390/rs17071237Attribute-Based Learning for Remote Sensing Image Captioning in Unseen ScenesZhang Guo0Haomin Liu1Zihao Ren2Licheng Jiao3Shuiping Gou4Ruimin Li5Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaRemote sensing image captioning (RSIC) aims to describe ground objects and scenes within remote sensing images in natural language form. As the complexity and diversity of scenes in remote sensing images increase, existing methods, although effective in specific tasks, are largely trained on particular scene images and corpora. This limits their ability to generate descriptions for scenes not encountered during training. Given the finite resources for data annotation and the expanding range of application scenarios, training data typically cover only a subset of common scenes, leaving many potential scene types unrepresented. Consequently, developing models capable of effectively handling unseen scenes with limited training data is imperative. This study introduces an innovative remote sensing image captioning model based on scene attribute learning—SALCap. The proposed model defines scene attributes and employs a specifically designed global object scene attribute extractor to capture these attributes. It then uses an attribute inference module to predict scene information through scene attributes, ensuring that this part of the scene’s information is reused in sentence generation through additional attribute loss. Experiments show that the method not only improves the accuracy of the description but also significantly enhances the model’s adaptability and generalizability relative to unseen scenes. This advancement expands the practical utility of remote sensing image captioning across diverse scenarios, particularly under the constraints of limited annotations.https://www.mdpi.com/2072-4292/17/7/1237image captioningremote sensingunseen scenestransformer networkglobal semantic information
spellingShingle	Zhang Guo Haomin Liu Zihao Ren Licheng Jiao Shuiping Gou Ruimin Li Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes Remote Sensing image captioning remote sensing unseen scenes transformer network global semantic information
title	Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
title_full	Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
title_fullStr	Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
title_full_unstemmed	Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
title_short	Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
title_sort	attribute based learning for remote sensing image captioning in unseen scenes
topic	image captioning remote sensing unseen scenes transformer network global semantic information
url	https://www.mdpi.com/2072-4292/17/7/1237
work_keys_str_mv	AT zhangguo attributebasedlearningforremotesensingimagecaptioninginunseenscenes AT haominliu attributebasedlearningforremotesensingimagecaptioninginunseenscenes AT zihaoren attributebasedlearningforremotesensingimagecaptioninginunseenscenes AT lichengjiao attributebasedlearningforremotesensingimagecaptioninginunseenscenes AT shuipinggou attributebasedlearningforremotesensingimagecaptioninginunseenscenes AT ruiminli attributebasedlearningforremotesensingimagecaptioninginunseenscenes

Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes

Similar Items