Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes

Remote sensing image captioning (RSIC) aims to describe ground objects and scenes within remote sensing images in natural language form. As the complexity and diversity of scenes in remote sensing images increase, existing methods, although effective in specific tasks, are largely trained on particu...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhang Guo, Haomin Liu, Zihao Ren, Licheng Jiao, Shuiping Gou, Ruimin Li
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Remote Sensing
Subjects:
Online Access:https://www.mdpi.com/2072-4292/17/7/1237
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850188332160516096
author Zhang Guo
Haomin Liu
Zihao Ren
Licheng Jiao
Shuiping Gou
Ruimin Li
author_facet Zhang Guo
Haomin Liu
Zihao Ren
Licheng Jiao
Shuiping Gou
Ruimin Li
author_sort Zhang Guo
collection DOAJ
description Remote sensing image captioning (RSIC) aims to describe ground objects and scenes within remote sensing images in natural language form. As the complexity and diversity of scenes in remote sensing images increase, existing methods, although effective in specific tasks, are largely trained on particular scene images and corpora. This limits their ability to generate descriptions for scenes not encountered during training. Given the finite resources for data annotation and the expanding range of application scenarios, training data typically cover only a subset of common scenes, leaving many potential scene types unrepresented. Consequently, developing models capable of effectively handling unseen scenes with limited training data is imperative. This study introduces an innovative remote sensing image captioning model based on scene attribute learning—SALCap. The proposed model defines scene attributes and employs a specifically designed global object scene attribute extractor to capture these attributes. It then uses an attribute inference module to predict scene information through scene attributes, ensuring that this part of the scene’s information is reused in sentence generation through additional attribute loss. Experiments show that the method not only improves the accuracy of the description but also significantly enhances the model’s adaptability and generalizability relative to unseen scenes. This advancement expands the practical utility of remote sensing image captioning across diverse scenarios, particularly under the constraints of limited annotations.
format Article
id doaj-art-332e5c838c6e475bb928db75b42286cd
institution OA Journals
issn 2072-4292
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Remote Sensing
spelling doaj-art-332e5c838c6e475bb928db75b42286cd2025-08-20T02:15:54ZengMDPI AGRemote Sensing2072-42922025-03-01177123710.3390/rs17071237Attribute-Based Learning for Remote Sensing Image Captioning in Unseen ScenesZhang Guo0Haomin Liu1Zihao Ren2Licheng Jiao3Shuiping Gou4Ruimin Li5Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaRemote sensing image captioning (RSIC) aims to describe ground objects and scenes within remote sensing images in natural language form. As the complexity and diversity of scenes in remote sensing images increase, existing methods, although effective in specific tasks, are largely trained on particular scene images and corpora. This limits their ability to generate descriptions for scenes not encountered during training. Given the finite resources for data annotation and the expanding range of application scenarios, training data typically cover only a subset of common scenes, leaving many potential scene types unrepresented. Consequently, developing models capable of effectively handling unseen scenes with limited training data is imperative. This study introduces an innovative remote sensing image captioning model based on scene attribute learning—SALCap. The proposed model defines scene attributes and employs a specifically designed global object scene attribute extractor to capture these attributes. It then uses an attribute inference module to predict scene information through scene attributes, ensuring that this part of the scene’s information is reused in sentence generation through additional attribute loss. Experiments show that the method not only improves the accuracy of the description but also significantly enhances the model’s adaptability and generalizability relative to unseen scenes. This advancement expands the practical utility of remote sensing image captioning across diverse scenarios, particularly under the constraints of limited annotations.https://www.mdpi.com/2072-4292/17/7/1237image captioningremote sensingunseen scenestransformer networkglobal semantic information
spellingShingle Zhang Guo
Haomin Liu
Zihao Ren
Licheng Jiao
Shuiping Gou
Ruimin Li
Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
Remote Sensing
image captioning
remote sensing
unseen scenes
transformer network
global semantic information
title Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
title_full Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
title_fullStr Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
title_full_unstemmed Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
title_short Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes
title_sort attribute based learning for remote sensing image captioning in unseen scenes
topic image captioning
remote sensing
unseen scenes
transformer network
global semantic information
url https://www.mdpi.com/2072-4292/17/7/1237
work_keys_str_mv AT zhangguo attributebasedlearningforremotesensingimagecaptioninginunseenscenes
AT haominliu attributebasedlearningforremotesensingimagecaptioninginunseenscenes
AT zihaoren attributebasedlearningforremotesensingimagecaptioninginunseenscenes
AT lichengjiao attributebasedlearningforremotesensingimagecaptioninginunseenscenes
AT shuipinggou attributebasedlearningforremotesensingimagecaptioninginunseenscenes
AT ruiminli attributebasedlearningforremotesensingimagecaptioninginunseenscenes