Dual-Stream Spatially Aware Transformer for Remote Sensing Image Captioning

Remote sensing image captioning (RSIC) aims to generate semantically rich and syntactically accurate descriptions for remote sensing images (RSIs). However, due to the complex spatial layouts, occlusions, and overlapping objects in such images, caption generation is often challenged by semantic ambi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Haifeng Sima, Xiangtao Ding, JianLong Wang, Mingliang Xu
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:	Image captioning remote sensing image captioning (RSIC) spatial-aware information (SAI) transformer
Online Access:	https://ieeexplore.ieee.org/document/11104798/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849404453632868352
author	Haifeng Sima Xiangtao Ding JianLong Wang Mingliang Xu
author_facet	Haifeng Sima Xiangtao Ding JianLong Wang Mingliang Xu
author_sort	Haifeng Sima
collection	DOAJ
description	Remote sensing image captioning (RSIC) aims to generate semantically rich and syntactically accurate descriptions for remote sensing images (RSIs). However, due to the complex spatial layouts, occlusions, and overlapping objects in such images, caption generation is often challenged by semantic ambiguity. To address these issues, we propose a novel <italic>dual-stream spatially aware transformer</italic> (DSAT), which explicitly models both global and local spatial relationships to enhance spatial understanding. Specifically, DSAT introduces a <italic>dual-stream feature interaction</italic> module that extracts grid-level global features and region-level object features, and further enhances their respective spatial dependencies through multibranch convolution and a graph attention network. In addition, we design a spatially aware attention mechanism that encodes relative spatial relationships into the Transformer, allowing the model to better capture object distribution patterns and geometric relationships. Extensive experiments conducted on three benchmark datasets, namely Sydney-Captions, UCM-Captions, and remote sensing image description, highlight the superior performance of DSAT. The proposed method achieves impressive CIDEr scores of 338.59%, 450.93%, and 275.36% on these datasets, respectively, demonstrating its effectiveness in generating high-quality captions for RSIs.
format	Article
id	doaj-art-b8361c0fa2b54cd69471e3176c0f74da
institution	Kabale University
issn	1939-1404 2151-1535
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
spelling	doaj-art-b8361c0fa2b54cd69471e3176c0f74da2025-08-20T03:36:58ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing1939-14042151-15352025-01-0118195461956210.1109/JSTARS.2025.359388711104798Dual-Stream Spatially Aware Transformer for Remote Sensing Image CaptioningHaifeng Sima0https://orcid.org/0000-0002-2049-3637Xiangtao Ding1https://orcid.org/0009-0001-6680-5608JianLong Wang2https://orcid.org/0000-0001-8117-0631Mingliang Xu3https://orcid.org/0000-0002-6885-3451School of Computer Science and Technology & School of Software, Henan Polytechnic University, Jiaozuo, ChinaSchool of Computer Science and Technology & School of Software, Henan Polytechnic University, Jiaozuo, ChinaSchool of Computer Science and Technology & School of Software, Henan Polytechnic University, Jiaozuo, ChinaSchool of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, ChinaRemote sensing image captioning (RSIC) aims to generate semantically rich and syntactically accurate descriptions for remote sensing images (RSIs). However, due to the complex spatial layouts, occlusions, and overlapping objects in such images, caption generation is often challenged by semantic ambiguity. To address these issues, we propose a novel <italic>dual-stream spatially aware transformer</italic> (DSAT), which explicitly models both global and local spatial relationships to enhance spatial understanding. Specifically, DSAT introduces a <italic>dual-stream feature interaction</italic> module that extracts grid-level global features and region-level object features, and further enhances their respective spatial dependencies through multibranch convolution and a graph attention network. In addition, we design a spatially aware attention mechanism that encodes relative spatial relationships into the Transformer, allowing the model to better capture object distribution patterns and geometric relationships. Extensive experiments conducted on three benchmark datasets, namely Sydney-Captions, UCM-Captions, and remote sensing image description, highlight the superior performance of DSAT. The proposed method achieves impressive CIDEr scores of 338.59%, 450.93%, and 275.36% on these datasets, respectively, demonstrating its effectiveness in generating high-quality captions for RSIs.https://ieeexplore.ieee.org/document/11104798/Image captioningremote sensing image captioning (RSIC)spatial-aware information (SAI)transformer
spellingShingle	Haifeng Sima Xiangtao Ding JianLong Wang Mingliang Xu Dual-Stream Spatially Aware Transformer for Remote Sensing Image Captioning IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Image captioning remote sensing image captioning (RSIC) spatial-aware information (SAI) transformer
title	Dual-Stream Spatially Aware Transformer for Remote Sensing Image Captioning
title_full	Dual-Stream Spatially Aware Transformer for Remote Sensing Image Captioning
title_fullStr	Dual-Stream Spatially Aware Transformer for Remote Sensing Image Captioning
title_full_unstemmed	Dual-Stream Spatially Aware Transformer for Remote Sensing Image Captioning
title_short	Dual-Stream Spatially Aware Transformer for Remote Sensing Image Captioning
title_sort	dual stream spatially aware transformer for remote sensing image captioning
topic	Image captioning remote sensing image captioning (RSIC) spatial-aware information (SAI) transformer
url	https://ieeexplore.ieee.org/document/11104798/
work_keys_str_mv	AT haifengsima dualstreamspatiallyawaretransformerforremotesensingimagecaptioning AT xiangtaoding dualstreamspatiallyawaretransformerforremotesensingimagecaptioning AT jianlongwang dualstreamspatiallyawaretransformerforremotesensingimagecaptioning AT mingliangxu dualstreamspatiallyawaretransformerforremotesensingimagecaptioning

Dual-Stream Spatially Aware Transformer for Remote Sensing Image Captioning

Similar Items