Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning

Remote Sensing Image Change Captioning (RSICC) has emerged as a cross-disciplinary technology that automatically generates sentences describing the changes in bi-temporal remote sensing images. While demonstrating significant potential for urban planning, agricultural surveillance, and disaster mana...

Full description

Saved in:
Bibliographic Details
Main Authors: Shiwei Zou, Yingmei Wei, Yuxiang Xie, Xidao Luan
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Remote Sensing
Subjects:
Online Access:https://www.mdpi.com/2072-4292/17/8/1463
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849714052922605568
author Shiwei Zou
Yingmei Wei
Yuxiang Xie
Xidao Luan
author_facet Shiwei Zou
Yingmei Wei
Yuxiang Xie
Xidao Luan
author_sort Shiwei Zou
collection DOAJ
description Remote Sensing Image Change Captioning (RSICC) has emerged as a cross-disciplinary technology that automatically generates sentences describing the changes in bi-temporal remote sensing images. While demonstrating significant potential for urban planning, agricultural surveillance, and disaster management, current RSICC methods exhibit two fundamental limitations: (1) vulnerability to pseudo-changes induced by illumination fluctuations and seasonal transitions and (2) an overemphasis on spatial variations with insufficient modeling of temporal dependencies in multi-temporal contexts. To address these challenges, we present the Frequency–Spatial–Temporal Fusion Network (FST-Net), a novel framework that integrates frequency, spatial, and temporal information for RSICC. Specifically, our Frequency–Spatial Fusion module implements adaptive spectral decomposition to disentangle structural changes from high-frequency noise artifacts, effectively suppressing environmental interference. The Spatia–Temporal Modeling module is further developed to employ state-space guided sequential scanning to capture evolutionary patterns of geospatial changes across temporal dimensions. Additionally, a unified dual-task decoder architecture bridges pixel-level change detection with semantic-level change captioning, achieving joint optimization of localization precision and description accuracy. Experiments on the LEVIR-MCI dataset demonstrate that our FSTNet outperforms previous methods by 3.65% on BLEU-4 and 4.08% on CIDEr-D, establishing new performance standards for RSICC.
format Article
id doaj-art-d137e421fc744c8092507b92f0dbd5a5
institution DOAJ
issn 2072-4292
language English
publishDate 2025-04-01
publisher MDPI AG
record_format Article
series Remote Sensing
spelling doaj-art-d137e421fc744c8092507b92f0dbd5a52025-08-20T03:13:48ZengMDPI AGRemote Sensing2072-42922025-04-01178146310.3390/rs17081463Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change CaptioningShiwei Zou0Yingmei Wei1Yuxiang Xie2Xidao Luan3Department of System Engineering, National University of Defense Technology, Changsha 410000, ChinaDepartment of System Engineering, National University of Defense Technology, Changsha 410000, ChinaDepartment of System Engineering, National University of Defense Technology, Changsha 410000, ChinaDepartment of Computer Science, Changsha University, Changsha 410000, ChinaRemote Sensing Image Change Captioning (RSICC) has emerged as a cross-disciplinary technology that automatically generates sentences describing the changes in bi-temporal remote sensing images. While demonstrating significant potential for urban planning, agricultural surveillance, and disaster management, current RSICC methods exhibit two fundamental limitations: (1) vulnerability to pseudo-changes induced by illumination fluctuations and seasonal transitions and (2) an overemphasis on spatial variations with insufficient modeling of temporal dependencies in multi-temporal contexts. To address these challenges, we present the Frequency–Spatial–Temporal Fusion Network (FST-Net), a novel framework that integrates frequency, spatial, and temporal information for RSICC. Specifically, our Frequency–Spatial Fusion module implements adaptive spectral decomposition to disentangle structural changes from high-frequency noise artifacts, effectively suppressing environmental interference. The Spatia–Temporal Modeling module is further developed to employ state-space guided sequential scanning to capture evolutionary patterns of geospatial changes across temporal dimensions. Additionally, a unified dual-task decoder architecture bridges pixel-level change detection with semantic-level change captioning, achieving joint optimization of localization precision and description accuracy. Experiments on the LEVIR-MCI dataset demonstrate that our FSTNet outperforms previous methods by 3.65% on BLEU-4 and 4.08% on CIDEr-D, establishing new performance standards for RSICC.https://www.mdpi.com/2072-4292/17/8/1463image change captioningremote sensingfrequencystate space modelchange detection
spellingShingle Shiwei Zou
Yingmei Wei
Yuxiang Xie
Xidao Luan
Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning
Remote Sensing
image change captioning
remote sensing
frequency
state space model
change detection
title Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning
title_full Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning
title_fullStr Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning
title_full_unstemmed Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning
title_short Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning
title_sort frequency spatial temporal domain fusion network for remote sensing image change captioning
topic image change captioning
remote sensing
frequency
state space model
change detection
url https://www.mdpi.com/2072-4292/17/8/1463
work_keys_str_mv AT shiweizou frequencyspatialtemporaldomainfusionnetworkforremotesensingimagechangecaptioning
AT yingmeiwei frequencyspatialtemporaldomainfusionnetworkforremotesensingimagechangecaptioning
AT yuxiangxie frequencyspatialtemporaldomainfusionnetworkforremotesensingimagechangecaptioning
AT xidaoluan frequencyspatialtemporaldomainfusionnetworkforremotesensingimagechangecaptioning