Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning
Remote Sensing Image Change Captioning (RSICC) has emerged as a cross-disciplinary technology that automatically generates sentences describing the changes in bi-temporal remote sensing images. While demonstrating significant potential for urban planning, agricultural surveillance, and disaster mana...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Remote Sensing |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2072-4292/17/8/1463 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849714052922605568 |
|---|---|
| author | Shiwei Zou Yingmei Wei Yuxiang Xie Xidao Luan |
| author_facet | Shiwei Zou Yingmei Wei Yuxiang Xie Xidao Luan |
| author_sort | Shiwei Zou |
| collection | DOAJ |
| description | Remote Sensing Image Change Captioning (RSICC) has emerged as a cross-disciplinary technology that automatically generates sentences describing the changes in bi-temporal remote sensing images. While demonstrating significant potential for urban planning, agricultural surveillance, and disaster management, current RSICC methods exhibit two fundamental limitations: (1) vulnerability to pseudo-changes induced by illumination fluctuations and seasonal transitions and (2) an overemphasis on spatial variations with insufficient modeling of temporal dependencies in multi-temporal contexts. To address these challenges, we present the Frequency–Spatial–Temporal Fusion Network (FST-Net), a novel framework that integrates frequency, spatial, and temporal information for RSICC. Specifically, our Frequency–Spatial Fusion module implements adaptive spectral decomposition to disentangle structural changes from high-frequency noise artifacts, effectively suppressing environmental interference. The Spatia–Temporal Modeling module is further developed to employ state-space guided sequential scanning to capture evolutionary patterns of geospatial changes across temporal dimensions. Additionally, a unified dual-task decoder architecture bridges pixel-level change detection with semantic-level change captioning, achieving joint optimization of localization precision and description accuracy. Experiments on the LEVIR-MCI dataset demonstrate that our FSTNet outperforms previous methods by 3.65% on BLEU-4 and 4.08% on CIDEr-D, establishing new performance standards for RSICC. |
| format | Article |
| id | doaj-art-d137e421fc744c8092507b92f0dbd5a5 |
| institution | DOAJ |
| issn | 2072-4292 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Remote Sensing |
| spelling | doaj-art-d137e421fc744c8092507b92f0dbd5a52025-08-20T03:13:48ZengMDPI AGRemote Sensing2072-42922025-04-01178146310.3390/rs17081463Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change CaptioningShiwei Zou0Yingmei Wei1Yuxiang Xie2Xidao Luan3Department of System Engineering, National University of Defense Technology, Changsha 410000, ChinaDepartment of System Engineering, National University of Defense Technology, Changsha 410000, ChinaDepartment of System Engineering, National University of Defense Technology, Changsha 410000, ChinaDepartment of Computer Science, Changsha University, Changsha 410000, ChinaRemote Sensing Image Change Captioning (RSICC) has emerged as a cross-disciplinary technology that automatically generates sentences describing the changes in bi-temporal remote sensing images. While demonstrating significant potential for urban planning, agricultural surveillance, and disaster management, current RSICC methods exhibit two fundamental limitations: (1) vulnerability to pseudo-changes induced by illumination fluctuations and seasonal transitions and (2) an overemphasis on spatial variations with insufficient modeling of temporal dependencies in multi-temporal contexts. To address these challenges, we present the Frequency–Spatial–Temporal Fusion Network (FST-Net), a novel framework that integrates frequency, spatial, and temporal information for RSICC. Specifically, our Frequency–Spatial Fusion module implements adaptive spectral decomposition to disentangle structural changes from high-frequency noise artifacts, effectively suppressing environmental interference. The Spatia–Temporal Modeling module is further developed to employ state-space guided sequential scanning to capture evolutionary patterns of geospatial changes across temporal dimensions. Additionally, a unified dual-task decoder architecture bridges pixel-level change detection with semantic-level change captioning, achieving joint optimization of localization precision and description accuracy. Experiments on the LEVIR-MCI dataset demonstrate that our FSTNet outperforms previous methods by 3.65% on BLEU-4 and 4.08% on CIDEr-D, establishing new performance standards for RSICC.https://www.mdpi.com/2072-4292/17/8/1463image change captioningremote sensingfrequencystate space modelchange detection |
| spellingShingle | Shiwei Zou Yingmei Wei Yuxiang Xie Xidao Luan Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning Remote Sensing image change captioning remote sensing frequency state space model change detection |
| title | Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning |
| title_full | Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning |
| title_fullStr | Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning |
| title_full_unstemmed | Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning |
| title_short | Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning |
| title_sort | frequency spatial temporal domain fusion network for remote sensing image change captioning |
| topic | image change captioning remote sensing frequency state space model change detection |
| url | https://www.mdpi.com/2072-4292/17/8/1463 |
| work_keys_str_mv | AT shiweizou frequencyspatialtemporaldomainfusionnetworkforremotesensingimagechangecaptioning AT yingmeiwei frequencyspatialtemporaldomainfusionnetworkforremotesensingimagechangecaptioning AT yuxiangxie frequencyspatialtemporaldomainfusionnetworkforremotesensingimagechangecaptioning AT xidaoluan frequencyspatialtemporaldomainfusionnetworkforremotesensingimagechangecaptioning |