Transformer-Based Dual-Branch Spatial–Temporal–Spectral Feature Fusion Network for Paddy Rice Mapping

Deep neural network fusion approaches utilizing multimodal remote sensing are essential for crop mapping. However, challenges such as insufficient spatiotemporal feature extraction and ineffective fusion strategies still exist, leading to a decrease in mapping accuracy and robustness when these appr...

Full description

Saved in:
Bibliographic Details
Main Authors: Xinxin Zhang, Hongwei Wei, Yuzhou Shao, Haijun Luan, Da-Han Wang
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Remote Sensing
Subjects:
Online Access:https://www.mdpi.com/2072-4292/17/12/1999
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Deep neural network fusion approaches utilizing multimodal remote sensing are essential for crop mapping. However, challenges such as insufficient spatiotemporal feature extraction and ineffective fusion strategies still exist, leading to a decrease in mapping accuracy and robustness when these approaches are applied across spatial‒temporal regions. In this study, we propose a novel rice mapping approach based on dual-branch transformer fusion networks, named RDTFNet. Specifically, we implemented a dual-branch encoder that is based on two improved transformer architectures. One is a multiscale transformer block used to extract spatial–spectral features from a single-phase optical image, and the other is a Restormer block used to extract spatial–temporal features from time-series synthetic aperture radar (SAR) images. Both extracted features were then combined into a feature fusion module (FFM) to generate fully fused spatial–temporal–spectral (STS) features, which were finally fed into the decoder of the U-Net structure for rice mapping. The model’s performance was evaluated through experiments with the Sentinel-1 and Sentinel-2 datasets from the United States. Compared with conventional models, the RDTFNet model achieved the best performance, and the overall accuracy (OA), intersection over union (IoU), precision, recall and F1-score were 96.95%, 88.12%, 95.14%, 92.27% and 93.68%, respectively. The comparative results show that the OA, IoU, accuracy, recall and F1-score improved by 1.61%, 5.37%, 5.16%, 1.12% and 2.53%, respectively, over those of the baseline model, demonstrating its superior performance for rice mapping. Furthermore, in subsequent cross-regional and cross-temporal tests, RDTFNet outperformed other classical models, achieving improvements of 7.11% and 12.10% in F1-score, and 11.55% and 18.18% in IoU, respectively. These results further confirm the robustness of the proposed model. Therefore, the proposed RDTFNet model can effectively fuse STS features from multimodal images and exhibit strong generalization capabilities, providing valuable information for governments in agricultural management.
ISSN:2072-4292