CTSeg: CNN and ViT collaborated segmentation framework for efficient land-use/land-cover mapping with high-resolution remote sensing images
Semantic segmentation models present significant work in land-use/land-cover (LULC) mapping. Even though vision transformers (ViT) with long-sequence interactions have recently emerged as popular solutions alongside convolutional neural networks (CNN), they remain less effective for high-resolution...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-05-01
|
| Series: | International Journal of Applied Earth Observations and Geoinformation |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S1569843225001931 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Semantic segmentation models present significant work in land-use/land-cover (LULC) mapping. Even though vision transformers (ViT) with long-sequence interactions have recently emerged as popular solutions alongside convolutional neural networks (CNN), they remain less effective for high-resolution remote sensing data characterized by small volumes and rich heterogeneities. In this paper, we propose a novel CNN and ViT collaborated segmentation framework (CTSeg) to address these weaknesses. Following the encoder-decoder architecture, we first introduce an encoding backbone with multifarious attention mechanisms to respectively capture global and local contexts. It is designed with parallel dual branches where the position-relation aggregation (PRA) blocks and others with channel relations (CRA) form the CNN-based encoding module, whereas the ViT-based one comprises multi-stage window-shifted transformer (WST) blocks with cross-window interactions. We further explore the online knowledge distillation presented with pixel-wise and channel-wise feature distillation modules to facilitate bidirectional learning between the CNN and ViT backbones, supported by a well-designed loss decay strategy. In addition, we develop a multiscale feature decoding module to produce more high-quality segmentation predictions where the leveraged correlation-weighted fusions emphasize the heterogeneous feature representations. Extensive comparison and ablation studies on two benchmark datasets demonstrate its competitive performance in efficient LULC mapping. |
|---|---|
| ISSN: | 1569-8432 |