CTSeg: CNN and ViT collaborated segmentation framework for efficient land-use/land-cover mapping with high-resolution remote sensing images

Semantic segmentation models present significant work in land-use/land-cover (LULC) mapping. Even though vision transformers (ViT) with long-sequence interactions have recently emerged as popular solutions alongside convolutional neural networks (CNN), they remain less effective for high-resolution...

Full description

Saved in:
Bibliographic Details
Main Authors: Jifa Chen, Gang Chen, Pin Zhou, Yufeng He, Lianzhe Yue, Mingjun Ding, Hui Lin
Format: Article
Language:English
Published: Elsevier 2025-05-01
Series:International Journal of Applied Earth Observations and Geoinformation
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1569843225001931
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Semantic segmentation models present significant work in land-use/land-cover (LULC) mapping. Even though vision transformers (ViT) with long-sequence interactions have recently emerged as popular solutions alongside convolutional neural networks (CNN), they remain less effective for high-resolution remote sensing data characterized by small volumes and rich heterogeneities. In this paper, we propose a novel CNN and ViT collaborated segmentation framework (CTSeg) to address these weaknesses. Following the encoder-decoder architecture, we first introduce an encoding backbone with multifarious attention mechanisms to respectively capture global and local contexts. It is designed with parallel dual branches where the position-relation aggregation (PRA) blocks and others with channel relations (CRA) form the CNN-based encoding module, whereas the ViT-based one comprises multi-stage window-shifted transformer (WST) blocks with cross-window interactions. We further explore the online knowledge distillation presented with pixel-wise and channel-wise feature distillation modules to facilitate bidirectional learning between the CNN and ViT backbones, supported by a well-designed loss decay strategy. In addition, we develop a multiscale feature decoding module to produce more high-quality segmentation predictions where the leveraged correlation-weighted fusions emphasize the heterogeneous feature representations. Extensive comparison and ablation studies on two benchmark datasets demonstrate its competitive performance in efficient LULC mapping.
ISSN:1569-8432