Toward unsupervised building extraction from very high-resolution remote sensing images using SAM and CLIP
Building extraction has become a cornerstone for accurately assessing climate change, urban development, and human activities. The substantial variability in imaging conditions and building appearances poses a significant challenge to precise building extraction. While recent work has attempted to i...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Taylor & Francis Group
2025-12-01
|
| Series: | GIScience & Remote Sensing |
| Subjects: | |
| Online Access: | https://www.tandfonline.com/doi/10.1080/15481603.2025.2543102 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Building extraction has become a cornerstone for accurately assessing climate change, urban development, and human activities. The substantial variability in imaging conditions and building appearances poses a significant challenge to precise building extraction. While recent work has attempted to integrate foundational models into remote sensing tasks, most approaches focus on supervised methods or fine-tuning streams. There has been limited exploration of unsupervised pipelines, particularly for large-scale building extraction from very high-resolution remote sensing images. In this work, we propose a two-stage unsupervised building extraction method driven by multi-modality foundation models. First, we introduce a zero-shot pseudo-label generation method, guided by the integration of the Segment Anything Model (SAM) and the CLIP model. To address the misclassification of fragmented objects, we design a zoom-out strategy to restore broken segments. Next, we present a hybrid feature fusion network that combines CLIP patch tokens with task-specific features, achieving high data adaptability while maintaining text-related visual features. Extensive experiments demonstrate that our proposed method achieves F1 scores of 56.00% and 62.53% on the Manhattan and WHU Building datasets, respectively, outperforming or matching supervised methods that require 700 training samples. Notably, when tested on two small-scale datasets, our method exhibits superior robustness compared to existing unsupervised domain adaptation approaches, showing 15%–25% less performance variation and demonstrating high adaptability to remote sensing datasets of varying scales. |
|---|---|
| ISSN: | 1548-1603 1943-7226 |