Toward unsupervised building extraction from very high-resolution remote sensing images using SAM and CLIP

Building extraction has become a cornerstone for accurately assessing climate change, urban development, and human activities. The substantial variability in imaging conditions and building appearances poses a significant challenge to precise building extraction. While recent work has attempted to i...

Full description

Saved in:
Bibliographic Details
Main Authors: Chenxiao Zhang, Peng Yue
Format: Article
Language:English
Published: Taylor & Francis Group 2025-12-01
Series:GIScience & Remote Sensing
Subjects:
Online Access:https://www.tandfonline.com/doi/10.1080/15481603.2025.2543102
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Building extraction has become a cornerstone for accurately assessing climate change, urban development, and human activities. The substantial variability in imaging conditions and building appearances poses a significant challenge to precise building extraction. While recent work has attempted to integrate foundational models into remote sensing tasks, most approaches focus on supervised methods or fine-tuning streams. There has been limited exploration of unsupervised pipelines, particularly for large-scale building extraction from very high-resolution remote sensing images. In this work, we propose a two-stage unsupervised building extraction method driven by multi-modality foundation models. First, we introduce a zero-shot pseudo-label generation method, guided by the integration of the Segment Anything Model (SAM) and the CLIP model. To address the misclassification of fragmented objects, we design a zoom-out strategy to restore broken segments. Next, we present a hybrid feature fusion network that combines CLIP patch tokens with task-specific features, achieving high data adaptability while maintaining text-related visual features. Extensive experiments demonstrate that our proposed method achieves F1 scores of 56.00% and 62.53% on the Manhattan and WHU Building datasets, respectively, outperforming or matching supervised methods that require 700 training samples. Notably, when tested on two small-scale datasets, our method exhibits superior robustness compared to existing unsupervised domain adaptation approaches, showing 15%–25% less performance variation and demonstrating high adaptability to remote sensing datasets of varying scales.
ISSN:1548-1603
1943-7226