Toward unsupervised building extraction from very high-resolution remote sensing images using SAM and CLIP

Building extraction has become a cornerstone for accurately assessing climate change, urban development, and human activities. The substantial variability in imaging conditions and building appearances poses a significant challenge to precise building extraction. While recent work has attempted to i...

Full description

Saved in:
Bibliographic Details
Main Authors: Chenxiao Zhang, Peng Yue
Format: Article
Language:English
Published: Taylor & Francis Group 2025-12-01
Series:GIScience & Remote Sensing
Subjects:
Online Access:https://www.tandfonline.com/doi/10.1080/15481603.2025.2543102
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849394950269042688
author Chenxiao Zhang
Peng Yue
author_facet Chenxiao Zhang
Peng Yue
author_sort Chenxiao Zhang
collection DOAJ
description Building extraction has become a cornerstone for accurately assessing climate change, urban development, and human activities. The substantial variability in imaging conditions and building appearances poses a significant challenge to precise building extraction. While recent work has attempted to integrate foundational models into remote sensing tasks, most approaches focus on supervised methods or fine-tuning streams. There has been limited exploration of unsupervised pipelines, particularly for large-scale building extraction from very high-resolution remote sensing images. In this work, we propose a two-stage unsupervised building extraction method driven by multi-modality foundation models. First, we introduce a zero-shot pseudo-label generation method, guided by the integration of the Segment Anything Model (SAM) and the CLIP model. To address the misclassification of fragmented objects, we design a zoom-out strategy to restore broken segments. Next, we present a hybrid feature fusion network that combines CLIP patch tokens with task-specific features, achieving high data adaptability while maintaining text-related visual features. Extensive experiments demonstrate that our proposed method achieves F1 scores of 56.00% and 62.53% on the Manhattan and WHU Building datasets, respectively, outperforming or matching supervised methods that require 700 training samples. Notably, when tested on two small-scale datasets, our method exhibits superior robustness compared to existing unsupervised domain adaptation approaches, showing 15%–25% less performance variation and demonstrating high adaptability to remote sensing datasets of varying scales.
format Article
id doaj-art-a346c01fcc1442499cc4768d2ad2e4ca
institution Kabale University
issn 1548-1603
1943-7226
language English
publishDate 2025-12-01
publisher Taylor & Francis Group
record_format Article
series GIScience & Remote Sensing
spelling doaj-art-a346c01fcc1442499cc4768d2ad2e4ca2025-08-20T03:39:49ZengTaylor & Francis GroupGIScience & Remote Sensing1548-16031943-72262025-12-0162110.1080/15481603.2025.2543102Toward unsupervised building extraction from very high-resolution remote sensing images using SAM and CLIPChenxiao Zhang0Peng Yue1School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, Hubei, ChinaSchool of Remote Sensing and Information Engineering, Wuhan University, Wuhan, Hubei, ChinaBuilding extraction has become a cornerstone for accurately assessing climate change, urban development, and human activities. The substantial variability in imaging conditions and building appearances poses a significant challenge to precise building extraction. While recent work has attempted to integrate foundational models into remote sensing tasks, most approaches focus on supervised methods or fine-tuning streams. There has been limited exploration of unsupervised pipelines, particularly for large-scale building extraction from very high-resolution remote sensing images. In this work, we propose a two-stage unsupervised building extraction method driven by multi-modality foundation models. First, we introduce a zero-shot pseudo-label generation method, guided by the integration of the Segment Anything Model (SAM) and the CLIP model. To address the misclassification of fragmented objects, we design a zoom-out strategy to restore broken segments. Next, we present a hybrid feature fusion network that combines CLIP patch tokens with task-specific features, achieving high data adaptability while maintaining text-related visual features. Extensive experiments demonstrate that our proposed method achieves F1 scores of 56.00% and 62.53% on the Manhattan and WHU Building datasets, respectively, outperforming or matching supervised methods that require 700 training samples. Notably, when tested on two small-scale datasets, our method exhibits superior robustness compared to existing unsupervised domain adaptation approaches, showing 15%–25% less performance variation and demonstrating high adaptability to remote sensing datasets of varying scales.https://www.tandfonline.com/doi/10.1080/15481603.2025.2543102Unsupervised building extractionsegment anything modelCLIPMulti-modalityfoundation models
spellingShingle Chenxiao Zhang
Peng Yue
Toward unsupervised building extraction from very high-resolution remote sensing images using SAM and CLIP
GIScience & Remote Sensing
Unsupervised building extraction
segment anything model
CLIP
Multi-modality
foundation models
title Toward unsupervised building extraction from very high-resolution remote sensing images using SAM and CLIP
title_full Toward unsupervised building extraction from very high-resolution remote sensing images using SAM and CLIP
title_fullStr Toward unsupervised building extraction from very high-resolution remote sensing images using SAM and CLIP
title_full_unstemmed Toward unsupervised building extraction from very high-resolution remote sensing images using SAM and CLIP
title_short Toward unsupervised building extraction from very high-resolution remote sensing images using SAM and CLIP
title_sort toward unsupervised building extraction from very high resolution remote sensing images using sam and clip
topic Unsupervised building extraction
segment anything model
CLIP
Multi-modality
foundation models
url https://www.tandfonline.com/doi/10.1080/15481603.2025.2543102
work_keys_str_mv AT chenxiaozhang towardunsupervisedbuildingextractionfromveryhighresolutionremotesensingimagesusingsamandclip
AT pengyue towardunsupervisedbuildingextractionfromveryhighresolutionremotesensingimagesusingsamandclip