Self-Supervised Learning with Trilateral Redundancy Reduction for Urban Functional Zone Identification Using Street-View Imagery
In recent years, the use of street-view images for urban analysis has received much attention. Despite the abundance of raw data, existing supervised learning methods heavily rely on large-scale and high-quality labels. Faced with the challenge of label scarcity in urban scene classification tasks,...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-02-01
|
| Series: | Sensors |
| Subjects: | |
| Online Access: | https://www.mdpi.com/1424-8220/25/5/1504 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In recent years, the use of street-view images for urban analysis has received much attention. Despite the abundance of raw data, existing supervised learning methods heavily rely on large-scale and high-quality labels. Faced with the challenge of label scarcity in urban scene classification tasks, an innovative self-supervised learning framework, Trilateral Redundancy Reduction (Tri-ReD) is proposed. In this framework, a more restrictive loss, “trilateral loss”, is proposed. By compelling the embedding of positive samples to be highly correlated, it guides the pre-trained model to learn more essential representations without semantic labels. Furthermore, a novel data augmentation strategy, tri-branch mutually exclusive augmentation (Tri-MExA), is proposed. Its aim is to reduce the uncertainties introduced by traditional random augmentation methods. As a model pre-training method, Tri-ReD framework is architecture-agnostic, performing effectively on both CNNs and ViTs, which makes it adaptable for a wide variety of downstream tasks. In this paper, 116,491 unlabeled street-view images were used to pre-train models by Tri-ReD to obtain the general representation of urban scenes at the ground level. These pre-trained models were then fine-tuned using supervised data with semantic labels (17,600 images from BIC_GSV and 12,871 from BEAUTY) for the final classification task. Experimental results demonstrate that the proposed self-supervised pre-training method outperformed the direct supervised learning approaches for urban functional zone identification by 19% on average. It also surpassed the performance of models pre-trained on ImageNet by around 11%, achieving state-of-the-art (SOTA) results in self-supervised pre-training. |
|---|---|
| ISSN: | 1424-8220 |