An Improved MAE-Based Pretraining Method for Urban Public Space Monitoring With Optical Remote Sensing Imagery

Monitoring urban public spaces is a vital component of scientific urban planning. This article proposes an improved masked autoencoder (MAE)-based pretraining method for automatic monitoring of urban public spaces with semantic segmentation of optical satellite remote sensing imagery. Different from...

Full description

Saved in:
Bibliographic Details
Main Authors: Wentao Wei, Huan Chen, Yu Jiang, Li Fu, Ping Yao
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11059859/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Monitoring urban public spaces is a vital component of scientific urban planning. This article proposes an improved masked autoencoder (MAE)-based pretraining method for automatic monitoring of urban public spaces with semantic segmentation of optical satellite remote sensing imagery. Different from most existing image encoders of vision transformer pretrained with MAE method on natural images, which have difficulty in transferring to small-scale remote sensing dataset because of its large domain gap relative to natural images and limit the fine-tuning performance on downstream tasks, our method performs pretraining directly on target remote sensing imagery to better capture its feature distribution. Specifically, label information is filled into masked regions to enhance the model’s image reconstruction capabilities. In addition, we design a momentum branch to ensure the stability and consistency of feature updates and adopt the dynamic masking strategy to reduce reconstruction difficulty during the initial stages of pretraining. This ensures a smooth transition to later stages, significantly improving the accuracy and efficiency of image generation. Experimental results demonstrate that the proposed method significantly outperforms competing MAE-based vision transformer approaches as well as state-of-the-art CNN-based methods on the test dataset. Furthermore, an analysis of the per capita public space in Haikou city validates the effectiveness of the proposed method for monitoring urban public spaces.
ISSN:1939-1404
2151-1535