SAM2Former: Segment Anything Model 2 Assisting UNet-Like Transformer for Remote Sensing Image Semantic Segmentation

Remote sensing semantic segmentation plays a crucial role in the fields of land cover classification, disaster monitoring, and urban planning. However, due to the high complexity and category imbalance of remote sensing data sets, traditional image segmentation network tend to be slow to train and d...

Full description

Saved in:
Bibliographic Details
Main Authors: Xuewen Li, Xiaomin Tian, Zihong Wang, Feng Zhang, Yanting Zhang, Na Yang, Chuanzhao Tian
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11052296/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Remote sensing semantic segmentation plays a crucial role in the fields of land cover classification, disaster monitoring, and urban planning. However, due to the high complexity and category imbalance of remote sensing data sets, traditional image segmentation network tend to be slow to train and difficult to capture deep semantic information, which leads to the accuracy failing to meet the requirements of practical applications. Inspired by the powerful transfer learning capability of visual foundation models, we introduce a semantic segmentation network for remote sensing images called SAM2Former, which utilizes the Segment Anything Model 2 (SAM2) to assist a UNet-like Transformer. The SAM2Former is a dual-encoder and single-decoder structure composed of the efficient ResNet18 and SAM2’s Image Encoder as dual encoders, with Transformer as the decoder. Firstly, we incorporate the lightweight Adapter to perform parameter-efficient fine-tuning on SAM2 and design a multi-scale information aggregation module (MIAM) to connect the dual-encoder, which weighted multi-scale features layer by layer of SAM2 Block and CNN, preserving key information in the image. Secondly, we devise a decoder based on global-local transformer module (GLTM) to effectively extract global context information and local detail information, improving the segmentation ability of edge texture. Finally, we construct the feature enhancement module (FEM) to improve the recognition accuracy of similar categories by combining channel and spatial attention features. Moreover, our SAM2Former performed com-prehensive comparative and ablation experiments on the ISPRS Vaihingen and Potsdam datasets, showing notable improvements in segmentation performance.
ISSN:2169-3536