RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model

The foundation model fine-tuning optimization method has gradually become a research hotspot due to the development of generative pretrained transformer. However, compared to natural scene images, remote sensing images have a wide range of spatial scales, complex objects, and limited labelled sample...

Full description

Saved in:
Bibliographic Details
Main Authors: Qiang Zhang, Decheng Wang, Xiao Yu
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Remote Sensing
Subjects:
Online Access:https://www.mdpi.com/2072-4292/17/10/1661
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The foundation model fine-tuning optimization method has gradually become a research hotspot due to the development of generative pretrained transformer. However, compared to natural scene images, remote sensing images have a wide range of spatial scales, complex objects, and limited labelled samples, which introduce great challenges to image interpretation. To reduce the gap between nature scene images and remote sensing images, this paper proposes a novel RLita optimization method for foundation models. Specifically, a region-level image–text alignment optimization method is proposed to represent the features of images and texts as visual and sematic representation vectors in one embedding space for better model generalization, and a parameter-efficient tuning strategy is designed to reduce computational resources. Experiments on five remote sensing datasets including object detection, semantic segmentation, and change detection show the effectiveness of the RLita method.
ISSN:2072-4292