RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model

The foundation model fine-tuning optimization method has gradually become a research hotspot due to the development of generative pretrained transformer. However, compared to natural scene images, remote sensing images have a wide range of spatial scales, complex objects, and limited labelled sample...

Full description

Saved in:
Bibliographic Details
Main Authors: Qiang Zhang, Decheng Wang, Xiao Yu
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Remote Sensing
Subjects:
Online Access:https://www.mdpi.com/2072-4292/17/10/1661
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849326818081898496
author Qiang Zhang
Decheng Wang
Xiao Yu
author_facet Qiang Zhang
Decheng Wang
Xiao Yu
author_sort Qiang Zhang
collection DOAJ
description The foundation model fine-tuning optimization method has gradually become a research hotspot due to the development of generative pretrained transformer. However, compared to natural scene images, remote sensing images have a wide range of spatial scales, complex objects, and limited labelled samples, which introduce great challenges to image interpretation. To reduce the gap between nature scene images and remote sensing images, this paper proposes a novel RLita optimization method for foundation models. Specifically, a region-level image–text alignment optimization method is proposed to represent the features of images and texts as visual and sematic representation vectors in one embedding space for better model generalization, and a parameter-efficient tuning strategy is designed to reduce computational resources. Experiments on five remote sensing datasets including object detection, semantic segmentation, and change detection show the effectiveness of the RLita method.
format Article
id doaj-art-fed79724f79a4191b8720c3c8106be82
institution Kabale University
issn 2072-4292
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Remote Sensing
spelling doaj-art-fed79724f79a4191b8720c3c8106be822025-08-20T03:48:02ZengMDPI AGRemote Sensing2072-42922025-05-011710166110.3390/rs17101661RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation ModelQiang Zhang0Decheng Wang1Xiao Yu2Beijing Institute of Tracking and Telecommunication Technology, Beijing 100094, ChinaBeijing Institute of Tracking and Telecommunication Technology, Beijing 100094, ChinaBeijing Institute of Tracking and Telecommunication Technology, Beijing 100094, ChinaThe foundation model fine-tuning optimization method has gradually become a research hotspot due to the development of generative pretrained transformer. However, compared to natural scene images, remote sensing images have a wide range of spatial scales, complex objects, and limited labelled samples, which introduce great challenges to image interpretation. To reduce the gap between nature scene images and remote sensing images, this paper proposes a novel RLita optimization method for foundation models. Specifically, a region-level image–text alignment optimization method is proposed to represent the features of images and texts as visual and sematic representation vectors in one embedding space for better model generalization, and a parameter-efficient tuning strategy is designed to reduce computational resources. Experiments on five remote sensing datasets including object detection, semantic segmentation, and change detection show the effectiveness of the RLita method.https://www.mdpi.com/2072-4292/17/10/1661remote sensingfoundation model optimizationobject detectionsemantic segmentationchange detection
spellingShingle Qiang Zhang
Decheng Wang
Xiao Yu
RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model
Remote Sensing
remote sensing
foundation model optimization
object detection
semantic segmentation
change detection
title RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model
title_full RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model
title_fullStr RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model
title_full_unstemmed RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model
title_short RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model
title_sort rlita a region level image text alignment method for remote sensing foundation model
topic remote sensing
foundation model optimization
object detection
semantic segmentation
change detection
url https://www.mdpi.com/2072-4292/17/10/1661
work_keys_str_mv AT qiangzhang rlitaaregionlevelimagetextalignmentmethodforremotesensingfoundationmodel
AT dechengwang rlitaaregionlevelimagetextalignmentmethodforremotesensingfoundationmodel
AT xiaoyu rlitaaregionlevelimagetextalignmentmethodforremotesensingfoundationmodel