Training strategies for semi-supervised remote sensing image captioning

Abstract Remote sensing image captioning is a rapidly growing application with significant roles in environmental monitoring, disaster response, and military intelligence. Despite advancements in handling scene diversity, object scale variability, and improving caption quality, existing models usual...

Full description

Saved in:
Bibliographic Details
Main Authors: Qimin Cheng, Haojun Cheng, Linfeng Yuan, Yingjie Du, Saifei Tu, Yuqi Xu
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-09853-8
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Remote sensing image captioning is a rapidly growing application with significant roles in environmental monitoring, disaster response, and military intelligence. Despite advancements in handling scene diversity, object scale variability, and improving caption quality, existing models usually rely on large labeled datasets and computationally expensive methods, which limits their effectiveness in scenarios with scarce annotations. To overcome these challenges, this paper introduces semi-supervised training strategies that reduce the dependence on labeled data while enhancing both the quality and diversity of captions. Specifically, we propose a Weakly Supervised Enhanced Noisy Teacher-Student Network (WENTS) to prevent over-reliance on specific features and improve generalization. Additionally, we develop a Two-Stage Training Network (TSTN) designed to stabilize learning, reduce gradient variance, and promote the generation of diverse captions. Our methods demonstrate exceptional performance even with low sampling rates and simple network architectures, highlighting their high scalability. Experimental evaluations on benchmark datasets demonstrate the promising results achieved by our strategies, particularly the state-of-the-art performance attained on the challenging NWPU-Captions dataset, with a 17.71% in CIDEr and 11.23% in Sm improvement over previous state-of-the-art methods.
ISSN:2045-2322