KOMPSAT-3/3A Image-text Dataset for Training Large Multimodal Models

This study aims to improve the accuracy and interpretability of large multimodal models (LMMs) specialized in satellite image analysis by constructing an image-text dataset based on KOMPSAT-3/3A imagery and presenting the results of training using this dataset. Conventional LMMs are primarily traine...

Full description

Saved in:
Bibliographic Details
Main Authors: Han Oh, Dong-Bin Shin, Dae-Won Chung
Format: Article
Language:English
Published: GeoAI Data Society 2025-03-01
Series:Geo Data
Subjects:
Online Access:http://geodata.kr/upload/pdf/GD-2025-0003.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850262593026916352
author Han Oh
Dong-Bin Shin
Dae-Won Chung
author_facet Han Oh
Dong-Bin Shin
Dae-Won Chung
author_sort Han Oh
collection DOAJ
description This study aims to improve the accuracy and interpretability of large multimodal models (LMMs) specialized in satellite image analysis by constructing an image-text dataset based on KOMPSAT-3/3A imagery and presenting the results of training using this dataset. Conventional LMMs are primarily trained on general images, limiting their ability to effectively interpret the specific characteristics of satellite imagery, such as spectral bands, spatial resolution, and viewing angles. To address this limitation, we developed an image-text dataset, divided into pretraining and finetuning stages, based on the existing KOMPSAT object detection dataset. The pretraining dataset consists of captions summarizing the overall theme and key information of each image. The fine-tuning dataset integrates metadata -including acquisition time, sensor type, and coordinates- with detailed object detection labels to generate six types of question-answer pairs: detailed descriptions, conversations with varying answer lengths, bounding box identification, multiple choice questions, and complex reasoning. This structured dataset enables the model to learn not only the general context of satellite images but also fine-grained details such as object quantity, location, and geographic attributes. Training with the new KOMPSAT-based dataset significantly improved the model’s accuracy in recognizing regional information and object characteristics in satellite imagery. Finetuned models achieved substantially higher accuracy than previous models, surpassing even the GPT-4o model and demonstrating the effectiveness of a domain-specific dataset. The findings of this study are expected to contribute to various remote sensing applications, including automated satellite image analysis, change detection, and object detection.
format Article
id doaj-art-e8a5c0e608f84aee9cc2f328ce7b2975
institution OA Journals
issn 2713-5004
language English
publishDate 2025-03-01
publisher GeoAI Data Society
record_format Article
series Geo Data
spelling doaj-art-e8a5c0e608f84aee9cc2f328ce7b29752025-08-20T01:55:09ZengGeoAI Data SocietyGeo Data2713-50042025-03-0171273510.22761/GD.2025.0003181KOMPSAT-3/3A Image-text Dataset for Training Large Multimodal ModelsHan Oh0Dong-Bin Shin1Dae-Won Chung2Principal Researcher, National Satellite Operation & Application Center, Korea Aerospace Research Institute (KARI), 169-84 Gwahak-ro, Yuseong-gu, 34133 Daejeon, South KoreaMaster Student, Major in Aerospace System Engineering, University of Science and Technology (UST), 169-84 Gwahak-ro, Yuseong-gu, 34133 Daejeon, South KoreaPrincipal Researcher, National Satellite Operation & Application Center, Korea Aerospace Research Institute (KARI), 169-84 Gwahak-ro, Yuseong-gu, 34133 Daejeon, South KoreaThis study aims to improve the accuracy and interpretability of large multimodal models (LMMs) specialized in satellite image analysis by constructing an image-text dataset based on KOMPSAT-3/3A imagery and presenting the results of training using this dataset. Conventional LMMs are primarily trained on general images, limiting their ability to effectively interpret the specific characteristics of satellite imagery, such as spectral bands, spatial resolution, and viewing angles. To address this limitation, we developed an image-text dataset, divided into pretraining and finetuning stages, based on the existing KOMPSAT object detection dataset. The pretraining dataset consists of captions summarizing the overall theme and key information of each image. The fine-tuning dataset integrates metadata -including acquisition time, sensor type, and coordinates- with detailed object detection labels to generate six types of question-answer pairs: detailed descriptions, conversations with varying answer lengths, bounding box identification, multiple choice questions, and complex reasoning. This structured dataset enables the model to learn not only the general context of satellite images but also fine-grained details such as object quantity, location, and geographic attributes. Training with the new KOMPSAT-based dataset significantly improved the model’s accuracy in recognizing regional information and object characteristics in satellite imagery. Finetuned models achieved substantially higher accuracy than previous models, surpassing even the GPT-4o model and demonstrating the effectiveness of a domain-specific dataset. The findings of this study are expected to contribute to various remote sensing applications, including automated satellite image analysis, change detection, and object detection.http://geodata.kr/upload/pdf/GD-2025-0003.pdflarge multimodal modelsatellite imagerykompsatimage-text datasetfinetuning
spellingShingle Han Oh
Dong-Bin Shin
Dae-Won Chung
KOMPSAT-3/3A Image-text Dataset for Training Large Multimodal Models
Geo Data
large multimodal model
satellite imagery
kompsat
image-text dataset
finetuning
title KOMPSAT-3/3A Image-text Dataset for Training Large Multimodal Models
title_full KOMPSAT-3/3A Image-text Dataset for Training Large Multimodal Models
title_fullStr KOMPSAT-3/3A Image-text Dataset for Training Large Multimodal Models
title_full_unstemmed KOMPSAT-3/3A Image-text Dataset for Training Large Multimodal Models
title_short KOMPSAT-3/3A Image-text Dataset for Training Large Multimodal Models
title_sort kompsat 3 3a image text dataset for training large multimodal models
topic large multimodal model
satellite imagery
kompsat
image-text dataset
finetuning
url http://geodata.kr/upload/pdf/GD-2025-0003.pdf
work_keys_str_mv AT hanoh kompsat33aimagetextdatasetfortraininglargemultimodalmodels
AT dongbinshin kompsat33aimagetextdatasetfortraininglargemultimodalmodels
AT daewonchung kompsat33aimagetextdatasetfortraininglargemultimodalmodels