Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample

The article presents an analysis of the effectiveness of selected machine learning methods: Random Forest (RF), Extreme Gradient Boosting (XGB), and Support Vector Machine (SVM) in the classification of land use and cover in satellite images. Several variants of each algorithm were tested, adopting...

Full description

Saved in:

Bibliographic Details
Main Authors:	Kupidura Przemysław, Kępa Agnieszka, Krawczyk Piotr
Format:	Article
Language:	English
Published:	Sciendo 2024-12-01
Series:	Reports on Geodesy and Geoinformatics
Subjects:	efficiency classification machine learning remote sensing satellite imagery training sample size
Online Access:	https://doi.org/10.2478/rgg-2024-0015
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850108127101321216
author	Kupidura Przemysław Kępa Agnieszka Krawczyk Piotr
author_facet	Kupidura Przemysław Kępa Agnieszka Krawczyk Piotr
author_sort	Kupidura Przemysław
collection	DOAJ
description	The article presents an analysis of the effectiveness of selected machine learning methods: Random Forest (RF), Extreme Gradient Boosting (XGB), and Support Vector Machine (SVM) in the classification of land use and cover in satellite images. Several variants of each algorithm were tested, adopting different parameters typical for each of them. Each variant was classified multiple (20) times, using training samples of different sizes: from 100 pixels to 200,000 pixels. The tests were conducted independently on 3 Sentinel-2 satellite images, identifying 5 basic land cover classes: built-up areas, soil, forest, water, and low vegetation. Typical metrics were used for the accuracy assessment: Cohen’s kappa coefficient, overall accuracy (for whole images), as well as F-1 score, precision, and recall (for individual classes). The results obtained for different images were consistent and clearly indicated an increase in classification accuracy with the increase in the size of the training sample. They also showed that among the tested algorithms, the XGB algorithm is the most sensitive to the size of the training sample, while the least sensitive is SVM, which achieved relatively good results even when using training samples of the smallest sizes. At the same time, it was pointed out that while in the case of RF and XGB algorithms the differences between the tested variants were slight, the effectiveness of SVM was very much dependent on the gamma parameter – with too high values of this parameter, the model showed a tendency to overfit, which did not allow for satisfactory results.
format	Article
id	doaj-art-dd9ec86503c1451fa7d29253b7ad9f73
institution	OA Journals
issn	2391-8152
language	English
publishDate	2024-12-01
publisher	Sciendo
record_format	Article
series	Reports on Geodesy and Geoinformatics
spelling	doaj-art-dd9ec86503c1451fa7d29253b7ad9f732025-08-20T02:38:26ZengSciendoReports on Geodesy and Geoinformatics2391-81522024-12-01118110.2478/rgg-2024-0015Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sampleKupidura Przemysław0Kępa Agnieszka1Krawczyk Piotr21Faculty of Geodesy and Cartography, Warsaw University of Technology, Pl. Politechniki 1, 00-661, Warsaw, Poland1Faculty of Geodesy and Cartography, Warsaw University of Technology, Pl. Politechniki 1, 00-661, Warsaw, Poland2Orbitile Ltd., Potułkały 6B/4, 02-791, Warsaw, PolandThe article presents an analysis of the effectiveness of selected machine learning methods: Random Forest (RF), Extreme Gradient Boosting (XGB), and Support Vector Machine (SVM) in the classification of land use and cover in satellite images. Several variants of each algorithm were tested, adopting different parameters typical for each of them. Each variant was classified multiple (20) times, using training samples of different sizes: from 100 pixels to 200,000 pixels. The tests were conducted independently on 3 Sentinel-2 satellite images, identifying 5 basic land cover classes: built-up areas, soil, forest, water, and low vegetation. Typical metrics were used for the accuracy assessment: Cohen’s kappa coefficient, overall accuracy (for whole images), as well as F-1 score, precision, and recall (for individual classes). The results obtained for different images were consistent and clearly indicated an increase in classification accuracy with the increase in the size of the training sample. They also showed that among the tested algorithms, the XGB algorithm is the most sensitive to the size of the training sample, while the least sensitive is SVM, which achieved relatively good results even when using training samples of the smallest sizes. At the same time, it was pointed out that while in the case of RF and XGB algorithms the differences between the tested variants were slight, the effectiveness of SVM was very much dependent on the gamma parameter – with too high values of this parameter, the model showed a tendency to overfit, which did not allow for satisfactory results.https://doi.org/10.2478/rgg-2024-0015efficiencyclassificationmachine learningremote sensingsatellite imagerytraining sample size
spellingShingle	Kupidura Przemysław Kępa Agnieszka Krawczyk Piotr Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample Reports on Geodesy and Geoinformatics efficiency classification machine learning remote sensing satellite imagery training sample size
title	Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample
title_full	Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample
title_fullStr	Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample
title_full_unstemmed	Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample
title_short	Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample
title_sort	comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample
topic	efficiency classification machine learning remote sensing satellite imagery training sample size
url	https://doi.org/10.2478/rgg-2024-0015
work_keys_str_mv	AT kupiduraprzemysław comparativeanalysisoftheperformanceofselectedmachinelearningalgorithmsdependingonthesizeofthetrainingsample AT kepaagnieszka comparativeanalysisoftheperformanceofselectedmachinelearningalgorithmsdependingonthesizeofthetrainingsample AT krawczykpiotr comparativeanalysisoftheperformanceofselectedmachinelearningalgorithmsdependingonthesizeofthetrainingsample

Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample

Similar Items