A dissimilarity-adaptive cross-validation method for evaluating geospatial machine learning predictions with clustered samples

Spatially clustered samples are prevalent in geospatial machine learning (ML) predictions, especially in ecological mapping. Since densely sampled regions in the prediction area are overrepresented, leading to dissimilarities in the data distribution between samples and predictions and thus posing a...

Full description

Saved in:
Bibliographic Details
Main Authors: Yanwen Wang, Mahdi Khodadadzadeh, Raúl Zurita-Milla
Format: Article
Language:English
Published: Elsevier 2025-12-01
Series:Ecological Informatics
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1574954125002961
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Spatially clustered samples are prevalent in geospatial machine learning (ML) predictions, especially in ecological mapping. Since densely sampled regions in the prediction area are overrepresented, leading to dissimilarities in the data distribution between samples and predictions and thus posing a noticeable challenge for the evaluation of geospatial ML predictions. Neither random nor spatial cross-validation (CV) methods can consistently yield accurate evaluations: Random CV overestimates prediction performance when clustering is high, while spatial CV underestimates it when clustering is low. To tackle this challenge, we propose a novel “adaptive” evaluation method called dissimilarity-adaptive cross-validation (DA-CV), which is based on the data feature space. DA-CV categorizes the prediction locations as “similar” and “different” groups according to the dissimilarity between their covariates and those of the sampled locations. DA-CV applies random CV to evaluate “similar” locations and spatial CV to evaluate “different” ones. The final evaluation metric is obtained through a weighted average of the two. To test DA-CV, we conducted a series of experiments on synthetic species abundance and real above ground biomass datasets, where the clustering degree was gradually changed, and we also compared DA-CV with current CV methods (RDM-CV, SP-CV, and kNNDM) in the experiments. Results showed that DA-CV provided the most accurate evaluations in 85% of scenarios. DA-CV effectively overcomes the common limitations of random and spatial CV methods, such as only considering a part of predictions in the evaluation. This means that DA-CV can provide accurate evaluations for most situations of clustered samples. The success of DA-CV confirms that considering feature space information is an effective way to improve the evaluation of geospatial ML predictions.
ISSN:1574-9541