A dissimilarity-adaptive cross-validation method for evaluating geospatial machine learning predictions with clustered samples
Spatially clustered samples are prevalent in geospatial machine learning (ML) predictions, especially in ecological mapping. Since densely sampled regions in the prediction area are overrepresented, leading to dissimilarities in the data distribution between samples and predictions and thus posing a...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-12-01
|
| Series: | Ecological Informatics |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S1574954125002961 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Spatially clustered samples are prevalent in geospatial machine learning (ML) predictions, especially in ecological mapping. Since densely sampled regions in the prediction area are overrepresented, leading to dissimilarities in the data distribution between samples and predictions and thus posing a noticeable challenge for the evaluation of geospatial ML predictions. Neither random nor spatial cross-validation (CV) methods can consistently yield accurate evaluations: Random CV overestimates prediction performance when clustering is high, while spatial CV underestimates it when clustering is low. To tackle this challenge, we propose a novel “adaptive” evaluation method called dissimilarity-adaptive cross-validation (DA-CV), which is based on the data feature space. DA-CV categorizes the prediction locations as “similar” and “different” groups according to the dissimilarity between their covariates and those of the sampled locations. DA-CV applies random CV to evaluate “similar” locations and spatial CV to evaluate “different” ones. The final evaluation metric is obtained through a weighted average of the two. To test DA-CV, we conducted a series of experiments on synthetic species abundance and real above ground biomass datasets, where the clustering degree was gradually changed, and we also compared DA-CV with current CV methods (RDM-CV, SP-CV, and kNNDM) in the experiments. Results showed that DA-CV provided the most accurate evaluations in 85% of scenarios. DA-CV effectively overcomes the common limitations of random and spatial CV methods, such as only considering a part of predictions in the evaluation. This means that DA-CV can provide accurate evaluations for most situations of clustered samples. The success of DA-CV confirms that considering feature space information is an effective way to improve the evaluation of geospatial ML predictions. |
|---|---|
| ISSN: | 1574-9541 |