The problematic case of data leakage: A case for leave-profile-out cross-validation in 3-dimensional digital soil mapping

Data leakage occurs when there is an overlap between the data used for model fitting and hyperparameter tuning, and those used for testing. This overlap biases the model performance, making it uninformative regarding the model’s ability to generalize. This is a significant issue in machine learning...

Full description

Saved in:
Bibliographic Details
Main Authors: Kingsley John, Daniel D. Saurette, Brandon Heung
Format: Article
Language:English
Published: Elsevier 2025-03-01
Series:Geoderma
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S0016706125000618
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850228052105101312
author Kingsley John
Daniel D. Saurette
Brandon Heung
author_facet Kingsley John
Daniel D. Saurette
Brandon Heung
author_sort Kingsley John
collection DOAJ
description Data leakage occurs when there is an overlap between the data used for model fitting and hyperparameter tuning, and those used for testing. This overlap biases the model performance, making it uninformative regarding the model’s ability to generalize. This is a significant issue in machine learning and predictive soil mapping, compromising model reliability. To demonstrate this issue, the 3-dimensional (3D) digital soil mapping (DSM) approach, whereby depth is used as a predictor of soil properties, was investigated. We compare two common approaches from the literature: leave-sample-out cross-validation (LSOCV) versus leave-profile-out cross-validation (LPOCV). Here, we argue that LSOCV results in contamination of the test dataset due to the potential vertical autocorrelation of soil properties from different samples within the same profile, and a more appropriate approach for testing 3D DSM models should be to fully partition all soil samples from the same profile to either the training or test dataset (i.e., LPOCV). Using the Ottawa region of Ontario, Canada, as a case study, cation exchange capacity (CEC), clay content, pH, and total organic carbon (TOC) were predicted using machine learning, and the discrepancy in accuracy metrics was reported. Furthermore, we evaluated the effects of data augmentation (i.e., the creation of additional synthetic data points from the original data) on accuracy metrics, a common practice in 3D DSM. Here, it was shown that with the augmented dataset, LSOCV generated overly optimistic accuracy metrics (e.g., CCC) that were 29–62% higher than LPOCV, while for the non-augmented data, the accuracy metrics were 8–18% higher, suggesting that vertical autocorrelation had a strong influence on inflating model accuracy through data leakage. As such, we strongly urge DSM practitioners to provide greater clarity when describing how model accuracy metrics were ascertained and to consider the use of LPOCV when applied to 3D DSM. This brings about broader concerns that policymakers and stakeholders may use map products with the false impression that the maps are more accurate than they are. Future research should focus on refining DSM methods and considering data structure to prevent data leakage in modelling soil properties.
format Article
id doaj-art-3694dfca08d949f984a58dec9e037eb1
institution OA Journals
issn 1872-6259
language English
publishDate 2025-03-01
publisher Elsevier
record_format Article
series Geoderma
spelling doaj-art-3694dfca08d949f984a58dec9e037eb12025-08-20T02:04:38ZengElsevierGeoderma1872-62592025-03-0145511722310.1016/j.geoderma.2025.117223The problematic case of data leakage: A case for leave-profile-out cross-validation in 3-dimensional digital soil mappingKingsley John0Daniel D. Saurette1Brandon Heung2Department of Plant, Food, and Environmental Sciences, Faculty of Agriculture, Dalhousie University, PO Box 550, 21 Cox Rd., Truro, NS B2N 5E3, CanadaOntario Ministry of Agriculture, Food and Agribusiness, 1 Stone Rd W, Guelph, ON N1G 4Y2, CanadaDepartment of Plant, Food, and Environmental Sciences, Faculty of Agriculture, Dalhousie University, PO Box 550, 21 Cox Rd., Truro, NS B2N 5E3, Canada; Corresponding author.Data leakage occurs when there is an overlap between the data used for model fitting and hyperparameter tuning, and those used for testing. This overlap biases the model performance, making it uninformative regarding the model’s ability to generalize. This is a significant issue in machine learning and predictive soil mapping, compromising model reliability. To demonstrate this issue, the 3-dimensional (3D) digital soil mapping (DSM) approach, whereby depth is used as a predictor of soil properties, was investigated. We compare two common approaches from the literature: leave-sample-out cross-validation (LSOCV) versus leave-profile-out cross-validation (LPOCV). Here, we argue that LSOCV results in contamination of the test dataset due to the potential vertical autocorrelation of soil properties from different samples within the same profile, and a more appropriate approach for testing 3D DSM models should be to fully partition all soil samples from the same profile to either the training or test dataset (i.e., LPOCV). Using the Ottawa region of Ontario, Canada, as a case study, cation exchange capacity (CEC), clay content, pH, and total organic carbon (TOC) were predicted using machine learning, and the discrepancy in accuracy metrics was reported. Furthermore, we evaluated the effects of data augmentation (i.e., the creation of additional synthetic data points from the original data) on accuracy metrics, a common practice in 3D DSM. Here, it was shown that with the augmented dataset, LSOCV generated overly optimistic accuracy metrics (e.g., CCC) that were 29–62% higher than LPOCV, while for the non-augmented data, the accuracy metrics were 8–18% higher, suggesting that vertical autocorrelation had a strong influence on inflating model accuracy through data leakage. As such, we strongly urge DSM practitioners to provide greater clarity when describing how model accuracy metrics were ascertained and to consider the use of LPOCV when applied to 3D DSM. This brings about broader concerns that policymakers and stakeholders may use map products with the false impression that the maps are more accurate than they are. Future research should focus on refining DSM methods and considering data structure to prevent data leakage in modelling soil properties.http://www.sciencedirect.com/science/article/pii/S0016706125000618AccuracyMachine learningPedometricsSoil propertiesVertical autocorrelation
spellingShingle Kingsley John
Daniel D. Saurette
Brandon Heung
The problematic case of data leakage: A case for leave-profile-out cross-validation in 3-dimensional digital soil mapping
Geoderma
Accuracy
Machine learning
Pedometrics
Soil properties
Vertical autocorrelation
title The problematic case of data leakage: A case for leave-profile-out cross-validation in 3-dimensional digital soil mapping
title_full The problematic case of data leakage: A case for leave-profile-out cross-validation in 3-dimensional digital soil mapping
title_fullStr The problematic case of data leakage: A case for leave-profile-out cross-validation in 3-dimensional digital soil mapping
title_full_unstemmed The problematic case of data leakage: A case for leave-profile-out cross-validation in 3-dimensional digital soil mapping
title_short The problematic case of data leakage: A case for leave-profile-out cross-validation in 3-dimensional digital soil mapping
title_sort problematic case of data leakage a case for leave profile out cross validation in 3 dimensional digital soil mapping
topic Accuracy
Machine learning
Pedometrics
Soil properties
Vertical autocorrelation
url http://www.sciencedirect.com/science/article/pii/S0016706125000618
work_keys_str_mv AT kingsleyjohn theproblematiccaseofdataleakageacaseforleaveprofileoutcrossvalidationin3dimensionaldigitalsoilmapping
AT danieldsaurette theproblematiccaseofdataleakageacaseforleaveprofileoutcrossvalidationin3dimensionaldigitalsoilmapping
AT brandonheung theproblematiccaseofdataleakageacaseforleaveprofileoutcrossvalidationin3dimensionaldigitalsoilmapping
AT kingsleyjohn problematiccaseofdataleakageacaseforleaveprofileoutcrossvalidationin3dimensionaldigitalsoilmapping
AT danieldsaurette problematiccaseofdataleakageacaseforleaveprofileoutcrossvalidationin3dimensionaldigitalsoilmapping
AT brandonheung problematiccaseofdataleakageacaseforleaveprofileoutcrossvalidationin3dimensionaldigitalsoilmapping