Understanding overfitting in random forest for probability estimation: a visualization and simulation study

Abstract Background Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of r...

Full description

Saved in:
Bibliographic Details
Main Authors: Lasai Barreñada, Paula Dhiman, Dirk Timmerman, Anne-Laure Boulesteix, Ben Van Calster
Format: Article
Language:English
Published: BMC 2024-09-01
Series:Diagnostic and Prognostic Research
Subjects:
Online Access:https://doi.org/10.1186/s41512-024-00177-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849733010286444544
author Lasai Barreñada
Paula Dhiman
Dirk Timmerman
Anne-Laure Boulesteix
Ben Van Calster
author_facet Lasai Barreñada
Paula Dhiman
Dirk Timmerman
Anne-Laure Boulesteix
Ben Van Calster
author_sort Lasai Barreñada
collection DOAJ
description Abstract Background Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study. Methods For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000). Results The visualizations suggested that the model learned “spikes of probability” around events in the training set. A cluster of events created a bigger peak or plateau (signal), isolated events local peaks (noise). In the simulation study, median training AUCs were between 0.97 and 1 unless there were 4 binary predictors or 16 binary predictors with a minimum node size of 20. The median discrimination loss, i.e., the difference between the median test AUC and the true AUC, was 0.025 (range 0.00 to 0.13). Median training AUCs had Spearman correlations of around 0.70 with discrimination loss. Median test AUCs were higher with higher events per variable, higher minimum node size, and binary predictors. Median training calibration slopes were always above 1 and were not correlated with median test slopes across scenarios (Spearman correlation − 0.11). Median test slopes were higher with higher true AUC, higher minimum node size, and higher sample size. Conclusions Random forests learn local probability peaks that often yield near perfect training AUCs without strongly affecting AUCs on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.
format Article
id doaj-art-e0a3450ab9244d92851909742ecf816b
institution DOAJ
issn 2397-7523
language English
publishDate 2024-09-01
publisher BMC
record_format Article
series Diagnostic and Prognostic Research
spelling doaj-art-e0a3450ab9244d92851909742ecf816b2025-08-20T03:08:09ZengBMCDiagnostic and Prognostic Research2397-75232024-09-018111410.1186/s41512-024-00177-1Understanding overfitting in random forest for probability estimation: a visualization and simulation studyLasai Barreñada0Paula Dhiman1Dirk Timmerman2Anne-Laure Boulesteix3Ben Van Calster4Department of Development and RegenerationCentre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of OxfordDepartment of Development and RegenerationBiometry in Molecular Medicine, LMUDepartment of Development and RegenerationAbstract Background Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study. Methods For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000). Results The visualizations suggested that the model learned “spikes of probability” around events in the training set. A cluster of events created a bigger peak or plateau (signal), isolated events local peaks (noise). In the simulation study, median training AUCs were between 0.97 and 1 unless there were 4 binary predictors or 16 binary predictors with a minimum node size of 20. The median discrimination loss, i.e., the difference between the median test AUC and the true AUC, was 0.025 (range 0.00 to 0.13). Median training AUCs had Spearman correlations of around 0.70 with discrimination loss. Median test AUCs were higher with higher events per variable, higher minimum node size, and binary predictors. Median training calibration slopes were always above 1 and were not correlated with median test slopes across scenarios (Spearman correlation − 0.11). Median test slopes were higher with higher true AUC, higher minimum node size, and higher sample size. Conclusions Random forests learn local probability peaks that often yield near perfect training AUCs without strongly affecting AUCs on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.https://doi.org/10.1186/s41512-024-00177-1Random ForestPrediction modelingRisk estimation
spellingShingle Lasai Barreñada
Paula Dhiman
Dirk Timmerman
Anne-Laure Boulesteix
Ben Van Calster
Understanding overfitting in random forest for probability estimation: a visualization and simulation study
Diagnostic and Prognostic Research
Random Forest
Prediction modeling
Risk estimation
title Understanding overfitting in random forest for probability estimation: a visualization and simulation study
title_full Understanding overfitting in random forest for probability estimation: a visualization and simulation study
title_fullStr Understanding overfitting in random forest for probability estimation: a visualization and simulation study
title_full_unstemmed Understanding overfitting in random forest for probability estimation: a visualization and simulation study
title_short Understanding overfitting in random forest for probability estimation: a visualization and simulation study
title_sort understanding overfitting in random forest for probability estimation a visualization and simulation study
topic Random Forest
Prediction modeling
Risk estimation
url https://doi.org/10.1186/s41512-024-00177-1
work_keys_str_mv AT lasaibarrenada understandingoverfittinginrandomforestforprobabilityestimationavisualizationandsimulationstudy
AT pauladhiman understandingoverfittinginrandomforestforprobabilityestimationavisualizationandsimulationstudy
AT dirktimmerman understandingoverfittinginrandomforestforprobabilityestimationavisualizationandsimulationstudy
AT annelaureboulesteix understandingoverfittinginrandomforestforprobabilityestimationavisualizationandsimulationstudy
AT benvancalster understandingoverfittinginrandomforestforprobabilityestimationavisualizationandsimulationstudy