Understanding overfitting in random forest for probability estimation: a visualization and simulation study
Abstract Background Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of r...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2024-09-01
|
| Series: | Diagnostic and Prognostic Research |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s41512-024-00177-1 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849733010286444544 |
|---|---|
| author | Lasai Barreñada Paula Dhiman Dirk Timmerman Anne-Laure Boulesteix Ben Van Calster |
| author_facet | Lasai Barreñada Paula Dhiman Dirk Timmerman Anne-Laure Boulesteix Ben Van Calster |
| author_sort | Lasai Barreñada |
| collection | DOAJ |
| description | Abstract Background Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study. Methods For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000). Results The visualizations suggested that the model learned “spikes of probability” around events in the training set. A cluster of events created a bigger peak or plateau (signal), isolated events local peaks (noise). In the simulation study, median training AUCs were between 0.97 and 1 unless there were 4 binary predictors or 16 binary predictors with a minimum node size of 20. The median discrimination loss, i.e., the difference between the median test AUC and the true AUC, was 0.025 (range 0.00 to 0.13). Median training AUCs had Spearman correlations of around 0.70 with discrimination loss. Median test AUCs were higher with higher events per variable, higher minimum node size, and binary predictors. Median training calibration slopes were always above 1 and were not correlated with median test slopes across scenarios (Spearman correlation − 0.11). Median test slopes were higher with higher true AUC, higher minimum node size, and higher sample size. Conclusions Random forests learn local probability peaks that often yield near perfect training AUCs without strongly affecting AUCs on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models. |
| format | Article |
| id | doaj-art-e0a3450ab9244d92851909742ecf816b |
| institution | DOAJ |
| issn | 2397-7523 |
| language | English |
| publishDate | 2024-09-01 |
| publisher | BMC |
| record_format | Article |
| series | Diagnostic and Prognostic Research |
| spelling | doaj-art-e0a3450ab9244d92851909742ecf816b2025-08-20T03:08:09ZengBMCDiagnostic and Prognostic Research2397-75232024-09-018111410.1186/s41512-024-00177-1Understanding overfitting in random forest for probability estimation: a visualization and simulation studyLasai Barreñada0Paula Dhiman1Dirk Timmerman2Anne-Laure Boulesteix3Ben Van Calster4Department of Development and RegenerationCentre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of OxfordDepartment of Development and RegenerationBiometry in Molecular Medicine, LMUDepartment of Development and RegenerationAbstract Background Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study. Methods For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000). Results The visualizations suggested that the model learned “spikes of probability” around events in the training set. A cluster of events created a bigger peak or plateau (signal), isolated events local peaks (noise). In the simulation study, median training AUCs were between 0.97 and 1 unless there were 4 binary predictors or 16 binary predictors with a minimum node size of 20. The median discrimination loss, i.e., the difference between the median test AUC and the true AUC, was 0.025 (range 0.00 to 0.13). Median training AUCs had Spearman correlations of around 0.70 with discrimination loss. Median test AUCs were higher with higher events per variable, higher minimum node size, and binary predictors. Median training calibration slopes were always above 1 and were not correlated with median test slopes across scenarios (Spearman correlation − 0.11). Median test slopes were higher with higher true AUC, higher minimum node size, and higher sample size. Conclusions Random forests learn local probability peaks that often yield near perfect training AUCs without strongly affecting AUCs on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.https://doi.org/10.1186/s41512-024-00177-1Random ForestPrediction modelingRisk estimation |
| spellingShingle | Lasai Barreñada Paula Dhiman Dirk Timmerman Anne-Laure Boulesteix Ben Van Calster Understanding overfitting in random forest for probability estimation: a visualization and simulation study Diagnostic and Prognostic Research Random Forest Prediction modeling Risk estimation |
| title | Understanding overfitting in random forest for probability estimation: a visualization and simulation study |
| title_full | Understanding overfitting in random forest for probability estimation: a visualization and simulation study |
| title_fullStr | Understanding overfitting in random forest for probability estimation: a visualization and simulation study |
| title_full_unstemmed | Understanding overfitting in random forest for probability estimation: a visualization and simulation study |
| title_short | Understanding overfitting in random forest for probability estimation: a visualization and simulation study |
| title_sort | understanding overfitting in random forest for probability estimation a visualization and simulation study |
| topic | Random Forest Prediction modeling Risk estimation |
| url | https://doi.org/10.1186/s41512-024-00177-1 |
| work_keys_str_mv | AT lasaibarrenada understandingoverfittinginrandomforestforprobabilityestimationavisualizationandsimulationstudy AT pauladhiman understandingoverfittinginrandomforestforprobabilityestimationavisualizationandsimulationstudy AT dirktimmerman understandingoverfittinginrandomforestforprobabilityestimationavisualizationandsimulationstudy AT annelaureboulesteix understandingoverfittinginrandomforestforprobabilityestimationavisualizationandsimulationstudy AT benvancalster understandingoverfittinginrandomforestforprobabilityestimationavisualizationandsimulationstudy |