optRF: Optimising random forest stability by determining the optimal number of trees

Abstract Machine learning is frequently used to make decisions based on big data. Among these techniques, random forest is particularly prominent. Although random forest is known to have many advantages, one aspect that is often overseen is that it is a non-deterministic method that can produce diff...

Full description

Saved in:

Bibliographic Details
Main Authors:	Thomas M. Lange, Mehmet Gültas, Armin O. Schmitt, Felix Heinrich
Format:	Article
Language:	English
Published:	BMC 2025-03-01
Series:	BMC Bioinformatics
Subjects:	Parameter optimisation Random forest Machine learning Non-determinism Decision-making Genomic selection
Online Access:	https://doi.org/10.1186/s12859-025-06097-1
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850042762681909248
author	Thomas M. Lange Mehmet Gültas Armin O. Schmitt Felix Heinrich
author_facet	Thomas M. Lange Mehmet Gültas Armin O. Schmitt Felix Heinrich
author_sort	Thomas M. Lange
collection	DOAJ
description	Abstract Machine learning is frequently used to make decisions based on big data. Among these techniques, random forest is particularly prominent. Although random forest is known to have many advantages, one aspect that is often overseen is that it is a non-deterministic method that can produce different models using the same input data. This can have severe consequences on decision-making processes. In this study, we introduce a method to quantify the impact of non-determinism on predictions, variable importance estimates, and decisions based on the predictions or variable importance estimates. Our findings demonstrate that increasing the number of trees in random forests enhances the stability in a non-linear way while computation time increases linearly. Consequently, we conclude that there exists an optimal number of trees for any given data set that maximises the stability without unnecessarily increasing the computation time. Based on these findings, we have developed the R package optRF which models the relationship between the number of trees and the stability of random forest, providing recommendations for the optimal number of trees for any given data set.
format	Article
id	doaj-art-0ca83e483eca459ca46526ee451ef7bf
institution	DOAJ
issn	1471-2105
language	English
publishDate	2025-03-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj-art-0ca83e483eca459ca46526ee451ef7bf2025-08-20T02:55:28ZengBMCBMC Bioinformatics1471-21052025-03-0126112110.1186/s12859-025-06097-1optRF: Optimising random forest stability by determining the optimal number of treesThomas M. Lange0Mehmet Gültas1Armin O. Schmitt2Felix Heinrich3Breeding Informatics Group, Georg-August UniversityFaculty of Agriculture, South Westphalia University of Applied SciencesBreeding Informatics Group, Georg-August UniversityBreeding Informatics Group, Georg-August UniversityAbstract Machine learning is frequently used to make decisions based on big data. Among these techniques, random forest is particularly prominent. Although random forest is known to have many advantages, one aspect that is often overseen is that it is a non-deterministic method that can produce different models using the same input data. This can have severe consequences on decision-making processes. In this study, we introduce a method to quantify the impact of non-determinism on predictions, variable importance estimates, and decisions based on the predictions or variable importance estimates. Our findings demonstrate that increasing the number of trees in random forests enhances the stability in a non-linear way while computation time increases linearly. Consequently, we conclude that there exists an optimal number of trees for any given data set that maximises the stability without unnecessarily increasing the computation time. Based on these findings, we have developed the R package optRF which models the relationship between the number of trees and the stability of random forest, providing recommendations for the optimal number of trees for any given data set.https://doi.org/10.1186/s12859-025-06097-1Parameter optimisationRandom forestMachine learningNon-determinismDecision-makingGenomic selection
spellingShingle	Thomas M. Lange Mehmet Gültas Armin O. Schmitt Felix Heinrich optRF: Optimising random forest stability by determining the optimal number of trees BMC Bioinformatics Parameter optimisation Random forest Machine learning Non-determinism Decision-making Genomic selection
title	optRF: Optimising random forest stability by determining the optimal number of trees
title_full	optRF: Optimising random forest stability by determining the optimal number of trees
title_fullStr	optRF: Optimising random forest stability by determining the optimal number of trees
title_full_unstemmed	optRF: Optimising random forest stability by determining the optimal number of trees
title_short	optRF: Optimising random forest stability by determining the optimal number of trees
title_sort	optrf optimising random forest stability by determining the optimal number of trees
topic	Parameter optimisation Random forest Machine learning Non-determinism Decision-making Genomic selection
url	https://doi.org/10.1186/s12859-025-06097-1
work_keys_str_mv	AT thomasmlange optrfoptimisingrandomforeststabilitybydeterminingtheoptimalnumberoftrees AT mehmetgultas optrfoptimisingrandomforeststabilitybydeterminingtheoptimalnumberoftrees AT arminoschmitt optrfoptimisingrandomforeststabilitybydeterminingtheoptimalnumberoftrees AT felixheinrich optrfoptimisingrandomforeststabilitybydeterminingtheoptimalnumberoftrees

optRF: Optimising random forest stability by determining the optimal number of trees

Similar Items