Copula Approximate Bayesian Computation Using Distribution Random Forests

Ongoing modern computational advancements continue to make it easier to collect increasingly large and complex datasets, which can often only be realistically analyzed using models defined by intractable likelihood functions. This <i>Stats</i> invited feature article introduces and provi...

Full description

Saved in:
Bibliographic Details
Main Author: George Karabatsos
Format: Article
Language:English
Published: MDPI AG 2024-09-01
Series:Stats
Subjects:
Online Access:https://www.mdpi.com/2571-905X/7/3/61
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850259330881814528
author George Karabatsos
author_facet George Karabatsos
author_sort George Karabatsos
collection DOAJ
description Ongoing modern computational advancements continue to make it easier to collect increasingly large and complex datasets, which can often only be realistically analyzed using models defined by intractable likelihood functions. This <i>Stats</i> invited feature article introduces and provides an extensive simulation study of a new approximate Bayesian computation (ABC) framework for estimating the posterior distribution and the maximum likelihood estimate (MLE) of the parameters of models defined by intractable likelihoods, that unifies and extends previous ABC methods proposed separately. This framework, copulaABCdrf, aims to accurately estimate and describe the possibly skewed and high-dimensional posterior distribution by a novel multivariate copula-based meta-<i>t</i> distribution based on univariate marginal posterior distributions that can be accurately estimated by distribution random forests (drf), while performing automatic summary statistics (covariates) selection, based on robustly estimated copula dependence parameters. The copulaABCdrf framework also provides a novel multivariate mode estimator to perform MLE and posterior mode estimation and an optional step to perform model selection from a given set of models using posterior probabilities estimated by drf. The posterior distribution estimation accuracy of the ABC framework is illustrated and compared with previous standard ABC methods through several simulation studies involving low- and high-dimensional models with computable posterior distributions, which are either unimodal, skewed, or multimodal; and exponential random graph and mechanistic network models, each defined by an intractable likelihood from which it is costly to simulate large network datasets. This paper also proposes and studies a new solution to the simulation cost problem in ABC involving the posterior estimation of parameters from datasets simulated from the given model that are smaller compared to the potentially large size of the dataset being analyzed. This proposal is motivated by the fact that, for many models defined by intractable likelihoods, such as the network models when they are applied to analyze massive networks, the repeated simulation of large datasets (networks) for posterior-based parameter estimation can be too computationally costly and vastly slow down or prohibit the use of standard ABC methods. The copulaABCdrf framework and standard ABC methods are further illustrated through analyses of large real-life networks of sizes ranging between 28,000 and 65.6 million nodes (between 3 million and 1.8 billion edges), including a large multilayer network with weighted directed edges. The results of the simulation studies show that, in settings where the true posterior distribution is not highly multimodal, copulaABCdrf usually produced similar point estimates from the posterior distribution for low-dimensional parametric models as previous ABC methods, but the copula-based method can produce more accurate estimates from the posterior distribution for high-dimensional models, and, in both dimensionality cases, usually produced more accurate estimates of univariate marginal posterior distributions of parameters. Also, posterior estimation accuracy was usually improved when pre-selecting the important summary statistics using drf compared to ABC employing no pre-selection of the subset of important summaries. For all ABC methods studied, accurate estimation of a highly multimodal posterior distribution was challenging. In light of the results of all the simulation studies, this article concludes by discussing how the copulaABCdrf framework can be improved for future research.
format Article
id doaj-art-7b49f693ec2543828a02b046e0ce9f8a
institution OA Journals
issn 2571-905X
language English
publishDate 2024-09-01
publisher MDPI AG
record_format Article
series Stats
spelling doaj-art-7b49f693ec2543828a02b046e0ce9f8a2025-08-20T01:55:52ZengMDPI AGStats2571-905X2024-09-01731002105010.3390/stats7030061Copula Approximate Bayesian Computation Using Distribution Random ForestsGeorge Karabatsos0Department of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, 1040 W. Harrison St. (MC 147), Chicago, IL 60607, USAOngoing modern computational advancements continue to make it easier to collect increasingly large and complex datasets, which can often only be realistically analyzed using models defined by intractable likelihood functions. This <i>Stats</i> invited feature article introduces and provides an extensive simulation study of a new approximate Bayesian computation (ABC) framework for estimating the posterior distribution and the maximum likelihood estimate (MLE) of the parameters of models defined by intractable likelihoods, that unifies and extends previous ABC methods proposed separately. This framework, copulaABCdrf, aims to accurately estimate and describe the possibly skewed and high-dimensional posterior distribution by a novel multivariate copula-based meta-<i>t</i> distribution based on univariate marginal posterior distributions that can be accurately estimated by distribution random forests (drf), while performing automatic summary statistics (covariates) selection, based on robustly estimated copula dependence parameters. The copulaABCdrf framework also provides a novel multivariate mode estimator to perform MLE and posterior mode estimation and an optional step to perform model selection from a given set of models using posterior probabilities estimated by drf. The posterior distribution estimation accuracy of the ABC framework is illustrated and compared with previous standard ABC methods through several simulation studies involving low- and high-dimensional models with computable posterior distributions, which are either unimodal, skewed, or multimodal; and exponential random graph and mechanistic network models, each defined by an intractable likelihood from which it is costly to simulate large network datasets. This paper also proposes and studies a new solution to the simulation cost problem in ABC involving the posterior estimation of parameters from datasets simulated from the given model that are smaller compared to the potentially large size of the dataset being analyzed. This proposal is motivated by the fact that, for many models defined by intractable likelihoods, such as the network models when they are applied to analyze massive networks, the repeated simulation of large datasets (networks) for posterior-based parameter estimation can be too computationally costly and vastly slow down or prohibit the use of standard ABC methods. The copulaABCdrf framework and standard ABC methods are further illustrated through analyses of large real-life networks of sizes ranging between 28,000 and 65.6 million nodes (between 3 million and 1.8 billion edges), including a large multilayer network with weighted directed edges. The results of the simulation studies show that, in settings where the true posterior distribution is not highly multimodal, copulaABCdrf usually produced similar point estimates from the posterior distribution for low-dimensional parametric models as previous ABC methods, but the copula-based method can produce more accurate estimates from the posterior distribution for high-dimensional models, and, in both dimensionality cases, usually produced more accurate estimates of univariate marginal posterior distributions of parameters. Also, posterior estimation accuracy was usually improved when pre-selecting the important summary statistics using drf compared to ABC employing no pre-selection of the subset of important summaries. For all ABC methods studied, accurate estimation of a highly multimodal posterior distribution was challenging. In light of the results of all the simulation studies, this article concludes by discussing how the copulaABCdrf framework can be improved for future research.https://www.mdpi.com/2571-905X/7/3/61Bayesian analysismaximum likelihoodintractable likelihood
spellingShingle George Karabatsos
Copula Approximate Bayesian Computation Using Distribution Random Forests
Stats
Bayesian analysis
maximum likelihood
intractable likelihood
title Copula Approximate Bayesian Computation Using Distribution Random Forests
title_full Copula Approximate Bayesian Computation Using Distribution Random Forests
title_fullStr Copula Approximate Bayesian Computation Using Distribution Random Forests
title_full_unstemmed Copula Approximate Bayesian Computation Using Distribution Random Forests
title_short Copula Approximate Bayesian Computation Using Distribution Random Forests
title_sort copula approximate bayesian computation using distribution random forests
topic Bayesian analysis
maximum likelihood
intractable likelihood
url https://www.mdpi.com/2571-905X/7/3/61
work_keys_str_mv AT georgekarabatsos copulaapproximatebayesiancomputationusingdistributionrandomforests