Model Selection from Multiple Model Families in Species Distribution Modeling Using Minimum Message Length

Species distribution modeling is fundamental to biodiversity, evolution, conservation science, and the study of invasive species. Given environmental data and species distribution data, model selection techniques are frequently used to help identify relevant features. Existing studies aim to find th...

Full description

Saved in:
Bibliographic Details
Main Authors: Zihao Wen, David L. Dowe
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/27/1/6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832588578244263936
author Zihao Wen
David L. Dowe
author_facet Zihao Wen
David L. Dowe
author_sort Zihao Wen
collection DOAJ
description Species distribution modeling is fundamental to biodiversity, evolution, conservation science, and the study of invasive species. Given environmental data and species distribution data, model selection techniques are frequently used to help identify relevant features. Existing studies aim to find the relevant features by selecting the best models using different criteria, and they deem the predictors in the best models as the relevant features. However, they mostly consider only a given model family, making them vulnerable to model family misspecification. To address this issue, this paper introduces the Bayesian information-theoretic minimum message length (MML) principle to species distribution model selection. In particular, we provide a framework that allows the message length of models from multiple model families to be calculated and compared, and by doing so, the model selection is both accurate and robust against model family misspecification and data aggregation. To find the relevant features efficiently, we further develop a novel search algorithm that does not require calculating the message length for all possible subsets of features. Experimental results demonstrate that our proposed method outperforms competing methods by selecting the best models on both artificial and real-world datasets. More specifically, there was one test on artificial data that all methods got wrong. On the other 10 tests on artificial data, the MML method got everything correct, but the alternative methods all failed on a variety of tests. Our real-world data pertained to two plant species from Barro Colorado Island, Panama. Compared to the alternative methods, for both the plant species, the MML method selects the simplest model while also having the overall best predictions.
format Article
id doaj-art-bdf6e5fc17f84a9bbe09af94fbbbee5b
institution Kabale University
issn 1099-4300
language English
publishDate 2024-12-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj-art-bdf6e5fc17f84a9bbe09af94fbbbee5b2025-01-24T13:31:38ZengMDPI AGEntropy1099-43002024-12-01271610.3390/e27010006Model Selection from Multiple Model Families in Species Distribution Modeling Using Minimum Message LengthZihao Wen0David L. Dowe1College of Mathematics and Informatics, South China Agricultural University, No. 483, Wushan Road, Tianhe District, Guangzhou 510642, ChinaDepartment of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, AustraliaSpecies distribution modeling is fundamental to biodiversity, evolution, conservation science, and the study of invasive species. Given environmental data and species distribution data, model selection techniques are frequently used to help identify relevant features. Existing studies aim to find the relevant features by selecting the best models using different criteria, and they deem the predictors in the best models as the relevant features. However, they mostly consider only a given model family, making them vulnerable to model family misspecification. To address this issue, this paper introduces the Bayesian information-theoretic minimum message length (MML) principle to species distribution model selection. In particular, we provide a framework that allows the message length of models from multiple model families to be calculated and compared, and by doing so, the model selection is both accurate and robust against model family misspecification and data aggregation. To find the relevant features efficiently, we further develop a novel search algorithm that does not require calculating the message length for all possible subsets of features. Experimental results demonstrate that our proposed method outperforms competing methods by selecting the best models on both artificial and real-world datasets. More specifically, there was one test on artificial data that all methods got wrong. On the other 10 tests on artificial data, the MML method got everything correct, but the alternative methods all failed on a variety of tests. Our real-world data pertained to two plant species from Barro Colorado Island, Panama. Compared to the alternative methods, for both the plant species, the MML method selects the simplest model while also having the overall best predictions.https://www.mdpi.com/1099-4300/27/1/6minimum message lengthmodel selectionspecies distribution modeling
spellingShingle Zihao Wen
David L. Dowe
Model Selection from Multiple Model Families in Species Distribution Modeling Using Minimum Message Length
Entropy
minimum message length
model selection
species distribution modeling
title Model Selection from Multiple Model Families in Species Distribution Modeling Using Minimum Message Length
title_full Model Selection from Multiple Model Families in Species Distribution Modeling Using Minimum Message Length
title_fullStr Model Selection from Multiple Model Families in Species Distribution Modeling Using Minimum Message Length
title_full_unstemmed Model Selection from Multiple Model Families in Species Distribution Modeling Using Minimum Message Length
title_short Model Selection from Multiple Model Families in Species Distribution Modeling Using Minimum Message Length
title_sort model selection from multiple model families in species distribution modeling using minimum message length
topic minimum message length
model selection
species distribution modeling
url https://www.mdpi.com/1099-4300/27/1/6
work_keys_str_mv AT zihaowen modelselectionfrommultiplemodelfamiliesinspeciesdistributionmodelingusingminimummessagelength
AT davidldowe modelselectionfrommultiplemodelfamiliesinspeciesdistributionmodelingusingminimummessagelength