A population spatialization method based on the integration of feature selection and an improved random forest model.

Ascertaining the precise and accurate spatial distribution of population is essential in conducting effective urban planning, resource allocation, and emergency rescue planning. The random forest (RF) model is widely used in population spatialization studies. However, the complexity of population di...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhen Zhao, Hongmei Guo, Xueli Jiang, Ying Zhang, Changjiang Lu, Can Zhang, Zonghang He
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0321263
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850151390578475008
author Zhen Zhao
Hongmei Guo
Xueli Jiang
Ying Zhang
Changjiang Lu
Can Zhang
Zonghang He
author_facet Zhen Zhao
Hongmei Guo
Xueli Jiang
Ying Zhang
Changjiang Lu
Can Zhang
Zonghang He
author_sort Zhen Zhao
collection DOAJ
description Ascertaining the precise and accurate spatial distribution of population is essential in conducting effective urban planning, resource allocation, and emergency rescue planning. The random forest (RF) model is widely used in population spatialization studies. However, the complexity of population distribution characteristics and the limitations of the RF model in processing unbalanced datasets affect population prediction accuracy. To address these issues, a population spatialization model that integrates feature selection with an improved random forest is proposed herein. Firstly, recursive feature elimination using cross validation (RFECV), maximum information coefficient (MIC), and mean decrease accuracy (MDA) methods were utilized to select population distribution feature factors. The random forest was constructed using feature subsets that were selected via different feature selection methods, namely MIC-RF, RFECV-RF and MDA-RF. Subsequently, the feature factors corresponding to the model with the highest accuracy were selected as the optimal feature subsets and used in the model construction as input data. Additionally, considering the imbalanced in population spatial distribution, we used the K-means ++ clustering algorithm to cluster the optimal feature subset, and we used the bootstrap sampling method to extract the same amount of data from each cluster and fuse it with the training subset to build an improved random forest model. Based on this model, a spatial population distribution dataset of the Southern Sichuan Economic Zone at a 500m resolution was generated. Finally, the population dataset generated in this study was compared and validated with the WorldPop dataset. The results showed that utilizing feature selection methods improves model accuracy to varying degrees compared with RF based on all factors, and the MDA-RF had the lowest MAPE of 0.174 and the highest R2 of 0.913 among them. Therefore, feature factors selection using the MDA method was considered the optimal feature subset. Compared with MDA-RF, the prediction accuracy of the improved RF built on the same subset increased by 1.7%, indicating that improving the bootstrap sampling of random forest by using the K-means++ clustering algorithm can enhance model accuracy to some extent. Compared with the WorldPop dataset, the accuracy of the results predicted using the proposed method was enhanced. The MRE and RMSE of the WorldPop dataset were 57.24 and 23174.98, respectively, while the MRE and RMSE of the proposed method were 25.00 and 15776.50, respectively. This implies that the method proposed in this paper could simulate population spatial distribution more accurately.
format Article
id doaj-art-aa256f2ebb774d0d99576b80ebdef596
institution OA Journals
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-aa256f2ebb774d0d99576b80ebdef5962025-08-20T02:26:16ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01204e032126310.1371/journal.pone.0321263A population spatialization method based on the integration of feature selection and an improved random forest model.Zhen ZhaoHongmei GuoXueli JiangYing ZhangChangjiang LuCan ZhangZonghang HeAscertaining the precise and accurate spatial distribution of population is essential in conducting effective urban planning, resource allocation, and emergency rescue planning. The random forest (RF) model is widely used in population spatialization studies. However, the complexity of population distribution characteristics and the limitations of the RF model in processing unbalanced datasets affect population prediction accuracy. To address these issues, a population spatialization model that integrates feature selection with an improved random forest is proposed herein. Firstly, recursive feature elimination using cross validation (RFECV), maximum information coefficient (MIC), and mean decrease accuracy (MDA) methods were utilized to select population distribution feature factors. The random forest was constructed using feature subsets that were selected via different feature selection methods, namely MIC-RF, RFECV-RF and MDA-RF. Subsequently, the feature factors corresponding to the model with the highest accuracy were selected as the optimal feature subsets and used in the model construction as input data. Additionally, considering the imbalanced in population spatial distribution, we used the K-means ++ clustering algorithm to cluster the optimal feature subset, and we used the bootstrap sampling method to extract the same amount of data from each cluster and fuse it with the training subset to build an improved random forest model. Based on this model, a spatial population distribution dataset of the Southern Sichuan Economic Zone at a 500m resolution was generated. Finally, the population dataset generated in this study was compared and validated with the WorldPop dataset. The results showed that utilizing feature selection methods improves model accuracy to varying degrees compared with RF based on all factors, and the MDA-RF had the lowest MAPE of 0.174 and the highest R2 of 0.913 among them. Therefore, feature factors selection using the MDA method was considered the optimal feature subset. Compared with MDA-RF, the prediction accuracy of the improved RF built on the same subset increased by 1.7%, indicating that improving the bootstrap sampling of random forest by using the K-means++ clustering algorithm can enhance model accuracy to some extent. Compared with the WorldPop dataset, the accuracy of the results predicted using the proposed method was enhanced. The MRE and RMSE of the WorldPop dataset were 57.24 and 23174.98, respectively, while the MRE and RMSE of the proposed method were 25.00 and 15776.50, respectively. This implies that the method proposed in this paper could simulate population spatial distribution more accurately.https://doi.org/10.1371/journal.pone.0321263
spellingShingle Zhen Zhao
Hongmei Guo
Xueli Jiang
Ying Zhang
Changjiang Lu
Can Zhang
Zonghang He
A population spatialization method based on the integration of feature selection and an improved random forest model.
PLoS ONE
title A population spatialization method based on the integration of feature selection and an improved random forest model.
title_full A population spatialization method based on the integration of feature selection and an improved random forest model.
title_fullStr A population spatialization method based on the integration of feature selection and an improved random forest model.
title_full_unstemmed A population spatialization method based on the integration of feature selection and an improved random forest model.
title_short A population spatialization method based on the integration of feature selection and an improved random forest model.
title_sort population spatialization method based on the integration of feature selection and an improved random forest model
url https://doi.org/10.1371/journal.pone.0321263
work_keys_str_mv AT zhenzhao apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT hongmeiguo apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT xuelijiang apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT yingzhang apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT changjianglu apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT canzhang apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT zonghanghe apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT zhenzhao populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT hongmeiguo populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT xuelijiang populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT yingzhang populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT changjianglu populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT canzhang populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel
AT zonghanghe populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel