A population spatialization method based on the integration of feature selection and an improved random forest model.
Ascertaining the precise and accurate spatial distribution of population is essential in conducting effective urban planning, resource allocation, and emergency rescue planning. The random forest (RF) model is widely used in population spatialization studies. However, the complexity of population di...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2025-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0321263 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850151390578475008 |
|---|---|
| author | Zhen Zhao Hongmei Guo Xueli Jiang Ying Zhang Changjiang Lu Can Zhang Zonghang He |
| author_facet | Zhen Zhao Hongmei Guo Xueli Jiang Ying Zhang Changjiang Lu Can Zhang Zonghang He |
| author_sort | Zhen Zhao |
| collection | DOAJ |
| description | Ascertaining the precise and accurate spatial distribution of population is essential in conducting effective urban planning, resource allocation, and emergency rescue planning. The random forest (RF) model is widely used in population spatialization studies. However, the complexity of population distribution characteristics and the limitations of the RF model in processing unbalanced datasets affect population prediction accuracy. To address these issues, a population spatialization model that integrates feature selection with an improved random forest is proposed herein. Firstly, recursive feature elimination using cross validation (RFECV), maximum information coefficient (MIC), and mean decrease accuracy (MDA) methods were utilized to select population distribution feature factors. The random forest was constructed using feature subsets that were selected via different feature selection methods, namely MIC-RF, RFECV-RF and MDA-RF. Subsequently, the feature factors corresponding to the model with the highest accuracy were selected as the optimal feature subsets and used in the model construction as input data. Additionally, considering the imbalanced in population spatial distribution, we used the K-means ++ clustering algorithm to cluster the optimal feature subset, and we used the bootstrap sampling method to extract the same amount of data from each cluster and fuse it with the training subset to build an improved random forest model. Based on this model, a spatial population distribution dataset of the Southern Sichuan Economic Zone at a 500m resolution was generated. Finally, the population dataset generated in this study was compared and validated with the WorldPop dataset. The results showed that utilizing feature selection methods improves model accuracy to varying degrees compared with RF based on all factors, and the MDA-RF had the lowest MAPE of 0.174 and the highest R2 of 0.913 among them. Therefore, feature factors selection using the MDA method was considered the optimal feature subset. Compared with MDA-RF, the prediction accuracy of the improved RF built on the same subset increased by 1.7%, indicating that improving the bootstrap sampling of random forest by using the K-means++ clustering algorithm can enhance model accuracy to some extent. Compared with the WorldPop dataset, the accuracy of the results predicted using the proposed method was enhanced. The MRE and RMSE of the WorldPop dataset were 57.24 and 23174.98, respectively, while the MRE and RMSE of the proposed method were 25.00 and 15776.50, respectively. This implies that the method proposed in this paper could simulate population spatial distribution more accurately. |
| format | Article |
| id | doaj-art-aa256f2ebb774d0d99576b80ebdef596 |
| institution | OA Journals |
| issn | 1932-6203 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Public Library of Science (PLoS) |
| record_format | Article |
| series | PLoS ONE |
| spelling | doaj-art-aa256f2ebb774d0d99576b80ebdef5962025-08-20T02:26:16ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01204e032126310.1371/journal.pone.0321263A population spatialization method based on the integration of feature selection and an improved random forest model.Zhen ZhaoHongmei GuoXueli JiangYing ZhangChangjiang LuCan ZhangZonghang HeAscertaining the precise and accurate spatial distribution of population is essential in conducting effective urban planning, resource allocation, and emergency rescue planning. The random forest (RF) model is widely used in population spatialization studies. However, the complexity of population distribution characteristics and the limitations of the RF model in processing unbalanced datasets affect population prediction accuracy. To address these issues, a population spatialization model that integrates feature selection with an improved random forest is proposed herein. Firstly, recursive feature elimination using cross validation (RFECV), maximum information coefficient (MIC), and mean decrease accuracy (MDA) methods were utilized to select population distribution feature factors. The random forest was constructed using feature subsets that were selected via different feature selection methods, namely MIC-RF, RFECV-RF and MDA-RF. Subsequently, the feature factors corresponding to the model with the highest accuracy were selected as the optimal feature subsets and used in the model construction as input data. Additionally, considering the imbalanced in population spatial distribution, we used the K-means ++ clustering algorithm to cluster the optimal feature subset, and we used the bootstrap sampling method to extract the same amount of data from each cluster and fuse it with the training subset to build an improved random forest model. Based on this model, a spatial population distribution dataset of the Southern Sichuan Economic Zone at a 500m resolution was generated. Finally, the population dataset generated in this study was compared and validated with the WorldPop dataset. The results showed that utilizing feature selection methods improves model accuracy to varying degrees compared with RF based on all factors, and the MDA-RF had the lowest MAPE of 0.174 and the highest R2 of 0.913 among them. Therefore, feature factors selection using the MDA method was considered the optimal feature subset. Compared with MDA-RF, the prediction accuracy of the improved RF built on the same subset increased by 1.7%, indicating that improving the bootstrap sampling of random forest by using the K-means++ clustering algorithm can enhance model accuracy to some extent. Compared with the WorldPop dataset, the accuracy of the results predicted using the proposed method was enhanced. The MRE and RMSE of the WorldPop dataset were 57.24 and 23174.98, respectively, while the MRE and RMSE of the proposed method were 25.00 and 15776.50, respectively. This implies that the method proposed in this paper could simulate population spatial distribution more accurately.https://doi.org/10.1371/journal.pone.0321263 |
| spellingShingle | Zhen Zhao Hongmei Guo Xueli Jiang Ying Zhang Changjiang Lu Can Zhang Zonghang He A population spatialization method based on the integration of feature selection and an improved random forest model. PLoS ONE |
| title | A population spatialization method based on the integration of feature selection and an improved random forest model. |
| title_full | A population spatialization method based on the integration of feature selection and an improved random forest model. |
| title_fullStr | A population spatialization method based on the integration of feature selection and an improved random forest model. |
| title_full_unstemmed | A population spatialization method based on the integration of feature selection and an improved random forest model. |
| title_short | A population spatialization method based on the integration of feature selection and an improved random forest model. |
| title_sort | population spatialization method based on the integration of feature selection and an improved random forest model |
| url | https://doi.org/10.1371/journal.pone.0321263 |
| work_keys_str_mv | AT zhenzhao apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT hongmeiguo apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT xuelijiang apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT yingzhang apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT changjianglu apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT canzhang apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT zonghanghe apopulationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT zhenzhao populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT hongmeiguo populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT xuelijiang populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT yingzhang populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT changjianglu populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT canzhang populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel AT zonghanghe populationspatializationmethodbasedontheintegrationoffeatureselectionandanimprovedrandomforestmodel |