Evaluation of predictive performance of modeling hyperuricemia using medical big data: comparison of data preprocessing methods
Abstract Background Using medical big data from two large-scale populations, a prediction model for continuous variables of raw data and a prediction model for categorical variables after assignment were constructed to evaluate the performance of the two forms of data preprocessing models. Method Pa...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SpringerOpen
2025-04-01
|
| Series: | Journal of Big Data |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s40537-025-01142-5 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849765521701994496 |
|---|---|
| author | Luwei Li Xian Huang Cijin Yan Shuzhan He Sishuai Cheng WenJie Yang |
| author_facet | Luwei Li Xian Huang Cijin Yan Shuzhan He Sishuai Cheng WenJie Yang |
| author_sort | Luwei Li |
| collection | DOAJ |
| description | Abstract Background Using medical big data from two large-scale populations, a prediction model for continuous variables of raw data and a prediction model for categorical variables after assignment were constructed to evaluate the performance of the two forms of data preprocessing models. Method Partial population data from the physical examination center of Guilin Medical University Affiliated Hospital from 2017 to 2019 were selected as the modeling group, with a total of 22,124 population data included. Selecting population data from NHANES database from 1998 to 2018 as the control group, a total of 28,021 population data were included. Logistic regression, LightGBM model, and Deep Neural Network were used to predict hyperuricemia in the form of continuous variables in the raw data. Then, the continuous variables in the raw data were assigned values to become categorical variables, and statistical analysis was performed using the same algorithm to obtain the predicted values of the two models. ROC curve analysis, Calibration curve analysis, DCA curve analysis, and CIC curve analysis were performed to comprehensively evaluate the accuracy, discriminatory ability, and clinical practicality of the two models. Result In the Logistic regression analysis of the continuous variable modeling group after controlling for confounding factors, a total of 11 variables showed statistical significance in the incidence of hyperuricemia. After assigning values, the Logistic regression analysis of the categorical variable modeling group showed that 9 variables had statistical significance in the incidence of hyperuricemia.In the Logistic regression analysis of continuous variables in the validation set, a total of 8 variables showed statistical significance in the incidence of hyperuricemia. After assignment, Logistic regression analysis of categorical variables showed that 10 variables had statistical significance in the incidence of hyperuricemia. The AUC values of the ROC curves of Logistic models, LightGBM models, and Deep Neural Networks with continuous variable types are higher than those of categorical variables. The average deviation between the continuous variable calibration curve prediction curve and the standard curve of the modeling and validation groups is generally lower than that of the categorical variables. The DCA curve and CIC curve of the modeling and validation groups both show that the clinical practicality of the continuous variable model is higher than that of the categorical variable model group. Conclusion In the statistical analysis of hyperuricemia medical big data, directly using the continuous variable form of raw data for statistical analysis may result in better model performance than using the categorical variable form after assignment. However, the relevant parameters such as OR value obtained through assignment may have greater statistical and clinical guidance significance. |
| format | Article |
| id | doaj-art-cbf24dd159a44d43b0174c78ff7a0b48 |
| institution | DOAJ |
| issn | 2196-1115 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | SpringerOpen |
| record_format | Article |
| series | Journal of Big Data |
| spelling | doaj-art-cbf24dd159a44d43b0174c78ff7a0b482025-08-20T03:04:50ZengSpringerOpenJournal of Big Data2196-11152025-04-0112111710.1186/s40537-025-01142-5Evaluation of predictive performance of modeling hyperuricemia using medical big data: comparison of data preprocessing methodsLuwei Li0Xian Huang1Cijin Yan2Shuzhan He3Sishuai Cheng4WenJie Yang5Department of Rheumatology and Immunology, Guangxi Hospital Division of The First Affiliated Hospital, Sun Yat-sen UniversityDepartment of Rheumatology and Immunology, Guangxi Hospital Division of The First Affiliated Hospital, Sun Yat-sen UniversityDepartment of Endocrinology, Guangxi Hospital Division of The First Affiliated Hospital, Sun Yat-sen UniversityDepartment of Endocrinology, Guangxi Hospital Division of The First Affiliated Hospital, Sun Yat-sen UniversityGuilin Medical UniversityDepartment of Hematology, Guangxi Hospital Division of The First Affiliated Hospital, Sun Yat-sen UniversityAbstract Background Using medical big data from two large-scale populations, a prediction model for continuous variables of raw data and a prediction model for categorical variables after assignment were constructed to evaluate the performance of the two forms of data preprocessing models. Method Partial population data from the physical examination center of Guilin Medical University Affiliated Hospital from 2017 to 2019 were selected as the modeling group, with a total of 22,124 population data included. Selecting population data from NHANES database from 1998 to 2018 as the control group, a total of 28,021 population data were included. Logistic regression, LightGBM model, and Deep Neural Network were used to predict hyperuricemia in the form of continuous variables in the raw data. Then, the continuous variables in the raw data were assigned values to become categorical variables, and statistical analysis was performed using the same algorithm to obtain the predicted values of the two models. ROC curve analysis, Calibration curve analysis, DCA curve analysis, and CIC curve analysis were performed to comprehensively evaluate the accuracy, discriminatory ability, and clinical practicality of the two models. Result In the Logistic regression analysis of the continuous variable modeling group after controlling for confounding factors, a total of 11 variables showed statistical significance in the incidence of hyperuricemia. After assigning values, the Logistic regression analysis of the categorical variable modeling group showed that 9 variables had statistical significance in the incidence of hyperuricemia.In the Logistic regression analysis of continuous variables in the validation set, a total of 8 variables showed statistical significance in the incidence of hyperuricemia. After assignment, Logistic regression analysis of categorical variables showed that 10 variables had statistical significance in the incidence of hyperuricemia. The AUC values of the ROC curves of Logistic models, LightGBM models, and Deep Neural Networks with continuous variable types are higher than those of categorical variables. The average deviation between the continuous variable calibration curve prediction curve and the standard curve of the modeling and validation groups is generally lower than that of the categorical variables. The DCA curve and CIC curve of the modeling and validation groups both show that the clinical practicality of the continuous variable model is higher than that of the categorical variable model group. Conclusion In the statistical analysis of hyperuricemia medical big data, directly using the continuous variable form of raw data for statistical analysis may result in better model performance than using the categorical variable form after assignment. However, the relevant parameters such as OR value obtained through assignment may have greater statistical and clinical guidance significance.https://doi.org/10.1186/s40537-025-01142-5Medical big dataHyperuricemiaData preprocessingContinuous variablesCategorized variablesAssignment |
| spellingShingle | Luwei Li Xian Huang Cijin Yan Shuzhan He Sishuai Cheng WenJie Yang Evaluation of predictive performance of modeling hyperuricemia using medical big data: comparison of data preprocessing methods Journal of Big Data Medical big data Hyperuricemia Data preprocessing Continuous variables Categorized variables Assignment |
| title | Evaluation of predictive performance of modeling hyperuricemia using medical big data: comparison of data preprocessing methods |
| title_full | Evaluation of predictive performance of modeling hyperuricemia using medical big data: comparison of data preprocessing methods |
| title_fullStr | Evaluation of predictive performance of modeling hyperuricemia using medical big data: comparison of data preprocessing methods |
| title_full_unstemmed | Evaluation of predictive performance of modeling hyperuricemia using medical big data: comparison of data preprocessing methods |
| title_short | Evaluation of predictive performance of modeling hyperuricemia using medical big data: comparison of data preprocessing methods |
| title_sort | evaluation of predictive performance of modeling hyperuricemia using medical big data comparison of data preprocessing methods |
| topic | Medical big data Hyperuricemia Data preprocessing Continuous variables Categorized variables Assignment |
| url | https://doi.org/10.1186/s40537-025-01142-5 |
| work_keys_str_mv | AT luweili evaluationofpredictiveperformanceofmodelinghyperuricemiausingmedicalbigdatacomparisonofdatapreprocessingmethods AT xianhuang evaluationofpredictiveperformanceofmodelinghyperuricemiausingmedicalbigdatacomparisonofdatapreprocessingmethods AT cijinyan evaluationofpredictiveperformanceofmodelinghyperuricemiausingmedicalbigdatacomparisonofdatapreprocessingmethods AT shuzhanhe evaluationofpredictiveperformanceofmodelinghyperuricemiausingmedicalbigdatacomparisonofdatapreprocessingmethods AT sishuaicheng evaluationofpredictiveperformanceofmodelinghyperuricemiausingmedicalbigdatacomparisonofdatapreprocessingmethods AT wenjieyang evaluationofpredictiveperformanceofmodelinghyperuricemiausingmedicalbigdatacomparisonofdatapreprocessingmethods |