A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses
Abstract Background Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and c...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2024-12-01
|
| Series: | BMC Medical Research Methodology |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12874-024-02427-8 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850253829333843968 |
|---|---|
| author | Pinyan Liu Han Yuan Yilin Ning Bibhas Chakraborty Nan Liu Marco Aurélio Peres |
| author_facet | Pinyan Liu Han Yuan Yilin Ning Bibhas Chakraborty Nan Liu Marco Aurélio Peres |
| author_sort | Pinyan Liu |
| collection | DOAJ |
| description | Abstract Background Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and categorical variables to offer better clustering compatibility, adaptability, and interpretability than other mixed type techniques. Methods This paper proposed a modified Gower distance incorporating feature importance as weights to maintain equal contributions between continuous and categorical features. The algorithm (DAFI) was evaluated using five simulated datasets with varying proportions of important features and real-world datasets from the 2011–2014 National Health and Nutrition Examination Survey (NHANES). Effectiveness was demonstrated through comparisons with 13 clustering techniques. Clustering performance was assessed using the adjusted Rand index (ARI) for accuracy in simulation studies and the silhouette score for cohesion and separation in NHANES. Additionally, multivariable logistic regression estimated the association between periodontitis (PD) and cardiovascular diseases (CVDs), adjusting for clusters in NHANES. Results In simulation studies, the DAFI-Gower algorithm consistently performs better than baseline methods according to the adjusted Rand index in settings investigated, especially on datasets with more redundant features. In NHANES, 3,760 people were analyzed. DAFI-Gower achieves the highest silhouette score (0.79). Four distinct clusters with diverse health profiles were identified. By incorporating feature importance, we found that cluster formations were more strongly influenced by CVD-related factors. The association between periodontitis and cardiovascular diseases, after adjusting for clusters, reveals significant insights (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012), highlighting severe periodontitis as a potential risk factor for cardiovascular diseases. Conclusions DAFI performed better than classic clustering baselines on both simulated and real-world datasets. It effectively captures cluster characteristics by considering feature importance, which is crucial in clinical settings where many variables may be similar or irrelevant. We envisage that DAFI offers an effective solution for mixed type clustering. |
| format | Article |
| id | doaj-art-6589c04fed2447699aa9feba854cedc6 |
| institution | OA Journals |
| issn | 1471-2288 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | BMC |
| record_format | Article |
| series | BMC Medical Research Methodology |
| spelling | doaj-art-6589c04fed2447699aa9feba854cedc62025-08-20T01:57:16ZengBMCBMC Medical Research Methodology1471-22882024-12-0124111510.1186/s12874-024-02427-8A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analysesPinyan Liu0Han Yuan1Yilin Ning2Bibhas Chakraborty3Nan Liu4Marco Aurélio Peres5Centre for Quantitative Medicine, Duke-NUS Medical SchoolCentre for Quantitative Medicine, Duke-NUS Medical SchoolCentre for Quantitative Medicine, Duke-NUS Medical SchoolCentre for Quantitative Medicine, Duke-NUS Medical SchoolCentre for Quantitative Medicine, Duke-NUS Medical SchoolCentre for Quantitative Medicine, Duke-NUS Medical SchoolAbstract Background Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and categorical variables to offer better clustering compatibility, adaptability, and interpretability than other mixed type techniques. Methods This paper proposed a modified Gower distance incorporating feature importance as weights to maintain equal contributions between continuous and categorical features. The algorithm (DAFI) was evaluated using five simulated datasets with varying proportions of important features and real-world datasets from the 2011–2014 National Health and Nutrition Examination Survey (NHANES). Effectiveness was demonstrated through comparisons with 13 clustering techniques. Clustering performance was assessed using the adjusted Rand index (ARI) for accuracy in simulation studies and the silhouette score for cohesion and separation in NHANES. Additionally, multivariable logistic regression estimated the association between periodontitis (PD) and cardiovascular diseases (CVDs), adjusting for clusters in NHANES. Results In simulation studies, the DAFI-Gower algorithm consistently performs better than baseline methods according to the adjusted Rand index in settings investigated, especially on datasets with more redundant features. In NHANES, 3,760 people were analyzed. DAFI-Gower achieves the highest silhouette score (0.79). Four distinct clusters with diverse health profiles were identified. By incorporating feature importance, we found that cluster formations were more strongly influenced by CVD-related factors. The association between periodontitis and cardiovascular diseases, after adjusting for clusters, reveals significant insights (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012), highlighting severe periodontitis as a potential risk factor for cardiovascular diseases. Conclusions DAFI performed better than classic clustering baselines on both simulated and real-world datasets. It effectively captures cluster characteristics by considering feature importance, which is crucial in clinical settings where many variables may be similar or irrelevant. We envisage that DAFI offers an effective solution for mixed type clustering.https://doi.org/10.1186/s12874-024-02427-8ClusteringDistance measureFeature importanceMixed type data |
| spellingShingle | Pinyan Liu Han Yuan Yilin Ning Bibhas Chakraborty Nan Liu Marco Aurélio Peres A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses BMC Medical Research Methodology Clustering Distance measure Feature importance Mixed type data |
| title | A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses |
| title_full | A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses |
| title_fullStr | A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses |
| title_full_unstemmed | A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses |
| title_short | A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses |
| title_sort | modified and weighted gower distance based clustering analysis for mixed type data a simulation and empirical analyses |
| topic | Clustering Distance measure Feature importance Mixed type data |
| url | https://doi.org/10.1186/s12874-024-02427-8 |
| work_keys_str_mv | AT pinyanliu amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses AT hanyuan amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses AT yilinning amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses AT bibhaschakraborty amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses AT nanliu amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses AT marcoaurelioperes amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses AT pinyanliu modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses AT hanyuan modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses AT yilinning modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses AT bibhaschakraborty modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses AT nanliu modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses AT marcoaurelioperes modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses |