A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses

Abstract Background Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and c...

Full description

Saved in:
Bibliographic Details
Main Authors: Pinyan Liu, Han Yuan, Yilin Ning, Bibhas Chakraborty, Nan Liu, Marco Aurélio Peres
Format: Article
Language:English
Published: BMC 2024-12-01
Series:BMC Medical Research Methodology
Subjects:
Online Access:https://doi.org/10.1186/s12874-024-02427-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850253829333843968
author Pinyan Liu
Han Yuan
Yilin Ning
Bibhas Chakraborty
Nan Liu
Marco Aurélio Peres
author_facet Pinyan Liu
Han Yuan
Yilin Ning
Bibhas Chakraborty
Nan Liu
Marco Aurélio Peres
author_sort Pinyan Liu
collection DOAJ
description Abstract Background Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and categorical variables to offer better clustering compatibility, adaptability, and interpretability than other mixed type techniques. Methods This paper proposed a modified Gower distance incorporating feature importance as weights to maintain equal contributions between continuous and categorical features. The algorithm (DAFI) was evaluated using five simulated datasets with varying proportions of important features and real-world datasets from the 2011–2014 National Health and Nutrition Examination Survey (NHANES). Effectiveness was demonstrated through comparisons with 13 clustering techniques. Clustering performance was assessed using the adjusted Rand index (ARI) for accuracy in simulation studies and the silhouette score for cohesion and separation in NHANES. Additionally, multivariable logistic regression estimated the association between periodontitis (PD) and cardiovascular diseases (CVDs), adjusting for clusters in NHANES. Results In simulation studies, the DAFI-Gower algorithm consistently performs better than baseline methods according to the adjusted Rand index in settings investigated, especially on datasets with more redundant features. In NHANES, 3,760 people were analyzed. DAFI-Gower achieves the highest silhouette score (0.79). Four distinct clusters with diverse health profiles were identified. By incorporating feature importance, we found that cluster formations were more strongly influenced by CVD-related factors. The association between periodontitis and cardiovascular diseases, after adjusting for clusters, reveals significant insights (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012), highlighting severe periodontitis as a potential risk factor for cardiovascular diseases. Conclusions DAFI performed better than classic clustering baselines on both simulated and real-world datasets. It effectively captures cluster characteristics by considering feature importance, which is crucial in clinical settings where many variables may be similar or irrelevant. We envisage that DAFI offers an effective solution for mixed type clustering.
format Article
id doaj-art-6589c04fed2447699aa9feba854cedc6
institution OA Journals
issn 1471-2288
language English
publishDate 2024-12-01
publisher BMC
record_format Article
series BMC Medical Research Methodology
spelling doaj-art-6589c04fed2447699aa9feba854cedc62025-08-20T01:57:16ZengBMCBMC Medical Research Methodology1471-22882024-12-0124111510.1186/s12874-024-02427-8A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analysesPinyan Liu0Han Yuan1Yilin Ning2Bibhas Chakraborty3Nan Liu4Marco Aurélio Peres5Centre for Quantitative Medicine, Duke-NUS Medical SchoolCentre for Quantitative Medicine, Duke-NUS Medical SchoolCentre for Quantitative Medicine, Duke-NUS Medical SchoolCentre for Quantitative Medicine, Duke-NUS Medical SchoolCentre for Quantitative Medicine, Duke-NUS Medical SchoolCentre for Quantitative Medicine, Duke-NUS Medical SchoolAbstract Background Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and categorical variables to offer better clustering compatibility, adaptability, and interpretability than other mixed type techniques. Methods This paper proposed a modified Gower distance incorporating feature importance as weights to maintain equal contributions between continuous and categorical features. The algorithm (DAFI) was evaluated using five simulated datasets with varying proportions of important features and real-world datasets from the 2011–2014 National Health and Nutrition Examination Survey (NHANES). Effectiveness was demonstrated through comparisons with 13 clustering techniques. Clustering performance was assessed using the adjusted Rand index (ARI) for accuracy in simulation studies and the silhouette score for cohesion and separation in NHANES. Additionally, multivariable logistic regression estimated the association between periodontitis (PD) and cardiovascular diseases (CVDs), adjusting for clusters in NHANES. Results In simulation studies, the DAFI-Gower algorithm consistently performs better than baseline methods according to the adjusted Rand index in settings investigated, especially on datasets with more redundant features. In NHANES, 3,760 people were analyzed. DAFI-Gower achieves the highest silhouette score (0.79). Four distinct clusters with diverse health profiles were identified. By incorporating feature importance, we found that cluster formations were more strongly influenced by CVD-related factors. The association between periodontitis and cardiovascular diseases, after adjusting for clusters, reveals significant insights (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012), highlighting severe periodontitis as a potential risk factor for cardiovascular diseases. Conclusions DAFI performed better than classic clustering baselines on both simulated and real-world datasets. It effectively captures cluster characteristics by considering feature importance, which is crucial in clinical settings where many variables may be similar or irrelevant. We envisage that DAFI offers an effective solution for mixed type clustering.https://doi.org/10.1186/s12874-024-02427-8ClusteringDistance measureFeature importanceMixed type data
spellingShingle Pinyan Liu
Han Yuan
Yilin Ning
Bibhas Chakraborty
Nan Liu
Marco Aurélio Peres
A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses
BMC Medical Research Methodology
Clustering
Distance measure
Feature importance
Mixed type data
title A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses
title_full A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses
title_fullStr A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses
title_full_unstemmed A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses
title_short A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses
title_sort modified and weighted gower distance based clustering analysis for mixed type data a simulation and empirical analyses
topic Clustering
Distance measure
Feature importance
Mixed type data
url https://doi.org/10.1186/s12874-024-02427-8
work_keys_str_mv AT pinyanliu amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses
AT hanyuan amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses
AT yilinning amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses
AT bibhaschakraborty amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses
AT nanliu amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses
AT marcoaurelioperes amodifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses
AT pinyanliu modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses
AT hanyuan modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses
AT yilinning modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses
AT bibhaschakraborty modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses
AT nanliu modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses
AT marcoaurelioperes modifiedandweightedgowerdistancebasedclusteringanalysisformixedtypedataasimulationandempiricalanalyses