Ground truth clustering is not the optimum clustering
Abstract Data clustering is a fundamental yet challenging task in data science. The minimum sum-of-squares clustering (MSSC) problem aims to partition data points into k clusters to minimize the sum of squared distances between the points and their cluster centers (centroids). Despite being NP-hard,...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-03-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-90865-9 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849763893411315712 |
|---|---|
| author | Lucia Absalom Bautista Timotej Hrga Janez Povh Shudian Zhao |
| author_facet | Lucia Absalom Bautista Timotej Hrga Janez Povh Shudian Zhao |
| author_sort | Lucia Absalom Bautista |
| collection | DOAJ |
| description | Abstract Data clustering is a fundamental yet challenging task in data science. The minimum sum-of-squares clustering (MSSC) problem aims to partition data points into k clusters to minimize the sum of squared distances between the points and their cluster centers (centroids). Despite being NP-hard, solvers exist that can compute optimal solutions for small to medium-sized datasets. One such solver is SOS-SDP, a branch-and-bound algorithm based on semidefinite programming. We used it to obtain optimal MSSC solutions (optimum clusterings) for various k across multiple datasets with known ground truth clusterings. We evaluated the alignment between the optimum and ground truth clusterings using six extrinsic measures and assessed their quality using three intrinsic measures. The results reveal that the optimum clusterings often differ significantly from the ground truth clusterings. Additionally, the optimum clusterings frequently outperform the ground truth clusterings, according to the intrinsic measures that we used. However, when ground truth clusters are well-separated convex shapes, such as ellipsoids, the optimum and ground truth clusterings closely align. |
| format | Article |
| id | doaj-art-7a4c602db017460fb9e7733efcd80dfd |
| institution | DOAJ |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-7a4c602db017460fb9e7733efcd80dfd2025-08-20T03:05:17ZengNature PortfolioScientific Reports2045-23222025-03-0115111710.1038/s41598-025-90865-9Ground truth clustering is not the optimum clusteringLucia Absalom Bautista0Timotej Hrga1Janez Povh2Shudian Zhao3University of SevillaInstitut für Mathematik, Alpen-Adria-Universität KlagenfurtFaculty of Mechanical Engineering, University of LjubljanaDepartment of Mathematics, KTH - Royal Institute of TechnologyAbstract Data clustering is a fundamental yet challenging task in data science. The minimum sum-of-squares clustering (MSSC) problem aims to partition data points into k clusters to minimize the sum of squared distances between the points and their cluster centers (centroids). Despite being NP-hard, solvers exist that can compute optimal solutions for small to medium-sized datasets. One such solver is SOS-SDP, a branch-and-bound algorithm based on semidefinite programming. We used it to obtain optimal MSSC solutions (optimum clusterings) for various k across multiple datasets with known ground truth clusterings. We evaluated the alignment between the optimum and ground truth clusterings using six extrinsic measures and assessed their quality using three intrinsic measures. The results reveal that the optimum clusterings often differ significantly from the ground truth clusterings. Additionally, the optimum clusterings frequently outperform the ground truth clusterings, according to the intrinsic measures that we used. However, when ground truth clusters are well-separated convex shapes, such as ellipsoids, the optimum and ground truth clusterings closely align.https://doi.org/10.1038/s41598-025-90865-9Minimum sum-of-squares clusteringGround truth clusteringExtrinsic measuresIntrinsic measures |
| spellingShingle | Lucia Absalom Bautista Timotej Hrga Janez Povh Shudian Zhao Ground truth clustering is not the optimum clustering Scientific Reports Minimum sum-of-squares clustering Ground truth clustering Extrinsic measures Intrinsic measures |
| title | Ground truth clustering is not the optimum clustering |
| title_full | Ground truth clustering is not the optimum clustering |
| title_fullStr | Ground truth clustering is not the optimum clustering |
| title_full_unstemmed | Ground truth clustering is not the optimum clustering |
| title_short | Ground truth clustering is not the optimum clustering |
| title_sort | ground truth clustering is not the optimum clustering |
| topic | Minimum sum-of-squares clustering Ground truth clustering Extrinsic measures Intrinsic measures |
| url | https://doi.org/10.1038/s41598-025-90865-9 |
| work_keys_str_mv | AT luciaabsalombautista groundtruthclusteringisnottheoptimumclustering AT timotejhrga groundtruthclusteringisnottheoptimumclustering AT janezpovh groundtruthclusteringisnottheoptimumclustering AT shudianzhao groundtruthclusteringisnottheoptimumclustering |