Ground truth clustering is not the optimum clustering

Abstract Data clustering is a fundamental yet challenging task in data science. The minimum sum-of-squares clustering (MSSC) problem aims to partition data points into k clusters to minimize the sum of squared distances between the points and their cluster centers (centroids). Despite being NP-hard,...

Full description

Saved in:
Bibliographic Details
Main Authors: Lucia Absalom Bautista, Timotej Hrga, Janez Povh, Shudian Zhao
Format: Article
Language:English
Published: Nature Portfolio 2025-03-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-90865-9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849763893411315712
author Lucia Absalom Bautista
Timotej Hrga
Janez Povh
Shudian Zhao
author_facet Lucia Absalom Bautista
Timotej Hrga
Janez Povh
Shudian Zhao
author_sort Lucia Absalom Bautista
collection DOAJ
description Abstract Data clustering is a fundamental yet challenging task in data science. The minimum sum-of-squares clustering (MSSC) problem aims to partition data points into k clusters to minimize the sum of squared distances between the points and their cluster centers (centroids). Despite being NP-hard, solvers exist that can compute optimal solutions for small to medium-sized datasets. One such solver is SOS-SDP, a branch-and-bound algorithm based on semidefinite programming. We used it to obtain optimal MSSC solutions (optimum clusterings) for various k across multiple datasets with known ground truth clusterings. We evaluated the alignment between the optimum and ground truth clusterings using six extrinsic measures and assessed their quality using three intrinsic measures. The results reveal that the optimum clusterings often differ significantly from the ground truth clusterings. Additionally, the optimum clusterings frequently outperform the ground truth clusterings, according to the intrinsic measures that we used. However, when ground truth clusters are well-separated convex shapes, such as ellipsoids, the optimum and ground truth clusterings closely align.
format Article
id doaj-art-7a4c602db017460fb9e7733efcd80dfd
institution DOAJ
issn 2045-2322
language English
publishDate 2025-03-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-7a4c602db017460fb9e7733efcd80dfd2025-08-20T03:05:17ZengNature PortfolioScientific Reports2045-23222025-03-0115111710.1038/s41598-025-90865-9Ground truth clustering is not the optimum clusteringLucia Absalom Bautista0Timotej Hrga1Janez Povh2Shudian Zhao3University of SevillaInstitut für Mathematik, Alpen-Adria-Universität KlagenfurtFaculty of Mechanical Engineering, University of LjubljanaDepartment of Mathematics, KTH - Royal Institute of TechnologyAbstract Data clustering is a fundamental yet challenging task in data science. The minimum sum-of-squares clustering (MSSC) problem aims to partition data points into k clusters to minimize the sum of squared distances between the points and their cluster centers (centroids). Despite being NP-hard, solvers exist that can compute optimal solutions for small to medium-sized datasets. One such solver is SOS-SDP, a branch-and-bound algorithm based on semidefinite programming. We used it to obtain optimal MSSC solutions (optimum clusterings) for various k across multiple datasets with known ground truth clusterings. We evaluated the alignment between the optimum and ground truth clusterings using six extrinsic measures and assessed their quality using three intrinsic measures. The results reveal that the optimum clusterings often differ significantly from the ground truth clusterings. Additionally, the optimum clusterings frequently outperform the ground truth clusterings, according to the intrinsic measures that we used. However, when ground truth clusters are well-separated convex shapes, such as ellipsoids, the optimum and ground truth clusterings closely align.https://doi.org/10.1038/s41598-025-90865-9Minimum sum-of-squares clusteringGround truth clusteringExtrinsic measuresIntrinsic measures
spellingShingle Lucia Absalom Bautista
Timotej Hrga
Janez Povh
Shudian Zhao
Ground truth clustering is not the optimum clustering
Scientific Reports
Minimum sum-of-squares clustering
Ground truth clustering
Extrinsic measures
Intrinsic measures
title Ground truth clustering is not the optimum clustering
title_full Ground truth clustering is not the optimum clustering
title_fullStr Ground truth clustering is not the optimum clustering
title_full_unstemmed Ground truth clustering is not the optimum clustering
title_short Ground truth clustering is not the optimum clustering
title_sort ground truth clustering is not the optimum clustering
topic Minimum sum-of-squares clustering
Ground truth clustering
Extrinsic measures
Intrinsic measures
url https://doi.org/10.1038/s41598-025-90865-9
work_keys_str_mv AT luciaabsalombautista groundtruthclusteringisnottheoptimumclustering
AT timotejhrga groundtruthclusteringisnottheoptimumclustering
AT janezpovh groundtruthclusteringisnottheoptimumclustering
AT shudianzhao groundtruthclusteringisnottheoptimumclustering