Automated Exploratory Clustering to Democratize Clustering Analysis

AutoML is enabling many practitioners to use sophisticated Machine Learning pipelines even without being experienced in building application-specific solutions. Adapting AutoML to the field of unsupervised learning, particularly to the task of clustering, is challenging, as clustering is highly subj...

Full description

Saved in:
Bibliographic Details
Main Authors: Georg Stefan Schlake, Max Pernklau, Christian Beecks
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/12/6876
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:AutoML is enabling many practitioners to use sophisticated Machine Learning pipelines even without being experienced in building application-specific solutions. Adapting AutoML to the field of unsupervised learning, particularly to the task of clustering, is challenging, as clustering is highly subjective and application-specific; the goal is not to find the best way to group data objects based on previously seen examples, but to find interesting new structures within potentially unknown data objects that provide actionable insights. The level of interestingness of a clustering is highly subjective and is subject to a variety of different characteristics making different clusterings of the same dataset (e.g., grouping people by age, gender, or special interests). In this paper, we propose an <i>Automated Exploratory Clustering</i> framework which determines multiple clusterings satisfying different notions of interestingness automatically. To this end, we generate multiple clusterings via AutoML processes and return a selection of clusterings, from which the user can explore the most preferred ones. We use different methods like the skyline operator to prune non-Pareto-optimal clusterings wrt. different dimensions of interestingsness and deliver a small set of valuable clusterings. In this way, our approach enables practitioners as well as domain experts to identify valuable clusterings without becoming experts in clustering as well, thus reducing human efforts and resources in finding application-specific solutions. Our empirical investigation with current state-of-the-art methods is carried out on a number of benchmark datasets, where a well-established ground truth can proxy for the wishes of a domain expert and multiple interestingness properties of the clusterings.
ISSN:2076-3417