Fast agglomerative clustering using approximate traveling salesman solutions
Abstract Agglomerative clustering, and Ward’s method, in particular, provide good clustering accuracy for most applications. However, its adoption has been limited by its quadratic time complexity, which makes it slow for large datasets. It also consumes O(N 2) memory for non-vectorial data. In this...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2025-01-01
|
Series: | Journal of Big Data |
Subjects: | |
Online Access: | https://doi.org/10.1186/s40537-024-01053-x |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Abstract Agglomerative clustering, and Ward’s method, in particular, provide good clustering accuracy for most applications. However, its adoption has been limited by its quadratic time complexity, which makes it slow for large datasets. It also consumes O(N 2) memory for non-vectorial data. In this work, we propose a fast variant of Ward’s method that reduces the number of distance calculations and only needs O(N) memory. It works by constraining the merges to neighboring nodes on a novel fully connected graph that consists of multiple approximate traveling salesman problem (TSP) solutions. This avoids the problems caused by disconnected graph components that can occur with a k-nearest neighbors graph. The method is general and works for all types of data for which a distance function can be defined. For a dataset of size 1.28 million, the proposed method achieves a speedup factor of 25:1 compared with the best exact implementation of Ward’s method and obtains identical results. The algorithm allows a flexible compromise between clustering quality and running time. For a moderate degradation in quality (NMI 0.90 to 0.80), the speedup factor improves further to 625:1. The proposed TSP-graph can also be used in other applications that require a connected neighborhood graph. |
---|---|
ISSN: | 2196-1115 |