Fast agglomerative clustering using approximate traveling salesman solutions

Abstract Agglomerative clustering, and Ward’s method, in particular, provide good clustering accuracy for most applications. However, its adoption has been limited by its quadratic time complexity, which makes it slow for large datasets. It also consumes O(N 2) memory for non-vectorial data. In this...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sami Sieranoja, Pasi Fränti
Format:	Article
Language:	English
Published:	SpringerOpen 2025-01-01
Series:	Journal of Big Data
Subjects:	Data clustering Large-scale data Travelling salesman problem
Online Access:	https://doi.org/10.1186/s40537-024-01053-x
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571576422236160
author	Sami Sieranoja Pasi Fränti
author_facet	Sami Sieranoja Pasi Fränti
author_sort	Sami Sieranoja
collection	DOAJ
description	Abstract Agglomerative clustering, and Ward’s method, in particular, provide good clustering accuracy for most applications. However, its adoption has been limited by its quadratic time complexity, which makes it slow for large datasets. It also consumes O(N 2) memory for non-vectorial data. In this work, we propose a fast variant of Ward’s method that reduces the number of distance calculations and only needs O(N) memory. It works by constraining the merges to neighboring nodes on a novel fully connected graph that consists of multiple approximate traveling salesman problem (TSP) solutions. This avoids the problems caused by disconnected graph components that can occur with a k-nearest neighbors graph. The method is general and works for all types of data for which a distance function can be defined. For a dataset of size 1.28 million, the proposed method achieves a speedup factor of 25:1 compared with the best exact implementation of Ward’s method and obtains identical results. The algorithm allows a flexible compromise between clustering quality and running time. For a moderate degradation in quality (NMI 0.90 to 0.80), the speedup factor improves further to 625:1. The proposed TSP-graph can also be used in other applications that require a connected neighborhood graph.
format	Article
id	doaj-art-3cc7e404b8f04c0b81d61102f4bcf078
institution	Kabale University
issn	2196-1115
language	English
publishDate	2025-01-01
publisher	SpringerOpen
record_format	Article
series	Journal of Big Data
spelling	doaj-art-3cc7e404b8f04c0b81d61102f4bcf0782025-02-02T12:28:28ZengSpringerOpenJournal of Big Data2196-11152025-01-0112111810.1186/s40537-024-01053-xFast agglomerative clustering using approximate traveling salesman solutionsSami Sieranoja0Pasi Fränti1School of Computing, University of Eastern FinlandSchool of Computing, University of Eastern FinlandAbstract Agglomerative clustering, and Ward’s method, in particular, provide good clustering accuracy for most applications. However, its adoption has been limited by its quadratic time complexity, which makes it slow for large datasets. It also consumes O(N 2) memory for non-vectorial data. In this work, we propose a fast variant of Ward’s method that reduces the number of distance calculations and only needs O(N) memory. It works by constraining the merges to neighboring nodes on a novel fully connected graph that consists of multiple approximate traveling salesman problem (TSP) solutions. This avoids the problems caused by disconnected graph components that can occur with a k-nearest neighbors graph. The method is general and works for all types of data for which a distance function can be defined. For a dataset of size 1.28 million, the proposed method achieves a speedup factor of 25:1 compared with the best exact implementation of Ward’s method and obtains identical results. The algorithm allows a flexible compromise between clustering quality and running time. For a moderate degradation in quality (NMI 0.90 to 0.80), the speedup factor improves further to 625:1. The proposed TSP-graph can also be used in other applications that require a connected neighborhood graph.https://doi.org/10.1186/s40537-024-01053-xData clusteringLarge-scale dataTravelling salesman problem
spellingShingle	Sami Sieranoja Pasi Fränti Fast agglomerative clustering using approximate traveling salesman solutions Journal of Big Data Data clustering Large-scale data Travelling salesman problem
title	Fast agglomerative clustering using approximate traveling salesman solutions
title_full	Fast agglomerative clustering using approximate traveling salesman solutions
title_fullStr	Fast agglomerative clustering using approximate traveling salesman solutions
title_full_unstemmed	Fast agglomerative clustering using approximate traveling salesman solutions
title_short	Fast agglomerative clustering using approximate traveling salesman solutions
title_sort	fast agglomerative clustering using approximate traveling salesman solutions
topic	Data clustering Large-scale data Travelling salesman problem
url	https://doi.org/10.1186/s40537-024-01053-x
work_keys_str_mv	AT samisieranoja fastagglomerativeclusteringusingapproximatetravelingsalesmansolutions AT pasifranti fastagglomerativeclusteringusingapproximatetravelingsalesmansolutions

Fast agglomerative clustering using approximate traveling salesman solutions

Similar Items