Fast agglomerative clustering using approximate traveling salesman solutions

Abstract Agglomerative clustering, and Ward’s method, in particular, provide good clustering accuracy for most applications. However, its adoption has been limited by its quadratic time complexity, which makes it slow for large datasets. It also consumes O(N 2) memory for non-vectorial data. In this...

Full description

Saved in:
Bibliographic Details
Main Authors: Sami Sieranoja, Pasi Fränti
Format: Article
Language:English
Published: SpringerOpen 2025-01-01
Series:Journal of Big Data
Subjects:
Online Access:https://doi.org/10.1186/s40537-024-01053-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832571576422236160
author Sami Sieranoja
Pasi Fränti
author_facet Sami Sieranoja
Pasi Fränti
author_sort Sami Sieranoja
collection DOAJ
description Abstract Agglomerative clustering, and Ward’s method, in particular, provide good clustering accuracy for most applications. However, its adoption has been limited by its quadratic time complexity, which makes it slow for large datasets. It also consumes O(N 2) memory for non-vectorial data. In this work, we propose a fast variant of Ward’s method that reduces the number of distance calculations and only needs O(N) memory. It works by constraining the merges to neighboring nodes on a novel fully connected graph that consists of multiple approximate traveling salesman problem (TSP) solutions. This avoids the problems caused by disconnected graph components that can occur with a k-nearest neighbors graph. The method is general and works for all types of data for which a distance function can be defined. For a dataset of size 1.28 million, the proposed method achieves a speedup factor of 25:1 compared with the best exact implementation of Ward’s method and obtains identical results. The algorithm allows a flexible compromise between clustering quality and running time. For a moderate degradation in quality (NMI 0.90 to 0.80), the speedup factor improves further to 625:1. The proposed TSP-graph can also be used in other applications that require a connected neighborhood graph.
format Article
id doaj-art-3cc7e404b8f04c0b81d61102f4bcf078
institution Kabale University
issn 2196-1115
language English
publishDate 2025-01-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj-art-3cc7e404b8f04c0b81d61102f4bcf0782025-02-02T12:28:28ZengSpringerOpenJournal of Big Data2196-11152025-01-0112111810.1186/s40537-024-01053-xFast agglomerative clustering using approximate traveling salesman solutionsSami Sieranoja0Pasi Fränti1School of Computing, University of Eastern FinlandSchool of Computing, University of Eastern FinlandAbstract Agglomerative clustering, and Ward’s method, in particular, provide good clustering accuracy for most applications. However, its adoption has been limited by its quadratic time complexity, which makes it slow for large datasets. It also consumes O(N 2) memory for non-vectorial data. In this work, we propose a fast variant of Ward’s method that reduces the number of distance calculations and only needs O(N) memory. It works by constraining the merges to neighboring nodes on a novel fully connected graph that consists of multiple approximate traveling salesman problem (TSP) solutions. This avoids the problems caused by disconnected graph components that can occur with a k-nearest neighbors graph. The method is general and works for all types of data for which a distance function can be defined. For a dataset of size 1.28 million, the proposed method achieves a speedup factor of 25:1 compared with the best exact implementation of Ward’s method and obtains identical results. The algorithm allows a flexible compromise between clustering quality and running time. For a moderate degradation in quality (NMI 0.90 to 0.80), the speedup factor improves further to 625:1. The proposed TSP-graph can also be used in other applications that require a connected neighborhood graph.https://doi.org/10.1186/s40537-024-01053-xData clusteringLarge-scale dataTravelling salesman problem
spellingShingle Sami Sieranoja
Pasi Fränti
Fast agglomerative clustering using approximate traveling salesman solutions
Journal of Big Data
Data clustering
Large-scale data
Travelling salesman problem
title Fast agglomerative clustering using approximate traveling salesman solutions
title_full Fast agglomerative clustering using approximate traveling salesman solutions
title_fullStr Fast agglomerative clustering using approximate traveling salesman solutions
title_full_unstemmed Fast agglomerative clustering using approximate traveling salesman solutions
title_short Fast agglomerative clustering using approximate traveling salesman solutions
title_sort fast agglomerative clustering using approximate traveling salesman solutions
topic Data clustering
Large-scale data
Travelling salesman problem
url https://doi.org/10.1186/s40537-024-01053-x
work_keys_str_mv AT samisieranoja fastagglomerativeclusteringusingapproximatetravelingsalesmansolutions
AT pasifranti fastagglomerativeclusteringusingapproximatetravelingsalesmansolutions