Non-hierarchic document clustering

Cluster analysis, or automatic classification, is a multivariate statistical technique that seeks to identify groups, or clusters, of similar objects in a multi-dimensional space. There have been many attempts over the years to use such procedures for the organisation of document databases, so that...

Full description

Saved in:
Bibliographic Details
Main Authors: Gareth Jones, Alexander M. Robertson, Chawchat Santimetvirul, Peter Willett
Format: Article
Language:English
Published: University of Borås 1995-01-01
Series:Information Research: An International Electronic Journal
Subjects:
Online Access:http://informationr.net/ir/1-1/paper1.html
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Cluster analysis, or automatic classification, is a multivariate statistical technique that seeks to identify groups, or clusters, of similar objects in a multi-dimensional space. There have been many attempts over the years to use such procedures for the organisation of document databases, so that documents with large numbers of index terms in common are grouped together. In this paper, we consider the use of a genetic algorithm, henceforth a GA, for document clustering. GAs are a class of non-deterministic algorithms that derive from Darwinian theories of evolution. They provide good, though not necessarily optimal solutions to combinatorial optimisation problems, where the number of possible solutions is far too great for all of the possibilities to be explored in a reasonable time by a deterministic algorithm. One such problem is that of non-hierarchic clustering, where the clustering method seeks to partition a set of objects into a set of non-overlapping groups so as to maximise some external criterion of goodness of clustering, typically the extent to which the within-cluster inter-object similarities are maximised and the between-cluster similarities minimised.
ISSN:1368-1613