Minimal Spanning Tree
All agglomerative clustering algorithms based on the Lance-Williams
equation have the drawback that the full distance matrix has to be calculated
during the cluster analysis. This can take considerable computing power
and requires high amounts of memory. A remedy to this can be found in a
graph-theoretical approach, which is called the minimal spanning tree.
Unfortunately, the minimal spanning tree algorithm can be applied only
to single linkage clustering. However, in the case of large data sets,
this algorithm can be orders of magnitude faster.
A minimal spanning tree is a graph which meets the following requirements:
-
tree (no cycles)
-
spanning (all vertices included)
-
the sum of the weights of the edges is a minimum
If edges reflect the distance between objects, then clusters can be detected
by removing all edges which have a distance greater than dmax.
There is a simple algorithm described by Prim how to determine the
minimal spanning tree. The resulting minimal spanning tree is equal to
the results of a single linkage cluster analysis:
1. select an arbitrary object as the starting vertex of a partial
graph M
2. for all objects which are not yet part of the graph M, find the
nearest neighbor to any vertex of graph M
3. join this nearest neighbor to M
4. repeat with step 2 until all vertices are part of graph M
|