Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and here for more.

Minimal Spanning Tree

All agglomerative clustering algorithms based on the Lance-Williams equation have the drawback that the full distance matrix has to be calculated during the cluster analysis. This can take considerable computing power and requires high amounts of memory. A remedy to this can be found in a graph-theoretical approach, which is called the minimal spanning tree. Unfortunately, the minimal spanning tree algorithm can be applied only to single linkage clustering. However, in the case of large data sets, this algorithm can be orders of magnitude faster.

A minimal spanning tree is a graph which meets the following requirements:

  • tree (no cycles)
  • spanning (all vertices included)
  • the sum of the weights of the edges is a minimum
If edges reflect the distance between objects, then clusters can be detected by removing all edges which have a distance greater than dmax.

There is a simple algorithm described by Prim how to determine the minimal spanning tree. The resulting minimal spanning tree is equal to the results of a single linkage cluster analysis:

1. select an arbitrary object as the starting vertex of a partial graph M
2. for all objects which are not yet part of the graph M, find the nearest neighbor to any vertex of graph M
3. join this nearest neighbor to M
4. repeat with step 2 until all vertices are part of graph M