Interpreting Cluster Analysis Outputs – The Data Story Guide

Cluster size and outliers

When a cluster analysis has been successful the size of the clusters in the sample indicates the size of segments in the population. However, where cluster sizes are small (e.g., containing a few percent of observations), the correct interpretation is often that the cluster contains outliers. Generally, the appropriate course of action is to evaluate the cluster means and check to see that there are no data integrity problems (e.g., it may be that whoever set up the data file used 99 for Don’t Know, rather than the more being set as missing datacode). If there are problems, fix them and re-run the cluster analysis.

If there is no obvious data integrity problem, the best course of action is usually to either filter the data to remove the small clusters and then re-run the cluster analysis, manually combine the small cluster with a similar larger cluster, or, if there are multiple small clusters, combine them into an “other” group.

Are the clusters statistically significant?

Statistical tests between clusters and clustering variables are, in general, not valid.^[1] This is because the clusters are formed directly from the clustering variables and, as such, there is a relationship between the clusters and the clustering variables. Nevertheless, statistical tests can be useful in terms of highlighting the relative sizes of the differences between the clusters (i.e., they should assist you in determining how to interpret the clusters, but you should not conclude that the differences are “statistically significant”).

References

↑ Bock, H-H. (1996): "Probability Models and Hypotheses Testing in Partitioning Cluster Analysis," in Clustering and Classification, ed. by P. Arabie, L. J. Hubert, and G. De Soete. River Edge, New Jersey: World Scientific Publishers, 377-453.