Data Preparation for Cluster-Based Segmentation – The Data Story Guide

When using cluster-based segmentation to form segment there are a variety of forms of data preparation that can (in some situations) assist in the forming of segments.

Variable Transformations

There are two basic types of transformation that are relevant:

Changing the range of a variable, which is known as variable standardization, and is discussed in the next section; and
Changing the shape of the distribution, which is discussed in this section.

Transforming variables to modify the shape of the distribution prior to cluster analysis is motivated by the same basic concerns as in other areas of statistics: extreme departures from normality can cause the resulting analyses to be misleading. When transforming variables so as to modify the shape of the distribution the key is to identify and remove long tails from variables (i.e., small numbers of respondents with values substantially higher or lower than the average). Techniques for achieving this include:

Taking the log of the variables.
Taking the square root of the variables.
Windsorizing the variables.

Variable Standardization

The scale of variables determines their ‘importance’ in the cluster analysis. The larger the scale of a variable – that is the larger its standard deviation, or, equivalently, the greater the range between its smallest and highest values – the greater the discrimination between the clusters on this variable and the less the difference between the clusters on all other variables in the analysis.

To rectify these differences across variables, you can rescale or standardize the data by changing the range or standard deviations of each of the input variables. Most academic studies recommend the use of a unit range, which is a fancy way of saying that each variable should have the same minimum and the same maximum. Generally, most market research data used in cluster analysis (e.g., attitude scales) are automatically set up this way. Practitioners often favour scaling the variables so that they all have a standard deviation of 1 (this is sometimes referred to as normalizing).

This involves multiplying each variable by a constant, such that its range (and variance) changes. The logic of this is only applicable to cluster analysis and self-organizing maps, as these algorithms implicitly weight variables according to their range (and variance). See Variable Standardization for more information.

Variable Selection and Weighting

If attempting to discover natural clusters - that is, clusters which are broadly homogeneous with large gaps between them - it is beneficial to weight variables, as inevitably some variables will contain more information about the clusters than others. The most extreme form of variable weighting is to exclude certain variables. Various algorithms have been developed for automatically weighting data, but none are widely used in market segmentation (presumably in part because in market research the interest is generally on finding segments with useful strategic implications rather than finding "natural" but uninteresting segments).

In addition to searching for natural clusters, the weighting of variables is often a useful way of avoiding the problem of a segmentation identifying unhelpful segments (e.g., if using a combination of behavioral and attitudinal data, sometimes the segments are formed entirely using the attitudinal data, with the behavioral data being ignored). Increasing the relative weighting of the behavioral data can increase the extent to which they differ between the segments.

There are a number of alternative approaches to weighting;

Including variables in the analysis multiple times.
Changing the range of a variable. The greater a variable's range, the greater its potential impact in the segmentation. (Note that some latent class algorithms will automatically model differences in the variance of the variables which will cause this form of weighting to have no impact.)
Tandem Clustering.
Modifying the algorithm used to form the segments to explicitly take into account the desired weight of different variables. This approach is available in Q and Displayr.

Dimension reduction

Principal components analysis and correspondence analysis can be used to reduce the number of dimensions prior to running cluster analysis. Multiple correspondence analysis has the added benefit that it can be used to turn categorical variables into numeric variables (which are thus consistent with the assumptions of cluster analysis).

Although this approach, which is known as tandem clustering is popular in industry, it is by no means guaranteed to result in a superior cluster analysis, as:

The dimension reduction creates variables that reflect the strongest patterns in the data. In doing this, some of the variance is removed from the data. This variance may be important and a better segmentation may result if it is left in the data prior to the cluster analysis.
The dimension reduction increases the focus of the cluster analysis on variables that are not highly correlated with the other variables. That is, it down-weights the strongest pattern in the data.

Within-Respondent Scaling

Respondent scaling is done when it is believed that respondents are differently biased in the way they answer questions. For example, commonly there are some respondents who give systematically higher answers than others. If wanting to correct for this – and it is not always the case that one should – the usual practice is to modify the data so that each respondent’s data has a standard deviation of 1. Scaling the data for each respondent. Most commonly, this involves modifying the scaling each respondent's data so that it has a mean of 0 and a standard deviation of 1. The logic of this is to remove response bias from the data.

Outlier Removal

Removing extreme observations which are considered likely to distort any segmentation.

A variety of methods have been developed, including using hierarchical cluster analysis to identify “singleton” clusters (i.e., clusters that contain only a single observation), variants of k-means which automatically remove missing data and robust various of cluster analysis (e.g., k-medoids). It is not clear that any of these approaches are helpful with cluster analysis and self-organizing maps, as both both techniques are extremely good at automatically identifying outliers (i.e., where outliers exist they are quickly identified in the form of small segments, which can be filtered and the cluster analysis re-run).

With latent class analysis models where they involve mixtures of general linear models, such as latent class logit, outliers are less likely to be automatically discovered as mixtures of general linear models suffer from the same sensitivity to outliers as regression. In theory, all the traditional tools for identifying outliers in regression can be applied to mixtures of general linear models. In practice, this is rarely done (presumably because the tools are not easy to implement due to the complexity of the models).