Merging Categories – The Data Story Guide

A key part of data tidying is to merge together categories. Common approaches to merging categories are:

Combining the smallest categories together.
Combining adjacent categories.
Applying standard merges (e.g., top 2 boxes).
Combining categories that are similar with respect to other data.

The way that data is merged varies by software.

Combining the smallest categories together

The table below on the left shows providers of different cell phones. The table on the right shows the smaller providers merged into an Other category.

Sometimes such merging of small categories is performed automatically, with brands of less than, say, 5% or 2% merged together.

Combining adjacent categories

Respondents to a survey were shown the description above and asked to rate how well it fits with the Apple brand. The results are shown below on the left. There are so many small categories that it makes the data a bit overwhelming. A simple solution to this is to merge adjacent categories together, as shown on the right.

Applying standard merges (e.g., top 2 boxes)

In many fields, there are standard ways of combining categories of categorical variables. For example, in the United Kingdom, statisticians combine occupations into the following socio-economic grades.

Social Grade	Description
AB	Higher managerial Intermediate managerial Administrative Professions
C1	Supervisory Clerical Junior managerial, administrative, professional occupations
C2	Skilled manual occupations
DE	Semi-skilled & unskilled manual occupations Unemployed and lowest grade occupations

In customer feedback, it is routine to ask how likely people are to recommend a product on an 11-point scale and merge them into the following three categories.

Social Grade	On a scale of 0 to 10, how likely are you to recommend INSERT NAME OF BRAND to your friends and colleagues?
Detractors	0 Extremely unlikely 1 2 3 4 5 6
Neutrals/Passives	7 8
Promoters	9 10 Extremely likely

Most commonly, it's routine to merge together 5-point scales into two categories, where the top two categories are called the top 2 box score. Similarly, there with 7-point scales, there are top 3 boxes, some people also like to analyze bottom 2 boxes, etc.

Combining categories that are similar with respect to other data

Data can also be merged so that it best shows how the data relates to some other data (e.g., using techniques like CHAID).

How merging varies by software

In some software, such as SPSS Statistics and R, categories are merged by recoding the values in variables. For example, if you want to merge together categories 0, 1, 2, 3, 4, and 5, this is done by recoding all the values (e.g., replacing values of 1, 2, 3, 4, and 5 with a value of 0).

Other software, such as Q or Displayr, permits categories to be merged without the need for the data to be recoded. This means that categories can be merged without numeric summaries (e.g., means) being changed by the merging of categories.