Converting Categorical Variables to Binary Variables – The Data Story Guide

Categorical variables containing three or more categories can be converted to binary variables. This can greatly improve the efficiency of analysis by reducing the amount of data to be examined. The way that this is done differs by software.

Improving the efficiency of analysis

Example 1: A variable set of brand attitude

The table below shows attitudes towards six cola brands. Ignoring the NET column, it contains 30 percentages. (For a different approach to analyzing this data, see Converting Categorical Variables to Numeric Variables.)

The table below shows the same data, except that rather than showing the individual percentages, instead, it shows the top 2 box score, which is the sum of the percentages of people to choose Like and Love in the table above.

As the table only contains seven numbers, it is much easier to interpret. We can easily see that Coca-Cola is most liked, and Diet Pepsi is least liked. And, we can see from the NET that 3% of people didn't like or love any of the brands.

When we want to understand how the data relates to other data, the benefit of converting the data to numeric grows. Consider the case when we want to understand how the attitude to the cola brands differs by age, where age is in nine categories. A table showing all attitude categories by age would have 90 * 9 = 810 numbers to look at. It would take a long time to examine. However, the table below much more efficiently presents the information, allowing us to quickly spot the patterns.

The efficiency that is achieved by using numeric rather than categorical variables occurs on two levels:

It is easier for the analyst, as there are fewer numbers to look at.
The resulting analysis has more statistical power. If there is a pattern in the data, we are more likely to find it when we treat the data as numeric (provided that it is not a nonlinear pattern).

Example 2: Customer effort scores

The table below shows customer effort scores for cell phone providers.

At first glance, it seems to be essentially the same problem as shown in the previous example. However, there is a trap for beginners hidden in it. The top two box score for Understand your bill, for example, isn't computed by summing 44% and 3%. Instead, the typical way to analyze such data is to:

Rebase the data, removing the Don't know category.
Calculating the percentage of Easy + Very Easy.

That is, the top 2 box score for Understand your bill is (34% + 44%)/(100% - 3%) = 80%.

Different software packages approach this problem in very different ways

How to do this depends on the software being used to analyze the data. In the older software, such as SPSS statistics, such analysis is done by using recoding to create new binary variables and then calculating the average of these variables.

In more modern software, such as Q, Displayr, and R, the change is performed by changing the structure of the data. For example, in Displayr, you could use a shortcut to entirely automate example 1. For example, 2 you would change the Structure of the data to Binary - Multi and then use Count this Value and Missing Values settings to tell the software how to calculate the percentages: