Converting Numeric Variables to Categorical Variables – The Data Story Guide

Numeric variables can be converted to categorical variables. This is variously known as categorizing, banding, binning, and quantizing. This can greatly improve the depth of insights obtained from the data. There are multiple methods for categorizing data. The way that this is done differs by software.

Improving the efficiency of analysis

Example 1: A single numeric variable

The histogram on the left, and the table below it, show the distribution of a numeric variable. The table on the right shows the distribution of a categorical variable, created from the numeric variable.

Example 2: Fast food consumption

The table below shows how many times people purchased from different fast-food chains over a three-month period. We can see that most of the data is around 0 or 1 times.

When the data is converted to categorical, with the higher values merged, we get a much richer analysis. We can see, for example, that:

Brands differ in their penetration (i.e., the proportion of people to buy once or more in a time period).
Brands with higher penetration are, in general, consumed more frequently. (This pattern is so common it has a name: double jeopardy).

Methods for categorizing numeric variables

There are four basic methods for categorizing numeric variables:

Manual
Percentiles
Equal width bands
Pretty intervals

Manual

As the name suggests, manual categorization involves a user specifying bands. The three basic ways of doing this in software are:

Drag and drop. For example, if the numeric variables are changed to categorical in Q and Displayr, users can then merge the categories.
Recoding. For example, in SPSS, banding is done by recoding into a new variable, where you may convert all values, say, between 0 and 5 to a 1, and set the label of 1 to 0 to 5.
Specifying the bands or cut points.

Percentiles

The new variable is created so that approximately the same proportion of cases are in each category. For example, quartiles are created, then there are 10 groups each with approximately 10% of the sample. "Approximately" if the data is discrete, as most data is, then it is impossible to have exactly 10% in each category.

Equal width bands

Categories are created so that they have the same range. For example, 0 to 9, 10 to 19, 20 to 29, etc.

Pretty intervals

This approach is a hybrid of percentiles and equal width brands, where the focus is on ensuring that the stars and end-points of the bands are "pretty" numbers (e.g., multiples of 10 if possible, and otherwise 5).