Numeric variables can be converted to categorical variables. This is variously known as categorizing, banding, binning, and quantizing. This can greatly improve the depth of insights obtained from the data. There are multiple methods for categorizing data. The way that this is done differs by software.
Improving the efficiency of analysis
Example 1: A single numeric variable
The histogram on the left, and the table below it, show the distribution of a numeric variable. The table on the right shows the distribution of a categorical variable, created from the numeric variable.
Example 2: Fast food consumption
The table below shows how many times people purchased from different fast-food chains over a three-month period. We can see that most of the data is around 0 or 1 times.
When the data is converted to categorical, with the higher values merged, we get a much richer analysis. We can see, for example, that:
- Brands differ in their penetration (i.e., the proportion of people to buy once or more in a time period).
- Brands with higher penetration are, in general, consumed more frequently. (This pattern is so common it has a name: double jeopardy).
Methods for categorizing numeric variables
There are four basic methods for categorizing numeric variables:
- Manual
- Percentiles
- Equal width bands
- Pretty intervals
Manual
As the name suggests, manual categorization involves a user specifying bands. The three basic ways of doing this in software are:
- Drag and drop. For example, if the numeric variables are changed to categorical in Q and Displayr, users can then merge the categories.
- Recoding. For example, in SPSS, banding is done by recoding into a new variable, where you may convert all values, say, between 0 and 5 to a 1, and set the label of 1 to 0 to 5.
- Specifying the bands or cut points.
Percentiles
The new variable is created so that approximately the same proportion of cases are in each category. For example, quartiles are created, then there are 10 groups each with approximately 10% of the sample. "Approximately" if the data is discrete, as most data is, then it is impossible to have exactly 10% in each category.
Equal width bands
Categories are created so that they have the same range. For example, 0 to 9, 10 to 19, 20 to 29, etc.
Pretty intervals
This approach is a hybrid of percentiles and equal width brands, where the focus is on ensuring that the stars and end-points of the bands are "pretty" numbers (e.g., multiples of 10 if possible, and otherwise 5).
Comments
0 comments
Please sign in to leave a comment.