Midpoint Recoding – The Data Story Guide

Many questions in surveys present ranges (e.g., 18 to 24). Midpoint recoding is a technique for converting the resulting categorical variables into numeric variables, by replacing their brackets (i.e., minimum and maximum values) with ranges. This article presents an example and discusses the four key issues that need to be solved: selecting a midpoint for a bracket, selecting a midpoint for a bottom bracket, selecting a midpoint for the top bracket, and selecting a value for non-numeric categories such as Don't Know.

Example

As an example, consider the table below which shows household income brackets (also known as bands). The number of categories makes it impractical to use this table in a crosstab, and one fix for this is to recode the data as a numeric variable so that we can compare average income.

Selecting a midpoint for a range

Consider the range of $150,000 to $199,999. The midpoint is ($150,000 + $199,999) / 2 = $174,999.50.

Although this midpoint is technically correct, it is not the most accurate estimate of the average income of people in the $150,000 to $199,999 income bracket. To see this, look at the chart below, where the area of each column is proportional to the percentage of data in each income bracket. Looking at the income bracket of 150000 to 200000, we can see that the height of this is much smaller than the categories to the left. This tells us that people in this income band are more likely to have incomes closer to $150,000 than to $199,999. Consequently, a better value to use to represent the category may be, say, $160,000.

Although one can spend many hours coming up with better estimates than the midpoint, the good news is that it virtually never makes any difference. The reason for this is that provided that you choose a value that is within the bracket, the difference between this value and the values in the other brackets remains relatively unchanged, and it is this relative difference that drives most key conclusions that will be reached when analyzing the recoded data. If this argument is not persuasive to you, you should just experiment by using alternative values and you will observe that relativities virtually never change.

Selecting a midpoint for the bottom bracket

The bracket of $0 to $1,000 is a special bracket in terms of assigning a midpoint. There is a good chance that people in this bracket actually have an income of $0. Unless the category is particularly large (e.g., 30% or more of the sample), we can likely just use the midpoint for the reason described in the previous section. Failing that, some guestimation is required.

Selecting a midpoint for the top category

With top categories, such as $200,000 or over, calculation of the midpoint is more challenging. There are people who earn billions of dollars in a year, and using this as the upper range is guaranteed to cause problems (see Model-Based Outlier Detection). Furthermore, following the reasoning in the earlier discussion of $150,000 to $199,999, it is likely that most people in the $200,000 or over category are closer to $200,000 than a billion.

A simple approach to this problem is to use the category below the top category, and add the distance between its midpoint and upper bound to the lower bound of the top bracket. In this example, the lower bracket is $150,000 to $199,999, its midpoint is $174,999.50, and so the difference between the midpoint and upper bound is $24,999.50. Therefore, the value for $200,000 or over becomes $200,000 + $24,999.50 = $224,999.50.

If the ultimate goal is to estimate the average income, this simple approach will likely lead to an underestimate, as it ignores the billionaires. However, if the goal is to estimate the average income, a survey is unlikely to be an appropriate methodology, as the chance that a billionaire is selected and chooses to participate in the study is small.

Midpoints for non-numeric brackets

Often questions that use brackets also contain non-numeric categories. The most common ones and the way to deal with them are:

Don't know: It is usually appropriate to code the value to a missing value.
Refuse to respond: It is usually appropriate to code the value to a missing value.
Not applicable: It can be appropriate to code the value to a 0. For example, if the person has no income, then $0, rather than a missing value is more appropriate.