Imputing Missing Data – The Data Story Guide

A popular approach to dealing with missing data is to use a technique called imputation, which seeks to guess the values of the missing data.

When to use imputation

Imputation is a useful technique when wanting to estimate averages and percentages for data sets where missing values occur and the missing values are either Missing At Random or Missing Completely At Random.

It is typically counterproductive to use imputation when conducting multivariate analyses, correlations or when the missing data is Nonignorable. See How Missing Values are Addressed in Statistical Analysis for more information.

Imputation methods

A variety of imputation methods have been developed and are in widespread use.

Replacing missing values with the mean (average)

A popular approach to imputation is to replace the missing value with the average value. The only situation where this is a "good" idea is if wanting to confuse people regarding your sample size. Some simple analyses are OK with this approach to imputation (e.g., computing the average), but the vast majority of analyses of data become invalid when missing values are replaced with the average. Consider the following data table, where we have missing data on beer consumption. Of the people with no missing data, the average is 1.43. If we use this value for those with missing data, we get the new column of data shown as Replaced by average.

ID	Gender	Beer consumption in last week	Replaced by average	Replaced by average for genders
1	Male	5	5	5
2	Male	4	4	4
3	Male	0	0	0
4	Male	0	0	0
5	Male	MISSING	1.43	2.25
6	Female	0	0	0
7	Female	1	1	1
8	Female	MISSING	1.43	0.33
9	Female	MISSING	1.43	0.33
10	Female	0	0	0

While it may at first seem sensible to assume that in the absence of any response, each respondent should be assumed to be ‘average’, it has some dire consequences. In this example, we see two such consequences. First, the standard deviation of the measure of beer consumption has reduced substantially from 2.15 to 1.75. This is particularly problematic in multivariate techniques, such as regression, cluster analysis, and principal components analysis, which are based on analyzing the variances in data.

Gender	Beer consumption in last week	Replaced by average	Replaced by average for genders
Average
Total	1.43	1.43	1.29
Males	2.25	2.09	2.25
Females	0.33	0.77	0.33
Male / Female	6.75	2.7	6.75
Standard deviation
Total	2.15	1.75	1.84
Males	2.63	2.31	2.28
Females	0.58	0.73	0.41

A second problem is that we have changed the relationship between our variables. Prior to replacing the missing values with the means, the research suggested that males consumed, on average, 6.75 times as much beer as females, but after we have replaced the missing values, a comparison of means leads to the conclusion that males only consume 2.70 times as much as females!

Standard implementations of regression and other predictive models

An alternative to using the average for imputation create a predictive model of some type. For example, we can see in the example above that gender is a predictor of beer consumption, and we can assign the missing values the average values for the genders, as shown in the final column of the table above.

While less problematic than simply replacing values with the means, this approach is still far from perfect. While the analysis now preserves the average differences between the genders, we still have a substantial reduction in the overall variation in the data, and even greater reductions in the standard deviations within each gender. The consequence of this is that we inadvertently exaggerate the true relationship between the predictor variables and the variable where the missing data is being replaced. In this example, we can see that the correlation between the two variables increases from .48 to .55.

Stochastic predictive models

The problem with the standard predictive models is that they ignore the uncertainty in the predictions. A sounder approach is to use stochastic models that rather than produce a single prediction for each person, instead calculate the probability of all possible results. Imputation is then performed by randomly selecting an observation from all of these possible results, with the probability of selection being proportional to the probability of the result occurring.

This can be done with any predictive technique. For example, if using linear regression, a draw from the normal distribution is used, with the predicted value as the mean and the standard error of the estimate as the standard deviation.

In practice, it is rare to use traditional predictive models, and they instead use specialty predictive models that have been tuned just for imputation. At the time of writing, MICE (Multiple Imputation by Chained Equations) is probably the most popular method for doing this. Prior to this, the most popular methods used mixtures models.

A problem with stochastic predictive models is that the process of taking a random draw to generate the imputed value introduces randomization into the data. A different random draw will cause the results to change.

Hot deck

An alternative to using stochastic predictive models is to input missing values using a hot deck. The basic idea is to:

Create some definition of similarity in a data set. For example, you may conclude that two cases are similar if they are the same age, gender, and geographic location.
Where a case has missing values, assign it a randomly selected value from a similar case.

This approach is relatively easy, but the imputations tend to be grossly inferior to those of stochastic predictive models, as the definitions of similarity are based on hunches, whereas predictive models are calibrated against the data. This approach also introduces randomness into the data.

Multiple imputation

Multiple imputation attempts to solve the problem of randomness being introduced via imputation. It does this by repeating the imputation multiple times (e.g., 10 times), each time using a different random number seed. This results in multiple data sets. Each of these sets is then analyzed and the results are consolidated. Most commonly this occurs with predictive models, such as regression, where:

The parameter estimates are obtained by calculating the average of the parameter estimates from the imputed data.
The standard errors are obtained by calculating the standard deviations of the parameter estimates, with an adjustment.

Multiple imputation is available in Displayr by setting Missing data to Multiple imputation.