How Missing Values are Addressed in Statistical Analysis – The Data Story Guide

By default, most statistical analysis programs make the convenient-but-rarely-plausible assumption that data is Missing Completely At Random. It is routinely made because it is the simplest assumption. In many areas of statistics, assumptions can be broken without dramatic consequences. The treatment of missing data is not such an area and incorrectly assuming data is Missing Completely At Random can lead to massively misleading results (e.g., in the case of regression, it can cause the conclusions of a model to be reversed).

When we have missing values in data, we need to go through the following process:

Try and fix the data (e.g., re-contact respondents and get their answers). If possible, it is better to try and work out the correct value of the missing data. Often categories are missing because they are inapplicable. Somebody who is listed as a home maker and has income listed as Missing Data probably has no income. Often common sense tells us what the true answers must be. If a respondent has indicated that they never purchase ice cream, they may not have been asked about their frequency of buying Magnums, we will be safe in replacing the missing value with a value of No. Similarly, if someone has indicated that they Don’t Know whether it is important to have a king sized bed in a hotel room, we can be reasonably confident in assuming that it cannot be of great importance to the respondent. And, if someone cannot remember the last time they went to the cinema, we can be reasonably confident it was not in the last week. Where common sense is not enough, we need to look for clues in answers to other questions in order to replace the missing value with a meaningful response. It is not unknown for non-commercial research institutes to have junior researchers and students to read through questionnaires to determine what the likely response may have been. Again, this practice may seem suspect, but it is probably less dangerous than ignoring the problem.
Determine whether the missing values are best characterized as being:
(Optionally) Data imputation, which involves replacing the missing values with predictions for their likely values. Imputation is always something of a last resort and this step should only be conducted if the next step cannot be conducted appropriately. Most automated imputation methods implicitly assume that the data is Missing At Random.
(Optionally) Weighting, whereby the data is weighted to correct for the missing value pattern. Theoretically this is equivalent to imputation but in practice it is a different process.
Using statistical methods that make appropriate assumptions regarding the type of missing data. Where the statistical methods available make assumptions that are known to be incorrect it is sometimes advisable to use imputation. However, it is always theoretically preferable to use statistical methods which make appropriate assumptions, as inevitably the process of imputation is very inaccurate and these inaccuracies infect any statistical methods.

The rest of this section reviews the most common types of analyses that are conducted in market research and how they can be implemented depending upon the type of missing data.

Averages and percentages

When averages and percentages are computed in standard statistical software the missing values are excluded from the analysis and this implicitly involves the assumption that the data is Missing Completely At Random. If this assumption is incorrect, imputation is generally the best solution if the data is Missing At Random.

When the missing values are Nonignorable there is little that can be done to compute meaningful averages and percentages.

Correlations

Correlations implicitly assume that the data is either Missing Completely At Random or Missing At Random. It is generally not appropriate to compute correlations with data that is imputed. This is because a:

Good imputations use the observed correlations in the data to infer the values of the missing values, and thus using imputed data to compute correlations involve circular logic.
Most imputations are not very good and the correlations computed using the imputed values will be biased, whereas without the imputation they may not be biased at all, even when Missing At Random.

A simple example helps in understanding this problem. Imagine that our the true values of 10 respondents are as follows. These variables clearly have a perfect correlation of 1.

x	y
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9
10	10

Now consider a situation where the y variable is missing for respondents who have values of 6 or more, which is an example of data that is Missing At Random. Using the only data that is available, we still observe a perfect correlation and thus our analysis is not ruined by the data being Missing At Random.

x	y
1	1
2	2
3	3
4	4
5	5
6	Missing Data
7	Missing Data
8	Missing Data
9	Missing Data
10	Missing Data

The next table shows the results computed using SPSS Missing Value Analysis module, using the EM algorithm. Note that SPSS has done a pretty good job (and, if we had played around with the options in SPSS we could have got it do do a better job). However, the correlation is now estimated as 0.994. At first glance that may seem like being almost the same as the correct value of 1, but if you think about it for a moment you will realize that the missing data pattern was a really obvious one and still the algorithm has gotten it wrong and the consequence of this is that we have underestimated the true relationship. With weaker correlations and more variables, the problem becomes much greater and it is thus, in general, best to not using imputed values when computing correlations. If you read the Imputation page you will see another example of correlations, where the imputation causes the correlation to be exaggerated.

x	y
1	1
2	2
3	3
4	4
5	5
6	5.50
7	6.35
8	7.20
9	8.05
10	8.90

When the missing values are Nonignorable there is little that can be done to compute meaningful correlations.

Principal Components Analysis

Different statistical programs make different assumptions about missing values when conducting principal components analysis. To understand the differences between these implementations it is important to understand that principal components analysis is computed from the correlation matrix (i.e., the correlations between each of the pairs of variables).^[2]

SPSS by default has a setting of Exclude cases pairwise which means that it computes the correlations between each pair of variables. This involves an implicit assumption that the data is Missing Completely At Random.

An alternative assumption is to only compute correlations using data where each respondent has no missing values. This is the default in R (where it referred to as na.exclude and is the only option in Q. This approach to missing data is consistent with the assumptions that the data is Missing Completely At Random and sometimes Missing At Random.^[3] SPSS can also be set to use this assumption (Options : Missing Values : Exclude cases listwise). In terms of its assumptions about the nature of the missing data, this approach is generally preferable to pairwise deletion. However, with large amounts of missing values it is often impossible to use this method.

As principal components analysis is based on correlations, and correlations are typically invalid when data imputation is involved, imputation is also not typically appropriate prior to principal components analysis. Various versions of principal components analysis have been developed which can accommodate missing values by making either Missing At Random or Missing Completely At Random assumptions, but they are not available as standard options in commonly used statistical software.

Cluster Analysis and Latent Class Analysis

See Missing Values in Cluster Analysis and Latent Class Analysis.

Regression

By default, most regression models exclude all respondents for which there is any missing data. This is consistent with Missing Completely At Random and can sometimes be consistent with Missing At Random as well.

The intuition for understanding the Missing Completely At Random case follows from the earlier discussion of correlation. Consider the regression model which predicts a straight line through the points (i.e., you get the same correct results if using the data which has missing values, even though they are Missing Completely At Random.

The Missing At Random case is more difficult. Best practice is to use Multiple Imputation.

Simpler imputation methods, such as mean replacement, are generally inappropriate (albeit popular due to their simplicity). The problem with these approaches is they change the variances of the variables, and these variances are inputs into both the parameter estimates and the statistical inference for a regression model.

As with principal components analysis, if regression is conducted using an option such as the SPSS option of Exclude cases pairwise, which essentially works by computing correlations between all the variables based on all the available data, this involves making an assumption that the data is Missing Completely At Random, which is a much stronger and less plausible assumption that occurs when all the observations are deleted that contain any missing values. It is important to appreciate that except when randomization explains the missing values, the use of the Exclude cases pairwise option is extremely difficult to justify, and is impossible to justify in situations where the missing data is caused by skips in the questionnaire or Don't know options.

When the missing data is Nonignorable, the simplest solution for regression is the same as for latent class analysis, and it is to treat the missing values as additional categories. Where the data is numeric, these can be addressed by creating additional variables. For example, if you have an predictor with values of 1, 2, NaN, 1, 3, NaN, you can replace the missing values with values of 0 and include a separate dummy variable in the regression to model the missing data. That is, the one variable is replaced by two in the analysis:

x	y
1	0
2	0
0	1
1	0
3	0
0	1

This approach is the best approach in situations where data is missing due to people not experiencing a particular aspect. For example, if you have ratings of satisfaction with various aspects of a bank, and a person has no data on branch service as they haven't been to a branch, the missing values are nonignorable and it is appropriate to use this dummy variable approach. ^[4]

Notes

↑ Little, R. J. A. and D. B. Rubin (1987). Statistical Analysis with Missing Data. Brisbane, John Wiley & Sons.
↑ Little, R. J. A. and D. B. Rubin (1987)Statistical Analysis with Missing Data. Brisbane, John Wiley & Sons.
↑ Manski, Charles F. (1995), Identification Problems in the Social Sciences: Harvard University Press.
↑ Paul D. Allison (2001), Missing Data. Quantitative Applications in the Social Sciences. SAGE Publications