Checking and Understanding Missing Data

When a particular variable doesn't contain any data, this is referred to as missing data. When checking a data set, it is important to check for missing data and to understand the causes of the missing data. The interpretation of data depends fundamentally on the reasons the data is missing. Missing data can be categorized into the following categories, each of which has different implications:

Structurally missing data
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (nonignorable missing data)

A worked example is presented at the end of the article.

How to check for missing data

When creating summary tables, most software will automatically indicate if there is any missing data. The table below shows age data. The table's footer indicates a sample size of 574, the total sample of 575, and that there is 1 missing. This means that:

The data file itself contains 575 rows of data.
There is only age data for 574 of these rows.
One of the rows is missing the age data. That is, there is some missing data.
The percentages are based using only the complete data. For example, 15% for 18 to 24 years is interpreted as 15% of the 574 cases (respondents) without missing data.

When data is missing, most survey analysis software will automatically filter any analyses to exclude the cases with missing data. For example, the number of missing respondents in the below table is 154. Note that the table is automatically filtered, and there is no warning. When analyzing a survey, you need to keep an eye out and look for missing data, as it can indicate serious data integrity issues.

When looking at raw data, missing data, also known as missing values, are usually presented as one of the following:

A blank cell.
NA, which usually means Not Applicable.
NaN, which is standard for Not a Number.
A period (full stop: .).
As a non-sensical number such as -9998 or 99.

In addition to automatically showing information about missing data, most software has additional options for displaying information. For example, the table below (discussed in more detail at the end of this article) shows missing data information in the footer caption. It also shows the missing data for each row of the table. Note that the NET has a particularly low sample size, which is often an indication of missing data problems of some kind (see Checking NETs are 100%).

Reasons for missing data

Data can be missing for many reasons.

For example, in a data set exported from a customer database, data may be missing due to the data not being collected or technical errors when constructing the data set file.

In a survey, there are many more possible reasons. For example:

Not all respondents who should have completed the survey have completed the survey. This can be detected by looking at the total sample size for the survey. If you were expecting a survey of 1,000 respondents, but the sample size is 575, then 425 are missing.
Respondents may not have finished the questionnaire, leaving the questions towards the end unanswered, with missing data.
Some questions may not have been asked of all respondents. For example, if somebody says they are unemployed, they are not asked about their profession and, as a result, have missing data for that question.
A specific option may not have been shown to a respondent in a question (e.g., they may not have been asked to rate satisfaction with a product they had not used).
The respondent may have chosen not to answer the question. This will only occur when the data collection software is set to make questions optional. For this reason, it is usually a bad idea to make questions in a survey optional.
The data may have been recoded as missing data during data cleaning.
Respondents may not have been shown a specific option to a question, so they have missing data for the corresponding variable (this is illustrated in the article on data cleaning).
The collected was considered uninformative. For example, a question may have asked for somebody's attitude and the respondent may have said "Don't know". Some researchers recode such data as missing values.
The database was corrupted.
The data was determined to be invalid during data cleaning and was recoded as missing values.

Interpreting data when some of the data is missing

To correctly interpret data based on a variable with missing data, we need to know the cause of the missing data.

Consider a variable that indicates whether or not people in a database have children, where the summary table shows:

20% have children
60% have no children
20% have missing data

What proportion of people do we believe have children? The answer depends on what assumptions we make about the missing data. For example:

Suppose we know that the missing data was due to a database administration error, which deleted the data from a random selection of 20% of the database. In that case, we can conclude that 25% of people have children (i.e., 20% / (20% + 60%)).
Suppose we know that data on children was only collected for adults and that the data was collected two months ago. In that case, we can conclude that only 20% of people have children, as (almost) all the 20% with missing data are children and likely have no children themselves.
Suppose we know that the data on children was only collected for adults, and the data was collected more than five years ago. In that case, we may conclude that a little over 20% of people have children, as some of the children will have become adults and had children since the data was collected.

Work out what type of missing data you have

There are four qualitatively distinct types of missing data. Missing data is either: structurally missing, missing completely at random (MCAR), missing at random, or nonignorable (also known as missing not at random). Different types of missing data need to be treated differently for any analysis to be meaningful.

Structurally missing data

Structurally missing data is data that is missing for a logical reason. In other words, it is data that is missing because it should not exist. For example, in the table below, the first and third observations have missing values for Age of youngest child. This is because these respondents have no children. This situation is typically best addressed by excluding people with such missing data from any analysis of the variables with structurally missing values.

In the How many colas did you drink in the past 24 hours column, there are also structurally missing values. We can logically deduce that the correct value is 0, replacing the missing values with 0 in our analysis.

ID	Children	Age of youngest child	Did you drink Coca-Cola in the last 24 hours?	How many colas did you drink in the past 24 hours?
1	No		No
2	Yes	18	Yes	2
3	No		No
4	Yes	13	No
5	Yes	8	Yes	1

Recoding and/or filtering should be used when dealing with structurally missing data.

Worked example of structurally missing data

Consider the table below. The table below shows aided awareness data (i.e., answers to "Have you heard of this brand before?"). If we look at the percentages shown on the table, it tells us that 15% of people have heard of the brand Optus. This is computed based on the number of people who said they had heard of Optus (13), divided by the number of people who were shown the option (84).

The 15% aided awareness for Optus has been calculated under the assumption that the data are missing completely at random, which, as described above, is the default assumption in data analysis.

A total of 641 people have missing data for Optus. Why? In the study, some awareness information had been collected on a question before this one. Consequently, not all respondents were shown all of the brands. People were only asked about Optus if they hadn't already revealed they had heard of Optus. Consequently, all the 641 people who have missing data, are people that have heard of Optus. As a result the correct aided awareness for Optus is (13 + 641) / (84 + 641) = 90%.

This example shows that the default way of analyzing missing data leads to a result of 15% when the correct answer is 90%.

Missing completely at random (MCAR)

Looking at the table below, we need to ask ourselves: what is the likely income of the fourth observation? The most straightforward approach is to note that 50% of other respondents have high incomes and 50% have low incomes. Therefore, we could assume a 50% chance observation four has a high income and a 50% chance she has a low income. This is known as assuming that the missing value is missing completely at random (MCAR). When we make this assumption, we assume that whether or not the observation has missing data or not is entirely unrelated to the other information in the data.

ID	Gender	Age	Income
1	Male	Under 30	Low
2	Female	Under 30	Low
3	Female	30 or more	High
4	Female	30 or more
5	Female	30 or more	High

It is relatively easy to check the assumption that data is missing completely at random. If you can predict which units have missing data (e.g., using common sense, regression, or some other method), then the data is not MCAR. A more formal way of testing is to use Little’s MCAR test (Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198-1202.)

When data is missing completely at random, it means that we can undertake analyses using only observations that have complete data (provided we have enough of such observations).

The MCAR assumption is rarely a good assumption. It is only likely to be true in situations where the data is missing due to some truly random phenomena (e.g., if people were randomly asked 10 of 15 questions in a questionnaire).

Nevertheless, almost all data analysis algorithms/software assumes that data are missing completely at random.

Missing at random (MAR)

In the case of missing completely at random, the assumption was that there was no pattern. An alternative assumption, known somewhat confusingly as missing at random (MAR; Little, R. J. A. and D. B. Rubin (1987). Statistical Analysis with Missing Data. Brisbane, John Wiley & Sons.) instead assumes that we can predict the value that is missing based on the other data.

We use this assumption to return to the problem of trying to work out the value of the fourth observation on income. A simple predictive model is that income can be predicted based on gender and age. Looking at the table below, which is the same as the one above, we note that our missing value is for a Female aged 30 or more, and the other females aged 30 or more have a High income. As a result, we can predict that the missing value should be High. Note that the idea of prediction does not mean we can perfectly predict a relationship. All that is required is a probabilistic relationship (i.e., that we have a better than the random probability of predicting the true value of the missing data).

ID	Gender	Age	Income
1	Male	Under 30	Low
2	Female	Under 30	Low
3	Female	30 or more	High
4	Female	30 or more
5	Female	30 or more	High

When data is missing at random, it means that we need to impute or use an analysis method explicitly designed for missing at random data (see Analysis Methods That Automatically Address Missing Values).

Missing at random is always a safer assumption than missing completely at random. This is because any analysis that is valid with the assumption that the data is missing completely at random will also be valid under the assumption that the data are missing at random. However, the opposite is not the case.

Missing not at random (nonignorable)

It may be the case that we cannot confidently make any conclusions about the likely value of missing data. For example, people with very low incomes and very high incomes may tend to refuse to answer the question. Or there could be some other reason we do not know. This is known as missing not at random data and also as nonignorable missing data.

It is common to include structural missing data as a special case of data that is missing not at random. However, this misses an important distinction. Structurally missing data is easy to analyze, whereas other forms of missing not at random data are highly problematic.

When data is missing not at random, we cannot use any of the standard methods for dealing with missing data (e.g., imputation or algorithms designed explicitly for missing values). If the missing data are missing not at random, any standard calculations give the wrong answer.

Consider the following study of homelessness (Manski, Charles F. (1995). Identification Problems in the Social Sciences. Harvard University Press.). Data were obtained from 31 women, of whom 14 were located six months later. Of these, three had exited from homelessness, so the estimated proportion of exiting homelessness is 3/14 = 21%. As there is no data for the remaining 17 women who could not be located, it is possible that none, some, or all of the remaining 17 women may have exited from homelessness. This means that potentially the proportion to have exited from homelessness in the sample is between 3/31 = 10% and 20/31 = 65%. As a result, reporting 21% as the correct result is misleading.

In this example, the missing data is nonignorable. Treating it as missing at random would also be inappropriate. This is especially true given that the inability to contact the women is likely to be causally related to whether or not they have exited from homelessness. Thus, strategies designed for data that is missing at random, such as imputation, will not work.

For information on how to handle missing data in analyses see the Dealing with Missing Values section.