Heatmaps showing missing values by case can provide insight into problems in a data set.
There are two steps:
- Create the heatmap
- Identify patterns in the heatmap
Create the heatmap
It is usually a good idea to:
- Fix any issues identified in the earlier data cleaning stages (e.g., set poor quality data to missing values).
- Only select variables that show survey responses (i.e., don’t include administrative variables such as sample source or completion status). Where the file contains a very large amount of variables, only choosing a subset of variables can be important.
- Use a heatmap visualization that allows you to zoom in and display observation numbers so that the specific problems can be identified, investigated, and remedied.
- Filter the visualization to remove patterns that you have already identified.
The resulting visualization is always extremely messy, but don’t be too concerned about that. The messiness is a product of the sheer amount of information rather than a formatting problem.
Patterns that can occur include:
- Vertical columns. These represent variables affected by routing and filtering and, provided you have followed the more basic data cleaning steps described above, these can be ignored.
- Short horizontal lines. These represent multiple variables in a row with missing values. If you look hard, you will see hundreds of such problems in this example. Some are highlighted on the left.
- Long horizontal lines. These can be due to routing and filtering but can also indicate the presence of observations (cases) with severe missing data problems. In the visualization above we can see quite a bit of this.
- Horizontal lines that continue to the end (right) of the data set. These are symptomatic of incomplete data. If these vary in length, it suggests that your data set hasn't been filtered to only include complete cases, and usually this can be corrected by deleting these cases.
- Clustering of horizontal lines. There are three likely causes of such clustering. First, coincidence is always a possibility. Second, they can indicate some fundamental problems in the data collection process. For example, the clustering highlighted on the right side of the visualization above is caused by one interviewer doing a particularly bad job. Third, these can be caused by intentional aspects of the research design (e.g., perhaps the study had a quota for some of its questions, and people were not asked those questions once the quota was met).