How to Check a Data Set – The Data Story Guide

A data set file is checked as follows:

Choose a data checking, cleaning, and analysis workflow
Check the number of cases in the data set
Check that the file has been created properly
Check screening, routing, and filtering
Check every variable
Look for patterns in missing data
Review respondent quality metrics

Choose a data checking, cleaning, and analysis workflow

Data checking and the related tasks of cleaning and tidying are undertaken to increase the quality of insights and reduce the time required to find insights.

Data checking, cleaning, and analysis can be done iteratively, sequentially, or automatically.

Iterative workflow

Modern software (e.g., Displayr) can be used for the entire data analysis process, from checking through to reporting. This supports a more iterative workflow, where errors are identified during the analysis and reporting stages, and the data can be cleaned with all the analysis and reporting automatically updated.

Sequential workflow

The sequential workflow involves first doing a very thorough job checking and cleaning the data. The analysis commences only once the data is known to be clean.

The sequential workflow can be substantially less efficient than the iterative workflow. This is because the sequential workflow involves having to review the data multiple times (i.e., when checking the data, and when performing analysis, and when creating reporting, and when interpreting the data), which takes time and can lead to scheduling delays, whereas the iterative workflow involves checking the data at the same time as when interpreting the data.

However, the need to review the data multiple times can be compensated for by efficiencies achieved through increased specialization (e.g., a specialist data checking and cleaning team and being able to use older and cheaper specialist software).

Automatic data checking and cleaning

The data checking and cleaning process can be automated. See Automated Data Checking and Cleaning.

Check the number of cases in the data set

Step one when checking a data set is to check that the number of cases in the data set is correct. If the result is different from what you expect, it usually indicates a serious data integrity issue, such as:

Key groups of data being omitted.
Data that should have been omitted is included (e.g., in the case of a survey, the data for incomplete interviews may have been included).
The data has the wrong shape.

The easiest way to verify the size of the data is by going to the data editor, where you can find the number of cases (rows), or where it can be verified by scrolling to the bottom. As an example, the number of rows is shown in the top-right corner of the data editor in Displayr (e.g., 1010 rows).

Check that the file has been created properly

A data set file is like any other document, in that it can be created well or created poorly. A detailed list of requirements for a high-quality data file is listed in Characteristics of a Good Data File.

Where there are problems with the file, the solution is to either:

Get the person who provided it to create a better file.
Manually correct it.

Check screening, routing, and filtering

Screening, routing, and filtering instructions determined who answered what in a questionnaire. Although in theory these should be checked prior to collecting data, it is good practice to check them as a part of the data checking and cleaning practice. See: Checking Screening, Routing, and Filtering Instructions.

Check every variable

Summary tables are created, summarizing all the variables in the study. Each summary table is reviewed for the following:

Missing data is checked and explained.
Checking NETs are 100%.
Consistency with known data is reviewed.
The data is scrutinized to check that the cases do not have impossible values.
Perform checks appropriate to the specific variables being examined:

Look for patterns in missing data

Two ways of looking for such patterns are heatmaps showing missing values by case, or missing data patterns.

Assess case quality

The most serious issue with data is when cases (respondents) are believed to be unacceptably poor, as often this requires the deletion of cases. See: How to Assess Case Quality.