Automated Data Checking and Cleaning – The Data Story Guide

Automated data checking and cleaning can save considerable time on projects. Automated data checking provides alerts to highlight particulate problems with the data. Automated cleaning, on the other hand, fixes the problems instead of providing an alert.

Saving time with automated data checking and cleaning

Setting up automated data checking and cleaning takes a bit of time, so it is only worthwhile to do it in situations where there will be a payback for this time. Common situations where it is appropriate:

Tracking studies, where data is collected over time and there is a need to regularly check the data.
Cookie cutter studies, where a very similar questionnaire is asked with some minor tweaks. For example, concept tests surveys are often identical except for the product descriptions, and employee engagement studies often vary little between projects.
Where companies have standardized banks of questions and standardized cleaning instructions.
Studies with short analysis time frames, where it is possible to set up the automated checking and cleaning prior to completion of data collection.

Automated alerts

An alert is a notification that something is broken. The key to an alert is that it is hard to miss. Two mechanisms that are used are emails, and, having hard-to-miss errors. For example, an error in a calculation on Displayr appears as an alert in the form of an orange/red exclamation mark on the page itself, and in the folder in which the page is located:

The common type of alerts to use are:

Out-of-range errors indicate situations where values appear in a data file that are bigger or smaller than the expected range of values (e.g., the values of 3 and 5 for q1).
Flow errors, where the values in one variable are inconsistent with what we would expect based on other data (these are also known as skip errors, filtering errors, and piping errors). For example, people without employment have occupations and vice versa.
Variable non-response, where observations that should contain data do not contain data (e.g., id). This is also known as Item Non-response.
Checking if data has been coded.
Variable consistency errors, such as having a variable that indicates the number of people in a household which has a smaller value than another indicates the number of children in a household.
Lie Tests, where data has been collected in such a way as to permit the identification of people providing deliberately incorrect data. Common ways of doing this include asking questions where you know the answer (e.g., anybody who says No when asked “Do you ever lie?” is typically defined as a liar), or comparison based on alternative ways of asking the same question (e.g., comparing age derived from a birth year with claimed age).
Sum constraint errors, where we specify equality or inequalities regarding the sum of variables. For example:
- - Variables measuring percentages adding up to 100%.
  - Variables measuring time spent doing activities adding up to less than 24 hours in a day.
Checking that old results don’t change.

Automated cleaning

Automated cleaning occurs when data problems are found and automatically fixed. How easy or hard this process is depends largely on the modernness of the software you are using. If using, say, SPSS, you can write code (syntax) to do the cleaning, but you must manually run the code. In more modern tools, such as Q and Displayr, the data cleaning code will automatically run if the input data changes.

Alternatively, scripts can be written to perform standard cleaning operations.