How to Clean and Tidy Data – The Data Story Guide

Once data has been checked, it needs to be cleaned and tidied. Usually, this is done in an iterative workflow, with data being cleaned and tidied as soon as problems are spotted.

Cleaning is also known as data cleaning and scrubbing.

Data cleaning

Data cleaning involves fixing all the problems identified during the data checking process. The tools available for cleaning are:

Delegation. For example, ask whoever provided the data to fix the problems.
If the data is of the wrong shape, it should be reshaped.
If errors exist in the data values, they should be fixed, either by:
- Editing values in the data editor
- Recoding.
Labels should be corrected.
Any poor-quality cases should be deleted.
If the results are inconsistent with known facts about the data, then the data should be weighted.
If there is missing data, and it is potentially problematic, either:
- Use imputation to replace the missing values.
- Analyze the data using techniques designed for missing data.

Data tidying

Tidying the data to make it more amenable for analysis involves:

Uninteresting categories should be removed so that the data is rebased.
Small categories should be merged.
Any filters that are likely to be required should be created.
New variables that will aid in the analysis should be created.
- This typically involves creating new variables, from most common to least common:
- The new variable creation involves transformations of data. See:
  - Transforming Numeric Variables
  - Transforming Categorical Variables