Delete Poor-Quality Cases – The Data Story Guide

Sometimes data quality problems are so severe as to cast into doubt the whole verticity of a case. For example, if a respondent completes a questionnaire in one minute, but it is not possible to have even read the questions in a minute. The remedy is to delete the cases from the data file so that none of their data is used in any analyses.

Data can be deleted from either the:

Data file.
Data set.

It is advisable to create filters to use when deleting, rather than manually deleting cases.

Deleting cases from the data file

Where cases are known to be of poor quality, they can be deleted from the data file itself.

Although a common approach, it's not always a good approach as:

It makes it impossible to review the data cleaning. That is, as the case is deleted, it cannot be reviewed. A fix for this is to have multiple versions of the data file.
If the data file is updated in the future, somebody will need to remember to delete the case again and, if they fail, future analyses will be invalid.

Deleting cases from the data set

The data set is how the data is represented in the data analysis software after the file has been imported.

If using Excel or SPSS, this distinction does not exist (i.e., the data file and the data set are one and the same thing).

In each of R, Q, and Displayr, you can delete data once it has been imported from the data set, and this leaves the data file unchanged.

Q and Displayr additionally:

Allow the user to undelete cases.
Allow you to import a revised data file and automatically re-delete any previously deleted cases (based on the ID variable).

It is advisable to create filters to use when deleting, rather than manually deleting cases.

Most software allows you to manually delete cases from the data editor. However, it's better to first create filters indicating poor quality data, and then use these filters to select data prior to deleting it, as:

The most efficient way to delete respondents is usually to do so via filters.
The process of creating filters effectively documents the data deletion process, ensuring that you can both understand what has been done and also that you can reproduce it in the future (e.g., if updating with a revised data file).
Filter variables are particularly useful when automating data checking and cleaning.