Sometimes data for a case in a data set can be poor enough to justify further investigation, exclusion, or deletion of the case from the analysis. There are a variety of mechanisms for assessing case quality, including:
- Not meeting correct screening criteria
- Missing data issues
- Inconsistent data
- Implausibility (e.g., failed "lie tests")
- Duplicate cases
- Model-based outlier detection
Investigation, exclusion, or deletion
When a case is identified as being potentially problematic, a more detailed investigation should be undertaken to attempt to understand the causes. If it is clear that the data is of poor quality, the solutions are to:
- Recode the offending data as missing.
- Filter the case out from the data set.
- Delete the case from the data set.
Not meeting correct screening criteria
Sometimes the screening criteria for a survey are incorrectly programmed, leading to respondents that should not have completed the survey ending up completing. In other cases, it becomes clear that more screening questions than were used were required. The main remedy for this problem is to delete the data for the respondent who didn't meet all the screening criteria.
Missing data issues
Cases in a data file may have:
- Too many missing values (see Use Heatmaps to Explore Missing Values by Case).
- Missing values on a key variable. For example:
- Many surveys weight data based on age and gender. If there is no data for age or gender, and the data needs to be weighted, then such cases may need to be deleted.
- If data is missing data for a variable central to analysis (e.g., the outcome variable in a predictive model), it often makes sense to delete the variable from the data set.
Sometimes a few variables contain inconsistent data. Maybe at one question, a respondent says they drink Coke, but at another question, they say they never drink any type of cola.
While the natural instinct is to delete data from people that provide inconsistencies, some caution is required. There is always a bit of error in questionnaire responses. For example, the difference between Strongly Agree and Agree can be quite hard to judge for a respondent, so many will change their answer if asked the same question twice.
Implausibility (e.g., failed "lie tests")
Where there is strong evidence that a case contains impossible data, it may be appropriate to delete the case from the data set. Sometimes studies are designed to make this process easy.
Lie tests are questions in questionnaires that are intended to catch out people providing inaccurate or dishonest answers. For example:
- Psychologists ask people how strongly they agree with the statement "I never lie" and conclude that anybody who says "Strongly agree" is a liar.
- Market research studies ask which brands people know about, and include some fake brands.
Speeding is when it is concluded that a person has answered the questions in a questionnaire, or part of a questionnaire, in an implausibly short time. Two approaches to working out how short is too short are:
- Timing yourself skim reading and answering the questions.
- Creating a histogram of the time taken to complete the questionnaire and see if there is a "bump" of speeders. See: Use Histograms to Understand Numeric Variables.
Flatling, also known as flat-lining and straight-lining, occurs when a respondent consistently chooses the same response in a grid question (e.g., chooses all the middle options). It is usually interpreted as a sign that the respondent is not providing data of sufficient quality.
The appearance of duplicate cases in a data set is a particularly serious problem. If the same person is counted two or more times, the results of a study can become highly misleading.
Sometimes duplicates can be detected by just examining the ID variable for duplicates (if a data set does not contain an ID variable, this is also an indication of a quality problem). In other instances, sets of variables need to be jointly analyzed (e.g., first name + family name + phone number) to identify duplicated cases.
When dealing with duplicates it is important to understand why they have come into existence prior to trying to resolve the issue. For example, sometimes duplicate entries come into existence in surveys because people make multiple attempts of completing the survey. In this case, their first attempt should be treated as representative.
Model-based outlier detection
Model-based outlier detection involves fitting a model of some kind to the data (e.g., a normal distribution, or a linear regression), and identifying cases that are in some way problematic (e.g., inconsistent with the rest of the data). See Model-Based Outlier Detection.