A data set file is checked as follows:
- Choose a data checking, cleaning, and analysis workflow
- Check the number of cases in the data set
- Check that the file has been created properly
- Check screening, routing, and filtering
- Check every variable
- Look for patterns in missing data
- Review respondent quality metrics
Choose a data checking, cleaning, and analysis workflow
Data checking and the related tasks of cleaning and tidying are undertaken to increase the quality of insights and reduce the time required to find insights.
Data checking, cleaning, and analysis can be done iteratively, sequentially, or automatically.
Iterative workflow
Modern software (e.g., Displayr) can be used for the entire data analysis process, from checking through to reporting. This supports a more iterative workflow, where errors are identified during the analysis and reporting stages, and the data can be cleaned with all the analysis and reporting automatically updated.
Sequential workflow
The sequential workflow involves first doing a very thorough job checking and cleaning the data. The analysis commences only once the data is known to be clean.
The sequential workflow can be substantially less efficient than the iterative workflow. This is because the sequential workflow involves having to review the data multiple times (i.e., when checking the data, and when performing analysis, and when creating reporting, and when interpreting the data), which takes time and can lead to scheduling delays, whereas the iterative workflow involves checking the data at the same time as when interpreting the data.
However, the need to review the data multiple times can be compensated for by efficiencies achieved through increased specialization (e.g., a specialist data checking and cleaning team and being able to use older and cheaper specialist software).
Automatic data checking and cleaning
The data checking and cleaning process can be automated. See Automated Data Checking and Cleaning.
Check the number of cases in the data set
Step one when checking a data set is to check that the number of cases in the data set is correct. If the result is different from what you expect, it usually indicates a serious data integrity issue, such as:
- Key groups of data being omitted.
- Data that should have been omitted is included (e.g., in the case of a survey, the data for incomplete interviews may have been included).
- The data has the wrong shape.
The easiest way to verify the size of the data is by going to the data editor, where you can find the number of cases (rows), or where it can be verified by scrolling to the bottom. As an example, the number of rows is shown in the top-right corner of the data editor in Displayr (e.g., 1010 rows).
Check that the file has been created properly
A data set file is like any other document, in that it can be created well or created poorly. A detailed list of requirements for a high-quality data file is listed in Characteristics of a Good Data File.
Where there are problems with the file, the solution is to either:
- Get the person who provided it to create a better file.
- Manually correct it.
Check screening, routing, and filtering
Screening, routing, and filtering instructions determined who answered what in a questionnaire. Although in theory these should be checked prior to collecting data, it is good practice to check them as a part of the data checking and cleaning practice. See: Checking Screening, Routing, and Filtering Instructions.
Check every variable
Summary tables are created, summarizing all the variables in the study. Each summary table is reviewed for the following:
- Missing data is checked and explained.
- Checking NETs are 100%.
- Consistency with known data is reviewed.
- The data is scrutinized to check that the cases do not have impossible values.
- Perform checks appropriate to the specific variables being examined:
Look for patterns in missing data
Two ways of looking for such patterns are heatmaps showing missing values by case, or missing data patterns.
Assess case quality
The most serious issue with data is when cases (respondents) are believed to be unacceptably poor, as often this requires the deletion of cases. See: How to Assess Case Quality.
Comments
0 comments
Article is closed for comments.