It is very important to have data stored correctly; it determines both the speed and quality of data analysis. There are two distinct phases in ensuring that data is stored correctly:
- Ensuring that the data is stored correctly in the data file.
- Ensuring that the data is stored correctly in the data set.
The data file is the original file that is created/provided containing the data. The data set is how the data appears once it has been read into the software that is used to analyze the data.
Why it is important that data is correctly stored
Modern data analysis software automates many aspects of:
- Data checking
- Data cleaning
- Data analysis
- Data visualization
- Statistical testing
It can only do this correctly if the data is stored correctly. If the data is stored incorrectly, it will be checked the wrong way, can't be properly cleaned, analyzed, visualized, and tested incorrectly. Consequently, when a data file has data stored in it incorrectly, substantial work is created for the person who needs to analyze the data, as they either need to clean and tidy the file to have the data stored correctly or, they need to do things manually that could otherwise be automated.
Data is stored correctly in the data file
Variables can be categorized according to their type (e.g., nominal, ordinal, interval). There are two stages to setting up data properly according to its type. How a variable should be stored in a data file is determined by:
- The file format:
-
Flat Data File (e.g., Excel, CSV)
-
- The data type
The table below describes how each type of data should be stored. However, some mistakes are more common and problematic than others, and the biggest mistakes that occur are:
- Storing single response data in multiple binary variables. For example, if the survey asks people their sex, with options of Male, Female, Non-Binary, and Refused, there should be a single Nominal variable, rather than four binary variables.
- Exporting numeric data (e.g., quantity purchased) as text.
- Inconsistent and ambiguous formats for dates. The gold standard is stored as dates in the data file. If this is not possible (e.g., if using a CSV file, which doesn't support dates), it's important that the dates are stored in an unambiguous format, which is: YYYY-MM-DD (e.g., 2018-11-29). If you instead go with MM-DD-YYYY or DD-MM-YYY you are potentially creating a world of pain, as in many situations these formats will confuse a machine (e.g., is 01-08-2018 the eighth of January or the first of August)?
-
Verbatim responses to open-ended questions and “other specify” options should be stored as text variables if the data is text and numeric variables if numeric.
- Where open-ended questions have been coded, these are then included in the data file as if they are standard single or multiple response variables (in particular, the binary format is appropriate for multiple response questions). An additional string variable should store the raw responses.
The more detailed set of rules is:
Flat data file |
SPSS Data File (.SAV) |
|
Values stored as:
|
Variable Type: Numeric Measure: Nominal Values stored as:
|
|
|
Values stored as: |
Variable Type: Numeric Measure: Nominal Values stored as:
|
Nominal-Ordinal Data |
If the nominal categories will likely be set to missing (e.g., Don't know), store as with Nominal Data. Otherwise, store as with Ordinal Data. |
If the nominal categories will likely be set to missing (e.g., Don't know), store as with Nominal Data. Otherwise, store as with Ordinal Data. |
Ordinal Data
|
Values stored as:
|
Variable Type: Numeric Measure: Ordinal Values stored as:
|
Values stored as:
|
Variable Type: Numeric Measure: Scale Values stored as:
|
|
Text Data |
Values stored as: |
Variable Type: String Values stored as: |
Currency Data |
Values stored as:
|
Variable Type: Dollar or Custom Currency Measure: Scale
|
Date/Time Data | The date, written in a consistent format (e.g., 2020-12-31) |
Variable Type: Date See also Date/Time Data |
Ensuring that the data is stored correctly in the data set
When a data file is read into data analysis software, additional decisions are made about how to store the data. These decisions are first made automatically by the software. The better the data file is created (see the previous section of this article and the related articles), the more likely the data will be automatically set up correctly. If not automatically set up correctly, it can usually be manually corrected in the data analysis software.
R(assuming the data is stored in a data frame) |
SPSS Statistics |
Displayr |
Q |
|
Type: logical
|
Variable Type: Numeric Measure: Nominal |
Structure: Nominal or Binary - Multi |
Variable Type: Categorical and/or Question Type: Pick Any |
|
|
Class: factor |
Variable Type: Numeric Measure: Nominal |
Structure: Nominal |
Question Type: Pick One |
Nominal-Ordinal Data, where the nominal categories will be set to missing (e.g., Don't know) |
Class: ordered |
Variable Type: Numeric Measure: Ordinal |
Structure: Ordinal |
Variable Type: Ordered Categorical Question Type: Pick One
|
Other Nominal-Ordinal Data |
Class: factor |
Variable Type: Numeric Measure: Nominal
|
Structure: Nominal |
Question Type: Pick One |
Ordinal Data
|
Class: ordered |
Variable Type: Numeric Measure: Nominal |
Structure: Nominal |
Variable Type: Ordered Categorical Question Type: Pick One
|
Type: numeric |
Variable Type: Numeric Measure: Scale |
Structure: Numeric |
Question Type: Number |
|
|
Type: integer |
Variable Type: Numeric Measure: Scale
|
Structure: Numeric |
Question Type: Number |
Text Data | Type: character |
Variable Type: String |
Structure: Text |
Question Type: Text |
Currency Data | Type: numeric |
Variable Type: Dollar or Custom Currency Measure: Scale |
Structure: Number |
Variable Type: Money Question Type: Number |
Date/Time Data | The date, written in a consistent format (e.g., 2020-12-31) |
Variable Type: Date See also Date/Time Data |
|
|
Comments
0 comments
Please sign in to leave a comment.