Sometimes there is a need to revise a data file. It's essential that the metadata of files remains consistent over time. The two key aspects to this are ensuring that:
- Categories are added to variables and are not removed or replaced.
- New variables are added to data files, rather than their contents being replaced over time.
Situations where data files need to be revised
Common main situations where data files need to be revised are:
- When newer data is required. For example, sales analyses need to be refreshed to show the latest data.
- When an initial data file was provided prior to data collection being completed, and a final file is required upon the completion of data collection.
- When conducting longitudinal studies (i.e., tracking studies), where the goal is to understand how results change over time (.g., surveys tracking voting intentions).
The paramount need for consistency of metadata
When data is updated/revised, efficient data analysis requires that the new data file has metadata that is consistent with the earlier data files. For example, if sex was stored in a variable called Var004, it's important that in the new data file, sex remains in Var004.
The reason for this is that almost all data analysis involving updating/revising of data files involves using automation, and the automation will break. For example, a visualization in a report may be created based only on females, and this filtering may be achieved with an expression like Var004 == 2. But, if Var004 no longer represents sex, then the wrong result will be shown, but there's a good chance that the analyst may fail to spot this.
Add categories, do not change or remove categories
Consider a nominal variable that records which model of iPhone somebody purchased. When Apple initially launched the iPhone, it had two models, the iPhone 4GB and 8GB. Assuming the data is stored as a numeric variable with labels, the initial values and labels would be:
iPhone 4GB 1
iPhone 8GB 2
Apple then discontinued the 4GB iPhone and introduced a 16GB iPhone. Then, it discontinued these phones and introduced the iPhone 3G 8GB and 16GB. The correct way to set up the data to deal with these changes is as shown below. That is, even if in the data you have nobody who has purchased the iPhone 4GB, the value of 1 should not be re-used for any newer models.
iPhone 4GB 1
iPhone 8GB 2
iPhone 16GB 3
iPhone 8GB 4
iPhone 3G 8GB 5
iPhone 3G 16GB 6
Add variables, do not change variables
The same principle applies to variables. If you have a variable that stores customer satisfaction for the iPhone 4GB, this variable should never contain data storing satisfaction with any other brand in the future, even if the iPhone 4GB has long disappeared.
Depending on the context, it may be appropriate to exclude old variables when creating new data files.