A data file should contain a variable that uniquely identifies each case (typically, each respondent). That is, each case should have a value that is different from those of the other cases, even if the same person has provided multiple cases of data. If respondents do provide multiple cases then a respondent identifier should be included as an additional variable. Its primary role is to assist in managing data quality issues (e.g., if a particular response to a survey is considered to be of dubious quality, knowing the value on the ID Variable makes it easier to ensure that the respondent is deleted from future data exports).
Considerations, which often conflict, when creating the ID variable are that:
- It will be a relatively short number. It's common that values need to be used in code (e.g., if creating filters, and having very long IDs containing a mixture of letters and numbers can make this difficult).
- It should not be personal information. That is, in general i's inadvisable, for example, to use an email address, as then the resulting data file may contain personal information, which brings various international laws about privacy and data protection into play.
- It should be able to be used as a key to link to other data. For example, using a customer number can be a good ID, as it makes it easy to check data and append additional data.
Also known as
- Analysis subject identifier.
- Unique key.
- Unique identifier.
- Identification variable.