Variable Labels are Short, Unique, Informative, and Indicate Variable Set Structure – The Data Story Guide

Variable labels should be short and should clearly communicate the underlying structure of the data.

Short

Variable labels appear in most reporting, so it's ideal to have short clear descriptions. For example, Q4. Age, is better than Q4. Which of the following age groups do you fall into?

Unique

Variable labels such as How important is this on a scale of 1 to 10, provided for each of a set of variables, are of no use as it is impossible to determine what is being rated without referring to the questionnaire. A better variable label is Importance: Price.

Where practical, the variable labels should correspond to the actual wording used in the questions, provided it is not too lengthy (see the previous point). Many programs that write data files automatically truncate variable labels to 120 characters, which can cause automatically generated labels from looped questions to be uninformative (e.g., the first 120 characters may not include all of the information about the loop).

Informative

Quick analysis requires that you can look at a table and know what it means, without having to refer to some other documentation. Consequently, it is extremely useful to have informative labels for all variables. A variable label of VAR045 is much less useful than Reasons for buying Coca-Cola.

Any strange text, such as HTML tags, should be removed.

Variables sets with multiple variables (e.g., multiple response questions)

If there are four variables that indicate which of four products a person owns, it is useful if the names have a common structure with a commonality at the beginning of the label, as this makes it easy for both people and computes to recognize variable sets. For example:

Products owned: Savings account
Products owned: Checking account
Products owned: Loan account
Products owned: Credit account

Grids

Grid questions should contain labels that describe both the specific option being evaluated and also some common aspects of the wording. For example, the following labels are poor:

Live a long life
Be rich
Have lots of friends

whereas these are much better:

How strongly do you agree that it is important to... Live a long life
How strongly do you agree that it is important to... Be rich
How strongly do you agree that it is important to... Have lots of friends

Care needs to be taken with the creation of labels for looped questions and some grid questions. Consider a study containing the following three questions:

Q1a	When you think of soft drinks that are sexy, which ones come to mind? 
        MULTIPLE RESPONSE
	Coke
	Pepsi
	Fanta
	Other

Q1b     When you think of soft drinks that are masculine, which ones come to mind?  
        MULTIPLE RESPONSE
	Coke
	Pepsi
	Fanta
	Other

Q1c	When you think of soft drinks that are powerful, which ones come to mind?  
        MULTIPLE RESPONSE
	Coke
	Pepsi
	Fanta
	Other

If the variable labels set up for such questions follow identical structures, this will make the use of the file considerably more straightforward. Some programs, such as Displayr and Q, will automatically detect the structure in the data and present it as a grid. For example, the following labels make the interpretation of the grid straightforward.

Variable Name		Variable Label
Q1a1			Q42. Brand attitude Sexy brands: Coke
Q1a2			Q42. Brand attitude Sexy brands: Pepsi
Q1a3			Q42. Brand attitude Sexy brands: Fanta
Q1a4			Q42. Brand attitude Sexy brands: Other
Q1b1			Q42. Brand attitude Masculine brands: Coke
Q1b2			Q42. Brand attitude Masculine brands: Pepsi
Q1b3			Q42. Brand attitude Masculine brands: Fanta
Q1b4			Q42. Brand attitude Masculine brands: Other
Q1c1			Q42. Brand attitude Powerful brands: Coke
Q1c2			Q42. Brand attitude Powerful brands: Pepsi
Q1c3			Q42. Brand attitude Powerful brands: Fanta
Q1c4			Q42. Brand attitude Powerful brands: Other

Common problems with the setup of grid questions

As an example, the following contains inconsistencies that prevent any auto-detection of the underlying structure:

Q1a1	Q42. Brand attitude Sexy brands: Coke
Q1a2	Q42. Brand attitude Sexy brands: Pepsi
Q1a3	Q42. Brand attitude Sexy brands: Fanta
Q1a4	Q42. Brand attitude Sexy brand: Other
Q1b1	Q42. Brand attitude - Masculine brands: Coke
Q1b2	Q42. Brand attitude - Masculine brands:  Pepsi
Q1b3	Q42. Brand attitude - Masculine brands: Fanta
Q1b4	Q42. Brand attitude - Masculine brands: Other
Q1c1	Brand attitude - Powerful brands: Coke
Q1c2	Brand attitude - Powerful brands: Pepsi
Q1c3	Brand attitude - Powerful brands: Fanta
Q1c4	Brand attitude - Powerful brands: Others

Common problems with the setup of grid questions include:

Any of the problems with multiple response questions discussed earlier in the article.
The Label field has been set up with contradictory or inconsistent information. Two common causes of this are:
Typographical errors. While these may seem like minor issues, they prevent data analysis programs from automatically identifying the looped structures in the data. In the example above:
- An additional space precedes Pepsi for Q1b2.
- There is no s with brands in Q1a4.
- An s has been added to Others in Q1c4.
- Q42. is absent from labels for Q1c.
Truncation of the Label field by the software used to create the data file. For example, the label may read Which of the following brands do you typically consume on a hot day? with the specific brands not listed and thus, there is no way to deduce the correct labeling of the rows and/or columns of the grid (other than assuming they are consistently ordered which, if an incorrect assumption, will result in incorrect analyses).
Repeated labels. For example, if there are two Other/Specify options in the questionnaire then they should be given distinct labels like Other 1 and Other 2. Duplicated labels can prevent the automatic detection of grids, as there is no way to tell the difference between the two options. Each label in the set must be unique.
There are inconsistencies in terms of the number of alternatives (brands) or attributes in the grid (e.g., some brands may not be shown with some attributes). The solution to this problem is to create new variables with no data.
The order of the variables is inconsistent. In the example above, the four brands are shown in the same order for each attitude statement, and this is required for successful automatic identification of the grid layout.

Where multiple questions are asked in a loop, it is usually best if all the data appears question-by-question (i.e., all the looped variables for one question, then all the variables for the next, etc.). However, if the intent is to create stacked data, it is instead usually better to structure the data by loop iteration (i.e., first show all the data from the first iteration of the loop, then from the second, etc.).