Variables are Correctly Grouped Into Variable Sets – The Data Story Guide

Efficient data analysis requires that variables belonging to the same variable set are grouped together. This means that variables in the same set are located next to each other in the data file. And, that those variables have been grouped into a variable set (if possible).

Variables in the same variable set are located next to each other

Consider a data file containing data on race, stored in a variable set with five binary variables. The five variables should appear next to each other in the data file

Variables grouped into variable sets (if possible)

Not all data analysis software supports all the variable set structures. For example, SPSS Statistics has multiple response sets, which are used to store data from multiple response questions, but it does not support any other of the variable set structures containing two or more variables. R also doesn't support the concept of a variable set in data frames, although often analysts will get around this by instead using matrixes and arrays.

Software that extensively supports the concept of variable sets (Displayr and Q), automatically groups variables into variable sets by examining the metadata of variables. For example, if there are five nominal variables all next to each other, and each has the same set of value labels and a common prefix in their variable label, then both Q and Displayr will group the variables into a nominal - multi variable set (referred to as a Pick Any question in Q).

However, sometimes these automatic algorithms will fail, due to inconsistencies in the metadata (e.g., spelling mistakes, or some variables having more labels than others).

The screenshot below shows eight variables as they have been imported into Displayr. The first and last variables are appearing on their own, while B through G have automatically been grouped into a variable set (we know that from the icons next to the label).

The fix for this problem is to group the variables into the variable set, although often at this stage problems with inconsistent metadata will be discovered.

The reverse problem is the one where variables that should not have been grouped are inadvertently grouped. For example, if one question asks about brand awareness, and the next asks about brand usage, the variables from the two questions may inadvertently be grouped into a grid.