Variable sets are groups of related variables. They can save considerable time when analyzing data (especially survey data). They can be created as a consequence of how data is collected, or, manually created in the analysis process). The key properties of a variable set are: the variables in it, it's name, its structure, and the properties of its variable(s).
A variable set is a grouping of one or more related variables. It may consist of a single variable (e.g., a variable measuring the age of people in a data set), or it may consist of multiple related variables (e.g., the age of people within a household, where there is one variable for each person).
Most variable sets contain only a single variable. The most common variable set involving more than one variable is a set of binary variables. However, there are lots of other structures of variable set.
The term "variable set" is not a standard term. For example, it's referred to as multiple response sets in SPSS and Supercolumns in Stata.
Why variable sets are useful
Variable sets perform two roles in data analysis: organization and automation.
By grouping similar variables together into sets, it makes the organization of large data sets easier. For example, a binary - grid can contain hundreds or even thousands of variables. When they are all grouped into a variable set this makes the management of a data set much more straightforward (e.g., a data file that contains 1,000 variables may be represented by 40 variable sets).
Referring to variable sets, rather than having to select individual variables, saves a lot of time in a variety of situations. In particular:
- If creating a summary table of a variable set, time is saved by selecting the variable set in one step, rather than selecting the variables one by one.
- Considerable speed in data preparation is achieved by changing variable set structure, rather than the structure of each variable, one by one. See Creating New Variables by Duplicating and Modifying Variable Sets
- If modifying the value attributes or other metadata of a set of related variables, it is much faster to modify the value attributes of the variable set than to perform the work for each variable.
- If creating crosstabs this can be done by selecting two variable sets, rather than having to specify which variable goes where in the crosstab.
- If writing custom automation via code, the code is much shorter and easier to read.
- Statistical tests can be automated based on the variable set structure.
As an example, the table below is created from the means of 12 numeric variables from a numeric - grid variable set. As an initial benefit, if the variable set is specified correctly, it means that interesting summary tables can be instantly generated, rather than needing to be created by lots of manual work. For example:
Furthermore, these can be easily manipulated if the software understands the data structure. For example:
How variable sets are created
Variable sets are either created in the data collection process or in the analysis process.
In surveys, variable sets correspond to the questions. For example, a question such as: Which of these brands have you drunk in the past 7 days? Coke, Pepsi, Dr Pepper, None of these can be represented by four binary variables.
Some data analysis tools, such as Q and Displayr, automatically scan the data when reading it and group variables into variable sets. In most apps, however, the analyst has to manually set variable sets.
During data analysis, variable sets can be created in a number of ways. For example:
- A categorical variable is routinely turned into a set of binary variables. This is referred to as one hot coding in machine learning and dummy variables in econometrics.
- A set of variables can be turned into another set of numeric variables using factor analysis.
- Users can manually group together variables that make sense to be analyzed jointly.
Properties of variable sets
A variable set has the following properties:
- The variables included in the set.
- The structure of the variable set. The structure of a variable set dictates the properties of the data in the variable set. It consists of two aspects, the set type and the measurement scale of the data.
- The properties of each variable in the set (name, label, values, etc.).
- How the data should be analyzed when creating tables (e.g., which categories should be merged).