Great efficiency is obtained in data analysis by creating new variables by duplicating and modifying existing variable sets. This article first gives an overview of how this is done in different software packages. Then, it discusses the main operations that are performed using this approach.
Great efficiency is contained by using variable sets
Often data files contain large numbers of variables that need to be manipulated in the same way. For example, a file may contain 15 numeric variables that all need to be categorized (e.g., see the example in Converting Numeric Variables to Categorical Variables).
Due to the frequency with which such work is required, all major data analysis software contains tools to make this approach efficient.
How different software packages allow you to duplicate and modify data
In Displayr, any variable sets can be duplicated (e.g., right-click and select Duplicate). Then, they can be modified. E.g., by changing their Structure or clicking Values and modifying the values and labels.
In Q, variable sets are duplicated on the Variables and Questions tab, by right-clicking and choosing the various Copy and Paste options. Then, they can be modified. E.g., by changing their Structure or clicking Values and modifying the values and labels.
In R, variable sets first need to be created, which is typically best done by creating them in a matrix or a data frame. Then, they can be easily manipulated by being coerced to different types and classes, using recode, and using sweep, apply, sapply, along with any number of other functions in the various tidyverse packages.
Although SPSS has minimal direct support for variable sets, the recode function makes it very efficient to recode multiple variables at the same time.
The main operations for modifying variable sets
There are four key operations used to efficiently manipulate variable sets:
- Changing structure.
- Changing values (i.e., recoding)
In older analysis software, such as SPSS Statistics, almost all variants of numeric and quantitative data are stored as numeric variables. If the user wants to calculate percentages, they choose analysis methods that assume the data is categorical (e.g., in SPSS, Frequency, Crosstab, Custom Tables, etc.). If the user wants to calculate means, the user chooses analysis methods that assume the data is numeric (e.g., Descriptives, Means).
In more modern software, such as R, Q, and Displayr, the workflow is:
- The user selects the data to analyze, e.g., the user may select the variable containing age data.
- The software automatically works out the best way to analyze it. For example, if the user selects age data, the software will likely choose to show the percentage of people in different age categories. (The software does this by working out the meaning of the data: see Variables are Stored in the Data Set Consistently with their Variable Type).
With modern analysis software, when the user wants to change how the data is analyzed, they then change the settings that indicate how the data is stored. For example:
- In R, if Age is stored as a factor, you change the type to numeric.
- In Q, if Age is stored as Pick One, you change it to Number.
- In Displayr, if Age is stored as Nominal, you change it to Numeric.
Commonly it is useful to have multiple versions of the same data. For example, you may want to have attitudinal data stored both as:
- Categorical variables, to compute frequencies.
- Numeric variables to computer averages.
- Binary variables to compute top two box scores.
How this is done also differs by software. In particular:
- In SPSS and R, a user creates new variables, by recoding old variables into the new variables.
- In Q and Displayr, a user duplicates variables and changes their structure (see the previous section).
Splitting variable sets
Consider the summary table shown below. This is from a variable set of 6 nominal variables. If you crosstab this with other data (e.g., demographics), the resulting table becomes huge and hard to use. The simple fix is to split the variable set into separate variables and then crosstab these separate variables with the demographics.
Combining variables into variable sets
Often it is useful to form variable sets by combining individual variables or sets of variables to make comparisons between variables more straightforward, even if the data was not originally collected as a variable set.