Data stacking involves splitting a data set up into smaller data sets, and stacking the values for each of the variables into a single column. It is a type of data wrangling, which is used when preparing data for further analysis. Common applications of stacking are: to unloop data, to allow multiple outcome variables to be used in regression, and to simplify reporting.
Example of stacking
In the image below, each row shows the data for one of four respondents in a survey. The data file contains a looped structure, where three sets of information appear for three different brands. In total, there are four observations and 10 variables.
The same data is shown below, in stacked form. It has been reshaped to contain 12 observations and five variables. The last three variables (columns) show data that has been stacked. The first column contains the ID variable, which has been stretched. The second column contains the unique variable names from the original data and is also stretched to line up with the other data.
Stacking can occur multiple times
You can perform data stacking multiple times. For example, you could stack the data set on the right, to contain two variables, where the first variable contained all the values in the table, stacked on top of each other, and the second variable contained the variable names, stretched to line up with the appropriate values.
Stacking to unloop data
Often, people create data files where each row reflects how the data has been collected, rather than how it should be analyzed. For example, surveys often have data on a whole household of people in a single row, but analysis may require each person in the household to be treated as a separate analysis unit (and thus to have their own row in the data file). By unlooping the data, calculating summary statistics, such as averages and percentages, become more straightforward because the data to be included is in a single column vs spread over multiple columns. In the example above, most statistical programs would not readily be able to compute an average answer for "Likelihood to Recommend" in the original data file, but can easily do so using the stacked data file.
Stacking for regression analysis
Most software for regression assumes that there is a single outcome variable. However, this is commonly not the case. For example, in the data set above, there are three potential outcome variables (the three variables measuring the likelihood to recommend). Stacking the data means you can analyze it using standard regression analysis.
Simplifying reporting
When you stack data it becomes possible to update calculations by applying a filter. If the data is not stacked, you'll need to update your analysis by changing variables or recreating the analysis from scratch. This takes more time and increases the risk of error.
Data stacking in software
For small data files, stacking is often performed using spreadsheets. With larger files, specialist software is required. For example:
- R has various functions that perform stacking (e.g., reshape).
- SPSS has: Data > Restructure > Restructure select variables into cases
- Q has: Tools > Stack SPSS .sav Data File
- Displayr has Anything > Data > Data Sets > Stack
Stacked data and statistical significance
Stacking the data from variables in a data set has the consequence of inflating the sample size (e.g., 100 respondents with 10 rows of data become 1,000 observations). This can cause problems with statistical tests. This can be partially ameliorated by using a weight (e.g., in this example, assigning each observation a weight of 0.1), although this is a hack. It is better to either treat the data as hierarchical in a modeling sense (e.g., fitting some kind of Bayesian model), or, treat the data as being from a cluster sample.
Comments
0 comments
Please sign in to leave a comment.