Each column in a data file should represent a single variable. There are two common variants of overly wide data:
- Where columns represent unique values of a variable
- Where a loop has been used to create multiple columns of a single variable
Columns represent unique values of a variable
The table below shows some sales data. Data in this format is widespread. When the tables are small, like the one below, such a format is not a big problem, and such data is easy to analyze in a spreadsheet. But, as the data gets larger or there is a need to analyze the data in specialist data analysis software, it becomes increasingly unwieldy.
It's not immediately obvious, but this data is inconsistent with the idea of each column containing a single variable. We have data, which is a variable, stored as multiple column headings. And, we have the sales, which are another variable, distributed across the 12 cells of the table, rather than as a single column.
The correct format for data like this is to structure it as shown below (this example also glues together the separated tables from There are Mutliple Tables, Rather Than a Single Table).
Where a loop has been used to create multiple columns of a single variable
In the example below, note that every 5th column contains the same style of information. This data file should have contained five variables, but instead, each of these variables has been distributed across multiple columns.
This problem is usually caused by the use of loops in data collection. For example, the survey may ask people about their last five purchases and include information about each of the purchases in a separate set of variables.
Efficient analysis typically requires that such data be stacked. For example, the data above has been stacked below.
Comments
0 comments
Please sign in to leave a comment.