Summary tables typically describe the responses to each of the questions in a survey. They are the first step in understanding what the data means and in checking and cleaning data.
The simplest of all the summary tables shows the number of responses (cases) for each possible answer to a question in a survey. Such a table of counts is shown below. These are known as frequency tables.
These tables can occasionally be useful - we return to this in our more detailed exploration of data cleaning. However, they are rarely the best place to start.
Summary tables showing percentages and means
The most useful summary tables are those that show percentages of people that choose each option, or if the data is numeric rather than categorical, the average. Examples of each are below.
Summaries of sets of variables
The left-most and right-most tables above look to have similar structures. However, if you carefully check the numbers, you will see that the different categories of race add up to 105%. Why is this? It is because this survey permitted people to choose multiple races.
In the raw data, sex is represented as a single variable (the first column below). By contrast, with the race data, we have five binary variables representing the data, one for each of the categories (as this is required to deal with people being permitted to choose multiple categories). The percentages that are shown in the table are the number of times the 1s appear in each of the variables.
The sex and age summary tables above are summaries of single variables. The race table shows the data from a variable set of related variables. The race example is the most simple type of variable set that commonly appears in surveys.
Summaries of multiple categorical variables
In the table of raw data above, the two right-most columns represent how likely people said they would be to buy a product where the description contained no price, and, how likely they were to buy it after being shown the price.
This data is summarized in the table below. Here, each row summarizes the data of one of the columns. The percentages in the table are labeled as Row %, to make it easier for the reader to ascertain how to read the table.
Summary tables of variable sets that create a table with multiple rows and columns are often called grids, where the term can also be used for the question used to collect the data as well.
Grids of binary variables
The table below shows brand imagery for different cola brands. For example, we can see that 6% of people regard Coke as Feminine, 2% regard Coke as Health-conscious, etc. At first glance, the table below looks similar to the one above. However, the underlying data is completely different. Whereas the table above summarized six categorical variables, the table below summarizes 63 binary variables.
The topic is discussed in more detail in Interpreting Grids of Binary Variables.
Grids of numeric variables
Summary tables of numeric variables can also be constructed as grids. The table below shows consumption per week of different brands in two locations, 'out and about' and 'at home'.
The basic structure and interpretation of these tables are the same as with binary variables, except that the comparison is of averages rather than percentages and the nets are sums rather than averages.
With grids of binary variables the default statistic, the percentage is usually the most interesting statistic. With numeric grids, it can be the case that other statistics are more useful. When there are outliers, more robust statistics like the median can be better. And, sometimes it is more relevant to convert the data into percentages. The table below shows percentages rather than averages, and this makes it more readily interpretable. For example, we can see that:
- 68% of the consumption is 'at home'.
- Coca-Cola and Coke Zero make up more than half the market (i..e., 31% + 29%).
Summary tables and data structures
In the summary tables above, multiple different statistics have been shown (e.g., counts, row %, and the average). In older data analysis programs, such as SPSS Statistics, the user chooses which statistic to display when. For example, the counts are produced using Analyze > Descriptive Statistics > Frequencies, the means via Analyze > Statistics > Descriptives, and the row percentages like the ones above using Analyze > Custom Tables.
More modern analysis programs, such as Q, Displayr, R, and Tableau, instead automatically select the best statistic to show based on the structure of the data. For example, in Displayr, if the user creates a table of a nominal variable set, percentages are shown, and if the table contains numeric data it automatically shows means. In such programs, the user changes the setup of the data to generate the desired summary table.
Generating lots of summary tables
Typically, the analysis starts by generating lots of summary tables automatically and then reading through them.