The purpose of a summary statistic is, as the name suggests, to summarize data. However, they can also misrepresent data. This article presents two examples, where an average and correlations misrepresent data.
Example: average investment returns
Silicon Valley venture capital firm Kleiner Perkins became one of the world's legendary investors by making 17 investments, totaling $7.5M between 1973 and 1980. By 1986, these investments were worth $346M. On average, each investment returned a staggering $58 for every $1 invested. (Source: Enterprise and Venture Capital: A Business Builder's and Investor's Handbook, Christopher C. Golis, 2002).
The result of $58 is mathematically correct, but is, nevertheless, misleading. The histogram below shows the data. The salient points are:
- 5 investments lost all of the money invested
- 6 returned less than the 3 times the amount invested (as a rule of thumb, a venture capital firm is losing money with returns below 3)
- Of the 6 "good" investments, two provided almost all of the upside, returning $115 and $802 for every dollar invested.
In this example, the average clearly does not summarize the data in a useful way. Fifteen of the 17 investments returned less than the average.
By contrast, if we at the amounts invested by Kleiner Perkins, the average is a more meaningful summary. The average amount invested was a little over $400K. As the histogram shows, the range was from $41K to $1.5M.
Examples like this suggest some rules that can be used for working out whether a summary statistic is useful or not. For example:
- If the summary statistic is changed a lot by the removal of one case (e.g., the $802 investment return), it suggests that the average is not useful.
- If the average is substantially different from the mode or median, it also suggests a problem.
- If the distribution of the data follows a power law (as is true for investment returns), it suggests that the mean may not be useful.
While such rules sound appealing, in practice the best approach has been found to visualize the data and use common sense to verify if the summary is adequate for its purpose.
Example 2: Correlations
A correlation summarizes the relationship between two variables. Just as we have seen with the average, it's possible for it to be a poor summary.
Consider a correlation of 0.82. In most areas of data analysis, such a correlation would indicate that there was a relatively strong relationship between two variables. However, the four scatterplots below each show a correlation of 0.82, and in only one of the four can we say that the correlation is a reasonable summary of the data.