Correlation

Many analyses involve understanding the relationship between two variables. Correlation is a summary statistic designed to summarize such relationships. This article discusses the idea of strong and weak correlations and introduces the concept of the correlation coefficient.

Strong and weak correlations

The scatterplot below shows the relationship between two numeric variables. Each of the dots on the chart shows clothing manufacturer Benetton’s advertising and sales in a particular year, with the advertising expenditure shown on the horizontal (x) axis and sales are shown on the vertical (y) axis. For example, the point at the top right shows that when advertising expenditure was at about 110 Billion Lire (Italy's pre-Euro currency), its sales were above 2.8 Trillion Lire.

There is a very clear pattern evident on this scatterplot: higher amounts of advertising expenditure correspond to higher amounts of sales. In the jargon of statistics, this pattern is usually referred to as a correlation. That is, advertising expenditure is correlated with sales.

In the chart above it is very clear that higher levels of advertising expenditure occurred at the same time as higher levels of sales. By contrast, in the chart below, there is still evidence of correlation, but it is much weaker. All we can say from the chart below is that higher levels of exposure to advertising correspond to higher levels of purchase intention on average, but that there are many exceptions to this relationship. To use the jargon, the chart above shows a strong correlation, whereas the one below is moderate or perhaps even weak.

The two scatterplots above show a positive correlation, where the word "positive" indicates that, on average, high values on one variable correspond to higher values of the other. By contrast, the visualization below shows a negative correlation, with purchase intention decreasing as the time since the last consumption increases. The chart below shows an even stronger correlation than the earlier Benetton data. When the points line up in a straight line, that goes up or down, the correlation is referred to as a perfect correlation. Putting all the jargon together, the data below shows a perfect negative correlation.

The correlation coefficient

It is often useful to have a more precise description of the strength of correlation than the words "weak" and "strong", so various statistics have been developed that quantify the strength of the correlation. By far the most popular of these statistics is Pearson's Product-Moment Correlation. This statistic is so widely used that most of the time when people use it they just refer to it as "correlation" or by its standard symbol the letter r.

Pearson's Product-Moment Correlation takes a value in the range of -1 to 1, where:

A correlation of 1 indicates a perfect positive correlation.
A correlation of -1 indicates a perfect negative correlation.
A correlation of 0 indicates that there is no relationship between the different variables.
The closer correlations are to 0, the weaker the correlations.

The charts below illustrate four examples of the correlation statistic. The p refers to the p-value indicating statistical significance.

While correlation is a very useful summary statistic, it can be misleading. See Summary Statistics Can Misrepresent Data.

Categorical data

Where the data is categorical the basic idea of correlation is still applicable (although some researchers prefer to use terms like association when describing categorical data). In the table below, as an example, the age distribution (i.e., pattern) for people without iPhones is different to that for people with iPhones. That is, the crosstab clearly reveals a relationship between the two. Thus, we can also say that these the variables used to create this table are correlated. If there was no relationship between these two categorical variables, we would say that they were not correlated.

Correlation Matrix

Strong and weak correlations

The correlation coefficient

Categorical data

Next

Comments