Formal Hypothesis Testing – The Data Story Guide

Formal hypothesis testing is an approach for drawing conclusions about how the world works.

Example

A study comparing the extent to which Chinese versus Mexican feel politically disenfranchised reported the following results:

	Chinese	Mexicans
Believe they have minimal say in "getting the government to address issues that interest" them	56%	25%
Sample size	499	287

Source: Wand, Jonathan. (2013), American Journal of Political Science, Vol. 57, No. 1 (January 2013), pp. 249-262 (14 pages), Table 2.

The obvious interpretation of such a result is that the Chinese are much more likely to consider themselves to be disenfranchised than are Mexicans. However, the study only spoke to 499 Chinese, yet there are more than a billion Chinese people living in China. Similarly, the 287 Mexican respondents are only a minuscule proportion of the Mexican population. Perhaps if a different 499 Chinese had been interviewed a substantially different result would have been obtained. Possible, maybe even a result even lower than the 25% figure reported for the Mexicans. Similarly, the Mexican figure could be found to be completely different in a different study.

When analyzing a survey it is important to keep a clear distinction between what has been recorded versus what is true.

The recorded results are 56% for China and 25% for Mexico. Thus, the study has found that the extent of disenfranchisement in China is thirty-one percentage points higher in China than that recorded in Mexico (i.e., 55% - 25% = 31%).

In this case, the truth is the actual proportion of Chinese people and Mexicans that feel disenfranchised. We have no way of knowing these true proportions. It is possible that the difference between China and Mexico is 0%. Or, Mexicans may feel even more disenfranchised than the Chinese.

The difference between what we have estimated (the difference of 31%) and the truth (the actual difference in the disenfranchisement of Chinese and Mexicans) is known as the total survey error. As we generally have no way of knowing the truth we also generally have no way of knowing the extent of the total survey error. However, a variety of tools have been developed that allow us to approximate the total survey error.

If we make some assumptions we can quantify the extent of the degree of survey error that can occur as a result of only talking to a fraction of the people in a population. Most commonly, researchers make the assumptions when faced with data such as this that people have been randomly selected to participate in the study. The technical term for this assumption is simple random sampling. In the case of the study, we are discussing, for this assumption to be true requires that:

All the people in China had an equal chance of being selected for the study and the 499 that did participate were randomly selected from the entire Chinese population.
All the people in Mexico had an equal chance of being selected in the study and the 287 that did participate were randomly selected from the entire Mexican population.

By making this somewhat heroic assumption we can use probability theory to make some conclusions about the extent of total survey error that can be anticipated to occur as a result of the random selection of people to participate in a survey. The basic process is as follows:

Stipulate something that we are trying to disprove. This is usually called the null hypothesis.
Compute the probability that we would have recorded the result that was obtained, or a more extreme result if the null hypothesis was indeed true. This probability is referred to as the p-value, where p is for probability.
Conclude that the null hypothesis is false if the p-value is very small. Most commonly, small is defined as less than or equal to 0.05 (or, to use the jargon, a 0.05 significance level is used).

In our example:

Our null hypothesis is that the true difference between the perceived disenfranchisement of the Chinese and Mexicans is 0% (i.e., they are the same).
The probability that we would observe a difference of 31% or more if the true difference is 0% is, given our sample sizes of 499 Chinese and 287 Mexicans, essentially 0. (How such computations are performed is discussed in Testing Differences Between Proportions.)
As 0 is a very small p-value we conclude that it is wrong to believe there is no difference between the Chinese and the Mexicans and thus, it seems that the difference is not a fluke and can be relied upon. That is, we can conclude that the difference between the countries is statistically significant.

Now consider a different example, comparing preference for Coca-Cola based on age:

	18 to 24 years old	25 to 29 years old
Prefer Coca-Cola	65%	41%
Sample size	43	39

The formal hypothesis test proceeds as follows:

Our null hypothesis is that the true difference between the preference for Coca-Cola amongst the age groups is 0% (i.e., they are the same).
We have observed a difference of 65%-41%=24%. The probability that we would observe a difference of 24% or more if the true difference is 0% is, given our sample sizes of 43 people and 39 people in the two age bands respectively is 0.026 (i.e., a little under 3 in 100).
It is hard to say if 0.026 is truly small. Ultimately what is small will depend upon context. However, it is smaller than the most common significance level of 0.05 and thus we conclude that it seems that preference for Coca-Cola does differ by age.

However, with the following table, we compute a p-value of 0.0793 and conclude that we cannot reject the null hypothesis (i.e., there is insufficient evidence to conclude that age is a determinant of Pepsi preference).

	18 to 24 years old	25 to 29 years old
Prefer Pepsi	2%	10%
Sample size	43	39

Some important conceptual points

The significance testing examples presented above are contradictory. That is, one test concludes that preference for Coca-Cola is related to age while another test of the sample concludes that preference for Pepsi is not related to age. It is impossible for both of these conclusions to be correct. There is no neat resolution to this problem. The formal terminology is that statistical tests are not transitive (i.e., the maths that is used to compute p-values does from time-to-time lead to contradictory results, even when it is done correctly). While in some situations more complicated statistical theories can provide some assistance, most of the time common sense is the only way to reconcile such contradictions (i.e., taking into account other evidence).
Significance tests are designed to take into account the extent of sampling error in the data. Sampling error is only one component of total survey error. Total survey error is also determined by measurement error (e.g., ambiguous wordings off questions). (What is described as sampling error on this page is a broader definition than is typical; the typical definition says that total survey error is the sum of sampling error, nonresponse error, coverage error, and measurement error; on this page and the associated pages a simpler model of total survey error is the sum of sampling error and measurement error is employed).
Lots of different statistical tests have been developed for computing p-values, such as t-tests, z-tests, and chi-square tests - please refer to a statistical textbook or website for more information. In general, testing is left to computers and only a tiny proportion of commercial researchers know or understand the formulas that are used (and, this is not written as a criticism; the mechanics of how to perform significance testing is pretty low on the list of things that a commercial researcher needs to understand.)
The p-values computed using the standard formulas all assume that only a single test is conducted. That is, they implicitly assume that the user is not conducting multiple tests in a study. This assumption is rarely true. And, when it is not true it means that the p-value that is computed is under-estimated and the real p-value is much higher, as a result, often results that are concluded to be statistically significant should not be concluded to be statistically significant. Fortunately, tools have been developed so that we do not need to make such a heroic and implausible assumption; see Multiple Comparisons (Post Hoc Testing) for more information on this topic.
Statistical tests make a host of other assumptions (see Technical Assumptions of Tests of Statistical Significance).