Formal hypothesis testing is an approach for drawing conclusions about how the world works.
Example
A study comparing the extent to which Chinese versus Mexican feel politically disenfranchised reported the following results:^{}
Chinese  Mexicans  

Believe they have minimal say in "getting the government to address issues that interest" them 
56%  25% 
Sample size  499  287 
Source: Wand, Jonathan. (2013), American Journal of Political Science, Vol. 57, No. 1 (January 2013), pp. 249262 (14 pages), Table 2.
The obvious interpretation of such a result is that the Chinese are much more likely to consider themselves to be disenfranchised than are Mexicans. However, the study only spoke to 499 Chinese, yet there are more than a billion Chinese people living in China. Similarly, the 287 Mexican respondents are only a minuscule proportion of the Mexican population. Perhaps if a different 499 Chinese had been interviewed a substantially different result would have been obtained. Possible, maybe even a result even lower than the 25% figure reported for the Mexicans. Similarly, the Mexican figure could be found to be completely different in a different study.
When analyzing a survey it is important to keep a clear distinction between what has been recorded versus what is true.
The recorded results are 56% for China and 25% for Mexico. Thus, the study has found that the extent of disenfranchisement in China is thirtyone percentage points higher in China than that recorded in Mexico (i.e., 55%  25% = 31%).
In this case, the truth is the actual proportion of Chinese people and Mexicans that feel disenfranchised. We have no way of knowing these true proportions. It is possible that the difference between China and Mexico is 0%. Or, Mexicans may feel even more disenfranchised than the Chinese.
The difference between what we have estimated (the difference of 31%) and the truth (the actual difference in the disenfranchisement of Chinese and Mexicans) is known as the total survey error. As we generally have no way of knowing the truth we also generally have no way of knowing the extent of the total survey error. However, a variety of tools have been developed that allow us to approximate the total survey error.
If we make some assumptions we can quantify the extent of the degree of survey error that can occur as a result of only talking to a fraction of the people in a population. Most commonly, researchers make the assumptions when faced with data such as this that people have been randomly selected to participate in the study. The technical term for this assumption is simple random sampling. In the case of the study, we are discussing, for this assumption to be true requires that:
 All the people in China had an equal chance of being selected for the study and the 499 that did participate were randomly selected from the entire Chinese population.
 All the people in Mexico had an equal chance of being selected in the study and the 287 that did participate were randomly selected from the entire Mexican population.
By making this somewhat heroic assumption we can use probability theory to make some conclusions about the extent of total survey error that can be anticipated to occur as a result of the random selection of people to participate in a survey. The basic process is as follows:
 Stipulate something that we are trying to disprove. This is usually called the null hypothesis.
 Compute the probability that we would have recorded the result that was obtained, or a more extreme result if the null hypothesis was indeed true. This probability is referred to as the pvalue, where p is for probability.
 Conclude that the null hypothesis is false if the pvalue is very small. Most commonly, small is defined as less than or equal to 0.05 (or, to use the jargon, a 0.05 significance level is used).
In our example:
 Our null hypothesis is that the true difference between the perceived disenfranchisement of the Chinese and Mexicans is 0% (i.e., they are the same).
 The probability that we would observe a difference of 31% or more if the true difference is 0% is, given our sample sizes of 499 Chinese and 287 Mexicans, essentially 0. (How such computations are performed is discussed in Testing Differences Between Proportions.)
 As 0 is a very small pvalue we conclude that it is wrong to believe there is no difference between the Chinese and the Mexicans and thus, it seems that the difference is not a fluke and can be relied upon. That is, we can conclude that the difference between the countries is statistically significant.
Now consider a different example, comparing preference for CocaCola based on age:
18 to 24 years old 
25 to 29 years old 


Prefer CocaCola  65%  41% 
Sample size  43  39 
The formal hypothesis test proceeds as follows:
 Our null hypothesis is that the true difference between the preference for CocaCola amongst the age groups is 0% (i.e., they are the same).
 We have observed a difference of 65%41%=24%. The probability that we would observe a difference of 24% or more if the true difference is 0% is, given our sample sizes of 43 people and 39 people in the two age bands respectively is 0.026 (i.e., a little under 3 in 100).
 It is hard to say if 0.026 is truly small. Ultimately what is small will depend upon context. However, it is smaller than the most common significance level of 0.05 and thus we conclude that it seems that preference for CocaCola does differ by age.
However, with the following table, we compute a pvalue of 0.0793 and conclude that we cannot reject the null hypothesis (i.e., there is insufficient evidence to conclude that age is a determinant of Pepsi preference).
18 to 24 years old 
25 to 29 years old 


Prefer Pepsi  2%  10% 
Sample size  43  39 
Some important conceptual points
 The significance testing examples presented above are contradictory. That is, one test concludes that preference for CocaCola is related to age while another test of the sample concludes that preference for Pepsi is not related to age. It is impossible for both of these conclusions to be correct. There is no neat resolution to this problem. The formal terminology is that statistical tests are not transitive (i.e., the maths that is used to compute pvalues does from timetotime lead to contradictory results, even when it is done correctly). While in some situations more complicated statistical theories can provide some assistance, most of the time common sense is the only way to reconcile such contradictions (i.e., taking into account other evidence).
 Significance tests are designed to take into account the extent of sampling error in the data. Sampling error is only one component of total survey error. Total survey error is also determined by measurement error (e.g., ambiguous wordings off questions). (What is described as sampling error on this page is a broader definition than is typical; the typical definition says that total survey error is the sum of sampling error, nonresponse error, coverage error, and measurement error; on this page and the associated pages a simpler model of total survey error is the sum of sampling error and measurement error is employed).
 Lots of different statistical tests have been developed for computing pvalues, such as ttests, ztests, and chisquare tests  please refer to a statistical textbook or website for more information. In general, testing is left to computers and only a tiny proportion of commercial researchers know or understand the formulas that are used (and, this is not written as a criticism; the mechanics of how to perform significance testing is pretty low on the list of things that a commercial researcher needs to understand.)
 The pvalues computed using the standard formulas all assume that only a single test is conducted. That is, they implicitly assume that the user is not conducting multiple tests in a study. This assumption is rarely true. And, when it is not true it means that the pvalue that is computed is underestimated and the real pvalue is much higher, as a result, often results that are concluded to be statistically significant should not be concluded to be statistically significant. Fortunately, tools have been developed so that we do not need to make such a heroic and implausible assumption; see Multiple Comparisons (Post Hoc Testing) for more information on this topic.
 Statistical tests make a host of other assumptions (see Technical Assumptions of Tests of Statistical Significance).
Comments
0 comments
Please sign in to leave a comment.