Data Collection Process – The Data Story Guide

Most statistical tests make an assumption known as simple random sampling (see Formal Hypothesis Testing for an example and definition).

Simple random sampling is a type of probability sampling, which is a catch-all term to describe samples where everybody in the population has the potential to be included in a study and we know both the probability of people being included as well as having a good understanding of the mechanism by which people are included. Simple random sampling is the simplest type of probability sample. There are lots of others, such as cluster random sampling and stratified random sampling. When these other forms of sampling occur then there is a need to use different formulas to compute statistical significance (i.e., the standard formulas taught in introductory statistics courses all assume the data is from a simple random sample) (Cochran, W. G. (1977). Sampling Techniques, Third Edition. New York, Wiley.). In general, if tests assume simple random sampling, but one of these other types of probability sampling methods is a better description of the sampling mechanism, the consequence will be that the computed p-value is smaller than the correctly computed p-value and results will be concluded as being significant and the rate of making false discoveries will increase (there are some situations where alternatives to simple random sampling can result in p-values being wrong in the other direction, but the nature of commercial research makes this possibility rare enough to be ignored).

Probability samples never occur in the real world. Only a tiny fraction of people are ever really available to participate in surveys. The rest are: illiterate, in prisons, unwilling to participate, too busy, not contactable, etc. Consequently, it is important to keep in mind that the p-values that are computed are always rough approximations based upon implausible assumptions. However, without making the implausible assumptions there is no way of drawing any conclusion at all, so the orthodoxy is to make such untestable assumptions but to proceed with a degree of caution. Nevertheless, it is important to appreciate that it does not follow that because no sample is ever really a probability sample which means that all samples are equally useful. The further a sample is from being a probability sample, the more dangerous it is to treat it as being a probability sample.