The Assumed Distribution(s) – The Data Story Guide

Most statistical tests make explicit assumptions about the distribution of the data (where the data is typically assumed to be independently and identically drawn from the distribution). Statisticians broadly group tests into two groups: parametric tests, which make relatively strong assumptions, and nonparametric tests, which make weaker assumptions about the distributions.

Parametric tests

Parametric tests are derived from assumptions about the distribution of the data. Most commonly used parametric tests assume that the data is normal or binomial, but there are dozens of other distributions assumed by more exotic tests. Where data is assumed to be binomial, as occurs in many tests of proportions, this assumption is almost unbreakable, provided that the i.i.d. assumption is met (this is because when the i.i.d. assumption is met the data becomes binomial by definition). In the case of the assumption of normality, however, the assumption can play a material role.

The most widely used tests in survey research - t-tests - assume that data, or, in some cases, the residuals, are drawn from the normal distribution. It is rarely the case that survey data comply with this assumption, with it being more common to follow other distributions (e.g., NBD) and/or have outliers. Where sample sizes are 'small' the failure of data to meet the assumption can make the p-values of surveys misleading. Fortunately, due to a very helpful theory known as the central limit theorem, with large sample sizes, this assumption is usually not a problem. That is, many tests which assume normality - such as most z-test and t-tests - are able to accurately compute the p-value even when the data is not even remotely similar to a normal distribution, provided that the sample size is large. Unfortunately, there is no good guidance as to how large a sample needs to be before the assumption of normality can be ignored. It is common to read guidelines suggesting that samples as small as 10, 20, 25, or 30 can be sufficiently large. Unfortunately, it is not so simple. In samples in the thousands, the departure from normality can still make a difference (e.g., heavily skewed samples). Perhaps the key thing to keep in mind is that with samples larger than 30 the difference between the computed p-value and the correct p-value is unlikely to be large. (That is, it is routine to treat samples of 30 and above as being sufficiently large to make the assumption of normality irrelevant.)

Nonparametric tests

Nonparametric tests make milder distributional assumptions. The most common assumption is finite expectations. This is a very technical assumption and is possibly always satisfied in survey research.

Other assumptions tend to depend upon the specific test. For example, simpler non-parametric tests used for testing differences in means and medians often assume that the data contains no ties (e.g., Kruskal-Wallis). Generally, where such assumptions are not met there are alternative variants of the tests that do not make these assumptions (but instead assume that the sample is large). Most software programs either default to these safer tests or use them when the assumption of ties is not met.

Some non-parametric tests make even more exotic assumptions. The Wilcoxon Sign-Rank Test, for example, makes a symmetry assumption.

Other than the question of ties, it is routine for researchers to proceed as if nonparametric tests make no assumptions (other than i.i.d.). There is no evidence to suggest this practice is routinely dangerous.