Calculating Statistical Significance With Sampling Weights – The Data Story Guide

Statistical inference, including statistical testing, with sampling weights, is different from traditional statistical analysis, as:

The default settings in most data analysis software are not designed for sampling weights
There are correct approaches to stat testing with weighted samples
If you use the wrong method you get the wrong result
Hacks have been created, but none work well

The default settings in most data analysis software are not designed for sampling weights

It has long been known that the default settings in most data analysis software are not designed for sampling weights. It is clearly stated in the documentation for all the main statistical packages. For example, SPSS Statistics help states that:

The WEIGHT command in the SPSS Base is a frequency or replication weight. Even though it allows the specification of noninteger values, such values are not treated as sampling weights. In order to properly analyze data from complex samples, you need to use the SPSS Complex Samples module.

Similarly, the defaults in R and SAS do not assume sampling weights. By contrast, Q and Displayr both assume that weights are sampling weights by default.

There are correct approaches to stat testing with weighted samples

Where there are sampling weights, there are correct formulas for most standard problems. For more information, see Introduction to Variance Estimation, 2007, Kirk Wolter.

As a simple example, consider this small data set of 10 cases. We only have one male and nine females, so weighting can be used to correct this imbalance. A sampling weight, which sums to 100 has been computed.

The summary table of the approval data from above is shown below, where:

The weighted proportion of people approving (Yes) is 27.78%.
The standard error is 0.18. (The standard error is the standard measure used for quantifying sampling error.)
The confidence interval for approval is -6.97% o 62.53% (i.e., approximately four times the standard error.

Provided that the correct settings are chosen, you get these results whether using SPSS Complex Samples, R with the survey package, Displayr, or Q.

If you use the wrong method you get the wrong result

SPSS Statistics does not have a menu-based approach for calculating the standard error of the proportion, but the standard error of the mean is almost identical, so it is used instead (the difference between the two is discussed later in this article). The weighted analysis performed with the default settings in SPSS concludes that:

Approval is 27.78%. This is the same as the correct result.
The standard error of the mean is calculated at 0.045 (4.5%). This is labeled as Std. Error Mean in the first table of the output below. This is approximately 1/4 of the correct result. That is, with this example, using the default settings in SPSS gives a massively wrong result, understating the true uncertainty by a factor of four.

Hacks have been created, but none work well

Hacks have been developed to trick traditional algorithms to give better results than when left at their default settings. None of the hacks is strong - all give the wrong answers. But, they are all preferable to using the default treatment of sampling weights. In declining order of quality, the most common hacks are:

Resampling
Weight calibration
Modifying the weight to have an average weight of 1
Using the observed sample size
Use the weight as is

Resampling

Resampling creates a new synthetic data set by randomly selecting cases, as replacements, from an existing data set. Cases are selected with probability proportional to the weight. That is, a weighted bootstrap is used to create the data set.

This approach is always inferior to variance estimation due to:

The noise added by random selection.
It ignores correlations between the weight and the variables in the analysis.
It is easy to get make mistakes when using the approach.

Weight calibration

Weight calibration involves modifying the weight, to sum up to the effective sample size. The resulting weight is sometimes known as the calibrated weight. This approach is used in Survey Reporter, Quantum, and many traditional crosstab packages.

In the example above, the effective sample size is 3.6. Intuitively, this makes sense, in that:

As there is only 1 man in the sample, clearly the effective sample size has to be less than the actual sample size of 10.
As there are 9 women in the sample, clearly the effective sample size must be more than 2 (i.e., we clearly gain some information by having so many women in the sample).

Thus, if we employ weight calibration, we end up with a weight of 1.8 for the man, and .2 for each woman.

The table below shows SPSS's standard error calculation with weight calibration. As discussed above, the correct standard error for this problem is 0.18. However, weight calibration is showing a much higher value of .28.

Modifying the weight to have an average weight of 1

With this approach, the weight is modified so that it has an average of 1. Equivalently, to force the weights to sum up to the actual sample size, which in this example is 10.

The SPSS output is shown below. It is also wrong, showing a smaller standard error than is correct. In this specific example, it is closer to the correct value, but this is just a fluke (if forced to choose, calibrating to the effective sample size is likely superior)

Using the actual sample size in formulas (e.g., using Excel)

This approach proceeds by using the standard formulas taught in introductory statistics texts, where the weighted statistics are used in the formulas (e.g., mean, proportion, standard deviation), except for the sample size, which is used unweighted. This approach is traditionally implemented using Excel spreadsheets.

From the above SPSS output, we know the weighted standard deviation and the sample size, so can use the standard formula:

s / √ (n) = .47213 / √ (10) = 0.14930.

Note that this is just the same as the result in the previous section (and wrong).

Similarly, if we compute the standard error for the proportion, we have the same problem:

√ [p (1 - p) / n)] = √ [.2778 (1 - .2778) / 10)] = 0.14164

Use the weight as is

The final approach is to use the weight “as is”, using the default settings. This is clearly much worse than any of the hacks.