Experimental designs are evaluated and compared using the following criteria:
- Questions are sensible and answerable
- Pairwise balance
- Standard errors
- Using simulations to compare designs
The most important test of any design is that the questions it generates are sensible and answerable. See Choice of Experimental Design Algorithm for an illustration of this approach.
A design is said to be balanced when all the attributes appear the same number of times. When comparing designs, an experimental design is more balanced if the frequencies are closer to this ideal.
When comparing designs it is useful to summarize all the balance information as a single metric. One measure is the Mean Version Balance, which is calculated as follows:
- The balance of an attribute is defined as the sum across levels of the absolute differences between the counts and the average count.
- This value is then scaled to give a value of 1 if all absolute differences are zero (i.e. the count of every level is the same as its mean, which implies perfect balance).
- The worst design consists of repeating the same levels for all alternatives and has a balance of zero.
- This is calculated for each version of the design and then averaged.
The Across Version Balance is the balance for the whole design across all versions, then averaged across all attributes.
All else being equal - and it isn't always - the more balanced a design, the better. For examples of this, see Choice of Experimental Design Algorithm.
Pairwise balance is balance, except that it relates to the balance of pairs of attribute levels.
The Mean Version Pairwise Balance is the same as the Mean Version Balance, except that is calculated for across all pairs of attributes. The Across Version Pairwise Balance is the pairwise version of the Across Version Balance. These are not meant for designs that don't see balance (e.g., alternative specific and partial-profile designs).
Overlap refers to the extent to which an attribute level appears in the same question. For example, if there are there alternatives and each none of the alternatives have attribute levels in common, then there is no overlap. Where there are more alternatives than attributes, there will always be some overlap.
The less overlap, the more information that is obtained from data using the design. However, designs with no overlap are cognitively harder and many researchers believe that a degree of overlap is beneficial.
Designs can also be compared by calculating the expected standard errors of the parameter estimates. The lower better. A simple summary measure that can be used for comparing designs is the average of each design's standard errors.
A better, but more complicated summary of standard errors is the d-error. A lower d-error indicates a better design. It is usual to compare d-errors for designs created using different algorithms, rather than consider the number by itself.
The d-error is a similar idea to the effective sample size. A design with a smaller d-error requires fewer respondents to provide the same level of precision.
Although d-error is appealing in principle, the actual d-error that can be practically calculated assume a very simple model (either linear regression or multinomial logit), and such models are known to be inappropriate for choice experiments. So, while there is value in comparing designs based on d-error it is by no means a definitive metric. This is also true of standard errors.
Using simulations to compare designs
A more complicated approach to comparing designs is to:
- Hypothesize the likely results of the experiment. A lazy but useful hypothesis is that all the utilities will be 0.
- Simulate data consistent with the hypothesized results.
- Fit models to the data (e.g., hierarchical Bayes), and evaluate:
- The difference between key results (e.g., the estimated utilities and the hypothesized utilities).
- Prediction accuracy.
For instructions on how to do this using Displayr, see: