The best way to compare choice models is based on their predictive accuracy with holdout samples (i.e., cross-validation accuracy). This article describes:
- What makes the comparison of choice models difficult
- Standard practice is to compare using cross-validation (holdout samples)
- Number of questions to use
- Percentage accuracy versus RLH
- Case study 1: Numeric versus categorical attribute
- Case study 2: Segmented model
What makes the comparison of choice models difficult
Comparing two or more choice models to work out which best describes the data is a surprisingly difficult problem. Most of the traditional approaches used for model comparison are not applicable. For example:
- Traditional hypothesis tests are not applicable. That is, none of the traditional chi-square or F-tests are applicable. This is because all modern choice models are mixture models, and the mixing distribution is not identifiable for the null hypothesis.
- R-squared and pseudo-r-squareds cannot be used as there is no way to adjust them to deal with differing numbers of parameters in the models.
- Informational criteria, such as the AIC and BIC, cannot be used as there is no theoretical basis for calculating the penalty based on the number of parameters (and, with Hierarchical Bayesian models, it is not even clear how many parameters they have).
- Choice-based conjoint models use repeated measures (i.e., multiple questions per respondent), and standard cross-validation methods are not applicable for repeated measures models.
- Hypothesis tests assume that the noise in data is due to sampling, whereas with conjoint models there are inevitably also response biases and cognitive errors as sources of noise.
Standard practice is to compare using cross-validation (holdout samples)
The standard solution to this problem has been to use a form of cross-validation known as holdout tasks. The basic idea is that you ask each person, say, 12 questions, and then use 10 of them to fit the model, and see how well it predicts the other two questions (i.e., tasks).
There are two different ways of generating the holdout questions:
- Scenarios of interest. For example, one question may show alternatives that represent the current state of the market and the second may show a key scenario that the stakeholders are interested in (e.g., with one of the products at a different price, or, a new product introduced).
- Random selection. For example, for each respondent, randomly select two of the 12 questions to use as holdouts.
Each of these approaches has its strengths.
If you can, with confidence, identify the scenarios of interest ahead of time, then this is the best approach, as ultimately, you can just report the results of the scenarios of interest as they will typically be more reliable than the predictions of a choice model (when we fit a choice model we make lots of assumptions).
However, if there's no real basis for confidently identifying scenarios of interest, and the only goal is to assess predictive accuracy, then random selection is preferable. This is because:
- When scenarios of interest are used, they will inevitably contain minimum variation, which greatly reduces their usefulness.
- Inevitably the level of error in a conjoint study varies throughout the questionnaire (e.g., the errors on the first question will be different from those on the 10th). By randomly selecting we ensure that we account for a representative level of such errors.
Number of questions to use
The more questions that are used for cross-validation, the more robust the cross-validation will be. However, the more questions that are used, the less robust the model will be. Consequently, it is recommended to:
- Initial use a single randomly selected question per respondent for cross-validation.
- Explore a higher number of questions, provided that:
- The in-sample predictive accuracy remains relatively constant
- The distribution of the coefficients of the model remains consistent.
Percentage accuracy versus RLH
In general, the easiest way to compare is to evaluate the percentage of choices predicted accurately. However, this statistic is subject to a high level of noise. Its comparison can be augmented by examining differences in the RLH. See Cleaning Choice-Based Conjoint Data with the RLH.
Case study 1: Numeric versus categorical attribute
The top part of a Displayr Hierarchical Bayes output is shown below. One question has been used for cross-validation, with a resulting predictive accuracy of 49.4%.
The output below replaces the attribute of salary with its five levels, with a single, numeric attribute. Note that:
- For this data set, the model with the numeric attribute is better, with a predictive accuracy of 57.5% versus 49.4% (and a higher RLH).
- The in-sample predictive accuracy is worst for the numeric attribute (76.9% versus 83.7%). This is as expected. When price is modeled with five levels, the model has three more parameters so it should provide a better in-sample fit, but this is due to overfitting in this case.
- The result is specific to the data set. In this case study, when a larger sample was used the result changed, with the numeric attribute becoming inferior.
Case study 2: Segmented model
Another modeling strategy is to use a segmented model, where separate models are fit to different segments. Three models were estimated:
- The model for the total sample, introduced in the previous case study, with a predictive accuracy of 49.4%.
- One segment of 57% of the sample, consisted of people who can do their work from home. The cross-validation accuracy in this sample was 57.5%.
- The second segment of 43% of the sample had a holdout cross-validation accuracy of 44.0%.
We need to combine the results from the two segments to calculate an overall accuracy for the segmented model of 57% * 49.4% + 43% * 44.0% = 47%. As 47% is less than the cross-validation accuracy for the total sample model of 49.4%, we conclude that we are better off using a single model rather than the cross-validated sample.
As with the previous case study, this conclusion is dependent on the sample. This test was performed using a soft send. When the entire sample was used the result flipped around, favoring the segmented model.