Introduction to the Multinomial Logit Model – The Data Story Guide

The multinomial logit model is central to analyzing choice-based conjoint data. The basic mechanics of the model are described with three examples:

Worked example with one attribute
Worked example with two attributes
Worked example with a categorical and a numeric attribute

The article then discusses some practicalities and theory:

The multinomial logit model is a building block, not a model that you should actually use
Multinomial logit and random utility theory
The two types of multinomial logit models

Worked example with one attribute

The multinomial logit model is easiest to understand by jumping straight into a worked example. Let’s say we have a very simple experiment, where we have just asked people to choose between Coke, Pepsi, and Fanta, and we observed that 20% choose Coke, 70% Pepsi, and 10% Fanta.

In the language of choice-based conjoint, we have a single choice question, and a single attribute, brand, with three levels. As discussed in The Assumptions of Choice-Based Conjoint, we want to measure the utility of each of these brands, in such a way that we can predict the observed data.

The multinomial logit model assumes that the proportion of people that chose Coke, can be given by the following formula:

For this formula to be practical we need to introduce some additional assumptions. In particular, we need to require that the appeal of each of the brands is positive, as if any are allowed to be negative, then we end up with negative proportions being predicted.

A simple way of achieving this is to define a new concept, called Utility, which can take any value, and then define appeal as exp(Utility). If you've forgotten what exp does, it ensures that all negative numbers becomes positive (among other things). For example, exp(-2) = 0.14, exp(-1) = 0.37), exp(0) = 1, exp(1) = 2.7, exp(2) = 7.4.

The formula for the proportion of people that chose coke then becomes:

The trick then is to compute the three utility values which explain our observed data. There are lots of ways of doing this, but the most common approach is something called maximum likelihood estimation. However, unless you a curious, there is no need to understand how it works, because for this problem it just about always does a great job, so the trick is just to press a button and feel confident.

For example, with proportions of 20% for Coke, 70% for Pepsi, and 10% for Fanta, the following utilities correctly predict our shares: Coke 0, Pepsi 1.25, and Fanta -0.69. You can verify this in pretty much any program (e.g., Displayr, Excel, R), by typing:

exp(0) / (exp(0) + exp(1.25) + exp(-0.69))

And verifying it gets the answer of 0.2003238. It is not precisely 20% (i.e., 0.2), because I have only shown the first two decimal points of the utility estimates.

You may think it is a bit odd that Coke has a utility of exactly 0. The reason for this is that the way that the formula for computing shares works is that it is only ever comparing relative values. We would still get the same estimates of shares if we had assumed the utilities as being 100, 101.25, and 99.69. Consequently, we simplify things by setting the first attribute level's utility to 0. Occasionally people follow different conventions. For example:

Effects coding, instead creates the utilities so that they have an average of 0. In this example, the utilities for Coke, Pepsi, and Fanta become respectively -0.19, 1.07, and -0.88. Note that the differences between these utilities remain the same, and it is these differences that determine the predictions of the multinomial logit model rather than the actual values of the utilities.
Set the utilities so that the lowest is 0, which gives us utilities of Coke 0.69, Pepsi 1.95, and Fanta 0.00.

It is also common practice to use software that computes the utilities in one way and then adjusts the utilities later to change to whichever of these systems (0 for the first, mean of 0, or lowest is 0) the person presenting the results prefers.

Worked example with two attributes

Now let’s consider a study with two attributes, brand, and price. Assuming we have created an appropriate experimental design, we run the multinomial logit software and get estimates for the utilities of both of our attributes Coke: 0, Pepsi: 1.25, and Fanta: -0.69, and for four price levels we get $1: 0, $2: -1.30, $3: -2.90, and $4: -3.85. From the utilities, we know that Pepsi is most preferred and that people prefer to pay less.

As discussed in Applications of Choice-Based Conjoint, we assume in choice-based conjoint that preference for alternatives is determined by the sum of the utilities for the attributes. So, if we wanted to predict share for Coke at $1, Pepsi at $2, and Fanta at $1, we first calculate each of their utilities Coke: 0 + 0 = 0, Pepsi: 1.25 - 1.30 = -0.05, and Fanta -0.69 + 0 = -0.69, which tells us that if Pepsi is $1 more expensive than Coke, Coke becomes more preferred (i.e., Coke's utility of 0 is above Pepsi's at $2 of -0.05. We can work out how much more preferred using the multinomial logit formula:

This is the same formula as before, but:

We are showing the utility of each alternative in the formula, rather than just for the single attribute as shown earlier.
Before, the formula was Proportion that chose Coke, and now it is labeled as market share. These two things aren't always the same of course.
The formula uses e rather than exp(), as it means the same thing, but is more space-efficient.

In the previous section, we were able to deduce the utilities for the brand attribute from a single question. With two attributes we need to ask multiple questions and use software to find the utilities that best predict the answers to all these questions.

Worked example with a categorical and a numeric attribute

In the example in the previous section we looked at price as an attribute with estimated coefficients of $1: 0, $2: -1.30, $3: -2.90, and $4: -3.85. How can we work out the utility of a price that we did not test? The chart below, which plots the utilities, provides an answer to this question. If we fit a line of best fit to the utilities, we work out that the slope of this line is -1.32. That is, on average, every extra dollar in price leads to a reduction in the utility of 1.32.

The slope is more commonly known as the coefficient of price. In practice, there are two different ways that we can compute the coefficient. One is to estimate the utilities for each level and compute the line of best fit as done here. The more orthodox approach is to tell the software that we are going to treat price as numeric and have the software just estimate the coefficient. In practice, this second approach is superior. The more important decision to make is whether it is appropriate to treat price as numeric or not.

For more information about this topic, see Numeric versus Categorical Attributes in Choice Models

The multinomial logit model is a building block, not a model that you should actually use

From the mid-1970s through to the mid-1990s the multinomial logit model was widely used in choice-based conjoint and market share modeling, even in situations where the assumption of everybody having the same preference is clearly inappropriate (e.g., see David A. Hensher and Lester W. Johnson (1980), Applied Discrete Choice Modelling, Croom Helm Limited).

Nevertheless, it is not appropriate to use it today, as it has long since been supplanted by vastly superior models. See Hierarchical Bayes for Choice-Based Conjoint.

Multinomial logit and random utility theory

You might be thinking, why do we use the multinomial logit formula rather than some other formula? There are two different explanations. The elegant explanation is that the formula can be derived (e.g., see Daniel McFadden (1973), Conditional logit analysis of qualitative choice behavior in Frontiers in Econometrics (P. Zarembka, ed.), Academic Press New York) by making a few simple assumptions:

People make decisions in accordance with the assumptions described in The Assumptions of Choice-Based Conjoint.
All people have the same underlying preferences (e.g., they all prefer Pepsi to Coke and Coke to Fanta).
People make mistakes in making choices, and the nature of their mistakes is consistent with a statistical distribution (i.e., pattern).

A shorthand way of saying the above is that the multinomial logit model is consistent with random utility theory.

Some people read the above and think “ah, the multinomial logit model assumes everybody is the same; that’s dumb”. Fortunately, this is an error in logic. While the model can be derived from these assumptions, it can be derived in other ways that do not make this assumption.

The two types of multinomial logit models

The model described above is structured around the properties of the alternatives being compared. In our example, we assumed that preference for a brand at a price is determined by the utility of the brand and the utility of the price. Although this seems quite a natural thing to do, it is quite unusual in data analysis. Most predictive analysis techniques instead assume that predictions are a function of the characteristics of the sample (e..g, people). For example, we might predict the proportion of people choosing Coke based on their age, gender, or some other characteristic.

It's possible to create a predictive model which uses the exact same math as described above, except that it's structured around the characteristics of the sample rather than the characteristics of the alternatives. Confusingly, his model is also called the multinomial logit model. Sometimes to avoid confusion, the multinomial logit model described above is called the conditional logit model.