1 Multinomial Logit (MNL)

Lehigh University
Department of Economics
Muzhe Yang
Spring 2013
Multinomial Models:
Multinomial Logit, Conditional Logit, Nested Logit, Multinomial Probit, and Random
Coe¢ cients/Parameters (“Mixed”1 ) Logit
Discrete (qualitative) response models deal with discrete dependent variables. It is strongly recommended
that you read and study Train (2009): http://elsa.berkeley.edu/books/choice2.html.2
8
binary: yes/no, participation/non-participation
>
>
>
>
) linear probability
model (LPM), probit or logit models
>
>
8
<
ordered:
self-rated heath status (excellent/good/bad)
>
>
discrete responses
<
)
latent
index
models
>
>
multinomial
>
>
>
categorical (mutually exclusive): transportation mode
>
>
>
:
:
) random utility models
1
Multinomial Logit (MNL)
These are the models for discrete choice with more than two alternatives which are not ordered. Suppose that
the discrete response variable y 2 f0; 1; 2;
; Jg. Examples are travel modes (bus/train/car), employment
status (employed/unemployed/out-of-labor-force), marital status (single/married/divorced/widowed) and
many others.
We wish to model the distribution of y in terms of covariates. In some cases we need to distinguish
between covariates that vary by units i (individuals or …rms), denoted by xi (i = 1; 2;
; N ), and covariates
that vary by alternative j (and possibly individual i), denoted by xij (i = 1; 2;
; N ; j = 0; 1;
; J).
Examples of the …rst type— case speci…c covariates— include individual characteristics such as age, gender,
and income. An example of the second type— alternative speci…c covariates— is the cost (or price) associated
with each alternative, such as the cost of commuting by bus or train or car. McFadden developed the
interpretation of these models through utility maximizing choice behavior. In that case we may be willing to
impose restrictions on how covariates a¤ect alternatives— costs of a particular alternative a¤ect the utility
of that alternative, but not the utility of other alternatives.
The modelling strategy is to write out a likelihood function based on each individual i’s conditional
probability of choosing j given the covariates xij , Pr(yi = jjxij ) = pij (xij ; ), parameterized by :
!
N
J
N P
J
Q
Q
P
1fyi =jg
ln L( ) = ln
Pr(yi = jjxij )
=
1fyi = jg ln pij (xij ; ).
i=1
j=0
i=1 j=0
If regressors do not vary across alternatives (xij
xi ), then the multinomial logit (MNL) model is used.
MNL speci…es the individual response probability as:
1
exp(x0i j )
Pr(yi = jjxi ) = PJ
, where i = 1;
0
k=0 exp(xi k )
; N ; j = 0;
; J; and
0
= 0.
This “mixed” logit model refers to MMA Section 15.7 and http://elsa.berkeley.edu/books/choice2nd/Ch06_p134-150.pdf,
not MMA Sections 15.2.3 and 15.4.1.
2
This handout is supplemental to MMA Chapter 15. Part of this handout is based on my previous notes taken from Guido
Imbens’s courses given at UC-Berkeley.
1
Lehigh University
Department of Economics
Muzhe Yang
Spring 2013
So, the log likelihood function will be:
ln L(
|
1;
{z
Note:
;
J)
}
= ln
N
Q
i=1
0 =0
=
N P
J
P
i=1 j=0
J
Q
1fyi =jg
Pr(yi = jjxi )
j=0
exp(x0i
!
=
N P
J
P
1fyi = jg ln Pr(yi = jjxi )
i=1 j=0
j)
0
k=0 exp(xi k )
1fyi = jg ln PJ
(where
0
= 0).
Be aware that there are J+1 conditional probabilities given covariates to estimate and these J+1 probabilities
must sum up to one. So the degree of freedom is J, meaning that there are only J independent j (j =
0; 1; 2;
; J) out of J + 1 parameters with one of them to be normalized. Here we choose 0 = 0 without
loss of generality.
This is a direct extension of the binary response logit model where j 2 f0; 1g. Compared with the probit
model, it leads to a well-behaved likelihood function and easy to estimate. More interestingly, it can be
viewed as a special case of the following conditional logit.
2
Conditional Logit (CL)
In this case, all covariates vary by alternative (and possibly by individual as well, but that is not essential
here). This means that these alternative-speci…c regressors take di¤erent values for di¤erent alternatives.
Let xij denote the value of the regressors for individual i and alternative j and xi = (x0i0 ; x0i1 ;
; x0iJ )0 , then
we have the following general representation of the probability of a particular alternative j of individual i:
Pr(yi = jjxi ) = Fj (xi ; ) = Fj (x0i0 ; x0i1 ;
; x0iJ ; ),
where the parameters are constant across alternatives. An example is the conditional logit (CL) model.
In contrast, we may have all regressors which are alternative-invariant, meaning that xi does not vary
across alternatives. Then a general representation of the probability of a particular alternative is:
Pr(yi = jjxi ) = Fj (xi ; ) = Fj (x0i
0
0 ; xi 1 ;
; x0i
J ),
0
0
0
where the parameters j di¤er across alternatives and
=
; 0J .3 An example is the MNL
0; 1;
model discussed above.
The distinction between alternative-varying (or called alternative-speci…c) and alternative-invariant (or
called case-speci…c) regressors is of practical importance, as standard notation and computer programs for
multinomial models work exclusively with one or the other. In practice, of course, some regressors may
be alternative-varying and others alternative-invariant. In such cases, it is best to use a program written
for alternative-varying regressors, because it is possible to go from alternative-invariant regressors to the
alternative-varying format. We will show how to do this using McFadden’s example where regressors are
essentially included as interactions with alternative-speci…c dummy variables.
McFadden proposed the CL model:
Pr(yi = jjxi0 ;
3
exp(x0ij )
; xiJ ) = PJ
0
k=0 exp(xik
Parameter identi…cation requires a normalization such as
0
2
)
, where i = 1;
= 0.
; N ; j = 0;
; J.
Lehigh University
Department of Economics
Muzhe Yang
Spring 2013
The log likelihood function will be:
ln L( ) = ln
N
Q
i=1
=
J
Q
1fyi =jg
Pr(yi = jjxi0 ;
; xiJ )
j=0
!
=
exp(x0ij )
.
1fyi = jg ln PJ
0
i=1 j=0
k=0 exp(xik )
N P
J
P
N P
J
P
1fyi = jg ln Pr(yi = jjxi0 ;
; xiJ )
i=1 j=0
The MNL model can be viewed as a special case of the CL model. Suppose that we have a vector of
individual characteristics xi with dimension K. Then de…ne for each alternative j the vector of covariates
xij as a column vector of K(J + 1) 1, with all zeros other than the elements from K j + 1 to K j + K
equal to xi :
2
3
2
3
2
3
xi
0
0
K
1
K
1
6 K 1 7
6 . 7
6
7
6 0 7
6 .. 7
6 0 7
6 K 1 7
6
7
6 K 1 7
6 . 7
6
7
6 . 7
x
6
7
6
7
6 .. 7 .
i
xi0
= 6 .. 7 ;
; xij
=6
;
;
x
=
iJ
7
6
7
K(J+1) 1
K(J+1) 1
6 . 7
6 K. 1 7
6 . 7
K(J+1) 1
6 . 7
6 . 7
6 .. 7
4 . 5
4 . 5
4
5
xi
0
0
K 1
K 1
K 1
0
Similarly, de…ne = 00 ; 01 ; 02 ;
; 0J , where 0 = 0 is a normalization. Thus we have x0ij = x0i j .
There is some confusion about the so-called “mixed”logit model. In one case, it refers to a combination
of a MNL and a CL model; in another case, it means the random parameters/coe¢ cients logit model. Most
often a mixed logit model refers to the latter because the former can be reexpressed as a CL model.
2.1
Regression Parameter Interpretation (MMA 15.4.3, p.502)
2.1.1
Marginal E¤ects
CL
Based on MMA 15.12.1 (p.524), we have
@pij
= pij (
@xik
ijk
pik ) , where
ijk
1fj = kg.
If > 0, then an increase in the corresponding component of the regressor value for the k-th alternative
increases the probability of choosing the k-th alternative and decreases the probability of choosing the
other alternatives.
MNL
Based on MMA 15.12.1 (p.524), we have
@pij
= pij (
@xi
j
i ),
where
i
=
PJ
k=0 pik k
is a probability weighted average of the
k.
It follows that the sign of the response is not necessarily given by the sign of j , unless j > k for all
k 6= j, and it does not necessarily make any sense to test whether a particular coe¢ cientP
is zero. As in
other nonlinear models, we may conduct inference on the average marginal e¤ects (1=N ) N
i=1 @pij =@xi
PN
= (1=N ) i=1 pij ( j
i ).
3
Lehigh University
Department of Economics
2.1.2
Muzhe Yang
Spring 2013
Comparison to Base Category
The coe¢ cients in the CL and MNL models can also be given a more direct logit-like interpretation in terms
of relative risk. This is because the models can be reexpressed as binary logit models.
For the MNL model, comparison is to a base category, which is the alternative normalized to have
coe¢ cients equal to zero ( 0 = 0). To see this note that the multinomial logit probabilities imply that
the conditional probability of observing alternative j given that either alternative j or alternative k is
observed is
pj
pj + pk
exp(x0 ( j
exp(x0 j )
=
exp(x0 j ) + exp(x0 k )
1 + exp(x0 (
Pr(y = jjy = j or k) =
=
k ))
j
k ))
Suppose that normalization is on alternative k, so that
Pr(yi = jjyi = j or k) =
k
) a logit model with coe¢ cient (
j
k ).
= 0, then
exp(x0i j )
,
1 + exp(x0i j )
and j can be interpreted in the same way as the logit model coe¢ cient for binary choice between
alternative j and k. Similarly to the binary logit model the relative risk of choosing j rather than k is
Pr(yi = jjxi )
= exp(x0i
Pr(yi = kjxi )
j ),
and hence exp( jl ) gives the proportionate change in this relative risk when the l-th regressor in xi (say
xil ) changes by one unit. Such interpretations will vary according to which alternative is normalized
to have zero coe¢ cient, and for this interpretation to be really useful one needs to have a natural base
category. For example, if interest lies in various alternative commute modes to travelling by car, then
normalize the coe¢ cients for the car alternative to be zero.
A similar approach can also be applied to the CL model, with:
Pr(yi = jjyi = j or k)
exp(x0ij )
exp((xij xik )0 )
=
=
) a logit model with coe¢ cient .
0
0
exp(xij ) + exp(xik )
1 + exp((xij xik )0 )
Here the normalization is with respect to regressor values for a base category.
2.2
Link with Utility Maximization
McFadden motivates the CL model by extending the latent index model to multiple alternatives. Suppose
that the utility for individual i associated with alternative j is:
Uij = x0ij +
ij .
Furthermore, let individual i choose alternative j (that is yi = j) if it provides the highest level of utility,
yi = j if Uij = maxfUi0 ; Ui1
; UiJ g.
We assume that ties have probability zero because of the continuity of the distribution for .
Now suppose that the ij ’s are independent across both alternatives and individuals and ij ’s have type
I extreme value distributions. Then the Pr(yi = jjxi0 ;
; xiJ ) follows the CL model. The type I extreme
4
Lehigh University
Department of Economics
Muzhe Yang
Spring 2013
value distribution has the following c.d.f. and p.d.f. functions:
F ( ) = exp( exp(
f ( ) = exp(
1 <
))
) exp( exp(
))
< 1.
See http://mathworld.wolfram.com/ExtremeValueDistribution.html for details about this distribution.
This distribution is asymmetric and has a unique mode at zero, a mean equal to 0:58 and a second
moment of 1:99 and a variance of 1:65. The extreme value distribution is obtained as a limiting distribution
as N ! 1 of the maximum of N random variables drawn from the same distribution. The type I extreme
value distribution is a special case that is right-skewed with most of the mass between 2 and 5.
As a side note, the logit model arises if 1 and 2 are assumed to be independent type I extreme value
distributed, then the di¤erence ( 1
2 ) has a logistic distribution.
Given the extreme value distribution, the probability Pr(yi = 0jxi0 ;
; xiJ ) is:
Pr(Ui0 > Ui1 ; Ui0 > Ui2 ;
Pr(x0i0
=
+
i0
; Ui0 > UiJ )
>
x0i1
+
0
i1 ; xi0
0
= Pr( i1 < (xi0 xi1 ) + i0 ;
Z 1 Z (xi0 xi1 )0 Z (xi0 xi2 )0
=
1
1
1
Z 1
=
[exp( i0 ) exp( exp(
1
exp( exp( (xi0
exp(x0i0 )
.
PJ
0
k=0 exp(xik )
=
i2
+
i0
> x0i2 +
< (xi0
Z (xi0
0
xi2 )
xiJ )0
; x0i0 +
i2 ;
+
f(
i0 ;
;
iJ
i0 )f ( i1 )
i0
> x0iJ +
0
< (xi0
f(
xiJ )
iJ )
+
i0 )
iJ )d i0 d i1 d i2
d
iJ
1
i0 ))]
xiJ )0
i0 ))
xi1 )0
exp( exp( (xi0
d
i0 ))
i0
Similarly, we have:
exp(x0ij )
; xiJ ) = PJ
, where i = 1;
0
k=0 exp(xik )
Pr(yi = jjxi0 ;
2.3
; N ; j = 0;
; J.
McFadden’s Conditional Logit Model For Gas/Electric Dryer Purchases
The starting point is an indirect utility function that depends on the operating and capital costs of the
device and interactions of the indicators for the alternatives with some individual characteristics.
The indirect utility for an electric dryer for household i (i = 1;
; N ) is:
Uielec =
elec
0
+
elec
1 owneri
+
elec
2 hholdsizei
+
elec
3 gas-availablei
The indirect utility for a gas dryer for household i (i = 1;
Uigas =
gas
0
+
gas
1 owneri
+
gas
2 hholdsizei
+
Uino =
no
0
+
no
1 owneri
+
no
2 hholdsizei
+
4 oper-costi
+
5 cap-costi
+
elec
i .
; N ) is:
gas
3 gas-availablei
The indirect utility for no-dryer for household i (i = 1;
+
+
4 oper-costi
+
5 cap-costi
+
gas
i .
; N ) is:
no
3 gas-availablei
+
4 oper-costi
+
5 cap-costi
+
no
i .
The covariate vector includes both individual-varying (or called alternative-invariant or case-speci…c)
5
Lehigh University
Department of Economics
Muzhe Yang
Spring 2013
and alternative-varying (or called case-invariant or alternative-speci…c) observables:
0
10
alternative-varying
}|
{
z
x = @owner, hholdsize, gas-available, oper-cost, cap-costA .
|
{z
}
alternative-invariant
The operating and capital costs of no-dryer are assumed to be zero by McFadden. “He probably has not
done much hand-washing” (Imbens). So the indirect utility for no-dryer for household i (i = 1;
; N ) can
be simpli…ed to:
no
no
no
no
Uino = no
0 + 1 owneri + 2 personi + 3 gas-savingi + i .
gas no
McFadden assumes that the three disturbances, ( elec
i , i , i ) are independent and identically distributed
with type I extreme value distribution. Then household i’s alternative of dryers can be modelled by:
yi = elec-dryer , Uielec = maxfUielec ,Uigas ,Uino g
yi = gas-dryer , Uigas = maxfUielec ,Uigas ,Uino g
yi = no-dryer , Uino = maxfUielec ,Uigas ,Uino g.
Next de…ne the “representative utilities” (V ):
Vielec = Uielec
=
Vigas =
=
Vino
=
=
elec
+
0
gas
Ui
gas
0 +
Uino
no
0 +
elec
i
elec
elec
elec
1 owneri + 2 hholdsizei + 3 gas-availablei + 4 oper-costi + 5 cap-costi
gas
i
gas
gas
gas
1 owneri + 2 hholdsizei + 3 gas-availablei + 4 oper-costi + 5 cap-costi
no
i
no
no
no
1 owneri + 2 hholdsizei + 3 gas-availablei
Given the distributional assumption, the probabilities of dryer purchase are:
exp(Vie le c )
;
exp(Vie le c )+exp(Vig a s )+exp(Vin o )
exp(Vig a s )
gas-dryer) = exp(V e le c )+exp(V g a s )+exp(V n o ) ;
i
i
i
exp(V n o )
no-dryer) = exp(V e le c )+exp(Vig a s )+exp(V n o ) .
i
i
i
Pr(y = elec-dryer) =
Pr(y =
Pr(y =
From data on operating and capital costs and individual characteristics we cannot identify all parameters.
Suppose that we subtract an individual speci…c (alternative-invariant) ci (constant) from Uielec , Uigas and
Uino . That would not change the ranking, so we cannot tell that apart from the original model. So we
choose:
gas
gas
gas
ci = gas
0 + 1 owneri + 2 hholdsizei + 3 gas-availablei .
Now the original model is normalized to:
Uielec =
elec
0
+
Uigas =
Uino =
elec
3
gas
0
+
gas
3
elec
1
gas
1
owneri +
gas-availablei +
4 oper-costi
gas
4 oper-costi + 5 cap-costi + i ;
gas
gas
no
( no
0
0 )+( 1
1 ) owneri +
gas
no
no
+( 3
3 ) gas-availablei + i .
6
elec
2
(
no
2
+
gas
2
hholdsizei
5 cap-costi
gas
2 ) hholdsizei
+
elec
i ;
Lehigh University
Department of Economics
We normalize
Uielec =
Uigas =
Uino =
gas
0
=
gas
1
Muzhe Yang
Spring 2013
=
gas
2
=
gas
3
= 0, then we have:
elec
elec
+ elec
1 owneri + 2 hholdsizei + 3 gas-availablei +
gas
4 oper-costi + 5 cap-costi + i
no
no
no
no
no
0 + 1 owneri + 2 hholdsizei + 3 gas-savingi + i
elec
0
McFadden’s estimates are given in Table 1 with
gas
0
=
gas
1
=
gas
2
=
4 oper-costi
gas
3
+
5 cap-costi
+
elec
i
= 0.
Table 1: Conditional Logit Estimates
Variable
Coe¢ cient
Alternative-varying
operating cost for dryer ( 4 )
0:014
capital cost for dryer ( 5 )
0:016
Alternative-invariant
0:600
house owner, electric ( elec
1 )
no
1:590
house owner, no-dryer ( 1 )
household size, electric ( elec
)
0:075
2
no
household size, dryer ( 2 )
0:400
gas-available, electric ( elec
)
1:270
3
gas-available, no-dryer ( no
1:560
3 )
intercept, electric ( elec
)
2:100
0
intercept, no-dryer ( no
0:020
0 )
Even more than in the binary logit and probit models these coe¢ cients are di¢ cult to interpret directly.
So instead McFadden reports some elasticities. For example, consider the elasticity of the probability of
buying an electric dryer with respect to the operating cost of an electric dryer:
Eelec,elec-oper-cost =
@ Pr(y = elec-dryer) elec-oper-cost
.
@elec-oper-cost Pr(y = elec-dryer)
This elasticity will depend on the values of the covariates. Here, we evaluate the elasticities at the means of
the variables:
@ Pr(y = elec-dryer) elec-oper-cost
.
Eelec,elec-oper-cost =
@elec-oper-cost Pr(y = elec-dryer)
The means are given in Table 2.
Variable
dryer alternative
operating cost
capital cost
gas-available
owner
household size
Table 2: Means
Electric-dryer Gas-dryer
0:447
0:235
31:17
7:560
233:2
258:8
0:719
0:719
0:873
0:873
3:310
3:310
No-dryer
0:318
0
0
0:719
0:873
3:310
The derivative of the probability of buying an electric dryer with respect to the operating cost of an
electric dryer is:
@ Pr(y = elec-dryer)
=
@elec-oper-cost
exp(Vielec )
4
exp(Vielec ) + exp(Vigas ) + exp(Vino )
7
4
exp(Vielec )
exp(Vielec ) + exp(Vigas ) + exp(Vino )
2
.
Lehigh University
Department of Economics
Muzhe Yang
Spring 2013
Evaluate this using the parameter estimates in Table 1, and at the means of the covariates given in Table 2:
Eelec,elec-oper-cost =
=
c = elec-dryer) elec-oper-cost
@ Pr(y
@elec-oper-cost Pr(y
c = elec-dryer)
0:0036
31:17
=
0:5516
0:203.
Thus, the elasticity for the probability of buying an electric dryer with respect to its own operating cost
is 0:203. Similarly, the elasticity for the probability of buying a gas dryer with respect to electric dryer’s
operating cost is 0:23. Note that we could also consider the average elasticities, by averaging the elasticities
evaluated at each set of values of the covariates.
3
Independence of Irrelevant Alternatives (IIA)
In the context of a multinomial discrete choice model, each individual chooses j out of J + 1 alternatives (j = 0; 1;
; J). For individual i, alternative j has characteristics xij (some of which may
be alternative-invariant). Associated with this alternative is utility Uij . Individual i then chooses j if
Uij = maxfUi0 ; Ui1
; UiJ g (ignoring ties). For the CL model, we specify:
Uij
ij
= x0ij +
ij
i.i.d. across i and j with type I extreme value distribution.
An implication of the CL model is the independence of irrelevant alternatives (IIA) property. This implies
that:
The conditional probability of choosing j rather than j 0 given that one chooses j or j 0 does
not depend on characteristics of alternatives other than j or j 0 .
This limitation of the CL and MNL (which can be regarded as a special case of CL) models means that
discrimination among the J + 1 alternatives reduces to a series of pairwise comparisons that are una¤ected
by the characteristics of alternatives other than the pair under consideration. Recall the following:
MNL ) Pr(yi = jjyi = j or k) =
CL ) Pr(yi = jjyi = j or k) =
exp(x0i ( j
1 + exp(x0i (
k ))
k ))
j
)0
exp((xij xik )
1 + exp((xij xik )0 )
Both MNL and CL models can be reduced to a binary choice logit model between any pair of alternatives.
These conditional probabilities do not depend on other alternatives. Put this in another way, consider a
conditional probability of choosing j given that an individual may choose either j or k:
Pr(yi = jjyi 2 fj; kg) =
exp(x0ij )
.
exp(x0ij ) + exp(x0ik )
This probability does not depend on any characteristics of alternatives other than j and k. This is sometimes
unattractive. McFadden’s famous “blue-bus/red-bus” example illustrates this. Suppose there are three
alternatives: commuting by car, by red-bus, or by blue-bus. A sensible model would take into account that
people may prefer cars to buses, but they may be indi¤erent between a red bus and a blue bus. So, we may
assume that Ui;red-bus = Ui;blue-bus , with the choice between them being random. To be explicit, suppose
8
Lehigh University
Department of Economics
Muzhe Yang
Spring 2013
that xi,blue-bus = xi,red-bus = xi,bus and suppose that the probability of commuting by bus is:
Pr(yi = bus) =
exp(x0i,bus )
exp(x0i,bus ) + exp(x0i,car )
and the probability of choosing a red bus or a blue bus conditional on choosing a bus is: Pr(yi =redbusjyi =bus) = 1=2. This would imply that the conditional probability of commuting by car given that
one commutes by car or red bus would probably di¤er from the same conditional probability if there is no
blue bus. Presumably taking away the blue bus alternative would lead all the current blue bus users to
shift to the red bus, and not to cars. However, the conditional logit model does not allow for this type of
substitution pattern. Therefore, another way of stating the problems with the conditional logit model is to
say that it generates unrealistic substitution patterns.
There are three models to address this issue: 1) nested logit, imposing a particular correlation structure
on ij ; 2) random coe¢ cients/parameters ( or called mixed) logit, allowing to di¤er by individual; and 3)
multinomial probit, allowing for relatively unrestricted covariance matrices for ij . One point to stress here
is that although formally the IIA property is tied to the extreme value distribution of ij , the deviations
from IIA if we instead use other distributions (such as normal) while still maintaining the independence (of
ij across i and j) are so minor that they do not really address the issue.
4
Nested Logit (NL)
The nested logit (NL) is the most analytically tractable generalization of the multinomial models. It is the
ideal model to use when there is a clear (and unambiguous) nesting structure, but not all multinomial choice
applications have an obvious nesting structure. The NL model breaks decision making into groups. A simple
example is to consider choice of college: people …rst decide whether to go to a two-year or four-year college,
and then within each of these paths whether to go to a public or private college. The errors in a random
utility model are permitted to be correlated for each alternative within the two-year and four-year groups,
but they are uncorrelated across these two groups.
More generally, suppose the set of alternatives f0; 1; 2;
; Jg can be partitioned
into S subsets B1 , B2 ,
SS
, BS , so that the full set of alternatives can be written as f0; 1; 2;
; Jg = s=1 Bs . Let zs be set-speci…c
characteristics, which vary across subsets Bs (s = 1; 2;
; S) only. It may be that the set of set-speci…c
variables is empty or just a vector of indicators, with zs being an S-vector of zeros with a one for the s-th
element.
For any model with this nesting Pr(yi = jjxij ), the joint probability of individual i’s choosing j can
be factored as Pr(yi 2 Bs jxij ), the probability of individual i’s choosing subset Bs , multiplied by Pr(yi =
jjxij ; yi 2 Bs ), the probability of individual i’s choosing j conditional on having chosen subset Bs . Thus:
Pr(yi = jjxij ) = Pr(yi = jjxij ; yi 2 Bs ) Pr(yi 2 Bs jxij ).
The NL model of McFadden arises when the errors js ’s have the generalized extreme value (GEV) joint
distribution. With GEV, the probability of individual i’s choosing alternative j conditional on the alternative
j being in the subset Bs , or yi 2 Bs equals:
exp(x0ij = s )
.
0
l2Bs exp(xil = s )
Pr(yi = jjxij ; yi 2 Bs ) = P
The marginal probability of choosing the subset Bs is:
Pr(yi 2 Bs jxij ) = PS
exp(z0s )
t=1
P
exp(z0t )
9
l2Bs
P
exp(x0il = s )
l2Bt
s
exp(x0il = t )
t
.
Lehigh University
Department of Economics
Muzhe Yang
Spring 2013
Therefore, using the product rule of a joint probability, we have:
Pr(yi = jjxij ) = Pr(yi = jjxij ; yi 2 Bs ) Pr(yi 2 Bs jxij )
P
s
0
exp(x0ij = s )
exp(z0s )
l2Bs exp(xil = s )
= P
PS
P
0
0
0
l2Bs exp(xil = s )
t=1 exp(zt )
l2Bt exp(xil = t )
t
0
The parameter s is a function of the correlation between js and p
j 0 s for j; j 2 Bs but does not exactly
equal the correlation parameter. In fact s can be shown to equal 1 Corr( js ; j 0 s ), so s is inversely
related to the correlation and we expect 0 6 s 6 1. The case s = 1 corresponds to independence of js
and j 0 s for j; j 0 2 Bs . If we …x s = 1 for all s, then we have:
exp(x0ij + z0s )
,
P
0
0
t=1
l2Bt exp(xil + zt )
Pr(yi = jjxij ) = PS
and we are back in the conditional logit model. We call the parameters s the scale parameter, as they
scale regression parameters (such as and ) in the models shown above. Note that s are also called the
dissimilarity parameters.4
In general this model corresponds to individuals choosing the alternative with the highest utility, where
the utility of alternative j in subset Bs for individual i is:
Uij = x0ij + z0s +
where the joint distribution of the
F(
i0 ; i1 ;
;
ij
iJ )
ij ,
is GEV:
= exp
S
P
s=1
P
exp z0s
exp (
ij = s )
j2Bs
! s!
.
2 . Between the subsets
Remember that in the subset Bs the correlation coe¢ cient for the ij is equal to 1
s
the ij ’s are independent.
To estimate a nested logit model, one approach is to construct a log likelihood and directly maximize
it. That is complicated, especially when the log likelihood function is not globally concave, but it is not
impossible in principle. An easier way (the sequential estimator ) is to directly use the nesting structure in
the following procedure:
(1) Within a nest, say Bs , we have a conditional logit model with coe¢ cient ( = s ):
exp(x0ij ( = s ))
.
0
l2Bs exp(xil ( = s ))
Pr(yi = jjxij ; yi 2 Bs ) = P
Hence we can directly estimate ( = s ) using the concavity of the conditional logit model. Denote the
estimates of ( = s ) by ( [
= s ).
4
Unlike regression models with continuous outcomes, notations vary considerably across authors studying discrete response
models.
10
Lehigh University
Department of Economics
Muzhe Yang
Spring 2013
(2) Then the probability of a particular subset Bs can be used to estimate
Pr(yi
s
and
through
P
s
0 [
exp(z0s )
l2Bs exp(xil ( = s ))
i
2 Bs jxij ) = P h
P
t
S
0
0 d
t=1 exp(zt )
l2Bt exp(xil ( = t ))
h
i
P
0 ([
exp(z0s ) exp s ln
exp(x
=
))
s
l2Bs
il
h
ii
= P h
P
S
0 ) exp
0 (d
exp(z
ln
exp(x
=
))
t
t
t
t=1
l2Bt
il
=
cs )
exp(z0s + s W
,
PS
0 + W
c
exp(z
)
t
t
t
t=1
0 [
cs = ln P
where W
l2Bs exp(xil ( = s )) is known as the inclusive value or the log-sum. Hence we
have another CL model which is easily estimable. Note that this two-step (sequential) estimator is
less e¢ cient than a full information maximum likelihood (FIML) estimator, and at the second stage
the usual CL standard errors understate the true standard errors of the sequential estimator because
they do not allow for the estimation error in computing the inclusive value. To correct for this bias,
bootstrap can be used.
These models can be extended to many layers of nests. It should be noted that both the order of the
nests and the elements of each nest are very important. In addition, the main limitation of the nested logit
model is that not all choice problems lend themselves to an obvious nesting structure. One can try selecting
an optimal nesting scheme using likelihood ratio tests, where appropriate, or Akaike’s information criteria.
However, the resulting scheme does not always accord with a priori expectations.
5
Multinomial Probit (MNP)
A second possibility is to directly free up the covariance matrix of the error terms. This is a little easier in
the probit case. The di¢ culty in general is that this frees up a large number of parameters and maximum
likelihood estimation is di¢ cult as in the most general case a J-fold (for a total of J + 1 alternatives) integral
needs to be calculated.
We specify the following multinomial probit (MNP) model with a random utility for individual i given
by:
2
3 2 0
3
2
3
Ui0
xi0 + i0
i0
6 Ui1 7 6 x0 + i1 7
6 i1 7
6
7 6 i1
7
6
7
Ui = 6 . 7 = 6
, i = 6 . 7 jxi N (0; ),
7
.
.
.
.
4 . 5 4
5
4 . 5
.
UiJ
x0iJ +
iJ
iJ
for some relatively unrestricted . We do need some normalization on beyond symmetry. Recall that in
the binary choice case (which corresponds to J = 1) there were no free parameters in the distribution of ,
which implies three restrictions on the symmetric matrix .
Di¤erent MNP models arise from di¤erent speci…cations of the covariance matrix . Note that if the
errors are uncorrelated the MNP still yields no closed-form solution for the probabilities and it is easier to
assume instead that the errors are extreme value and use the CL or MNL models.
In principle we can derive the probability for each alternative given the covariates, construct the likelihood
function based on that, and maximize it using numerical methods. In practice this is very di¢ cult when
J > 3. Evaluating the probabilities involves calculating a third order integral involving normal densities.
This is di¢ cult to do using standard integration methods, and we often use simulation methods.
11
Lehigh University
Department of Economics
6
Muzhe Yang
Spring 2013
Random Coe¢ cients/Parameters (“Mixed”) Logit (RPL)
A third possibility to get around the IIA property is to allow for unobserved heterogeneity in the slope
coe¢ cients. This “mixed” logit model refers to MMA Section 15.7 and http://elsa.berkeley.edu/books/
choice2nd/Ch06_p134-150.pdf, not MMA Sections 15.2.3 and 15.4.1. The mixed logit model allows the
marginal utility ( i ) to vary at the individual level:
Uij = x0ij
i
+
ij ,
where the ij ’s are again independent of everything else, and of each other, with either type I extreme value
or normal distribution. We can also write this as
Uij = x0ij + vij ,
where
vij = x0ij ui +
ij ,
ui =
, ui
i
N (0;
),
which is no longer independent across alternatives because
Cov(vij ; vik ) = x0ij
xik , 8j 6= k.
The key ingredient is the vector of individual speci…c taste parameters i . Introducing random parameters
has the attractive property of inducing correlation across alternatives. Let us also allow for individual speci…c
characteristics (zi ). Another possibility is to model i with a discrete probability distribution:
i
2 fb0 ; b1 ;
; bK g with Pr(
i
= bk jzi ) =
exp (z0 k )
.
PK i
1 + l=1 exp (z0i l )
Here the taste parameters take on a …nite number of values, and we have a …nite mixture. We can use either
Gibbs sampler with the indicator of which mixture an observation belongs to as an unobserved random
variable, or use the EM (Expectation-Maximization) algorithm.
Alternatively (and most commonly) we could specify:
i jzi
N (z0i ;
),
where we use a normal (continuous) mixture. Again evaluating the likelihood function would be di¢ cult
in this setting. This would involve integrating out the random coe¢ cients which would be computationally
intensive.
Berry-Levinsohn-Pakes (BLP) (optional reading)
BLP extended these mixed logit models to allow for unobserved product characteristics, endogeneity of
alternative characteristics, and developed methods that allowed for consistent estimation without individual
level alternative data. Their approach has been widely used in the empirical IO (industrial organization)
literature to model demand for di¤erentiated products, often in settings with a large number of products.
Compared to the earlier examples we have looked at there is an emphasis in this study, and those that
followed it, on the large number of goods and the potential endogeneity of some of the product characteristics.
(Typically one of the regressors is the price of the good.) In addition the procedure only requires market level
data. We do not need individual level purchase data, just market shares and estimates of the distribution
of individual characteristics by market. In practice we need a fair amount of variation in these settings to
estimate the parameters well, but in principle this is less demanding in terms of data required. On the
other hand, we do need data by market, where before we just need individual purchases in a single market
(although to identify price e¤ects we would need variation in prices by individuals in that case).
12
Lehigh University
Department of Economics
6.1
Muzhe Yang
Spring 2013
Set-up (optional reading)
Suppose the data have three dimensions:
products, indexed by j = 0; 1;
; J, markets, t = 1; 2;
; T and
P
individuals, i = 1; 2;
; Nt and Tt=1 Nt = N . We only observe one purchase per individual, yi = j. The
large sample approximations are based on large N and T and …xed J.
Let’s go back to the random coe¢ cients model with each utility indexed by individual (i), product (j)
and market (t):
Uijt = x0jt i + jt + ijt
The jt is an unobserved product characteristic. This component is allowed to vary by market and product.
This can include product and market dummies. For example, we can have jt = j + t . Unlike the
observed product characteristics (xjt ) this unobserved characteristic ( jt ) does not have an individualspeci…c coe¢ cient. The inclusion of this component allows the model to rationalize any pattern of market
shares. The observed product characteristics (xjt ) may include endogenous characteristics like the price.
The ijt unobserved components have extreme value distributions, independent across all individuals i,
products j and market t.
The random coe¢ cients i , with dimension equal to that of the observable market-product-level characteristics xjt , say K, are assumed to be related to individual observable characteristics. We therefore
postulate the following linear form:
i
K 1
=
+
K 1
0
zi
L KL 1
+
i
K 1
with
i jzi
N (0; )
So if the dimension of zi is L, then is an L K matrix. The zi ’s are normalized to have mean zero, so
that the ’s are the average marginal utilities. The normality assumption is not necessary, and unlikely to
be important. Other distributional assumptions can be substituted.
BLP developed an approach to estimate models of this type that does not require individual level data.
Instead it exploits aggregate (market level) data in combination with estimates of the distribution of zi .
Speci…cally the data consist of estimated market shares sbjt for each product (alternative) j in each market
t, combined with observations from the marginal distribution of individual characteristics, the zi ’s, for each
market, often from representative data such as the CPS.
6.2
Estimation (optional reading)
(1) Rewrite the utility as:
Uijt
=
jt
=
vijt
=
)
=
jt + vijt + ijt
x0jt + jt
(“average return”)
0
0
xjt
zi + i (“idiosyncratic return”)
Uijt = x0jt + jt + x0jt 0 zi + i + ijt
x0jt ( + 0 zi + i ) + jt + ijt
|
{z
i
}
Now considering for …xed parameter matrices , and given jt , the implied market share for product
j in market t, sjt , can be calculated analytically in simple cases. For example, with parameters = 0 and
= 0, the market share is:
exp( jt )
sjt ( jt ; = 0; = 0) = PJ
l=0 exp( lt )
More generally, this is a more complex relationship. We can always calculate the implied market share
by the following simulation:
13
Lehigh University
Department of Economics
Muzhe Yang
Spring 2013
Draw from the distribution of zi in market t and draw from the distribution of i , and calculate
the implied purchase probability (or even simulate the implied purchase by also drawing from
the distribution of ijt . Do that repeatedly and we will be able to calculate the market share
for this product/market. Call the vector function obtained by stacking these functions for all
products and markets s( ; ; )JT 1 .
(2) Fix only and . For each value of jt we can …nd the implied market share. Now …nd the vector
of jt such that the implied market shares are equal to the observed market shares sbjt for all j and t. BLP
suggest using the following algorithm. Given a starting value for 0jt , using the updating formula:
k+1
jt
=
k
jt
+ ln sjt
ln sjt (
k
jt ;
; )
BLP show this is a contraction mapping, and so it will converge to a function (s; ; ) expressing the
as a function of observed market shares with parameters and . Calculating this function can be quite
demanding. For each iteration in the contraction mapping we need to approximate the implied market shares
accurately, and then we will need to do this repeatedly to get the contraction mapping to converge.
Note that this does require that each market share is accurately estimated. If all we have is an estimated
market share, then even if this is unbiased, the procedures will not necessarily work. In that case, the log of
the estimated share is not unbiased for the log of the true share. In practice the precision of the estimated
market share is so much higher than that of the other parameters that this is unlikely to matter.
Given this function (s; ; ), de…ne the residuals:
b jt =
jt (sjt ;
; )
x0jt
At the true values of the parameters and the true market shares this is equal to the unobserved product/market characteristics jt .
x0jt
jt = jt
(3) Now we can use GMM methods. We assume that the unobserved product/market characteristics
( jt ) are uncorrelated with observed product/market characteristics other than typically price. This is not
su¢ cient since the observed product/market characteristics (xjt ) enter directly into the model, we therefore
have an endogeneity problem induced by price included in the observed product/market characteristics. We
need more instruments, and typically use things like characteristics of other products by the same …rm, or
average characteristics by competing products. The general GMM machinery will also give us the standard
errors for this procedure. This is where the method is most challenging. Finding values of the parameters
that set the average moments closest to zero can be di¢ cult.
It is instructive to see what this approach does if we in fact have a conditional logit model with …xed
coe¢ cients. In that case = 0, and = 0. Then we can invert the market share equation to get the market
speci…c unobserved alternative-characteristics:
jt
where we set
residual is:
0t
= ln sjt
ln s0t
= 0. This is typically the outside good, whose average utility is normalized to zero. The
b jt =
jt
x0jt = ln sjt
ln s0t
x0jt
With a set of instruments wjt for price included in xjt , we run the regression
ln sjt
ln s0t = x0jt +
jt
(using wjt as instruments for the endogenous regressor, usually the price, in xjt ) using as the observational
unit the market share for product j in market t.
14
Lehigh University
Department of Economics
Muzhe Yang
Spring 2013
So here the technique is very transparent. It amounts to transforming the market shares to something
linear in the coe¢ cients so we can use GMM for a linear model. More generally the transformation is
going to be much more di¢ cult with the random coe¢ cients implying that there is no analytic solution.
Computationally these things can get very complicated. Note however that we can estimate these models
now without having individual level data, and that at the same time we can get a fairly ‡exible model for the
substitution patterns. Meanwhile we would expect to need a lot of structure to get the parameters precisely
estimated just as in the other models. Of course if we compare the current model to the nested logit model
we can impose such structure by imposing restrictions on the covariance matrix.
Comparisons of the models are di¢ cult. Obviously if the structure imposed is correct it helps, but we
typically do not know what the truth is, so we cannot conclude which one is better on the basis of the data
typically available.
References
Train, K. E. (2009). Discrete Choice Methods with Simulation (Second edition). Cambridge and New York,
Cambridge University Press.
15