DISCRETE CHOICE MODELS FOR POLITICAL ANALYSIS

DISCRETE CHOICE MODELS FOR
POLITICAL ANALYSIS
Advanced Political Methodology
Lecture Notes/Skript
Marco R. Steenbergen
Institut für Politikwissenschaft
Universität Bern
c
Spring 2008!
Contents
1 Introduction
1
2 Binary Choice Models
2.1 The Linear Probability Model . . . . . .
2.1.1 Derivation and Estimation . . . .
2.1.2 Example . . . . . . . . . . . . . .
2.2 Logit and Probit Analysis . . . . . . . .
2.2.1 Influences on the CDF . . . . . .
2.2.2 Derivation . . . . . . . . . . . . .
2.2.3 Estimation . . . . . . . . . . . .
2.2.4 Interpretation . . . . . . . . . . .
2.2.5 Hypothesis Testing . . . . . . . .
2.2.6 Model Fit . . . . . . . . . . . . .
2.2.7 Example . . . . . . . . . . . . . .
2.2.8 Interpretation . . . . . . . . . . .
2.2.9 Heteroskedastic Logit and Probit
2.3 Alternatives to Logit and Probit . . . .
2.3.1 The Gompit Model . . . . . . . .
2.3.2 The Scobit Model . . . . . . . .
3 Modeling Ordinal Variables
3.1 Ordered Logit and Probit Analysis
3.1.1 A Motivating Example . . .
3.1.2 General Model . . . . . . .
3.1.3 Estimation . . . . . . . . .
3.1.4 Interpretation . . . . . . . .
3.1.5 Model Fit . . . . . . . . . .
3.1.6 Example . . . . . . . . . . .
3.1.7 Model Fit . . . . . . . . . .
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
8
9
10
12
18
21
35
36
40
42
48
53
53
57
.
.
.
.
.
.
.
.
61
62
62
64
66
67
75
76
78
3.2
3.1.8 Interpretation . . . . . . . . . . . . . . . .
3.1.9 Heteroskedastic Ordered Logit and Probit
3.1.10 The Parallel Regression Assumption . . .
Alternative Models . . . . . . . . . . . . . . . . .
3.2.1 The Generalized Ordered Logit Model . .
3.2.2 Stereotype Regression . . . . . . . . . . .
4 Probabilistic Choice Models: An Introduction
4.1 The Basics of PCMs . . . . . . . . . . . . . . .
4.1.1 Key Characteristics . . . . . . . . . . .
4.1.2 Basic Notation . . . . . . . . . . . . . .
4.2 Traditions of Probabilistic Choice Modeling . .
4.2.1 The Legacy of Duncan Luce . . . . . . .
4.2.2 The Legacy of L.L. Thurstone . . . . . .
4.2.3 Sequential Choice Models . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
78
83
87
92
92
95
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
98
98
98
99
100
100
101
101
5 Multinomial and Conditional Logit
5.1 The Model . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Distributional Assumptions . . . . . . . . .
5.1.2 Choice Probabilities . . . . . . . . . . . . .
5.1.3 Determinants of Choice . . . . . . . . . . .
5.2 Identification . . . . . . . . . . . . . . . . . . . . .
5.2.1 Multinomial Logit . . . . . . . . . . . . . .
5.2.2 Conditional Logit . . . . . . . . . . . . . . .
5.3 Estimation . . . . . . . . . . . . . . . . . . . . . .
5.4 Interpretation . . . . . . . . . . . . . . . . . . . . .
5.4.1 Multinomial Logit . . . . . . . . . . . . . .
5.4.2 Conditional Logit . . . . . . . . . . . . . . .
5.5 Hypothesis Testing . . . . . . . . . . . . . . . . . .
5.6 Model Fit . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Pseudo-R2 . . . . . . . . . . . . . . . . . . .
5.6.2 Correct Predictions . . . . . . . . . . . . . .
5.7 Example . . . . . . . . . . . . . . . . . . . . . . . .
5.7.1 Multinomial Logit Analysis . . . . . . . . .
5.7.2 Conditional Logit Analysis . . . . . . . . .
5.7.3 Adding Attributes of the Decision Maker to
ditional Logit Model . . . . . . . . . . . . .
5.8 Independence of Irrelevant Alternatives . . . . . .
5.8.1 Meaning and Consequences . . . . . . . . .
5.8.2 Testing for IIA . . . . . . . . . . . . . . . .
ii
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
the
. .
. .
. .
. .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Con. . . .
. . . .
. . . .
. . . .
103
103
103
104
107
108
108
110
110
111
111
114
116
117
117
118
120
120
125
130
131
131
133
6 Multinomial Probit
6.1 The Model . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Distributional Assumptions . . . . . . . . . . .
6.1.2 Choice Probabilities . . . . . . . . . . . . . . .
6.1.3 Determinants of Choice . . . . . . . . . . . . .
6.1.4 Identification . . . . . . . . . . . . . . . . . . .
6.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Maximum Likelihood Estimation . . . . . . . .
6.2.2 Maximum Simulated Likelihood Estimation . .
6.2.3 Alternative Estimators . . . . . . . . . . . . . .
6.3 Interpretation . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Marginal Effects and Elasticities . . . . . . . .
6.3.2 Discrete Change in Predicted Probabilities . .
6.4 Model Fit . . . . . . . . . . . . . . . . . . . . . . . . .
6.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5.1 Estimation Results . . . . . . . . . . . . . . . .
6.5.2 Interpretation . . . . . . . . . . . . . . . . . . .
6.5.3 Model Fit . . . . . . . . . . . . . . . . . . . . .
6.6 Appendices . . . . . . . . . . . . . . . . . . . . . . . .
6.6.1 The Method of Maximum Simulated Likelihood
6.6.2 Predicted Probabilities through Quadrature . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
138
138
138
139
140
141
143
143
144
145
145
145
146
147
147
147
149
150
151
151
157
7 Nested Logit
7.1 Nesting and Sequential Choice . . . . . . . . . . .
7.2 The Model . . . . . . . . . . . . . . . . . . . . . .
7.2.1 Choice Probabilities . . . . . . . . . . . . .
7.2.2 Decomposition of Choice Probabilities . . .
7.2.3 Determinants of Choice . . . . . . . . . . .
7.3 Identification . . . . . . . . . . . . . . . . . . . . .
7.4 Estimation . . . . . . . . . . . . . . . . . . . . . .
7.4.1 Sequential MLE . . . . . . . . . . . . . . .
7.4.2 Full Information MLE . . . . . . . . . . . .
7.5 Interpretation . . . . . . . . . . . . . . . . . . . . .
7.5.1 Marginal Effects and Elasticities . . . . . .
7.5.2 Discrete Change in Predicted Probabilities
7.6 Model Fit . . . . . . . . . . . . . . . . . . . . . . .
7.7 Hypothesis Testing . . . . . . . . . . . . . . . . . .
7.8 Deciding on a Nesting Structure . . . . . . . . . .
7.9 Example . . . . . . . . . . . . . . . . . . . . . . . .
7.9.1 Estimation Results . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
162
162
163
163
166
168
169
169
169
170
171
171
171
172
172
173
173
174
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7.9.2
7.9.3
7.9.4
Interpretation . . . . . . . . . . . . . . . . . . . . . . . 176
Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . 180
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 180
8 Choice Set Models
8.1 The Model . . . . . . . . . . . . . . . . . . . . . .
8.1.1 Choice Probabilities . . . . . . . . . . . . .
8.1.2 Determinants of Choice . . . . . . . . . . .
8.1.3 Special Cases . . . . . . . . . . . . . . . . .
8.2 Identification . . . . . . . . . . . . . . . . . . . . .
8.3 Estimation . . . . . . . . . . . . . . . . . . . . . .
8.3.1 The Log-Likelihood Function . . . . . . . .
8.3.2 A Memetic Approach . . . . . . . . . . . .
8.4 Interpretation . . . . . . . . . . . . . . . . . . . . .
8.4.1 Marginal Effects and Elasticities . . . . . .
8.4.2 Discrete Change in Predicted Probabilities
8.5 Example . . . . . . . . . . . . . . . . . . . . . . . .
8.5.1 Estimation Results . . . . . . . . . . . . . .
8.5.2 Interpretation . . . . . . . . . . . . . . . . .
8.5.3 Predicted Probabilities . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
181
181
181
182
183
183
184
184
185
187
187
189
189
190
193
193
9 The Theory of Maximum Likelihood Estimation
9.1 The Intuition behind MLE . . . . . . . . . . . . . . .
9.2 Estimating Single Parameters . . . . . . . . . . . . .
9.2.1 The Likelihood and Log-Likelihood Functions
9.2.2 Optimizing the Log-Likelihood . . . . . . . .
9.2.3 Standard Errors . . . . . . . . . . . . . . . .
9.3 Estimating Multiple Parameters . . . . . . . . . . .
9.3.1 Likelihoods, Gradients, and Hessians . . . . .
9.3.2 Standard Errors . . . . . . . . . . . . . . . .
9.3.3 MLE of the Linear Regression Model . . . . .
9.4 Properties of MLE . . . . . . . . . . . . . . . . . . .
9.5 A Primer on Numerical Optimization . . . . . . . . .
9.5.1 Hill-Climbing Algorithms . . . . . . . . . . .
9.5.2 Computation of Standard Errors . . . . . . .
9.5.3 Algorithmic Convergence . . . . . . . . . . .
9.6 MLE and Hypothesis Testing . . . . . . . . . . . . .
9.6.1 Testing Simple Hypotheses . . . . . . . . . .
9.6.2 Testing Joint Hypotheses . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
195
195
197
197
199
203
205
205
213
214
218
218
218
223
223
226
226
226
iv
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
10
11
12
13
19
30
2.8
2.9
The Standard Logistic and Standard Normal CDFs . . . . . .
Role of the Constant in Logit and Probit . . . . . . . . . . .
Impact of the Sign of β1 in Logit and Probit . . . . . . . . . .
Impact of the Size of β1 in Logit and Probit . . . . . . . . . .
Logit Log-Likelihoods for a Constant-Only Model . . . . . . .
Discrete Change versus Marginal Effects . . . . . . . . . . . .
Predicted Probability of Voting for Gore by Partisanship and
Religious Beliefs . . . . . . . . . . . . . . . . . . . . . . . . .
Standard Gumbel Distribution . . . . . . . . . . . . . . . . .
Burr-II Distribution . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2
3.3
Choice Mechanism in Ordinal Regression Models . . . . . . .
Economic Retrospection as a Function of Income . . . . . . .
Parallel Regression Assumption . . . . . . . . . . . . . . . . .
65
81
87
4.1
4.2
4.3
R. Duncan Luce . . . . . . . . . . . . . . . . . . . . . . . . . 100
Louis Leon Thurstone . . . . . . . . . . . . . . . . . . . . . . 101
Amos Tversky, Daniel McFadden, and Charles Manski . . . . 102
7.1
7.2
Example of a Nested Decision Problem . . . . . . . . . . . . . 163
1994 Dutch Elections—Nesting by Government or Opposition
Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.1
8.2
Examples of the PCMNL Log-Likelihood . . . . . . . . . . . . 184
Vote Choice and Class in Britain . . . . . . . . . . . . . . . . 194
9.1
9.2
9.3
9.4
9.5
Binomial Likelihood Function . .
Poisson Likelihood Function . . .
Poisson Log-Likelihood Function
Normal Likelihood Function . . .
Normal Likelihood Contour . . .
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
54
58
199
202
202
208
209
9.6
The Operation of Hill-Climbing Algorithms . . . . . . . . . . 220
vi
List of Tables
2.1
2.2
2.3
2.4
2.5
LPM of Vote Choice in the U.S. in 2000 . . . . . . . . . . . .
Correct Prediction in Logit and Probit Models . . . . . . . .
Logit and Probit Models of Vote Choice in the U.S. in 2000 .
Marginal Effect of the Trait Differential on Vote Choice . . .
Predicted Probability of Voting for Gore by Partisanship and
Religious Beliefs . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Discrete Change in the Predicted Probability of a Gore Vote
2.7 Odds Ratios for the Gore Vote . . . . . . . . . . . . . . . . .
2.8 Heteroskedastic Probit Model of Vote Choice in 2000 . . . . .
2.9 Marginal Effects for Heteroskedastic Probit . . . . . . . . . .
2.10 Gompit Model of Military Coups . . . . . . . . . . . . . . . .
2.11 Scobit and Rare Events Logit Models of Military Coups . . .
9
37
40
43
3.1
3.2
75
Correct Prediction in Ordered Regression Models . . . . . . .
Ordinal Regression Models of Economic Perceptions in the
U.S. in 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Marginal Effect of Income on Economic Perceptions . . . . .
3.4 Race and Economic Perceptions . . . . . . . . . . . . . . . . .
3.5 Partisanship and Economic Perceptions . . . . . . . . . . . .
3.6 Discrete Change in Predicted Probabilities . . . . . . . . . . .
3.7 Cumulative Odds Ratios . . . . . . . . . . . . . . . . . . . . .
3.8 Heteroskedastic Ordered Probit Model of Economic Perceptions
3.9 Brant Test of Economic Perceptions Model . . . . . . . . . .
3.10 Generalized Ordered Logit Model of Economic Perceptions .
3.11 Stereotype Regression Model of Economic Perceptions . . . .
5.1
5.2
5.3
5.4
45
47
48
51
52
56
60
77
79
80
81
82
83
86
92
95
97
Conditional versus Multinomial Logit . . . . . . . . . . . . . 107
Multinomial Logit Model of Dutch Vote Choice in 1994 . . . 121
Marginal Effects for the Multinomial Logit Vote Choice Model 122
Discrete Change for the Multinomial Logit Vote Choice Model 122
vii
5.5
5.6
5.7
5.8
5.9
5.10
Odds Ratios for the Multinomial Logit Vote Choice Model . .
Person-Choice Matrix . . . . . . . . . . . . . . . . . . . . . .
Conditional Logit Model of Dutch Vote Choice in 1994 . . . .
Marginal Effects for the Conditional Logit Vote Choice Model
Elasticities for the Conditional Logit Vote Choice Model . . .
Conditional Logit Model with Voter Attributes . . . . . . . .
6.1
6.2
6.3
6.4
Multinomial Probit Model of Dutch Vote Choice in 1994 . . . 148
MNP Error Covariances and Correlations . . . . . . . . . . . 149
Marginal Effects for the Multinomial Probit Vote Choice Model150
Discrete Changes in Predicted Probabilities . . . . . . . . . . 151
7.1
Nested Logit Model of Dutch Vote Choice in 1994 . . . . . . 175
8.1
8.2
8.3
PCMNL Estimates for the 2005 British House Elections . . . 191
Predicted Choice Sets of British Voters . . . . . . . . . . . . . 192
Elasticities for Ideological Distance in Britain . . . . . . . . . 193
9.1
9.2
Different MLE Algorithms . . . . . . . . . . . . . . . . . . . . 221
Numerical Estimates of the VCE . . . . . . . . . . . . . . . . 223
viii
123
126
127
129
129
132
Chapter 1
Introduction
Choice is one of the most central topics in politics. Whether we model voter
turnout, decisions by members of parliament, or the presence or absence of
war between two countries, choices are involved. Some decision maker—a
voter, MP, or country—looks at the available alternatives and then chooses
one of them. The consequences of these choices can be momentous and thus
political scientists are interested in how they get made.
As important as political choice is, however, statistical choice models
have received scant attention in political science. While it is now common
to teach students binary choice models—i.e. models for choices that involve
two alternatives—more complex choice models often do not appear in the
curriculum. The number of articles by political scientists utilizing sophisticated choice models remains rather small (for exceptions see Alvarez and
Nagler 1998; Born 1992; Quinn, Martin, and Whitford 1999; Whitten and
Palmer 1996).
Such an important topic as choice deserves more attention than this.
This course and these notes, then, are dedicated to providing a thorough
introduction to choice modeling. They start by considering binary choice
models, then move to models for polychotomous ordered variables, and then
conclude by considering a range of models for polychotomous unordered
variables. By the end, you should have a pretty good overview of the state
of the art of choice modeling and you should feel comfortable estimating and
interpreting these models yourself.
The emphasis in these notes is not just on the statistical aspect of choice
models. It is also important that you understand the behavioral assumptions
that underlie the various models. And it is important that you know how
to interpret the results. These aspects receive a great deal of emphasis
1
throughout the course.
The models covered in these notes can be estimated in a variety of statistical packages. These notes focus on Stata because most of you are already
familiar with this software and because it happens to have strong capabilities in discrete choice modeling. Accordingly, the notes provide the relevant
Stata syntax and show examples of Stata output.
Estimation of discrete choice models typically occurs using maximum
likelihood estimation (MLE). Chapter 9 contains a detailed discussion of
MLE and numerical optimization. You should consult this material if you
feel uncertain about MLE or if you want to obtain more detailed insight in
how model estimation is accomplished.
2
Chapter 2
Binary Choice Models
Binary response variables are extremely common in political science. Students of voting behavior want to know if a citizen participated in an election
or stayed home. Experts on legislatures are interested in modeling whether a
representative voted yeah or nay on a bill. In comparative politics, scholars
may want to know if a country has a presidential or parliamentary system.
In international relations, students of inter-state conflict are interested in
whether two countries went to war against each other or stayed at peace.
Contrary to what your introductory econometrics text may have told
you, it is possible to analyze these kinds of binary responses using linear
regression analysis. The resulting model is known as the linear probability model (LPM). As we shall see, however, this model has a number of
drawbacks, which have caused econometricians and statisticians to look for
alternatives. The logit and probit models are two particularly well-known
alternatives, but there are others. In this chapter, we discuss how one can
derive models for binary response variables, estimate them via MLE, and
interpret the results.
2.1
2.1.1
The Linear Probability Model
Derivation and Estimation
The LPM and other models for binary responses can be motivated in a
number of different ways. Here I focus on decision theoretic rationales,
which are common in econometrics.1 Consider an individual i who faces a
1
In biostatistics, the motivation for these models is quite different. One common rationale in this field is the dose-response rationale, which considers the probability of a
particular response (e.g. being cured from an illness or death) from a specified dose of
3
choice between two alternatives, θ or θ̄ (i.e. not θ). For example, θ could be
voting in an election, a yeah vote for a bill, or going to war against another
country. We indicate the individual’s choice with a binary indicator
!
0 if θ is not chosen
yi =
1 if θ is chosen
Imagine that we ignored the admonition that the linear regression model
should be used only for continuous response variables. In this case, we could
model the binary response variable via the population regression function
y i = x! i β + # i
where x!i is a set of attributes of the decision maker.2 Under the standard
assumption that E[#i ] = 0, we know that
E[yi ] = x!i β
Since yi is binary, however, we also know that3
E[yi ] = Pr[yi = 1]
= πi
Substituting this result into the regression model, we get
πi = x!i β
(2.1)
Here πi is the probability that i chooses θ (and 1 − π is the probability
that he/she does not choose θ). In equation (2.1), this is a linear function
of the covariates in xi . Thus, the model in (2.1) is known as the linear
probability model.
The LPM has a number of advantages. First, it can be estimated using
least squares estimation, a method that is known to have desirable small
sample properties. By contrast, the logit and probit alternatives that will be
discussed in the next section can be estimated only through MLE, a method
whose desirable properties are asymptotic only. Second, since the LPM is
a linear model, interpretation is straightforward. Let βk be the coefficient
a medication or pathogen. Since binary response variables arise mostly in the context of
choice in political science, the econometric rationale seems more helpful.
2
In these notes, it is assumed that all vectors are column vectors.
3
A 0-1 binary variable y follows the Bernoulli distribution, f (y) = π y (1 − π)1−y , where
π is the probability of success (i.e. choosing θ). The expectation of this distribution is π.
4
associated with a predictor xik . In the LPM, the interpretation of this
coefficient is literally that, for a unit increase in xik , πi is expected to increase
by βk units, holding all else equal. Since logit and probit are inherently nonlinear models, their interpretation is considerably more complex as we shall
see.
Limitations of the LPM
Despite these advantages, the LPM is only used infrequently these days.
The reason is that the model has a number of well-known drawbacks. In increasing order of severity, these drawbacks include: (1) the error distribution
is not normal; (2) the LPM has built-in heteroskedasticity; (3) the model
produces out-of-bounds predictions; and (4) the linear functional form is
frequently invalid. Let us address each of these concerns in turn.
First, since yi follows the Bernoulli distribution, it follows immediately
that #i is not normally distributed. Instead, #i also follows a Bernoulli distribution: it takes on the value of 1 − x!i β with probability πi and the value
−x!i β with probability 1 − πi .4 The reason why we would like the disturbances to be normally distributed is that it then follows immediately that
β̂k follows a normal distribution (see e.g. Gujarati 2003). We take advantage of this fact for purposes of hypothesis testing. The non-normality of #i ,
however, is usually no great concern. If the sample size is sufficiently large,
then we can rely on the central limit theorem to prove that the sampling
distribution of β̂k is asymptotically normal. Hypothesis testing can then
proceed as usual. Thus, non-normality becomes a concern only in small
samples.
Second, the LPM has built-in heteroskedasticity. Specifically, we can
demonstrate that:5
V [#i ] = x!i β(1 − x!i β)
We see that the variance of #i is a function of the predictors, which is a
classical form of heteroskedasticity.
The presence of heteroskedasticity in the LPM makes it an unwise choice
to estimate this model using OLS. However, this is one of the rare instances
where the nature of the heteroskedasticity is known precisely so that we can
4
This follows immediately from the mathematical definition of the disturbance term,
namely #i = yi − x! i β. Substituting the different values of yi then yields the two distinct
values that the disturbance can take on in the LPM.
5
Proof: The variance of a Bernoulli-distributed variable is πi (1 − πi ). In the LPM,
πi = x! i β. Substitution of this result produces the desired equation.
5
use weighted least squares (WLS) without any difficulty. Goldberger (1964)
outlined a two-step procedure to deal with the problem.
1. First, estimate the LPM using OLS. Obtain the predicted values ŷi =
x!i β̂. Then compute weights
wi = ŷi (1 − ŷi )
√
2. Next, perform WLS using wi as the weight variable. This is tantamount to performing OLS on
yi
√
wi
=
1
#i
√ x! i β + √
wi
wi
Notice that the new error term δi = wi−.5 #i is homoskedastic: V [δi ] =
(wi−.5 )2 V [#i ] = wi−1 V [#i ] = [x!i β(1 − x!i β)]−1 x!i β(1 − x!i β) = 1.
Since we can perform WLS, the built-in heteroskedasticity in the LPM is
but a minor nuisance. That is to say, it is a minor nuisance as long as the
weights can be calculated, which requires that there can be no out-of-bounds
predictions.
Since the LPM essentially predicts a probability, namely πi , it has to
be the case that 0 ≤ x!i β̂ ≤ 1. Instances where x!i β̂ < 0 or x!i β̂ > 1
make no sense conceptually. We say that these are out-of-bounds predictions. Unfortunately, there is no guarantee that the least squares estimators
will produce predictions within the probability bounds. The least squares
estimation procedures treat yi just as any response variable, namely something that is theoretically bounded between negative and positive infinity.
As such, there is a more than negligible chance to find that ŷi takes on values
less than zero or greater than one.
The problem here is not just a conceptual one. With out-of-bounds
predictions, WLS will also fail. The reason is that wi < 0 whenever ŷi < 0
or ŷi > 1, and the square root of a negative number is undefined. Indeed,
WLS already becomes problematic when the predictions are right at the
bounds, since wi = 0 in these instances and dividing by the square root of
zero leads to undefined results.
What to do in this case? For purposes of WLS, Achen (1986) proposed
replacing the out of bounds predictions with ones that are just inside the
bounds. However, this seems highly arbitrary. Only slightly less arbitrary
is to prevent out-of-bounds predictions in the first place by running a constrained LPM model. Here, the coefficients are estimated in such a way
6
that all predicted values lie between 0 and 1, in recognition of the fact that
we are predicting a probability. Unfortunately, the restricted LPM model
is frequently beset with multicollinearity problems. It is not difficult to see
why: once one has chosen one parameter estimate, the other estimates often
follow from it and the restrictions that are being imposed. It is also common
for this model to produce negative estimates of the variance. Thus, on the
problem of out-of-bounds predictions there are really no good solutions.
Finally, the linear functional form of the LPM is in many cases questionable. This functional form implies that πi changes at a constant rate,
regardless of the starting point on a predictor. A simple example can illustrate why this implication is often implausible. Consider a consumer
deciding whether to purchase an article that costs $2. Among the factors
that influence the decision is wealth. Imagine a person whose current wealth
is $0. If we increase this person’s wealth by one unit, for example by giving
him/her $1, then the probability of purchasing the article will not increase
by a noticeable amount because the person is still short $1. Next consider
a millionaire. If that person has not yet acquired the article it is surely not
because of financial reasons. Hence, increasing this person’s wealth by one
unit should also not have much of an effect on acquisition of the article.
Finally, consider someone whose current wealth is $1. Here a unit increase
in wealth could have a tremendous effect because the individual goes from
being unable to afford the article to being able to purchase it. The point
here is that changes in the probability are not constant across the predictor,
as the LPM assumes. On the contrary, these changes are larger or smaller
depending on what the initial value of the predictor is.
Indeed, the example suggests that an S-shaped functional form may be
more reasonable. Here, initial increases in a covariate have relatively little
impact on πi (e.g. giving someone with nothing one dollar). However, as
one moves along the scale of a covariate, the effect of unit increases become
ever larger, at least up to a certain point (e.g. giving a person with one
dollar another dollar so that they can acquire an article). Then, the changes
become smaller again (e.g. giving a dollar to a millionaire). The LPM is
ill-suited to accommodate these S-shaped changes in the effects of covariates
on πi . This is one of the most important reasons why this model has been
replaced by other models, most notably logit and probit.
Note that if an S-shaped functional form captures reality, then imposing
a linear relationship is tantamount to a specification error. Such an error
will result in bias in the OLS/WLS estimator. Thus, the problem that the
LPM uses an inappropriate functional form should be taken very seriously.
7
2.1.2
Example
I illustrate the LPM model with an analysis of vote choice in the 2000
U.S. Presidential elections. The response variable is whether a respondent
voted for Al Gore (y = 1) or George W. Bush (y = 0).6 The covariates
include age, gender (1 = male), race (1 = white), religion (1 = born again
Christian), education , household income, partisanship, a trait differential,
and an issue differential. The trait differential is measured as the difference
between trait evaluations of Gore and Bush. Higher values mean that the
trait evaluations of Gore are more positive than those of Bush. The issue
differential is measured as the difference in the issue distances between the
respondent and Bush and Gore, respectively. Higher values mean that the
respondent is further away on the issues from Bush than from Gore.
We begin by computing the OLS estimates of the LPM of vote choice.
These estimates are obtained by using the following syntax:7
regress choicevar
"
varlist
#"
, level(#)
#
The resulting estimates are shown in the columns marked OLS in Table 2.1
These estimates are straightforward to interpret. For example, the coefficient on partisanship suggests that for a unit increase, which means moving
from the Democratic to the Republican end of the party identification scale,
the probability of voting for Gore decreases, on the average, by .1 points.
The OLS results are marred by the problem of built-in heteroskedasticity that plagues the LPM. Such heteroskedasticity means that the standard
errors for the coefficients cannot be trusted, so that claims about the significance (or lack of significance) of predictors are suspect. To address this
issue, we can implement Goldberger’s (1964) method. Tamás Bartus has
written the linprob package that accomplishes this task in Stata. The
generic syntax for this package is as follows:
linprob choicevar
"
# "
#
varlist , gen(varname)
Here, the option gen(varname) generates an indicator variable that flags
all out of bound predictions. The WLS results can be found in the last
two columns of Table 2.1. We see that, relative to the OLS results, few
things have changed. In the WLS solution, race is now significant and the
6
Non-voters or respondents who voted for a candidate from a different party have been
dropped from the sample.
7
The level option controls the size of the confidence intervals of the parameter estimates.
8
Table 2.1: LPM of Vote Choice in the U.S. in 2000
Predictor
Partisanship
Male
White
Born Again Christian
Age
Education
Household Income
Trait Differential
Issue Differential
Constant
OLS
Estimate
−.100∗∗
−.016
−.110
−.141∗∗
−.003+
−.008
−.012+
.195∗∗
.005
1.216∗∗
S.E.
.015
.046
.082
.049
.002
.017
.006
.041
.042
.153
WLS
Estimate
−.095∗∗
.022
−.148∗
−.168∗∗
−.004∗∗
−.010
−.019∗∗
.281∗∗
.001
1.340∗∗
S.E.
.012
.040
.069
.039
.001
.011
.005
.037
.033
.124
Notes: N = 142. ∗∗ p < .01, ∗ p < .05 (two-tailed). 39 observations produced
out-of-range predictions.
significance levels of age and household income have improved. Other than
this, the results are comparable. Note, however, the very large number
of observations (61) that produced out-of-bound predictions. This sheds
considerable doubt on the appropriateness of the LPM for this data.
2.2
Logit and Probit Analysis
As the discussion of the LPM makes clear, the ideal mapping of predictors to
probabilities is one that avoids heteroskedasticity and out-of-bounds predictions while being accommodating to a sigmoid shaped relationship between
x!i and πi . Two mappings that comply with these requirements are
$
%
π i = Φ x! i β
(2.2)
and
$
%
πi = Λ x!i β
exp (x!i β)
=
1 + exp (x!i β)
(2.3)
where Φ(.) denotes the standard normal cumulative distribution function
(CDF) and Λ(.) the standard logistic CDF. Equation (2.2) describes the
9
1
.8
.6
Prob
.4
.2
0
−4
−2
0
z
Standard Normal
2
4
Standard Logistic
Figure 2.1: The Standard Logistic and Standard Normal CDFs
well-known probit model, while equation (2.3) describes the equally wellknown logit model.8
The probit and logit specifications are ideal for a number of reasons.
First, both specifications are, by definition, homoskedastic. The variance of
the probit error terms is fixed at 1, while the logit error term has a variance
of π 2 /3, where π is the mathematical constant pi. Second, both specifications avoid out-of-bounds predictions because they are based on CDFs,
which are bounded between 0 and 1.9 Finally, as Figure 2.1 illustrates, both
specifications allow for an S-shaped relationship between the predictors and
πi . For these reasons, the logit and probit specifications are a useful and
widely used alternative to the LPM.
2.2.1
Influences on the CDF
Let us write the generic model as
8
$
%
π i = F x! i β
(2.4)
The logit model has been credited to the American physicist and statistician Joseph
Berkson who developed it in 1944. Because the logit model is based on the logistic distribution, it is also known as the logistic model. However, Berkson (1944) preferred the
term logit due to its similarity to the term probit, which had been coined by the entomologist Chester Bliss (1934) ten years earlier. The term logit refers to the transformation
ln[π/(1 − π)] that, as we will see, linearizes the logit model. The probit model has been
credited to Bliss, although its roots go back to Gustav Fechner who experimented with
the model in the 19th century.
9
Technically, the values of 0 and 1 are the lower and upper asymptotes, respectively,
of the logit/probit CDFs.
10
1.0
0.8
0.6
0.4
0.2
b0=-2
b0=0
b0=2
0.0
-4
-2
0
2
4
x
Figure 2.2: Role of the Constant in Logit and Probit
where F (.) is a CDF, which is either the standard normal or standard logistic
CDF. How do the parameters influence the shape of the probability curve?
To answer this question, let us consider the simple model πi = F (β0 + β1 xi ).
This is a logit or probit model with a single predictor, as well as constant.
In this model, the constant β0 is a shift parameter. It shifts the entire
probability curve to the left or to the right, as is illustrated in Figure 2.2.
If β0 = 0, then we attain πi = .5 when yi∗ = 0. When β0 > 0, then the
curve shifts to the left, which means that we attain πi = .5 when yi∗ < 0.
This captures a world in which the net utility for an alternative can be low,
yet decision makers are still inclined to choose that alternative. Finally,
when β0 < 0, then the curve shifts to the right, which means that we attain
πi = .5 when yi∗ > 0. This captures a situation in which decision makers
are inclined to choose an alternative only if it gives them considerable net
utility.
Turning our attention to β1 , the sign of this coefficient influences the
direction of the relationship between πi and xi . As is illustrated in Figure
2.3, if β1 > 0, then the probability of choosing an alternative increases
with values of the predictor. On the other hand, if β1 < 0, then choosing
the alternative becomes less likely when the values of xi increase. (β1 = 0
captures the situation where the probability of choosing an alternative does
11
1.0
0.8
0.6
0.4
0.2
b1 = 1
b1 = -1
0.0
-4
-2
0
2
4
x
Figure 2.3: Impact of the Sign of β1 in Logit and Probit
not depend on xi .)
In terms of its size, β1 can be interpreted as a stretch parameter. As
indicated in Figure 2.4, β1 either stretches out or shrinks the probability
curve, which influences how quickly changes in πi occur. Specifically, the
smaller β1 , the more stretched out the curve is, which means that changes
in πi occur more gradually. By contrast, the greater β1 , the more condensed
the curve is, which means that changes in πi occur rapidly. We also see that
higher values of β1 amplify the sigmoidal nature of the probability curve.
2.2.2
Derivation
The discussion so far motivates logit and probit modeling in terms of practical advantages (homoskedasticity, proper predictions, and an appealing
functional form). But what is the theoretical rationale for these models?
That is, how can they be derived from a theoretical model of choice behavior? Here, I outline two theoretical frameworks: (1) latent regression models
and (2) stochastic utility theory.10
10
Much of the work on the micro-level foundations of the logit and probit models is due
to the economist Daniel McFadden (1973).
12
1.0
0.8
0.6
0.4
0.2
b1 = .1
b1 = 1
b1 = 10
0.0
-4
-2
0
2
4
x
Figure 2.4: Impact of the Size of β1 in Logit and Probit
Latent Regression Models
One way in which the logit and probit models can be derived is to think of the
choice variable, yi , as a manifestation of an unobserved or latent variable,
yi∗ , which is continuous. This latent variable can be seen as an infinitely
fine-grained measure of the net benefit/pleasure/utility that someone derives
from choosing alternative θ. We do not observe this utility; all we observe
is the final choice that it produces. The mechanism linking yi to yi∗ is:
!
1 if yi∗ > 0
yi =
(2.5)
0 if yi∗ ≤ 0
This mechanism means that a person chooses θ, so that yi = 1, only if she
derives some positive net utility from the alternative.11
We can now model yi∗ as a linear function of attributes of the decision
maker plus a constant, contained in x!i , and a stochastic component (#i ).
The stochastic component captures unobserved aspects of the net utility
of θ, which in this case is mostly measurement error or attributes of the
11
A more general formulation is to say that yi = 1 if yi∗ > k, where k is some arbitrary
constant. The restriction k = 0 that we impose here may seem arbitrary, but it is of
little practical relevance since the model for yi∗ includes a constant that can accommodate
deviations from k = 0 if the data suggest so.
13
decision maker that are unobserved (from the perspective of the modeler—
see Manski 1997). Thus,
yi∗ = x!i β + #i
(2.6)
Econometricians sometimes refer to x!i β as the index function. Notice
that the index function contains a constant, β0 .
We can now develop the latent regression model a little further. Combining equations (2.5) and (2.6), we have12
!
1 if #i > −x!i β
yi =
(2.7)
0 if #i ≤ −x!i β
Moreover, the stochastic nature of #i means that choice is probabilistic. That
is, the outcome yi = 1 is stochastic because one of the parts driving it, #i , is
stochastic. Accordingly, choice is probabilistic and the choice probabilities
are given by
Pr[yi = 1] = πi
= Pr[#i > −x!i β]
(2.8)
Similarly, Pr[yi = 0] = 1 − πi = Pr[#i ≤ −x!i β].
Equation (2.8) is unidetified unless we are willing to make some distributional assumptions. In the probit model, we assume that #i ∼ N (0, 1),
i.e. the stochastic component follows a standard normal distribution. Since
the standard normal distribution is a symmetric distribution, it follows that
Pr[#i > −x!i β] = Pr[#i < x!i β] = F (x!i β), where F (.) is the CDF, in this
case the standard normal CDF, Φ(.). Thus,
πi = Φ(x!i β)
which is equation (2.2) that we discussed before. In the logit model, we
assume that #i ∼ L(0, π 2 /3), i.e. the stochastic component follows a standard logistic distribution. This, too, is a symmetrical distribution so that
Pr[#i > −x!i β] = Pr[#i < x!i β] = F (x!i β). Here F (.) = Λ(.), which is the
standard logistic CDF, so that
πi = Λ(x!i β)
12
Proof: From (2.5) we know that yi = 1 if yi∗ > 0. Substituting x! i β + #i for yi∗ , this
can be formulated as yi = 1 if x! i β + #i > 0. Subtracting the index function from both
sides of the inequality sign, we get #i > −x! i β.
14
This is equation (2.3) that we discussed before.
The latent regression approach shows that the derivation of the logit and
probit models follows a clear logic. Even the distributional assumptions are
not as arbitrary as they may seem at first glance. Specifically, assuming
a symmetrical distribution for #i with mean 0 makes a great deal of sense
if one believes that the stochastic component of the latent variable has no
systematic tendency and if positive and negative displacements from x!i β
are distributed evenly.13
Stochastic Utility Theory
The logit and probit models can be derived also from the stochastic utility
model (a.k.a. the random utility model), which finds its roots in both economics (e.g. Neuman and Morgenstern) and psychology (e.g. Thurstone).14
Let yi indicate the choice between two alternatives, θ and θ̄ such that yi = 1
if θ is chosen and yi = 0 if θ̄ is chosen. Further, let U θ and U θ̄ be the
utilities of both alternatives. In the stochastic utility model, these utilities
have both fixed and stochastic components:
Uiθ = x!i β θ + #i,θ
Uiθ̄ = x!i β θ̄ + #i,θ̄
(2.9)
Here, x!i is a vector of observed attributes of the decision maker, β θ and
β θ̄ are the effects of those attributes on the utilities of alternatives θ and
θ̄, respectively, and #i,θ and #i,θ̄ are random and unobserved components,
including attributes of the decision maker that are hidden from the modeler
(e.g. Manski 1997).
The stochastic utility model assumes that decision makers are utility
maximizers. This means that yi = 1 if Uiθ > Uiθ̄ . Since the utilities contain
13
There are other distributional assumptions that one can make, as we shall see in
Chapter 2.3. Whether those assumptions are more attractive depends on the nature of
the problem that one is analyzing.
14
Also see Nakosteen and Zimmer (1980) for a derivation of logit and probit analysis
from stochastic utility theory.
15
a stochastic component, choice is again probabilistic. Thus,
Pr[yi = 1] = πi
= Pr[Uiθ > Uiθ̄ ]
= Pr[x!i β θ + #i,θ > x!i β θ̄ + #i,θ̄ ]
= Pr[x!i β θ + #i,θ − x!i β θ̄ − #i,θ̄ > 0]
(2.10)
= Pr[x i (β θ − β θ̄ ) + #i,θ − #i,θ̄ > 0]
!
= Pr[x!i β + #i > 0]
= Pr[#i > −x!i β]
where β = β θ − β θ̄ and #i = #i,θ − #i,θ̄ . By imposing the appropriate
distributional assumption on #i we obtain, once again, the logit and probit
models.
Stochastic utility theory provides a powerful framework for understanding binary and other choices. We shall encounter this framework again in
Chapters 4-8, where we shall consider choices between more than two alternatives.
The Problem of Model Identification
You may wonder why we need to fix the variance of the stochastic components. In the LPM, as in regression analysis more generally, the variance is
an estimated parameter, so why is it not here? The reason for fixing the
variance is that the underlying latent variable, yi∗ , has no scale. This is
different from regression analysis, since the response variable is measured in
that approach. With the scale of yi∗ undefined, the scale of β is also undefined. Thus, the parameters are under-identified, meaning that they could
be changed by multiplying them with an arbitrary constant δ.15 By fixing
the variance of #i , we standardize the scale of yi∗ which, in turn, allows us to
fix the scale of β. Thus, a variance assumption concerning #i is a necessary
identifying constraint without which estimation would fail.
There is nothing sacred about setting the variance of #i to 1, as is done in
the probit model, or to π 2 /3, as is done in the logit model. These particular
constraints are arbitrary. This fact has two important implications. First,
because the scale of β is tied to the particular variance assumption, one can
expect logit and probit estimates to come out differently, a point we shall
discuss in greater detail in Chapter 2.2.3. Second, because the scale of β
To see this, imagine that we replace yi∗ with wi∗ = δyi∗ . Since yi∗ = x! i β + #i , it follows
that wi∗ = δ(x! i β + #i ) = x! i (δβ) + δ#i . Hence, β w∗ = δβ y∗ . Note that V [w∗ ] = δ 2 V [y ∗ ].
15
16
is fixed arbitrarily, it makes no sense to give a direct interpretation to the
coefficients contained in this vector.
One might wonder wether the arbitrariness of the identifying constraint
does not completely jeopardize the validity of the logit and probit models.
The answer is negative because the identifying assumption has no impact
on πi , which is the quantity that interests us ultimately.16 It is easy to
illustrate this for the logit model. The standard logistic distribution may
also be written as
1
exp (−#)
F (#) =
In this distribution, V [#] = π 2 /3. Imagine that we want to transform # so
that it has a variance of 1, just as in the probit model. We can do this by
dividing # by σ, which is the square root of V [#]. Of course, if we transform
# this way, then we should transform the other elements of (2.6) as well.
Thus,
yi∗
σ
=
=
x! i β # i
+
σ
σ
x! i β
+ νi
σ
Here ν follows the standardized logistic distribution, which is given by
F (ν) =
=
1
exp − √π3 ν
If we now substitute #/σ for ν, we get
F (ν) =
&
&
1
'
√
exp − √π3 # π 3
1
exp (−#)
'
Thus, changing the variance (and thus the identifying restriction) has no
impact on the CDF. As a result, the probabilities remain unaffected, even
if the scale of the parameters is changed by changing the variance of the
stochastic component.
16
That is, πi is an estimable function, which means that it is invariant to any identifying
constraints imposed on the parameters (e.g. Searle 1971).
17
2.2.3
Estimation
Likelihood and Log-Likelihood Functions
Unlike the LPM, estimation of the logit and probit models requires maximum likelihood estimation (MLE). Assume that we have drawn a sample of
n independent observations. On the response variable, these observations
are a collection of 0s and 1s, where the 1s occur with probability πi and the
0s with probability 1 − πi . The likelihood function is then given by
(
(
L =
f (yi = 1)
f (yi = 0)
yi =1
=
(
πi
yi =1
=
(
(
yi =0
yi =0
(1 − πi )
F (x!i β)
yi =1
(
[1 − F (x!i β)]
yi =0
This can be written more compactly as
L =
n
(
"
#y "
#1−yi
F (x!i β) i 1 − F (x!i β)
i=1
where F (.) was defined in (2.4). Keeping in mind that the standard normal and standard logistic distributions are symmetrical, we know that 1 −
F (x!i β) = F (−x!i β). This suggests yet another formulation of the likelihood function:
L =
n
(
F (qi x!i β)
i=1
where qi = 2yi − 1. The corresponding log-likelihood function can be formulated as:
n
)
*
+
( =
yi ln[F (x!i β)] + (1 − yi ) ln[1 − F (x!i β)]
i=1
=
n
)
ln[F (qi x!i β)]
i=1
Optimization
Gradient In order to optimize the log-likelihood function we follow the
usual procedure of evaluating the first-order condition. The gradient is given
18
0
Normalized Log−Likelihood
−300
−200
−100
−400
−1
0
1
2
b0
n = 100
n = 1000
Figure 2.5: Logit Log-Likelihoods for a Constant-Only Model
by
∂(
∂β
n ,
)
f (x!i β)
f (x!i β)
=
yi
− (1 − yi )
xi
F (x!i β)
1 − F (x!i β)
i=1
=
n
)
i=1
qi
f (qi x!i β)
xi
F (qi x!i β)
where f (.) is the probability density function or PDF, i.e. φ(.) in the case
of probit and λ(.) in the case of logit.
The likelihood equation sets these derivatives equal to zero. Since the
derivatives are nonlinear functions, the solution to the likelihood equation
has to be found by way of numerical methods. Generally, this does not
pose complications since the logit and probit log-likelihood functions are
generally well-behaved. Specifically, the log-likelihood functions are singlepeaked, which means that a unique maximum exists. Figure 2.5 illustrates
the logit log-likelihood functions for different sample sizes. The figure clearly
shows the single-peakedness of the log-likelihood function, as well as the
relationship between curvature and sample size.
Hessian The generic Hessian is given by17
n
H =
) f % (qi x% β)F (qi x% β) − [f (qi x% β)]2
∂2(
i
i
i
=
xi x!i
∂β∂β !
[F (qi x%i β)]2
i=1
17
The derivative also contains a multiplier qi2 . It can be omitted, however, since it is
equal to 1 regardless of the value of yi .
19
In the case of logit, this formula reduces to
H = −
n
)
i=1
Λ(.) [1 − Λ(.)] xi x!i
where Λ(.) = Λ(qi x!i β).18 In the case of probit, the Hessian os equal to
,
n
)
φ(.)
φ(.)
!
H = −
β xi +
xi x!i
Φ(.)
Φ(.)
i=1
where φ(.) = φ(qi x!i β) and Φ(.) = Φ(qi x!i β).19
The Problem of Perfect Separation While it is generally easy to optimize the logit and probit log-likelihood functions, problems can occur when
there is perfect classification or separation. This happens when we can
perfectly predict a score of one or zero from one of the covariates (e.g. every
Republican voted for Bush in 2000). If we can perfectly predict a score of
one, then πi = 1. Likewise, if we can perfectly predict a score of zero, then
πi = 0. Such probabilities can be accommodated by the standard logistic
and the standard normal distribution functions only by letting |x!i β| → ∞,
which, for finitely bounded covariates, means that |β| → ∞ (the standard
errors will also tend to infinity). This will generally create numerical difficulties (e.g. potential singularity of the Hessian) in the algorithms that are
discussed in Chapter 9. Statistical packages typically resolve the problem
18
In the logit model, f (.) = Λ(.) [1 − Λ(.)], F (.) = Λ(.), and f (.) = Λ(.) [1 − Λ(.)],
F (.) = Λ(.), and f # (.) = Λ(.) [1 − Λ(.)] [1 − 2Λ(.)]. Substitution into the formula for the
generic Hessian and algebraic manipulation then yields the logit Hessian.
19
Here we have taken advantage of the result that
dφ(z)
dz
−zφ(z)
=
The probit Hessian can be written more elegantly by introducing the following notation
(see Greene 2003):
! −φ(x ! β)
i
if yi = 0
1−Φ(x ! i β)
λi =
φ(x ! i β)
if yi = 1
Φ(x ! β)
i
It is now easily verified that
H
=
−
"
i
#
$
λi λi + β ! xi xi x! i
Moreover, the gradient can now be written as
20
%
i
λi xi .
by dropping the offending predictor and observations. They then optimize
the log-likelihood for the remaining observations and predictors.20
Logit versus Probit Estimates
Whether one optimizes a logit or probit log-likelihood does not matter much.
As Figure 2.1 shows, the standard normal and standard logistic distribution
functions look very similar, with the logistic distribution displaying slightly
heavier tails. Although the theoretical development of the probit model
preceded that of the logit model, for a while applied researchers preferred
logit because of its better computational tractability. However, the development of powerful algorithms for computing normal probabilities and the
availability of powerful computers have eliminated the historical advantage
of logit. Nowadays, it is just as easy to run a probit as a logit analysis and
the choice between them is much like flipping a coin.
This does not mean that logit and probit will give you exactly the
same parameter estimates. Because these two models fix the variance at
different values, the scale of the parameters is also different. Equating
the variances
√ of the standard logistic and normal distributions, we find
βL ≈ (π/ 3)βP ⇔ βL ≈ 1.8βP , where βL indicates the logit parameter
and βP the probit parameter.21 Amemiya (1981) suggests that βL ≈ 1.6βP
if other features of the standard logistic and normal distributions are also
synchronized. Following the same logic, Long (1997) finds that βL ≈ 1.7βP .
Thus, the scale of the logit and probit parameter estimates is different, although it is possible to approximate one set of estimates by applying a
proper transformation of the other set.
2.2.4
Interpretation
The sigmoidal relationship between the predictors and πi means that the
interpretation of logit and probit estimates is considerably more difficult
than that of LPM estimates. Unlike the LPM model, the effect of predictors
is not constant, which means that we have to rely on different interpretative
devices than the standard phrase “for a unit increase in xk , π is expected to
change by βk units.” Several interpretation methods exist that allow one to
make sense of the coefficients beyond their general meaning in terms of shift,
20
This is similar to concentrating the perfectly predicted observations out of the likelihood function (see Ruud 2000). For an alternative solution, which relies on a penalized
log-likelihood function, see Zorn (2005).
21
As Long (1997) suggests, this is probably the best approximation to use if you want
to translate the coefficients reported in published research.
21
direction, and stretch. These methods do not just work for binary logit and
probit models but can be adapted for most of the models discussed in these
notes. It is thus useful to discuss these methods in some detail. While the
discussion gets a bit mathematical at times, the good news is that most of
the procedures described herein have been automated in Stata.
Interpretation in terms of the Latent Variable
One way of interpreting logit and probit coefficients is to focus on the underlying latent variable, yi∗ . This interpretation has the advantage of being
simple. Since yi∗ is a linear function of the predictors, we can say that yi∗ is
expected to change by βk units for a unit change in xik . On the downside, as
an unobserved variable, yi∗ does not have an intrinsically meaningful scale.
In fact, as we have seen, the scale units of yi∗ are dictated by our choice of
error variance on #i . Because of the arbitrariness of the scale, interpretations
in terms of the latent variable are rare.
Predicted Probabilities
Computation Most of the interpretation of logit and probit coefficients
occurs in terms of predicted probabilities or changes in those predicted probabilities, which will be discussed in the next two sub-sections. Predicted
probabilities are computed as
π̂i = F (xi β̂)
(2.11)
where F (.) was defined earlier. Equation (2.11) yields predicted probabilities
for each of the sample units, based on that unit’s values on the covariates. It
is also possible to obtain a predicted probability by simulating a set of values
on the covariates. For example, one might be interested in the probability
of turnout for someone who scores average on all of the covariates. When
one simulates covariates in this manner, one should always bear in mind
that there may be no sample unit that fits the simulated profile. There
are some risks associated with such out-of-sample predictions, namely that
the prediction may not be realistic and that the parameter estimates might
have come out differently if we really had included the unit that is average
on all covariates. Nevertheless, predicted probabilities are an invaluable
interpretative device, especially since they can be plotted to graphically
present the effects of covariates (see the example below).
22
Confidence Intervals A drawback of (2.11) is that it produces a point estimate of the predicted probability and, as such, does not consider sampling
variability. Ideally, then, one would like to compute a confidence interval for
the predicted probabilities.22 Xu and Long (2005) describe a number of procedures for computing confidence intervals, including endpoint transformations and the delta method.23 The method of endpoint transformations
works well when we are computing a confidence interval for a monotonic
function of the linear predictions x!β, as is the case with the distribution
functions of the probit and logit models (see Figure 2.1). In this case, one
starts by computing a confidence interval for x!β̂, namely the symmetric
interval LBx! β ≤ x!β ≤ U Bx! β , where LB and U B denote the lower and
upper bounds, respectively. One then transforms the endpoints or bounds
of this confidence interval to obtain a confidence interval for F (x!β). Thus,
{F (x!β)LB } ≤ F (x!β ≤ {F (x!β)U B }.
Some work is required in obtaining the lower and upper bounds for the
linear prediction. First, from the properties of ML estimators we know
that β̂ ∼ N (β, V[β̂]). Since x!β̂ is a linear function of β̂, we know that it,
too, is asymptotically normally distributed: x!β̂ ∼ N (x!β, V [x!β̂]).24 Here,
V [x!β̂] = x!V[β̂]x. Thus, the 100(1 − α)% confidence interval for x!β is
given by
.
.
!
!
!
!
x β̂ − zα/2 x V[β̂]x ≤ x β ≤ x β̂ + zα/2 x!V[β̂]x
where 1−α is the confidence level and −zα/2 is the value of a standard normal
variate such that Pr(z < −zα/2 ) = α/2. Using the method of endpoint
transformations, the 100(1−α)% confidence interval for F (xβ) is then given
by
!
/
!
/
.
.
F x!β̂ − zα/2 x!V[β̂]x ≤ F (x!β) ≤ F x!β̂ + zα/2 x!V[β̂]x
Computation of this interval requires that we specify a series of values for
the covariates.
A different approach to obtaining confidence intervals, which is more
generally applicable, is to rely on the delta method, which uses a Taylor
22
See Herron (1999) on the need to take into consideration uncertainty in postestimation statistics in political analysis.
23
A third procedure is bootstrapping (see Davison and Hinkley 1997). A fourth one is
to simulate posterior distributions.
24
Here we rely on the Slutsky theorem, which states that if plim θ̂ = θ, then plim g(θ̂) =
g(θ). From the asymptotic properties of ML estimators, we know that plim β̂ = β. Letting
g(β) = x! β, it then follows that plim x! β̂ = x! β.
23
expansion to linearize the function π̂ = F (x!β̂). Frequently, a first-order
expansion is used, which yields F (x!β)+(β̂ −β)f (x!β). For this expansion,
the asymptotic variance of π̂ is given by25
0
12
V [F (x!β̂)] =
f (x!β̂) x!V[β̂]x
(2.12)
(see Greene 2003).26 Relying again on the property that F (xβ̂) is normally
distributed, the 100(1 − α)% confidence interval for π = F (x!β) is then
given by
F (x!β̂) − zα/2 s.e.F (x! β̂) ≤ F (x!β) ≤ F (x!β̂) + zα/2 s.e.F (x! β̂)
where
s.e.F (x! β̂) =
20
12
f (x!β̂) x!V[β̂]x
As usual, this interval depends on the specification of the values of the covariates. You should also keep in mind that this is an asymptotic confidence
interval. As such, it may not be wise to report the interval when the sample
size is small.
25
√
This result is based on the following general formulation of the delta method. If
d
n(zn − µ) → N (0, σ 2 ) and if g(zn ) is a continuous function not involving n, then
√
26
d
n[g(zn ) − g(µ)] → N (0, [g # (µ)]2 σ 2 )
Proof: Following the general result, the asymptotic variance of F (xβ̂) is given by
&
'
&
'
∂F (x! β̂) !
∂F (x! β̂)
V[β̂]
∂ β̂
∂ β̂
Using the chain rule,
∂F (x! β̂)
∂ β̂
=
dF (ẑ) ∂ ẑ
dẑ ∂ β̂
where ẑ = x! β̂. The first term on the right-hand side of this derivative is equal to f (x! β̂),
while the second term is equal to x. Hence
&
'
&
'
(
)
(
)
∂F (x! β̂) !
∂F (x! β̂)
V[β̂]
=
f (x! β̂)x ! V[β̂] f (x! β̂)x
∂ β̂
∂ β̂
*
+2
=
f (xβ̂) x! V[β̂]x
24
Marginal Effects and Elasticities
Marginal Effects without Interactions Econometricians often interpret logit and probit results in terms of marginal effects, which are also
known as partial effects. In general terms, the marginal effect of a predictor is defined as the partial derivative of a prediction function with respect
to that predictor. As such, the marginal effect is equal to the slope of the
prediction function at a particular value of the predictor. In the case of
logit and probit, the prediction function is the predicted probability, πi , so
that the marginal effect is defined as ∂πi /∂xik .27 Using the chain rule of
differentiation, this can be computed as
∂πi
∂xik
∂F (x!i β)
=
∂xik
dF (x!i β) ∂x!i β
=
dx!i β ∂xik
= f (x!i β)βk
γk =
(2.13)
Here f (.) is the appropriate PDF. In mathematical terms, the marginal effect
is the slope of the probability curve relating πi to xik , while holding all other
predictors constant. This may be interpreted as the instantaneous change
in the predicted probability, i.e. the change in πi due to an infinitesimal
change in the value of xik .28 Note that the maximum marginal effects in
logit and probit models occur at πi = .5.29
27
It should be noted that marginal effects assume that xk is continuous. When it is
a dummy variable, then taking the derivative of π with respect to the predictor will not
work. Many statistical packages will in this case resort to reporting discrete change in
predicted probabilities.
28
Econometricians sometimes treat f (x! i β) as a scale parameter, so that the marginal
effect is equal to a scale parameter times a predictor’s coefficient. Viewed in this light,
marginal effects simply re-scale the original coefficients.
29
Proof: Let wi = x! i β. The marginal effect with respect to wi is given by
∂πi
∂wi
=
f (wi )
If we seek to maximize this effect, then we should take its first derivative, set it to 0, and
solve for wi :
∂ 2 πi
∂wi2
=
f # (wi )
For the probit model, f # (wi ) = −wi φ(wi ). This is equal to 0 when wi = 0, which
corresponds to πi = Φ(0) = .5. For the logit model, f # (wi ) = [1−2Λ(wi )]Λ(wi )[1−Λ(wi )],
25
The sign of the marginal effect is driven entirely by βk , since f (x!i β) > 0
by virtue of the fact that f (.) is a PDF. The magnitude of the marginal effect
is driven by both βk and x!i β, which includes xik .30 There are two main
approaches to setting the values of x!:
1. Marginal Effects at the Mean: One approach is to set all of the
predictors, including xk , at their mean values, producing so-called
marginal effects at the mean or MEMs:
γkM EM
= f (x̄!β)βk
where x̄! is the vector of predictor means. This can be simplified
considerably if the predictors have been centered about their means.
In this case, the means of the centered predictors are all zero and
f (x̄!β) = f (β0 ) (see Anderson and Newell 2003).
2. Average Marginal Effects: Another approach is to use the actual
values of the sample units on the predictors, which produces separate
marginal effects estimates for each unit. These estimates are than
averaged to produce an average marginal effect or AME:
n
γkAM E =
1)
f (x!i β)βk
n
i=1
Although MEMs are commonly computed by statistical software packages,
they have an important drawback: there may not be a single sample unit
which is equal to
−
exp (wi ) [exp (wi ) − 1]
[1 + exp (wi )]3
This is equal to 0 when exp (wi ) − 1 = 0, which happens when wi = 0. Again, this
corresponds to πi = Λ(0) = .5.
30
However, ratios of marginal effects do not depend on x! i β. It is easy to see this. The
marginal effect of xk is defined as f (x! i β)βk Likewise, the marginal effect of xm is defined
as f (x! i β)βm . If we now take the ratio of these marginal effects, we are left with the ratio
of the parameters:
∂πi
∂xik
∂πi
∂xim
=
f (x! i β)βk
f (x! i β)βm
=
βk
βm
The ratio of the marginal effects provides a nice way to compare the magnitude of the
effects of covariates.
26
that is average on all of the predictors. Thus MEMs may represent imaginary
units, a problem that the AME avoids by relying on extant values on the
predictors.31
Marginal Effects with Interactions Special care should be taken when
computing the marginal effects for models including polynomial and interaction terms. Consider, for example, the latent variable model yi∗ =
β0 + β1 xi + β2 x2i + #i . When computing the marginal effect with respect to
xi it should not be forgotten that this covariate appears both in linear and
quadratic terms. The correct marginal effect in this case is
∂πi
∂xik
= f (.)(β1 + 2β2 xi )
where f (.) = f (β0 + β1 xi + β2 x2i ). As another example, consider the latent
variable model yi∗ = β0 + β1 xi + β2 zi + β12 xi zi . Proper computation of the
marginal effect of the interaction term xi zi now requires that we differentiate
with respect to both xi and zi . This yields:
∂ 2 πi
∂xi ∂zi
= f (.)β12 + f % (.)(β1 + β12 zi )(β2 + β12 xi )
where f (.) = f (β0 + β1 xi + β2 zi + β12 xi zi ) and f % (.) = f % (β0 + β1 xi +
β2 zi + β12 xi zi ) is the first derivative of the PDF.32 Notice that this result
implies that there can be a marginal effect of the interaction even if β12 = 0.
This stands in stark contrast with linear regression analysis where β12 = 0
implies the absence of an interaction. Moreover, the sign of the interaction
effect cannot be read directly from the sign of β12 , as the other terms of the
marginal effect may run in an opposite direction.
Elasticities Elasticities are closely associated with marginal effects, but
offer a different interpretation of a predictor’s effect. Specifically, we can
31
The choice between MEM and AME is not without consequence, as these procedures
may produce different estimates of the effects of predictors. Bartus (2005) approximates
their relative difference as
γkM EM − γkAM E
γkM EM
≈
32
.5
f ”(x̄! β)
V [x! β]
f (x̄! β)
For an extensive discussion of marginal effects of interaction terms in logit and probit
models see Ai and Norton (2003) and Norton, Wang, and Ai (2004).
27
define the elasticity, κ, of the kth predictor as
κk =
=
∂πi xik
∂xik πi
∂ ln(πi )
∂ ln(xik )
(2.14)
The elasticity is thus the product of the marginal effect of xk and the ratio of this predictor over the predicted probability.33 Elasticities have a
straightforward and useful interpretation: they give the percentage change
in π in response to a one percent change in the predictor. In light of this
interpretation, it makes sense to compute elasticities only for continuous
predictors.
In computing elasticities, it is again essential to set the predictors to a
particular value. As with marginal effects, elasticities can be computed by
setting all of the predictors to their means or by using the actual values
of the predictors for each sample unit and then averaging the unit-specific
elasticities. A variation on the latter procedure is to weight each individual
elasticity by the predicted probability before aggregation (see Hensher and
Johnson 1981).
Confidence Intervals In practice, marginal effects and elasticities are
estimated quantities. As such it is useful to compute confidence intervals,
which can be done using the delta method. For the estimated marginal
effect γ̂k , the estimated asymptotic variance is given by
V [γ̂k ] =
,
∂γ̂k
∂ β̂k
-2
Vβ̂k
where
∂γ̂k
∂ β̂k
= f % (x!i β̂)β̂k xk + f (x!i β̂)
3
43
4
where f % (x!i β̂) = Λ(x!i β̂) 1 − Λ(x!i β̂) 1 − 2Λ(x!i β̂) in the case of logit
and −β̂ !xi φ(x!i β̂) in the case of probit. Taking advantage of the fact that
a function of a maximum likelihood function is itself normally distributed,
33
More generally, elasticities are defined as (∂y/∂x)(x/y) = (∂y/y)/(∂x/x), where y is
some prediction function.
28
the 100(1 − α)% confidence interval for the marginal effect of a predictor xk
is given by
γ̂k − zα/2 s.e.γ̂k ≤ γk ≤ γ̂k + zα/2 s.e.γ̂k
where s.e.γ̂k is the square root of V [γ̂k ]. Application of this formula requires
that we substitute the appropriate PDFs and their derivatives depending on
whether we are computing marginal effects for a logit or a probit model.
For elasticities, the estimated asymptotic variance is
,
∂κ̂k 2
V [κ̂k ] =
Vβ̂k
∂ β̂k
where
∂κ̂k
∂ β̂k
=
3
4
f % (.)x2k β̂k + f (.)xk F (.) − f (.)x2k β̂k
[F (.)]2
and f % (.), f (.), and F (.) are all defined in terms of the parameter estimates.
Confidence intervals can be obtained analogously to the marginal effects.
Discrete Change in Predicted Probabilities
Discrete Change versus Marginal Effects A third method of interpretation is to compute the discrete change in the predicted probabilities.
Discrete change in the predicted probabilities is similar but not identical
to marginal effects. Marginal effects give the instantaneous change in the
predicted probability, i.e. the change in π for a very small change in a predictor. Discrete change considers the change in π as a function of a larger
change in a predictor, i.e. as the predictor changes from one discrete value
to another. An another way to put this is that a discrete change measure considers the change in π when a predictor changes by δ units. The
marginal effect does the same but now δ → 0. The difference between the
two measures is illustrated in Figure 2.6. The difference between the two
approaches makes marginal effects more suitable for continuous predictors,
while discrete change is better for categorical predictors.
Discrete Change without Interactions The discrete change in predicted probabilities is defined as the change in π in response to a change
of δ units in the predictor of interest, while holding all other covariates, x,
constant:
∆π
= Pr(y = 1|x, xk + δ) − Pr(y = 1|x, xk )
(2.15)
∆xk
29
.8
.6
Probability
.2
.4
Discrete
Change
-.2
0
Marginal
Effect
-4
-3
-2
-1
0
1
x
Figure 2.6: Discrete Change versus Marginal Effects
It is conventional to set all of the remaining covariates to their means, so that
the discrete change is computed as Pr(y = 1|x̄, xk + δ) − Pr(y = 1|x̄, xk ).
However, other choices are surely possible and may be more appropriate.
For example, when some of the covariates are dummy variables, it may be
useful to set their values equal to the mode, while ordinal scales can be
set equal to their medians. You are free to set the reference values of the
remaining covariates as long as you use the same values in both terms on the
left-hand side of the discrete change formula and as long as you document
your choice of reference values.
Just as you have a choice in selecting the values of the other covariates,
you have a choice in defining δ. Common choices for the change in xk include
the following.
1. A unit change relative to x̄k , so that δ = 1 and ∆π/∆xk = Pr(y =
1|x̄, x̄k + 1) − Pr(y = 1|x̄, x̄k ). This is tantamount to setting all covariates to their mean values initially and then increasing only xk by
one unit.
2. A centered unit change relative to x̄k , so that xk moves from x̄k −.5
to x̄k + .5, as proposed by Kaufman (1996). The discrete change in
predicted probability is then ∆π/∆xk = Pr(y = 1|x̄, x̄k + .5) − Pr(y =
1|x̄, x̄k − .5).
3. A standard deviation change relative to x̄k , so that xk moves from
x̄k − .5sk to x̄k + .5sk , where sk is the sample standard deviation
of xk . The discrete change is then defined as ∆π/∆xk = Pr(y =
1|x̄, x̄k + .5sk ) − Pr(y = 1|x̄, x̄k − .5sk ).
30
4. It is also possible to compute the maximum possible change. Here
we let xk move from the minimum to the maximum in the sample,
so that δ is equal to the range and the discrete change is defined as
∆π/∆xk = Pr[y = 1|x̄, max(xk )] − Pr[y = 1|x̄, min(xk )]. This manner
of computing discrete change is quite common in political analysis,
although it is sometimes criticized for creating an exaggerated view
of the effects of predictors because very few cases may fall at the
extremes.
5. For dummy predictors, computing the maximum possible change is
tantamount to computing a change from 0 to 1. Thus, ∆π/∆xk =
Pr(= 1|x̄, xk = 1) − Pr(y = 1|x̄, xk = 0). This computation is not
subject to the same criticism as maximum possible change. On the
contrary, this is actually one of the best ways to characterize the effect
of dummy predictors.
6. Percentile change provides another method of operationalizing discrete change. Here, we let xk move from the pth percentile to the
100 − pth percentile, e.g. from the 10th to the 90th percentile. Letting xk,p and xk,100−p denote the values of xk that correspond to the
pth and 100 − pth percentiles, respectively, discrete change is defined
as ∆π/∆xk = Pr(y = 1|x̄, xk,100−p ) − Pr(y = 1|x̄, xk,p ). Especially
when p is not chosen too small, the percentile change method can help
to overcome the criticism of maximum possible change. After all, we
would know that p% of the cases would score lower and p% would
score higher than the values that are used to demarcate the discrete
change in xk .
Discrete Change with Interactions As was the case with marginal effects, special care should be taken in correctly specifying the discrete change
of polynomial and interaction terms. Consider again the latent variable
model yi∗ = β0 + β1 xi + β2 x2i + #i . If we let x move from x̄ − .5 to x̄ + .5,
then the discrete change is given by F [β0 + β1 (x̄ + .5) + β2 (x̄ + .5)2 ] − F [β0 +
β1 (x̄ − .5) + β2 (x̄ − .5)2 ]. The change thus has to be introduced in both the
linear and quadratic terms.
To see what happens with the discrete change on interactions, let us
consider two dummy predictors, x and z, in the latent variable model yi∗ =
β0 + β1 xi + β2 zi + β12 xi zi + #i . The discrete change term for the interaction
is
∆2 π
= F (β0 + β1 + β2 + β12 ) − F (β0 + β1 ) − F (β0 + β2 ) + F (β0 )
∆x∆z
31
where the key is again to consider changes in x and z simultaneously.34 We
see that the discrete change can be non-zero even if β12 = 0 and that the
overall direction of the change depends on more than the sign of β12 alone.
This is reminiscent of the discussion of the marginal effects of interactions.
Confidence Intervals Since discrete changes depend on parameter estimates they are subject to sampling fluctuation. It is useful to take this into
consideration by computing confidence intervals for the discrete changes.
This requires that we compute the asymptotic variance of ∆π̂/∆xk , which
can be done using the delta method (see Xu and Long 2005).
Let xa be one set of values for the predictors and let xb be another
set of values. For example, xa could set xk to the maximum and all other
predictors to the mean, while xb could set xk to the minimum while still
keeping all other predictors at the mean. We are interested in the variance of
F (xa β̂) − F (xb β̂). To this effect, we begin by defining the partial derivative
of this quantity with respect to β̂ !:35
∂F (xa β̂) − F (xb β̂)
∂ β̂ !
34
= f (xa β̂)xa − f (xb β̂)xb = q
Proof: Let us begin by defining ∆π/∆x. This is equal to
∆π
∆x
=
Pr(y = 1|x = 1, z) − Pr(y = 1|x = 0, z)
=
F (β0 + β1 × 1 + β2 z + β12 × 1 × z) − F (β0 + β1 × 0 + β2 z + β12 × 0 × z)
=
F (β0 + β1 + β2 z + β12 z) − F (β0 + β2 z)
Now we take the discrete change in ∆π/∆x with respect to ∆z:
∆2 π
∆x∆z
=
=
=
F (β0 + β1 + β2 z + β12 z) − F (β0 + β2 z)
∆z
F (β0 + β1 + β2 × 1 + β12 × 1) − F (β0 + β2 × 1) −
[F (β0 + β1 + β2 × 0 + β12 × 0) − F (β0 + β2 × 0)]
F (β0 + β1 + β2 + β12 ) − F (β0 + β2 ) − [F (β0 + β1 ) − F (β0 )]
By rearranging terms we obtain the desired formula.
35
Proof: The proof relies on the property that the derivative of a difference of two
functions is equal to the difference between the derivatives. Thus,
∂F (xa β̂) − F (xb β̂)
∂ β̂ !
=
=
=
∂F (xa β̂)
∂F (xb β̂)
−
!
∂ β̂
∂ β̂ !
dF (ẑa ) ∂ ẑa
dF (ẑb ) ∂ ẑb
−
dẑa ∂ β̂ !
dẑb ∂ β̂ !
f (ẑa )xa − f (ẑb )xb
where ẑa = xa β̂ and ẑb = xb β̂.
32
The variance of F (xa β̂) − F (xb β̂) is given by
V [F (xa β̂) − F (xb β̂)] = qV[β̂]q !
= f (xa β̂)2 xa V[β̂]x!a + f (xb β̂)2 xb V[β̂]x!b −
2f (xa β̂)f (xb β̂)xa V[β̂]x!b
Since F (xa β̂) − F (xb β̂) is normally distributed, a confidence interval for the
discrete change is given by
ˆ − zα/2 s.e. ˆ ≤ ∆π ≤ ∆
ˆ + zα/2 s.e. ˆ
∆
∆
∆
∆xk
.
ˆ
where ∆ = F (xa β̂) − F (xb β̂) and s.e.∆
V [F (xa β̂) − F (xb β̂)]. Obviˆ =
ously, this confidence interval depends on the specific values placed in xa
and xb .
The Odds Ratio
For the logit model, one final method of interpretation is available: the odds
ratio. In general, the odds are defined as the ratio of the probabilities of
scoring 1 and 0 on the response variable, conditional on a set of covariates.
Thus,
Pr(y = 1|x)
Pr(y = 1|x)
=
Pr(y = 0|x)
1 − Pr(y = 1|x)
Ω(x) =
In the logit model, it is easily demonstrated that36
Ω(x) = exp(x!β)
36
Proof: In the logit model, Pr(y = 1|x) = exp(x! β)/[1 + exp(x! β)]. Moreover,
1 − Pr(y = 1|x)
=
=
exp(x! β)
1 + exp(x! β) − exp(x! β)
=
!
1 + exp(x β)
1 + exp(x! β)
1
1 + exp(x! β)
1−
Consequently,
Ω(x)
=
=
exp(x ! β)
1+exp(x ! β)
1
1+exp(x ! β)
!
=
exp(x! β)
[1 + exp(x! β)]
1 + exp(x! β)
exp(x β)
33
As the name suggests, the odds ratio is the ratio of two odds. These odds
come about by changing the value of one of the covariates. To determine
our thoughts, imagine that the value of the kth covariate is changed from
xk to xk + δ. Writing the latent regression model as yi∗ = βk xk + x!i β + #i ,
where xi contains all of the remaining predictors plus a constant term, the
odds at xk are
Ω(xk , xi ) = exp(βk xk + x!i β) = exp(βk xk ) exp(x!i β)
The odds at xk + δ are
Ω(xk + δ, xi ) = exp[βk (xk + δ) + x!i β] = exp(βk xk ) exp(βk δ) exp(x!i β)
The odds ratio is then defined as
Ω(xk + δ, xi )
Ω(xk , xi )
exp(βk xk ) exp(βk δ) exp(x!i β)
exp(βk xk ) exp(x!i β)
= exp(βk δ)
=
The interpretation of the odds ratio is that, for a change of δ in xk , the odds
are expected to change by a factor of exp(βk δ).
Compared to the other interpretative devices that we have discussed, the
nice aspect of the odds ratio is that the other covariates are irrelevant. No
matter how we set those covariates, the odds ratio for xk is always equal
to exp(βk δ). Moreover, the odds ratio can be easily computed from the
coefficients, especially when δ = 1.37 On the downside, some people have a
difficult time wrapping their brain around the concept of odds.
Two transformations of the odds ratio are useful. First, the log-odds ratio or logit is simply the natural logarithm of the odds ratio: ln[exp(βk δ)] =
βk δ. As this equation shows, the log-odds ratio is a linear function and, as
such, interpretation is straightforward: for an increase of δ units in xk the
log-odds are expected to increase by βk δ, where the log-odds are defined as
ln[π/(1−π)], where π is the probability of scoring 1 on the response variable.
A second useful transformation is to compute the percentage change in
the odds:
100 ×
Ω(xk + δ, xi ) − Ω(xk , xi )
Ω(xk , xi )
= 100 × [exp(βk δ) − 1]
This can be interpreted as the percentage change in the odds for a δ unit
change in xk .38
37
In this case, the odds ratio becomes exp(βk ). Long (1997) refers to this as the factor
change.
38
We obtain this formula by recognizing that Ω(xk + δ, xi ) − Ω(xk , xi ) =
34
2.2.5
Hypothesis Testing
Testing Individual Parameters
In addition to interpreting the coefficients, it is useful to ascertain if they
are significantly different from zero. This can be done by considering the
ratio of the parameter estimates over their standard errors and referring the
resulting test statistic to a standard normal distribution to obtain a p-value
(see Chapter 9.6). If the p-value falls below a pre-determined Type-I error
rate (e.g. .05) then the predictor is statistically significant; otherwise it is
not.
Testing the Entire Set of Predictors
When we posit a particular set of predictors, xi it is because we believe
them to be important. We thus want to rule out the null hypothesis that
all of the coefficients associated with the predictors are simultaneously zero.
Thus, the relevant null hypothesis is
H0 :
β1 = β2 = · · · = βK
where K is the total number of predictors in the model. Most statistical
software packages, including Stata, test this hypothesis by using a likelihood
ratio test (see Chapter 9.6). The relevant constrained or empty model contains a constant only, i.e. πi = F (β0 ). The log-likelihood for this model is
given by
5 6
5 6
N0
N1
(0 = N0 ln
+ N1 ln
(2.16)
N
N
where N0 is the number of cases that score 0 on the dependent variable
and N1 is the number of cases that score 1.39 The likelihood ratio test
exp(βk δ) exp(βk xk ) exp(x! i β)−exp(βk xk ) exp(x! i β) = exp(βk xk ) exp(x! i β)[exp(βk δ)−1].
Division of this expression by Ω(xk , xi ) = exp(βk xk ) exp(x! i β) yields exp(βk δ) − 1. All
that is left to do then is to multiply by 100.
39
Proof: In the constant only model, β0 is adjusted so as to accommodate the proportions of 0s and 1s in the data. The proportion of 1s is the estimate of π, while the
proportion of 0s is the estimate of 1 − π. Letting N1 and N0 denote the numbers of cases
that score 1 and 0, respectively, the estimate of π is equal to N1 /N , while the estimate
of 1 − π is equal to N0 /N . Since N1 /N occurs N1 times in the data, the contribution to
the log-likelihood function is N1 × ln(N1 /N ). Similarly, since N0 /N occurs N0 times in
the data, the contribution to + is equal to N0 × ln(N0 /N ). Adding these two components
gives the formula for +0 .
35
compares this log-likelihood to that of the estimated model ((1 ), computing
the following test statistic
LR = 2((1 − (0 ) ∼ χ2K
If the p-value associated with this test statistics falls below a preset Type-I
error rate then H0 is rejected and we conclude that at least a subset of the
predictors have non-zero effects.
Testing Other Hypotheses
Sometimes we may be interested in testing whether a subset of the predictors
is inconsequential. This is, for example, useful when the subset embodies
the core of our theoretical model. We can test such hypotheses using a
likelihood ratio test, but this requires that we estimate both a model that
includes and a model that excludes the subset of predictors. A less involved
test procedure is the Wald test (see Chapter 9.6); it will be illustrated in
the example below.
2.2.6
Model Fit
When we posit a binary choice model we want to know how well it performs.
How well does it account for the observed choices in the data? This question
concerns model fit and a number of different approaches are available for its
assessment.
Correct Classification
One measure of how well logit and probit models fit the data is to consider
the number of correct predictions or classifications that they generate. The
approach here is to predict an individual’s score on y based on his or her
predicted probability. A common cutoff for the predicted probability is .5
so that
ŷ = 1 if π̂ > .5
ŷ = 0 if π̂ ≤ .5
where ŷ is the predicted score on the response variable. We can then compare
the predictions with the actual scores on the response variable. Whenever
there is agreement between ŷ and y, there is a correct prediction. The higher
the percentage of correctly classified observations, the better the model fit
is.
36
Table 2.2: Correct Prediction in Logit and Probit Models
ŷ
0
1
Total
0
N00
N10
N.0
y
1
N01
N11
N.1
Total
N0.
N1.
N
Notes: N0. and N1. are
the numbers of cases that
are predicted to score 0
and 1, respectively. N.0
and N.1 are the numbers of cases that actually score 0 and 1, respectively. Nij denotes cases
that are predicted to score
i = 0, 1 and that actually
score j = 0, 1.
The mechanics of computing the percentage of correct predictions can
be illustrated using Table 2.2. In this table, the column variable indicates
the actual value of the response variable, while the row variable indicates
the predicted value. The quantities N00 and N11 represent the number of
cases where the predictions correspond to the actual value of the response
variable. These are, then, the correctly classified cases. The percentage of
correctly classified (CC) cases is given by
CC = 100 ×
N00 + N11
N
In judging CC, it is important to consider how good our predictions
would have been in the absence of the model. If we were to ignore all
knowledge about the covariates, then the best prediction of the response
variable would be the mode. (This is the best prediction in the sense of
minimizing the number of misclassified cases.) If CC is much higher than
the percentage of cases in the modal category, then our model clearly seems
to add to our ability to predict the response variable. On the other hand,
if CC is in the vicinity of the percentage of cases in the modal category,
then we should question how much the model adds, even when CC seems
reasonably high.
37
Pseudo-R2
The popularity of the coefficient of determination or, R2 , in linear regression
analysis has caused statisticians to define similar measures in the context
of logit and probit models. These pseudo-R2 statistics mimic the R2 but,
as their name suggests, they do not exactly behave like the coefficient of
determination. A large number of pseudo-R2 statistics exist, but here we
shall focus only on the most popular and/or best-performing statistics.40
McFadden McFadden’s (1973) pseudo-R2 , also known as the likelihood
ratio index, is one of the most widely used measures of fit and is computed
by default in Stata. The measure evaluates the ratio of the log-likelihood
functions for the substantive model ((1 ) and the empty model ((0 ). The
log-likelihood of the empty model can be seen as a general equivalent of the
total sum of squares in regression analysis, while the log-likelihood of the
substantive model is a general equivalent of the residual sum of squares.
In light of these interpretations, McFadden’s R2 follows directly from the
formula for the coefficient of determination in regression analysis:41
2
RM
cF
= 1−
(1
(0
2
2
In theory, RM
cF ranges between 0 and 1, just like the regression R . The
lower-bound arises if the substantive model first the data no better than the
empty model ((1 = (0 ). The upper-bound arises if the substantive model fits
the data perfectly, so that (1 = 0. In the case of logit and probit analysis,
this upper-bound will never be reached.42 Thus, the practical upper-bound
2
of RM
cF can be quite far removed from 1. McFadden (1979) has proposed
2
that .2 ≤ RM
cF ≤ .4 signifies an excellent fit of the model.
40
Aside from the measures discussed here, pseudo-R2 statistics have been proposed by
Efron (1978), Goldberger (1973), Lave (1970), Morrison (1972), and Neter and Maynes
(1970). A review of these measures can be found in Veall and Zimmermann (1996).
Evaluation studies of the measures include Hagle and Mitchell (1992), Langer (2000),
Veall and Zimmerman (1992, 1993, 1994), and and Windmeijer (1995).
41
One way to derive this formula is to define +max as the maximum attainable loglikelihood. We can then decompose the quantity +max − +0 as (+max − +1 ) + (+1 − +0 ),
where +1 − +0 is the explained portion and +max − +1 is the unexplained portion. Taking
the explained portion over the total portion, +max − +0 , we get (+1 − +0 )/(+max − +0 ). If we
assume that the maximum feasible value of the log-likelihood is 0, which is theoretically
2
true in the discrete case, then this expression reduces to RM
cF .
42
The only time that +1 = 0 is when the model is saturated, i.e. we have as many
parameters as observations so that we can perfectly account for each data point. This is
never the case with logit and probit models.
38
McKelvey and Zavoina The pseudo-R2 measure proposed by McKelvey
and Zavoina (1975) is based on the latent response variable y ∗ . The measure
takes as the explained variance the variance in the predicted values ŷ ∗ . It
then compares this to the total variation in y ∗ , just as one would do in a
linear regression R2 . Thus,
2
RM
&Z
=
V̂ [ŷ ∗ ]
V̂ [y ∗ ]
=
V̂ [ŷ ∗ ]
V̂ [ŷ ∗ ] + V [#]
Here V̂ [ŷ ∗ ] = β̂ !V[x]β̂. By assumption, V [#] = 1 in the probit model and
V [#] = π 2 /3 in the logit model. The theoretical range is from 0 to 1. The
upper limit of 1 is reached when the predictions of y ∗ are perfect. In this
case, ŷ ∗ = y ∗ and V [ŷ ∗ ] = V [y ∗ ]. The lower limit of 0 arises when all of
the regression coefficients are zero with the exception of β̂0 . In this case
2 = β̂ 2 × 0 = 0, where σ 2 is the variation in the constant,
V [ŷ ∗ ] = β̂02 × σ00
0
00
which is zero.
Aldrich and Nelson Aldrich and Nelson (1984) proposed a pseudo-R2
statistic that is based on the likelihood ratio test statistic. The measure is
given by
2
RA&N
=
=
LR
LR + n
2((1 − (0 )
2((1 − (0 ) + n
where LR is the likelihood ratio test statistic that we discussed in the section
2
on hypothesis testing. The lower-bound on RA&N
is 0, which occurs when
(1 = (0 , i.e. when the covariates do not contribute to the fit of the model.
2
The upper-bound of RA&N
is equal to −2(0 /(−2(0 + n), which arises when
2
(1 = 0. As we discussed in the context of RM
cF , this is a theoretical upperbound only because in practice no logit or probit model will fit the data
perfectly.
2
Due to the nature of its upper-bound, which is not equal to 1, RA&N
2
does not perfectly mimic the regression R perfectly. Veall and Zimmermann
(1992) have proposed a simple correction to rectify this problem:
RV2 &Z
=
2('1 −'0 )
2('1 −'0 )+n
−2'0
−2'0 +n
=
2((1 − (0 ) −2(0 + n
2((1 − (0 ) + n −2(0
This measure takes on values between 0 and 1, although this is again just
the theoretical range.
39
Table 2.3: Logit and Probit Models of Vote Choice in the U.S. in
2000
Logit
Estimate
−.997∗∗
.229
−.236
−2.835∗
−.045
.197
−.232
5.730∗∗
1.667
6.446∗
−15.067
166.470
.000
.847
.964
.540
.930
93.660
Predictor
Partisanship
Male
White
Born Again Christian
Age
Education
Household Income
Trait Differential
Issue Differential
Constant
(
LR χ2
p
2
RM
cF
2
RM
&Z
2
RA&N
RV &Z
CC
S.E.
.350
1.101
1.462
1.286
.035
.483
.148
1.967
1.809
3.279
Probit
Estimate
−.581∗∗
.153
−.117
−1.594∗
−.025
.116
−.136
3.273∗∗
.980
3.665∗
−14.801
167.000
.000
.849
.967
.540
.930
93.660
S.E.
.203
.638
.856
.720
.020
.279
.088
1.093
1.060
1.859
Notes: N = 142. ∗∗ p < .01, ∗ p < .05 (two-tailed). In the logit model, 1 failure
and 2 successes are completely determined. In the probit model, 26 failures and 20
successes are completely determined.
2.2.7
Example
Estimation Results
Let us return to the voting behavior example that was discussed in the
context of the LPM. Table 2.3 shows the logit and probit results from this
example. These results were obtained by running the following commands:
logit choicevar
"
probit choicevar
varlist
"
#"
varlist
if
#"
#"
if
, level(#)
#"
#
, level(#)
#
On the average, the logit coefficients are about 1.74 times larger than
the probit coefficients, as should be expected given the different scaling of
the parameters. Other than this, the models produce similar results, as
40
should also be expected. Thus, all predictors have effects of the same sign
in the logit and probit specifications and the same predictors are statistically
significant.
Hypothesis Testing
Individual Parameters Only three predictors attain statistical significance, namely partisanship, being a born-again Christian, and the trait
differential. It should be noted, however, that the trait and issue differential
measures are highly correlated (at .795 in the estimation sample), creating
a collinearity problem. Logit and probit models are not immune to such
problems and this can result in inflated standard errors.
Testing the Entire Set of Predictors An important aspect of model
evaluation is determining if one can rule out the null hypothesis that none
of the predictors exert a non-zero effect. After all, failure to rule out this
hypothesis means that one’s substantive model does not contribute to explaining the empirical phenomenon at hand. The usual approach is to test
the null hypothesis using a likelihood ratio test. In both the logit and probit
analysis, this test results in a sound rejection of the null hypothesis. In the
logit model, LR = 166.47; when referred to a χ2 -distribution with 9 degrees
of freedom we obtain p = .000, which is well below any commonly accepted
Type-I error rate. In the probit model, LR = 167, which yields the same
conclusion.
Testing Other Hypotheses Let us consider the issue and trait differentials, which are the only predictors in the model that pick up on the
candidates and issues of the 2000 campaign. Imagine that we want to test
the null hypothesis that these predictors do not matter. To do so we can
perform a Wald test on the restriction that the coefficients for the issue and
trait differentials are zero in the population. In Stata, this is done via the
test command:
test issue trait
where issue is the name of the issue differential and trait that of the
trait differential. When we perform this test for the logit model we obtain
W = 9.72, p = .008. Thus, we can reject the null hypothesis that neither
issues nor candidate traits mattered in the 2000 U.S. presidential elections.
41
As another example of a test, consider the null hypothesis that gender
and race exerted the same effect in the 2000 elections. We can test this
using the following syntax:
test male=white
For the logit model, this yields W = .07. With one degree of freedom—we
are imposing one restriction—the p-value is .787, meaning that we fail to
reject the null hypothesis.
Model Fit
Correct Classification The percentage of correctly classified cases is very
high. Considering that one would predict about 50 percent of the cases
correctly without covariate information, a CC value of nearly 94 percent can
be considered outstanding. In this sense, the model fits the data extremely
well. In Stata, the percentage of correctly classified cases can be obtained by
running the following command after estimating the logit or probit model:
estat classification
Pseudo-R2 All of the pseudo-R2 measures also indicate excellent fit. By
2
2
default, Stata reports RM
cF . RM &Z can be obtained by running the
fitstat
command, which is part of the SPost package of Freese and Long (2006).43
2
RA&N
and RV2 &Z have to be computed by hand. This can be done using
the likelihood ratio test statistic, which is part of the standard output, and
(0 , which is part of the output produced by fitstat.
2.2.8
Interpretation
Considering the results, we can say impressionistically that the probability
of a Gore vote decreases as partisanship moves toward the Republican end
of the scale and as people indicate that they are born-again Christians. The
likelihood of a Gore vote increases as trait judgments of him are more favorable than those for Bush. However, beyond these rather vague statements,
nothing can be said about the effects based on the parameter estimates
43
This package is not part of the standard Stata installation and will have to be downloaded.
42
Table 2.4: Marginal Effect of the Trait Differential on Vote Choice
(a) Marginal Effect at the Mean
Method
Effect
S.E.
z
Logit
1.234
.438
2.820
Probit
1.186
.392
3.030
p
.000
.000
Elasticity
.083
.076
(b) Average Marginal
Method
Effect
Logit
.189
Probit
.186
p
.000
.000
Elasticity
NA
NA
Effect
S.E.
.041
.039
z
4.650
4.800
Notes: Marginal effects at the mean were computed using Stata’s mfx
command. Average marginal effects were computed using the margeff
command.
alone. To provide further interpretation of the results we have to rely on
the interpretation methods described earlier.
Marginal Effects and Elasticities To obtain marginal effects at the
mean in Stata, one should issue the
mfx [compute],
"
eyex
#
command. Issuing mfx by itself will compute marginal effects at the mean
for continuous predictors and discrete changes in the predicted probability
for dummy predictors. Adding the eyex option produces elasticities. The
top panel of Table 2.4 shows the marginal effect at the mean of the trait
differential, including standard errors (computed via the delta method), test
statistics, and elasticities. We observe a statistically significant marginal
positive effect for the trait differential. The size of this effect is considerable:
the instantaneous change in predicted probability is 1.234 in the logit model
and 1.186 in the probit model. Looking at the elasticities, we observe that a
1% increase in the trait differential, which corresponds to .06 units, produces
a .083% increase in the probability of voting for Gore in the logit model and
a .076% increase in the probit model.
Average marginal effects can be obtained by running Tamár Bartus’
(2005) margeff8 command. The syntax for this command is
margeff8 [compute],
"
dummies(varlist)
43
#
The second panel in Table 2.4 shows these marginal effects, which are quite
a bit different from the marginal effects at the mean but still statistically
significant and positive.
Predicted Probabilities and Discrete Change Another way of interpreting the results is by considering predicted probabilities. These can be
obtained using a number of the commands in the Stata add-on SPost (see
Long and Freese 2006). First, it is possible to obtain a table of predicted
probabilities using the prtab command:
prtab rowvar
"
#
rest(stat)
"
colvar
# "
# "
#
supercolvar , by(superrowvar)
The table can display predicted probabilities based on up to four covariates; the minimum is to specify one covariate (the rowvar ). The option
rest(stat) allows one to specify values for the remaining covariates. For
example, in our analysis we might be interested in predicted probabilities
by partisanship and born-again Christianity, while holding everything else
constant. The resulting predicted probabilities are shown in Table 2.5 for
the probit model. The table reveals a much faster drop-off in the likelihood
of voting for Gore among born-again Christians than among other respondents. While strong Democrats have a better than .5 predicted probability
of voting for Gore, regardless of their religious beliefs, weak Democrats who
are born-again Christians are already more likely to vote for Bush than for
Gore. This suggests a rather powerful effect of born-again Christianity.
Another useful command in the SPost suite is the prgen command, which
allows one to compute and graph predicted probabilities.44 The relationship
between partisanship and the probability of voting for Gore, stratified by
religious beliefs, can be plotted using the following command sequence:
prgen pid, from(0) to(6) gen(pidnot) x(bornagain=0)
rest(mean) n(7)
prgen pid, from(0) to(6) gen(pidba) x(bornagain=1)
rest(mean) n(7)
label var pidnop1 "Not Born-Again"
label var pidbap1 "Born-Again"
44
In logit analysis, predicted probabilities can also be graphed using Mitchell and Chen’s
(2005) tools for visualizing binary logit models (vibl).
44
Table 2.5: Predicted Probability of Voting for Gore by Partisanship and Religious Beliefs
Born-Again Christian?
No
Yes
.970
.615
.904
.386
.765
.192
.557
.073
.331
.021
.154
.005
.055
.001
Pid
Strong Democrat
Weak Democrat
Leaning Democrat
Independent
Leaning Republican
Weak Republican
Strong Republican
Notes: Predicted probabilities were computed using the
prtab command while holding all other predictors at
their mean. Predicted probabilities are based on probit estimation results.
graph twoway connected pidnop1 pidbap1 pidbax, ytitle("Pr
of Gore Vote") xtitle("pid")
Here pidnop1 contains the predicted probabilities of a Gore vote for the different levels of partisanship for respondents who are not born-again Christians; pidbap1 contains similar predicted probabilities but now for bornagain Christians. Figure 2.7 shows the graph of the predicted probability,
clearly revealing the differential drop-offs in the likelihood of a Gore vote
among born-again and other respondents.
A third useful command in the SPost suite is the prvalue command,
which allows one to compute predicted probabilities given certain values of
the covariates. The general syntax for the command is
prvalue,
"
x(variable1=value1
#
save dif lev(#)
"
#
... ) rest(stats ) delta
The save option preserves the predicted probability for subsequent comparisons, while the dif option draws the comparison between the current value
of the predicted probability and that stored using the save option. The
delta option causes Stata to compute confidence intervals on the predicted
probabilities using the delta method. The confidence level of these intervals
is controlled by lev(#). For example, one can issue the following sequence
after running a probit analysis:
45
1
.8
Pr of Gore Vote
.4
.6
.2
0
0
2
4
6
pid
Not Born−Again
Born−Again
Figure 2.7: Predicted Probability of Voting for Gore by Partisanship and Religious Beliefs
prvalue, x(bornagain=0) rest(mean) delta save lev(99)
prvalue, x(bornagain=1) rest(mean) delta dif lev(99)
The first command causes Stata to compute the predicted probability of
a Gore vote and the 99% confidence interval for respondents who are not
born-again Christians. This predicted probability is .571, although it has
a rather wide confidence interval that runs from .202 to .941. The second
command causes Stata to calculate the predicted probability of a Gore vote
for born-again Christians. It also computes the difference between this predicted probability and that for the previous group, as well as the 99% confidence interval on the difference. For born-again Christians, the predicted
probability of a Gore vote is only .079. Compared to the earlier predicted
probability, this is a change of -.493 points. The confidence interval on this
difference runs from -.928 to -.057. Since this interval does not include 0,
we can say that the difference in predicted probabilities due to born-again
Christianity is statistically significant at the .01 level.
Finally, Long and Freese’s (2006) prchange command computes discrete
change in the predicted probabilities. By just typing
prchange
Stata will produce a number of the discrete change measures that we have
discussed, including maximum possible change, a change from 0 to 1, a centered unit change, and a standard deviation change. In computing these
measures, the default is to keep the remaining predictors at their mean values. Table 2.6 reports these measures for partisanship, the trait differential,
46
Table 2.6: Discrete Change in the Predicted Probability of a Gore
Vote
Predictor
PID
Trait Differential
Born-Again Christian
Min → Max
−.884
1.000
−.493
0→1
−.144
.691
−.493
± .5
−.208
.866
NA
± .5 SD
−.464
.860
NA
Notes: Discrete changes in predicted probabilities were computed using the
prchange command while holding all other predictors at their mean. These
changes are based on probit estimation results.
and born-again Christianity. One should keep in mind, that only maximum
possible change and 0-1 change are reasonable for dummy predictors; these
measures produce identical estimates of the discrete change. Looking at
the discrete changes, we see sizable effects for all predictors. For example,
moving PID from the minimum to the maximum produces a drop in the
predicted probability of a Gore vote of .884 points. Moving the trait differential from one standard deviation below to one standard deviation above
the mean produces a boost in the predicted probability of a Gore vote of
.860 points. And compared to others, born-again Christians are almost half
a point less likely to vote for Gore, holding all else equal.
The Odds Ratio For the logit model, the odds ratio forms an alternative
basis for interpretation. Long and Freese’s (2006) suite contains a useful
command that will compute odds ratios and factor/percentage changes in
those ratios.45 The syntax of this command is as follows:
listcoef varlist,
"
factor|percent
# "
help
#
The factor option shows the factor changes in the odds for a unit increase
in the predictor, whereas the percent option shows the percentage changes
in the odds. The help option adds annotation to the output so that it can
be read more easily.
Table 2.7 shows the factor and percentage changes in the odds for PID,
the trait differential, and born-again Christianity. The factor changes for
PID and born-again Christian are less than 1 because the corresponding
45
Stata’s logistic command also reports logit results as odds ratios. Further, adding
the or option to the logit command will produce odds ratios as well.
47
Table 2.7: Odds Ratios for the Gore Vote
Predictor
PID
Trait Differential
Born-Again Christian
Factor
.369
307.845
.059
Percent
−63.100
30684.500
−94.100
Notes: Factor and percentage changes in the odds
were computed using the listcoef command. These
changes are based on logit estimation results.
logit coefficients are negative. The factor change for the trait differential
is enormous because of the very large coefficient on this predictor. The
percentage changes indicate that for each increase in PID, the odds of voting
for Gore drop by 63.1%. Compared to others, the odds of voting for Gore
are 94.1% lower for born-again Christians. An increase by one unit in the
trait differential, which is an increase in favor of Gore, boosts the odds of a
Gore vote by over 30,000%.
2.2.9
Heteroskedastic Logit and Probit
Derivation, Estimation, and Interpretation
By assumption, logit and probit models have homoskedastic stochastic components. But what happens when this assumption is wrong? What if the
stochastic elements of choice are heteroskedastic? In this case, we have a
problem that is far more serious than would be the case in linear regression
analysis. In linear regression, the OLS estimator is still unbiased in the
presence of heteroskedastic errors. It is no longer the best linear unbiased
estimator and the estimated standard errors will be biased, but the regression coefficients themselves can be trusted. This is not true in logit and
probit models. As we have seen, the variance assumption in these models
plays a critical role in setting the scale of the coefficients. We can transform
that scale by setting the variance differently, as is the case when we compare
logit and probit. As long as we apply the same transformation to all sample
units, then we can still compare predicted probabilities and marginal effects
across those units. But if the appropriate variance is different for some units
than other units, then imposing a common variance and setting the scale of
the coefficients accordingly will introduce bias. This bias does not disappear
asymptotically, so that the ML estimators will be inconsistent, as well as
48
asymptotically inefficient. In addition, one also has the problem that the
estimated standard errors will be inconsistent.
There is a way around the imposition of homoskedasticity and that is to
run a heteroskedastic logit/probit analysis. Heteroskedastic logit and probit
are patterned after Harvey’s (1976) heteroskedastic regression model. Here,
I focus on the heteroskedastic probit model, since it is implemented through
Stata’s hetprob command.
Derivation Derivation of the heteroskedastic probit model begins by specifying the following latent variable model:
yi∗ = x!i β + #i
#i ∼ N (0, σi2 )
Compared to the standard probit model, the difference here lies in the variance assumption for #i : in probit, this variance is fixed to 1 for all units, but
here it is left free to vary across units. We model the variance as a function
of a set of predictors, analogous to Harvey (1976):
σi2 =
"
$
%#2
exp z !i γ
or, equivalently, σi = exp (z !i γ) or ln σi = z !i γ. Here z i may or may
not contain the same predictors as xi . Importantly, z i does not contain a
constant. This means that if none of the predictors in the variance model
have an effect, then σi2 = exp(0) = 1, the same as in the standard probit
model. This suggests a test of whether a heteroskedastic model is necessary,
namely a test of H0 : γ = 0.
It is possible to transform the heteroskedastic probit specification into a
homoskedastic model by dividing all terms of the latent variable model by
σi . This produces
yi∗
σi
yi∗
exp (z !i γ)
yi∗
exp (z !i γ)
x! i β
#i
+
σi
σi
x! i β
#i
+
!
exp (z i γ) exp (z i γ)
x! i β
+ δi
exp (z !i γ)
=
=
=
In this model, δi ∼ N (0, 1) so that the errors are homoskedastic again.46
46
Proof: V [δi ] = E[δi2 ] = [exp (z ! i γ)]
−1
V [#i ] = [exp (z ! i γ)]
49
−1
exp (z ! i γ) = 1.
We can now state the mechanism linking the observed response to the
underlying latent variable. In the ordinary probit model, this mechanism is
πi = Pr(yi = 1) = Pr(yi∗ > 0) = Pr(#i < x!i β). In terms of the transformed
latent variable this becomes
5
6
yi∗
πi = Pr
>
0
exp (z !i γ)
5
6
x!i β
= Pr δi <
exp (z !i γ)
5
6
x! i β
= Φ
(2.17)
exp (z !i γ)
Estimation The log-likelihood function for the heteroskedastic probit model
is given by
5
6
5
6/
n !
)
x! i β
x! i β
( =
yi ln Φ
+ (1 − yi ) ln Φ
exp (z !i γ)
exp (z !i γ)
i=1
This is a difficult log-likelihood function to optimize and convergence problems are not uncommon. The first-order conditions are given by
n ,
)
$
%
∂(
φi (yi − Φi )
=
exp z !i γ xi = 0
∂β
Φi (1 − Φi )
i=1
,
n
) φi (yi − Φi ) $
% $
%
∂(
=
exp z !i γ z i −x!i β = 0
∂γ
Φi (1 − Φi )
i=1
where φi = φ [x!i β/ exp (z !i γ)] and Φi = Φ [x!i β/ exp (z !i γ)].
Interpretation Interpretation is also more complicated than in the standard probit model, at least when z i and xi share elements in common. Let
wik be a predictor that could be included in xi , in z i , or in both, then the
marginal effect of this predictor is given by
,
∂πi
x! i β
βk − (x!i β) γk
= φ
∂wik
exp (z !i γ)
exp (z !i γ)
When wik appears in only xi , then γk = 0 and the formula for the marginal
effect is adjusted accordingly. When wik appears in only z i , then βk = 0
and the formula for the marginal effect is adjusted accordingly. When wik
appears in both sets of covariates, then the marginal effect may deviate dramatically from what is suggested by either β̂k or γ̂k . In general, the presence
50
Table 2.8: Heteroskedastic Probit Model of Vote Choice in 2000
Predictor
Model for π
Partisanship
Male
White
Born Again Christian
Age
Education
Household Income
Trait Differential
Constant
Variance Model
Strength of Partisanship
Education
(
LR test/p
∗∗
Estimate
S.E.
−.060+
−.000
−.185
−.123
−.002
.013
−.011
.396+
.430
.035
.053
.140
.082
.002
.022
.009
.210
.292
−.591∗∗
−.185+
−58.825
17.790
.230
.112
.000
+
Notes: n = 348.
p < .01, p < .10 (two-tailed). Estimates obtained using Stata’s hetprob command. The likelihood ratio test pertains to H0 : γ = 0.
of exp (z !i γ) can create important differences between the estimated coefficients and the marginal effects, for example in terms of significance levels.
Thus, extreme care should be taken with the interpretation.
Example
It is possible to add an uncertainty component to the 2000 presidential vote
model that we explored earlier. This component incorporates the idea that
vote choice may be more predictable for certain individuals than others. For
instance, the vote choice of strong partisans should be more predictable than
that of weak partisans or independents. Similarly, if we treat education as
a proxy for information, then it stands to reason that more highly educated
citizens are more predictable in their vote choices than less educated citizens.
To add the uncertainty component, we run Stata’s hetprob command,
whose syntax is:
hetprob depvar
"
#
"
#
indepvars , het(varlist) options
51
Table 2.9: Marginal Effects for Heteroskedastic Probit
Predictor
Partisanship
Male
White
Born Again Christian
Age
Education
Household Income
Trait Differential
Strength of Partisanship
Effect
−.183∗∗
−.001
−.370∗∗
−.359∗
−.005
.036
−.033+
1.215∗∗
−.012
S.E.
.041
.163
.149
.166
.005
.066
.018
.280
.046
Notes: ∗∗ p < .01, + p < .10 (two-tailed). Marginal effects
are computed at the mean using the mfx command. Standard errors are computed using the delta method.
In our case, the dependent variable is vote choice (1=Gore, 0=Bush). Our
vector xi consists of partisanship, male, white, born-again Christian, age,
education, household income, and the trait differential describe earlier.47
Our vector z i consists of partisan strength (the absolute value of the difference between partisanship and the midpoint on the party identification
scale) and education. The estimates are shown in Table 2.8 and the marginal
effects at the mean, obtained by running mfx, are shown in Table 2.9. Note
that Stata estimates ln σi2 rather than σi2 .48 Negative values for the effects
of the predictors in z i mean that the variance (uncertainty) is reduced as
the values of the predictors increase, whereas positive effects imply that the
variance is increased.
Table 2.8 includes the result of a LR test of the hypothesis H0 : γ = 0.
The value of the LR test statistic is 17.79, leading to a sound rejection of
the hypothesis. This implies that the standard probit specification with
σ 2 = 1 is inappropriate for the data at hand. Considering the marginal
effects, we observe significant marginal effects for partisanship, white, bornagain Christian, income, and the trait differential. Note that we obtain
47
The issue differential was dropped from the model because inclusion of this predictor reduced the sample size dramatically, leading to convergence problems for the heteroskedastic probit model.
48
From a computational perspective, it is better to estimate ln σi2 since it is not bounded,
whereas σi2 is bounded to be non-negative. Thus issues of out-of-bound estimates or
estimation at the boundary of parameter space do not arise with ln σi2 .
52
these effects, even though several of these predictors were not statistically
significant in Table 2.8.
Some Cautionary Remarks
Heteroskedastic logit and probit models have been popularized in political
science through the work of Alvarez and Brehm (1995, 2002). While interesting things may be learnt from these models, it is important to remember
that they tinker with a fundamental identifying assumption of logit and
probit analysis. There is a certain risk involved in this, as Keele and Park
(2005) have demonstrated in a recent Monte Carlo simulation analysis of
heteroskedastic probit and logit. Among other things, Keele and Park find
that the estimate of β tends to be severely biased unless the sample size is
large (n ≥ 500). In our example, the sample size was considerably smaller
than 500 and one may therefore question the estimates presented in Table
3.8. Keele and Park also found that small sample bias in the estimate of γ
is less severe, although still problematic. They also find that the estimated
standard errors tend to be too large and the coverage probabilities too low,
even in reasonably large samples. All of these problems are compounded in
the face of measurement error and/or model mis=specification. The lesson
here is that heteroskedastic logit and probit models should be used with
extreme care. One should be reasonably convinced that one has a correctly
specified model and one should have a reasonably large sample before one
can place great confidence in the results from these models.
2.3
Alternatives to Logit and Probit
Logit and probit are the best-known of the binary regression models. However, useful alternatives are available. These may be better capable of fitting
skewed distributions of y, fitting models where the maximum marginal effect
occurs for π += .5. The gompit, scobit, and power logit models fall into this
category.
2.3.1
The Gompit Model
Model and Estimation
Model The gompit model is one of the alternatives to logit and probit.
The model can be motivated using the latent variable framework that was
used to motivate the logit and probit models. Let y be a binary indicator
53
1
.8
.6
Prob
.4
.2
0
−4
−2
0
z
Standard Logistic
2
4
Standard Gumbel
Figure 2.8: Standard Gumbel Distribution
of whether an alternative θ was chosen. We postulate that a decision maker
chooses θ when he or she derives positive utility from this alternative, i.e.
when yi∗ > 0. We further postulate that yi∗ = x!i β + #i , where xi is a vector
of attributes of the decision maker and #i is a stochastic component. Thus,
Pr(yi∗ > 0) = Pr(x!i β + #i > 0) = Pr(#i > −x!i β). This is precisely the
same derivation that we used for the logit and probit models.
Where things become different is in the error distribution that we postulate for #i . The gompit model assumes the standard Type-I extreme value
distribution, which is also known as the standard Gumbel distribution:49
G(#) = exp {− exp (−#)}
Figure 9.8 shows the standard extreme value distribution and compares it
to the standard logistic distribution. Compared to the standard logistic
distribution, the standard extreme value distribution has a slightly higher
mean (.57722 versus 0) and a positive skew (skewness coefficient is 1.139547).
The presence of a degree of skewness allows the Gompit model to better
accommodate a skewed distribution in y.50
With the distributional assumption in place, the probability of choosing
49
The distribution is also referred to as the Gompertz distribution, which explains the
name gompit model. Note that there are three types of extreme value distributions.
In addition to the Gumbel distribution, there are Frechet (Type-II extreme value) and
Weibull (Type-III extreme value) distributions.
50
If it is a negative skew that needs to be accommodated, then the distribution may
be defined as exp {− exp (#)}; this distribution has a mean of -.57722 and a skewness
coefficient of -1.139547.
54
θ can now be computed:
πi = Pr(#i > −x!i β)
= 1 − Pr(#i < −x!i β)
This is the gompit model.
= 1 − G(−x!i β)
*
$
%+
= 1 − exp − exp x!i β
(2.18)
Estimation Estimation of the gompit model is relatively straightforward.
The data consist of 1s that occur with probability πi and 0s that occur with
probability 1 − πi . Substituting 1 − G(−xi β) for πi and G(−xi β) for 1 − πi ,
the likelihood function can be written as
L =
n
(
"
#y "
#1−yi
1 − G(−x!i β) i G(−x!i β)
i=1
The log-likelihood function is
n
)
*
"
#
"
#+
( =
yi ln 1 − G(−x!i β) + (1 − yi ) ln G(−x!i β)
i=1
This is not a complex log-likelihood function so that estimation generally
poses few problems.
Properties The maximum marginal effect of the gompit model arises at
πi = .632121.51 Thus, the gompit model is quite useful if the goal is to fit a
model where the maximum marginal effect does not arise at πi = .5.
51
Proof: Let wi = x! i β. The marginal effect with respect to wi is given by:
∂πi
∂wi
=
exp [− exp (wi ) + wi ]
Optimization of this marginal effect requires that we take the first derivative and set it
equal to zero:
∂ 2 πi
∂wi2
=
exp [− exp (wi ) + wi ] [exp (wi ) − 1] = 0
This equation holds true when exp (wi ) = 1, which occurs when wi = 0. Substitution of
wi = 0 into the formula for πi yields .632121.
55
Table 2.10: Gompit Model of Military Coups
Predictor
Civilian Regime
Size of the Military
Log Per Capita GDP
Constant
Logit
B
−2.475∗∗
−.001
−1.004∗∗
1.040+
S.E.
.207
.001
.239
.575
Gompit
B
−2.366∗∗
−.001
−.915∗∗
.705
S.E.
.202
.001
.223
.526
Notes: N = 3493. ∗∗ p < .01, + p < .10 (two-tailed). Table entries are maximum
likelihood logit and gompit estimates with cluster-corrected estimated standard
errors (clustering by country).
Example
As an example of a gompit model, we analyze the incidence of military
coups in the post-World War II period, using Banks’ Handbook of Social
and Political Indicators. Military coups are extremely rare: out of 3,493
country-year cases that we analyze here, only 148 (less than 5%) experienced
military coups. The severe skewness in this distribution may make logit and
probit analysis a poor choice. Instead, we fit a gompit model with three
predictors: regime type (1 = civilian regime; 0 = otherwise), per capita
GDP (logged), and size of the military.
The gompit model can be estimated in Stata by issuing the cloglog
command, which stands for complementary log-log regression. The syntax
is
cloglog depvar
"
indepvars
# "
, options
#
When applied to the Banks data, we obtain the estimates shown in Table
2.10. For the sake of comparison, I have also included the logit estimates
of the coefficients, which, in this case, tell a very similar story. To shed
further light on the gompit estimates, they can be converted into predicted
probabilities or marginal effects using the same methods that were discussed
earlier.
56
2.3.2
The Scobit Model
Model and Estimation
Model Nagler (1994) proposed the Burr-II distribution for the stochastic component in the latent variable model.52 This distribution contains
an ancillary parameter that controls the degree of skewness in the error distribution, giving a great deal of flexibility.53 The resulting model is
known as the scobit (skewed logit) model and is closely related to methods
developed by Aranda-Ordaz (1981) and Prentice (1976).
Derivation of the scobit model follows the by now familiar pattern. Let
yi denote a binary choice variable that takes on the value 1 if alternative θ
is chosen. This variable is related to an underlying latent utility, which is
modeled as yi∗ = x!i β + #i . The link between yi and yi∗ is probabilistic, due
to the stochastic nature of #i . Correspondingly, πi = Pr(yi = 1) = Pr(yi∗ >
0) = Pr(x!i β + #i > 0) = Pr(#i > −x!i β).
This is where the distributional assumption comes into play. The scobit
model assumes
B(#) = [1 + exp (−#)]−α
where α > 0 is an ancillary parameter controlling the degree of skewness.
Figure 2.9 shows different examples of the Burr-II distribution, clearly illustrating the different degrees of skewness. In general, 0 < α < 1 implies
negative skewness, α = 1 implies no skewness, and α > 1 implies positive
skewness.
With the distributional assumption in place, it is now easy to work out
the choice probability:
??πi = Pr(#i > −x!i β)
= 1 − Pr(#i < −x!i β)
52
= 1 − B(−x!i β)
"
$
%#−α
= 1 − 1 + exp x!i β
(2.19)
Nagler (1994) calls this the Burr-10 distribution because it is the 10th equation in the
list of a dozen distributions that Burr (1942) proposed but, as Achen (2002) notes, it is
conventionally referred to as the Burr-II or Burr Type-II distribution.
53
In general, an ancillary parameter is a feature of a probability function that is constant
across observation units i.
57
1
.8
.6
Prob
.4
.2
0
−10
−5
0
z
alpha = .1
alpha = 1
alpha = 10
5
10
alpha = .5
alpha = 2
Figure 2.9: Burr-II Distribution
Note that when α = 1, this expression is identical to the logit model.54
Estimation The likelihood function for the scobit model comes about by
recognizing that the data consist of 1s that occur with probability πi and 0s
that occur with probability 1 − πi . Thus,
L =
n
(
"
i=1
1 − B(−x!i β)
The log-likelihood function is
( =
n
)
*
i=1
#yi "
#1−yi
B(−x!i β)
"
#
"
#+
yi ln 1 − B(−x!i β) + (1 − yi ) ln B(−x!i β)
This is a complex log-likelihood function and convergence problems are quite
common. Another problem is that α can be highly collinear with the elements of β, in particular the constant. This may produce inflated standard
errors. For this reason, Nagler (1994) suggests testing the elements of β
using a likelihood ratio approach rather than a Wald test approach.
54
Proof: When α = 1 then πi = 1 − [1 + exp (x! i β)]
πi
=
=
=
−1
. This may also be written as
1 + exp (x! i β)
1
−
1 + exp (x! i β)
1 + exp (x! i β)
exp (x! i β)
1 + exp (x! i β)
#
$
Λ x! i β
58
Properties Despite the fact that the scobit model is much more difficult
to estimate than a logit, probit, or gompit model, it stands out through its
remarkable flexibility. Since α is an estimated parameter, one does not have
to make any a priori assumptions about the degree of skewness that has
to be assumed. Correspondingly, there is flexibility in where the maximum
marginal effect occurs. It can be demonstrated that this maximum occurs
at55
,
-α
α
πi = 1 −
α+1
If α < 1, then the maximum marginal effect occurs at 0 < πi < .5. If α = 1,
then the maximum occurs at πi = .5, the same value as for the logit model.
Finally, if α > 1, then the maximum occurs for .5 < πi < 1 − e−1 . Here
1 − e−1 ≈ .632 is the upper asymptote on πi .
Power Logit The presence of this upper asymptote is one limitation of
the scobit model. If one expects the point of maximal impact to be greater
than .632, then one should fit a power logit model instead. In analogy to
the logit model, Achen (2002) specifies the power logit model as follows:
"
$
%#−α
1 + exp −x!i β
πi =
and shows that the marginal effect is optimized at
,
-α
α
πi =
α+1
where e−1 < πi < 1. Estimation of this model creates problems similar to
the scobit model.
55
Proof: As usual, let wi = x! i β. Then the marginal effect with respect to wi is given
by
dπi
dwi
=
α exp(wi )
[1 + exp(wi )]1+α
We seek to optimize this marginal effect. Thus we take the derivative of the marginal
effect with respect to wi and set the result to 0. This produces
d 2 πi
dwi2
=
−
α exp(wi ) [α exp(wi ) − 1]
=0
[1 + exp(wi )]2+α
The solution to this equation is wi = − ln α. Substitution of this result in the formula for
πi yields the formula for the maximum marginal effect.
59
Table 2.11: Scobit and Rare Events Logit Models of Military
Coups
Predictor
Civilian Regime
Size of the Military
Log Per Capita GDP
Constant
ln(α)
α
Scobit
B
−3.543∗∗
−.001
−1.605∗∗
5.933+
−2.293∗∗
.101
S.E.
1.157
.001
.598
3.359
.888
RE Logit
B
−2.390∗∗
−.001
−.880∗∗
.326
S.E.
.267
.001
.332
.824
Notes: n = 3493 for scobit and n = 477 for the rare events logit. ∗∗ p < .01,
+
p < .10 (two-tailed). Table entries are maximum likelihood scobit and rare events
(RE) logit estimates with cluster-corrected estimated standard errors (clustering
by country). The rare events logit was based on a sample of 10% of 0s and 100%
of 1s. The weight is .0314.
Example
Returning to the military coup data, we can fit a scobit model by using
Stata’s scobit command:
scobit depvar indepvars
"
, options
#
Given the difficult scobit log-likelihood, I routinely include dif (difficult)
as one of the options. Table 2.11 shows the scobit estimates, which tell
a similar story to the logit and gompit estimates reported earlier. You
should note, however, that the standard errors are quite a bit higher than
those for the gompit model, a consequence of the high degree of collinearity
between the scobit estimates (the average absolute correlation between the
estimates is .762). We also see that ln(α) is statistically significant at the
.01 level, resulting in an estimate of α of .101.56 The nature of this estimate
is such that we apparently need to introduce considerable negative skewness
to accommodate the distribution of yi , which should not come as a surprise
in light of the preponderance of 0s on this variable.
56
For estimation purposes it is better to parameterize the scobit model in terms of ln(α)
since it is not bounded, whereas α is bounded to be strictly positive. Remember that the
scobit model reduces to a logit model when α = 1. In terms of ln(α), this means that
a test of H0 : ln(α) = 0 provides the mechanism for telling whether the scobit model
deviates significantly from the logit model.
60
Chapter 3
Modeling Ordinal Variables
Ordinal response variables are extremely common in political analysis. In
survey research, many response scales are ordinal in nature. This is true,
for example, of the agree-disagree items that are widely used to measure
political attitudes, beliefs, and values. In the field of international relations,
many a research paper has been based on the Polity democracy and autocracy scores. While some have treated these scores as interval scales, they
truly are ordinal scales. The same can be said of Freedom House scores and
of most measures of corruption and the rule of law that have found their
way into comparative politics. Thus, it is probably safe to say that ordinal
measures dominate in political research.
How should one analyze these measures? There remains a strong current
in political science that treats ordinal scales as if they were continuous.
Advocates of this approach rely on the linear regression model to model
ordinal response variables, thus imposing a linear effect on the covariates.
As widespread as this approach is, it is also extremely risky. Unless the
ordinal scale has many categories, it is easy to obtain misleading results
(McKelvey and Zavoina 1975; Winship and Mare 1984).
It is not difficult to find the reasons why the linear regression model is
inappropriate for ordinal outcomes. First, it is inherently suspect to assume
normality, as the classical normal linear regression model does, when the dependent variable is ordinal in nature. Second, it is well-known that the association between an ordinal outcome and a predictor may be underestimated
by traditional measures of association, such as Pearson’s product-moment
correlation, that underlie the regression model. This is especially true if
the response scale has few categories and the distribution over those categories is skewed (e.g. Jöreskog 1990). Finally, the linear regression model
61
may impose the incorrect functional form on the relationship between the
response variable and the covariates. In the linear regression model, the
effect of a unit increase in a covariate is constant along the y scale. This
makes sense when the differences on this scale may be interpreted as identical, as is the case with continuous dependent variables. However, on an
ordinal response scale, the differences between adjacent scale values need
not to represent identical distances. As such, there is no reason to expect
that a unit increase in a covariate has a constant effect on the response variable. Imposing this as a constraint could easily produce biased parameter
estimates.
There is, then, considerable peril in analyzing ordinal response variables,
especially those with few categories, as if they are continuous. Instead of
analyzing such variables via linear regression analysis, we should consider
alternative models. The ordered logit and probit models are useful starting
points for this purpose.
3.1
Ordered Logit and Probit Analysis
3.1.1
A Motivating Example
Imagine a survey respondent who is faced with the following well-known
survey item:
Generally speaking, do you consider yourself a Democrat, a Republican, or an independent?
We can think of the three response options as being rank-ordered with respect to the property “Republican-ness,” i.e. D < I < R. It is the respondent’s task to choose a particular response to the survey question. We
indicate his/her response by

 1 Democrat
2 Independent
yi =

3 Republican
We can think of the 3-point response scale as a crude reflection of an
underlying continuous latent variable that taps an individual’s “Republicanness” with infinite precision. We shall label this latent variable as yi∗ , just
as we did for the binary regression models. The latent response scale ranges
from −∞ to ∞. Low values on yi∗ indicate a low level of Republican affiliation, while high values indicate a high level of this property. As before, we
62
model yi∗ as a function of systematic and stochastic components, i.e.
yi∗ = x!i β + #i
where xi is a vector of respondent attributes (e.g. income, race, gender).
Importantly, this vector does not include a constant. The stochastic term,
#i , captures unobserved factors that make an individual more or less Republican. These may be things of which the respondent is him or her self
unaware, or things that remain hidden from the modeler.
We now need a mechanism that connects yi∗ to yi . Imagine a set of two
cut points or thresholds on yi∗ . The first cut point is τ1 and it indicates
the level of yi∗ that is required to embrace at least the second response option
(“independent”). That is, respondents whose level of Republican affiliation
falls short of τ1 are expected to embrace the first response option, while
those respondents whose level is at least τ1 will, at a minimum, indicate
that they are independents. The second cut point, τ2 > τ1 , indicates the
level of yi∗ that is required to embrace the third response option. That is,
only respondents whose level of Republican identification is at least τ2 will
respond that they are Republicans; everyone else is expected to indicate
that they are independents or, if yi∗ falls short even of τ1 , Democrats. Thus,

 1 if −∞ < yi∗ < τ1
2 if τ1 ≤ yi∗ < τ2
yi =

3 if τ2 ≤ yi∗ < ∞
Substituting the model for yi∗ this may also be written as

 1 if −∞ < x!i β + #i < τ1
2 if τ1 ≤ x!i β + #i < τ2
yi =

3 if τ2 ≤ x!i β + #i < ∞
or, subtracting x!i β from both sides of the inequality signs,

 1 if −∞ < #i < τ1 − x!i β
2 if τ1 − x!i β ≤ #i < τ2 − x!i β
yi =

3 if τ2 − x!i β ≤ #i < ∞
The presence of the stochastic term #i in the mechanism linking yi to yi∗
means that the mechanism of necessity works probabilistically. Accordingly,
Pr(yi = 1) = Pr(−∞ < #i < τ1 − x!i β), Pr(yi = 2) = Pr(τ1 − x!i β ≤ #i <
τ2 − x!i β), and Pr(yi = 3) = Pr(τ2 − x!i β ≤ #i < ∞). To work out
these probabilities, we need to make a distributional assumption for #i . A
63
convenient starting point is to impose a symmetric probability distribution
such as the standard logistic or standard normal distribution. Doing so
produces the following probability for responding “Democrat:”
Pr(yi = 1) = Pr((−∞ < #i < τ1 − x!i β)
: τ1 −x! i β
=
f (#i )d#i
−∞
= F (τ1 − x!i β)
Here f (.) is the PDF, while F (.) is the CDF. Similarly the probabilities of
responding “independent” and “Republican” are given by:
Pr(yi = 2) = Pr(τ1 − x!i β ≤ #i < τ2 − x!i β)
: τ2 −x! i β
: τ1 −x! i β
=
f (#i )d#i −
f (#i )d#i
−∞
−∞
= F (τ2 − x i β) − F (τ1 − x!i β)
!
Pr(yi = 3) = Pr(τ2 − x!i β ≤ #i < ∞)
= 1 − Pr(#i < τ2 − x!i β)
: τ2 −x! i β
f (#i )d#i
= 1−
−∞
= 1 − F (τ2 − x!i β)
The probabilistic choice mechanism is illustrated in Figure 3.1.
When F (.) is chosen to be the standard logistic distribution, Λ(.), then
the equations for Pr(yi = 1) · · · Pr(yi = 3) specify the choice probabilities of
the ordered logit model. If, instead, F (.) is chosen to be the standard normal distribution, Φ(.), then these equations specify the choice probabilities
of the ordered probit model. The differences between these two models
are fairly minimal, although they make different variance assumptions, so
that the scale of the parameters is different as well (cf. binary logit/probit).
Note that one could have chosen an asymmetric error distribution as well,
but this is not commonly done for ordinal regression models.
3.1.2
General Model
In general, imagine that an individual i chooses from among M alternatives,
1, 2, · · · , M , that are rank-ordered with respect to the underlying latent trait
yi∗ . The latent trait is continuous and can be modeled via
yi∗ = x!i β + #i
64
(3.1)
0.15
y=1
y=2
y=3
f(y*)
0.10
0.05
0.00
Tau2
Tau1
-3
-2
-1
0
1
2
3
y*
Figure 3.1: Choice Mechanism in Ordinal Regression Models
Here xi is a vector of attributes of the individual and does not include a
constant. Further, #i is a stochastic component.
Alternative m is chosen if
τm−1 ≤ yi∗ < τm = τm−1 − x!i β ≤ #i < τm − x!i β
where the τ ’s are ancillary parameters and the following conditions hold:
τm−1 < τm , τ0 = −∞, and τM = ∞. In probabilistic terms
Pr(yi = m) = F (τm − x!i β) − F (τm−1 − x!i β)
(3.2)
where F (.) is specified either as the standard logistic or the standard normal
CDF. Specifically, the choice probabilities of the ordered logit model are
given by
Pr(yi = m) = Λ(τm − x!i β) − Λ(τm−1 − x!i β)
while those of the ordered probit model are given by
Pr(yi = m) = Φ(τm − x!i β) − Φ(τm−1 − x!i β)
65
3.1.3
Estimation
Identification
As was the case with the binary logit and probit models, the latent variable
yi∗ is unobserved and does not have a fixed measurement scale. This means
that the scale of the parameters in β is also unidentified, making it impossible to estimate the ordered logit and probit models without making some
identifying assumptions. We identify the scale of yi∗ and hence of β by fixing
the variance of #i . In the ordered probit model, V [#i ] = 1; in the ordered
2
logit model, V [#i ] = π3 . Because the variances (and other features) of the
standard normal and standard logistic distributions are different, the scale
of the ordered probit and logit coefficients is as well. In general, the ordered
logit estimates are about 1.7 times the size of the ordered logit coefficients.
Another identifying constraint is that we need to exclude the constant
from xi . It is easy to see why. In an empty model, in which there are
no predictors, the only information that we have are estimates of M − 1
probabilities (the remaining probability is not unique, since it follows from
the axiom that all probabilities must sum to 1). In this empty model, there
are M − 1 parameters in the form of cut points. This means that the model
is just-identified. But if we add a constant, then the number of estimated
parameters increases to M . With M − 1 pieces of information, this means
that the model is no longer identified. Thus, if one adds a constant, as
Greene (2003) proposes, then one should drop one of the cut points.1
Likelihood and Log-Likelihood Functions
The ordered logit and probit models are estimated using MLE. The likelihood function is derived by recognizing that we draw a sample of n independent observations that yields observed choices 1, 2 · · · M . Focusing on a
particular alternative m, the joint density is given by
(
( "
#
Pr(yi = m) =
F (τm − x!i β) − F (τm−1 − x!i β)
yi =m
yi =m
Aggregation across all of the alternatives then yields the following likelihood:
L =
M (
(
"
#
F (τm − x!i β) − F (τm−1 − x!i β)
m=1 yi =m
1
Whether one estimates M − 2 cut points and a constant or M − 1 cut points and no
constant is immaterial for the estimated effects of the substantive predictors in xi .
66
This can be written more elegantly when we create an indicator for the
alternative that was selected:
!
1 if yi = m
zim =
0 if yi += m
Now,
(("
#z
F (τm − x!i β) − F (τm−1 − x!i β) im
L =
The log-likelihood function is
))
"
#
( =
zim ln F (τm − x!i β) − F (τm−1 − x!i β)
i
(3.3)
m
i
(3.4)
m
where we substitute Λ(.) for F (.) in the case of ordered logit analysis and
Φ(.) for F (.) in the case of ordered probit analysis.
Optimization The ordered logit/probit log-likelihood is relatively straightforward and optimization of this function generally poses few problems. The
first-order conditions are
∂(
∂β
∂(
∂τk
=
))
i
=
m
))
i
zim
m
zim
f (τm−1 − xi β) − f (τm − xi β)
xi = 0
F (τm − xi β) − F (τm−1 − xi β)
δm,k f (τm − xi β) − δm−1,k f (τm−1 − xi β)
=0
F (τm − xi β) − F (τm−1 − xi β)
where δm,k = 1 if m = k and δm,k = 0 if m += k is the Kronecker delta (see
Maddala 1983). While these first-order conditions will have to be evaluated
numerically, for example via the Newton-Raphson algorithm, they are not
complex and convergence should be speedy.
3.1.4
Interpretation
As with the binary logit and probit models, the ordered logit and probit
coefficients cannot be directly interpreted since the effect of a unit increase
in a predictor depends on that predictor’s initial value. Various interpretation methods exist, including marginal effects, predicted probabilities and
discrete changes in predicted probabilities and, for the ordered logit model,
odds ratios.
67
Marginal Effects and Elasticities
Marginal Effects without Interactions One useful way of interpreting
the coefficients from an ordered logit or probit model, especially those of
continuous predictors, is to compute the marginal effects of predictors. The
marginal effect of a predictor xk is defined as
#
∂πm "
= f (τm−1 − x!β) − f (τm − x!β) βk
∂xk
= [fm−1 − fm ] βk
γmk =
where πm = Pr(y = m), fm−1 = f (τm−1 − x!β), and fm = f (τm − x!β).
This expression is the slope of the curve that relates Pr(y = m|x) to xk . It
can be interpreted as the instantaneous change in the probability of choosing
alternative m for a very small change in xk .
In order to evaluate the marginal effects, we need to set the covariates
to a particular value. There are two ways of doing this. When computing
marginal effects at the mean (MEM), we set all of the covariates equal
to their sample means. Thus,
"
#
M EM
γmk
= f (τm−1 − x̄!β) − f (τm − x̄!β) βk
When computing average marginal effects (AME), we use the sample
unit’s actual values on the covariates. Thus, a unique marginal effect is
computed for each unit. The AME is then the average of those idiosyncratic
effects:
AM E
γmk
=
n
#
1 )"
f (τm−1 − x!i β) − f (τm − x!i β) βk
n
i=1
One of the advantages of AME is that we do not rely upon unrealistic values
for the covariates, since AME is based on values that actually occurred in
the sample.
Marginal Effects with Interactions One should be careful when computing marginal effects for nonlinear models or models with interaction
terms. Imagine, for example, that we are estimating an ordered logit/probit
model where the underlying latent variable is modeled as yi∗ = β1 xi + β2 x2i +
x!i β + #i . With some calculus, it can then be shown that the marginal effect
is given by
∂πm
∂xk
= [fm−1 − fm ] (β1 + 2β2 x)
68
where fm = f (τm −β1 x−β2 x2 −x!β) and fm−1 = f (τm−1 −β1 x−β2 x2 −x!β).
Next consider an ordered logit/probit model where yi∗ = β1 xi + β2 zi +
β12 xi zi + x!i β. Here, the marginal effect is given by
∂ 2 πm
∂x∂z
$ %
%
%
= (fm−1 − fm ) β12 + fm
− fm−1
(β1 + β12 z) (β2 + β12 x)
where fm = f (τm − β1 x − β2 z − β12 xz − x!β), fm−1 = f (τm−1 − β1 x −
% = f % (τ − β x − β z − β xz − x! β), and f %
β2 z − β12 xz − x!β), fm
m
1
2
12
m−1 =
%
!
f (τm−1 − β1 x − β2 z − β12 xz − x β). This equation implies that a non-zero
interaction effect can exist even when β12 = 0. Moreover, the sign of the
interaction effect depends on more than the sign of β12 alone.
Elasticities Marginal effects can be transformed into elasticities by multiplying them through by a factor xik /πm :
κmk = γmk
xk
πm
The advantage of elasticities is that they have a straightforward interpretation, namely that a one percentage point change in the predictor changes
the probability of choosing alternative m by κmk percentage points.
Confidence Intervals As always, marginal effects and elasticities have
to be estimated using the parameter estimates. This means that they are
subject to sampling fluctuation and that it is useful to report them in the
form of interval rather than point estimates. This requires that we compute
the sampling variability for which the delta method is again useful.
For estimated marginal effects, the estimated asymptotic variance is
given by
,
∂γ̂mk 2
V [γ̂mk ] =
Vβ̂k
∂ β̂k
where
∂γ̂mk
∂ β̂k
=
&
&
'
'
%
%
fˆm
− fˆm−1
β̂k xk + fˆm−1 − fˆm
% , and fˆ%
where fˆm , fˆm−1 , fˆm
m−1 are evaluated at β̂. The 100(1 − α)% asymptotic confidence interval for the marginal effect of a predictor xk is given
by
γ̂mk − zα/2 s.e.γ̂mk ≤ γmk ≤ γ̂mk + zα/2 s.e.γ̂mk
69
where s.e.γ̂mk is the square root of V [γ̂mk ].
For elasticities, the estimated asymptotic variance is
,
∂κ̂mk 2
V [κ̂mk ] =
Vβ̂k
∂ β̂k
where
∂κ̂mk
∂ β̂k
=
fˆm−1 − fˆm
F̂m − F̂m−1
xk +
% − fˆ%
fˆm
m−1
F̂m − F̂m−1
β̂k x2k −
;
fˆm−1 − fˆm
F̂m − F̂m−1
<2
β̂k x2k
and F̂m − F̂m−1 = π̂m .
Discrete Change in Predicted Probabilities
Predicted Probabilities Marginal effects are not the only way to interpret the results from an ordered logit or probit model. Predicted probabilities provide a useful alternative to marginal effects, one that has the
advantage that it can be applied to both continuous and dummy predictors.
For alternative m, then the estimated predicted probability is equal to
π̂im = F (τ̂m − x!i β̂) − F (τ̂m−1 − x!i β̂)
Typically, we set the predictors to their means to compute the predicted
probabilities, although other simulation scenarios are possible. A measure
of the asymptotic sampling variability in π̂i can be calculated via the delta
method:
3
42
V [π̂im ] = f (τ̂m−1 − x!i β̂) − f (τ̂m − x!i β̂) x!i V[β̂]xi
Knowledge of this sampling variance allows one to compute test statistics
and confidence intervals on the predicted probabilities.
Discrete Change without Interactions While predicted probabilities
give a sense of the probability of a particular choice given a certain profile
of covariate values, they do not directly speak to the impact of a change in
one of the covariates. For this, one can compute a discrete change measure:
∆πm
∆xk
= Pr(y = m|x, xk + δ) − Pr(y = m|x, xk )
It is conventional to set all of the remaining covariates to their means, so that
the discrete change is computed as Pr(y = 1|x̄, xk + δ) − Pr(y = 1|x̄, xk ).
70
However, other covariate values may be chosen as well, as long as they
remain constant across the values of y and the simulated change in xk .
The change in xk can be set at different values. Common choices include
the following.
1. A unit change relative to x̄k , so that δ = 1 and ∆πm /∆xk = Pr(y =
m|x̄, x̄k + 1) − Pr(y = m|x̄, x̄k ). This is tantamount to setting all
covariates to their mean values initially and then increasing only xk
by one unit.
2. A centered unit change relative to x̄k , so that xk moves from x̄k −
.5 to x̄k + .5. The discrete change in predicted probability is then
∆πm /∆xk = Pr(y = m|x̄, x̄k + .5) − Pr(y = m|x̄, x̄k − .5).
3. A standard deviation change relative to x̄k , so that xk moves from
x̄k − .5sk to x̄k + .5sk , where sk is the sample standard deviation
of xk . The discrete change is then defined as ∆πm /∆xk = Pr(y =
m|x̄, x̄k + .5sk ) − Pr(y = m|x̄, x̄k − .5sk ).
4. It is also possible to compute the maximum possible change. Here
we let xk move from the minimum to the maximum in the sample,
so that δ is equal to the range and the discrete change is defined as
∆πm /∆xk = Pr[y = m|x̄, max(xk )] − Pr[y = m|x̄, min(xk )]. This
manner of computing discrete change is quite common in political
analysis, although it may produce an exaggerated sense of the impact
of predictors.
5. For dummy predictors, computing the maximum possible change is
tantamount to computing a change from 0 to 1. Thus, ∆πm /∆xk =
Pr(= m|x̄, xk = 1) − Pr(y = m|x̄, xk = 0).
6. Percentile change provides another method of operationalizing discrete change. Here, we let xk move from the pth percentile to the
100 − pth percentile, e.g. from the 10th to the 90th percentile. Letting
xk,p and xk,100−p denote the values of xk that correspond to the pth
and 100 − pth percentiles, respectively, discrete change is defined as
∆πm /∆xk = Pr(y = m|x̄, xk,100−p ) − Pr(y = m|x̄, xk,p ).
One should pick the change in xk that best captures a change that is substantively interesting. Once this change has been selected, it should be applied
to all values of y.
One issue that arises with the discrete change measure is that there are
as many discrete changes as there are values of y. Long (1997) proposes to
71
summarize all those changes in what he refers to as the average absolute
discrete change, i.e. the average of the absolute values of the discrete
changes in predicted probabilities. The formula is given by
=
M =
1 ) == ∆πm ==
¯
∆ =
= ∆xk =
M
m=1
>
The reason for taking the absolute values is that m ∆πm /∆xk = 0, i.e.
the discrete changes cancel each other out. This is not true of the absolute
values of the discrete changes.
Discrete Change with Interactions Special care should be taken with
the interpretation of ordered probit/logit models that contain interactions.
Consider the latent variable model y ∗ = β1 x+β2 z+β12 xz+x!β, where x and
z are both dummy variables. The discrete change in predicted probabilities,
holding all else at the mean, is given by
"
∆πm
= F (τm − β1 − β2 − β12 − x̄!β)−
∆x∆z
#
F (τm−1 − β1 − β2 − β12 − x̄!β) −
"
#
F (τm − β2 − x̄!β) − F (τm − β2 − x̄!β) −
"
#
F (τm − β1 − x̄!β) − F (τm − β1 − x̄!β) +
"
#
F (τm − x̄!β) − F (τm − x̄!β)
Similar discrete change measures can be obtained for other models with
interactions or nonlinear effects.
Confidence Intervals In order to compute confidence intervals for the
discrete change measures one needs an estimate of the standard errors. Let
xa and xb denote two sets of values of the covariates, e.g. xa = x̄, max(xk )
and xb = x̄, min(xk ). Then the discrete change is given by ∆πm /∆x =
Pr(y = m|xa ) − Pr(y = m|xb ). The asymptotic variance of this change can
be computed via the delta method, which yields
,
,
,
∆π̂m
∂∆π̂m /∆x !
∂∆π̂m /∆x
V
=
V[β̂]
∆x
∂ β̂
∂ β̂
where
,
∂∆π̂m /∆x
∂ β̂ !
-
=
3
4
f (τ̂m−1 − xa β̂) − f (τ̂m − xa β̂) xa −
3
4
f (τ̂m−1 − xb β̂) − f (τ̂m − xb β̂) xb
72
Computing these variances can be computationally intensive but fortunately
software is available that will perform the necessary operations.
Cumulative Odds Ratio
The cumulative odds ratio is a particular useful tool for interpreting ordered
logit models. It is based on the notion of cumulative odds, which are defined
as2
Pr(yi ≤ m|xi )
Pr(yi > m|xi )
$
%
= exp τm − x!i β
Ωm =
Thus, the cumulative odds show the probability of choosing alternative m,
or a lower ranked alternative, versus the probability of choosing a higher
ranked alternative.
We can evaluate the cumulative odds at value xk of a predictor and then
again at value xk + δ, while holding all other predictors constant. This
2
Proof: In the ordered logit model,
Pr(yi ≤ m)
=
exp (τm − x! i β)
1 + exp (τm − x! i β)
Further,
Pr(yi > m)
=
1 − Pr(yi ≤ m) =
1
1 + exp (τm − x! i β)
Thus, the cumulative odds are defined as
Ωm
=
=
=
Pr(yi ≤ m)
1 − Pr(yi ≤ m)
exp (τm − x! i β) 1 + exp (τm − x! i β)
1
1 + exp (τm − x! i β)
#
$
exp τm − x! i β
73
produces the cumulative odds ratio:3
Ωm (xk + δ, x)
Ωm (xk , x)
= exp(−βk δ)
where βk is the effect associated with the predictor and x contains all of the
remaining predictors (which are held constant). We can interpret this as
follows: for a δ unit increase in xk , the odds of an outcome being less than
or equal to m changes by a factor exp(−βk δ), holding all other predictors
constant. For interpretation purposes, δ is frequently set equal to 1.
It is sometimes useful to transform the factor change measure into a
percentage change measure. This can be done via
100 ×
Ωm (xk + δ, x) − Ωm (xk , x)
Ωm (xk , x)
= 100 × [exp(−βk δ) − 1]
This measure shows the percentage by which the cumulative odds are increased or decreased.
An important property of the cumulative odds ratio should be noted. If
one considers the formula, it is clear that the right-hand side does not contain
a subscript m. This means that the factor change in the cumulative odds
is the same, regardless of which category receives the focus. For example, if
βk = .5 and there are 4 categories in y, then a unit increase in xk changes
the cumulative odds by a factor of .61, regardless of whether one contrasts
category 1 with 2, 3, and 4, or 1 and 2 with 3 and 4, or 1, 2, and 3 with
4. This is known as the proportional odds assumption. We shall revisit
this assumption later in this chapter.
3
Proof: At xk , the cumulative odds can be written as
#
$
Ωm (xk , x) = exp τm − βk xk − x! β
=
#
$
exp (τm ) exp (−βk xk ) exp −x! β
At xk + δ, the cumulative odds can be written as
#
$
Ωm (xk + δ, x) = exp τm − βk (xk + δ) − x! β
=
Thus,
Ωm (xk + δ, x)
Ωm (xk , x)
=
=
#
$
exp (τm ) exp (−βk xk ) exp (−βk δ) exp −x! β
exp (τm ) exp (−βk xk ) exp (−βk δ) exp (−x! β)
exp (τm ) exp (−βk xk ) exp (−x! β)
exp (−βk δ)
74
Table 3.1: Correct Prediction in Ordered Regression Models
3.1.5
y
1
2
..
.
1
N11
N21
..
.
2
N12
N22
..
.
M
Total
NM 1
N.1
NM 2
N.2
ŷ
···
···
···
..
.
M
N1M
N2M
..
.
Total
N1.
N2.
..
.
···
···
NM M
N.M
NM.
N
Model Fit
As always, an important aspect of modeling is assessing the model fit. How
well does the model do in terms of accounting for the response variable?
In the context of ordered logit and probit models, there are two useful approaches toward assessing model fit: correct predictions and pseudo-R2 .
Correct predictions
Based on the predicted probabilities, one can make a prediction concerning
the response variable y. Specifically, one can label as the predicted response
that value of y that has the largest associated predicted probability. Thus,
ŷi =
m if π̂im > π̂ij ∀j += m
We can then tabulate the predicted responses against the actual responses,
using
> a table such as Table 3.1. The frequency of correct predictions is given
by M
m=1 Nmm and the percentage of correct classifications (CC) is equal to
>M
m=1 Nmm
CC =
× 100
N
When judging CC it is important to keep in mind how well one would have
predicted in the absence of any covariate information. In this case, the best
prediction of y would be the mode and we can simply compare CC to the
percentage of cases that fall into the modal category.
Pseudo-R2
The same pseudo-R2 measures as were discussed for the binary logit and
probit models can be estimated for the ordered logit and probit models. A
75
2
simulation study by Veall and Zimmermann (1993) demonstrates that RM
&Z
2
and RA&N are among the best-performing measures, in the sense that they
2
best mimic the regression R2 . However, the theoretical range of RA&N
is
from 0 to −2(0 /(−2(0 + n), which falls short of 1. To obtain a measure
with a theoretical range from 0 to 1, Veall and Zimmermann propose RV2 &Z ,
2
which divides RA&N
by its upper-bound. Here,
M
)
(0 =
Nm ln
m=1
5
Nm
N
6
where Nm is the number of cases who selected response m.
3.1.6
Example
Estimation Results
As an example of ordered logit and probit we consider the determinants
of retrospective economic evaluations in the U.S. in 2004. Since 1980,
the American National Election Studies have asked respondents to judge
whether, during the preceding year, the economy had worsened (1), improved (3), or stayed the same (2). We are treating these responses as
an ordinal scale that depends on a person’s employment status (1 = unemployed; 0 = other), education, income, age, race (1 = black; 0 = white),
gender (1 = female; 0 = male), and partisanship (as captured by dummies for
Republicans and Democrats; independents are the baseline category). Table 3.2 shows the ordered logit and probit results from this analysis. These
results were obtained by running the following Stata commands:
ologit choicevar
"
oprobit choicevar
varlist
"
#"
varlist
if
#"
#"
if
, level(#)
#"
#
, level(#)
#
The results show a significant effect of partisanship such that Republicans were more optimistic about the economy than independents and Democrats
more pessimistic. More affluent individuals tended to give a more positive
assessment of the economy, whereas women and blacks tended to give a more
pessimistic assessment. Education has a marginally significant positive effect. The remaining predictors are not statistically significant.
76
Table 3.2: Ordinal Regression Models of Economic Perceptions in
the U.S. in 2004
Ordered Logit
Estimate
S.E.
−.580∗
.250
1.162∗∗
.255
−.001
.004
.154+
.085
.219∗∗
.061
−.416∗
.199
−.369∗∗
.134
−.393
.391
.670+
.391
2.345∗∗
.400
−833.799
241.160
.000
.254
.311
Predictor
Democrat
Republican
Age
Education
Income
Black
Female
Unemployed
1st Cut Point
2nd Cut Point
(
LR χ2
p
2
RM
&Z
RV &Z
Notes: n = 895.
∗∗
p < .01, ∗ p < .05,
77
+
Ordered Probit
Estimate
S.E.
−.345∗
.148
.695∗∗
.151
−.000
.002
.087+
.051
.128∗∗
.036
−.264∗
.117
−.225∗∗
.080
−.221
.218
.387+
.234
1.378∗∗
.237
−833.111
242.540
.000
.286
.312
p < .10 (two-tailed).
3.1.7
Model Fit
The model does not fit the data particularly well. A first indication of
this can be obtained via pseudo-R2 . Neither McKelvey and Zavoina’s measure (obtained in Stata via fitstat) nor Veall and Zimmermann’s measure
(computed by hand) are particularly high.
A second indication of the mediocre fit can be obtained by considering
the percentage of correct classifications.4 Only about 55% of the cases are
correctly classified in the ordered logit model and about 56% are correctly
classified in the ordered probit model. These are disappointing numbers,
considering that we would have predicted almost 45% of the cases correctly
had we just considered the modal category (worse) of the retrospective evaluations.
3.1.8
Interpretation
Marginal Effects and Elasticities Stata’s mfx command yields marginal
effects at the mean.5 For ordered logit and probit models, the syntax is given
by
mfx, predict(p outcome(#))
"
eyex
#
Here outcome(#) is the outcome for which one seeks marginal effects. To
obtain all possible marginal effects, one needs to issue as many commands
as there are outcomes, varying the outcome number in each run. Note that
mfx computes the discrete change in the predicted probability for dummy
4
The estat clas command does not work with ordered logit and probit. To compute
the predicted outcomes I used the following syntax:
predict p1 p2 p3 if e(sample)
gen predout=.
replace predout=1 if p1>p2 & p1>p3
replace predout=2 if p2>p1 & p2>p3
replace predout=3 if p3>p1 & p3>p2
Here, p1 is the predicted probability of saying that the economy has gotten worse, p2
is the predicted probability of saying that the economy has stayed the same, and p3 is
the predicted probability of saying that the economy has gotten better. Tabulating these
predictions against the actual economic perceptions and counting the percentage of cases
where the prediction corresponds to the actual observation then yields CC.
5
The margeff8 command can be used to compute average marginal effects.
78
Table 3.3: Marginal Effect of Income on Economic Perceptions
Category
Worse
Same
Better
Effect
−.054
.019
.035
S.E.
.015
.006
.010
z
−3.610
3.250
3.600
p
.000
.001
.000
Elasticity
−.366
.147
.513
Notes: Marginal effects at the mean were computed using Stata’s mfx
command. Marginal effects are based on ordered logit results.
predictors such as race. It will only compute marginal effects proper for
continuous predictors. Adding the option eyex causes Stata to compute
elasticities instead of marginal effects.
Table 3.3 displays the marginal effects at the mean for income. We observe statistically significant effects for all of the response categories of the
economic perception variable. The marginal effect is negative for the category “worse”, which means that a very small increase in income reduces the
likelihood of this response. The marginal effects for “same” and “better”
are positive, implying an increased probability of those responses for a very
small change in income. When transformed into elasticities, the effects are
substantively sizable. A one percentage point increase in income is associated with a .37% decline in the probability of responding “worse”, a .15%
increase in the probability of responding “same”, and a .51% increase in the
probability of responding “better.”
Predicted Probabilities and Discrete Change Predicted probabilities
can be computed in Stata in a variety of ways. One option is the prtab
command that we discussed in the previous chapter. For example,
prtab black
yields three tables that distinguish between the predicted probabilities for
blacks and whites, one for each outcome. The results from this command
are summarized in Table 3.4. This table clearly shows the tendency toward
greater economic pessimism among black respondents.
Another way of obtaining predicted probabilities is the prgen command,
which is particularly useful if the goal is to graph the predicted probabilities.
The following command sequence produces the graph shown in Figure 3.2:
prgen income, from(1) to(5) gen(inc) rest(mean) ncases(5)
79
Table 3.4: Race and Economic Perceptions
Response
Worse
Same
Better
Race
White
Black
.411
.514
.377
.336
.212
.151
Notes: Predicted probabilities
were computed using the prtab
command while holding all other
predictors at their mean. Predicted probabilities are based on
ordered logit estimation results.
label variable incp1 "Worse"
label variable incp2 "Same"
label variable incp3 "Better"
graph twoway connected incp1 incp2 incp3 incx,
xtitle("Income") ytitle("Prob")
This graph clearly shows the declining probability of a “worse” response as
income increases and the increasing probabilities of “same” and “better”
responses.
A third approach toward computing predicted probabilities is the prvalue
command. For ordered logit and probit models, the syntax is essentially the
same as that for binary logit and probit models. For example, the following command sequence allows us to compare the predicted probabilities for
Democrats and Republicans:
prvalue, x(dem=1 rep=0) rest(mean) delta save lev(99)
prvalue, x(dem=0 rep=1) rest(mean) delta dif lev(99)
The results are shown in Table 3.5, which reveals stark differences in the
economic perceptions of Democrats and Republicans in 2004. Specifically,
whereas the most likely response for Democrats was that the economy had
worsened, for Republicans the most likely response was that it had improved.
80
.5
.4
Prob
.3
.2
.1
1
2
3
Income
Worse
Better
4
5
Same
Figure 3.2: Economic Retrospection as a Function of Income
Table 3.5: Partisanship and Economic Perceptions
Response
Worse
Same
Better
Democrats
.615
.280
.105
Republicans
.219
.380
.401
∆
−.396∗∗
.100∗∗
.296∗∗
Notes: ∗∗ p < .01 (two-tailed). Predicted probabilities were
computed using the prgen command while holding all other
predictors at their mean. Predicted probabilities are based on
ordered logit estimation results.
81
Table 3.6: Discrete Change in Predicted Probabilities
Predictor
Democrat
Republican
Black
Income
Worse
.141
−.273
.103
−.211
Same
−.048
.075
−.042
.069
Better
−.093
.198
−.061
.141
¯
∆
.094
.182
.069
.140
Notes: Discrete change measures were computed using the
prchange command while holding all other predictors at
their mean. The table entries are based on ordered logit
estimation results.
Finally, discrete change measures can be obtained via the prchange
command. For dummy predictors, this command shows the maximum discrete change in predicted probabilities. For continuous covariates, the command shows maximum change, centered unit change, and standard deviation
change. In addition to reporting the changes in the predicted probabilities
for each of the response categories, the prchange command also shows the
average absolute change. Table 3.6 shows the maximum discrete change
measures for partisanship, race, and income, while holding all other predictors constant at the mean. The table reveals powerful effects especially
for Republican and income, revealing the by now familiar pattern that Republicans and wealthy respondents are more likely to see the economy in a
positive light than everyone else.
Cumulative Odds Ratio In Stata, the cumulative odds ratio can be
computed, inter alia, via the listcoef command:
listcoef
"
, factor
# "
percent
# "
reverse
#
Two options may be specified: factor reports the factor change in the
cumulative odds, while percent computes the percentage change. Note that
the listcoef command computes the odds ratio as the the ratio of Pr(y >
m) over Pr(y ≤ m) instead of Pr(y ≤ m) over Pr(y > m). This means
that the cumulative odds is computed as exp(βk δ) instead of exp(−βk δ). Of
course, one can always move from one definition of the cumulative odds to
the next by inverting the result. This can also be done by requesting the
reverse option with the listcoef command.
82
Table 3.7: Cumulative Odds Ratios
Predictor
Democrat
Republican
Black
Income
Factor
.560
3.198
.660
1.245
Percent
−44.000
219.800
−34.000
24.400
Notes: Cumulative odds ratios reflect the ratio of Pr(y > m) over
Pr(y ≤ m) and were computed using
the listcoef command.
Table 3.7 shows the factor and percentage changes in the cumulative
odds. For example, Democrats are 44% less likely to choose a more optimistic response category (e.g. “better”) compared to a less optimistic
response category (e.g. “same” or “worse”). For Republicans, the result is
reversed: the odds of choosing a more optimistic response (e.g. “same” or
“better”) compared to a less optimistic response (e.g. “worse”) are almost
220% greater for Republicans.
3.1.9
Heteroskedastic Ordered Logit and Probit
Derivation, Estimation, and Interpretation
The ordered logit and probit models have a built-in homoskedasticity assumption that is necessary to fix the scale of the parameters. But what
happens when the homoskedasticity assumption is false? In this case, we
incorrectly fix the scale of y ∗ and, by extension, β to a common metric.
Instead, we should be fixing the scales of y ∗ and β differently for different
sub-groups. By not doing so, the parameter estimates will be inconsistent,
as will be the estimated standard errors.
Derivation It is possible to do away with the homoskedasticity assumption by specifying a so-called heteroskedastic ordered logit or probit model
(see Alvarez and Brehm 1995, 1998, 2002). Here, we show the derivation
of the heteroskedastic ordered probit model, although the derivation of the
heteroskedastic ordered logit model is analogous. Consider the following
variance model for the disturbances
"
#2
σi2 = exp(z !i γ)
83
As we saw in Chapter 2, this is equivalent to the variance model that Harvey
(1976) proposed. The vector z i does not contain a constant. Hence, σi2 = 1
when γ = 0.
Given the variance model, we can create weighted disturbances
δi =
=
#i
σi
#i
exp(z !i γ)
that have unit variances. If we divide #i by exp(z !i γ), then the other terms
in the choice model should be divided in a similar way. Thus,
5
6
5
6
τ m − x! i β
τm−1 − x!i β
Pr(yi = m) = Φ
−Φ
(3.5)
exp(z !i γ)
exp(z !i γ)
This is the core of the heteroskedastic ordered probit model.
Estimation The log-likelihood function for the model is given by
, 5
6
5
6))
τm − x!i β
τm−1 − x!i β
( =
zim ln Φ
−
Φ
exp(z !i γ)
exp(z !i γ)
m
i
(see Alvarez and Brehm 1998). This is a complex log-likelihood function to
optimize and convergence problems are not uncommon.
Interpretation Interpretation of the heteroskedastic ordered probit model
is also complex. Let wk denote a covariate that is included in x, z, or in
both. Then the marginal effect is defined as
5
6
∂πm
τm − x!β x!βγk − βk − τm γk
= φ
−
∂wk
exp(z !γ)
exp(z !γ)
5
6
τm−1 − x!β x!βγk − βk − τm−1 γk
φ
exp(z !γ)
exp(z !γ)
This simplifies only when wk occurs in just one set of the predictors. For
example, if wk is included in x but not z, then all terms involving γk will
drop out. And if wk appears in z but not in x, then the terms involving βk
will drop out. However, if wk occurs in both x and z, then the effect can
deviate dramatically from what is suggested by βk or γk .
84
Example
Stata currently has no built-in capability to estimate the heteroskedastic ordinal regression models. However, Richard Williams has written a program,
oglm, that, as one of its options, will estimate a heteroskedastic ordered
logit or probit model. The syntax is as follows:
oglm depvar
"
#
indepvars , hetero(varlist)
link(logit/probit)
Specifying link(logit) will cause Stata to estimate a heteroskedastic ordered logit model, whereas link(probit) results in estimation of a heteroskedastic ordered probit model.
Table 3.8 shows the results of a heteroskedastic ordered probit model
for retrospective economic perceptions. The variance model includes political knowledge as its single predictor. This specification reflects the common hypothesis that greater knowledge produces more predictability of survey responses (Alvarez and Brehm 1995, 1998, 2002). The results reveal a
marginally significant effect for knowledge in the expected negative direction.6 Note that the inclusion of knowledge in the variance model causes
the significance levels of the other variables to decline, although this may
also be a consequence of the reduced sample size.
We can test whether a heteroskedastic specification is necessary by performing a Wald or LR test on the null hypothesis H0 : γ = 0. In our
example, this test is redundant because there is only one predictor in the
variance model and we can simply ascertain whether it is significant. Since
knowledge is only significant at the .10 level, there is serious doubt about
the necessity for a heteroskedastic model, at least in its current specification.
Some Cautionary Remarks
When performing a heteroskedastic ordinal regression analysis it should be
kept in mind that one is tinkering with a fundamental identifying assumption. This can be quite risky. Monte Carlo simulations by Keele and Park
(2005) demonstrate that serious biases can creep into the estimates of β
and, to a lesser extent, γ, especially in the presence of model misspecification. The standard errors may also be off by a considerable amount if
the variance model is misspecified. However, these problems appear to be
6
Stata estimates ln(σi in lieu of σi . This is desirable because ln(σi is not bounded,
whereas σi is bounded to be non-negative. Note that we can always move to σi by
exponentiating the coefficients.
85
Table 3.8: Heteroskedastic Ordered Probit Model of Economic
Perceptions
Predictor
Model for π
Democrat
Republican
Age
Education
Income
Black
Female
Unemployed
1st Cut Point
2nd Cut Point
Variance Model
Knowledge
(
Wald test/p
Estimate
S.E.
−.218
−.838∗∗
.002
.084
.152∗∗
−.295∗
−.203∗
−.202
.688∗∗
1.648∗∗
.167
.170
.003
.053
.038
.128
.084
.251
.258
.262
−.135+
−732.754
3.020
.078
.082
Notes: n = 348. ∗∗ p < .01, + p < .10 (twotailed). Estimates obtained using the oglm
command. The Wald test assesses H0 : γ = 0.
86
Pr(y<=1)
Pr(y<=2)
Pr(y<=3)
1.0
0.8
0.6
0.4
0.2
0.0
0
2
4
6
8
10
x
Figure 3.3: Parallel Regression Assumption
less severe than they are for heteroskedastic binary logit and probit models,
a result that may well have to do with the increase in information that is
available from ordinal scales.
3.1.10
The Parallel Regression Assumption
Meaning
In addition to homoskedastic disturbances, ordered logit and probit assume
that the effect of predictors is constant across the categories of the response
variable. This is known as the parallel regression assumption or parallel lines assumption. The assumption is particularly obvious when we
consider the cumulative choice probabilities:
Pr(yi ≤ m) = F (τm − xi β)
Key here is the absence of a m-subscript on β, which implies that the cumulative choice probabilities depend only on the change in cut points, not on a
change in the effect of the predictors. When we graph the cumulative choice
curves we obtain parallel lines, as is shown in Figure 10.3. This explains
why the assumption is referred to as the parallel regression assumption.
87
In the ordered logit model, an alternative way of thinking about the
parallel regression assumption is in terms of proportional odds. Consider
the following cumulative log-odds ratio:
,
,
Pr(yi ≤ m|xi1 )/ Pr(yi > m|xi1 )
exp(τm − x!i1 β)
ln
= ln
Pr(yi ≤ m|xi2 )/ Pr(yi > m|xi2 )
exp(τm − x!i2 β)
= (xi1 − xi2 )β
where xi1 and xi2 are two different sets of values of the covariates. We see
that the log of the cumulative odds ratio is proportional to the distance
between the values of the predictors. Critically, it does not depend on
changing effects of those predictors.
The parallel regression assumption is a rather stringent assumption that
is frequently violated (Long and Freese 2003). When it is, the parameter
estimates that come out of an ordered logit or probit analysis may well be
inconsistent. It is easy to see why. Imagine that a predictor has a positive
effect for some categories of y and a negative effect for other categories.
When we constrain the effect to be constant across all of the categories of y,
as we have to do under the parallel regression assumption, then the net effect
might well come out to zero. This would give the false impression that the
predictor is inconsequential. Given the potential for inconsistent parameter
estimates, it is essential that one checks the validity of the parallel regression
assumption. Fortunately, there are several procedures that will allow you to
do this.
Testing the Parallel Regression Assumption
An Approximate Likelihood Ratio Test Wolfe and Gould (1998) have
proposed an approximate LR test of the parallel regression assumption. The
test is based on the cumulative probability curves described in Figure 3.3. If
parallelism holds, then we can think of the ordinal logit and probit models
in terms of M − 1 binary regressions7 of the variety
Pr(yi ≤ m) = F (τm − x!i β)
Central here is the constraint that the effect of the predictors is the same
across the M −1 regressions. We call this model 1, which requires estimation
only of the p elements contained in β and τ . If parallelism does not hold,
7
There are only M − 1 regressions because the cumulative probability of the last response category is fixed at 1.
88
then we should be estimating M − 1 equations of the form
Pr(yi ≤ m) = F (τm − x!i β m )
This is model 2 and it requires estimation of p(M − 1) parameters.
An exact LR test would recognize that model 1 is nested into model
2—model 1 is a special case of model 2 that comes about by constraining
the elements of β m to be identical across the categories of y. The exact LR
test statistic is then defined by
−2 [(1 − (2 ] ∼ χ2p(M −2)
(3.6)
where the degrees of freedom are given by the difference in the numbers of
estimated parameters between the two models.
To compute the exact LR test statistic, we need to take into consideration the correlations that exist between the binary responses. For example,
Pr(y ≤ 2) and Pr(y ≤ 1) contain a common response options (1) and are
therefore not independent. The correlation between successive responses is
given by
?
πij (1 − πij )
ρyij ,yik =
πik (1 − πik )
for j < k. It is difficult to estimate the second model while considering these
correlations. Thus, in practice, we treat the M − 1 regressions as independent, as would be the case if one were to run a multinomial logit analysis
(see Chapter 5). Let (ib
2 denote the log-likelihood of model 2 conceived of
as independent binaries. In general, (ib
2 < (2 but Wolfe and Gould (1998)
maintain
that
"
# the difference is small. Thus, (3.6) can be approximated via
−2 (1 − (ib
2 .
Wolfe and Gould (1998) provide the Stata program omodel to perform
the approximate LR test. The syntax is as follows:
omodel logit choicevar varlist
omodel probit choicevar varlist
The omodel logit command performs a ordered logit analysis as well as
a test of the parallel regression assumption; omodel probit also tests the
parallel regression assumption and additionally estimates an ordered probit
model.
89
The Brant Test The LR test is a so-called omnibus test, which means
that it informs us if the parallel regression assumption is violated but does
not tell us what the offending predictor(s) is/are. For this purpose, a Wald
test procedure developed by Brant (1990) is more useful, although it can
only be used with ordered logit.
As a Wald test procedure, the Brant test is based on the unrestricted
model, i.e. M − 1 binary regressions. The test starts by estimating β m and
its variance. Let
!
1 y>m
zm =
0 y≤m
Note that these binaries are reversed from what we used in the LR test,
which focused on Pr(yi ≤ m rather than Pr(yi > m). Then we estimate
M − 1 separate binary regressions yielding β̂ m and V[β̂ m ]. We can also
compute predicted probabilities:
π̂m = F (−τ̂m + x!β̂ m )
Next, we obtain the covariance between β̂ m and β̂ l . This covariance is
given by
Ĉ[β̂ m , β̂ l ] = (X%+ Wmm X+ )−1 X%+ Wml X+ (X%+ Wll X+ )−1
where X+ is a n × (p + 1) matrix that is formed by augmenting the matrix
of predictors, X, with a leading column of 1s. Further, Wml is a n × n
diagonal matrix with entries π̂m − π̂m π̂l for m ≤ l.
∗
%
%
%
We now combine the estimates. Let β̂ = (β̂ 1 β̂ 2 · · · β̂ M −1 ). It can be
∗ asy
shown that β̂ ∼ M V N (multivariate normal) with E[β̂ m ] ≈ β m and


∗

V̂[β̂ ] = 

V̂[β̂ 1 ]
Ĉ[β̂ 2 , β̂ 1 ]
..
.
Ĉ[β̂ 1 , β̂ 2 ]
V̂[β̂ 2 ]
..
.
· · · Ĉ[β̂ 1 , β̂ M −1 ]
· · · Ĉ[β̂ 2 , β̂ M −1 ]
..
..
.
.
Ĉ[β̂ M −1 , β̂ 1 ] Ĉ[β̂ M −1 , β̂ 2 ] · · ·
V̂[β̂ M −1 ]





The penultimate step is to create the pairwise contrasts with β̂ 1 . The
parallel regression assumption implies
Dβ ∗ = 0
90
where



D = 

I −I 0 · · · 0
I 0 −I · · · 0
.. ..
.. . .
.
. ..
. .
.
I 0
0 · · · −I





is a p(J − 2) × p(J − 1) contrast matrix. This matrix always selects β 1 and
subtracts β 2 · · · β M −1 , setting the results equal to zero, as is implied by the
parallel regression assumption. Thus, the pairwise contrasts are the essence
of the Brant test.8
The Brant test statistic is now given by
W
∗
∗
∗ asy
= (Dβ̂ )% [DV̂(β̂ )D% ]−1 (Dβ̂ ) ∼ χ2p(M −2)
(3.7)
By selecting the proper elements of β m and the proper elements of the
variance-covariance matrix we can also identify covariates for which the
parallel regression assumption fails. The procedure is implemented in Stata
through the brant command, whose syntax is
brant
"
, detail
#
(where detail causes Stata to show the estimates from the binary regressions).
Example
Does the parallel regression assumption hold for the ordered logit analysis of
economic perceptions that is reported in Table 3.2? Performing the approximate likelihood ratio test yields LR = 12.70. When referred to a chi-squared
distribution with 8 degrees of freedom, we obtain p = .123. Thus, we fail to
reject the null hypothesis that the parallel regression assumption holds.
The Brant test leads to the same overall conclusion: W = 11.87, p =
.157. However, when we peruse the results for specific predictors, two problem cases emerge: the test statistics for gender and race are both significant
at the .05 level, as is shown in Table 3.9. This would be reason to consider
estimating one of the alternative models that we shall discuss next.
8
It is arbitrary that we form contrasts with β 1 . We could have formed contrasts with
any of the coefficient vectors.
91
Table 3.9: Brant Test of Economic Perceptions Model
Predictor
Democrat
Republican
Age
Education
Income
Black
Female
Unemployed
Overall
W
.40
.12
1.38
1.74
.66
4.10
3.98
.02
11.87
p
.528
.728
.239
.188
.416
.043
.046
.897
.157
Notes: Results obtained via the
brant command.
3.2
Alternative Models
When the parallel regression assumption fails, then it is necessary to consider estimating a different model. Here we consider two alternatives that
are available in Stata: the generalized ordered logit model and the stereotype (ordered) regression model. Neither model requires a universal parallel
regression assumption, although the manner in which this assumption is
bypassed is quite different across the two models.
3.2.1
The Generalized Ordered Logit Model
Derivation, Estimation, and Interpretation
Derivation The generalized ordered logit model was first proposed by
McCullagh and Nelder (1989) and subsequently elaborated by Clogg and
Shihadeh (1994), Fahrmeir and Trutz (1994), Fu (1998), Peterson and Harrell (1990), and Williams (2006). The basic model considers the probability
of a response greater than m and is given by
Pr(yi > m) =
exp(αm + x!i β m )
1 + exp(αm + x!i β m )
(3.8)
for m = 1, 2, · · · M − 1. The m-subscript on β implies that the effects of the
predictors can vary across the categories of y. The model in (3.8) contains
several sub-models:
92
1. If M = 2, then the model reduces to a binary logit model.
2. If M > 2, then the model is similar to the Brant (1990) specification.
3. If β m = β ∀m and all predictors, then the model reduces to the ordered
logit model. (Note that the ordered logit cut points are equal to −αm .)
If β m = β ∀m and a subset of the predictors, then we obtain the
partial proportional odds model.
The partial proportional odds model is often a useful compromise between
the parsimony of the regular ordered logit model and the potentially greater
veracity of the Brant (1990) specification. It allows researchers to retain the
parallel regression assumption where this seems warranted and to drop it
where it seems inconsistent with the data.
Estimation The log-likelihood function for the generalized ordered logit
model is given by
))
"
#
( =
zim ln F (αm−1 + x!i β m−1 ) − F (αm + x!i β m )
m
i
where F (.) is the standard logistic distribution, α0 = ∞, and αM = −∞.
This is a fairly straightforward log-likelihood function and convergence problems are rare.
Interpretation Interpretation is done most straightforwardly in terms of
predicted probabilities. For yi = 1, the predicted probability is given by
Pr(yi = 1) = 1 −
exp(α1 + x!i β 1 )
1 + exp(α1 + x!i β 1 )
For yi = 2 · · · M − 1, the predicted probabilities are given by
Pr(yi = m) =
exp(αm−1 + x!i β m−1 )
exp(αm + x!i β m )
−
1 + exp(αm−1 + x!i β m−1 ) 1 + exp(αm + x!i β m )
Finally, for yi = M the predicted probability is
Pr(yi = M ) =
exp(αM −1 + x!i β M −1 )
1 + exp(αM −1 + x!i β M −1 )
These predicted probabilities can be parlayed into discrete change measures
in the usual manner.
93
Alternatively, it is possible to compute a marginal effect. This can be
done in the following manner:
∂ Pr(y > m)
= f (αm + x!β m )βkm
∂xk
γmk =
As usual, this effect can be computed either at the mean or as an average.
An elasticity can be obtained by multiplying γmk by xk / Pr(y = m).
Example
Fu (1998) and Williams (2006) have developed Stata add-on programs that
will estimate the generalized ordered logit model. Of these programs, the
gologit2 routine of Williams is the most flexible as it allows users to specify
partial proportional odds models. The syntax for this program is as follows:
gologit2 depvar
"
indepvars
# "
,
pl|pl(varlist)|npl|npl(varlist)|autofit
#
Here the pl option causes the program to assume that the parallel regression
assumption holds for all predictors (“pl” stands for parallel lines). The option pl(varlist ) applies the parallel regression assumption only to those
predictors included in varlist. The option npl lifts the parallel regression
assumption for all of the predictors, whereas npl(varlist ) lifts it for the
predictors included in varlist (“npl” stands for non-parallel lines). Finally,
autofit causes the program to find a model that relaxes the parallel regression assumption where this is necessary in terms of model fit (for details see
Williams 2006).
Table 3.10 shows the results from a generalized ordered logit analysis
where we have relaxed the parallel regression assumption for gender and
race, as suggested by the Brant test. As the table shows, gender and race
do little to discriminate “same” and “better” responses from “worse” responses. However, they have sizable effects on discriminating “better” responses from “same” and “worse”. Thus, gender and race matter mostly
in that women and blacks appear to be less optimistic about the development of the economy. This is a different story than the regular ordered logit
results tell.
94
Table 3.10: Generalized Ordered Logit Model of Economic Perceptions
Predictor
Democrat
Republican
Age
Education
Income
Black
Female
Unemployed
Constant
> Worse
Estimate
S.E.
−.581∗
.249
1.135∗∗
.254
−.001
.004
.153+
.085
.219∗∗
.060
−.328
.203
−.227
.150
−.377
.390
−.756+
.392
> Same
Estimate
−.581∗
1.135∗∗
−.001
.153+
.219∗∗
−1.031∗∗
−.582∗∗
−.377
−2.186∗∗
S.E.
.249
.254
.004
.085
.060
.387
.172
.390
.402
Notes: n = 895. ∗∗ p < .01, ∗ p < .05, + p < .10 (two-tailed). Models
relax the parallel regression assumption for race and gender.
3.2.2
Stereotype Regression
Derivation and Estimation
Derivation The stereotype regression model provides an alternative framework for avoiding the parallel regression assumption. This model was first
proposed by Anderson (1984) and subsequently developed by DiPrete (1990)
and Lunt (2001). The model estimates a common coefficient vector β but
in addition estimates a set of parameters that act as multipliers on the coefficients. These parameters are specific to the values of y. The model is
given by the following equations:
Pr(yi = m) =
for yi < M and
exp(θm − φm x!i β)
>M −1
1 + j=1 exp(θj − φj x!i β)
!
j=1 exp(θj − φj x i β)
>M −1
1 + j=1 exp(θj − φj x!i β)
1
>M −1
1 + j=1 exp(θj − φj x!i β)
Pr(yi = M ) = 1 −
=
>M −1
95
This can be written more elegantly in log-odds form:
5
6
Pr(yi = m)
= (θm − θl ) − (φm − φl )x!i β
ln
Pr(yi = l)
(3.9)
This formulation clearly shows that the effect of a predictor is allowed to
vary by a scalar factor, φm − φl , that depends on the pair of alternatives
under consideration. (The term θm − θl is a function of the different cut
points for different values of y.)
The model in (3.9) is identified only after we impose some constraints.
Specifically, we impose three identifying constraints: (1) θM = 0, (2) φ1 = 1,
and (3) φM = 0. We also impose the ordinality constraint φ1 = 1 > φ2 >
· · · > φJ−1 > φM = 0; this constraint ensures that the ordinal nature of y
is preserved in the estimation.
Note that the stereotype regression model offers a rather different solution for the parallel regression assumption than the generalized ordered logit
model. In the stereotype regression model the coefficients on all predictors
are scaled by the same factor φm − φl . This is not the case in the generalized
ordered logit model. Using a constant scaling factor may be more parsimonious, although this depends on the nature of the generalized ordered logit
model that would be estimated instead.
Estimation With the constraints in place, estimation is relatively straightforward. The log-likelihood function is given by
( =
n )
M
)
zim ln πim
i=1 m=1
where zim = 1 if alternative m was chosen and πim = Pr(yi = m). In my
experience, convergence problems are rare, although estimation tends to be
slower than for the ordered logit model. Interpretation is best done in terms
of predicted probabilities or odds, using the formulas presented above.
Example
Stata will estimate the stereotype regression model using the mclest command (Hendrickx 2000) or the slogit command. Here we focus on the
latter command, whose syntax is
slogit depvar varlist
96
Table 3.11: Stereotype Regression Model of Economic Perceptions
Predictor
Democrat
Republican
Age
Education
Income
Black
Female
Unemployed
φ1
φ2
φ3
θ1
θ2
θ3
Estimate
−.874∗
1.544∗∗
−.001
.228+
.305∗∗
−.837∗
−.570∗∗
−.663
1.000
.550∗∗
.000
2.068∗∗
1.275∗∗
.000
Notes: n = 895. ∗∗ p < .01,
+
p < .10 (two-tailed).
S.E.
.369
.366
.006
.124
.089
.337
.198
.594
.056
.574
.339
∗
p < .05,
Note that Stata’s implementation of the stereotype regression model does
not impose an ordinality constraint; if necessary, the user will have to specify
this.9
Table 3.11 shows the results from the stereotype regression model of economic perceptions. The estimates satisfy the ordinality constraint without
further intervention. Considering the log-odds of the “same” and “worse”
responses, we see that the effect of the predictors is changed by a factor
φ2 − φ1 = .550 − 1 = −.450. Considering the log-odds for “better” and
“same” responses, the effect of the predictors is changed by a slightly larger
factor of φ3 − φ2 = 0 − .550 = −.550. This suggests that the impact of the
predictors is not constant across the categories of the perception variable.
9
The reason for the lack of the ordinality constraint is that Stata’s command handles
more than the ordinal case alone. For details see the Stata documentation.
97
Chapter 4
Probabilistic Choice Models:
An Introduction
In the previous two chapters, we considered choice models for binary outcomes and polychotomous ordered outcomes. Many choice problems, however, do not fall into these categories. Consider, for example, vote choice
in a multi-party electoral system. Here there are more than two alternatives, so binary choice models are inadequate. But the alternatives cannot
be rank-ordered. It makes no sense, for example, to say that in Switzerland
the FDP is more than the SVP, which in turn is more than the SP. Thus,
ordinal choice models are also inappropriate.
How then is one to analyze choice among multiple, unordered alternatives? Here is where probabilistic choice models (PCMs) come into play.
This class of models, which is rooted in economics and psychology, is perfectly suited for any kind of decision problem that involves multiple alternatives. In Chapters 5-8, we shall discuss specific instances of PCMs. In
this chapter, the focus is on the generic model and its assumptions. We also
discuss different traditions in probabilistic choice modeling, traditions that
will be reflected in subsequent chapters.
4.1
4.1.1
The Basics of PCMs
Key Characteristics
PCMs have several important key characteristics, some of which we have
already seen in Chapter 2. The characteristics can be summarized as follows.
1. Decision makers are assumed to be utility maximizers, either over the
98
entire set of available alternatives or a subset thereof.
2. Utility contains a random component, so that utility maximization—
and ultimately choice—is probabilistic. The random component captures: (a) unobserved attributes of the alternative, (b) unobserved attributes of the decision maker, (c) measurement error, and (d) proxies
or instrumental variables (Manski 1997).
3. Utility contains two systematic components to wit: (a) attributes of
the decision maker and (b) attributes of the alternatives.
4.1.2
Basic Notation
Let Yq = 1, 2, · · · , M denote the observed choice of decision maker q, where
M is the number of alternatives under consideration. Let Uqi be the (unobserved) utility that is associated with alternative i = 1, 2, · · · , M for decision
maker q. Under the assumption of utility maximization,
Yq = i if Uqi > Uqj ∀j += i
(4.1)
The utilities for all alternatives can be broken down into a systematic and
a stochastic component. Let Vqi denote the systematic utility component for
alternative i, which is a function of attributes of the decision maker and the
alternative. Further, let #qi denote the stochastic utility component. Then
Uqi = Vqi + #qi
(4.2)
We can now restate the utility maximization postulate. Specifically,
Uqi > Uqj
= Vqi + #qi > Vqj + #qj
∀j += i. This can be rearranged in a number of different ways. Sometimes it
is useful to write the postulate as
#qj < Vqi − Vqj + #qi
This is what we shall use in the next chapter. At other times, it is useful to
write the postulate in terms of
#qj − #qi < Vqi − Vqj
This is what we shall use in the derivation of the multinomial probit model
in Chapter 6. In either case, a comparison of the systematic utilities of
alternatives i and j is an essential part of utility maximization.
99
Figure 4.1: R. Duncan Luce
Since #qi and #qj are stochastic components, the entire choice process has
to be stated in probabilistic terms. Specifically, let πqi = Pr(Yq = i) be the
probability that decision maker q chooses alternative i. Then
πqi = Pr(Uqi > Uqj )
= Pr(#qj < Vqi − Vqj + #qi )
(4.3)
= Pr(#qj − #qi < Vqi − Vqj )
∀j += i. This probability cannot be computed without distributional assumptions for the stochastic utility components. This is where the different
traditions of probabilistic choice modeling start to deviate. That is, each
tradition makes its own specific distributional assumptions and, depending
on the nature of those assumptions, different properties of the PCM emerge.
4.2
4.2.1
Traditions of Probabilistic Choice Modeling
The Legacy of Duncan Luce
There are three main traditions of probabilistic choice modeling. One of
these can be attributed to the work of the decision scientist Duncan Luce in
the 1950s and 1960s. Lucean-type models make relatively simple assumptions about the stochastic components. Specifically, in addition to assuming
that these components are identically distributed it is also assumed that they
are independent and homoskedastic. The great advantage of this assumption is that it leads to computationally tractable choice models. The great
disadvantage is that these models result in a rather stringent assumption
about choice behavior that is known as independence of irrelevant alternatives or IIA (see Arrow 1951). Put simply, IIA means that the odds of
choosing alternative i relative to alternative j do not depend on any other
alternatives. This assumption is often unrealistic. The models from Chapter
5 are of the Lucean variety.
100
Figure 4.2: Louis Leon Thurstone
4.2.2
The Legacy of L.L. Thurstone
L.L. Thurstone’s work in psychometrics in the 1920s has laid the basis for a
different class of PCMs. In these models, the stochastic components are not
treated as independent, nor do they have to be homoskedastic. As a result,
the IIA assumption is avoided. On the down side, however, these models
are computationally complex. To date, the best-known example of a choice
model in the Thurstonian tradition is the multinomial probit model, which
will be discussed in Chapter 6. Until recently, this model was of theoretical
interest only because researchers lacked the processing power and algorithms
to estimate it. Even today, when specialized algorithms are available, the
model remains one of the most difficult choice models to estimate.
4.2.3
Sequential Choice Models
The types of PCM discussed so far assume that decision makers consider the
available alternatives all at once. It is often reasonable to assume, however,
that choice is a sequential process whereby decision makers pare down the
alternatives in successive steps. One of the oldest models of sequential choice
is Amos Tversky’s (1974) elimination-by-aspects model, which assumes that
decision makers work through successive attributes and eliminate alternatives that are unsatisfactory; those alternatives will not be considered in the
next step. Daniel McFaden’s (1978) nested logit model is another variant
of sequential choice models that we shall consider in Chapter 7. Charles
Manski’s (1977) probabilistic choice set multinomial logit model, which will
be discussed in Chapter 8, is another variant of sequential choice models.
Common to all of these models is that they, to some degree, bypass the IIA
assumption without being computationally intractable.
101
Figure 4.3: Amos Tversky, Daniel McFadden, and Charles Manski
102
Chapter 5
Multinomial and Conditional
Logit
In the previous chapter, we saw that the probability of choosing an alternative i from among M alternatives is given by (4.3). We also saw that
the solution for this choice probability depends on the specific distributional
assumptions that one makes for the stochastic utility components. In the
present chapter, we consider two models—the multinomial and conditional
logit models—that assume these components to be homoskedastic and independent across alternatives. These models are of the Lucean variety and
they are widely used in political analysis (see, for example, Whitten and
Palmer 1996). Because of their prominence it is important to dwell on the
derivation, estimation, and interpretation of these models, as well as on their
chief limitation—the IIA assumption.
5.1
5.1.1
The Model
Distributional Assumptions
The multinomial and conditional logit models can be derived from the standard Type-I maximum extreme value distribution that we saw in the discussion of the gompit model in Chapter 2. As applied to the stochastic utility
components, the PDF and CDF are given by:
f (#) = exp(−#) exp[− exp(−#)] = exp[−# − exp(−#)]
F (#) = exp[− exp(−#)]
103
In this distribution, the variance of # is fixed at π 2 /6. In addition, we assume
that E[#i , #j ] = 0 for j += i, i.e. the errors are independent.
5.1.2
Choice Probabilities
Three Alternatives
With these distributional assumptions in place, it is now possible to evaluate
(4.3) and to derive the choice probabilities. This gets quite mathematical,
and you could easily skip to the end of this section to see the final result if
the math becomes too eye-blinding. To make the derivation more concrete,
we start by considering a choice problem involving three alternatives (e.g.
choosing among three different political parties in an election). We focus on
the probability that the first alternative will be chosen. Thus, we want to
know π1 . Using (4.3) and the distributional assumptions we have
π1 = Pr(#2 < #1 + V1 − V2 ∩ #3 < #1 + V1 − V3 )
= Pr(#2 < #1 + V1 − V2 ) Pr(#3 < #1 + V1 − V3 )
where the second equation follows from the assumed independence of the
errors, which turns the joint probability into a product of two marginal
probabilities.
Evaluating this equation is a bit complicated because the right-hand
side of the inequalities is not fixed (as it was in binary and ordered response
models). On the contrary, it is stochastic since #1 is random. This means
that we actually will have to integrate over #1 as well as #2 or #3 . For
example,
6
: ∞ 5: #1 +V1 −V2
Pr(#2 < #1 + V1 − V2 ) =
f (#2 )d#2 f (#1 )d#1
−∞
−∞
where the part in parentheses captures the notion that #2 < #1 + V1 − V2
and the remainder of the expression addresses the stochastic nature of #1 .
We can simplify this expression considerably when we realize that the part
in parentheses is a cumulative probability:
6
: ∞ 5: #1 +V1 −V2
: ∞
f (#2 )d#2 f (#1 )d#1 =
F (#1 + V1 − V2 )f (#1 )d#1
−∞
−∞
−∞
In terms of the Type-I extreme value distribution the function over which
104
we integrate is given by
F (#1 + V1 − V2 )f (#1 ) = exp(−e−#1 −V1 +V2 ) exp(−#1 − e−#1 )
= exp(−e−#1 −V1 +V2 − e−#1 − #1 )
"
$
%#
= exp −#1 − e−#1 1 + eV2 −V1
,
5
6eV2
−#1
= exp −#1 − e
1 + V1
e
so that
Pr(#2 < #1 + V1 − V2 ) =
,
5
6eV2
−#1
exp −#1 − e
1 + V1
d#1
e
−∞
:
∞
Analogously,
Pr(#3 < #1 + V1 − V2 ) =
,
5
6eV3
−#1
exp −#1 − e
1 + V1
d#1
e
−∞
:
∞
Hence,
π1 = Pr(#2 < #1 + V1 − V2 ) Pr(#3 < #1 + V1 − V3 )
5
,
6: ∞
eV2
−#1
1 + V1
=
exp −#1 − e
×
e
−∞
,
5
6eV3
−#1
exp −#1 − e
1 + V1
d#1
e



: ∞
) eVk
 d#1
=
exp −#1 − e−#1 1 +
eV1
−∞
k'=1
This looks quite ugly
> but fortunately considerable
>simplification is possible. Let λ1 = ln[1+( k'=1 exp(Vk )/ exp(V1 ))] = ln[ k exp(Vk )/ exp(V1 )],1
%
1
% To get the second form for λ we write 1 + k$=1 exp(Vk )/ exp(V1 ) as (exp(V1 ) +
k$=1 exp(Vk ))/ exp(V1 ). We now realize that
%the numerator sums over all alternatives,
so that it immediately follows that we have k exp(Vk )/ exp(V1 ). All that is left to do
then is to take the natural logarithm in order to obtain λ.
105
then
:
∞
−∞


exp −#1 − e−#1 1 +
:
∞
) eVk
k'=1
eV1

 d#1 =
exp {−#1 − exp[−(#1 − λ1 )]} d#1 =
=∞
=
=
1
=
3
4
=
1
exp exp(#
exp(λ
)
exp(λ
)
1
1 =
)
−∞
1
−∞
For #1 → ∞, exp(#1 ) → ∞, and 1/ exp(#1 ) → 0. Hence, the denominator of the expression above may be written as exp(0 × exp(λ1 )) exp(λ1 ) =
exp(0) exp(λ1 ) = exp(λ1 ). The full expression is just the reciprocal of this.
For #1 → −∞, exp(#1 ) → 0, and 1/ exp(#1 ) → ∞. Hence, the denominator of the expression above may be written as exp(∞ × exp(λ1 )) exp(λ1 ) =
exp(∞) exp(λ1 ) = ∞. The full expression is just the reciprocal of this, i.e.,
0. Thus,
=∞
=
=
1
1
=
3
4
=
−0
=
1
exp(λ1 )
exp exp(#1 ) exp(λ1 ) exp(λ1 ) =
−∞
Substitution gives
J
exp(−λ1 ) = exp − ln
=
eV1
> V
k
ke
= exp(−λ1 )
;
<K
) eVk
k
eV1
Thus, we see that the probability of choosing alternative 1 is equal to the
exponent of the systematic utility component of that alternative divided by
the sum of the exponents of the systematic utility components of all of the
alternatives.
M Alternatives
The previous discussion suggests the following generic formula for the choice
probabilities (see McFadden 1974, 1981):
πi =
eVi
> V
k
ke
106
(5.1)
Table 5.1: Conditional versus Multinomial Logit
Model
Conditional Logit
Multinomial Logit
Predictors
Variable across choices
Fixed across choices
Parameters
Fixed across choices
Variable across choices
This is the basic Lucean PCM. It is the foundation for the conditional and
multinomial logit models.
5.1.3
Determinants of Choice
The model in (5.1) does not account for variation in the systematic utility
components Vi . There are two main approaches to modeling these components (see Table 5.1). In the first approach, Vi is a function of the attributes of the alternative, which are assumed to have identical effects across
all choice options. This produces the so-called conditional logit model,
which was developed by Daniel McFadden (1974, 1981). In the second approach, Vi is a function of attributes of the decision maker whose effect
varies across the alternatives. This results in the so-called multinomial
logit model, which has a long tradition in the social sciences (see Long
1997). As we shall see, it is also possible to enter both attributes of the
decision maker and the alternatives into a single choice model.
Conditional Logit
Let πqi be the choice probability for alternative i in individual q.2 As we
have seen, πqi depends on Vqi . In the conditional logit model,
Vqi = x!qi β
where xqi is a vector of attributes of the ith alternative as perceived by the
qth decision maker. Thus
πqi =
exp(x!qi β)
>M
!
k=1 exp(x qk β)
(5.2)
We clearly see that the predictors vary across alternatives (whence the i
subscript on x). We also clearly see that the parameters are fixed across
alternatives (whence the lack of a i subscript on β).
2
Since we will be introducing individual characteristics, it is now useful to introduce
individual-level subscripts. This will also help in discussing the data structure.
107
Multinomial Logit
In the multinomial logit model, the Vqi are understood in terms of individual
characteristics whose effects may vary across alternatives. Thus
Vqi = z !q γ i
where z q is a vector of attributes of the decision maker, whose effect depends
on the alternative that is under consideration (whence the i-subscript on γ).
Hence,
πqi =
Combined Model
exp(z !q γ i )
>M
!
k=1 exp(z q γ k )
(5.3)
Most statistical software packages that estimate the conditional logit model
will actually allow you to include individual characteristics of the decision
maker. This produces the following model
Vqi = x!qi β + z !q γ i
and
πqi =
5.2
Identification
exp(x!qi β + z !q γ i )
>M
!
!
k=1 exp(x qk β + z q γ k )
(5.4)
The models presented above are not identified. That is, we can arbitrarily
change the vector of parameters without altering the implied choice probabilities. Thus, the same choice probability can map onto an infinitely large
number of values for the parameters. Unique parameter estimates can be
obtained only after certain identifying restrictions are imposed.
5.2.1
Multinomial Logit
The identification problem is easily demonstrated for the multinomial logit
model. Imagine that instead of specifying a vector of parameters γ i we add
an arbitrary constant τ , how would this affect the choice probabilities? We
108
now have
πqi =
=
=
exp[z !q (γ i + τ )]
>M
!
k=1 exp[z q (γ k + τ )]
exp(z !q γ i + z !q τ )
>M
!
!
k=1 exp(z q γ k + z q τ )
exp(z !q γ i )
exp(z !q τ )
×
>M
!
exp(z !q τ )
k=1 exp(z q γ k )
It is clear that the second term on the right-hand side is equal to one and
hence does not change the probabilities on the left-hand side. Thus, we see
that it is possible to arbitrarily change the parameters without changing
the implied choice probabilities. This means that starting with a set of
choice probability estimates (e.g., proportions of individuals who choose a
particular alternative), it is impossible to derive a unique set of parameter
estimates. The model is unidentified, at least, if we do not impose some
restrictions on the parameters.
In practice, two kinds of identification constraint are imposed on the
parameters. First, we can impose a normalizing constraint, such that the
parameters sum to 0 across all of the alternatives. Second, one can constrain
to zero the parameters for one of the alternatives. It is arbitrary which
alternative is constrained in this manner. This alternative becomes the
baseline category to which we can compare the other alternatives. The
second type of constraint is more common.
To see the implications of the second type of constraint, imagine that we
set γ 1 = 0, i.e. we constrain the parameters for the first alternative. Then
πq1 =
=
exp(z !q 0)
>
!
exp(z !q 0) + M
k=2 exp(z q γ k )
1
>M
1 + k=2 exp(z !q γ k )
For the remaining alternatives,
πqi =
exp(z !q γ i )
>M
1 + k=2 exp(z !q γ k )
When we now add an arbitrary constant to γ i , then the choice probabilities
will be affected. Thus, the parameters are identified.
109
5.2.2
Conditional Logit
The conditional logit model also has an identification problem but this is
more localized, pertaining to the constant term only. To see this problem,
let us partition xqi into (1 x∗qi ), where x∗qi contains information about the
covariates and 1 is the constant. The parameter vector is partitioned in a
similar manner: β ! = (β0 β ∗!). We can then write
πqi =
=
=
=
exp(x!qi β)
>
!
k exp(x qi β)
exp(β0 + x!∗qi β ∗ )
>
∗
!∗
k exp(β0 + x qk β )
exp(β0 ) exp(x!∗qi β ∗ )
>
∗
!∗
k exp(β0 ) exp(x qk β )
exp(x!∗qi β ∗ )
exp(β0 )
>
∗
∗ ×
!
exp(β0 )
k exp(x qk β )
The second term on the right-hand side of the last equation is always one,
no matter what value of β0 is selected. As such, this term does not affect
the choice probabilities. Thus, we can pick any value for β0 and will still
be able to recover the choice probabilities for the alternatives. This means
that no unique estimate of β0 can be derived from the choice probabilities.
The solution to this identification problem is that β0 is typically normalized
at 0 and is not be estimated.
5.3
Estimation
Estimation of the multinomial/conditional logit models is straightforward.
Let
!
1 if i is chosen
dqi =
0 otherwise
Then
L =
=
(
πq1
dq1 =1
N (
M
(
(
dq2 =1
d
πqiqi
q=1 i=1
110
πq2 · · ·
(
dqM =1
πqM
Here πqi is given by one of the models in (5.2)-(5.4). The log-likelihood
function is given by
( =
N )
M
)
dqi ln πqi
q=1 i=1
For the multinomial logit model, the derivatives have the following simple
form:
)
∂(
=
(dqi − πqi )z i
∂γ i
i
for i = 1, · · · , M . For the conditional logit model, the derivatives are:
∂(
∂β
=
))
q
i
dqi (xqi − x̄q )
>
where x̄q = k πqk xqk . The first-order condition requires that both sets of
derivatives are set equal to zero. Solving the resulting equations requires
numerical methods, but due to the straightforward nature of the derivatives
convergence is generally fast and unproblematic.
5.4
5.4.1
Interpretation
Multinomial Logit
Marginal Effects and Elasticities
Marginal Effects One way to interpret multinomial logit results is to
compute marginal effects. With a little calculus it can be shown that
;
<
)
∂πi
= πi γli −
γlk πk
∂zl
k
Here zl is a particular attribute of the decision maker and γlk (for k =
1, · · · , M ) is the effect of this attribute on πk . It is assumed that zl is
continuous. We can interpret the marginal effect as the change in the probability of choosing alternative i for a very small change in zl . Note that
the marginal effect depends on many factors, including the probabilities of
choosing other alternatives and the effect of zl on those choice probabilities.
Thus, it is not necessarily the case that the marginal effect will be of the
same sign as γ̂li .
111
As usual, the computation of marginal effects requires that we fix the
values of the predictors. It is common to set the predictors to their means,
which results in marginal effects at the mean. It is also possible to compute
average marginal effects, which requires that we use the covariate values
for each observation in the marginal effects formula and then average the
results. By using the delta method, the approximate asymptotic variance of
the marginal effects can be computed (see Greene 2003). This can be used
to calculate confidence intervals.
Elasticities For the predictor zl , the elasticity with respect to πi is given
by
;
<
)
κzl = zl γli −
γlk πk
k
This can be interpreted as the percentage change in πi that arises from a
one percentage point increase in zl .
Predicted Probabilities and Discrete Change
Predicted probabilities for the multinomial logit model can be obtained by
substituting the estimates of γ i into (5.3). It is also possible to compute approximate confidence intervals for these predictions using the delta method.
It is common to parlay the predicted probabilities into discrete change
measures. Here we focus on a single attribute of the decision maker, zl , and
let it vary between the values a and b, while holding everything else constant
(usually at the mean). The discrete change is then given by
∆πi
∆zl
= Pr(Y = i|z̄, zl = b) − Pr(Y = i|z̄, zl = a)
Common choices for the limits of zl include the following.
1. a = 0 and b = 1; this is particularly useful when ascertaining the effect
of dummy predictors.
2. a = z̄l and b = z̄l + 1; this gives a unit change from the mean of zl .
3. a = z̄l − .5 and b = z̄l + .5; this is the centered unit change.
4. a = z̄l − .5szl and b = z̄l + .5szl ; this represents a standard deviation
change.
112
5. a = min(zl ) and b = max(zl ); this represents the maximum change
In each case, it is possible to compute the variance of the discrete change via
the delta method. This allows one to test whether the change is significantly
different from zero.
To summarize the changes in predicted probabilities across all of the
alternatives, it is possible to compute an average absolute discrete change
measure (see Long 1997; Long and Freese 2003):
=
M =
1 ) == ∆πk ==
= ∆zl =
M
¯ =
∆
k=1
This measure gives an overall impression of how a predictor affects the probabilities of choosing different alternatives.
The Odds Ratio
Finally, the odds ratio is a method of interpretation that can be advantageous. The odds give the probability of choosing one alternative rather than
another given a particular set of values of the covariates. One type of odds
involves the baseline category. If we assume, without loss of generality, that
the first alternative serves as the baseline, then the odds relative to this
alternative are given by
πi
π1
= exp(z !q γ i )
Ωi|1 =
A second type of odds compares two non-baseline categories:
πi
πk
= exp[z !q (γ i − γ k )]
Ωi|k =
Now imagine that we compute the odds under two different scenarios. In
the first scenario, we consider a value of zl for one of the predictors while in
the second scenario that predictor is set to zl + δ. The remaining predictors
are kept constant across the two scenarios. The odds ratio involving the
baseline category is now given by
Ωi|1 (z, zl + δ)
Ωi|1 (z, zl )
= exp(γli δ)
113
For non-baseline categories, the odds ratio is
Ωi|k (z, zl + δ)
Ωi|k (z, zl )
= exp[(γli − γlk )δ]
This expression shows a number of nice features of the odds ratio. First, it
does not depend on the level of zl , just on the change in this variable. Second,
it does not depend on any of the other predictors (unlike marginal effects
and discrete changes). Also note that the ratio is driven by the differences in
the effects of a predictor across two alternatives. This forms an interesting
counter-point to the odds ratio for the conditional logit model.
5.4.2
Conditional Logit
Marginal Effects and Elasticities
Marginal Effects For the conditional logit model, a bit of calculus shows
that the marginal effect of xlk with respect to πi is given by
∂πi
∂xlk
= πi (r − πk ) βl
where r = 1 if i = k and r = 0 otherwise. This cam be interpreted as the
impact on the choice probability for alternative i of a very small change in
attribute l of alternative k (which may be the same alternative as i). We see
that the marginal effect depends on all attribute sets (through πi and πk ).
Depending on how we select the values of the covariates, either marginal
effects at the mean or average marginal effects can be computed.
Elasticities It is often helpful to convert the marginal effect into an elasticity, which can be done by multiplying it by xlk /πi . This yields
κxlk
= xlk (r − πk ) βl
This can be interpreted as the percentage change in πm for a one percentage
point increase in xlk . When k = i the expression for κ presents a selfelasticity: it gives the percentage change in the probability of choosing
alternative i when attribute l of this alternative changes by 1 percent. When
k += i, then κ gives the cross-elasticity, i.e. the percentage change in the
the probability of choosing alternative i when attribute l of alternative k += i
changes by 1 percent. We say that two alternatives are complements if their
cross-elasticity is negative. We say that they are substitutes when their
114
cross-elasticity is positive. Finally, the alternatives are independent when
the cross-elasticity is zero.3
It is important to note that the cross-elasticities in the conditional logit
model are all of the form −xlk πk βk . This means that they depend entirely
on attributes of the alternative k, not on attributes of alternative i, which
is the alternative for which we are computing the elasticity. An important
consequence is that the cross-elasticity is identical regardless of what i is.
This is an important feature of the conditional logit model, although not
necessarily a desirable one. The models considered in later chapters do not
display this behavior.
Predicted Probabilities and Discrete Change
Predicted probabilities for the conditional logit model can be computed by
substituting β̂ into (5.2). Using the delta method, it is also possible to
compute asymptotic confidence intervals on the predicted probabilities.
A common way of interpreting the conditional logit model is to compute
discrete change in the predicted probabilities:
∆πi
∆xlk
= Pr(Y = i|x, xlk = b) − Pr(Y = i|x, xlk = a)
It is conventional to set the predictors in x to their means. These predictors
include attributes of alternative i as well as of the other alternatives. In
terms of the values a and b, these can be selected in the same way as in the
multinomial logit model.
The Odds Ratio
The odds is again defined as the ratio of the choice probabilities for two
different alternatives. It is computed via
Ωi|k = exp([x!i − x!k ]β)
Note that this reflects an important difference with the multinomial logit
model. That is, in the conditional logit model the odds are driven by the
differences in the attributes. By contrast, in the multinomial logit model
3
In economics, cross-elasticities often refer to changes in the demand for one good as
a result of changes in the price of another good. With complementary goods, a price
increase in one leads to reduced demand for the other good. With substitutes, a price
increase in one good leads to increased demand for the other good.
115
the odds are driven by the differences in parameters associated with the
attributes.
The odds ratio reflects what happens when we change the value of an
attribute of alternative i from xli to xli + δ. In this case,
Ωi|k (xi , xk , xli + δ)
Ωi|k (xi , xk , xli )
= exp(δβl )
is the factor by which the odds of choosing i compared to k change. Apart
from the ease with which it can be computed, a major advantage of the
odds ratio (as usual) is that we can ignore all of the remaining alternatives,
something that cannot be done with marginal effects and discrete change in
predicted probabilities.
5.5
Hypothesis Testing
A variety of hypotheses can be tested in the multinomial and conditional
logit models. One hypothesis is that a particular covariate has a zero effect
in the population. It is straightforward to test this hypothesis in the conditional logit model. If the null hypothesis is that xli has no effect, then we
can simply test the simple hypothesis H0 : βl = 0 using a LR or Wald test
procedure. In the multinomial logit model, the test is a bit more complex.
Since each covariate has M − 1 parameters associated with it, the hypothesis that a particular attribute of the decision maker, zl , does not influence
choice amounts to: H0 : γl2 = · · · = γlM = 0 (assuming that the first alternative is the baseline). This joint hypothesis is most easily tested using a
LR or Wald test approach.
A second type of hypothesis that can be useful to test involves an equality
constraint whereby two parameters are restricted to be equal. In the multinomial logit model, this test can be used to ascertain the distinguishability
of two alternatives. Two alternatives i and k are non-distinct if the effect of
all of the covariates is the same for both alternatives, i.e. H0 : γli − γlk = 0
∀l. In this case, no information is lost by combining the alternatives. Long
(1997) suggests a LR test for evaluating the null hypothesis that proceeds
as follows:
1. Limit the choice set to i and k.
2. Perform a binary logit analysis on this choice set, using the same
predictors as in the multinomial logit model.
116
3. Perform an LR test on the hypothesis that all of the predictors have
null effects. Failure to reject this hypothesis constitutes evidence that
alternatives i and k are not distinct.
An alternative approach is to perform a Wald test; this will be illustrated
in the example below.
5.6
Model Fit
5.6.1
Pseudo-R2
Likelihood Ratio Index
Stata by default computes McFadden’s pseudo-R2 , which is given by 1 minus
the ratio of the log-likelihood of the fitted model and the log-likelihood of
an empty or null model (see Chapter 2). In the case of multinomial logit
analysis, the most common null model is one that omits all of the predictors
and includes constants only. The log-likelihood of this model is given by
(0 =
M
)
i=1
Ni ln
5
Ni
N
6
We can think of this as the log-likelihood evaluated at actual “market
shares,” i.e. taking into consideration how many decision makers actually
chose a particular alternative. In conditional logit analysis, the most widely
used null model is one that contains no parameters. Here, πqi = 1/M and
(0 = −N ln M
The theoretical range of this measure is from 0 to 1, although the upperbound is in practice never realized in conditional and multinomial logit models.
Başar and Bhat (2004) propose computing an adjusted likelihood ratio index, as proposed by Ben-Akiva and Lerman (1985). This measure is
defined as
2
R̄M
cF
= 1−
(1 − p
(0
where p is the number of estimated parameters beyond those included in the
null model. The advantage of this measure is that it penalizes against overfitting, i.e. adding parameters that do not lead to a substantial improvement
in the model fit.
117
Aldrich and Nelson pseudo-R2
It is also possible to compute the pseudo-R2 measure that Aldrich and Nelson (1984) proposed for binary choice models. As shown in Chapter 2,
this is the ratio of the likelihood ratio test statistic and the sum of this
statistic and the sample size. The theoretical upper-bound on this measure
is −2(0 /(n − 2(0 . To obtain a pseudo-R2 with a theoretical range from 0
to 1, one can divide the measure proposed by Aldrich and Nelson by this
upper-bound (see Veall and Zimmermann 1992).
5.6.2
Correct Predictions
Percentage of Correct Classifications
Apart from computing pseudo-R2 measures, it is also possible to assess
model fit via correct predictions. One approach here is to calculate the percentage of correctly classified cases. For each decision maker in the sample,
it is possible to compute a vector of predicted probabilities for all of the
alternatives. One can then make a prediction for the choice variable by
assuming that the decision maker chooses the alternative with the highest
predicted probability. That is,
Ŷq = i if π̂qi > π̂qk ∀k += i
We can then count the percentage of cases where Ŷ and Y correspond; this
is CC. As always, CC should be compared to our predictive ability in the
absence of the model. To do this, simply consider a scenario where all we
know is which alternative was chosen most frequently. The percentage of
cases choosing this alternative serves as the baseline to which CC can be
compared.
Average Probability of Correct Prediction
An obvious drawback of CC is that it may be impossible to come up with a
unique prediction because the predicted probabilities for two alternatives are
identical. But even when the probabilities are not identical, sharp decisions
about choice behavior seem risky. Imagine, for example, a decision maker
with predicted probabilities of .49 for alternative 1, .48 for alternative 2,
and .03 for alternative 3. We are on pretty safe ground to assume that this
person will not choose alternative 3. But when it comes to the remaining
alternatives, it seems perilous to assume that the decision maker will choose
118
alternative 1 simply because its predicted probability is .01 points higher
than that of alternative 2.
To overcome the problem of having to make sharp predictions of choice
behavior, Başar and Bhat (2004) propose computing the average probability
of correct choice. This is given by4
AP CC =
N M
1 ))
dqi π̂qi
N
q=1 i=1
The AP CC takes into consideration the inherent uncertainty in our predictions by explicitly taking into account π̂qi in addition to the actual choice
behavior, which is captured by dqi .
To consider the differences between AP CC and CC let us consider a sample of two decision makers who have the following vectors of predicted probabilities over three alternatives: π̂ 1 = (.49, .48, .03) and π̂ 2 = (.51, .20, .29).
To compute CC we turn these vectors into choice predictions in the form of
d̂1 = (1, 0, 0) and d̂2 = (1, 0, 0). Now imagine that the actual choice vectors
are given by d1 = (1, 0, 0) and d2 = (1, 0, 0). Since there is a perfect match
between the predictions and the actual choices, CC = 1. On the other hand,
AP CC = .5 [(1 · .49 + 0 · .48 + 0 · .03) + (1 · .51 + 0 · .20 + 0 · .29)] = .50.
The AP CC comes out lower because it takes into consideration that nonzero predicted probabilities were associated with choices other than the first
alternative. Indeed, AP CC would have attained the value of 1 only if the
predicted probabilities associated with the winning alternative were all 1,
i.e. there would have been no chance of decision makers choosing any other
alternatives.
Now consider a situation in which decision makers actually chose the
alternative with the lowest predicted value, i.e. d̂1 = (0, 0, 1) and d̂2 =
(0, 1, 0). Since there is a mismatch between d and d̂ for both decision makers, CC = 0. On the other hand, AP CC = .115. The AP P C is not equal
to zero because non-zero probabilities are associated with the alternatives
that were chosen. Another feature of AP CC relative to CC is also of interest. For CC it does not matter whether the decision makers actually
selected the least likely alternatives, or the second, or third least likely alternatives. Regardless CC = 0. But for AP CC this makes a difference. For
instance, if decision maker 1 had selected the second alternative and 3 the
third alternative, then AP CC = .385.
4
When the sampling design calls for%it, we%
can add a weight to each decision maker.
In this case, the formula becomes N −1 q wq i dqi π̂qi , where wq is the weight.
119
In general, we would recommend computing both CC and AP CC. Both
provide useful information and the cost of computing them is so low as to
pose no great burden.
5.7
Example
To illustrate the conditional and multinomial logit models, we consider voting behavior in the Netherlands in the 1994 parliamentary elections for the
2nd Chamber—the main chamber of the legislature (and the only chamber
that is directly elected). In this example, we model the vote for the four
largest parties at that time, to wit CDA, D66, PvdA, and VVD.5 We consider one party attribute: issue distance to the voter.6 We consider six voter
characteristics, which generalize across the alternatives: (1) age, (2) gender,
(3) education, (4) income, (5) religiosity, and (6) left-right self placement.
The analysis is based on the Dutch Parliamentary Election Study (DPES),
1994. After eliminating cases with missing values on the predictors or cases
with a vote choice different from CDA, D66, PvdA, or VVD, the sample size
is 993 observations for the multinomial logit analysis. For the conditional
logit model, the sample size is discussed below.
5.7.1
Multinomial Logit Analysis
Estimation Results
We begin by specifying a multinomial logit analysis in which vote choice is
predicted from the individual level characteristics (age, gender, education,
income, religiosity, and left-right self-placement). The baseline category is
voting for the PvdA—i.e. the parameters for this choice have been set equal
to 0.7 The model was estimated using the mlogit command:
"
#"
#
mlogit varname varlist baseoutcome(#) level(#)
In our case, we specified b(1) which identifies the first alternative (PvdA)
as the baseline category. The estimation results are shown in Table 5.2.
5
CDA = Christian Democratic Appeal (Christian Democrats), D66 = Democrats 66
(left liberal), PvdA = Labor Party (Social Democrats), and VVD = People’s Party for
Freedom and Democracy (right liberal).
6
This is measured in terms of five issues: (a) euthanasia, (b) crime, (c) reduction of
income differences, (d) nuclear power, and (e) integration of minorities.
7
By default, Stata selects the modal choice as the baseline alternative.
120
Table 5.2: Multinomial Logit Model of Dutch Vote Choice in 1994
Predictor
Age
Education
Income
Male
Religious
L-R Placement
Constant
CDA
.011
.197+
.088∗∗
.056
2.357∗∗
.750∗∗
−7.652∗∗
VVD
−.017∗
.281∗∗
.134∗∗
.082
.282
.893∗∗
−6.300∗∗
Notes: n = 993. Table entries are ML
ficients. PvdA is the baseline category.
−1052.138; +0 = −1368.480. LR = 632.68,
2
2
AP CC = .419. RM
cF = .231, RV &Z =
+
.05, p < .10 (two-tailed).
D66
−.044∗∗
.057
.043
−.120
.387+
.250∗∗
−.186
multinomial logit coefAt convergence, +1 =
p = .000. CC = 55.2%,
.530. ∗∗ p < .01,∗ p <
The results show a consistently strong effect of left-right self-placement
such that voters further to the right were more likely to vote CDA, VVD, or
D66 compared to voting for the PvdA. Religiosity makes a CDA vote more
likely than a PvdA vote. Higher incomes dispose people toward voting for
the CDA or VVD instead of the PvdA. Older voters are less likely to vote
for the VVD or D66 relative to PvdA. More highly educated voters are more
likely to vote for the VVD than for the PvdA.
Interpretation
Marginal Effects and Elasticities Marginal effects at the mean can be
computed via the by now familiar mfx command. For the multinomial logit
model, the syntax for this command is similar to that for ordered logit and
probit. Table 5.3 shows the marginal effects of age, income, and left-right
self-placement on voting for the CDA. For example, a very small increase in
L-R self-placement, which corresponds to a movement to the political right,
increases the probability of voting for the CDA by about .06 points. The
corresponding elasticity is 1.7, which means that a one percent increase in
left-right self-placement corresponds to a 1.7% increase in the probability of
voting for the CDA.
Discrete Change Maximum discrete change measures were obtained using the prchange command and are reported in Table 5.4. We observe
121
Table 5.3: Marginal Effects for the Multinomial Logit Vote Choice
Model
Predictor
Age
Income
L-R
m.e.
.005∗∗
.005
.058∗∗
s.e.
.001
.004
.008
κ
1.123
.185
1.722
Notes: Marginal effects (m.e.) and elasticities at
the mean computed on the probability of voting for
CDA. ∗∗ p < .01 (two-tailed).
Table 5.4: Discrete Change for the Multinomial Logit Vote Choice
Model
Predictor
Age
Education
Income
Male
Religious
L-R
PvdA
.213
−.146
−.203
−.001
−.175
−.749
CDA
.316
.051
.053
.010
.315
.242
VVD
−.076
.165
.197
.021
−.084
.678
D66
−.453
−.069
−.047
−.031
−.055
−.171
¯
∆
.264
.108
.125
.016
.157
.460
Notes: Maximum discrete changes reported. All other predictors are
set to their means.
particularly sizable effects for age, left-right self-placement and religion. For
example, the oldest respondent is predicted to be .45 less likely to vote
for D66 than the youngest respondent, suggesting the appeal of this party
among young but not older voters. Moving from the most left-wing to the
most right-wing ideological position reduces the likelihood of voting for the
PvdA by .75 points on the average, while increasing the likelihood of voting
for the VVD by .68 points. Religious respondents on the average have a .32
greater probability of voting for the CDA than non-religious respondents.
This is testimony to the explicitly Christian foundation of the CDA, whereas
the remaining parties are secular in nature.
The Odds Ratio Finally, consider the odds ratios reported in Table 5.5,
which were obtained using the listcoef command. These ratios draw com-
122
Table 5.5: Odds Ratios for the Multinomial Logit Vote Choice
Model
Predictor
Age
Education
Income
Male
Religious
L-R Placement
CDA
1.012
1.218
1.092
1.058
10.559
2.117
VVD
.983
1.325
1.143
1.086
1.325
2.443
D66
.957
1.059
1.044
.887
1.473
1.284
Notes: Comparisons with the baseline category
(PvdA).
parisons of the remaining alternatives with the PvdA. For example, the
odds of choosing the CDA in lieu of the PvdA are over 10 times greater for
a religious as compared to a non-religious person.
The listcoef command also allows one to draw comparisons between
non-baseline categories. For example, one of the comparisons that is reported concerns the probability of voting for the CDA as compared to the
VVD. These odds are almost 8 times greater for religious than for nonreligious people. We could also have computed this result directly from
Table 5.5: Ω2|1 /Ω3|1 = Ω2|3 = 10.559/1.325 ≈ 8.
Hypothesis Testing
Looking through the results in Table 5.2, we see that gender is never statistically significant. We can test more formally that gender is inconsequential
for choice among the four parties. Let γ4k be the coefficient for male. Then
we can perform a Wald test of H0 : γ4k = 0, where k = 2, 3, 4. We do this
in Stata by typing
test male
This produces W = 1.02, p = .797. Thus, we fail to reject the null hypothesis
that gender had no impact on party vote choice in the Netherlands in 1994.
To illustrate a Wald test approach for distinguishability let us consider if
the alternatives PvdA and D66 can be collapsed. The relevant null hypothesis is H0 : γl1 = γl4 for l = 1, 2, · · · , 6. To test this hypothesis in Stata, we
begin by estimating the multinomial logit model with a baseline other than
123
alternative 1 (PvdA) or 4 (D66). For example, we could make the CDA the
baseline. We then execute the following command
test[1=4]
The resulting Wald test statistic is 64.56 with p < .01. Thus, we reject the
hypothesis that the alternatives of PvdA and D66 are indistinguishable and
can be collapsed into a single alternative.
Model Fit
Pseudo-R2 The Stata output for the multinomial logit model contains
2
McFadden’s pseudo-R2 . In this example, RM
cF = .231, which indicates a
8
2
2
solid fit (see Table 5.2). RA&N and RV &Z will have to be computed by
hand. Useful here is that the fitstat command shows (0 . For the present
example, RV2 &Z = .530, which can be considered quite good.
Correct Predictions At present, Stata does not compute the CC and
AP CC measures. Hand calculations show that CC = 55.2%.9 This may
seem on the low side, but the relevant comparison here is how well we would
predict if we ignored all covariate information and only knew which party
attracted the most votes in the sample. In our case, this is the PvdA, which
receives the vote of 296 out of 993 sample units. Thus, ignoring all covariate
information we would be correctly classifying 100 · (296/993) = 29.8% of
8
The adjusted likelihood ratio is .218. This is hardly worse than the unadjusted likelihood ratio, suggesting that we are not over-fitting the choice data.
9
The process for obtaining CC is quite simple. Assuming that the dependent variable
for the multinomial logit model is labeled vote, then the following syntax will give you the
information needed to compute CC:
predict p1 p2 p3 p4 if e(sample), pr
gen predchoice=.
replace predchoice=1 if p1>p2 & p1>p3 & p1>p4
replace predchoice=2 if p2>p1 & p2>p3 & p2>p4
replace predchoice=3 if p3>p1 & p3>p2 & p3>p4
replace predchoice=4 if p4>p1 & p4>p2 & p4>p3
tab predchoice vote
All that is left to do is count the number of cases on the diagonal of the table between
predchoice and vote and divide by N .
124
the cases. Compared to this, the model does a much better job at correctly
predicting vote choice (the improvement is (55.2 − 29.8)/29.8 or better than
85%).
Computation of the AP CC also requires some simple hand calculations.10 In our case, AP CC = .419, which means that we have a less than
.5 probability of assigning the correct vote decision to a sample unit. Again,
we may wish to compare this to a situation where there is no covariate information. In that case, AP CC = .254, meaning that the model improves
by around 65% over this baseline value.11
Overall, then, the fit of the multinomial logit model is quite satisfactory.
Of course, we have not yet considered any attributes of the parties. It is to
such an analysis that we turn next.
5.7.2
Conditional Logit Analysis
Estimation Results
Let us now consider a conditional logit analysis in which vote choice is
predicted from issue distance and the party’s seat share from the previous
election (measured as a percentage of the total seats, which is 150). We
can estimate the conditional logit model using Stata’s clogit command.
However, before we can do this we need to restructure the data into personchoice matrix.
As opposed to a normal data matrix, which contains one row per individual, a person-choice matrix contains M rows for each individual, each
row corresponding to an alternative in the choice set. Table 5.6 shows a
fragment of a person-choice matrix for the DPES data. This example illustrates several points very clearly. First, each individual appears four
times, i.e. as often as there are alternatives in the choice set. Second, a
dichotomous variable (“Choice”) keeps track of whether a respondent chose
a particular alternative. This corresponds to the dqi variable described before. Third, party attributes vary across the rows belonging to the same
individual. Fourth, individual characteristics are constant across the rows
belonging to the same individual.
10
These computations consist of creating a set of dummies indicating which party was
chosen. These are then multiplied into the predicted probabilities and summed over the
alternatives, thus creating a new variable. The mean of this variable is the AP CCC.
11
To obtain the baseline AP CC, consider the proportions of votes for each party in the
estimation sample. Assign these proportions as the choice probabilities to each sample
unit and repeat the computations from fn. 10.
125
Table 5.6: Person-Choice Matrix
Id
1
1
1
1
2
2
2
2
Party
PvdA
CDA
VVD
D66
PvdA
CDA
VVD
D66
Choice
0
1
0
0
1
0
0
0
Issues
6.00
2.00
6.00
6.00
2.00
7.00
11.00
2.00
L-R
7
7
7
7
3
3
3
3
Age
29
29
29
29
56
56
56
56
Creating a person-choice matrix is greatly simplified through the mclgen
program of John Hendrickx. In this case, we can simply run
mclgen varname
where varname is the choice variable in the original data set. Since the
mclgen command transforms the data, you should make sure that the original data is saved before you run the command.12 The mclgen command
adds two new variables to the data set: (1) strata is a person identifier
(id in Table 5.6) and (2) didep is a choice variable (choice in Table 5.6).
The new data set has 4N observations or, in general M N observations. Due
to missing data on the issue distance variables, 4N = 1676 for our analysis.
We can now run the conditional logit analysis using Stata’s clogit command. The syntax for our model is
12
After running the command, you may have to collapse some variables in the original
data. Specifically, if you created separate variables for each party’s attributes, you would
want to stack them. For instance, imagine that we had created four different issue distance
variables, one for each of the parties. We now want to combine them into a single issue
distance variable. This can be done as follows:
gen dist=.
replace dist=pvdadist if vote==1
replace dist=cdadist if vote==2
replace dist=vvddist if vote==3
replace dist=d66dist if vote==4
where pvdadist, cdadist, vvddist, and d66dist are the individual vote share variables.
126
Table 5.7: Conditional Logit Model of Dutch Vote Choice in 1994
Predictor
Issue Distance
Seat Share 1989
Estimate
−.390∗∗
.021∗∗
SE
.027
.005
Notes: n = 419, 4n = 1676. +1 = −364.331
at convergence; +0 = −580.857. LR = 433.05,
2
p = .000. RmcF
= .373; RV2 &Z = .691. CC =
61.1%, AP CC = .520. ∗∗ p < .01 (two-tailed).
clogit didep varlist, group( strata)
"
level(#) or
#
where or shows the results in terms of odds ratios. In our model, we introduce issue distance and prior seat share as the predictors.13 The results
from this model are reported in Table 5.7. The results show a positive effect of prior seat share, suggesting that larger parties tend to attract more
votes. They show a negative effect for issue distance, which means that the
probability of voting for a party declines the further that party is removed
from the voter.
Interpretation
Discrete Change We begin by interpreting the results using discrete
change in the predicted probabilities. Unfortunately, the prchange command does not work for conditional logit analysis, so we will have to create
the predictions ourselves. Setting 1989 seat share and issue distances to
the mean values for each party and varying the issue distance for the PvdA
from minimum to maximum, we obtain the following changes in the predicted probabilities: -.932 for PvdA, .240 for CDA, .227 for VVD, and .465
for D66.14 Thus, holding all else constant, there is a sizable decline in the
probability of voting for the PvdA, which benefits D66 the most.
13
Prior seat share is treated as an objective fact, not a perception by the voters. Thus,
it varies only across parties and not across sample units.
14
The mean issue distance in the estimation sample is 10.4 for CDA, 9.4 for VVD, and
7.2 for D66. The 1989 seat share is 32.7 for PvDA, 36 for CDA, 14.7 for VVD, and 8 for
D66. To compute the predicted vote probability for the PvdA when the issue distance to
this party is zero (minimum), one computes
e−.390×0+.021×32.7
+
e−.390×0+.021×32.7
+ e−.390×9.4+.021×14.7 + e−.390×7.2+.021×8
e−.390×10.4+.021×36
127
Marginal Effects and Elasticities The results can also be interpreted
using marginal effects and elasticities. While it is possible to use the mfx
command to compute marginal effects, these are not of the same variety
as those described in the formula earlier.15 But it is easy to compute the
marginal effects and elasticities by hand.16 The resulting marginal effects
are shown in Table 5.8. These clearly show the pattern of declining vote
probabilities for a party when the issue distance to that party increases by a
small amount. The table also shows that other parties benefit differentially
from such increases in the issue distance.
Table 5.9 converts the marginal effects into elasticities. Here we see
that a one percent increase in the issue distance to the PvdA reduces the
probability of voting for this party by roughly 2.5%. Such an increase boosts
the probability of voting for the CDA by about 1 percentage point. The
signs on the cross-elasticities implies that the four political parties serve as
substitutes, as one might expect in an election. The elasticities also reveal
At the maximum issue distance (24) this becomes
e−.390×29+.021×32.7
+
e−.390×24+.021×32.7
+ e−.390×9.4+.021×14.7 + e−.390×7.2+.021×8
e−.390×10.4+.021×36
The implications of changing the distance to the PvdA for the vote probabilities for other
parties can be computed in a similar manner.
15
The mfx command does not compute marginal effects for individual observations and,
what is more, assumes that the fixed effect is zero. See the Stata manuals for details.
16
To compute marginal effects at the mean, one starts by computing the predicted
probabilities for all of the alternatives while setting all of the predictors equal to their
alternative-specific means. The mean issue distances are 9.1, 10.4, 9.4, and 7.2 for PvdA,
CDA, VVD, and D66, respectively. The seat shares for these parties are 32.7, 36, 14.7,
and 8, respectively. The resulting predicted probabilities are .285 for PvdA, .184 for CDA,
.174 for VVD, and .356 for D66. Once the predicted probabilities have been computed,
then the marginal effects are computed in a straightforward manner. For example, the
effect of a change in the issue distance with respect to PvdA on the probability of voting
for that party is
π1
x11
=
.285 × (1 − .285) × −.390
where π1 is the probability of a PvdA vote and x11 is the issue distance for the PvdA.
Similarly, the effect of a change in the issue distance with respect to the PvdA on the
probability of voting for the CDA is
π2
x11
=
.184 × (0 − .285) × −.390
The elasticity in the PvdA vote probability with respect to PvdA issue distance is 9.1 ×
(1 − .285) × −.390. The elasticity in the CDA vote probability with respect to PvdA issue
distance is 9.1 × (0 − .285) × −.390.
128
Table 5.8: Marginal Effects for the Conditional Logit Vote Choice
Model
Change in Issue Distance
PvdA
CDA
VVD
D66
Change in Probability for
PvdA
CDA
VVD
D66
−.079
.020
.019
.040
.020
−.059
.012
.026
.019
.012
−.056
.024
.040
.026
.024
−.089
Notes: Table entries are marginal effects at the mean.
Table 5.9: Elasticities for the Conditional Logit Vote Choice Model
Change in Issue Distance
PvdA
CDA
VVD
D66
Change in Probability for
PvdA
CDA
VVD
D66
−2.538
1.011
1.011
1.011
.746
−3.310
.746
.746
.638
.638
−3.028
.638
1.000
1.000
1.000
−1.808
Notes: Table entries are elasticities computed at the mean. The diagonal elements are self-elasticities, while the off-diagonal elements are cross-elasticities.
the regularity that was discussed earlier, namely that the cross-elasticities
are identical for a change in the predictor. For example, a change in the issue
distance with respect to the PvdA results in identical cross-elasticities for
CDA, VVD, and D66. This is a direct consequence of the IIA assumption,
as we shall see below.
The Odds Ratio Finally, the results can be interpreted using the odds
ratio. This ratio can be obtained via the listcoef command. This yields
a factor change in the odds of .675 for issue distance. Thus the odds of
choosing, for example, PvdA compared to CDA decrease by a factor .675
when issue distance to the PvdA increases by one unit. This corresponds to
a 32.5% decrease in the odds, which can be considered sizable.
129
Model Fit
The conditional logit model fits the data quite nicely, as revealed by the
McFadden and Veall and Zimmermann pseudo-R2 measures (see Table 5.7).
The AP CC is equal to .520.17 This is quite good, especially when compared
to the null model. In that model, the choice probability is equal to .25
for all alternatives, leading to an AP CC value of only .25/419 = .001.
CC = 61.1%, which is very good since a model without covariates would
only classify 25% of the cases correctly.18
5.7.3
Adding Attributes of the Decision Maker to the Conditional Logit Model
Let us now combine the party and individual attributes into a single conditional logit model. We now need to make sure that the individual attributes
are interacted with a set of dummies for the alternatives, so that the effects
of those attributes can vary across choice options. Interaction expansion is
the easiest approach here. Specifically, we can issue the command
xi:
clogit didep dist i.vote i.vote|age i.vote|educ
i.vote|income i.vote|male i.vote|religious
17
The AP CC can be computed in the following fashion:
predict p, pc1
gen dp=p* didep
When one now obtains the total for dp and divides by the sample size, then one has
AP CC.
18
The following syntax will yield the information for CC:
predict p, pc1
sort strata
by strata:
egen prank=rank(p)
by strata:
gen cpred=0
by strata:
replace cpred=1 if prank==4
gen av= didep*vote
gen pv=cpred*vote
tab av pv
130
i.vote|leftright, group( strata)
where vote isthe 4-category choice variable. This produces the results in
Table 5.10. Note that PvdA once again serves as the baseline category:
the parameters for this alternative are all fixed at zero.19 Interpretation
proceeds as before for the attributes of the voters and the parties.20
5.8
5.8.1
Independence of Irrelevant Alternatives
Meaning and Consequences
Independence of irrelevant alternatives (IIA) means that the ratio of the
choice probabilities for two alternatives, i and j, is independent from all
other alternatives, k, in the choice set (Arrow 1951). That is, regardless of
whether those alternatives are in or out off the choice set, the probability
of choosing i compared to the probability of choosing j remains the same.
This property is built into Luce-type PCMs:
>
,
exp(Vqi )/ k exp(Vqk )
πqi
>
=
πqj
exp(Vqj )/ k exp(Vqk )
exp(Vqi )
=
exp(Vqj )
We see that the ratio of choice probabilities for i and j is driven entirely
by the systematic utility components of those alternatives, not by anything
else. It does not matter if we model the utilities in terms of the multinomial
logit model or in terms of the conditional logit model.
The IIA assumption is attractive from a computational point of view.
It means that we can assume independent and homoskedastic errors across
alternatives, which greatly simplifies the log-likelihood function. However,
from a substantive perspective the assumption is often quite unattractive. It
implies certain empirical regularities that may seem unreasonable, especially
when it comes to political choice behavior.
Consider, for example, a political system with three political parties.
Party R operates on the right of the political spectrum. Parties L1 and
19
I have dropped the seat share variable since it causes perfect collinearity and has a
small effect to begin with.
20
Since we are estimating a conditional logit model, the null model is a model that
predicts equal vote shares for the four parties. An alternative null model would be one
based on estimating three party-specific constants. This model accommodates the empirical vote shares in the data. The model fit statistics shown in Table 5.10 are based on the
equal vote share null model.
131
Table 5.10: Conditional Logit Model with Voter Attributes
Predictor
Issue Distance
Age × CDA
Age × VVD
Age × D66
Education × CDA
Education × VVD
Education × D66
Income × CDA
Income × VVD
Income × D66
Male × CDA
Male × VVD
Male × D66
Religiosity × CDA
Religiosity × VVD
Religiosity × D66
Left-Right × CDA
Left-Right × VVD
Left-Right × D66
CDA
VVD
D66
Estimate
−.337∗∗
−.002
−.011
−.053∗∗
−.100
.183
−.061
.106
.051
−.006
−.084
−.253
.094
2.153∗∗
.812+
.723∗
.638∗∗
.681∗∗
.302∗∗
−5.153∗∗
−4.432∗∗
.154
SE
.029
.017
.016
.014
.197
.186
.150
.070
.064
.049
.453
.439
.341
.496
.431
.342
.138
.133
.102
1.456
1.339
.974
Notes: n = 419, 4n = 1676. +1 = −317.161 at
convergence; +0 = −580.857. LR = 527.39, p =
2
2
.000. RM
cF = .454; RV &Z = .758. CC = 68.7%,
AP CC = .578. ∗∗ p < .01, ∗ p < .05, + p < .10
(two-tailed).
132
L2 operate on the left of the political spectrum, roughly in the same spot.
Imagine that a voter treats L1 and L2 equivalently because they operate in
the same ideological location. Imagine further that the voter is indifferent
between the right-wing party and the parties on the left. We then would
expect choice probabilities πR = .5, πL1 = .25, and πL2 = .25—the voter
is equally likely to vote for the left and right, but two equally attractive
parties on the left cause the vote to be split there. The odds of voting R
versus L1 thus depends on the presence of L2. If L2 is present, πR /πL1 = 2.
However, if L2 should cease to exist, then πR /πL1 = 1 since the voter now
simply faces a choice between a party from the right and from the left, both
of which are equally attractive. This seems a reasonable adjustment in the
voter’s behavior, but it violates the IIA assumption. Under IIA, the voter
should be twice as likely to vote for R than for L1, even if L2 ceases to exist.
This kind of “foolish consistency” in behavior seems maladaptive, but that
is what IIA implies.
The reason we are concerned about the IIA assumption is that it affects
the consistency of the estimators of the multinomial and conditional logit
models. The estimators for these models are consistent only if the IIA
assumption holds. If the IIA assumption is incorrect, as will often be the
case, then consistency of the estimators is not guaranteed. This is sufficient
reason to test whether the IIA assumption holds. Several test procedures
are available.
5.8.2
Testing for IIA
Hausman-McFadden Test
The Hausman-McFadden (1984) test is based on a simple intuition: if IIA
holds, then the estimates obtained from considering two alternatives in isolation should be roughly comparable to those obtained from considering the
full choice set. After all, under IIA, the ratio πqi /πqj is constant across
alternative specifications of the choice set, which suggests that the parameters influencing the probabilities in this ratio are constant as well. Thus,
Hausman and McFadden suggest estimating the parameters in a subset of
the alternatives and again in the full set. A comparison of the estimates,
taking into consideration sampling variability, can then shed light on the
fate of the IIA assumption.
More formally, let Mf be a conditional or multinomial logit model that
is estimated on the complete choice set. We assume that θ̂ f is consistent;
that is, asymptotically θ reduces to β in conditional logit and to γ m in
133
multinomial logit. Let Ms be a model estimated on the choice sub-set consisting of alternatives i and j. Under IIA, θ̂ s also is consistent. Under IIA,
then, we would expect θ̂ s and θ̂ f both to converge to θ. Consequently, we
asy
would expect θ̂ s − θ̂ f → 0. If we now also take into consideration sampling
fluctuation, then we can construct the following test statistic:
THM
3
4−1
asy
= (β̂ s − β̂ f )% V̂s − V̂f
(β̂ s − β̂ f ) ∼ χ2p
(5.5)
where p is the number of predictors. Under IIA, THM = 0. As the IIA
assumption becomes less reasonable, we will obtain larger values of the test
statistic; when THM exceeds the critical value, then we reject the null assumption of IIA.
It is important to point out that the test is valid only if θ̂ f is consistent
and asymptotically efficient. It should also be stressed that all test results
are asymptotic. It is somewhat dangerous and, as such, not recommended to
rely on this test in small samples. If the sample size is small and/or θ̂ f is not
consistent, then THM may come out negative. The same can happen when
V̂s − V̂f , i.e. is not positive definite. This can happen when V̂s − V̂f → 0,
which is frequently the case. In this case, inversion of the variance-covariance
matrix is also problematic, further adding to computational difficulties.
As an illustration of the Hausman-McFadden test, we consider the impact of excluding the CDA on the estimates of a conditional logit model.
The model includes issue distance and party-specific constants only. The
latter are included to capture the electoral fortunes of the parties in as far
as these are not explained by issues. Considering the impact of removing
the CDA is substantively interesting: as a center party, it has been argued
that the CDA fundamentally alters the nature of the competition among the
remaining parties. We begin by estimating a model that includes the CDA
as an alternative and then repeat estimation with a choice set that excludes
this party. We then perform the Hausman-McFadden test. The syntax is as
follows:
xi:
clogit didep dist i.vote, group( strata)
est store F
xi:
clogit didep dist i.vote if vote!=2, group( strata)
est store S
hausman F S, alleq cons
134
where dist is issue distance and vote indicates the parties in the choice set.
The last command performs the hausman test on the estimates stored in F
and S. The alleq option indicates that all parameters are hypothesized to be
the same across the two models. The cons option tells Stata to include the
constant in the test as well.21 Stata shows the differences in the estimates
between the reduced and complete choice sets as -.027 for issue distance, as
-.062 for the party-specific constant for the VVD, and as -.050 for the partyspecific constant for D66 (PvdA serves as the baseline). These changes in
parameter estimates yield THM = 14.52 with 3 degrees of freedom, p = .002.
Thus, we reject the null hypothesis that the IIA assumption holds.
There is one important caveat with this conclusion. Although the test
statistic looks reasonable, Stata issues a clear warning that the variancecovariance is not positive-definite. It is not clear, then, that we can trust
the Hausman-McFadden test.
Seemingly Unrelated Estimations
Seemingly unrelated estimation or SUE is a methodology for combining
estimates from different estimations that may or may not be partially overlapping and that may or may not be based on the same estimation method.
The SUE method creates a proper variance-covariance matrix of the sandwich type (White 1982, 1994). This is one of the distinctive advantages
when compared to the Hausman-McFadden test because it means that tests
based on SUE are less likely to return inadmissible results.
The statistical rationale of SUE can be found in Weesie (1999) and White
(1982, 1994). Apart from the computation of the variance-covariance matrix,
its logic in the context of testing the IIA assumption is rather similar to
that of the Hausman-McFadden test. After running the conditional logit
commands on the full and reduced choice sets and storing the results, the
test proceeds as follows in Stata:
suest F S
test[F didp=S didep], cons common
This produces a test statistic of 4.33 with p = .228, suggesting that the
estimates for the complete and reduced choice sets are not statistically dis21
This is not relevant for the conditional logit model because we are not estimating a
constant. However, if the command were used for the multinomial logit model, then it is
important to test the constants as well.
135
tinct. Thus, the seemingly unrelated regression procedure does not provide
evidence against the IIA assumption.
Small-Hsiao Test
For the multinomial logit model, the IIA assumption may also be tested
using the Small-Hsiao (1982) exact test. This is a correction of a LR test
that compares the log-likelihood implied by the parameters of the full model
to the log-likelihood of a restricted model that comes about by dropping one
of the alternatives. This test was used by McFadden, Tye, and Train (1977)
and is given by
LR = −2 ((f − (s )
where (s is the log-likelihood for the restricted choice set and (f is the loglikelihood for the complete choice set.
Small and Hsiao (1982) pointed out that this test is problematic because
it ignores the potential correlation between θ̂ s and θ̂ f . This tends to bias the
test in favor of the IIA assumption, i.e. it is less likely that the assumption
is rejected. To overcome this problem, Small and Hsiao propose dividing
the sample into two independent segments, A and B. Segment A is used
to estimate θ̂ f . Segment B is also used to estimate θ̂ f but, in addition,
produces θ̂ s . The estimates for the unrestricted choice set are combined
using
AB
θ̂ f
√ A
√ B
= (1 2)θ̂ f + (1 − 1/ 2)θ̂ f
This combined estimate is then used to compute the log-likelihood on sample
B. This log-likelihood can then be compared to the restricted choice set loglikelihood to form
$
%
B
TSH = −2 (B
(5.6)
f − (s
In Stata, the Small-Hsiao test can be performed using the smshiao command that was written by Nick Winter. The command has the following
basic syntax:
smhsiao depvar
"
#
varlist , elim(yvalue )
where yvalue is the alternative that we wish to remove from the choice
set. When we run the command on the vote model from Table 5.2 and
eliminating CDA from the choice set, we obtain TSH = 17.766, p = .013.
136
Thus, assuming a Type-I error rate of .05, we reject the null hypothesis of
IIA for the multinomial logit model.22
22
The test divides the sample randomly in roughly equally sized groups. Since the groups
can vary across iterations, so can the values of TSH . Thus, I recommend running this a
couple of times and to compare the results. Alternatively, one can add the s(varname)
option, where varname specifies the name of a variable that you have created that splits
the sample into two groups.
137
Chapter 6
Multinomial Probit
In the previous chapter, we discovered that one of the liabilities of the multinomial and conditional logit models is that they make an IIA assumption
that does not always appear reasonable. Because of this problem, some
scholars (e.g. Alvarez and Nagler 1998) advocate the use of multinomial
probit analysis. The multinomial probit model avoids the IIA assumption.
Unfortunately, the model is much more difficult to estimate than the models discussed in the previous chapter. Indeed, until recently estimation of
the model was so complicated that it was of theoretical interest only. The
development of maximum simulated likelihood estimation (and other algorithms) has changed that situation, although multinomial probit models
remain highly complex.
6.1
6.1.1
The Model
Distributional Assumptions
It is possible to remove the IIA assumption by allowing the stochastic elements of a decision maker’s utilities over the alternatives to be correlated
and heteroskedastic. Specifically, in the expression Ui = Vi + #i , we can assume that the #s follow a multivariate normal distribution. More precisely,
let &! = (#1 #2 · · · #M ) be a vector of random utility components over the
alternatives. Then, it is assumed that & ∼ N (0, Σ), where


σ12 σ12 · · · σ1M
 σ12
σ22 · · · σ2M 


Σ = 
.. . .
.. 
..

.
.
.
. 
σ1M
σ2M
138
···
2
σM
Note that there are two key differences with Lucean choice models: (1) the
diagonal elements of Σ are not constrained to be identical and fixed and (2)
the off-diagonal elements are not constrained to be zero.
6.1.2
Choice Probabilities
Three Alternatives
We can now relate the distributional assumption to a choice mechanism.
To focus our thoughts, I will again start by presenting the results for a
choice problem containing three alternatives (e.g. voting for Labour, the
Conservatives, or the Liberal Democrats in the U.K.). Furthermore, I shall
focus on the probability that the first alternative (e.g. Labour) is chosen.1
From the earlier discussion, Y = 1 if U1 > U2 and U1 > U3 . Substituting
Ui = Vi + #i yields the condition that V1 + #1 > V2 + #2 and V1 + #1 > V3 + #3 .
We can reformulate this condition by rearranging the terms (see Chapter
4): #2 − #1 < V1 − V2 and #3 − #1 < V1 − V3 . In probabilistic terms,
Pr(Y = 1) = Pr(#2 − #1 < V1 − V2 ∩ #3 − #1 < V1 − V3 )
To simplify notation, let ηji = #j − #i and let Vij = Vi − Vj , then
Pr(Y = 1) = Pr(η21 < V12 ∩ η31 < V13 )
Evaluation of this last expression requires the introduction of the normality assumption. In addition, we need to invoke some basic statistical results about linear composites, since ηji is a linear function of two
stochastic variables. Specifically, E[ηji ] = 0, V [ηji ] = σi2 + σj2 − 2σij , and
E[ηji , ηki ] = σi2 + σjk − σij − σik .2 Further, when the #s follow a multivariate normal distribution, then the ηs also follow a multivariate normal
distribution.
With these results in place, we now know that η21 and η31 follow a
bivariate normal distribution with covariance matrix
5
6
σ12 + σ22 − 2σ12 σ12 + σ23 − σ21 − σ31
Σ1 =
σ12 + σ23 − σ21 − σ31
σ12 + σ32 − 2σ13
1
As before, individual subscripts are suppressed in the formulas for the sake of simplicity.
2
These results should look familiar, but they are also easily derived. First, E[ηji ] =
E[#j − #i ] = E[#j ] − E[#i ] = 0 − 0 = 0. Second, V [ηji ] = E[(#j − #i )2 ] = E[#2j − 2#j #i + #2i ] =
E[#2j ] − 2E[#j #i ] + E[#2i ] = σj2 − 2σij + σi2 . Finally, E[ηji , ηki ] = E[(#j − #i )(#k − #i )] =
E[#j #k − #j #i − #k#i + #2i ] = E[#j #k ] − E[#j #i ] − E[#k#i ] + E[#2i ] = σjk − σij − σik + σi2 .
139
Let us refer to this distribution as Φ2 (η21 , η31 ), where Φ2 () stands for the bivariate normal distribution. Then it follows that the probability of choosing
alternative 1 is given by
: V12 : V13
π1 =
Φ2 (η21 , η31 )dη21 dη31
−∞
−∞
M Alternatives
The derivations above imply the following general formula for the choice
probabilities of the multinomial probit model:
: Vi1 : Vi2
: ViM
πi =
···
φM −1 (η1i , η2i , · · · ηM i )dη1i dη2i · · · dηM i
−∞
−∞
−∞
(6.1)
Note that if there are M alternatives, this is a M − 1-variate integral. The
function φM −1 is a M −1-variate normal distribution. The covariance matrix
of this distribution has diagonal elements of the form σi2 + σj2 − 2σij (for j =
1, 2, · · · M and j += i) and off-diagonal elements of the form σi2 +σjk −σij −σik
(for j, k = 1, 2, · · · M and j += i, k += i, and j += k). Thus,
Σi
=





σi2 + σ12 + 2σi1
+ σ12 − σi1 − σi2
..
.
σi2 + σ1M − σi1 − σiM
σi2
σi2
+

σ22
σi2 + σ2M
+ 2σi2
..
.
− σi2 − σiM
..
.
···
2
σi2 + σM
+ 2σiM




(6.2)
The multinomial probit model avoids the IIA assumption because the
ratio of two choice probabilities depends on all of the alternatives in the
choice set. Specifically,
N Vi1 N Vi2
N ViM
πi
−∞ −∞ · · · −∞ φM −1 (η1i , η2i , · · · ηM i )dη1i dη2i · · · dηM i
= NV NV
NV
j1
j2
πj
· · · jM φM −1 (η1j , η2j , · · · ηM j )dη1j dη2j · · · dηM j
−∞ −∞
−∞
We see that both the numerator and the denominator involve the remaining
alternatives. Thus, removing one of these alternatives will impact the ratio
πi /πj .
6.1.3
Determinants of Choice
The development of the multinomial probit model so far does not consider
any covariates. The manner in which covariates are handled is to specify Vi
140
in terms of a linear function of the predictors:
Vqi = x!qi β + z !q γ i
where xqi is a set of attributes of alternative i as perceived by decision maker
q and z q is a set of attributes of q. In terms of Vij this implies
Vqij = Vqi − Vqj
= (x!qi − x!qj )β + z !q (γ i − γ j )
(6.3)
Thus, the impact of attributes of the decision maker flows through the different effects that they have on Vij , whereas the effect of attributes of the
alternatives flows through differences in attribute values between i and j.
6.1.4
Identification
As described by (6.1)-(6.3), the multinomial probit model is unidentified.
That is, multiple values of β and γ m are consistent with the same choice
probabilities πqi . The identification problem has two roots. First, for the
choice probabilities all that matters are utility comparisons (i.e. is Ui greater
than Uj ?) If we try to say something about the actual utility level, as in
(6.2) then there is simply not enough information in the choice probabilities
to do this. Second, as we saw with binary choice models, the scale of utility
is arbitrary. This means that the scale can be changed arbitrarily, resulting
in a different set of parameter estimates.
The identification problem is resolved by imposing a number of identifying constraints. First, we focus on utility differences. This means that
we draw comparisons to a baseline category, which is usually the first alternative. Thus, we set γ 1 = 0 for the baseline category. In addition, #1
is assumed to be uncorrelated with the other errors and to have a variance
of 1. Second, we fix the utility scale by setting V [#j ] = 1 for one of the
non-baseline alternatives (typically j = 2).3 Thus, in total, M − 2 variances
are estimated and M (M − 3)/2 + 1 correlations. Another way to think of
this is that Σ contains at most M (M − 1)/2 − 1 identifiable parameters.
Keane (1992) warns that these are the absolute minimum identification
requirements. They are necessary but not always sufficient to ensure identification. In practice, it happens frequently that one has to impose additional
restrictions. One can do so either by constraining the effects of predictors
or by imposing more structure onto Σ. In terms of the latter approach, a
3
Alternatively, we can set V [ηij ] = 1 (typically j = 2, i = 1). This is the approach
that Train (2003) uses.
141
couple of possibilities suggest themselves. One is to impose a homoskedastic
error structure, whereby σi2 = σ 2 for all i. Another approach is to assume an
exchangeable correlation structure, which means that all of the correlation
parameters are assumed to be equal (ρij = ρ ∀ i += j). It is also possible to
impose other restrictions. For example, one can assume partial or complete
independence among the alternatives. With partial independence, the errors
of some alternatives are correlated while the remaining errors are treated as
independent. With complete independence, all of the errors are treated as
independent. Of course, if we combine independence with homoskedasticity, then the multinomial probit model is similar to the conditional logit
model described in the previous section, the only difference being a different
distributional assumption.
Example 6.1. To illustrate the minimum identifying assumptions, let us
consider once more a choice problem with three alternatives. We treat the
1st alternative as the baseline. Thus, γ 1 = 0, σ12 = 1, and σ1j = 0 (for
j '= 1). Further, we set σ22 = 1. With these constraints in place,
2 Vq12 2 Vq13
πq1 =
φ2 (ηq21 , ηq31 )dηq21 dηq13
−∞
!
!
−∞
!
where Vq12 = (x q1 − x q2 )β − z i γ 2 , Vq13 = (x! q1 − x! q3 )β − z ! q γ 3 , and φ2
is the bivariate normal PDF with covariance matrix
3
4
2
Σ1 =
1 + σ23 1 + σ32
For the second alternative, we have
2 Vq21 2 Vq23
φ2 (ηq12 , ηq32 )dηq12 dηq32
πq2 =
−∞
!
−∞
!
!
where Vq21 = (x q2 − x q1 )β + z q γ 2 , Vq23 = (x! q2 − x! q3 )β + z ! q (γ 2 − γ 3 ),
and φ2 is the bivariate normal PDF with covariance matrix
3
4
2
Σ2 =
1 − σ23 1 + σ32 − 2σ23
Finally,
πq3
=
2
Vq31
−∞
2
Vq32
φ2 (ηq13 , ηq23 )dηq13 dηq23
−∞
where Vq31 = (x! q3 − x! q1 )β + z ! q γ 3 , Vq32 = (x! q3 − x! q2 )β + z ! q (γ 3 − γ 2 ),
and φ2 is the bivariate normal PDF with covariance matrix
3
4
1 + σ32
Σ3 =
2
2
σ3 − σ23 1 + σ3 − 2σ23
142
Since identification is so problematic, it is particularly important that
one pays close attention to any signs of under-identification. These include
convergence problems (concavity warnings and backups—see Chapter 9),
but also problematic estimates from seemingly successful estimations. In this
regard, it is particularly important to pay attention to corner solutions,
i.e. correlations between error terms that are pushed toward 0 or 1. These
frequently signal identification problems and should not be taken at face
value.
6.2
Estimation
Estimation of the multinomial probit model can be quite complicated. The
reason is that the choice probabilities and therefore the log-likelihood function involves multiple integrals. The evaluation of such integrals remains a
computational challenge and until recently many believed that the multinomial probit model was merely of theoretical interest (see e.g. Maddala
1983). Computational advances in recent years, however, have made estimation feasible, although it remains complex.
6.2.1
Maximum Likelihood Estimation
The likelihood function for the multinomial probit model is given by
( ( dqi
L =
πqi
q
i
where dqi is a binary indicator of whether alternative i was chosen or not (cf.
the conditional and multinomial logit models). The log-likelihood function
is
))
( =
dqi ln(πqi )
q
i
This looks innocent enough; indeed, this is exactly the same as what we had
for the conditional/multinomial logit models. Once we substitute for πqi ,
however, things look considerably nastier:
))
( =
dqi ×
q
ln
,:
i
Vi1
−∞
:
Vi2
−∞
···
:
ViM
−∞
φM −1 (η1i , η2i , · · · ηM i )dη1i dη2i · · · dηM i
143
-
If the number of alternatives is three, then it turns out that optimization
of this log-likelihood is not too complicated. In this case, we need only to
evaluate a double integral. Moreover, the gradient involves only a single
integral. Since optimization occurs in terms of the gradient, MLE thus
requires that we evaluate a single integral and this generally does not pose
too many difficulties if the model is properly identified.
Things become considerably hairier, however, when the number of alternatives exceeds three. In this case, the gradients involve multiple integrals.
These remain difficult to evaluate even with the computational power that
is available nowadays. In general, the log-likelihood can be evaluated only
if the integrals can be approximated or bypassed altogether. There are several ways to accomplish this but the most common approach is maximum
simulated likelihood estimation.
6.2.2
Maximum Simulated Likelihood Estimation
The intuition behind maximum simulated likelihood estimation (MSLE) is
that estimation of the multinomial probit model would be straightforward
if we could use simulated values of the πqi in the log-likelihood function
(instead of having to derive those values through the evaluation of complex
integrals). Lerman and Manski (1981) were the first to propose this estimation procedure. The general idea is to perform model estimation in two
steps: (1) simulate πqi and then (2) maximize
))
( =
dqi ln(π̃qi )
q
i
where π̃qi denotes the simulated choice probability for alternative i in individual q.
The trick is, of course, to obtain values of π̃qi that perform well in terms
of approximating the true πqi s. This problem was solved in the 1990s when
Geweke (1991), Keane (1990, 1994), and Hajivassiliou and McFadden (1998;
Hajivassiliou, McFadden, and Ruud 1996) independently developed similar
methods for simulating multivariate normal probabilities. These methods
are now commonly referred to under the name of Geweke-HajivassiliouKeane or GHK estimator. The GHK estimator is the most widely used
MSLE method because it is more reliable and more accurate than alternative approaches (see Hajivassiliou, McFadden, and Ruud 1996). The details
of the GHK estimator are complex—they are described in the Appendix to
this chapter.
144
6.2.3
Alternative Estimators
GHK is not the only approach toward estimating the multinomial probit
model. An older approach approximates multiple integrals by Gaussian
quadrature (see Greene 2003; Maddala 1983). While a lot of progress has
been made on numerical integration through adaptive Gaussian quadrature,
the approach is computationally intensive, resulting in typically slow convergence. There also exist alternative simulation methods. These include the
method of simulated moments (MSM; McFadden 1989) and Monte Carlo
Markov Chain (MCMC) methods (e.g. Gibbs sampling; Dorfman 1996,
Imay and Dyk 2004). The performance of GHK relative to MSM and Gibbs
is generally good, especially in the context of cross-sectional data analysis.
6.3
Interpretation
6.3.1
Marginal Effects and Elasticities
Marginal effects provide a useful method of interpretation as they show the
change in predicted probabilities for an infinitesimally small change in the
value of a predictor. Marginal effects formulas for the multinomial probit
model are difficult to derive and rarely shown in textbooks and articles. An
exception is Dorfman (1996), who shows the marginal effects for a model
with four alternatives. It is perhaps useful to show the key results from his
paper.
Let Vib be the systematic utility difference between alternative i and the
baseline category. Without loss of generality, let the first alternative be the
baseline alternative so that
Vib = Vi1 = (x!i − x!1 )β + z !γ i = w!i λi
Thus,
Uib = w!i λi + (#i − #b ) = w!i λi + νi
It is further assumed that V [#2 ] has been fixed to 1 for identification purposes. Because the first alternative serves as the baseline, V [#1 ] = 1 and all
covariances involving this alternative are zero. Then the marginal effect for
the second alternative is given by:
∂π2
∂wl
=
3
5 !
64 3
5 !
64
#
$
w 3 λ3 − w ! 2 λ2
w 4 λ4 − w ! 2 λ2
φ −w ! 2 λ2 1 − Φ
1−Φ
λ2l +
ψ23
ψ24
5 !
63
5 !
64
7
#
$8
w 3 λ3 − w ! 2 λ2
w 4 λ4 − w ! 2 λ2
1 − Φ w ! 2 λ2 φ
1−Φ
(λ2l − λ3l ) +
ψ23
ψ24
3
5 !
64
5
6
7
#
$8
w 3 λ3 − w ! 2 λ2
w ! 4 λ4 − w ! 2 λ2
1 − Φ w ! 2 λ2
1−Φ
φ
(λ2l − λ4l )
ψ23
ψ24
145
Here ψjk denotes the standard deviation of the difference in the error
terms, which is used to normalize the random variables to have unit vari2 = V [ν − ν ] = V [# ] + V [# ] − 2cov[# , # ] =
ances. Specifically, ψ23
2
3
2
3
2 3
2 = V [ν − ν ] = V [# ] + V [# ] − 2cov[# , # ] =
1 + V [#3 ] − 2cov[#2 , #3 ] and ψ24
2
4
2
4
2 4
1 + V [#4 ] − 2cov[#2 , #4 ]. Terms involving only w!2 λ2 already have a unit
variance.4 Marginal effects for the remaining non-baseline categories can be
obtained in a similar manner. For instance,
∂π3
∂wl
=
5
63
5 !
64 3
5 !
64
w ! 3 λ3
w 2 λ2 − w ! 3 λ3
w 4 λ4 − w ! 3 λ3
φ −
1−Φ
1−Φ
λ3l +
ψ33
ψ32
ψ34
3
5 !
64 5 !
6
3
5
64
w 3 λ3
w 2 λ2 − w ! 3 λ3
w ! 4 λ4 − w ! 3 λ3
1−Φ
φ
1−Φ
(λ3l − λ2l ) +
ψ33
ψ32
ψ34
3
5 !
64 3
5 !
64
5
6
w 3 λ3
w 2 λ2 − w ! 3 λ3
w ! 4 λ4 − w ! 2 λ2
1−Φ
1−Φ
φ
(λ3l − λ4l )
ψ33
ψ32
ψ34
Finally, for the baseline alternative we obtain
∂π1
∂wl
=
3
4 3
4
#
$
w ! 3 λ3
w ! 4 λ4
−φ −w ! 2 λ2 Φ −
Φ −
λ2l −
ψ33
ψ44
3
4
3
4
#
$
w ! 3 λ3
w ! 4 λ4
Φ −w ! 2 λ2 φ −
Φ −
λ3l −
ψ33
ψ44
3
4
3
4
#
$
w ! 3 λ3
w ! 4 λ4
Φ −w ! 2 λ2 Φ −
φ −
λ4l
ψ33
ψ44
The marginal effects can be turned into elasticities by multiplying by the
ratio of the covariate value and the predicted probabilities. The resulting
quantities then show the percentage change in the predicted probability due
to a one percent change in the predictor. Unlike the conditional and multinomial logit models, the cross-elasticities are not assumed to be equivalent.
6.3.2
Discrete Change in Predicted Probabilities
Predicted probabilities provide an alternative method of interpretation. Here
we rely on the equations (6.1)-(6.3). The computations involve the evaluation of multiple integrals and, for this reason, a variety of algorithms have
been designed to approximate the predicted probabilities (see Hajivassiliou,
McFadden, and Ruud 1996; Lai 2006; Train 2003; also see Tong 1990). The
appendix describes a procedure based on Gauss-Hermite quadrature, which
is relatively easily implemented in Stata (albeit tedious).
4
These terms come about by taking the difference Ubb −U2b = −U2b = −w ! 2 λ2 +ν1 −ν2 .
Keeping in mind that ν1 = #1 − #1 = 0 and that #1 = 0, the only relevant part of the error
term is −ν2 = −#2 + #1 = −#2 . The variance of this term has been constrained to 1, so
that unit variance has already been achieved.
146
One can use this procedure to evaluate predicted probabilities under
different scenarios and in this way compute discrete changes. For example,
while holding all else at the mean, one could move one of the covariates from
its minimum to its maximum, in order to compute the maximum discrete
change. It is also possible to compute the average absolute discrete change
(Long 1997).
6.4
Model Fit
The best approach to assessing model fit is to compute the percentage of
correct predictions or the average probability of correct predictions. These
can be computed in a similar way to the conditional logit model. In general,
pseudo-R2 measures are not computed for the multinomial probit model.
Such measures rely on the value of the log-likelihood function but ,using
the estimation procedure outlined earlier, we have information only about
the simulated likelihood. For this same reason, hypotheses are generally not
evaluated using a likelihood ratio test but using a Wald test.
6.5
Example
6.5.1
Estimation Results
To illustrate the multinomial probit model let us consider again vote choice
in the Netherlands with the alternatives specified as PvdA, CDA, VVD,
and D66. We estimate a somewhat simpler model than we have done so
far, namely one that includes issue distance, left-right self-placement, and
religiosity as predictors. The model is estimated using Stata’s asmprobit
command, which is available as of version 9.0. The syntax is as follows:
"
#
asmprobit didep varlist1, casevars(varlist2) case( strata)
"
alternatives(varname ) structural
#
intmethod(halton|hamm,ersley|random) difficult
Here, it is assumed that the data have been formatted as a person-choice
matrix. The variable didep is a binary indicator of the alternative that was
chosen. The variable listing these alternatives is specified using the keyword
alternatives(). Further, varlist1 contains attributes of the alternatives,
whereas varlist2 contains attributes of the decision makers. If the second
variable list is omitted, then Stata will still compute a series of alternativespecific constants for the non-baseline alternatives. The structural option
147
Table 6.1: Multinomial Probit Model of Dutch Vote Choice in
1994
Predictor
Issue Distance
Religious
L-R Placement
Constant
CDA
−.233∗∗
1.453∗∗
.437∗∗
−3.148∗∗
VVD
−.233∗∗
.523+
.475∗∗
−2.730∗∗
D66
−.233∗∗
.429
.234∗∗
−1.636∗∗
Notes: 4n = 1676, n = 319. Table entries are maximum simulated
likelihood multinomial probit coefficients. PvdA is the baseline
category. CC = 67.1%, AP CC = .551. + p < .10, ∗∗ p < .01
(two-tailed).
will display the covariance structure of the original error terms, i.e. Σ̂,
instead of the covariances between the differences in error terms, i.e Σ̂i .
The intmethod() option specifies the manner in which random numbers
should be generated for the draws from the truncated normal distributions.
Finally, difficult helps to alleviate convergence problems due to nearly flat
regions of the log-likelihood function. Since estimation of the multinomial
probit model is complex, I recommend adding this option.5 Table 6.1 shows
the resulting estimates, while Table 6.2 shows the covariance and correlation
matrices of the stochastic components. These estimates assume that PvdA
is the baseline category.
In terms of the estimates, we see again that the probability of voting for a
party decreases the further this party is removed from the voter on the issues.
Religious people are more likely to vote for the CDA compared to the PvdA,
but religion does not much influence the probabilities of voting for VVD and
D66. Left-right ideological placement is exerts a significant positive effect
for all three of the non-baseline parties, meaning that people to the right of
the political spectrum are more likely to vote for those parties compared to
5
Even with this option convergence is not guaranteed. Convergence problems may
be alleviated by increasing the number of points used in the Halton and Hammersley
sequences (the default is 50, but by specifying intpoints(#) this can be increased). It
may also be useful to specify antithetics as an option, which means that antithetic
draws are used in the sequence. This tantamount to doubling the number of points in
the sequence because for every point x in the sequence now its antithetic, 1 − x, is used
as well. It is occasionally also helpful to switch to the Newton-Raphson algorithm with
the difficult option specified. As a last resort, one could increase the tolerance but this
approach is risky because it can cut short iterations too early.
148
Table 6.2: MNP Error Covariances and Correlations
Predictor
PvdA
CDA
VVD
D66
PvdA
1.000
(1.000)
.000
(.000)
.000
(.000)
.000
(.000)
CDA
1.000
(1.000)
.182
(.179)
.083
(.074)
VVD
1.043
(1.000)
.408
(.354)
D66
1.275
(1.000)
Notes: Table entries are estimated error variances and covariances
with the corresponding correlations in the parameters. Note that
the covariances for PvdA have been constrained to 0 for identification purposes. Identification needs also resulted in a restriction
on the variances for PvdA and CDA (both error variances were
constrained to 1).
PvdA. In terms of the error variances, it should be noted that these do not
significantly deviate from 1. None of the covariances is significantly different
from zero. This calls into question whether it is necessary to perform a
multinomial probit analysis.
6.5.2
Interpretation
Marginal Effects
Stata makes it relatively easy to obtain marginal effects for the multinomial
probit model. The necessary command sequence is
predict p if e(sample)
estat mfx
The first command computes predicted probabilities, while the second command parlays those probabilities into marginal effects.6
Table 6.3 shows the marginal effects for the issue distance variable. Each
column reflects the implications for the choice probabilities of increasing
the issue distance to one of the parties by a small amount. For example,
6
At present, the estat mfx command has no option for computing elasticities.
149
Table 6.3: Marginal Effects for the Multinomial Probit Vote
Choice Model
Effect on
PvdA
CDA
VVD
D66
Change in Issue Distance to
PvdA
CDA
VVD
D66
−.059
.014
.019
.026
.014
−.043
.013
.015
.019
.013
−.063
.030
.026
.015
.030
−.072
Notes: Table entries are marginal effects for issue distance. All marginal effects are significant at the .01 level
or better.
increasing the distance to the CDA by a very small amount increases the
vote probability for the PvdA by .014 points, while decreasing the vote
probability for the CDA by .043 points.
Discrete Change
At present, Stata does not have a routine to compute discrete changes in predicted probabilities. However, these can be computed fairly easily through
Stata’s new matrix language, Mata, which is available as of version 9.0. The
Appendix shows the procedure for doing this.
Table 6.4 shows maximum discrete changes in the probabilities of voting
for PvdA, CDA, VVD, and D66 as a function of moving a person from the
extreme left to the extreme right on the ideological scale while holding issue
distance and religiosity at the mean. We see that moving from the left to
the right reduces the predicted vote probabilities for PvdA and D66 by .643
and .117 points, respectively. It increases the predicted vote probabilities
for CDA and VVD by .248 and .512 points, respectively. It is clear that
choice behavior for PvdA and VVD is particularly responsive to ideological
change; the vote probabilities for CDA and D66 seem less sensitive to this
variable.
6.5.3
Model Fit
Looking at Table 6.1, the model has a reasonably good fit. It classifies 67.1%
of the cases correctly, which is a considerable improvement over performance
when all covariate information is ignored (30.5% correct classifications). The
150
Table 6.4: Discrete Changes in Predicted Probabilities
PvdA
.667
.024
−.643
L-R
Minimum
Maximum
Change
CDA
.020
.268
.248
VVD
.022
.534
.512
D66
.290
.173
−.117
Notes: Table entries are maximum discrete changes in
predicted probabilities as a function of changing left-right
self-placement. The remaining predictors are kept constant at the mean.
AP CC value is also respectable at .551.
6.6
6.6.1
Appendices
The Method of Maximum Simulated Likelihood
How does the GHK-estimator simulate the probabilities πqi ?7 Unfortunately, to answer this question we need a fairly technical discussion. As
always, feel free to skip this material if the math is too overwhelming. In
this case, all you need to remember is that it is possible to accurately simulate the choice probabilities of the multinomial probit model so that MSLE
estimation of that model is feasible.
Simulation with Three Alternatives
Let us first consider a
cus on the probability
discussion, alternative
Uqi = Vqi + #qi . Taking
we get
choice model with three alternatives, where we foof choosing the 1st alternative. As per our earlier
1 is chosen if Uq1 > Uq2 and if Uq1 > Uq3 . Here,
utility differences with respect to the 1st alternative
Uqj − Uq1 = (Vqj − Vq1 ) + (#qj − #q1
= −Vq1j + ηqj1
for j = 2, 3. Note that, in terms of these utility differences, the condition
Uq1 > Uqj becomes Uqj − Uq1 < 0 or −Vq1j + ηqj1 < 0 for j = 2, 3. As
7
For a slightly different discussion of the GHK-approach, as well as a political science
application, see Lawrence (1997). The discussion here draws heavily from Train (2003).
151
discussed earlier, the vector η !q1 = (ηq21 ηq31 ) has a variance-covariance
matrix of
,
σ12 + σ22 − 2σ12 σ12 + σ23 − σ12 − σ13
Σ1 =
σ12 + σ23 − σ12 − σ13
σ12 + σ32 − 2σ13
We begin by performing a Cholesky decomposition of Σ1 .8 This produces
a lower triangular matrix, L1 of the variety
,
caa
0
L1 =
cab cbb
O
O
where caa = σ12 + σ22 − 2σ12 , cab = (σ12 +σ23 −σ12 −σ13 )/ σ12 + σ22 − 2σ12 ,
and c2bb = σ12 + σ32 − 2σ13 − ((σ12 + σ23 − σ12 − σ13 )2 /(σ12 + σ22 − 2σ12 )).
Using the Cholesky decomposition, the original correlated error terms
η can be expressed in terms of a series of uncorrelated standard normal
deviates, e. That is9
,
η q1 = L1 eq1
,
-,
ηq21
caa
0
eq21
=
ηq31
cab cbb
eq31
This means that
ηq21 = caa eq21
ηq31 = cab eq21 + cbb eq31
8
Cholesky decomposition takes a symmetric (positive-definite) matrix, A, and decomposes it into a lower-triangular matrix, L, such that A = LL’.
9
It is easily verified that L1 eq1 has the same statistical properties as η q1 . First, E[η q1 ] =
0 and the same is true for E[L1 eq1 ]:
E[L1 eq1 ]
=
L1 E[eq1 ]
=
L1 0
=
0
where E[eq1 ] = 0 because we are dealing with standard normal deviates. Second,
E[L1 eq1 (L1 eq1 )’]
=
=
=
E[L1 eq1 e’q1 L’1 ]
L1 E[eq1 e’q1 ]L’1
L1 IL’1
=
L1 L’1
=
Σ1
where E[eq1 e’q1 ] = I because we are dealing with uncorrelated standard normal deviates.
152
Moreover,
Uq2 − Uq1 = −Vq12 + caa eq21
Uq3 − Uq1 = −Vq13 + cab eq21 + cbb eq31
We can now turn to the choice probabilities. The probability of choosing
the first alternative is given by
πq1 = Pr(−Vq12 + ηq21 < 0 ∩ −Vq12 + ηq21 < 0)
In terms of the Cholesky decomposition this can be written as
πq1 = Pr(−Vq12 + caa eq21 < 0 ∩ −Vq13 + cab eq21 + cbb eq31 < 0)
Using the well-known result from probability theory that P r(A ∩ B) =
Pr(A)P r(B|A), we can also write this as
πq1 = Pr(−Vq12 + caa eq21 < 0) ×
Pr(−Vq13 + cab eq21 + cbb eq31 < 0| − Vq12 + caa eq21 < 0)
5
6
V1i12
= Pr eqi21 <
×
caa
=
5
6
Vq13 − cab eq21 ==
Vq12
Pr eq31 <
=eq21 < caa
cbb
5
6 : Vq12 /caa 5
6
Vq12
Vq13 − cab eq21
= Φ
Φ
φ(eq21 )deq21
caa
cbb
−∞
where Φ(.) is the standard normal CDF and φ(.) is the standard normal
PDF. The first term of this expression poses no computational difficulties,
as statistical packages nowadays all include fast routines for evaluating the
cumulative standard normal distribution. The second term is computationally complex and it is here that the simulation takes place. Note that this
term integrates over a truncated standard normal variate, where the truncation is from above at Vq12 /caa .10
The simulation now proceeds as follows. We begin by drawing a value
eq21 from a standard normal distribution that is truncated from above at
Vq12 /caa . For this draw we then compute Φ[(Vq13 − cab ei21 )/cbb ]. We repeat
the process
R times and average the results. This average is an approximaN
tion of Φ[(Vq13 − cab ei21 )/cbb ]φ(eq21 )deq21 . By multiplying this average by
10
A truncated distribution arises when we are able to observe only part of the distribution. With truncation from above at some point t, we are only able to observe the
distribution for values y < t.
153
Φ(Vq12 /caa ) we obtain π̃q1 , the simulated value of πq1 . The probabilities for
other alternatives can be derived in a similar manner.
One detail remains to be worked out: how does one perform draws from
a truncated normal distribution? The answer is to compute
, 5
6Vq12
eq21 = Φ−1 µΦ
caa
where µ can be generated in a number of different ways. Most commonly µ
is drawn from a uniform distribution that is defined over the interval [0, 1].
Some programs, including Stata, rely on Halton or Hammersley sequences,
which tend to spread the values of µ out more evenly 11
Simulation with M Alternatives
The setup with M alternatives is a relatively straightforward extension of
GHK estimation with 3 alternatives, although it is computationally much
more intensive.12 When considering the probability of choosing the ith
alternative the following quantity is of interest
Uqi − Uqj
= −Vqij + ηqji
where Vqij = Vqi − Vqj , ηqji = #qj − #qi , and j += i. Stacking the errors, we
obtain the (M − 1) × 1 vector η !qi = (ηq1i · · · ηqM i ) with variance-covariance
matrix (6.2).
We can reexpress the errors through the Cholesky transformation of
a set of independent standard normal deviates. Specifically, we use the
11
A Halton sequence is based on a set of primes. Given a particular prime, the unit
interval is divided into a number of segments that is equal to the prime. Thus, a prime
of three produces three segments with break points at 1/3 and 2/3. The three segments
are then divided into nine segments, etc. In general, one long sequence is generated and
then different parts of that sequence are used for different observations. This tends to
induce negative correlation between the observations, which means that an error in one
direction for one observation tends to be offset by an error in another direction for another
observation (see Cappellari and Jenkins 2006, Drukker and Gates 2006, Gates 2006, and
Train 2003 for more extensive discussions of Halton sequences). Hammersley sequences
use evenly spaced points by considering the first M − 2 components of a Halton sequence.
12
In the GHK approach, computational costs increase in a roughly linear fashion with
the number of alternatives. If we obtain choice probabilities by evaluating integrals, each
added alternative increases the computational burden by an order of magnitude.
154
decomposition Σi = Li L’i where

c11
0
 c21
c22

 c31
c32
Li = 
 ..
..
 .
.
0
0
c33
..
.
0
0
0
..
.
···
···
···
..
.
0
0
0
..
.
cM −1,1 cM −1,2 cM −1,3 cM −1,4 · · · cM −1,M −1
Then







η qi = Li eqi
where eqi is a (M − 1) × 1 vector of independent standard normal deviates.
With this transformation in place, the utility differences can be written as
ξ qi = −Vqi − Li eqi
where ξ !qi = [(Uq1 − Uqi ) · · · (UqM − Uqi )], V’qi = [(Vqi − Vq1 ) · · · (Vqi − VqM )],
and e’qi = [eq1i · · · eqM i ] = [(eq1 − eqm ) · · · (eqM − eqi )]. This amounts to the
following system of equations
Uq1 − Uqi = −Vqi1 + c11 eq1i
Uq2 − Uqi = −Vqi2 + c21 eq1i + c22 eq2i
Uq3 − Uqi = −Vqi3 + c31 eq1i + c32 eq2i + c33 eq3i
etc.
The choice probability for alternative i can now be written as
5
6
Vqi1
πqi = Pr eq1i <
×
c11
=
5
6
Vqi2 − c21 eq1i ==
Vqi1
Pr eq2i <
=eq1i < c11 ×
c22
=
5
Vqi3 − c31 eq1i − c32 eq2i ==
Vqi1
Pr eq3i <
eq1i <
∩
=
c33
c11
6
Vqi2 − c21 eq1i
eq2i <
× ···
c22
This probability can be simulated via the following process. Here the superscript r denotes the rth draw in the simulation.
1. Compute
5
Pr eq1i
Vqi1
<
c11
155
6
= Φ
5
Vqi1
c11
6
2. Draw a value erq1i from a standard normal distribution that is truncated
from above at Vqi1 /c11 . The draw is constructed by drawing µr1 from
the uniform (0,1) distribution and computing
,
5
6Vqi1
r
−1
r
eq1i = Φ
µ1 Φ
c11
(Halton or Hammersley sequences may also be used in this computation.)
3. Compute
5
6
5
6
Vqi2 − c21 erq1i
Vqi2 − c21 eq1i ==
r
Pr eq2i <
eq1i = eq1i
= Φ
c22
c22
4. Draw a value erq2i from a standard normal distribution that is truncated
from above at (Vqi2 −c21 erq1i )/c22 . The draw is constructed by drawing
µr2 from the uniform (0,1) distribution and computing
,
5
6Vqi2 − c21 erq1i
r
−1
r
eq2i = Φ
µ2 Φ
c22
(Again, Halton or Hammersley sequences may be used instead.)
5. Compute
5
6
Vqi3 − c31 eq1i − c32 eq2i ==
r
r
Pr eq3i <
eq1i = eq1i , eq2i = eq2i
=
c33
6
5
Vqi3 − c31 erq1i − c32 erq2i
Φ
c33
6. Continue the computations for eq3i · · · eqM i .
r by multiplying the various standard normal expressions
7. Construct π̃qi
with one another.
8. Repeat steps (1)-(7) R times.
9. Construct the simulated choice probability as
1 ) r
π̃qi =
π̃
R r qi
This simulated choice probability is then entered into the log-likelihood function, which can then be optimized via one of the algorithms discussed in
Chapter 9.
156
Statistical Properties of the GHK Estimator
Before we accept MSLE via the GHK approach, we should consider its
properties. We know that normal ML estimators have desirable asymptotic
properties. Is this also true for the GHK estimator? It can be demonstrated
that
• θ̂GHK is consistent and asymptotically efficient as N → ∞ and R →
∞.
√
• In finite samples, θ̂GHK is biased.13 This bias disappears as R/ N →
∞.
The bias in the GHK approach may seem damning for the method. Fortunately, Monte Carlo studies show that the bias is generally not severe
and that 30 or more replications are sufficient to make it a minor concern
(Börsch-Supan and Hajivassiliou 1993; Geweke, Keane, and Runkle 1994).
Note that the Monte Carlo studies also show that GHK estimator performs
better with cross-sectional data than with panel data (Geweke, Keane, and
Runkle 1997).
6.6.2
Predicted Probabilities through Quadrature
Although there is presently no built-in routine in Stata to computed discrete
changes in predicted probabilities, such changes can be computed relatively
easily using the Stata matrix programming language Mata. In particular, the
mvnlattice command developed by Richard Gates allows for the computation of multivariate normal probabilities using a rank-1 lattice rule, which
is a form of constant weight quadrature that is reasonably precise and quite
fast.
The discussion below shows how to use this procedure to compute discrete changes in predicted probabilities. It is illustrated here for a choice
problem with four alternatives since this is relevant to the example we have
described. However, the procedure is easily generalized to problems involving larger or smaller choice sets.
13
GHK estimates of πqi are unbiased. However, the log-likelihood function is stated in
terms of ln πqi and this is biased. It is easy to see this. Suppose an estimator π̂ of π such
that π̂ = π + ν with probability .5 and π̂ = π − ν with probability .5. Then E[π̂] = π.
But E[ln π̂] < ln E[π̂] due to the concavity of the logarithmic function. Thus, there is a
downward bias.
157
Baseline Alternative
Imagine that the 1st alternative has been designated the baseline alternative.
From the previous discussion, we know that the following conditions have
to hold true for the first alternative to be selected:
ηq21 < (x!q1 − x!q2 )β − z !q γ 2
ηq31 < (x!q1 − x!q3 )β − z !q γ 3
ηq41 < (x!q1 − x!q4 )β − z !q γ 4
We can stack ηq21 , ηq31 and ηq41 into the vector η q1 , which is related to
&!q = (#q1 #q2 #q3 #q4 ) through the equation
η q1 = M1 &q
where
M1


−1 1 0 0
=  −1 0 1 0 
−1 0 0 1
The variance-covariance matrix of η q1 can then be computed via
Σ1 = M1 ΣM’1
and the predicted probability is given by
: Vq12 : Vq13 : Vq14
φ3 (ηq21 , ηq31 , ηq41 )dηq21 dηq31 dηq41
πq1 =
−∞
−∞
−∞
where Vq12 = (x!q1 − x!q2 )β − z !q γ 2 , Vq13 = (x!q1 − x!q3 )β − z !q γ 3 , Vq14 =
(x!q1 − x!q4 )β − z !q γ 4 , and φ3 (.) is the trivariate normal distribution with
variance-covariance matrix Σ1 .
We can now simulate the values for Vq12 , Vq13 , and Vq14 . For example,
in Table 6.4 we changed the value of left-right self-placement while holding
all else at the mean. This corresponded to setting religiosity to .482 (this
is the proportion of respondents who consider themselves religious) and by
setting the issue distances to PvdA, CDA, VVD, and D66 to 9.055, 10.370,
9.418, and 7.160, respectively. To do this in Mata, we proceed as follows.
mata
S=(1,0,0,0\0,1,.182,.083\0,.182,1.043,.408\0,.083,.408,1.278)
158
M1=(-1,1,0,0\-1,0,1,0\-1,0,0,1)
S1=M1*S*M1’
R=(S1+S1’)/2
V12=(9.055-10.370)*-.233+1*3.148+1*-.437+.482*-1.453
V13=(9.055-9.418)*-.233+1*2.730+1*-.475+.482*-.523
V14=(9.055-7.160)*-.233+1*1.636+1*-.234+.482*-.429
mvnlattice((-100000,-100000,-100000),(V12,V13,V14),R)
V12=(9.055-10.370)*-.233+1*3.148+10*-.437+.482*-1.453
V13=(9.055-9.418)*-.233+1*2.730+10*-.475+.482*-.523
V14=(9.055-7.160)*-.233+1*1.636+10*-.234+.482*-.429
mvnlattice((-100000,-100000,-100000),(V12,V13,V14),R)
end
The first line starts Mata in interactive mode. The second line enters the
variance-covariance matrix, Σ, shown in Table 6.2. The third line creates
the matrix M1 described above. In the fourth line, the matrix Σ1 is created. The fifth line is necessary only if, for some computational reason, Σ1
does not come out as symmetric. By averaging Σ1 and its transpose it is
frequently possible to render the matrix symmetric. Lines (6)-(8) generate
Vq12 , Vq13 , and Vq14 by setting left-right self-placement to the minimum (1)
and everything else to the mean. The ninth line produces the predicted
probability of voting for the baseline alternative. Lines (10)-(14) repeat the
computations, now setting left-right self-placement to the maximum (10).
The last line exits Mata.
Note that the mvnlattice command produces three pieces of information. The first piece of information is the predicted multivariate normal
probability that we are interested in. The second piece is an indication of
the absolute approximation error of the multivariate integral. Finally, the
command also indicates how many points were included in the quadrature.
Non-Baseline Alternatives
For non-baseline alternatives, the computations proceed much the same.
To determine our thoughts, let us consider the probability of choosing the
159
second alternative. This alternative is selected when the following holds
true:
ηq12 < (x!q2 − x!q1 )β + z !q γ 2
ηq32 < (x!q2 − x!q3 )β + z !q (γ 2 − γ 3 )
ηq42 < (x!q2 − x!q4 )β + z !q (γ 2 − γ 4 )
Further η q2 = M2 &q , where
M2


1 −1 0 0
=  0 −1 1 0 
0 −1 0 1
The probability of choosing the second alternative is then given by
: Vq21 : Vq23 : Vq24
πq2 =
φ3 (ηq12 , ηq32 , ηq42 )dηq12 dηq32 dηq42
−∞
−∞
−∞
where Vq21 = (x!q2 − x!q1 )β + z !q γ 2 , Vq23 = (x!q2 − x!q3 )β + z !q (γ 2 − γ 3 ),
Vq24 = (x!q2 − x!q4 )β + z !q (γ 2 − γ 4 ), and φ3 (.) is the trivariate normal
distribution with variance-covariance matrix Σ2 .
Simulating the same covariate values as before, we can obtain the discrete
change in π̂q2 through the following syntax:
mata
S=(1,0,0,0\0,1,.182,.083\0,.182,1.043,.408\0,.083,.408,1.278)
M2=(1,-11,0,0\0,-1,1,0\0,-1,0,1)
S2=M2*S*M2’
R=(S2+S2’)/2
V21=(10.370-9.055)*-.233+1*-3.148+1*.437+.482*1.453
V23=(10.370-9.418)*-.233+1*(-3.148 -2.730)+1*(.437-.475)+.482*(1.453-.523)
V24=(10.370-7.160)*-.233+1*(-3.148 -1.636)+1*(.437-.234)+.482*(1.453-.429)
mvnlattice((-100000,-100000,-100000),(V12,V13,V14),R)
V21=(10.370-9.055)*-.233+1*-3.148+10*.437+.482*1.453
160
V23=(10.370-9.418)*-.233+1*(-3.148 -2.730)+10*(.437-.475)+.482*(1.453-.523)
V24=(10.370-7.160)*-.233+1*(-3.148 -1.636)+10*(.437-.234)+.482*(1.453-.429)
mvnlattice((-100000,-100000,-100000),(V12,V13,V14),R)
end
161
Chapter 7
Nested Logit
So far, our emphasis has been on choice models that assume the decision
maker to consider all of the alternatives simultaneously. But decision scientists have long considered the possibility that decisions are made sequentially
(e.g. Tversky 1974). That is, at any given point in time only a subset of the
alternatives is being considered. Sequential choice models allow researchers
to model these decision making processes.
One of the best-known sequential choice models is McFadden’s (1978)
nested logit model. This model is quite versatile and relatively straightforward to estimate. It assumes that the alternatives can be grouped into
subsets or nests of similar alternatives. IIA holds locally, i.e. within nests,
but not globally. The model was applied successfully in political science by
Born (1992) who used it to model turnout and vote choice simultaneously
in midterm elections.
7.1
Nesting and Sequential Choice
Consider the decision tree in Figure 7.1, which is based on Born (1992). This
tree groups the alternatives into two nests: {abstention} and {Democrat,
Republican}. Here a nest refers to a subset of alternatives that belong
together. The property of “belonging” can be based on temporal considerations or it can derive from other similarities of the alternatives. For example,
the nesting structure can be interpreted as a sequence: first, a citizen decides whether he or she will participate in the election; only when he or she
participates does the decision on a Democratic or Republican vote become
relevant. Thus, in a temporal sense, the decision whom to vote for follows
the decision whether to vote. However, the nesting structure can also be
162
Vote Choice
Participate
Democrat (2)
Abstain (1)
Republican (3)
Figure 7.1: Example of a Nested Decision Problem
interpreted to mean that the choices Democrat and Republican are similar,
and that both choices are different from abstention. This could be argued,
for example, on grounds that choosing whether to vote depends on different
factors than choosing who to vote for. Later in this chapter, we discuss some
approaches that can be used to choose a nesting structure.
In the nested logit model, the nesting structure has to be specified ahead
of time. That is, the researcher decides how many nests there are and which
alternatives belong to which nest. In the next chapter, we shall consider
a different kind of sequential choice model that makes the composition of
subsets of alternatives a feature of the estimation process.
7.2
7.2.1
The Model
Choice Probabilities
Consider again the nesting structure of Figure 7.1. Embedded in this structure are three alternatives: 1 = abstain, 2 = vote Democratic, and 3 = vote
Republican. Let Yq = 1, 2, 3 be the observed choice for decision maker q.
Under a utility maximization assumption, Yq = i if Uqi > Uqj ∀j += i. For
instance, Yq = 1 if Uq1 > Uq2 and Uq1 > Uq3 . As before, we assume that
the utility of an alternative has a systematic and a stochastic component:
Uqi = Vqi + #qi .
The key to developing the nested logit model is to impose a distributional
assumption on #. First, we establish a correlational structure of the errors.
Since alternatives 2 and 3 are similar, as suggested by their grouping in the
nesting structure, we allow #q2 and #q3 to be correlated. However, alternatives 2 and 3 are dissimilar from alternative 1 so that #q1 is uncorrelated with
#q2 and #q3 . In general, the errors of similar alternatives are correlated, while
the errors of dissimilar alternatives are constrained at zero. Thus, the nested
logit model is situated between multinomial and conditional logit models,
on one hand, and multinomial probit, on another. The former models do
163
not allow any of the errors to be correlated, while the latter correlates all
errors. Nested logit correlates some, but not all of the errors.
Second, we assume that the errors follow a generalized extreme value
(GEV) distribution: #qi ∼ GEV .1 Under this error distribution, McFadden
(1978) demonstrated that
πi =
eVi Hi (eV1 , eV2 , · · · eVM )
H(eV1 , eV2 , · · · eVM )
(7.1)
where

Hi (eV1 , eV2 , · · · eVM ) = 
)
j∈Ck
λs −1
eVj /λk 
eVi [(1/λk )−1]
λl

L
)
)

eVj /λl 
H(eV1 , eV2 , · · · eVM ) =
j∈Cl
l=1
Here Ck and Cl denote different subsets of choices, which correspond to the
nesting structure, which consists of L nests in total. Further ρ = 1 − λ
is a measure of the correlation of the errors within a choice subset.2 Note
that λ = 1 if the subset contains only one alternative. We can consider λ a
measure of similarity of the alternatives in a choice subset.
For the nesting structure in Figure 7.1, there are two choice subsets
or nests: C1 = {1} and C2 = {2, 3}. It is then easily verified that the
probability of choosing alternative 1 (abstention) is given by
π1 =
eV1 +
&>
eV1
3
Vj /λ2
j=2 e
'λ2
1
The GEV is a generalization of the Type-I extreme value distribution. A distribution
F is a GEV distribution if the following conditions are satisfied: (1) F (#1 , #2 , · · · #M ) =
exp[−H(e−%1 , e−%2 , · · · e−%M )], (2) H(e−%1 , e−%2 , · · · e−%M ) is a non-negative function, homogenous of degree 1, (3) lime!m →∞ H(e−%1 , e−%2 , · · · e−%M ) → ∞, and (4) the kth partial
derivative of H(e−%1 , e−%2 , · · · e−%M ) is non-negative if k is odd and non-positive for k even.
Notice
that one specification that meets these requirements is H(e−%1 , e−%2 , · · · e−%M ) =
%
m exp(−#m ). This produces the multinomial logit model.
2
Technically speaking, λ = 1 − σ, where ρ ≤ σ ≤ ρ + .045 (see Maddala 1983). This
is such a small modification that the interpretation λ = 1 − ρ is widely used in the
econometric literature. Since −1 ≤ ρ ≤ 1, note that 0 ≤ λ ≤ 2. Also note that λ = 1 if
ρ = 0.
164
where λ2 is the parameter for the choices in C2 .3 In a similar manner, the
probabilities for the remaining alternatives can be found. Thus, we have
&>
'λ2 −1
3
Vj /λ2
eV2 /λ2
e
j=2
π2 =
&>
'λ2
3
Vj /λ2
eV1 +
e
j=2
&>
'λ2 −1
3
Vj /λ2
eV3 /λ2
e
j=2
π3 =
&>
'λ2
3
Vj /λ2
eV1 +
e
j=2
as the probabilities of voting Democratic and Republican, respectively.4
A couple of things can be noted about these choice probabilities. First,
imagine that λ2 = 1. This corresponds to the situation that the alternatives
3
The proof is as follows. First, within C1 , λ1 = 1. Second,
&
'λ1 −1
" V /λ
V1
V2
V3
1
j
H1 (e , e , e ) =
e
eV1 [(1/λ1 )−1]
=
=
Third,
H(eV1 , eV2 , eV3 )
=
(
j∈C1
)λ1 −1
eV1 /λ1
eV1 [(1/λ1 )−1]
( )1−1
eV1
eV1 [1−1] = 1
(eV1 /λ1 )λ1 + (
3
"
eVj /λ2 )λ2
j=2
=
eV1 + (
3
"
eVj /λ2 )λ2
j=2
Fourth, substituting these results in the generic formula for π1 yields:
eV1 H1
H
4
=
eV1
+(
For π2 we have
H2 (eV1 , eV2 , eV3 )
V1
V2
=
9
=
9
3
"
eV1
%3
j=2
eVj /λ2
:λ2 −1
eV2 [(1/λ2 )−1]
Vj /λ2
:λ2 −1
eV2 /λ2 e−V2
j=2
3
"
eVj /λ2 )λ2
e
j=2
V3
Further H(e , e , e ) is the same as in the computation for π1 . Substitution then yields
the proper probability. We can find π3 in a similar manner.
165
in C2 are uncorrelated, which means that none of the alternatives are correlated. If we substitute
> Vj λ2 = 1 into the choice probability equations above,
V
i
we get πi = e / j e . This, of course, is the expression of the choice
probabilities in the Luce model, so that it follows that the multinomial and
conditional logit models are a special case of the nested logit model.
Second, within a nest the IIA assumption holds. This is easily verified
if we look at π2 /π3 :
π2
π3
=
eV2 /λ2
eV3 /λ2
=
eV2 /λ2
eV3 /λ2
&>
3
Vj /λ2
j=2 e
&>
3
Vj /λ2
j=2 e
'λ2 −1
'λ2 −1
We see that the relative choice probabilities between alternatives 2 and 3
are determined entirely by the utilities of those two alternatives. No other
alternative influences the relative likelihood of voting Democratic versus
Republican.5
Third, the IIA assumption does not extend to alternatives from different
nests. For example, when considering the probability of abstention relative
to voting Democratic we get
π1
π2
=
eV2 /λ2
&>
eV1
3
Vj /λ2
j=2 e
'λ2 −1
Thus, we see that this ratio of probabilities is driven by all alternatives in
the two choice subsets.
7.2.2
Decomposition of Choice Probabilities
Oftentimes, the choice probabilities described in the previous section are
decomposed by nesting structure. There are three main benefits of doing
this. First, such a decomposition allows us to express the choice model in
terms of straightforward logits. Second, it aids in the interpretation. Finally,
it suggests a simple estimation strategy for the nested logit model.
To develop the decomposition, let us introduce a slightly different notation that brings the nesting structure into greater relief. In our example,
5
This is true even if we had added other alternatives to C2 such as voting for a thirdparty candidate. The presence of those alternatives, too, would have been inconsequential
for the likelihood of voting Democratic compared to voting Republican.
166
the nesting structure has two levels—vote or abstain and, within vote, vote
Democratic or vote Republican. We index the first level by i and the second level by j. We then define the choice probabilities as πij . This is the
probability of choosing the jth alternative in the ith nest.
It is also helpful to decompose the systematic portion of the utilities into
a portion that varies across nests only and a portion that varies across nests
and alternatives. Specifically,
Vij = Pi + λi Qij
where Pi varies across nests only, Qij varies across nests and alternatives,
and λi is a normalizing constant.
The πij s are joint probabilities. From probability theory, we know that
= πj|i πi
πij
where πj|i is the conditional probability of choosing j given that nest i
was selected first, and πi is the probability of selecting nest i. It can be
demonstrated that
πj|i =
πi =
where Ii = ln
6
>
k∈Ci
exp(Qij )
k∈Ci exp(Qik )
>
exp(Pi + λi Ii )
>L
l=1 exp(Pl + λl Il )
(7.2)
exp(Qik ) is the log-sum or inclusive value.6
It is demonstrated fairly easily that πij = πj|i πi yields the expression in (7.1). First,
πij
=
πj|i πi
=
%
exp(Qij )
exp(Pi + λi Ii )
%L
exp(Q
)
ik
k∈Ci
l=1 exp(Pl + λl Il )
exp(Qij )
exp(Pi ) exp(Ii )λi
%L
λl
k∈Ci exp(Qik )
l=1 exp(Pl ) exp(Il )
%
%
Substituting ln k∈Ci exp(Qik ) for Ii and ln k∈Cl exp(Qlk ) for Il yields:
=
πij
=
=
%
%
exp(Pi )[ k∈Ci exp(Qik )]λi
exp(Qij )
%
%L
%
λl
k∈Ci exp(Qik )
l=1 exp(Pl )[
k∈Cl exp(Qlk )]
%
exp(Qij ) exp(Pi )[ k∈Ci exp(Qik )]λi −1
%L
%
λl
l=1 exp(Pl )[
k∈Cl exp(Qlk )]
Recognizing that exp(Pi ) = [exp(Pi /λi )]λi = exp(Pi /λi )[exp(Pi /λi )]λi −1 , we can reex-
167
7.2.3
Determinants of Choice
With the decomposition in place, it is now possible to impose a model on pi
and qij . Specifically,
Pi = z ! i α
= x!ij β
Qij
(7.3)
Here xij is a vector of choice attributes that vary both within and across
nests, while z i is a vector of choice attributes that varies only across nests.
Further, the vectors β and α contain the parameters that are associated
with the choice attributes.
Substitution of this equation into the formulae for the decomposed choice
probabilities yields
πj|i =
πi =
where
exp(x!ij β)
!
k∈Ci exp(x ik β)
>
exp(z !i α + λi Ii )
>L
!
l=1 exp(z l α + λl Il )
Ii = ln
)
(7.4)
exp(x!ik β)
k∈Ci
The system of equations in (7.4) is what is estimated in the nested logit
model.7 Attributes of the decision maker may be added to this system,
provided that those attributes are allowed to vary with i.
press this as
πij
=
%
exp(Qik + Pi /λi )[ j∈Ci exp(Qik + Pi /λi )]λi −1
%L %
λl
l=1 [
k∈Cl exp(Qlk + Pl /λl )]
Since Vij = Pi + λi Qik , it follows that Vij /λi = Pi /λi + Qik . Substitution then yields
%
exp(Vij /λi )[ j∈Ci exp(Vik /λi )]λi −1
πij =
%L %
λl
l=1 [
k∈Cl exp(Vlk /λl )]
This is expression (7.1).
7
The system of equations can be expanded to multiple levels of nesting. For example,
Maddala (1983) shows the three equations that are required when there are three levels
of nesting. Note that adding layers of nests greatly increases the computational burden,
especially under full information maximum likelihood estimation. Consequently, many
computer programs limit how many layers can be specified, a common cutoff being three
layers.
168
7.3
Identification
The identification conditions of the nested logit model are similar to those
of the conditional logit model. This means that the vectors β and α are
identified up to a constant term. Thus, the constant has to be omitted
when the nested logit model is estimated.8 Note that when attributes of
the decision maker are added to the model, one of the alternatives has to
be designated as the baseline, meaning that no effects are estimated for this
alternative.
7.4
Estimation
There are two different approaches to estimating nested logit models: sequential MLE and full information MLE. Sequential MLE has the advantage of computational simplicity. Full information MLE (FIML) has the
advantage of being more efficient and of producing asymptotically correct
estimates of the standard errors at higher nesting levels.
7.4.1
Sequential MLE
Let us return to Figure 7.1 and its implied choice probabilities. How would
we sequentially estimate the impact of predictors on the choices? The general approach is bottom-up, which means that we start with the lowest level
of the nesting structure and work our way upwards. Here is the template
for how we proceed.
1. Obtain estimates of β by estimating the model for πj|2 .9
2. Compute the inclusive value, Ii , based on the estimate of β.
It is easy to see why the constant has to be omitted. Let xij = (1 x∗ij ) be a partitioned
vector such that x∗ij contains information about all of the predictors and 1 is the constant.
Then
8
πj|i
=
=
%
∗
exp(β0 + x! ij β ∗ )
!∗
∗
k∈Ci exp(β0 + x ik β )
∗
exp(β0 ) exp(x! ij β ∗ )
%
∗
exp(β0 ) k∈Ci exp(x! ik β ∗ )
We see that the terms involving β0 cancel. This means that β0 can be set to any value
without changing the choice probability, which is tantamount to saying that this parameter
is unidentified. A similar proof can be constructed for πi .
9
Ordinarily, we would also estimate the model for πj|1 . However, that is unnecessary
in this case since πj|1 = 1 due to the presence of only one alternative in this choice subset.
169
3. Estimate α and λi by estimating the model for πi . Note that λi will
be the coefficient associated with Ii .
A major benefit of sequential maximum likelihood estimation is computational simplicity. Stages (2) and (3) of the estimation process simply require that you estimate a set of multinomial/conditional logit models, which
is an easy computational task, as we have seen. The resulting estimators
are consistent and, in this sense, meet the usual requirements. However, sequential MLE has several drawbacks. First, the estimators are not efficient
because they do not take into consideration information about all of the alternatives. Second, and particularly troublesome, it has been demonstrated
that the standard errors at higher levels of the nesting structure are biased
in a downward direction.10 While a correction factor has been proposed, this
is quite cumbersome (see Amemiya 1978). Troublesome is also that several
simulation studies have demonstrated that in finite samples the sequential
and FIML estimates can be quite different. In light of these problems, many
econometricians now recommend using FIML.
7.4.2
Full Information MLE
IML takes into consideration the entire choice set in estimating the nested
logit model. Let dij = 1 if alternative j in nest i is chosen, and let dij = 0
otherwise. Then the likelihood function can be written as
( ( dij
L =
πij
i
j
and the log-likelihood function can be written as
))
( =
dij ln πij
i
j
Using the decomposition of the choice probabilities, this can also be written
as
))
( =
dij ln(πj|i πi )
=
i
j
i
j
))
dij (ln πj|i + ln πi )
10
The reason is hat the inclusive values are estimated quantities and, as such, subject to
sampling fluctuation. However, the second stage of the sequential estimation process does
not recognize this and treats the inclusive values like any other predictor, i.e. as fixed.
170
This log-likelihood function can be maximized by way of standard methods
such as Newton-Raphson or quasi-Newton algorithms.
A key advantage of FIML is that it is consistent and efficient. Moreover,
this estimator produces consistent estimates of the standard errors. An
important drawback of FIML is that it can become computationally costly
if the number of alternatives is large and the model contains a large number
of parameters.
7.5
7.5.1
Interpretation
Marginal Effects and Elasticities
For a nesting structure with two layers, the marginal effect of attribute r
in alternative k in nest l on the logged choice probability of alternative j in
nest i is given by
* "
#
+
∂ ln πij
= d1 d2 − πk|l + λl [d2 − πl ] πk|l βr
∂xlkr
where
!
1 if i = l
d1 =
0 if i += l
and
d2 =
!
1 if j = k
0 if j += k
(see Greene 2003). The marginal effect is then given by
* "
#
+
∂πij
= πij d1 d2 − πk|l + λl [d2 − πl ] πk|l βr
∂xlkr
We typically evaluate this expression by setting all of the predictors to their
means to obtain marginal effects at the mean. To obtain the elasticity, we
compute
* "
#
+
∂πij xlkr
= xlkr d1 d2 − πk|l + λl [d2 − πl ] πk|l βr
∂xlkr πij
7.5.2
Discrete Change in Predicted Probabilities
Predicted probabilities can be obtained using the equations in (7.4). These
predicted probabilities can then be parlayed into discrete change measures
to aid in the interpretation of the coefficients. The example below illustrates
the process.
171
7.6
Model Fit
Evaluation of the model fit can be based on pseudo-R2 or correct predictions.
In practice, McFadden’s pseudo-R2 is most frequently reported. Both the
percentage of correct predictions and the average probability of correct classification can be used to evaluate the success with which the model recovers
the observed choices.
7.7
Hypothesis Testing
A key test in nested logit models concerns the similarity of alternatives that
have been grouped together in the same nest. By grouping alternatives in
nests, we assume that they are correlated. If this is not the case, i.e. the
choices are uncorrelated, then the similarity parameter will tend to one.
This casts doubt on the particular nesting structure that was imposed.
The importance of a non-unit similarity parameter is readily seen in a
two-level nested model such as depicted in Figure 7.1. As we have seen,
πj|i =
πi =
exp(x!ij β)
!
k∈Ci exp(x ik β)
>
exp(z !i α + λi Ii )
>L
!
l=1 exp(z l α + λl Il )
If λi = 1 ∀i, then the model reduces to
πij =
exp(x!ij β + z !i α)
> >
!
!
l
k exp(x lk β + z l α)
and this is simply the conditional logit model.
How would one test the hypothesis that λi = 1 ∀i? The easiest approach
is to perform a likelihood ratio test. This requires the following steps:
1. Run an unconstrained nested logit analysis. Obtain the value of the
log-likelihood function and call this (U .
2. Run a constrained nested logit analysis where the constraint takes the
form of λi = 1 for all i. (In practice, this can be done by constraining
the effect of the inclusive values to be equal to one.) Obtain the value
of the log-likelihood function and call this (R .
3. Compute LR = 2((U − (R ). Under the null hypothesis, λi = 1 for all
i, this statistic follows an asymptotic χ2 distribution with degrees of
freedom equal to the number of nests.
172
Many statistical packages that perform nested logit analysis perform this test
by default. The test is sometimes referred to as a test of homoskedasticity.
7.8
Deciding on a Nesting Structure
One of the most important decisions that you will face in nested logit analysis is to decide on a nesting structure. As with all modeling decisions, this
decision should be theoretically guided. This means that you should ask
two theoretical questions. First, which choices are correlated in the sense of
sharing common unobserved factors? Second, for which choices are the relative odds independent from other alternatives? Choices that are correlated
and whose relative odds are independent from other alternatives should be
nested together.
There are also some statistical tools that can help you decide on the
nesting structure. These include the following:
1. Perform a conditional logit analysis, followed by a test of the IIA
assumption. If the IIA assumption holds for two alternatives, j and
k, this is a priori evidence that j and k can be nested.
2. Perform a multinomial probit analysis. A statistically significant correlation between the errors of two alternatives, j and k, suggests nesting
of these alternatives.
3. Run alternative nested models and examine the coefficients associated
with the inclusive values. If the coefficients are not statistically different from one, then this casts doubt on the nesting structure (see
above). If the coefficients are implausible, you can also opt for a different structure.
7.9
Example
Let us return to the Dutch Parliamentary Election Study Data from 1994
to illustrate nested logit analysis. The multinomial probit analysis that was
reported earlier suggested that the errors for VVD and D66 are correlated.
Since these parties formed the opposition prior to the elections, it makes
sense to form nests based on the parties’ role in terms of government or opposition. This produces the nesting structure of Figure 7.2. Here the choice
set is first broken down by government versus opposition. The government
nest contains PvdA and CDA, while the opposition nest contains VVD and
173
Parties
Government
PvdA
Opposition
CDA
VVD
D66
Figure 7.2: 1994 Dutch Elections—Nesting by Government or Opposition Status
D66. One way to interpret this structure is to say that Dutch voters first
decided whether to vote for or against the government and then decided on
a particular party.
7.9.1
Estimation Results
How would we model this nesting structure? Stata contains a nlogit command that allows for FIML estimation of nested logit models. The starting
point for this command is a person-choice matrix like the one that we discussed for the conditional logit model. However, this matrix needs to be
modified in that we have to declare the nesting structure. This is done via
the nlogitgen command. For the nesting structure in Figure 7.2 we would
issue the following syntax:
nlogitgen role=vote(govt:
1|2, opp:
3|4)
This generates a new variable, “role,” with two values, “govt” and “opp,”
that identify if a party was in the government or in the opposition at the
time of the election. Here alternatives 1 and 2 are grouped together in the
“govt category, since 1 corresponds to PvdA and 2 corresponds to CDA.
Alternatives 3 and 4 are grouped together in the “opp” category, since 3
corresponds to VVD and 4 corresponds to D66.
We estimate a model in which xij consists of issue distance between the
voter and a party, since this attribute varies across alternatives and nests. z i
consists of attributes of the decision maker, to wit left-right self placement
and religiosity. Consistent with the results from the multinomial logit analysis, we allow the effects from these variables to vary across nests, setting
the “govt” nest as the baseline category. Estimation in Stata proceeds by
issuing the following syntax:
gen odum=.
174
Table 7.1: Nested Logit Model of Dutch Vote Choice in 1994
Predictor
Issue Distance
L-R × Opposition
Religious × Opposition
Constant
B
−.437∗∗
.199∗∗
−.249
−1.428∗∗
SE
.043
.062
.239
.383
Notes: 4n = 1676, n = 419. Table entries are maximum
simulated likelihood nested logit coefficients. ∗∗ p < .01
(two-tailed). λ̂1 = .740, λ̂2 = .682. Test of homoskedasticity: LR = 4.82, p = .090. CC = 62.5%, AP CC =
.524.
replace odum=0 if vote==1|vote==2
replace odum=1 if vote==3|vote==4
gen olr=(role==2)*leftright
gen orel=(role==2)*religious
nlogit choice (vote=dist) (role=olr orel), group(person)
Here “choice” and “person” are the names given to “ didep” and “ strata,”
respectively, the latter variables having been generated by the mclgen command. Further “odum” is a constant. The results are displayed in Table
7.1.
We observe statistically significant effects for two predictors. Issue distance is a significant predictor of vote choice, with greater distances to a
party reducing the probability of voting for that party. At the nest level we
observe a statistically significant positive effect of left-right self-placement,
which means that voters to the right of the ideological spectrum are more
likely to cast a vote for the opposition. This makes sense, since the government could be characterized as center-left, while the two opposition parties
could be characterized as center-right. Religiosity os not statistically significant.
175
7.9.2
Interpretation
Marginal Effects and Elasticities
At present, Stata does not have facilities to compute marginal effects for
the nested logit model. However, these effects are easily computed with a
decent calculator or a spreadsheet program. We begin by setting all of the
predictors to their mean values. For the data at hand, the mean distances to
PvdA, CDA, VVD, and D66 are 9.055, 10.370, 9.418, and 7.160, respectively.
The means for left-right self-placement and religiosity are 5.411 and .482,
respectively. We use this last set of values as the elements of z 2 , so that


−1.428
"
#
.199  = −.471
z !2 α̂ = 1 5.411 .482 
−.249
The inclusive values are computed as
3
4
Iˆ1 = ln exp(β̂x11 ) + exp(β̂x21 )
= ln [exp(−.437 × 9.055) + exp(−.437 × 10.370)]
= −3.510
3
4
ˆ
I2 = ln exp(β̂x12 ) + exp(β̂x22 )
= ln [exp(−.435 × 9.27) + exp(−.435 × 7.15)]
= −2.812
Now we can compute the various probabilities. First, the probability of
voting for the PvdA given that someone votes for the government is
π̂1|1 =
exp(β̂x11 )
exp(β̂x11 ) + exp(β̂x21 )
exp(−.437 × 9.055)
=
exp(−.437 × 9.055) + exp(−.437 × 10.370)
= .640
This means that the probability of voting for the CDA given that one votes
for the government is given by
π̂2|1 = 1 − π̂1|1 = .360
176
Further, the probability of voting for the government is given by
exp(λ̂1 Iˆ1 )
π̂1 =
exp(λ̂1 Iˆ1 ) + exp(z 2 α̂ + λ̂2 Iˆ2 )
exp(.740 × −3.510)
=
exp(.740 × −3.510) + exp(−.471 + .682 × −2.812)
= .448
Similarly, the probability of voting for the VVD given an opposition vote is
π̂1|2 =
exp(β̂x12 )
exp(β̂x12 ) + exp(β̂x22 )
exp(−.437 × 9.418)
=
exp(−.437 × 9.418) + exp(−.43 × 7.160)
= .272
This means that the probability of voting for D66 given an opposition vote
is
π̂2|2 = 1 − π̂1|2 = .728
Further, the probability of voting for the opposition is
π̂2 = 1 − π̂1 = .552
The joint probabilities are then
π̂11 = π̂1|1 π̂1 = .287
π̂21 = π̂2|1 π̂1 = .161
π̂12 = π̂1|2 π̂2 = .150
π̂22 = π̂2|2 π̂2 = .402
We can now illustrate the computation of marginal effects for the nested
logit model. We do so for the probability of voting for the PvdA as a function
of very small changes in the issue distances. The first marginal effect is the
change in the probability of voting for PvdA as a consequence of a very
small increase in the issue distance to that party:
∂π11
∂x11
*
"
#
+
= π11 1 × 1 − π1|1 + λ1 [1 − π1 ] π1|1 β
= .287 × {1 × [1 − .640] + .740 × [1 − .448] × .640} × −.437
= −.078
177
As one would expect, the probability of voting for the PvdA declines as the
distance to this party increases. Next, consider the probability of voting for
the PvdA when the distance to the CDA increases by a very small amount.
Now we compute
*
"
#
+
∂π11
= π11 1 × 0 − π2|1 + λ1 [0 − π1 ] π2|1 β
∂x21
= .287 × {1 × [0 − .360] + .740 × [0 − .448] × .360} × −.437
= .060
Thus increasing the distance to the other government party benefits the
PvdA. The marginal effect for the PvdA vote probability relative to a change
in the issue distance to VVD is given by
*
"
#
+
∂π11
= π11 0 × 1 − π1|2 + λ2 [1 − π2 ] π1|2 β
∂x12
= .287 × {0 × [1 − .272] + .682 × [1 − .552] × .272} × −.437
= −.010
This is a minuscule effect. Finally, changing the distance to D66 by a small
amount changes the vote probability for the PvdA by
*
"
#
+
∂π11
= π11 0 × 0 − π2|2 + λ2 [0 − π2 ] π2|2 β
∂x22
= .287 × {0 × [0 − .728] + .682 × [0 − .552] × .728} × −.437
= .034
The corresponding elasticities are given by the following expressions:
∂π11 x11
∂x11 π11
∂π11 x12
∂x12 π11
∂π11 x21
∂x21 π11
∂π11 x22
∂x22 π11
= −2.461
= 2.168
= −.328
= .848
Thus a one percent increase in the issue distance to the PvdA is expected
to decrease the probability of voting for that party by roughly 2.5 percent.
On the other hand, a one percent increase in the distance to the CDA tends
to increase the PvdA vote probability by about 2.2 percent. The effects for
the issue distances to VVD and D66 can be interpreted in a similar way,
although these are considerably smaller.
178
Discrete Change
To illustrate the use of predicted probabilities, let us considering the effect
of changing left-right self-placement from the minimum to the maximum,
while holding all else equal. Since the conditional probabilities and inclusive
values are not functions of the attributes of the decision-makers, they remain
unaffected by left-right self-placement. Thus, regardless of left-right selfplacement,
Iˆ1 = −3.510
î2 = −2.812
π̂1|1 = .640
π̂2|1 = .360
π̂1|2 = .272
π̂2|2 = .728
At the minimum level of left-right self-placement, which indicates the extreme left of the ideological spectrum, z !α̂2 = −1.349. Thus,
exp(.740 × −3.510)
exp(.740 × −3.510) + exp(−1.349 + .682 × −2.812)
= .661
π̂1 =
π̂2 = 1 − π̂1
= .339
Consequently,
π̂11 = π̂1|1 π̂1 = .423
π̂21 = π̂2|1 π̂1 = .238
π̂12 = π̂1|2 π̂2 = .092
π̂22 = π̂2|2 π̂2 = .247
At the maximum level of left-right self-placement, which corresponds to the
extreme right of the ideological spectrum, z !α̂2 = .442. Therefore,
exp(.740 × −3.510)
exp(.740 × −3.510) + exp(.442 + .682 × −2.812)
= .246
π̂1 =
π̂2 = 1 − π̂1
= .754
179
and
π̂11 = π̂1|1 π̂1 = .157
π̂21 = π̂2|1 π̂1 = .088
π̂12 = π̂1|2 π̂2 = .205
π̂22 = π̂2|2 π̂2 = .550
Thus the maximum discrete changes are -.266 for PvdA, -.150 for CDA, .113
for VVD, and .303 for D66.
7.9.3
Model Fit
The percentage of correct predictions is quite good at 62.5 percent. The
average probability of a correct prediction is also respectable at .524. These
statistics can be computed analogously to the conditional logit model.
7.9.4
Hypothesis Testing
An important hypothesis to test is whether the effects of the inclusive values
are statistically distinguishable from 1. If they are not, then this suggests
that the stochastic components of the alternatives are uncorrelated in each
nest and that we could have sufficed by estimating a multinomial/conditional
logit model. As Table 7.1 reveals, the LR test statistic for H0 : λ1 = λ2 = 1
comes out to 4.82. When referred to a χ2 -distribution with 2 degrees of
freedom we obtain a p-value is .09, which means that we can reject the
hypothesis that a Lucean model would have sufficed but only if we adopt a
rather liberal Type-I error rate.
180
Chapter 8
Choice Set Models
The nested logit model requires that the researcher specifies the nesting
structure ahead of time. That is, he or she has to indicate how many nests
there are, how many members each nest has, and who those members are.
Sometimes we have good prior theory, or at least hunches, to provide this
information. But often we are uncertain about the precise nesting structure
and in those cases we need an alternative to nested logit.
In a 1977 article, Manski provided such an alternative, which has since
then been dubbed the probabilistic choice set multinomial logit model or
PCMNL (see Başar and Bhat 2004). Here choice is assumed to occur in
two stages. In the first stage, decision makers narrow down the field of
alternatives to a smaller choice set. In a second stage, they then select
one alternative from this choice set. Both stages may be driven by different
logics, so that the predictors need not be the same across the two stages. The
model does not globally assume IIA and it makes the size and composition
of the choice sets an empirical matter rather than something that must be
specified a priori. Here we shall look at logistic models, but probit variants
are also possible (Paap et al. 2005).
8.1
8.1.1
The Model
Choice Probabilities
In an important paper on random utility theory, Charles Manski (1977)
proposed that choice can be conceived of in terms of the selection of alternatives from choice sets that are probabilistic in nature. That is, alternatives are grouped together in choice sets that vary in size and composition.
181
Each choice set has a particular probability of being selected by the decision maker. Then an alternative is, again probabilistically, selected from
within each choice subset. The probability that decision maker q chooses
alternative i ∈ M is thus
)
πqi =
πq (i|C)πq (C)
(8.1)
C∈ G
Here M is the universal choice set, i.e. the choice set of all alternatives, and
G is the set of all non-empty subsets (C) of M , i.e. G is the powerset of
M excluding the null set. Further, πq (C) is the probability that individual
q’s choice set is C and πq (i|C) is the probability that alternative i is chosen
conditional on the choice set C.
8.1.2
Determinants of Choice
Consideration Stage
Parameterizations of this model were first developed in transportation economics (e.g. Ben-Akiva and Boccara 1995; Başar and Bhat 2004) and applied in consumer behavior (e.g. Moe 2006). Following these, we let the
inclusion of i in voter q’s choice set, C, be a function of a vector x!qi of
attributes of the decision maker and alternatives (including an alternativespecific constant). If this function exceeds a latent threshold then i is included in C; otherwise, it is not. Thus, we assume that alternative i needs
to produce a minimum level of utility for it to be included in the choice set.
We further assume that the latent threshold follows a standard logistical
distribution, so that the inclusion of i in C is given by
1
ωqi =
(8.2)
1 + exp(−x!qi β)
where β is a vector of coefficients on x!qi .
Assuming that the thresholds are independent across alternatives, then
the probability of choice set C is given by
P
P
i∈C ωqi
/ (1 − ωqj )
P j ∈C
πq (C) =
(8.3)
1 − i (1 − ωqi )
where the denominator is a normalizing constant that eliminates the empty
choice set.1 This is important for the next stage of the model, which pertains
to the final vote choice of a decision maker.
1
Thus, the model assumes that the choice set contains at least one alternative. This is
a mild assumption because abstention—i.e. the decision not to choose—can be included
among the alternatives.
182
Choice Stage
In the choice stage, we use a multinomial/conditional logit specification
(McFadden 1973, 1978) to capture the conditional probability of selecting i
from C. Let z !qi denote a set of attributes of the alternatives and decision
maker (excluding a constant) that influence choice in the second stage, then
πq (i|C) =
exp (z !qi γ)
!
k∈C exp (z qk γ)
>
(8.4)
for i ∈ C and πq (i|C) = 0 for i ∈
/ C. Here γ is a vector of coefficients on
!
z qi . Note that the predictors entered into the second stage may be different
from those entered into the first stage.
The unconditional probability of choosing alternative i can be derived
from (8.1). Note that G includes 2m − 1 elements, where m is the number of alternatives in the universal choice set, M . Thus, the unconditional
probability is based on an iterated application of logit models, whence the
alternative name of probabilistic choice set multinomial logit model.
8.1.3
Special Cases
The Manski model contains the conditional logit model as a special case.
This case arises when ωqi = 1, ∀q, i. In this case, all alternatives are included
in the choice set with probability 1, which is tantamount to saying that each
individual is considering the universal choice set in his or her decision making
process. The Manski model does not preclude that there exist individuals
who consider all alternatives, but this is only one of the possible outcomes
and its prevalence should be judged based on the data.
8.2
Identification
Model identification can in theory be accomplished by imposing the standard constraints of multinomial/conditional logit. That is, the impact of
attributes of the decision maker is constrained to zero for one of the alternatives, and a model containing only attributes of the alternatives cannot
contain a constant. But identification problems may arise even with these
constraints in place. First, note that the parameters of the second stage
cannot be identified by choice sets consisting of one element only since the
conditional probability of choosing that element is by definition one, regardless of the parameter values. Such “degenerate” choice sets are likely to
arise where there is one dominant alternative.
183
−1500
−1200
−2500
−2000
Loglikelihood
−1300
Loglikelihood
−1400
−1500
−1600
0
5
10
15
20
0
βlab
5
10
15
βdist
Figure 8.1: Examples of the PCMNL Log-Likelihood
Second, identification can become problematic when the inclusion or selection probability approaches one or zero. Additional predictors that push
the probability in the same direction will have unidentifiable parameters,
since pretty much any estimate would be consistent with the choice probabilities. This property also leads to estimation problems, which we consider
next.
8.3
8.3.1
Estimation
The Log-Likelihood Function
The PCMNL model can be estimated using maximum likelihood. Let dqi be
an indicator that takes on the value 1 if an individual chooses alternative i
and 0 otherwise. Then the log-likelihood function can be written as
))
( =
dqi ln πqi
q
=
i
))
q
i
dqi ln
J
)
C∈ G
K
πq (i|C)πq (C)
(8.5)
Başar and Bhat (2004) optimize this log-likelihood using standard methods
such as the Broyden-Fletcher-Goldfard-Shanno (BFGS) algorithm. Wilson
(forthcoming) uses used this approach to estimate a choice set model for the
three major contenders in Mexican elections.
Standard optimization methods, however, can easily run into serious
convergence problems because the log-likelihood function can be irregular
184
in the sense that large segments are nearly flat. This is true, for example,
for the 1st stage constant and it is easy to see why. If the constant reaches
a certain value, the probability that an alternative is included in the choice
set approximates 1. Further increasing this constant will not change the inclusion probability, which results in no further changes in the log-likelihood
function.2 As long as one selects starting values in the strictly concave region near the true ML estimate, then the BFGS algorithm should operate
smoothly. But if one selects starting values corresponding to the flat region
of the log-likelihood, then convergence problems are bound to emerge. Figure 8.1 illustrate this problem, using some of the estimates from the PCMNL
for the British elections example that we shall consider later in this chapter.
This figure also demonstrates that the problem is not confined to the 1st
stage constant.
8.3.2
A Memetic Approach
Wilson, Hangartner, and Steenbergen (2008) propose a memetic approach
to solve this problem. This approach combines the BFGS algorithm with a
genetic algorithm (GA) in a setup that is similar to the one proposed by Liu
and Mahmassani (2000) for multinomial probit models. First, they create a
population of trial solutions and let the GA perform stochastic operations
on these code-strings until something in the vicinity of the maximum of the
log-likelihood function is found. Next, they use the BFGS algorithm starting
from the best-fitting values to discover the ML estimate.
Unlike more commonly used optimization algorithms such as NewtonRaphson and quasi-Newton methods, GAs are not derivative-based. Therefore, GAs are perfectly suited for the PCMNL because they can handle
flat regions (and even discontinuities) in the likelihood function very well.
However, GAs perform poorly at local hill-climbing. Because the likelihood function of the PCMNL is—as long as the PCMNL cannot be reduced
to the MNL—strictly concave in the neighborhood of the optimal solution,
derivative-based algorithms such as BFGS can be used to find the maximum
once the GA selected this region of the objective function.
The optimization problem we face can be represented using a fixed-length
parameter vector of trial values. In GA lingo, this vector of trial values is
called an individual. A population consists of many such individuals (about
5000 in our implementation). The evaluation of a trial solution using the
2
The limitations of IEEE arithmetic, which isn’t exactly arithmetic, further aggravate
this problem.
185
likelihood function returns a likelihood value. This likelihood value is a measure of the fitness of the individual. The GA that Wilson, Hangartner, and
Steenbergen (2008) use is taken from the R implementation of GENOUD
(Mebane and Sekhon 1998). The GA performs a set of stochastic operators to evolve the population of trial solutions over a series of generations.
It consists of the following basic operators (for details, see Mebane and
Sekhon, 1998): (i) reproduction, which entails selecting a code-string with a
probability that increases with the fitness; (ii) crossover, which use pairs of
code-strings to create new code-strings; and (iii) mutation, which randomly
changes the values of elements of a single code-string. Mutation prevents
the algorithm to be trapped in a local minima and ensures ergodicity.3 This
ability to leave local maxima is a main advantage compared to deterministic
derivative-based methods such as BFGS, which get stuck there forever.
Because we use the GA only as a device to get starting values in the
concave region of the likelihood function for the following runs of BFGS, our
results may depend on a lesser degree on the success of the GA in finding
the optimum than GA-only applications. However, it is still important to
mention the properties of a GA to judge the stability of our results. The
long-run properties of a GA may best be understood by thinking of the
GA as a Markov chain. A state of the chain is the population of the size
used in the GA. Given this finite size and the finite length of the codestrings, the state space is also finite. Such a finite GA has been proven to be
approximately a finite and irreducible Markov chain. As long as the best trial
solution is maintained from one generation to the next, the usual asymptotic
proofs for convergence of a Markov chain to its stationary distribution apply
(for a critical discussion, see e.g. Rudolph 1994).
ML estimators gain their desirable properties (consistency, invariance,
asymptotic normality) because they can be derived via the likelihood function. This properties are independent from the optimization algorithm used.
Therefore, our ML estimates have the same interpretation as any other parameter derived by maximum likelihood.
3
Very informally, a search space is said to be ergodic if there is a non-zero probability
of generating any solution from any population state.
186
8.4
8.4.1
Interpretation
Marginal Effects and Elasticities
Let us start by considering the marginal effect of changes in attributes of
the alternatives. Let wim be the mth attribute for alternative i. We assume
that the attribute is relevant for both decision stages. Then we can derive
the following marginal effects:4
Q
R
)
∂πi
= (1 − Bi )πi γm +
π(i|C)[1 − π(i|C)]π(C) βm
∂wim
C∈G
Q
R
Q
R
)
)
∂πj
=
π(j|C)π(C)δij − πj Bi γm +
−π(i|C)π(j|C)π(C) βm
∂wim
C∈G
C∈G
where
Bi =
1−
P
ωi
i (1 − ωi )
This term is equal to the sum of all of the choice set probabilities that include
alternative i. The corresponding elasticities are (see Başar and Bhat 2004):
∂πi wim
∂wim πi
= {(1 − Bi )γm +
J
K R
1 )
π(i|C)[1 − π(i|C)]π(C) βm wim
πi
(8.6)
C∈G
∂πj wim
∂wim πj
=
QJ
K
1 )
π(j|C)π(C)δij − Bi γm +
πj
C∈G
J
K R
1 )
−π(i|C)π(j|C)π(C) βm wim
πj
(8.7)
C∈G
Equation (8.6) gives the self-elasticity, i.e. the percentage change in the
probability of voting for alternative i due to a one percent change in the
attribute of that alternative. Equation (8.7) gives the cross-elasticity, i.e.
the percentage change in the probability of voting for alternative j due to a
one percent change in the attribute of another alternative, i.
4
For simplicity, the subscript for the decision maker has been omitted from the following
expressions.
187
Equations (8.6) and (8.7) are easily adjusted when an attribute enters
only one stage of the decision-making process. If wim enters only the first
stage, then only the terms in γm are retained. These terms pertain to
consideration elasticity. If wim appears only in the second stage, then only
terms in βm are retained. These terms reflect substitution elasticity at the
choice stage.
The cross-elasticities show an important property, namely that they depend not just on i but also on probability information for j. As such, the
cross-elasticities will not be identical across all alternatives other than i, as
they are in the conditional logit model. That property of conditional logit is
a direct outgrowth of the IIA assumption that is embedded in that model.
That the PCMNL cross-elasticities depend on the nature of the other alternatives, implies that this model does not make an IIA assumption. Of
course, when all decision makers consider all alternatives, then the formulas
for the self- and cross-elasticities collapse to those of the conditional logit
model, which means that they again display the IIA assumption.
Next consider the effect of changes in attributes of the decision maker.
Let wm be the mth attribute of the decision maker and let this attribute be
relevant for both stages of the decision making process. Then the marginal
effect is
∂πi
∂wm
= πi (1 − Bi )γim +
)
C∈G
))
j'=i C∈G
π(i|C)π(C)βim −
π(i|C)π(C)δij γjm −
)
)
πi Bj γjm +
j'=i
π(i|C)π(j|C)π(C)δij βjm
j'=i
The corresponding elasticity is given by


∂πi wm
1 ))
=
(1 − Bi )γim +
π(i|C)π(C)δij γjm −

∂wm πi
πi
j'=i C∈G
)
1 )
Bj γjm +
π(i|C)π(C)βim −
πi
C∈G
j'=i


)
1
π(i|C)π(j|C)π(C)δij βjm wm

πi
(8.8)
j'=i
If the attribute of the decision maker enters only into the first stage, then
only terms in γ are retained, resulting in consideration elasticities. If the
attribute of the decision maker enters only into the second stage, then only
188
terms in β are retained, resulting in substitution elasticities at the choice
stage.
8.4.2
Discrete Change in Predicted Probabilities
Discrete changes in predicted probabilities can be calculated based on (8.1).
The usual recipe is used. One sets all of the predictors in (8.2) and (8.4)
equal to their means, except for the predictor of interest. For that predictor,
one lets the values vary between a and b (e.g., minimum and maximum).
The resulting difference in probabilities is then the discrete change.
8.5
Example
We illustrate the PCMNL for the 2005 elections for the British House of
Commons (see Wilson, Hangartner, and Steenbergen 2008). At present
there are no canned routines for estimating the model in Stata or any other
package. The results shown here were obtained through R and required up
to 30 minutes for each estimation.
Using the British Election Study (Sanders et al. 2005), we have complete data on three parties: the Conservatives, Labour, and the Liberal
Democrats. The incumbent Labour administration under Tony Blair was
seeking a third term in office. The Conservatives were trying to regain
power and the Liberal-Democrats were hoping to strengthen their position
in the House. These three parties gathered 89.7 percent of the vote. Labour
succeeded to stay in power, even though it saw its margins decrease.
In the consideration stage, we assume that British voters considered ideological proximity to decide which parties to consider. The more distant the
party, the less likely its inclusion into the choice set. Thus, we postulate
a simple spatial logic in the consideration stage (Downs 1957; Enelow and
Hinich 1984). We measure ideological distance as the absolute difference
between a respondent’s own placement on a 11-point left-right scale and
his/her placement of the parties on that scale. Thus, ideological distance
ranges from 0 (the voter and party are positioned in the same ideological
location) to 10 (the voter and party are positioned at opposite ends of the
ideological spectrum). In addition, we include party specific constants for
the three parties, which capture the baseline probability of their consideration.
In the choice stage, we include two attributes of the alternatives. First,
we include again ideological distance in order to allow voters to use this to
select their preferred alternative from their choice set. Second, we include
189
sympathy scores for the leaders of the three parties (Tony Blair, Michael
Howard, and Charles Kennedy). This attribute captures the possibility that
the election was, at least in part, about the likability of the leaders of the
three parties. The sympathy scores range from 0 (strongly dislike) to 10
(strongly like).
We include three attributes of voters in the choice stage. First, we include
two social class dummies, one for middle class and one for no class (identification with the working class serves as the baseline). Traditionally, voting in
Britain has had a strong class component (Butler and Stokes 1973). While
it has been argued that class-based voting has declined in Britain (Franklin
1985), this view is not shared by all (Evans 2000). We split the difference
between these views by including class in the second stage of the PCMNL
rather than the first stage.
The two remaining voter attributes reflect evaluations of the performance
of the Blair government. First, in keeping with the economic voting literature (e.g. Lewis-Beck 1990), we believe that voters consider the state of the
economy in choosing whom to vote for. We measure this using a question
about the movement in the economy over the previous year; responses could
range from 1 (the economy “got a lot worse”) to 5 (it “got a lot better”).
Second, we include evaluations of the Iraq war since it was one of the defining acts of the Blair government and a highly salient issue in the campaign.
Evaluations of the war range from 1 (strong approval) to 4 (strong disapproval) of the war. The effect of the voter attributes is allowed to vary across
the alternatives. For identification purposes we constrain the effects to 0 for
Labour. We also include party-specific constants for the Conservatives and
Liberal-Democrats (Labour serves as the baseline).
8.5.1
Estimation Results
Table 8.1 shows the estimation results from the model. The estimates reveal
that ideological distance is not a statistically significant predictor in the consideration stage, leaving only the party-specific constants, all of which are
statistically significant. The lack of significance of ideological distance may
have to do with the fact that the three political parties are not miles apart in
ideological terms. The fact that only the party-specific constants are statistically distinguishable from zero is annoying from a substantive viewpoint:
these constants simply do not tell us very much about the reasons why voters consider certain parties. However, from a statistical viewpoint this does
not damage the PCMNL. The constants capture the baseline inclusion probabilities for the three parties, reflecting their electoral tides. To the extant
190
Table 8.1: PCMNL Estimates for the 2005 British House Elections
Vote Stage
Distance
Leader
Middle Class × Cons
Middle Class × Lib
No Class × Cons
No Class × Lib
Economy × Cons
Economy × Lib
Iraq × Cons
Iraq × Lib
Cons
Lib
Consideration Stage
Distance
Lab
Cons
Lib
B
SE
−.610∗
.870∗
1.949∗
1.396∗
1.126∗
.795∗
−.677∗
−.267∗
−.079
.064
1.040
−.788
.077
.079
.351
.343
.275
.262
.143
.135
.123
.128
.566
.599
−.150
3.962∗
4.277∗
1.222∗
.102
.619
.739
.231
Note: Table entries are ML estimates with estimated standard errors. Two-tailed tests; ∗ p < 0.05.
AP CC = .690. N = 2110.
that these probabilities vary across parties, this is important information in
its own right.
In the vote stage, most of the predictors are statistically significant. The
exception are evaluations of the Iraq war, which simply did not matter in
choosing a particular party.5 Ideological distance mattered in the second
stage, with more proximate parties being more likely to be selected. Sympathy for the party leaders also influenced vote choice, with parties with
more sympathetic leaders having a better probability of being selected. Being middle class or having no class identification had a positive effect. More
optimistic retrospective evaluations of the economy a negative effect. A detailed interpretation of these results will follow. Before going there, however,
it is useful to consider the nature of the choice sets.
5
There is no evidence that the Iraq war played into the first stage of the model. When
we tried a model with Iraq as the only first stage predictor it failed to achieve statistical
significance.
191
Table 8.2: Predicted Choice Sets of British Voters
Choice Set
Labour
Conservatives
Liberal Democrats
Labour & Conservatives
Labour & Liberal Democrats
Conservatives & Liberal Democrats
Labour & Conservatives & Liberal Democrats
Percentage
0.55
0.70
0.04
26.20
1.42
1.79
69.30
Note: Distribution of estimated choice sets from PCMNL. N = 2110.
Table 8.2 shows the choice subsets for this analysis. We might have
expected that British voters might have considered all three parties before
casting their vote, considering that this presents not a terrible cognitive
burden and also considering that the three parties are not miles apart ideologically. Based on our set of predictors, this turns out to be true for well
over two-thirds of the sample. For these voters, the PCMNL is indistinguishable from a multinomial logit model. The next most common consideration
set consists of Labour and the Conservatives. Between considering all parties or the two major parties we cover about 95.5 percent of the sample.
Only a small percentage of the sample had consideration sets consisting of
only one party (1.29 percent).
Despite the fact that the PCMNL reduces to a multinomial logit model
for nearly 70 percent of the sample, the model still fits the data better.
One indicator of this is the BIC or Bayesian Information Criterion, which
takes into consideration the fit of a model but also its level of parsimony,
penalizing rather severely against over-specification.6 In general, we prefer
models with a lower BIC value. For the PCMNL, we have BIC ≈ 2368;
for the multinomial logit, we have BIC ≈ 2382. Thus, the PCMNL is the
preferred model despite the fact that it is distinguishable from a multinomial
logit model for less than one-third of the sample.
6
The statistical definition of the BIC is as follows:
BIC
=
−2+ + p ln N
where p is the number of estimated parameters.
192
Table 8.3: Elasticities for Ideological Distance in Britain
Change in Distance to
Labour
Conservatives
Liberal Democrats
Change in Prob of Voting for
Lab
Cons
Lib
−.572
.523
.471
.454
−.908
.416
.193
.197
−.757
Note: All elasticities were computed by setting distance to
the party means, leader sympathy scores to the global mean,
class to the modal category, and economic perceptions and
evaluations of the Iraq War to their medians.
8.5.2
Interpretation
Elasticities
Table 8.3 shows the elasticities for ideological distance. As one would expect,
if the distance to a party increases then this reduces the probability of voting
for that party while increasing the probability of the other parties. But the
electoral costs of increased ideological distance are not equally distributed.
Because the Conservatives are already further removed from voters, on the
average, a percentage increase generates close to a percentage point decrease
in the probability of voting for this party. Both Labour and the Liberal
Democrats stand to gain from this loss, albeit to a different degree. A
percentage increase in the distance to the Liberal Democrats—a party that
is, on the average, closest to the voters in the sample—leads to a decrease
in the probability of voting for this party of about .75 percent, but neither
Labour nor the Conservatives gain much from this. By far the least sensitive
to shifts in ideological distance is Labour. which is on the average situated
between the Liberal Democrats and Conservatives in terms of ideological
distance.
8.5.3
Predicted Probabilities
Figure 8.2 shows the predicted probabilities of voting for the three parties
across working and middle class voters, as well as people who do not identify
with a particular social class.7 It is clear that their remains a strong class
7
Ideological distance and leader sympathy are kept at the mean in these simulations,
while economic perceptions and evaluations of the war in Iraq are kept at their median.
193
.8
Prob of Voting for Party
.4
.6
.2
0
Lab
Cons
Lib
Working Class
Lab
Cons
Lib
Middle Class
Lab
Cons
Lib
No Class
Figure 8.2: Vote Choice and Class in Britain
basis to voting in Britain. Working class identifiers are overwhelmingly
predicted to vote for Labour, whereas middle class identifiers clearly favor
the Conservatives. Among no class identifiers, the split between Labour and
the Conservatives is more even. The Liberal Democrats also perform better
among middle class and no class identifiers.
194
Chapter 9
The Theory of Maximum
Likelihood Estimation
Maximum likelihood estimation or MLE is one of a number of different
estimation methods that are used in statistics and applied data analysis.
The general character of estimation problems is that we have one or more
parameters that describe the distribution of one or more variables in the
population; these parameters cannot be directly observed. Using data from
a sample drawn from this population, our goal is to obtain estimates of the
parameters. Estimation methods outline principles that allow us to generate
formulae for assembling data into estimates.
MLE stands out from several other estimation procedures such as least
squares and method of moments because it is versatile and is known to have
desirable asymptotic properties such as consistency, efficiency, and normality. King (1989) has also made the case that MLE provides a unified framework for both estimation and hypothesis testing. A potential downside of
MLE is that, unlike least squares estimation, it requires distributional assumptions. When false, these assumptions can negatively affect the quality
of estimates. It is thus imperative to carefully think through the distributional assumptions that one makes when applying MLE.
9.1
The Intuition behind MLE
Imagine a probability density function (PDF—continuous case) or probability mass function (discrete case) with a parameter θ. The function describes
the distribution of a random variable in the population; θ is some theoretically interesting attribute of that distribution. Now imagine that θ is given;
195
we know it with certainty. Then we have some ability to predict the data
that we would observe in a sample of size n. Put differently, given a particular value of θ, certain values are more likely to be sampled from the
distribution than others. Example 9.1 illustrates this intuition.
Example 9.1. Consider a binomial distribution with parameter π = .5.
We draw a sample of n = 2 observations. The probability of observing
y = 0 successes in this sample is .25, while the probability of observing
y = 1 success is .50, and the probability of observing y = 2 successes is .25.
Clearly, it is more likely to obtain a sample with a single success than one
where none or all of the sample units succeeded. Thus, y = 1 is the most
likely outcome given π = .5.
The logic so far is a straightforward application of the basic laws of
probability. Here we have treated the parameter as given and the data
as the thing we seek to predict and about which there is uncertainty. Of
course, the statistical inference problem is precisely the opposite: we know
our data but we do not know the parameter. But the underlying intuition
can be easily reversed. That is, taking our data as given, certain values
of the parameter are more likely to have given rise to this data than other
values.The parameter value that is most likely to have produced the data is
accepted as the estimate of the parameter. Example 9.2 illustrates how this
logic works.
Example 9.2. Consider a sample of size n = 2. We know that this sample
has been drawn from a binomial distribution. We also know that it includes
one success (y = 1). Two researchers debate the value of π that gave rise to
this data. According to researcher A, π = .5 is the parameter value that gave
rise to the sample. Researcher B believes that π = 1 is the right estimate of
the parameter value. Who would you believe?
The answer is that A is more likely to be correct. Under the binomial
distribution, the expected number of successes is nπ. Under A’s estimate
of π this produces one success in a sample of two. Under B’s estimate of
π we should expect to see two successes. The reality is that we observe
one success, which is consistent with A’s estimate of π but less so with B’s
estimate. Thus π = .5 is the parameter value that is most likely to have
given rise to the data.
The language used here is instructive. We are not saying that the binomial parameter is .5 with certainty. Even when π is different from .5,
it might be that we draw a sample where y = 1. But of all the possible
guesses/estimates of π, .5 is most likely to have given rise to the data at
hand. And an estimate of .5 is certainly a great deal more likely to have
produced one success in a sample of two observations than an estimate of 1.
This process of working backwards from the data to an estimator is called
MLE. More formally:
196
Definition: A ML estimator, θ̂, of a parameter θ produces an
estimate such that the probability density function or probability
mass function corresponding to this estimate is most likely to
have generated the data.
Note the “hat” on top of the parameter. It is conventional to denote estimators and estimates in this manner, so as to distinguish them clearly from
the parameter. So in the previous example, we could have denoted the ML
estimate as π̂ = .5.
With the definition in place, the question is now how one finds ML
estimates other than by way of trial and error, as was done in Example
9.2. It is to this question that we turn next. In the process of answering
this question, things will get a lot more mathematical. Underlying this
mathematical complexity, however, the same simple intuition remains.
9.2
Estimating Single Parameters
The intuitions of MLE should be formalized if we want the estimation procedure to be of practical use. To do so, we start by making two assumptions:
1. The population is characterized by a probability mass or density function that is known, up to the parameter(s) that has/have to be estimated.
2. The sample consists of n independent draws from the probability mass
or density function. In other words, the data are generated through
random sampling so that the observations in the sample are independent. More precisely, any dependencies between the observations are
captured through one or more parameters. Thus, the observations are
conditionally independent—conditional, that is, on the parameters of
the probability mass or density function.
Taken together, these two assumptions imply that the observations in the
sample are i.i.d.—identically and independently distributed.
9.2.1
The Likelihood and Log-Likelihood Functions
Based on these assumptions, we can now describe the likelihood of obtaining the data. We can then work out the estimator that is most likely to
have produced this data. Let f (y|θ) denote the probability density or mass
197
function, which contains a single parameter θ. Further, let D denote the
data, consisting of n independent observations. Then the following holds.
Definition: The likelihood function is given by:
Pr(D) = f (y1 , y2 · · · yn |θ)
= f (y1 |θ)f (y2 |θ) · · · f (yn |θ)
n
(
=
f (yi |θ)
(9.1)
i=1
= L(θ|D)
(The second equation in the definition follows from the independence of the
observations.) Here L(θ|D), or L for short, stands for the likelihood function
and is sometimes also abbreviated as L.
Equation (9.1) express the so-called likelihood principle, which states
that the information in any sample can be found, if at all, from the likelihood
function. This principle captures the intuition that I started out with and
provides the central foundation of MLE.
Example 9.3. Consider the binomial problem from Example 9.2. Imagine
that we observe y = 1, n = 2. For the binomial distribution, the likelihood
function is the same as the distribution, i.e.
9 :
n y
L =
π (1 − π)n−y
y
Figure 9.1 plots this likelihood for different values of π (and setting n = 2 and
y = 1). We see that the likelihood function reaches its apex at π̂ = .5, which
means this is the MLE. This supports the intuition outlined in Example 9.2.
In practice, the likelihood function is transformed into a log-likelihood
function, which is equal to the natural logarithm of L.
Definition: The log-likelihood function is given by
J n
K
(
ln L(θ|D) = ln
f (yi |θ)
i=1
=
n
)
i=1
ln f (yi |θ)
= ((θ|D)
198
(9.2)
0.5
0.4
L
0.3
0.2
0.1
0.0
0.1
0.3
0.5
0.7
0.9
1.1
pi
Figure 9.1: Binomial Likelihood Function
Here ((θ|D), or ( for short, is the log-likelihood, which is sometimes also
abbreviated as LL or lnL. This function is used in lieu of L because working
with sums is easier than working with products (e.g. computers handle
summation much better than multiplication). Moreover, since the likelihood
function is essentially a product of probabilities in the discrete case, its
numeric value can be extremely low, which can give problems with numeric
precision in computers. The log-likelihood function bypasses this problem
since logarithms of fractions are larger numbers.
9.2.2
Optimizing the Log-Likelihood
The likelihood function gives the likelihood of the data. To find the MLE
of θ, we should maximize this function.
Definition: Let θ̂ be the maximum likelihood estimator
then
L(θ̂) = sup L(θ)
(9.3)
θ∈Θ
where sup stands for supremum (the maximal element in a set
of real numbers) and Θ is the parameter space, i.e. a non-empty
set of feasible values of the parameter.
199
For the reasons described above, in practice we optimize the log-likelihood
function rather than the likelihood function; both will produce identical
estimates.
Finding the maximum of the log-likelihood function involves three steps
common to all optimization procedures:
1. Take the first derivative of ( with respect to θ:
(% =
d(
dθ
2. Set this derivative equal to 0:
(% = 0
This is called the log-likelihood equation. It captures the firstorder condition for finding a maximum. When we solve for θ we
obtain the ML estimator, θ̂.
3. Verify that θ̂ is a maximum by evaluating the second derivative of (
at θ̂:
(” =
d2 (
dθdθ
If (” < 0 at θ̂, then θ̂ is the ML estimator. This second derivative
test is sometimes known as the second-order condition for finding
a maximum.
It is important to dwell a little on the first and second derivatives. The
first derivative captures the gradient of the log-likelihood function. As
stated by the log-likelihood equation, a (local) maximum occurs where this
gradient is zero (assuming the second-order condition is also met). For
simple problems, such as those considered in this and the next section, the
log-likelihood equation can be solved algebraically. In reality, most loglikelihood equations are too complex to be solved in this way. In those
cases, numerical optimization methods are used to obtain solutions, a topic
that will be discussed in greater detail later in this chapter.
The second derivative influences the curvature of the log-likelihood
function. More specifically, −(%% captures the curvature. The larger −(%% is
the stronger the peak of the distribution is. In general this is desirable, as
it allows for more precise estimation. The quantity −(%% is known as the
observed Fisher information.
200
Definition: The observed Fisher information is equal to
negative one times the second derivative of the log-likelihood,
i.e. −(%% .
Example 9.4. Consider the Poisson distribution
f (y|µ)
=
µy exp(−µ)
y!
where µ > 0 is the parameter that we want to estimate. This distribution
is frequently used to model count data (e.g. the number of protests against
the government in each month). Imagine that we have drawn n independent
observations from this distribution, then the likelihood function is given by
L
=
=
µy1 exp(−µ)
µy2 exp(−µ)
µyn exp(−µ)
×
× ··· ×
y1 !
y2 !
yn !
µ
!
i
yi
exp(−nµ)
;
i yi !
The log-likelihood function is
+
=
"
i
=
"
i
yi ln(µ) − nµ − ln
yi ln(µ) − nµ −
9
"
<
yi !
i
:
ln(yi !)
i
The optimization of + proceeds as follows. First, we take the first derivative
+#
=
=
d+
dµ
%
i
yi
µ
Next we set this derivative equal to 0 and
%
i yi
−n=0
µ
%
i yi
=n
µ
%
i yi
µ̂ =
n
µ̂ = ȳ
−n
solve for µ:
⇔
⇔
⇔
This means that the sample mean is the MLE of µ. Before we accept this,
however, we need to make sure that the sample mean produces a maximum
for +. Taking the second derivative gives
%
yi
+” = − i2
µ
We know that µ2 is positive for any feasible estimate of µ. Since y can take
on only non-negative values in the Poisson distribution, it follows that the
201
0.004
Likelihood
0.003
0.002
0.001
0.000
0.0
0.5
1.0
1.5
2.0
2.5
mu
Figure 9.2: Poisson Likelihood Function
-5
Log-Likelihood
-7
-9
-11
-13
-15
0.0
0.5
1.0
1.5
2.0
2.5
mu
Figure 9.3: Poisson Log-Likelihood Function
202
numerator of the expression is positive as long as there is one non-zero value
of y, which is almost always true. Thus, +” itself is negative, which means
that the necessary condition for a maximum is satisfied.
An example can illustrate that ȳ is the MLE. Consider a sample of the
following 5 observations: 0, 0, 1, 1, 2. The likelihood function is given by
L
=
µ4 exp(−5µ)
0!0!1!1!2!
Further, the log-likelihood function is given by
+
=
4 ln µ − 5µ − ln (0!0!1!1!2!)
We can plot L and + for different values of µ, as is done in Figures 2.2 and
2.3. As we can see, both figures have a global maximum at µ = .8, which
makes this the MLE. Of course, this is also the value that we obtain by
computing ȳ. Hence, the figures illustrate that ȳ is the MLE.
9.2.3
Standard Errors
For purposes of statistical inference, including the computation of confidence
intervals, it is necessary to compute the standard error of the ML estimator
of a parameter. This task requires that we compute the expected Fisher
information, invert it, and then take the square root. The result of these
operations is a formula for the standard error of the estimator.
Earlier, we saw that the observed Fisher information is simply the negative of the second derivative of the log-likelihood function, i.e. −(%% . The
expected Fisher information is defined as follows:
Definition: The expected Fisher information is equal to
negative one times the expected value of the second derivative
of the log-likelihood function, i.e. −E[(%% ].
The following examples illustrate the derivation of the expected Fisher information.
Example 9.5. For%
the Poisson log-likelihood, Example 9.4 demonstrated
that +” = −(1/µ2 ) i yi . To obtain the expected Fisher information we
203
begin by taking the expectation of +## :
3 %
4
yi
E[+## ] = E − i2
µ
&
'
"
1
= − 2 E
yi
µ
i
1 "
= − 2
E[yi ]
µ i
1 "
= − 2
µ
µ i
nµ
= − 2
µ
n
= −
µ
The second line reflects the property that the expectation of a constant times
a variable is equal to the constant times the expectation of the variable. In
this case, the constant is −(1/µ2 ) and the variable is yi . The third line
reflects the property that the expectation of a sum is equal to the sum of the
expectations. The fourth line shows that the expected value of a Poissondistributed variable is µ. The fifth line carries out the summation over this
expected value, and the last line simplifies the result. This result is not yet
the expected Fisher information. To obtain this we need to take negative
one times the expected value, i.e.
n
−E[+## ] =
µ
The variance of a ML estimator is obtained by inverting the expected
Fisher information:
Definition: The variance of a ML estimator is equal to the
reciprocal of the expected Fisher information for that estimator,
i.e.
V [θ̂] =
1
−E[(%% ]
(9.4)
The standard error is equal to the square root of (9.4). Thus, once the
expected Fisher information has been obtained, computing standard errors
for ML estimators is quite easy.
Example 9.6. In example 2.7, we obtained the expected Fisher information
for the Poisson distribution. Using this result we obtain
5 6−1
n
V [µ̂] =
µ
µ
=
n
204
The standard error is then
√
√
µ/ n.
If one takes a close look at the formulas derived in example 9.6, one
observes that the right-hand side is a function of the very parameters that
we seek to estimate. We can overcome this problem by substituting the
ML estimators of the parameters. Doing this, there are no more unknown
quantities on the right-hand side of the variance formulas. The resulting
variances are so-called estimated variances and the corresponding standard errors are estimated standard errors. For instance, the estimated
variance for the Poisson ML estimator is V̂ [µ̂] = µ̂/n. The “hat” over the
V-symbol highlights that we are dealing with an estimated quantity. In
practice, almost all of the standard errors that we report will be estimated
standard errors since it is rare to find that the right-hand side of a variance
formula contains known quantities only.
9.3
Estimating Multiple Parameters
In many cases, the distributions in which we are interested contain multiple
parameters. This is true, for example, of the normal distribution, which
is characterized by both a mean and a variance. Even if the distribution
contains only one parameter, we may wish to model it as a function of
covariates, which effectively means that we will have to consider a set of
parameters. The theory of MLE extends quite easily to the multi-parameter
case. Indeed, the logic of estimating multiple parameters is similar to the
single-parameter case. However, the mechanics of MLE become a bit more
complicated in that we will have to rely on multivariate calculus. It will
also prove useful to redefine a number of constructs in terms of vectors and
matrices.
9.3.1
Likelihoods, Gradients, and Hessians
Consider a probability density or mass function f (y|θ), where θ is a vector
of p parameters. Drawing a sample from this distribution we can define the
likelihood function in the following way:
Definition: For a sample of n (conditionally) independent observations, the likelihood function is given by
L =
n
(
i=1
205
f (yi |θ)
(9.5)
Further,
Definition: The log-likelihood function is given by
( =
n
)
i=1
ln f (yi |θ)
(9.6)
The maximum likelihood estimator can be defined in the following manner.
Definition: Let θ̂ be the maximum likelihood estimator then
((θ̂) = sup ((θ)
(9.7)
θ∈Θ
where Θ is the p-dimensional parameter space.
The ML estimator can be obtained by taking the first partial derivatives of
(, setting them to zero, and solving for the elements of θ. An example can
illustrate how this process works.
Example 9.7. Consider the normal PDF:
f (y|µ, σ 2 )
=
√
=
>
1
1
exp − 2 (y − µ)2
2σ
2πσ 2
where µ and σ 2 are the parameters of interest, corresponding to the population mean and variance, respectively. Thus, θ ! = (µ σ 2 ). Assume that a
sample of n independent observations has been drawn from this distribution,
then the likelihood function is given by
=
>
1
1
L = √
exp − 2 (y1 − µ)2 ×
2σ
2πσ 2
=
>
1
1
√
exp − 2 (y2 − µ)2 × · · · ×
2σ
2πσ 2
=
>
1
1
√
exp − 2 (yn − µ)2
2σ
2πσ 2
!
?
1 "
2 −.5n
2
(yi − µ)
= (2πσ )
exp − 2
2σ i
The log-likelihood function is
+
=
−.5n ln(2π) − .5n ln(σ 2 ) −
206
1 "
(yi − µ)2
2σ 2 i
To find the ML estimator of µ we take the first partial derivative of + with
respect to this parameter:
∂+
∂µ
=
1 "
(yi − µ)
σ2 i
By setting this derivative to 0 we can derive µ̂:
1 "
(yi − µ) = 0
σ2 i
"
(yi − µ) = 0
i
"
i
yi − nµ = 0
nµ =
"
yi
i
µ̂ =
%
i
yi
n
µ̂ = ȳ
⇔
⇔
⇔
⇔
⇔
Thus the sample mean is the ML estimator of the mean of a normal distribution.
To find the MLE of σ 2 we take the first partial derivative of + with respect
to this parameter:
∂+
∂σ 2
=
−
n
1 "
+
(yi − µ)2
2
2σ
2σ 4 i
To obtain σ̂ 2 we set this derivative to 0 and solve:
n
1 "
− 2 +
(yi − µ)2 = 0
2σ
2σ 4 i
!
?
1
1 "
2
−n + 2
(yi − µ)
=0
2σ 2
σ i
−n +
1 "
(yi − µ)2 = 0
σ2 i
1 "
(yi − µ)2 = n
σ2 i
%
2
i (yi − µ̂)
σ̂ 2 =
n
⇔
⇔
(9.8)
⇔
⇔
Here we have substituted µ̂ for µ in order to ensure that the right-hand side
consists of known quantities only. Notice that the ML estimator of σ 2 is not
the same as the sample variance, s2 , which divides by n − 1 instead of n.1
1
The ML estimator of σ 2 is biased in small samples, although not asymptotically. The
bias can be corrected by dividing by n − 1 instead of n.
207
Figure 9.4: Normal Likelihood Function
An example can illustrate these results. Imagine that we have drawn a
sample consisting of the following 3 independent observations: 0, 3, and 6.
In this case, µ̂ = 3, σ̂ 2 = 6, and σ̂ = 2.45. It should be the case that the
log-likelihood function reaches its apex for these values of the parameters.
The likelihood function in this case is
=
>
$
1 #
L = (2πσ 2 )−1.5 exp − 2 3µ2 − 18µ + 45
2σ
We can plot L for different values of µ and σ, as is done in the surface plot
in Figure 3.1 and the contour plot in Figure 3.2.2 . The maximum in these
figures is at µ = 3 and σ = 2.45, illustrating that these are the ML estimates.
If you paid close attention to Example 9.7, you would have noticed that
one important step was omitted, namely the derivation of the second partial
derivative of (. This step is actually a bit complicated in a multi-parameter
estimation problem because we can define multiple second partial deriva2
In these figures L has been re-scaled so that its apex is 1. We call such a likelihood
the normalized likelihood.
208
0.1
7
6
sigma
5
4
3
0.9
1
0.
2
0.7
0.5
0.3
1
0
2
4
6
mu
Figure 9.5: Normal Likelihood Contour
tives. Specifically, in Example 9.7 there are four second partial derivatives:
∂2(
∂µ∂µ
∂2(
∂µ∂σ 2
∂2(
∂σ 2 ∂µ
∂2(
∂σ 2 ∂σ 2
= −
= −
= −
=
n
σ2
>
i (yi −
σ4
µ)
i (yi −
σ4
µ)
>
n
−
2σ 4
>
i (yi
− µ)2
σ6
The first equation means that ( was first differentiated with respect to µ; the
resulting first partial derivative was then again differentiated with respect
to µ. The second equation means that ( was first differentiated with respect
to µ; the resulting first partial derivative was then differentiated with respect to σ 2 . The third equation means that ( was first differentiated with
respect to σ 2 ; the resulting first partial derivative was then differentiated
with respect to µ. Finally, the fourth equation means that ( was first differentiated with respect to σ 2 ; the resulting first partial derivative was then
again differentiated with respect to σ 2 .
209
It is conventional to group these second partial derivatives in a matrix,
called the Hessian and denoted by the symbol H (or H). For the normal
PDF, the Hessian is:
J 2
K
2
H =
J
=
∂ '
∂µ∂µ
∂2'
∂σ 2 ∂µ
n
−!
σ2
−
∂ '
∂µ∂σ 2
∂2'
∂σ 2 ∂σ 2
−
i (yi −µ)
σ4
!
n
2σ 4
i (yi −µ)
σ 4!
−
i (yi −µ)
σ6
2
K
In general, the Hessian is defined in the following manner.
Definition: The Hessian is the p × p matrix of second partial derivatives of the log-likelihood function with respect to the
parameters:
∂2(
∂θ∂θ !

H =



= 


(9.9)
∂2'
∂θ12
∂2'
∂2'
···
∂θ2 ∂θ1
∂θ1 ∂θ2
∂2'
∂θ22
∂2'
∂θp ∂θ1
∂2'
∂θp ∂θ2
H =
n
)
∂ 2 (i
∂θ∂θ !
..
.
..
.
···
..
.
···
∂2'
∂θ1 ∂θp
∂2'
∂θ2 ∂θp
..
.
∂2'
∂θp2
This may also be written as
=







(9.10)
i=1
n
)
i=1
Hi
where Hi is the Hessian matrix evaluated for a particular observation in the
sample.
Just as the second partial derivatives of the log-likelihood function can be
assembled into a matrix, the first partial derivatives can be assembled into
a vector, which is known as the gradient and denoted by ∇ (or sometimes
g). For the normal PDF of example 9.7, the gradient is given by:
, ∂' ∂µ
∇ =
∂'
=
,
∂σ 2
1
σ2
− 2σn2 +
>
−
i (y
>i
1
2σ 4
210
µ)
2
(y
i i − µ)
-
In general, the gradient is defined in the following manner.
Definition: The gradient is the p-vector of first derivatives of
the log-likelihood function with respect to the parameters:
∂(
∂θ

∇ =


= 


(9.11)
∂'
∂θ1
∂'
∂θ2
..
.
∂'
∂θp






This may also be written as
∇ =
=
n
)
∂(i
i=1
n
)
i=1
∂θ
(9.12)
∇i
Here, ∇i is known as the score, which is simply a sample unit’s values
on
>
the first derivative of the log-likelihood function. The quantity i ∇i is
known as the score function (or score vector), which is just another name
for the gradient.
With the definitions of gradient and Hessian in place, it is now possible
to define the first and second order conditions for ML estimators.
Definition: The first-order condition for finding an ML estimator is
∇ = 0
(9.13)
This is the likelihood equation.
Definition: The second-order condition for finding an ML
estimator is that H is negative definite.3
3
A matrix is negative definite if all of its eigenvalues are negative. Alternatively, H is
negative definite if x’Hx < 0, for any conformable column vector x.
211
The combination of the first and second-order conditions allows one to identify ML estimators.
Example 9.8. For the data and estimates presented in Example 9.7, the
Hessian is
3
4
−.500
.000
H =
.000 −.042
It can be easily ascertained that this matrix has only negative eigenvalues
(namely -.5 and -.042), which means that H is negative definite and that
the estimates reported earlier are ML estimates.
More generally, it is easily demonstrated that the Hessian for the normal loglikelihood is negative definite. Substituting the estimators into the formula
derived earlier, we have
!
&
'
(y −µ̂)
− σ̂n2
− i σ̂4i
!
!
H =
(y −µ̂)
(y −µ̂)2
n
− i σ̂4i
− i σ̂i6
2σ̂ 4
%
4
In this expression,
% i (yi −2µ̂) = 0,2 so that the diagonal elements are 0. We
also know that i (yi − µ̂) = nσ̂ . Hence,
%
2
n
n
nσ̂ 2
i (yi − µ̂)
−
=
− 6
4
6
4
2σ̂
σ̂
2σ̂
σ̂
n
n
=
−
2σ̂ 4
σ̂ 4
n
= − 4
2σ̂
The Hessian may thus be written as
3
− σ̂n2
H =
0
0
− 2σ̂n4
4
For a matrix where the off-diagonal elements are zero, the eigenvalues are
simply the diagonal values. Both of the diagonal elements in the Hessian
are negative, which means that all of the eigenvalues are negative and that
H is negative definite.
As in the univariate case, curvature of the (log-) likelihood function is
given by −H. This is known as the observed Fisher information (matrix).
Definition: The observed Fisher information matrix is
given by
I o = −H
(9.14)
We prefer curvature to be as large as possible in order to obtain maximum
precision in our estimates.
4
Proof:
%
i (yi
− µ̂) =
%
i (yi
− ȳ) =
%
i
yi −
212
%
i
ȳ = nȳ − nȳ = 0.
9.3.2
Standard Errors
In the multi-parameter case, standard errors of ML estimators are calculated by inverting the expected Fisher information matrix and by taking
the square root of the diagonal elements of the resulting matrix. Let us
begin by defining the expected Fisher information (matrix).
Definition: The expected Fisher information matrix is
given by negative one times the expectation of the Hessian:
I e = −E [H]
(9.15)
We now illustrate computation of the expected Fisher information for the
normal PDF.
Example 9.9. For the normal PDF, we have already seen that
!
&
'
(y −µ)
− σn2
− i σ4i
!
!
H =
(y −µ)
(y −µ)2
n
− i σ4i
− i σi6
2σ 4
To obtain the expected Fisher information matrix we start by taking the
expected values of the elements
the Hessian.
It is easy to demonstrate that
( in
)
!
7
8
− i (yi −µ)
5
E − σn2 = − σn2 and that E
=
0.
Obtaining the expectation
σ4
of the last element of the Hessian is only slightly more complex:
%
3
3%
24
24
( n )
n
i (yi − µ)
i (yi − µ)
E
−
=
E
−
E
2σ 4
σ6
2σ 4
σ6
8
n
1 " 7
=
− 6
E (yi − µ)2
2σ 4
σ i
n
1 " 2
=
− 6
σ
4
2σ
σ i
n
nσ 2
−
2σ 4
σ6
n
n
=
− 4
2σ 4
σ
n
= − 4
2σ
7
8
Here, the third line reflects the fact that E (yi − µ)2 is the mathematical
definition of the population variance (i.e. it is the second moment about the
=
5
that
%The second result should be familiar. The proof centers
%about demonstrating
%
E[ i (y
y
µ].
Expansion
i − µ)] = 0. We start by rewriting this expression: E[
i ] − E[
i
i
%
%
gives
this may also be written as
i E[yi ] − nµ. Since E[yi ] = µ %
i µ − nµ. Since
%
µ
=
nµ,
it
follows
immediately
that
µ
−
nµ
=
0.
i
i
213
mean, which equals σ 2 ). Thus, the expected Fisher information is given by
3
4
− σn2
0
−E [H] = −
n
0
− 2σ4
3 n
4
0
σ2
=
n
0
2σ 4
The variance-covariance matrix of the estimators (VCE) is defined as
the inverse of the expected Fisher information matrix.
Definition: The variance-covariance matrix of θ̂ is given
by
3 4
V θ̂ = I −1
(9.16)
e
The standard errors are equal to the square roots of the diagonal elements
of this matrix.
Example 9.10. For the normal PDF, the variance-covariance matrix is
given by
3 n
4−1
0
σ2
I −1
=
e
n
0
4
& 2 2σ '
σ
0
n
=
2σ 4
0
n
√
Thus, the standard
error of µ̂ is equal to σ/ n, while the standard error of
√
√
σ̂ 2 is equal to ( 2σ 2 )/ n.
Notice that the formulas for the standard errors of µ̂ and σ̂ 2 contain
the unknown parameter σ. In practice, then, we obtain estimated standard errors by substituting estimates for the parameters that appear in
the formulas of the standard errors.
9.3.3
MLE of the Linear Regression Model
The linear regression model is one of the most important models encountered
in political analysis. As such, it is worthwhile dwelling on the estimation of
this model via MLE. Consider a continuous response variable yi that is a
linear function of a set of fixed (i.e. non-stochastic) covariates (including a
constant), xi , and a stochastic error term, #i :
y i = x! i β + # i
214
Under the classical normal linear regression model (CNLRM), the errors are
normally distributed with a mean of zero and constant variance:
%
$
#i ∼ N 0, σ 2
Moreover, E[#i , #j ] = 0 for i += j, i.e. there is no autocorrelation. The
parameter vector consists of the regression coefficients in β, as well as σ 2 :
θ ! = (β ! σ 2 ). Thus, θ ! consists of p = K+2 parameters: K partial regression
coefficients, β1 · · · βp ; a constant, β0 ; and an error variance, σ 2 .
Based on the model and the assumptions about the errors, the conditional distribution of the response variable is given by
$
%
f (yi |x!i , β, σ 2 ) ∼ N x!i β, σ 2
If we draw n independent observations from this distribution, then the likelihood function is given by6
Q
R
%2
1 )$
2 −.5n
!
L = (2πσ )
exp − 2
yi − x i β
2σ
i
The log-likelihood function is given by
%2
1 )$
!
y
−
x
β
i
i
2σ 2
i
*
+
1
= −.5n ln(2π) − .5n ln(σ 2 ) − 2 (y − Xβ)!(y − Xβ)
2σ
+
1 *
= −.5n ln(2π) − .5n ln(σ 2 ) − 2 y !y − 2β !X !y + β !X !Xβ
2σ
Here y is a n × 1 vector stacking all of the observations on the response
variable and X is a n × (K + 1) matrix stacking all of the observations on
the covariates.
Estimation of the constant and partial regression coefficients requires
that we take the first partial derivative with respect to β:
+
∂(
1 *
= − 2 −2X !y + 2X !Xβ
∂β
2σ
( = −.5n ln(2π) − .5n ln(σ 2 ) −
Setting this derivative to zero yields:
+
1 *
− 2 −2X !y + 2X !Xβ = 0 ⇔
2σ
−2X !y + 2X !Xβ = 0 ⇔
X !Xβ = X !y
6
To obtain this result, simply substitute xi β for µ in the likelihood function derived in
Example 9.7.
215
This expression is known as the normal equations. In the absence of perfect
multicollinearity, X !X can be inverted and the ML estimator for β can be
found by pre-multiplying both sides of the normal equations by this inverse:
$ ! %−1 !
$
%−1 !
XX
X Xβ = X !X
Xy ⇔
$ ! %−1 !
β̂ = X X
Xy
This is the same estimator that one would get using ordinary least squares
(OLS).
Estimation of the error variance requires that we take the first partial
derivative of the log-likelihood function with respect to σ 2 :
∂(
∂σ 2
= −
n
1
+ 4 (y − Xβ)!(y − Xβ)
2
2σ
2σ
Setting this derivative to zero yields:
n
1
− 2 + 4 (y − Xβ)!(y − Xβ) = 0 ⇔
2σ
!2σ
/
1
1
!
−n + 2 (y − Xβ) (y − Xβ) = 0 ⇔
2σ 2
σ
1
(y − Xβ)!(y − Xβ) = n ⇔
σ2
(y − X β̂)!(y − X β̂)
σ̂ 2 =
n
Recognizing that e = y − X β̂ is the residual, this estimator can be written
more compactly as
σ̂ 2 =
e!e
n
This is not the conventional estimator of the error variance, which divides
e!e by n − K − 1 instead of n.7
Based on the computations above, the gradient for the classical linear
regression model is given by
,
−*2σ1 2 {−2X !y + 2X !Xβ} +
∇ =
− 2σ1 2 n − σ12 (y − Xβ)!(y − Xβ)
7
The ML estimator of σ 2 is biased in small samples, although the bias disappears when
K is finite and n goes to infinity. The division by n − K − 1 eliminates the small-sample
bias of the ML estimator.
216
The Hessian requires that we compute the second partial derivatives of
the log-likelihood function. Computing these produces
J
K
2
2
H =
=
,
∂ '
∂β∂β!
∂2'
∂σ 2 ∂β!
∂ '
∂β∂σ 2
∂2'
∂σ 2 ∂σ 2
− σ12 X !X
− σ14 {y !X − β !X !X}
− σ14
1
! y − X ! Xβ}
* − σ4 {X
+
−.5n + σ12 (y − Xβ)!(y − Xβ)
-
This expression can be simplified if we recognize that y = Xβ + & or,
equivalently, & = y −Xβ. Using this information, X !y −X !Xβ = X !Xβ +
X !& − X !Xβ = X !& and (y − Xβ)!(y − Xβ) = &!&. Thus, the Hessian
may also be written as:
,
1
!&
− σ12 X !X
−
X
4
* σ
+
H =
− σ14 &!X − σ14 −.5n + σ12 &!&
This means that the observed Fisher information is
, 1 !
1
!
2X X
4X &
σ
σ
*
+
Io =
1 !
& X σ14 −.5n + σ12 &!&
σ4
We obtain the variance-covariance matrix of the estimators by inverting
the expected Fisher information. Under standard regression assumptions,
E[X !&] = 0, E[&!X] = 0!, and E[&!&] = nσ 2 .8 Thus, the expected Fisher
information is
, 1 !
0
2X X
σ
*
+
Ie =
1
0!
−.5n + σ12 nσ 2
σ4
, 1 !
0
2X X
σ
=
n
0!
2σ 4
Inverting this matrix yields
V =
,
σ 2 (X !X)−1
0!
0
2σ 4
n
-
Here σ 2 (X !X)−1 is the variance-covariance matrix of β̂ and (2σ 4 )/n is the
variance of σ̂ 2 . In practice, we compute estimated variances and covariances
by substituting σ̂ 2 in the formula for V.
%
8
that &! & = #21 + #22 + · · · + #2n = i #2i . Taking expectations
we have E[&! &] =
% Notice
%
%
2
2
2
2
2
2
2
i E[#i ]. Under homoskedasticity, E[#i ] = σi = σ . Hence,
i E[#i ] =
i σ = nσ .
217
9.4
Properties of MLE
ML estimators have desirable asymptotic properties, i.e. properties that
hold as the sample size goes to infinity or becomes very large.9 These properties include the following.
1. ML estimators are invariant, meaning that functions of such estimators are themselves ML estimators.
2. ML estimators are consistent, i.e. they converge to the true parameter values in probability limit:
plim θ̂ = θ
3. ML estimators are efficient, i.e. they have the smallest possible variance as defined by the Cramér-Rao lower-bound (see Ruud 2000).
4. ML estimators are asymptotically normally distributed. That is,
the sampling distribution of ML estimators converges to a normal distribution.
9.5
A Primer on Numerical Optimization
In the previous sections, the likelihood equations were solved through algebraic methods. However, this method can be used only for gradients that
are linear in the parameters. Most of the time, the gradients are not linear
in the parameters, as is the case for all of the models considered in these
notes. In this case, closed-form solutions to the likelihood equations cannot
be found and we need to rely on numerical optimization to maximize the
log-likelihood function. Numerical optimization is the process by which a
computer algorithm continuously refines an initial numerical guess of the parameter(s), until the maximum of the objective function (() has been found
(within tolerable error). In the context of MLE, the most commonly used
optimization methods are the so-called hill-climbing algorithms.
9.5.1
Hill-Climbing Algorithms
To get a better sense of the logic of numerical methods, let us start by
considering the problem of estimating a single parameter,θ. We could start
9
Several other, rather technical conditions—so-called regularity condirions—have to be
satisfied as well.
218
by making an initial guess, θ0 . We call this the starting value or the
initial value. More likely than not, this value does not maximize the loglikelihood function. Thus, we need to make an adjustment, ξ0 , so that we
obtain a new estimate, θ1 = θ0 + ξ0 . We keep repeating this process until we
have maximized the log-likelihood function (within tolerable error) at which
point we stop. The production of each new estimate is called an iteration
of the algorithm. For the tth iteration we have
θt+1 = θt + ξt
where θt+1 is the parameter value that will be used in the next iteration. The
situation for a multi-parameter estimation problem is much the same, except
that we now have a vector of estimates and corresponding adjustments:
θ t+1 = θ t + ξ t
(9.17)
The trick, of course, is how to define the adjustment. Hill-climbing
algorithms define adjustments in terms of the gradient, i.e. ξ = f (∇). It
makes a great deal of sense to consider the gradient while updating estimates.
As the name implies, the gradient indicates something about the direction
in which the log-likelihood function is changing (like the grade of a road). If
the gradient is positive, this means that the log-likelihood function increases
with values of the parameter. Put differently, the slope is positive, so an
increase in the parameter causes an increase in the log-likelihood function.
Since our goal is to maximize this function, a logical adjustment would be to
increase the estimate—we would be walking/stepping up the log-likelihood
function, so to speak. We would continue to do so until the gradient would
turn negative. After all, a negative gradient means that the log-likelihood
function decreases with values of the parameter. The slope is negative,
which means that an increase in the parameter causes the log-likelihood
to decrease. We want to avoid this and therefore we would decrease our
estimate. Thus, hill-climbing algorithms work with variations on this simple
rule: (1) if ∇ > 0 then ξ > 0; (2) if ∇ < 0 then ξ < 0 and (3) if ∇ = 0
then ξ = 0. The algorithm is illustrated in Figure 9.6.
We obtain greater insight into the different variants of the hill-climbing
algorithm by dissecting the adjustment factor ξ. In general, this factor has
three components: (1) a step size, λ, (2) a direction matrix, D, and (3)
the gradient, ∇. Thus, the generic hill-climbing algorithm can be stated as
follows.
Definition: In the generic hill-climbing algorithm parame219
−1
−2
log−likelihood
−4
−3
−6
−5
Adjustment
−2
−1
0
parameter
1
2
Figure 9.6: The Operation of Hill-Climbing Algorithms
ter estimates are updated via the following rule
θ t+1 = θ t + λt D t ∇t
(9.18)
Before embarking on a detailed discussion of the different variants of the
algorithm, it is important to dwell a little on the different components of
the adjustment factor, ξ. As we have seen, the gradient, ∇, influences in
what direction the parameter estimates should be updated. Should they be
increased or decreased? The direction matrix, D, controls by how much
the parameter estimates should change. Should they be adjusted by just a
little or by a lot? The product d = D∇ is known as the direction vector
and plays a major role in the iterative process.
This leaves us with the step size, λ. This parameter plays a dual role
in the iterative process. First and most importantly, it ensures that the
log-likelihood function increases at successive iterations. Second, it helps to
direct the iterative process without having to recompute the direction vector. Computation of the direction vector requires computer cycles and helps
to slow down iterations. This is so because the direction vector requires, at
a minimum, computation of the first derivatives at the current estimate of
θ. By contrast, manipulating the step size involves a simple scalar multiplication, which does not require much in the way of computational resources.
Manipulation of the step size typically adheres to the following pattern:
1. Start by setting λ = 1.
2. If ((θ t+1 ) > ((θ t ), then increase the step size to λ = 2.
220
Table 9.1: Different MLE Algorithms
Algorithm
Steepest Ascent
Newton-Raphson
Fisher Scoring
BHHH
DFP+BFGS
Direction Matrix
Identity
(−H)−1 = I −1
o
−1
I>
e
( i ∇i ∇i ! )−1
Arc Hessian
3. Continue to increase the step size by one unit as long as ((θ t+1 ) >
((θ t ).
4. If ((θ t+1 ) ≤ ((θ t ), then back up and decrease the step size to λ = .5,
λ = .25, or an even smaller value.
By manipulating the step size in this way, it may not be necessary to recompute the direction vector all that often, which will significantly reduce
computational costs. Note that a step size parameter is not always included
in the algorithm.
We obtain different hill-climbing algorithm variants by changing the way
in which the direction matrix is specified. Table 9.1 displays the main specifications of this matrix and names the resulting algorithms (see Thisted
1988).
Method of Steepest Ascent This is one of the oldest available algorithms. It sets the direction matrix equal to an identity matrix, which
means that both the direction and the size of the adjustments are driven in
large part by the gradient. This means that the size of the adjustments is
not reduced as the algorithm approaches the maximum, a property that can
cause the algorithm to overshoot its target and converge poorly. Because of
this problem, the method of steepest ascents is hardly ever used any more,
except as a backup to Newton-Raphson when that algorithm encounters flat
regions.
Newton-Raphson This algorithm uses the inverse of the observed Fisher
information matrix to control the size of the adjustments. As a result, the
adjustments become smaller as one comes closer to the maximum, causing
the algorithm to converge much better than the method of steepest ascent.
This better convergence makes the algorithm a popular choice; for example, Stata estimates many of the models discussed in these notes using the
221
Newton-Raphson algorithm. A downside of the algorithm is that the observed Fisher information may not be invertible in relatively flat regions of
the log-likelihood function. Moreover, the Fisher information matrix may
not be positive definite (or the Hessian not negative definite) in which case
the log-likelihood may fail to increase in successive iterations.
Fisher Scoring This algorithm is also known as the method of scoring
and, in certain contexts, as iteratively re-weighted least squares (IRLS). It
uses the inverse of the expected Fisher information matrix as the direction
matrix. This has two main advantages: (1) the expected Fisher information
is frequently simpler than the observed Fisher information; (2) since the
inverse of the expected Fisher information is the variance-covariance matrix
of the estimates, it is guaranteed to be positive definite. This algorithm has
found its way in a number of applications, although not so much the ones
discussed here.
BHHH Algorithm This algorithm is named after Berndt, Hall, Hall,
and Hausman and is one of the so-called quasi-Newton algorithms. QuasiNewton algorithms find an approximation of the Hessian that is simpler
to compute and guaranteed to produce a positive-definite direction matrix.
The BHHH algorithm does this by computing a so-called outer-product
of gradients, which can be defined as follows:
B =
n
)
i=1
∇i ∇!i
where ∇i is known as sample unit i’s score. The inverse of the matrix B
serves as the direction matrix. Asymptotically, this matrix converges to the
observed Fisher information. Apart from the fact that B is always positive
definite, a chief advantage of the BHHH algorithm is that it requires only
first-derivative information.
DFP and BFGS Algorithms These are two other quasi-Newton algorithms. They are based on arc-Hessians that are formed by computing the
change in the slope from one iteration to the next. Arc-Hessians are generally more precise than Hessians that are based on the information of one
iteration only. The DFP algorithm is named after Davidson, Fletcher, and
Powell; the BFGS after Broyden, Fletcher, Goldfarb, and Shanno. The
latter algorithm improves on the former and is widely used in computer
software because it is fast and has excellent convergence.
222
Table 9.2: Numerical Estimates of the VCE
Algorithm
Fisher Scoring
Newton-Raphson
BHHH
DFP & BFGS
9.5.2
V
V1
V2
V3
V4
= I −1
e
= I −1
o
>
−1
n
= ( i=1 ∇it ∇! it )
= f (Arc Hessian)
Computation of Standard Errors
Just like the parameter estimates are derived through numerical methods
for all but the simplest estimation problems, so are the standard errors.
Dependent on the algorithm, the approach to computing the standard errors
will take on a specific form, as is shown in Table 9.2.
First, if parameter estimates are computed using the Fisher scoring algorithm, then it makes sense to base the standard errors on the inverse of the
expected Fisher information matrix since the algorithm computes it anyway. This produces V1 as the numerical estimate of the variance-covariance
matrix. Second, if one uses the Newton-Raphson algorithm for computing
parameter estimates, then it makes sense to compute the standard errors
from the observed Fisher information matrix since this also is the direction
matrix that the algorithm uses. This suggests V2 as the numerical estimate
of the variance-covariance matrix. Third, the BHHH algorithm suggest using V3 as the numerical estimate of the variance covariance matrix, since
the inverse of the outer-product of gradients is anyway required to compute
the parameter estimates. Finally, in the DFP and BFGS algorithms, the
variance-covariance matrix is based on a function of the arc-Hessian that is
computed to implement these algorithms.
The four approaches all produce consistent estimates of the variancecovariance matrix of the estimators. Thus, they are asymptotically equivalent. In small samples, however, these procedures may yield differences in
the computed standard errors.
9.5.3
Algorithmic Convergence
How Algorithms Decide When to Stop
Earlier, it was said that the iterative process continues until ( has been
maximized within tolerable error. What does this mean? A commonly used
stopping rule for algorithms is to declare convergence if the absolute relative
223
change in the log-likelihood function falls below a predefined tolerance, q:
=
=
= (t+1 − (t =
=
=<q
=
=
(t
This is known as the relative convergence criterion. When the tolerance is
set sufficiently small (e.g. to 10−7 ) then improvements between successive
iterations become trivial, affecting digits that we typically do not care about.
Thus one rationale for the relative convergence criterion is pragmatic: small
relative changes in the value of the log-likelihood function are simply not very
important from a practical standpoint (e.g. they will never get reported in
a table). The theoretical rationale is that small changes in the log-likelihood
function should occur around the maximum, since this is where (% → 0.
Convergence Problems
Convergence problems refer to any instance where an algorithm fails to
produce estimates or produces estimates that cannot be trusted. There are
several warning signs of such problems. Sometimes, programs will actually
tell you that they have failed to converge within the allotted number of
iterations. Sometimes convergence occurs only very slowly, taking many
more iterations than the estimation problem should take. At other times,
the warnings are more cryptic. For example, Stata issues non-concavity
and backup warnings. Non-concavity means that the log-likelihood is flat,
which causes the Hessian to be non-invertible. When such a problem is
signaled on the last iteration, it usually means that the estimates cannot
be trusted. Backup means that the algorithm has had to decrease the step
size, something that happens if the new log-likelihood is smaller than the
previous one. A multitude of backup messages signals that the algorithm is
having a difficult time optimizing the log-likelihood.
Convergence problems can arise due to any number of reasons. However,
statisticians have identified a couple of causes that frequently produce convergence issues. It is useful to describe these causes in some detail so that
you can be aware of them.
A first cause of convergence problems may be poor starting values. A
poor choice of starting values can cause problems with convergence and,
as we have just seen, local maxima. Most statistical packages will choose
reasonable starting values for their built-in MLE routines, but sometimes
starting values are far off the mark. If the log-likelihood function contains
both local and global maxima, then it may depend a lot on the starting
values which of these extremes will be found.
224
A small sample size is a second common cause of convergence problems. Convergence is generally faster the larger the sample size is relative
to the number of estimated parameters. If the sample size is too small,
the log-likelihood function may become relatively flat and this could cause
non-concavity. Of course, using MLE in small samples is anyway precarious
because all of the desirable properties manifest themselves only asymptotically.
Third, complex likelihood functions can create convergence problems.
Some likelihood functions are so complex that they challenge even the most
powerful of computers, leading to slow convergence or frequent instances
of non-convergence. This is particularly true of likelihood functions that
involve multiple integrals such as those associated with the multivariate
normal distribution. Discrete choice models are frequently characterized by
complex likelihoods and special procedures may need to be used to produce
estimates for these models.
Fourth, estimation at the boundary of parameter space can cause convergence problems. When the true estimate lies at the boundary of parameter
space, then algorithms may be pushed into considering inadmissible estimates, which may in turn make it impossible to compute the log-likelihood
function and its derivatives. This problem arises frequently when we are estimating probabilities, which are bounded between 0 and 1, or correlations,
which are bounded between -1 and 1. If the true probability is close to 0
or 1, or the true correlation is close to -1 or 1, then the algorithm may end
up exploring estimates on the wrong side of these bounds. This may cause
the algorithm to halt completely or to produce nonsensical results such as
negative variance estimates. The solution in these cases is to find an appropriate transformation of the parameter that is itself not bounded. One can
then rely on the invariance principle, which ensures that the transformation
is also a ML estimate.
Fifth, the scaling of variables may cause convergence problems. The
larger the ratio between the largest and smallest standard deviations, the
more likely convergence problems are. Long (1997) suggests that this ratio
should not exceed 10. The solution in this case is as simple as changing the
units on a variable (e.g. income in $1000 units instead of $1 units).
Sixth, convergence problems can arise when the model is inappropriate
for the data. This can happen when the wrong PDF is specified or, more
generally, when the model makes assumptions that do not match the data
generation process. In this case, the algorithm will try hard to find the best
estimates, but this may simply not be possible. This is one reason why
considerable thought should be placed in model specification, i.e. selection
225
of predictors as well as distributional assumptions.
Finally, data cleaning problems may be a cause of convergence difficulties. If the data contain errors (e.g. negative values where positive ones are
expected) then convergence can become a problem. As always, you should
inspect your data through descriptive statistics and graphical displays before touching complex statistical techniques like MLE. This will also help
you to assess whether the distributional assumptions that you make seem
reasonable for the data at hand.
9.6
9.6.1
MLE and Hypothesis Testing
Testing Simple Hypotheses
A simple hypothesis test is a test pertaining to a single parameter. In the
usual two-sided setup, the null hypothesis stipulates a particular value, k,
for the parameter, θ, while the alternative hypothesis stipulates some other
value:
H0 : θ = k
H1 : θ += k
A common variant of this setup is to set k = 0; this is how we would test if
regression coefficients are statistically distinct from zero.
We test simple hypotheses by considering the difference between the
parameter estimate, θ̂, and the hypothesized value relative to the sampling
fluctuation. Thus, the test statistic is
T
=
θ̂ − k
s.e.θ̂
(9.19)
When θ̂ is an ML estimator then we know that its asymptotic distribution is
normal. Specifically, under the null hypothesis θ̂ ∼ N (k, σθ̂2 ). If follows then
immediately that T ∼ N (0, 1), i.e. the test statistic follows the standard
normal distribution. We can use this property to compute a p-value, which
will then allow us to decide whether the null hypothesis should be rejected.
9.6.2
Testing Joint Hypotheses
When we speak of a joint hypothesis test we mean that several parameters
are tested simultaneously. A common variant of such a test is when we set a
subset of the parameters equal to zero. For instance, if θ ! = (θ1 , θ2 , θ3 , θ4 ),
then we might wish to test the hypothesis:
226
H0 : θ1 = θ2 = 0
Another common variant of joint hypothesis testing involves linear restrictions. Here a parameter is modeled as a linear function of a subset of the
other parameters. For instance,
H0 : θ1 = θ2
is an equality constraint, which may also be formulated as H0 : θ1 − θ2 = 0.
In the context of MLE, there are two main testing procedures for joint
hypotheses: (1) the likelihood ratio test and (2) the Wald test.10 The differences between these tests concern the models that need to be estimated
in order to perform the test. A key distinction here is between restricted
and unrestricted models. The restricted model incorporates the hypothesized values of the parameters under H0 , while the unrestricted model does
not and leaves these parameters free for estimation. We can also say that
the restricted model is nested inside the unrestricted model, in that the restricted model is a structurally simpler form of the unrestricted model. To
carry out a likelihood ratio test, one has to estimate both the restricted and
unrestricted models. To carry out a Wald test, however, one needs only to
estimate the unrestricted model.11
Likelihood Ratio Test
The LR test is based on the evaluation of both the unrestricted and restricted
models. Each of these models yields a value of the likelihood function. As
its name suggests, the likelihood ratio test considers the ratio of these two
values.12 Specifically, the likelihood ratio is defined as
λ =
LR
LU
(9.20)
where LR and LU are the values of the likelihood functions of the restricted
and unrestricted models, respectively. The fit of the unrestricted model can
be no worse than that of the restricted model: LU ≥ LR . Consequently, 0 <
λ ≤ 1. If the restrictions imposed by H0 are correct, then θ̂ R = θ̂ U ,where
θ̂ U denotes the estimates of the unrestricted model and θ̂ R the estimates
10
A third test procedure that is sometimes used is the Lagrange multiplier test. Since
we do not need it for the models discussed in these notes, we refrain from a discussion of
this test.
11
In a Lagrange multiplier test, one estimates only the restricted model.
12
The test is justified by the Neyman-Pearson lemma, named after the statisticians Jerzy
Neyman and Egon Pearson. For a discussion of this lemma see e.g. Lehmann (1986).
227
of the restricted model. Put simply, the ML estimates coincide with the
hypothesized values of the parameters under H0 . In this case, LR = LU
and λ = 1. However, if the restrictions implied by H0 are incorrect, then
θ̂ R += θ̂ U . Moreover, since the restricted estimates are no longer the true
ML estimates, LR < LU and λ will be small. Thus, smaller values of the
likelihood ratio indicate evidence against the restricted model and therefore
H0 .
In order to compute a p-value for the null hypothesis, we need to know
the sampling distribution of λ. This turns out to be quite complicated, but
the asymptotic sampling distribution of negative two times the log of the
likelihood ratio is relatively straightforward and yields the likelihood ratio
test statistic:
LR = −2 ln λ
∼
(9.21)
χ2q
where q is the number of restrictions imposed under H0 . Through some
algebra it is possible to write this in terms of the log-likelihood functions of
the restricted and unrestricted models:
LR = −2 ln λ
5
6
LR
= −2 ln
LU
= −2 [ln(LR ) − ln(LU )]
(9.22)
= 2 ((U − (R )
It should be noted that there are actually two different types of LR test.
The previous example showed an instance of the so-called LR chi-squared
test, which is also designated as G2 . Here the unrestricted model contains
all of the parameters that are relevant for the distribution. A second variant
is the (scaled) deviance test. In this test, a given model is compared to
a saturated model, i.e. a model that has one parameter per observation.
The saturated model now serves as the unrestricted model and with one
parameter per observation, this model fits the data perfectly so that LU = 1
and (U = 0. Thus, the deviance is given by
D = 2 (0 − (R )
= −2(R
(9.23)
where (R is now the log-likelihood of the model at hand. Deviances play a
role in some applications of MLE, although the LR chi-square test is more
common.
228
Wald Test
For the Wald test procedure, only the unrestricted model is estimated.13
The test takes into consideration two pieces of evidence: (1) the distance
between the unconstrained parameter estimates and the constrained values
of those estimates and (2) sampling variance. Together, these pieces of
information determine the fate of the null hypothesis.
To motivate the Wald test let me start by considering the concept of
restricted model in some greater detail. Consider the parameter vector θ.
A restricted or constrained model imposes restrictions on the elements of
this vector. For the sake of simplicity, let us limit ourselves to linear restrictions, since they are the most common.14 These restrictions can be captured
through the following system of equations:
Rθ = r
(9.24)
where r is a q × 1 vector of values (q is the number of constraints) and R
is a q × p design matrix that connects parameters (of which there are p in
total) to the hypothesized values.
Example 9.11. Consider the parameter vector θ # = (θ1 θ2 θ3 ). Imagine
that we want to test the hypothesis H0 : θ2 = θ3 = 0. That is, we seek to
impose two constraints. In this case, we can define r as the following 2 × 1
vector
5
6
0
r =
0
Further, R is the 2 × 3 matrix
R
=
5
0
0
1
0
0
1
6
If our hypothesis is H0 : θ1 + θ2 − θ3 = k, where k is a constant, then we
impose one constraint (once two parameters have been estimated, the value
of the third one follows from H0 ). In this case,
#
$
k
r =
and
R
=
#
1
1
−1
$
To evaluate the validity of the linear constraints one piece of evidence
suggests itself immediately: the distance of the ML estimates from the constrained values. If the constraints are valid, then θ̂ should satisfy them, at
13
The Wald test is named after the Hungarian mathematician Abraham Wald (19021950).
14
Greene (2003) discusses applications of the Wald test to non-linear restrictions.
229
least approximately. However, if the constraints are invalid, then they should
be far removed from θ̂. To capture the difference between the constraints
and the estimates we evaluate
Rθ̂ − r
(We multiply θ̂ by R to select the estimates of those parameters that are
restricted under H0 .) If the null hypothesis is true, then we would expect
Rθ̂ − r ≈ 0
To obtain a proper test statistic, however, we would also want to include
knowledge of the sampling variance of Rθ̂ − r. By a well-known result on
linear composites, we know that
V[Rθ̂ − r] = RV[θ̂]R%
This is the VCE based on the restrictions.
The Wald test statistic is now given by
&
' &
'−1 &
'
W =
Rθ̂ − r ! RV[θ̂]R%
Rθ̂ − r
(9.25)
∼ χ2q
where it should be emphasized that the sampling distribution of W holds
only asymptotically. If the constraints are correct and Rθ̂ − r ≈ 0, then W
tends to 0, providing support for H0 . If the constraints are invalid, then W
will tend to be large and will exceed the critical value, providing a foundation
for rejecting H0 .
The test statistic for testing simple hypotheses is actually a special case
of the Wald test procedure. Let H0 : θ = k, then the Wald test statistic is
given by
W
= (θ̂ − k)V [θ̂]−1 (θ̂ − k)%
=
(θ̂ − k)2
=
;
V [θ̂]
θ̂ − k
seθ̂
<2
= z2
The last step comes about because we recognize (θ̂ − k)/seθ̂ as the ztransformation under H0 . Under the null hypothesis, we know that W is
230
asymptotically distributed as χ21 , i.e. as a χ2 -variate with one degree of freedom. From mathematical statistics, we know that the square of a standard
normal variate also follows a χ2 distribution with one degree of freedom.
Thus, taking the square root of the Wald test statistic we obtain the test
statistic that we derived earlier in our discussion of simple hypotheses.
231