Semiparametric Estimation of the Random Utility Model with Rank

Semiparametric Estimation of the Random Utility Model with
Rank-Ordered Choice Data
∗
†
Jin Yan
Hong Il Yoo
‡
September, 2015
Abstract
We propose semiparametric methods for estimating the random utility model exploiting rank-ordered
choice data. The term semiparametric refers to the fact that the preference parameters of interest are
nite dimensional but the error term in the random utility function has unspecied distribution.
We
allow for a exible form of heteroskedasticity across individuals. The case that random coecients can
be allowed is also discussed. We show the strong consistency of the proposed generalized maximum score
(GMS) estimator. The asymptotic distribution of the GMS estimator is nonstandard. For inference purpose, we propose the smoothed GMS (SGMS) estimator. The SGMS estimator is strongly consistent and
asymptotically normal, making inference straightforward. Monte Carlo experiments provide the evidence
that the proposed estimators outperform the rank-ordered logit model in the presence of heteroskedasticity. An illustrative empirical application using a bank account choice data set suggests that the proposed
estimators are capable of nding plausible and robust solutions.
∗ We thank Xu Cheng, Liran Einav, Bruce Hansen, Han Hong, Arthur Lewbel, Taisuke Otsu, Joris Pinkse, Jack Porter and
seminar participants at the 2015 Tsinghua Econometric Conference and Academia Sinica for valuable comments and discussions.
We acknowledge funding support provided by the Hong Kong Research Grants Council General Research fund 2014/2015
(Project No.14413214) and six anonymous referee reports for the project proposal. All errors are ours.
† Department of Economics, The Chinese University of Hong Kong. Email: [email protected].
‡ Durham University Business School, Durham University. Email: [email protected].
1
Keywords: Rank-ordered; Random utility; Semiparametric estimation; Smoothing; Laplace; Markov
Chain Monte Carlo
JEL Classication: C14, C35
1
Introduction
This paper develops semiparametric methods for the estimation of random utility models for rank-ordered
choice data. The random utility function of interest has a typical structure, comprising a deterministic component or utility index that depends on nite-dimensional preference parameters, and an additive stochastic
component or error term. The methods are semiparametric in that they allow one to estimate the preference parameters, by combining mild nonparametric restrictions on the stochastic component with the usual
parametric specication of the deterministic component. Rank-ordered choice data are available in similar
empirical contexts as multinomial choice data, wherein the economic agent faces a nite set of mutually
exclusive alternatives. Multinomial choice data reveal the agent's choice or most preferred alternative from
the set. Rank-ordered choice data reveal further about the agent's preference ordering over the set, such as
her second and third preferences, thereby revealing what the agent's counterfactual choices would have been
if her most preferred were not available. Having a rank-ordered choice observation on an agent thus oers a
similar advantage as having several repeated multinomial choice observations on that agent, in that it allows
one to reduce sampling variations in the estimated preference parameters.
The parametric methods for rank-ordered choices are as well established as the parametric methods for
multinomial choices.
Every well-known parametric multinomial choice model has its rank-ordered choice
counterpart, both of which permit maximum likelihood estimation.
One can derive the likelihood of an
observed preference ordering by assuming a parametric form of the stochastic component's distribution.
Beggs et al. (1981) follow this approach to specify the rank-ordered logit (ROL) model that postulates the
same distribution as McFadden's (1974) multinomial logit (MNL) model. Assuming a exible distribution
that allows for correlated errors over alternatives could make deriving the rank-ordered choice likelihood
algebraically dicult.
In such cases, as McFadden (1986) suggests on the basis of Falmagne (1978) and
Barberá and Pattanaik (1986), one can exploit logical links between a rank-ordered choice likelihood and
2
1
several multinomial choice likelihoods to express the former as a combination of the latter.
Using this
approach, Layton and Levine (2003) specify the rank-ordered probit model and Dagsvik and Liu (2009)
specify the nested rank-ordered logit model. These two models postulate the same stochastic distribution as
the multinomial probit model and the nested logit model, respectively.
The semiparametric methods for rank-ordered choices are much less developed than the semiparametric
methods for multinomial choices. To our best knowledge, the study of Hausman and Ruud (1987) is both the
earliest and only precedent of developing a semiparametric estimator for rank-ordered choice data. In the
multinomial choice context, the prior works of Ruud (1983, 1986) have established conditions under which a
weighted M-estimator (WME) can be applied to the MNL model to estimate the ratios of slope coecients
consistently despite stochastic misspecication. Hausman and Ruud construct an extension of this WME to
the rank-ordered choice context and the ROL model. Much as Ruud's multinomial choice WME, however,
their rank-ordered choice WME is subject to two limitations that aects its applicability in many empirical
situations: its consistency is limited to coecients on continuous independent variables, and its asymptotic
distribution is unknown outside a special case where the population moments of the independent variables
is known. A random utility model for rank-ordered choices can be represented as a multivariate system of
latent dependent variables, much as that for multinomial choices involving three or more alternatives. Most
of the existing semiparametric estimators for multinomial choices focus on the special case of binomial choices
between two alternatives (Manski, 1975; Han, 1987; Horowitz, 1992; Klein and Spady, 1993; Lewbel, 2000)
that allows one to work with a univariate latent dependent variable instead. As Thompson (1993) point out,
however, the results for this special case do not easily carry over to more general cases involving three or
more alternatives. Lee (1995) and Fox (2007) count among a few that develop semiparametric multinomial
choice estimators for these general cases.
This paper proposes two types of semiparametric estimators for rank-ordered choices.
We call them
the generalized maximum score (GMS) estimator and the smoothed generalized maximum score estimator
(SGMS) respectively, and prove the main asymptotic properties of each. Roughly put, each estimator allows
the consistent estimation of suitably normalized coecients, be they on continuous or discrete independent
1 As a simple example, consider a random preference ordering B A C over three alternatives A, B, C . Note that
Pr(A C) = [Pr(A C B) + Pr(A B C)] + Pr(B A C) and the two terms in [.] add up to the trinomial choice
probability of A. The functional form of Pr(B A C) is thus known whenver the functional forms of the binomial choice
probability Pr(A C) and the trinomial choice probability Pr(A B and A C) are known.
3
variables, so long as there is one continuous independent variable. In addition, the SGMS estimator can be
shown as asymptotically normal, though it relies on somewhat stronger assumptions than the GMS estimator.
We also provide preliminary estimation results from Monte Carlo simulations and an illustrative empirical
application involving bank account products.
These preliminary results suggest that our estimators have
promising nite sample properties, and can be very useful additions to the empirical practitioner's toolbox.
The GMS estimator generalizes Manski's (1975) MS estimator for binomial choices, and more directly,
Fox's (2007) pairwise MS estimator for multinomial choices. Like those two estimators, the GMS estimator
is motivated by the identifying assumption that when considering a pair of alternatives, the agent's more
likely choice is one that yields the higher derministic utility. A multinomial choice observation involving
alternatives allows one to learn the implied outcomes of
J
J −1 pairwise comparisons, where each pair comprises
the agent's actual choice and one other alternative. A rank-ordered choice observation allows one to learn
the implied outcomes of additional pairwise comparisons. GMS intuitively extends Fox's MS by incoporating
that additional information. We show that this intuitive extension allows the GMS estimator to inherit the
attractive interpersonal heterogeneity features of Fox's MS estimator: it can accommodate arbitrary forms
of heteroskedasticity across agents, and also dierent error distributions across agents. The former feature
makes the estimator particularly appealing since the empirical evidence from parametric studies (Hensher et
al., 1999; Fiebig et al., 2010) suggests that heteroskedasticity is the most important aspect of interpersonal
heterogeneity in choice behavior, and yet it is often, if not always, unclear how heteroskedasticity is to be
modeled.
The SGMS estimator complements the GMS estimator by addressing the latter's two practical limitations,
in return for making some additional assumptions. Specically, the GMS estimator is
N
N 1/3 -consistent
where
is the number of agents, thereby exhibiting a slow rate of convergence, and also follows a non-standard
asymptotic distribution of Kim and Pollard (1990).
These limitations are shared by the MS estimators
of Manski (1975) and Fox (2007), and arise from that the MS criterion function involves step functions.
To address these issues in the context of Manski's binomial choice estimator, Horowitz (1992) develops a
smoothed maximum score (SMS) estimator that replaces the step functions with smooth kernels. Yan (2013)
applies the same technique to derive a smoothed version of Fox's multinomial choice estimator. Our SGMS
estimator is also based on the same approach, and oers similar benets as its precedents: we show that
4
the SGMS estimator's convergence rate can be made arbitrarily close to the usual
N −1/2 ,
and also that its
asymptotic distribution is normal.
A further novel aspect of our results is that the benet of using rank-ordered choice data over multinomial
choice data in the semiparametric framework is not limited to eciency gains, whereas it is in the parametric
framework.
In the special case of fully rank-ordered choice data that have all alternatives in a choice set
ranked from most to least preferred, one can replace the exchangeability assumption of Fox (2007, p.1006)
with a much weaker zero conditional median assumption, without aecting the strong consistency of the GMS
estimator. In this case, the rank-ordered choice estimator can be consistent even when the multinomial choice
estimator is not, since the former is robust to a wider class of stochastic distributions. Most importantly, this
expanded class includes many popular parametric models that the multinomial choice estimator rules out,
such as the nested logit model, the probit model with an unrestricted covariance matrix, and the mixed logit
model. Together with that the relevant special case commonly holds in empirical applications (Calfee et al.,
2001; Capparos et al., 2008; Scarpa et al., 2011; Yoo and Doiron, 2013), this makes the use of semiparametric
methods more attractive in the rank-ordered choice data environment than in the multinomial choice data
environment.
The remainder of this paper is organized as follows. Section 2 develops the GMS estimator and discusses
its asymptotic properties. Section 3 develops the SGMS estimator. Section 4 presents our preliminary Monte
Carlo evidence on the nite sample properties of those two estimators.
Section 5 presents an illustrative
empirical application using bank account choice data. Section 6 concludes.
2
The Model and the Estimation
2.1 A Random Utility Framework and Rank-Ordered Choice Data
Consider a standard random utility model.
collection of alternatives. Let
J = {1, . . . , J}
number of alternatives contained in
J.
Each individual in the population of interest faces a nite
J ≥2
be the
from choosing alternative
j , unj ,
denote the common choice set of alternatives and
The utility obtained by individual
5
n
is assumed as follows:
where
unj = x0nj β + εnj
∀ j ∈ J,
xnj
q -vector
is an observed
(1)
is the preference parameter vector of interest, and
index,
x0nj β ,
n
containing the characteristics of individual
εnj
and alternative
is the unobserved component of utility. The utility
is often referred to as systematic (or deterministic) utility.
Based on the utilities associated with all the alternatives, each individual reveals her best
J − 1)
alternatives and ranks those
M = J − 1,
M
alternatives. When
M = 1,
one-to-one onto the integers
any two alternatives
the set of the best
where
M (1 ≤ M ≤
we have multinomial choice data; when
we have a complete ranking of all the alternatives. In case any utility ties occur, let
be the set of alternatives with the same utility as alternative
rnj =
j , β ∈ Rq
{0, . . . , |T(n, j)| − 1},
where
|T|
M
alternatives of individual
n
and
rnj
maps the elements of
is the number of alternatives in the set
k, l ∈ T(n, j), A(k, T(n, j)) < A(l, T(n, j))
if and only if
k < l.2
Let
denote the ranking of alternative



 L(n, j) + 1 + A(j, T(j))
if
j ∈ Mn


 M +1
if
j ∈ J \ Mn ,
L(n, j)
j . A(·, T(n, j))
Mn ⊂ J
j,
T(n, j)
T(n, j)
T.
For
denote
that is,
(2)
denotes the number of alternatives with strictly larger utility than alternative
j
for individual
n.
Let the vector
r n ≡ (rn1 , . . . , rnJ ) ∈ NJ
example, if individual
Denote
n
denote the ranking of all alternatives in the choice set
ranks four alternatives as,
X n ≡ (xn1 , . . . , xnJ )0 ∈ RJ×q ,
and
un3 > un4 > max{un1 , un2 },
εn ≡ (εn1 , . . . , εnJ )0 ∈ RJ .
then
J.
For
r n ≡ (3, 3, 1, 2).
Although utilities are not observable,
for each individual we can observe the ranking and the explanatory vectors for all the alternatives,
(r n , X n ).
Next, we make the sampling assumption.
Assumption 1. {rn , X n : n = 1, . . . , N } is a random sample of (r, X), where r ≡ (r1 , . . . , rJ )0 ∈ NJ ,
X ≡ (x1 , . . . , xJ )0 ∈ RJ×q ,
2 The
and rj is the utility ranking of alternative j dened in (2) .
function A has the eect of breaking utility ties and is introduced purely for technical convenience.
6
(r n , X n )
Assumption 1 states that
sometimes drop subscript
n
are independently identically distributed (i.i.d.). For this reason, we
for simplicity in the following analysis.
2.2 Relation to Literature
Identication of
ranking
r
β
is subject to the restriction on the conditional probability of observing an arbitrary
given the explanatory vectors
X.
Parametric methods assume particular functional forms for this
conditional probability of observing any rankings, among which the most popular models are the rank-ordered
logit (ROL) model and the rank-ordered probit (ROP) model. Both models assume that the error terms in
are independent of explanatory vectors
i.e., the conditional distribution of
ε
X,
ε
ruling out the possibility of heteroskedasticity aross individuals,
varies across individuals.
The ROP model assumes that the error terms
ε
are jointly normal and allows correlation in error terms
across alternatives. The computation cost of the ROP model increases rapidly as the number of alternatives
rises due to calculation of multivariate integrals and estimation of covariance matrix of the error terms.
For example, an individual ranks four alternatives as
r = (3, 3, 1, 2)
given
X = (x1 , . . . , x4 )0 .
The ROP
probability is
PROP (r | X)
=
P (u3 > u4 > max{u1 , u2 })
=
P (u3 > u4 > u1 > u2 ) + P (u3 > u4 > u2 > u1 )
(3)
=
´∞
−∞
+
where
dε2
´∞
−∞
´∞
x02 β−x01 β+ε2
dε1
dε1
´∞
x01 β−x02 β+ε1
´∞
x01 β−x04 β+ε1
dε2
φ(ε) is a multivariate normal density function.
dε4
´∞
x02 β−x04 β+ε2
´∞
x04 β−x03 β+ε4
dε4
φ(ε)dε3
´∞
x04 β−x03 β+ε4
φ(ε)dε3 ,
The complexity of computing
PROP (r | X) is especially
high when the number of alternatives is large and alternatives are partially ranked.
The ROL model, rst introduced by Beggs, Cardell and Hausman (1981), assumes the elements of
ε
are
i.i.d. extreme value type I distributed. The ROL model does not allow for any form of heteroskedasticity
7
nor any correlation across alternatives, but it has the ease of computation. The conditional probability of
observing the ranking
PROL (r | X)
r = (3, 3, 1, 2)
given
X = (x1 , . . . , x4 )0
=
P (u3 > u4 > max{u1 , u2 })
=
P (r3 = 1| X) · P (r4 = 2|r3 = 1, X)
=
0
ex3 β
0
0
x01 β
x02 β
e
+e
+ex3 β +ex4 β
·
has a closed-form solution:
(4)
0
ex4 β
0
x01 β
x02 β
e
+e
+ex4 β
eliminating the necessity of calculating multivariate integrals.
,
The MLE objective function is concave so
optimization becomes standard. The low computation cost explains the popularity of the ROL model over
the ROP model in empirical work, especially when the choice set is large. However, the closed-form expression
(4) arises from the independence of irrelevant alternative (IIA) assumption, which is a very strong hypothesis
as many authors point out.
3 When the IIA assumption is violated, both the probability that alternative 3
is the best alternative (P (r3
alternative (P (r4
= 1| X))
and the conditional probability that alternative 4 is the second best
= 2|r3 = 1, X)) are misspecied.
best alternative (P (r3
= 1| X))
By contrast, only the probability that alternative 3 is the
is misspecied in the MNL model when the IIA assumption is problematic.
The ROL model that gains more eciency than the MNL model by incorporating information on rankings is
at the cost of suering from more misspecication than its MNL counterpart. Yan and Yoo (2014) provide
analytical examples and Monte Carlo evidence to illustrate the inconsistency of the ROL estimator under
misspecication.
The preference parameters are not estimated consistently by the parametric models under misspecication.
However, since the scale of utilities is not identied, we cannot identify the scale of the preference
parameters either.
4 The considerable information in utility functions is the ranking of alternatives by prefer-
ence and the ratio of preference parameters, also known as the marginal rates of substitution, that underlie
5 It is the ratio of preference parameters that is the focus of our structural estimation.6 Many
such rankings.
3 See Hausman and McFadden (1984).
4 The ranking still holds if β and ε are multiplied by any positive constant.
5 The marginal rates of substitution are sometimes called equivalent or implicit
prices when they are calculated as ratios with
the coecient that has a monetary interpretation.
6 Some research focuses on forecasting the choice probabilities instead of the structural preference parameters. In that case,
8
semiparametric estimators have been proposed to estimate the ratio of preference parameters using multinomial choice data. Rank-ordered data allows considerably more information to be gathered from a given
observation than multinomial choice data in which only the best alternative is revealed. Since sampling costs
are the major portion of a project to evaluate the preference for new products and non-market goods, it is
important to develop robust estimation method that utilizes ranking information.
To our best knowledge, the only semiparametric estimator that exploits rank-ordered choice data is developed by Hausman and Ruud (1987). Ruud (1983) nds that under misspecication the MNL model provides
consistent estimator for the ratio of slope coecients if the regressors have linear conditional expectations.
Taking advantage of this result, Ruud (1986) constructs a weighted M-estimator (WME) that achieve the
same consistency in situations where the conditional expectations are not linear. Hausman and Ruud (1987)
generalize the WME for multinomial choice data to rank-ordered choice data. However, consistent estimation
is limited to coecients of continuous explanatory variables and the asymptotic distribution of the WME is
still unknown, which limit the application of WME in the applications where some (if not most) variables
are categorical variables.
2.3 The Generalized Maximum Score (GMS) Estimator
In this section, we will develop a method to estimate the preference parameter
distribution function form for the error terms. As the scale of
β
β
without specifying the
is not identied, we need to impose some
normalization rst. Subject to the prior knowledge that at least one parameter is non-zero, we can normalized
the magnitude of that entry of
that
|β1 | = 1
β
to be one. Dene
0
β ≡ (β1 , β̃ )0 ∈ Rq .
Without loss of generality, we assume
as stated in Assumption 2.
Assumption 2. β ∈ B where B ≡ {−1, 1} × B̃ and B̃ is a compact subset of Rq−1 where q ≥ 2.
The normalization given by Assumption 2 is widely used in semiparametric estimation for the preference
the ROL (or ROP) model may not be a good replacement for the MNL (or MNP) model. This is because both the ROP and ROL
estimators are quasi-maximum likelihood estimator (QMLE) in the presence of misspecication on the conditional distribution of
error terms. The quasi-maximum likelihood (QMLE) estimator converges to the parameter vector that minimizes the KullbackLeibler Information Criterion (KLIC). The ROL (or ROP) estimator has its limiting parameter vector to be distorted in a way
to mimic the probability of rankings instead of the probability of choosing the best alternative. Therefore, the ROL (or ROP)
model may not be a good replacement for the MNL (or MNP) model when the research focus is to approximate the choice
probabilities of each alternative.
9
parameter vector in the random utility functions. The vector
β̃ ∈ B̃
is our estimation focus because the sign
estimator converges at a much faster rate.
Without knowing the joint density of
ε
given
X,
we rely on the relationship between the ranking of
utilities and the ranking of systematic utilities stated in Assumption 3 to identify the preference parameters.
For any alternative
j ∈ J,
denote the deterministic utility of alternative
j
as
vj ≡ x0j β .
Assumption 3. For any pair of alternatives j, k ∈ J, vj > vk if and only if
P (rj < rk |X) > P (rk < rj |X),
(5)
i.e., if alternative j has higher deterministic utility than alternative k, then it is more likely to occur that
alternative j is preferred to alternative k (rj < rk ) than the case that alternative k is preferred to alternative
j (rk < rj ),
conditioning on the explanatory variables.
Assumption 3 immediately implies that
j
k
and alternative
than alternative
k
P (rj < rk |X) = P (rj > rk |X)
when
vj = vk ,
have the same deterministic utility, then the chance that alternative
is the same as the one that alternative
Two special cases worth mentioning here. First,
of all the alternatives in the choice set.
P (rj < rk |X)
j
j
is ranked better
is ranked better than alternative
j.
M = J − 1, i.e., individuals reveal their complete ranking
With the complete ranking of alternatives, we can compare the
utilities of any two alternatives. Therefore, alternative
utility from choosing alternative
k
i.e., if alternative
j
is ranked better than alternative
is higher than the utility from choosing alternative
k.
k
if and only if the
We have
= P (uj > uk |X)
(6)
= P (εk − εj < x0j β − x0k β|X).
The rst line of (6) does not hold if we only observe a partial ranking. For example, when neither alternative
j
nor alternative
k
belongs to the set
M,
both of them have rank
M +1
no matter whether they yield the
same utility level for the individual or not. Assume that the conditional distribution of
increasing function. Then the well-known conditional median zero restriction,
εj − εk
is a strictly
median(εj − εk |X) = 0,
is a
sucient and necessary condition for Assumption 3.
The second special case is
M = 1,
i.e., only the best alternative is revealed, meaning that we have
10
multinomial discrete choice data. For multinomial choice data, alternative
k
if and only if
j
j
is ranked better than alternative
is ranked as the best alternative. So we have
P (rj < rk |X) = P (rj = 1|X).
(7)
In this case, Assumption 3 is known as the monotonicity of choice probabilities property, i.e., the ranking of
the choice probabilities of alternatives is the same as the ranking of the deterministic utilities of alternatives.
Next, we describe the intuition of applying Assumption 3 to identify the parameter vector
β.
Let
7
1(·)
be the indicator function that equals one if the event in the parenthesis is true and zero otherwise, and let
0
b ≡ (b1 , b̃ )0
be any vector in the parameter space
true, then the event
rk < rj
event
rj < rk
B,
where
b̃ ∈ B̃.
is more likely to be true than the event
is more likely to be true than the event
has the same chance to be true as the event
rk < rj .
rj < rk ;
if
Under Assumption 3, if
rk < rj ;
x0j β = x0k β
if
x0k β > x0j β
x0j β > x0k β
is true, then the
is true, then the event
rj < rk
Therefore, the expected value of the following match
mjk (b) ≡ 1(rj < rk ) · 1(x0j b ≥ x0k b) + 1(rk < rj ) · 1(x0k b > x0j b)
should be maximized at the true preference parameter vector
index for individual
estimator,
bN ≡
n
and alternative
0
(bN,1 , b̃N )
∈ B,
j.
is
β,
(8)
where
b ∈ B.
Dene
x0nj b
as the
b-utility
Applying the analogy principle, we propose a semiparametric
for the preference parameter vector
β
dened by (9):
bN ∈ argmaxb∈B QN (b),
(9)
where
QN (b) = N
−1
N
X
X
1(rnj < rnk ) · 1(x0nj b ≥ x0nk b) + 1(rnk < rnj ) · 1(x0nk b > x0nj b) ,
(10)
n=1 1≤j<k≤J
In the special case that
M = 1,
i.e., we have multinomial choice data, the estimator
bN
dened by (9)
becomes the well-known maximum score (MS) estimator. For this reason, we name the estimator
7 See
Manski (1975), Fox (2007) and Yan (2013).
11
bN
as a
8
Generalized Maximum Score (GMS) estimator.
When all the explanatory variables are discrete, we can always nd another parameter vector in the
neighborhood of
β
to generate the same ranking of
b-utility
indexes as the true parameter vector. To get
point identication, we need to impose an extra assumption on the explanatory variables, namely, we need
a continuous explanatory variable conditional on other explanatory variables. We make a few notations and
state the restrictions on explanatory variables formally in Assumption 4.
In random utility models, only the dierences in utilities matter. Normalize
dierence in the explanatory vectors between alternatives
j
2, we assumed that the rst parameter has nonzero value.
xj
into
(xj,1 , x̃0j )0 ,
where
So the rst element of
xj,1
xjk
is
is the rst element of
xjk,1 = xj,1 − xk,1
X̃ ≡ (x̃1 , . . . , x̃J )0 ∈ RJ×(q−1) .
Vectors
x̃jk ,
and
respectively. Matrices
Xn
xnj , xnjk ,
X̃ n
are the
xj ,
k,
that is,
For each alternative
and
x̃j
x̃njk
Let
xjk = xj − xk .
j ∈ J,
xjk
denote the
In Assumption
factor the vector
refers to the remaining elements of
and its remaining elements
and
nth
and
xJ = 0.9
are the
nth
x̃jk = x̃j − x̃k .
observation of vectors
observation of matrices
X
and
X̃ ,
xj .
Denote
xj , xjk ,
and
respectively.
Assumption 4. The following statements are true.
(a)
For any pair of alternatives j, k ∈ J, gjk (xjk,1 |x̃jk ) denotes the density function of xjk,1 conditional on
x̃jk ,
(b)
and gjk (xjk,1 |x̃jk ) is nonzero everywhere on R for almost every x̃jk .
For any constant vector c ≡ (c1 , . . . , cq )0 ∈ Rq , Xc = 0 with probability one if and only if c = 0.
Assumption 4 is sucient to show that other vectors
limit of the objective function
QN (b)
b ∈ B would yield dierent values for the probability
from the true parameter vector
β.
Assumption 4(a) avoids a local
failure of identication, which is important for semiparametric setting. Assumption 4(b) is analogous to the
full-rank condition for the binary choice model, which prevents a global failure of identication. The following
theorem establishes the strong consistency of the GMS estimator.
Theorem 1. Let Assumptions
1-4
hold. The GMS estimator bN that solves the following problem
(11)
maxb: b∈B QN (b)
8 The GMS estimator is dierent from the maximum rank estimator that Han and Sherman study with. Though the objective
function (10) includes comparison of rankings, it is the comparison within each individual, allowing for heteroskedasticity across
individuals. The maximum rank estimator compares ranks across individuals, ruling out the heteroskedasticity across individuals.
9 If x 6= 0, we can subtract each x by x
j
J
J
12
converges almost surely to β, the true parameter in the data generating process.
Similar to the MS estimator, the complexity of the GMS estimator
are due to the discontinuity of the indicator function in (10).
N 1/3
bN
and its slow rate of convergence
Kim and Pollard (1990) have shown that
times the centered MS estimator converges in distribution to the random variable that maximizes
a certain Gaussian process for the binary choice model.
special case of the GMS estimator.
The estimator that Kim and Pollard study is a
This asymptotic distribution result is too complicated to be used for
inference in applications. Abrevaya and Huang (2005) prove that the standard bootstrap is not consistent
for the MS estimator. Delgado
et al.
(2001) show that subsampling consistently estimates the asymptotic
distribution of the test statistic of the MS estimator for the binary choice model. Subsampling has eciency
loss and its computational cost is high for our estimator because a global search method is needed to solve
the maximization problem for each subsample. In Section 3, we introduce two methods of smoothing over
the objective function
QN (b)
to (at least partly) overcome the theoretical and computational diculty.
2.4 A Special Case: Random Coecients
There is a special case that we can incorporate random coecients into the random utility model. Consider
a random utility model with random coecients as follows:
unj = x0nj β n + enj = x0nj β + εnj ,
where
εnj = x0nj (β n − β) + enj .
(12)
The random coecient
βn
observable characteristics. The parameter vector of interest,
reects unobserved heterogeneity in tastes for
β,
now represents a certain central tendency
measure (mean or median) of the random coecients for the population of interest. The parameter vector
β
can be estimated using parametric method under several restrictive assumptions. For example, the popular
mixed-logit model assumes that
and
en ,
and
βn
enj
are i.i.d.
extreme value type I distributed,
βn
is independent of
Xn
follows a known distribution subject to a few parameters. The mixed-logit model has many
good properties such as allowing for heteroskedasticity and correlation across alternatives, and unrestricted
substitution patterns. It is straightforward to generalize the mix-logit model for multinomial choice data to
the one for rank-ordered choice data by replacing the choice probabilities in the likelihood function with the
13
probability of rankings. However, as other quasi-maximum likelihood estimators, the mixed-logit estimator
may be inconsistent when the distribution function forms of
Estimating
and
βn
β
enj
and
βn
are misspecied.
in the random coecient model (12) semiparametrically, i.e., leaving the distribution of
enj
unspecied, is a challenging task. When we only have multinomial choice data, the MS estimator
does not guarantee consistent estimation for
β
in (12) because the ranking of deterministic utilities of some
alternatives may not be the same as the ranking of their choice probabilities.
Given the same level of
deterministic utility, the alternative with higher variance has more frequent draws from the right tail of the
10 When we have fully rank-ordered choice data, i.e., we know the ranking for all the
error term distribution.
alternatives, semiparametric estimation of
β
in the random coecient model can be achieved using the GMS
estimator.
Next, we describe the intuition of consistency of the GMS estimator for
(12). When the alternatives in the choice set is fully ranked,
rnj < rnk
β
in the random coecient model
if and only if
unj > unk .
A sucient
and necessary condition for Assumption 3 is
median(εnj − εnk |X) ≡ median[(xnj − xnk )0 (β n − β) + enj − enk | X] = 0.
Condition (13) is much weaker than the assumptions imposed on the mixed-logit model.
(13)
For example,
condition (14),



median(e
nj
− enk | X) = 0 f or any j, k,
(14)


β n ⊥ en | X,
which is satised by the mixed-logit model, is sucient for (13). By theorem 1, the GMS estimator is
consistent for
β
in the random coecient model (12) under Assumptions 1, 2, 4 together with a mild location
restriction (13) when we have complete ranking data.
The GMS estimator is consistent for estimating the random coecient model (12) using fully rank-ordered
data. This property is important because in many applications (Calfee et al., 2001; Capparos et al., 2008;
Scarpa et al., 2011; Yoo and Doiron, 2013) with a small choice set the alternatives are fully ranked. The
10 See
Fox (2007) and Yan (2013).
14
GMS does not require parametric assumption on the distributional function form on individual heterogeneity
βn
or the error term
en ,
making it more robust than the mixed-logit model. Meanwhile, the GMS estimator
keeps the good properties of the mixed-logit model, allowing for unobserved heterogeneity in tastes, exible
substitution pattern across alternatives, and a general form of heteroskedasticity across both individuals and
alternatives.
3
The Smoothed GMS Estimator
The maximum score type estimator is
N 1/3 -consistent and follows a non-standard asymptotic distribution of
Kim and Pollard (1990). Kim and Pollard have shown that
N 1/3
times the centered MS estimator converges
in distribution to the random variable that maximizes a certain Gaussian process for the binary choice data.
This asymptotic distribution result is too complicated to be used for inference in applications.
Abrevaya
and Huang (2005) prove that the standard bootstrap is not consistent for the MS estimator. Delgado
et al.
(2001) show that subsampling consistently estimates the asymptotic distribution of the test statistic of the MS
estimator for the binary choice data. Subsampling has eciency loss and its computational cost is high for the
MS or GMS estimator because a global search method is needed to solve the maximization problem for each
subsample. In this section, we propose an estimator that complements the GMS estimator by addressing these
practical limitations, in return for making some additional assumptions. In the context of Manski's binary
choice MS estimator, Horowitz (1992) develops a smoothed maximum score (SMS) estimator that replaces
the step functions with smooth functions. Yan (2012) applies this technique to derive a smoothed version
of Fox's multinomial choice MS estimator. We use the same approach to derive a smoothed GMS (SGMS)
estimator and oer similar benets as its precedents: we show that the SGMS estimator's convergence rate is
faster than
N −1/3
with extra smoothness conditions, and also that its has asymptotically normal distribution.
15
3.1 The Smoothed GMS Estimator and its Asymptotic Properties
The objective function in (10) can be rewritten as
= N −1
QN (b)
N
X
X
{[1(rnj < rnk ) − 1(rnk < rnj )]
(15)
n=1 1≤j<k≤J
·1(x0nj b ≥ x0nk b) + 1(rnk < rnj ) .
The indicator function of
b
in (15) can be replaced with a suciently smooth function
is analogous to a cumulative distribution function. Let
the sample size
N
hN
K(·),
where
K(·)
be a positive bandwidth that goes to zero when
goes to innity. Application of the smoothing idea in Horowitz (1992) to the right-hand
side of (15) yields a smoothed GMS (SGMS) estimator
bSN ∈ argmaxQSN (b, hN ),
(16)
b∈B
where
QSN (b, hN )
=
N −1
N
X
X
{[1(rnj < rnk ) − 1(rnk < rnj )]
(17)
n=1 1≤j<k≤J
·K
x0nj b
−
x0nk b
/hN + 1(rnk
< rnj ) .
The next assumption states the requirement for the smoothing function
bSN ,
to be a consistent estimator for
Condition 1.
0
and let
K(x)
Let
K(·)
for the SGMS estimator,
β.
{hN : N = 1, 2, . . .} be a sequence of strictly positive real numbers satisfying limN →∞ hN =
be a function on
(a)
|K(x)| < C
(b)
limx→−∞ K(x) = 0
for some nite
and
R
such that:
C
and all
x ∈ (−∞, ∞);
and
limx→∞ K(x) = 1.
Theorem 2. Let Assumptions
1-4 and Condition 1
hold. The SGMS estimator bSN ∈ B dened in (16)
converges almost surely to the true preference parameter β.
16
Since the objective function (17) of the SGMS estimator
bSN
is a smooth function, the derivation of the
asymptotic distribution of the SGMS estimator for the binary choice model in Horowitz (1992) is applicable
to it. Denote the rst element of
bSN
as
bSN,1
and the rest elements as
S
b̃N .
Next, we dene a few terms and
give a sketch of deriving the asymptotic distribution of the SGMS estimator. Dene the rst order derivative
of the smoothed objective function
second order derivative of
function
K(·)
QSN (b, hN )
QSN (b, hN )
with respect to
with respect to
b̃
as
b̃
as
tN (b, hN ) ≡ ∂QSN (b, hN )/∂ b̃
0
H N (b, hN ) ≡ ∂ 2 QSN (b, hN )/∂ b̃∂ b̃ .
and the
Assume that
is twice dierentiable, and we have the vector
tN (b, hN )
=
(N hN )−1
N
X
X
{[1(rnj < rnk ) − 1(rnk < rnj )]
(18)
n=1 1≤j<k≤J
·K
0
x0njk b/hN
o
x̃njk .
and the matrix
H N (b, hN )
=
(N h2N )−1
N
X
X
{[1(rnj < rnk ) − 1(rnk < rnj )]
(19)
n=1 1≤j<k≤J
·K
00
x0njk b/hN
x̃njk x̃0njk
o
.
Assumption 5. β̃ is an interior point of B̃,
If Assumption 5 is true, then with probability approaching 1 as
by the rst order condition. A Taylor series expansion of
N → ∞, bSN,1 = β1 ,
tN (bSN , hN )
and
tN (bSN , hN ) = 0
around the true parameter
S
tN (bSN , hN ) = tN (β, hN ) + H N (b∗N , hN )(b̃N − β̃) = 0,
where
b∗N
is a vector between
bSN
and
β.
∗
in distribution and also that H N (bN ,
H.
Suppose there is a function
hN )
β
yields
(20)
ρ(N ) such that ρ(N )tN (β, hN ) converges
converges in probability to a nonsingular, nonstochastic matrix
Then,
S
ρ(N )(b̃N − β̃) = −H −1 ρ(N )tN (β, hN ) + op (1).
Equation (21) implies that it is essential to derive the limiting distribution of
17
(21)
ρ(N )tN (β, hN )
to establish
the asymptotic distribution of the estimator
probabilities of dierent rankings.
bSN .
The mean and variance of
ρ(N )tN (β, hN )
depend on the
Unlike the binary choice setting where the choice probabilities depend
on a single dierence between explanatory vectors, the probabilities of rankings rely on multiple dierences
between explanatory vectors. How the dierence in explanatory vectors of any two alternatives aects the
dierence in their rankings is key to the derivation of the asymptotic distribution of
To formalize the idea of deriving the asymptotic distribution of the estimator
notations. Let
vj ≡ x0j β
be the systematic utility of choosing alternative
vJ
is normalized to be zero. Denote
X
and
(v −J , X̃)
v − ι J vj
for xed
excluding its
j th
β.
Dene
v −J ≡ (v1 , . . . , vJ−1 )0 .
ιJ ≡ (1, . . . , 1) ∈ RJ .
element. For example, when
j.
Denote
ρ(N )tN (β, hN ).
bSN ,
we introduce some
v ≡ (v1 , . . . , vJ−1 , vJ )0 .
There is a one-to-one correspondence between
For any alternative
j ∈ J,
let
v −j
be the vector
1 < j < J,
v −j = (v1 − vj , . . . , vj−1 − vj , vj+1 − vj , . . . , vJ − vj )0 .
In other words,
v −j
is the systematic utility vector normalized by the systematic utility of alternative
any pair of alternatives
elements of
v −j
j, k ∈ J,
excluding
v−j,k .
dene
v−j,k = vk − vj
For example, when
and
ṽ −j,k
j.
For
as the vector that consists of all of the
1 < j < k < J,
ṽ −j,k ≡ (v1 − vj , . . . , vj−1 − vj , vj+1 − vj , . . . , vk−1 − vj , vk+1 − vj , . . . , vJ − vj )0 .
If
J > 2,
for any three dierent alternatives
elements of
v −j
excluding
v−j,k
and
v−j,l .
j, k, l ∈ J,
dene
ṽ −j,kl
For example, when
as the vector that consists of all of the
1 < j < k < l < J,
ṽ −j,kl ≡ (v1 − vj , . . . vj−1 − vj , vj+1 − vj , . . . , vk−1 − vj , vk+1 − vj , . . . , vl−1 − vj , vl+1 − vj , . . . , vJ − vj )0 .
Let
pjk (v−j,k |ṽ −j,k , X̃)
denote the conditional density of
(i)
v−j,k
pjk (v−j,k |ṽ −j,k , X̃) = ∂ i pjk (v−j,k |ṽ −j,k , X̃)/∂(v−j,k )i
18
given
(ṽ −j,k , X̃).
Dene the derivatives
and
(0)
pjk (v−j,k |ṽ −j,k , X̃) ≡ pjk (v−j,k |ṽ −j,k , X̃).
pjkl (v−j,k , v−j,l |ṽ −j,kl , X̃)
Let
denote the joint density of
Given any pair of alternatives
for xed
β ∈ B.
(v−j,k , v−j,l )
X,
(ṽ −j,kl , X̃).
j, k ∈ J, there is a one-to-one correspondence between X and (v−j,k , ṽ −j,k , X̃)
The conditional probability of alternative
explanatory matrix
conditional on
or equivalently,
(v−j,k , ṽ −j,k , X̃).
j
ranked better than alternative
Next, for any alternatives
j, k ∈ J,
k
depends the
dene
P (rj < rk |v−j,k , ṽ −j,k , X̃) ≡ Fjk (v−j,k , ṽ −j,k , X̃)
(22)
P (rj < rk |v−j,k , ṽ −j,k , X̃) − P (rk < rj |v−j,k , ṽ −j,k , X̃) ≡ F̄jk (v−j,k , ṽ −j,k , X̃).
(23)
and
For any integer
i > 0,
dene the following derivatives:
(i)
F̄jk (v−j,k , ṽ −j,k , X̃) ≡ ∂ i F̄jk (v−j,k , ṽ −j,k , X̃)/∂(v−j,k )i
whenever the derivatives exist. Dene the scalar constants
kd =
´∞
−∞
kd
and
kΩ
by
xd K 0 (x)dx
and
ˆ
∞
[K 0 (x)]2 dx,
kΩ =
−∞
whenever these quantities exist. Dene the
q−1
vector
19
a
and the
(q − 1) × (q − 1)
matrices
Ω
and
H
as
follows:
X
a=
kd
1≤j<k≤J
Ω=
d
X
h
i
1
(i)
(d−i)
E F̄jk (0, ṽ −j,k , X̃) pjk (0|ṽ −j,k , X̃)x̃jk ,
i!(d − i)!
i=1
h
i
2kΩ E Fjk (0, ṽ −j,k , X̃) pjk (0|ṽ −j,k , X̃)x̃jk x̃0jk ,
X
1≤j<k≤J
and
H=
i
h
(1)
E F̄jk (0, ṽ −j,k , X̃) pjk (0|ṽ −j,k , X̃)x̃jk x̃0jk
X
1≤j<k≤J
whenever these quantities exist.
In addition to Condition 1, we make the following additional restrictions on the smoothing function
Condition 2.
(a)
K(·).
The following statements are true.
K(x) is twice dierentiable for x ∈ R, |K 0 (x)| and |K 00 (x)| are uniformly
´∞
´∞
´∞
[K 0 (x)]4 dx, −∞ x2 |K 00 (x)|dx, and −∞ [K 00 (x)]2 dx are nite.
−∞
(1 ≤ i < d),
´∞
|xd K 0 (x)|dx < ∞, kd ∈ (0, ∞),
´∞
|xi K 0 (x)|dx < ∞ and −∞ xi K 0 (x)dx = 0.
−∞
d ≥ 2,
(b) For some integer
´∞
(c) For any integer
−∞
i (0 ≤ i ≤ d),
lim hi−d
N
´
|hN x|>η
N →∞
any
η > 0,
and any sequence
{hN }
and
bounded, and the integrals
kΩ ∈ (0, ∞).
converging to
For any integer
0,
|xi K 0 (x)|dx = 0
and
lim h−1
N
N →∞
´
|hN x|>η
|K 00 (x)|dx = 0.
We need additional assumptions to derive the asymptotic distribution of the estimator
20
bSN .
i
(i)
Assumption 6. For any pair of alternatives j, k ∈ J, and for v−j,k in a neighborhood of zero, F̄jk
(v−j,k , ṽ −j,k , X̃)
exists and is a continuous function of v−j,k and is bounded by C for almost every (ṽ −j,k , X̃), where C < ∞
and i is an integer (1 ≤ i ≤ d).
Assumption 6 states the dierentiability of the conditional distribution function of the error term
ε.
Assumption 7. The following statements are true.
(a)
For any pair of alternatives j, k ∈ J, p(i)
jk (v−j,k |ṽ −j,k , X̃) exists and is a continuous function of v−j,k
satisfying |p(i)
jk (v−j,k |ṽ −j,k , X̃)| < C for v−j,k in a neighborhood of zero, almost every (ṽ −j,k , X̃), some
C < ∞,
and any integer i (1 ≤ i ≤ d − 1). In addition, |pjk (v−j,k |ṽ −j,k , X̃)| < C for all v−j,k
almost every
(b)
and
(ṽ −j,k , X̃).
For any three dierent alternatives j, k, l ∈ J, pjkl (v−j,k , v−j,l |ṽ −j,kl , X̃) < C for all (v−j,k , v−j,l ) and
almost every (ṽ −j,kl , X̃).
(c)
The components of matrices X̃ , vec(X̃)vec(X̃)0 , and vec(X̃)vec(X̃)0 vec(X̃)vec(X̃)0 have nite rst
absolute moments.
Assumption 7 imposes regularity conditions on the explanatory variables. Assumptions 6-7 guarantee the
existence of the parameters in the limiting distribution of the estimator.
Assumption 8. (logN )/N h4N → 0 as N → ∞.
Assumptions 6-8 together with Condition 2 are analogous to assumptions in kernel density estimation. A
higher convergence rate of the SGMS estimator can be achieved using a higher order kernel (K
required derivatives of
F̄
and
p
0
(·))
when the
exist.
Assumption 9. The matrix H is negative denite.
The matrix
H
is analogous to the Hessian information matrix in the quasi-MLE.
The main results concerning the asymptotic distribution of the SGMS estimator are given by the following
theorem.
Theorem 3. Let Assumptions 1-9 and Conditions 1-2 hold for some integer d ≥ 2 and let {bSN } be a sequence
of solutions to problem (16),
21
S
(a)
−d
If N h2d+1
→ ∞ as N → ∞, then hN
(b̃N − β̃) → −H −1 a.
N
(b)
If N h2d+1
has a nite limit λ as N → ∞, then
N
p
S
d
(N hN )1/2 (b̃N − β̃) → M V N −λ1/2 H −1 a, H −1 ΩH −1 .
(c)
Let hN = (λ/N )1/(2d+1) with 0 < λ < ∞; W be any nonstochastic, positive semidenite matrix such
that a0 H −1 W H −1 a 6= 0; EA denote the expectation with respect to the asymptotic distribution of
S
S
N d/(2d+1) (b̃N − β̃);
S
and M SE ≡ EA [(b̃N − β̃)0 W (b̃N − β̃)]. The M SE is minimized by setting
λ = λ∗ ≡ trace ΩH −1 W H −1 / 2da0 H −1 W H −1 a ,
S
in which case N d/(2d+1) (b̃N − β̃) converges in distribution to
M V N −(λ∗ )d/(2d+1) H −1 a, (λ∗ )−1/(2d+1) H −1 ΩH −1 .
Theorem 3 implies that the fastest rate of convergence of the estimator is
(N h2d+1
)−d/(2d+1) → 0
N
N h2d+1
→0
N
as
if
N → ∞.
N h2d+1
→∞
N
as
N → ∞,
Choosing bandwidth
and
d/(2d+1) S
of N
(b̃N
(N hN )1/2 /N d/(2d+1) = (N h2d+1
)1/(4d+2) → 0
N
hN = (λ/N )1/(2d+1)
the fastest rate of convergence. Theorem 3(c) shows that
d/(2d+1)
N −d/(2d+1) because h−d
=
N /N
λ∗
where
minimizes
λ ∈ (0, ∞)
M SE
if
is sucient to achieve
of the asymptotic distribution
− β̃).
To make the result of Theorem 3 useful in applications, it is necessary to be able to estimate the parameters
in the limiting distribution
a, Ω,
and
H
consistently from observations of
(r, X).
The next theorem shows
how this can be done.
Theorem 4. Let Assumptions 1-9 and Conditions 1-2 hold for some integer d ≥ 2 and vector bSN be a
consistent estimator based on hN ∝ N −1/(2d+1) . Let h∗N ∝ N −δ/(2d+1) , where δ ∈ (0, 1). Then
(a)
âN ≡ (h∗N )−d tN (bSN , h∗N )
converges in probability to a.
22
(b)
For b ∈ B and n = 1, . . . , N , dene
X
tN n (b, hN ) =
[1(rnj < rnk ) − 1(rnk < rnj )] K 0 x0njk b/hN x̃njk h−1
N ,
1≤j<k≤J
the matrix
Ω̂N ≡ (hN /N )
N
X
tN n (bSN , hN )tN n (bSN , hN )0
n=1
converges in probability to Ω.
(c)
H N (bSN , hN )
converges in probability to H .
By Theorem 3(c), the asymptotic bias of
hN = (λ/N )1/(2d+1) .
consistently by
u
S
N d/(2d+1) (b̃N − β̃)
is
−λd/(2d+1) H −1 a
It follows from Theorem 4 that the bias term
−λd/(2d+1) H N (bN , hN )−1 âN .
when the bandwidth
−λd/(2d+1) H −1 a
can be estimated
Therefore, dene
S
b̃N = b̃N + (λ/N )d/(2d+1) H N (bN , hN )−1 âN
(24)
as the bias-corrected SGMS estimator.
3.2 Bandwidth Selection
Theorem 3(c) provides a way to choose the bandwidth for the SGMS estimator. To achieve the minimum
M SE ,
an optimal
λ∗
can be consistently estimated by the conclusion of Theorem 4. Therefore, one possible
way of choosing bandwidth is to set
hN = (λ̂/N )1/(2d+1) ,
where
λ̂
is a consistent estimator for
λ∗ .
In Monte Carlo experiments and empirical application, the choice of bandwidth can be implemented by
taking the following steps.
Step 1. Given
d,
choose a
hN ∝ N −1/(2d+1)
Step 2. Compute the SGMS estimator
compute
Ω̂N
and
bSN
and
using
h∗N ∝ N −δ/(2d+1)
hN .
H N (bSN , hN ).
23
Use
bSN
and
for
h∗N
δ ∈ (0, 1).
to compute
âN .
Use
bSN
and
hN
to
Step 3. Estimate
λ̂N
=
λ∗
by
n
h
io
trace Ω̂N H N (bSN , hN )−1 H N (bSN , hN )−1
h
i−1
· 2dâ0N H N (bSN , hN )−1 H N (bSN , hN )−1 âN
.
Step 4. Calculate the estimated bandwidth
After deriving the bandwidth
heN ,
heN = (λ̂N /N )1/(2d+1) .
an SGMS estimator can be calculated based on it.
This method of
choosing the bandwidth is analogous to the plug-in method of kernel density estimation.
4
Monte Carlo Experiments
In this section, we provide small-scale Monte Carlo simulation results to demonstrate some nite-sample
properties of the proposed estimator
bN
bSN .
In each experiment, we compare the estimation results of
bSN
with
and the ROL estimator. We consider three data generating processes (DGPs). In the rst process, the
ROL model is correctly specied, i.e., the error terms are i.i.d. type-I extreme value distributed and are
independent of explanatory variables.
In the second process, the error terms are i.i.d. distributed from a
mixed normal distribution and are independent of explanatory variables. In the last process, we allow for
heteroskedasticity, the conditional distribution of the error terms varies across agents.
For each agent
n,
data are generated from the random utility model:
unj = xnj,1 β1 + xnj,2 β2 + εnj
where
f or j ∈ J,
β1 = β2 = 1.
A gradient-based algorithm is used to search for the ROL estimator due to the concavity of the maximum
likelihood function. The challenge in nding
bN
is that its objective function is a step function, and hence
every point is a local extrema as shown in Figure 1. Smoothing over the step function yields the objective
function of
bSN .
It is dierentiable, but the diculty in nding the maxima persists.
In all Monte Carlo
experiments, we use the dierential evolution random search method (Storn and Price, 1995) to search for
bN
and a gradient-based algorithm with dierent initial searching points to search for
24
bSN .
We consider sample sizes (N
= 100
number of alternatives is xed at
chosen by
hN = N −1/5
and
500)
j = 5.There
and three rankings (M
= 1, 2,
and
4)
for each DGP. The
are 1000 replications for each experiment. The bandwidth is
The smoothing function
K(·)
is the standard normal distribution function.
Following are the three DGPs.
•
DGP 1:
xnj,1
εnj
•
and
xnj,2
are i.i.d. N(0, 2).
are i.i.d. EV(0, 1, 0).
DGP 2:
xnj,1
εnj
and
xnj,2
are i.i.d. N(0, 2).
are i.i.d. with density function given by
"
#
"
#
2
2
0.369
− (ε + 1)
(1 − 0.369)
− (ε − 1.5)
f (ε) = √
exp
+√
exp
.
2 × 0.184
2 × 0.193
2π × 0.184
2π × 0.193
•
(25)
DGP 3:
xnj,1
are i.i.d. N(0, 2).
xnj,2 = wnj /cn , where wnj
are i.i.d. N(0, 2) and
εnj = 0.004(1 + 2c2n + c4n )enj , where enj
cn
are i.i.d. unif(1/5, 5).
are i.i.d. EV(0, 1, 0).
The results of the experiments using the three DGPs are summarized in Tables 1-3, respectively.
normalization of the ROL model is imposed on the variance of the error terms.
both parameters
parameter:
β1
|β1 | = 1.
and
β2
for the ROL model. For
Therefore, we only estimate
bN
and
bSN ,
The
Therefore, we estimate
the normalization is imposed on the rst
β2 .11
For each table, columns 1-2 show the number of agents and rankings; columns 3-4 show the bias of
and
β̂2
obtained from the ROL model; columns 5-6 show the bias and RMSE of the ratio
estimators; columns 7-8 show the bias and RMSE of
11 Because
bN ;
β̂2 /β̂1
β̂1
of the ROL
columns 10-11 show the bias and RMSE of
bSN .
the estimate of the sign will converge at a faster rate such that there is no need to analyze its nite-sample property.
25
Table 1 illustrates the eciency loss of using semiparametric methods when the ROL model is correctly
specied. The RMSE of the semiparametric estimators is larger than the parametric estimator, as expected.
However,
bSN
always has a smaller RMSE than
bN .
Table 2 illustrates the performance of the three estimators when the error terms are i.i.d. from a mixed
normal distribution and homoskedasticity still holds.
Because the ROL model is misspecied, the ROL
estimator may be inconsistent. Columns 3-4 show the bias of the ROL estimators. This bias does not vanish
when the sample size increases, which suggests that the ROL estimators are inconsistent. However, if our
interest is in comparing the relative importance of the explanatory variables
the utility of choosing alternative
j,
xnj,1
and
xnj,2
in determining
then it seems that the ROL performs well in terms of achieving very
small bias and RMSE as shown in columns 5-6.
Table 3 illustrates the performance of the three estimators in the presence of heteroskedasticity. Because
the ROL model is misspecied, the ROL estimators may be inconsistent. Columns 3-4 show the bias of the
ROL estimators. This bias does not vanish when the sample size increases, which suggests inconsistency of
the ROL estimators. Now, the ROL estimators cannot predict the relative importance of all of the factors
that aect an agent's utility, either, as shown in columns 5-6. The bias of the ratio of the parameters does
not vanish when the sample size increases.
heteroskedasticity. The RMSE of
bSN
However, semiparametric estimators still perform well under
decreases faster than that of
bN
as the sample size increases.
In empirical work, heteroskedasticity might be relevant (e.g., the variance of the error term of students
with high family income may be dierent from the variance of the error term of students with low family
income in a college choice problem). As shown in Monte Carlo experiments, the MNL estimators might be
inconsistent, and therefore analysis based on them could be misleading in the presence of heteroskedasticity.
Results of Monte Carlo experiments suggest considering
bN
and
bSN
, which allow for a exible form of
heteroskedasticity as alternatives in estimating the random utility model with rank-ordered data.
5
Illustrative application: preferences for bank services
This section provides an illustrative empirical application of the generalized maximum score (GMS) estimator
and the smoothed generalized maximum score (SGMS) estimator that we have proposed earlier. We apply
26
27
0.006
0.001
0
2
4
0.009
4
1
0.012
2
500
0.021
1
0.002
0.001
0.003
0.012
0.014
0.018
(β̂2 )
(β̂1 )
100
Bias
Bias
M
N
0.002
0.001
-0.001
0.005
0.007
0.008
(β̂2 /β̂1 )
Bias
ROL
RMSE
0.036
0.047
0.064
0.078
0.107
0.152
(β̂2 /β̂1 )
0.007
0.010
0.010
0.023
0.033
0.053
(β̂2 )
Bias
0.102
0.125
0.155
0.178
0.222
0.297
(β̂2 )
RMSE
bN
0.025
0.025
0.026
0.047
0.055
0.074
(β̂2 )
Bias
Table 1: Monte Carlo results of DGP 1
0.066
0.085
0.117
0.127
0.170
0.258
(β̂2 )
RMSE
bSN
142.7s
61.9s
25.4s
24.84s
11.15s
6.35s
time
28
0.176
0.004
-0.141
2
4
-0.135
4
1
0.014
2
500
0.200
1
-0.141
0.006
0.178
-0.133
0.016
0.201
(β̂2 )
(β̂1 )
100
Bias
Bias
M
N
0.001
0.003
0.003
0.007
0.008
0.010
(β̂2 /β̂1 )
Bias
ROL
RMSE
0.042
0.048
0.061
0.092
0.106
0.138
(β̂2 /β̂1 )
0.004
0
0.008
0.017
0.022
0.034
(β̂2 )
Bias
0.082
0.093
0.111
0.152
0.174
0.223
(β̂2 )
RMSE
bN
0.023
0.022
0.026
0.050
0.051
0.057
(β̂2 )
Bias
Table 2: Monte Carlo results of DGP 2
0.053
0.060
0.078
0.115
0.133
0.180
(β̂2 )
RMSE
bSN
108.3s
83.9s
32s
25.5s
11s
8.5s
time
29
0.447
0.486
0.512
1
2
4
0.701
0.685
0.664
0.723
0.718
0.710
Bias
RMSE
Size of t-test**: test of hypothesis
0
-0.001
0.003
0.002
0.007
0.013
(β̂2 )
Bias
0.019
0.032
0.057
0.050
0.089
0.160
(β̂2 )
RMSE
bN
at the nominal 0.05 level.
at the nominal 0.05 level.
0.124
0.142
0.166
0.143
0.175
0.227
(β̂2 /β̂1 )
β2 = β1
β2 = 1
0.120
0.135
0.152
0.124
0.143
0.159
(β̂2 /β̂1 )
Size of t-test*: test of hypothesis
500
0.539
4
0.490
(β̂2 )
(β̂1 )
0.512
1
100
Bias
Bias
2
M
N
ROL
0.021
0.021
0.021
0.040
0.042
0.046
(β̂2 )
Bias
Table 3: Monte Carlo results of DGP 3
0.028
0.034
0.047
0.064
0.085
0.134
(β̂2 )
RMSE
bSN
137.6s
82.6s
41.1s
25.2s
14.5s
8.19s
time
those methods to estimate a semiparametric rank-ordered choice model using the bank account choice data
set that is distributed on the website of software package Latent GOLD 5.0.
12
As in our Monte Carlo
experiment, we compare the results with the maximum likelihood estimates of a popular parametric model,
rank-ordered logit (ROL).
The data set includes authentic empirical observations that originate from the stated preference study
of Kamakura et al. (1994). The sample includes
Every customer has faced a choice set of
J =9
N = 256
customers of a large bank in the United States.
hypothetical alternatives, each of which describes a checking
account product with dierent characteristics. There are four
alternative-specic
characteristics in total as
follows:
•
minbal :
minimum balance (in $'00s) required to exempt the customer from a monthly service fee (0, 5,
or 10 i.e. $0, $500 or $1000)
•
costpch :
•
fee :
•
atm :
amount charged per check issue in $s (0, 0.15, or 0.35)
monthlty service fee in $s (0, 3, or 6)
availability and cost of automatic teller machines (N/A; available at $0.75 per transaction; avail-
able for free). In our model specication, we include two alternative-specic constants
atm($0.75),
atm(N/A)
and
keeping available for free as the base category.
In addition to the alternative-specic characteristics, we incorporate one
customer-specic
characteristic
into our illustrative specication.
•
bal :
13
average monthly balance (in $'000s) kept in the customer's account during the past 6 months.
Each customer has ranked all
9
alternatives from most to least preferred.
introduced in section 2, the depth of available rankings
dierent depth levels. The starting level
choice data.
The next levels
M = 3
M =1
and
M = 8.
We apply each estimation method at four
provides the same amount of information as multinomial
M = 6
mimic partial ranking data, where one observes up to
the third and sixth preferences of each customer respectively.
12 The
13 The
In line with our notation
The nal level
M = 8
lets us analyze the
web address is: http://statisticalinnovations.com/technicalsupport/choice_datasets.html#bank
average balance varies across 256 customers from $7 to $9999, with the mean of $1158.
30
data as fully rank-ordered choices, as they actually are. In a rank-ordered choice analysis, one may consider
N ×M
as the eective sample size that takes into account the additional information from the use of deeper
preferences. The data set thus allows us to investigate in an empirical setting how the performance of the
semiparametric estimators vary across relatively small (256
× 1 = 256)
to large
(256 × 8 = 2048)
samples.
The random utility model of interest is specied as:
unj
0
α × minbal nj × baln + xnj β + εnj
=
1, 2, · · · , 256; j = 1, 2, · · · , 9
n =
where
n
and
j
index customers and alternatives, respectively;
atm($0.75) nj ]; α and β are preference parameters; and εnj
variable
baln
xnj = [minbal nj
0
costpch nj f eenj atm(N/A) nj
is the stochastic component of utility. We center
around its sample mean so that the estimated coecient on
minbal nj
can be interpreted as the
taste coecient for someone with the sample mean average balance ($1158).
One may expect all coecients in
inconvenience. The coecient
β
to be negative, as attributes
xnj
capture an increase in cost or
α on the other hand can be expected to be positive, since someone who usually
keeps a larger balance is less likely to be sensitive to an increase in the minimum balance requirement. The
ROL estimates at all depth levels conform to these expectations. In our GMS and SGMS applications, we
normalize
α
to
1
and estimate
β .14
As in the Monte Carlo experiment, the dierential evolution algorithm
is used to compute the GMS estimates, and a gradient-based algorithm with several initial starting points is
used to compute the SGMS estimates.
15
Table 4 reports the estimates of coecients
α
and
β.
All estimates have expected signs.
The results
suggest that there are potentially large gains from using deeper ranking information, in terms of statistical
precision. All sets of point estimates change visibly between
sample size increases from 256 to 768, but much less between
M =1
M =6
and
M =3
cases or when the eective
and
M =9
cases or when the eective
sample size increases from 1536 to 2048 observations. The faster convergence rate of SGMS over GMS has
14 The interaction term minbal
nj × baln takes 477 distinct values across alternative-customer pairs.
15 The dierential algorithm conguration uses the amplication factor of 0.4, the cross-over probability 0.8,
and the population
size of 200. The algorithm runs up to 2000 generations. At each depth level, the reported set of GMS estimates corresponds
to the maximum that has been found from restarting the algorithm 200 times. At each depth level, the reported set of SGMS
estimates corresponds to the maximum that has been found by experimenting with 201 starting points that include the ROL
estimates and 200 sets of GMS estimates.
31
also played out vividly. The SGMS estimates at
M = 8,
M =3
dier much less from the corresponding estimates at
in comparison with the GMS estimates at those two depth levels.
Table 5 transforms the results in table 4 into equivalent prices, to facilitate the discussion of how substantive economic results vary across dierent sets of estimates. These equivalent prices have been computed
by using the coecient on
f eenj
to divide other coecients, and measure the monthly fee equivalents of unit
increases in relevant attributes. For instance, the SGMS equivalent price of
costpch
at
M =8
is 16.41, and
indicates that raising the amount chargered per check issue by $1 is equivalent to a $16.41 increase in the
monthly fee. Such coecient ratios feature as the primary parameters of interest in several empirical studies:
see for example Layton (2000), Calfee et al. (2001), and Small et al. (2005).
The equivalent prices illustrate the advantage of the semiparametric methods over the popular parametric
method. In an empirical application, one may expect the ROL estimates to be more sensitive across ranks
than the semiparametric estimates.
The former relies on a stringent distributional assumption, while the
latter are consistent for a wide class of stochastic distributions. As the earlier Monte Carlo results suggest,
the misspecication bias of the ROL coecient ratios can vary with the depth of rankings used in the
estimation. One may also expect the GMS estimates and SGMS estimates to be rather similar in a large
sample, since their primary practical dierence is in the convergence rates even though we need a few more
technical assumptions to derive the asymptotic properties of the SGMS estimator.
that the ROL estimates imply dierent equivalent prices at
M =6
and
M = 8,
Indeed, table 5 shows
despite the relatively large
sample sizes. By contrast, the GMS estimates imply identical equivalent prices at
M =6
and
M = 8,
and
show only a minimal dierence from the SGMS estimates at those two depth levels. The faster convergence
of the SGMS estimator becomes even more evident when the estimates are expressed as equivalent prices:
apart from
atm(N/A),
the SGMS equivalent prices vary by less than $1 between
M =3
and
M =6
cases,
whereas the GMS equivalent prices vary by as much as $4.13.
Outside Monte Carlo settings, it is admittedly dicult to conrm whether the semiparametric estimates
are indeed closer to the true parameters than the ROL estimates. In this particular application, however,
we are able to provide further circumstantial evidence that it is likely to be the case. Specically, note that
the equivalent price of
atm($0.75)
from $0 to $0.75 per transaction.
measures the monthly fee equivalent of raising the ATM transaction cost
The data set includes a variable
32
N AT M
that measures the number of
ATM transaction that each customer makes per month, even though we have not incorporated it into our
model specication. Since the sample mean of this variable is 5.25 transactions,
be a reasonable guess about the equivalent price of
this amount at
M = 8.
M =6
and
M = 8,
atm($0.75).
$0.75 × 5.25 = $3.94
Both GMS and SGMS estimates are close to
whereas the ROL estimates dier by more than $2 at
We also note that in the fully ranked case of
M = 8,
would
M =6
and $1 at
the GMS estimator and the SGMS estimator
admit several random coecient models that allow their equivalent prices to be interpreted as the population
average equivalent prices.
6
Conclusions
The parametric methods for rank-ordered choices are as well established as the parametric methods for
multinomial choices. The semiparametric methods for rank-ordered choices are much less developed than the
semiparametric methods for multinomial choices. To our best knowledge, the study of Hausman and Ruud
(1987) is both the earliest and only semiparametric analysis of rank-ordered choices.
We propose two types of semiparametric estimators for rank-ordered choices. We call them the generalized maximum score (GMS) estimator and the smoothed generalized maximum score estimator (SGMS)
respectively. The term semiparametricrefers to the fact that the preference parameters of interest are nite
dimensional but the error term in the random utility function has unspecied distribution. Both types of
estimators admit very exible forms of interpersonal heteroskedasticity, and a wide class of stochastic distributions including those that motivate most of popular parametric models. We show the strong consistency
of the proposed GMS estimator. The asymptotic distribution of the GMS estimator is nonstandard. To facilitate inference, we propose the SGMS estimator. We show that the SGMS estimator is strongly consistent
and asymptotically normal, making inference straightforward.
The preliminary results from Monte Carlo
experiments and an illustrative empirical application show that the estimators have promising nite sample
properties, and can be very useful additions to the empirical practitioner's toolbox.
Rank-ordered choices allow one to gather considerably more preference information from a given sample
of agents than multinomial choices. The practical gains from this extra information can be quite substantial.
Train and Winston (2007), for example, exploit data from a market survey that asks for the consumer's
33
Table 4: Utility coecients: the rank-ordered logit (ROL), generalized maximum score (GMS) and smoothed
GMS (SGMS) estimators
ROL
M = 1
minbal × bal
minbal
costpch
f ee
atm($0.75)
atm(N/A)
log-likelihood
M = 3
M = 6
M = 8
0.04
0.07
0.04
0.04
-0.33
-0.28
-0.22
-0.22
-15.74
-4.61
-2.22
-2.49
-0.53
-0.20
-0.16
-0.18
-0.75
-0.60
-0.98
-0.88
-0.73
-1.22
-1.31
-1.17
-303.3
-1051
-2170
-2566
M = 1
M = 3
M = 6
M = 8
GMS
minbal × bal
minbal
costpch
f ee
atm($0.75)
atm(N/A)
score
1.00
1.00
1.00
1.00
-7.11
-16.76
-5.69
-5.69
-345.53
-295.46
-56.45
-56.45
-11.93
-21.24
-3.37
-3.37
-72.29
-169.23
-12.93
-12.93
-69.26
-207.16
-21.39
-21.39
7.25
18.22
26.7
28.82
M = 1
M = 3
M = 6
M = 8
SGMS
minbal × bal
minbal
costpch
f ee
atm($0.75)
atm(N/A)
score
1.00
1.00
1.00
1.00
-7.13
-5.25
-6.17
-6.13
-350.72
-99.09
-58.73
-58.89
-11.97
-5.77
-3.59
-3.59
-71.32
-18.61
-14.30
-14.04
-68.50
-27.92
-23.22
-22.96
7.25
18.07
26.65
28.77
34
Table 5: Equivalent prices: the rank-ordered logit (ROL), generalized maximum score (GMS) and smoothed
GMS (SGMS) estimators
ROL
M = 1
M = 3
M = 6
M = 8
-0.08
-0.24
minbal × bal
minbal
costpch
atm($0.75)
atm(N/A)
-0.34
-0.27
0.62
1.45
1.40
1.26
29.48
23.59
14.02
14.23
1.40
3.05
6.18
5.00
1.38
6.24
8.27
6.70
log-likelihood
-303.3
-1051
-2170
-2566
M = 1
M = 3
M = 6
M = 8
-0.08
-0.05
-0.30
-0.30
0.60
0.79
1.69
1.69
28.97
13.91
16.77
16.77
6.06
7.97
3.84
3.84
5.81
9.75
6.36
6.35
7.25
18.22
26.7
28.82
M = 1
M = 3
M = 6
M = 8
-0.08
-0.17
-0.28
-0.28
0.60
0.91
1.72
1.71
29.30
17.17
16.36
16.41
5.96
3.22
3.98
3.91
5.72
4.84
6.47
6.40
7.25
18.07
26.65
28.77
GMS
minbal × bal
minbal
costpch
atm($0.75)
atm(N/A)
score
SGMS
minbal × bal
minbal
costpch
atm($0.75)
atm(N/A)
score
35
most recent vehicle purchase and preference ordering of other vehicles that the consumer has considered for
buying. They nd that the use of rank-ordered choices is essential for estimating exible parametric models.
Scarpa et al.
(2011) administer a non-market valuation survey eliciting mountain visitors' preferences for
land management plans. In the context where obtaining a large on-site sample of the target population is
dicult, they nd that the use of rank-ordered choices reduces their sample size requirements up to a third
of comparable mutinomial choice data. At the same time, however, the parametric analysis of rank-ordered
choices can be expected to be more sensitive to stochastic misspecication than that of multinomial choices,
since one needs to use maintained stochastic assumptions more heavily to derive likelihood functions. The
simulation study of Yan and Yoo (2014), for example, nds that an empirical regularity taken as the classic
symptoms of poor-quality rank-ordered responses (Chapman and Staelin, 1982; Ben-Akiva et al., 1992) may
simply result from estimating the rank-ordered logit model under misspecication. Since the sampling cost,
in terms of both time and money, is a major consideration for any project aiming at estimating individual
preferences using choice data, it is important to develop robust estimation method that utilizes ranking
information as we propose.
36
A
Proof of Theorem 1
In Appendix A, we provide the proofs of Theorem 1 and of Lemmas 1-3. Lemma 1 establishes the identication
condition. Lemma 2 veries the continuity property of the probability limit of the GMS estimator's objective
function
QN (b).
Lemma 3 shows the uniform convergence of
Throughout, for
Q∗ (b) ≡ E
b ∈ Rq ,


X

1≤j<k≤J
QN (b)
to its probability limit.
let


1(rj < rk ) · 1(x0j b ≥ x0k b) + 1(rk < rj ) · 1(x0k b > x0j b) ,

denote the probability limit of
(A1)
QN (b).
Lemma 1. Under Assumptions
, the true preference parameter vector β uniquely maximizes Q∗ (b) for
2-4
b ∈ B.
Proof.
Applying the law of iterated expectations to the right-hand side of (A1) yields
Q∗ (b)
=
E


X
h
P (rj < rk |X) · 1(x0jk b ≥ 0) + P (rk < rj |X) · 1(x0kj b > 0)

i


1≤j<k≤J
=
n
o
E [P (rj < rk |X) − P (rk < rj |X)] · 1(x0jk b ≥ 0) + P (rk < rj |X)
X
1≤j<k≤J
It follows from Assumption 3 that
rk |X) − P (rk < rj |X)
of
Q∗ (b).
and
β−
for any
β
globally maximizes
is the same as the sign of
Consider a dierent parameter vector
x0jk β .
Q∗ (b)
β − ∈ B.
with positive probability, if we observe that
of alternatives
j, k ∈ J,
the argument for
then we can conclude
β1 = −1
is similar. If
If, for values of
x0jk β
β−
and
x0jk β −
X
β
P (rj <
is a unique global maximizer
with positive probability,
Q∗ (b).
β
In other words,
have opposite signs for some pair
We will show this argument for
the set of points where
37
β
because the sign of
will not maximize
Q∗ (β) > Q∗ (β − ).
β1− = 1,
b ∈ B
Next, we show that
yield dierent rankings of systematic utilities, then
X
for
and
β−
β1 = 1;
yield dierent ranking of
systematic utility is
D(β, β − )
=
{X|x0jk β < 0 < x0jk β − f or some j, k ∈ J}
=
{X|x̃0jk β̃ < −xjk,1 < x̃0jk β̃ f or some j, k ∈ J}.
−
D(β, β − )
By Assumption 4(a), the probability of
one for any pair of alternatives
−
Assumption 4(b). If β1
= −1,
equals zero if and only if
j, k ∈ J, which implies that Xβ = Xβ
the set of points where
β
and
β
−
−
x̃0jk β̃ = x̃0jk β̃
−
with probability
with probability one. This contradicts
give dierent predictions is
−
D(β, β − ) = {X|xjk,1 < min(x̃0jk β̃ , −x̃0jk β̃) f or some j, k ∈ J}.
The probability of
parameter vector
D(β, β − )
β
is positive by Assumption 4(a). Thus, we have proved that the true preference
uniquely maximizes
Lemma 2. Under Assumptions
Proof.
2
For any pair of alternatives
Q∗ (b)
for b ∈ B.
and 4, Q∗ (b) is continuous in b ∈ B.
j, k ∈ J,
dene
Q∗jk (b) ≡ E [P (rj < rk |X) − P (rk < rj |X)] · 1(x0jk b ≥ 0) + P (rk < rj |X) .
Assume that
b1 = 1.
The argument for
ˆ (ˆ
Q∗jk (b)
where
)
[P (rj < rk |X) − P (rk < rj |X)] gjk (xjk,1 |x̃jk )dxjk,1
dP (x̃jk ) + P (rk < rj ),
(A2)
−x̃0jk b̃
denotes the cumulative distribution function of
right-hand side of (A2) is a function of
Q∗ (b) =
is symmetric. Calculate
∞
=
P (x̃jk )
b1 = −1
X
x̃jk
and
b
x̃jk .
The curly brackets inner integral of the
that is continuous in
b ∈ B.
Therefore,
Q∗jk (b)
1≤j<k≤J
is also continuous in
b ∈ B.
Lemma 3. Under Assumptions
1-2
and 4, QN (b) converges almost surely to Q∗ (b) uniformly over b ∈ B.
38
Proof.
For any pair of alternatives
QN jk (b) ≡ N −1
j, k ∈ J,
dene
N
X
[1(rnj < rnk ) − 1(rnk < rnj )] · 1(x0njk b ≥ 0) + 1(rnk < rnj ) ,
n=1
and
Q∗jk (b) ≡ E[QN jk (b)].
X
QN (b) =
Thus, we have
QN jk (b)
1≤j<k≤J
and
X
Q∗ (b) =
Q∗jk (b).
1≤j<k≤J
By Lemma 4 of Manski (1985), with probability one,
limN →∞ supb∈Rq QN jk (b) − Q∗jk (b) = 0,
Because
QN (b)
uniformly over
Proof.
is the sum of a nite number of term
QN jk (b), QN (b)
converges almost surely to
Q∗ (b)
b ∈ B.
(THEOREM 1) The proof of strong consistency involves verifying the conditions of Theorem 2.1 in
Newey and McFadden (1994):
(1)
Q∗ (b)
is uniquely maximized at
(2) The parameter space
(3)
Q∗ (b)
(4)
QN (b)
B
is continuous in
β;
is compact;
b;
and
converges almost surely to its probability limit
Q∗ (b)
uniformly over
b∈B
.
Conditions (1), (3), and (4) are veried by Lemmas 1, 2, and 3, respectively. Condition (2) is guaranteed
by Assumption 2. Therefore, the GMS estimator that maximizes
Assumptions 1-4.
39
QN (b)
converges to
β
almost surely under
B
Proof of Theorems 2-4
In Appendix B, we provide the proofs of Theorems 2-4 and of Lemmas 4-9. Lemma 4 establishes the uniform
convergence of the SGMS objective function to its probability limit. Lemmas 5-6 establish the asymptotic
distribution of the normalized forms of
tN (β, hN ).
Lemmas 7-9 justify that
H N (b∗N , hN )
converges to a
nonstochastic matrix in probability. By applying Taylor series expansion, Lemmas 5-7 can be used to derive
the asymptotic distribution of the centered, properly normalized SGMS estimator for the random utility
model.
Lemma 4. Under Assumptions
1-4
and Condition 1, QN (b, hN ) converges almost surely to Q∗ (b) uniformly
over b ∈ B.
Proof.
First, we show that
QSN (b, hN )
converges almost surely to
QN (b)
uniformly over
b∈B
following the
method in Lemma 4 of Horowitz (1992). We calculate
N
X
S
QN (b, hN ) − QN (b) ≤ 1
N n=1
X
1 x0njk b > 0 − K x0njk b/hN .
(B1)
1≤j<k≤J
The right-hand side of (B1) is the sum of
cN 1 (η)
and
cN 2 (η),
where
N
cN 1 (η) ≡
1X
N n=1
X
1 x0njk b > 0 − K x0njk b/hN · 1 x0njk b ≥ η ,
1≤j<k≤J
N
1X
cN 2 (η) ≡
N n=1
and
η ∈R
1 x0njk b > 0 − K x0njk b/hN · 1 x0njk b < η ,
1≤j<k≤J
is a positive number. Condition 1(b) implies that for any
|K(x) − 1| < δ · J −2
for any
X
N > N0 .
uniformly over
and
|K(−x)| < δ · J −2
Therefore,
b∈B
as
N → ∞.
"
cN 2 (η) ≤
cN 1 (η) < δ
X
1≤j<k≤J
C
N
for any
for any
Next consider
x > c.
As
N > N0 .
cN 2 (η).
#
N
X
0
1 xnjk b < η .
δ > 0,
there exists
c >0
such that
hN → 0, there exist N0 ∈ N such that η/hN > c
We have shown that for each
η > 0, cN 1 (η) → 0
By Condition 1(a), there is a nite
C
such that
(B2)
n=1
40
Horowitz (1992) shows that the inner-bracket part of the right-hand side of (B2) converges to 0 uniformly
over
b ∈ B.
Because
J
cN 2 (η)
is nite,
also converges to 0 uniformly over
side of (B1) converges to 0 uniformly over
sup QSN (b, hN ) − Q∗ (b)
b∈B
b∈B
as
N → ∞.
The right-hand
because
≤ sup QSN (b, hN ) − QN (b) + |QN (b) − Q∗ (b)|
b∈B
(B3)
b∈B
≤ sup QSN (b, hN ) − QN (b) + sup |QN (b) − Q∗ (b)| ,
b∈B
b∈B
and we have proved that the right-hand side of (B3) converges to 0 almost surely.
converges almost surely to
Proof.
Q∗ (b)
uniformly over
Therefore,
QSN (b, hN )
b ∈ B.
(THEOREM 2) The proof of strong consistency involves verifying the conditions of Theorem 2.1 in
Newey and McFadden (1994):
(1)
Q∗ (b)
is uniquely maximized at
(2) The parameter space
(3)
Q∗ (b)
(4)
QSN (b, hN )
B
is continuous in
β;
is compact;
b;
and
converges uniformly almost surely to its probability limit
Q∗ (b).
Conditions (1), (3), and (4) are veried by Lemmas 1, 2, and 4, respectively. Condition (2) is guaranteed
by Assumption 2. Therefore, the SGMS estimator that maximizes
QSN (b, hN )
converges to
β
almost surely
under Assumptions 1-4 and Condition 1.
Lemma 5. Let Assumptions 1-4, 6-7 and Conditions 1-2 hold. Then
(a)
(b)
Proof.
limN →∞ E h−d
N tN (β, hN ) = a; and
limN →∞ V ar (N hN )1/2 tN (β, hN ) = Ω.
First, under Assumption 1 we calculate that
E h−d
=
N tN (β, hN )
X
n
o
−d−1
E [1(rj < rk ) − 1(rk < rj )] K 0 (x0jk β/hN )x̃jk hN
1≤j<k≤J
=
X
(B4)
djk ,
1≤j<k≤J
41
where
djk ≡ E [1(rj < rk ) − 1(rk < rj )] K 0 (x0jk β/hN )x̃jk h−d−1
.
N
(B5)
By the law of iterated expectations,
djk
= E
nh
i
P (rj < rk |v−j,k , ṽ −j,k , X̃) − P (rk < rj |v−j,k , ṽ −j,k , X̃)
·K 0 (−v−j,k /hN )x̃jk h−d−1
N
(B6)
−d−1
= F̄jk (v−j,k , ṽ −j,k , X̃) · K 0 (−v−j,k /hN )x̃jk hN
Application of the Taylor series expansion for the right-hand side of (B6) around
F̄jk (v−j,k , ṽ −j,k , X̃) = F̄jk (0, ṽ −j,k , X̃) +
v−j,k = 0
d−1
X
1 (i)
F̄jk (0, ṽ −j,k , X̃)(v−j,k )i
i!
i=1
yields
(B7)
(d)
1
F̄jk (ξ, ṽ −j,k , X̃)(v−j,k )d ,
+ d!
where
ξ is between 0 and v−j,k .
around
v−j,k = 0
ξi
F̄jk (0, ṽ −j,k , X̃) = 0.
Taylor expansion of
pjk (v−j,k |ṽ −j,k , X̃)
yields
pjk (v−j,k |ṽ −j,k , Z̃) =
where
By Assumption 3,
is between 0 and
d−i−1
X
1 (i)
pjk (0|ṽ −j,k , X̃)(v−j,k )i
c!
c=0
1
(d−i)
p
(ξi |ṽ −j,k , X̃)(v−j,k )d−i ,
+
(d − i)! jk
v−j,k .
(B8)
Combining (B7) and (B8) yields
F̄jk (v−j,k , −ṽ −j,k , X̃) pjk (v−j,k |ṽ −j,k , X̃)
d−1
X
1
(i)
(d−i)
=
F̄jk (0, ṽ −j,k , X̃) pjk (ξi |ṽ −j,k , X̃)(v−j,k )d
i!(d
−
i)!
i=1
(d)
1
+ d!
F̄jk (ξ, ṽ −j,k , X̃) pjk (v−j,k |ṽ −j,k , X̃)(v−j,k )d
d−1 d−i−1
X
X 1 (i)
(c)
+
F̄jk (0, ṽ −j,k , X̃) pjk (0|ṽ −j,k , X̃)(v−j,k )i+c
i!c!
i=1 c=0
42
(B9)
whenever the derivatives exist.
Assumptions 6-7 imply that all of the derivatives in the right-hand side
of (B9) exist and are uniformly bounded for almost every
ζjk = −v−j,k /hN .
Decompose
djk
(ṽ −j,k , X̃)
if
|v−j,k | ≤ η
for some
η > 0.
Let
into two parts:
djk ≡ djk1 + djk2 ,
where
djk1 = h−d
N
´
|hN ζjk |>η
F̄jk (−ζjk hN , ṽ −j,k , X̃) pj,k (−ζjk hN |ṽ −j,k , X̃)
(B10)
·x̃jk K 0 (ζjk )dζjk dP (ṽ −j,k , X̃),
and
djk2 = h−d
N
´
|hN ζjk |≤η
F̄jk (−ζjk hN , ṽ −j,k , X̃) pj,k (−ζjk hN |ṽ −j,k , X̃)
(B11)
·x̃jk K 0 (ζjk )dζjk dP (ṽ −j,k , X̃).
Under Assumption 7 and Condition 2,
ˆ
|djk1 | ≤
where
|djk1 |
Ch−d
N
denotes the vector of the absolute values of
assumption that
djk2 → kd
as
N → ∞,
|x̃jk | · |K 0 (ζjk )| dζjk dP (ṽ −j,k , X̃) → 0,
|hN ζjk |>η
K 0 (·)
is a
dth
djk1 .
Plugging (B9) into (B11) and using the
order kernel yield the result that
d
X
h
i
1
(i)
(d−i)
E F̄jk (0, ṽ −j,k , X̃) pjk (0|ṽ −j,k , X̃)x̃jk
i!(d − i)!
i=1
(B12)
by Lebesgue's dominated convergence theorem. Therefore, by (B4) we have proved part (a).
Next consider
V ar[(N hN )1/2 tN (β, hN )].
V ar[(N hN )1/2 tN (β, hN )] = hN V ar


By Assumption 1,
X
[1(rj < rk ) − 1(rk < rj )] K 0 (x0jk β/hN )x̃jk h−1
N




1≤j<k≤J
43
.
Denote
X
eN ≡
[1(rj < rk ) − 1(rk < rj )] K 0 (x0jk β/hN )x̃jk h−1
N ,
(B13)
1≤j<k≤J
then
V ar[(N hN )1/2 tN (β, hN )] = hN E(eN e0N ) − hN E(eN )E(e0N ).
In part (a), we show that
E[h−d
N eN ] = O(1),
so
hN E(eN )E(e0N ) = o(1).
(B14)
Since the binary choice setting
J = 2 has been discussed in Horowitz (1992), the following discussion focuses on the case where J ≥ 3.
where
Dene
hN E(eN e0N ) ≡ LN 1 + LN 2 ,
(B15)
where
X
LN 1 =
2h−1
N E {[1(rj < rk ) − 1(rk < rj )] [1(rj < rl ) − 1(rl < rj )]
(B16)
1≤j<k<l≤J
·K 0 (x0jk β/hN )K 0 (x0jl β/hN )x̃jk x̃0jl
+ [1(rj < rk ) − 1(rk < rj )] [1(rk < rl ) − 1(rl < rk )] K 0 (x0jk β/hN )K 0 (x0kl β/hN )x̃jk x̃0kl
o
+ [1(rj < rl ) − 1(rl < rj )] [1(rk < rl ) − 1(rl < rk )] K 0 (x0jl β/hN )K 0 (x0kl β/hN )x̃jl x̃0kl ,
and
X
LN 2 =
n
o
2
2 0
0
0
h−1
[1(r
<
r
)
−
1(r
<
r
)]
E
K
(x
β/h
)
x̃
x̃
j
k
k
j
N
jk jk .
jk
N
(B17)
1≤j<k≤J
Let
ζjk = −v−j,k /hN
LN 2 =
for any pair of alternatives
X
j, k ∈ J.
By the law of iterated expectation,
i
´h
Fjk (−hN ζjk , ṽ −j,k , X̃) + Fkj (hN ζjk , ṽ −j,k + hN ζjk ιJ−2 , X̃)
1≤j<k≤J
·x̃jk x̃0jk pjk (−hN ζjk |ṽ −j,k , X̃)[K 0 (ζjk )]2 dζjk dP (ṽ −j,k , X̃).
44
(B18)
By Assumptions 3, 6, 7 and Lebesgue's dominated convergence theorem,
LN 2 → Ω
when
N → ∞.
By
Assumption 7,
´
2ChN [ |K 0 (ζjk )K 0 (ζjl )x̃jk x̃0jl |dζjk dζjl dP (ṽ −j,kl , X̃)
X
|LN 1 | ≤
1≤j<k<l≤J
(B19)
´
+ |K 0 (ζjk )K 0 (ζkl )x̃jk x̃0kl |dζkj dζkl dP (ṽ −k,jl , X̃)
´
+ |K 0 (ζjl )K 0 (ζkl )x̃jl x̃0kl | dζlj dζlk dP (ṽ −l,jk , X̃)].
LN 1 → 0
Thus, by Assumptions 6-7,
when
N → ∞.
We have proved part (b).
Lemma 6. Let Assumptions 1-4, 6-7 and Conditions 1-2 hold. Then
(a)
2d+1
If N hN
→ ∞ as N → ∞, then h−d
N tN (β, hN ) → a.
(b)
d
2d+1
If N hN
→ λ, where λ ∈ (0, ∞), as N → ∞, then (N hN )1/2 tN (β, hN ) → M V N (λ1/2 a, Ω).
Proof.
p
If
2d+1
N hN
→∞
N → ∞,
as
then
1
V ar[(N hN )1/2 tN (β, hN )] → 0
N h2d+1
N
−d
V ar[hN
tN (β, hN )] =
by Lemma 5(b). Lemma 5(b) together with Chebyshev's Theorem imply Lemma 6(a).
Next consider part (b). Dene
wN = N hN {tN n (β, hN ) − E[tN n (β, hN )]}.
Lemma 5(a) implies that
1/2
(N hN )1/2 E[tN (β, hN )] = (N h2d+1
)1/2 E[h−d
a,
N
N tN (β, hN )] → λ
so it suces to prove that
vector
γ
such that
γ 0 wN
γ 0 γ = 1.
tN (β, hN ) ≡ N
−1
N
X
is asymptotically distributed as
N (0, γ 0 Ωγ) for any nonstochastic (q − 1) × 1
Denote that
tN n ,
n=1
where
tN n ≡ tN n (β, hN ) =
X
[1(rnj < rnk ) − 1(rnk < rnj )] K 0 (x0njk β/hN )x̃njk h−1
N .
1≤j<k≤J
45
So we have
γ 0 wN = (hN /N )1/2 γ 0
N
X
[tN n − E(tN n )].
n=1
Let
CFN (·) denote the characteristic function of γ 0 wN .
Then by Assumption 1,
CFN (τ ) = [ΨN (τ )] N , where
n
o
ΨN (τ ) = E exp iτ (hN /N )1/2 γ 0 [tN n − E(tN n )] ,
and
n ≤ N
derivative of
is arbitrary.
ΨN (τ )
ΨN (0) = 1
and
Ψ0N (0) = 0
by construction.
We can calculate the second order
as
0 Ψ00N (τ ) = E −hN N −1 ΨN (τ )γ 0 [tN n − E(tN n )] [tN n − E(tN n )] γ .
For each
N,
lim Ψ00N (τ ) = −N −1 γ 0 [hN V ar(tN n )]γ = −N −1 γ 0 V ar (N hN )1/2 tN (β, hN ) γ.
τ →0
Lemma 5(b) implies that
lim Ψ00N (τ ) = −N −1 [γ 0 Ωγ + o(1)] .
τ →0
Therefore, a Taylor series expansion of
ΨN (τ )
around
τ =0
ΨN (τ ) = 1 − γ 0 Ωγ(τ 2 /2N ) + o(τ 2 /N )
and
N
CFN (τ ) = 1 − γ 0 Ωγ(τ 2 /2N ) + o(τ 2 /N ) .
46
yields
Applying the methods used in the proof of the Lindeverg-Levy central limit theorem yields the result that
lim CFN (τ ) = exp(−γ 0 Ωγτ 2 /2),
N →∞
which is the same as the characteristic function of
N (0, γ 0 Ωγ).
Lemma 7. Let Assumptions 1-4, 6-8 and Conditions 1-2 hold. For any pair of alternatives j, k ∈ J,
assume that ||x̃jk || ≤ c for some c > 0. Let η be some positive real number such that p(1)
jk (v−j,k |ṽ −j,k , X̃),
F̄jk (v−j,k , ṽ −j,k , X̃),
(1)
(2)
and F̄jk
(v−j,k , ṽ −j,k , X̃) exist and are uniformly bounded for almost every (ṽ −j,k , X̃)
if |v−j,k | ≤ η. For θ ∈ Rq−1 , dene
t∗N (θ)
=
(N h2N )−1
N
X
X
[1(rnj < rnk ) − 1(rnk < rnj )] K 0 (x0njk β/hN + x̃0njk θ)x̃njk .
n=1 1≤j<k≤J
Dene the sets ΘN , N = 1, 2, . . ., by ΘN = θ : θ ∈ Rq−1 , hN kθk ≤ η/2c .
(a)
Then
plim sup kt∗N (θ) − E [t∗N (θ)]k = 0.
(B20)
N →∞ θ∈ΘN
(b)
There are nite numbers α1 and α2 such that for all θ ∈ ΘN
kE[t∗N (θ)] − Hθk ≤ o(1) + α1 hN kθk + α2 hN kθk
2
(B21)
uniformly over θ ∈ ΘN .
Proof.
Dene
g N n (θ)
=
X
n
[1(rnj < rnk ) − 1(rnk < rnj )] K 0 (x0njk β/hN + x̃0njk θ)x̃njk
(B22)
1≤j<k≤J
h
−E [1(rnj < rnk ) − 1(rnk < rnj )] K
0
(x0njk β/hN
+
x̃0njk θ)x̃njk
io
.
The remaining part of the proof of (B20) follows the proof of (A15) in Lemma 7 of Horowitz (1992). Next,
47
we prove (B21). Dene
X
E [t∗N (θ)] ≡
t∗N jk (θ),
1≤j<k≤J
where
h
i
0
0
t∗N jk (θ) = h−2
N E F̄jk (v−j,k , ṽ −j,k , X̃)K (−v−j,k /hN + x̃jk θ)x̃jk .
Decompose
t∗N jk (θ)
t∗N jk1 = h−2
N
into two parts:
´
|v−j,k |>η
t∗N jk (θ) ≡ t∗N jk1 + t∗N jk2 ,
where
F̄jk (v−j,k , ṽ −j,k , X̃)K 0 (−v−j,k /hN + x̃0jk θ)
·x̃jk pjk (v−j,k |ṽ −j,k , X̃)dv−j,k dP (ṽ −j,k , X̃)
and
t∗N jk2 = h−2
N
´
|v−j,k |≤η
F̄jk (v−j,k , ṽ −j,k , X̃)K 0 (−v−j,k /hN + x̃0jk θ)
·x̃jk pjk (v−j,k |ṽ −j,k , X̃)dv−j,k dP (ṽ −j,k , X̃).
For some nite
C > 0,
by Assumption 7,
ˆ
∗
tN jk1 ≤ Ch−2
N
Let
0
K (−v−j,k /hN + x̃0jk θ) dv−j,k dP (ṽ −j,k , X̃).
|v−j,k |>η
ζjk = −v−j,k /hN + x̃0jk θ .
Since
hN ||θ|| ≤ η/2c
and
||x̃jk || ≤ c, |v−j,k | > η
implies that
|ζjk | > | − v−j,k /hN | − |x̃0jk θ| > η/2hN
and
ˆ
∗
tN jk1 ≤ Ch−1
N
|K 0 (ζjk )| dζjk .
(B23)
|ζjk |>η/2hN
48
We have
lim sup t∗N jk1 = 0,
(B24)
N →∞θ∈ΘN
because the term on the right-hand side of (B23) converges to 0 by Condition 2. Next, we consider
|v−j,k | ≤ η ,
d=2
substitution of
(B25)
(1)
= F̄jk (0, ṽ −j,k , X̃) pjk (0|ṽ −j,k , X̃)v−j,k
(1)
(1)
+F̄jk (0, ṽ −j,k , X̃) pjk (ξ1 |ξ1 , ṽ −j,k , X̃)(v−j,k )2
(2)
+(1/2)F̄jk (ξ, ṽ −j,k , X̃) pjk (v−j,k |ṽ −j,k , X̃)(v−j,k )2 ,
ξ
and
ξ1
v−j,k .
are between 0 and
Decompose
t∗N jk2
into two parts
t∗N jk2 ≡ sN jk1 + sN jk2 ,
where
sN jk1 = h−2
N
´
(1)
F̄jk (0, ṽ −j,k , X̃) pjk (0|ṽ −j,k , X̃)x̃jk
|v−j,k |≤η
·v−j,k K 0 (−v−j,k /hN + x̃0jk θ)dv−j,k dP (ṽ −j,k , X̃),
and
sN jk2 = h−2
N
´
(1)
|v−j,k |≤η
(1)
[F̄jk (0, ṽ −j,k , X̃) pjk (ξ1 |ξ1 , ṽ −j,k , X̃)
(2)
+(1/2)F̄jk (ξ, ṽ −j,k , X̃) pjk (v−j,k |ṽ −j,k , X̃)]x̃jk
·(v−j,k )2 K 0 (−v−j,k /hN + x̃0jk θ)dv−j,k dP (ṽ −j,k , X̃).
Let
ζjk = −v−j,k /hN + z̃ 0jk θ ,
sN jk1 =
then
´
(1)
|ζjk −x̃0jk θ|≤η/hN
If
into the right-hand side of (B9) yields
F̄jk (v−j,k , −ṽ −j,k , X̃) pjk (v−j,k |ṽ −j,k , X̃)
where
t∗N jk2 .
F̄jk (0, ṽ −j,k , X̃) pjk (0|ṽ −j,k , X̃)
·x̃jk (ζjk − x̃0jk θ)K 0 (ζjk )dζjk dP (ṽ −j,k , X̃).
49
Because
|
´
ζK 0 (ζ)dζ = 0
´
|ζjk −x̃0jk θ|≤η/hN
and
|x̃0jk θhN | ≤ η/2,
´
ζjk K 0 (ζjk )dζjk | = | |ζjk −x̃0 θ|>η/hN ζjk K 0 (ζjk )dζjk |
jk
´
≤ |ζjk |>η/2hN |ζjk K 0 (ζjk )|dζjk .
By Condition 2, the right-hand term of (B26) is bounded uniformly over
(B26)
θ ∈ ΘN
and it converges to 0.
Therefore, by Lebesgue's dominated convergence theorem,
lim sup |
´
(1)
|ζjk −x̃0jk θ|≤η/hN
N →∞θ∈ΘN
F̄jk (0, ṽ −j,k , X̃)
(B27)
·pjk (0|ṽ −j,k , X̃)x̃jk ζjk K 0 (ζjk )dζjk dP (ṽ −j,k , X̃)| = 0.
In addition,
||x̃0jk θ
´
|ζjk −x̃0jk θ|≤η/hN
K 0 (ζjk )dζjk − x̃0jk θ||
(B28)
´
≤ |x̃0jk θhN |h−1
|K 0 (ζjk )|dζjk
N
|ζjk −x̃0jk θ|>η/hN
´
≤ (η/2)h−1
|K 0 (ζjk )|dζjk .
N
|ζjk −x̃0 θ|>η/hN
jk
By Condition 2, the right-hand side of (B28) is bounded uniformly over
θ ∈ ΘN
Therefore, by Lebesgue's dominated convergence theorem and the denition of
N →∞ θ∈ΘN
´
X
lim || sup
H,
(1)
|ζjk −x̃0jk θ|≤η/hN
and it converges to 0.
F̄jk (0, ṽ −j,k , X̃)
(B29)
1≤j<k≤J
·pjk (0|ṽ −j,k , X̃)x̃jk x̃0jk θζjk K 0 (ζjk )dζjk dP (ṽ −j,k , X̃) − Hθ|| = 0.
For some nite
C > 0,
||sN jk2 || ≤ ChN
´
|ζjk −x̃0jk θ|≤η/hN
(ζjk − x̃0jk θ)2 |K 0 (ζjk )|dζjk dP (ṽ −j,k , X̃)
(B30)
≤ o(1) + αjk1 hN ||θ|| + αjk2 hN ||θ||2
for some nite
αjk1
and
αjk2 .
So part (b) is established by combining (B24), (B27), (B29), and (B30).
S
Lemma 8. Let Assumptions 1-9 and Conditions 1-2 hold and dene θN = (b̃N − β̃)/hN , where bSN is a
50
SGMS estimator. Then the probability limit of θN is zero.
Proof.
Pδ
Let
Given any
δ > 0, choose γ to be a nite number such that P r(||x̃jk || ≤ γf orany1 ≤ j < k ≤ J) ≥ 1−δ .
be the probability distribution of
k ≤ J}.
Let
Cγ0
X
denote the complement of
conditional on the event
Cγ .
Cγ ≡ {X : |||x̃jk || ≤ γ f or any 1 ≤ j <
The remaining part of the proof of Lemma 8 follows the proof
in Lemma 8 of Horowitz (1992).
0
Lemma 9. Let Assumptions 1-8 and Conditions 1-2 hold. Let {βN = (βN 1 , β̃N )0 } be any sequence in B
such that (βN − β)/hN → 0 as N → ∞. Then the probability limit of H N (βN , hN ) is H .
Proof.
Assume that
β̃)/hN .
Let
βN 1 = β1
because
βN 1 = β1 ∈ {−1, 1}
for suciently large
{aN } be a sequence such that aN → ∞ and aN θN → 0 as N → ∞.
aN f or any 1 ≤ j < k ≤ J}.
For any
N.
Dene
Denote
θ N = (β̃ N −
X̃ N = {X̃ : ||x̃jk || ≤
> 0,
h
i
lim P [|H N (β N , hN ) − H| > ] = lim P |H N (β N , hN ) − H| > |X̃ N .
N →∞
N →∞
Therefore, it suces to show that
shev's Theorem. Consider
Dene
E[H N (β N , hN )|X̃ N ] → H
E[H N (β N , hN )|Z̃ N ]
E N ≡ E[H N (β N , hN )|X̃ N ],
E N jk = h−2
N
´
then
and
V ar[H N (β N , hN )|X̃ N ] → 0
by Cheby-
rst.
EN =
P
1≤j<k≤J
E N jk ,
where
F̄jk (v−j,k , ṽ −j,k , X̃) pjk (v−j,k |ṽ −j,k , X̃)x̃jk x̃0jk
(B31)
·K 00 (−v−j,k /hN + x̃0jk θ N )dv−j,k dPN jk (ṽ −j,k , X̃),
and
η
PN jk
denote the distribution of
such that
(1)
(ṽ −j,k , X̃)
conditional on
(2)
F̄jk (v−j,k , ṽ −j,k , X̃), F̄jk (v−j,k , ṽ −j,k , X̃),
uniformly bounded if
|v−j,k | ≤ η .
and
(1)
X̃ N .
By Assumptions 7-8, there exists an
pjk (v−j,k |ṽ −j,k , X̃)
exist and are almost surely
Therefore, substitution of (B25) into (B31) yields
E N jk = I N jk1 + I N jk2 + I N jk3 ,
51
where
I N jk1 = h−2
N
´
(1)
|v−j,k |≤η
F̄jk (0, ṽ −j,k , X̃) pjk (0|ṽ −j,k , X̃)x̃jk x̃0jk
·v−j,k K 00 (−v−j,k /hN + x̃0jk θ N )dv−j,k dPN jk (ṽ −j,k , X̃),
I N jk2 = h−2
N
´
(1)
|v−j,k |≤η
(1)
[F̄jk (0, ṽ −j,k , X̃) pjk (ξ1 |ṽ −j,k , X̃)
(2)
+(1/2)F̄jk (ξ, ṽ −j,k , X̃) pj,k (v−j,k |ṽ −j,k , X̃)]x̃jk x̃0jk
·(v−j,k )2 K 00 (−v−j,k /hN + x̃0jk θ N )dv−j,k dPN jk (ṽ −j,k , X̃),
and
I N jk3 = h−2
N
´
|v−j,k |>η
F̄jk (v−j,k , ṽ −j,k , X̃) pjk (v−j,k |ṽ −j,k , X̃)x̃jk x̃0jk
(B32)
·K 00 (−v−j,k /hN + x̃0jk θ N )dv−j,k dPN jk (ṽ −j,k , X̃).
Dene
ζjk = −v−j,k /hN + x̃0jk θ N .
I N jk1 =
´
Then
(1)
|ζjk −x̃0jk θ N |≤η/hN
F̄jk (0, ṽ −j,k , X̃) pj,k (0|ṽ −j,k , X̃)x̃jk x̃0jk
·(ζjk − x̃0jk θ N )K 00 (ζjk )dζjk dPN jk (ṽ −j,k , X̃).
Because
|x̃0jk θ N | ≤ aN ||θ N || → 0,
by Assumptions 6-7 and Conditions 1-2,
i
h
(1)
I N jk1 → E F̄jk (0, ṽ −j,k , X̃) pjk (0|ṽ −j,k , X̃)x̃jk x̃0jk .
For some nite
(B33)
C > 0,
|I N jk2 | ≤ ChN
´
|ζjk −x̃0jk θ N |≤η/hN
|x̃jk x̃0jk |
(B34)
·(ζjk − x̃0jk θ N )2 |K 00 (ζjk )|dζjk dPN jk (ṽ −j,k , X̃) → 0.
Next we calculate (B32). By Condition 2,
ˆ
|I N jk3 | ≤ Ch−1
N
|ζjk −x̃0jk θ N |>η/hN
|x̃jk x̃0jk | · |K 00 (ζjk )|dζjk dPN jk (ṽ −j,k , X̃).
52
(B35)
Under Assumption 7, the right-hand side of (B35) converges to 0 as
N
goes to innity. Combination of (B35),
(B33), (B34), and (B35) establishes that
X
EN =
1≤j<k≤J
Now consider
=
X
E N jk =
(I N jk1 + I N jk2 + I N jk3 ) → H.
1≤j<k≤J
V ar[H(β N , hN )|X̃ N ]:
V ar[H(β N , hN )|X̃ N ]
X
N −1 V ar{
[1(rnj < rnk ) − 1(rnk < rnj )] K 00 (−v−j,k /hN + x̃0jk θ N )x̃jk x̃0jk h−2
N |X̃ N }
(B36)
1≤j<k≤J
=
N
−1
E[r N (θ N )r N (θ N )0 |X̃ N ] + O(N −1 ),
where
r N (θ N ) =
X
[1(rnj < rnk ) − 1(rnk < rnj )] K 00 (−v−j,k /hN + x̃0jk θ N )vec(x̃jk x̃0jk )h−2
N .
1≤j<k≤J
Let
ζjk = −v−j,k /hN + x̃0jk θ N .
For some nite
C,
N −1 E[r N (θ N )r N (θ N )0 |X̃ N ]
X ˆ
|vec(x̃jk x̃0jk )vec(x̃jk x̃0jk )0 |
≤ ChN (N h4N )−1
(B37)
1≤j<k≤J
+Ch2N (N h4N )−1
·[K 00 (ζjk )]2 dζjk dPN jk (ṽ −j,k , X̃)
ˆ
X
|vec(x̃jk x̃0jk )vec(x̃jl x̃0jl )0 |
k6=l; k,l∈J\{j}
·K 00 (ζjk )K 00 (ζjl )dζjk dζjl dPN jkl (ṽ −j,kl , X̃)
by Assumption 7, where
PN jkl
is the distribution of
(ṽ −j,kl , X̃)
conditional on
of (B37) converges to zero by Assumptions 7-8 and Condition 2.
V ar[H(β N , hN )|X̃ N ]
converges to zero.
53
X̃ N .
The right-hand side
Therefore, it follows from (B36) that
Proof.
1 as
(THEOREM 3) By Theorem 2,
N → ∞.
So
tN (bSN , hN ) = 0
S
bSN,1 = β1 and b̃N
is an interior point of
B̃ with probability approaching
and a Taylor series expansion yields
S
tN (β, hN ) + H N (b∗N , hN )(b̃N − β̃) = 0
with probability approaching 1, where
b∗N
(B38)
is between
β
and
bSN .
Part (a): By (B38),
S
∗
−d
h−d
N tN (β, hN ) + H N (bN , hN )hN (b̃N − β̃) = 0
with probability approaching 1 as
N → ∞.
Lemmas 8-9 imply that
plimH N (b∗N , hN ) = H .
nonsingular by Assumption 9, we have
S
−1 −d
h−d
hN tN (β, hN ) + op (1).
N (b̃N − β̃) = −H
Part (a) now follows from Lemma 6(a).
Part (b): By (B38),
S
(N hN )1/2 tN (β, hN ) + H N (b∗N , hN )(N hN )1/2 (b̃N − β̃) = 0
with probability approaching 1 as
N → ∞.
So, by Lemmas 8-9 and Assumption 9,
S
(N hN )1/2 (b̃N − β̃) = −H −1 (N hN )1/2 tN (β, hN ) + op (1).
Part (b) now follows from Lemma 6(b).
Part (c): By the property of matrix trace,
S
S
S
S
EA [(b̃N − β̃)0 W (b̃N − β̃)] = T r{W EA [(b̃N − β̃)(b̃N − β̃)0 ]}.
54
Because
H
is
Part (b) implies that
S
S
EA [(b̃N − β̃)(b̃N − β̃)0 ]
= N −2d/(2d+1) λ−1/(2d+1) H −1 ΩH −1 + λ2d/(2d+1) H −1 aa0 H −1 .
Therefore, by the denition of MSE,
M SE = N −2d/(2d+1) T r W S −1 λ−1/(2d+1) Ω + λ2d/(2d+1) aa0 H −1 .
(B39)
To minimize MSE, take the dierentiation of the right-hand side of (B39) with respect to
order condition, we show that MSE is minimized by setting
λ = λ∗ ,
λ.
From the rst
where
λ∗ = [trace(W H −1 ΩH −1 )]/[trace(2dW H −1 aa0 H −1 )].
It follows from Part (b) that
S
N d/(2d+1) (b̃N − β̃)
(B40)
has the asymptotic distribution
M V N (−(λ∗ )d/(2d+1) H −1 a, (λ∗ )−1/(2d+1) H −1 ΩH −1 ).
Proof.
(THEOREM 4)
Part (a): Applying Taylor expansion to
tN (bSN , h∗N )
around
β
yields
(h∗N )−d tN (bSN , h∗N ) = (h∗N )−d tN (β, h∗N )
(B41)
0
+[∂tN (b∗N , h∗N )/∂ b̃ ] (h∗N )−d (b̃N − β̃)
with probability approaching one as
op (1).
Therefore,
S
(h∗N )−d (b̃N
S
N → ∞,
(b̃N − β̃)/h∗N = op (1).
− β̃) = op (1)
b∗N
is between
Lemma 9 implies that
−d S
because (hN )
(b̃N
plim[(h∗N )−d tN (β, h∗N )] = a
where
− β̃) = Op (1)
bSN
and
β.
By Lemma 8,
0
S
(b̃N − β̃)/hN =
plim[∂tN (b∗N , h∗N )/∂ b̃ ] = H .
by Theorem 3(b) and
In addition,
(h∗N /hN )−d → 0.
by Lemma 5(a). Part (a) now follows by taking probability limits as
55
Finally,
N →∞
of
each side of (B41).
Part (b): By Chebyshev's Theorem, it suces to show that
consider
E(Ω̂N ) → Ω
and
V ar(Ω̂N ) → 0.
First
E(Ω̂N ):
E(Ω̂N ) = hN E[tN n (bSN , hN )tN n (bSN , hN )0 ] ≡ L∗N 1 + L∗N 2 ,
(B42)
where
h
i2
E [1(rj < rk ) − 1(rk < rj )] K 0 (x0jk bSN /hN ) x̃jk x̃0jk ,
X
L∗N 1 = h−1
N
1≤j<k≤J
and
L∗N 2 =
X
2h−1
N E {[1(rj < rk ) − 1(rk < rj )] [1(rj < rl ) − 1(rl < rj )]
1≤j<k<l≤J
·K 0 (x0jk bSN /hN )K 0 (x0jl bSN /hN )x̃jk x̃0jl
+ [1(rj < rk ) − 1(rk < rj )] [1(rk < rl ) − 1(rl < rk )] K 0 (x0jk bSN /hN )K 0 (x0kl bSN /hN )x̃jk x̃0kl
+ [1(rj < rl ) − 1(rl < rj )] [1(rk < rl ) − 1(rl < rk )] K 0 (x0jl bSN /hN )K 0 (x0kl bSN /hN )x̃jl x̃0kl ].
Let
S
θ N = (b̃N − β̃)/hN
L∗N 1 =
X
´
and
ζjk = −v−j,k /hN + x̃0jk θ N .
Then
{Fjk [hN (−ζjk + x̃0jk θ N ), ṽ −j,k , X̃]
(B43)
1≤j<k≤J
+Fkj [hN (−ζjk + x̃0jk θ N ), ṽ −j,k + hN (−ζjk + x̃0jk θ N )ι0J−2 , X̃}
·pjk [hN (−ζjk + x̃0jk θ N )|ṽ −j,k , X̃]x̃jk x̃0jk [K 0 (ζjk )]2 dζjk dP (ṽ −j,k , X̃).
By Assumptions 3, 6-7 and Lebesgue's dominated convergence theorem, the right-hand side of (B43) converges
56
to
Ω
when
N → ∞.
Under Assumption 7,
´
2M hN [ |K 0 (ζjk )K 0 (ζjl )x̃jk x̃0jl |dζjk dζjl dP (ṽ −j,kl , X̃)
X
|L∗N 2 | ≤
1≤j<k<l≤J
´
+ |K 0 (ζjk )K 0 (ζkl )x̃jk x̃0kl |dζkj dζkl dP (ṽ −k,jl , X̃)
´
+ |K 0 (ζjl )K 0 (ζkl )x̃jl x̃0kl | dζlj dζlk dP (ṽ −l,jk , X̃)].
Therefore, the right-hand side of (B44) converges to 0 when
E(Ω̂N ) → Ω
(B44)
N →∞
by Assumption 7 and Condition 2. So
by (B42).
Next consider
V ar(Ω̂N ).
By Assumption 1, we can calculate
0 S
S
V ar(Ω̂N ) =
V ar tN n bN , hN tN n bN , hN
0 S
S
2
= ( hN /N E vec tN n bN , hN tN n bN , hN
)
0 0
S
S
vec tN n bN , hN tN n bN , hN
+ o(1)
h2N /N
=
(B45)
(N h2N )−1 E [cc0 ] + o(1),
where
c≡
X
cjklm ,
1≤j<k≤J
1≤l<m≤J
and
cjklm ≡ [1(rj < rk ) − 1(rk < rj )] [1(rl < rm ) − 1(rm < rl )] K 0 (x0jk bSN /hN )K 0 (x0lm bSN /hN )vec(x̃jk x̃0lm ).
The right-hand side of (B45) converges to 0 under Assumptions 8-10.
Part (c): This is implied by Lemma 9.
57
References
[1] Abrevaya, J., and Huang, J. 2005. On the Bootstrap of the Maximum Score Estimator. Econometrica
73: 1175-1204.
[2] Barbera S, Pattanaik PK. 1986. Falmagne and the rationalizability of stochastic choices in terms of
random orderings. Econometrica 54: 707-715
[3] Basile R, Castellani D, Zanfei A. 2008. Location choices of multinational rms in Europe: the role of
EU cohesion policy. Journal of International Economics 74: 328-340.
[4] Beggs S, Cardell S, Hausman J. 1981. Assessing the potential demand for electric cars. Journal of
Econometrics 17: 1-19
[5] Ben-Akiva, M., T. Morikawa, and F. Shiroish. 1992. Analysis of the reliability of preference ranking
data. Journal of Business Research 24: 149-164.
[6] Calfee, J., C. Winston, and R. Stempski. 2001. Econometric issues in estimating consumer preferences
from stated preference data: a case study of the value of automobile travel time. Review of Economics
and Statistics 83(4): 699-707.
[7] Caparrós A, Oviedo JL, Campos P. 2008. Would you choose your preferred option? Comparing choice
and recoded ranking experiments. American Journal of Agricultural Economics 90: 843-855
[8] Chapman RG, Staelin R. 1982. Exploiting rank ordered choice set data within the stochastic utility
model. Journal of Marketing Research 19: 288-301
[9] Dagsvik JK, Liu G. 2009. A framework for analyzing rank-ordered data with application to automobile
demand. Transportation Research Part A 43: 1-12
[10] Falmagne JC. 1978. A representation theorem for nite scale systems. Journal of Mathematical Psychology 18: 52-72
[11] Fiebig D, Keane M, Louviere J, Wasi N. 2010. The generalized multinomial logit model: Accounting for
scale and coecient heterogeneity. Marketing Science 29: 393-421
58
[12] Fox J. 2007. Semiparametric estimation of multinomial discrete-choice models using a subset of choices.
RAND Journal of Economics 38: 1002-1019.
[13] Goeree, J.K., Holt, C.A., and Palfrey, T.R. 2005. Regular Quantal Response Equilibrium. Experimental
Economics 8: 347-367.
[14] Han, A.K. 1987. Non-parametric Analysis of a Generalized Regression Model. Journal of Econometrics
35: 303-316.
[15] Hausman J, McFadden D. 1984. Specication tests for the multinomial logit model. Econometrica 52:
1219-1240.
[16] Hausman J, Ruud P. 1987. Specifying and testing econometric models for rank-ordered data. Journal of
Econometrics 34: 83-104.
[17] Hausman, J, and Wise, D. 1978. A Conditional Probit Model for Qualitative Choice: Discrete Decisions
Recognizing Interdependence and Heterogeneous Preferences. Econometrica 46: 403-426.
[18] Horowitz JL. 1992. A smoothed maximum score estimator for the binary response model. Econometrica
60: 505-531.
[19] Kim, J., and Pollard, D. 1990. Cube Root Asymptotics. Annals of Statistics 18: 191-219.
[20] Layton D. 2000. Random coecient models for stated preference surveys. Journal of Environmental
Economics and Management 40: 21-36.
[21] Layton D., Levine, R. 2003. How Much Does the Far Future Matter? A Hierarchical Bayesian Analysis
of the Public's Willingness to Mitigate Ecological Impacts of Climate Change. Journal the American
Statistical Association 98: 533-544.
[22] Luce D.R. and Suppes P. 1965. Preference, Utility and Subjective Probability, Handbook of Mathematical Psychology. Vol 3, New York: John Wiley & Sons, Inc., 249-410.
[23] Lee L-F. 1995. Semiparametric maximum likelihood estimation of polychotomous and sequential choice
models. Journal of Econometrics 65: 381-348
59
[24] Lewbel A. 2000. Semiparametric qualitative response model estimation with unknown heteroskedasticity
or instrumental variables. Journal of Econometrics 97: 145-177.
[25] Manski CF. 1975. Maximum score estimation of the stochastic utility models of choice. Journal of
Econometrics 3: 205-228.
[26] McFadden D. 1974. Conditional logit analysis of qualitative choice behavior. In:
Zarembka P (ed)
Frontiers in econometrics. Academic Press, New York
[27] McFadden D. 1986. The choice theory approach to market research. Marketing Science 5: 275-297
[28] Newey, W.K., and McFadden, D. 1994. Large Sample Estimation and Hypothesis Testing. Handbook of
Econometrics, Vol 4: 2111-2245.
[29] Rao R.R. 1962. Relations between weak and uniform convergence of measures with applications. Annals
of Mathematical Statistics 33, 659-680.
[30] Ruud PA. 1983. Sucient conditions for the consistency of maximum likelihood estimation despite
misspecication of distribution in multinomial discrete choice models. Econometrica 51: 225-228
[31] Ruud PA. 1986. Consistent estimation of limited dependent variable models despite misspecication of
distribution. Journal of Econometrics 32: 157-187.
[32] Scarpa R, Notaro S, Louviere J, Raaelli R. 2011. Exploring scale eects of best/worst rank ordered
choice data to estimate benets of tourism in Alpine Grazing Commons. American Journal of Agricultural
Economics 93: 813-828
[33] Sherman, R.P. 1993. The Limiting Distribution of the Maximum Rank Correlation Estimator. Econometrica 61: 123-137.
[34] Small KA, Winston C, Yan J. 2005. Uncovering the distribution of motorists' preferences for travel time
and reliability. Econometrica 73: 1367-1382.
[35] Storn, R., and Price, K. 1997. Dierential EvolutionA Simple and Ecient Heuristic for Global Optimization over Continuous Spaces. Journal of Global Optimization 11: 341-359.
60
[36] Thompson S. 1993. Some eciency bounds for semiparametric discrete choice models. Journal of Econometrics 58: 257-274.
[37] Train KE, Winston C. 2007. Vehicle choice behavior and the declining market share of U.S. automakers.
International Economic Review 48: 1469-1496
[38] Yan J, 2012. A smoothed maximum score estimator for multinomial discrete choice models. Working
paper.
[39] Yan J, Yoo H. 2014. The seeming unreliability of rank-ordered data as a consequence of model misspecication. MPRA Paper No. 56285.
[40] Yoo HI, Doiron D. 2013. The use of alternative preference elicitation methods in complex discrete choice
experiments. Journal of Health Economics 32: 1166-1179
61