The Pure Characteristics Demand Model

The Pure Characteristics Demand Model1
Steven Berry
Dept. of Economics, Yale
and NBER
Ariel Pakes
Dept. of Economics, Harvard
and NBER
December 28, 2005
1 We
are very pleased to have presented this paper at the Festschrift in honor
of Dan McFadden. Dan opened up the world of discrete choice modeling in empirical researh, with a lasting impact on our field. We also appreciate helpful
comments from Don Brown, Paris Cleanthous, Gowtham Gowrisankarn, Minjae
Song and from many seminar participants over the many years that we worked on
this project.
Abstract
In this paper we consider a class of discrete choice models in which consumers
care about a finite set of product characteristics. These models have been
used extensively in the theoretical literature on product differentiation and
the goal of this paper is to translate them into a form that is useful for empirical work. Most recent econometric applications of discrete choice models
implicitly let the dimension of the characteristic space increase with the number of products. The two models have different theoretical properties, and
these, in turn, have quite pronounced implications for the likely impacts of
increases in the number of products. After developing those properties, we
provide an algorithm for estimating the parameters of the pure characteristic
model. We then show that the computational properties of the estimation
algorithm for the pure characteristics model also differs markedly from the
computational properties of the model with tastes for products. We conclude
with a set of Monte Carlo results which illustrate both the theoretical and
the computational properties of the two classes of models.
1
Introduction.
The theory and econometrics of demand models which treat products as bundles of characteristics dates back at least to Lancaster (1971) and McFadden
(1974). Early applications of these models used restrictive assumptions and a
primary concern of the literature that followed (including many contributions
by McFadden himself) was that the structure of the discrete choice model
used would in some way restrict the range of possible outcomes, thereby
providing misleading empirical results. This paper is a continuation of that
tradition. In contrast to the empirical models currently available, we consider estimation of a class of discrete choice models in which consumers care
about a finite set of product characteristics. These models have been used
extensively in the theoretical literature on product differentiation, but prior
to this paper had not been translated into a form that is rich enough for
detailed empirical work.
Typical discrete choice empirical models (including those we have worked
on) implicitly assume that the dimension of the product space increases with
the number of products. This assumption is often embedded in an otherwise unexplained i.i.d. additive random term in the utility function. One
interpretation for these terms is that they represent variance in the “taste
for the product”, as opposed to a taste for the characteristics of the products. Though indications are that such models can do quite a good job in
approximating some aspects of demand, they have some counter-intuitive implications as the number of products increases. As a result we worry about
their ability to answer questions that are specifically about changes in the
number of goods. The two leading examples are the prospective analysis
of demand for goods not yet introduced, and the retrospective analysis of
benefits from new goods that were introduced.
Recent advances in econometric technique, computing power, and data
availability have significantly increased the use of characteristic based models
with “tastes for products” in analyzing demand in differentiated products
markets1 Provided the researcher can condition on the major characteristics
that consumers value, the characteristic based demand models have a number
of advantages over the more traditional demand models in “product space”.
One of those advantages concerns the analysis of demand for products not
yet marketed2 .
1
Most of the advances made since McFadden’s seminal work were designed to insure
that the characteristic based models do not unduly restrict own and cross price and characteristic elasticities; see e.g. Berry, Levinsohn, and Pakes (1995).
2
The other major advantage of estimating demand systems in characteristic space is
1
Demand systems estimated in product space do not allow us to analyze
potential demand. Provided we are willing to specify the characteristics of
the new product and the form of the equilibrium in the product market, the
characteristic based models do (e.g. Berry, Levinsohn, and Pakes (2004),
henceforth “MicroBLP”)3 . Of course the quality of these predictions will
depend on the information in the current data on demand for tuples of characteristics similar to those of the new product4 . Still characteristics based
models provide one of the few sources of information on demand for new
products available, and the analysis of such demand is central to the analysis
of many issues in both the Industrial Organization and Marketing literatures.
An important issue closely related to the demand for new products is the
use of demand systems to estimate the consumer surplus that was generated by the demand for products that were already introduced. This type of
retrospective analysis dates back at least to Trajtenberg (1989), and its usefulness is illustrated clearly in Petrin’s (2002) investigation of the consumer
surplus gains from the (privately funded) research expenditures that lead to
the development of a major innovation (the Minivan); and Hausman’s (1997)
investigation of the consumer surplus losses caused by regulatory delay in the
introduction of the cell phone. In addition measuring the consumer surplus
gains from new products is an integral part of constructing ideal price indices (see, e.g, Pakes, Berry, and Levinsohn (1993) Nevo (2000). Moreover,
at least according to the Boskin commission report (2000), failure to adequately measure the gains from new products is the major source of bias in
the Consumer Price Index (see also Pakes (2003)).
The consumer surplus generated by products already introduced can, at
that they typically constrain own and cross price (and characteristic) elasticities to be
functions of a small number of parameters (these describe the distribution of tastes for
characteristics). In particular the number of parameters needed is independent of the
number of products. In contrast if we worked in product space a (log) linear demand
system for J products would require the estimation of a number of parameters which
grows like J 2 .
3
As noted their this requires the researcher to either specify values for any unobservable
characteristics, or at least a distribution from which these values are drawn. This in
addition to specifying values for the characteristics that were observed.
4
To see the need for this qualification consider Pakes’ (1999) example of the analysis of
the potential demand for laptop computers. Prior data had been on demand for desktops.
These had more speed memory and reliability than the early laptops and sold for a lower
price. Since these characteristics were the primary ones used in the analysis of desktop
demand, the characteristic based model available when laptops were introduced would have
predicted little demand for laptops. The fact that laptops were portable, a characteristic
which one could not analyze with the data before the laptop’s introduction, made reality
quite different from this.
2
least in principal, be analyzed using either product based or characteristic
based demand systems. However in either case the results of the analysis
are likely to be particularly sensitive to a priori modeling assumptions. This
is because we typically do not have data on the demand for new products
at prices that are high enough to enable us to nonparametrically estimate
the reservation prices of a large fraction of consumers. When using product
based demand systems the utility gains for the infra-marginal consumers who
purchased the good at all observed prices are obtained by extrapolating the
demand curve estimated at lower prices to a higher price range, and these
extrapolations can be very sensitive to the assumptions build into the model5 .
The characteristic based demand model uses slightly more information in its
estimation of consumer surplus gains, since it uses the price variance for
products with similar characteristics, but analogous problems can and do
arise6 . As a result the measurement of gains from product introductions
is likely to be particularly sensitive to assumptions like those we discuss
extensively below.
The next section introduces our pure characteristics model and provides
a more extended discussion of the reason we think it might be useful. In
particular, when it comes to measuring consumer surplus gains from product
introductions, there is a sense in which our pure characteristics models makes
assumptions at the opposite extreme of the assumptions used in the models
with tastes for products per se. So there is some reason to believe that the
true gains to product introduction lie between the two extremes.
We then develop some of the properties of our model. These properties
enable us to build an algorithm for estimating the pure characteristics model
from market level data on prices, quantities, and characteristics (section 3).
This section provides an analog to the algorithm developed for the model
with “tastes for products” in Berry, Levinsohn, and Pakes (1995), henceforth BLP). The paper concludes with some Monte Carlo evidence. The
Monte Carlo studies are designed to give the reader; (i) some indication of
the performance of our estimators, (ii) an indication of the computational
burden of the pure characteristics model relative to the model with a taste for
products, and (iii) an indication of the performance of the pure characteristic
model relative to the performance of a model with tastes for products.
5
Hausman (1997), for example, reports infinite consumer surplus gains from some of
his specifications.
6
See the discussion below or Petrin (2002) who reports large differences in consumer
surplus gains from differences in specifications and data sources.
3
2
Discrete Choice Models Used In Empirical
Work.
We consider models in which each consumer chooses to buy at most one
product from some set of differentiated products. Consumer i’s (indirect)
utility from the purchase of product j is
Uij = U (Xj , Vi , θ),
(1)
where Xj is a vector of product characteristics (including the price of the
good), Vi is a vector of consumer tastes and θ is a vector of parameters to be
estimated.
Probably the earliest model of this sort in the economic literature is the
Hotelling (1929) model of product differentiation on the line. In that model
X is the location of the product and V is the location of the consumer.
Subsequent applications were concerned both with demand conditional on
product locations and the determination of those locations.
To obtain the market share of good j (our sj ) implied by the model we
simply add up the number of consumers who prefer good j over all other
goods. That is
sj = P r{Vi : U (Xj , Vi , θ) ≥ U (Xk , Vi , θ), ∀k 6= j)}
(2)
To ease the transition to empirical work we follow the notation in Berry,
Levinsohn, and Pakes (2004) and
• partition the vector of consumer attributes, Vi , into zi , which an econometrician with a micro data set might observe, and νi , which the econometrician does not observe
• and partition product characteristics, Xj , into xj , which is observed by
the econometrician, and ξj , which is unobserved to the econometrician
(though observed to the agents7 ).
The models taken to data impose more structure. They typically assume
that the utility function is additively separable in a deterministic function of
7
Often, especially in the study of consumer (in contrast to producer) goods, the ξ
refer to the aggregate impact of a number of relatively unimportant characteristics, some
or all of which may in fact be observed. The researcher chooses not to include them in
the specification taken to data because of the worry that the data cannot support an
investigation of preferences on such a detailed space.
4
the product attributes and the observed consumer data, and a disturbance
term, i.e.
Uij = f (Xj , zi ; θ) + µij ,
(3)
where θ is a parameter to be estimated. The model is then completed by
making alternative assumptions on the joint distribution of the {µi,j , Xj , zi }
tuples.
We focus here on the assumed distribution of the µij conditional on
(Xj , zi )8 . The simplest specification is obtained by assuming the µi,j are
generated by a set of linear interactions between the unobserved determinants of consumer tastes (the ν) and the product characteristics X (both
observed and unobserved). If there are K product characteristics then this
“random coefficients” model sets
µij ≡
K
X
νik Xjk ,
(4)
k=1
and makes a parametric assumption is made on the distribution of ν conditional on (X, z).
Models with Tastes for Products.
Empirical specifications typically assume that each µij has a conditional distribution, conditional on (Xj , zi ) and all other µi,j 0 (∀j 0 6= j), that has support on the entire real line. A familiar case is where the µi,j contain an additive component which is i.i.d. across both products and consumers. Note
that this implies that for every possible set of products, there will always be
some consumers who like any new product “infinitely” more than the other
products available. Consequently the empirical results obtained when using
these models must have properties that we have a priori reason to doubt
(see below). Examples of specifications that have additive components with
full support include the random coefficient logit model used in BLP as well
as the random coefficient probit discussed in Hausman and Wise (1978) and
McFadden (1981).
To obtain an additive component with full support from a model generated by a distribution of preferences on a set of characteristics we would
have to make the dimension of the characteristic space, K, be a function of
the number of products. The suggestion in Caplin and Nalebuff (1991) is
8
To complete the specification we will also need either an assumption on the distribution
of the unobserved product characteristic (on ξ) given (x, z), or a procedure which estimates
the ξ pointwise. The choice made typically depends on the type of data available (see the
discussion in Berry, Levinsohn, and Pakes (2004)). We return to these points below.
5
to think of the additive component as being formed from the interaction of
a set of product-specific dummy variables and a set of i.i.d. tastes for each
product. Consequently we will refer to this class of models as having “tastes
for products,” as opposed to just tastes for product characteristics.
Though this assumption does justify empirical work, it contradicts the
spirit of the literature on characteristic based demand models (which focus
on demand and product location in a given characteristic space). Perhaps
more important is that in applications it is often difficult, if not impossible,
to think of a source of variance in tastes that is both independent across
households for a given product, and independent across products for a given
household.
On the other hand, models that include a taste for products have a number of important practical advantages. In particular those models
• define all probabilities by integrals with simple limits of integration (see
McFadden (1981),
• insure that all the purchase probabilities are nonzero (at every value
of the parameter vector), and (provided certain other regularity conditions are satisfied) have smooth derivatives (McFadden, 1981), and
• aggregate into market shares which can be easily inverted to solve for
the unobservable characteristics ( the {ξj }) as a linear function of the
parameters and the data, thereby enabling the use of familiar techniques to account for the simultaneity problem induced by any correlation between these unobserved characteristics and price (see BLP).
The “cost” of these practical advantages is that the model with tastes for
products imposes structure that can be counterintuitive. In particular that
model implies
• that their is a limit on substitution possibilities between products, and
• that each individual’s utility increases to infinity as the number of new
products grows, regardless of the observed characteristics of the the
products which enter.
Perhaps the earliest statement of the applied problems that can result
from the model’s limit on substitution possibilities was McFadden’s “red bus
blue bus” problem. That is when we introduce a new product that is virtually
identical to an existing product (say product “A”) we expect the combined
market shares of the new product and product “A” to to be approximately
the same as was the market share of product “A” before we introduced the
6
new product. The fact that we introduce a “new characteristic” with the
new product insures that the model will predict a larger combined share
than that (consumers who value the new characteristic enough will switch
from products other than “A” to the new product).
A related implication of importance to industrial organization is that as
goods whose characteristics closely resemble those of good A are added to
the system their markups will stay bounded away from zero (a similar point
is made by Anderson, DePalma, and Thisse (1992), in the context of the
logit model). This fact might lead us to worry about the implications of
the model with tastes for products on the incentives for product development, at least in markets with large numbers of products. In contrast as the
number of products in our “pure characteristic” model increases the products themselves will become increasingly good substitutes for one another
and oligopolistic competition will approach the competitive case, with prices
driven toward marginal cost.
The model’s implication that each consumer’s utility must grow without
bound as we increase the number of products marketed, no matter the characteristics of the products, is a particular concern for using models with tastes
for products to structure the analysis of both prospective and retrospective
gains from product introductions. That is since we know the global implications of the model with tastes for products are a priori unreasonable in this
context, we worry that when we use the model to extrapolate outside of the
range of the data, as we must do to measure welfare gains from new products
(see the introduction) we will obtain results which are not meaningful.
The finite-dimensional pure characteristics model is very different in this
respect. In the pure characteristics model the agent’s increment in utility
from a new product introduction is limited by a continuous (usually linear)
function of the distance (in characteristic space) between the new and previously existing products. Moreover since under reasonable regularity conditions, if the environment does not change, the characteristic space itself
“fills up” as the number of products grow large9 , the model implies that the
incremental benefits that a consumer is likely to gain from new product introductions will decline in the number of products (usually at an exponential
rate). In particular these gains to “variety” will be bounded in the pure characteristics model whereas, as noted above, they must grow without bound in
models with tastes for products.
We should note that recent empirical work indicates that the impact of
9
The caveat on the environment is to rule out either technological changes, or changes
in competing and complimentary products, which alter the relative benefits of producing
in different parts of the characteristic space.
7
the additive component with full support on the substitution and welfare
implications of estimators from actual data sets is mitigated, often markedly,
when household level data is available (e.g. Petrin (2002), and MicroBLP). So
one possible response to the problems induced by using models with tastes
for products is to keep the additive component with full support but use
data which enables a larger role for variation in household attributes that
load directly on the observed product characteristics.
This paper considers an alternative: we do away the additive component
entirely. The model we obtain by dropping the additive component is a limiting case of the model with tastes for products, where limits are taken letting
the relative importance of variation in preferences for non-product specific
characteristic to variation in preferences for the products per se grow without
bound. This helps explain both the empirical results discussed above, and
why we expect our alternative will, at the very least, give some indication of
the robustness of the results of models with tastes for characteristics to the
presence of the additive component with full support.
3
Finite Dimensional Models.
We begin with a brief review of the literature on pure characteristic models. The theoretical literature on these models includes the Hotelling model
of competition on a line, the “vertical” model of product differentiation of
Mussa and Rosen (1978) (see also Gabszewicz and Thisse (1979) and Shaked
and Sutton (1982)) and Salop’s (Salop, 1979) model of competition on a circle. In all these models, demand is determined by the location or “address”
of the products in the characteristics space, and an exogenous distribution of
consumer preferences over this space. As the number of products increases,
the product space fills up, with products becoming very good substitutes for
one another.
The simplest finite dimensional model is the vertical model and it was
the finite dimensional model first brought to data (in Bresnahan (1987); later
applications include Greenstein (1996). Since we will want to explicitly allow
for unobserved product characteristics we use a specification of the vertical
model which appears in Berry (1994)
uij = Xj β − αi pj ,
(5)
where the unobserved and observed characteristics both enter through Xj =
(xj , ξj ). In this model the quality of the good is the single index
δj ≡ Xj β
8
(6)
and its value increases over the entire real line. All consumers agree on
this quality ranking. The reason consumers differ in their choices is that
different consumers have different marginal utilities of income (this generates
the differences in their coefficients on price).
Other examples of pure characteristics models are given in Caplin and
Nalebuff (1991), and Anderson, DePalma, and Thisse (1992). These include
the “ideal point” models in which consumers care about the distance between
their location (vi ) and the products’ location (Xj ) in Rk :
uij = kXj − νi k − αi pj ,
(7)
where k · k is some distance metric. A special case of this is Hotelling’s model
of competition on the line.
If we interpret k · k as Euclidean distance, expand (7) and eliminate
individual specific constant terms that have the same effect on all choices
(since these do not effect preference orderings), this last specification becomes
uij = Xj βi − αi pj .
(8)
Equation (8) is a pure characteristics random coefficients model. It allows
consumers to differ in their tastes for different product characteristics, in
addition to differences in their marginal utility of income. Note that unlike
the standard random coefficients model used in the econometric literature
equation (8) does not have “tastes for the products”10 .
One final point. Typical differentiated product model’s of demand have
consumer’s choosing from J competing products and an outside option. The
outside option represents the utility the consumer would obtain were it to not
spend any of its income on the goods in this market. It is this outside option
which enables us to study aggregate demand for the goods in the market. We
label the outside options as the 0 option and its utility as ui,0 11 . The choice
model in equation (8) is, then, a choice between J + 1 mutually exclusive
and exhaustive alternatives; J inside products and the outside option.
3.1
Constraints and Normalizations.
We introduce two constraints on the model in equation (8), then consider
required normalizations, and finally come back to a discussion of the as10
We note here that empirical work often finds it helpful to include in the xj a set of
company or brand dummies. These dummies are used as proxies for unobserved aspects
of quality which are common to goods marketed within a group.
11
The specification actually used for ui,0 depends on the data available, but it is often
allowed to depend on a slightly different set of characteristics as the inside options. This
is easy to do by appropriate choice of X values for the inside and outside options.
9
sumptions underlying the form of the model we take to data.
We assume that
• A1. there is only one unobserved product characteristic; i.e. X =
(x, ξ) ∈ R k × R 1 , and
• A2. that ξ is a “vertical” characteristic in the sense that every individual would prefer more of it.
A1 implies that our model can be written as
uij = xj βi − αi pj + λi ξj ,
(9)
which would be identical to the model used in Das, Olley, and Pakes (1995)
were we to omit their i.i.d. disturbances with full support. A2 implies that
λi > 0 for all i.
Since the utility functions of consumers can only be identified up to a
monotone (in our case affine) transformation, theory implies that we can
take each consumer’s utility from every choice and
• multiply them by a consumer specific positive constant, and
• add to them a consumer specific constant.
We add −ui,0 to the utility of each choice so that the utility of the outside
option is zero and the utility of the inside options should be interpreted as
their utility relative to that of the outside option. Our second normalization
is to divide each ui,j by λi (so that the coefficient of ξ is one for all consumers).
Imposing these normalizations and (for notational simplicity) reinterpreting
the characteristic values for option j to be the value of the j th option for that
characteristic minus the value of the outside option for that characteristic,
we have
uij = xj βi − αi pj + ξj ,
(10)
and
ui,0 = 0.
This is identical to the model in BLP without their additive component with
full support. Keep in mind that our change of notation implies that our new
ξ variables represent the difference in the unobserved quality of the inside
options from that of the outside option.12
12
This becomes important when we try to measure welfare gains over time, as we are
implicitly doing when we construct a price index. This because the difference in the
10
Though these normalizations would suffice if the {ξ} were observed, with
unobserved {ξ} they leave an econometric model that is under-identified by
a scale factor; i.e. we need to chose units for the unobservable13 . In the
monte carlo work at the end of the paper we set the mean of αi = 1, so the
unobserved quality’s units are in terms of the mean price coefficient.
Our previous work emphasized the reasons for (and the empirical importance of) allowing for unobserved product characteristics in estimating discrete choice demand systems (see Berry (1994), BLP and Berry, Levinsohn,
and Pakes (2004)). The unobserved product characteristics are the analogs
of the disturbances in the standard demand model, i.e. they aggregate up
the impact of all the factors that are unobserved to the econometrician but
affect demand. Without them it will typically be impossible to find parameter values that makes the implications of the model consistent with the data.
Moreover if we incorporate them and consider any realistic model of market equilibrium we are lead to worry about a simultaneous equations bias
resulting from price being a function of the ξ.
Of course in reality there may be more than one unobserved characteristic
that is omitted from the empirical specification and, provided consumers
varied in their relative preferences for the different unobserved characteristics,
the model in (10) is, at least strictly speaking, misspecified. Though allowing
for one unobserved factor seemed to be particularly important in enabling us
to obtain a demand model with realistic implications for own and cross price
elasticities with the type of data typically used to estimate demand systems,
our ability to estimate “multi-unobserved factor” discrete choice models with
such data seems extremely limited14 .
As noted above, the fact that the model with a single unobserved factor
might provide good fits “in sample” does not imply that it delivers meaningful welfare measures. On the other hand if we allow for as many unobserved
factors as there are products, then the pure characteristics model with mulaverage estimated level of ξ across periods can be due either to a change in the average
unobserved quality of the products in the market being analyzed, or to a change in the
average value of the outside alternative. For more details and an attempt to decompose
movements in ξ into its components see Pakes, Berry, and Levinsohn (1993) and Song
(2004).
13
Multiply all ξj , αi , and βi , by the same positive constant and the implications of the
model are not changed.
14
We note here that this is likely to change when richer data is available. For example, ?,
and a related literature in the field of marketing, successfully estimate multiple unobserved
product characteristics from data that observes the same consumers making a repeated
set of choices, while Heckman and Snyder (1997) have used multi-factor discrete choice
models to analyze political choice.
11
tiple unobserved characteristics has the traditional models with tastes for
products as a special case. So there is a sense the pure characteristics model
with one unobserved characteristic is an opposite extreme to the model with
tastes for products. The two models, then, might be thought of as providing
bounds for the impacts of unobserved product product heterogeneity.
4
Estimating the Model.
The estimation issues that arise when estimating the pure characteristics
model are similar to the issues that arise when estimating more traditional
discrete choice models. As a result we use the techniques in BLP and Berry,
Levinsohn, and Pakes (2004), and the vast literature cited therein, as starting
points. There are, however, four modifications of those techniques we will
need in order develop an estimator for the pure characteristics model. The
modifications provide
1. a method of calculating the aggregate market share function conditional
on the vectors of characteristics and parameters (θ),
2. an argument that proves existence of a unique ξ vector conditional on
any vector of model parameters and observed market shares,
3. an algorithm for computing that ξ vector, and
4. a limiting distribution of the estimated parameter vector.
It will work out that the modifications we use to accomplish these tasks
imply that the pure characteristic model encounters different computational
tradeoffs then the model with tastes for products, and the difference will play
out differently on different types of data.
4.1
Computing Market Shares.
In the model with product-specific tastes, market shares can be calculated by
a two step method. The first step conditions on preferences for the product
characteristics (the βi ) and integrates out the product-specific tastes. This
provides market shares conditional on the βi . When the additive productspecific tastes has a “logit” form the market shares conditional on the βi
have an analytic form, so that there is no approximation error in calculating
them, and they are a smooth function of the βi . The second step follows
Pakes (1986) and uses simulation to provide an approximation to the integral
12
defining the expectation (over βi ) of the conditional market shares (i.e. to
the aggregate market share).
When there are no additive product-specific tastes we must compute market shares in a different way. A simple two-step replacement is to use the
structure of the vertical model to integrate out of one of the dimensions of
heterogeneity in the pure characteristics model, thereby producing market
shares conditional on the rest of the βi , and then use the suggestion in Pakes
(1986) again to compute the aggregate share. Given an appropriate distribution for the dimension of heterogeneity integrated out in the first step (see
below), this procedure produces a smooth objective function and insures that
one of the components of heterogeneity does not generate an approximation
error.
To see how to do this we first consider the simple vertical model
uij = δj − αi pj ,
(11)
for j =, . . . , J, where δj is product “quality”
δj = xj β − αi pj + ξj ,
(12)
and ui,0 = 0.
Order the goods in terms of increasing price. Then good j is purchased
iff uij > uik , ∀k 6= j, or equivalently
δj − αi pj > δk − αi pk , ⇒ αi (pj − pk ) < δj − δk , ∀k 6= j.
(13)
Recall that (pj −pk ) is positive if j > k and negative otherwise. So a consumer
endowed with αi will buy product j iff
δj − δk
≡ ∆j (δ, p), and
(pj − pk )
δk − δj
> max
≡ ∆j (δ, p).
k>j (pk − pj )
αi < min
k<j
αi
(14)
These formula assume that 0 < j < J. However if we set
∆0 = ∞, and ∆J = 0
(15)
they extend to the j = 0 (the outside good) and j = J cases.
If the cdf of α is F (·), then the market share of product j is
h
i
sj (x, p, ξ; θ, F ) ≡ F (∆j (x, p, ξ)) − F (∆j (x, p, ξ) 1 ∆j > ∆j ,
13
(16)
where here and below 1[·] is the indicator function for the condition in the
brackets and θ is a vector containing all unknown parameters.
If ∆j ≤ ∆j , then sj (·) = 0. Since the data has positive market shares
the model should predict positive market shares at the true value of the
parameters. Note that the vertical behaves differently then does the model
with tastes for products; the latter never predicts zero market shares for any
parameter value. In the vertical model any parameter vector which generates
an ordering which leaves one product with a higher price but lower quality
than some other product predicts a zero market share for that product.
4.1.1
The Extension to K Dimensions.
Recall that the difference between the vertical model and the pure characteristics model is that in the pure characteristics model characteristics other
than price can have coefficients which vary over consumers (the βi ). However
if ui,j = xj βi − αi pj + ξj then, conditional on βi , the model is once again a
vertical model with cut-off points in the space of αi (but now the quality
levels in those cut-offs depend on the βi ). So to obtain market shares in this
case we do the calculation in (16) conditional on the βi , and then integrate
over the βi distribution.
More precisely we begin as before by ordering the goods by their price.
Then for a fixed β we can compute the cut-off points ∆(x, p, ξ, β)j and
∆(x, p, ξ, β)j , so the market share function is
sj (x, p, ξ; θ, F, G) ≡
Z h
(17)
i
F (∆j (δ, p, X, β)|β) − F (∆j (δ, p, X, β)|β) 1 ∆j (·, β) > ∆j (·, β) dG(β),
where F (·|β) is the cdf of α given β and G(·) is the cdf of β.
That is the pure characteristic model’s market share function can be
expressed as a mixture of the market share functions of the pure vertical
models. The conditioning argument used here avoids the difficult problem
of solving for the exact region of the βi space on which a consumer prefers
product j 15 . It does, however, produce an integral which is typically not
analytic. So we use a simulation estimator to approximate it. That is we
obtain an unbiased estimator of the integral by taking ns draws from the
distribution G of the random coefficients β and then calculating
sj (x, p, ξ; θ, F, Gns ) ≡
15
(18)
Feenstra and Levinsohn directly calculate the region of integration, Aj ⊂ RK such
that if (β, α) ∈ Aj then good j is purchased directly, but this becomes quite complicated.
14
h
i
1 X
F (∆j (x, p, ξ, βi )|βi ) − F (∆j (x, p, ξ, βi )|βi ) 1 ∆j (·, βi ) > ∆j (·, βi ) ,
ns i
where Gns (·) is notation for the empirical distribution of the simulated βi .
A number of computational points are worth noting here. First if F (·)
has a density (with respect to Lebesgue measure) which is a differentiable
function of the parameter vector, then the market share function is a continuously differentiable function of the parameter vector (regardless of the form
of the simulator used for the empirical distribution Gns (·))16 .
Second the calculation of market shares is simplified by noting that a necessary and sufficient condition for the indicator function’s condition for product j to be one conditional on a particular value for β is that maxq<j ∆q (·, βi ) <
∆j (·, βi ) (recall that the j− ordering is the price ordering). As a result for a
given value of β our program first computes the {∆j (·, β)}, then drops those
goods for whom ∆(·, β) are “out of order”, and then computes the shares.
Finally note that unless the distribution of β is discrete, in which case
a simple mean similar to the mean in (18) will provide an exact calculation
of the share, the generalization from the vertical model to the pure characteristics model introduces simulation error into the estimates (just as the
generalization from the simple logit to the random coefficient logit in BLP
adds simulation error).
4.2
Existence and Uniqueness of the ξ(·) Function.
Recall that BLP proceed in three steps. First they show that provided mild
regularity conditions are met, their model associates a unique ξ(·) with any
triple consisting of a vector of parameters, a vector of observed shares, and a
distribution over individual characteristics. They then provide a contraction
mapping which computes the ξ(·). Finally they make an identifying assumption on the joint distribution of (ξ, x), and estimate by finding that value of θ
that makes the theoretical restrictions implied by the identifying assumption
as “close as possible” to being satisfied 17 .
We will mimic those steps. Thus our first task is to show that for every
θ and distribution of consumer characteristics there is a unique value of ξ
which makes the model’s predicted shares just equal to the shares observed
16
This assumes that the random draws underlying the simulator are not changed as we
change the parameter vector, as is required for the statistical properties of the estimator,
see below.
17
BLP assume that E[ξ|x, θ0 ] = 0, and look for a θ that makes the covariance of the
computed ξ(·)0 s and various functions of x as close as possible to zero. Clearly if other
assumptions on the joint distribution of (ξ, x) were appropriate, they could be used instead.
15
in the data , our so . To accomplish this task we shall assume (as does
BLP) that so , the (J + 1-dimensional) vector of observed market shares, is in
the interior of the J-dimensional unit simplex (all market shares are strictly
between zero and one)18 . Simplify notation and let s(θ, ξ) ≡ s(x, p, ξ; θ, F, G)
for any fixed (F, G, x, p). s(θ, ξ) is the vector of shares predicted by the pure
characteristics model.
Consider the system of J + 1 equations
s(θ, ξ) = so .
(19)
Our goal is to provide conditions under which, given the normalization ξ0 = 0,
this system has exactly one solution, ξ(θ, so ).
Let the discrete choice market share, as a function of all unobserved
product characteristics (including that of the outside alternative), be
sj (ξj , ξ−j , 0),
(20)
where ξj is the own-product characteristic, ξ−j is the vector of rival-product
characteristics and 0 is the (normalized) value of the characteristic of the
outside good.
Now define the “element-by-element” inverse for product j, rj (sj , ξ−j ), as
sj (rj , ξ−j , 0) = sj
(21)
The vector of element-by-element inverses, say r(s, ξ), when viewed as a
function of ξ takes R J → R J . It turns out to be more convenient to work
with a fixed point defined by the element-by-element inverse than to work
directly with the system of equations defined by (19). In particular, the
inverse of the market share function (i.e. ξ(·)) exists and is unique if there
is a unique solution to the fixed point
ξ = r(s, ξ).
(22)
Theorem 1 Suppose the discrete choice market share function has the following properties:
1. Monotonicity. sj is weakly increasing and continuous in ξj and weakly
decreasing in ξ−j . Also, for all ξ−j there must be values of ξj that set
sj arbitrarily close to zero and values of ξj that set sj arbitrarily close
to one.
18
If this were not the case, that is if the q th product had a zero market share, and if we
found one ξ vector that fit the observed market shares, then any other ξ vector with the
same components for ξ−q but a lower ξq would also fit the observed shares also.
16
2. Linearity of utility in ξ. If the ξ for every good (including the outside
good) is increased by an equal amount, then no market share changes.
3. Substitutes with Some Other Good. Whenever s is strictly between 0
and 1, every product must be a strict substitute with some other good.
In particular, if ξ 0 ≤ ξ, with strict inequality holding for at least one
component, then there is a product (j) such that
0
, 0).
sj (ξj , ξ−j , 0) < sj (ξj , ξ−j
(23)
Similarly, if ξ 0 ≥ ξ, with strict inequality holding for at least one component, then there is a product (j) such that
0
sj (ξj , ξ−j , 0) > sj (ξj , ξ−j
, 0).
(24)
Then, for any market share vector s that is strictly interior to the unit simplex; (i) an inverse exists, and (ii) this inverse is unique.
♠
Comments.
1. The theorem is true independent of the values of (θ, F, G, x, p) that go
into the calculation of s(·) provided those values imply an s(·) function which
satisfies conditions (1) and (3). It is easy to verify that those conditions will
be satisfied for any finite θ as long as F has a density which is positive on the
real line (a.e. β). In particular G(·) need not have a density (w.r.t. Lebesgue
measure), and indeed the simulated Gns (·) we typically use in computation
will not have a density).
2. The Linearity assumption is redundant if we stick with the model in
(10). We include it here just to make it clear that provided the utility from
a particular choice is additive in ξ we do not need additivity in the rest of
the variables that appear in the utility function. That is a non-linear model
that was linear only in ξ would have exactly the same properties as those
developed here provided it satisfied the usual identification conditions.
Proof. Existence follows from the argument in Berry (1994). Our first step
in proving uniqueness is to show that the map r(ξ, s) is a weak contraction
(a contraction with modulus ≤ 1), a fact which we use later in computation.
Take any ξ and ξ 0 ∈ RJ and let kξ − ξ 0 ksup = d > 0. From (21) and
Linearity
sj (rj + d, ξ−j + d, d) = sj .
(25)
By Monotonicity
sj (rj + d, ξ 0 , 0) ≥ sj ,
17
(26)
and by (3) there is at least one good, say good q, which for which this
inequality is strict (any good that substitutes with the outside good). By
Monotonicity, this implies that for all j,
rj0 ≤ rj + d
(27)
with strict inequality for good q. A symmetric argument shows that the
condition
sj (rj − d, ξ−j − d, −d) = sj
(28)
implies that for all j,
rj0 ≥ rj − d
(29)
with strict inequality for at least one good. Clearly then kr(ξ 0 , s) − r(ξ, s)k ≤
d, which proves that the inverse function is a weak contraction.
Now assume that both ξ and ξ 0 satisfy (22), and that kξ − ξ 0 ksup = κ > 0,
i.e. that there are two distinct solutions to the fixed point. In particular
let ξq − ξq0 = κ. Without loss of generality assume that q substitutes to the
outside good (if this were not the case then renormalize in terms of the good
that substitutes with q and repeat the argument that follows). From above,
0
0
, 0), which,
, 0) > sq . But this last expression equals sq (ξq0 , ξ−q
sq (rq + κ, ξ−q
0
since ξ is a fixed point, equals sq , a contradiction.
♠
4.3
Computation of ξ(·).
We need a procedure which inverts the market-share function from the purecharacteristic model to find ξ(·). BLP provide a contraction mapping which is
guaranteed to find the unique ξ(·) that solves the system of equations in (19)
generated by their model. We begin by explaining the sense in which BLP’s
methodology can be applied to our model (this also clarifies the relationship
between the two models). This requires the use of a limiting argument. Unfortunately when we try to use an approximation to this limiting argument
to find the ξ(·) of the pure characteristics model, we incur numerical problems. So we provide two alternative techniques for finding ξ(·). Each of these
also has its problems. Indeed, as is explained below, we have come to the
conclusion that the three techniques are best used in combination.
The model in BLP is obtained from the model in (10) by adding i,j to
each alternative choice and assuming that they distribute as independent
(over both i and j) type II extreme value deviates. We generalize that model
slightly and assume that the extreme value deviates in BLP are scaled by a
factor σ , i.e.
uij = xj βi − αi pj + ξj + σ i,j ,
(30)
18
while for the outside alternative, ui,0 = σ i,0 . Note that if σ = 0 we are
back to the pure characteristics model, while if σ = 1 we have the model in
BLP.
If we assume σ > 0, we can multiply equation (30) by
µ ≡
1
,
σ
and BLP’s arguments imply that if (αi , βi ) are draws from (F (α|β), G(β)),
a consistent (as the number of simulation draws grows large) simulator for
this model’s shares is
sj =
X
i
exp [(xj βi − αi pj + ξj )µ ]
.
1 + q=1 exp [(xq βi − αi pq + ξq )µ ]
PJ
(31)
If µ is very large the shares generated by this model are similar to the
shares generated by the pure characteristics model. Indeed a dominated
convergence argument implies that as µ → ∞ the shares in (31) converge
to the shares from our model (as calculated in (18))19 . This is intuitive. As
µ → ∞ (σ → 0) the relative variance in the product specific tastes, relative
to the variance in the preferences for the X, goes to zero, so the model in
(30) converges to a model without product specific tastes.
So one technique for computing ξ is to chose an increasing sequence of
{µ }, use the contraction mapping in BLP to compute the ξ(·; µ()) which
solves equation (31) for each one, and stop the procedure when the shares
are “close enough” to the pure-characteristics model’s shares. Unfortunately
numerical experimentation with this procedure indicates that we sometimes
require a very large µ before we obtain an adequate approximation to the
pure characteristics’ model’s shares (see below). Indeed the µ required can,
and sometimes does, generate values for exp[µ (·)] that are too large for the
computer to calculate. This makes it impossible to use this technique as the
sole inversion technique used in our estimation algorithm.
So we look for an alternative method of computing the ξ(·). One obvious
alternative is to use the element-by-element inverse, r(s, ξ), shown to lead
to a weak contraction in the proof of the theorem in section 4.2. If the
weak contraction had modulus that was strictly less than one, this would be
guaranteed to converge to the fixed point at a geometric rate. Unfortunately
19
The difference between the integrand in this model and the indicator for choice j can
be made smaller than for almost every (αi , βi ) and is bounded by one for all (αi , βi ).
If the parameter space is compact a covering argument proves that the convergence is
uniform in θ.
19
we have been unable to prove that the modulus is strictly less than one, and
in Monte exercises studies we find that it sometimes contracts very slowly20 .
We therefore move to solution techniques that are likely to be more computationally intensive but will converge; in particular techniques based on
a homotopy [cites ??]. Starting with the standard fixed-point homotopy we
have
h(ξ, t, ξ0 ) = (1 − t) ∗ (ξ − ξ0 ) + t ∗ (ξ − r(s, ξ)),
where
• t is a parameter that takes values between zero and one,
• δ0 is an initial guess for δ, and
• the function r(·) returns the element-by-element inverse of the market
share function (see equation (21)).
For each value of t, consider the value of ξ that sets h(ξ, t, ξ0 ) to zero. Call
this ξ(t, ξ0 ). For t = 0, this solution is trivially the starting guess of ξ0 . For
t = 1, this solution is the fixed point that we are looking for. The homotopy
methods suggest starting at t = 0, where the solution is trivial, and slowly
moving t toward one. The series of solutions ξ(t) should then move toward
the fixed-point solution ξ(1). If t is moved slowly enough, then by continuity
the new solution should be close to the old solution and therefore “easy” to
find (say by a Newton method starting at the prior solution).
In our problem a version of the fixed-point homotopy is a strong contraction when t < 1 making it easy to compute the ξ from successive t. I.e. the
homotopy implies that
ξ(t, ξ0 ) = (1 − t) ∗ ξ0 + t ∗ r (s, ξ(t, ξ0 )) ,
(32)
which, when viewed as an operator which takes RJ into itself, is a contraction
mapping with modulus less than t. This suggests a recursive solution method,
taking an initial guess, ξ, for the solution ξ(t, ξ0 ) and then placing this guess
on the RHS of (32) to create a new guess, ξ 0 :
ξ 0 = (1 − t) ∗ ξ0 + t ∗ r(s, ξ).
(33)
For t < 1 this is a strict contraction mapping (the proof relies on the
properties of the best-reply function r), and use of the recursion provides a
20
This contrasts markedly with the contraction used in the random coefficients logit
specification given in BLP; their and in related work by many others, there was never any
problem in using their contraction to find the inverse.
20
fast reliable way of finding ξ. As t approaches one, the modulus of contraction
for fixed point in (32) also approaches one. Thus when t is very close to one
we may not be able to rely on the contraction property, but instead use a
Newton method to zero the equation. In practice, we find that it is sometimes
necessary to move t very slowly as it approaches one, and this can increase
the computational burden of the technique greatly.
The solution method that has worked best for us is to begin with the
contraction in equation (32), using a value for ξ0 that is outputted from the
model in equation (31), the contraction in BLP, and some large value of
our µ . We then switch to a Newton method if successive iterations do not
improve our fit and our equality criteria are not satisfied.
4.4
Limit Theorems.
This section summarizes results from Berry, Linton, and Pakes (2004). They
provide limit theorems for the parameter estimates from differentiated product models both with and without tastes for products21 . The limit theorems
for the different models differ, and the differences impact on their computational properties. We focus on the resulting computational tradeoffs.
The difference in the limit properties of the estimators from the two model
stem from differences in the behavior of the inverse of the market share
function (of ξ(s, ·)) as J, the number of products, grows large. Consequently
to explain our results we focus on the case where we have only one market
and one time period, and the appropriate limiting dimension is the number
of products.
For notational convenience redefine P ns (·) to equal the empirical measure
of the simulated distribution function of interest22 , and assume that our
estimate of so is obtained from the choices of a sample of consumers of size
n and therefore is denoted sn .
In both models the objective function minimized in the estimation algorithm is a norm of
J
1X
zj ξj (sn , x, p, θ, P ns ),
(34)
GJ (θ, sn , P ns ) =
J j=1
21
The actual form of the limit distributions for both models depend on the type of data
available. We will focus on estimation from product level data (defined as data on market
share, prices, and characteristics, possibly augmented with the distribution of consumer
attributes, as in BLP), though similar types of issues arise when micro data is also available
(see Berry, Levinsohn, and Pakes (2004)).
22 ns
P = F × Gns in our model and is the empirical measure of the simulation draws
on the the random coefficients of the characteristics in BLP. This also corresponds to the
notation in BLP and in Berry, Linton, and Pakes (2004).
21
where
• the ξj are defined implicitly as the solution to the system
snj = sj (ξ, x, p, ; θ, P ns ),
(35)
• and the true ξ are mean independent of the z (which are typically
functions of x and cost shifters).
In both models the objective function, kGJ (θ, sn , P ns )k, has a distribution determined by three independent sources of randomness: randomness
generated from the draws on the vectors {ξj , x1j }, randomness generated
from the sampling distribution of sn , and that generated from the simulated
distribution P ns . Analogously there are three dimensions in which our sample
can grow: as n, as ns, and as J grow large. The limit theorems allow different
rates of growth for each dimension. Throughout we take pathwise limits, i.e.,
we write n(J) and ns(J), let J → ∞. The rate at which n(J), ns(J) → ∞
grow as J grows is chosen to insure that; (i) the estimator has a limit distribution with finite variance, and (ii) there is a non-zero contribution of all
sources of randomness to the variance in that estimator.
To provide the intuition underlying the major results write
ξ(θ, sn , P ns ) =
n
(36)
o n
o
ξ(θ, s0 , P 0 )+ ξ(θ, sn , P ns ) − ξ(θ, s0 , P ns ) + ξ(θ, s0 , P ns ) − ξ(θ, s0 , P 0 ) .
The impact of the sampling and simulation errors on the estimator can be
derived by using a Taylor series expansion to approximate for the bracketed
terms in (36), and substituting the result into the objective function in (34).
The expansion is obtained by inverting the system in (35). Note that the
system being inverted has J equations, and J is the dimension that we let
grow in the limiting arguments. This is the reason that the proofs in Berry,
Linton, and Pakes (2004)) are non standard.
We require some additional notation. Let
• εn = sn − s0 , as the sampling error,
• εns (θ) = s[ξ(θ, s0 , P ns ), θ, P ns ] − s[ξ(θ, s0 , P ns ), θ, P 0 ], as the simulation error,
and assume that for every fixed J the function s(ξ, θ, P ) is continuously
differentiable in ξ, and its derivative has an inverse, say
(
H
−1
(ξ, θ, P ) =
∂s(ξ, θ, P )
∂ξ 0
22
)−1
.
(37)
H −1 (·) is a J × J matrix, so its dimension increases we increase J in our
limiting arguments.
Then
ξ(θ, sn , P ns ) = ξ(θ, s0 , P 0 ) + Ho−1 (θ, s0 , P 0 ) {εn − εns (θ)} + r(θ, sn , P ns ),
(38)
where
• s 0 (θ, s, P ) = s(ξ(s, θ, P ), θ, P ) and Ho (θ, s, P ) = H(ξ(s, θ, P ), θ, P ),
and
• r(θ, sn , P ns ) is the remainder term resulting from our application of
the mean value theorem.
Consequently our objective function can be rewritten as
GJ (θ, sn , P ns ) =
(39)
1
1 0 −1
z Ho (θ, s0 , P 0 ) {εn − εns (θ)} + z 0 r(θ, sn , P ns ),
J
J
P
0
0
0
0
where GJ (θ, s , P ) ≡ 1/J zj ξ(θ, s , P ). The limit theorems in work from
this representation of GJ (θ, sn , P ns ).
To prove consistency they provide conditions which insure that: i) the
second and third terms in this equation converge to zero in probability
uniformly in θ, and ii) an estimator which minimized kGJ (θ, s0 , P 0 )k over
θ ∈ Θ would lead to a consistent estimator of θ0 . Asymptotic normality
requires, in addition, local regularity conditions and a limiting distribution
for Ho−1 (θ, s0 , P 0 ) {εn − εns (θ)}.
The rate needed for this limit distribution depends on how the elements
of the J × J matrix Ho−1 (θ, s0 , P 0 ) grow, as J gets large. This differs for the
two classes of models. It is easiest to see the difference when we compare the
simple logit model to the simple vertical model.
In the simple logit model, ui,j = δj + i,j , with the {i,j } distributed
i.i.d. type II extreme value, and δj = xj β − αpj + ξj . Familiar arguments
P
P
show that sj = exp[δj ]/(1 + q exp[δq ]), while s0 = 1/(1 + q exp[δq ]). So
δj = ln[sj ] − ln[s0 ] and the required inverse is
GJ (θ, s0 , P 0 ) +
ξj (θ) = (ln[sj ] − ln[s0 ]) − xj β − αpj .
Thus in this simple case
1
∂ξ
= .
∂sj
sj
23
Now consider how randomness effects the estimate of ξj (θ). In the simple
logit model the only source of randomness is in the sampling distribution
of so . That is we observe the purchases of only a finite random sample of
consumers. Letting their shares be sn we have, sn − so = n . The first order
impact of this randomness on the value of our objective function at any θ
will be given by
∂ξ
× n ,
Ho−1 (θ, s0 ) × n =
∂s |s=s0
which from above contains expressions like nj s1j . In the logit model as J →
∞, sj → 0 (for all but possibly a finite number of products). √So if n were to
grow proportional to J and we multiplied the disturbance by J, the impact
of any given sampling error would grow without bound as J grows large.
A similar argument holds for the estimator of BLP’s model, only in this
more complicated model there are two sources or randomness whose impacts
increase as J grows large, sampling error and simulation error. Consequently
Berry, Linton, and Pakes (2004) require both n and ns must grow at rate J 2
to insure an asymptotically normal estimator of the parameter vector from
BLP’s model. The computational implication is that for data sets with large
J one will have to use many simulation draws and large samples of purchasers
before one can expect to obtain an accurate estimator whose distribution is
approximated well by a normal with finite variance. However their monte
carlo does show that with large enough n and ns the asymptotic distribution
is indeed very close to normal.
Now go back to the simplest pure characteristics model, the vertical
model, i.e., uij = δj − αi pj , with δj defined as above and with ui0 = 0. Order
the products so 0 = p0 < p1 < p2 < . . . , and assume that 0 < δ1 < δ2 , . . . ,
and that ∆j = (δj − δj−1 )/(pj − pj−1 ) is also ordered in this way. These two
latter conditions are necessary and sufficient for all products to be purchased
(a fact we use below). If F (·) is the distribution of αi , then the market share
of good j, j = 1, . . . , J − 1 is
sj = F (∆j ) − F (∆j+1 ),
sJ = F (∆J ).
Thus the derivative matrix is of the form
∂∆j
∂∆j+1
∂sj
= f (∆j )
+ f (∆j+1 )
,
∂ξp
∂ξp
∂ξp
where f (·) is the density of F (·). Note that these derivatives contain terms
like 1/(pj − pj−1 ), and in contrast to the derivatives in BLP’s model, these
derivatives are likely to get smaller as J increases. That is in the pure
24
characteristic model the impact of sampling and simulation errors grows is
diminished as J grows large.
In fact the precise behavior of these derivatives as J grows large depends
on the nature of the pricing equilibrium and just which products are introduced. What Berry, Linton, and Pakes (2004) do is provide sufficient
conditions for all elements of the inverse matrix Ho−1 (·) to stay bounded as
J increases. Consequently to obtain an asymptotically normal estimate of
the parameter vector in the vertical model both n and ns need only grow at
rate J. That is we should not need either as large a consumer sample, or as
many simulation draws, to obtain reasonable parameter estimates from the
vertical model.
4.5
Computational Comparison to the Model with Tastes
for Products.
Gathering the results of prior sections we have two theoretical reasons for
expecting the computational burden of the pure characteristics model to
differ from the computational burden of the model with tastes for products,
but they have opposite implications.
• First, the number of simulation draws needed to get accurate estimates
of the inverse, and hence of the moment conditions, must grow at rate
J 2 in the model with a taste for products, while it need only grow at
rate J in the pure characteristics model.
• Second the contraction mapping used to compute the inverse is expected to converge at a geometric rate for the model with tastes for
products, but we do not have such a rate for the pure characteristics
model.
The first argument implies that computation should be easier in the pure
characteristics model, the second that computation should be easier in the
model with tastes for products. Of course which of the two effects turns
out to dominate may well depend on features of the data being analyzed;
the number of products, the number of times the inverse must be evaluated
in the estimation algorithm (which typically is related to the number of
parameters), and so on.
Berry, Linton, and Pakes (2004) compare the effects of simulation error
in the vertical model to those in the logit model. The differences increase
in the number of products marketed, but it is clear that they can be large
for numbers of products that are relevant for empirical work. For example
25
for the limiting distribution to provide an accurate description of the monte
carlo distribution of the vertical model with up to two hundred products fifty
simulation draws seem to suffice, whereas the logit model with one hundred
products requires over two thousand simulation draws.
There is a great deal of evidence on the speed and accuracy of BLP’s algorithm for estimating the model with tastes for products (and we will provide
a bit more below). As a result the next section focuses on the properties of
the algorithms available for estimating the pure characteristic model.
5
Evidence from Simulated Data
Our goal in this section is to investigate the properties of alternative algorithms for estimating the pure characteristic model. As noted the computational difficulties that arise in the pure characteristic model are a result
of the need to compute the δ which solve the fixed point in equation (22).
We compare algorithms based on three different methods for computing this
fixed point. One is the “homotopy” method outlined in the text. The other
two are modifications of the contraction mapping in BLP. The first simply
sets the µ in equation (31) to some fixed number, and proceeds assuming
that BLP’s contraction when applied to this share equation will produce the
correct estimates of δ. The second only differs from this in that it estimates
µ along with the other parameters of the model.
The comparison will be done in terms of both compute times and the
precision of various estimates. We consider precision in three steps; first of
the estimates of the δ themselves, then of the estimates of the parameters
of the underlying model, and finally of the estimates of the implications of
interest to applied studies (own and cross price elasticities, and welfare).
5.1
A Model for Simulation
For most of the results we report, data is drawn from a model with utility
function
uij = δj + σx νix xj − (αi ∗ pj )
(40)
where
ln(αi ) = σp νip
(41)
δj = β0 + βx xj + ξj .
(42)
and
The consumer-specific random terms (νix , νip ) are distributed standard normal (so that αi is log normal, with a normalized mean.)
26
The x variable is drawn as twice a uniform (0,1) draw that is firm-specific
plus 1/2 a uniform (0,1) draw that is common to all firms in a market. This
is to allow for within market correlation across observables. ξ is drawn as
a uniform on (-0.5,0.5). Note that the contribution of x to utility is greater
than the contribution of ξ, which is likely to help the Monte Carlo find
good parameter estimates with limited data (we come back to this below).
Price, p, is set equal to a convex function of δ, pj = eδj /20, and this insures
positive shares for all goods (though some of the shares get very small when
we consider markets with a large number of products). Note also that p is
a function of δ and δ is a function of ξ, so that p and ξ are correlated in
the simulated datasets. Finally the actual consumer choice data is generated
from 5000 independent draws on νi who chose optimally given the true values
of the parameters.
5.2
Calculating δ.
We begin with the calculation of delta. For computational speed and ease of
presentation we consider an example with only five products. The example
uses randomly drawn data from the base model of the last subsection. The
first column of Table 1 shows the “true” values of the randomly drawn δ.
The “Homotopy” method is the combination of weak-contraction, homotopy and Newton methods that is described in the text. With 5 products
and our data-creation model, this method can almost always find a δ vector that exactly reproduces the “true” market shares. The first Homotopy
column uses the same 5000 draws on ν used to create the data, so it has no
simulation error and it indeed recovers the exact values of the true delta’s.
The second Homotopy column uses only 500 simulation draws and so some
error is introduced. While the homotopy method with simulation error fits
the shares exactly, it does so with δ’s that vary from the originals.
The columns labeled “Modified BLP Contraction: Fixed µ ” use the contraction in BLP with the shares modified as in equation (31). Recall that
this multiplies the variance of the Type II extreme value errors by µ−1
. In
Table 1 we first look at columns that use the same 5000 draws on ν with µ
set respectively at 1, 10 and 50. In the last column, simulation error is again
introduced by using only 500 draws.
The last row of the table gives the computational time relative to using
the full homotopy method with 5000 draws. Decreasing the number of draws
on ν and obviously saves time, while increasing µ costs time.
With µ = 1 (which is the BLP contraction there is little of any correlation
at all between the true δ and those obtained from the contraction. With
27
µ = 10 one of the δ obtained from the contraction is “out of order”. When
we get to µ = 50 the order of the δ obtained from the contraction is correct,
though they still have a one or two percent error in their values. There is
also some indication that even at µ = 50 the modified BLP contraction is
more sensitive to simulation error.
On the other hand the compute time at µ = 1 is only one fiftieth of
that from the homotopy. When µ = 50 the compute time with the full 5000
draws is about 60% of that in the homotopy, but with the 500 draws it is
only about 20%. We conclude that from the point of view of estimating the
δ it might be efficient to got to the modified BLP contraction, but only if µ
is kept very high.
Table 1: An Example of Calculating δ Using Different Methods
nsim:
True
5000
Homotopy.
5000 500
δ1
δ2
δ3
δ4
δ5
2.99
3.21
3.56
4.04
4.10
2.99
3.21
3.56
4.04
4.10
3.14
3.36
3.71
4.27
4.32
3.08
2.09
3.13
4.13
2.68
3.02
3.19
3.60
4.10
4.06
3.04
3.26
3.62
4.10
4.16
3.25
3.51
3.94
4.61
4.65
1
0.15
0.02
0.12
0.58
0.03
Rel. Time:
5.3
Modified BLP Contraction with Fixed µ
5000
5000
5000
500
µ = 1 µ = 10 µ = 50
µ = 50
Sample Designs.
The results for the remainder of the issues we investigated depended somewhat on the sample designs. The major feature of the design that seemed
relevant was the number of products marketed. So we focused on two sample designs; one where there are a small number of products marketed but
a reasonable number of markets, and one when there are a large number of
products marketed and a small number of markets.
The sample with a small number of products marketed consists of 20
markets, and for each market the number of products was chosen randomly
from a distribution which put equal weight on [2, ..., 10]. With this design
estimation is very fast and we have experimented with a number of alternative assumptions some of which will be illustrated below. The sample with a
large number of products has one hundred products per market, but only five
28
markets, and we have done less experimentation under this design. All samples used five thousand simulation draws to construct the model’s predictions
for market shares (i.e. this is the size of the consumer sample).
5.3.1
Parameter Estimates: Small Number of Products.
First we look at estimates of parameter values in Tables 2 and 3. The instruments used are a constant, x, x2 , the mean x in the market and the minimum
distance to the nearest x23 .
Starting with Table 2 we see that with a small number of products per
market, and a small variance on the unobservable, the modified BLP contraction mapping with a µ set exogenously to thirty does quite well; not
noticeably worse than our full homotopy. However to get the performance
of the two algorithms to be comparable when the distribution of ξ had more
variance we needed to push µ up to fifty. As noted we really could not push
µ much higher than this.
Note that the numbers reported underneath the coefficient estimates are
the standard errors of the estimates across different Monte Carlo samples.
Since the estimates are a mean across these one hundred samples, the standard error of this mean should be about one tenth of the reported standard
error. So the asymptotic approximation of the distribution of the estimator
is underestimating the estimator’s variance by quite a bit with these sample sizes. Also all algorithms do worse when we increase the variance of ξ;
i.e. intuitively we need more data to obtain a given level of precision as the
unobservable variance increases.
Of course we would not know how high we needed to set µ to get reasonable approximations if we were working with a real data set. Moreover
we experimented some to find the best values for these runs. We rejected
any value for the fixed scale that resulted in numeric errors, and in the first
experiment µ = 30 worked a bit better than µ = 50 even though there were
no obvious numeric errors24 .
An alternative which avoids the problem of choosing the scale is to let
the data try to estimate µ . Table 3 presents the results from this exercise.
We note that in several cases the scale parameter, which is now set by the
search algorithm, increased to the point where numeric errors occurred. In
those cases, we fixed µ at 50. In one of those cases, even µ = 50 caused
23
We also tried an approximation to “optimal instruments” as in Berry, Levinsohn, and
Pakes (1999) but this had little effect on the results.
24
This could be because of approximation errors in the computer routines that calculate
the exponents of large numbers.
29
problems and so we fixed µ at 2525 .
Table 3 presents both the estimates of the parameter of interest, and of
the “auxiliary” parameter µ . In particular it provides both the mean and
median of the estimates of µ across runs. The first column of results uses
the base specification for the unobservable, while the second column increases
the variance of the unobservable. In both cases this gives us results which
are worse than those obtained in Table 2, though probably still acceptable,
especially for the sample design with less variance in ξ.
We conclude that with a small number of products, the modified BLP
contraction may do well, especially if the variance of of the unobservable is
small. However at high values of µ numeric errors are quite common, and
when we estimate µ instead of fix it exogenously, we do seem to do worse.
Table 2: Monte Carlo Results: Small # of Products Using Modified BLP
Contraction
Method
Scale (µ ):
ξj =
σx (= 1)
σp (= 1)
β0 (= 2)
βx (= 1)
(1)
(2)
(3)
(4)
Mod. BLP
30
U (− 21 , 21 )
Homotopy
n.r.
U (− 12 , 12 )
Mod. BLP
50
U(-1.5,1.5)
Homotopy
n.r.
U(-1.5,1.5)
1.04
(0.04)
1.00
(0.01)
2.06
(0.05)
0.99
(0.01)
1.03
(0.03)
0.98
(0.01)
2.00
(0.05)
1.00
(0.01)
1.24
(0.06)
1.02
(0.03)
2.34
(0.10)
1.04
(0.02)
1.26
(0.06)
1.02
(0.03)
2.33
(0.09)
1.05
(0.03)
25
We also tried experiments where we imposed the traditional logit scale normalization
of one, while dropping our current normalization on the price coefficient. Those runs where
much less likely to converge without numeric errors.
30
Table 3: Estimated µ
True
ξ distribution:
nsim:
σx
1
σp
1
β0
2
βx
1
Scale, µ
∞
µ (Median)
∞
(1)
(2)
U (− 12 , 21 ) U (−1.5, 1.5)
500
500
1.14
(0.04)
1.03
(0.01)
2.19
(0.06)
1.00
(0.01)
34.08
(3.31)
17.81
1.64
(0.08)
1.09
(0.03)
2.79
(0.12)
1.03
(0.03)
15.50
(1.98)
4.67
All estimates are means across 100 simulated datasets. Estimated standard
deviations of the mean estimates are given in parentheses. The homotopy
estimates took on the order of 10 times as long to compute.
5.3.2
Substitution Effects: Small Number of Products.
Tables 2 and 3 show that, for the example data-generating process, all of the
methods do at least a reasonable job of estimating the model parameters.
Since, as illustrated in table 1, the homotopy method is much slower, one
might conclude that it is the least desirable method for this data generating
process. Of course parameter estimates are not usually the objects of interest;
rather it is their implications we are usually concerned with.
An implication that is frequently the focus of industrial organization studies are own and cross price elasticities. If we use the modified BLP contraction to estimate parameters we still have two options for computing own
and cross price elasticities. We could use the parameter estimates obtained
from the modified contraction together with the functional forms of the pure
characteristic model, or we could use those parameter estimates and the functional form from the model with tastes for products; i.e. from the modified
BLP model in equation (31). If we use the functional form from the purecharacteristic model the difference between the implications will be the same
as the difference in implications of different parameter estimates. However if
we use the functional form from the model with tastes for products the differ31
ence in functional forms will cause differences in the implications of interest
even conditional on the parameter estimates.
Table 4 provides an example of results of how well the methods reproduce the true substitution patterns. The example considers one market from
the data-generating process that has more spread in the unobservables and
therefore less precise estimates (i.e. the one last columns of Tables 2 and
3). The column labeled “true” gives the true derivatives of the market share
of the first (lowest-priced) product in that market with respect to the row
product. The first entry is therefore an own-price derivative, the second entry is the cross-price derivative of product one with respect to product two,
and so forth.
The last three columns of Table 4 recompute those derivatives using the
three different methods of Tables 2 and 3: homotopy and modified BLP contraction with fixed and estimated µ . We do the computation for each of the
parameters generated by a single Monte Carlo run in the earlier tables, and
then average across the Monte Carlo sample of parameters. This approximates the answer we would get by averaging over the asymptotic distribution
of the parameters as estimated by the appropriate method. We also provide
the standard deviation of the calculated sample mean.
Table 4 indicates that the homotopy method does a much better job of reproducing the true substitution pattern. This is especially true relative to the
model with an estimated µ . We are currently exploring this in more detail.
No matter the result, however, this example has the following implications
• “good” parameter estimates from a particular model need not lead the
model to generate good predictions for implications of interest (i.e. the
functional form approximation affects the predicted implications), and
• even if we use the modified BLP contraction for estimation, we may
not want to use the functional form in equation (31) to compute our
estimates of the implications of the parameter estimates.
32
Table 4: Examples of Substitution Patterns
Product
5.4
True
Homotopy
1
-0.6027
2
0.1981
3
0.1621
4
0.0000
5
0.0204
-0.6198
(0.0337)
0.2093
(0.0233)
0.1624
(0.0106)
0.0002
(0.0001)
0.0252
(0.0026)
Modified BLP
Fix µ = 50 Estimate µ
-0.7313
(0.0220)
0.2531
(0.0140)
0.1183
(0.0022)
0.0278
(0.0038)
0.0292
(0.0024)
-0.4130
(0.0232)
0.1656
(0.0139)
0.0331
(0.0059)
0.0348
(0.0042)
0.0235
(0.0023)
A Large Number of Products
In this section we consider a sample with a larger number of products (an
average of 100 per market) and a smaller number of markets (3) structured
as a time series on a market with improving characteristics. In particular
we use the data on the evolution of megahertz in the computer market from
1995 to 1997, taken from Pakes (2003), to pattern the x’s in our sample.
Year:
Min Mhz:
Max Mhz:
1
2
25 25
133 200
3
33
240
The monte carlo sample lets the number of products increase across the
three years, from 75 to 100 and then to 125, and has x’s drawn from uniform
on min-max range of Mhz (divided by 10). So that the ξ’s scale with x, the
ξ’s are uniform on [0, 1] plus the draw on Mhz divided by 10.
The δ’s are determined as above (δj = β0 +β1 xj +ξj ) as are the parameters
and the distribution of αi . To mimic a high tech market we let prices fall
over time, t, according to
ln(pj ) = (t − 1) ∗ ln(0.7) + 1.1 ∗ ln(δj ).
(43)
Note that again price is convex in δ thus assuring positive shares. However
the shares we generate are sometimes very small, smaller than we typically
33
observe in actual data sets26 (see below). Price for the same set of characteristics declines at roughly 30% per year. This together with the increase in
the number of products and the improvement in the product characteristics
over time are consistent with a market which generates “large” increases in
welfare over time.
The instruments used here are a constant and x, both interacted with
a set of dummies for each market. That these seem to suffice is probably
a result of the fact that we are requiring the “same utility model” to fit
in each time period and letting the products in the market change rather
dramatically over time.
The fact that market shares are so small makes computation of the δ more
difficult in this example. Partly as a result, we estimate on only one example
dataset, and use asymptotic standard errors. We obtain estimates from all
three algorithms discussed above. In the homotopy method we begin with
the Modified BLP contraction with fixed µ , say δ0 , and then iterate only 25
times on the “homotopy” equation:
δ 0 = 0.05δ0 + 0.95r(δ),
(44)
where r(δ) is the element-by-element inversion.27
We note that even this limited search greatly improves the calculated
δ’s. In particular the mean of the calculated δ’s are much too small if we
stop at the solution to the modified BLP contraction. This contrasts with
the case with a small number of products; the modified BLP contraction
did rather well with that sample design. The difference is that the current
sample design generates products with small market shares, and though the
pure-characteristic model will do that if the characteristics of products are
close enough to one another, the model with tastes for products can only
generate small shares if the δ 0 s are very small.
In table 5 we see much bigger differences between the parameter estimates
from the different estimation algorithms than we did when we had a small
number of products. The estimates from the modified BLP algorithm with
estimated µ are clearly the worst of the three. The estimated value of the
26
This may well be more of a result of the data not reflecting reality than our simulation
not reflecting reality. That is goods which truly have very small market shares might exist
and simply not be included in traditional data sets.
27
Note that in the sample with a large number of products, we are using only one iteration on the homotopy method to improve the “modified BLP” estimate of δ – i.e. we never
start a new iteration with a new δ0 in equation (44). This keeps the computational burden
manageable while still providing what turns out to be a reasonably large improvement in
the computed δ. As computational speed increases, further improvements in δ will become
possible.
34
Table 5: Parameter Estimates: Dataset with a Large Number of Products
Parameter
True
σx
1
σp
1
β0
2
βx
1
Scale, µ
Homotopy
0.833
(0.194)
1.192
(0.556)
1.956
(2.013)
0.984
(0.209)
∞
Modified BLP
Fix µ
Estimate µ
0.862
(0.380)
1.188
(0.621)
1.354
(2.066)
0.986
(0.198)
10∗
0.832
(6.956)
1.207
(1.783)
-6.455
(66.864)
0.879
(1.102)
0.934
(6.680)
∗
The scale was initially set to 10, but some combinations of parameter values
and markets this caused numeric problems and the scale in those cases was
halved until the numeric problems went away. In a few cases, a scale as low as
2.5 was necessary. Asymptotic Standard Errors are in parenthesis.
scale parameter µ is relatively small which implies that the logit error is
being assigned a relatively important role. To counteract the effects of the
logit and still match the small shares, the constant in δ is driven down to
less than -6. The modified BLP contraction with a fixed µ does better, but
still suffers from a β0 which is too low.
In this example all the models do a good job of capturing the general
pattern of substitution across products, although the BLP model is a bit
more diffuse, as expected. However, only the homotopy method does a good
job of capturing the overall level of elasticities. This is shown in Table 6,
which gives, for the first five products in the year 1 data (the lowest price
products), actual share and price data and then price elasticities (calculated
from a discrete 1% change in price.) The products with a very small share
have excellent substitutes in the pure characteristics model, but the modified
BLP contraction with estimated µ does not capture this effect and even the
model with a random µ has trouble reproducing this result. Note that for
the 4th product, the share is truly tiny and that a 1% increase in price would
wipe out all of the sales of that product28 .
28
Such a product would likely only survive if produced by a multi-product firm, so that
perhaps some markup could be sustained and some fixed-costs shared across products.
35
Table 6: Predicted Elasticities with a Large Number of Products
% Change in Share from a 1% Price Chg.
% Share
5.2109
4.0180
0.1078
0.0038
0.8855
5.5
Price
1.26
1.58
2.93
3.85
4.01
True
-11.0
-26.8
-30.8
-100.0
-63.4
Homotopy
-14.6
-24.5
-46.3
-100.0
-58.5
Modified BLP
Fixed µ
Estimate µ
-6.8
-9.2
-11.0
-14.8
-19.3
-1.8
-2.3
-2.9
-3.7
-4.9
Welfare Effects
We now calculate the average per-person welfare gain (in dollars) of moving
from the year one choice set to the year three choice set. Recall that we
greatly increase both the number of goods as well as their quality at the same
time as lowering prices. As a result there is both a large influx of consumers
from the outside good over time, and a “large” true welfare increase (much
of it going to the new consumers).
The surprising result here is that all methods do very well. The homotopy
is within .5% of the true result, but even the modified BLP algorithm with
an estimated µ is within three per cent of the truth. The modified BLP
method do not do as well on the parameters, or on the elasticities, but the
fact that the contraction fits the shares exactly means that the extra gain
from the logit errors is offset by lower δ’s and this counteracts the problems
generated for welfare measurement by the model with tastes for products.
Table 7: Welfare Effects
Method
Gain
Homotopy
265.26
True
266.14
Modified BLP µ = 10
269.97
Modified BLP µ estimated 259.00
36
References
Anderson, S., A. DePalma, and F. Thisse (1992): Discrete Choice
Theory of Product Differentiation. MIT Press, Cambridge MA.
Berry, S. (1994): “Estimating Discrete Choice Models of Product Differentiation,” RAND Journal of Economics, 23(2), 242–262.
Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile Prices in
Market Equilibrium,” Econometrica, 60(4), 889–917.
(1999): “Voluntary Export Restraints on Automobiles: Evaluating
a Strategic Trade Policy,” American Economic Review, 89(3), 189–211.
(2004): “Differentiated Products Demand Systems from a Combination of Micro and Macro Data: The New Vehicle Market,” Journal of
Political Economy, 112(1), 68–105.
Berry, S., O. Linton, and A. Pakes (2004): “Limit Theorems for Differentiated Product Demand Systems,” Review of Economic Studies, 71(3),
613–614.
Bresnahan, T. (1987): “Competition and Collusion in the American Automobile Oligopoly: The 1955 Price War,” Journal of Industrial Economics,
35, 457–482.
Caplin, A., and B. Nalebuff (1991): “Aggregation and Imperfect Competition: On the Existence of Equilibrium,” Econometrica, 59(1), 1–23.
Das, S., G. S. Olley, and A. Pakes (1995): “The Market for TVs,”
Discussion paper, Yale Univeristy.
Gabszewicz, J., and J.-F. Thisse (1979): “Price Competition, Quality
and Income Disparity,” Journal of Economic Theory, 20, 340–359.
Greenstein, S. M. (1996): “From Superminis to Supercomputers: Estimating the Surplus in the Computing Market,” in The Economics of New
Goods, ed. by T. F. Bresnahan, and R. J. Gordon. University of Chicago
Press, Chicago.
Hausman, J. (1997): “Valuing the Effect of Regulation on New Services in
Telecommunications,” The Brookings Papers on Economic Activity: Microeconomics, pp. 1–38.
37
Hausman, J., and D. Wise (1978): “A Conditional Probit Model for
Qualitative Choice: Discrete Decisions Recognizing Interdependence and
Heterogeneous Preferences,” Econometrica, 46, 403–426.
Heckman, J. J., and J. M. Snyder (1997): “Linear Probability Models
of the Demand for Attributes with an Empirical Application to Estimating
the Preferences of Legislators,” RAND.
Hotelling, H. (1929): “Stability in Competition,” Economic Journal, 39,
41–57.
Lancaster, K. (1971): Consumer Demand: A New Approach. Columbia
University Press, New York.
McFadden, D. (1974): “Conditional Logit Analysis of Qualitative Choice
Behavior,” in Frontiers of Econometrics, ed. by P. Zarembka. Academic
Press, New York.
(1981): “Econometric Models of Probabilistic Choice,” in Structural
Analysis of Discrete Data with Econometric Applications, ed. by C. Manski, and D. McFadden. MIT Press, Cambridge, MA.
Mussa, M., and S. Rosen (1978): “Monopoly and Product Quality,” Journal of Economics Theory, 18, 301–307.
Nevo, A. (2000): “Mergers with Differentiated Products: The Case of the
Ready-to-Eat Cereal Industry,” RAND Journal ofEconomics, 31(3), 395–
421.
Pakes, A. (1986): “Patents as Options: Some Estimates of the Value of
Holding European Patent Stocks,” Econometrica, 54, 755–784.
(2003): “A Reconsideration of Hedonic Price Indices with an Application to PC’s,” American Economic Review, 93(5), 1578–1596.
Pakes, A., S. Berry, and J. Levinsohn (1993): “Some Applications
and Limitations of Recent Advances in Empirical Industrial Organization:
Price Indexes and the Analysis of Environmental Change,” American Economic Review, Paper and Proceedings, 83, 240–246.
Petrin, A. (2002): “Quantifying the Benefits of New Products: The Case
of the Minivan,” JPE, 110(4), 705–729.
Salop, S. (1979): “Monopolistic Competition with Outside Goods,” RAND
Journal of Economics, 10, 141–156.
38
Shaked, A., and J. Sutton (1982): “Relaxing Price Competition Through
Product Differentiation,” The Review of Economic Studies.
Song, M. (2004): “Measuring Consumer Welfare in the CPU Market: An
Application of the Pure Characteristics Demand Model,” Working paper,
Georgia Tech.
Trajtenberg, M. (1989): “The Welfare Analysis of Product Innovations:
with an Application to Computed Tomography Scanners,” Journal of Political Economy, 97, 444–479.
39