Alternative sampling methods

SEVENTH FRAMEWORK PROGRAMME
THEME [SST.2010.1.3-1.]
[Transport modelling for policy impact assessments]
Grant agreement for: Coordination and support action
Acronym: Transtools 3
Full title: „Research and development of the European Transport Network Model – Transtools
Version 3
Proposal/Contract no.: MOVE/FP7/266182/TRANSTOOLS 3
Start date: 1st March 2011
Duration: 36 months
Milestone 79 - “Sampling alternatives for choice
model estimation in practice”
Document number: TT3_WP8_COM_MS79_Sampling alternatives for choice model
estimation_0b
Workpackage: WP8
Deliverable nature: Report
Dissemination level: N/A
Lead beneficiary: ITS Leeds, beneficiary 2, Andrew Daly.
Due data of deliverable: March, 2012
Date of preparation of deliverable: 30.12.2011
Date of last change: 17.03.2012
Date of approval by Commission: N/A
Abstract:
The note discusses practical approaches for sampling alternatives in the estimation of GEV
models, which include the tree-nested models which will be used for the passenger and freight
demand modelling. It covers both procedures for actual sampling of subsets of alternatives and
the process of estimation based on those subsets, pointing out the deficiencies in the academic
literature, which is still being developed. It concludes that, given the moderate numbers of
alternatives to be used in this study, the efficiency of modern software and hardware and the
difficulties and limitations of the methods, that sampling of alternatives should not be used for
model estimation in this work.
Keywords:
logit models, sampling alternatives, model estimation
Author(s):
Daly, Andrew
Disclaimer:
The contents of this report reflect the views of the author(s) and do not necessarily reflect the
official views or policy of the European Union. The European Union is not liable for any use that
may be made of the information contained in the report.
The report is not an official deliverable under the TT3 project and has not been reviewed or
approved by the Commission. The report is a working document of the Consortium.
Sampling alternatives for choice model estimation in practice.
Report version 0c
2012
By
Andrew Daly
Copyright:
Reproduction of this publication in whole or in part must include the customary
bibliographic citation, including author attribution, report title, etc.
Published by: Department of Transport, Bygningstorvet 116 Vest, DK-2800 Kgs. Lyngby,
Denmark
Request report www.transport.dtu.dk
from:
Content
1.
Introduction ..........................................................................................................................6
2.
McFadden (1978): Sampling alternatives for MNL ..............................................................8
3.
Sampling in GEV models ...................................................................................................12
4.
Conclusion and Recommendation .....................................................................................15
Sampling alternatives for choice model estimation in practice
Summary
The note begins by pointing out that sampling alternatives introduces a number of new issues
involving the specification of sampling procedures and the calculation of errors. With moderate
numbers of alternatives it is possible to make estimations efficiently without sampling.
For multinomial logit, sampling procedures due to McFadden (1978) are known to be unbiased when
used in estimation, though the increased error introduced by these procedures is unknown. Simple
sampling procedures are consistent with the McFadden requirements, though the sampling criteria
would have to be specified for each study.
For more general models, as will be needed for this study, the best sampling procedures are
unknown, though work by Guevara and Ben-Akiva (2010) is promising.
Because of the theoretical and practical complications, and the possibility of efficient work without
sampling, it is recommended that no sampling be undertaken for model estimation in this study. For
model application, the methods are even less well developed and we similarly recommend that no
sampling be undertaken.
5
1. Introduction
1.1 Context of the milestone
For the TransTools3 (TT3) project it is necessary to estimate choice models with substantial numbers
of spatial alternatives. Such models can make heavy demands on computer resources, particularly run
time, but also potentially the storage requirement. An option to reduce these demands is to look for
some way of reducing the number of alternatives actually used in model estimation by sampling. The
intention of the present note is to set out the practical procedures available for sampling and
recommend a procedure for this project.
For the purposes of the current note, we shall assume that the observations can be treated as
independent of each other. Since TT3 is working with RP observations, this assumption seems
reasonable.
1.2 Preliminary remark
Sampling of alternatives should only be undertaken when it is clearly necessary. Alternative sampling
always leads to increased error in parameter estimations which is not completely straightforward to
calculate. Below, we begin to analyse the nature of the error specifically introduced by alternative
sampling in a limited range of cases. This additional error is usually ignored in practice, but this is
incorrect.
Moreover, in many cases the use of sampling imposes a duty on the analyst to make calculations of
correction terms, sometimes simple, sometimes complicated, that need checking. This additional
burden is an important practical reason to avoid sampling unless it is necessary.
Finally, in complex models the guarantee of consistency of estimation is not always present. The main
aim of this note is to discuss the practical application of methods that achieve consistency in a wider
range of cases.
Nerella and Bhat (2004) give some indications of the magnitude of the error, both for MNL and for
more complicated models. For MNL, they give some guidelines on minimum sample size to achieve
stability, but that is with a simple sampling strategy in simulated data and so not likely to be
transferable to real data and more sophisticated sampling. In particular, the efficiency of sampling can
vary substantially between contexts and sampling procedures. For mixed logit models, the results are
less accurate, but in this case the estimates are not necessarily consistent and would not usually be
used in practice. The current situation appears to be that sampling alternatives in MNL causes noise,
but we would not be able to state in advance what that noise would be in a specific situation; sampling
alternatives in mixed logit causes more noise and bias may also be present, but we are not able to say
how much of either.
6
With efficient software it is possible to estimate MNL and tree-nested models with quite large numbers
of alternatives. RAND Europe routinely uses ALOGIT to estimate models with tens of thousands of
alternatives, dozens of parameters and tens of thousands of observations 1. Of course runs of this
magnitude are time-consuming but they remain entirely feasible for practical analysis work. ALOGIT is
currently not able to estimate cross-nested models or models based on repeated observations for
individuals (panel data); but these features are not necessary for the current TT3 work.
In each study, therefore, a decision needs to be taken about whether or not sampling is to be
undertaken. When sampling is undertaken, we should investigate how much error this is likely to
introduce. Procedures that may introduce bias should not be used without careful testing.
1.3 Reading guidance
In this note, we discuss first the sampling procedures appropriate for MNL models and some of the
basic properties of sampling relating to destination choice. We then turn to sampling for GEV models.
1
The Home-other model in the Sydney system had 29,000 observations, 46 parameters and about 26,000 alternatives.
7
2. McFadden (1978): Sampling alternatives for MNL
In his 1978 paper, McFadden set out the Positive Conditioning (PC) property under which consistent
estimates of a MNL can be obtained with alternative sampling. Specifically, he showed that
asymptotically consistent estimates of model parameters can be obtained if we maximise a modified
likelihood function, with a contribution for each individual of
𝐿 = log ∑
exp(𝑉𝑐 +log 𝜋(𝐷|𝑐) )
𝑗∈𝐷 exp(𝑉𝑗 +log 𝜋(𝐷|𝑗) )
(1)
where 𝑉𝑗 is the systematic part of utility for alternative 𝑗;
𝑐 is the chosen alternative;
𝐷 is the sampled set of alternatives for this individual, which is a subset of the set of all
available alternatives 𝐶; and
𝜋(𝐷|𝑗) is the probability of sampling 𝐷, if 𝑗 is chosen;
we must have 𝜋(𝐷|𝑗) > 0 ∀ 𝑗 ∈ 𝐷 (which is the PC property).
Clearly, if 𝜋(𝐷|𝑗) is the same for all 𝑗 ∈ 𝐷, then it will cancel out in the equation above. This is the
Uniform Conditioning property. The simplicity of Uniform Conditioning is attractive, but in many
practical cases some alternatives are much more important than others, in the sense of being much
more likely to be chosen, so that the general PC approach with unequal 𝜋 values is more efficient.
Some intuition on how ‘important’ alternatives should be identified is given below.
Note that it is essential that the chosen alternative 𝑐 is included in 𝐷.
An important point in practice is that McFadden’s PC theorem requires the assumption that “the choice
model is multinomial logit”. In practice, this will often not be the case and the consequence is then that
estimation using the amended likelihood function may not give the same result as when using the full
model.2 However, in this case, neither the base MNL nor the sampled version is correct and sampling
does not necessarily make matters worse.
2.1 Practical sampling strategies for PC
A simple practical approach for PC is to use independent sampling, where each unchosen alternative 𝑗
is sampled with probability 𝑞𝑗 . Alternatively, we can sample a fixed number of times from 𝐶, with
replacement, giving each alternative a probability 𝑞𝑗 of being sampled at each draw and then delete
2
That is, the validity of the theorem depends on behaviour truly according with the MNL formula. I am grateful to Prof.
McFadden for patiently pointing this out to me, as well as that the theorem applies only to large samples. Issues arising from
other model specification errors, e.g. that the error variance is not uniform across the alternatives (a very likely issue in TT3)
would also impact on the proofs of the theorems and in an unknown way on the model outcomes.
the duplicate alternatives. In each case Ben-Akiva and Lerman (1985, equations 9.22 and 9.23) these
strategies yield
𝜋(𝐷|𝑗) =
1
𝑞𝑗
𝐾(𝐷)
(2)
with 𝐾(𝐷) independent of 𝑗. Examining the likelihood equation (1), we see that 𝐾(𝐷) cancels out and
we are left with
𝐿 = log (∑
exp(𝑉𝑐 −log 𝑞𝑐 )
𝑗∈𝐷 exp(𝑉𝑗 −log 𝑞𝑗
)
)
(3)
With independent sampling the expected set size is 𝑄𝐶 = ∑𝑗∈𝐶 𝑞𝑗 but there is quite likely to be
considerable variation around this number. When sampling with replacement the probabilities 𝑞𝑗 must
sum to 1, of course, but the advantage claimed for this method is that the size of the set 𝐷 varies less
than with independent sampling; the expected set size is more complicated to determine. In each case
we can adjust the sample rate to obtain a suitable balance between sampling error and computational
cost.
A further approach, which has been used in some practical studies, is to sample without replacement.
This gives no variation in the sampled set size, but the calculation of 𝜋(𝐷|𝑗) is complicated and for this
reason the approach is not generally recommended.
Ben-Akiva and Lerman imply that replacement sampling is likely to be preferable to independent
sampling, because of the anticipated reduced variation of the set size. Variation of the set size is,
however, not obviously directly connected with the accuracy of estimation; some aspects of these
issues can be tested quite readily by simulation, as illustrated in Figure 1. Details of the simulation are
given in the Appendix.
Efficiency and set size variation
45.00%
40.00%
35.00%
30.00%
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
Independent Cost
With-Rep. Cost
Without-Rep. Cost
Independent Variation
With-Rep. Variation
Without-Rep. Variation
-0.03
-0.05
-0.07
-0.09
-0.11
Deterrence parameter
Figure 1: Results of simulated sampling of choice sets
The upper lines in the graph show the relative cost of the three approaches, specifically giving the
ratio of selected destinations to the weight of those destinations, for a selected reasonable sampling
rate (about 16% of the destinations). We see that the differences between the procedures are very
small and that all the procedures become more efficient as the parameter becomes more strongly
negative (equivalently, the study area size increases relative to expected trip lengths). The lower lines
show the coefficient of variation in the sample sizes. We see that for small study areas replacement
sampling gives less variation, as expected, than independent sampling, but that the difference
decreases substantially as the study area size increases and for the largest area size independent
sampling gives a less variable set size.
It appears that independent sampling is a simple and efficient approach. It is also quicker to apply than
replacement sampling, though neither of these approaches is time-consuming.
It is also relevant to note that PC sampling is more important in the context of destination choice, as
needed for these projects, than for other contexts such as residential choice, where the link to travel
accessibility from a specific origin is less strong. The strength of the link also affects the balance of
choice between independent and replacement sampling, i.e. results from residential choice studies
cannot necessarily be transferred straightforwardly to destination choice studies.3
In practical studies, a stratified sampling approach has also been used. This approach would sample a
fixed number of alternatives from each of several strata, adding the chosen alternative if it is not
otherwise sampled. The efficiency of such an approach has not been tested in the current work.
3
It is also possible that results for combined mode-destination models will be different that for ‘pure’ destination choice
models as simulated here. The variance of cost across modes will be of a different nature than variation across destinations and
this will affect the results.
2.2 Sampling error in PC sampling
A simple approach to assessing sampling error is to calculate the ‘fraction of choice’ that is captured
by 𝐷:
𝐹(𝐷|𝐶) =
∑𝑗∈𝐷 exp(𝑉𝑗 )
(4)
∑𝑗∈𝐶 exp(𝑉𝑗 )
This approach is taken by Miller et al. (2007) in the context of sampling alternatives for application and
is implicitly the measure used to indicate efficiency in the graph above. It is intuitively clear that as the
fraction increases then the approximation will generally improve. After all, when the fraction reaches 1
there is no approximation. But it is a rough measure.
̃ = ∑𝑗∈𝐷 𝑤𝑗 , where 𝐷 is the set of 𝑗 that have been
Suppose we want to estimate 𝑊 = ∑𝑗∈𝐶 𝑤𝑗 by 𝑊
𝑞𝑗
̃ is given by
sampled. If 𝑞𝑗 is the sampling probability, the variance of 𝑊
2
̃ ) = ∑𝑗∈𝐷 (𝑤𝑗) 𝑞𝑗 (1 − 𝑞𝑗 ) = ∑𝑗∈𝐷 𝑤𝑗 2 ( 1 − 1)
var(𝑊
𝑞𝑗
𝑞𝑗
(5)
It is quite easy to see that this is minimised for a given 𝑄𝐷 = ∑𝑗∈𝐷 𝑞𝑗 , which is related to calculation
cost by the efficiency of the sampling procedure, when 𝑞𝑗 = 𝑘. 𝑤𝑗 (see also Hammersley and
Handscomb, 1964). This result gives a strong indication that the intuitive attribution of sampling
probability as approximately proportional to ‘importance’, measured by exp 𝑉𝑗 = 𝑤𝑗 , is reasonable4. But
the formula for variance is not a simple function of 𝐹(𝐷|𝐶).
The calculation above applies for independent sampling. Examining the variation in 𝑄𝐷 in the
simulation runs of the previous section, we see that the results for replacement sampling are similar to
those for independent sampling, with lower variation for small area sizes and larger variation for larger
areas. We might expect that a function like (5) could be developed, though it would not necessarily be
simple.
̃ contributes the sampling error, can
The error in the likelihood-contribution calculation (3), to which 𝑊
be estimated as
𝜕𝐿 2
̃)
) var(𝑊
̃
𝜕𝑊
var(𝐿) ≅ (
=
̃)
var (𝑊
2
̃
𝑊
(6)
In estimation of these models, we maximise the likelihood over all the observations, i.e. summing each
individual’s contribution (3) to obtain the overall likelihood. Then the first-order conditions for
optimality are, for each parameter 𝑘 and assuming that the utility functions 𝑉 are linear in parameters
𝛽 and observed data 𝑥:
4
Of course, we cannot calculate exp 𝑉𝑗 in advance of estimating the model, so importance sampling has to be
performed with an approximate proxy used for 𝑤𝑗 .
0=
𝜕𝐿
𝜕𝛽𝑘
= ∑𝑛(𝑥𝑘𝑐𝑛 − ∑𝑗 𝑝𝑗𝑛 𝑥𝑘𝑗𝑛 )
= ∑𝑛 (𝑥𝑘𝑐𝑛 − ∑𝑗 𝑥𝑘𝑗𝑛 ∑
exp(𝑉𝑗𝑛 −log 𝑞𝑗𝑛 )
𝑗′∈𝐷 exp(𝑉𝑗𝑛 −log 𝑞𝑗′𝑛 )
)
(7)
where 𝑛 indexes the individuals.
Examining this result, we see that the sampling error occurs in the denominator, which has variance
given by (5). The resulting estimation error is difficult to calculate, as it depends on the observed data
̃ varies over the observations.
and the way in which 𝑊
Since the model is not properly specified and (as noted by Guevara and Ben-Akiva, 2010) we are not
working with the true likelihood function, the usual error calculations do not apply. Guevara and BenAkiva show that he estimation error is given by the sandwich matrix and this should therefore be
applied for models estimated using sampling.
3. Sampling in GEV models
The key references for procedures in GEV models (introduced by McFadden, 1978) are Lee and
Waddell (2010), Guevara and Ben-Akiva (2010, also Guevara, 2010) and Bierlaire et al. (2008). Note
that Frejinger et al. (2009) do not deal with models beyond MNL except through the ‘path size’
correction and that the PC correction is therefore sufficient for their work. Among these, we give
priority to the work of Guevara and Ben-Akiva, but first we note briefly the other work that has been
done.
3.1 Other literature
Koppelman and Garrow (2005) discuss the (important) issue of the choice-based sampling of
observations and do not discuss sampling alternatives at all. One is at a loss to know why this paper is
cited so frequently in the literature discussing sampling of alternatives.
Bierlaire et al. (2008) is also aimed chiefly at the issue of sampling observations, which is not directly
related to the present problem. However, “for the sake of completeness”, they give some attention to
sampling alternatives, deriving results that foreshadow somewhat the work of Guevara. However,
Guevara’s work is more directly focussed on our topic of interest and we shall base our discussion on
those publications.
Lee and Waddell (2010) claim to provide the first consistent estimator for tree logit with sampling of
alternatives. The formula (their equation 5) is simple, the logsum used in the higher (unsampled) level
is5
1
1
𝑉𝑚 = ( ) log (∑𝑖∈𝑚 ( ) exp(𝜇𝑉𝑖 ))
𝜇
𝑅
(8)
where 𝑅 is the sampling rate “which only applies to the sampled non-chosen alternatives”, so they
apply a rate of 1 to the chosen alternative. The estimate of the logsum is therefore a function of the
chosen alternative. When 𝜇 = 1, i.e. the model is MNL, this is different from McFadden’s PC, so that it
appears that the Lee and Waddell procedure is incorrect. Simple simulations confirm that a bias is
introduced.
3.2 Guevara and Ben-Akiva work
Guevara and Ben-Akiva (2010, abbreviated as GBA)6, which relates closely to a chapter of Guevara’s
thesis (2010), gives the theorem that consistent estimation can be achieved by a correction of the logit
utility function
𝑉𝑖∗ = 𝑉𝑖 + log 𝐺𝑖 (𝐷) + log 𝜋(𝐷|𝑖)
(9)
where 𝜋(𝐷|𝑖) is the probability of selecting the reduced choice set 𝐷, given that 𝑖 is the chosen
alternative; we note that this is reassuringly the standard McFadden PC correction;
𝐺𝑖 is the derivative with respect to its 𝑖 th argument of the GEV7 generating function 𝐺; here we
note that it is calculated for the restricted choice set 𝐷.
In an MNL model, 𝐺𝑖 = 1 for all the alternatives, so that this term disappears from the function and we
return to the standard McFadden MNL formulation. However, in more general GEV, such as nested
logit, this term does not disappear. Ben-Akiva and Lerman (1985) show that equation (9) can be used
(without sampling, i.e. without the 𝜋 term) to represent any GEV model, so that the GBA theorem
using (9) represents an intuitive extension of both McFadden sampling and the Ben-Akiva/Lerman
finding.
For nested logit (later work deals with cross-nesting), GBA obtain the formula, simplified here slightly
by removing the arbitrary scale at the root and rearranging:
1
log 𝐺𝑖 = (𝜇 − 1) (𝑉𝑖 − 𝑙𝑜𝑔𝑠𝑢𝑚(𝑚))
𝜇
5
(10)
This is clearly a top-normalised RU2 formulation, but they switch to a bottom-normalised RU1 formulation in the
empirical work in equations (7) and (8). Normalisation cannot affect the theory of sampling, but care is needed in the formulae.
6
I am aware this paper has been improved for potential publication.
7
GBA use the term MEV to describe the models introduced by McFadden as GEV.
where 𝜇 is the nesting coefficient.
The estimator they propose (their equations 14 and 15) is for the logsum for nest 𝑚:
𝑙𝑜𝑔𝑠𝑢𝑚(𝑚) ≈ log ∑𝑗∈𝐷(𝑚)
̃𝑗
𝑛
𝐸(𝑗)
exp(𝜇𝑚 𝑉𝑗 )
(11)
where 𝑛𝑗 is the number of times alternative 𝑗 is actually sampled and
𝐸(𝑗) is the expectation of this number.
It is shown by GBA that the term 𝑛̃𝑗 ⁄𝐸(𝑗) is the expansion factor required to obtain an unbiased
estimate of the logsum. It is important to note that the sampling procedure used to estimate the
logsum need not be the same as the procedure used to sample the set 𝐷. In the GBA paper, the use
of separate procedures is called re-sampling, but this naming is confusing. From an array of
possibilities, GBA recommend two procedures 8.
 Using separate sampling procedures, such that the sampling for the logsum approximation
does not depend on the chosen alternative, works well in the GBA simulations.
 “When re-sampling is not possible” the same sampling can be used but in this case, because
of the dependence of 𝐷 on the chosen alternative, the expansion factors depend on the model
parameters and the model must be estimated iteratively. This iterative procedure works well in
the GBA simulations.
We note that the GBA simulations relate to residential choice and these results would have to be
checked before using their approaches for other applications, such as destination choice modelling. In
particular, the benefit gained by sampling, reduced in the second approach by the iterative procedure,
would have to be tested.
For practical implementation it seems that we could run the iterative procedure using an unamended
version of ALOGIT. Calculation of the logsums and 𝐺𝑖 , using equations (10) and (11) can be made in
ALOGIT code and these can be added as constants to the utility functions.9
An interesting feature of the GBA result is that, if sampling is such that no logsums require
approximation, then no correction is required. For example, if we have a mode-destination choice
model with destinations above modes à l’Américaine, and we sample destinations but not modes, then
there is no approximation of logsums. However, this structure could not be applied in Europe without
at least testing the other structure (more common in Europe), so that estimation procedures that
sample destinations when they are not at the top of the tree are needed in any case.
8
I am grateful to Angelo Guevara for clarifying these points for me.
9
Note that in ALOGIT it is quite efficient to use the AVAIL array to implement the sampling of alternatives. Non-
available alternatives are not considered during the estimation (not written to F11) so that the set 𝐷 can be constructed simply
by marking as unavailable all of the alternatives in 𝐶\𝐷.
4. Conclusion and Recommendation
Sampling alternatives introduces a number of issues into the modelling.
 A sampling procedure has to be specified, implemented and tested.
 Errors are introduced into the modelling as a result of the sampling and these errors are not
easily quantified.
 For the TT3 project, it is essential that we use tree-nested logit models.
 Sampling procedures for models more complex than MNL have not been resolved entirely and
the literature is still being extended.
 Efficient software can work with quite large numbers of alternatives.
Therefore it is recommended that the TT3 project should not sample alternatives for model estimation.
For model application, the literature is even less developed, though the methods used by Miller et al.
(2007) appear efficient but have not been shown to be unbiased. For similar reasons, it is
recommended that no sampling be undertaken for model application.
References
Ben-Akiva, M. and Lerman, S. (1985), Discrete Choice Analysis: theory and application to travel
demand, MIT Press, see pp. 261-269 (Estimation of choice models with a sample of alternatives).
Bierlaire, M., Bolduc, D. and McFadden, D. (2008), The estimation of generalized extreme value
models from choice-based samples, Trans. Res. B, 42, pp. 381-394.
Frejinger, E., Bierlaire, M. and Ben-Akiva, M. (2009), Sampling of alternatives for route choice
modelling, Trans. Res. B, 43, pp. 984-994.
Guevara, C. and Ben-Akiva, M. (2010), Sampling of alternatives in multivariate extreme value (MEV)
models, WCTR, Lisbon.
Guevara (2010), Ph.D., Massachusetts Institute of Technology.
Hammersley, J. and Handscomb, D. (1964), Monte Carlo Methods, Chapman and Hall, pp. 57-59
(Importance Sampling).
Koppelman & Garrow (2005), Efficiently estimating nested logit models with choice-based samples:
Example applications, Transportation Research Record 1921: 63-69.
Lee, B.H. and Waddell, P (2010), Residential mobility and location choice: a nested logit model with
sampling of alternatives, Transportation, 37, pp. 587-601.
McFadden, D.L. (1978), Modelling the choice of residential location, in Karlqvist, A., Lundqvist, L.,
Snickars, F. and Weibull, J., Spatial interaction theory and residential location, North-Holland, pp. 7596.
Miller, S., Daly, A., Fox, J. and Kohli, S. (2007), Destination sampling in forecasting: application in the
PRISM model for the UK West Midlands Region, presented to European Transport Conference,
Noordwijkerhout.
Nerella, S. and Bhat, C. (2004), A numerical analysis of the effect of sampling of alternatives in
discrete choice models, TRB.
Appendix A Details of simulation
Independent and Replacement Sampling
These simulations are based on 10000 draws of sets of destinations from a total set of 100.
Destinations are located at a ‘distance’ from the origin of 𝑑𝑗 = 10√𝑗, with 𝑗 the destination number, the
square root function giving a representation of roughly uniform distribution of destinations in space.
Each destination is assigned a utility function of
𝑉𝑗 = 𝛽. 𝑑𝑗 + 𝛾. 𝛿1
where 𝛿1 indicates an ‘intrazonal’ trip, i.e. with destination in zone 1;
𝛽, 𝛾 are the assumed parameters of the model.
In these models we set 𝛾 = 2 and 𝛽 is set to scale the impact of distance. In effect the variation of 𝛽
models the variation of the size of the study area relative to trip length, with a smaller 𝛽 indicating a
larger study area.
Independent sampling is then undertaken with 𝑞𝑗 = 𝑓. exp 𝑉𝑗 ⁄∑ exp 𝑉𝑘 and 𝑓 set to achieve a roughly
uniform sample size.
Replacement sampling is then undertaken 𝐽̃ times, with 𝑞𝑗 = exp 𝑉𝑗 ⁄∑ exp 𝑉𝑘 and 𝐽̃ set to achieve a
roughly uniform sample size.
The settings of 𝛽, 𝛾 and 𝐽̃ and the sample sizes achieved are shown in the table below.
Table 1: Specification of sampling parameters
𝛽
–0.03
–0.05
–0.07
–0.09
Independent sampling
𝑓
average sample
20
16.05
26
16.38
37
16.15
60
16.18
Replacement sampling
𝐽̃
average sample
23
32
50
85
16.25
16.09
16.17
16.36
The key results, i.e. the proportion of ‘weight’ included in the sample, relative to the proportion of
destinations, and the variation of the sample size, are given in the graph in the main text.
Sampling without replacement was done with 16 samples each time.