SEVENTH FRAMEWORK PROGRAMME THEME [SST.2010.1.3-1.] [Transport modelling for policy impact assessments] Grant agreement for: Coordination and support action Acronym: Transtools 3 Full title: „Research and development of the European Transport Network Model – Transtools Version 3 Proposal/Contract no.: MOVE/FP7/266182/TRANSTOOLS 3 Start date: 1st March 2011 Duration: 36 months Milestone 79 - “Sampling alternatives for choice model estimation in practice” Document number: TT3_WP8_COM_MS79_Sampling alternatives for choice model estimation_0b Workpackage: WP8 Deliverable nature: Report Dissemination level: N/A Lead beneficiary: ITS Leeds, beneficiary 2, Andrew Daly. Due data of deliverable: March, 2012 Date of preparation of deliverable: 30.12.2011 Date of last change: 17.03.2012 Date of approval by Commission: N/A Abstract: The note discusses practical approaches for sampling alternatives in the estimation of GEV models, which include the tree-nested models which will be used for the passenger and freight demand modelling. It covers both procedures for actual sampling of subsets of alternatives and the process of estimation based on those subsets, pointing out the deficiencies in the academic literature, which is still being developed. It concludes that, given the moderate numbers of alternatives to be used in this study, the efficiency of modern software and hardware and the difficulties and limitations of the methods, that sampling of alternatives should not be used for model estimation in this work. Keywords: logit models, sampling alternatives, model estimation Author(s): Daly, Andrew Disclaimer: The contents of this report reflect the views of the author(s) and do not necessarily reflect the official views or policy of the European Union. The European Union is not liable for any use that may be made of the information contained in the report. The report is not an official deliverable under the TT3 project and has not been reviewed or approved by the Commission. The report is a working document of the Consortium. Sampling alternatives for choice model estimation in practice. Report version 0c 2012 By Andrew Daly Copyright: Reproduction of this publication in whole or in part must include the customary bibliographic citation, including author attribution, report title, etc. Published by: Department of Transport, Bygningstorvet 116 Vest, DK-2800 Kgs. Lyngby, Denmark Request report www.transport.dtu.dk from: Content 1. Introduction ..........................................................................................................................6 2. McFadden (1978): Sampling alternatives for MNL ..............................................................8 3. Sampling in GEV models ...................................................................................................12 4. Conclusion and Recommendation .....................................................................................15 Sampling alternatives for choice model estimation in practice Summary The note begins by pointing out that sampling alternatives introduces a number of new issues involving the specification of sampling procedures and the calculation of errors. With moderate numbers of alternatives it is possible to make estimations efficiently without sampling. For multinomial logit, sampling procedures due to McFadden (1978) are known to be unbiased when used in estimation, though the increased error introduced by these procedures is unknown. Simple sampling procedures are consistent with the McFadden requirements, though the sampling criteria would have to be specified for each study. For more general models, as will be needed for this study, the best sampling procedures are unknown, though work by Guevara and Ben-Akiva (2010) is promising. Because of the theoretical and practical complications, and the possibility of efficient work without sampling, it is recommended that no sampling be undertaken for model estimation in this study. For model application, the methods are even less well developed and we similarly recommend that no sampling be undertaken. 5 1. Introduction 1.1 Context of the milestone For the TransTools3 (TT3) project it is necessary to estimate choice models with substantial numbers of spatial alternatives. Such models can make heavy demands on computer resources, particularly run time, but also potentially the storage requirement. An option to reduce these demands is to look for some way of reducing the number of alternatives actually used in model estimation by sampling. The intention of the present note is to set out the practical procedures available for sampling and recommend a procedure for this project. For the purposes of the current note, we shall assume that the observations can be treated as independent of each other. Since TT3 is working with RP observations, this assumption seems reasonable. 1.2 Preliminary remark Sampling of alternatives should only be undertaken when it is clearly necessary. Alternative sampling always leads to increased error in parameter estimations which is not completely straightforward to calculate. Below, we begin to analyse the nature of the error specifically introduced by alternative sampling in a limited range of cases. This additional error is usually ignored in practice, but this is incorrect. Moreover, in many cases the use of sampling imposes a duty on the analyst to make calculations of correction terms, sometimes simple, sometimes complicated, that need checking. This additional burden is an important practical reason to avoid sampling unless it is necessary. Finally, in complex models the guarantee of consistency of estimation is not always present. The main aim of this note is to discuss the practical application of methods that achieve consistency in a wider range of cases. Nerella and Bhat (2004) give some indications of the magnitude of the error, both for MNL and for more complicated models. For MNL, they give some guidelines on minimum sample size to achieve stability, but that is with a simple sampling strategy in simulated data and so not likely to be transferable to real data and more sophisticated sampling. In particular, the efficiency of sampling can vary substantially between contexts and sampling procedures. For mixed logit models, the results are less accurate, but in this case the estimates are not necessarily consistent and would not usually be used in practice. The current situation appears to be that sampling alternatives in MNL causes noise, but we would not be able to state in advance what that noise would be in a specific situation; sampling alternatives in mixed logit causes more noise and bias may also be present, but we are not able to say how much of either. 6 With efficient software it is possible to estimate MNL and tree-nested models with quite large numbers of alternatives. RAND Europe routinely uses ALOGIT to estimate models with tens of thousands of alternatives, dozens of parameters and tens of thousands of observations 1. Of course runs of this magnitude are time-consuming but they remain entirely feasible for practical analysis work. ALOGIT is currently not able to estimate cross-nested models or models based on repeated observations for individuals (panel data); but these features are not necessary for the current TT3 work. In each study, therefore, a decision needs to be taken about whether or not sampling is to be undertaken. When sampling is undertaken, we should investigate how much error this is likely to introduce. Procedures that may introduce bias should not be used without careful testing. 1.3 Reading guidance In this note, we discuss first the sampling procedures appropriate for MNL models and some of the basic properties of sampling relating to destination choice. We then turn to sampling for GEV models. 1 The Home-other model in the Sydney system had 29,000 observations, 46 parameters and about 26,000 alternatives. 7 2. McFadden (1978): Sampling alternatives for MNL In his 1978 paper, McFadden set out the Positive Conditioning (PC) property under which consistent estimates of a MNL can be obtained with alternative sampling. Specifically, he showed that asymptotically consistent estimates of model parameters can be obtained if we maximise a modified likelihood function, with a contribution for each individual of 𝐿 = log ∑ exp(𝑉𝑐 +log 𝜋(𝐷|𝑐) ) 𝑗∈𝐷 exp(𝑉𝑗 +log 𝜋(𝐷|𝑗) ) (1) where 𝑉𝑗 is the systematic part of utility for alternative 𝑗; 𝑐 is the chosen alternative; 𝐷 is the sampled set of alternatives for this individual, which is a subset of the set of all available alternatives 𝐶; and 𝜋(𝐷|𝑗) is the probability of sampling 𝐷, if 𝑗 is chosen; we must have 𝜋(𝐷|𝑗) > 0 ∀ 𝑗 ∈ 𝐷 (which is the PC property). Clearly, if 𝜋(𝐷|𝑗) is the same for all 𝑗 ∈ 𝐷, then it will cancel out in the equation above. This is the Uniform Conditioning property. The simplicity of Uniform Conditioning is attractive, but in many practical cases some alternatives are much more important than others, in the sense of being much more likely to be chosen, so that the general PC approach with unequal 𝜋 values is more efficient. Some intuition on how ‘important’ alternatives should be identified is given below. Note that it is essential that the chosen alternative 𝑐 is included in 𝐷. An important point in practice is that McFadden’s PC theorem requires the assumption that “the choice model is multinomial logit”. In practice, this will often not be the case and the consequence is then that estimation using the amended likelihood function may not give the same result as when using the full model.2 However, in this case, neither the base MNL nor the sampled version is correct and sampling does not necessarily make matters worse. 2.1 Practical sampling strategies for PC A simple practical approach for PC is to use independent sampling, where each unchosen alternative 𝑗 is sampled with probability 𝑞𝑗 . Alternatively, we can sample a fixed number of times from 𝐶, with replacement, giving each alternative a probability 𝑞𝑗 of being sampled at each draw and then delete 2 That is, the validity of the theorem depends on behaviour truly according with the MNL formula. I am grateful to Prof. McFadden for patiently pointing this out to me, as well as that the theorem applies only to large samples. Issues arising from other model specification errors, e.g. that the error variance is not uniform across the alternatives (a very likely issue in TT3) would also impact on the proofs of the theorems and in an unknown way on the model outcomes. the duplicate alternatives. In each case Ben-Akiva and Lerman (1985, equations 9.22 and 9.23) these strategies yield 𝜋(𝐷|𝑗) = 1 𝑞𝑗 𝐾(𝐷) (2) with 𝐾(𝐷) independent of 𝑗. Examining the likelihood equation (1), we see that 𝐾(𝐷) cancels out and we are left with 𝐿 = log (∑ exp(𝑉𝑐 −log 𝑞𝑐 ) 𝑗∈𝐷 exp(𝑉𝑗 −log 𝑞𝑗 ) ) (3) With independent sampling the expected set size is 𝑄𝐶 = ∑𝑗∈𝐶 𝑞𝑗 but there is quite likely to be considerable variation around this number. When sampling with replacement the probabilities 𝑞𝑗 must sum to 1, of course, but the advantage claimed for this method is that the size of the set 𝐷 varies less than with independent sampling; the expected set size is more complicated to determine. In each case we can adjust the sample rate to obtain a suitable balance between sampling error and computational cost. A further approach, which has been used in some practical studies, is to sample without replacement. This gives no variation in the sampled set size, but the calculation of 𝜋(𝐷|𝑗) is complicated and for this reason the approach is not generally recommended. Ben-Akiva and Lerman imply that replacement sampling is likely to be preferable to independent sampling, because of the anticipated reduced variation of the set size. Variation of the set size is, however, not obviously directly connected with the accuracy of estimation; some aspects of these issues can be tested quite readily by simulation, as illustrated in Figure 1. Details of the simulation are given in the Appendix. Efficiency and set size variation 45.00% 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% Independent Cost With-Rep. Cost Without-Rep. Cost Independent Variation With-Rep. Variation Without-Rep. Variation -0.03 -0.05 -0.07 -0.09 -0.11 Deterrence parameter Figure 1: Results of simulated sampling of choice sets The upper lines in the graph show the relative cost of the three approaches, specifically giving the ratio of selected destinations to the weight of those destinations, for a selected reasonable sampling rate (about 16% of the destinations). We see that the differences between the procedures are very small and that all the procedures become more efficient as the parameter becomes more strongly negative (equivalently, the study area size increases relative to expected trip lengths). The lower lines show the coefficient of variation in the sample sizes. We see that for small study areas replacement sampling gives less variation, as expected, than independent sampling, but that the difference decreases substantially as the study area size increases and for the largest area size independent sampling gives a less variable set size. It appears that independent sampling is a simple and efficient approach. It is also quicker to apply than replacement sampling, though neither of these approaches is time-consuming. It is also relevant to note that PC sampling is more important in the context of destination choice, as needed for these projects, than for other contexts such as residential choice, where the link to travel accessibility from a specific origin is less strong. The strength of the link also affects the balance of choice between independent and replacement sampling, i.e. results from residential choice studies cannot necessarily be transferred straightforwardly to destination choice studies.3 In practical studies, a stratified sampling approach has also been used. This approach would sample a fixed number of alternatives from each of several strata, adding the chosen alternative if it is not otherwise sampled. The efficiency of such an approach has not been tested in the current work. 3 It is also possible that results for combined mode-destination models will be different that for ‘pure’ destination choice models as simulated here. The variance of cost across modes will be of a different nature than variation across destinations and this will affect the results. 2.2 Sampling error in PC sampling A simple approach to assessing sampling error is to calculate the ‘fraction of choice’ that is captured by 𝐷: 𝐹(𝐷|𝐶) = ∑𝑗∈𝐷 exp(𝑉𝑗 ) (4) ∑𝑗∈𝐶 exp(𝑉𝑗 ) This approach is taken by Miller et al. (2007) in the context of sampling alternatives for application and is implicitly the measure used to indicate efficiency in the graph above. It is intuitively clear that as the fraction increases then the approximation will generally improve. After all, when the fraction reaches 1 there is no approximation. But it is a rough measure. ̃ = ∑𝑗∈𝐷 𝑤𝑗 , where 𝐷 is the set of 𝑗 that have been Suppose we want to estimate 𝑊 = ∑𝑗∈𝐶 𝑤𝑗 by 𝑊 𝑞𝑗 ̃ is given by sampled. If 𝑞𝑗 is the sampling probability, the variance of 𝑊 2 ̃ ) = ∑𝑗∈𝐷 (𝑤𝑗) 𝑞𝑗 (1 − 𝑞𝑗 ) = ∑𝑗∈𝐷 𝑤𝑗 2 ( 1 − 1) var(𝑊 𝑞𝑗 𝑞𝑗 (5) It is quite easy to see that this is minimised for a given 𝑄𝐷 = ∑𝑗∈𝐷 𝑞𝑗 , which is related to calculation cost by the efficiency of the sampling procedure, when 𝑞𝑗 = 𝑘. 𝑤𝑗 (see also Hammersley and Handscomb, 1964). This result gives a strong indication that the intuitive attribution of sampling probability as approximately proportional to ‘importance’, measured by exp 𝑉𝑗 = 𝑤𝑗 , is reasonable4. But the formula for variance is not a simple function of 𝐹(𝐷|𝐶). The calculation above applies for independent sampling. Examining the variation in 𝑄𝐷 in the simulation runs of the previous section, we see that the results for replacement sampling are similar to those for independent sampling, with lower variation for small area sizes and larger variation for larger areas. We might expect that a function like (5) could be developed, though it would not necessarily be simple. ̃ contributes the sampling error, can The error in the likelihood-contribution calculation (3), to which 𝑊 be estimated as 𝜕𝐿 2 ̃) ) var(𝑊 ̃ 𝜕𝑊 var(𝐿) ≅ ( = ̃) var (𝑊 2 ̃ 𝑊 (6) In estimation of these models, we maximise the likelihood over all the observations, i.e. summing each individual’s contribution (3) to obtain the overall likelihood. Then the first-order conditions for optimality are, for each parameter 𝑘 and assuming that the utility functions 𝑉 are linear in parameters 𝛽 and observed data 𝑥: 4 Of course, we cannot calculate exp 𝑉𝑗 in advance of estimating the model, so importance sampling has to be performed with an approximate proxy used for 𝑤𝑗 . 0= 𝜕𝐿 𝜕𝛽𝑘 = ∑𝑛(𝑥𝑘𝑐𝑛 − ∑𝑗 𝑝𝑗𝑛 𝑥𝑘𝑗𝑛 ) = ∑𝑛 (𝑥𝑘𝑐𝑛 − ∑𝑗 𝑥𝑘𝑗𝑛 ∑ exp(𝑉𝑗𝑛 −log 𝑞𝑗𝑛 ) 𝑗′∈𝐷 exp(𝑉𝑗𝑛 −log 𝑞𝑗′𝑛 ) ) (7) where 𝑛 indexes the individuals. Examining this result, we see that the sampling error occurs in the denominator, which has variance given by (5). The resulting estimation error is difficult to calculate, as it depends on the observed data ̃ varies over the observations. and the way in which 𝑊 Since the model is not properly specified and (as noted by Guevara and Ben-Akiva, 2010) we are not working with the true likelihood function, the usual error calculations do not apply. Guevara and BenAkiva show that he estimation error is given by the sandwich matrix and this should therefore be applied for models estimated using sampling. 3. Sampling in GEV models The key references for procedures in GEV models (introduced by McFadden, 1978) are Lee and Waddell (2010), Guevara and Ben-Akiva (2010, also Guevara, 2010) and Bierlaire et al. (2008). Note that Frejinger et al. (2009) do not deal with models beyond MNL except through the ‘path size’ correction and that the PC correction is therefore sufficient for their work. Among these, we give priority to the work of Guevara and Ben-Akiva, but first we note briefly the other work that has been done. 3.1 Other literature Koppelman and Garrow (2005) discuss the (important) issue of the choice-based sampling of observations and do not discuss sampling alternatives at all. One is at a loss to know why this paper is cited so frequently in the literature discussing sampling of alternatives. Bierlaire et al. (2008) is also aimed chiefly at the issue of sampling observations, which is not directly related to the present problem. However, “for the sake of completeness”, they give some attention to sampling alternatives, deriving results that foreshadow somewhat the work of Guevara. However, Guevara’s work is more directly focussed on our topic of interest and we shall base our discussion on those publications. Lee and Waddell (2010) claim to provide the first consistent estimator for tree logit with sampling of alternatives. The formula (their equation 5) is simple, the logsum used in the higher (unsampled) level is5 1 1 𝑉𝑚 = ( ) log (∑𝑖∈𝑚 ( ) exp(𝜇𝑉𝑖 )) 𝜇 𝑅 (8) where 𝑅 is the sampling rate “which only applies to the sampled non-chosen alternatives”, so they apply a rate of 1 to the chosen alternative. The estimate of the logsum is therefore a function of the chosen alternative. When 𝜇 = 1, i.e. the model is MNL, this is different from McFadden’s PC, so that it appears that the Lee and Waddell procedure is incorrect. Simple simulations confirm that a bias is introduced. 3.2 Guevara and Ben-Akiva work Guevara and Ben-Akiva (2010, abbreviated as GBA)6, which relates closely to a chapter of Guevara’s thesis (2010), gives the theorem that consistent estimation can be achieved by a correction of the logit utility function 𝑉𝑖∗ = 𝑉𝑖 + log 𝐺𝑖 (𝐷) + log 𝜋(𝐷|𝑖) (9) where 𝜋(𝐷|𝑖) is the probability of selecting the reduced choice set 𝐷, given that 𝑖 is the chosen alternative; we note that this is reassuringly the standard McFadden PC correction; 𝐺𝑖 is the derivative with respect to its 𝑖 th argument of the GEV7 generating function 𝐺; here we note that it is calculated for the restricted choice set 𝐷. In an MNL model, 𝐺𝑖 = 1 for all the alternatives, so that this term disappears from the function and we return to the standard McFadden MNL formulation. However, in more general GEV, such as nested logit, this term does not disappear. Ben-Akiva and Lerman (1985) show that equation (9) can be used (without sampling, i.e. without the 𝜋 term) to represent any GEV model, so that the GBA theorem using (9) represents an intuitive extension of both McFadden sampling and the Ben-Akiva/Lerman finding. For nested logit (later work deals with cross-nesting), GBA obtain the formula, simplified here slightly by removing the arbitrary scale at the root and rearranging: 1 log 𝐺𝑖 = (𝜇 − 1) (𝑉𝑖 − 𝑙𝑜𝑔𝑠𝑢𝑚(𝑚)) 𝜇 5 (10) This is clearly a top-normalised RU2 formulation, but they switch to a bottom-normalised RU1 formulation in the empirical work in equations (7) and (8). Normalisation cannot affect the theory of sampling, but care is needed in the formulae. 6 I am aware this paper has been improved for potential publication. 7 GBA use the term MEV to describe the models introduced by McFadden as GEV. where 𝜇 is the nesting coefficient. The estimator they propose (their equations 14 and 15) is for the logsum for nest 𝑚: 𝑙𝑜𝑔𝑠𝑢𝑚(𝑚) ≈ log ∑𝑗∈𝐷(𝑚) ̃𝑗 𝑛 𝐸(𝑗) exp(𝜇𝑚 𝑉𝑗 ) (11) where 𝑛𝑗 is the number of times alternative 𝑗 is actually sampled and 𝐸(𝑗) is the expectation of this number. It is shown by GBA that the term 𝑛̃𝑗 ⁄𝐸(𝑗) is the expansion factor required to obtain an unbiased estimate of the logsum. It is important to note that the sampling procedure used to estimate the logsum need not be the same as the procedure used to sample the set 𝐷. In the GBA paper, the use of separate procedures is called re-sampling, but this naming is confusing. From an array of possibilities, GBA recommend two procedures 8. Using separate sampling procedures, such that the sampling for the logsum approximation does not depend on the chosen alternative, works well in the GBA simulations. “When re-sampling is not possible” the same sampling can be used but in this case, because of the dependence of 𝐷 on the chosen alternative, the expansion factors depend on the model parameters and the model must be estimated iteratively. This iterative procedure works well in the GBA simulations. We note that the GBA simulations relate to residential choice and these results would have to be checked before using their approaches for other applications, such as destination choice modelling. In particular, the benefit gained by sampling, reduced in the second approach by the iterative procedure, would have to be tested. For practical implementation it seems that we could run the iterative procedure using an unamended version of ALOGIT. Calculation of the logsums and 𝐺𝑖 , using equations (10) and (11) can be made in ALOGIT code and these can be added as constants to the utility functions.9 An interesting feature of the GBA result is that, if sampling is such that no logsums require approximation, then no correction is required. For example, if we have a mode-destination choice model with destinations above modes à l’Américaine, and we sample destinations but not modes, then there is no approximation of logsums. However, this structure could not be applied in Europe without at least testing the other structure (more common in Europe), so that estimation procedures that sample destinations when they are not at the top of the tree are needed in any case. 8 I am grateful to Angelo Guevara for clarifying these points for me. 9 Note that in ALOGIT it is quite efficient to use the AVAIL array to implement the sampling of alternatives. Non- available alternatives are not considered during the estimation (not written to F11) so that the set 𝐷 can be constructed simply by marking as unavailable all of the alternatives in 𝐶\𝐷. 4. Conclusion and Recommendation Sampling alternatives introduces a number of issues into the modelling. A sampling procedure has to be specified, implemented and tested. Errors are introduced into the modelling as a result of the sampling and these errors are not easily quantified. For the TT3 project, it is essential that we use tree-nested logit models. Sampling procedures for models more complex than MNL have not been resolved entirely and the literature is still being extended. Efficient software can work with quite large numbers of alternatives. Therefore it is recommended that the TT3 project should not sample alternatives for model estimation. For model application, the literature is even less developed, though the methods used by Miller et al. (2007) appear efficient but have not been shown to be unbiased. For similar reasons, it is recommended that no sampling be undertaken for model application. References Ben-Akiva, M. and Lerman, S. (1985), Discrete Choice Analysis: theory and application to travel demand, MIT Press, see pp. 261-269 (Estimation of choice models with a sample of alternatives). Bierlaire, M., Bolduc, D. and McFadden, D. (2008), The estimation of generalized extreme value models from choice-based samples, Trans. Res. B, 42, pp. 381-394. Frejinger, E., Bierlaire, M. and Ben-Akiva, M. (2009), Sampling of alternatives for route choice modelling, Trans. Res. B, 43, pp. 984-994. Guevara, C. and Ben-Akiva, M. (2010), Sampling of alternatives in multivariate extreme value (MEV) models, WCTR, Lisbon. Guevara (2010), Ph.D., Massachusetts Institute of Technology. Hammersley, J. and Handscomb, D. (1964), Monte Carlo Methods, Chapman and Hall, pp. 57-59 (Importance Sampling). Koppelman & Garrow (2005), Efficiently estimating nested logit models with choice-based samples: Example applications, Transportation Research Record 1921: 63-69. Lee, B.H. and Waddell, P (2010), Residential mobility and location choice: a nested logit model with sampling of alternatives, Transportation, 37, pp. 587-601. McFadden, D.L. (1978), Modelling the choice of residential location, in Karlqvist, A., Lundqvist, L., Snickars, F. and Weibull, J., Spatial interaction theory and residential location, North-Holland, pp. 7596. Miller, S., Daly, A., Fox, J. and Kohli, S. (2007), Destination sampling in forecasting: application in the PRISM model for the UK West Midlands Region, presented to European Transport Conference, Noordwijkerhout. Nerella, S. and Bhat, C. (2004), A numerical analysis of the effect of sampling of alternatives in discrete choice models, TRB. Appendix A Details of simulation Independent and Replacement Sampling These simulations are based on 10000 draws of sets of destinations from a total set of 100. Destinations are located at a ‘distance’ from the origin of 𝑑𝑗 = 10√𝑗, with 𝑗 the destination number, the square root function giving a representation of roughly uniform distribution of destinations in space. Each destination is assigned a utility function of 𝑉𝑗 = 𝛽. 𝑑𝑗 + 𝛾. 𝛿1 where 𝛿1 indicates an ‘intrazonal’ trip, i.e. with destination in zone 1; 𝛽, 𝛾 are the assumed parameters of the model. In these models we set 𝛾 = 2 and 𝛽 is set to scale the impact of distance. In effect the variation of 𝛽 models the variation of the size of the study area relative to trip length, with a smaller 𝛽 indicating a larger study area. Independent sampling is then undertaken with 𝑞𝑗 = 𝑓. exp 𝑉𝑗 ⁄∑ exp 𝑉𝑘 and 𝑓 set to achieve a roughly uniform sample size. Replacement sampling is then undertaken 𝐽̃ times, with 𝑞𝑗 = exp 𝑉𝑗 ⁄∑ exp 𝑉𝑘 and 𝐽̃ set to achieve a roughly uniform sample size. The settings of 𝛽, 𝛾 and 𝐽̃ and the sample sizes achieved are shown in the table below. Table 1: Specification of sampling parameters 𝛽 –0.03 –0.05 –0.07 –0.09 Independent sampling 𝑓 average sample 20 16.05 26 16.38 37 16.15 60 16.18 Replacement sampling 𝐽̃ average sample 23 32 50 85 16.25 16.09 16.17 16.36 The key results, i.e. the proportion of ‘weight’ included in the sample, relative to the proportion of destinations, and the variation of the sample size, are given in the graph in the main text. Sampling without replacement was done with 16 samples each time.
© Copyright 2026 Paperzz