Bias and Variance Reduction in Assessing Solution Quality for Stochastic Programs by Rebecca Stockbridge A Dissertation Submitted to the Faculty of the Graduate Interdisciplinary Program in Applied Mathematics In Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy In the Graduate College The University of Arizona 2013 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Rebecca Stockbridge entitled Bias and Variance Reduction in Assessing Solution Quality for Stochastic Programs and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy. Date: July 2, 2013 Güzı̇n Bayraksan Date: July 2, 2013 Young-Jun Son Date: July 2, 2013 Joseph Watkins Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. Date: July 2, 2013 Güzı̇n Bayraksan 3 Statement by Author This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at The University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library. Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the head of the major department or the Dean of the Graduate College when in his or her judgment the proposed use of the material is in the interests of scholarship. In all other instances, however, permission must be obtained from the author. Signed: Rebecca Carnegie Neal Stockbridge 4 acknowledgments First, I would like to thank my teachers at Lexington Montessori School and the Montessori Middle School of Kentucky for creating wonderful communities focused on both academic and personal growth. My experiences as an independent learner from the beginning provided the foundation for all my subsequent academic achievements. Second, I would like to acknowledge Dr. Michael Tabor of the Interdisciplinary Program in Applied Mathematics at the University of Arizona. As a result of Dr. Tabor’s unceasing e↵orts, the Program o↵ers wide-ranging opportunities to develop in every aspect of research, teaching, and service. In addition, Stacey LaBorde and Anne Keyl have graciously assisted with all administrative matters. Third, an allocation of computer time from the UA Research Computing High Performance Computing (HPC) and High Throughput Computing (HTC) at the University of Arizona is gratefully acknowledged. Finally, and most importantly, I would like to thank my advisor, Dr. Güzı̇n Bayraksan for her support during the last four years. She has taught me a great deal about the art of successfully applying mathematical techniques to interdisciplinary problems, while mentoring and encouraging me every step of the way. 5 dedication This dissertation is dedicated to my parents, Richard and Judith, for cultivating a love of inquiry from my earliest years; my brother, David, for his support and sense of humor; and my husband, Stuart, for his companionship throughout our mathematical adventures spanning nine years and two continents. 6 Table of Contents List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 1. Introduction . . . . 1.1. Motivation . . . . . . . . . . . 1.2. Contributions . . . . . . . . . 1.3. Definitions and Abbreviations 1.4. Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 14 16 18 19 Chapter 2. Overview of Algorithms for Assessing Solution Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1. Problem Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Optimality Gap Estimation . . . . . . . . . . . . . . . . . . . . . . . 2.3. Multiple Replications Procedure . . . . . . . . . . . . . . . . . . . . . 2.4. Single Replication Procedure . . . . . . . . . . . . . . . . . . . . . . . 2.5. Averaged Two-Replication Procedure . . . . . . . . . . . . . . . . . . 2.6. Summary and Concluding Remarks . . . . . . . . . . . . . . . . . . . 21 22 23 25 26 28 29 Chapter 3. Averaged Two-Replication Procedure with Bias Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Problem Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1. A Redefinition of the Averaged Two-Replication Procedure . . 3.3.2. A Stability Result . . . . . . . . . . . . . . . . . . . . . . . . . 3.4. Bias Reduction via Probability Metrics . . . . . . . . . . . . . . . . . 3.4.1. Motivation for Bias Reduction Technique . . . . . . . . . . . . 3.4.2. Averaged Two-Replication Procedure with Bias Reduction . . 3.5. Illustration: Newsvendor Problem . . . . . . . . . . . . . . . . . . . . 3.6. Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1. Weak Convergence of Empirical Measures . . . . . . . . . . . 3.6.2. Consistency of Point Estimators . . . . . . . . . . . . . . . . . 3.6.3. Asymptotic Validity of the Interval Estimator . . . . . . . . . 3.7. Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . 3.7.1. Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 31 32 34 34 35 37 37 40 42 45 45 47 50 50 51 7 Table of Contents—Continued 3.7.2. Experimental Setup . . . . . . . . . . . . . . . . . . . . . 3.7.3. Results of Experiments on NV, PGP2, APL1P, and GBD 3.7.4. Computation Time of Bias Reduction . . . . . . . . . . . 3.7.5. E↵ect of Multiple Optimal Solutions . . . . . . . . . . . 3.7.6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8. Summary and Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . 53 55 64 65 66 68 Chapter 4. Assessing Solution Quality with Variance Reduction 4.1. Overview of Antithetic Variates and Latin Hypercube Sampling . . . 4.1.1. Antithetic Variates . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2. Latin Hypercube Sampling . . . . . . . . . . . . . . . . . . . . 4.2. Multiple Replications Procedure with Variance Reduction . . . . . . . 4.2.1. Antithetic Variates . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2. Latin Hypercube Sampling . . . . . . . . . . . . . . . . . . . . 4.3. Single Replication Procedure with Variance Reduction . . . . . . . . 4.3.1. Antithetic Variates . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2. Latin Hypercube Sampling . . . . . . . . . . . . . . . . . . . . 4.4. Averaged Two-Replication Procedure with Variance Reduction . . . . 4.4.1. Antithetic Variates . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2. Latin Hypercube Sampling . . . . . . . . . . . . . . . . . . . . 4.4.3. Antithetic Variates with Bias Reduction . . . . . . . . . . . . 4.4.4. Latin Hypercube Sampling with Bias Reduction . . . . . . . . 4.5. Summary of Key Di↵erences in Algorithms . . . . . . . . . . . . . . . 4.6. Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . 4.6.1. Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3. Results of Experiments . . . . . . . . . . . . . . . . . . . . . . 4.6.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7. Summary and Concluding Remarks . . . . . . . . . . . . . . . . . . . 70 71 72 74 77 77 79 80 80 82 85 86 88 91 94 98 99 100 101 102 118 119 Chapter 5. Sequential Sampling with Variance Reduction 5.1. Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 5.2. Overview of a Sequential Sampling Procedure . . . . . . . . . 5.3. Sequential Sampling Procedure with Variance Reduction . . . 5.3.1. Antithetic Variates . . . . . . . . . . . . . . . . . . . . 5.3.2. Latin Hypercube Sampling . . . . . . . . . . . . . . . . 5.4. Computational Experiments . . . . . . . . . . . . . . . . . . . 5.4.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . 5.4.2. Results of Experiments . . . . . . . . . . . . . . . . . . 5.5. Summary and Concluding Remarks . . . . . . . . . . . . . . . 121 122 124 130 130 132 137 138 141 144 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Table of Contents—Continued Chapter 6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.1. Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 145 6.2. Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 9 List of Figures Figure 3.1. Percentage reductions between A2RP and A2RP-B for optimal candidate solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.2. Percentage reductions between A2RP and A2RP-B for suboptimal candidate solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.1. Percentage reductions in bias between MRP and MRP AV and MRP and MRP LHS for suboptimal candidate solutions . . . . . . . . . Figure 4.2. Percentage reductions in variance between MRP and MRP AV and MRP and MRP LHS . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.3. Percentage reductions in MSE between MRP and MRP AV and MRP and MRP LHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.4. Percentage reductions in CI width between MRP and MRP AV and MRP and MRP LHS . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.5. Percentage reductions in SV between SRP and SRP AV and SRP and SRP LHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.6. Percentage reductions in CI width between SRP and SRP AV and SRP and SRP LHS . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.7. Percentage reductions between A2RP and A2RP-B for large problems at suboptimal candidate solutions . . . . . . . . . . . . . . . . Figure 4.8. Percentage reductions in MSE between A2RP and A2RP AV-B and A2RP and A2RP LHS-B . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.9. Percentage reductions in SV between A2RP and A2RP AV-B and A2RP and A2RP LHS-B . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.10. Percentage reductions in CI width between A2RP and A2RP AV-B and A2RP and A2RP LHS-B . . . . . . . . . . . . . . . . . . . . 56 61 106 107 108 109 111 112 113 115 116 117 10 List of Tables Table Table Table Table Table Table Table Table Table Table Table 3.1. 3.2. 3.3. 3.4. 3.5. 3.6. 3.7. 3.8. 3.9. 3.10. 3.11. Table 4.1. Table 4.2. Table 4.3. Table 4.4. PGP2 Table 4.5. Table 4.6. Table 4.7. Table 4.8. Small test problem characteristics . . . . . . . . . . . . . . . . Small test problem candidate solutions . . . . . . . . . . . . . Bias for optimal candidate solutions . . . . . . . . . . . . . . . MSE for optimal candidate solutions . . . . . . . . . . . . . . . CI estimator for optimal candidate solutions . . . . . . . . . . Bias for suboptimal candidate solutions . . . . . . . . . . . . . MSE for suboptimal candidate solutions . . . . . . . . . . . . . CI estimator for suboptimal candidate solutions . . . . . . . . A2RP-B computational time for SSN . . . . . . . . . . . . . . Multiple optimal solutions for an optimal candidate solution . Multiple optimal solutions for a suboptimal candidate solution . . . . . . . . . . . 51 52 58 59 59 62 63 63 64 65 66 Key di↵erences in algorithms . . . . . . . . . . . . . . . . . . . . Large test problem characteristics . . . . . . . . . . . . . . . . . Large test problem candidate solutions . . . . . . . . . . . . . . Percentage reduction in variance between MRP and MRP AV for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MRP coverage for suboptimal candidate solutions . . . . . . . . SRP coverage for suboptimal candidate solutions . . . . . . . . . A2RP coverage for suboptimal candidate solutions . . . . . . . . Computational time for IID, AV, and LHS . . . . . . . . . . . . 99 100 101 105 105 110 114 118 Table 5.1. Parameters for sequential sampling with IID and LHS . . . . . . 139 Table 5.2. Parameters for sequential sampling with AV . . . . . . . . . . . 140 Table 5.3. Results for sequential procedures using IID, AV, and LHS . . . . 143 11 Abstract Stochastic programming combines ideas from deterministic optimization with probability and statistics to produce more accurate models of optimization problems involving uncertainty. However, due to their size, stochastic programming problems can be extremely difficult to solve and instead approximate solutions are used. Therefore, there is a need for methods that can accurately identify optimal or near optimal solutions. In this dissertation, we focus on improving Monte-Carlo sampling-based methods that assess the quality of potential solutions to stochastic programs by estimating optimality gaps. In particular, we aim to reduce the bias and/or variance of these estimators. We first propose a technique to reduce the bias of optimality gap estimators which is based on probability metrics and stability results in stochastic programming. This method, which requires the solution of a minimum-weight perfect matching problem, can be run in polynomial time in sample size. We establish asymptotic properties and present computational results. We then investigate the use of sampling schemes to reduce the variance of optimality gap estimators, and in particular focus on antithetic variates and Latin hypercube sampling. We also combine these methods with the bias reduction technique discussed above. Asymptotic properties of the resultant estimators are presented, and computational results on a range of test problems are discussed. Finally, we apply methods of assessing solution quality using antithetic variates and Latin hypercube sampling to a sequential sampling procedure to solve stochastic programs. In this setting, we use Latin hypercube sampling when generating a sequence of candidate solutions that is input to the procedure. We prove that these procedures produce a high-quality solution with high probability, asymptotically, and terminate in a finite number of iterations. Computational results are presented. 12 Chapter 1 Introduction Deterministic mathematical programming aims to optimize functions under a set of constraints with known parameters. However, the real world is not entirely deterministic; in many cases, the parameters that go into the objective function and constraints, such as costs, demands, etc., may not be completely known. Stochastic programs take this uncertainty into account by including random vectors and other probabilistic statements in the problem descriptions. A standard way to incorporate uncertainty into optimization problems is to represent unknown parameters by random variables. It is then natural to include probabilistic quantities such as expectations and probabilities in the model. It is usually assumed that the probability distributions of the random variables are known. In the case of real-world situations, such distributions can be constructed, for example, via statistical analysis. If a probability distribution is assumed present but unknown, one can consider a range of possible probability distributions in the analysis of the problem. In this dissertation, we consider a stochastic optimization problem of the form Z ˜ min Ef (x, ⇠) = min f (x, ⇠)P (d⇠), (SP) x2X x2X ⌅ where X ✓ Rdx represents the set of constraints the decision vector x of dimension dx must satisfy. The random vector ⇠˜ on (⌅, B, P ) is of dimension d⇠ and has support ⌅ ✓ Rd⇠ and distribution P that does not depend on x, where B is the Borel -algebra on ⌅. The function f : X ⇥ ⌅ ! R is assumed to be a Borel measurable, real-valued function, with inputs being the decision vector x and a realization ⇠ of the random ˜ Throughout the dissertation, we use ⇠˜ to denote the random vector and ⇠ to vector ⇠. denote its realization. The expectation operator E is taken with respect to P . We use z ⇤ to denote the optimal objective function value of (SP) and x⇤ to denote an optimal 13 ˜ solution to (SP). The set of optimal solutions is given by X ⇤ = arg minx2X Ef (x, ⇠). We will primarily focus on two-stage stochastic linear programs with recourse, a ˜ = class of (SP) first introduced by Beale (1955) and Dantzig (1955), where f (x, ⇠) ˜ and h(x, ⇠) ˜ is the optimal value of the minimization problem cx + h(x, ⇠) min{q̃y : W̃ y = R̃ y In this case, X = {x 2 Rdx : Ax = b, x T̃ x, y 0}. 0} and ⇠˜ = (q̃, W̃ , R̃, T̃ ). The following terminology is often used. If W̃ = W , i.e., W is non-random, then the problem is said to have fixed recourse. Otherwise, the problem has random recourse. If q̃ = q, then the problem has stochasticity only on the right-hand side. We will list assumptions on this class of problems required by our theoretical results in subsequent chapters. Two-stage stochastic programs can be understood in the following way. In the first stage of the problem, a decision x is made knowing only the distribution of the ˜ Then, a realization ⇠ of ⇠˜ occurs, and a corrective recourse decision random vector ⇠. y is made in the second stage of the problem. Whereas x cannot depend on ⇠, y explicitly depends on the random outcome, but this dependence is suppressed above. This class of problems can be extended to multi-stage decision problems over a finite number of time periods using conditional expectations. It can also be reformulated to include terms such as nonlinear and integer constraints and nonlinear objective function terms. Two-stage stochastic programs have been used in a wide variety of applications, including fleet management, production planning, risk management, energy generation, and telecommunications; see (Wallace & Ziemba, 2005) for example applications. We now present two examples of two-stage stochastic linear programs with fixed recourse which appear in our computational experiments. Example 1.1 (Aircraft Allocation). An airline wishes to allocate several types of aircraft to di↵erent routes in order to maximize expected profit. The customer demands 14 for each route are modeled as independent random variables with known distributions, and the fixed operating costs vary according to the aircraft type and route. First, the airline determines the number of each aircraft type to assign to each route. Once the customer demand has been realized, the airline can choose the number of bumped passengers (incurring a cost per passenger) and the number of empty seats on each flight. This model, referred to as GBD, was first introduced by Ferguson & Dantzig (1956). We will consider a modification of this problem in later chapters. Example 1.2 (Telecommunications Network Planning). A telecommunications network consists of a set of nodes connected by links. The number of calls between each point-to-point pair of nodes is treated as a random variable. When service is requested between two nodes, a route of links between the nodes with sufficient capacity is identified. If no route has enough capacity, the request for service goes unmet. In the first stage, the amount of capacity to add to each link is chosen. In the second stage, the number of calls that use each possible route and the number of unserved calls are determined for each point-to-point pair of nodes. The goal is to minimize the expected number of unserved calls. A mathematical formulation of this problem, SSN, can be found in (Sen et al., 1994). 1.1 Motivation Even though stochastic programs can yield more realistic models compared to deterministic mathematical programs, they can also be extremely difficult, and perhaps impossible, to solve. For instance, if the random vector ⇠˜ is discrete, then the size of the stochastic program will grow exponentially with the dimension of the random vector, and so the stochastic program su↵ers from a curse of dimensionality. If instead ⇠˜ is continuous, the stochastic program will in general be intractable unless f has a special structure. Note that in many cases, the expectation in (SP) is a multidimensional integral of a complicated function and cannot be calculated exactly even for a 15 ˜ over X brings additional challenges. fixed x 2 X. Optimizing Ef (x, ⇠) Assessing the quality of a potential solution is critically important since many real-world problems cast as (SP)—such as two-stage stochastic linear programs with recourse, which are the primary focus of this dissertation—cannot be solved exactly and one is often left with an approximate solution x̂ 2 X without verification of its quality. This is also fundamental in optimization algorithms, as they use quality assessment iteratively; e.g., every time a new candidate solution is generated, these algorithms need to identify an optimal or nearly optimal solution to stop. Specifically, given a candidate (feasible) solution x̂ 2 X to (SP), we would like to determine whether it is optimal or nearly optimal. This can be done by calculating ˜ the optimality gap, denoted Gx̂ , where Gx̂ = Ef (x̂, ⇠) z ⇤ . A high-quality candi- date solution will have a small optimality gap, and an optimal solution will have an optimality gap of zero. Unfortunately, since the optimal value z ⇤ is unknown, the optimality gap cannot be computed explicitly. In addition, as mentioned earlier, it ˜ exactly. Given a candidate solution x̂ and may not be possible to evaluate Ef (x̂, ⇠) sample size n, Monte Carlo sampling can be used to create statistical estimators of the optimality gap Gx̂ (Bayraksan & Morton, 2006; Mak et al., 1999; Norkin et al., 1998), which we will revisit in Chapter 2. When a statistical estimator of Gx̂ turns out to be large, this can be due to the fact that: (i) bias is large; (ii) variance or sampling error is large, or (iii) Gx̂ is large. The third condition is simply when x̂ is a low-quality solution. However, even if x̂ is a high-quality solution, (i) and (ii) could result in an estimator that can be significantly misleading. It is well known that Monte Carlo statistical estimators of optimality gaps 16 are biased (Mak et al., 1999; Norkin et al., 1998). That is, on average, they tend to over-estimate Gx̂ for a finite sample size n. Therefore, for some problems, bias could be a major issue, whereas for other problems, variance could be the dominant factor. In either situation, estimates of Gx̂ can be large even if we have a high-quality solution. Additionally, a high variance or sample error can lead to estimators that under-report the optimality gap and thus indicate that a candidate solution is of higher quality than warranted. Each case mentioned significantly diminishes our ability to identify an optimal or nearly optimal solution. While the current literature provides Monte Carlo sampling-based methods to estimate Gx̂ , these methods could yield unreliable estimators when bias or variance is large. This dissertation aims to improve the current procedures by reducing bias and variance, yielding estimators that are more reliable than the current state-of-theart in assessing solution quality in stochastic programs. In particular, we present a method to reduce bias via strategic partitioning of samples. We then investigate the use of variance reduction schemes in optimality gap estimation. Finally, we study these methods in a sequential sampling context. 1.2 Contributions The primary goal of this dissertation is to identify high-quality solutions to (SP) by improving the reliability of sampling-based estimators of optimality gaps. The methods developed in this dissertation to achieve this goal could be used to assess the quality of a given solution (found in any way) with a fixed sample size, or within a sequential Monte Carlo sampling-based method to identify high-quality solutions to (SP) with increasing sample sizes. We note that extensive work has been done in the statistics and simulation literature regarding bias and variance reduction methods. We review some of these methods in the context of stochastic programming in Chapters 3 and 4 (see Sec- 17 tions 3.3 and 4.1). Thus far, attention has been focused on the estimator zn⇤ of the ˜ for a fixed x. The work presented in this true optimal value of (SP) or on Ef (x, ⇠) dissertation addresses the need to improve estimators of optimality gaps of proposed solutions to (SP). The main contributions of this dissertation are as follows: C1. We develop a technique to reduce the bias of optimality gap estimators motivated by stability results in stochastic programming. Stability results quantify changes in optimal values and optimal solution sets under distributional perturbations. The random sample is partitioned via a minimum-weight perfect matching problem in an e↵ort to reduce bias. The observations within each group are no longer independent and identically distributed, complicating further analysis. This technical difficulty is overcome with a weak convergence argument and additional asymptotic properties are presented. C2. We embed two well-known variance reduction schemes, antithetic variates and Latin hypercube sampling, in algorithms that produce optimality gap estimators. We also blend these sampling schemes with the bias reduction technique outlined in contribution C1. Asymptotic properties of the resulting estimators are discussed and extensive computational experiments for a range of test problems are given. Based on our theory and computational experiments, we provide guidelines on e↵ective and efficient means of assessing solution quality. C3. Finally, we apply a selected subset of the methods developed above to a sequential sampling procedure that aims to approximately solve (SP) via a sequence of generated candidate solutions with increasing sample size. We present extensions to the theory and provide computational experiments. 18 1.3 Definitions and Abbreviations In this section, we introduce commonly used statistical terms and list general assumptions on (SP) required in this dissertation. We also provide a list of abbreviations commonly used throughout this dissertation. We start with a framework to model Monte Carlo sampling. The results presented in this dissertation require precise probabilistic modeling of the Monte Carlo sampling performed. In particular, the expectations and the almost sure statements are made with respect to the underlying product measure. An overview of this framework is as follows. Let (⌦, A, P̂ ) be the space formed by the product of a countable sequence of identical probability spaces (⌅i , Bi , Pi ), where ⌅i = ⌅, Bi = B, and Pi = P , for i = 1, 2, . . ., and let ⇠ i denote an outcome in the sample space ⌅i . An outcome ! 2 ⌦ then has the form ! = (⇠ 1 , ⇠ 2 , . . .). Now, define the countable sequence of projection random variables {⇠˜i : ⌦ ! ⌅, i = 1, 2, . . .} by ⇠˜i (!) = ⇠ i . Then, the collection {⇠˜1 , . . . , ⇠˜n } is a random sample from (⌅, B, P ), and ⇠˜ := ⇠ 1 is a random variable with distribution P . We will study both point and interval estimators, which are functions of a random sample that calculate an estimate of an unknown parameter. A point estimator computes a single value, whereas an interval estimator provides a range of values. A point estimator is strongly consistent if it converges to the true value, almost surely, as opposed to in probability. We refer to such an estimator simply as consistent from now on. To understand the behavior of an interval estimator, we consider the probability that it contains the parameter of interest, referred to as its coverage probability, or simply coverage. A high-quality interval estimator will have a narrow width but also a high coverage probability. Note, however, that a high coverage probability can be obtained increasing the interval estimator’s width. An interval estimator which is constructed to have a coverage probability that (asymptotically) equals or exceeds a certain value (the confidence level) is called a confidence interval estimator. Such an 19 estimator is said to be (asymptotically) valid. We now provide a list of abbreviations used in this dissertation. Three algorithms considered in this dissertation, the Multiple Replications Procedure, the Single Replication Procedure, and the Averaged-Two Replication Procedure, are abbreviated as MRP, SRP, and A2RP, respectively. We use AV to denote antithetic variates and LHS to denote Latin hypercube sampling. “Strong law of large numbers” is shortened to SLLN and “central limit theorem” to CLT. “Confidence interval” is abbreviated as CI and “sample variance” is abbreviated as SV. Finally, the phrases “independent and identically distributed” and “almost surely” are shortened to i.i.d. and a.s., respectively. The phrase “i.i.d. sampling” is abbreviated to IID in some tables. 1.4 Dissertation Organization The rest of this dissertation is organized in the following way: In Chapter 2, we discuss the use of Monte Carlo simulation for assessing solution quality via optimality gap estimation. We review the three algorithms from the literature, MRP, SRP, and A2RP, that are used to estimate optimality gaps. We review additional relevant literature on bias reduction in Chapter 3, variance reduction in Chapter 4, and a sequential sampling method in Chapter 5. Chapter 3 presents a technique for reducing the bias of the A2RP optimality gap estimators for two-stage stochastic linear programs with recourse via a probability metrics approach, motivated by stability results in stochastic programming. We start with a review of relevant literature and a discussion of the problem class. We then discuss the motivation behind the bias reduction technique and formally define the resulting algorithm. We provide conditions under which asymptotic results on optimality gap estimators hold and present computational experiments to provide insights into the e↵ectiveness of the bias reduction technique. The material presented in this chapter can be found in (Stockbridge & Bayraksan, 2012). This is the main 20 contribution C1 in this dissertation. In Chapter 4, we embed sampling-based variance reduction techniques from the literature in the pre-existing algorithms MRP, SRP, and A2RP. We particularly focus on AV and LHS, and an overview of these techniques and their use in the stochastic programming literature begins the chapter. We then discuss the theoretical implications of each combination of algorithm for optimality gap estimation and variance reduction scheme. This includes the use of AV and LHS within MRP, SRP, and A2RP. In addition, we also blend our bias reduction procedure detailed in Chapter 3 with LHS and AV. Computational results are provided and discussed for each case. This is the main contribution C2 in this dissertation. Chapter 5 applies the ideas of the previous chapters to a sequential sampling setting, where the aim is to obtain high-quality solutions to (SP) with a desired probability. First, a sequential sampling procedure from the literature is reviewed. Then, a subset of variance reduction techniques are applied both when generating candidate solutions and when assessing their quality via SRP. Asymptotic properties of the resultant procedures are established and their empirical performance is analyzed. This is the main contribution C3 in this dissertation. We conclude the dissertation in Chapter 6 with a summary of contributions and a discussion of future research directions. 21 Chapter 2 Overview of Algorithms for Assessing Solution Quality In this chapter, we give an overview of Monte Carlo sampling-based techniques from the literature for assessing the quality of a candidate solution. In particular, we review MRP, developed by Mak et al. (1999), and SRP and A2RP of Bayraksan & Morton (2006). These procedures, while di↵ering in the details of their implementation, each produce point and interval estimators measuring solution quality. A discussion weighing the costs and benefits of each algorithm along with guidelines for use can be found in (Bayraksan & Morton, 2006). Recall that we define the quality of a solution x̂ 2 X to be its optimality gap, ˜ denoted Gx̂ , where Gx̂ = Ef (x̂, ⇠) z ⇤ . The optimality gap Gx̂ cannot be evaluated explicitly, particularly as the optimal value z ⇤ is not known. Furthermore, exact ˜ may not be possible, as the expectation is typically a highevaluation of Ef (x̂, ⇠) dimensional integral of a complicated function. Monte Carlo sampling-based methods bypass these difficulties by allowing us to form statistical estimators of optimality gaps. These work in the following way. They take as Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), a sample size n, a method to generate the sample, and a method to solve a sampling approximation of (SP), and they produce Output: A point estimator (e.g., Gn (x̂)), its associated variance estimator (e.g., s2n ), and a (1 ↵)-level approximate confidence interval estimator of Gx̂ (e.g., [0, Gn (x̂) +✏n,↵ ], where ✏n,↵ is the sampling error that typically uses s2n ). They are easy to implement, provided a sampling approximation of (SP) with 22 moderate sample sizes can be solved, and they can be used in conjunction with any specialized solution procedure to solve the sampling approximations of the underlying problem. For example, Bayraksan & Morton (2011) and Bayraksan & Pierre-Louis (2012) develop sequential sampling stopping rules that make use of optimality gap estimators to obtain high-quality solutions to (SP). These methods to estimate optimality gaps have been successfully applied to problems in finance (Bertocchi et al., 2000), stochastic vehicle routing (Kenyon & Morton, 2003; Verweij et al., 2003), supply chain network design (Santoso et al., 2005), and scheduling under uncertainty (Keller & Bayraksan, 2010). This chapter is organized as follows. In Section 2.1, we list and discuss necessary assumptions. We provide background on optimality gap estimation in Section 2.2. We then present the procedures MRP, SRP, and A2RP in Sections 2.3, 2.4, and 2.5, respectively, and close with a summary in Section 2.6. 2.1 Problem Class The main assumptions we impose on (SP) in this chapter are as follows: ˜ is continuous in x, a.s., (A1) f (·, ⇠) ˜ < 1, (A2) E supx2X f 2 (x, ⇠) (A3) X 6= ; and is compact. Assumption (A1) holds for two-stage stochastic linear programs with relatively complete recourse; i.e., a feasible second-stage decision exists for every feasible firststage decision, a.s. Assumption (A2) ensures the existence of first and second moments and provides a uniform integrability condition. Assumption (A3) requires that the problem be feasible and the set of feasible solutions be closed and bounded. This last condition is not usually overly restrictive for practical problems. 23 2.2 Optimality Gap Estimation Before defining MRP, SRP, and A2RP, we first discuss how to compute point estimators of the optimality gap. Since Gx̂ usually cannot be evaluated explicitly, we use Monte Carlo sampling to provide an approximation of (SP) and exploit the properties of this approximation to estimate the optimality gap. We first approximate P , using the observations from a random sample {⇠˜1 , . . . , ⇠˜n } described in Section 1.3, by the P empirical distribution Pn (·) = n1 ni=1 {⇠˜i } (·). The use of (·) indicates that Pn is a probability measure on ⌅. We then approximate (SP) by n 1X min f (x, ⇠˜i ) = min x2X n x2X i=1 Z f (x, ⇠)Pn (d⇠). (SPn ) ⌅ Let x⇤n denote an optimal solution to (SPn ) and zn⇤ denote the corresponding optimal value. Asymptotic properties of x⇤n and zn⇤ have been comprehensively studied in the literature (Attouch & Wets, 1981; Dupačová & Wets, 1988; King & Rockafellar, 1993; Shapiro, 1989). As mentioned in Section 1.3, it is most convenient throughout the dissertation to interpret expectations and almost sure statements relating to zn⇤ with R respect to the underlying probability measure P̂ . For instance, Ezn⇤ = ⌦ zn⇤ (!)P̂ (d!). By interchanging minimization and expectation, we have " # " n # n X X 1 1 ˜ = z ⇤ . (2.1) Ezn⇤ = E min f (x, ⇠˜i ) min E f (x, ⇠˜i ) = min Ef (x, ⇠) x2X n x2X x2X n i=1 i=1 In other words, Ezn⇤ provides us with a lower bound on z ⇤ . An upper bound on Gx̂ , ˜ Ef (x̂, ⇠) ˜ z ⇤ , is then given by Ef (x̂, ⇠) n Gn (x̂) = With fixed x̂ 2 X, 1X f (x̂, ⇠˜i ) n i=1 1 n Pn i=1 ˜ Ezn⇤ . We estimate Ef (x̂, ⇠) n min x2X Ezn⇤ by n 1X 1X f (x, ⇠˜i ) = f (x̂, ⇠˜i ) n i=1 n i=1 zn⇤ . (2.2) ˜ is an unbiased estimator of Ef (x̂, ⇠) ˜ due to i.i.d. f (x̂, ⇠) sampling. However, since Ezn⇤ z ⇤ 0, EGn (x̂) ˜ Ef (x̂, ⇠) z⇤, 24 and hence Gn (x̂) is a biased estimator of the optimality gap. We assume the same observations are used in both terms on the right-hand side in (2.2), so Gn (x̂) 0. This results in variance reduction through the use of common random variates. Consequently, compared to zn⇤ , which has the same bias, bias can be a more prominent factor in Gn (x̂). It is well known that the bias decreases as the size of the random sample increases ⇤ (Mak et al., 1999; Norkin et al., 1998). That is, Ezn⇤ Ezn+1 for all n. However, the rate the bias shrinks to zero can be slow, e.g., of order O(n 1/2 ); see for instance Ex- ample 4 in (Bayraksan & Morton, 2009). One way to reduce bias is to simply increase the sample size. However, significant increases in sample sizes are required to obtain a modest reduction in bias, and increasing the sample size is not computationally desirable since obtaining statistical estimators of optimality gaps requires solving a sampling approximation problem. The bias reduction technique presented in Chapter 3 provides a way to lessen bias without increasing sample size. We also observe in Chapter 4 that variance reduction techniques can also result in bias reduction. We note that there are other approaches to assessing solution quality. Some of these approaches are motivated by the Karush-Kuhn-Tucker conditions; see, e.g., (Higle & Sen, 1991b; Shapiro & Homem-de-Mello, 1998). There is also work on assessing solution quality with respect to a particular sampling-based algorithm, typically utilizing the bounds obtained through the course of the algorithm; see, e.g., (Dantzig & Infanger, 1995; Higle & Sen, 1991a, 1999; Lan et al., 2012). In the rest of this chapter, we provide an overview of three procedures to estimate Gx̂ , MRP, SRP, and A2RP, which we then enhance to reduce bias and/or variance in later chapters. 25 2.3 Multiple Replications Procedure The use of minimization in the definition of the optimality gap point estimator (2.2) results in a random variable that is generally not normally distributed. Therefore the CLT cannot be applied directly to Gn (x̂) to produce an approximate CI estimator of the optimality gap. The Multiple Replications Procedure of Mak et al. (1999) overcomes this difficulty by computing multiple instances of Gn (x̂) using independent batches of observations and forming a CI on the mean of these instances. Let tn,↵ be the 1 ↵ quantile of the Student’s t distribution with n degrees of freedom. We assume a method to solve (SPn ) is known and, in this chapter, the sample is generated in an i.i.d. fashion. Therefore, we remove this from the input list of the procedures. MRP is as follows: MRP Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), a sample size per replication n, and a replication size m. Output: A point estimator, its associated variance estimator, and a (1 ↵)-level confidence interval on Gx̂ . 1. For l = 1, . . . , m, 1.1. Sample observations i.i.d. {⇠˜l1 , . . . , ⇠˜ln } from P . ⇤ 1.2. Solve (SPn,l ) using {⇠˜l1 , . . . , ⇠˜ln } to obtain x⇤n,l and zn,l . 1.3. Calculate 1 Xh Gn,l (x̂) = f (x̂, ⇠˜li ) n i=1 n i f (x⇤n,l , ⇠˜li ) . 2. Calculate the optimality gap and sample variance estimators by: m 1 X Ḡ(m) = Gn,l (x̂) m l=1 and 1 2 s (m) = m 1 m X 3. Output a one-sided confidence interval on Gx̂ : tm 1,↵ s(m) p 0, Ḡ(m) + . m Gn,l (x̂) Ḡ(m) 2 . l=1 (2.3) 26 Since the replications {Gn,1 (x̂), . . . , Gn,m (x̂)} are i.i.d., we can apply the CLT to the sample mean, Ḡ(m), to get ✓ ◆ tm 1,↵ s(m) p P EGn (x̂) Ḡ(m) + ⇡1 m ↵ for sufficiently large m. As mentioned above, Monte Carlo sampling produces unh P i n 1 i ˜ ˜ and so Ez ⇤ z ⇤ and thus biased estimators, i.e., E n i=1 f (x, ⇠ ) = Ef (x, ⇠), n EGn (x̂) Gx̂ . Therefore, for large enough m, (2.3) is an approximate (1 CI for Gx̂ , i.e., ✓ Gx̂ Ḡ(m) + tm 1,↵ s(m) p ↵)-level ◆ ⇡ 1 ↵. m Note that var(Ḡm ) = m1 var(Gn ), and so we can expect the SV estimator s2 (m) to be P large if the variance of Ḡ(m) is large. Of course, this can be reduced by increasing the number of replications, but this also means solving more optimization problems (one per replication) to assess the quality of one solution. In Chapter 4, we explore the e↵ect of alternative sampling techniques on MRP in an e↵ort to reduce variance (and bias) without increasing the number of replications. 2.4 Single Replication Procedure A general rule of thumb to ensure that the sample mean of i.i.d. random variables is approximately normal is to use a sample size of at least 30. Translating to the MRP setting, the number of replications m is usually set to least 30, which therefore requires solving at least 30 optimization problems in Step 1.2. This can be computationally burdensome. An alternate approach, referred to as the Single Replication Procedure, is presented in (Bayraksan & Morton, 2006). SRP uses a single optimality gap estimator Gn , and computes the SV of the individual observations {f (x̂, ⇠˜1 ) f (x⇤n , ⇠˜1 )}, . . . , f (x̂, ⇠˜n ) f (x⇤n , ⇠˜n )}. These two point estimators are then P combined to form a CI estimator of the optimality gap. Define f¯n (x) = n1 ni=1 f (x, ⇠˜i ) and let z↵ be the 1 is as follows: ↵ quantile of the standard normal distribution. The procedure 27 SRP Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and a sample size n. Output: A point estimator, its associated variance estimator, and a (1 ↵)-level confidence interval on Gx̂ . 1. Sample i.i.d. observations {⇠˜1 , . . . , ⇠˜n } from P . 2. Solve (SPn ) to obtain x⇤n and zn⇤ . 3. Calculate Gn (x̂) as in (2.2) and s2n = 1 n 1 n h⇣ X f (x̂, ⇠˜i ) f (x⇤n , ⇠˜i ) i=1 ⌘ 4. Output a one-sided confidence interval on Gx̂ : z ↵ sn 0, Gn (x̂) + p . n f¯n (x̂) f¯n (x⇤n ) i2 . (2.4) (2.5) Note that unlike the m independent optimality gap estimators of MRP used to calculate the SV, the observations f (x̂, ⇠˜i ) f (x⇤n , ⇠˜i ), i = 1, . . . , n, used in the SRP SV calculation are not independent. Instead, they each depend on x⇤n 2 arg minx2X Pn 1 ˜i i=1 f (x, ⇠ ). However, if (A1)–(A3) are satisfied, then the point estimators Gn (x̂) n and s2n of SRP are consistent and the interval estimator of SRP in (2.5) is asymptot- ically valid. The goal of Chapters 3 and 4 is to address situations when the bias and variance of Gn (x̂) are large or the SV estimator s2n is large, which can lead to unduly large (or small) CI estimators. SRP can significantly reduce the computation time required to estimate the optimality gap. However, for small sample sizes it can happen that x⇤n coincides with the candidate solution x̂. The optimality gap and SV estimators are then zero (e.g., when x̂ = x⇤n , Gn (x̂) in (2.2) is zero). This also results in a CI estimator of width zero, even though the candidate solution may be significantly suboptimal. We repeat Example 1 in (Bayraksan & Morton, 2006) to illustrate the concept of coinciding solutions: 28 ˜ : Example 2.1. Consider the problem {min E[⇠x] 1 x 1}, where ⇠˜ ⇠ N (µ, 1) and µ > 0. The optimal pair is (x⇤ , z ⇤ ) = ( 1, µ). We examine the candidate solution x̂ = 1, which has the largest optimality gap of 2µ. If the random sample P satisfies ⇠¯ = n1 ni=1 ⇠˜i < 0, then x⇤n = 1 coincides with x̂, and so the point and interval estimators of SRP are zero. Setting µ = 0.1, ↵ = 0.10 and n = 50, and using normal quantiles, we obtain an upper bound on the coverage of SRP as 1 P (⇠¯ < 0) ⇡ 0.760, which is considerably below the desired coverage of 0.90. The Averaged Two-Replication Procedure addresses this undesirable coverage behavior by using two replications of observations. 2.5 Averaged Two-Replication Procedure A2RP can be defined as the following modification of MRP. Note that rather than using the standard SV as in MRP, the A2RP SV estimator is based on the SRP SV estimator. A2RP Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and a sample size per replication n. Output: A point estimator, its associated variance estimator, and a (1 ↵)-level confidence interval on Gx̂ . Fix m = 2 and replace Steps 1.3, 2, and 3 of MRP by: 1.3. Calculate Gn,l (x̂) and s2n,l . 2. Calculate the optimality gap and sample variance estimators by taking the average: G0n (x̂) = 1 (Gn,1 (x̂) + Gn,2 (x̂)) 2 and 3. Output a one-sided confidence interval on Gx̂ : z ↵ s0 0, G0n (x̂) + p n . 2n 0 s2n = 1 2 s + s2n,2 . 2 n,1 (2.6) 29 The same conditions as for SRP guarantee that the point estimator G0n (x̂) of A2RP is consistent and the interval estimator of A2RP in (3.1) is asymptotically valid. We now examine how A2RP can improve coverage by returning to Example 1: Example 2.2. Consider the problem instance in Example 2.1. Now consider A2RP P with each replication of size n = 50. Let ⇠¯1 = n1 ni=1 ⇠˜1i be the sample mean of the first group of observations, and similarly let ⇠¯2 be the sample mean of the second group. In this case, the probability of obtaining a CI estimator of non-zero width is given by 1 P (⇠¯1 < 0)P (⇠¯2 < 0) ⇡ 1 (0.240)2 ⇡ 0.943. Alternatively, note that if the sample of 50 observations is divided into two groups of 25 observations, the probability of obtaining a CI estimator of non-zero width is approximately 1 2.6 (0.308)2 ⇡ 0.905. Summary and Concluding Remarks In this chapter, we reviewed three Monte Carlo sampling-based algorithms, MRP, SRP, and A2RP, to assess the quality of a candidate solution to (SP). Observe that the bias of the optimality gap estimator is the same for MRP, SRP, and A2RP if the sample size per replication n is equal for each procedure. If instead we fix the same total sample size for each algorithm, MRP will have significantly increased bias compared to SRP and A2RP due to a larger number of replications and therefore a smaller sample size per replication. Due to the heavy computational burden of MRP and the difficulties that can arise when using SRP, in Chapter 3 we focus on A2RP and present a technique that aims to reduce the bias of the optimality gap estimator. We will return to the above example in Section 3.4.2 to see how implementing bias reduction a↵ects coinciding solutions. Chapter 4 implements variance reduction schemes in all three algorithms. 30 Chapter 3 Averaged Two-Replication Procedure with Bias Reduction In this chapter, we combine a Monte Carlo sampling-based approach to optimality gap estimation with stability results from (Römisch, 2003) for a class of two-stage stochastic linear programs, with the intention of reducing the bias of the A2RP optimality gap estimators. We note that in Section 2.5, we defined the two replications of A2RP as each using a sample size of n, which allows a fair comparison of MRP, SRP, and A2RP. In this chapter, we focus on A2RP only and assume a fixed computational budget of n observations, which are then divided in two to form two replications of size n/2. Our goal is to partition the observations in such as way as to reduce the bias of the optimality gap estimator as compared to dividing the observations randomly. The bias reduction technique presented partitions the observations by solving a minimum-weight perfect matching problem, which can be done in polynomial time in sample size. We are concerned with a fixed candidate solution x̂ 2 X in this chapter and the next, so we suppress the dependence on x̂ in our notation. We also suppress the dependence on the sample size n for notational simplicity. This chapter is organized as follows. In the next section, we give an overview of the relevant literature. In Section 3.2 we formally define our problem setup and list necessary assumptions, which are more restrictive than those in Chapter 2. Section 3.3 updates A2RP to reflect our focus on partitioning and presents the stability result. We then introduce our bias reduction technique in Section 3.4 and illustrate the technique on an instance of a newsvendor problem in Section 3.5. Asymptotic properties of the resulting estimators are provided in Section 3.6. In Section 3.7, we present our computational experiments on a number of test problems. Finally, in Section 3.8, we 31 close with a summary. 3.1 Literature Review Bias reduction in statistics and simulation is a well-established topic and resampling methods such as jackknife and bootstrap are commonly used for this purpose (Efron & Tibshirani, 1993; Shao & Tu, 1995). In stochastic programming, while there has been a lot of focus on variance reduction techniques (see Section 4.1 for an overview), bias reduction has received relatively little attention. Only a few studies exist for this purpose. Freimer et al. (2012) study the e↵ect on bias of di↵erent sampling schemes mainly used for variance reduction, such as AV and LHS. These schemes can successfully reduce the bias of the estimator of z ⇤ with minimal computational e↵ort; however, the entire optimality gap estimators are not considered. Partani (2007) and Partani et al. (2006), on the other hand, develop a generalized jackknife technique for bias reduction in MRP optimality gap estimators. In this chapter, bias reduction is motivated by stability results in stochastic programming rather than adaptation of well-established sampling or bias reduction techniques. Stability results use probability metrics to provide continuity properties of optimal values and optimal solution sets with respect to perturbations of the original probability distribution of the random vector; see, e.g., the survey by Römisch (2003). Stability results have been successfully used for scenario reduction in stochastic programs; see, e.g., (Dupačová et al., 2003; Heitsch & Römisch, 2003; Henrion et al., 2009). We specifically apply the bias reduction approach to A2RP, described in Section 2.5, and use a particular stability result from (Römisch, 2003) involving the Kantorovich metric. Utilizing the Kantorovich metric to calculate distances between probability measures results in a significant computational advantage (see Section 3.4.1). The specific stability result we use, however, restricts (SP) to a class of two-stage 32 stochastic linear programs with recourse (see Section 3.2). The bias reduction approach presented in this chapter does not require resampling—like the bootstrap and jackknife methods commonly used in statistics—and thus no additional sampling approximation problems need to be solved. The cost of bias reduction, however, comes from solving a minimum-weight matching problem, which is used to partition a random sample so as to reduce bias by minimizing the Kantorovich metric. Minimumweight matching is a well-known combinatorial optimization problem for which efficient algorithms exist (Edmonds, 1965; Kolmogorov, 2009; Mehlhorn & Schäfer, 2002). It can be solved in polynomial time in sample size n and the computational burden is likely to be minimal compared to solving (approximations of) real-world stochastic programs with hundreds of stochastic parameters. Partitioning a random sample in an e↵ort to reduce bias as we do in this chapter results in observations that are no longer independent nor identically distributed, and hence the consistency and asymptotic validity results for A2RP in (Bayraksan & Morton, 2006) cannot be immediately stated. We overcome this difficulty by showing that the resulting distributions on the partitioned subsets converge weakly to P , the ˜ a.s. This result allows us to provide conditions under which original distribution of ⇠, the point estimators are consistent and the interval estimator is asymptotically valid. 3.2 Problem Class While (SP) encompasses many classes of problems, in this chapter, we focus on a particular class dictated by the specific stability result we use to motivate the proposed bias reduction technique (see Section 3.3.2). As stated before, we consider two-stage ˜ = cx+h(x, ⇠), ˜ X = {x : Ax = stochastic linear programs with recourse, where f (x, ⇠) b, x 0}, and h(x, ⇠) is the optimal value of the minimization problem min{qy : W y = R(⇠) y T (⇠)x, y 0}. 33 The above problem has fixed recourse (W is non-random) and stochasticity only on the right-hand side (R(⇠) and T (⇠) are random). We assume that X and ⌅ are convex polyhedral. We also assume that T and R depend affine linearly on ⇠, which allows for modeling first-order dependencies between them, such as those that arise in commonly-used linear factor models. Furthermore, we assume that our model has relatively complete recourse; i.e., for each (x, ⇠) 2 X ⇥ ⌅, there exists y W y = R(⇠) 0 such that T (⇠)x, and dual feasibility; i.e., {⇡ : ⇡W q} = 6 ;. These assumptions are needed to ensure the stability result presented in Section 3.3.2. We also require that X 6= ; and is compact, which is assumption (A3) of Chapter 2. We make the following additional assumption: (A4) ⌅ is compact. Let P(⌅) be the set of probability measures on ⌅ with finite first order moments, i.e., R P(⌅) = Q : ⌅ k⇠kQ(d⇠) < 1 . It follows immediately from assumption (A4) that P 2 P(⌅), a condition required by our theoretical results. For the class of problems we consider, f (·, ⇠) is convex in x for all fixed ⇠ 2 ⌅, and f (x, ⇠) satisfies the following Lipschitz continuity condition for all x, x0 2 X and ⇠ 2 ⌅, for some L > 0: |f (x, ⇠) f (x0 , ⇠)| L max {1, k⇠k} kx x0 k, where k · k is some norm (see Proposition 22 in (Römisch, 2003)). This result leads directly to the continuity of f (·, ⇠) in x for fixed ⇠. We also note that f (x, ·) is Lipschitz continuous and thus continuous in ⇠; see, e.g., Corollary 25 in (Römisch, 2003). Assumptions (A3) and (A4) along with continuity in both variables imply that f (x, ⇠) is uniformly bounded; i.e., 9 C < 1 such that |f (x, ⇠)| < C for each x 2 X, ⇠ 2 ⌅, a condition necessary for establishing consistency of our point estimators (see Section 3.6.2). Therefore, assumption (A2) is automatically satisfied. Also, per the 34 above discussion, the continuity assumption (A1) is also satisfied. Uniform boundedness also ensures that f (x, ⇠) is a real-valued function and enforces the relatively complete recourse and dual feasibility assumptions. In addition, convexity and con˜ < 1 is convex and continuous in x. Hence, (SP) tinuity in x implies that Ef (x, ⇠) has a finite optimal solution on X, and so X ⇤ is non-empty. 3.3 Background In this section, we update the notation of A2RP as defined in Section 2.5 to reflect our emphasis on the partitioning of observations. We also provide a stability result from (Römisch, 2003) that is fundamental to the bias reduction technique. 3.3.1 A Redefinition of the Averaged Two-Replication Procedure As mentioned at the beginning of this chapter, we currently view A2RP as sampling a budget of n observations, where n is even, and partitioning them into two replications of size n/2. To facilitate comparison with the bias reduction technique discussed in Section 3.4, we present A2RP again in this section with slightly altered notation and steps. Let I 1 be a uniformly distributed random variable independent of {⇠˜1 , . . . , ⇠˜n } taking values in the set of all subsets of {1, . . . , n} of size n/2, and let I1 be an instance of I 1 . Let I2 = (I1 )C ; that is, I2 contains all n/2 elements of {1, . . . , n} that are not in I1 . This is essentially equivalent to generating two independent random samples of size n/2 as in Section 2.5. However, we prefer to use the notation I1 and I2 to emphasize the random partitioning. Later, the proposed bias reduction technique will alter this partitioning mechanism. P Let PIl (·) = i2Il n2 {⇠˜i } (·) be the empirical probability measure formed on the lth set of observations, l = 1, 2. Similar to the definition of (SPn ), let (SPIl ) denote 35 the problem 1X min f (x, ⇠˜i ) = min x2X n x2X i2I l x⇤Il Z f (x, ⇠)PIl (d⇠), (SPIl ) ⌅ denote an optimal solution to (SPIl ), and let zI⇤l be the optimal value, for l = 1, 2. A2RP is as follows: A2RP Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and an even sample size n. Output: A point estimator, its associated variance estimator, and a (1 ↵)-level confidence interval (CI) on Gx̂ . 1. Sample i.i.d. observations {⇠˜1 , . . . , ⇠˜n } from P . 2. Generate a random partition of {⇠˜1 , . . . , ⇠˜n } via I1 and I2 , and produce PI1 and PI 2 . 3. For l = 1, 2: 3.1. Solve (SPIl ) to obtain x⇤Il and zI⇤l . 3.2. Calculate: GIl = 2X f (x̂, ⇠˜i ) n i2I zI⇤l and s2Il = l X h⇣ 1 f (x̂, ⇠˜i ) n/2 1 i2I l f (x⇤Il , ⇠˜i ) ⌘ GI l i2 . 4. Calculate the optimality gap and sample variance estimators by taking the average; GI = 12 (GI1 + GI2 ) and s2I = 1 2 s2I1 + s2I2 . 5. Output a one-sided confidence interval on Gx̂ : z ↵ sI 0, GI + p . n 3.3.2 (3.1) A Stability Result In this section, we review a stability result from (Römisch, 2003) that is fundamental to our bias reduction technique. Stability results in stochastic programming quan˜ is perturbed. In tify the behavior of (SP) when P , the original distribution of ⇠, 36 this chapter, we are particularly interested in changes in the optimal value z ⇤ under perturbations of P ; however, stability results also examine the changes in the solution sets X ⇤ . In this section, we will use z ⇤ (P ) to denote the optimal value of (SP) when the distribution of ⇠˜ is P . Similarly, z ⇤ (Q) denotes the optimal value under the distribution Q, a perturbation of P . Probability metrics, which calculate distances between probability measures, can provide upper bounds on |z ⇤ (P ) z ⇤ (Q)|, the change in the optimal value. One such probability metric relevant for the class of problems we consider is µ̂d (P, Q), the Kantorovich metric with cost function d(⇠ 1 , ⇠ 2 ) = ||⇠ 1 ⇠ 2 ||, where || · || is a norm. The following result—a restatement of Corollary 25 to Theorem 23 in (Römisch, 2003), written to match this dissertation’s notation—provides continuity properties of optimal values of (SP) with respect to perturbations of P . Theorem 3.2. Let only T (⇠) and R(⇠) be random, and assume that relatively complete recourse and dual feasibility hold. Let P 2 P(⌅) and X ⇤ be non-empty. Then, there exist constants L > 0, > 0 such that |z ⇤ (P ) z ⇤ (Q)| Lµ̂d (P, Q) whenever Q 2 P(⌅) and µ̂d (P, Q) < . All conditions necessary to apply Theorem 3.2 for the class of problems we consider are satisfied, as specified in Section 3.2; see (Römisch, 2003) for details. The above stability result implies that if P and Q are sufficiently close with respect to the Kantorovich metric, then the optimal value of (SP) behaves Lipschitz continuously with respect to changes in the probability distribution. Suppose that P is a discrete probability measure placing masses p1 , . . . , pNP on the points {⇠ 1 , . . . , ⇠ NP } in ⌅, respectively, and Q is a discrete measure with masses q1 , . . . , qNQ on the points {⌫ 1 , . . . , ⌫ NQ } in ⌅, respectively. Then the Kantorovich metric can be written in the form of the Monge-Kantorovich transportation problem, which formulates the 37 transfer of mass from P to Q: µ̂d (P, Q) = min ⌘ s.t. NQ NP X X i=1 j=1 NP X i=1 NQ X j=1 ⌘ij k⇠ i ⌫ j k⌘ij (MKP) ⌘ij = qj , 8j, ⌘ij = pi , 8i, 0, 8i, j. This is the well-known transportation problem, where P can be viewed to have NP supply nodes, each with supply pi , i = 1, . . . , NP ; similarly, Q has NQ demand nodes, each with demand qj , j = 1, . . . , NQ ; and total supply and demand match, i.e., P NP P NQ j=1 qj = 1. Thus, µ̂d (P, Q) is the minimum cost of transferring mass from i=1 pi = P to Q. Representing the Kantorovich metric as the optimal value of a well-known, efficiently solvable optimization problem is vital in allowing us to implement the bias reduction technique described in the next section. 3.4 Bias Reduction via Probability Metrics In this section, we present a technique to reduce the bias in sampling-based estimates of z ⇤ in stochastic programs and apply it to the A2RP optimality gap estimators. We begin by discussing the motivation behind the technique and explaining the connection with Theorem 3.2. We then formally state the resulting procedure to obtain variants of the A2RP optimality gap estimators after bias reduction. 3.4.1 Motivation for Bias Reduction Technique Consider a partition of n observations {⇠˜1 , . . . , ⇠˜n } given by index sets S1 and S2 , where (i) S1 , S2 ⇢ {1, . . . , n} and S2 = (S1 )C , 38 (ii) |S1 | = |S2 | = n/2, and (iii) each ⇠˜i , i 2 S1 [ S2 , receives probability mass 2/n. Note that S1 and S2 are functions of the random sample {⇠˜1 , . . . , ⇠˜n }. This is a generalization of the partitioning performed via I1 and I2 , where we now allow dependencies between S1 and S2 . For any given {⇠˜1 , . . . , ⇠˜n }, we have ! X 2X 2 min f (x, ⇠˜i ) + min f (x, ⇠˜i ) x2X n x2X n i2S1 i2S2 ! n X 1 X 1X i i ˜ ˜ min f (x, ⇠ ) + f (x, ⇠ ) = min f (x, ⇠˜i ) = zn⇤ . x2X n x2X n i=1 i2S i2S 1 ⇤ 1 (zS1 + zS⇤ 2 ) = 2 2 1 2 Therefore, by the monotonicity of expectation, the following inequality holds: 1 (EzS⇤ 1 + EzS⇤ 2 ) Ezn⇤ z ⇤ . 2 (3.3) Inequality (3.3) indicates that when n observations are divided in two, the expected gap between 12 (zS⇤ 1 +zS⇤ 2 ) and z ⇤ grows. This motivates us to partition the observations via index sets S1 and S2 that maximize 12 (EzS⇤ 1 + EzS⇤ 2 ). This approach will help to alleviate the increase in bias that results from using two subsets of n/2 observations rather than one set of n observations. Since 12 (EzS⇤ 1 + EzS⇤ 2 ) is always bounded above by Ezn⇤ , we equivalently aim to minimize Ezn⇤ 1 (EzS⇤ 1 2 + EzS⇤ 2 ). This problem can be hard to solve, but an approximation is obtained by: 1 (EzS⇤ 1 + EzS⇤ 2 ) 2 ⇥ We would thus like to minimize E |zn⇤ Ezn⇤ 1 ⇥ ⇤ E zn 2 zS⇤ 1 | + |zn⇤ zS⇤ 1 + zn⇤ zS⇤ 2 ⇤ . ⇤ zS⇤ 2 | , but again this is a hard problem. In an e↵ort to achieve this, we focus on |zn⇤ zS⇤ 1 |+|zn⇤ zS⇤ 2 | for a given sam- ple of size n. By viewing the empirical measure Pn of the random sample {⇠˜1 , . . . , ⇠˜n } P as the original measure and the measures PSl (·) = n2 i2Sl {⇠˜i } (·), l = 1, 2, as per- turbations of Pn , we appeal to Theorem 3.2 to obtain an upper bound containing probability metrics: |zn⇤ zS⇤ 1 | + |zn⇤ zS⇤ 2 | Lµ̂d (Pn , PS1 ) + Lµ̂d (Pn , PS2 ). (3.4) 39 As a result, we aim to reduce the bias of the optimality gap estimator by partitioning the observations according to sets S1 and S2 that minimize µ̂d (Pn , PS1 ) + µ̂d (Pn , PS2 ). By minimizing these metrics, we would like PS1 and PS2 to mimic Pn as much as possible. This way, we may expect the resulting optimal values of the partitions to be closer to zn⇤ , reducing the bias induced by partitioning. We note that several approximations were used above and (3.4) is valid only when Pn and PSl are sufficiently close in terms of the Kantorovich metric, for l = 1, 2. However, it is natural to think of PSl as a perturbation of Pn even though Theorem 3.2 does not specify how close they should be. The advantage of using these approximations is that it results in an easily solvable optimization problem (see the discussion below). Even though it is approximate, we present strong evidence that the proposed bias reduction technique can be successful via analytical results on a newsvendor problem in Section 3.5 and numerical results for several stochastic programs from the literature in Section 3.7. Let us now examine µ̂d (Pn , PSl ), l = 1, 2 when we have a known partition via S1 and S2 . Suppose given a realization of the random sample {⇠ 1 , . . . , ⇠ n } and the corresponding empirical measure Pn , we have identified S1 and S2 that satisfy (i)– (iii) above. Because d(⇠ i , ⇠ j ) = ||⇠ i ⇠ j || = 0 whenever ⇠ i = ⇠ j , for i 2 {1, . . . , n} and j 2 Sk , the Monge-Kantorovich problem (MKP) in this setting turns into an assignment problem (for a given index set Sk ): µ̂d (Pn , PSl ) = min ⌘ s.t. XX i2S2 j2S1 X k⇠ i ⌘ij = 1 , 8j, n ⌘ij = 1 , 8i, n i2S2 X j2S1 ⌘ij ⇠ j k⌘ij 0, 8i, j. Furthermore, since the cost function d(⇠ i , ⇠ j ) = ||⇠ i ⇠ j || is symmetric, µ̂d (Pn , PS1 ) 40 = µ̂d (Pn , PS2 ). It follows that if we minimize µ̂d (Pn , PS1 ), we automatically minimize µ̂d (Pn , PS2 ). Therefore, identifying sets S1 and S2 that minimize the sum of the Kantorovich metrics is equivalent to finding a set S1 that minimizes µ̂d (Pn , PS1 ). Thus, to attempt to reduce the bias of the optimality gap estimator, we wish to find an index set of size n/2 that solves the problem: min µ̂d (Pn , PS1 ) (PM) s.t. S1 ⇢ {1, . . . , n}, |S1 | = n/2. Note that this is the well-known minimum-weight perfect matching problem. Given a graph with n nodes and m edges, it can be solved in polynomial time of O(mn log n) (Mehlhorn & Schäfer, 2002). The running time for our problem is O(n3 log n) since we have a fully connected graph. A special case of (PM) when ⇠ is univariate is solvable in O(n log n) via a sorting algorithm, as the optimal solution is to place the odd order statistics in one subset and the even order statistics in the other. For large-scale stochastic programs, solving instances of (SPn ) can be expected to be the computational bottleneck compared to solving (PM). 3.4.2 Averaged Two-Replication Procedure with Bias Reduction In this section, we present the Averaged Two-Replication Procedure with Bias Reduction (A2RP-B) that results from adapting A2RP to include the bias reduction technique described in Section 3.4.1. To distinguish from the uniformly chosen subsets I1 and I2 defined Section 3.3.1, we denote an optimal solution to (PM) by J1 and let J2 = (J1 )C . The resulting probability measures are denoted PJl , l = 1, 2, where P PJl = n2 i2Jl ⇠˜i . 41 A2RP-B Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and an even sample size n. Output: A point estimator, its associated variance estimator, and a (1 ↵)-level confidence interval on G. 1. Sample i.i.d. observations {⇠˜1 , . . . , ⇠˜n } from P . 2. Generate J1 and J2 by solving (PM), and produce PJ1 and PJ2 . 3. For l = 1, 2: 3.1. Solve (SPJl ) to obtain x⇤Jl and zJ⇤l . 3.2. Calculate: G Jl = 2X f (x̂, ⇠˜i ) n i2J zJ⇤l and s2Jl = l X h⇣ 1 f (x̂, ⇠˜i ) n/2 1 i2J l f (x⇤Jl , ⇠˜i ) ⌘ GJl i2 . 4. Calculate the optimality gap and sample variance estimators by taking the average; GJ = 12 (GJ1 + GJ2 ) and s2J = 1 2 s2J1 + s2J2 . 5. Output a one-sided confidence interval on G: z ↵ sJ 0, GJ + p . n (3.5) A2RP-B di↵ers from A2RP in Step 2. Here, a minimum-weight perfect matching problem (PM) is solved to obtain an optimal partition of the observations via the index sets J1 and J2 . Note that the elements in J1 and J2 depend on the observations {⇠˜1 , . . . , ⇠˜n }, and so J1 and J2 are random variables acting on ⌦. Hence, J1 and J2 are not independent of {⇠˜1 , . . . , ⇠˜n }, distinguishing PJ1 and PJ2 from PI1 and PI2 . The random partitioning mechanism of I1 and I2 results in i.i.d. observations in PI1 and PI2 . Unfortunately, this property is lost in PJ1 and PJ2 . Nevertheless, we prove in Section 3.6 that the point estimators GJ and s2J are consistent and the interval estimator given by (3.5) is asymptotically valid. 42 We conclude this section by updating Example 2.2 to include the e↵ects of bias reduction. We will illustrate A2RP-B in more detail on an instance of a newsvendor problem in the next section. Example 3.1. Consider the problem described in Example 2.1. As before, ↵ = 0.10 P and n = 50. Let ⇠¯J1 = n2 i2J1 ⇠˜i be the sample mean of the first subset of 25 observations (all odd order statistics), and similarly let ⇠¯J2 be the sample mean of the second subset (all even order statistics) after solving (PM). To estimate an upper bound on the coverage of A2RP-B, we ran 1,000,000 independent runs in MATLAB and computed the proportions of runs that ⇠¯J1 and ⇠¯J2 were negative. This resulted in the estimate 1 P (⇠¯J1 < 0)P (⇠¯J2 < 0) ⇡ 1 (0.363)(0.145) ⇡ 0.947. Compared to A2RP with P (⇠¯l < 0) ⇡ 0.308 for each subset l = 1, 2, after solving (PM), the sample mean of the first subset shifted slightly downward, increasing this probability, whereas the sample mean of the second subset shifted slightly upward, decreasing this probability. As a result, the probability of obtaining coinciding solutions is decreased. Hence the upper bound on the coverage of A2RP-B is greater than A2RP for this problem. In general, A2RP-B may be viewed as in between SRP and A2RP. Like A2RP, it can lower the occurrence of coinciding solutions while at the same time having a lower bias like SRP. Computational results in Section 3.7 seem to support this hypothesis. 3.5 Illustration: Newsvendor Problem Before presenting theoretical results, we illustrate the above bias reduction technique on an instance of a newsvendor problem. For this problem, we are able to derive analytical results, and therefore can compare the optimality gap estimators produced by A2RP, A2RP-B, and SRP to examine the efficacy of the bias reduction technique. The specific newsvendor problem we consider is as follows: a newsvendor would like to determine the number of newspapers to order daily, x, in order to maximize 43 expected daily profit. Each copy sells at a price r and costs the newsvendor c, where ˜ is assumed to be random with a U (0, b) distribution. 0 < c < r. The daily demand, ⇠, The problem can be expressed as h min E cx x 0 The optimal solution is x⇤ = b(r i ˜ . r min{x, ⇠} (3.6) c)/r and the optimal value is z ⇤ = b(r c)2 /(2r). Note that (3.6) can be rewritten as a two-stage stochastic linear program in the form presented in Section 3.2. Prior to finding expressions for the biases of the optimality gap estimators, we note two results that are used in the subsequent derivations. First, let {⇠˜1 , . . . , ⇠˜n } be a random sample of size n from a U (0, b) distribution, and let {⇠˜(1) , . . . , ⇠˜(n) } denote the ordering of the random sample, i.e., ⇠˜(1) ⇠˜(2) . . . ⇠˜(n) . The optimal solution ⇤ to the approximated problem (SPn ) using this random sample is x⇤n = ⇠˜(l ) , where l⇤ = d(r c)n/re. The optimal value of (SPn ) is thus l⇤ 1 n zn⇤ = rX ⇤ min{x⇤n , ⇠˜i } = c⇠˜(l ) n i=1 cx⇤n r X ˜(i) ⇠ n i=1 r (n n ⇤ l⇤ + 1) ⇠˜(l ) . Second, recall that the ith order statistic from a U (0, b) random sample of size n satisfies ⇠˜(i) /b ⇠ (i, n + 1 i), where (↵1 , ↵2 ) denotes a random variable having a Beta distribution with parameters ↵1 and ↵2 . We now determine the bias of GI , the optimality gap estimator produced by A2RP. In this case, the n observations are randomly partitioned into two subsets of size n/2, generating the corresponding sampled problems (SPIl ), l = 1, 2. Relabel the observations ⇠˜i , i 2 I1 , as ⇠˜Ii1 , and similarly for I2 . The optimal solution to (SPIl ) is (i⇤ ) x⇤Il = ⇠˜Il , where i⇤ = d(r c)n/2re = (r c)n/2r + , for some 2 [0, 1). The ith (i) order statistic of each subset satisfies ⇠˜Il /b ⇠ i, n2 + 1 i . After some algebra, the bias of GI is z⇤ 1 (EzI⇤1 + EzI⇤2 ) = 2 ⇥ b 2( n(n + 2)r 1)r2 ⇤ cnr + c2 n . (3.7) 44 The analysis changes somewhat under A2RP-B. The newsvendor problem is univariate in ⇠, and so (PM) places the odd order statistics in one subset and the even order statistics in the other. Since the order statistics are computed from the original sample size of n, the ith order statistic follows a (i, n + 1 i) distribution. Note that after solving (PM), the observations in each subset are no longer i.i.d., since order statistics are neither identically distributed nor independent. Solving the sampling problem using the first subset of observations leads to the optimal solution ⇤ x⇤J1 = ⇠˜(2i 1) and using the second set of observations produces the optimal solution ⇤ x⇤J2 = ⇠˜(2i ) . Following the same steps, we calculate the bias of GJ as z⇤ 1 (EzJ⇤1 + EzJ⇤2 ) = 2 ⇥ b 4( 2n(n + 1)r 1)r2 ⇤ cnr + c2 n . (3.8) We now consider the limiting behavior of the percentage reduction in the bias of the optimality gap estimator going from A2RP to A2RP-B, which is given by subtracting expression (3.8) from expression (3.7) and normalizing by (3.7). We get % Red. in Bias = 1 b [4( 2n(n+1)r b [2( n(n+2)r 1)r2 cnr + c2 n] 1)r2 c2 n] cnr + !1 1 as n ! 1. 2 Therefore, the percentage reduction in the bias converges to 50% as n ! 1. So, simply partitioning the random sample into odd and even order statistics (the result of solving (PM)) gives an optimality gap estimator with asymptotically half the bias compared to using a random partition. This result holds regardless of the values of the parameters r, c, and b for this specific newsvendor problem, so parameter choices that change the bias of the optimality gap estimator will not alter the largesample behavior of the bias reduction technique. For small sample size behavior of this newsvendor problem, see Section 3.7.3. Our numerical results indicate that convergence of the percentage reduction in bias is achieved very quickly, e.g., around a sample size of n = 100. Finally, we compare A2RP-B and A2RP to SRP, where we assume that SRP uses a sample size of n. Observe that replacing n with 2n in (3.7) gives the bias of the 45 optimality gap estimator produced by SRP. Consequently, the ratio of the bias of the A2RP optimality gap estimator to the bias of the SRP estimator converges to 2 as n ! 1, indicating that partitioning the observations into two random subsets doubles the bias for larger sample sizes. In contrast, the ratio of the biases of the A2RP-B and SRP optimality gap estimators converges to 1 as n ! 1. In essence, the bias reduction technique performs “anti-partitioning” for this problem by eliminating the additional bias introduced from partitioning. 3.6 Theoretical Properties We now prove that the estimators GJ and s2J of A2RP-B are strongly consistent and that A2RP-B provides an asymptotically valid CI on the optimality gap. This is important because applying a bias reduction technique can sometimes result in overcorrection of the bias and lead to undesirable behavior. In this section, we show that, asymptotically, such unwanted behavior does not happen for our method. The technical difficulty in the consistency proofs for the A2RP-B estimators comes from the fact that the proposed bias reduction technique destroys the i.i.d. nature of the observations in the partitioned subsets of observations. Recall that in A2RP, the uniform partitioning of the observations preserves the i.i.d. property, but this is not the case for A2RP-B; see Section 3.5 for an illustration from the newsvendor problem. Hence, it is necessary to generalize the consistency proofs in (Bayraksan & Morton, 2006) to cover the non-i.i.d. case arising from solving (PM). 3.6.1 Weak Convergence of Empirical Measures We first establish the weak convergence of the empirical probability measures PJ1 and ˜ a.s. This provides the structure necessary to PJ2 to P , the original distribution of ⇠, obtain consistent estimators. 46 Theorem 3.9. Assume that {⇠˜1 , . . . , ⇠˜n } is an i.i.d. sample from distribution P and (A4) holds. Then the probability measures on the partitioned sets obtained by solving ˜ a.s. (PM), PJ1 and PJ2 , converge weakly to P , the original distribution of ⇠, Proof. Since µ̂d is a metric, by the triangle inequality we have that µ̂d (P, PJ1 ) µ̂d (P, Pn ) + µ̂d (Pn , PJ1 ). Also, µ̂d (Pn , PJ1 ) µ̂d (Pn , PI1 ), since the partitioning of the observations via J1 minimizes the Kantorovich metric; hence, the random partition provides an upper bound. Therefore, µ̂d (P, PJ1 ) µ̂d (P, Pn ) + µ̂d (Pn , PI1 ), and by applying the triangle inequality again, we obtain µ̂d (P, PJ1 ) µ̂d (P, Pn ) + µ̂d (P, Pn ) + µ̂d (P, PI1 ) = 2µ̂d (P, Pn ) + µ̂d (P, PI1 ). We would like to show that µ̂d (P, PJ1 ) ! 0 as n ! 1, a.s. First, applying R the SLLN to a fixed bounded, continuous function f on ⌅ gives that ⌅ f (⇠)Pn (d⇠) R ! ⌅ f (⇠)P (d⇠), a.s. Theorem 11.4.1 in (Dudley, 2002) extends this result to all bounded, continuous f ; i.e., the random empirical measure Pn converges weakly to R the non-random measure P , a.s. This combined with (A4) yields ⌅ k⇠kPn (d⇠) ! R k⇠kP (d⇠), a.s. Hence, applying Theorem 6.3.1 in (Rachev, 1991), we obtain ⌅ µ̂d (P, Pn ) ! 0 as n ! 1, a.s. and similarly, µ̂d (P, PI1 ) ! 0 as n ! 1, a.s. The sec- ond statement follows from the fact that PI1 is essentially the same as Pn/2 . Combining these, we obtain that 2µ̂d (P, Pn ) + µ̂d (P, PI1 ) ! 0, a.s. Therefore, µ̂d (P, PJ1 ) ! 0, a.s., and another application of Theorem 6.3.1 in (Rachev, 1991) implies that PJ1 converges weakly to P , a.s. The same argument holds for PJ2 . Even though we lose the i.i.d. property of the observations in the partitioned subsets after minimizing the Kantorovich metrics, Theorem 3.9 shows the weak convergence of the resulting probability measures to the original measure. 47 3.6.2 Consistency of Point Estimators We now show the consistency of the estimators GJ and s2J in the almost sure sense. ⇣ ⌘ ˜ f (x, ⇠) ˜ , and denote the optimal For a fixed x̂ 2 X, define x̂2 (x) = var f (x̂, ⇠) solutions that minimize and maximize arg maxx2X ⇤ 2 x̂ (x), 2 x̂ (x) by x⇤min 2 arg minx2X ⇤ 2 x̂ (x) and x⇤max 2 ˜ is continuous in x, Ef (x, ⇠) ˜ respectively. Note that since f (x, ⇠) is continuous, and hence X ⇤ is closed (and therefore compact). In addition, continuous, and thus arg minx2X ⇤ 2 x̂ (x) and arg maxx2X ⇤ 2 x̂ (x) 2 x̂ (x) is are nonempty. Theorem 3.10. Assume x̂ 2 X, {⇠˜1 , . . . , ⇠˜n } is an i.i.d. sample from distribution P , and (A3) and (A4) hold. Fix 0 < ↵ < 1. Let n be even and consider A2RP-B. Then, (i) all limit points of x⇤Jl lie in X ⇤ , a.s., for l = 1, 2; (ii) zJ⇤l ! z ⇤ , a.s., as n ! 1, for l = 1, 2; (iii) GJ ! G, a.s., as n ! 1; (iv) 2 ⇤ x̂ (xmin ) lim inf s2J lim sup s2J n!1 n!1 2 ⇤ x̂ (xmax ), a.s. Proof. (i) First, note from Theorem 3.9 that the probability measures on the partitioned subsets converge weakly to the original distribution of ⇠˜ as n ! 1, a.s. As R R a result, for l = 1, 2, ⌅ f (x, ⇠)PJl (d⇠) epi-converges to ⌅ f (x, ⇠)P (d⇠) as n ! 1, a.s., by Theorem 3.9 in (Wets, 1983). Thus by Theorem 3.9 in (Wets, 1983), all limit points of x⇤Jl lie in X ⇤ , a.s., for l = 1, 2. (ii) Using epi-convergence, Theorem 7.33 in (Rockafellar & Wets, 1998) along with assumptions (A3) and (A4) give that zJ⇤l converges to z ⇤ , a.s., as n ! 1. P (iii) By definition, GJ = 12 [GJ1 + GJ2 ] where GJl = n2 i2Jl f (x̂, ⇠˜i ) zJ⇤l , for l = 1, 2. P For a feasible x 2 X, define f¯n (x) = n1 ni=1 f (x, ⇠˜i ). Then GJ = f¯n (x̂) 12 (zJ⇤1 + zJ⇤2 ). Since the original sample is formed using n i.i.d. observations, f¯n (x̂) converges to 48 ˜ a.s., by the SLLN. Furthermore, by part (ii), 1 (z ⇤ + z ⇤ ) converges to z ⇤ , Ef (x̂, ⇠), J2 2 J1 ˜ a.s., as n ! 1. We conclude that GJ ! Ef (x̂, ⇠) (iv) We begin by letting w(x, ⇠) = 2C (f (x̂, ⇠) z ⇤ , a.s., as n ! 1. f (x, ⇠)) and w̄Jl = 2 n P i2Jl w(x, ⇠˜i ), for a given x 2 X, l = 1, 2. Recall that the constant C gives a uniform bound on f (x, ⇠). Note that w(x, ⇠) is defined in this fashion to enforce non-negativity. Altering our notation slightly and fixing x̂ 2 X, we define s2Jl (x) = X⇣ 1 w(x, ⇠˜i ) n/2 1 i2J w̄Jl (x) l ⌘2 . Note that s2Jl (x⇤Jl ) is equivalent to s2Jl defined in Section 3.4.2. Rewriting, we obtain ! # " X n/2 2 s2Jl (x) = w2 (x, ⇠˜i ) (w̄Jl (x))2 . (3.11) n/2 1 n i2J l First, we show that the sequence of functions {s2Jl (x)} converges uniformly to 2 x̂ (x), a.s., as n ! 1, for l = 1, 2. To this end, we first examine the two terms inside the brackets in (3.11). By the uniform boundedness of f (x, ⇠), |f (x̂, ⇠) f (x, ⇠)| 2C; hence, w(x, ⇠) 0 for all x 2 X, ⇠ 2 ⌅. It also immediately follows that w(x, ⇠) is bounded in ⇠ since for all x 2 X, |w(x, ⇠)| 4C, and w(x, ⇠) is continuous in ⇠ for the class of problems we consider. Therefore, for each x 2 X, by the definition of weak convergence and ˜ a.s., as n ! 1, i.e., the SLLN holds using Theorem 3.9, we have w̄Jl (x) ! Ew(x, ⇠), pointwise, a.s. Since f (·, ⇠) is convex in x, w(·, ⇠) is convex in x (note that x̂ is fixed). Hence, we apply Corollary 3 from (Shapiro, 2003) to obtain sup w̄Jl (x) x2X ˜ ! 0, a.s., as n ! 1, Ew(x, ⇠) (3.12) ˜ a.s., as n ! 1. i.e., w̄Jl (x) converges uniformly to Ew(x, ⇠), Note that w2 (x, ⇠) is bounded and continuous in ⇠, and because w(·, ⇠) 0, w2 (·, ⇠) is also convex in x. Hence, following the same steps as above, we conclude 49 that 2 n these, P i2Jl ˜ a.s., as n ! 1. Combining w2 (x, ⇠˜i ) converges uniformly to Ew2 (x, ⇠), " ! # 2X 2 2 aJl (x) := w (x, ⇠˜i ) (w̄Jl (x)) n i2J l ⇣ ⌘2 ˜ ˜ ˜ a.s., as n ! 1. Note converges uniformly to Ew2 (x, ⇠) Ew(x, ⇠) = var(w(x, ⇠)), ˜ = that var(w(x, ⇠)) 2 x̂ (x). n/2 a (x), n/2 1 Jl To show uniform convergence of sup aJl (x) + x2X aJl (x) n/2 1 consider the following inequality: ˜ sup aJ (x) var(w(x, ⇠)) l ˜ var(w(x, ⇠)) aJl (x) ˜ var(w(x, ⇠)) n/2 1 x2X + sup x2X ˜ var(w(x, ⇠)) . n/2 1 + sup x2X From above, the first two terms on the right-hand side converge to 0, a.s. By (A3), ˜ < 1 for all x, a.s., and so the last term also converges to 0 as n ! 1. var(w(x, ⇠)) n/2 a (x) n/2 1 Jl This establishes uniform convergence of Hence, s2Jl (x) converges uniformly to 2 x̂ (x), ˜ = to var(w(x, ⇠)) 2 x̂ (x), a.s. a.s., for l = 1, 2. Since X is compact, for any fixed ! there exists a subsequence of N, Nk , along which {x⇤Jl }n2Nk converges to a point in X, for l = 1, 2. This point, denoted ẋk , is in X ⇤ , a.s., by (i). Then, because s2Jl (x) is a sequence of continuous functions that converges uniformly to 2 x̂ (x), a.s., 2 ⇤ lim s (x ) n ! 1 J l Jl n 2 Nk Therefore, min x2X ⇤ 2 x̂ (x) = 2 x̂ (ẋk ), a.s. lim s2Jl (x⇤Jl ) max⇤ n!1 n 2 Nk x2X 2 x̂ (x), a.s., for l = 1, 2. Since Nk is just one subsequence of N, and by the definition of x⇤min and x⇤max , we have 2 ⇤ x̂ (xmin ) lim inf s2Jl (x⇤Jl ) lim sup s2Jl (x⇤Jl ) n!1 n!1 2 ⇤ x̂ (xmax ), a.s., 50 for l = 1, 2. Since s2J = 1 2 s2J1 (x⇤J1 ) + s2J2 (x⇤J2 ) , the desired result follows. Parts (i) and (ii) of Theorem 3.10 establish the consistency of x⇤Jl and zJ⇤l , an optimal solution and the optimal value of (SPJl ). Similarly, parts (iii) and (iv) establish the consistency of GJ and s2J , the point estimators produced by A2RP-B. Note that if (SP) has a unique optimal solution; that is, X ⇤ = {x⇤ }, then part (i) implies that x⇤Jl ! x⇤ , for l = 1, 2, and part (iv) implies that limn!1 s2J = 3.6.3 2 ⇤ x̂ (x ), a.s., as n ! 1. Asymptotic Validity of the Interval Estimator In our final result, we show the asymptotic validity of the CI estimator produced by A2RP-B, given in (3.5). This justifies the construction of an approximate CI after bias reduction. Theorem 3.13. Assume x̂ 2 X, {⇠˜1 , . . . , ⇠˜n } is an i.i.d. sample from distribution P , and (A3) and (A4) hold. Fix 0 < ↵ < 1. Let n be even and consider A2RP-B. Then, ◆ ✓ z ↵ sJ lim inf P G GJ + p 1 ↵. n!1 n Proof. First, note that if x̂ 2 X ⇤ , then the inequality is satisfied automatically. Suppose now that x̂ 2 / X ⇤ . As in the proof of part (iii) of Theorem 3.10, we P 1 express GJ as GJ = f¯n (x̂) zJ⇤1 + zJ⇤2 , where f¯n (x) = n1 ni=1 f (x, ⇠˜i ). Since 2 P zJ⇤l = minx2X i2Jl f (x, ⇠˜i ) for l = 1, 2, GJ f¯n (x̂) f¯n (x), for all x 2 X. Noting that {⇠˜1 , . . . , ⇠˜n } is an i.i.d. sample, the rest of the proof proceeds as in the proof of Theorem 1 in (Bayraksan & Morton, 2006). 3.7 Computational Experiments In Section 3.6, we proved asymptotic results regarding the consistency and validity of estimators produced using A2RP-B. In this section, we apply A2RP-B to several test problems in order to examine its small-sample behavior. We begin our discussion 51 by introducing the test problems used for evaluating the bias reduction technique, followed by the experimental setup in Sections 3.7.1 and 3.7.2. Then, in Section 3.7.3, we present the results of our experiments and discuss computational e↵ort. We end our discussion by providing insights gained from our experiments in Section 3.7.6. 3.7.1 Test Problems To fully evaluate the efficacy of the proposed bias reduction technique, we consider four test problems from the literature; namely the newsvendor problem (denoted NV), APL1P, PGP2, and GBD. All four problems are two-stage stochastic linear programs with fixed recourse and stochasticity on the right-hand side, and can be solved exactly, allowing us to compute exact optimality gaps. Characteristics of these problems are summarized in Table 3.1. NV is defined as in Section 3.5 and can be solved analytically. We set the cost of one newspaper, c, to be 5, and its selling price, r, to 15. The demand ⇠˜ is assumed to have a U (0, 10) distribution. The electric power generation model PGP2 of Higle & Sen (1996b) has 3 stochastic parameters and 576 scenarios. APL1P is a power expansion problem with 5 independent stochastic parameters and 1,280 scenarios (Infanger, 1992). GBD, described in Example 1.1, is an aircraft allocation model. The version we use has 646,425 scenarios generated by 5 independent stochastic parameters. The standard formulations of these three problems di↵er slightly from the formulation presented in Section 3.2, in that ⇠ := (R, T ) rather than R and T being Problem # of 1st Stage Variables # of 2nd Stage Variables # of Stochastic Parameters # of Scenarios NV PGP2 APL1P GBD 1 4 2 17 1 16 9 10 1 3 5 5 1 576 1,280 646,425 Table 3.1: Test problem characteristics 52 Problem x⇤ Suboptimal x̂ z⇤ NV PGP2 APL1P 6 23 (1.5, 5.5, 5, 5.5) (1800, 1571.43) (10, 0, 0, 0, 0, 12.48, 1.19, 5.33, 0, 4.24, 0, 20.76, 7.81, 0, 7.20, 0, 0) 8.775 (1.5, 5.5, 5, 4.5) (1111.11, 2300) (10, 0, 0, 0, 0, 12.43, 1.22, 5.33, 0, 4.32, 0, 20.68, 8.05, 0, 6.95, 0, 0) 33 13 447.32 24,642.32 GBD 1,655.63 G 3 13 1.14 164.84 1.15 Table 3.2: Optimal and suboptimal candidate solutions functions of ⇠. This discrepancy can be easily remedied by defining the functions R(⇠) and T (⇠) in our formulation to be the coordinate projections of ⇠, so with a slight abuse of notation, R(⇠) = R and T (⇠) = T . Then R(⇠) and T (⇠) satisfy the affine linearity assumption in Section 3.2 and we can express the problems in the form assumed in this chapter. All test problems satisfy the required assumptions. We selected two candidate solutions, x̂, for each test problem listed in Table 3.1. The first candidate solution is the optimal solution, i.e., x̂ = x⇤ . Note that all these problems have a unique optimal solution. The second candidate solution is a suboptimal solution. For NV, APL1P, and PGP2, the suboptimal solution is the solution used in the computational experiments in (Bayraksan & Morton, 2006). We selected a suboptimal solution for GBD by solving an independent sampling problem and setting its solution as the candidate solution. Table 3.2 summarizes the optimal and suboptimal solutions used in our computational experiments, along with the optimal value and the optimality gap of the suboptimal candidate solution. For the problems summarized in Tables 3.1 and 3.2, we performed tests based on the setup detailed in Section 3.7.2 and present the results in Section 3.7.3. In addition, we looked at how much computational e↵ort bias reduction needs on a practical-scale problem and also studied the e↵ects of multiple optimal solutions. These are discussed in Section 3.7.4 and 3.7.5, respectively. 53 3.7.2 Experimental Setup The primary objective of our computational experiments is to determine the reduction in the bias of the point estimator GJ of A2RP-B compared to the estimator GI of A2RP for finite sample sizes n. It is well known that bias reduction techniques in statistics can increase the variance of an estimator; therefore, we use the mean-squared error (MSE) to capture both e↵ects. Recall that if ✓ˆ is an estimator of ✓, the MSE of ✓ˆ is given by E(✓ˆ ✓)2 = (E ✓ˆ ✓)2 + E(✓ˆ ˆ 2 , where the first term is the E ✓) ˆ We also examine the CI square of the bias and the second term is the variance of ✓. estimator after bias reduction. At each stage of our experiments, we perform relevant hypothesis tests to determine if there is a statistically significant reduction in bias, variance, MSE, etc. Our experiments were conducted as follows. First, for each test problem, we fixed a candidate solution (optimal or suboptimal) and set ↵ = 0.10. Then, we applied A2RP and A2RP-B for a variety of sample sizes {n = 50, 100, 200, . . . , 1000} to test the small-sample behavior and to observe any potential trends as n increases. For each independent run, we used the same random number stream for both A2RP and A2RP-B to enable a direct comparison of the two procedures. We used a batch size of m to estimate the biases of GI and GJ by averaging across m independent runs. We also obtained single estimates of var(GI ), var(GJ ), MSE(GI ), and MSE(GJ ) using the m runs. In order to produce CIs on the biases of GI and GJ , as well as obtain better estimates of the variance and MSE, we repeated this procedure M times, resulting in a total of m ⇥ M independent runs. The means of the M estimates of the bias, variance, and MSE and the m ⇥ M CI widths were used to compute percentage reductions. Since the stochastic parameters of APL1P take values that vary by several orders of magnitude, we used a weighted Euclidean norm to better calculate the distance between scenarios when defining (PM). We used the standard Euclidean norm for 54 the other test problems. For NV, we used the quicksort algorithm (in C++) to solve the sampling approximations (SPIl ) and (SPJl ), l = 1, 2, as the optimal solution is a sample quantile of demand. We also used the quicksort algorithm to perform the minimum-weight perfect matching. For all other test problems, we used the regularized decomposition (RD) code (in C++) by Świetanowski and Ruszczyński (Ruszczyński, 1986; Ruszczyński & Świetanowski, 1997) to solve the sampling approximations. We modified this code to use the Mersenne Twister algorithm to generate random samples (Wagner, 2009). To solve (PM), we used the Blossom V code discussed in (Kolmogorov, 2009). We note that there are multiple ways to partition the observations given a solution to (PM); we simply chose our partition based on the output from Blossom V. Given that NV and its corresponding matching problem can be solved efficiently, we set m = 1,000 and M = 1,000 for a total of 1,000,000 independent runs for each sample size n for this problem. For the other problems, we used m = 10 and M = 1,000 for a total of 10,000 independent runs for each n. For PGP2, APL1P, and GBD, we used the UA Research Computing High Performance Computing at the University of Arizona and for NV, we used the MORE Institute facilities. The statistical tests we performed are as follows. The null hypothesis states that the di↵erence between the corresponding estimators of A2RP and A2RP-B is zero, and the alternative hypothesis states that the di↵erence is positive (indicating that A2RP-B decreases the quantity being studied). We performed this test on the M di↵erences in the estimates of the bias, bias squared, variance, and MSE of the optimality gap point estimator, and the m ⇥ M di↵erences in the width of the CI estimator, using a one-sided dependent t-test with a 10% level of significance. This test is valid due to the large number of di↵erences calculated. Finally, we know from Theorem 3.13 that the CIs will attain the desired coverage of 0.90 for large sample sizes. However, given that bias reduction may reduce the width of the CI estimator, it is important to consider the change in coverage for 55 small sample sizes when applying bias reduction. We estimated the coverage for each algorithm and sample size. This was done by computing p̂, the proportion of the m ⇥ M independent runs in which the CI contained the optimality gap. Note that when the candidate solution is optimal, the optimality gap is 0, and so the coverage is always trivially 1. The estimator p̂ is a scaled binomial random variable, and hence for the suboptimal candidate solution we formed a 90% CI on the coverage via p p p̂ ± 1.645 p̂(1 p̂)/106 for NV and p̂ ± 1.645 p̂(1 p̂)/(5 ⇥ 104 ) for the other test problems. We also performed a one-sided two-proportion t-test with a 10% level of significance to test the null hypothesis that the coverage for A2RP and A2RP-B are equal versus the alternative hypothesis that A2RP-B has a lower coverage. 3.7.3 Results of Experiments on NV, PGP2, APL1P, and GBD We now present the computational results for each candidate solution, beginning with the optimal candidate solution. We note that for all CIs provided, a margin of error smaller than 0.01 or 0.001 is reported as 0.00 or 0.000, respectively. Optimal Candidate Solution This section highlights our comparison of A2RP and A2RP-B using the optimal solution as the candidate solution for each test problem presented in Tables 3.1 and 3.2. Figure 3.1 depicts a summary for all sample sizes of this comparison in terms of the percentage reduction in the bias and MSE of the optimality gap estimator and the width of the CI estimator. Tables 3.3, 3.4, and 3.5 detail the results for select sample sizes. In these tables, the appearance of ⇤ symbol in the columns that consider percentage reductions indicates that the hypothesis test on the di↵erence between the A2RP and A2RP-B estimators did not result in a statistically significant positive di↵erence. Conversely, the lack of a ⇤ symbol represents a statistically significant positive di↵erence. 56 80 80 80 70 70 70 60 60 60 50 50 50 40 40 40 30 30 30 20 20 20 10 10 10 0 0 0 −10 −10 −10 100 NV 300 PGP2 500 700 APL1P (a) % Red. in Bias 900 GBD 100 NV 300 PGP2 500 700 APL1P (b) % Red. in MSE 900 GBD 100 NV 300 PGP2 500 700 APL1P 900 GBD (c) % Red. in CI Width Figure 3.1: Percentage reductions between A2RP and A2RP-B in (a) bias of optimality gap estimator, (b) MSE of optimality gap estimator, and (c) CI width with respect to sample size n for optimal candidate solutions Figure 3.1(a) shows the percentage reduction between the biases of GI and GJ . The corresponding Table 3.3 provides the details. Columns ‘CI on Bias’ provide the 90% CI on the estimated biases of GI and GJ , respectively. The column ‘CI on Di↵. in Bias’ gives the 90% one-sided CI on the di↵erence in the biases formed via A2RP and A2RP-B. We use a one-sided CI here because our aim is to test if there is a significant reduction in bias. Finally, the last column shows the percentage reduction in the bias, which is also shown in Figure 3.1(a). The results illustrate that all the test problems have a statistically significant positive di↵erence in the bias of the optimality gap estimator, which is strong evidence that A2RP-B achieves its goal of reducing bias. The results for NV match the theory presented in Section 3.5, and we note the very fast convergence of the percentage reduction in the bias to 50%. For the other test problems, we observe a monotonic increase in the percentage reduction in the bias with sample size. A comparison between the biases of the A2RP or A2RP-B and SRP optimality gap estimators can be made for n 2 {50, 100, . . . , 500} by looking at the A2RP bias for sample size 2n (the SRP bias) and the A2RP (or A2RP-B) bias for sample size n. Although the bias reduction technique does not eliminate all of the additional 57 bias due to partitioning for APL1P, PGP2, and GBD, it does significantly reduce it. Focusing on GBD at n = 100, for example, indicates that the A2RP bias is about 2.4 times the bias of SRP, whereas A2RP-B reduces that ratio to about 1.4. The summary of results on the MSE of the optimality gap estimator is depicted in Figure 3.1(b), with details presented in Table 3.4. The first two columns of Table 3.4 give an indication of the relative contributions of the square of the bias and variance of the MSE. For instance, for NV at n = 100, we estimate that approximately 54% of the MSE of GI is the square of the bias of GI and the remaining 46% is the contribution from the variance of GI . Table 3.4 shows that the square of the bias is a significant proportion of the MSE for all test problems. This proportion is reduced under A2RP-B. Furthermore, we observe that the proposed bias reduction technique not only reduces the bias but also the variance of the optimality gap estimator. Like the bias, we observe increases in variance reduction as n increases. As a result, the percentage reduction in the MSE is notable for all test problems, and is roughly monotonically increasing. Finally, Figure 3.1(c) and Table 3.5 show the percentage reduction in the CI width at an optimal candidate solution. Because the optimality gap of an optimal solution is zero, reduction in the interval widths in this case is desirable. Our results indicate that for all test problems, there is a statistically significant reduction in the width of the CI. We again observe an increasing trend with sample size. 58 Problem n A2RP A2RP-B CI on Bias CI on Bias CI on Di↵. in Bias % Red. in Bias NV 100 200 400 600 800 0.33 0.17 0.08 0.06 0.04 ± ± ± ± ± 0.00 0.00 0.00 0.00 0.00 0.17 0.08 0.04 0.03 0.02 ± ± ± ± ± 0.00 0.00 0.00 0.00 0.00 0.16 0.08 0.04 0.03 0.02 0.00 0.00 0.00 0.00 0.00 48.55 49.29 49.53 49.91 49.78 PGP2 100 200 400 600 800 6.19 3.56 1.85 1.39 1.16 ± ± ± ± ± 0.06 0.04 0.03 0.02 0.02 5.45 3.11 1.47 1.05 0.86 ± ± ± ± ± 0.06 0.04 0.03 0.02 0.02 0.74 0.45 0.38 0.34 0.30 0.02 0.02 0.01 0.01 0.01 11.96 12.74 20.44 24.27 26.05 APL1P 100 200 400 600 800 86.46 44.79 22.88 15.09 11.09 ± ± ± ± ± 1.33 0.76 0.42 0.28 0.21 64.75 31.82 15.18 9.25 6.38 ± ± ± ± ± 1.11 0.61 0.33 0.22 0.15 21.71 12.97 7.70 5.84 4.71 0.76 0.43 0.23 0.15 0.12 25.11 28.95 33.65 38.71 42.49 GBD 100 200 400 600 800 4.38 1.84 0.74 0.43 0.29 ± ± ± ± ± 0.06 0.03 0.01 0.01 0.01 2.57 1.03 0.40 0.22 0.14 ± ± ± ± ± 0.04 0.02 0.01 0.01 0.00 1.80 0.82 0.34 0.20 0.15 0.03 0.02 0.01 0.00 0.00 41.20 44.33 45.70 47.19 51.04 Table 3.3: Bias of optimality gap estimator for optimal candidate solutions 59 A2RP A2RP-B % of MSE: (Bias)2 :Var. % of MSE (Bias)2 :Var. % Red. in (Bias)2 % Red. in Var. NV 100 200 400 600 800 54:46 53:47 52:48 52:48 51:49 38:62 36:64 35:65 34:66 34:66 73.50 74.26 74.51 74.89 74.75 47.44 48.39 48.60 48.85 48.98 61.59 62.09 62.09 62.25 62.24 PGP2 100 200 400 600 800 76:24 69:31 54:46 50:50 49:51 73:27 65:35 45:55 38:62 38:62 22.06 23.20 34.04 38.95 41.93 9.68 8.35 4.17 4.08 8.92 18.95 18.80 20.92 22.10 25.64 APL1P 100 200 400 600 800 57:43 54:46 51:49 49:51 49:51 53:47 50:50 47:53 45:55 44:56 42.81 48.20 54.03 59.98 64.86 29.17 32.62 38.47 45.54 50.76 36.54 40.58 45.82 51.96 56.95 GBD 100 200 400 600 800 65:35 60:40 52:48 45:55 41:59 60:40 50:50 39:61 32:68 29:71 64.79 67.98 68.31 69.07 72.08 54.47 53.49 46.38 44.17 46.72 60.85 61.77 57.54 55.32 57.17 Problem n % Red. in MSE Table 3.4: MSE of optimality gap estimator for optimal candidate solutions Problem n % Red. in CI Width NV 100 200 400 600 800 39.37 40.34 40.81 41.05 41.14 PGP2 100 200 400 600 800 10.06 9.55 6.84 17.39 14.53 APL1P 100 200 400 600 800 17.27 20.09 24.69 29.58 33.44 GBD 100 200 400 600 800 33.29 36.22 36.27 37.29 40.11 Table 3.5: CI estimator for optimal candidate solutions 60 Suboptimal Candidate Solution We now turn to the results of our experiments using the suboptimal candidate solutions outlined in Table 3.2. Tables 3.6, 3.7, and 3.8 summarize the same information as Tables 3.3, 3.4, and 3.5, with the addition of the coverage estimates for A2RP and A2RP-B in Table 3.8 for select sample sizes. Figure 3.2 shows plots of the percentage reductions in the bias and the MSE of the optimality gap estimator and in the CI width. Bias reduction does not depend on the candidate solution but Figures 3.1 and 3.2 indicate that reductions in the MSE and CI width di↵er from optimal to suboptimal candidate solutions. We now examine each in detail. Because the bias of the optimality gap estimator is independent of the candidate solution (only the zn⇤ term contributes to the bias), Figures 3.1(a) and 3.2(a), and Tables 3.3 and 3.6 should be identical. However, we note that there are slight di↵erences, although the overall trends are very similar. In particular, the columns ‘CI on Di↵. in Bias’ are the same in both Tables 3.3 and 3.6, but the percentage reductions in the bias can be slightly di↵erent. The suboptimal candidate solution appears to induce more variability in the optimality gap estimator, as illustrated by the wider CIs on the bias of the optimality gap estimator. Slight di↵erences—typically at higher sample sizes when the absolute value of the bias is small—result in slightly di↵erent values of the percentage reduction for the suboptimal candidate solution. Even though the bias of the optimality gap estimator is independent of the candidate solution, its variance, and hence MSE, depends on the candidate solution. Compared to the optimal candidate solution case, the square of the bias is much less a proportion of the MSE for all test problems. Under A2RP-B for NV, the square of the bias is a negligible proportion of the MSE. (Note that we round the percentages in the first two columns in Tables 3.4 and 3.7.) Another di↵erence between the optimal and suboptimal candidate solutions is that the variance reducing e↵ect of the bias reduction technique is typically only statistically significant at small sample 61 80 80 80 70 70 70 60 60 60 50 50 50 40 40 40 30 30 30 20 20 20 10 10 10 0 0 0 −10 −10 −10 100 NV 300 PGP2 500 700 APL1P (a) % Red. in Bias 900 GBD 100 NV 300 PGP2 500 700 APL1P (b) % Red. in MSE 900 GBD 100 NV 300 PGP2 500 700 APL1P 900 GBD (c) % Red. in CI Width Figure 3.2: Percentage reductions between A2RP and A2RP-B in (a) bias of optimality gap estimator, (b) MSE of optimality gap estimator, and (c) CI width with respect to sample size n for suboptimal candidate solutions sizes, and at a reduced rate. At higher sample sizes, the variance actually increases, although these increases were not found to be statistically significant. As a result, the MSE of the optimality gap estimator is reduced mainly at smaller sample sizes. One exception is PGP2, which exhibits MSE reduction across all values of n, and slight but statistically significant variance reduction at higher sample sizes. Table 3.8 indicates that the di↵erence between the width of the CI estimator for A2RP and A2RP-B is significant; however, the percentage reduction in the width is fairly small, with the exception of GBD for small sample sizes. The lowering of the coverage under A2RP-B is statistically significant for all test problems and sample sizes. However, the coverage under A2RP-B remain close to 90%, with the exception of PGP2. Note that PGP2 is known to have low coverage when A2RP is used (Bayraksan & Morton, 2006). A2RP-B reduces coverage for PGP2 but is still much higher than SRP, which yields coverage of 0.50-0.60 at the same candidate solution (Bayraksan & Morton, 2006). 62 Problem n A2RP A2RP-B CI on Bias CI on Bias CI on Di↵. in Bias % Red. in Bias NV 100 200 400 600 800 0.33 0.17 0.08 0.06 0.04 ± ± ± ± ± 0.00 0.00 0.00 0.00 0.00 0.17 0.08 0.04 0.03 0.02 ± ± ± ± ± 0.00 0.00 0.00 0.00 0.00 0.16 0.08 0.04 0.03 0.02 0.00 0.00 0.00 0.00 0.00 48.55 49.19 49.41 50.01 49.75 PGP2 100 200 400 600 800 6.30 3.59 1.85 1.41 1.16 ± ± ± ± ± 0.13 0.08 0.05 0.04 0.04 5.56 3.14 1.48 1.07 0.86 ± ± ± ± ± 0.12 0.07 0.05 0.04 0.04 0.74 0.45 0.38 0.34 0.30 0.02 0.02 0.01 0.01 0.01 11.75 12.63 20.39 23.90 25.95 APL1P 100 200 400 600 800 86.68 44.59 22.81 15.06 10.91 ± ± ± ± ± 2.75 1.99 1.43 1.19 1.02 64.97 31.62 15.11 9.22 6.20 ± ± ± ± ± 2.65 1.94 1.43 1.19 1.03 21.71 12.97 7.70 5.84 4.71 0.76 0.43 0.23 0.15 0.12 25.05 29.09 33.75 38.79 43.18 GBD 100 200 400 600 800 4.39 1.83 0.73 0.43 0.29 ± ± ± ± ± 0.06 0.03 0.02 0.02 0.01 2.58 1.01 0.40 0.23 0.14 ± ± ± ± ± 0.05 0.03 0.02 0.02 0.01 1.80 0.82 0.34 0.20 0.15 0.03 0.02 0.01 0.00 0.00 41.09 44.67 46.02 47.07 51.40 Table 3.6: Bias of optimality gap estimator for suboptimal candidate solutions 63 A2RP A2RP-B % of MSE: (Bias)2 :Var. % of MSE (Bias)2 :Var. % Red. in (Bias)2 % Red. in Var. % Red. in MSE NV 100 200 400 600 800 7:93 4:96 2:98 1:99 1:99 2:98 1:99 1:99 0:100 0:100 72.53 72.37 71.05 69.97 68.05 1.14 0.08 0.12* 0.08* 0.09* 6.38 2.91 1.33 0.88 0.65 PGP2 100 200 400 600 800 42:58 38:62 27:73 24:76 24:76 38:62 34:66 21:79 17:83 16:84 20.08 21.79 30.13 32.70 34.13 5.48 9.98 6.66 4.09 2.26 11.67 14.69 13.95 12.37 11.13 APL1P 100 200 400 600 800 23:77 17:83 12:88 11:89 10:90 16:84 12:88 10:90 10:90 09:91 33.84 30.62 23.21 18.55 13.87 4.01 1.06* 2.03* 3.06* 2.99* 12.00 5.20 1.59 0.34* 1.11* GBD 100 200 400 600 800 59:41 46:54 28:72 20:80 16:84 46:54 27:73 15:85 12:88 11:89 63.70 64.69 58.33 50.57 45.13 Problem n 40.01 27.39 12.89 5.68 3.18 53.77 44.93 27.04 16.15 11.13 Table 3.7: MSE of optimality gap estimator for suboptimal candidate solutions Problem n A2RP A2RP-B CI on Coverage CI on Coverage % Red. in CI Width NV 100 200 400 600 800 0.911 0.912 0.907 0.908 0.906 ± ± ± ± ± 0.000 0.000 0.000 0.000 0.000 0.885 0.894 0.894 0.897 0.896 ± ± ± ± ± 0.001 0.001 0.001 0.000 0.001 3.61 1.99 1.07 0.73 0.56 PGP2 100 200 400 600 800 0.772 0.821 0.707 0.886 0.826 ± ± ± ± ± 0.007 0.006 0.007 0.005 0.006 0.676 0.792 0.641 0.861 0.775 ± ± ± ± ± 0.008 0.007 0.008 0.006 0.007 4.56 3.67 10.14 6.85 8.46 APL1P 100 200 400 600 800 0.896 0.899 0.892 0.895 0.893 ± ± ± ± ± 0.005 0.005 0.005 0.005 0.005 0.868 0.867 0.862 0.871 0.872 ± ± ± ± ± 0.006 0.006 0.006 0.006 0.005 6.06 4.10 2.52 2.01 1.64 GBD 100 200 400 600 800 0.991 0.979 0.958 0.940 0.926 ± ± ± ± ± 0.002 0.002 0.003 0.004 0.004 0.974 0.939 0.908 0.895 0.891 ± ± ± ± ± 0.003 0.004 0.005 0.005 0.005 27.07 24.08 15.38 10.31 8.01 Table 3.8: CI estimator for suboptimal candidate solutions 64 3.7.4 Computation Time of Bias Reduction In this section, we investigate whether solving (PM) could be computationally prohibitive for large sample sizes as required by large-scale problems. To test the performance of A2RP-B on a problem of practical size, we conducted an experiment on the test problem SSN, which was described in Example 1.2. SSN has 86 independent stochastic parameters and 1070 scenarios. Fixing the candidate solution to a (suboptimal) solution obtained by solving a separate sampling problem and setting ↵ = 0.10, we ran A2RP-B for sample sizes ranging from n = 1, 000 to n = 5, 000. The tests were performed on a 1.66 GHz LINUX computer with 4GB memory. Table 3.9 shows the breakdown of A2RP-B running times in minutes. For the sampling approximations, we measured the time needed to compute the edge weights, allocate the scenarios into two subsets, solve the sampling problems, and produce the A2RP-B estimators. We also calculated the time Blossom V spent solving (PM). As can be seen in Table 3.9, solving (PM) is a small percentage of the total computational e↵ort required to implement A2RP-B. Since Blossom V’s running time depends only on the sample sized used in A2RP-B and not on the dimension of the random vector, we expect similar results for other large-scale problems. Note that the worst-case running time of Blossom V is believed to be O(n3 m) for a graph with n nodes and m edges (Kolmogorov, 2009). A more efficient implementation, O(mn log n), for solving (PM) for dense graphs like ours is given in (Mehlhorn & Schäfer, 2002); however, we do not pursue it here. n 1000 2000 3000 Sampling Approx. (PM) n 7.10 18.52 28.65 0.01 0.04 0.07 4000 5000 Sampling Approx. 44.29 56.18 (PM) 0.16 0.24 Table 3.9: Breakdown of A2RP-B computational time (in minutes) with respect to sample size n for SSN 65 3.7.5 E↵ect of Multiple Optimal Solutions In our final test, we applied A2RP-B to problems with multiple optimal solutions. It is known that such problems may behave di↵erently than those with a single optimal solution. For example, under appropriate conditions, Ezn⇤ z ⇤ = n o(n 1/2 1/2 E[inf x2X ⇤ Y (x)]+ ), where Y (x) is a normal random variable with mean zero and variance given ˜ see, e.g., the discussion after Theorem 10 in (Shapiro, 2003). This by var(f (x, ⇠)); indicates that as the set of optimal solutions X ⇤ gets larger, the bias might increase. To test the performance of A2RP-B on problems with multiple optimal solutions, we generated instances of NV with a discrete demand distribution and increasing set of optimal solutions. We set the cost of one newspaper to c = 5 and its selling price to r = 50. The demand ⇠˜ is set to take values in {0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, b}, each with probability 1/10, resulting in a piecewise linear objective function. The optimal solution set is given by X ⇤ = [4, b] and the optimal value is z ⇤ = 90. We considered three cases: (i) b = 4.1, giving a “narrow solution set of width 0.1, (ii) b = 5, leading to a “medium” solution set of width 1, and (iii) b = 10, resulting in a “wide” solution set of width 6. Computations were run using an optimal candidate solution, x̂ = 4, and a suboptimal candidate solution, x̂ = 3, with an optimality gap of 7.5 regardless of the size of X ⇤ . Tables 3.10 and 3.11, which follow the format of the tables in the previous two sections, highlight the main results of the experiments for the optimal and suboptimal candidate solutions, respectively. All reported results are for sample size of n = 500. X⇤ Narrow Medium Wide A2RP A2RP-B CI on Bias CI on Bias % Red. in Bias % Red. in MSE 0.04 ± 0.00 0.38 ± 0.00 2.26 ± 0.00 0.03 ± 0.00 0.27 ± 0.00 1.60 ± 0.00 29.16 29.16 29.16 24.17 24.17 24.17 % Red. in CI Width 24.12 24.11 24.11 Table 3.10: Summary of multiple optimal solutions results for an optimal candidate solution (n = 500) 66 X⇤ Narrow Medium Wide A2RP A2RP-B CI on Bias CI on Bias % Red. in Bias 0.04 ± 0.00 0.38 ± 0.00 2.26 ± 0.00 0.03 ± 0.00 0.27 ± 0.00 1.60 ± 0.00 29.26 29.17 29.16 % Red. in MSE 0.17 5.36 19.05 % Red. in CI Width 0.14 1.58 8.30 Table 3.11: Summary of multiple optimal solutions results for a suboptimal candidate solution (n = 500) As expected, the absolute value of the bias of the optimality gap estimator increases as the width of the optimal solution set increases. Nevertheless, the percentage reduction in the bias remains roughly constant at just below 30%. In the case of the optimal candidate solution, the percentage reductions in the MSE of the optimality gap estimator and the width of the CI estimator also do not appear to be a↵ected by solution set width. When using the suboptimal candidate solution, the MSE of the optimality gap estimator and the CI width are only slightly reduced when applying bias reduction; however, they do increase as the solution set width increases. This indicates that the bias reduction technique may be of particular use for problems with a large optimal solution set. 3.7.6 Discussion In this section, we summarize insights gained from our computational experiments and discuss our findings. • The percentage reduction in the bias of the optimality gap estimator tends to increase as n increases. We hypothesize that this is due to the stability result that motivates the bias reduction technique. Recall that Theorem 3.2 requires P and Q to be sufficiently close, and at larger sample sizes, we expect Pn and PJl , l = 1, 2, to be closer. The absolute value of the reduction in the bias, on the other hand, decreases as n increases. Please see the columns labeled ‘CI on Di↵. in Bias’ in Tables 3.3 and 3.6. 67 • The bias reduction technique works well when an optimal candidate solution is used. In this case, both the bias and the variance are reduced, resulting in a significant reduction in the MSE of the optimality gap point estimator. The width of the CI estimator formed via A2RP-B is also considerably reduced at an optimal solution. • At a suboptimal candidate solution, we observe that the variance of the optimality gap estimator becomes more significant relative to the bias. Bias reduction is not a↵ected, but the bias reduction technique reduces the variance, and hence the MSE, at smaller sample sizes. However, it can sometimes increase the variance at higher sample sizes, weakening the MSE reduction. The CI width and coverage are slightly reduced. • We also observed that in problems with multiple optima, the procedure at a suboptimal solution can be more e↵ective in MSE and CI width reduction as the set of optimal solutions gets larger. • The computational e↵ort required by the bias reduction technique, while increasing with sample size, is insignificant compared to the total computational e↵ort of a two-replication procedure for optimality gap estimation. Suppose we solve an independent sampling problem (or use any other method) to obtain a candidate solution. Fixing this solution, we apply A2RP to obtain an estimate of its optimality gap. If this estimate is large, then we do not know if it is a good solution or not. Note that even when an optimal solution is obtained, this estimate can be large due to bias or variance. Alternatively, the candidate solution itself may have a large optimality gap. Suppose the candidate solution obtained is indeed an optimal solution. Note that if an independent sampling problem is used to obtain the candidate solution, for the class of problems considered, the probability of obtaining an optimal candidate solution increases exponentially as the sample size 68 increases under appropriate conditions (Shapiro & Homem-de-Mello, 2000; Shapiro et al., 2002), and this probability can be quite high for small sample sizes for some problems. Then, the use of A2RP-B can significantly increase our ability to detect that this is an optimal solution. Our results indicate that A2RP-B reduces the bias, variance, and MSE of the optimality gap point estimator, and the width of the interval estimator, at an optimal solution. The risk in doing so is a decrease in the coverage at suboptimal solutions. The reduction in the bias remains same but the variance and MSE are mainly reduced at smaller sample sizes at suboptimal solutions, indicating that A2RP-B provides a more reliable point estimator at suboptimal solutions at smaller sample sizes. Among the procedures presented thus far in the dissertation, if identifying optimal solutions is of primary importance, then, we recommend A2RP-B. If, on the other hand, conservative coverage is the primary concern, then we recommend the Multiple Replications Procedure (MRP) of Mak et al. (Mak et al., 1999), which is known to be more conservative; see the computational results and also the preliminary guidelines in (Bayraksan & Morton, 2006). 3.8 Summary and Concluding Remarks In this chapter, we presented a bias reduction technique for a class of stochastic programs that is rooted in a stability result. The proposed technique partitions the observations by minimizing the Kantorovich metrics between the empirical measure of the original sample and the probability measures on the resulting partitioned observations. This amounts to solving a minimum-weight perfect matching problem, which is polynomially solvable in the sample size. The bias reduction technique is applied to the A2RP optimality gap estimators for a given candidate solution. Analytical results on an instance of a newsvendor problem and computations indicate that bias reduction technique can reduce the bias introduced by partitioning while main- 69 taining appropriate coverage. We showed that the optimality gap and SV estimators of A2RP-B are consistent and the CI estimator is asymptotically valid. Preliminary computational results suggest that the technique works well for optimal candidate solutions, decreasing both the bias and the variance of the optimality gap estimator, and hence the MSE. For suboptimal solutions, bias reduction is una↵ected but variance and MSE reduction are weakened. Coverage is slightly lowered after bias reduction. We conclude that the method presented in this chapter has the potential to produce more reliable estimators of the optimality gap and increase our ability to recognize optimal or nearly optimal solutions. The next chapter studies the use of some variance reduction schemes to improve optimality gap estimators, with and without the bias reduction technique presented in this chapter. 70 Chapter 4 Assessing Solution Quality with Variance Reduction In this chapter, we investigate the e↵ects of embedding alternative sampling techniques aimed at reducing variance in algorithms that assess solution quality via optimality gap estimators. In particular, we focus on AV and LHS and implement these schemes into MRP, SRP, and A2RP (outlined in Chapter 2). We also consider the combination of these variance reduction schemes with A2RP-B from Chapter 3. Since we now consider MRP and SRP in addition to A2RP, we fix the sample size per replication n for each algorithm and allow the total sample size to vary, so that the bias of the optimality gap estimator is not significantly increased for MRP compared to SRP and A2RP. In addition to assumptions (A1)–(A3) described in Chapter 2, when considering LHS we assume (A5) f (x, ⇠) is uniformly bounded. Assumption (A5), which implies (A2), is required in order to apply a CLT result for LHS. Other assumptions on f (x, ·), such as non-additivity, are needed for convergence to a non-degenerate normal; see, e.g., (Homem-de-Mello, 2008). We will discuss these in our setting later. We also assume that the components of the random vector ⇠˜ are independent and each have an invertible cumulative distribution function (cdf) Fj , j = 1, . . . , d⇠ . The test problems studied in Section 4.6 satisfy these conditions. As in Chapter 3, we suppress the dependence on x̂ and n in our notation. This chapter is organized as follows. In Section 4.1, we formally define AV and LHS and provide an overview of the relevant literature. We implement these variance reduction schemes in MRP in Section 4.2. Section 4.3 updates SRP to include AV and 71 LHS, and discusses the asymptotic properties of the resulting estimators. In addition to presenting A2RP with AV and LHS, Section 4.4 investigates the combination of AV and LHS with the bias reduction technique from Chapter 3. In Section 4.6, we present our computational experiments on a number of test problems. We conclude with a summary in Section 4.7. 4.1 Overview of Antithetic Variates and Latin Hypercube Sampling Monte Carlo simulation is widely used to estimate expectations of random variables. p The technique, which produces unbiased estimators, has a convergence rate of 1/ n, independent of dimension, and so a very large sample size may be required to achieve a small error. Variance reduction schemes typically aim to create an estimator of the expectation that is also unbiased but has a smaller variance than a standard Monte Carlo estimator. The implementation of a variety of variance reduction schemes in a stochastic programming framework has been studied extensively. The goal is usually to reduce the variance of estimates of Ef (x, ⇠) for a given x 2 X or of zn⇤ , the estimator of the true optimal value of (SP). Shortly, we will focus on AV and LHS to reduce variance but other variance reduction techniques have been studied as well. For example, Dantzig & Glynn (1990), Higle (1998), and Infanger (1992) consider importance sampling. Higle (1998), Pierre-Louis et al. (2011), and Shapiro & Homem-de-Mello (1998) apply control variates techniques to stochastic programs. Quasi-Monte Carlo sampling has also been considered (Drew & Homem-de-Mello, 2006; Homem-de-Mello, 2008; Koivu, 2005; Pennanen & Koivu, 2005). Drew (2007) develops padded sampling schemes that sample the most important variables with randomized quasi-Monte Carlo sampling and use Monte Carlo or LHS for the remaining variables. We will discuss additional literature on the use of AV and LHS in (SP) below. Our work di↵ers from most of the literature in that we apply these 72 ˜ or estimates of techniques to estimators of the optimality gap rather than Ef (x, ⇠) the optimal value. The implications of this change of focus are discussed in later sections. First, we give the details of these two sampling schemes and review the relevant literature in the remainder of this section. 4.1.1 Antithetic Variates The motivating idea behind AV is use to pairs of negatively correlated random varin n0 0 ables to reduce variance. Let n be even. To sample n observations {⇠˜1 , ⇠˜1 , . . . , ⇠˜2 , ⇠˜2 } from P using AV, perform the following procedure: AV 1. Sample n 2 n i.i.d. observations {ũ1 , . . . , ũ 2 } from a U (0, 1)d⇠ distribution. 0 2. For each dimension j = 1, . . . , d⇠ , create antithetic pairs by setting ũij = 1 ũij . 0 3. To transform to sampling from P , invert ũ by setting ⇠˜ji = Fj 1 (ũij ) and ⇠˜ji = 0 Fj 1 (ũij ), for i = 1, . . . , n/2 and j = 1, . . . , d⇠ . AV replaces the Monte Carlo estimator 1 n Pn i=1 f (x̂, ⇠˜i ) with ⌘ 2X1⇣ 0 f (x̂, ⇠˜i ) + f (x̂, ⇠˜i ) . n i=1 2 n/2 0 0 Since both ũi and ũi follow a U (0, 1)d⇠ distribution, ⇠˜i and ⇠˜i have distribution P . Therefore, AV produces an unbiased estimator of the mean, i.e., 2 3 n/2 ⇣ ⌘ X 2 1 0 ˜ E4 f (x, ⇠˜i ) + f (x, ⇠˜i ) 5 = Ef (x, ⇠). n i=1 2 The standard Monte Carlo estimator has variance 2 x̂ /n, where (4.1) 2 x̂ ⇣ ⌘ ˜ . := var f (x̂, ⇠) 73 We can express the variance of the AV estimator in terms of the Monte Carlo variance: 0 1 n/2 ⇣ ⌘ X 2 1 0 var @ f (x̂, ⇠˜i ) + f (x̂, ⇠˜ i ) A n i=1 2 ⌘ ⇣ ⌘ ⇣ ⌘⌘ 2 1⇣ ⇣ 0 0 ˜ ˜ ˜ ˜ = · var f (x̂, ⇠) + var f (x̂, ⇠ ) + 2 Cov f (x̂, ⇠), f (x̂, ⇠ n 4 ⇣ ⌘ 2 1 ˜ f (x̂, ⇠˜0 ) . = x̂ + Cov f (x̂, ⇠), n n Hence, AV reduces variance compared to Monte Carlo if ⇣ ⌘ 0 ˜ ˜ Cov f (x̂, ⇠), f (x̂, ⇠ ) < 0, (4.2) ˜ and f (x̂, ⇠˜0 ) are negatively correlated. It is well known that (4.2) holds i.e., if f (x̂, ⇠) when f (x̂, ·) is bounded and monotone in each component of ⇠ and f (x̂, ·) is not constant in the interior of ⌅; see, e.g., Theorem 4.3 in (Lemieux, 2009). The amount of variance reduction depends on how much negative correlation between ũ and ũ0 is preserved when (i) transforming to ⇠˜ and ⇠˜0 and (ii) after applying f (x̂, ·). It will be necessary to use the SLLN in later sections. The following theorem extends this result to AV: ˜ < 1. The SLLN holds for an AV random Theorem 4.3. Assume that E|f (x, ⇠)| 0 0 sample {⇠˜1 , ⇠˜1 , . . . , ⇠˜n/2 , ⇠˜n/2 }, i.e. i 2X1h i i0 ˜ ˜ ˜ a.s., as n ! 1. f (x, ⇠ ) + f (x, ⇠ ) ! Ef (x, ⇠), n i=1 2 n/2 Proof. The collections of random variables {⇠˜1 , . . . , ⇠˜n/2 } and {⇠˜1 , . . . , ⇠˜n/2 } contain 0 0 ˜ < 1, we have independent variables with identical distributions P . Since E|f (x, ⇠)| that E|f (x, ⇠˜0 )| < 1. Furthermore, AV produces unbiased estimators, and so we can apply the SLLN to each collection of random variables: 0 1 n/2 X 2 1 1 ˜A=1 P @ lim f (x, ⇠˜i ) = Ef (x, ⇠) n!1 n 2 2 i=1 74 and 0 1 n/2 X 2 1 1 0 ˜ A = 1. P @ lim f (x, ⇠˜i ) = Ef (x, ⇠) n!1 n 2 2 i=1 Therefore, 0 1 h i 2X1 0 ˜A P @ lim f (x, ⇠˜i ) + f (x, ⇠˜i ) = Ef (x, ⇠) n!1 n 2 i=1 0 1 n/2 n/2 X X 2 1 2 1 0 ˜ A = 1, = P @ lim f (x, ⇠˜i ) + lim f (x, ⇠˜i ) = Ef (x, ⇠) n!1 n n!1 n 2 2 i=1 i=1 n/2 and the result holds. ˜ x̂ being optimal Higle (1998) considers the use of AV when estimating Ef (x̂, ⇠), or near optimal, for two-stage stochastic linear programs with recourse. This class of stochastic programs satisfies the monotonicity requirement on f . This paper also provides an empirical study of AV (alongside other variance reduction techniques) and finds that AV can reduce variance. Freimer et al. (2012) also consider AV in the context of two-stage stochastic linear programs with recourse. Analytical results for the newsvendor problem show that AV can reduce the bias of zn⇤ , while variance may be increased or decreased depending on problem parameters. Computational experiments on large-scale problems indicate that AV can be e↵ective at reducing variance with minimal e↵ort. Koivu (2005) applies AV to zn⇤ and observes empirically that it can increase variance when the objective function is not monotone. However, when AV is e↵ective, combining it with randomized quasi-Monte Carlo sampling can lead to significant variance reduction. 4.1.2 Latin Hypercube Sampling While Monte Carlo sampling can produce high-quality estimators, particularly for large sample sizes, observations may be clustered together in such a way that the distribution is not sampled from evenly. Stratified sampling addresses this issue by 75 partitioning the distribution into strata and selecting a specified number of observations from each strata. We focus on a particular type of stratified sampling referred to as Latin hypercube sampling. To sample n observations {⇠˜1 , . . . , ⇠˜n } from P using LHS, perform the following procedure: LHS For each dimension j = 1, . . . , d⇠ , 1. Let ũij ⇠ U ( i n1 , ni ) for i = 1, . . . , n. 2. Define {⇡j1 , . . . , ⇡jn } to be a random permutation of {1, . . . , n}, where each of the n! permutations is equally likely. 3. To transform to sampling from P , invert by setting ⇠˜ji = F 1 ⇡i (ũj j ), for i = 1, . . . , n. We now provide an overview of properties of LHS that will be required in later sections. First, McKay et al. (1979) show that LHS gives an unbiased estimator, h i ˜ for i = 1, . . . , n, and so E 1 Pn f (x, ⇠˜i ) = Ef (x, ⇠). ˜ In i.e., Ef (x, ⇠˜i ) = Ef (x, ⇠) i=1 n addition, similarly to AV, the variance of an LHS estimator is no larger than that of a Monte Carlo estimator if f (x̂, ·) is monotone in each argument. For all measurable functions f (x̂, ·), Owen (1997) proves that " n # " n # X 1X n 1 1 i i var f (x̂, ⇠˜ ) var f (x̂, ⇠˜M C ) = n i=1 n 1 n i=1 n 1 2 x̂ , (4.4) 1 ˜n where {⇠˜M C , . . . , ⇠M C } is a standard Monte Carlo random sample. Therefore, LHS produces an estimator with a variance at most that of a Monte Carlo estimator with sample size n 1. Theorem 3 of Loh (1996) gives the following SLLN result for LHS: ˜ < 1. The SLLN holds for an LHS sample Theorem 4.5. Assume that E[f 2 (x, ⇠)] {⇠˜1 , . . . , ⇠˜n }, i.e. n 1X ˜ a.s., as n ! 1. f (x, ⇠˜i ) ! Ef (x, ⇠), n i=1 76 Additionally, a version of the CLT for LHS is presented in (Homem-de-Mello, 2008), which is based on Theorem 1 of Owen (1992): Theorem 4.6. Assume that f (x, ⇠) is uniformly bounded. A CLT holds for an LHS sample {⇠˜1 , . . . , ⇠˜n }, i.e. 1 n assuming var ⇣ P n 1 n Pn ˜ f (x, ⇠˜i ) Ef (x, ⇠) r i=1 ⇣ ⌘ ) N (0, 1), Pn 1 i ˜ var n i=1 f (x, ⇠ ) ˜i i=1 f (x, ⇠ ) ⌘ > 0, where N (0, 1) is a standard normal random variable and “)” denotes convergence in distribution. We note that if f (x, ·) is an additive function, i.e., f (x, ·) can be written as P d⇠ f (x, ⇠1 , . . . , ⇠d⇠ ) = C + j=1 fj (⇠j ), where f1 , . . . , fd⇠ are univariate functions and ⇣ P ⌘ C a constant, then var n1 ni=1 f (x, ⇠˜i ) ! 0 as n ! 1 and the CLT result holds in a degenerate form. We use this CLT result for asymptotic validity of the CI variants that use LHS and note that in the degenerate case, our main results remain una↵ected (see, e.g., proofs of Theorems 4.12 and 4.24 below). Both numerical and theoretical properties of LHS have been examined in the stochastic programming literature. Bailey et al. (1999) combine LHS with a response surface methodology to solve two-stage stochastic linear programs with recourse. Linderoth et al. (2006) use LHS to improve the calculation of zn⇤ , its bias, and an upper bound of z ⇤ for a set of large-scale test problems. Building on this work, Freimer et al. (2012) find that LHS is very e↵ective at reducing the bias, variance, and MSE of zn⇤ for a variety of two-stage stochastic linear programs. On the theoretical side, Homem-de-Mello (2008) studies the rates of convergence of estimators of optimal solutions and optimal values under LHS, and Drew & Homem-de-Mello (2012) examine large deviations properties of estimators obtained with LHS. Drew (2007) develops padded sampling schemes using randomized quasi-Monte Carlo sampling and either Monte Carlo sampling or LHS, and provides a central limit theorem in the case of 77 LHS. These schemes are used to approximately solve (SP), and their application when assessing solution quality via SRP is also studied. 4.2 Multiple Replications Procedure with Variance Reduction In this section, we present MRP with AV and MRP with LHS and discuss asymptotic properties of the resultant CI estimators. 4.2.1 Antithetic Variates 0 0 Let the random sample {⇠˜1 , ⇠˜1 , . . . , ⇠˜n/2 , ⇠˜n/2 } be an AV sample from P of total size n. The approximated stochastic program using this AV sample is ⌘ 2X1⇣ ⌘ 2X1⇣ 0 0 = min f (x, ⇠˜i ) + f (x, ⇠˜i ) = f (x⇤A , ⇠˜i ) + f (x⇤A ⇠˜i ) . x2X n 2 n i=1 2 i=1 n/2 zA⇤ n/2 (SPA ) The formal statement of the algorithm is as follows: MRP AV Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), an even sample size per replication n, and a replication size m. Output: A point estimator, its associated variance estimator, and a (1 ↵)-level confidence interval on G. Replace Steps 1.1 and 1.3 of MRP by 0 0 1.1. Sample observations {⇠˜l1 , ⇠˜l1 , . . . , ⇠˜ln/2 , ⇠˜ln/2 } from P using AV, indepen- dently of other replications. 1.3. Calculate 2X1h = f (x̂, ⇠˜li ) n i=1 2 n/2 GA,l 0 f (x⇤A,l , ⇠˜li ) + f (x̂, ⇠˜li ) i 0 f (x⇤A,l , ⇠˜li ) . 78 The optimality gap and SV estimators are now denoted ḠA (m) and s2A (m), respectively, and the CI estimator is updated accordingly. h P ⇣ ⌘i 1 ˜li ) + f (x, ⇠˜li0 ) = Ef (x, ⇠), ˜ and so we can apply the By (4.1), E n2 n/2 f (x, ⇠ i=1 2 same argument as in (2.1) to get EzA⇤ z ⇤ and hence EGA CLT implies P ✓ µx̂ ḠA (m) + Note that var ḠA (m) = 1 m tm 1,↵ sA (m) p m ◆ ⇡1 µx̂ . Therefore, the ↵. var(GA ). Our aim is to reduce the variance of each replication GA,l , l = 1, . . . , m, and hence reduce the variance of ḠA (m). In our case, removing subscript l for ease of notation, we have 0 n/2 X 2 1h 0 var(GA ) = var @ f (x̂, ⇠˜i ) f (x⇤A , ⇠˜i ) + f (x̂, ⇠˜i ) n i=1 2 i 1 0 f (x⇤A , ⇠˜i ) A n/2 ⇣ ⌘ 1 X i ⇤ ˜i i0 ⇤ ˜i0 ˜ ˜ = 2 var f (x̂, ⇠ ) f (xA , ⇠ ) + f (x̂, ⇠ ) f (xA , ⇠ ) + n i=1 ⇣h i 2 X ˜i ) f (x⇤ , ⇠˜i ) + f (x̂, ⇠˜i0 ) f (x⇤ , ⇠˜i0 ) , Cov f (x̂, ⇠ A A n2 i<j h i⌘ 0 0 f (x̂, ⇠˜j ) f (x⇤A , ⇠˜j ) + f (x̂, ⇠˜j ) f (x⇤A , ⇠˜j ) . It is not straightforward to further analyze this expression, due to the dependence n0 n 0 of x⇤A on all samples {⇠˜1 , ⇠˜1 , . . . , ⇠˜2 , ⇠˜2 }. In addition, as discussed in Section 4.1.1, f (x̂, ·) is monotone in each component of ⇠ for two-stage stochastic linear programs ˜ with recourse. However, monotonicity does not necessarily hold for f (x̂, ⇠) ˜ f (x⇤A , ⇠), and therefore we cannot immediately conclude that AV will reduce the bias of the MRP optimality gap estimator for this class of problems. Section 4.6 investigates the e↵ectiveness of MRP AV empirically for a number of two-stage programs. 79 4.2.2 Latin Hypercube Sampling Let the random sample {⇠˜1 , . . . , ⇠˜n } be an LHS sample from P . The approximated stochastic program using this LHS sample is n zL⇤ n 1X 1X = min f (x, ⇠˜i ) = f (x⇤L , ⇠˜i ). x2X n n i=1 i=1 (SPL ) The formal statement of the algorithm is as follows: MRP LHS Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), a sample size per replication n, and a replication size m. Output: A point estimator, its associated variance estimator, and a (1 ↵)-level confidence interval on G. Replace Step 1.1 of MRP by 1.1. Sample observations {⇠˜l1 , . . . , ⇠˜ln } from P using LHS, independently of other replications. The optimality gap and SV estimators are now denoted ḠL (m) and s2L (m), respectively, and the CI estimator is updated accordingly. As mentioned above, LHS produces unbiased estimators, i.e., E h P n 1 n ˜i i=1 f (x, ⇠ ) ˜ and so Ez ⇤ z ⇤ and EGL µx̂ . Therefore, by the CLT, = Ef (x, ⇠) L ✓ ◆ tm 1,↵ sL (m) p P µx̂ ḠL (m) + ⇡ 1 ↵. m Since var ḠL (m) = 1 m i var(GL ), reducing the variance of each replication GL,l , l = 1, . . . , m will reduce the variance of ḠL (m). The variance of a single replication can be expressed as n ⇣ ⌘ 1 X var(GL ) = 2 var f (x̂, ⇠˜i ) f (x⇤L , ⇠˜i ) + n i=1 ⇣h i h 2 X i ⇤ ˜i ˜ Cov f (x̂, ⇠ ) f (xL , ⇠ ) , f (x̂, ⇠˜j ) 2 n i<j f (x⇤L , ⇠˜j ) i⌘ . 80 As with AV, we cannot easily further analyze this expression, and the lack of monotonicity does not allow us to easily make statements regarding variance reduction at this point. Regardless, Section 4.6 observes significant benefits when using MRP LHS. 4.3 Single Replication Procedure with Variance Reduction This section defines SRP with AV and SRP with LHS and discusses the consistency and asymptotic validity of the optimality gap estimators. 4.3.1 Antithetic Variates The algorithmic statement for SRP with AV is as follows: SRP AV Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and an even sample size n. Output: A point estimator, its associated variance estimator, and a (1 confidence interval on G. Replace Steps 1, 3, and 4 of SRP by: 1. Sample observations {⇠˜1 , ⇠˜1 , . . . , ⇠˜n/2 , ⇠˜n/2 } from P using AV. 0 0 3. Calculate n/2 2X 1⇣ 0 GA = f (x̂, ⇠˜i ) + f (x̂, ⇠˜i ) n i=1 2 f (x⇤A , ⇠˜i ) 0 f (x⇤A , ⇠˜i ) ⌘ and s2A n/2 ⇣ X 1 1 0 = f (x̂, ⇠˜i ) + f (x̂, ⇠˜i ) n/2 1 i=1 2 1 2 (ḡA (x̂) ḡA (x⇤A )) f (x⇤A , ⇠˜i ) 2 , 0 f (x⇤A , ⇠˜i ) ⌘ ↵)-level 81 where ḡA (x) = 2 n Pn/2 1 ⇣ i=1 2 ⌘ i i0 ˜ ˜ f (x, ⇠ ) + f (x, ⇠ ) . 4. Output a one-sided confidence interval on G: " # z ↵ sA 0, GA + p . n/2 Note that the CI estimator (4.7) divides by (4.7) p p n/2 instead of n. This is because the AV sample used produces n/2 i.i.d. pairs of observations for each replication. As with MRP AV, the structure of the point estimator GA makes it difficult to analytically determine the e↵ect of AV on its variance. We now present theoretical results on the consistency and asymptotic validity of the SRP AV optimality gap estimators. First, some definitions: For a fixed x̂ 2 X, ⇣ ⌘ 2 ˜ + f (x̂, ⇠˜0 ) f (x, ⇠) ˜ f (x, ⇠˜0 )] , and denote the optidefine x̂,A (x) = var 12 [f (x̂, ⇠) mal solutions that minimize and maximize and x⇤max,A 2 arg maxx2X ⇤ 2 x̂,A (x), 2 x̂,A (x) by x⇤min,A 2 arg minx2X ⇤ 2 x̂,A (x) respectively. 0 0 Theorem 4.8. Assume x̂ 2 X, {⇠˜1 , ⇠˜1 , . . . , ⇠˜n/2 , ⇠˜n/2 } is an AV sample from distri- bution P , and (A1)–(A3) hold. Fix 0 < ↵ < 1 and consider SRP AV. Then, (i) zA⇤ ! z ⇤ , a.s., as n ! 1; (ii) GA ! G, a.s., as n ! 1; (iii) all limit points of x⇤A lie in X ⇤ , a.s.; (iv) 2 ⇤ x̂,A (xmin,A ) lim inf s2A lim sup s2A n!1 n!1 2 ⇤ x̂,A (xmax,A ), a.s. ˜ < 1. Therefore, noting that the SLLN Proof. (i) (A2) implies that E supx2X f (x, ⇠) holds for AV (Theorem 4.3), the result follows immediately from Theorem A1 of Rubinstein & Shapiro (1993). (ii) We have that GA = ḡA (x̂) ˜ a.s., zA⇤ . Additionally, ḡA (x̂) converges to Ef (x̂, ⇠), by Theorem 4.3. Furthermore, by part (i), zA⇤ converges to z ⇤ , a.s., as n ! 1. We ˜ conclude that GA ! Ef (x̂, ⇠) z ⇤ , a.s., as n ! 1. 82 ˜ a.s., on X, by (iii) (A1)–(A3) imply that ḡA (x) converges uniformly to Ef (x, ⇠), Lemma A1 of Rubinstein & Shapiro (1993). This result along with (i) implies (iii). (iv) With appropriate adjustments of notation, the proof is as in the proof of Proposition 1 in (Bayraksan & Morton, 2006). The next result demonstrates that the SRP AV CI estimator of the optimality gap is asymptotically valid. 0 0 Theorem 4.9. Assume x̂ 2 X, {⇠˜1 , ⇠˜1 , . . . , ⇠˜n/2 , ⇠˜n/2 } is an AV sample from distri- bution P , and (A1)–(A3) hold. Fix 0 < ↵ < 1 and consider SRP AV. Then, ! z ↵ sA lim inf P G GA + p 1 ↵. n!1 n/2 (4.10) Proof. When x̂ 2 X ⇤ , inequality (4.10) is trivial. Suppose x̂ 2 / X ⇤ . Then GA = ḡA (x̂) Observe that ḡA (x̂) ḡA (x⇤A ) ḡA (x̂) ḡA (x), 8x 2 X. ḡA (x⇤min,A ) is a sample mean of n/2 i.i.d. random variables and thus the CLT can be applied. The remainder of the proof follows the proof of Theorem 1 in (Bayraksan & Morton, 2006) with some minor adjustments of notation. 4.3.2 Latin Hypercube Sampling SRP with LHS is as follows: SRP LHS Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and a sample size n. Output: A point estimator, its associated variance estimator, and a (1 confidence interval on G. Replace Step 1 of SRP by: ↵)-level 83 1. Sample observations {⇠˜1 , . . . , ⇠˜n } from P using LHS. The optimality gap and sample estimators are denoted GL and s2L , respectively, P and the CI estimator is adjusted accordingly. Define f¯L (x) = n1 ni=1 f (x, ⇠˜i ). As described above, we are unable to easily quantify the e↵ect of LHS on variance via analytical methods. We now discuss the asymptotic behavior of the SRP LHS optimality gap estimators. These results were first presented in (Drew, 2007). Theorem 4.11. Assume x̂ 2 X, {⇠˜1 , . . . , ⇠˜n } is an LHS sample from distribution P , and (A1)–(A3) hold. Fix 0 < ↵ < 1 and consider SRP LHS. Then, (i) zL⇤ ! z ⇤ , a.s., as n ! 1; (ii) GL ! G, a.s., as n ! 1; (iii) all limit points of x⇤L lie in X ⇤ , a.s.; (iv) 2 ⇤ x̂ (xmin ) lim inf s2L lim sup s2L n!1 n!1 2 ⇤ x̂ (xmax ), a.s. ˜ < 1. Therefore, noting that the SLLN Proof. (i) (A2) implies that E supx2X f (x, ⇠) holds for LHS (Theorem 4.5), the result follows immediately from Theorem A1 of Rubinstein & Shapiro (1993). (ii) We have that GL = f¯L (x̂) ˜ a.s., zL⇤ . Additionally, f¯L (x̂) converges to Ef (x̂, ⇠), by Theorem 4.5. Furthermore, by part (i), zL⇤ converges to z ⇤ , a.s., as n ! 1. We ˜ conclude that GL ! Ef (x̂, ⇠) z ⇤ , a.s., as n ! 1. ˜ a.s., on X, by (iii) (A1)–(A3) imply that f¯L (x) converges uniformly to Ef (x, ⇠), Lemma A1 of Rubinstein & Shapiro (1993). This result along with (i) implies (iii). (iv) The remainder of the proof is as in the proof of Proposition 1 in (Bayraksan & Morton, 2006) with slight di↵erences in notation. The second result for SRP LHS proves the asymptotic validity of the CI estimator. We present a detailed proof of this result to highlight the changes required in this 84 case compared to the proof for SRP in (Bayraksan & Morton, 2006). Specifically, the proof must be updated to reflect the altered form of the CLT. We note that we adapt the proof in (Drew, 2007) to reflect our notation. Theorem 4.12. Assume x̂ 2 X, {⇠˜1 , . . . , ⇠˜n } is an LHS sample from distribution P , and (A1),(A3), and (A5) hold. Fix 0 < ↵ < 1 and consider SRP LHS. Then, ✓ ◆ z ↵ sL lim inf P G GL + p 1 ↵. (4.13) n!1 n ⇣ P ⌘ n 1 2 i i ˜ ˜ Proof. Let x̂,L (x) = var n i=1 [f (x̂, ⇠ ) f (x, ⇠ )] , where the samples are generated using LHS. When x̂ 2 X ⇤ , inequality (4.13) is trivial. Suppose x̂ 2 / X ⇤ , and recall that zL⇤ = minx2X f¯L (x). Thus, GL = f¯L (x̂) f¯L (x⇤L ) f¯L (x̂) f¯L (x), 8x 2 X. Replacing x by x⇤min 2 arg minx2X ⇤ x̂2 (x), we obtain ✓ ◆ z ↵ sL P GL + p G n ✓ ◆ z ↵ sL ⇤ ¯ ¯ P fL (x̂) fL (xmin ) + p G n ✓ ¯ ◆ (fL (x̂) f¯L (x⇤min )) G z↵ sL p =P , ⇤ n x̂,L (x⇤min ) x̂,L (xmin ) ! r (f¯L (x̂) f¯L (x⇤min )) G n 1 sL P z↵ ⇤ ⇤ n x̂,L (xmin ) x̂ (xmin ) where in (4.15) we assume var[f¯L (x̂) 2 ⇤ x̂,L (xmin ) > 0. f¯L (x⇤min )] = 0 and hence f¯L (x̂) Note that if 2 ⇤ x̂,L (xmin ) f¯L (x⇤min ) = E[f¯L (x̂) (4.14) (4.15) (4.16) = 0, then f¯L (x⇤min )] = G. It follows from (4.14) that (4.13) is again trivial. Inequality (4.16) holds because q 2 ⇤ (f¯L (x̂) f¯L (x⇤min )) G 2 ⇤ x̂ (xmin ) (see (4.4)). Let DL = , aL = nn 1 x̂ (xsL⇤ ) , and ⇤ ) x̂,L (xmin ) n 1 x̂,L (x min 0 < " < 1, and for the moment assume that ↵ 1/2 so that z↵ min 0. Then (4.15) can 85 be rewritten as P (DL z ↵ aL ) P (DL (1 ")z↵ , aL = P (DL (1 ")z↵ ) + P (aL P ({DL 1 ") 1 ")z↵ } [ {aL (1 ") 1 "}). (4.17) Taking limits, we obtain lim inf P n!1 where ✓ z ↵ sL G GL + p n ◆ ((1 ")z↵ ), denotes the distribution function of the standard normal. By part (iv) of Theorem 4.11, the last two terms in (4.16) both converge to 1 and cancel out. Since f¯L (x̂) f¯L (x⇤min ) is a sample mean of Latin hypercube random variables, by Theorem 4.6 the first term in (4.17) converges to ((1 ")z↵ ). Letting " shrink to zero gives the desired result, provided that ↵ 1/2. When ↵ x⇤max 2 arg maxx2X ⇤ 2 x̂ (x) 1/2, we replace x⇤min with in (4.15) and then use a straightforward variation of the above argument. We now update the algorithms and results of this section for the case of two replications of observations. 4.4 Averaged Two-Replication Procedure with Variance Reduction Our final set of algorithms in this chapter consider A2RP with variance reduction schemes. We also combine variance reduction with the bias reduction technique introduced in Chapter 3, in the hopes of further improving the quality of the optimality gap estimators. Section 4.6 examines the e↵ectiveness of this approach. 86 4.4.1 Antithetic Variates We state A2RP AV as the following modification of MRP AV. Note that the SV estimator in each replication is calculated like that of SRP AV; therefore, the SV estimator of A2RP AV is di↵erent to that of MRP AV with two replications. A2RP AV Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and an even sample size per replication n. Output: A point estimator, its associated variance estimator, and a (1 ↵)-level confidence interval on G. Fix m = 2 and replace Steps 1.3, 2, and 3 of MRP AV by: 1.3. Calculate GA,l and s2A,l . 2. Calculate the optimality gap and sample variance estimators by taking the average: G0A = 1 (GA,1 + GA,2 ) 2 and 0 s2A = 1 2 s + s2A,2 . 2 A,1 3. Output a one-sided confidence interval on Gx̂ : z ↵ s0 0, G0A + p A . n Similarly to SRP AV, the CI estimator (4.18) uses (4.18) p n instead of p 2n. State- ments about the consistency and asymptotic validity of the A2RP AV optimality gap estimators are as follows. 0 0 Theorem 4.19. Assume x̂ 2 X, {⇠˜l1 , ⇠˜l1 , . . . , ⇠˜ln/2 , ⇠˜ln/2 }, l = 1, 2, are two inde- pendent AV samples from distribution P , and (A1)–(A3) hold. Fix 0 < ↵ < 1 and consider A2RP AV. Then, (i) zA⇤ ! z ⇤ , a.s., as n ! 1; (ii) G0A ! G, a.s., as n ! 1; 87 (iii) all limit points of x⇤A lie in X ⇤ , a.s.; (iv) 2 ⇤ x̂,A (xmin,A ) 0 0 lim inf s2A lim sup s2A n!1 n!1 2 ⇤ x̂,A (xmax,A ), a.s. Proof. (i) The proof is as for SRP AV. (ii) Note that G0A can be represented as Theorem 4.3, 1 2 1 2 1 ⇤ (z 2 A,1 (ḡA,1 (x̂) + ḡA,2 (x̂)) ⇤ + zA,2 ). By ˜ a.s. Furthermore, by part (ḡA,1 (x̂) + ḡA,2 (x̂)) converges to Ef (x̂, ⇠), ⇤ ⇤ ˜ (i), 12 (zA,1 +zA,2 ) converges to z ⇤ , a.s., as n ! 1. We conclude that G0A ! Ef (x̂, ⇠) z ⇤ , a.s., as n ! 1. (iii) The proof is as for SRP AV. (iv) The proof follows that of SRP AV with minor modifications. Next, we consider the asymptotic validity of the A2RP AV CI estimator. 0 0 Theorem 4.20. Assume x̂ 2 X, {⇠˜l1 , ⇠˜l1 , . . . , ⇠˜ln/2 , ⇠˜ln/2 }, l = 1, 2, are two inde- pendent AV samples from distribution P , and (A1)–(A3) hold. Fix 0 < ↵ < 1 and consider A2RP AV. Then, lim inf P n!1 ✓ G G0A z ↵ s0 + pA n ◆ 1 ↵. (4.21) Proof. First, note that if x̂ 2 X ⇤ , then the inequality is satisfied automatically. Suppose now that x̂ 2 / X ⇤ . Then G0A for all x 2 X. By the CLT, ✓ p 1 n (ḡA,1 (x̂) + ḡA,2 (x̂)) 2 1 2 (ḡA,1 (x̂) + ḡA,2 (x̂)) 1 2 (ḡA,1 (x) + ḡA,2 (x)), 1 (ḡA,1 (x⇤min ) + ḡA,2 (x⇤min )) 2 G ◆ converges in distribution to a normal random variable with mean zero and variance 2 ⇤ x̂,A (xmin,A ). Also, lim inf n!1 s0A ⇤ x̂,A (xmin,A ) 1, a.s., by part (iv) of Theorem 4.19. The rest of the proof is analogous to that for SRP AV. 88 4.4.2 Latin Hypercube Sampling A2RP with LHS is as follows: A2RP LHS Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and a sample size per replication n. Output: A point estimator, its associated variance estimator, and a (1 ↵)2 -level confidence interval on G. Fix m = 2 and replace Steps 1.3, 2, and 3 of MRP LHS by: 1.3. Calculate GL,l and s2L,l . 2. Calculate the optimality gap and sample variance estimators by taking the average: G0L = 1 (GL,1 + GL,2 ) 2 and 0 s2L = 3. Output a one-sided confidence interval on Gx̂ : z ↵ s0 0, G0L + p L . n 1 2 sL,1 + s2L,2 . 2 (4.22) Due to the need to separately consider the two independent LHS replications in p the upcoming theoretical results, the CI estimator (4.22) is defined using n instead p of 2n. We now present results showing the consistency and asymptotic validity of the A2RP LHS optimality gap estimators. Theorem 4.23. Assume x̂ 2 X, {⇠˜l1 , . . . , ⇠˜ln }, l = 1, 2, are LHS samples from distribution P , and (A1)–(A3) hold. Fix 0 < ↵ < 1 and consider A2RP LHS. Then, (i) zL⇤ ! z ⇤ , a.s., as n ! 1; (ii) G0L ! G, a.s., as n ! 1; (iii) all limit points of x⇤L lie in X ⇤ , a.s.; 89 (iv) 2 ⇤ x̂ (xmin ) 0 0 2 ⇤ x̂ (xmax ), lim inf s2L lim sup s2L n!1 n!1 a.s. Proof. (i) The proof is as for SRP LHS. (ii) Note that G0L can be represented as Theorem 4.5, 1 2 f¯L,1 (x̂) + f¯L,2 (x̂) 1 2 1 ⇤ (z 2 L,1 ⇤ + zL,2 ). By ˜ a.s. Furthermore, by part f¯L,1 (x̂) + f¯L,2 (x̂) converges to Ef (x̂, ⇠), ⇤ ⇤ ˜ (i), 12 (zL,1 + zL,2 ) converges to z ⇤ , a.s., as n ! 1. We conclude that G0L ! Ef (x̂, ⇠) z ⇤ , a.s., as n ! 1. (iii) The proof is as for SRP LHS. (iv) The proof follows that of SRP LHS with minor modifications. The proof that the A2RP LHS CI estimator is asymptotically valid requires some significant changes compared to SRP LHS. While two replications of i.i.d. or AV observations of size n can be equivalently represented as one replication of i.i.d. or AV observations of size 2n, the same is not true for LHS. Observe that P (A/2+B/2 c) P (A/2 c/2)P (B/2 c/2) = P (A c)P (B c), for two independent random variables A and B and constant c. We make use of this observation to divide the group of all observations into two LHS samples so that we can apply Theorem 4.6, the CLT for LHS, on each sample. Note that the confidence level becomes (1 rather than 1 ↵)2 ↵. Theorem 4.24. Assume x̂ 2 X, {⇠˜l1 , . . . , ⇠˜ln }, l = 1, 2, are two independent LHS samples from distribution P , and (A1),(A3), and (A5) hold. Fix 0 < ↵ < 1 and consider A2RP LHS. Then, lim inf P n!1 Proof. Let 2 x̂,L,l (x) = var ✓ G G0L z ↵ s0 + pL n ⇣ P n 1 n ˜li i=1 [f (x̂, ⇠ ) ◆ (1 ↵)2 . (4.25) ⌘ f (x, ⇠˜li )] for l = 1, 2, where the sam- ples are generated using LHS. When x̂ 2 X ⇤ , inequality (4.25) is trivial. Suppose that x̂ 2 / X ⇤ . Then G0L 1 2 f¯L,1 (x̂) + f¯L,2 (x̂) 1 2 f¯L,1 (x) + f¯L,2 (x) , for all x 2 X. 90 Replacing x by x⇤min 2 arg minx2X ⇤ x̂2 (x), we obtain ✓ ◆ z↵ s0L 0 P GL + p G n ✓ 1 ¯ z ↵ s0 1 ¯ P fL,1 (x̂) f¯L,1 (x⇤min ) + fL,2 (x̂) f¯L,2 (x⇤min ) + p L 2 2 n ✓ ◆ 0 1 ¯ z↵ s 1 P fL,1 (x̂) f¯L,1 (x⇤min ) + p L G · 2 2 2 n ✓ ◆ 1 ¯ z↵ s0L 1 ⇤ ¯ P fL,2 (x̂) fL,2 (xmin ) + p G 2 2 2 n ✓ ¯ ◆ (fL,1 (x̂) f¯L,1 (x⇤min )) G z↵ s0L p =P · ⇤ n x̂,L,1 (x⇤min ) x̂,L,1 (xmin ) ✓ ¯ ◆ (fL,2 (x̂) f¯L,2 (x⇤min )) G z↵ s0L p P , ⇤ n x̂,L,2 (x⇤min ) x̂,L,2 (xmin ) 2 ⇤ x̂,L,l (xmin ) where in (4.27) we assume G ◆ (4.26) (4.27) > 0 for l = 1, 2. We will return to the cases where this does not hold below. By (4.4), P ✓ ¯ (fL,l (x̂) P ◆ f¯L,l (x⇤min )) G z↵ s0L p ⇤ n x̂,L,l (x⇤min ) x̂,L,l (xmin ) ! r (f¯L,l (x̂) f¯L,l (x⇤min )) G n 1 s0L z↵ , ⇤ ⇤ n x̂,L,l (xmin ) x̂ (xmin ) for l = 1, 2. The remainder of the proof follows along the same lines as for SRP LHS, except that we consider each independent LHS sample separately and multiply the individual probabilities at the end. Note that if 2 ⇤ x̂,L,l (xmin ) f¯L,l (x⇤min ) = E[f¯L,l (x̂) = 0, then var[f¯L,l (x̂) f¯L,l (x⇤min )] = G. If 2 ⇤ x̂,L,l (xmin ) from (4.26) that (4.25) is trivial. If just one of lim inf P n!1 ✓ G G0L f¯L,l (x⇤min )] = 0 and hence f¯L,l (x̂) z↵ s0L + p n ◆ = 0 for both l = 1, 2, it follows 2 ⇤ x̂,L,l (xmin ) 1 ↵ = 0, then (1 ↵)2 . 91 4.4.3 Antithetic Variates with Bias Reduction Reverting to the notation used in Chapter 3 and restricting our analysis to the problem class described in Section 3.2, which is a subset of the class of problems considered thus far in this chapter, we now embed the bias reduction technique in A2RP AV. However, to facilitate comparison with the other algorithms described in this chapter, 0 0 we consider an AV sample {⇠˜1 , ⇠˜1 , . . . , ⇠˜n , ⇠˜n } of size of 2n (rather than n), and define i Pn h 1 0 the empirical measure on all observations to be P2n (·) = 2n i=1 {⇠˜i } (·) + {⇠˜i } (·) . In order to produce two replications using AV sampling, we partition pairs of observations {⇠˜i , ⇠˜i } instead of individual points. Consider a partition of these n 0 pairs given by index sets S1 and S2 , where (i) S1 , S2 ⇢ {1, . . . , n} and S2 = (S1 )C , (ii) |S1 | = |S2 | = n/2, and 0 (iii) each ⇠˜i and ⇠˜i , i 2 S1 [ S2 , receives probability mass 1/n. h i P Define PSl (·) = n1 i2Sl {⇠˜i } (·) + {⇠˜i0 } (·) for l = 1, 2. The Kantorovich metric is therefore µ̂d (P2n , PSl ) = min ⌘ ( XX i2S2 j2S1 X i2S2 0 k(⇠ i , ⇠ i ) 0 (⇠ j , ⇠ j )k⌘ij : X 2 2 ⌘ij = , 8j; ⌘ij = , 8i; ⌘ij n n j2S 1 0, 8i, j ) . for l = 1, 2. Observe that we now look at the distance between the AV pairs of observations by considering a random vector of both ⇠˜ and ⇠˜0 . We note that this might not be the only method of partitioning the observations, but it allows us to use independent pairs of observations, which then allows us to push through the theoretical results below. Thus, we wish to find an index set of size n/2 that solves the problem: min {µ̂d (P2n , PS1 ) : S1 ⇢ {1, . . . , n}, |S1 | = n/2.} (PM AV) 92 We denote an optimal solution to (PM AV) by AJ1 and let AJ2 = (AJ1 )C . The result⇥ ⇤ P ing probability measures are denoted PAJl , l = 1, 2, where PAJl = n1 i2AJl ⇠˜i + ⇠˜i0 . The algorithm is as follows: A2RP AV-B Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and an even sample size per replication n. Output: A point estimator, its associated variance estimator, and a (1 ↵)-level confidence interval on G. 0 0 1. Sample observations {⇠˜1 , ⇠˜1 , . . . , ⇠˜n , ⇠˜n } from P using AV. 2. Generate AJ1 and AJ2 by solving (PM AV), and produce PAJ1 and PAJ2 . 3. For l = 1, 2: ⇤ 3.1. Solve (SPAJl ) to obtain x⇤AJl and zAJ . l 3.2. Calculate: GAJl 2 X 1⇣ 0 = f (x̂, ⇠˜i ) + f (x̂, ⇠˜i ) n i2AJ 2 f (x⇤AJl , ⇠˜i ) 0 f (x⇤AJl , ⇠˜i ) l ⌘ and s2AJl X 1 ⇣ 1 0 = f (x̂, ⇠˜i ) + f (x̂, ⇠˜i ) n/2 1 i2AJ 2 f (x⇤AJl , ⇠˜i ) l 1 n where ḡAJl (x) = 2 n P 1 i2AJl 2 ḡAJl (x̂) ⇣ ḡAJl (x⇤AJl ) 2 0 f (x⇤AJl , ⇠˜i ) ⌘ , ⌘ i i0 ˜ ˜ f (x, ⇠ ) + f (x, ⇠ ) . 4. Calculate the optimality gap and sample variance estimators by taking the average; GAJ = 12 (GAJ1 + GAJ2 ) and s2AJ = 1 2 s2AJ1 + s2AJ2 . 5. Output a one-sided confidence interval on G: z↵ sAJ 0, GAJ + p . n (4.28) We now show that the estimators GAJ and s2AJ of A2RP AV-B are strongly consistent and that A2RP AV-B provides an asymptotically valid CI on the optimality gap. 93 These results require some minor adjustments to the proofs in Section 3.6. We first update Theorem 3.9 to demonstrate the weak convergence of the empirical probability measures PAJ1 and PAJ2 to P , a.s. 0 0 Theorem 4.29. Assume that {⇠˜1 , ⇠˜1 , . . . , ⇠˜n , ⇠˜n } is an AV sample from distribution P and (A4) holds. Then the probability measures on the partitioned sets obtained by solving (PM AV), PAJ1 and PAJ2 , converge weakly to P , the original distribution of ˜ a.s. ⇠, Proof. Since µ̂d is a metric, by the triangle inequality we have that µ̂d (P, PAJ1 ) µ̂d (P, P2n ) + µ̂d (P2n , PAJ1 ). Also, µ̂d (P2n , PAJ1 ) µ̂d (P2n , PA1 ), where PA1 is the empirical measure on the first n/2 antithetic pairs. This is because the partitioning of the observations via AJ1 minimizes the Kantorovich metric; hence, partitioning into two groups, each composed of antithetic pairs, provides an upper bound. Therefore, µ̂d (P, PAJ1 ) µ̂d (P, P2n ) + µ̂d (P2n , PA1 ), and by applying the triangle inequality again, we obtain µ̂d (P, PAJ1 ) µ̂d (P, P2n ) + µ̂d (P, P2n ) + µ̂d (P, PA1 ) = 2µ̂d (P, P2n ) + µ̂d (P, PA1 ). Noting that we can apply the SLLN to an AV sample (see Theorem 4.3), the rest of the proof proceeds as in Theorem 3.9, with minor adjustments of notation. We now show the consistency of the estimators GAJ and s2AJ . Theorem 4.30. Assume {⇠˜1 , ⇠˜1 , . . . , ⇠˜n , ⇠˜n } is an AV sample from distribution P , 0 0 and (A3) and (A4) hold. Fix 0 < ↵ < 1. Let n be even and consider A2RP AV-B. Then, (i) all limit points of x⇤AJl lie in X ⇤ , a.s., for l = 1, 2; 94 ⇤ (ii) zAJ ! z ⇤ , a.s., as n ! 1, for l = 1, 2; l (iii) GAJ ! G, a.s., as n ! 1; (iv) 2 ⇤ x̂,A (xmin,A ) lim inf s2AJ lim sup s2AJ n!1 n!1 2 ⇤ x̂,A (xmax,A ), a.s. Proof. The proofs for parts (i)–(iv) are as for A2RP-B, with minor modifications to notation. Next, we show the asymptotic validity of the CI estimator produced by A2RP AV-B, given in (4.28). 0 0 Theorem 4.31. Assume {⇠˜1 , ⇠˜1 , . . . , ⇠˜n , ⇠˜n } is an AV sample from distribution P , and (A3) and (A4) hold. Fix 0 < ↵ < 1. Let n be even and consider A2RP AV-B. Then, lim inf P n!1 ✓ G GAJ z↵ sAJ + p n ◆ 1 ↵. Proof. The proof is analogous to that for A2RP AV. 4.4.4 Latin Hypercube Sampling with Bias Reduction We conclude this section with our final algorithm, A2RP LHS-B. As with A2RP AVB, we assume the setup of Section 3.2. Note that in this setting, assumption (A5) is automatically satisfied (see the discussion in Section 3.2), so we can apply the CLT result for LHS. Some care is required when choosing how to produce the LHS sample. In Chapter 3, for A2RP-B, we sampled one large group of i.i.d. observations which was then split into two groups using the minimum weight perfect matching problem (PM). However, this approach is not applicable to our current setting. Note that a random partition of the original set of i.i.d. observations results in two i.i.d. samples of half the size; however, a random subset of an LHS sample is not itself an LHS sample, which is required by our theoretical result. See, for instance, the proof of 95 weak convergence of the empirical measures on each group of partitioned observations in Theorem 4.33 below. Instead, we sample two independent LHS replications of observations {⇠˜11 , . . . , ⇠˜1n } and {⇠˜21 , . . . , ⇠˜2n } and combine them into a larger set of 2n observations. For ease of notation, we relabel these observations as {⇠˜1 , . . . , ⇠˜2n }. We then partition the observations via the perfect matching problem (PM), as in Chapter 3, and refer to the associated index sets as LJ1 and LJ2 . The algorithm is as follows: 96 A2RP LHS-B Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and a sample size per replication n. Output: A point estimator, its associated variance estimator, and a (1 ↵)-level confidence interval on G. 1. Sample two replications of Latin hypercube observations of size n from P : n o {⇠˜11 , . . . , ⇠˜1n }, {⇠˜21 , . . . , ⇠˜2n } . Combine the observations and relabel as {⇠˜1 , . . . , ⇠˜2n }. 2. Generate LJ1 and LJ2 by solving (PM) using all the observations {⇠˜1 , . . . , ⇠˜2n }, and produce PLJ1 and PLJ2 . 3. For l = 1, 2: 3.1. Solve (SPLJl ) to obtain x⇤LJl . 3.2. Calculate: GLJl = 1 Xh f (x̂, ⇠˜i ) n i2LJ f (x⇤LJl , ⇠˜i ) l and s2LJl = 1 n X h⇣ 1 i2LJ where f¯LJl (x) = 1 n P f (x̂, ⇠˜i ) f (x⇤LJl , ⇠˜i ) l i2LJl ⌘ i f¯LJl (x̂) f¯LJl (x⇤LJl ) i2 , f (x, ⇠˜i ). 4. Calculate the optimality gap and sample variance estimators by taking the average; GLJ = 12 (GLJ1 + GLJ2 ) and s2LJ = 1 2 s2LJ1 + s2LJ2 . 5. Output a one-sided confidence interval on G: tn 1,↵ sLJ p 0, GLJ + . n (4.32) Theorem 4.33. Assume that {{⇠˜11 , . . . , ⇠˜1n }, {⇠˜21 , . . . , ⇠˜2n }} are two independent LHS samples from distribution P and (A4) holds. Then the probability measures 97 on the partitioned sets obtained by solving (PM), PLJ1 and PLJ2 , converge weakly to ˜ a.s. P , the original distribution of ⇠, Proof. Since µ̂d is a metric, by the triangle inequality we have that µ̂d (P, PLJ1 ) µ̂d (P, P2n ) + µ̂d (P2n , PLJ1 ). Also, µ̂d (P2n , PLJ1 ) µ̂d (P2n , PL,1 ), where PL,1 is the empirical measure on the first Latin hypercube sample. This is because the partitioning of the observations via LJ1 minimizes the Kantorovich metric; hence, partitioning into the two original Latin hypercube samples provides an upper bound. Therefore, µ̂d (P, PLJ1 ) µ̂d (P, P2n ) + µ̂d (P2n , PL,1 ), and by applying the triangle inequality again, we obtain µ̂d (P, PLJ1 ) µ̂d (P, P2n ) + µ̂d (P, P2n ) + µ̂d (P, PL,1 ) = 2µ̂d (P, P2n ) + µ̂d (P, PL,1 ). Noting that we can apply the SLLN to an LHS sample (see Theorem 4.5), the rest of the proof proceeds as in Theorem 3.9, with minor adjustments of notation. The following consistency results hold for the A2RP LHS-B estimators. Theorem 4.34. Assume {{⇠˜11 , . . . , ⇠˜1n }, {⇠˜21 , . . . , ⇠˜2n }} are two independent LHS samples from distribution P , and (A3) and (A4) hold. Fix 0 < ↵ < 1. Let n be even and consider A2RP LHS-B. Then, (i) all limit points of x⇤LJl lie in X ⇤ , a.s., for l = 1, 2; ⇤ (ii) zLJ ! z ⇤ , a.s., as n ! 1, for l = 1, 2; l (iii) GLJ ! G, a.s., as n ! 1; (iv) 2 ⇤ x̂ (xmin ) lim inf s2LJ lim sup s2LJ n!1 n!1 2 ⇤ x̂ (xmax ), a.s. 98 Proof. Noting that the SLLN holds for LHS (see Theorem 4.5), the proofs of all parts are as in A2RP-B, with appropriate adjustments of notation. Next, we show the asymptotic validity of the CI estimator produced by A2RP LHS-B, given in (4.28). As in Section 4.4.2, some modifications are required to the proof compared to Chapter 3 to be able to apply the CLT for LHS (Theorem 4.6). Theorem 4.35. Assume {{⇠˜11 , . . . , ⇠˜1n }, {⇠˜21 , . . . , ⇠˜2n }} are two independent LHS samples from distribution P , and (A3)–(A5) hold. Fix 0 < ↵ < 1. Let n be even and consider A2RP LHS-B. Then, lim inf P n!1 ✓ G GLJ z↵ sLJ + p n ◆ (1 ↵)2 . (4.36) Proof. When x̂ 2 X ⇤ , inequality (4.36) is trivial. Suppose now that x̂ 2 / X ⇤ . Then 1 ¯ 1 ¯ fLJ1 (x̂) + f¯LJ2 (x̂) fLJ1 (x⇤LJ1 ) + f¯LJ2 (x⇤LJ2 ) 2 2 1 ¯ 1 ¯ fLJ1 (x̂) + f¯LJ2 (x̂) fLJ1 (x⇤min ) + f¯LJ2 (x⇤min ) , 2 2 1 ¯ 1 ¯ = fL,1 (x̂) + f¯L,2 (x̂) fL,1 (x⇤min ) + f¯L,2 (x⇤min ) 2 2 P for all x 2 X, where f¯L,l (x) = n1 ni=1 f (x, ⇠˜li ), for l = 1, 2. The optimality gap GLJ = estimator GLJ is now expressed in terms of the two original LHS samples prior to bias reduction. The rest of the proof is analogous to that for A2RP LHS. 4.5 Summary of Key Di↵erences in Algorithms Table 4.1 summarizes the main di↵erences in the algorithms for assessing solution presented thus far. Column ‘Algorithm’ lists the procedure and column ‘Sampling’ specifies the sampling scheme used, where “BR” indicates that the bias reduction technique is applied after the initial sampling. Column ‘n’ indicates if the sample size per replication must be even and the number of replications is specified in column ‘m’. The di↵erences in the sample variance estimator are highlighted in column ‘SV’. The 99 Algorithm Sampling n m SV MRP MRP AV MRP LHS IID AV LHS even 30 30 30 standard standard standard SRP SRP AV SRP LHS IID AV LHS even 1 1 1 SRP SRP SRP A2RP A2RP AV A2RP LHS IID AV LHS even 2 2 2 average SRPs average SRPs average SRPs A2RP-B A2RP AV-B A2RP LHS-B IID, BR AV, BR LHS, BR even 2 2 2 average SRPs average SRPs average SRPs SE p pm pm m p pn p n/2 n p p2n pn n p p2n pn n Table 4.1: Test problem characteristics term “standard” refers to the usual sample variance of a number of optimality gap estimators, “SRP” denotes the SRP sample variance defined in (2.4), and “average SRPs” refers to the A2RP sample variance estimator, which is obtained by averaging two SRP sample variance estimators. The final column, ‘SE’ indicates the term used in the denominator when calculating the sample error. 4.6 Computational Experiments In previous sections of this chapter, we proved asymptotic results regarding the consistency and validity of estimators produced using variance reduction. We now compare the small-sample behavior of MRP, SRP, and A2RP using AV and LHS to the same algorithms using i.i.d. sampling. We also analyze the bias reduction technique of Chapter 3 in conjunction with AV and LHS in the case of A2RP. We first describe the large-scale test problems used in addition to NV, APL1P, GBD, and PGP2, which appeared in Section 3.7. The experimental setup, which is largely the same as in Section 3.7.2, is briefly discussed. We then present the results of our experiments and conclude by highlighting insights gained and providing guidelines for the use of 100 algorithms for optimality gap estimation with variance reduction. 4.6.1 Test Problems To compare the e↵ects of the variance reduction schemes described in the previous sections, we consider four large-scale test problems from the literature, DB1, SSN, 20TERM, and STORM, in addition to the four smaller test problems outlined in Section 3.7.1. Characteristics of these problems are summarized in Table 4.2. 20TERM is a vehicle allocation problem for a motor freight carrier with 40 independent stochastic parameters and 1.059 ⇥ 1012 scenarios (Hackney & Infanger, 1994; Mak et al., 1999). DB1, the vehicle allocation model in a single-commodity network of Donohue & Birge (1995) and Mak et al. (1999), has 46 stochastic parameters and 4.5 ⇥ 1025 scenarios. SSN, as described in Example 1.2 and Section 3.7.4, has 86 independent stochastic parameters and 1070 scenarios. STORM, described in (Mulvey & Ruszczyński, 1995), is an air freight scheduling model with 5118 scenarios generated by 118 independent stochastic parameters. All four test problems listed in Table 4.2 satisfy the required assumptions, but have not yet been solved exactly due to their size. Following the method in (Freimer et al., 2012), we estimated the optimal value of each problem by solving (SPn ) using LHS and a sample size of n = 50, 000. A suboptimal candidate solution x̂ was obtained for each test problem by solving a separate sampling problem of size n = 500 using ˜ was estimated by calculating the sample mean of f (x̂, ⇠) ˜ i.i.d. sampling, and Ef (x̂, ⇠) Problem # of 1st Stage Variables # of 2nd Stage Variables # of Stochastic Parameters # of Scenarios 20TERM DB1 SSN STORM 63 5 89 121 764 102 706 1259 40 46 86 118 1.059 ⇥ 1012 4.5 ⇥ 1025 1070 5118 Table 4.2: Large test problem characteristics 101 Problem Suboptimal x̂ Estimate of z ⇤ Estimate of G 20TERM (245.20, 19.37, 49.78, 3.89, 14.91, 37.53, 20.70, 1.24, 0.95, 10.27, 9.45, 33.53, 18.29, 13.65, 16, 10.87, 18, 14.25, 22.15, 12.04, 27.95, 12.70, 9.37, 23.53, 0, 0, 18.03, 9.20, 285.70, 0, 0.02, 1.45, 18.53, 0, 0, 0, 1.37, 0, 0.25, 0.15, 0, 19.70, 225.80, 5.63, 5.22, 9.11, 20.09, 5.47, 5.30, 17.76, 16.05, 16.73, 24.55, 6.47, 21.71, 17.35, 21, 19.13, 28, 18.75, 26.85, 16.96, 8.05) 254,311.30 9.24 DB1 (11, 14, 8, 11, 7) -17,717.08 1.08 SSN (0, 0.15, 36.84, 0, 17.96, 0.20, 0, 0, 9.42, 0, 1.81, 25.81, 65.96, 0, 0, 29.79, 2.72, 0, 17.89, 2.55, 14.13, 4.25, 21.96, 19.76, 0, 0, 2.58, 41.67, 15.80, 34.41, 28.76, 0, 19.50, 90.99, 0, 84.84, 0.98, 0, 18.81, 7.68, 19.89, 0, 34.06, 42.21, 5.42, 0, 4.25, 10.65, 0, 0, 1.99, 0, 9.42, 24.12, 13.70, 3.68, 0, 0.99, 3.74, 13.26, 0, 0, 7.91, 32.86, 0, 127.13, 0, 0, 0, 0.12, 0, 0, 0, 0, 2.68, 0.82, 2.12, 2.54, 3.42, 0, 0.74, 3.42, 1.11, 0.74, 2.70, 1.12, 1.23, 2.68, 0) 9.88 1.19 STORM (0, 0, 0, 0, 0, 0, 0, 0, 0.66, 3.34, 0, 0, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 8.27, 8, 0, 0, 0, 0, 0, 4.08, 7.92, 0, 0.99, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 0, 0, 28, 0, 4, 8, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.47, 0, 0, 0, 0, 0, 0, 0, 0, 3 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 4, 2.01, 0, 0, 0, 1.46, 0, 0, 0, 0, 0, 0, 0, 0, 0) 15,498,634.92 29.48 Table 4.3: Large test problem suboptimal candidate solutions with the same 50,000 realizations used to estimate the optimal value. The estimates ˜ were combined to give an estimate of the optimality gap G for of z ⇤ and Ef (x̂, ⇠) each candidate solution. Table 4.3 summarizes the candidate solutions used in our computational experiments, along with the estimates of the optimal value and the optimality gap of the candidate solution. 4.6.2 Experimental Setup We would primarily like to determine the reduction in the variance of the optimality gap estimator as well as the reduction in the SV estimator for finite sample sizes n. Since LHS, and to a lesser extent, AV, samples more evenly from P than i.i.d. sampling, bias may also be reduced, and so we consider the percentage reduction in the bias of the optimality gap estimator. The combined e↵ects on bias and variance are measured by the percentage reduction in the MSE of the optimality gap estimator. We also examine the CI estimator and associated coverage probability. Our experiments 102 were conducted following the same computational setup as that used in Chapter 3, so we highlight only the di↵erences in the setup. We set ↵ = 0.10 for all algorithms except for A2RP LHS and A2RP LHS-B. Recall that in those cases, the sample error is calculated using n rather than 2n, and the asymptotic lower bound on coverage is (1 that (1 ↵)2 . Setting ↵ = 0.051, so ↵)2 = 0.90 and the asymptotic lower bound on coverage is as for the other sampling schemes, will lead to a significant increase in the sample error and CI width due to the increase in the value of z↵⇤ . Instead, we set ↵ = 0.182 so that the sample error and CI width are scaled the same way as for SRP LHS, since p p ⇤ ⇤ z0.182 / n = z0.10 / 2n. Note that the asymptotic lower bound on coverage for A2RP LHS and A2RP LHS-B is then (1 0.182)2 = 0.670; however the empirical results in the next section indicate the coverage remains above 0.90 and is often higher. For small-scale problems, we used the sample sizes n 2 {50, 100, 200, . . . , 1000} and set m and M to 10 and 100, respectively. For consistency, the computations for A2RP and A2RP-B were rerun with this smaller number of independent runs. Note also that the sample size per replication has been doubled compared to Chapter 3. Due to the increase in computation time, we set m = 4 and M = 25 for the largescale problems and considered the sample sizes n 2 {100, 500, 1000}. In this case, our estimates of the coverage probability of the CI estimator were obtained using the estimates of the optimality gap described in the previous section. 4.6.3 Results of Experiments We now examine the experimental results for all algorithms and sampling schemes. We note that when presenting CIs on coverage probabilities for each algorithm, a margin of error smaller than 0.001 is reported as 0.000. In this chapter, we consider equal values of n, the sample size per replication, for MRP, A2RP, and SRP, and so the bias, variance, and MSE of the optimality gap estimator are identical in all 103 three cases. Since our estimates are more accurate with a larger total sample size, we present the percentage reductions in bias, variance, and MSE when using AV and LHS only for MRP. Multiple Replications Procedure We begin by comparing MRP, MRP AV, and MRP LHS. Figure 4.1 presents the percentage reductions in the bias of the optimality gap estimator for MRP AV and MRP LHS compared to MRP for all test problems at a suboptimal solution (recall that bias is independent of the candidate solution used). For small-scale problems with the exception of PGP2, AV provides a moderate reduction in the bias, while LHS eliminates almost all bias. AV actually slightly increases bias for PGP2 for some sample sizes, whereas the reduction in bias is slight for LHS. Both sampling schemes reduce bias for all large-scale problems, although the e↵ect is lessened. Figure 4.2 summarizes the e↵ect of AV and LHS on the variance of the optimality gap estimator, for all test problems and candidate solutions. AV reduces variance substantially for NV, APL1P, and GBD at the optimal solution, and has a moderate e↵ect at the suboptimal solution. PGP2 demonstrates little reduction and sometimes an increase. The large-scale problems show varied reduction that lessens with increasing sample size. With the exception of PGP2, LHS removes almost all variance for the small-scale problems both at optimal and suboptimal solutions, and has a moderate to large e↵ect on the large-scale problems. Note that in the case of MRP, a decrease in variance of optimality gap estimator corresponds to a decrease in the SV estimator. Figure 4.3 shows the percentage reductions in the MSE of the optimality gap estimator for all problems and candidate solutions. The results, not surprisingly, are very similar to those in Figure 4.2. We note that the structure of the optimality gap estimator a↵ects the variance results, and therefore the MSE results, in a way that would not be observed if we 104 were solely considering the approximate optimal values zn⇤ , zA⇤ , and zL⇤ . The main ˜ and the issue is that the optimality gap estimator is the di↵erence between Ef (x̂, ⇠) estimator of z ⇤ (via zn⇤ , zA⇤ , and zL⇤ ). Note that the candidate solution x̂ also a↵ects the overall variance. We illustrate this e↵ect with AV, and note that the same trends can be present in LHS occasionally as well. Recall that i 2X1h ˜ + f (x̂, ⇠˜i0 ) f (x̂, ⇠) GA = n i=1 2 n/2 and so zA⇤ , 1 n/2 h i X 2 1 ˜ + f (x̂, ⇠˜i0 ) var (GA ) = var @ f (x̂, ⇠) zA⇤ A n i=1 2 0 1 n/2 h i X 2 1 ˜ + f (x̂, ⇠˜i0 ) A + var (z ⇤ ) = var @ f (x̂, ⇠) A n i=1 2 0 1 n/2 h i X 2 1 ˜ + f (x̂, ⇠˜i0 ) , z ⇤ A . 2 Cov @ f (x̂, ⇠) A n i=1 2 (4.37) 0 (4.38) It often happens that the absolute value of each term on the right-hand side of (4.38) is reduced significantly compared to using i.i.d. sampling, but the subsequent percentage reduction in the variance of the optimality gap estimator can be much reduced or even negative due to the subtraction of the covariance term. Table 4.4 gives an example for PGP2 with the optimal candidate solution and sample size n = 500. Terms ⇣ P h i⌘ n/2 1 2 i0 ˜ ˜ 1, 2, and 3 denote the quantities var n i=1 2 f (x̂, ⇠) + f (x̂, ⇠ ) , var (zA⇤ ), and ⇣ P h i ⌘ 1 ˜ + f (x̂, ⇠˜i0 ) , z ⇤ , respectively. 2 Cov n2 n/2 f (x̂, ⇠) A i=1 2 The results for the CI width are given in Figure 4.4. For the small-scale prob- lems, AV moderately reduces interval width while LHS produces intervals of near-zero width, apart from little to no impact for PGP2. Considering small-scale problems and suboptimal candidate solutions, both LHS and AV can reduce CI width, although the impact lessens with sample size. AV and LHS have a moderate impact for large-scale problems, again with decreasing e↵ect with increasing sample size. 105 Table 4.5 shows the coverage results for MRP, MRP AV, and MRP LHS at n = 500. We observe that coverage can both be increased and decreased slightly by AV and LHS; however, it never falls below the target threshold of 0.9, and is typically higher. MRP MRP AV Term 1 Term 2 Term 3 Var. 0.42 ± 0.03 0.12 ± 0.01 0.58 ± 0.04 0.25 ± 0.02 0.46 ± 0.03 0.14 ± 0.01 0.08 ± 0.01 0.08 ± 0.01 72.58 57.38 69.42 -4.00 % Red. Table 4.4: Breakdown of the percentage reduction in variance between MRP and MRP AV of the optimality gap estimator for optimal candidate solution, for PGP2 (n = 500) Problem NV PGP2 APL1P GBD 20TERM DB1 SSN STORM MRP 0.939 1.000 0.960 0.995 1.000 1.000 1.000 1.000 ± ± ± ± ± ± ± ± 0.012 0.000 0.010 0.010 0.000 0.000 0.000 0.000 MRP AV 0.928 0.997 1.000 0.954 1.000 1.000 1.000 1.000 ± ± ± ± ± ± ± ± 0.013 0.003 0.000 0.011 0.000 0.000 0.000 0.000 MRP LHS 0.969 0.999 0.949 1.000 1.000 0.980 1.000 1.000 ± ± ± ± ± ± ± ± 0.009 0.002 0.011 0.000 0.000 0.023 0.000 0.000 Table 4.5: MRP coverage for suboptimal candidate solutions (n = 500) 106 100 100 80 80 60 60 40 40 20 20 0 0 100 NV 300 500 PGP2 700 APL1P 900 100 GBD (a) AV: % Red. in Bias, Small 500 DB1 700 SSN 900 STORM (b) AV: % Red. in Bias, Large 100 100 80 80 60 60 40 40 20 20 0 300 20TERM 0 100 NV 300 PGP2 500 700 APL1P 900 GBD (c) LHS: % Red. in Bias, Small 100 20TERM 300 500 DB1 700 SSN 900 STORM (d) LHS: % Red. in Bias, Large Figure 4.1: Percentage reductions in bias of optimality gap estimator between MRP and (a) MRP AV for small problems, (b) MRP AV for large problems, (c) MRP LHS for small problems, and (d) MRP LHS for large problems, with respect to sample size n for suboptimal candidate solutions 107 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 −20 100 NV 300 500 PGP2 (a) AV: % Small Opt 700 APL1P Red. 900 −20 GBD in 100 NV 300 500 PGP2 Var., (b) AV: % Small Sub 700 APL1P Red. 900 −20 GBD in 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 NV 300 PGP2 500 700 APL1P 900 GBD −20 100 NV 300 PGP2 500 700 APL1P 900 GBD −20 100 20TERM 500 DB1 Var., (c) AV: % Large Sub 100 100 300 20TERM 100 −20 100 300 700 SSN Red. 500 DB1 in 700 SSN 900 STORM Var., 900 STORM (d) LHS: % Red. in Var., (e) LHS: % Red. in Var., (f) LHS: % Red. in Var., Small Opt Small Sub Large Sub Figure 4.2: Percentage reductions in variance of optimality gap estimator between MRP and (a) MRP AV for small problems at optimal candidate solutions, (b) MRP AV for small problems at suboptimal candidate solutions, (c) MRP AV for large problems at suboptimal candidate solutions, (c) MRP LHS for small problems at optimal candidate solutions, (d) MRP LHS for small problems at suboptimal candidate solutions, (e) MRP LHS for large problems at suboptimal candidate solutions, with respect to sample size n 108 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 100 NV 300 500 PGP2 700 APL1P 900 0 100 GBD NV 300 500 PGP2 700 APL1P 900 100 GBD 300 20TERM 500 DB1 700 SSN 900 STORM (a) AV: % Red. in MSE, (b) AV: % Red. in MSE, (c) AV: % Red. in MSE, Small Opt Small Sub Large Sub 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 100 NV 300 PGP2 500 700 APL1P 900 GBD 0 100 NV 300 PGP2 500 700 APL1P 900 GBD 100 20TERM 300 500 DB1 700 SSN 900 STORM (d) LHS: % Red. in MSE, (e) LHS: % Red. in MSE, (f) LHS: % Red. in MSE, Small Opt Small Sub Large Sub Figure 4.3: Percentage reductions in MSE of optimality gap estimator between MRP and (a) MRP AV for small problems at optimal candidate solutions, (b) MRP AV for small problems at suboptimal candidate solutions, (c) MRP AV for large problems at suboptimal candidate solutions, (c) MRP LHS for small problems at optimal candidate solutions, (d) MRP LHS for small problems at suboptimal candidate solutions, (e) MRP LHS for large problems at suboptimal candidate solutions, with respect to sample size n 109 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 100 NV 300 500 PGP2 700 APL1P 900 0 100 GBD NV 300 500 PGP2 700 APL1P 900 100 GBD 300 20TERM 500 DB1 700 SSN 900 STORM (a) AV: % Red. in CI Width, (b) AV: % Red. in CI Width, (c) AV: % Red. in CI Width, Small Opt Small Sub Large Sub 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 100 NV 300 PGP2 500 700 APL1P 900 GBD 0 100 NV 300 PGP2 500 700 APL1P 900 GBD 100 20TERM 300 500 DB1 700 SSN 900 STORM (d) LHS: % Red. in CI Width, (e) LHS: % Red. in CI Width, (f) LHS: % Red. in CI Width, Small Opt Small Sub Large Sub Figure 4.4: Percentage reductions in CI width between MRP and (a) MRP AV for small problems at optimal candidate solutions, (b) MRP AV for small problems at suboptimal candidate solutions, (c) MRP AV for large problems at suboptimal candidate solutions, (c) MRP LHS for small problems at optimal candidate solutions, (d) MRP LHS for small problems at suboptimal candidate solutions, (e) MRP LHS for large problems at suboptimal candidate solutions, with respect to sample size n Single Replication Procedure We now discuss the computational results for SRP, SRP AV, and SRP LHS. Recall that the results on the bias, variance, and MSE of the optimality gap estimator are the same as those for MRP, so we do not present them here. Note as well that the SV estimator is no longer a multiple of the variance of the optimality gap estimator, and so we present the SV results separately. Figure 4.5 gives the percentage reductions in the SV estimator for all test problems and candidate solutions. We observe that 110 AV has a moderate to significant e↵ect in all cases. For small-scale problems at the optimal solutions, LHS nearly eliminates SV with exception of PGP2, where SV is increased for some sample sizes. At suboptimal solutions, LHS has a decreasing e↵ect with sample size for GBD, little e↵ect for NV, increases SV for APL1P and PGP2, and has a slight e↵ect for large-scale problems. The percentage reductions in the CI width for all test problems and candidate solutions are presented in Figure 4.6. The results are very similar to those in Figure 4.5, although AV is less e↵ective overall and can even increase CI width for small-scale problems at optimal solutions. This is not surprising because the SRP AV samp p ple error is calculated using n/2 rather than n, increasing the sample error and therefore the CI width relative to SRP with i.i.d. sampling. The coverage results for SRP, SRP AV, and SRP LHS at n = 500 are given in Table 4.6. As expected per the discussion in Section 2.4, SRP has some low coverage, particularly for PGP2 and DB1. In some cases, AV can increase the coverage slightly, but at other times coverage is decreased as well. On the other hand, LHS mostly increases coverage. While this may seem surprising since LHS can decrease CI width, coverage still improves because the point estimators have significantly less variability under LHS. Problem NV PGP2 APL1P GBD 20TERM DB1 SSN STORM SRP 0.907 0.501 0.873 0.874 0.940 0.690 1.000 0.920 ± ± ± ± ± ± ± ± 0.015 0.026 0.017 0.017 0.039 0.076 0.000 0.045 SRP AV 0.984 0.534 0.883 0.889 0.930 0.690 1.000 0.820 ± ± ± ± ± ± ± ± 0.007 0.026 0.017 0.016 0.042 0.076 0.000 0.063 SRP LHS 1.000 0.550 1.000 1.000 0.890 0.660 1.000 1.000 ± ± ± ± ± ± ± ± 0.000 0.026 0.000 0.000 0.051 0.078 0.000 0.000 Table 4.6: SRP coverage for suboptimal candidate solutions (n = 500) 111 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 −20 −20 −20 −40 100 NV 300 500 PGP2 (a) AV: % Small Opt 700 900 APL1P Red. −40 GBD in 100 300 NV 500 PGP2 SV, (b) AV: % Small Sub 700 900 APL1P Red. −40 GBD in 100 100 80 80 60 60 60 40 40 40 20 20 20 0 0 0 −20 −20 −20 −40 −40 300 500 PGP2 (d) LHS: Small Opt % 700 900 APL1P Red. GBD in 100 300 NV SV, (e) LHS: Small Sub 500 PGP2 % 700 900 APL1P Red. GBD in 500 DB1 SV, (c) AV: % Large Sub 80 NV 300 20TERM 100 100 100 −40 100 300 20TERM SV, (f) LHS: Large Sub 700 SSN Red. 500 DB1 % 900 STORM in 700 SSN Red. SV, 900 STORM in SV, Figure 4.5: Percentage reductions in SV estimator between SRP and (a) SRP AV for small problems at optimal candidate solutions, (b) SRP AV for small problems at suboptimal candidate solutions, (c) SRP AV for large problems at suboptimal candidate solutions, (c) SRP LHS for small problems at optimal candidate solutions, (d) SRP LHS for small problems at suboptimal candidate solutions, (e) SRP LHS for large problems at suboptimal candidate solutions, with respect to sample size n 112 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 −20 100 NV 300 500 PGP2 700 APL1P 900 −20 GBD 100 NV 300 500 PGP2 700 APL1P 900 −20 GBD 100 300 20TERM 500 DB1 700 SSN 900 STORM (a) AV: % Red. in CI Width, (b) AV: % Red. in CI Width, (c) AV: % Red. in CI Width, Small Opt Small Sub Large Sub 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 −20 100 NV 300 PGP2 500 700 APL1P 900 GBD −20 100 NV 300 PGP2 500 700 APL1P 900 GBD −20 100 20TERM 300 500 DB1 700 SSN 900 STORM (d) LHS: % Red. in CI Width, (e) LHS: % Red. in CI Width, (f) LHS: % Red. in CI Width, Small Opt Small Sub Large Sub Figure 4.6: Percentage reductions in CI width between SRP and (a) SRP AV for small problems at optimal candidate solutions, (b) SRP AV for small problems at suboptimal candidate solutions, (c) SRP AV for large problems at suboptimal candidate solutions, (c) SRP LHS for small problems at optimal candidate solutions, (d) SRP LHS for small problems at suboptimal candidate solutions, (e) SRP LHS for large problems at suboptimal candidate solutions, with respect to sample size n 113 Averaged Two-Replications Procedure Finally, we discuss the results for A2RP with variance reduction. We first note that the values for the small test problems under A2RP and A2RP-B di↵er from those in Section 3.7.3. This is due to the change in the total sample size and the decreased number of independent runs; however, the trends remain the same, so we do not include these results here. For completeness, in Figure 4.7 we compare A2RP and A2RP-B for the large-scale problems, which were not considered in Chapter 3. A2RPB does not markedly improve the point and interval estimators in this case. Figure 4.8 gives the results for the MSE of the optimality gap estimator, which summarizes the bias and variance together in one estimator, for A2RP AV-B and A2RP LHS-B. A2RP AV-B increases the percentage reduction in the MSE somewhat compared to A2RP AV, whereas LHS-B has a positive impact compared to A2RP LHS only in the case of PGP2. Both A2RP AV-B and A2RP LHS-B show significant improvement in the MSE compared to A2RP-B. Although the values of the SV estimators for A2RP, A2RP AV, and A2RP LHS di↵er from those of SRP, the percentage reductions are the same since we are using the same sample size per replication. In addition, the percentage reduction in the CI 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 100 300 20TERM 500 DB1 700 SSN 900 STORM (a) % Red. in MSE 0 100 300 20TERM 500 DB1 700 SSN (b) % Red. in SV 900 STORM 100 20TERM 300 500 DB1 700 SSN 900 STORM (c) % Red. in CI Width Figure 4.7: Percentage reductions between A2RP and A2RP-B in (a) MSE of optimality gap estimator, (b) SV estimator, and (c) in CI width, with respect to sample size n for large problems at suboptimal candidate solutions 114 width will be the same in both cases. Note that this is true for A2RP LHS because of our choice of ↵. Therefore, we only present the SV and CI estimator results for A2RP AV-B and A2RP LHS-B in Figures 4.9 and 4.10, respectively. A2RP AV-B has a noticeable e↵ect on the estimators compared to A2RP AV only when the optimal candidate solution is used. A2RP LHS-B has no discernible impact in comparison to A2RP LHS for all test problems except for PGP2 at an optimal solution. In that case, A2RP LHS-B o↵ers moderate improvement over LHS. Compared to A2RP-B, A2RP AV and A2RP AV-B further reduces the SV estimator and the CI width, with a particular positive impact for PGP2. A2RP LHS and A2RP LHS-B o↵er improvement compared to A2RP-B in some cases, mainly for the large-scale problems. Coverage results are given in Table 4.7. As expected, the use of two replications results in higher coverage compared to SRP. AV further increases the coverage the majority of the time. A2RP AV-B produces very similar coverage results to A2RP AV. The asymptotic validity results for A2RP LHS and A2RP LHS-B give an asymptotic lower bound of only 0.670. However, our empirical results for small sample sizes indicate that LHS significantly exceeds this bound, and in fact always improves coverage compared to i.i.d. sampling. As with SRP, this is due to the lessening of the variability of the point estimators under LHS. The coverage results of A2RP LHSB are very similar to those for A2RP LHS, except for PGP2 where coverage drops somewhat. Problem NV PGP2 APL1P GBD 20TERM DB1 SSN STORM A2RP 0.905 0.745 0.894 0.929 0.950 0.860 1.000 0.960 ± ± ± ± ± ± ± ± 0.015 0.023 0.016 0.013 0.036 0.057 0.000 0.032 A2RP-B 0.897 0.679 0.877 0.891 0.960 0.830 1.000 0.970 ± ± ± ± ± ± ± ± 0.015 0.024 0.017 0.016 0.032 0.062 0.000 0.028 A2RP AV 0.985 0.784 0.901 0.884 0.970 0.930 1.000 0.970 ± ± ± ± ± ± ± ± 0.006 0.021 0.016 0.017 0.028 0.042 0.000 0.028 A2RP AV-B 0.986 0.735 0.893 0.883 0.960 0.930 1.000 0.980 ± ± ± ± ± ± ± ± 0.006 0.023 0.016 0.017 0.032 0.042 0.000 0.023 A2RP LHS 1.000 0.807 1.000 1.000 0.980 0.920 1.000 1.000 ± ± ± ± ± ± ± ± 0.000 0.021 0.000 0.000 0.023 0.045 0.000 0.000 A2RP LHS-B 1.000 0.725 1.000 1.000 0.990 0.950 1.000 1.000 ± ± ± ± ± ± ± ± Table 4.7: A2RP coverage for suboptimal candidate solutions (n = 500) 0.000 0.023 0.000 0.000 0.016 0.036 0.000 0.000 115 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 −20 −20 −20 −40 100 NV 300 500 PGP2 700 APL1P 900 −40 GBD 100 NV 300 500 PGP2 700 APL1P 900 −40 GBD 100 300 20TERM 500 DB1 700 SSN 900 STORM (a) AV-B: % Red. in MSE, (b) AV-B: % Red. in MSE, (c) AV-B: % Red. in MSE, Small Opt Small Sub Large Sub 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 −20 −20 −20 −40 −40 100 NV 300 PGP2 500 700 APL1P 900 GBD 100 NV 300 PGP2 500 700 APL1P 900 GBD −40 100 20TERM 300 500 DB1 700 SSN 900 STORM (d) LHS-B: % Red. in MSE, (e) LHS-B: % Red. in MSE, (f) LHS-B: % Red. in MSE, Small Opt Small Sub Large Sub Figure 4.8: Percentage reductions in MSE of optimality gap estimator between A2RP and (a) A2RP AV-B for small problems at optimal candidate solutions, (b) A2RP AV-B for small problems at suboptimal candidate solutions, (c) A2RP AV-B for large problems at suboptimal candidate solutions, (c) A2RP LHS-B for small problems at optimal candidate solutions, (d) A2RP LHS-B for small problems at suboptimal candidate solutions, (e) A2RP LHS-B for large problems at suboptimal candidate solutions, with respect to sample size n 116 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 −20 −20 −20 −40 100 NV 300 500 PGP2 700 APL1P 900 −40 GBD 100 NV 300 500 PGP2 700 APL1P 900 −40 GBD 100 300 20TERM 500 DB1 700 SSN 900 STORM (a) AV-B: % Red. in SV, (b) AV-B: % Red. in SV, (c) AV-B: % Red. in SV, Small Opt Small Sub Large Sub 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 −20 −20 −20 −40 −40 100 NV 300 PGP2 500 700 APL1P 900 GBD 100 NV 300 PGP2 500 700 APL1P 900 GBD −40 100 20TERM 300 500 DB1 700 SSN 900 STORM (d) LHS-B: % Red. in SV, (e) LHS-B: % Red. in SV, (f) LHS-B: % Red. in SV, Small Opt Small Sub Large Sub Figure 4.9: Percentage reductions in SV estimator between A2RP and (a) A2RP AV-B for small problems at optimal candidate solutions, (b) A2RP AV-B for small problems at suboptimal candidate solutions, (c) A2RP AV-B for large problems at suboptimal candidate solutions, (c) A2RP LHS-B for small problems at optimal candidate solutions, (d) A2RP LHS-B for small problems at suboptimal candidate solutions, (e) A2RP LHS-B for large problems at suboptimal candidate solutions, with respect to sample size n 117 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 −20 100 NV 300 500 PGP2 700 APL1P 900 −20 GBD 100 NV 300 500 PGP2 700 APL1P 900 −20 GBD 100 300 20TERM 500 DB1 700 SSN 900 STORM (a) AV-B: % Red. in CI Width, (b) AV-B: % Red. in CI Width, (c) AV-B: % Red. in CI Width, Small Opt Small Sub Large Sub 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 −20 100 NV 300 PGP2 500 700 APL1P 900 GBD −20 100 NV 300 PGP2 500 700 APL1P 900 GBD −20 100 20TERM 300 500 DB1 700 SSN 900 STORM (d) LHS-B: % Red. in CI Width, (e) LHS-B: % Red. in CI Width, (f) LHS-B: % Red. in CI Width, Small Opt Small Sub Large Sub Figure 4.10: Percentage reductions in CI width between A2RP and (a) A2RP AV-B for small problems at optimal candidate solutions, (b) A2RP AV-B for small problems at suboptimal candidate solutions, (c) A2RP AV-B for large problems at suboptimal candidate solutions, (c) A2RP LHS-B for small problems at optimal candidate solutions, (d) A2RP LHS-B for small problems at suboptimal candidate solutions, (e) A2RP LHS-B for large problems at suboptimal candidate solutions, with respect to sample size n 118 n 1000 2000 3000 4000 5000 IID AV LHS 0.0012 0.0023 0.0034 0.0044 0.0056 0.0010 0.0018 0.0027 0.0035 0.0043 0.0090 0.0247 0.0515 0.0824 0.1269 Table 4.8: Computational time for IID, AV, and LHS (in seconds) with respect to sample size n Timings In this section, we compare the computational e↵ort required by i.i.d. sampling, AV, and LHS. For each sampling method, we generated samples with sizes ranging from n = 1, 000 to n = 5, 000 to facilitate comparison with the timings results in Section 3.7.4. The tests were performed on a 1.66 GHz LINUX computer with 4GB memory. Table 4.8 presents the results. Note that this table lists only the time required to generate the samples. As expected, the time taken by both i.i.d. sampling and AV increases roughly linearly with sample size. AV slightly decreases the computational burden compared to i.i.d. sampling because half as many uniform random numbers need to be generated. This is well known; see e.g., Chapter 4 of (Lemieux, 2009). In our implementation, due to the shu✏ing across each component of the observations, LHS can increase the sampling time by several orders of magnitude, but the time required is insignificant compared to the time required to solve the sampling problem (SPn ) for large-scale problems. 4.6.4 Discussion In this section, we highlight insights gained from our computational experiments and provide guidelines for the use of the various sampling schemes when estimating optimality gaps. 119 • Both AV and LHS can be e↵ective at reducing the variance of the optimality gap estimator and the SV estimator, as well as lessening the bias of the optimality gap estimator. LHS outperforms AV in almost all situations. • The structure of the optimality gap estimator can lessen the amount of variance reduction achieved, particularly for AV (recall the discussion on page 103). • As expected, the coverage associated with the CI estimator increases with the number of replications. LHS can further improve coverage, despite reducing the width of the CI estimator, because of the reduced variability of the optimality gap and SV estimators. • The combination of the bias reduction technique from Chapter 3 with A2RP AV and A2RP LHS does not appear to o↵er significant additional benefits beyond the variance reduction schemes on their own. • LHS can increase the time required to generate a random sample by several orders of magnitude compared to using AV or i.i.d. sampling; however, the LHS sampling time is still trivial compared to that of solving sampling problems. Given the observations above, we recommend the use of LHS to improve the reliability of estimators produced by algorithms that assess solution quality. As per the discussion in Section 3.7.6, we generally recommend using A2RP over SRP for improved coverage. However, MRP provides the most conservative coverage results, and so the use of MRP is recommended if coverage is of highest importance. 4.7 Summary and Concluding Remarks In this chapter, we combined algorithms for assessing solution quality with sampling schemes used for variance reduction and with the bias reduction scheme of Chapter 3. In each case, we showed that the optimality gap and SV estimators are consistent 120 and the CI estimator is asymptotically valid. Computational experiments indicate that both AV and LHS can improve the reliability of the estimators, with LHS in particular eliminating or nearly eliminating bias and variance in some cases. We note, however, that the structure of the optimality gap estimator can result less variance reduction than if the sampled optimal value zn⇤ was considered on its own. Neither sampling scheme requires undue computational e↵ort. We conclude that implementing variance reduction schemes in MRP, A2RP, and SRP can improve the performance of the algorithms without a loss of computational efficiency. The next chapter studies the use of SRP with variance reduction schemes within a sequential sampling framework. Note that until now, we used a fixed candidate solution x̂ 2 X and a fixed sample size n to run the procedures. In the next chapter, both of these will change iteratively until a stopping condition that depends on the optimality gap estimator is satisfied. 121 Chapter 5 Sequential Sampling with Variance Reduction The procedures for assessing solution quality discussed so far in this dissertation are static, in that the candidate solution, x̂ 2 X, and the sample size, n, are fixed and given as input at the beginning of the algorithms. In this chapter, we allow both x̂ and n to be adaptive and aim to solve (SP) in an iterative fashion. The sequential sampling procedure we consider, which appears in (Bayraksan & Morton, 2011), is based on a sequence of candidate solutions. At each step, the quality of the current candidate solution is assessed using a method that generates an optimality gap and its associated variance estimate, like the ones investigated earlier in this dissertation. The procedure terminates when the value of the optimality gap estimator of the candidate solution falls below a threshold dictated by the value of the SV estimator. Given the success of AV and LHS in improving estimator reliability, as demonstrated in Chapter 4 and in the literature, in this chapter we adapt the sequential sampling methodology to use these variance reduction schemes when estimating the optimality gaps of the candidate solutions. We also generate the sequence of candidate solutions using LHS with the aim of producing higher quality candidate solutions. We note that MRP is computationally burdensome if the same sample size per replication n as for SRP is used in a sequential setting, or too biased if n is divided into 25-30 batches. We also note that the LHS variant of A2RP requires significant changes to the existing theory and so we leave this for future work and instead focus on SRP. In order to establish desired theoretical properties, minor changes to the theory are required when using SRP AV to assess solution quality, and more significant adaptations are required when using SRP LHS. This chapter is organized as follows. An overview of the literature related to se- 122 quential sampling is given in the next section. Section 5.2 summarizes the particular sequential sampling procedure being considered and specifies the necessary assumptions. Section 5.3 applies the variance reduction schemes to the sequential sampling procedure and discusses theoretical results. Section 5.4 presents computational results and Section 5.5 concludes with a summary of our findings. 5.1 Literature Review In general, Monte Carlo sampling-based sequential procedures operate in the following way. Rather than using a fixed sample size, the algorithms proceed iteratively: at each iteration, the observations generated so far are analyzed, and either the procedure is terminated or the sample size is increased and the procedure continues. The use of sequential procedures has been studied extensively in statistics (Chow & Robbins, 1965; Ghosh et al., 1997; Nadas, 1969) and in simulating stochastic systems (Glynn & W.Whitt, 1992; Kim & Nelson, 2001, 2006; Law, 2007). In the context of stochastic programming, sequential sampling procedures iteratively increase the sample size used when solving Monte Carlo-sampling based approximations (SPn ) of (SP). These methods rely on stopping rules to determine when to terminate with a high-quality solution. When using such algorithms to (approximately) solve stochastic programs, it is important to determine (i) what sample size to use at a given iteration and (ii) when to terminate the algorithm in order to obtain high-quality solutions. We will highlight work from the literature that addresses these goals. First, Shapiro (2003) summarizes methods to estimate the sample size required to solve a single approximate stochastic program (SPn ) with a desired accuracy. These techniques are based on rates of convergence of optimal solutions and large deviations theory. However, these methods can provide overly conservative estimates compared to the sample sizes required in practice (Verweij et al., 2003). 123 Returning to a sequential approach, Homem-de-Mello (2003) examines the rate at which sample sizes must grow to guarantee the consistency of sampling-based estimators of z ⇤ . A natural question then is how to best allocate the sample sizes at each iteration. One aim is to minimize expected total computational cost (Byrd et al., 2012; Polak & Royset, 2008; Royset, 2013). In another approach,Royset & Szechtman (2013) study the relationship between sample size selection and e↵ort required to solve (SPn ) as the computational budget increases in order to maximize the rate of convergence to the optimal value. Some papers study sample size allocation with the goal of achieving convergence of optimal solutions. For example, in the context of stochastic root finding and simulation optimization, Pasupathy (2010) provides guidance on how to select sample sizes in order to guarantee the convergence of solutions of retrospective-approximation algorithms. As mentioned above, sequential sampling procedures require stopping rules in addition to methods for determining sample sizes. Stopping rules have been described in the literature for certain sampling-based methods for stochastic programs (Dantzig & Infanger, 1995; Higle & Sen, 1991a, 1996a; Norkin et al., 1998; Shapiro & Nemirovski, 2005). Furthermore, Homem-de-Mello et al. (2011) discuss variance reduction schemes and stopping criteria for a stochastic dual dynamic programming algorithm to solve multistage stochastic programs. With the exception of stochastic quasi-gradient methods (see the survey by Pflug (1988)), however, these stopping rules do not necessarily control the quality of the solution obtained and are not analyzed in a sequential fashion. Morton (1998) studies stopping rules that control solution quality for algorithms based on asymptotically normal optimality gap estimators. The asymptotic normality assumption can be restrictive for stochastic programs, and the procedure of Bayraksan & Morton (2011), discussed in detail in the next section, removes this requirement. Bayraksan & Pierre-Louis (2012) consider fixed-width sequential stopping rules for interval estimators, and Pierre-Louis (2012) develops a sequential approxi- 124 mation method using both sampling approximation and deterministic bounding with similar stopping rules. 5.2 Overview of a Sequential Sampling Procedure We now review the sequential sampling procedure of Bayraksan & Morton (2011) on which our work is based. The material presented in this section is taken largely from (Bayraksan & Morton, 2009). We consider the following basic framework to solve (SP): Step 1: Generate a candidate solution, Step 2: Assess the quality of the candidate solution, Step 3: Check stopping criterion. If satisfied, stop. Else, go to Step 1. In the context of stochastic programming with Monte Carlo sampling-based approximations, improving the candidate solution typically involves generating one or more ˜ In Step 1, a candidate solution can be generated by a number additional samples of ⇠. of methods that solve approximations of (SP). This includes, but is not limited to, solving a sequence of sampling problems (SPmk ) at iteration k with increasing sample sizes, mk ! 1 as k ! 1, and setting the solutions x⇤mk as the candidate solution, i.e., x̂k = x⇤mk , at iteration k. Additionally, any method that satisfies the assumptions detailed below can be used to assess solution quality. We will focus on SRP in this chapter. Before we formally state the sequential sampling procedure, we first make the following definitions and assumptions. Let {⇠˜1 , ⇠˜2 , . . . , ⇠˜n } be a random sample of size n. Suppose we have at hand an optimality gap estimator Gn (x) and an SV estimator s2n (x) 0. We define 1 Xh Dn (x) = f (x, ⇠˜i ) n i=1 n f (x⇤min , ⇠˜i ) i . (5.1) 125 Note that the expected value of Dn (x) is Gx , and its variance under i.i.d. sampling is 2 (x)/n, where 2 ˜ (x) = var[f (x, ⇠) ˜ and x⇤ 2 arg miny2X ⇤ var[f (x, ⇠) ˜ f (x⇤min , ⇠)] min ˜ f (y, ⇠)]. Observe that the definitions of (x) and x⇤min vary slightly to those used in Chapters 3 and 4. Throughout this chapter, we focus on problems that satisfy assumptions (A1) and (A3) of Chapter 2 and (A5) of Chapter 4 (recall that (A5) implies (A2)). In particular, our computational results are on a subset of problems from Chapters 3 and 4. That said, in this chapter we work with more detailed assumptions and point to when these are satisfied under (A1)–(A3) and (A5): (A6) The sequence of feasible candidate solutions {x̂k } has at least one limit point in X ⇤ , a.s. (A7) Let {xk } be a feasible sequence with x as one of its limit points. Let sample size nk satisfy nk ! 1 as k ! 1. Then, lim inf k!1 P (|Gnk (xk ) Gx | > ) = 0 for any > 0. (A8) Gn (x) Dn (x), a.s., for all x 2 X and n (A9) lim inf n!1 s2n (x) (A10) N (0, 2 p n(Dn (x) 2 1. (x), a.s., for all x 2 X. Gx ) ) N (0, 2 (x)) as n ! 1 for all x 2 X, where (x)) is a normal random variable with mean zero and variance 2 (x). Briefly, the assumptions are: (i) the algorithm used in Step 1 eventually generates at least one optimal solution (assumption (A6)), (ii) the statistical estimators of the optimality gap and its variance have desired properties such as convergence (assumptions (A7)–(A9)), and (iii) the sampling is done in such a way that a form of the CLT holds (assumption (A10)). More specifically, assumption (A7) ensures that the optimality gap estimator Gn (x) converges in probability, uniformly in x, to Gx , the true optimality gap. Similarly, (A9) requires that a form of convergence holds for the sample variance estimator s2n (x). Assumption (A8) ensures the correct direction of 126 the bias in the estimation of the optimality gap. Assumption (A6) is satisfied for optimal solutions to (SPmk ) with mk ! 1 under i.i.d. sampling; see, e.g., (Shapiro, 2003). The same is also true for the class of problems we consider when the samples are generated using LHS, as we do in our computational results in this chapter; see (Homem-de-Mello, 2008). Assumption (A7) is satisfied under (A1)–(A3) (Bayraksan & Morton, 2006). Assumption (A8) is satisfied as a direct consequence of the optimization that occurs in SRP. Uniform convergence in x of the relevant sample means to their expectations is a sufficient condition for (A9) to be satisfied—as discussed earlier in Chapter 4, this requirement is satisfied within our framework. Furthermore, the sampling schemes considered in this chapter, namely i.i.d. sampling, AV, and LHS, satisfy Assumption (A10) (see Section 4.1). At iteration k of sequential sampling, we are given a candidate solution x̂k 2 X (from Step 1). First, we select a sample size nk . To achieve this sample size, we can either choose a newly generated sample {⇠˜1 , . . . , ⇠˜nk }, or we can augment the previously obtained observations with {⇠˜nk 1 +1 , . . . , ⇠˜nk }. The resampling frequency kf controls at which iterations to generate new observations. We now estimate the optimality gap of x̂k through Gnk (x̂k ) and snk (x̂k ) (Step 2). To simplify notation, from this point on, we will usually suppress dependence on x̂k and nk , and simply denote Gk = Gx̂k , 2 k = 2 (x̂k ), Dk = Dnk (x̂k ), Gk = Gnk (x̂k ) and sk = snk (x̂k ). Let h0 > 0 and "0 > 0 be two scalars. In Step 3, we check the following stopping criterion and terminate the sequential sampling at iteration T = inf {k : Gk h0 sk + "0 }, k 1 (5.2) i.e., we stop the first time the ratio of Gk to sk falls below h0 > 0 plus a small positive number. Let h > h0 . We select the sample size at iteration k according to ✓ ◆2 1 nk cp + 2p ln2 k , 0 h h (5.3) 127 n where cp = max 2 ln ⇣P 1 j=1 j p ln j p ⌘ o / 2⇡↵ , 1 . Here, p > 0 is a parameter we can choose, which a↵ects the number of samples we generate. Under this formula, the sample size grows of order O(log2 k) in iterations k. When we stop at iteration T according to (5.2), the sequential sampling procedure provides an approximate solution, x̂T , and a CI on its optimality gap, GT , as [0, hsT +"], where " > "0 . Note that " and "0 are very small, e.g., 10 7 , and h and h0 are the more important parameters in selecting initial sample sizes and determining when to stop. We will discuss how to select these parameters in more detail in our computational experiments. A formal statement of the sequential sampling procedure using SRP under i.i.d. sampling is as follows. Observe that the sequence of steps is more detailed than at the beginning of the section. Sequential Sampling Procedure Input: Values for h > h0 > 0, " > "0 > 0, and p > 0, a method that generates candidate solutions {x̂k } with at least one limit point in X ⇤ , a resampling frequency kf (a positive integer), and a desired value of ↵ 2 (0, 1). Output: A candidate solution x̂T and a (1 ↵)-level confidence interval on its opti- mality gap, GT . 0. (Initialization) Set k = 1, calculate nk according to (5.3), and sample i.i.d. observations {⇠˜1 , . . . , ⇠˜nk } from P . 1a. Solve (SPnk ) using {⇠˜1 , . . . , ⇠˜nk } to obtain x⇤nk . 1b. Calculate nk h 1 X Gk = f (x̂k , ⇠˜i ) nk i=1 and s2k = 1 nk 1 nk h⇣ X f (x̂k , ⇠˜i ) f (x⇤nk , ⇠˜i ) f (x⇤nk , ⇠˜i ) i=1 2. If {Gk h0 sk + "0 }, then set T = k, and go to 4. ⌘ i Gk i2 . 128 3. Set k = k + 1 and calculate nk according to (5.3). If kf divides k, then sample observations {⇠˜1 , . . . , ⇠˜nk } independently of samples generated in previous iterations. Else, sample nk nk 1 observations {⇠˜nk 1 +1 , . . . , ⇠˜nk } from P . Go to 1. 4. Output candidate solution x̂T and a one-sided confidence interval on GT : [0, hsT + "]. (5.4) Note that a resampling frequency of kf = 1 will generate nk new observations independent of the observations from the previous iteration. Secondly, it is important to note that the quality statement (5.4) provides a larger bound (hsT + ") on GT than the bound used as a stopping criterion (h0 sT +"0 ) for GT in (5.2). Such inflation of the CI statement, relative to the stopping criterion, is fairly standard when using sampling methods with a sequential nature (Chow & Robbins, 1965; Glynn & W.Whitt, 1992). A third remark is that the stopping rule (5.2) and the corresponding quality statement (5.4) are written relative to the standard deviation. This allows us to stop with a larger optimality gap estimate when the variability of the problem is large compared to a tighter stopping rule when the variability is low. Consequently, the quality statement regarding x̂T is tighter when variability is low and larger when variability is high. When the sample sizes are increased according to (5.3) and the procedure stops at iteration T according to (5.2), the CI (5.4) is asymptotically valid provided Dn (x) satisfies a finite moment generating function (MGF) assumption. Next, we state the MGF assumption, and then we summarize this result along with the fact that sequential sampling stops in a finite number of iterations, a.s.: h h ⇣ ⌘ii Dn (x) Gx p (A11) supn 1 supx2X E exp < 1, for all | | (x)/ n 0 0, for some > 0. Assumption (A11) holds, for instance, when X is compact, f (x, ⇠) is uniformly bounded (assumption (A5)), and {⇠˜1 , ⇠˜2 , . . . , ⇠˜n } are an i.i.d. sample from P . More 129 generally, (A11) holds for the problem class described in Chapter 3; see (Bayraksan & Morton, 2011) for further details. The theorem below summarizes properties of the sequential procedure. Theorem 5.5. Let " > "0 > 0 and h > h0 > 0 and 0 < ↵ < 1. Consider the above sequential sampling procedure where the sample size is increased according to (5.3), and the procedure stops at iteration T according to (5.2). (i) Assume (A6) and (A7) are satisfied. Then, P (T < 1) = 1. (ii) Assume (A8)–(A11) are satisfied. Then, lim inf P (GT hsT + ") 0 1 h#h ↵. (5.6) Part (i) of Theorem 5.5 implies that if the algorithm used in Step 1 eventually produces an optimal solution (assumption (A6)) and the optimality gap estimator can consistently estimate solution quality (assumption (A7)), then the sequential sampling procedure stops in a finite number of iterations, a.s. Part (ii) of Theorem 5.5 shows that for values of h close enough to h0 , or, when the sample sizes nk are large enough, the optimality gap of the solution when we stop lies within [0, hsT + "] with at least the desired probability of 1 ↵. The results in Theorem 5.5 can be also be proven under i.i.d. sampling and the following weaker second moment condition: ˜ < 1. (A12) supx2X Ef 2 (x, ⇠) The sample size at each iteration must be calculated as follows: ✓ ◆2 1 nk (cp,q + 2pk q ) , h h0 where q > 1, p > 0, and where cp,q = max{2 ln( P1 j=1 (5.7) p exp[ pj q ]/ 2⇡↵), 1}. This formula results in larger sample sizes than (5.7). Choosing q to be just larger than 1 will result in sample sizes growing roughly linearly. 130 5.3 Sequential Sampling Procedure with Variance Reduction In this section, we update the sequential sampling procedure to include variance reduction when assessing solution quality via SRP. 5.3.1 Antithetic Variates The sequential sampling procedure using AV when estimating optimality gaps is as follows: Sequential Sampling Procedure with AV Input: Values for h > h0 > 0, " > "0 > 0, and p > 0, a method that generates candidate solutions {x̂k } with at least one limit point in X ⇤ , a resampling frequency kf (a positive integer), and a desired value of ↵ 2 (0, 1). Output: A candidate solution x̂T and a (1 ↵)-level confidence interval on its opti- mality gap, GT . 0. (Initialization) Set k = 1, calculate nk so that nk 2 ✓ 1 ◆2 (cp + 2p ln2 (k)), (5.8) h0 p P 0 0 p ln j where cp = max{2 ln( 1 j / 2⇡↵, 1}, and sample {⇠˜1 , ⇠˜1 , . . . , ⇠˜nk /2 , ⇠˜nk /2 } j=1 h from P using AV sampling. 0 0 1a. Solve (SPnk ,A ) using {⇠˜1 , ⇠˜1 , . . . , ⇠˜nk /2 , ⇠˜nk /2 } to obtain x⇤nk ,A . 1b. Calculate Gk,A nk /2 2 X 1⇣ 0 = f (x̂, ⇠˜i ) + f (x̂, ⇠˜i ) nk i=1 2 and s2k,A 1 = nk /2 nk /2 X 1⇣ f (x̂k , ⇠˜i ) 1 i=1 2 0 f (x⇤k,A , ⇠˜i ) f (x⇤k,A , ⇠˜i ) f (x⇤nk ,A , ⇠˜i ) 2. If {Gk,A h0 sk,A + "0 }, then set T = k, and go to 4. ⌘ 2 Gk,A . ⌘ 131 3. Set k = k + 1 and calculate nk according to (5.8). If kf divides k, then sample observations {⇠˜1 , . . . , ⇠˜nk } independently of samples generated in previous iterations. Else, sample nk nk 1 observations {⇠˜nk 1 +1 , . . . , ⇠˜nk } from P . Go to 1. 4. Output candidate solution x̂T and a one-sided confidence interval on GT : [0, hsT,A + "]. Therefore, the procedure is terminated when the following stopping criterion is satisfied: T = inf {k : Gk,A h0 sk,A + "0 }. (5.9) k 1 Observe that formula (5.8) calculates nk /2 rather than nk , and that Gk,A and s2k,A are updated accordingly. These changes are a consequence of the structure of AV. For a fixed x 2 X, we redefine Dn (x) and 2 (x) to accommodate AV as follows: 2X1h 0 = f (x, ⇠˜i ) + f (x, ⇠˜i ) n i=1 2 n/2 Dn,A and 0 f (x⇤min , ⇠˜i ) i ◆ 1 0 ⇤ ⇤ 0 ˜ ˜ ˜ ˜ = var [f (x, ⇠) + f (x, ⇠ ) f (xmin,A , ⇠) f (xmin,A , ⇠ )] , 2 ⇣ ⌘ ˜ + f (x, ⇠˜0 ) f (y, ⇠) ˜ f (y, ⇠˜0 )] . 2 arg miny2X ⇤ var 12 [f (x, ⇠) 2 A (x) where x⇤min,A f (x⇤min , ⇠˜i ) ✓ We now examine assumptions (A7)–(A11) in the context of AV: (A7) Let xk be a feasible sequence with x as one of its limit points. Let nk satisfy nk ! 1 as k ! 1. Then, lim inf k!1 P (|Gnk ,A (xk ) Gx | > ) = 0 for any > 0. (A8) Gn,A (x) Dn,A (x), a.s., for all x 2 X and n (A9) lim inf n!1 s2n,A (x) (A10) N (0, 2 A (x)) p n/2(Dn,A (x) 2 A (x), 1. a.s., for all x 2 X. Gx ) ) N (0, 2 A (x)) as n ! 1 for all x 2 X, where is a normal random variable with mean zero and variance 2 A (x). 132 (A11) supn 0 1 supx2X E exp > 0. ✓ Dn,A (x) Gx A (x)/ p n/2 ◆ < 1, for all | | 0, for some p Note that, in addition to updates in notation, assumptions (A10) and (A11) use n/2 p rather than n. This is because Dn,A (x) is a sample mean of n/2 i.i.d. observations. As in Section 5.2, the modified AV assumptions (A7)–(A10) hold via the results in Section 4.3.1 when (A1)–(A3) hold. Assumption (A11) holds when (A5) is satisfied but it can also hold under less restrictive cases when (A1)–(A3) hold, as discussed in (Bayraksan and Morton, 2011). Note that with the appropriate redefinitions for AV, everything follows as in the i.i.d. case, but with a focus on antithetic pairs. Therefore, Theorem 5.5 holds under these circumstances with minor adjustments: Theorem 5.10. Let " > "0 > 0 and h > h0 > 0 and 0 < ↵ < 1. Consider the above sequential sampling procedure with AV where the sample size is increased according to (5.3), and the procedure stops at iteration T according to (5.9). (i) Assume (A6) and (A7) are satisfied. Then, P (T < 1) = 1. (ii) Assume (A8)–(A11) are satisfied. Then, lim inf P (GT hsT,A + ") 0 h#h 5.3.2 1 ↵. Latin Hypercube Sampling Finally, we present the sequential sampling procedure with LHS: Sequential Sampling Procedure with LHS Input: Values for h > h0 > 0, " > "0 > 0, and p > 0, a method that generates candidate solutions {x̂k } with at least one limit point in X ⇤ , and a desired value of ↵ 2 (0, 1). Output: A candidate solution x̂T and a (1 mality gap, GT . ↵)-level confidence interval on its opti- 133 0. (Initialization) Set k = 1, calculate nk according to (5.7), and sample observations {⇠˜1 , . . . , ⇠˜nk } from P using LHS. 1a. Solve (SPnk ,L ) using {⇠˜1 , . . . , ⇠˜nk } to obtain x⇤nk ,L . 1b. Calculate Gk,L and s2k,L = nk h 1 X = f (x̂k , ⇠˜i ) nk i=1 1 nk 1 nk h⇣ X f (x̂k , ⇠˜i ) f (x⇤nk ,L , ⇠˜i ) f (x⇤nk ,L , ⇠˜i ) i=1 2. If {Gk,L h0 sk,L + "0 }, then set T = k, and go to 4. i ⌘ Gk,L i2 . 3. Set k = k+1 and calculate nk according to (5.7). Sample observations {⇠˜1 , . . . , ⇠˜nk } independently of samples generated in previous iterations. Go to 1. 4. Output candidate solution x̂T and a one-sided confidence interval on GT : [0, hsT,L + "]. It is important to note the di↵erences compared to using i.i.d. sampling. First, it is not possible to obtain a larger LHS sample by augmenting with additional observations, so the procedure regenerates new observations at each iteration (this is equivalent to setting kf = 1). In addition, the procedure uses the sample size formula (5.7), which uses larger sample sizes than the sample size formula under the assumption of finite moment generating function. The procedure is terminated when the following stopping criterion is satisfied: T = inf {k : Gk,L h0 sk,L + "0 }. (5.11) k 1 We make the following additional definitions: nk h 1 X Dn,L (x) = f (x̂k , ⇠˜i ) nk i=1 and " 1 X⇣ 2 (x) = var f (x, ⇠˜i ) L n i=1 f (x⇤min , ⇠˜i ) n f (x⇤min , ⇠˜i ) i ⌘ # , 134 where the observations {⇠˜1 , . . . , ⇠˜n } are generated via LHS. We now discuss the requirements on the procedure. Assumptions (A7)–(A9) simply require an update of notation and are satisfied via the results in Section 4.3.2 when (A1)–(A3) hold. Assumption (A10) can be expressed as: (A10) We assume that (Dn,L (x) Gx )/ 2 L (x) ) N (0, 1) as n ! 1 for all x 2 X. > 0, then the result holds by Theorem 4.6, the CLT for LHS. If L2 (x) = 0, ⌘ P ⇣ then n1 ni=1 f (x, ⇠˜i ) f (x⇤min , ⇠˜i ) = Gx , for almost all ⇠˜ and the result holds in a If 2 L (x) degenerate form. It is not straightforward to establish the finite moment generating assumption (A11) in the case of LHS, so instead we rely on the second moment condition (A12) and use the sample size formula (5.7). Assumption (A12) holds by assumption (A2) or by the stronger uniformly bounded assumption of (A5). However, we need (A5) to be able to invoke the CLT for LHS. The proof of asymptotic validity of the sequential procedure that uses LHS for assessment of solution quality requires a few adjustments to the proof for the same procedure under i.i.d. sampling; see, e.g., Theorem 2 in (Bayraksan & Morton, 2011). Specifically, we apply inequality (4.4) relating the variance of an LHS estimator to that of a standard Monte Carlo estimator, as illustrated in the detailed proof of Theorem 5.12 below. First, we require the following lemmas: Lemma 5.1. Let X 1 , . . . , X n be i.i.d. random variables with mean µ and X̄n = Pn 1 i 1 µ)2 < 1, then i=1 X . If E(X n ✓ ◆ 2 2 1 2 E(X̄n µ) E(X µ) . n Lemma 5.2. (Fatou’s Lemma) Suppose fn is a sequence of measurable functions on E. Then (i) If fn 0 for all n, then Z E lim inf fn lim inf n!1 n!1 Z fn . E 135 R R (ii) If L fn U for all n, such that E L < 1 and E U < 1, then Z Z Z Z lim inf fn lim inf fn lim sup fn lim sup fn . E n!1 n!1 E n!1 E E n!1 Lemma 5.3. (Bound on Tail of a Standard Normal) Let Z be a standard normal and t > 0. Then, P (Z 1 exp[ t2 /2] t) p . t 2⇡ The following theorem expresses the asymptotic validity and almost sure finite stopping of the sequential sampling procedure with LHS. Theorem 5.12. Let " > "0 > 0 and h > h0 > 0 and 0 < ↵ < 1. Consider the above sequential sampling procedure with LHS where the sample size is increased according to (5.7), and the procedure stops at iteration T according to (5.11). (i) Assume (A6) and (A7) are satisfied. Then, P (T < 1) = 1. (ii) Assume (A5) and (A8)–(A10) are satisfied and nk lim inf P (GT hsT,L + ") 0 h#h 1 2 for all k. Then, ↵. Proof. The proof of finite stopping is as in proof of Proposition 1 in (Bayraksan & Morton, 2011). We now provide a proof of part (ii). Let h=h h0 and "=" "0 . As in the proof of Theorem 1 in (Bayraksan & Morton, 2011), it suffices to show that lim sup h#0 1 X k=1 P (Dk,L Gk ") ↵. h sk,L To apply part (ii) of Fatou’s lemma, we first show that P1 k=1 P (Dk,L Gk h sk,L 1 ˜nk ") is bounded above. Let {⇠˜M C , . . . , ⇠M C } be a standard Monte Carlo random sam- 136 ple. Then, 1 X Gk P (Dk,L k=1 = 1 X P (Dk,L k=1 1 Z X k=1 G k )2 x̂k ") ( ")2 P (Dk,L G k )2 ⇥ E (Dk,L ⇤ Gk )2 x̂k ( ") 2 dPx̂k x̂k 1 Z X k=1 h sk,L " ( ")2 x̂k dPx̂k (5.13) (5.14) # nk h i 1 X = var f (x, ⇠˜i ) f (x⇤min , ⇠˜i ) Gk x̂k ( ") 2 dPx̂k n k i=1 k=1 x̂k " # Z nk h 1 i X nk 1 X i i var f (x, ⇠˜M f (x⇤min , ⇠˜M Gk x̂k · C) C) n 1 n k k i=1 x̂ k k=1 1 Z X ( ") 2 dPx̂k 2 nk h 1 Z X nk 1 X i 4 = E f (x, ⇠˜M C) n 1 n k k x̂ k i=1 k=1 ( ") 2 dPx̂k ⇣ sup E f (x, ⇠˜M C ) f (x⇤min , ⇠˜M C ) f (x, ⇠˜M C ) f (x⇤min , ⇠˜M C ) x2X = sup E x2X ⇣ Gk Gk i f (x⇤min , ⇠˜M C) ⌘2 ⌘2 i Gk !2 (5.15) 3 (5.16) x̂k 5 · ◆ 1 ✓ X nk 1 2( ") (5.17) nk 1 nk k=1 ◆ 1 ✓ X 1 2 2( ") , (5.18) nk 1 k=1 2 where Px̂k denotes the distribution function of x̂k , and (5.14) follows from an application of Markov’s inequality to the conditional probability in (5.13). Inequality (5.15) holds because Dk,L Gk has mean zero for a fixed x̂k , inequality (5.16) holds by the bound on the LHS variance by the Monte Carlo variance given in (4.4), and (5.17) follows from Lemma 5.1. The multiplier of the infinite sum in (5.18) is bounded by (A12), and the sum itself is bounded since nk 2 and nk / k 1+ for some > 0. P Therefore, 1 Gk h sk,L ") is bounded and Fatou’s lemma can k=1 P (Dk,L be used. 137 Taking limits, we obtain lim sup h#0 = 1 X 1 X lim sup P (Dk,L h#0 k=1 1 X k=1 1 X k=1 1 X lim sup P (Dk,L h#0 lim sup h#0 lim sup k=1 1 Z X k=1 Gk P (Dk,L k=1 h#0 Z P x̂k Z P x̂k lim sup P x̂k h#0 ✓ ✓ ✓ hsk,L ") Gk h sk,L Gk h sk,L )) Dk,L Gk L (x̂k ) Dk,L Gk L (x̂k ) Dk,L Gk L (x̂k ) ")) (5.19) ✓ ◆ ◆ 1 sk,L h nk p x̂k dPx̂k (5.20) nk L (x̂k ) r ✓ ◆ ◆ p nk 1 sk,L h nk x̂k dPx̂k (5.21) nk k r ✓ ◆ ◆ nk 1 sk,L 2 1/2 (cp + 2p ln k) x̂k dPx̂k nk k p ↵. The first and final inequalities follow from an application of Fatou’s lemma. In (5.20), we assume that 2 L (x̂k ) > 0. Otherwise, the probability in (5.19) is zero. Inequal- ity (5.21) holds because Gk )/ 2 L (x̂k ) 2 k nk 1 (see (4.4)). With k and x̂k fixed, (Dk,L converges to a standard normal by (A10) because h # 0 ensures nk ! 1. q q Similarly, lim inf h#0 nknk 1 (sk,L / k ) 1 by (A9), and so lim sup h#0 nknk 1 (sk,L / k ) L (x̂k ) 1. The final inequality then follows by applying Lemma 5.3 and the definition of cp,q . We now examine the sequential sampling procedure with variance reduction for a variety of test problems. 5.4 Computational Experiments In this section, we apply the sequential sampling procedure with variance reduction to a number of two-stage stochastic linear programs with recourse from the literature and compare to using i.i.d. sampling when assessing solution quality. We first outline the 138 experimental setup in Section 5.4.1. We then present the results of the experiments and discuss our findings in Section 5.4.2. 5.4.1 Experimental Setup We now provide the details of our experimental design, including the test problems considered, how we generated the candidate solutions, and our methodology for choosing parameter values. Test Problems We consider the small-scale test problems PGP2, APL1P, and GBD also studied in Chapters 3 and 4 and the large-scale test problems DB1, 20TERM, SSN, and STORM outlined in Chapter 4. In prior chapters, the candidate solution x̂ was fixed, and so ˜ only needed to be estimated once to allow the estimation of the optimality Ef (x̂, ⇠) gap of x̂. In this case, x̂T , the candidate solution at the final iteration, may vary from run to run. However, in the case of the small-scale problems, it is not computation˜ for each x̂T output by the procedure. For the ally burdensome to calculate Ef (x̂T , ⇠) ˜ for each solution x̂T using 50,000 scelarge-scale problems, we estimated Ef (x̂T , ⇠) narios generated by LHS. We observed very slightly negative values for the optimality gap estimates for some solutions to 20TERM, most likely due to numerical imprecision in the optimization algorithm solving (SPn ). These discrepancies are negligible compared to the width of the CIs output by the sequential sampling procedures, and so we treated the optimality gap as zero in these cases when calculating the coverage probabilities. Candidate Solution Given the success of LHS in the previous chapter and in the literature, we used the following method to generate the sequence of candidate solutions x̂k : 139 1. Set mk = m1 . Sample observations (independent of those used in the evaluation procedure) {⇠˜1 , . . . , ⇠˜mk } from P using LHS. 2. Solve (SPmk ,L ) using {⇠˜1 , . . . , ⇠˜mk } to obtain x⇤mk . 3. Set x̂k = x⇤mk . Calculate mk+1 and generate a fresh sample of observations {⇠˜1 , . . . , ⇠˜mk+1 } from P using LHS. Set k = k + 1 and go to 2. Assumption (A6) is satisfied by part (iii) of Theorem 4.11 of Section 4.3.2. We set mk = 2nk so that more computational e↵ort is spent on generating high-quality candidate solutions than on assessing them. Parameter Settings As in previous chapters, we set ↵ = 0.10. Following the guidelines in (Bayraksan & Morton, 2011), we chose p = 4.67 ⇥ 10 3 and q = 1.5. We also set " = 2 ⇥ 10 7 and "0 = 1 ⇥ 10 7 . The sequential sampling procedure with each type of sampling was run a total of 300 times for the small-scale test problems and 100 times for the large-scale test problems using a common stream of random numbers for each run. In order to directly compare sequential sampling with i.i.d. sampling, AV, and LHS, we chose the remaining parameters based on the requirements for LHS. Specifically, we set the refresh rate kf = 1, so that new observations are generated at each iteration, and use the sample size formula (5.7). AV requires the following slight Problem n1 PGP2 APL1P GBD 20TERM DB1 SSN STORM 100 200 500 500 500 500 500 h 0.3116 0.2204 0.1394 0.1394 0.1394 0.1394 0.1394 IID: (h, h0 ) (0.4166, (0.2284, (0.1494, (0.2394, (0.1694, (0.1694, (0.1694, 0.105) 0.008) 0.010) 0.100) 0.030) 0.030) 0.030) LHS: (h, h0 ) (0.4060, (0.2304, (0.1404, (0.2044, (0.1694, (0.1594, (0.1494, 0.090) 0.010) 0.001) 0.065) 0.030) 0.020) 0.010) Table 5.1: Parameters for the sequential sampling procedure with IID and LHS 140 Problem n1 PGP2 APL1P GBD 20TERM DB1 SSN STORM 100 200 500 500 500 500 500 h 0.4407 0.3116 0.1971 0.1971 0.1971 0.1971 0.1971 AV1: (h, h0 ) (0.5457, (0.3196, (0.2071, (0.2971, (0.2271, (0.2221, (0.2121, 0.105) 0.008) 0.010) 0.100) 0.030) 0.025) 0.015) n1 200 400 1000 1000 1000 1000 1000 h 0.3116 0.2204 0.1394 0.1394 0.1394 0.1394 0.1394 AV2: (h, h0 ) (0.4166, (0.2904, (0.1494, (0.2394, (0.1694, (0.1644, (0.1544, 0.105) 0.070) 0.010) 0.100) 0.030) 0.025) 0.015) Table 5.2: Parameters for the sequential sampling procedure with AV1 and AV2 adjustment: nk 2 ✓ 1 h h0 ◆2 (cp,q + 2pk q ) . (5.22) Therefore, we make our comparisons in the following two ways. First, we adjusted h for AV so that the procedure used the same sample sizes at each iteration for each sampling scheme. We refer to this case as AV1. Second, we doubled the sample size when AV is used compared to i.i.d. sampling and LHS. This case is referred to as AV2. In the cases of i.i.d. sampling and LHS, we rounded up each sample size nk to be even in order to facilitate better comparison with AV. Formula (5.22) produces a larger sample size than (5.8), but Theorem 5.10 is still valid since (5.8) simply provides a lower bound on the sample size. We chose h=h h0 so that the sample size at the first iteration is equal to a specified initial sample size n1 . For example, setting h = 0.3116 leads to an initial sample size n1 = 100 for LHS and n1 = 200 for AV. We selected the parameters h and h0 using the method outlined in (Pierre-Louis, 2012). Specifically, for each test problem, we ran a pilot study examining the average values of Gk /sk , Gk,A /sk,A , and Gk,L /sk,L for the first 10 iterations over 25 replications. The associated values of h0 for each test problem and sampling scheme were chosen to be slightly smaller than the average ratios observed in the pilot study. Slight adjustment was required for some cases of the small-scale problems after examining the main sequential runs. This approach aims to prevent the procedure from stopping too soon (h0 too large), 141 while also preventing it from running for an excessive number of iterations (h0 too small). Tables 5.1 and 5.2 show our parameter choices in each situation. 5.4.2 Results of Experiments We now present the results of our experiments for the sequential sampling procedure with i.i.d. sampling, both experimental setups for AV, and LHS. Table 5.3 provides confidence intervals on the number of iterations required, the CI widths on the optimality gap hsT +", hsT,A +", and hsT,L +", and an estimate of the coverage probability of the CI estimator. The confidence interval on the coverage probability is given by p p̂ ± 1.645 p̂(1 p̂)/ , where p̂ is the fraction of the CIs that contained the true optimality gap at termination and = 300 for the small-scale problems and 100 for the large-scale problems. For the CIs on coverage, a margin of error smaller than 0.001 is reported as 0.000. Table 5.3 also reports the average total computational time required (in seconds) over all runs. First, we observe that in all cases a relatively small number of iterations are required, particularly in comparison to similar experimental results in (Bayraksan & Morton, 2011). We believe that this is due to the use of LHS when generating the sequence of candidate solutions, as the results presented in Chapter 4 as well as studies in the literature (Bailey et al., 1999; Freimer et al., 2012; Homem-de-Mello, 2008) indicate that LHS can perform very well in a stochastic programming context, generating high-quality solutions. The sequential sampling procedure is particularly e↵ective for APL1P and GBD, where the CI widths are within 1% and 0.04% of optimality, respectively (see the values of z ⇤ in Table 3.2), and coverage probabilities are at least 0.90 for APL1P and always 1 for GBD. The sequential procedure performs less well for PGP2, where the CI widths are within 7% of optimality, and coverage drops as low as 0.68 with the use of LHS. As in the non-sequential setting, we again observe low coverage in PG2 142 when using SRP. The results are similar for the large-scale problems. The CI widths for 20TERM, DB1, SSN, and STORM are estimated to be within 0.02%, 0.04%, 28%, and 0.002% of optimality, respectively (see the values of z ⇤ in Table 4.3). The coverage probabilities for 20TERM, SSN, and STORM are greater than 0.90 and often close to 1. Coverage drops to about 0.84 for DB1; however, this is higher than the coverage observed in the non-sequential setting in Chapter 4. The relatively wide CIs and very high coverage for SSN indicate that slightly lower values of h and h0 may tighten the CI widths without compromising coverage. Using LHS when assessing solution quality leads to the tightest CI widths for all test problems except DB1. LHS requires the fewest number of iterations for APL1P and GBD but the most iterations for the large-scale problems; however computation time is only significantly a↵ected compared to i.i.d. to i.i.d. sampling and LHS, as in AV2, produces better results than fixing the initial sample size across all sampling schemes and adjusting h, as in AV1, but at a cost of increased computational time for every problem except SSN. In particular, AV2 performs better than i.i.d. sampling, but it is not as e↵ective as LHS (with the exception of DB1). Further investigations indicate that time spent generating the sequence of candidate solutions is the major contributor to the total computation time. This is not surprising since we use LHS to generate the candidate solutions, where computation time grows faster with sample size than for i.i.d. sampling or AV (see Table 4.8 in Section 4.6.3), and additionally observations must be generated from scratch at each iteration. Based on these results, we recommend the use of LHS both to quickly obtain high-quality solutions when generating candidate solutions and when assessing solution quality with SRP in the sequential sampling procedure, albeit at the risk of undercoverage for some problems. 143 Problem Method CI on T CI Width CI on Coverage Total Time PGP2 IID AV1 AV2 LHS 1.55 1.42 1.13 1.52 ± ± ± ± 0.10 0.05 0.04 0.09 24.45 31.94 17.99 16.99 ± ± ± ± 2.70 3.41 2.01 2.44 0.740 0.763 0.780 0.680 ± ± ± ± 0.042 0.040 0.039 0.028 1.08 1.03 1.97 1.02 APL1P IID AV1 AV2 LHS 6.62 5.33 3.07 2.04 ± ± ± ± 0.63 0.44 0.24 0.16 19.08 27.08 18.18 10.77 ± ± ± ± 2.72 3.66 2.65 1.75 0.903 0.907 0.890 0.907 ± ± ± ± 0.028 0.028 0.030 0.028 3.60 2.85 4.30 1.40 GBD IID AV1 AV2 LHS 1.86 1.86 1.53 1.00 ± ± ± ± 0.11 0.13 0.09 0.00 0.60 0.06 0.01 0.00 ± ± ± ± 0.13 0.05 0.00 0.00 1.000 1.000 1.000 1.000 ± ± ± ± 0.000 0.000 0.000 0.000 4.50 4.45 9.96 3.97 20TERM IID AV1 AV2 LHS 1.96 2.19 1.54 2.48 ± ± ± ± 0.25 0.28 0.14 0.29 53.59 61.38 42.75 36.83 ± ± ± ± 4.41 5.23 3.35 3.15 1.000 1.000 0.990 0.970 ± ± ± ± 0.000 0.000 0.016 0.028 3664.73 3883.49 8770.98 4104.24 DB1 IID AV1 AV2 LHS 1.41 1.40 1.27 1.45 ± ± ± ± 0.12 0.12 0.09 0.14 5.57 7.53 5.40 5.88 ± ± ± ± 1.11 1.50 1.11 1.11 0.840 0.840 0.840 0.830 ± ± ± ± 0.060 0.060 0.060 0.062 52.48 52.66 108.58 52.06 SSN IID AV1 AV2 LHS 2.22 2.73 1.08 4.33 ± ± ± ± 0.24 0.35 0.05 0.33 2.15 2.75 1.93 1.39 ± ± ± ± 0.08 0.11 0.07 0.05 1.000 0.980 1.000 1.000 ± ± ± ± 0.000 0.023 0.000 0.000 1221.72 1483.09 1641.71 2222.17 STORM IID AV1 AV2 LHS 1.82 2.54 1.75 2.84 ± ± ± ± 0.19 0.31 0.20 0.34 303.66 309.89 177.78 144.75 ± ± ± ± 47.18 59.73 41.06 41.94 0.980 0.980 0.980 0.900 ± ± ± ± 0.023 0.023 0.023 0.049 488.73 584.80 1044.73 621.89 Table 5.3: Summary of results for sequential procedures using IID, AV1, AV2, and LHS 144 5.5 Summary and Concluding Remarks In this chapter, we studied the use of SRP with variance reduction, specifically SRP AV and SRP LHS, to assess solution quality in a sequential sampling procedure to approximately solve (SP) with a desired probability. We also applied LHS when generating the sequence of candidate solutions that is input to the sequential procedure. We showed that the CI on the optimality gap of the solution at termination is asymptotically valid and that the procedure stops in a finite number of iterations, a.s, when using both SRP AV and SRP LHS. Our computational results indicate the inclusion of variance reduction, particularly LHS, can result in a procedure that takes a small number of iterations to produce a high-quality approximate solution to (SP); however, coverage probability can be low when using SRP LHS for assessing solution quality for some problems. The results in (Bayraksan & Morton, 2011) indicate that SRP in the i.i.d. setting improves coverage probability as the initial sample size is increased, with only modest increases in computation time. 145 Chapter 6 Conclusions In this dissertation, we have developed a bias reduction technique and implemented variance reduction methods to improve our ability to identify high-quality solutions to stochastic programs. Stochastic programs can be very difficult to solve, particularly as the size of the problem increases, and so there is a need for methods to accurately assess the quality of solutions obtained by means such as approximation. We have focused on increasing the reliability of Monte Carlo sampling-based estimators of optimality gaps, with a particular emphasis on two-stage stochastic linear programs with recourse. We have applied these ideas to a sequential procedure to solve stochastic programs. The following section summarizes the contributions of this dissertation, and Section 6.2 concludes with future directions based on this work. 6.1 Summary of Contributions An overview of the contributions of the work presented in this dissertatation is as follows: • We developed a technique to reduce the bias of the Averaged Two-Replication Procedure optimality gap estimators. This method is motivated by stability results in stochastic programming. Rather than sampling two independent replications of observations, we partition a larger random sample by solving a minimum-weight perfect matching problem, which can be done in polynomial time in sample size. We showed that the resultant point estimators are consistent and the interval estimator is asymptotically valid. The empirical behavior of the bias reduction technique was examined in computational experiments. 146 The results indicate that the technique can be e↵ective at reducing bias and also variance, particularly when the optimal solution is considered. • We studied the use of antithetic variates and Latin hypercube sampling for variance reduction in several algorithms that estimate optimality gaps. These include algorithms both with and without the bias reduction technique developed in this dissertation. As with the bias reduction technique, we established the consistency and asymptotic validity of the estimators. We conducted computational experiments for a range of small-scale and large-scale test problems to evaluate the e↵ectiveness of the variance reduction schemes compared to i.i.d. sampling. Both techniques can improve the reliability of the optimality gap estimators and Latin hypercube sampling performs particularly well for the class of problems considered. We provided guidelines for the use of variance reduction schemes in algorithms that assess solution quality. • We applied the above ideas to a sequential sampling procedure that (approximately) solves (SP) iteratively by generating and assessing a sequence of candidate solutions. Specifically, we examined the use of variance reduction schemes when assessing the quality of the current candidate solution using the Single Replication Procedure. We proved that the subsequent sequential procedure stops in a finite number of iterations and (asymptotically) with high probability produces a solution within a desired quality threshold. Computational results indicate that variance reduction techniques, particularly Latin hypercube sampling, can improve the performance of the sequential procedure, but can reduce the coverage probability for some problems. 6.2 Future Research We conclude this dissertation by discussing several avenues for future work: 147 • Other classes of problems: A primary goal is to adapt methods to assess solution quality with bias and variance reduction to other classes of stochastic programs, such as multi-stage stochastic programs, risk models, etc. The bias reduction technique in particular will require significant adaptations, which we now briefly discuss: – Stability results for models such as chance-constrained and mixed-integer stochastic programs use di↵erent probability metrics (Römisch, 2003). For example, the Kolmogorov metric can be used for chance-constrained and mixed-integer stochastic programs (Henrion et al., 2009). Minimizing these metrics could potentially lead to more complicated problems than the minimum-weight perfect matching problem that arose in the current context; see for instance the scenario reduction techniques for di↵erent classes of stochastic programs (Heitsch & Römisch, 2003; Henrion et al., 2009). However, quick solution methods could still provide significant reduction in bias. – Multi-stage stochastic programs also o↵er several interesting avenues to explore. The first step is to focus on problems where the stages are independent from one another and determine how to generate the scenario trees while applying the bias reduction technique. The next step is to generalize the approach to dependent stages. Dupačová et al. (2003) discuss backward and forward heuristics for multi-stage programs that may be applicable in this case. – The use of a modified version of the bias reduction technique may also be beneficial in risk models such as Conditional Value at Risk. • Comparison with other bias reduction methods: It would be instructive to compare the bias reduction technique to other bias reduction methods such as the 148 jackknife estimators considered in (Partani et al., 2006), (Partani, 2007) and other sampling techniques. • Additional variance reduction techniques: The use of other variance reduction techniques, such as quasi-Monte Carlo sampling, in algorithms that assess solution quality can be studied. The use of quasi-Monte Carlo in the context of stochastic programming has been explored in (Drew & Homem-de-Mello, 2006; Homem-de-Mello, 2008; Koivu, 2005; Pennanen & Koivu, 2005). Challenges in adapting the methodology presented in this dissertation include the determination of appropriate quasi-Monte Carlo sequences and identifying a CLT-type result for the problems under consideration. Furthermore, dimension reduction techniques need to be further explored for large-scale problems. For some risk models, tail behavior is important. Variance reduction geared toward capturing tails of distributions needs to be explored for such models. • Fixed-width sequential sampling: Bayraksan & Pierre-Louis (2012) introduce a fixed-width sequential sampling scheme to approximately solve (SP). In this case, the procedure is terminated when the width of a confidence interval estimator of the optimality gap falls below a pre-specified level. Variance reduction techniques may improve the performance of this procedure. • Connections with stochastic control theory: There appear to be natural connections between the notion of assessing solution quality in stochastic programming and assessing policy quality in stochastic optimal control theory. In the latter case, suboptimal policies are adopted in the absence of explicit solutions to stochastic control problems. The question of how good these policies are then arises. The techniques described in this dissertation may be helpful in answering this question. 149 References Attouch, H., & Wets, R. (1981). Approximation and convergence in nonlinear optimization. In O. Mangasarian, R. Meyer, & S. Robinson (Eds.) Nonlinear Programming 4 , (pp. 367–394). Academic Press, New York. Bailey, T., Jensen, P., & Morton, D. (1999). Response surface analysis of two-stage stochastic linear programming with recourse. Naval Research Logistics, 46 , 753– 778. Bayraksan, G., & Morton, D. (2006). Assessing solution quality in stochastic programs. Mathematical Programming, 108 , 495–514. Bayraksan, G., & Morton, D. (2009). Assessing solution quality in stochastic programs via sampling. In M. R. Oskoorouchi (Ed.) INFORMS TutORials in Operations Research, vol. 6, (pp. 102–122). INFORMS, Hanover, MD. Bayraksan, G., & Morton, D. (2011). A sequential sampling procedure for stochastic programming. Operations Research, 59 , 898–913. Bayraksan, G., & Pierre-Louis, P. (2012). Fixed-width sequential stopping rules for a class of stochastic programs. SIAM Journal on Optimization, 22 , 1518–1548. Beale, E. (1955). On minimizing a convex function subject to linear inequalities. Journal of the Royal Statistical Society, Series B , 17 , 173–184. Bertocchi, M., Dupačová, J., & Moriggia, V. (2000). Sensitivity of bond portfolio’s behavior with respect to random movements in yield curve: a simulation study. Annals of Operations Research, 99 , 267–286. Byrd, R., Chin, G., Nocedal, J., & Wu, Y. (2012). Sample size selection in optimization methods for machine learning. Mathematical Programming, 134 , 127–155. Chow, Y. S., & Robbins, H. (1965). On the asymptotic theory of fixed-width sequential confidence intervals for the mean. Annals of Mathematical Statistics, 36 , 457–462. Dantzig, G. (1955). Linear programming under uncertainty. Management Science, 1 , 197206. Dantzig, G., & Glynn, P. (1990). Parallel processors for planning under uncertainty. Annals of Operations Research, 22 , 1–21. 150 Dantzig, G., & Infanger, G. (1995). A probabilistic lower bound for two-stage stochastic programs. Technical Report SOL 95-6, Department of Operations Research, Stanford University. Donohue, C., & Birge, J. (1995). An upper bound on the network recourse function. Working Paper, Department of Industrial and Operations Engineering, University of Michigan. Drew, S. (2007). Quasi-Monte Carlo Methods for Stochastic Programming. Ph.D. thesis, Northwestern University. Drew, S., & Homem-de-Mello, T. (2006). Quasi-Monte Carlo strategies for stochastic optimization. In Proceedings of the 2006 Winter Simulation Conference, (pp. 774– 782). Drew, S., & Homem-de-Mello, T. (2012). Some large deviations results for Latin hypercube sampling. Methodology and Computing in Applied Probability, 14 , 203– 232. Dudley, R. M. (2002). Real Analysis and Probability. Cambridge University Press, Cambridge, 2nd ed. Dupačová, J., Gröwe-Kuska, N., & Römisch, W. (2003). Scenario reduction in stochastic programming: an approach using probability metrics. Mathematical Programming, 95 , 493–511. Dupačová, J., & Wets, R.-B. (1988). Asymptotic behavior of statistical estimators and of optimal solutions of stochastic optimization problems. The Annals of Statistics, 16 , 1517–1549. Edmonds, J. (1965). Paths, trees, and flowers. Canadian Journal of Mathematics, (pp. 449–467). Efron, B., & Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman & Hall, New York. Ferguson, A., & Dantzig, G. (1956). The allocation of aircraft to routes: an example of linear programming under uncertain demands. Management Science, 3 , 45–73. Freimer, M. B., Thomas, D. J., & Linderoth, J. T. (2012). The impact of sampling methods on bias and variance in stochastic linear programs. Computational Optimization and Applications, 51 , 51–75. Ghosh, M., Mukhopadhyay, N., & Sen, P. K. (1997). Sequential Estimation. Wiley, New York. 151 Glynn, P., & W.Whitt (1992). The asymptotic validity of sequential stopping rules for stochastic simulations. The Annals of Applied Probability, 2 , 180–198. Hackney, B., & Infanger, G. (1994). Private Communication. Heitsch, H., & Römisch, W. (2003). Scenario reduction algorithms in stochastic programming. Computational Optimization and Applications, 24 , 187–206. Henrion, R., Küchler, C., & Römisch, W. (2009). Scenario reduction in stochastic programming with respect to discrepancy distances. Computational Optimization and Applications, 43 , 67–93. Higle, J. (1998). Variance reduction and objective function evaluation in stochastic linear programs. INFORMS Journal on Computing, 10 , 236–247. Higle, J., & Sen, S. (1991a). Statistical verification of optimality conditions for stochastic programs with recourse. Annals of Operations Research, 30 , 215–240. Higle, J., & Sen, S. (1991b). Stochastic decomposition: an algorithm for two-stage linear programs with recourse. Mathematics of Operations Research, 16 , 650–669. Higle, J., & Sen, S. (1996a). Duality and statistical tests of optimality for two stage stochastic programs. Mathematical Programming, 75 , 257–272. Higle, J., & Sen, S. (1996b). Stochastic Decomposition: A Statistical Method for Large Scale Stochastic Linear Programming. Kluwer Academic Publishers, Dordrecht. Higle, J., & Sen, S. (1999). Statistical approximations for stochastic linear programming problems. Annals of Operations Research, 85 , 173–192. Homem-de-Mello, T. (2003). Variable-sample methods for stochastic opti- mization. ACM Transactions on Modeling and Computer Simulation, 13 , 108–133. Homem-de-Mello, T. (2008). On rates of convergence for stochastic optimization problems under non-i.i.d. sampling. SIAM Journal on Optimization, 19 , 524–551. Homem-de-Mello, T., de Matos, V. L., & Finardi, E. C. (2011). Sampling strategies and stopping criteria for stochastic dual dynamic programming: a case study in long-term hydrothermal scheduling. Energy Systems, 2 , 1–31. Infanger, G. (1992). Monte Carlo (importance) sampling within a Benders decomposition algorithm for stochastic linear programs. Annals of Operations Research, 39 , 69–95. Keller, B., & Bayraksan, G. (2010). Scheduling jobs sharing multiple resources under uncertainty: A stochastic programming approach. IIE Transactions, 42 , 16–30. 152 Kenyon, A., & Morton, D. (2003). Stochastic vehicle routing with random travel times. Transportation Science, 37 , 69–82. Kim, S.-H., & Nelson, B. L. (2001). A fully sequential procedure for indi↵erence-zone selection in simulation. ACM Transactions on Modeling and Computer Simulation, 11 , 251–273. Kim, S.-H., & Nelson, B. L. (2006). On the asymptotic validity of fully sequential selection procedures for steady-state simulation. Operations Research, 54 , 475–488. King, A., & Rockafellar, R. (1993). Asymptotic theory for solutions in statistical estimation and stochastic programming. Mathematics of Operations Research, 18 , 148–162. Koivu, M. (2005). Variance reduction in sample approximations of stochastic programs. Mathematical Programming, 103 , 463–485. Kolmogorov, V. (2009). Blossom V: a new implementation of a minimum cost perfect matching algorithm. Mathematical Programming Computation, 1 , 43–67. Lan, G., Nemirovski, A., & Shapiro, A. (2012). Validation analysis of mirror descent stochastic approximation method. Mathematical Programming, 134 , 425–458. Law, A. (2007). Simulation Modeling and Analysis, 4th ed . McGraw-Hill, New York. Lemieux, C. (2009). Monte Carlo and Quasi-Monte Carlo Sampling. Springer, New York. Linderoth, J., Shapiro, A., & Wright, S. (2006). The empirical behavior of sampling methods for stochastic programming. Annals of Operations Research, 142 , 215– 241. Loh, W. L. (1996). On Latin hypercube sampling. The Annals of Statistics, 24 , 2058–2080. Mak, W., Morton, D., & Wood, R. (1999). Monte Carlo bounding techniques for determining solution quality in stochastic programs. Operations Research Letters, 24 , 47–56. McKay, M., Conover, R., & Beckman, W. (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21 , 239–245. Mehlhorn, K., & Schäfer, G. (2002). Implementation of O(nm log n) weighted matchings in general graphs: the power of data structures. Journal on Experimental Algorithmics, 7 . 153 Morton, D. (1998). Stopping rules for a class of sampling-based stochastic programming algorithms. Operations Research, 24 , 47–56. Mulvey, J., & Ruszczyński, A. (1995). A new scenario decomposition method for large scale stochastic optimization. Operations Research, 43 , 477–490. Nadas, A. (1969). An extension of a theorem of Chow and Robbins on sequential confidence intervals for the mean. Annals of Mathematical Statistics, 40 , 667–671. Norkin, V., Pflug, G., & Ruszczyński, A. (1998). A branch and bound method for stochastic global optimization. Mathematical Programming, 83 , 425–450. Owen, A. B. (1992). A central limit theorem for Latin hypercube sampling. Journal of the Royal Statistical Society, Series B , 54 , 541–551. Owen, A. B. (1997). Monte Carlo variance of scrambled net quadrature. SIAM Journal on Numerical Analysis, 34 , 1884–1910. Partani, A. (2007). Adaptive Jacknife Estimators for Stochastic Programming. Ph.D. thesis, The University of Texas at Austin. Partani, A., Morton, D., & Popova, I. (2006). Jackknife estimators for reducing bias in asset allocation. In Proceedings of the 2006 Winter Simulation Conference, (pp. 783–791). Pasupathy, R. (2010). On choosing parameters in retrospective-approximation algorithms for stochastic root finding and simulation optimization. Operations Research, 48 , 889–901. Pennanen, T., & Koivu, M. (2005). Epi-convergent discretizations of stochastic programs via integration quadratures. Numerische Mathematik , 100 , 141–163. Pflug, G. C. (1988). Stepsize rules, stopping times and their implementations in stochastic quasigradient algorithms. In Y. Ermoliev, & R. Wets (Eds.) Numerical Techniques for Stochastic Optimization, (pp. 353–372). Springer-Verlag, Berlin. Pierre-Louis, P. (2012). Algorithmic Developments in Monte Carlo Sampling-Based Methods For Stochastic Programming. Ph.D. thesis, The University of Arizona. Pierre-Louis, P., Morton, D., & Bayraksan, G. (2011). A combined deterministic and sampling-based sequential bounding method for stochastic programming. In Proceedings of the 2011 Winter Simulation Conference, (pp. 4172–4183). Piscataway, New Jersey. Polak, E., & Royset, J. (2008). Efficient sample sizes in stochastic nonlinear programming. Journal of Computational and Applied Mathematics, 217 , 301–310. 154 Rachev, S. T. (1991). Probability Metrics and the Stability of Stochastic Models. Wiley, New York. Rockafellar, R., & Wets, R.-B. (1998). Variational Analysis. Springer-Verlag, Berlin. Römisch, W. (2003). Stability of stochastic programming problems. In A. Ruszczyński, & A. Shapiro (Eds.) Handbooks in Operations Research and Management Science, Vol. 10: Stochastic Programming, (pp. 483–554). Elsevier, Amsterdam. Royset, J. (2013). On sample size control in sample average approximations for solving smooth stochastic programs. Computational Optimization and Applications, 55 , 265–309. Royset, J., & Szechtman, R. (2013). Optimal budget allocation for sample average approximation. Operations Research, 61 , 762–776. Rubinstein, R. Y., & Shapiro, A. (1993). Discrete Event Systems: Sensitivity and Stochastic Optimization by the Score Function Method . John Wiley & Sons, Chichester. Ruszczyński, A. (1986). A regularized decomposition method for minimizing a sum of polyhedral functions. Mathematical Programming, 35 , 309–333. Ruszczyński, A., & Świetanowski, A. (1997). Accelerating the regularized decomposition method for two stage stochastic linear problems. European Journal of Operational Research, 101 , 328–342. Santoso, T., Ahmed, S., Goetschalckx, M., & Shapiro, A. (2005). A stochastic programming approach for supply chain network design under uncertainty. European Journal of Operational Research, 167 , 96–115. Sen, S., Doverspike, R., & Cosares, S. (1994). Network planning with random demand. Telecommunication Systems, 3 , 11–30. Shao, J., & Tu, D. (1995). The Jackknife and Bootstrap. Springer-Verlag, New York. Shapiro, A. (1989). Asymptotic properties of statistical estimators in stochastic programming. The Annals of Statistics, 17 , 841–858. Shapiro, A. (2003). Monte Carlo sampling methods. In A. Ruszczyński, & A. Shapiro (Eds.) Handbooks in Operations Research and Management Science, Vol. 10: Stochastic Programming, (pp. 353–425). Elsevier, Amsterdam. Shapiro, A., & Homem-de-Mello, T. (1998). A simulation-based approach to two-stage stochastic programming with recourse. Mathematical Programming, 81 , 301–325. 155 Shapiro, A., & Homem-de-Mello, T. (2000). On rate of convergence of Monte Carlo approximations of stochastic programs. SIAM Journal on Optimization, 11 , 70–86. Shapiro, A., Homem-de-Mello, T., & Kim, J. (2002). Conditioning of convex piecewise linear stochastic programs. Mathematical Programming, 94 , 1–19. Shapiro, A., & Nemirovski, A. (2005). On complexity of stochastic programming problems. In V. Jeyakumar, & A. Rubinov (Eds.) Continuous Optimization: Current Trends and Applications, (pp. 111–144). Springer, Berlin. Stockbridge, R., & Bayraksan, G. (2012). A probability metrics approach for reducing the bias of optimality gap estimators in two-stage stochastic linear programming. Mathematical Programming. doi:10.1007/s10107-012-0563-6. Verweij, B., Ahmed, S., Kleywegt, A., Nemhauser, G., & Shapiro, A. (2003). The sample average approximation method applied to stochastic vehicle routing problems: a computational study. Computational Optimization and Applications, 24 , 289–333. Wagner, R. (2009). Mersenne twister random number generator. http://svn.seqan. de/seqan/trunk/core/include/seqan/random/ext_MersenneTwister.h. Last accessed June 3, 2013. Wallace, S., & Ziemba, W. (Eds.) (2005). Applications of Stochastic Programming. MPS-SIAM Series in Optimization, Philadelpha. Wets, R. (1983). Stochastic programming: solution techniques and approximation schemes. In A. Bachem, M. Grötschel, & B. Korte (Eds.) Mathematical Programming: The State of the Art (Bonn 1982), (pp. 560–603). Springer-Verlag, Berlin.
© Copyright 2025 Paperzz