Bias and Variance Reduction in Assessing Solution Quality for

Bias and Variance Reduction in Assessing
Solution Quality for Stochastic Programs
by
Rebecca Stockbridge
A Dissertation Submitted to the Faculty of the
Graduate Interdisciplinary Program in Applied
Mathematics
In Partial Fulfillment of the Requirements
For the Degree of
Doctor of Philosophy
In the Graduate College
The University of Arizona
2013
2
THE UNIVERSITY OF ARIZONA
GRADUATE COLLEGE
As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Rebecca Stockbridge entitled Bias and Variance Reduction in Assessing Solution Quality for Stochastic Programs and recommend that it be accepted
as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy.
Date: July 2, 2013
Güzı̇n Bayraksan
Date: July 2, 2013
Young-Jun Son
Date: July 2, 2013
Joseph Watkins
Final approval and acceptance of this dissertation is contingent
upon the candidate’s submission of the final copies of the dissertation
to the Graduate College.
I hereby certify that I have read this dissertation prepared under
my direction and recommend that it be accepted as fulfilling the
dissertation requirement.
Date: July 2, 2013
Güzı̇n Bayraksan
3
Statement by Author
This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at The University of Arizona and
is deposited in the University Library to be made available to borrowers under rules of the Library.
Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is
made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the
head of the major department or the Dean of the Graduate College
when in his or her judgment the proposed use of the material is in the
interests of scholarship. In all other instances, however, permission
must be obtained from the author.
Signed:
Rebecca Carnegie Neal Stockbridge
4
acknowledgments
First, I would like to thank my teachers at Lexington Montessori School and the
Montessori Middle School of Kentucky for creating wonderful communities focused on
both academic and personal growth. My experiences as an independent learner from
the beginning provided the foundation for all my subsequent academic achievements.
Second, I would like to acknowledge Dr. Michael Tabor of the Interdisciplinary
Program in Applied Mathematics at the University of Arizona. As a result of Dr.
Tabor’s unceasing e↵orts, the Program o↵ers wide-ranging opportunities to develop
in every aspect of research, teaching, and service. In addition, Stacey LaBorde and
Anne Keyl have graciously assisted with all administrative matters.
Third, an allocation of computer time from the UA Research Computing High
Performance Computing (HPC) and High Throughput Computing (HTC) at the University of Arizona is gratefully acknowledged.
Finally, and most importantly, I would like to thank my advisor, Dr. Güzı̇n
Bayraksan for her support during the last four years. She has taught me a great deal
about the art of successfully applying mathematical techniques to interdisciplinary
problems, while mentoring and encouraging me every step of the way.
5
dedication
This dissertation is dedicated to my parents, Richard and Judith, for cultivating a
love of inquiry from my earliest years; my brother, David, for his support and sense of
humor; and my husband, Stuart, for his companionship throughout our mathematical
adventures spanning nine years and two continents.
6
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Chapter 1. Introduction . . . .
1.1. Motivation . . . . . . . . . . .
1.2. Contributions . . . . . . . . .
1.3. Definitions and Abbreviations
1.4. Dissertation Organization . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
14
16
18
19
Chapter 2. Overview of Algorithms for Assessing Solution Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1. Problem Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2. Optimality Gap Estimation . . . . . . . . . . . . . . . . . . . . . . .
2.3. Multiple Replications Procedure . . . . . . . . . . . . . . . . . . . . .
2.4. Single Replication Procedure . . . . . . . . . . . . . . . . . . . . . . .
2.5. Averaged Two-Replication Procedure . . . . . . . . . . . . . . . . . .
2.6. Summary and Concluding Remarks . . . . . . . . . . . . . . . . . . .
21
22
23
25
26
28
29
Chapter 3. Averaged Two-Replication Procedure with Bias Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1. Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2. Problem Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1. A Redefinition of the Averaged Two-Replication Procedure . .
3.3.2. A Stability Result . . . . . . . . . . . . . . . . . . . . . . . . .
3.4. Bias Reduction via Probability Metrics . . . . . . . . . . . . . . . . .
3.4.1. Motivation for Bias Reduction Technique . . . . . . . . . . . .
3.4.2. Averaged Two-Replication Procedure with Bias Reduction . .
3.5. Illustration: Newsvendor Problem . . . . . . . . . . . . . . . . . . . .
3.6. Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.1. Weak Convergence of Empirical Measures . . . . . . . . . . .
3.6.2. Consistency of Point Estimators . . . . . . . . . . . . . . . . .
3.6.3. Asymptotic Validity of the Interval Estimator . . . . . . . . .
3.7. Computational Experiments . . . . . . . . . . . . . . . . . . . . . . .
3.7.1. Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
31
32
34
34
35
37
37
40
42
45
45
47
50
50
51
7
Table of Contents—Continued
3.7.2. Experimental Setup . . . . . . . . . . . . . . . . . . . . .
3.7.3. Results of Experiments on NV, PGP2, APL1P, and GBD
3.7.4. Computation Time of Bias Reduction . . . . . . . . . . .
3.7.5. E↵ect of Multiple Optimal Solutions . . . . . . . . . . .
3.7.6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8. Summary and Concluding Remarks . . . . . . . . . . . . . . . .
.
.
.
.
.
.
53
55
64
65
66
68
Chapter 4. Assessing Solution Quality with Variance Reduction
4.1. Overview of Antithetic Variates and Latin Hypercube Sampling . . .
4.1.1. Antithetic Variates . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2. Latin Hypercube Sampling . . . . . . . . . . . . . . . . . . . .
4.2. Multiple Replications Procedure with Variance Reduction . . . . . . .
4.2.1. Antithetic Variates . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2. Latin Hypercube Sampling . . . . . . . . . . . . . . . . . . . .
4.3. Single Replication Procedure with Variance Reduction . . . . . . . .
4.3.1. Antithetic Variates . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2. Latin Hypercube Sampling . . . . . . . . . . . . . . . . . . . .
4.4. Averaged Two-Replication Procedure with Variance Reduction . . . .
4.4.1. Antithetic Variates . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2. Latin Hypercube Sampling . . . . . . . . . . . . . . . . . . . .
4.4.3. Antithetic Variates with Bias Reduction . . . . . . . . . . . .
4.4.4. Latin Hypercube Sampling with Bias Reduction . . . . . . . .
4.5. Summary of Key Di↵erences in Algorithms . . . . . . . . . . . . . . .
4.6. Computational Experiments . . . . . . . . . . . . . . . . . . . . . . .
4.6.1. Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.2. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . .
4.6.3. Results of Experiments . . . . . . . . . . . . . . . . . . . . . .
4.6.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7. Summary and Concluding Remarks . . . . . . . . . . . . . . . . . . .
70
71
72
74
77
77
79
80
80
82
85
86
88
91
94
98
99
100
101
102
118
119
Chapter 5. Sequential Sampling with Variance Reduction
5.1. Literature Review . . . . . . . . . . . . . . . . . . . . . . . . .
5.2. Overview of a Sequential Sampling Procedure . . . . . . . . .
5.3. Sequential Sampling Procedure with Variance Reduction . . .
5.3.1. Antithetic Variates . . . . . . . . . . . . . . . . . . . .
5.3.2. Latin Hypercube Sampling . . . . . . . . . . . . . . . .
5.4. Computational Experiments . . . . . . . . . . . . . . . . . . .
5.4.1. Experimental Setup . . . . . . . . . . . . . . . . . . . .
5.4.2. Results of Experiments . . . . . . . . . . . . . . . . . .
5.5. Summary and Concluding Remarks . . . . . . . . . . . . . . .
121
122
124
130
130
132
137
138
141
144
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
Table of Contents—Continued
Chapter 6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.1. Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2. Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9
List of Figures
Figure 3.1. Percentage reductions between A2RP and A2RP-B for optimal
candidate solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 3.2. Percentage reductions between A2RP and A2RP-B for suboptimal candidate solutions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.1. Percentage reductions in bias between MRP and MRP AV and
MRP and MRP LHS for suboptimal candidate solutions . . . . . . . . .
Figure 4.2. Percentage reductions in variance between MRP and MRP AV
and MRP and MRP LHS . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.3. Percentage reductions in MSE between MRP and MRP AV and
MRP and MRP LHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.4. Percentage reductions in CI width between MRP and MRP AV
and MRP and MRP LHS . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.5. Percentage reductions in SV between SRP and SRP AV and SRP
and SRP LHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.6. Percentage reductions in CI width between SRP and SRP AV
and SRP and SRP LHS . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.7. Percentage reductions between A2RP and A2RP-B for large
problems at suboptimal candidate solutions . . . . . . . . . . . . . . . .
Figure 4.8. Percentage reductions in MSE between A2RP and A2RP AV-B
and A2RP and A2RP LHS-B . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.9. Percentage reductions in SV between A2RP and A2RP AV-B
and A2RP and A2RP LHS-B . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.10. Percentage reductions in CI width between A2RP and A2RP
AV-B and A2RP and A2RP LHS-B . . . . . . . . . . . . . . . . . . . .
56
61
106
107
108
109
111
112
113
115
116
117
10
List of Tables
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
3.1.
3.2.
3.3.
3.4.
3.5.
3.6.
3.7.
3.8.
3.9.
3.10.
3.11.
Table 4.1.
Table 4.2.
Table 4.3.
Table 4.4.
PGP2
Table 4.5.
Table 4.6.
Table 4.7.
Table 4.8.
Small test problem characteristics . . . . . . . . . . . . . . . .
Small test problem candidate solutions . . . . . . . . . . . . .
Bias for optimal candidate solutions . . . . . . . . . . . . . . .
MSE for optimal candidate solutions . . . . . . . . . . . . . . .
CI estimator for optimal candidate solutions . . . . . . . . . .
Bias for suboptimal candidate solutions . . . . . . . . . . . . .
MSE for suboptimal candidate solutions . . . . . . . . . . . . .
CI estimator for suboptimal candidate solutions . . . . . . . .
A2RP-B computational time for SSN . . . . . . . . . . . . . .
Multiple optimal solutions for an optimal candidate solution .
Multiple optimal solutions for a suboptimal candidate solution
.
.
.
.
.
.
.
.
.
.
.
51
52
58
59
59
62
63
63
64
65
66
Key di↵erences in algorithms . . . . . . . . . . . . . . . . . . . .
Large test problem characteristics . . . . . . . . . . . . . . . . .
Large test problem candidate solutions . . . . . . . . . . . . . .
Percentage reduction in variance between MRP and MRP AV for
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MRP coverage for suboptimal candidate solutions . . . . . . . .
SRP coverage for suboptimal candidate solutions . . . . . . . . .
A2RP coverage for suboptimal candidate solutions . . . . . . . .
Computational time for IID, AV, and LHS . . . . . . . . . . . .
99
100
101
105
105
110
114
118
Table 5.1. Parameters for sequential sampling with IID and LHS . . . . . . 139
Table 5.2. Parameters for sequential sampling with AV . . . . . . . . . . . 140
Table 5.3. Results for sequential procedures using IID, AV, and LHS . . . . 143
11
Abstract
Stochastic programming combines ideas from deterministic optimization with probability and statistics to produce more accurate models of optimization problems involving uncertainty. However, due to their size, stochastic programming problems can
be extremely difficult to solve and instead approximate solutions are used. Therefore, there is a need for methods that can accurately identify optimal or near optimal
solutions. In this dissertation, we focus on improving Monte-Carlo sampling-based
methods that assess the quality of potential solutions to stochastic programs by estimating optimality gaps. In particular, we aim to reduce the bias and/or variance of
these estimators.
We first propose a technique to reduce the bias of optimality gap estimators which
is based on probability metrics and stability results in stochastic programming. This
method, which requires the solution of a minimum-weight perfect matching problem,
can be run in polynomial time in sample size. We establish asymptotic properties
and present computational results.
We then investigate the use of sampling schemes to reduce the variance of optimality gap estimators, and in particular focus on antithetic variates and Latin hypercube
sampling. We also combine these methods with the bias reduction technique discussed above. Asymptotic properties of the resultant estimators are presented, and
computational results on a range of test problems are discussed.
Finally, we apply methods of assessing solution quality using antithetic variates
and Latin hypercube sampling to a sequential sampling procedure to solve stochastic programs. In this setting, we use Latin hypercube sampling when generating a
sequence of candidate solutions that is input to the procedure. We prove that these
procedures produce a high-quality solution with high probability, asymptotically, and
terminate in a finite number of iterations. Computational results are presented.
12
Chapter 1
Introduction
Deterministic mathematical programming aims to optimize functions under a set
of constraints with known parameters. However, the real world is not entirely deterministic; in many cases, the parameters that go into the objective function and
constraints, such as costs, demands, etc., may not be completely known. Stochastic
programs take this uncertainty into account by including random vectors and other
probabilistic statements in the problem descriptions. A standard way to incorporate uncertainty into optimization problems is to represent unknown parameters by
random variables. It is then natural to include probabilistic quantities such as expectations and probabilities in the model. It is usually assumed that the probability
distributions of the random variables are known. In the case of real-world situations,
such distributions can be constructed, for example, via statistical analysis. If a probability distribution is assumed present but unknown, one can consider a range of
possible probability distributions in the analysis of the problem.
In this dissertation, we consider a stochastic optimization problem of the form
Z
˜
min Ef (x, ⇠) = min f (x, ⇠)P (d⇠),
(SP)
x2X
x2X
⌅
where X ✓ Rdx represents the set of constraints the decision vector x of dimension
dx must satisfy. The random vector ⇠˜ on (⌅, B, P ) is of dimension d⇠ and has support
⌅ ✓ Rd⇠ and distribution P that does not depend on x, where B is the Borel -algebra
on ⌅. The function f : X ⇥ ⌅ ! R is assumed to be a Borel measurable, real-valued
function, with inputs being the decision vector x and a realization ⇠ of the random
˜ Throughout the dissertation, we use ⇠˜ to denote the random vector and ⇠ to
vector ⇠.
denote its realization. The expectation operator E is taken with respect to P . We use
z ⇤ to denote the optimal objective function value of (SP) and x⇤ to denote an optimal
13
˜
solution to (SP). The set of optimal solutions is given by X ⇤ = arg minx2X Ef (x, ⇠).
We will primarily focus on two-stage stochastic linear programs with recourse, a
˜ =
class of (SP) first introduced by Beale (1955) and Dantzig (1955), where f (x, ⇠)
˜ and h(x, ⇠)
˜ is the optimal value of the minimization problem
cx + h(x, ⇠)
min{q̃y : W̃ y = R̃
y
In this case, X = {x 2 Rdx : Ax = b, x
T̃ x, y
0}.
0} and ⇠˜ = (q̃, W̃ , R̃, T̃ ). The following
terminology is often used. If W̃ = W , i.e., W is non-random, then the problem is said
to have fixed recourse. Otherwise, the problem has random recourse. If q̃ = q, then
the problem has stochasticity only on the right-hand side. We will list assumptions
on this class of problems required by our theoretical results in subsequent chapters.
Two-stage stochastic programs can be understood in the following way. In the
first stage of the problem, a decision x is made knowing only the distribution of the
˜ Then, a realization ⇠ of ⇠˜ occurs, and a corrective recourse decision
random vector ⇠.
y is made in the second stage of the problem. Whereas x cannot depend on ⇠, y
explicitly depends on the random outcome, but this dependence is suppressed above.
This class of problems can be extended to multi-stage decision problems over a finite
number of time periods using conditional expectations. It can also be reformulated
to include terms such as nonlinear and integer constraints and nonlinear objective
function terms.
Two-stage stochastic programs have been used in a wide variety of applications, including fleet management, production planning, risk management, energy generation,
and telecommunications; see (Wallace & Ziemba, 2005) for example applications. We
now present two examples of two-stage stochastic linear programs with fixed recourse
which appear in our computational experiments.
Example 1.1 (Aircraft Allocation). An airline wishes to allocate several types of aircraft to di↵erent routes in order to maximize expected profit. The customer demands
14
for each route are modeled as independent random variables with known distributions, and the fixed operating costs vary according to the aircraft type and route.
First, the airline determines the number of each aircraft type to assign to each route.
Once the customer demand has been realized, the airline can choose the number of
bumped passengers (incurring a cost per passenger) and the number of empty seats
on each flight. This model, referred to as GBD, was first introduced by Ferguson &
Dantzig (1956). We will consider a modification of this problem in later chapters.
Example 1.2 (Telecommunications Network Planning). A telecommunications network consists of a set of nodes connected by links. The number of calls between
each point-to-point pair of nodes is treated as a random variable. When service is requested between two nodes, a route of links between the nodes with sufficient capacity
is identified. If no route has enough capacity, the request for service goes unmet. In
the first stage, the amount of capacity to add to each link is chosen. In the second
stage, the number of calls that use each possible route and the number of unserved
calls are determined for each point-to-point pair of nodes. The goal is to minimize
the expected number of unserved calls. A mathematical formulation of this problem,
SSN, can be found in (Sen et al., 1994).
1.1
Motivation
Even though stochastic programs can yield more realistic models compared to deterministic mathematical programs, they can also be extremely difficult, and perhaps
impossible, to solve. For instance, if the random vector ⇠˜ is discrete, then the size
of the stochastic program will grow exponentially with the dimension of the random
vector, and so the stochastic program su↵ers from a curse of dimensionality. If instead
⇠˜ is continuous, the stochastic program will in general be intractable unless f has a
special structure. Note that in many cases, the expectation in (SP) is a multidimensional integral of a complicated function and cannot be calculated exactly even for a
15
˜ over X brings additional challenges.
fixed x 2 X. Optimizing Ef (x, ⇠)
Assessing the quality of a potential solution is critically important since many
real-world problems cast as (SP)—such as two-stage stochastic linear programs with
recourse, which are the primary focus of this dissertation—cannot be solved exactly
and one is often left with an approximate solution x̂ 2 X without verification of
its quality. This is also fundamental in optimization algorithms, as they use quality
assessment iteratively; e.g., every time a new candidate solution is generated, these
algorithms need to identify an optimal or nearly optimal solution to stop.
Specifically, given a candidate (feasible) solution x̂ 2 X to (SP), we would like to
determine whether it is optimal or nearly optimal. This can be done by calculating
˜
the optimality gap, denoted Gx̂ , where Gx̂ = Ef (x̂, ⇠)
z ⇤ . A high-quality candi-
date solution will have a small optimality gap, and an optimal solution will have an
optimality gap of zero. Unfortunately, since the optimal value z ⇤ is unknown, the
optimality gap cannot be computed explicitly. In addition, as mentioned earlier, it
˜ exactly. Given a candidate solution x̂ and
may not be possible to evaluate Ef (x̂, ⇠)
sample size n, Monte Carlo sampling can be used to create statistical estimators of
the optimality gap Gx̂ (Bayraksan & Morton, 2006; Mak et al., 1999; Norkin et al.,
1998), which we will revisit in Chapter 2.
When a statistical estimator of Gx̂ turns out to be large, this can be due to the
fact that:
(i) bias is large;
(ii) variance or sampling error is large, or
(iii) Gx̂ is large.
The third condition is simply when x̂ is a low-quality solution. However, even if x̂ is a
high-quality solution, (i) and (ii) could result in an estimator that can be significantly
misleading. It is well known that Monte Carlo statistical estimators of optimality gaps
16
are biased (Mak et al., 1999; Norkin et al., 1998). That is, on average, they tend to
over-estimate Gx̂ for a finite sample size n. Therefore, for some problems, bias could
be a major issue, whereas for other problems, variance could be the dominant factor.
In either situation, estimates of Gx̂ can be large even if we have a high-quality solution.
Additionally, a high variance or sample error can lead to estimators that under-report
the optimality gap and thus indicate that a candidate solution is of higher quality
than warranted. Each case mentioned significantly diminishes our ability to identify
an optimal or nearly optimal solution.
While the current literature provides Monte Carlo sampling-based methods to
estimate Gx̂ , these methods could yield unreliable estimators when bias or variance
is large. This dissertation aims to improve the current procedures by reducing bias
and variance, yielding estimators that are more reliable than the current state-of-theart in assessing solution quality in stochastic programs. In particular, we present a
method to reduce bias via strategic partitioning of samples. We then investigate the
use of variance reduction schemes in optimality gap estimation. Finally, we study
these methods in a sequential sampling context.
1.2
Contributions
The primary goal of this dissertation is to identify high-quality solutions to (SP)
by improving the reliability of sampling-based estimators of optimality gaps. The
methods developed in this dissertation to achieve this goal could be used to assess
the quality of a given solution (found in any way) with a fixed sample size, or within
a sequential Monte Carlo sampling-based method to identify high-quality solutions
to (SP) with increasing sample sizes.
We note that extensive work has been done in the statistics and simulation literature regarding bias and variance reduction methods. We review some of these
methods in the context of stochastic programming in Chapters 3 and 4 (see Sec-
17
tions 3.3 and 4.1). Thus far, attention has been focused on the estimator zn⇤ of the
˜ for a fixed x. The work presented in this
true optimal value of (SP) or on Ef (x, ⇠)
dissertation addresses the need to improve estimators of optimality gaps of proposed
solutions to (SP).
The main contributions of this dissertation are as follows:
C1. We develop a technique to reduce the bias of optimality gap estimators motivated by stability results in stochastic programming. Stability results quantify
changes in optimal values and optimal solution sets under distributional perturbations. The random sample is partitioned via a minimum-weight perfect
matching problem in an e↵ort to reduce bias. The observations within each
group are no longer independent and identically distributed, complicating further analysis. This technical difficulty is overcome with a weak convergence
argument and additional asymptotic properties are presented.
C2. We embed two well-known variance reduction schemes, antithetic variates and
Latin hypercube sampling, in algorithms that produce optimality gap estimators. We also blend these sampling schemes with the bias reduction technique
outlined in contribution C1. Asymptotic properties of the resulting estimators are discussed and extensive computational experiments for a range of test
problems are given. Based on our theory and computational experiments, we
provide guidelines on e↵ective and efficient means of assessing solution quality.
C3. Finally, we apply a selected subset of the methods developed above to a sequential sampling procedure that aims to approximately solve (SP) via a sequence
of generated candidate solutions with increasing sample size. We present extensions to the theory and provide computational experiments.
18
1.3
Definitions and Abbreviations
In this section, we introduce commonly used statistical terms and list general assumptions on (SP) required in this dissertation. We also provide a list of abbreviations
commonly used throughout this dissertation. We start with a framework to model
Monte Carlo sampling.
The results presented in this dissertation require precise probabilistic modeling
of the Monte Carlo sampling performed. In particular, the expectations and the
almost sure statements are made with respect to the underlying product measure.
An overview of this framework is as follows. Let (⌦, A, P̂ ) be the space formed by
the product of a countable sequence of identical probability spaces (⌅i , Bi , Pi ), where
⌅i = ⌅, Bi = B, and Pi = P , for i = 1, 2, . . ., and let ⇠ i denote an outcome in the
sample space ⌅i . An outcome ! 2 ⌦ then has the form ! = (⇠ 1 , ⇠ 2 , . . .). Now, define
the countable sequence of projection random variables {⇠˜i : ⌦ ! ⌅, i = 1, 2, . . .} by
⇠˜i (!) = ⇠ i . Then, the collection {⇠˜1 , . . . , ⇠˜n } is a random sample from (⌅, B, P ), and
⇠˜ := ⇠ 1 is a random variable with distribution P .
We will study both point and interval estimators, which are functions of a random
sample that calculate an estimate of an unknown parameter. A point estimator
computes a single value, whereas an interval estimator provides a range of values. A
point estimator is strongly consistent if it converges to the true value, almost surely, as
opposed to in probability. We refer to such an estimator simply as consistent from now
on. To understand the behavior of an interval estimator, we consider the probability
that it contains the parameter of interest, referred to as its coverage probability, or
simply coverage. A high-quality interval estimator will have a narrow width but also
a high coverage probability. Note, however, that a high coverage probability can be
obtained increasing the interval estimator’s width. An interval estimator which is
constructed to have a coverage probability that (asymptotically) equals or exceeds a
certain value (the confidence level) is called a confidence interval estimator. Such an
19
estimator is said to be (asymptotically) valid.
We now provide a list of abbreviations used in this dissertation. Three algorithms
considered in this dissertation, the Multiple Replications Procedure, the Single Replication Procedure, and the Averaged-Two Replication Procedure, are abbreviated as
MRP, SRP, and A2RP, respectively. We use AV to denote antithetic variates and LHS
to denote Latin hypercube sampling. “Strong law of large numbers” is shortened to
SLLN and “central limit theorem” to CLT. “Confidence interval” is abbreviated as
CI and “sample variance” is abbreviated as SV. Finally, the phrases “independent
and identically distributed” and “almost surely” are shortened to i.i.d. and a.s., respectively. The phrase “i.i.d. sampling” is abbreviated to IID in some tables.
1.4
Dissertation Organization
The rest of this dissertation is organized in the following way:
In Chapter 2, we discuss the use of Monte Carlo simulation for assessing solution
quality via optimality gap estimation. We review the three algorithms from the literature, MRP, SRP, and A2RP, that are used to estimate optimality gaps. We review
additional relevant literature on bias reduction in Chapter 3, variance reduction in
Chapter 4, and a sequential sampling method in Chapter 5.
Chapter 3 presents a technique for reducing the bias of the A2RP optimality
gap estimators for two-stage stochastic linear programs with recourse via a probability metrics approach, motivated by stability results in stochastic programming. We
start with a review of relevant literature and a discussion of the problem class. We
then discuss the motivation behind the bias reduction technique and formally define
the resulting algorithm. We provide conditions under which asymptotic results on
optimality gap estimators hold and present computational experiments to provide
insights into the e↵ectiveness of the bias reduction technique. The material presented
in this chapter can be found in (Stockbridge & Bayraksan, 2012). This is the main
20
contribution C1 in this dissertation.
In Chapter 4, we embed sampling-based variance reduction techniques from the
literature in the pre-existing algorithms MRP, SRP, and A2RP. We particularly focus
on AV and LHS, and an overview of these techniques and their use in the stochastic
programming literature begins the chapter. We then discuss the theoretical implications of each combination of algorithm for optimality gap estimation and variance
reduction scheme. This includes the use of AV and LHS within MRP, SRP, and
A2RP. In addition, we also blend our bias reduction procedure detailed in Chapter 3
with LHS and AV. Computational results are provided and discussed for each case.
This is the main contribution C2 in this dissertation.
Chapter 5 applies the ideas of the previous chapters to a sequential sampling
setting, where the aim is to obtain high-quality solutions to (SP) with a desired probability. First, a sequential sampling procedure from the literature is reviewed. Then,
a subset of variance reduction techniques are applied both when generating candidate
solutions and when assessing their quality via SRP. Asymptotic properties of the resultant procedures are established and their empirical performance is analyzed. This
is the main contribution C3 in this dissertation.
We conclude the dissertation in Chapter 6 with a summary of contributions and
a discussion of future research directions.
21
Chapter 2
Overview of Algorithms for Assessing Solution
Quality
In this chapter, we give an overview of Monte Carlo sampling-based techniques from
the literature for assessing the quality of a candidate solution. In particular, we review
MRP, developed by Mak et al. (1999), and SRP and A2RP of Bayraksan & Morton
(2006). These procedures, while di↵ering in the details of their implementation,
each produce point and interval estimators measuring solution quality. A discussion
weighing the costs and benefits of each algorithm along with guidelines for use can
be found in (Bayraksan & Morton, 2006).
Recall that we define the quality of a solution x̂ 2 X to be its optimality gap,
˜
denoted Gx̂ , where Gx̂ = Ef (x̂, ⇠)
z ⇤ . The optimality gap Gx̂ cannot be evaluated
explicitly, particularly as the optimal value z ⇤ is not known. Furthermore, exact
˜ may not be possible, as the expectation is typically a highevaluation of Ef (x̂, ⇠)
dimensional integral of a complicated function. Monte Carlo sampling-based methods
bypass these difficulties by allowing us to form statistical estimators of optimality
gaps. These work in the following way. They take as
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), a sample size n,
a method to generate the sample, and a method to solve a sampling approximation
of (SP),
and they produce
Output: A point estimator (e.g., Gn (x̂)), its associated variance estimator (e.g.,
s2n ), and a (1 ↵)-level approximate confidence interval estimator of Gx̂ (e.g., [0, Gn (x̂)
+✏n,↵ ], where ✏n,↵ is the sampling error that typically uses s2n ).
They are easy to implement, provided a sampling approximation of (SP) with
22
moderate sample sizes can be solved, and they can be used in conjunction with any
specialized solution procedure to solve the sampling approximations of the underlying
problem. For example, Bayraksan & Morton (2011) and Bayraksan & Pierre-Louis
(2012) develop sequential sampling stopping rules that make use of optimality gap
estimators to obtain high-quality solutions to (SP). These methods to estimate optimality gaps have been successfully applied to problems in finance (Bertocchi et al.,
2000), stochastic vehicle routing (Kenyon & Morton, 2003; Verweij et al., 2003), supply chain network design (Santoso et al., 2005), and scheduling under uncertainty
(Keller & Bayraksan, 2010).
This chapter is organized as follows. In Section 2.1, we list and discuss necessary
assumptions. We provide background on optimality gap estimation in Section 2.2.
We then present the procedures MRP, SRP, and A2RP in Sections 2.3, 2.4, and 2.5,
respectively, and close with a summary in Section 2.6.
2.1
Problem Class
The main assumptions we impose on (SP) in this chapter are as follows:
˜ is continuous in x, a.s.,
(A1) f (·, ⇠)
˜ < 1,
(A2) E supx2X f 2 (x, ⇠)
(A3) X 6= ; and is compact.
Assumption (A1) holds for two-stage stochastic linear programs with relatively
complete recourse; i.e., a feasible second-stage decision exists for every feasible firststage decision, a.s. Assumption (A2) ensures the existence of first and second moments and provides a uniform integrability condition. Assumption (A3) requires that
the problem be feasible and the set of feasible solutions be closed and bounded. This
last condition is not usually overly restrictive for practical problems.
23
2.2
Optimality Gap Estimation
Before defining MRP, SRP, and A2RP, we first discuss how to compute point estimators of the optimality gap. Since Gx̂ usually cannot be evaluated explicitly, we use
Monte Carlo sampling to provide an approximation of (SP) and exploit the properties
of this approximation to estimate the optimality gap. We first approximate P , using
the observations from a random sample {⇠˜1 , . . . , ⇠˜n } described in Section 1.3, by the
P
empirical distribution Pn (·) = n1 ni=1 {⇠˜i } (·). The use of (·) indicates that Pn is a
probability measure on ⌅. We then approximate (SP) by
n
1X
min
f (x, ⇠˜i ) = min
x2X n
x2X
i=1
Z
f (x, ⇠)Pn (d⇠).
(SPn )
⌅
Let x⇤n denote an optimal solution to (SPn ) and zn⇤ denote the corresponding optimal
value. Asymptotic properties of x⇤n and zn⇤ have been comprehensively studied in the
literature (Attouch & Wets, 1981; Dupačová & Wets, 1988; King & Rockafellar, 1993;
Shapiro, 1989). As mentioned in Section 1.3, it is most convenient throughout the
dissertation to interpret expectations and almost sure statements relating to zn⇤ with
R
respect to the underlying probability measure P̂ . For instance, Ezn⇤ = ⌦ zn⇤ (!)P̂ (d!).
By interchanging minimization and expectation, we have
"
#
" n
#
n
X
X
1
1
˜ = z ⇤ . (2.1)
Ezn⇤ = E min
f (x, ⇠˜i )  min E
f (x, ⇠˜i ) = min Ef (x, ⇠)
x2X n
x2X
x2X
n
i=1
i=1
In other words, Ezn⇤ provides us with a lower bound on z ⇤ . An upper bound on Gx̂ ,
˜
Ef (x̂, ⇠)
˜
z ⇤ , is then given by Ef (x̂, ⇠)
n
Gn (x̂) =
With fixed x̂ 2 X,
1X
f (x̂, ⇠˜i )
n i=1
1
n
Pn
i=1
˜
Ezn⇤ . We estimate Ef (x̂, ⇠)
n
min
x2X
Ezn⇤ by
n
1X
1X
f (x, ⇠˜i ) =
f (x̂, ⇠˜i )
n i=1
n i=1
zn⇤ .
(2.2)
˜ is an unbiased estimator of Ef (x̂, ⇠)
˜ due to i.i.d.
f (x̂, ⇠)
sampling. However, since Ezn⇤
z ⇤  0,
EGn (x̂)
˜
Ef (x̂, ⇠)
z⇤,
24
and hence Gn (x̂) is a biased estimator of the optimality gap. We assume the same
observations are used in both terms on the right-hand side in (2.2), so Gn (x̂)
0. This results in variance reduction through the use of common random variates.
Consequently, compared to zn⇤ , which has the same bias, bias can be a more prominent
factor in Gn (x̂).
It is well known that the bias decreases as the size of the random sample increases
⇤
(Mak et al., 1999; Norkin et al., 1998). That is, Ezn⇤  Ezn+1
for all n. However, the
rate the bias shrinks to zero can be slow, e.g., of order O(n
1/2
); see for instance Ex-
ample 4 in (Bayraksan & Morton, 2009). One way to reduce bias is to simply increase
the sample size. However, significant increases in sample sizes are required to obtain
a modest reduction in bias, and increasing the sample size is not computationally
desirable since obtaining statistical estimators of optimality gaps requires solving a
sampling approximation problem. The bias reduction technique presented in Chapter 3 provides a way to lessen bias without increasing sample size. We also observe
in Chapter 4 that variance reduction techniques can also result in bias reduction.
We note that there are other approaches to assessing solution quality. Some
of these approaches are motivated by the Karush-Kuhn-Tucker conditions; see, e.g.,
(Higle & Sen, 1991b; Shapiro & Homem-de-Mello, 1998). There is also work on assessing solution quality with respect to a particular sampling-based algorithm, typically
utilizing the bounds obtained through the course of the algorithm; see, e.g., (Dantzig
& Infanger, 1995; Higle & Sen, 1991a, 1999; Lan et al., 2012).
In the rest of this chapter, we provide an overview of three procedures to estimate
Gx̂ , MRP, SRP, and A2RP, which we then enhance to reduce bias and/or variance in
later chapters.
25
2.3
Multiple Replications Procedure
The use of minimization in the definition of the optimality gap point estimator (2.2)
results in a random variable that is generally not normally distributed. Therefore the
CLT cannot be applied directly to Gn (x̂) to produce an approximate CI estimator
of the optimality gap. The Multiple Replications Procedure of Mak et al. (1999)
overcomes this difficulty by computing multiple instances of Gn (x̂) using independent
batches of observations and forming a CI on the mean of these instances. Let tn,↵
be the 1
↵ quantile of the Student’s t distribution with n degrees of freedom. We
assume a method to solve (SPn ) is known and, in this chapter, the sample is generated
in an i.i.d. fashion. Therefore, we remove this from the input list of the procedures.
MRP is as follows:
MRP
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), a sample size per
replication n, and a replication size m.
Output: A point estimator, its associated variance estimator, and a (1
↵)-level
confidence interval on Gx̂ .
1. For l = 1, . . . , m,
1.1. Sample observations i.i.d. {⇠˜l1 , . . . , ⇠˜ln } from P .
⇤
1.2. Solve (SPn,l ) using {⇠˜l1 , . . . , ⇠˜ln } to obtain x⇤n,l and zn,l
.
1.3. Calculate
1 Xh
Gn,l (x̂) =
f (x̂, ⇠˜li )
n i=1
n
i
f (x⇤n,l , ⇠˜li ) .
2. Calculate the optimality gap and sample variance estimators by:
m
1 X
Ḡ(m) =
Gn,l (x̂)
m l=1
and
1
2
s (m) =
m
1
m
X
3. Output a one-sided confidence interval on Gx̂ :

tm 1,↵ s(m)
p
0, Ḡ(m) +
.
m
Gn,l (x̂)
Ḡ(m)
2
.
l=1
(2.3)
26
Since the replications {Gn,1 (x̂), . . . , Gn,m (x̂)} are i.i.d., we can apply the CLT to
the sample mean, Ḡ(m), to get
✓
◆
tm 1,↵ s(m)
p
P EGn (x̂)  Ḡ(m) +
⇡1
m
↵
for sufficiently large m. As mentioned above, Monte Carlo sampling produces unh P
i
n
1
i
˜
˜ and so Ez ⇤  z ⇤ and thus
biased estimators, i.e., E n i=1 f (x, ⇠ ) = Ef (x, ⇠),
n
EGn (x̂)
Gx̂ . Therefore, for large enough m, (2.3) is an approximate (1
CI for Gx̂ , i.e.,
✓
Gx̂  Ḡ(m) +
tm
1,↵ s(m)
p
↵)-level
◆
⇡ 1 ↵.
m
Note that var(Ḡm ) = m1 var(Gn ), and so we can expect the SV estimator s2 (m) to be
P
large if the variance of Ḡ(m) is large. Of course, this can be reduced by increasing
the number of replications, but this also means solving more optimization problems
(one per replication) to assess the quality of one solution. In Chapter 4, we explore
the e↵ect of alternative sampling techniques on MRP in an e↵ort to reduce variance
(and bias) without increasing the number of replications.
2.4
Single Replication Procedure
A general rule of thumb to ensure that the sample mean of i.i.d. random variables
is approximately normal is to use a sample size of at least 30. Translating to the
MRP setting, the number of replications m is usually set to least 30, which therefore requires solving at least 30 optimization problems in Step 1.2. This can be
computationally burdensome. An alternate approach, referred to as the Single Replication Procedure, is presented in (Bayraksan & Morton, 2006). SRP uses a single
optimality gap estimator Gn , and computes the SV of the individual observations
{f (x̂, ⇠˜1 )
f (x⇤n , ⇠˜1 )}, . . . , f (x̂, ⇠˜n )
f (x⇤n , ⇠˜n )}. These two point estimators are then
P
combined to form a CI estimator of the optimality gap. Define f¯n (x) = n1 ni=1 f (x, ⇠˜i )
and let z↵ be the 1
is as follows:
↵ quantile of the standard normal distribution. The procedure
27
SRP
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and a sample size
n.
Output: A point estimator, its associated variance estimator, and a (1
↵)-level
confidence interval on Gx̂ .
1. Sample i.i.d. observations {⇠˜1 , . . . , ⇠˜n } from P .
2. Solve (SPn ) to obtain x⇤n and zn⇤ .
3. Calculate Gn (x̂) as in (2.2) and
s2n
=
1
n
1
n h⇣
X
f (x̂, ⇠˜i )
f (x⇤n , ⇠˜i )
i=1
⌘
4. Output a one-sided confidence interval on Gx̂ :

z ↵ sn
0, Gn (x̂) + p
.
n
f¯n (x̂)
f¯n (x⇤n )
i2
.
(2.4)
(2.5)
Note that unlike the m independent optimality gap estimators of MRP used to
calculate the SV, the observations f (x̂, ⇠˜i )
f (x⇤n , ⇠˜i ), i = 1, . . . , n, used in the SRP
SV calculation are not independent. Instead, they each depend on x⇤n 2 arg minx2X
Pn
1
˜i
i=1 f (x, ⇠ ). However, if (A1)–(A3) are satisfied, then the point estimators Gn (x̂)
n
and s2n of SRP are consistent and the interval estimator of SRP in (2.5) is asymptot-
ically valid. The goal of Chapters 3 and 4 is to address situations when the bias and
variance of Gn (x̂) are large or the SV estimator s2n is large, which can lead to unduly
large (or small) CI estimators.
SRP can significantly reduce the computation time required to estimate the optimality gap. However, for small sample sizes it can happen that x⇤n coincides with the
candidate solution x̂. The optimality gap and SV estimators are then zero (e.g., when
x̂ = x⇤n , Gn (x̂) in (2.2) is zero). This also results in a CI estimator of width zero, even
though the candidate solution may be significantly suboptimal. We repeat Example
1 in (Bayraksan & Morton, 2006) to illustrate the concept of coinciding solutions:
28
˜ :
Example 2.1. Consider the problem {min E[⇠x]
1  x  1}, where ⇠˜ ⇠ N (µ, 1)
and µ > 0. The optimal pair is (x⇤ , z ⇤ ) = ( 1, µ). We examine the candidate
solution x̂ = 1, which has the largest optimality gap of 2µ. If the random sample
P
satisfies ⇠¯ = n1 ni=1 ⇠˜i < 0, then x⇤n = 1 coincides with x̂, and so the point and interval
estimators of SRP are zero. Setting µ = 0.1, ↵ = 0.10 and n = 50, and using normal
quantiles, we obtain an upper bound on the coverage of SRP as 1
P (⇠¯ < 0) ⇡ 0.760,
which is considerably below the desired coverage of 0.90.
The Averaged Two-Replication Procedure addresses this undesirable coverage behavior by using two replications of observations.
2.5
Averaged Two-Replication Procedure
A2RP can be defined as the following modification of MRP. Note that rather than
using the standard SV as in MRP, the A2RP SV estimator is based on the SRP SV
estimator.
A2RP
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and a sample size
per replication n.
Output: A point estimator, its associated variance estimator, and a (1
↵)-level
confidence interval on Gx̂ .
Fix m = 2 and replace Steps 1.3, 2, and 3 of MRP by:
1.3. Calculate Gn,l (x̂) and s2n,l .
2. Calculate the optimality gap and sample variance estimators by taking the average:
G0n (x̂) =
1
(Gn,1 (x̂) + Gn,2 (x̂))
2
and
3. Output a one-sided confidence interval on Gx̂ :

z ↵ s0
0, G0n (x̂) + p n .
2n
0
s2n =
1 2
s + s2n,2 .
2 n,1
(2.6)
29
The same conditions as for SRP guarantee that the point estimator G0n (x̂) of A2RP
is consistent and the interval estimator of A2RP in (3.1) is asymptotically valid. We
now examine how A2RP can improve coverage by returning to Example 1:
Example 2.2. Consider the problem instance in Example 2.1. Now consider A2RP
P
with each replication of size n = 50. Let ⇠¯1 = n1 ni=1 ⇠˜1i be the sample mean of the
first group of observations, and similarly let ⇠¯2 be the sample mean of the second
group. In this case, the probability of obtaining a CI estimator of non-zero width is
given by 1 P (⇠¯1 < 0)P (⇠¯2 < 0) ⇡ 1 (0.240)2 ⇡ 0.943. Alternatively, note that if the
sample of 50 observations is divided into two groups of 25 observations, the probability
of obtaining a CI estimator of non-zero width is approximately 1
2.6
(0.308)2 ⇡ 0.905.
Summary and Concluding Remarks
In this chapter, we reviewed three Monte Carlo sampling-based algorithms, MRP,
SRP, and A2RP, to assess the quality of a candidate solution to (SP). Observe that
the bias of the optimality gap estimator is the same for MRP, SRP, and A2RP if
the sample size per replication n is equal for each procedure. If instead we fix the
same total sample size for each algorithm, MRP will have significantly increased bias
compared to SRP and A2RP due to a larger number of replications and therefore a
smaller sample size per replication. Due to the heavy computational burden of MRP
and the difficulties that can arise when using SRP, in Chapter 3 we focus on A2RP
and present a technique that aims to reduce the bias of the optimality gap estimator. We will return to the above example in Section 3.4.2 to see how implementing
bias reduction a↵ects coinciding solutions. Chapter 4 implements variance reduction
schemes in all three algorithms.
30
Chapter 3
Averaged Two-Replication Procedure with Bias
Reduction
In this chapter, we combine a Monte Carlo sampling-based approach to optimality
gap estimation with stability results from (Römisch, 2003) for a class of two-stage
stochastic linear programs, with the intention of reducing the bias of the A2RP optimality gap estimators. We note that in Section 2.5, we defined the two replications of
A2RP as each using a sample size of n, which allows a fair comparison of MRP, SRP,
and A2RP. In this chapter, we focus on A2RP only and assume a fixed computational
budget of n observations, which are then divided in two to form two replications of
size n/2. Our goal is to partition the observations in such as way as to reduce the
bias of the optimality gap estimator as compared to dividing the observations randomly. The bias reduction technique presented partitions the observations by solving
a minimum-weight perfect matching problem, which can be done in polynomial time
in sample size. We are concerned with a fixed candidate solution x̂ 2 X in this
chapter and the next, so we suppress the dependence on x̂ in our notation. We also
suppress the dependence on the sample size n for notational simplicity.
This chapter is organized as follows. In the next section, we give an overview of
the relevant literature. In Section 3.2 we formally define our problem setup and list
necessary assumptions, which are more restrictive than those in Chapter 2. Section 3.3
updates A2RP to reflect our focus on partitioning and presents the stability result. We
then introduce our bias reduction technique in Section 3.4 and illustrate the technique
on an instance of a newsvendor problem in Section 3.5. Asymptotic properties of
the resulting estimators are provided in Section 3.6. In Section 3.7, we present our
computational experiments on a number of test problems. Finally, in Section 3.8, we
31
close with a summary.
3.1
Literature Review
Bias reduction in statistics and simulation is a well-established topic and resampling
methods such as jackknife and bootstrap are commonly used for this purpose (Efron
& Tibshirani, 1993; Shao & Tu, 1995). In stochastic programming, while there has
been a lot of focus on variance reduction techniques (see Section 4.1 for an overview),
bias reduction has received relatively little attention. Only a few studies exist for
this purpose. Freimer et al. (2012) study the e↵ect on bias of di↵erent sampling
schemes mainly used for variance reduction, such as AV and LHS. These schemes can
successfully reduce the bias of the estimator of z ⇤ with minimal computational e↵ort;
however, the entire optimality gap estimators are not considered. Partani (2007) and
Partani et al. (2006), on the other hand, develop a generalized jackknife technique
for bias reduction in MRP optimality gap estimators.
In this chapter, bias reduction is motivated by stability results in stochastic programming rather than adaptation of well-established sampling or bias reduction techniques. Stability results use probability metrics to provide continuity properties of
optimal values and optimal solution sets with respect to perturbations of the original
probability distribution of the random vector; see, e.g., the survey by Römisch (2003).
Stability results have been successfully used for scenario reduction in stochastic programs; see, e.g., (Dupačová et al., 2003; Heitsch & Römisch, 2003; Henrion et al.,
2009).
We specifically apply the bias reduction approach to A2RP, described in Section
2.5, and use a particular stability result from (Römisch, 2003) involving the Kantorovich metric. Utilizing the Kantorovich metric to calculate distances between probability measures results in a significant computational advantage (see Section 3.4.1).
The specific stability result we use, however, restricts (SP) to a class of two-stage
32
stochastic linear programs with recourse (see Section 3.2). The bias reduction approach presented in this chapter does not require resampling—like the bootstrap and
jackknife methods commonly used in statistics—and thus no additional sampling approximation problems need to be solved. The cost of bias reduction, however, comes
from solving a minimum-weight matching problem, which is used to partition a random sample so as to reduce bias by minimizing the Kantorovich metric. Minimumweight matching is a well-known combinatorial optimization problem for which efficient algorithms exist (Edmonds, 1965; Kolmogorov, 2009; Mehlhorn & Schäfer,
2002). It can be solved in polynomial time in sample size n and the computational
burden is likely to be minimal compared to solving (approximations of) real-world
stochastic programs with hundreds of stochastic parameters.
Partitioning a random sample in an e↵ort to reduce bias as we do in this chapter
results in observations that are no longer independent nor identically distributed,
and hence the consistency and asymptotic validity results for A2RP in (Bayraksan &
Morton, 2006) cannot be immediately stated. We overcome this difficulty by showing
that the resulting distributions on the partitioned subsets converge weakly to P , the
˜ a.s. This result allows us to provide conditions under which
original distribution of ⇠,
the point estimators are consistent and the interval estimator is asymptotically valid.
3.2
Problem Class
While (SP) encompasses many classes of problems, in this chapter, we focus on a
particular class dictated by the specific stability result we use to motivate the proposed
bias reduction technique (see Section 3.3.2). As stated before, we consider two-stage
˜ = cx+h(x, ⇠),
˜ X = {x : Ax =
stochastic linear programs with recourse, where f (x, ⇠)
b, x
0}, and h(x, ⇠) is the optimal value of the minimization problem
min{qy : W y = R(⇠)
y
T (⇠)x, y
0}.
33
The above problem has fixed recourse (W is non-random) and stochasticity only on
the right-hand side (R(⇠) and T (⇠) are random). We assume that X and ⌅ are
convex polyhedral. We also assume that T and R depend affine linearly on ⇠, which
allows for modeling first-order dependencies between them, such as those that arise
in commonly-used linear factor models. Furthermore, we assume that our model has
relatively complete recourse; i.e., for each (x, ⇠) 2 X ⇥ ⌅, there exists y
W y = R(⇠)
0 such that
T (⇠)x, and dual feasibility; i.e., {⇡ : ⇡W  q} =
6 ;. These assumptions
are needed to ensure the stability result presented in Section 3.3.2. We also require
that X 6= ; and is compact, which is assumption (A3) of Chapter 2. We make the
following additional assumption:
(A4) ⌅ is compact.
Let P(⌅) be the set of probability measures on ⌅ with finite first order moments, i.e.,
R
P(⌅) = Q : ⌅ k⇠kQ(d⇠) < 1 . It follows immediately from assumption (A4) that
P 2 P(⌅), a condition required by our theoretical results.
For the class of problems we consider, f (·, ⇠) is convex in x for all fixed ⇠ 2 ⌅,
and f (x, ⇠) satisfies the following Lipschitz continuity condition for all x, x0 2 X and
⇠ 2 ⌅, for some L > 0:
|f (x, ⇠)
f (x0 , ⇠)|  L max {1, k⇠k} kx
x0 k,
where k · k is some norm (see Proposition 22 in (Römisch, 2003)). This result leads
directly to the continuity of f (·, ⇠) in x for fixed ⇠. We also note that f (x, ·) is
Lipschitz continuous and thus continuous in ⇠; see, e.g., Corollary 25 in (Römisch,
2003). Assumptions (A3) and (A4) along with continuity in both variables imply that
f (x, ⇠) is uniformly bounded; i.e.,
9 C < 1 such that |f (x, ⇠)| < C for each x 2 X, ⇠ 2 ⌅,
a condition necessary for establishing consistency of our point estimators (see Section 3.6.2). Therefore, assumption (A2) is automatically satisfied. Also, per the
34
above discussion, the continuity assumption (A1) is also satisfied. Uniform boundedness also ensures that f (x, ⇠) is a real-valued function and enforces the relatively
complete recourse and dual feasibility assumptions. In addition, convexity and con˜ < 1 is convex and continuous in x. Hence, (SP)
tinuity in x implies that Ef (x, ⇠)
has a finite optimal solution on X, and so X ⇤ is non-empty.
3.3
Background
In this section, we update the notation of A2RP as defined in Section 2.5 to reflect
our emphasis on the partitioning of observations. We also provide a stability result
from (Römisch, 2003) that is fundamental to the bias reduction technique.
3.3.1
A Redefinition of the Averaged Two-Replication Procedure
As mentioned at the beginning of this chapter, we currently view A2RP as sampling a
budget of n observations, where n is even, and partitioning them into two replications
of size n/2. To facilitate comparison with the bias reduction technique discussed in
Section 3.4, we present A2RP again in this section with slightly altered notation and
steps.
Let I 1 be a uniformly distributed random variable independent of {⇠˜1 , . . . , ⇠˜n }
taking values in the set of all subsets of {1, . . . , n} of size n/2, and let I1 be an
instance of I 1 . Let I2 = (I1 )C ; that is, I2 contains all n/2 elements of {1, . . . , n} that
are not in I1 . This is essentially equivalent to generating two independent random
samples of size n/2 as in Section 2.5. However, we prefer to use the notation I1 and I2
to emphasize the random partitioning. Later, the proposed bias reduction technique
will alter this partitioning mechanism.
P
Let PIl (·) = i2Il n2 {⇠˜i } (·) be the empirical probability measure formed on the
lth set of observations, l = 1, 2. Similar to the definition of (SPn ), let (SPIl ) denote
35
the problem
1X
min
f (x, ⇠˜i ) = min
x2X n
x2X
i2I
l
x⇤Il
Z
f (x, ⇠)PIl (d⇠),
(SPIl )
⌅
denote an optimal solution to (SPIl ), and let zI⇤l be the optimal value, for l = 1, 2.
A2RP is as follows:
A2RP
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and an even sample
size n.
Output: A point estimator, its associated variance estimator, and a (1
↵)-level
confidence interval (CI) on Gx̂ .
1. Sample i.i.d. observations {⇠˜1 , . . . , ⇠˜n } from P .
2. Generate a random partition of {⇠˜1 , . . . , ⇠˜n } via I1 and I2 , and produce PI1 and
PI 2 .
3. For l = 1, 2:
3.1. Solve (SPIl ) to obtain x⇤Il and zI⇤l .
3.2. Calculate:
GIl =
2X
f (x̂, ⇠˜i )
n i2I
zI⇤l and s2Il =
l
X h⇣
1
f (x̂, ⇠˜i )
n/2 1 i2I
l
f (x⇤Il , ⇠˜i )
⌘
GI l
i2
.
4. Calculate the optimality gap and sample variance estimators by taking the average;
GI = 12 (GI1 + GI2 ) and s2I =
1
2
s2I1 + s2I2 .
5. Output a one-sided confidence interval on Gx̂ :

z ↵ sI
0, GI + p
.
n
3.3.2
(3.1)
A Stability Result
In this section, we review a stability result from (Römisch, 2003) that is fundamental
to our bias reduction technique. Stability results in stochastic programming quan˜ is perturbed. In
tify the behavior of (SP) when P , the original distribution of ⇠,
36
this chapter, we are particularly interested in changes in the optimal value z ⇤ under
perturbations of P ; however, stability results also examine the changes in the solution sets X ⇤ . In this section, we will use z ⇤ (P ) to denote the optimal value of (SP)
when the distribution of ⇠˜ is P . Similarly, z ⇤ (Q) denotes the optimal value under the
distribution Q, a perturbation of P .
Probability metrics, which calculate distances between probability measures, can
provide upper bounds on |z ⇤ (P )
z ⇤ (Q)|, the change in the optimal value. One
such probability metric relevant for the class of problems we consider is µ̂d (P, Q), the
Kantorovich metric with cost function d(⇠ 1 , ⇠ 2 ) = ||⇠ 1
⇠ 2 ||, where || · || is a norm.
The following result—a restatement of Corollary 25 to Theorem 23 in (Römisch,
2003), written to match this dissertation’s notation—provides continuity properties
of optimal values of (SP) with respect to perturbations of P .
Theorem 3.2. Let only T (⇠) and R(⇠) be random, and assume that relatively complete recourse and dual feasibility hold. Let P 2 P(⌅) and X ⇤ be non-empty. Then,
there exist constants L > 0, > 0 such that
|z ⇤ (P )
z ⇤ (Q)|  Lµ̂d (P, Q)
whenever Q 2 P(⌅) and µ̂d (P, Q) < .
All conditions necessary to apply Theorem 3.2 for the class of problems we consider are satisfied, as specified in Section 3.2; see (Römisch, 2003) for details. The
above stability result implies that if P and Q are sufficiently close with respect to
the Kantorovich metric, then the optimal value of (SP) behaves Lipschitz continuously with respect to changes in the probability distribution. Suppose that P is a
discrete probability measure placing masses p1 , . . . , pNP on the points {⇠ 1 , . . . , ⇠ NP }
in ⌅, respectively, and Q is a discrete measure with masses q1 , . . . , qNQ on the points
{⌫ 1 , . . . , ⌫ NQ } in ⌅, respectively. Then the Kantorovich metric can be written in
the form of the Monge-Kantorovich transportation problem, which formulates the
37
transfer of mass from P to Q:
µ̂d (P, Q) = min
⌘
s.t.
NQ
NP X
X
i=1 j=1
NP
X
i=1
NQ
X
j=1
⌘ij
k⇠ i
⌫ j k⌘ij
(MKP)
⌘ij = qj , 8j,
⌘ij = pi , 8i,
0, 8i, j.
This is the well-known transportation problem, where P can be viewed to have
NP supply nodes, each with supply pi , i = 1, . . . , NP ; similarly, Q has NQ demand
nodes, each with demand qj , j = 1, . . . , NQ ; and total supply and demand match, i.e.,
P NP
P NQ
j=1 qj = 1. Thus, µ̂d (P, Q) is the minimum cost of transferring mass from
i=1 pi =
P to Q. Representing the Kantorovich metric as the optimal value of a well-known,
efficiently solvable optimization problem is vital in allowing us to implement the bias
reduction technique described in the next section.
3.4
Bias Reduction via Probability Metrics
In this section, we present a technique to reduce the bias in sampling-based estimates
of z ⇤ in stochastic programs and apply it to the A2RP optimality gap estimators.
We begin by discussing the motivation behind the technique and explaining the connection with Theorem 3.2. We then formally state the resulting procedure to obtain
variants of the A2RP optimality gap estimators after bias reduction.
3.4.1
Motivation for Bias Reduction Technique
Consider a partition of n observations {⇠˜1 , . . . , ⇠˜n } given by index sets S1 and S2 ,
where
(i) S1 , S2 ⇢ {1, . . . , n} and S2 = (S1 )C ,
38
(ii) |S1 | = |S2 | = n/2, and
(iii) each ⇠˜i , i 2 S1 [ S2 , receives probability mass 2/n.
Note that S1 and S2 are functions of the random sample {⇠˜1 , . . . , ⇠˜n }. This is a generalization of the partitioning performed via I1 and I2 , where we now allow dependencies
between S1 and S2 . For any given {⇠˜1 , . . . , ⇠˜n }, we have
!
X
2X
2
min
f (x, ⇠˜i ) + min
f (x, ⇠˜i )
x2X n
x2X n
i2S1
i2S2
!
n
X
1 X
1X
i
i
˜
˜
 min
f (x, ⇠ ) +
f (x, ⇠ ) = min
f (x, ⇠˜i ) = zn⇤ .
x2X n
x2X n
i=1
i2S
i2S
1 ⇤
1
(zS1 + zS⇤ 2 ) =
2
2
1
2
Therefore, by the monotonicity of expectation, the following inequality holds:
1
(EzS⇤ 1 + EzS⇤ 2 )  Ezn⇤  z ⇤ .
2
(3.3)
Inequality (3.3) indicates that when n observations are divided in two, the expected
gap between 12 (zS⇤ 1 +zS⇤ 2 ) and z ⇤ grows. This motivates us to partition the observations
via index sets S1 and S2 that maximize 12 (EzS⇤ 1 + EzS⇤ 2 ). This approach will help to
alleviate the increase in bias that results from using two subsets of n/2 observations
rather than one set of n observations. Since 12 (EzS⇤ 1 + EzS⇤ 2 ) is always bounded above
by Ezn⇤ , we equivalently aim to minimize Ezn⇤
1
(EzS⇤ 1
2
+ EzS⇤ 2 ). This problem can
be hard to solve, but an approximation is obtained by:
1
(EzS⇤ 1 + EzS⇤ 2 ) 
2
⇥
We would thus like to minimize E |zn⇤
Ezn⇤
1 ⇥ ⇤
E zn
2
zS⇤ 1 | + |zn⇤
zS⇤ 1 + zn⇤
zS⇤ 2
⇤
.
⇤
zS⇤ 2 | , but again this is a hard
problem. In an e↵ort to achieve this, we focus on |zn⇤ zS⇤ 1 |+|zn⇤ zS⇤ 2 | for a given sam-
ple of size n. By viewing the empirical measure Pn of the random sample {⇠˜1 , . . . , ⇠˜n }
P
as the original measure and the measures PSl (·) = n2 i2Sl {⇠˜i } (·), l = 1, 2, as per-
turbations of Pn , we appeal to Theorem 3.2 to obtain an upper bound containing
probability metrics:
|zn⇤
zS⇤ 1 | + |zn⇤
zS⇤ 2 |  Lµ̂d (Pn , PS1 ) + Lµ̂d (Pn , PS2 ).
(3.4)
39
As a result, we aim to reduce the bias of the optimality gap estimator by partitioning
the observations according to sets S1 and S2 that minimize µ̂d (Pn , PS1 ) + µ̂d (Pn , PS2 ).
By minimizing these metrics, we would like PS1 and PS2 to mimic Pn as much as
possible. This way, we may expect the resulting optimal values of the partitions to
be closer to zn⇤ , reducing the bias induced by partitioning.
We note that several approximations were used above and (3.4) is valid only
when Pn and PSl are sufficiently close in terms of the Kantorovich metric, for l =
1, 2. However, it is natural to think of PSl as a perturbation of Pn even though
Theorem 3.2 does not specify how close they should be. The advantage of using
these approximations is that it results in an easily solvable optimization problem (see
the discussion below). Even though it is approximate, we present strong evidence
that the proposed bias reduction technique can be successful via analytical results
on a newsvendor problem in Section 3.5 and numerical results for several stochastic
programs from the literature in Section 3.7.
Let us now examine µ̂d (Pn , PSl ), l = 1, 2 when we have a known partition via
S1 and S2 . Suppose given a realization of the random sample {⇠ 1 , . . . , ⇠ n } and the
corresponding empirical measure Pn , we have identified S1 and S2 that satisfy (i)–
(iii) above. Because d(⇠ i , ⇠ j ) = ||⇠ i
⇠ j || = 0 whenever ⇠ i = ⇠ j , for i 2 {1, . . . , n}
and j 2 Sk , the Monge-Kantorovich problem (MKP) in this setting turns into an
assignment problem (for a given index set Sk ):
µ̂d (Pn , PSl ) = min
⌘
s.t.
XX
i2S2 j2S1
X
k⇠ i
⌘ij =
1
, 8j,
n
⌘ij =
1
, 8i,
n
i2S2
X
j2S1
⌘ij
⇠ j k⌘ij
0, 8i, j.
Furthermore, since the cost function d(⇠ i , ⇠ j ) = ||⇠ i
⇠ j || is symmetric, µ̂d (Pn , PS1 )
40
= µ̂d (Pn , PS2 ). It follows that if we minimize µ̂d (Pn , PS1 ), we automatically minimize
µ̂d (Pn , PS2 ). Therefore, identifying sets S1 and S2 that minimize the sum of the
Kantorovich metrics is equivalent to finding a set S1 that minimizes µ̂d (Pn , PS1 ).
Thus, to attempt to reduce the bias of the optimality gap estimator, we wish to find
an index set of size n/2 that solves the problem:
min µ̂d (Pn , PS1 )
(PM)
s.t. S1 ⇢ {1, . . . , n},
|S1 | = n/2.
Note that this is the well-known minimum-weight perfect matching problem. Given a
graph with n nodes and m edges, it can be solved in polynomial time of O(mn log n)
(Mehlhorn & Schäfer, 2002). The running time for our problem is O(n3 log n) since
we have a fully connected graph. A special case of (PM) when ⇠ is univariate is
solvable in O(n log n) via a sorting algorithm, as the optimal solution is to place the
odd order statistics in one subset and the even order statistics in the other. For
large-scale stochastic programs, solving instances of (SPn ) can be expected to be the
computational bottleneck compared to solving (PM).
3.4.2
Averaged Two-Replication Procedure with Bias Reduction
In this section, we present the Averaged Two-Replication Procedure with Bias Reduction (A2RP-B) that results from adapting A2RP to include the bias reduction
technique described in Section 3.4.1. To distinguish from the uniformly chosen subsets I1 and I2 defined Section 3.3.1, we denote an optimal solution to (PM) by J1 and
let J2 = (J1 )C . The resulting probability measures are denoted PJl , l = 1, 2, where
P
PJl = n2 i2Jl ⇠˜i .
41
A2RP-B
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and an even sample
size n.
Output: A point estimator, its associated variance estimator, and a (1
↵)-level
confidence interval on G.
1. Sample i.i.d. observations {⇠˜1 , . . . , ⇠˜n } from P .
2. Generate J1 and J2 by solving (PM), and produce PJ1 and PJ2 .
3. For l = 1, 2:
3.1. Solve (SPJl ) to obtain x⇤Jl and zJ⇤l .
3.2. Calculate:
G Jl =
2X
f (x̂, ⇠˜i )
n i2J
zJ⇤l and s2Jl =
l
X h⇣
1
f (x̂, ⇠˜i )
n/2 1 i2J
l
f (x⇤Jl , ⇠˜i )
⌘
GJl
i2
.
4. Calculate the optimality gap and sample variance estimators by taking the average;
GJ = 12 (GJ1 + GJ2 ) and s2J =
1
2
s2J1 + s2J2 .
5. Output a one-sided confidence interval on G:

z ↵ sJ
0, GJ + p
.
n
(3.5)
A2RP-B di↵ers from A2RP in Step 2. Here, a minimum-weight perfect matching
problem (PM) is solved to obtain an optimal partition of the observations via the
index sets J1 and J2 . Note that the elements in J1 and J2 depend on the observations
{⇠˜1 , . . . , ⇠˜n }, and so J1 and J2 are random variables acting on ⌦. Hence, J1 and J2
are not independent of {⇠˜1 , . . . , ⇠˜n }, distinguishing PJ1 and PJ2 from PI1 and PI2 .
The random partitioning mechanism of I1 and I2 results in i.i.d. observations in PI1
and PI2 . Unfortunately, this property is lost in PJ1 and PJ2 . Nevertheless, we prove
in Section 3.6 that the point estimators GJ and s2J are consistent and the interval
estimator given by (3.5) is asymptotically valid.
42
We conclude this section by updating Example 2.2 to include the e↵ects of bias
reduction. We will illustrate A2RP-B in more detail on an instance of a newsvendor
problem in the next section.
Example 3.1. Consider the problem described in Example 2.1. As before, ↵ = 0.10
P
and n = 50. Let ⇠¯J1 = n2 i2J1 ⇠˜i be the sample mean of the first subset of 25
observations (all odd order statistics), and similarly let ⇠¯J2 be the sample mean of
the second subset (all even order statistics) after solving (PM). To estimate an upper
bound on the coverage of A2RP-B, we ran 1,000,000 independent runs in MATLAB
and computed the proportions of runs that ⇠¯J1 and ⇠¯J2 were negative. This resulted
in the estimate 1
P (⇠¯J1 < 0)P (⇠¯J2 < 0) ⇡ 1
(0.363)(0.145) ⇡ 0.947. Compared
to A2RP with P (⇠¯l < 0) ⇡ 0.308 for each subset l = 1, 2, after solving (PM), the
sample mean of the first subset shifted slightly downward, increasing this probability,
whereas the sample mean of the second subset shifted slightly upward, decreasing this
probability. As a result, the probability of obtaining coinciding solutions is decreased.
Hence the upper bound on the coverage of A2RP-B is greater than A2RP for this
problem.
In general, A2RP-B may be viewed as in between SRP and A2RP. Like A2RP, it
can lower the occurrence of coinciding solutions while at the same time having a lower
bias like SRP. Computational results in Section 3.7 seem to support this hypothesis.
3.5
Illustration: Newsvendor Problem
Before presenting theoretical results, we illustrate the above bias reduction technique
on an instance of a newsvendor problem. For this problem, we are able to derive
analytical results, and therefore can compare the optimality gap estimators produced
by A2RP, A2RP-B, and SRP to examine the efficacy of the bias reduction technique.
The specific newsvendor problem we consider is as follows: a newsvendor would
like to determine the number of newspapers to order daily, x, in order to maximize
43
expected daily profit. Each copy sells at a price r and costs the newsvendor c, where
˜ is assumed to be random with a U (0, b) distribution.
0 < c < r. The daily demand, ⇠,
The problem can be expressed as
h
min E cx
x 0
The optimal solution is x⇤ = b(r
i
˜ .
r min{x, ⇠}
(3.6)
c)/r and the optimal value is z ⇤ =
b(r
c)2 /(2r).
Note that (3.6) can be rewritten as a two-stage stochastic linear program in the form
presented in Section 3.2.
Prior to finding expressions for the biases of the optimality gap estimators, we
note two results that are used in the subsequent derivations. First, let {⇠˜1 , . . . , ⇠˜n } be
a random sample of size n from a U (0, b) distribution, and let {⇠˜(1) , . . . , ⇠˜(n) } denote
the ordering of the random sample, i.e., ⇠˜(1)  ⇠˜(2)  . . .  ⇠˜(n) . The optimal solution
⇤
to the approximated problem (SPn ) using this random sample is x⇤n = ⇠˜(l ) , where
l⇤ = d(r
c)n/re. The optimal value of (SPn ) is thus
l⇤ 1
n
zn⇤
=
rX
⇤
min{x⇤n , ⇠˜i } = c⇠˜(l )
n i=1
cx⇤n
r X ˜(i)
⇠
n i=1
r
(n
n
⇤
l⇤ + 1) ⇠˜(l ) .
Second, recall that the ith order statistic from a U (0, b) random sample of size n
satisfies ⇠˜(i) /b ⇠ (i, n + 1
i), where (↵1 , ↵2 ) denotes a random variable having a
Beta distribution with parameters ↵1 and ↵2 .
We now determine the bias of GI , the optimality gap estimator produced by
A2RP. In this case, the n observations are randomly partitioned into two subsets of
size n/2, generating the corresponding sampled problems (SPIl ), l = 1, 2. Relabel the
observations ⇠˜i , i 2 I1 , as ⇠˜Ii1 , and similarly for I2 . The optimal solution to (SPIl ) is
(i⇤ )
x⇤Il = ⇠˜Il , where i⇤ = d(r
c)n/2re = (r
c)n/2r + , for some  2 [0, 1). The ith
(i)
order statistic of each subset satisfies ⇠˜Il /b ⇠
i, n2 + 1
i . After some algebra,
the bias of GI is
z⇤
1
(EzI⇤1 + EzI⇤2 ) =
2
⇥
b
2(
n(n + 2)r
1)r2
⇤
cnr + c2 n .
(3.7)
44
The analysis changes somewhat under A2RP-B. The newsvendor problem is univariate in ⇠, and so (PM) places the odd order statistics in one subset and the even
order statistics in the other. Since the order statistics are computed from the original sample size of n, the ith order statistic follows a
(i, n + 1
i) distribution.
Note that after solving (PM), the observations in each subset are no longer i.i.d.,
since order statistics are neither identically distributed nor independent. Solving the
sampling problem using the first subset of observations leads to the optimal solution
⇤
x⇤J1 = ⇠˜(2i
1)
and using the second set of observations produces the optimal solution
⇤
x⇤J2 = ⇠˜(2i ) . Following the same steps, we calculate the bias of GJ as
z⇤
1
(EzJ⇤1 + EzJ⇤2 ) =
2
⇥
b
4(
2n(n + 1)r
1)r2
⇤
cnr + c2 n .
(3.8)
We now consider the limiting behavior of the percentage reduction in the bias
of the optimality gap estimator going from A2RP to A2RP-B, which is given by
subtracting expression (3.8) from expression (3.7) and normalizing by (3.7). We get
% Red. in Bias = 1
b
[4(
2n(n+1)r
b
[2(
n(n+2)r
1)r2
cnr + c2 n]
1)r2
c2 n]
cnr +
!1
1
as n ! 1.
2
Therefore, the percentage reduction in the bias converges to 50% as n ! 1. So,
simply partitioning the random sample into odd and even order statistics (the result
of solving (PM)) gives an optimality gap estimator with asymptotically half the bias
compared to using a random partition. This result holds regardless of the values
of the parameters r, c, and b for this specific newsvendor problem, so parameter
choices that change the bias of the optimality gap estimator will not alter the largesample behavior of the bias reduction technique. For small sample size behavior
of this newsvendor problem, see Section 3.7.3. Our numerical results indicate that
convergence of the percentage reduction in bias is achieved very quickly, e.g., around
a sample size of n = 100.
Finally, we compare A2RP-B and A2RP to SRP, where we assume that SRP uses
a sample size of n. Observe that replacing n with 2n in (3.7) gives the bias of the
45
optimality gap estimator produced by SRP. Consequently, the ratio of the bias of
the A2RP optimality gap estimator to the bias of the SRP estimator converges to
2 as n ! 1, indicating that partitioning the observations into two random subsets
doubles the bias for larger sample sizes. In contrast, the ratio of the biases of the
A2RP-B and SRP optimality gap estimators converges to 1 as n ! 1. In essence, the
bias reduction technique performs “anti-partitioning” for this problem by eliminating
the additional bias introduced from partitioning.
3.6
Theoretical Properties
We now prove that the estimators GJ and s2J of A2RP-B are strongly consistent
and that A2RP-B provides an asymptotically valid CI on the optimality gap. This
is important because applying a bias reduction technique can sometimes result in
overcorrection of the bias and lead to undesirable behavior. In this section, we show
that, asymptotically, such unwanted behavior does not happen for our method. The
technical difficulty in the consistency proofs for the A2RP-B estimators comes from
the fact that the proposed bias reduction technique destroys the i.i.d. nature of the
observations in the partitioned subsets of observations. Recall that in A2RP, the
uniform partitioning of the observations preserves the i.i.d. property, but this is not
the case for A2RP-B; see Section 3.5 for an illustration from the newsvendor problem.
Hence, it is necessary to generalize the consistency proofs in (Bayraksan & Morton,
2006) to cover the non-i.i.d. case arising from solving (PM).
3.6.1
Weak Convergence of Empirical Measures
We first establish the weak convergence of the empirical probability measures PJ1 and
˜ a.s. This provides the structure necessary to
PJ2 to P , the original distribution of ⇠,
obtain consistent estimators.
46
Theorem 3.9. Assume that {⇠˜1 , . . . , ⇠˜n } is an i.i.d. sample from distribution P and
(A4) holds. Then the probability measures on the partitioned sets obtained by solving
˜ a.s.
(PM), PJ1 and PJ2 , converge weakly to P , the original distribution of ⇠,
Proof. Since µ̂d is a metric, by the triangle inequality we have that
µ̂d (P, PJ1 )  µ̂d (P, Pn ) + µ̂d (Pn , PJ1 ).
Also, µ̂d (Pn , PJ1 )  µ̂d (Pn , PI1 ), since the partitioning of the observations via J1
minimizes the Kantorovich metric; hence, the random partition provides an upper
bound. Therefore,
µ̂d (P, PJ1 )  µ̂d (P, Pn ) + µ̂d (Pn , PI1 ),
and by applying the triangle inequality again, we obtain
µ̂d (P, PJ1 )  µ̂d (P, Pn ) + µ̂d (P, Pn ) + µ̂d (P, PI1 ) = 2µ̂d (P, Pn ) + µ̂d (P, PI1 ).
We would like to show that µ̂d (P, PJ1 ) ! 0 as n ! 1, a.s. First, applying
R
the SLLN to a fixed bounded, continuous function f on ⌅ gives that ⌅ f (⇠)Pn (d⇠)
R
! ⌅ f (⇠)P (d⇠), a.s. Theorem 11.4.1 in (Dudley, 2002) extends this result to all
bounded, continuous f ; i.e., the random empirical measure Pn converges weakly to
R
the non-random measure P , a.s. This combined with (A4) yields ⌅ k⇠kPn (d⇠) !
R
k⇠kP (d⇠), a.s. Hence, applying Theorem 6.3.1 in (Rachev, 1991), we obtain
⌅
µ̂d (P, Pn ) ! 0 as n ! 1, a.s. and similarly, µ̂d (P, PI1 ) ! 0 as n ! 1, a.s. The sec-
ond statement follows from the fact that PI1 is essentially the same as Pn/2 . Combining
these, we obtain that 2µ̂d (P, Pn ) + µ̂d (P, PI1 ) ! 0, a.s. Therefore, µ̂d (P, PJ1 ) ! 0,
a.s., and another application of Theorem 6.3.1 in (Rachev, 1991) implies that PJ1
converges weakly to P , a.s. The same argument holds for PJ2 .
Even though we lose the i.i.d. property of the observations in the partitioned
subsets after minimizing the Kantorovich metrics, Theorem 3.9 shows the weak convergence of the resulting probability measures to the original measure.
47
3.6.2
Consistency of Point Estimators
We now show the consistency of the estimators GJ and s2J in the almost sure sense.
⇣
⌘
˜ f (x, ⇠)
˜ , and denote the optimal
For a fixed x̂ 2 X, define x̂2 (x) = var f (x̂, ⇠)
solutions that minimize and maximize
arg maxx2X ⇤
2
x̂ (x),
2
x̂ (x)
by x⇤min 2 arg minx2X ⇤
2
x̂ (x)
and x⇤max 2
˜ is continuous in x, Ef (x, ⇠)
˜
respectively. Note that since f (x, ⇠)
is continuous, and hence X ⇤ is closed (and therefore compact). In addition,
continuous, and thus arg minx2X ⇤
2
x̂ (x)
and arg maxx2X ⇤
2
x̂ (x)
2
x̂ (x)
is
are nonempty.
Theorem 3.10. Assume x̂ 2 X, {⇠˜1 , . . . , ⇠˜n } is an i.i.d. sample from distribution P ,
and (A3) and (A4) hold. Fix 0 < ↵ < 1. Let n be even and consider A2RP-B. Then,
(i) all limit points of x⇤Jl lie in X ⇤ , a.s., for l = 1, 2;
(ii) zJ⇤l ! z ⇤ , a.s., as n ! 1, for l = 1, 2;
(iii) GJ ! G, a.s., as n ! 1;
(iv)
2 ⇤
x̂ (xmin )
 lim inf s2J  lim sup s2J 
n!1
n!1
2 ⇤
x̂ (xmax ),
a.s.
Proof. (i) First, note from Theorem 3.9 that the probability measures on the partitioned subsets converge weakly to the original distribution of ⇠˜ as n ! 1, a.s. As
R
R
a result, for l = 1, 2, ⌅ f (x, ⇠)PJl (d⇠) epi-converges to ⌅ f (x, ⇠)P (d⇠) as n ! 1,
a.s., by Theorem 3.9 in (Wets, 1983). Thus by Theorem 3.9 in (Wets, 1983), all limit
points of x⇤Jl lie in X ⇤ , a.s., for l = 1, 2.
(ii) Using epi-convergence, Theorem 7.33 in (Rockafellar & Wets, 1998) along with
assumptions (A3) and (A4) give that zJ⇤l converges to z ⇤ , a.s., as n ! 1.
P
(iii) By definition, GJ = 12 [GJ1 + GJ2 ] where GJl = n2 i2Jl f (x̂, ⇠˜i ) zJ⇤l , for l = 1, 2.
P
For a feasible x 2 X, define f¯n (x) = n1 ni=1 f (x, ⇠˜i ). Then GJ = f¯n (x̂) 12 (zJ⇤1 + zJ⇤2 ).
Since the original sample is formed using n i.i.d. observations, f¯n (x̂) converges to
48
˜ a.s., by the SLLN. Furthermore, by part (ii), 1 (z ⇤ + z ⇤ ) converges to z ⇤ ,
Ef (x̂, ⇠),
J2
2 J1
˜
a.s., as n ! 1. We conclude that GJ ! Ef (x̂, ⇠)
(iv) We begin by letting w(x, ⇠) = 2C
(f (x̂, ⇠)
z ⇤ , a.s., as n ! 1.
f (x, ⇠)) and w̄Jl =
2
n
P
i2Jl
w(x, ⇠˜i ),
for a given x 2 X, l = 1, 2. Recall that the constant C gives a uniform bound on
f (x, ⇠). Note that w(x, ⇠) is defined in this fashion to enforce non-negativity. Altering
our notation slightly and fixing x̂ 2 X, we define
s2Jl (x) =
X⇣
1
w(x, ⇠˜i )
n/2 1 i2J
w̄Jl (x)
l
⌘2
.
Note that s2Jl (x⇤Jl ) is equivalent to s2Jl defined in Section 3.4.2. Rewriting, we obtain
!
#
"
X
n/2
2
s2Jl (x) =
w2 (x, ⇠˜i )
(w̄Jl (x))2 .
(3.11)
n/2 1
n i2J
l
First, we show that the sequence of functions {s2Jl (x)} converges uniformly to
2
x̂ (x),
a.s., as n ! 1, for l = 1, 2. To this end, we first examine the two terms inside the
brackets in (3.11).
By the uniform boundedness of f (x, ⇠), |f (x̂, ⇠) f (x, ⇠)|  2C; hence, w(x, ⇠)
0
for all x 2 X, ⇠ 2 ⌅. It also immediately follows that w(x, ⇠) is bounded in ⇠ since
for all x 2 X, |w(x, ⇠)|  4C, and w(x, ⇠) is continuous in ⇠ for the class of problems
we consider. Therefore, for each x 2 X, by the definition of weak convergence and
˜ a.s., as n ! 1, i.e., the SLLN holds
using Theorem 3.9, we have w̄Jl (x) ! Ew(x, ⇠),
pointwise, a.s. Since f (·, ⇠) is convex in x, w(·, ⇠) is convex in x (note that x̂ is fixed).
Hence, we apply Corollary 3 from (Shapiro, 2003) to obtain
sup w̄Jl (x)
x2X
˜ ! 0, a.s., as n ! 1,
Ew(x, ⇠)
(3.12)
˜ a.s., as n ! 1.
i.e., w̄Jl (x) converges uniformly to Ew(x, ⇠),
Note that w2 (x, ⇠) is bounded and continuous in ⇠, and because w(·, ⇠)
0,
w2 (·, ⇠) is also convex in x. Hence, following the same steps as above, we conclude
49
that
2
n
these,
P
i2Jl
˜ a.s., as n ! 1. Combining
w2 (x, ⇠˜i ) converges uniformly to Ew2 (x, ⇠),
"
!
#
2X 2
2
aJl (x) :=
w (x, ⇠˜i )
(w̄Jl (x))
n i2J
l
⇣
⌘2
˜
˜
˜ a.s., as n ! 1. Note
converges uniformly to Ew2 (x, ⇠)
Ew(x, ⇠)
= var(w(x, ⇠)),
˜ =
that var(w(x, ⇠))
2
x̂ (x).
n/2
a (x),
n/2 1 Jl
To show uniform convergence of
sup aJl (x) +
x2X
aJl (x)
n/2 1
consider the following inequality:
˜  sup aJ (x)
var(w(x, ⇠))
l
˜
var(w(x, ⇠))
aJl (x)
˜
var(w(x, ⇠))
n/2 1
x2X
+ sup
x2X
˜
var(w(x, ⇠))
.
n/2 1
+ sup
x2X
From above, the first two terms on the right-hand side converge to 0, a.s. By (A3),
˜ < 1 for all x, a.s., and so the last term also converges to 0 as n ! 1.
var(w(x, ⇠))
n/2
a (x)
n/2 1 Jl
This establishes uniform convergence of
Hence, s2Jl (x) converges uniformly to
2
x̂ (x),
˜ =
to var(w(x, ⇠))
2
x̂ (x),
a.s.
a.s., for l = 1, 2.
Since X is compact, for any fixed ! there exists a subsequence of N, Nk , along
which {x⇤Jl }n2Nk converges to a point in X, for l = 1, 2. This point, denoted ẋk , is
in X ⇤ , a.s., by (i). Then, because s2Jl (x) is a sequence of continuous functions that
converges uniformly to
2
x̂ (x),
a.s.,
2
⇤
lim s (x )
n ! 1 J l Jl
n 2 Nk
Therefore,
min
x2X ⇤
2
x̂ (x)
=
2
x̂ (ẋk ),
a.s.
 lim s2Jl (x⇤Jl )  max⇤
n!1
n 2 Nk
x2X
2
x̂ (x),
a.s.,
for l = 1, 2. Since Nk is just one subsequence of N, and by the definition of x⇤min and
x⇤max , we have
2 ⇤
x̂ (xmin )
 lim inf s2Jl (x⇤Jl )  lim sup s2Jl (x⇤Jl ) 
n!1
n!1
2 ⇤
x̂ (xmax ),
a.s.,
50
for l = 1, 2. Since s2J =
1
2
s2J1 (x⇤J1 ) + s2J2 (x⇤J2 ) , the desired result follows.
Parts (i) and (ii) of Theorem 3.10 establish the consistency of x⇤Jl and zJ⇤l , an optimal solution and the optimal value of (SPJl ). Similarly, parts (iii) and (iv) establish
the consistency of GJ and s2J , the point estimators produced by A2RP-B. Note that
if (SP) has a unique optimal solution; that is, X ⇤ = {x⇤ }, then part (i) implies that
x⇤Jl ! x⇤ , for l = 1, 2, and part (iv) implies that limn!1 s2J =
3.6.3
2 ⇤
x̂ (x ),
a.s., as n ! 1.
Asymptotic Validity of the Interval Estimator
In our final result, we show the asymptotic validity of the CI estimator produced by
A2RP-B, given in (3.5). This justifies the construction of an approximate CI after
bias reduction.
Theorem 3.13. Assume x̂ 2 X, {⇠˜1 , . . . , ⇠˜n } is an i.i.d. sample from distribution P ,
and (A3) and (A4) hold. Fix 0 < ↵ < 1. Let n be even and consider A2RP-B. Then,
◆
✓
z ↵ sJ
lim inf P G  GJ + p
1 ↵.
n!1
n
Proof. First, note that if x̂ 2 X ⇤ , then the inequality is satisfied automatically.
Suppose now that x̂ 2
/ X ⇤ . As in the proof of part (iii) of Theorem 3.10, we
P
1
express GJ as GJ = f¯n (x̂)
zJ⇤1 + zJ⇤2 , where f¯n (x) = n1 ni=1 f (x, ⇠˜i ). Since
2
P
zJ⇤l = minx2X i2Jl f (x, ⇠˜i ) for l = 1, 2, GJ
f¯n (x̂) f¯n (x), for all x 2 X. Noting
that {⇠˜1 , . . . , ⇠˜n } is an i.i.d. sample, the rest of the proof proceeds as in the proof of
Theorem 1 in (Bayraksan & Morton, 2006).
3.7
Computational Experiments
In Section 3.6, we proved asymptotic results regarding the consistency and validity
of estimators produced using A2RP-B. In this section, we apply A2RP-B to several
test problems in order to examine its small-sample behavior. We begin our discussion
51
by introducing the test problems used for evaluating the bias reduction technique,
followed by the experimental setup in Sections 3.7.1 and 3.7.2. Then, in Section 3.7.3,
we present the results of our experiments and discuss computational e↵ort. We end
our discussion by providing insights gained from our experiments in Section 3.7.6.
3.7.1
Test Problems
To fully evaluate the efficacy of the proposed bias reduction technique, we consider
four test problems from the literature; namely the newsvendor problem (denoted
NV), APL1P, PGP2, and GBD. All four problems are two-stage stochastic linear
programs with fixed recourse and stochasticity on the right-hand side, and can be
solved exactly, allowing us to compute exact optimality gaps. Characteristics of
these problems are summarized in Table 3.1. NV is defined as in Section 3.5 and can
be solved analytically. We set the cost of one newspaper, c, to be 5, and its selling
price, r, to 15. The demand ⇠˜ is assumed to have a U (0, 10) distribution. The electric
power generation model PGP2 of Higle & Sen (1996b) has 3 stochastic parameters and
576 scenarios. APL1P is a power expansion problem with 5 independent stochastic
parameters and 1,280 scenarios (Infanger, 1992). GBD, described in Example 1.1, is
an aircraft allocation model. The version we use has 646,425 scenarios generated by
5 independent stochastic parameters.
The standard formulations of these three problems di↵er slightly from the formulation presented in Section 3.2, in that ⇠ := (R, T ) rather than R and T being
Problem
# of 1st Stage
Variables
# of 2nd Stage
Variables
# of Stochastic
Parameters
# of Scenarios
NV
PGP2
APL1P
GBD
1
4
2
17
1
16
9
10
1
3
5
5
1
576
1,280
646,425
Table 3.1: Test problem characteristics
52
Problem
x⇤
Suboptimal x̂
z⇤
NV
PGP2
APL1P
6 23
(1.5, 5.5, 5, 5.5)
(1800, 1571.43)
(10, 0, 0, 0, 0,
12.48, 1.19, 5.33, 0,
4.24, 0, 20.76, 7.81,
0, 7.20, 0, 0)
8.775
(1.5, 5.5, 5, 4.5)
(1111.11, 2300)
(10, 0, 0, 0, 0,
12.43, 1.22, 5.33, 0,
4.32, 0, 20.68, 8.05,
0, 6.95, 0, 0)
33 13
447.32
24,642.32
GBD
1,655.63
G
3 13
1.14
164.84
1.15
Table 3.2: Optimal and suboptimal candidate solutions
functions of ⇠. This discrepancy can be easily remedied by defining the functions
R(⇠) and T (⇠) in our formulation to be the coordinate projections of ⇠, so with a
slight abuse of notation, R(⇠) = R and T (⇠) = T . Then R(⇠) and T (⇠) satisfy the
affine linearity assumption in Section 3.2 and we can express the problems in the form
assumed in this chapter. All test problems satisfy the required assumptions.
We selected two candidate solutions, x̂, for each test problem listed in Table 3.1.
The first candidate solution is the optimal solution, i.e., x̂ = x⇤ . Note that all
these problems have a unique optimal solution. The second candidate solution is
a suboptimal solution. For NV, APL1P, and PGP2, the suboptimal solution is the
solution used in the computational experiments in (Bayraksan & Morton, 2006). We
selected a suboptimal solution for GBD by solving an independent sampling problem
and setting its solution as the candidate solution. Table 3.2 summarizes the optimal
and suboptimal solutions used in our computational experiments, along with the
optimal value and the optimality gap of the suboptimal candidate solution.
For the problems summarized in Tables 3.1 and 3.2, we performed tests based on
the setup detailed in Section 3.7.2 and present the results in Section 3.7.3. In addition,
we looked at how much computational e↵ort bias reduction needs on a practical-scale
problem and also studied the e↵ects of multiple optimal solutions. These are discussed
in Section 3.7.4 and 3.7.5, respectively.
53
3.7.2
Experimental Setup
The primary objective of our computational experiments is to determine the reduction
in the bias of the point estimator GJ of A2RP-B compared to the estimator GI of
A2RP for finite sample sizes n. It is well known that bias reduction techniques in
statistics can increase the variance of an estimator; therefore, we use the mean-squared
error (MSE) to capture both e↵ects. Recall that if ✓ˆ is an estimator of ✓, the MSE
of ✓ˆ is given by E(✓ˆ
✓)2 = (E ✓ˆ
✓)2 + E(✓ˆ
ˆ 2 , where the first term is the
E ✓)
ˆ We also examine the CI
square of the bias and the second term is the variance of ✓.
estimator after bias reduction. At each stage of our experiments, we perform relevant
hypothesis tests to determine if there is a statistically significant reduction in bias,
variance, MSE, etc.
Our experiments were conducted as follows. First, for each test problem, we fixed
a candidate solution (optimal or suboptimal) and set ↵ = 0.10. Then, we applied
A2RP and A2RP-B for a variety of sample sizes {n = 50, 100, 200, . . . , 1000} to test
the small-sample behavior and to observe any potential trends as n increases. For
each independent run, we used the same random number stream for both A2RP and
A2RP-B to enable a direct comparison of the two procedures. We used a batch size of
m to estimate the biases of GI and GJ by averaging across m independent runs. We
also obtained single estimates of var(GI ), var(GJ ), MSE(GI ), and MSE(GJ ) using the
m runs. In order to produce CIs on the biases of GI and GJ , as well as obtain better
estimates of the variance and MSE, we repeated this procedure M times, resulting
in a total of m ⇥ M independent runs. The means of the M estimates of the bias,
variance, and MSE and the m ⇥ M CI widths were used to compute percentage
reductions.
Since the stochastic parameters of APL1P take values that vary by several orders
of magnitude, we used a weighted Euclidean norm to better calculate the distance
between scenarios when defining (PM). We used the standard Euclidean norm for
54
the other test problems. For NV, we used the quicksort algorithm (in C++) to
solve the sampling approximations (SPIl ) and (SPJl ), l = 1, 2, as the optimal solution is a sample quantile of demand. We also used the quicksort algorithm to
perform the minimum-weight perfect matching. For all other test problems, we used
the regularized decomposition (RD) code (in C++) by Świetanowski and Ruszczyński
(Ruszczyński, 1986; Ruszczyński & Świetanowski, 1997) to solve the sampling approximations. We modified this code to use the Mersenne Twister algorithm to generate
random samples (Wagner, 2009). To solve (PM), we used the Blossom V code discussed in (Kolmogorov, 2009). We note that there are multiple ways to partition
the observations given a solution to (PM); we simply chose our partition based on
the output from Blossom V. Given that NV and its corresponding matching problem
can be solved efficiently, we set m = 1,000 and M = 1,000 for a total of 1,000,000
independent runs for each sample size n for this problem. For the other problems, we
used m = 10 and M = 1,000 for a total of 10,000 independent runs for each n. For
PGP2, APL1P, and GBD, we used the UA Research Computing High Performance
Computing at the University of Arizona and for NV, we used the MORE Institute
facilities.
The statistical tests we performed are as follows. The null hypothesis states
that the di↵erence between the corresponding estimators of A2RP and A2RP-B is
zero, and the alternative hypothesis states that the di↵erence is positive (indicating
that A2RP-B decreases the quantity being studied). We performed this test on the
M di↵erences in the estimates of the bias, bias squared, variance, and MSE of the
optimality gap point estimator, and the m ⇥ M di↵erences in the width of the CI
estimator, using a one-sided dependent t-test with a 10% level of significance. This
test is valid due to the large number of di↵erences calculated.
Finally, we know from Theorem 3.13 that the CIs will attain the desired coverage
of 0.90 for large sample sizes. However, given that bias reduction may reduce the
width of the CI estimator, it is important to consider the change in coverage for
55
small sample sizes when applying bias reduction. We estimated the coverage for each
algorithm and sample size. This was done by computing p̂, the proportion of the
m ⇥ M independent runs in which the CI contained the optimality gap. Note that
when the candidate solution is optimal, the optimality gap is 0, and so the coverage
is always trivially 1. The estimator p̂ is a scaled binomial random variable, and
hence for the suboptimal candidate solution we formed a 90% CI on the coverage via
p
p
p̂ ± 1.645 p̂(1 p̂)/106 for NV and p̂ ± 1.645 p̂(1 p̂)/(5 ⇥ 104 ) for the other
test problems. We also performed a one-sided two-proportion t-test with a 10% level
of significance to test the null hypothesis that the coverage for A2RP and A2RP-B
are equal versus the alternative hypothesis that A2RP-B has a lower coverage.
3.7.3
Results of Experiments on NV, PGP2, APL1P, and GBD
We now present the computational results for each candidate solution, beginning with
the optimal candidate solution. We note that for all CIs provided, a margin of error
smaller than 0.01 or 0.001 is reported as 0.00 or 0.000, respectively.
Optimal Candidate Solution
This section highlights our comparison of A2RP and A2RP-B using the optimal
solution as the candidate solution for each test problem presented in Tables 3.1 and
3.2. Figure 3.1 depicts a summary for all sample sizes of this comparison in terms of
the percentage reduction in the bias and MSE of the optimality gap estimator and the
width of the CI estimator. Tables 3.3, 3.4, and 3.5 detail the results for select sample
sizes. In these tables, the appearance of ⇤ symbol in the columns that consider
percentage reductions indicates that the hypothesis test on the di↵erence between
the A2RP and A2RP-B estimators did not result in a statistically significant positive
di↵erence. Conversely, the lack of a ⇤ symbol represents a statistically significant
positive di↵erence.
56
80
80
80
70
70
70
60
60
60
50
50
50
40
40
40
30
30
30
20
20
20
10
10
10
0
0
0
−10
−10
−10
100
NV
300
PGP2
500
700
APL1P
(a) % Red. in Bias
900
GBD
100
NV
300
PGP2
500
700
APL1P
(b) % Red. in MSE
900
GBD
100
NV
300
PGP2
500
700
APL1P
900
GBD
(c) % Red. in CI Width
Figure 3.1: Percentage reductions between A2RP and A2RP-B in (a) bias of optimality gap estimator, (b) MSE of optimality gap estimator, and (c) CI width with
respect to sample size n for optimal candidate solutions
Figure 3.1(a) shows the percentage reduction between the biases of GI and GJ .
The corresponding Table 3.3 provides the details. Columns ‘CI on Bias’ provide the
90% CI on the estimated biases of GI and GJ , respectively. The column ‘CI on Di↵. in
Bias’ gives the 90% one-sided CI on the di↵erence in the biases formed via A2RP and
A2RP-B. We use a one-sided CI here because our aim is to test if there is a significant
reduction in bias. Finally, the last column shows the percentage reduction in the bias,
which is also shown in Figure 3.1(a). The results illustrate that all the test problems
have a statistically significant positive di↵erence in the bias of the optimality gap
estimator, which is strong evidence that A2RP-B achieves its goal of reducing bias.
The results for NV match the theory presented in Section 3.5, and we note the very
fast convergence of the percentage reduction in the bias to 50%. For the other test
problems, we observe a monotonic increase in the percentage reduction in the bias
with sample size.
A comparison between the biases of the A2RP or A2RP-B and SRP optimality
gap estimators can be made for n 2 {50, 100, . . . , 500} by looking at the A2RP bias
for sample size 2n (the SRP bias) and the A2RP (or A2RP-B) bias for sample size
n. Although the bias reduction technique does not eliminate all of the additional
57
bias due to partitioning for APL1P, PGP2, and GBD, it does significantly reduce it.
Focusing on GBD at n = 100, for example, indicates that the A2RP bias is about 2.4
times the bias of SRP, whereas A2RP-B reduces that ratio to about 1.4.
The summary of results on the MSE of the optimality gap estimator is depicted in
Figure 3.1(b), with details presented in Table 3.4. The first two columns of Table 3.4
give an indication of the relative contributions of the square of the bias and variance
of the MSE. For instance, for NV at n = 100, we estimate that approximately 54%
of the MSE of GI is the square of the bias of GI and the remaining 46% is the
contribution from the variance of GI . Table 3.4 shows that the square of the bias is
a significant proportion of the MSE for all test problems. This proportion is reduced
under A2RP-B. Furthermore, we observe that the proposed bias reduction technique
not only reduces the bias but also the variance of the optimality gap estimator. Like
the bias, we observe increases in variance reduction as n increases. As a result, the
percentage reduction in the MSE is notable for all test problems, and is roughly
monotonically increasing.
Finally, Figure 3.1(c) and Table 3.5 show the percentage reduction in the CI width
at an optimal candidate solution. Because the optimality gap of an optimal solution
is zero, reduction in the interval widths in this case is desirable. Our results indicate
that for all test problems, there is a statistically significant reduction in the width of
the CI. We again observe an increasing trend with sample size.
58
Problem
n
A2RP
A2RP-B
CI on Bias
CI on Bias
CI on Di↵. in Bias
% Red. in Bias
NV
100
200
400
600
800
0.33
0.17
0.08
0.06
0.04
±
±
±
±
±
0.00
0.00
0.00
0.00
0.00
0.17
0.08
0.04
0.03
0.02
±
±
±
±
±
0.00
0.00
0.00
0.00
0.00
0.16
0.08
0.04
0.03
0.02
0.00
0.00
0.00
0.00
0.00
48.55
49.29
49.53
49.91
49.78
PGP2
100
200
400
600
800
6.19
3.56
1.85
1.39
1.16
±
±
±
±
±
0.06
0.04
0.03
0.02
0.02
5.45
3.11
1.47
1.05
0.86
±
±
±
±
±
0.06
0.04
0.03
0.02
0.02
0.74
0.45
0.38
0.34
0.30
0.02
0.02
0.01
0.01
0.01
11.96
12.74
20.44
24.27
26.05
APL1P
100
200
400
600
800
86.46
44.79
22.88
15.09
11.09
±
±
±
±
±
1.33
0.76
0.42
0.28
0.21
64.75
31.82
15.18
9.25
6.38
±
±
±
±
±
1.11
0.61
0.33
0.22
0.15
21.71
12.97
7.70
5.84
4.71
0.76
0.43
0.23
0.15
0.12
25.11
28.95
33.65
38.71
42.49
GBD
100
200
400
600
800
4.38
1.84
0.74
0.43
0.29
±
±
±
±
±
0.06
0.03
0.01
0.01
0.01
2.57
1.03
0.40
0.22
0.14
±
±
±
±
±
0.04
0.02
0.01
0.01
0.00
1.80
0.82
0.34
0.20
0.15
0.03
0.02
0.01
0.00
0.00
41.20
44.33
45.70
47.19
51.04
Table 3.3: Bias of optimality gap estimator for optimal candidate solutions
59
A2RP
A2RP-B
% of MSE:
(Bias)2 :Var.
% of MSE
(Bias)2 :Var.
% Red. in (Bias)2
% Red. in Var.
NV
100
200
400
600
800
54:46
53:47
52:48
52:48
51:49
38:62
36:64
35:65
34:66
34:66
73.50
74.26
74.51
74.89
74.75
47.44
48.39
48.60
48.85
48.98
61.59
62.09
62.09
62.25
62.24
PGP2
100
200
400
600
800
76:24
69:31
54:46
50:50
49:51
73:27
65:35
45:55
38:62
38:62
22.06
23.20
34.04
38.95
41.93
9.68
8.35
4.17
4.08
8.92
18.95
18.80
20.92
22.10
25.64
APL1P
100
200
400
600
800
57:43
54:46
51:49
49:51
49:51
53:47
50:50
47:53
45:55
44:56
42.81
48.20
54.03
59.98
64.86
29.17
32.62
38.47
45.54
50.76
36.54
40.58
45.82
51.96
56.95
GBD
100
200
400
600
800
65:35
60:40
52:48
45:55
41:59
60:40
50:50
39:61
32:68
29:71
64.79
67.98
68.31
69.07
72.08
54.47
53.49
46.38
44.17
46.72
60.85
61.77
57.54
55.32
57.17
Problem
n
% Red. in MSE
Table 3.4: MSE of optimality gap estimator for optimal candidate solutions
Problem
n
% Red. in CI Width
NV
100
200
400
600
800
39.37
40.34
40.81
41.05
41.14
PGP2
100
200
400
600
800
10.06
9.55
6.84
17.39
14.53
APL1P
100
200
400
600
800
17.27
20.09
24.69
29.58
33.44
GBD
100
200
400
600
800
33.29
36.22
36.27
37.29
40.11
Table 3.5: CI estimator for optimal candidate solutions
60
Suboptimal Candidate Solution
We now turn to the results of our experiments using the suboptimal candidate solutions outlined in Table 3.2. Tables 3.6, 3.7, and 3.8 summarize the same information
as Tables 3.3, 3.4, and 3.5, with the addition of the coverage estimates for A2RP and
A2RP-B in Table 3.8 for select sample sizes. Figure 3.2 shows plots of the percentage
reductions in the bias and the MSE of the optimality gap estimator and in the CI
width. Bias reduction does not depend on the candidate solution but Figures 3.1
and 3.2 indicate that reductions in the MSE and CI width di↵er from optimal to
suboptimal candidate solutions. We now examine each in detail.
Because the bias of the optimality gap estimator is independent of the candidate
solution (only the zn⇤ term contributes to the bias), Figures 3.1(a) and 3.2(a), and
Tables 3.3 and 3.6 should be identical. However, we note that there are slight di↵erences, although the overall trends are very similar. In particular, the columns ‘CI on
Di↵. in Bias’ are the same in both Tables 3.3 and 3.6, but the percentage reductions
in the bias can be slightly di↵erent. The suboptimal candidate solution appears to
induce more variability in the optimality gap estimator, as illustrated by the wider
CIs on the bias of the optimality gap estimator. Slight di↵erences—typically at higher
sample sizes when the absolute value of the bias is small—result in slightly di↵erent
values of the percentage reduction for the suboptimal candidate solution.
Even though the bias of the optimality gap estimator is independent of the candidate solution, its variance, and hence MSE, depends on the candidate solution.
Compared to the optimal candidate solution case, the square of the bias is much less
a proportion of the MSE for all test problems. Under A2RP-B for NV, the square
of the bias is a negligible proportion of the MSE. (Note that we round the percentages in the first two columns in Tables 3.4 and 3.7.) Another di↵erence between the
optimal and suboptimal candidate solutions is that the variance reducing e↵ect of
the bias reduction technique is typically only statistically significant at small sample
61
80
80
80
70
70
70
60
60
60
50
50
50
40
40
40
30
30
30
20
20
20
10
10
10
0
0
0
−10
−10
−10
100
NV
300
PGP2
500
700
APL1P
(a) % Red. in Bias
900
GBD
100
NV
300
PGP2
500
700
APL1P
(b) % Red. in MSE
900
GBD
100
NV
300
PGP2
500
700
APL1P
900
GBD
(c) % Red. in CI Width
Figure 3.2: Percentage reductions between A2RP and A2RP-B in (a) bias of optimality gap estimator, (b) MSE of optimality gap estimator, and (c) CI width with
respect to sample size n for suboptimal candidate solutions
sizes, and at a reduced rate. At higher sample sizes, the variance actually increases,
although these increases were not found to be statistically significant. As a result, the
MSE of the optimality gap estimator is reduced mainly at smaller sample sizes. One
exception is PGP2, which exhibits MSE reduction across all values of n, and slight
but statistically significant variance reduction at higher sample sizes.
Table 3.8 indicates that the di↵erence between the width of the CI estimator for
A2RP and A2RP-B is significant; however, the percentage reduction in the width
is fairly small, with the exception of GBD for small sample sizes. The lowering
of the coverage under A2RP-B is statistically significant for all test problems and
sample sizes. However, the coverage under A2RP-B remain close to 90%, with the
exception of PGP2. Note that PGP2 is known to have low coverage when A2RP
is used (Bayraksan & Morton, 2006). A2RP-B reduces coverage for PGP2 but is
still much higher than SRP, which yields coverage of 0.50-0.60 at the same candidate
solution (Bayraksan & Morton, 2006).
62
Problem
n
A2RP
A2RP-B
CI on Bias
CI on Bias
CI on Di↵. in Bias
% Red. in Bias
NV
100
200
400
600
800
0.33
0.17
0.08
0.06
0.04
±
±
±
±
±
0.00
0.00
0.00
0.00
0.00
0.17
0.08
0.04
0.03
0.02
±
±
±
±
±
0.00
0.00
0.00
0.00
0.00
0.16
0.08
0.04
0.03
0.02
0.00
0.00
0.00
0.00
0.00
48.55
49.19
49.41
50.01
49.75
PGP2
100
200
400
600
800
6.30
3.59
1.85
1.41
1.16
±
±
±
±
±
0.13
0.08
0.05
0.04
0.04
5.56
3.14
1.48
1.07
0.86
±
±
±
±
±
0.12
0.07
0.05
0.04
0.04
0.74
0.45
0.38
0.34
0.30
0.02
0.02
0.01
0.01
0.01
11.75
12.63
20.39
23.90
25.95
APL1P
100
200
400
600
800
86.68
44.59
22.81
15.06
10.91
±
±
±
±
±
2.75
1.99
1.43
1.19
1.02
64.97
31.62
15.11
9.22
6.20
±
±
±
±
±
2.65
1.94
1.43
1.19
1.03
21.71
12.97
7.70
5.84
4.71
0.76
0.43
0.23
0.15
0.12
25.05
29.09
33.75
38.79
43.18
GBD
100
200
400
600
800
4.39
1.83
0.73
0.43
0.29
±
±
±
±
±
0.06
0.03
0.02
0.02
0.01
2.58
1.01
0.40
0.23
0.14
±
±
±
±
±
0.05
0.03
0.02
0.02
0.01
1.80
0.82
0.34
0.20
0.15
0.03
0.02
0.01
0.00
0.00
41.09
44.67
46.02
47.07
51.40
Table 3.6: Bias of optimality gap estimator for suboptimal candidate solutions
63
A2RP
A2RP-B
% of MSE:
(Bias)2 :Var.
% of MSE
(Bias)2 :Var.
% Red. in (Bias)2
% Red. in Var.
% Red. in MSE
NV
100
200
400
600
800
7:93
4:96
2:98
1:99
1:99
2:98
1:99
1:99
0:100
0:100
72.53
72.37
71.05
69.97
68.05
1.14
0.08
0.12*
0.08*
0.09*
6.38
2.91
1.33
0.88
0.65
PGP2
100
200
400
600
800
42:58
38:62
27:73
24:76
24:76
38:62
34:66
21:79
17:83
16:84
20.08
21.79
30.13
32.70
34.13
5.48
9.98
6.66
4.09
2.26
11.67
14.69
13.95
12.37
11.13
APL1P
100
200
400
600
800
23:77
17:83
12:88
11:89
10:90
16:84
12:88
10:90
10:90
09:91
33.84
30.62
23.21
18.55
13.87
4.01
1.06*
2.03*
3.06*
2.99*
12.00
5.20
1.59
0.34*
1.11*
GBD
100
200
400
600
800
59:41
46:54
28:72
20:80
16:84
46:54
27:73
15:85
12:88
11:89
63.70
64.69
58.33
50.57
45.13
Problem
n
40.01
27.39
12.89
5.68
3.18
53.77
44.93
27.04
16.15
11.13
Table 3.7: MSE of optimality gap estimator for suboptimal candidate solutions
Problem
n
A2RP
A2RP-B
CI on Coverage
CI on Coverage
% Red. in CI Width
NV
100
200
400
600
800
0.911
0.912
0.907
0.908
0.906
±
±
±
±
±
0.000
0.000
0.000
0.000
0.000
0.885
0.894
0.894
0.897
0.896
±
±
±
±
±
0.001
0.001
0.001
0.000
0.001
3.61
1.99
1.07
0.73
0.56
PGP2
100
200
400
600
800
0.772
0.821
0.707
0.886
0.826
±
±
±
±
±
0.007
0.006
0.007
0.005
0.006
0.676
0.792
0.641
0.861
0.775
±
±
±
±
±
0.008
0.007
0.008
0.006
0.007
4.56
3.67
10.14
6.85
8.46
APL1P
100
200
400
600
800
0.896
0.899
0.892
0.895
0.893
±
±
±
±
±
0.005
0.005
0.005
0.005
0.005
0.868
0.867
0.862
0.871
0.872
±
±
±
±
±
0.006
0.006
0.006
0.006
0.005
6.06
4.10
2.52
2.01
1.64
GBD
100
200
400
600
800
0.991
0.979
0.958
0.940
0.926
±
±
±
±
±
0.002
0.002
0.003
0.004
0.004
0.974
0.939
0.908
0.895
0.891
±
±
±
±
±
0.003
0.004
0.005
0.005
0.005
27.07
24.08
15.38
10.31
8.01
Table 3.8: CI estimator for suboptimal candidate solutions
64
3.7.4
Computation Time of Bias Reduction
In this section, we investigate whether solving (PM) could be computationally prohibitive for large sample sizes as required by large-scale problems. To test the performance of A2RP-B on a problem of practical size, we conducted an experiment on
the test problem SSN, which was described in Example 1.2. SSN has 86 independent stochastic parameters and 1070 scenarios. Fixing the candidate solution to a
(suboptimal) solution obtained by solving a separate sampling problem and setting
↵ = 0.10, we ran A2RP-B for sample sizes ranging from n = 1, 000 to n = 5, 000.
The tests were performed on a 1.66 GHz LINUX computer with 4GB memory.
Table 3.9 shows the breakdown of A2RP-B running times in minutes. For the
sampling approximations, we measured the time needed to compute the edge weights,
allocate the scenarios into two subsets, solve the sampling problems, and produce the
A2RP-B estimators. We also calculated the time Blossom V spent solving (PM). As
can be seen in Table 3.9, solving (PM) is a small percentage of the total computational
e↵ort required to implement A2RP-B. Since Blossom V’s running time depends only
on the sample sized used in A2RP-B and not on the dimension of the random vector,
we expect similar results for other large-scale problems. Note that the worst-case
running time of Blossom V is believed to be O(n3 m) for a graph with n nodes and m
edges (Kolmogorov, 2009). A more efficient implementation, O(mn log n), for solving
(PM) for dense graphs like ours is given in (Mehlhorn & Schäfer, 2002); however, we
do not pursue it here.
n
1000
2000
3000
Sampling Approx.
(PM)
n
7.10
18.52
28.65
0.01
0.04
0.07
4000
5000
Sampling Approx.
44.29
56.18
(PM)
0.16
0.24
Table 3.9: Breakdown of A2RP-B computational time (in minutes) with respect to
sample size n for SSN
65
3.7.5
E↵ect of Multiple Optimal Solutions
In our final test, we applied A2RP-B to problems with multiple optimal solutions. It is
known that such problems may behave di↵erently than those with a single optimal solution. For example, under appropriate conditions, Ezn⇤ z ⇤ = n
o(n
1/2
1/2
E[inf x2X ⇤ Y (x)]+
), where Y (x) is a normal random variable with mean zero and variance given
˜ see, e.g., the discussion after Theorem 10 in (Shapiro, 2003). This
by var(f (x, ⇠));
indicates that as the set of optimal solutions X ⇤ gets larger, the bias might increase.
To test the performance of A2RP-B on problems with multiple optimal solutions,
we generated instances of NV with a discrete demand distribution and increasing set
of optimal solutions. We set the cost of one newspaper to c = 5 and its selling price
to r = 50. The demand ⇠˜ is set to take values in {0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, b}, each
with probability 1/10, resulting in a piecewise linear objective function. The optimal
solution set is given by X ⇤ = [4, b] and the optimal value is z ⇤ =
90. We considered
three cases: (i) b = 4.1, giving a “narrow solution set of width 0.1, (ii) b = 5, leading
to a “medium” solution set of width 1, and (iii) b = 10, resulting in a “wide” solution
set of width 6. Computations were run using an optimal candidate solution, x̂ = 4,
and a suboptimal candidate solution, x̂ = 3, with an optimality gap of 7.5 regardless
of the size of X ⇤ .
Tables 3.10 and 3.11, which follow the format of the tables in the previous two
sections, highlight the main results of the experiments for the optimal and suboptimal
candidate solutions, respectively. All reported results are for sample size of n = 500.
X⇤
Narrow
Medium
Wide
A2RP
A2RP-B
CI on Bias
CI on Bias
% Red. in Bias
% Red. in MSE
0.04 ± 0.00
0.38 ± 0.00
2.26 ± 0.00
0.03 ± 0.00
0.27 ± 0.00
1.60 ± 0.00
29.16
29.16
29.16
24.17
24.17
24.17
% Red. in CI Width
24.12
24.11
24.11
Table 3.10: Summary of multiple optimal solutions results for an optimal candidate
solution (n = 500)
66
X⇤
Narrow
Medium
Wide
A2RP
A2RP-B
CI on Bias
CI on Bias
% Red. in Bias
0.04 ± 0.00
0.38 ± 0.00
2.26 ± 0.00
0.03 ± 0.00
0.27 ± 0.00
1.60 ± 0.00
29.26
29.17
29.16
% Red. in MSE
0.17
5.36
19.05
% Red. in CI Width
0.14
1.58
8.30
Table 3.11: Summary of multiple optimal solutions results for a suboptimal candidate
solution (n = 500)
As expected, the absolute value of the bias of the optimality gap estimator increases
as the width of the optimal solution set increases. Nevertheless, the percentage reduction in the bias remains roughly constant at just below 30%. In the case of the
optimal candidate solution, the percentage reductions in the MSE of the optimality
gap estimator and the width of the CI estimator also do not appear to be a↵ected by
solution set width. When using the suboptimal candidate solution, the MSE of the
optimality gap estimator and the CI width are only slightly reduced when applying
bias reduction; however, they do increase as the solution set width increases. This
indicates that the bias reduction technique may be of particular use for problems with
a large optimal solution set.
3.7.6
Discussion
In this section, we summarize insights gained from our computational experiments
and discuss our findings.
• The percentage reduction in the bias of the optimality gap estimator tends to
increase as n increases. We hypothesize that this is due to the stability result
that motivates the bias reduction technique. Recall that Theorem 3.2 requires
P and Q to be sufficiently close, and at larger sample sizes, we expect Pn and
PJl , l = 1, 2, to be closer. The absolute value of the reduction in the bias, on
the other hand, decreases as n increases. Please see the columns labeled ‘CI on
Di↵. in Bias’ in Tables 3.3 and 3.6.
67
• The bias reduction technique works well when an optimal candidate solution
is used. In this case, both the bias and the variance are reduced, resulting in
a significant reduction in the MSE of the optimality gap point estimator. The
width of the CI estimator formed via A2RP-B is also considerably reduced at
an optimal solution.
• At a suboptimal candidate solution, we observe that the variance of the optimality gap estimator becomes more significant relative to the bias. Bias reduction
is not a↵ected, but the bias reduction technique reduces the variance, and hence
the MSE, at smaller sample sizes. However, it can sometimes increase the variance at higher sample sizes, weakening the MSE reduction. The CI width and
coverage are slightly reduced.
• We also observed that in problems with multiple optima, the procedure at a
suboptimal solution can be more e↵ective in MSE and CI width reduction as
the set of optimal solutions gets larger.
• The computational e↵ort required by the bias reduction technique, while increasing with sample size, is insignificant compared to the total computational
e↵ort of a two-replication procedure for optimality gap estimation.
Suppose we solve an independent sampling problem (or use any other method)
to obtain a candidate solution. Fixing this solution, we apply A2RP to obtain an
estimate of its optimality gap. If this estimate is large, then we do not know if it is
a good solution or not. Note that even when an optimal solution is obtained, this
estimate can be large due to bias or variance. Alternatively, the candidate solution
itself may have a large optimality gap. Suppose the candidate solution obtained is
indeed an optimal solution. Note that if an independent sampling problem is used to
obtain the candidate solution, for the class of problems considered, the probability
of obtaining an optimal candidate solution increases exponentially as the sample size
68
increases under appropriate conditions (Shapiro & Homem-de-Mello, 2000; Shapiro
et al., 2002), and this probability can be quite high for small sample sizes for some
problems. Then, the use of A2RP-B can significantly increase our ability to detect
that this is an optimal solution. Our results indicate that A2RP-B reduces the bias,
variance, and MSE of the optimality gap point estimator, and the width of the interval
estimator, at an optimal solution. The risk in doing so is a decrease in the coverage at
suboptimal solutions. The reduction in the bias remains same but the variance and
MSE are mainly reduced at smaller sample sizes at suboptimal solutions, indicating
that A2RP-B provides a more reliable point estimator at suboptimal solutions at
smaller sample sizes.
Among the procedures presented thus far in the dissertation, if identifying optimal
solutions is of primary importance, then, we recommend A2RP-B. If, on the other
hand, conservative coverage is the primary concern, then we recommend the Multiple
Replications Procedure (MRP) of Mak et al. (Mak et al., 1999), which is known to be
more conservative; see the computational results and also the preliminary guidelines
in (Bayraksan & Morton, 2006).
3.8
Summary and Concluding Remarks
In this chapter, we presented a bias reduction technique for a class of stochastic programs that is rooted in a stability result. The proposed technique partitions the
observations by minimizing the Kantorovich metrics between the empirical measure
of the original sample and the probability measures on the resulting partitioned observations. This amounts to solving a minimum-weight perfect matching problem,
which is polynomially solvable in the sample size. The bias reduction technique is
applied to the A2RP optimality gap estimators for a given candidate solution. Analytical results on an instance of a newsvendor problem and computations indicate that
bias reduction technique can reduce the bias introduced by partitioning while main-
69
taining appropriate coverage. We showed that the optimality gap and SV estimators
of A2RP-B are consistent and the CI estimator is asymptotically valid. Preliminary
computational results suggest that the technique works well for optimal candidate
solutions, decreasing both the bias and the variance of the optimality gap estimator, and hence the MSE. For suboptimal solutions, bias reduction is una↵ected but
variance and MSE reduction are weakened. Coverage is slightly lowered after bias
reduction.
We conclude that the method presented in this chapter has the potential to produce more reliable estimators of the optimality gap and increase our ability to recognize optimal or nearly optimal solutions. The next chapter studies the use of some
variance reduction schemes to improve optimality gap estimators, with and without
the bias reduction technique presented in this chapter.
70
Chapter 4
Assessing Solution Quality with Variance
Reduction
In this chapter, we investigate the e↵ects of embedding alternative sampling techniques aimed at reducing variance in algorithms that assess solution quality via optimality gap estimators. In particular, we focus on AV and LHS and implement these
schemes into MRP, SRP, and A2RP (outlined in Chapter 2). We also consider the
combination of these variance reduction schemes with A2RP-B from Chapter 3. Since
we now consider MRP and SRP in addition to A2RP, we fix the sample size per replication n for each algorithm and allow the total sample size to vary, so that the bias
of the optimality gap estimator is not significantly increased for MRP compared to
SRP and A2RP.
In addition to assumptions (A1)–(A3) described in Chapter 2, when considering
LHS we assume
(A5) f (x, ⇠) is uniformly bounded.
Assumption (A5), which implies (A2), is required in order to apply a CLT result for
LHS. Other assumptions on f (x, ·), such as non-additivity, are needed for convergence
to a non-degenerate normal; see, e.g., (Homem-de-Mello, 2008). We will discuss these
in our setting later. We also assume that the components of the random vector ⇠˜ are
independent and each have an invertible cumulative distribution function (cdf) Fj ,
j = 1, . . . , d⇠ . The test problems studied in Section 4.6 satisfy these conditions. As
in Chapter 3, we suppress the dependence on x̂ and n in our notation.
This chapter is organized as follows. In Section 4.1, we formally define AV and
LHS and provide an overview of the relevant literature. We implement these variance
reduction schemes in MRP in Section 4.2. Section 4.3 updates SRP to include AV and
71
LHS, and discusses the asymptotic properties of the resulting estimators. In addition
to presenting A2RP with AV and LHS, Section 4.4 investigates the combination of
AV and LHS with the bias reduction technique from Chapter 3. In Section 4.6, we
present our computational experiments on a number of test problems. We conclude
with a summary in Section 4.7.
4.1
Overview of Antithetic Variates and Latin Hypercube
Sampling
Monte Carlo simulation is widely used to estimate expectations of random variables.
p
The technique, which produces unbiased estimators, has a convergence rate of 1/ n,
independent of dimension, and so a very large sample size may be required to achieve
a small error. Variance reduction schemes typically aim to create an estimator of the
expectation that is also unbiased but has a smaller variance than a standard Monte
Carlo estimator. The implementation of a variety of variance reduction schemes in
a stochastic programming framework has been studied extensively. The goal is usually to reduce the variance of estimates of Ef (x, ⇠) for a given x 2 X or of zn⇤ ,
the estimator of the true optimal value of (SP). Shortly, we will focus on AV and
LHS to reduce variance but other variance reduction techniques have been studied
as well. For example, Dantzig & Glynn (1990), Higle (1998), and Infanger (1992)
consider importance sampling. Higle (1998), Pierre-Louis et al. (2011), and Shapiro
& Homem-de-Mello (1998) apply control variates techniques to stochastic programs.
Quasi-Monte Carlo sampling has also been considered (Drew & Homem-de-Mello,
2006; Homem-de-Mello, 2008; Koivu, 2005; Pennanen & Koivu, 2005). Drew (2007)
develops padded sampling schemes that sample the most important variables with
randomized quasi-Monte Carlo sampling and use Monte Carlo or LHS for the remaining variables. We will discuss additional literature on the use of AV and LHS
in (SP) below. Our work di↵ers from most of the literature in that we apply these
72
˜ or estimates of
techniques to estimators of the optimality gap rather than Ef (x, ⇠)
the optimal value. The implications of this change of focus are discussed in later
sections. First, we give the details of these two sampling schemes and review the
relevant literature in the remainder of this section.
4.1.1
Antithetic Variates
The motivating idea behind AV is use to pairs of negatively correlated random varin
n0
0
ables to reduce variance. Let n be even. To sample n observations {⇠˜1 , ⇠˜1 , . . . , ⇠˜2 , ⇠˜2 }
from P using AV, perform the following procedure:
AV
1. Sample
n
2
n
i.i.d. observations {ũ1 , . . . , ũ 2 } from a U (0, 1)d⇠ distribution.
0
2. For each dimension j = 1, . . . , d⇠ , create antithetic pairs by setting ũij = 1
ũij .
0
3. To transform to sampling from P , invert ũ by setting ⇠˜ji = Fj 1 (ũij ) and ⇠˜ji =
0
Fj 1 (ũij ), for i = 1, . . . , n/2 and j = 1, . . . , d⇠ .
AV replaces the Monte Carlo estimator
1
n
Pn
i=1
f (x̂, ⇠˜i ) with
⌘
2X1⇣
0
f (x̂, ⇠˜i ) + f (x̂, ⇠˜i ) .
n i=1 2
n/2
0
0
Since both ũi and ũi follow a U (0, 1)d⇠ distribution, ⇠˜i and ⇠˜i have distribution P .
Therefore, AV produces an unbiased estimator of the mean, i.e.,
2
3
n/2
⇣
⌘
X
2
1
0
˜
E4
f (x, ⇠˜i ) + f (x, ⇠˜i ) 5 = Ef (x, ⇠).
n i=1 2
The standard Monte Carlo estimator has variance
2
x̂ /n,
where
(4.1)
2
x̂
⇣
⌘
˜ .
:= var f (x̂, ⇠)
73
We can express the variance of the AV estimator in terms of the Monte Carlo variance:
0
1
n/2
⇣
⌘
X
2
1
0
var @
f (x̂, ⇠˜i ) + f (x̂, ⇠˜ i ) A
n i=1 2
⌘
⇣
⌘
⇣
⌘⌘
2 1⇣ ⇣
0
0
˜
˜
˜
˜
= ·
var f (x̂, ⇠) + var f (x̂, ⇠ ) + 2 Cov f (x̂, ⇠), f (x̂, ⇠
n 4
⇣
⌘
2
1
˜ f (x̂, ⇠˜0 ) .
= x̂ + Cov f (x̂, ⇠),
n
n
Hence, AV reduces variance compared to Monte Carlo if
⇣
⌘
0
˜
˜
Cov f (x̂, ⇠), f (x̂, ⇠ ) < 0,
(4.2)
˜ and f (x̂, ⇠˜0 ) are negatively correlated. It is well known that (4.2) holds
i.e., if f (x̂, ⇠)
when f (x̂, ·) is bounded and monotone in each component of ⇠ and f (x̂, ·) is not
constant in the interior of ⌅; see, e.g., Theorem 4.3 in (Lemieux, 2009). The amount
of variance reduction depends on how much negative correlation between ũ and ũ0 is
preserved when (i) transforming to ⇠˜ and ⇠˜0 and (ii) after applying f (x̂, ·).
It will be necessary to use the SLLN in later sections. The following theorem
extends this result to AV:
˜ < 1. The SLLN holds for an AV random
Theorem 4.3. Assume that E|f (x, ⇠)|
0
0
sample {⇠˜1 , ⇠˜1 , . . . , ⇠˜n/2 , ⇠˜n/2 }, i.e.
i
2X1h
i
i0
˜
˜
˜ a.s., as n ! 1.
f (x, ⇠ ) + f (x, ⇠ ) ! Ef (x, ⇠),
n i=1 2
n/2
Proof. The collections of random variables {⇠˜1 , . . . , ⇠˜n/2 } and {⇠˜1 , . . . , ⇠˜n/2 } contain
0
0
˜ < 1, we have
independent variables with identical distributions P . Since E|f (x, ⇠)|
that E|f (x, ⇠˜0 )| < 1. Furthermore, AV produces unbiased estimators, and so we can
apply the SLLN to each collection of random variables:
0
1
n/2
X
2
1
1
˜A=1
P @ lim
f (x, ⇠˜i ) = Ef (x, ⇠)
n!1 n
2
2
i=1
74
and
0
1
n/2
X
2
1
1
0
˜ A = 1.
P @ lim
f (x, ⇠˜i ) = Ef (x, ⇠)
n!1 n
2
2
i=1
Therefore,
0
1
h
i
2X1
0
˜A
P @ lim
f (x, ⇠˜i ) + f (x, ⇠˜i ) = Ef (x, ⇠)
n!1 n
2
i=1
0
1
n/2
n/2
X
X
2
1
2
1
0
˜ A = 1,
= P @ lim
f (x, ⇠˜i ) + lim
f (x, ⇠˜i ) = Ef (x, ⇠)
n!1 n
n!1 n
2
2
i=1
i=1
n/2
and the result holds.
˜ x̂ being optimal
Higle (1998) considers the use of AV when estimating Ef (x̂, ⇠),
or near optimal, for two-stage stochastic linear programs with recourse. This class
of stochastic programs satisfies the monotonicity requirement on f . This paper also
provides an empirical study of AV (alongside other variance reduction techniques)
and finds that AV can reduce variance. Freimer et al. (2012) also consider AV in
the context of two-stage stochastic linear programs with recourse. Analytical results
for the newsvendor problem show that AV can reduce the bias of zn⇤ , while variance
may be increased or decreased depending on problem parameters. Computational
experiments on large-scale problems indicate that AV can be e↵ective at reducing
variance with minimal e↵ort. Koivu (2005) applies AV to zn⇤ and observes empirically
that it can increase variance when the objective function is not monotone. However,
when AV is e↵ective, combining it with randomized quasi-Monte Carlo sampling can
lead to significant variance reduction.
4.1.2
Latin Hypercube Sampling
While Monte Carlo sampling can produce high-quality estimators, particularly for
large sample sizes, observations may be clustered together in such a way that the
distribution is not sampled from evenly. Stratified sampling addresses this issue by
75
partitioning the distribution into strata and selecting a specified number of observations from each strata. We focus on a particular type of stratified sampling referred
to as Latin hypercube sampling. To sample n observations {⇠˜1 , . . . , ⇠˜n } from P using
LHS, perform the following procedure:
LHS
For each dimension j = 1, . . . , d⇠ ,
1. Let ũij ⇠ U ( i n1 , ni ) for i = 1, . . . , n.
2. Define {⇡j1 , . . . , ⇡jn } to be a random permutation of {1, . . . , n}, where each of
the n! permutations is equally likely.
3. To transform to sampling from P , invert by setting ⇠˜ji = F
1
⇡i
(ũj j ), for i =
1, . . . , n.
We now provide an overview of properties of LHS that will be required in later
sections. First, McKay et al. (1979) show that LHS gives an unbiased estimator,
h
i
˜ for i = 1, . . . , n, and so E 1 Pn f (x, ⇠˜i ) = Ef (x, ⇠).
˜ In
i.e., Ef (x, ⇠˜i ) = Ef (x, ⇠)
i=1
n
addition, similarly to AV, the variance of an LHS estimator is no larger than that of
a Monte Carlo estimator if f (x̂, ·) is monotone in each argument. For all measurable
functions f (x̂, ·), Owen (1997) proves that
" n
#
" n
#
X
1X
n
1
1
i
i
var
f (x̂, ⇠˜ ) 
var
f (x̂, ⇠˜M C ) =
n i=1
n 1
n i=1
n 1
2
x̂ ,
(4.4)
1
˜n
where {⇠˜M
C , . . . , ⇠M C } is a standard Monte Carlo random sample. Therefore, LHS
produces an estimator with a variance at most that of a Monte Carlo estimator with
sample size n
1.
Theorem 3 of Loh (1996) gives the following SLLN result for LHS:
˜ < 1. The SLLN holds for an LHS sample
Theorem 4.5. Assume that E[f 2 (x, ⇠)]
{⇠˜1 , . . . , ⇠˜n }, i.e.
n
1X
˜ a.s., as n ! 1.
f (x, ⇠˜i ) ! Ef (x, ⇠),
n i=1
76
Additionally, a version of the CLT for LHS is presented in (Homem-de-Mello,
2008), which is based on Theorem 1 of Owen (1992):
Theorem 4.6. Assume that f (x, ⇠) is uniformly bounded. A CLT holds for an LHS
sample {⇠˜1 , . . . , ⇠˜n }, i.e.
1
n
assuming var
⇣ P
n
1
n
Pn
˜
f (x, ⇠˜i ) Ef (x, ⇠)
r i=1 ⇣
⌘ ) N (0, 1),
Pn
1
i
˜
var n i=1 f (x, ⇠ )
˜i
i=1 f (x, ⇠ )
⌘
> 0, where N (0, 1) is a standard normal random
variable and “)” denotes convergence in distribution.
We note that if f (x, ·) is an additive function, i.e., f (x, ·) can be written as
P d⇠
f (x, ⇠1 , . . . , ⇠d⇠ ) = C + j=1
fj (⇠j ), where f1 , . . . , fd⇠ are univariate functions and
⇣ P
⌘
C a constant, then var n1 ni=1 f (x, ⇠˜i ) ! 0 as n ! 1 and the CLT result holds in
a degenerate form. We use this CLT result for asymptotic validity of the CI variants
that use LHS and note that in the degenerate case, our main results remain una↵ected
(see, e.g., proofs of Theorems 4.12 and 4.24 below).
Both numerical and theoretical properties of LHS have been examined in the
stochastic programming literature. Bailey et al. (1999) combine LHS with a response
surface methodology to solve two-stage stochastic linear programs with recourse. Linderoth et al. (2006) use LHS to improve the calculation of zn⇤ , its bias, and an upper
bound of z ⇤ for a set of large-scale test problems. Building on this work, Freimer
et al. (2012) find that LHS is very e↵ective at reducing the bias, variance, and MSE
of zn⇤ for a variety of two-stage stochastic linear programs. On the theoretical side,
Homem-de-Mello (2008) studies the rates of convergence of estimators of optimal solutions and optimal values under LHS, and Drew & Homem-de-Mello (2012) examine
large deviations properties of estimators obtained with LHS. Drew (2007) develops
padded sampling schemes using randomized quasi-Monte Carlo sampling and either
Monte Carlo sampling or LHS, and provides a central limit theorem in the case of
77
LHS. These schemes are used to approximately solve (SP), and their application when
assessing solution quality via SRP is also studied.
4.2
Multiple Replications Procedure with Variance Reduction
In this section, we present MRP with AV and MRP with LHS and discuss asymptotic
properties of the resultant CI estimators.
4.2.1
Antithetic Variates
0
0
Let the random sample {⇠˜1 , ⇠˜1 , . . . , ⇠˜n/2 , ⇠˜n/2 } be an AV sample from P of total size
n. The approximated stochastic program using this AV sample is
⌘ 2X1⇣
⌘
2X1⇣
0
0
= min
f (x, ⇠˜i ) + f (x, ⇠˜i ) =
f (x⇤A , ⇠˜i ) + f (x⇤A ⇠˜i ) .
x2X n
2
n i=1 2
i=1
n/2
zA⇤
n/2
(SPA )
The formal statement of the algorithm is as follows:
MRP AV
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), an even sample size
per replication n, and a replication size m.
Output: A point estimator, its associated variance estimator, and a (1
↵)-level
confidence interval on G.
Replace Steps 1.1 and 1.3 of MRP by
0
0
1.1. Sample observations {⇠˜l1 , ⇠˜l1 , . . . , ⇠˜ln/2 , ⇠˜ln/2 } from P using AV, indepen-
dently of other replications.
1.3. Calculate
2X1h
=
f (x̂, ⇠˜li )
n i=1 2
n/2
GA,l
0
f (x⇤A,l , ⇠˜li ) + f (x̂, ⇠˜li )
i
0
f (x⇤A,l , ⇠˜li ) .
78
The optimality gap and SV estimators are now denoted ḠA (m) and s2A (m), respectively, and the CI estimator is updated accordingly.
h P
⇣
⌘i
1
˜li ) + f (x, ⇠˜li0 ) = Ef (x, ⇠),
˜ and so we can apply the
By (4.1), E n2 n/2
f
(x,
⇠
i=1 2
same argument as in (2.1) to get EzA⇤  z ⇤ and hence EGA
CLT implies
P
✓
µx̂  ḠA (m) +
Note that var ḠA (m) =
1
m
tm
1,↵ sA (m)
p
m
◆
⇡1
µx̂ . Therefore, the
↵.
var(GA ). Our aim is to reduce the variance of each
replication GA,l , l = 1, . . . , m, and hence reduce the variance of ḠA (m). In our case,
removing subscript l for ease of notation, we have
0
n/2
X
2
1h
0
var(GA ) = var @
f (x̂, ⇠˜i ) f (x⇤A , ⇠˜i ) + f (x̂, ⇠˜i )
n i=1 2
i
1
0
f (x⇤A , ⇠˜i ) A
n/2
⇣
⌘
1 X
i
⇤ ˜i
i0
⇤ ˜i0
˜
˜
= 2
var f (x̂, ⇠ ) f (xA , ⇠ ) + f (x̂, ⇠ ) f (xA , ⇠ ) +
n i=1
⇣h
i
2 X
˜i ) f (x⇤ , ⇠˜i ) + f (x̂, ⇠˜i0 ) f (x⇤ , ⇠˜i0 ) ,
Cov
f
(x̂,
⇠
A
A
n2 i<j
h
i⌘
0
0
f (x̂, ⇠˜j ) f (x⇤A , ⇠˜j ) + f (x̂, ⇠˜j ) f (x⇤A , ⇠˜j ) .
It is not straightforward to further analyze this expression, due to the dependence
n0
n
0
of x⇤A on all samples {⇠˜1 , ⇠˜1 , . . . , ⇠˜2 , ⇠˜2 }. In addition, as discussed in Section 4.1.1,
f (x̂, ·) is monotone in each component of ⇠ for two-stage stochastic linear programs
˜
with recourse. However, monotonicity does not necessarily hold for f (x̂, ⇠)
˜
f (x⇤A , ⇠),
and therefore we cannot immediately conclude that AV will reduce the bias of the
MRP optimality gap estimator for this class of problems. Section 4.6 investigates the
e↵ectiveness of MRP AV empirically for a number of two-stage programs.
79
4.2.2
Latin Hypercube Sampling
Let the random sample {⇠˜1 , . . . , ⇠˜n } be an LHS sample from P . The approximated
stochastic program using this LHS sample is
n
zL⇤
n
1X
1X
= min
f (x, ⇠˜i ) =
f (x⇤L , ⇠˜i ).
x2X n
n
i=1
i=1
(SPL )
The formal statement of the algorithm is as follows:
MRP LHS
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), a sample size per
replication n, and a replication size m.
Output: A point estimator, its associated variance estimator, and a (1
↵)-level
confidence interval on G.
Replace Step 1.1 of MRP by
1.1. Sample observations {⇠˜l1 , . . . , ⇠˜ln } from P using LHS, independently of other
replications.
The optimality gap and SV estimators are now denoted ḠL (m) and s2L (m), respectively, and the CI estimator is updated accordingly.
As mentioned above, LHS produces unbiased estimators, i.e., E
h P
n
1
n
˜i
i=1 f (x, ⇠ )
˜ and so Ez ⇤  z ⇤ and EGL µx̂ . Therefore, by the CLT,
= Ef (x, ⇠)
L
✓
◆
tm 1,↵ sL (m)
p
P µx̂  ḠL (m) +
⇡ 1 ↵.
m
Since var ḠL (m) =
1
m
i
var(GL ), reducing the variance of each replication GL,l , l =
1, . . . , m will reduce the variance of ḠL (m). The variance of a single replication can
be expressed as
n
⇣
⌘
1 X
var(GL ) = 2
var f (x̂, ⇠˜i ) f (x⇤L , ⇠˜i ) +
n i=1
⇣h
i h
2 X
i
⇤ ˜i
˜
Cov f (x̂, ⇠ ) f (xL , ⇠ ) , f (x̂, ⇠˜j )
2
n i<j
f (x⇤L , ⇠˜j )
i⌘
.
80
As with AV, we cannot easily further analyze this expression, and the lack of monotonicity does not allow us to easily make statements regarding variance reduction
at this point. Regardless, Section 4.6 observes significant benefits when using MRP
LHS.
4.3
Single Replication Procedure with Variance Reduction
This section defines SRP with AV and SRP with LHS and discusses the consistency
and asymptotic validity of the optimality gap estimators.
4.3.1
Antithetic Variates
The algorithmic statement for SRP with AV is as follows:
SRP AV
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and an even sample
size n.
Output: A point estimator, its associated variance estimator, and a (1
confidence interval on G.
Replace Steps 1, 3, and 4 of SRP by:
1. Sample observations {⇠˜1 , ⇠˜1 , . . . , ⇠˜n/2 , ⇠˜n/2 } from P using AV.
0
0
3. Calculate
n/2 
2X 1⇣
0
GA =
f (x̂, ⇠˜i ) + f (x̂, ⇠˜i )
n i=1 2
f (x⇤A , ⇠˜i )
0
f (x⇤A , ⇠˜i )
⌘
and
s2A
n/2  ⇣
X
1
1
0
=
f (x̂, ⇠˜i ) + f (x̂, ⇠˜i )
n/2 1 i=1 2
1
2
(ḡA (x̂)
ḡA (x⇤A ))
f (x⇤A , ⇠˜i )
2
,
0
f (x⇤A , ⇠˜i )
⌘
↵)-level
81
where ḡA (x) =
2
n
Pn/2 1 ⇣
i=1 2
⌘
i
i0
˜
˜
f (x, ⇠ ) + f (x, ⇠ ) .
4. Output a one-sided confidence interval on G:
"
#
z ↵ sA
0, GA + p
.
n/2
Note that the CI estimator (4.7) divides by
(4.7)
p
p
n/2 instead of n. This is because
the AV sample used produces n/2 i.i.d. pairs of observations for each replication.
As with MRP AV, the structure of the point estimator GA makes it difficult to
analytically determine the e↵ect of AV on its variance.
We now present theoretical results on the consistency and asymptotic validity of
the SRP AV optimality gap estimators. First, some definitions: For a fixed x̂ 2 X,
⇣
⌘
2
˜ + f (x̂, ⇠˜0 ) f (x, ⇠)
˜ f (x, ⇠˜0 )] , and denote the optidefine x̂,A
(x) = var 12 [f (x̂, ⇠)
mal solutions that minimize and maximize
and x⇤max,A 2 arg maxx2X ⇤
2
x̂,A (x),
2
x̂,A (x)
by x⇤min,A 2 arg minx2X ⇤
2
x̂,A (x)
respectively.
0
0
Theorem 4.8. Assume x̂ 2 X, {⇠˜1 , ⇠˜1 , . . . , ⇠˜n/2 , ⇠˜n/2 } is an AV sample from distri-
bution P , and (A1)–(A3) hold. Fix 0 < ↵ < 1 and consider SRP AV. Then,
(i) zA⇤ ! z ⇤ , a.s., as n ! 1;
(ii) GA ! G, a.s., as n ! 1;
(iii) all limit points of x⇤A lie in X ⇤ , a.s.;
(iv)
2
⇤
x̂,A (xmin,A )
 lim inf s2A  lim sup s2A 
n!1
n!1
2
⇤
x̂,A (xmax,A ),
a.s.
˜ < 1. Therefore, noting that the SLLN
Proof. (i) (A2) implies that E supx2X f (x, ⇠)
holds for AV (Theorem 4.3), the result follows immediately from Theorem A1 of
Rubinstein & Shapiro (1993).
(ii) We have that GA = ḡA (x̂)
˜ a.s.,
zA⇤ . Additionally, ḡA (x̂) converges to Ef (x̂, ⇠),
by Theorem 4.3. Furthermore, by part (i), zA⇤ converges to z ⇤ , a.s., as n ! 1. We
˜
conclude that GA ! Ef (x̂, ⇠)
z ⇤ , a.s., as n ! 1.
82
˜ a.s., on X, by
(iii) (A1)–(A3) imply that ḡA (x) converges uniformly to Ef (x, ⇠),
Lemma A1 of Rubinstein & Shapiro (1993). This result along with (i) implies (iii).
(iv) With appropriate adjustments of notation, the proof is as in the proof of Proposition 1 in (Bayraksan & Morton, 2006).
The next result demonstrates that the SRP AV CI estimator of the optimality
gap is asymptotically valid.
0
0
Theorem 4.9. Assume x̂ 2 X, {⇠˜1 , ⇠˜1 , . . . , ⇠˜n/2 , ⇠˜n/2 } is an AV sample from distri-
bution P , and (A1)–(A3) hold. Fix 0 < ↵ < 1 and consider SRP AV. Then,
!
z ↵ sA
lim inf P G  GA + p
1 ↵.
n!1
n/2
(4.10)
Proof. When x̂ 2 X ⇤ , inequality (4.10) is trivial. Suppose x̂ 2
/ X ⇤ . Then
GA = ḡA (x̂)
Observe that ḡA (x̂)
ḡA (x⇤A )
ḡA (x̂)
ḡA (x), 8x 2 X.
ḡA (x⇤min,A ) is a sample mean of n/2 i.i.d. random variables and
thus the CLT can be applied. The remainder of the proof follows the proof of Theorem
1 in (Bayraksan & Morton, 2006) with some minor adjustments of notation.
4.3.2
Latin Hypercube Sampling
SRP with LHS is as follows:
SRP LHS
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and a sample size
n.
Output: A point estimator, its associated variance estimator, and a (1
confidence interval on G.
Replace Step 1 of SRP by:
↵)-level
83
1. Sample observations {⇠˜1 , . . . , ⇠˜n } from P using LHS.
The optimality gap and sample estimators are denoted GL and s2L , respectively,
P
and the CI estimator is adjusted accordingly. Define f¯L (x) = n1 ni=1 f (x, ⇠˜i ).
As described above, we are unable to easily quantify the e↵ect of LHS on variance
via analytical methods. We now discuss the asymptotic behavior of the SRP LHS
optimality gap estimators. These results were first presented in (Drew, 2007).
Theorem 4.11. Assume x̂ 2 X, {⇠˜1 , . . . , ⇠˜n } is an LHS sample from distribution P ,
and (A1)–(A3) hold. Fix 0 < ↵ < 1 and consider SRP LHS. Then,
(i) zL⇤ ! z ⇤ , a.s., as n ! 1;
(ii) GL ! G, a.s., as n ! 1;
(iii) all limit points of x⇤L lie in X ⇤ , a.s.;
(iv)
2 ⇤
x̂ (xmin )
 lim inf s2L  lim sup s2L 
n!1
n!1
2 ⇤
x̂ (xmax ),
a.s.
˜ < 1. Therefore, noting that the SLLN
Proof. (i) (A2) implies that E supx2X f (x, ⇠)
holds for LHS (Theorem 4.5), the result follows immediately from Theorem A1 of
Rubinstein & Shapiro (1993).
(ii) We have that GL = f¯L (x̂)
˜ a.s.,
zL⇤ . Additionally, f¯L (x̂) converges to Ef (x̂, ⇠),
by Theorem 4.5. Furthermore, by part (i), zL⇤ converges to z ⇤ , a.s., as n ! 1. We
˜
conclude that GL ! Ef (x̂, ⇠)
z ⇤ , a.s., as n ! 1.
˜ a.s., on X, by
(iii) (A1)–(A3) imply that f¯L (x) converges uniformly to Ef (x, ⇠),
Lemma A1 of Rubinstein & Shapiro (1993). This result along with (i) implies (iii).
(iv) The remainder of the proof is as in the proof of Proposition 1 in (Bayraksan &
Morton, 2006) with slight di↵erences in notation.
The second result for SRP LHS proves the asymptotic validity of the CI estimator.
We present a detailed proof of this result to highlight the changes required in this
84
case compared to the proof for SRP in (Bayraksan & Morton, 2006). Specifically, the
proof must be updated to reflect the altered form of the CLT. We note that we adapt
the proof in (Drew, 2007) to reflect our notation.
Theorem 4.12. Assume x̂ 2 X, {⇠˜1 , . . . , ⇠˜n } is an LHS sample from distribution P ,
and (A1),(A3), and (A5) hold. Fix 0 < ↵ < 1 and consider SRP LHS. Then,
✓
◆
z ↵ sL
lim inf P G  GL + p
1 ↵.
(4.13)
n!1
n
⇣ P
⌘
n
1
2
i
i
˜
˜
Proof. Let x̂,L (x) = var n i=1 [f (x̂, ⇠ ) f (x, ⇠ )] , where the samples are generated using LHS. When x̂ 2 X ⇤ , inequality (4.13) is trivial. Suppose x̂ 2
/ X ⇤ , and
recall that zL⇤ = minx2X f¯L (x). Thus,
GL = f¯L (x̂)
f¯L (x⇤L )
f¯L (x̂)
f¯L (x), 8x 2 X.
Replacing x by x⇤min 2 arg minx2X ⇤ x̂2 (x), we obtain
✓
◆
z ↵ sL
P GL + p
G
n
✓
◆
z ↵ sL
⇤
¯
¯
P fL (x̂) fL (xmin ) + p
G
n
✓ ¯
◆
(fL (x̂) f¯L (x⇤min )) G
z↵
sL
p
=P
,
⇤
n x̂,L (x⇤min )
x̂,L (xmin )
!
r
(f¯L (x̂) f¯L (x⇤min )) G
n 1 sL
P
z↵
⇤
⇤
n
x̂,L (xmin )
x̂ (xmin )
where in (4.15) we assume
var[f¯L (x̂)
2
⇤
x̂,L (xmin )
> 0.
f¯L (x⇤min )] = 0 and hence f¯L (x̂)
Note that if
2
⇤
x̂,L (xmin )
f¯L (x⇤min ) = E[f¯L (x̂)
(4.14)
(4.15)
(4.16)
= 0, then
f¯L (x⇤min )] = G.
It follows from (4.14) that (4.13) is again trivial. Inequality (4.16) holds because
q
2 ⇤
(f¯L (x̂) f¯L (x⇤min )) G
2
⇤
x̂ (xmin )
(see (4.4)). Let DL =
, aL = nn 1 x̂ (xsL⇤ ) , and
⇤ )
x̂,L (xmin ) 
n 1
x̂,L (x
min
0 < " < 1, and for the moment assume that ↵  1/2 so that z↵
min
0. Then (4.15) can
85
be rewritten as
P (DL
z ↵ aL )
P (DL
(1
")z↵ , aL
= P (DL
(1
")z↵ ) + P (aL
P ({DL
1
")
1
")z↵ } [ {aL
(1
")
1
"}).
(4.17)
Taking limits, we obtain
lim inf P
n!1
where
✓
z ↵ sL
G  GL + p
n
◆
((1
")z↵ ),
denotes the distribution function of the standard normal. By part (iv) of
Theorem 4.11, the last two terms in (4.16) both converge to 1 and cancel out. Since
f¯L (x̂)
f¯L (x⇤min ) is a sample mean of Latin hypercube random variables, by Theorem
4.6 the first term in (4.17) converges to
((1
")z↵ ). Letting " shrink to zero gives
the desired result, provided that ↵  1/2. When ↵
x⇤max 2 arg maxx2X ⇤
2
x̂ (x)
1/2, we replace x⇤min with
in (4.15) and then use a straightforward variation of the
above argument.
We now update the algorithms and results of this section for the case of two
replications of observations.
4.4
Averaged Two-Replication Procedure with Variance Reduction
Our final set of algorithms in this chapter consider A2RP with variance reduction
schemes. We also combine variance reduction with the bias reduction technique introduced in Chapter 3, in the hopes of further improving the quality of the optimality
gap estimators. Section 4.6 examines the e↵ectiveness of this approach.
86
4.4.1
Antithetic Variates
We state A2RP AV as the following modification of MRP AV. Note that the SV
estimator in each replication is calculated like that of SRP AV; therefore, the SV
estimator of A2RP AV is di↵erent to that of MRP AV with two replications.
A2RP AV
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and an even sample
size per replication n.
Output: A point estimator, its associated variance estimator, and a (1
↵)-level
confidence interval on G.
Fix m = 2 and replace Steps 1.3, 2, and 3 of MRP AV by:
1.3. Calculate GA,l and s2A,l .
2. Calculate the optimality gap and sample variance estimators by taking the average:
G0A =
1
(GA,1 + GA,2 )
2
and
0
s2A =
1 2
s + s2A,2 .
2 A,1
3. Output a one-sided confidence interval on Gx̂ :

z ↵ s0
0, G0A + p A .
n
Similarly to SRP AV, the CI estimator (4.18) uses
(4.18)
p
n instead of
p
2n. State-
ments about the consistency and asymptotic validity of the A2RP AV optimality gap
estimators are as follows.
0
0
Theorem 4.19. Assume x̂ 2 X, {⇠˜l1 , ⇠˜l1 , . . . , ⇠˜ln/2 , ⇠˜ln/2 }, l = 1, 2, are two inde-
pendent AV samples from distribution P , and (A1)–(A3) hold. Fix 0 < ↵ < 1 and
consider A2RP AV. Then,
(i) zA⇤ ! z ⇤ , a.s., as n ! 1;
(ii) G0A ! G, a.s., as n ! 1;
87
(iii) all limit points of x⇤A lie in X ⇤ , a.s.;
(iv)
2
⇤
x̂,A (xmin,A )
0
0
 lim inf s2A  lim sup s2A 
n!1
n!1
2
⇤
x̂,A (xmax,A ),
a.s.
Proof. (i) The proof is as for SRP AV.
(ii) Note that G0A can be represented as
Theorem 4.3,
1
2
1
2
1 ⇤
(z
2 A,1
(ḡA,1 (x̂) + ḡA,2 (x̂))
⇤
+ zA,2
). By
˜ a.s. Furthermore, by part
(ḡA,1 (x̂) + ḡA,2 (x̂)) converges to Ef (x̂, ⇠),
⇤
⇤
˜
(i), 12 (zA,1
+zA,2
) converges to z ⇤ , a.s., as n ! 1. We conclude that G0A ! Ef (x̂, ⇠)
z ⇤ , a.s., as n ! 1.
(iii) The proof is as for SRP AV.
(iv) The proof follows that of SRP AV with minor modifications.
Next, we consider the asymptotic validity of the A2RP AV CI estimator.
0
0
Theorem 4.20. Assume x̂ 2 X, {⇠˜l1 , ⇠˜l1 , . . . , ⇠˜ln/2 , ⇠˜ln/2 }, l = 1, 2, are two inde-
pendent AV samples from distribution P , and (A1)–(A3) hold. Fix 0 < ↵ < 1 and
consider A2RP AV. Then,
lim inf P
n!1
✓
G
G0A
z ↵ s0
+ pA
n
◆
1
↵.
(4.21)
Proof. First, note that if x̂ 2 X ⇤ , then the inequality is satisfied automatically.
Suppose now that x̂ 2
/ X ⇤ . Then G0A
for all x 2 X. By the CLT,
✓
p
1
n
(ḡA,1 (x̂) + ḡA,2 (x̂))
2
1
2
(ḡA,1 (x̂) + ḡA,2 (x̂))
1
2
(ḡA,1 (x) + ḡA,2 (x)),
1
(ḡA,1 (x⇤min ) + ḡA,2 (x⇤min ))
2
G
◆
converges in distribution to a normal random variable with mean zero and variance
2
⇤
x̂,A (xmin,A ).
Also, lim inf n!1
s0A
⇤
x̂,A (xmin,A )
1, a.s., by part (iv) of Theorem 4.19. The
rest of the proof is analogous to that for SRP AV.
88
4.4.2
Latin Hypercube Sampling
A2RP with LHS is as follows:
A2RP LHS
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and a sample size
per replication n.
Output: A point estimator, its associated variance estimator, and a (1
↵)2 -level
confidence interval on G.
Fix m = 2 and replace Steps 1.3, 2, and 3 of MRP LHS by:
1.3. Calculate GL,l and s2L,l .
2. Calculate the optimality gap and sample variance estimators by taking the average:
G0L =
1
(GL,1 + GL,2 )
2
and
0
s2L =
3. Output a one-sided confidence interval on Gx̂ :

z ↵ s0
0, G0L + p L .
n
1 2
sL,1 + s2L,2 .
2
(4.22)
Due to the need to separately consider the two independent LHS replications in
p
the upcoming theoretical results, the CI estimator (4.22) is defined using n instead
p
of 2n. We now present results showing the consistency and asymptotic validity of
the A2RP LHS optimality gap estimators.
Theorem 4.23. Assume x̂ 2 X, {⇠˜l1 , . . . , ⇠˜ln }, l = 1, 2, are LHS samples from
distribution P , and (A1)–(A3) hold. Fix 0 < ↵ < 1 and consider A2RP LHS. Then,
(i) zL⇤ ! z ⇤ , a.s., as n ! 1;
(ii) G0L ! G, a.s., as n ! 1;
(iii) all limit points of x⇤L lie in X ⇤ , a.s.;
89
(iv)
2 ⇤
x̂ (xmin )
0
0
2 ⇤
x̂ (xmax ),
 lim inf s2L  lim sup s2L 
n!1
n!1
a.s.
Proof. (i) The proof is as for SRP LHS.
(ii) Note that G0L can be represented as
Theorem 4.5,
1
2
f¯L,1 (x̂) + f¯L,2 (x̂)
1
2
1 ⇤
(z
2 L,1
⇤
+ zL,2
). By
˜ a.s. Furthermore, by part
f¯L,1 (x̂) + f¯L,2 (x̂) converges to Ef (x̂, ⇠),
⇤
⇤
˜
(i), 12 (zL,1
+ zL,2
) converges to z ⇤ , a.s., as n ! 1. We conclude that G0L ! Ef (x̂, ⇠)
z ⇤ , a.s., as n ! 1.
(iii) The proof is as for SRP LHS.
(iv) The proof follows that of SRP LHS with minor modifications.
The proof that the A2RP LHS CI estimator is asymptotically valid requires some
significant changes compared to SRP LHS. While two replications of i.i.d. or AV
observations of size n can be equivalently represented as one replication of i.i.d. or
AV observations of size 2n, the same is not true for LHS. Observe that P (A/2+B/2
c)
P (A/2
c/2)P (B/2
c/2) = P (A
c)P (B
c), for two independent random
variables A and B and constant c. We make use of this observation to divide the
group of all observations into two LHS samples so that we can apply Theorem 4.6,
the CLT for LHS, on each sample. Note that the confidence level becomes (1
rather than 1
↵)2
↵.
Theorem 4.24. Assume x̂ 2 X, {⇠˜l1 , . . . , ⇠˜ln }, l = 1, 2, are two independent LHS
samples from distribution P , and (A1),(A3), and (A5) hold. Fix 0 < ↵ < 1 and
consider A2RP LHS. Then,
lim inf P
n!1
Proof. Let
2
x̂,L,l (x)
= var
✓
G
G0L
z ↵ s0
+ pL
n
⇣ P
n
1
n
˜li
i=1 [f (x̂, ⇠ )
◆
(1
↵)2 .
(4.25)
⌘
f (x, ⇠˜li )] for l = 1, 2, where the sam-
ples are generated using LHS. When x̂ 2 X ⇤ , inequality (4.25) is trivial. Suppose
that x̂ 2
/ X ⇤ . Then G0L
1
2
f¯L,1 (x̂) + f¯L,2 (x̂)
1
2
f¯L,1 (x) + f¯L,2 (x) , for all x 2 X.
90
Replacing x by x⇤min 2 arg minx2X ⇤ x̂2 (x), we obtain
✓
◆
z↵ s0L
0
P GL + p
G
n
✓
1 ¯
z ↵ s0
1 ¯
P
fL,1 (x̂) f¯L,1 (x⇤min ) +
fL,2 (x̂) f¯L,2 (x⇤min ) + p L
2
2
n
✓
◆
0
1 ¯
z↵ s
1
P
fL,1 (x̂) f¯L,1 (x⇤min ) + p L
G ·
2
2
2 n
✓
◆
1 ¯
z↵ s0L
1
⇤
¯
P
fL,2 (x̂) fL,2 (xmin ) + p
G
2
2
2 n
✓ ¯
◆
(fL,1 (x̂) f¯L,1 (x⇤min )) G
z↵
s0L
p
=P
·
⇤
n x̂,L,1 (x⇤min )
x̂,L,1 (xmin )
✓ ¯
◆
(fL,2 (x̂) f¯L,2 (x⇤min )) G
z↵
s0L
p
P
,
⇤
n x̂,L,2 (x⇤min )
x̂,L,2 (xmin )
2
⇤
x̂,L,l (xmin )
where in (4.27) we assume
G
◆
(4.26)
(4.27)
> 0 for l = 1, 2. We will return to the cases
where this does not hold below.
By (4.4),
P
✓ ¯
(fL,l (x̂)
P
◆
f¯L,l (x⇤min )) G
z↵
s0L
p
⇤
n x̂,L,l (x⇤min )
x̂,L,l (xmin )
!
r
(f¯L,l (x̂) f¯L,l (x⇤min )) G
n 1 s0L
z↵
,
⇤
⇤
n
x̂,L,l (xmin )
x̂ (xmin )
for l = 1, 2. The remainder of the proof follows along the same lines as for SRP LHS,
except that we consider each independent LHS sample separately and multiply the
individual probabilities at the end.
Note that if
2
⇤
x̂,L,l (xmin )
f¯L,l (x⇤min ) = E[f¯L,l (x̂)
= 0, then var[f¯L,l (x̂)
f¯L,l (x⇤min )] = G. If
2
⇤
x̂,L,l (xmin )
from (4.26) that (4.25) is trivial. If just one of
lim inf P
n!1
✓
G
G0L
f¯L,l (x⇤min )] = 0 and hence f¯L,l (x̂)
z↵ s0L
+ p
n
◆
= 0 for both l = 1, 2, it follows
2
⇤
x̂,L,l (xmin )
1
↵
= 0, then
(1
↵)2 .
91
4.4.3
Antithetic Variates with Bias Reduction
Reverting to the notation used in Chapter 3 and restricting our analysis to the problem
class described in Section 3.2, which is a subset of the class of problems considered
thus far in this chapter, we now embed the bias reduction technique in A2RP AV.
However, to facilitate comparison with the other algorithms described in this chapter,
0
0
we consider an AV sample {⇠˜1 , ⇠˜1 , . . . , ⇠˜n , ⇠˜n } of size of 2n (rather than n), and define
i
Pn h
1
0
the empirical measure on all observations to be P2n (·) = 2n i=1 {⇠˜i } (·) + {⇠˜i } (·) .
In order to produce two replications using AV sampling, we partition pairs of
observations {⇠˜i , ⇠˜i } instead of individual points. Consider a partition of these n
0
pairs given by index sets S1 and S2 , where
(i) S1 , S2 ⇢ {1, . . . , n} and S2 = (S1 )C ,
(ii) |S1 | = |S2 | = n/2, and
0
(iii) each ⇠˜i and ⇠˜i , i 2 S1 [ S2 , receives probability mass 1/n.
h
i
P
Define PSl (·) = n1 i2Sl {⇠˜i } (·) + {⇠˜i0 } (·) for l = 1, 2. The Kantorovich metric is
therefore
µ̂d (P2n , PSl ) = min
⌘
(
XX
i2S2 j2S1
X
i2S2
0
k(⇠ i , ⇠ i )
0
(⇠ j , ⇠ j )k⌘ij :
X
2
2
⌘ij = , 8j;
⌘ij = , 8i; ⌘ij
n
n
j2S
1
0, 8i, j
)
.
for l = 1, 2. Observe that we now look at the distance between the AV pairs of
observations by considering a random vector of both ⇠˜ and ⇠˜0 . We note that this
might not be the only method of partitioning the observations, but it allows us to
use independent pairs of observations, which then allows us to push through the
theoretical results below. Thus, we wish to find an index set of size n/2 that solves
the problem:
min {µ̂d (P2n , PS1 ) : S1 ⇢ {1, . . . , n}, |S1 | = n/2.}
(PM AV)
92
We denote an optimal solution to (PM AV) by AJ1 and let AJ2 = (AJ1 )C . The result⇥
⇤
P
ing probability measures are denoted PAJl , l = 1, 2, where PAJl = n1 i2AJl ⇠˜i + ⇠˜i0 .
The algorithm is as follows:
A2RP AV-B
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and an even sample
size per replication n.
Output: A point estimator, its associated variance estimator, and a (1
↵)-level
confidence interval on G.
0
0
1. Sample observations {⇠˜1 , ⇠˜1 , . . . , ⇠˜n , ⇠˜n } from P using AV.
2. Generate AJ1 and AJ2 by solving (PM AV), and produce PAJ1 and PAJ2 .
3. For l = 1, 2:
⇤
3.1. Solve (SPAJl ) to obtain x⇤AJl and zAJ
.
l
3.2. Calculate:
GAJl

2 X 1⇣
0
=
f (x̂, ⇠˜i ) + f (x̂, ⇠˜i )
n i2AJ 2
f (x⇤AJl , ⇠˜i )
0
f (x⇤AJl , ⇠˜i )
l
⌘
and
s2AJl
X 1 ⇣
1
0
=
f (x̂, ⇠˜i ) + f (x̂, ⇠˜i )
n/2 1 i2AJ 2
f (x⇤AJl , ⇠˜i )
l
1
n
where ḡAJl (x) =
2
n
P
1
i2AJl 2
ḡAJl (x̂)
⇣
ḡAJl (x⇤AJl )
2
0
f (x⇤AJl , ⇠˜i )
⌘
,
⌘
i
i0
˜
˜
f (x, ⇠ ) + f (x, ⇠ ) .
4. Calculate the optimality gap and sample variance estimators by taking the average;
GAJ = 12 (GAJ1 + GAJ2 ) and s2AJ =
1
2
s2AJ1 + s2AJ2 .
5. Output a one-sided confidence interval on G:

z↵ sAJ
0, GAJ + p
.
n
(4.28)
We now show that the estimators GAJ and s2AJ of A2RP AV-B are strongly consistent and that A2RP AV-B provides an asymptotically valid CI on the optimality gap.
93
These results require some minor adjustments to the proofs in Section 3.6. We first
update Theorem 3.9 to demonstrate the weak convergence of the empirical probability
measures PAJ1 and PAJ2 to P , a.s.
0
0
Theorem 4.29. Assume that {⇠˜1 , ⇠˜1 , . . . , ⇠˜n , ⇠˜n } is an AV sample from distribution
P and (A4) holds. Then the probability measures on the partitioned sets obtained by
solving (PM AV), PAJ1 and PAJ2 , converge weakly to P , the original distribution of
˜ a.s.
⇠,
Proof. Since µ̂d is a metric, by the triangle inequality we have that
µ̂d (P, PAJ1 )  µ̂d (P, P2n ) + µ̂d (P2n , PAJ1 ).
Also, µ̂d (P2n , PAJ1 )  µ̂d (P2n , PA1 ), where PA1 is the empirical measure on the first
n/2 antithetic pairs. This is because the partitioning of the observations via AJ1
minimizes the Kantorovich metric; hence, partitioning into two groups, each composed
of antithetic pairs, provides an upper bound. Therefore,
µ̂d (P, PAJ1 )  µ̂d (P, P2n ) + µ̂d (P2n , PA1 ),
and by applying the triangle inequality again, we obtain
µ̂d (P, PAJ1 )  µ̂d (P, P2n ) + µ̂d (P, P2n ) + µ̂d (P, PA1 ) = 2µ̂d (P, P2n ) + µ̂d (P, PA1 ).
Noting that we can apply the SLLN to an AV sample (see Theorem 4.3), the rest
of the proof proceeds as in Theorem 3.9, with minor adjustments of notation.
We now show the consistency of the estimators GAJ and s2AJ .
Theorem 4.30. Assume {⇠˜1 , ⇠˜1 , . . . , ⇠˜n , ⇠˜n } is an AV sample from distribution P ,
0
0
and (A3) and (A4) hold. Fix 0 < ↵ < 1. Let n be even and consider A2RP AV-B.
Then,
(i) all limit points of x⇤AJl lie in X ⇤ , a.s., for l = 1, 2;
94
⇤
(ii) zAJ
! z ⇤ , a.s., as n ! 1, for l = 1, 2;
l
(iii) GAJ ! G, a.s., as n ! 1;
(iv)
2
⇤
x̂,A (xmin,A )
 lim inf s2AJ  lim sup s2AJ 
n!1
n!1
2
⇤
x̂,A (xmax,A ),
a.s.
Proof. The proofs for parts (i)–(iv) are as for A2RP-B, with minor modifications to
notation.
Next, we show the asymptotic validity of the CI estimator produced by A2RP
AV-B, given in (4.28).
0
0
Theorem 4.31. Assume {⇠˜1 , ⇠˜1 , . . . , ⇠˜n , ⇠˜n } is an AV sample from distribution P ,
and (A3) and (A4) hold. Fix 0 < ↵ < 1. Let n be even and consider A2RP AV-B.
Then,
lim inf P
n!1
✓
G  GAJ
z↵ sAJ
+ p
n
◆
1
↵.
Proof. The proof is analogous to that for A2RP AV.
4.4.4
Latin Hypercube Sampling with Bias Reduction
We conclude this section with our final algorithm, A2RP LHS-B. As with A2RP AVB, we assume the setup of Section 3.2. Note that in this setting, assumption (A5) is
automatically satisfied (see the discussion in Section 3.2), so we can apply the CLT
result for LHS.
Some care is required when choosing how to produce the LHS sample. In Chapter 3, for A2RP-B, we sampled one large group of i.i.d. observations which was then
split into two groups using the minimum weight perfect matching problem (PM).
However, this approach is not applicable to our current setting. Note that a random partition of the original set of i.i.d. observations results in two i.i.d. samples
of half the size; however, a random subset of an LHS sample is not itself an LHS
sample, which is required by our theoretical result. See, for instance, the proof of
95
weak convergence of the empirical measures on each group of partitioned observations in Theorem 4.33 below. Instead, we sample two independent LHS replications
of observations {⇠˜11 , . . . , ⇠˜1n } and {⇠˜21 , . . . , ⇠˜2n } and combine them into a larger set of
2n observations. For ease of notation, we relabel these observations as {⇠˜1 , . . . , ⇠˜2n }.
We then partition the observations via the perfect matching problem (PM), as in
Chapter 3, and refer to the associated index sets as LJ1 and LJ2 .
The algorithm is as follows:
96
A2RP LHS-B
Input: A candidate solution x̂ 2 X, a desired value of ↵ 2 (0, 1), and a sample size
per replication n.
Output: A point estimator, its associated variance estimator, and a (1
↵)-level
confidence interval on G.
1. Sample two replications of Latin hypercube observations of size n from P :
n
o
{⇠˜11 , . . . , ⇠˜1n }, {⇠˜21 , . . . , ⇠˜2n } .
Combine the observations and relabel as {⇠˜1 , . . . , ⇠˜2n }.
2. Generate LJ1 and LJ2 by solving (PM) using all the observations {⇠˜1 , . . . , ⇠˜2n },
and produce PLJ1 and PLJ2 .
3. For l = 1, 2:
3.1. Solve (SPLJl ) to obtain x⇤LJl .
3.2. Calculate:
GLJl =
1 Xh
f (x̂, ⇠˜i )
n i2LJ
f (x⇤LJl , ⇠˜i )
l
and
s2LJl =
1
n
X h⇣
1 i2LJ
where f¯LJl (x) =
1
n
P
f (x̂, ⇠˜i )
f (x⇤LJl , ⇠˜i )
l
i2LJl
⌘
i
f¯LJl (x̂)
f¯LJl (x⇤LJl )
i2
,
f (x, ⇠˜i ).
4. Calculate the optimality gap and sample variance estimators by taking the average;
GLJ = 12 (GLJ1 + GLJ2 ) and s2LJ =
1
2
s2LJ1 + s2LJ2 .
5. Output a one-sided confidence interval on G:

tn 1,↵ sLJ
p
0, GLJ +
.
n
(4.32)
Theorem 4.33. Assume that {{⇠˜11 , . . . , ⇠˜1n }, {⇠˜21 , . . . , ⇠˜2n }} are two independent
LHS samples from distribution P and (A4) holds. Then the probability measures
97
on the partitioned sets obtained by solving (PM), PLJ1 and PLJ2 , converge weakly to
˜ a.s.
P , the original distribution of ⇠,
Proof. Since µ̂d is a metric, by the triangle inequality we have that
µ̂d (P, PLJ1 )  µ̂d (P, P2n ) + µ̂d (P2n , PLJ1 ).
Also, µ̂d (P2n , PLJ1 )  µ̂d (P2n , PL,1 ), where PL,1 is the empirical measure on the first
Latin hypercube sample. This is because the partitioning of the observations via LJ1
minimizes the Kantorovich metric; hence, partitioning into the two original Latin
hypercube samples provides an upper bound. Therefore,
µ̂d (P, PLJ1 )  µ̂d (P, P2n ) + µ̂d (P2n , PL,1 ),
and by applying the triangle inequality again, we obtain
µ̂d (P, PLJ1 )  µ̂d (P, P2n ) + µ̂d (P, P2n ) + µ̂d (P, PL,1 ) = 2µ̂d (P, P2n ) + µ̂d (P, PL,1 ).
Noting that we can apply the SLLN to an LHS sample (see Theorem 4.5), the rest
of the proof proceeds as in Theorem 3.9, with minor adjustments of notation.
The following consistency results hold for the A2RP LHS-B estimators.
Theorem 4.34. Assume {{⇠˜11 , . . . , ⇠˜1n }, {⇠˜21 , . . . , ⇠˜2n }} are two independent LHS
samples from distribution P , and (A3) and (A4) hold. Fix 0 < ↵ < 1. Let n be even
and consider A2RP LHS-B. Then,
(i) all limit points of x⇤LJl lie in X ⇤ , a.s., for l = 1, 2;
⇤
(ii) zLJ
! z ⇤ , a.s., as n ! 1, for l = 1, 2;
l
(iii) GLJ ! G, a.s., as n ! 1;
(iv)
2 ⇤
x̂ (xmin )
 lim inf s2LJ  lim sup s2LJ 
n!1
n!1
2 ⇤
x̂ (xmax ),
a.s.
98
Proof. Noting that the SLLN holds for LHS (see Theorem 4.5), the proofs of all
parts are as in A2RP-B, with appropriate adjustments of notation.
Next, we show the asymptotic validity of the CI estimator produced by A2RP
LHS-B, given in (4.28). As in Section 4.4.2, some modifications are required to the
proof compared to Chapter 3 to be able to apply the CLT for LHS (Theorem 4.6).
Theorem 4.35. Assume {{⇠˜11 , . . . , ⇠˜1n }, {⇠˜21 , . . . , ⇠˜2n }} are two independent LHS
samples from distribution P , and (A3)–(A5) hold. Fix 0 < ↵ < 1. Let n be even and
consider A2RP LHS-B. Then,
lim inf P
n!1
✓
G  GLJ
z↵ sLJ
+ p
n
◆
(1
↵)2 .
(4.36)
Proof. When x̂ 2 X ⇤ , inequality (4.36) is trivial. Suppose now that x̂ 2
/ X ⇤ . Then
1 ¯
1 ¯
fLJ1 (x̂) + f¯LJ2 (x̂)
fLJ1 (x⇤LJ1 ) + f¯LJ2 (x⇤LJ2 )
2
2
1 ¯
1 ¯
fLJ1 (x̂) + f¯LJ2 (x̂)
fLJ1 (x⇤min ) + f¯LJ2 (x⇤min ) ,
2
2
1 ¯
1 ¯
=
fL,1 (x̂) + f¯L,2 (x̂)
fL,1 (x⇤min ) + f¯L,2 (x⇤min )
2
2
P
for all x 2 X, where f¯L,l (x) = n1 ni=1 f (x, ⇠˜li ), for l = 1, 2. The optimality gap
GLJ =
estimator GLJ is now expressed in terms of the two original LHS samples prior to
bias reduction. The rest of the proof is analogous to that for A2RP LHS.
4.5
Summary of Key Di↵erences in Algorithms
Table 4.1 summarizes the main di↵erences in the algorithms for assessing solution
presented thus far. Column ‘Algorithm’ lists the procedure and column ‘Sampling’
specifies the sampling scheme used, where “BR” indicates that the bias reduction
technique is applied after the initial sampling. Column ‘n’ indicates if the sample size
per replication must be even and the number of replications is specified in column ‘m’.
The di↵erences in the sample variance estimator are highlighted in column ‘SV’. The
99
Algorithm
Sampling
n
m
SV
MRP
MRP AV
MRP LHS
IID
AV
LHS
even
30
30
30
standard
standard
standard
SRP
SRP AV
SRP LHS
IID
AV
LHS
even
1
1
1
SRP
SRP
SRP
A2RP
A2RP AV
A2RP LHS
IID
AV
LHS
even
2
2
2
average SRPs
average SRPs
average SRPs
A2RP-B
A2RP AV-B
A2RP LHS-B
IID, BR
AV, BR
LHS, BR
even
2
2
2
average SRPs
average SRPs
average SRPs
SE
p
pm
pm
m
p
pn
p n/2
n
p
p2n
pn
n
p
p2n
pn
n
Table 4.1: Test problem characteristics
term “standard” refers to the usual sample variance of a number of optimality gap
estimators, “SRP” denotes the SRP sample variance defined in (2.4), and “average
SRPs” refers to the A2RP sample variance estimator, which is obtained by averaging
two SRP sample variance estimators. The final column, ‘SE’ indicates the term used
in the denominator when calculating the sample error.
4.6
Computational Experiments
In previous sections of this chapter, we proved asymptotic results regarding the consistency and validity of estimators produced using variance reduction. We now compare
the small-sample behavior of MRP, SRP, and A2RP using AV and LHS to the same
algorithms using i.i.d. sampling. We also analyze the bias reduction technique of
Chapter 3 in conjunction with AV and LHS in the case of A2RP. We first describe
the large-scale test problems used in addition to NV, APL1P, GBD, and PGP2,
which appeared in Section 3.7. The experimental setup, which is largely the same as
in Section 3.7.2, is briefly discussed. We then present the results of our experiments
and conclude by highlighting insights gained and providing guidelines for the use of
100
algorithms for optimality gap estimation with variance reduction.
4.6.1
Test Problems
To compare the e↵ects of the variance reduction schemes described in the previous
sections, we consider four large-scale test problems from the literature, DB1, SSN,
20TERM, and STORM, in addition to the four smaller test problems outlined in Section 3.7.1. Characteristics of these problems are summarized in Table 4.2. 20TERM is
a vehicle allocation problem for a motor freight carrier with 40 independent stochastic
parameters and 1.059 ⇥ 1012 scenarios (Hackney & Infanger, 1994; Mak et al., 1999).
DB1, the vehicle allocation model in a single-commodity network of Donohue & Birge
(1995) and Mak et al. (1999), has 46 stochastic parameters and 4.5 ⇥ 1025 scenarios.
SSN, as described in Example 1.2 and Section 3.7.4, has 86 independent stochastic
parameters and 1070 scenarios. STORM, described in (Mulvey & Ruszczyński, 1995),
is an air freight scheduling model with 5118 scenarios generated by 118 independent
stochastic parameters.
All four test problems listed in Table 4.2 satisfy the required assumptions, but have
not yet been solved exactly due to their size. Following the method in (Freimer et al.,
2012), we estimated the optimal value of each problem by solving (SPn ) using LHS
and a sample size of n = 50, 000. A suboptimal candidate solution x̂ was obtained
for each test problem by solving a separate sampling problem of size n = 500 using
˜ was estimated by calculating the sample mean of f (x̂, ⇠)
˜
i.i.d. sampling, and Ef (x̂, ⇠)
Problem
# of 1st Stage
Variables
# of 2nd Stage
Variables
# of Stochastic
Parameters
# of Scenarios
20TERM
DB1
SSN
STORM
63
5
89
121
764
102
706
1259
40
46
86
118
1.059 ⇥ 1012
4.5 ⇥ 1025
1070
5118
Table 4.2: Large test problem characteristics
101
Problem
Suboptimal x̂
Estimate of z ⇤
Estimate of G
20TERM
(245.20, 19.37, 49.78, 3.89, 14.91, 37.53, 20.70, 1.24, 0.95,
10.27, 9.45, 33.53, 18.29, 13.65, 16, 10.87, 18, 14.25, 22.15,
12.04, 27.95, 12.70, 9.37, 23.53, 0, 0, 18.03, 9.20, 285.70, 0,
0.02, 1.45, 18.53, 0, 0, 0, 1.37, 0, 0.25, 0.15, 0, 19.70, 225.80,
5.63, 5.22, 9.11, 20.09, 5.47, 5.30, 17.76, 16.05, 16.73, 24.55,
6.47, 21.71, 17.35, 21, 19.13, 28, 18.75, 26.85, 16.96, 8.05)
254,311.30
9.24
DB1
(11, 14, 8, 11, 7)
-17,717.08
1.08
SSN
(0, 0.15, 36.84, 0, 17.96, 0.20, 0, 0, 9.42, 0, 1.81, 25.81,
65.96, 0, 0, 29.79, 2.72, 0, 17.89, 2.55, 14.13, 4.25, 21.96,
19.76, 0, 0, 2.58, 41.67, 15.80, 34.41, 28.76, 0, 19.50, 90.99,
0, 84.84, 0.98, 0, 18.81, 7.68, 19.89, 0, 34.06, 42.21, 5.42, 0,
4.25, 10.65, 0, 0, 1.99, 0, 9.42, 24.12, 13.70, 3.68, 0, 0.99,
3.74, 13.26, 0, 0, 7.91, 32.86, 0, 127.13, 0, 0, 0, 0.12, 0, 0, 0,
0, 2.68, 0.82, 2.12, 2.54, 3.42, 0, 0.74, 3.42, 1.11, 0.74, 2.70,
1.12, 1.23, 2.68, 0)
9.88
1.19
STORM
(0, 0, 0, 0, 0, 0, 0, 0, 0.66, 3.34, 0, 0, 9, 3, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 8.27, 8, 0, 0, 0, 0, 0,
4.08, 7.92, 0, 0.99, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 0, 0, 28,
0, 4, 8, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.47, 0, 0, 0, 0, 0,
0, 0, 0, 3 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 4, 2.01,
0, 0, 0, 1.46, 0, 0, 0, 0, 0, 0, 0, 0, 0)
15,498,634.92
29.48
Table 4.3: Large test problem suboptimal candidate solutions
with the same 50,000 realizations used to estimate the optimal value. The estimates
˜ were combined to give an estimate of the optimality gap G for
of z ⇤ and Ef (x̂, ⇠)
each candidate solution. Table 4.3 summarizes the candidate solutions used in our
computational experiments, along with the estimates of the optimal value and the
optimality gap of the candidate solution.
4.6.2
Experimental Setup
We would primarily like to determine the reduction in the variance of the optimality
gap estimator as well as the reduction in the SV estimator for finite sample sizes
n. Since LHS, and to a lesser extent, AV, samples more evenly from P than i.i.d.
sampling, bias may also be reduced, and so we consider the percentage reduction in the
bias of the optimality gap estimator. The combined e↵ects on bias and variance are
measured by the percentage reduction in the MSE of the optimality gap estimator. We
also examine the CI estimator and associated coverage probability. Our experiments
102
were conducted following the same computational setup as that used in Chapter 3,
so we highlight only the di↵erences in the setup.
We set ↵ = 0.10 for all algorithms except for A2RP LHS and A2RP LHS-B.
Recall that in those cases, the sample error is calculated using n rather than 2n,
and the asymptotic lower bound on coverage is (1
that (1
↵)2 . Setting ↵ = 0.051, so
↵)2 = 0.90 and the asymptotic lower bound on coverage is as for the
other sampling schemes, will lead to a significant increase in the sample error and
CI width due to the increase in the value of z↵⇤ . Instead, we set ↵ = 0.182 so
that the sample error and CI width are scaled the same way as for SRP LHS, since
p
p
⇤
⇤
z0.182
/ n = z0.10
/ 2n. Note that the asymptotic lower bound on coverage for A2RP
LHS and A2RP LHS-B is then (1
0.182)2 = 0.670; however the empirical results in
the next section indicate the coverage remains above 0.90 and is often higher.
For small-scale problems, we used the sample sizes n 2 {50, 100, 200, . . . , 1000}
and set m and M to 10 and 100, respectively. For consistency, the computations for
A2RP and A2RP-B were rerun with this smaller number of independent runs. Note
also that the sample size per replication has been doubled compared to Chapter 3.
Due to the increase in computation time, we set m = 4 and M = 25 for the largescale problems and considered the sample sizes n 2 {100, 500, 1000}. In this case,
our estimates of the coverage probability of the CI estimator were obtained using the
estimates of the optimality gap described in the previous section.
4.6.3
Results of Experiments
We now examine the experimental results for all algorithms and sampling schemes.
We note that when presenting CIs on coverage probabilities for each algorithm, a
margin of error smaller than 0.001 is reported as 0.000. In this chapter, we consider
equal values of n, the sample size per replication, for MRP, A2RP, and SRP, and
so the bias, variance, and MSE of the optimality gap estimator are identical in all
103
three cases. Since our estimates are more accurate with a larger total sample size,
we present the percentage reductions in bias, variance, and MSE when using AV and
LHS only for MRP.
Multiple Replications Procedure
We begin by comparing MRP, MRP AV, and MRP LHS. Figure 4.1 presents the
percentage reductions in the bias of the optimality gap estimator for MRP AV and
MRP LHS compared to MRP for all test problems at a suboptimal solution (recall
that bias is independent of the candidate solution used). For small-scale problems
with the exception of PGP2, AV provides a moderate reduction in the bias, while
LHS eliminates almost all bias. AV actually slightly increases bias for PGP2 for some
sample sizes, whereas the reduction in bias is slight for LHS. Both sampling schemes
reduce bias for all large-scale problems, although the e↵ect is lessened.
Figure 4.2 summarizes the e↵ect of AV and LHS on the variance of the optimality
gap estimator, for all test problems and candidate solutions. AV reduces variance
substantially for NV, APL1P, and GBD at the optimal solution, and has a moderate
e↵ect at the suboptimal solution. PGP2 demonstrates little reduction and sometimes
an increase. The large-scale problems show varied reduction that lessens with increasing sample size. With the exception of PGP2, LHS removes almost all variance
for the small-scale problems both at optimal and suboptimal solutions, and has a
moderate to large e↵ect on the large-scale problems. Note that in the case of MRP,
a decrease in variance of optimality gap estimator corresponds to a decrease in the
SV estimator.
Figure 4.3 shows the percentage reductions in the MSE of the optimality gap
estimator for all problems and candidate solutions. The results, not surprisingly, are
very similar to those in Figure 4.2.
We note that the structure of the optimality gap estimator a↵ects the variance
results, and therefore the MSE results, in a way that would not be observed if we
104
were solely considering the approximate optimal values zn⇤ , zA⇤ , and zL⇤ . The main
˜ and the
issue is that the optimality gap estimator is the di↵erence between Ef (x̂, ⇠)
estimator of z ⇤ (via zn⇤ , zA⇤ , and zL⇤ ). Note that the candidate solution x̂ also a↵ects
the overall variance. We illustrate this e↵ect with AV, and note that the same trends
can be present in LHS occasionally as well. Recall that
i
2X1h
˜ + f (x̂, ⇠˜i0 )
f (x̂, ⇠)
GA =
n i=1 2
n/2
and so
zA⇤ ,
1
n/2
h
i
X
2
1
˜ + f (x̂, ⇠˜i0 )
var (GA ) = var @
f (x̂, ⇠)
zA⇤ A
n i=1 2
0
1
n/2
h
i
X
2
1
˜ + f (x̂, ⇠˜i0 ) A + var (z ⇤ )
= var @
f (x̂, ⇠)
A
n i=1 2
0
1
n/2
h
i
X
2
1
˜ + f (x̂, ⇠˜i0 ) , z ⇤ A .
2 Cov @
f (x̂, ⇠)
A
n i=1 2
(4.37)
0
(4.38)
It often happens that the absolute value of each term on the right-hand side of (4.38) is
reduced significantly compared to using i.i.d. sampling, but the subsequent percentage
reduction in the variance of the optimality gap estimator can be much reduced or even
negative due to the subtraction of the covariance term. Table 4.4 gives an example
for PGP2 with the optimal candidate solution and sample size n = 500. Terms
⇣ P
h
i⌘
n/2 1
2
i0
˜
˜
1, 2, and 3 denote the quantities var n i=1 2 f (x̂, ⇠) + f (x̂, ⇠ ) , var (zA⇤ ), and
⇣ P
h
i
⌘
1
˜ + f (x̂, ⇠˜i0 ) , z ⇤ , respectively.
2 Cov n2 n/2
f
(x̂,
⇠)
A
i=1 2
The results for the CI width are given in Figure 4.4. For the small-scale prob-
lems, AV moderately reduces interval width while LHS produces intervals of near-zero
width, apart from little to no impact for PGP2. Considering small-scale problems and
suboptimal candidate solutions, both LHS and AV can reduce CI width, although the
impact lessens with sample size. AV and LHS have a moderate impact for large-scale
problems, again with decreasing e↵ect with increasing sample size.
105
Table 4.5 shows the coverage results for MRP, MRP AV, and MRP LHS at n =
500. We observe that coverage can both be increased and decreased slightly by AV
and LHS; however, it never falls below the target threshold of 0.9, and is typically
higher.
MRP
MRP AV
Term 1
Term 2
Term 3
Var.
0.42 ± 0.03
0.12 ± 0.01
0.58 ± 0.04
0.25 ± 0.02
0.46 ± 0.03
0.14 ± 0.01
0.08 ± 0.01
0.08 ± 0.01
72.58
57.38
69.42
-4.00
% Red.
Table 4.4: Breakdown of the percentage reduction in variance between MRP and
MRP AV of the optimality gap estimator for optimal candidate solution, for PGP2
(n = 500)
Problem
NV
PGP2
APL1P
GBD
20TERM
DB1
SSN
STORM
MRP
0.939
1.000
0.960
0.995
1.000
1.000
1.000
1.000
±
±
±
±
±
±
±
±
0.012
0.000
0.010
0.010
0.000
0.000
0.000
0.000
MRP AV
0.928
0.997
1.000
0.954
1.000
1.000
1.000
1.000
±
±
±
±
±
±
±
±
0.013
0.003
0.000
0.011
0.000
0.000
0.000
0.000
MRP LHS
0.969
0.999
0.949
1.000
1.000
0.980
1.000
1.000
±
±
±
±
±
±
±
±
0.009
0.002
0.011
0.000
0.000
0.023
0.000
0.000
Table 4.5: MRP coverage for suboptimal candidate solutions (n = 500)
106
100
100
80
80
60
60
40
40
20
20
0
0
100
NV
300
500
PGP2
700
APL1P
900
100
GBD
(a) AV: % Red. in Bias, Small
500
DB1
700
SSN
900
STORM
(b) AV: % Red. in Bias, Large
100
100
80
80
60
60
40
40
20
20
0
300
20TERM
0
100
NV
300
PGP2
500
700
APL1P
900
GBD
(c) LHS: % Red. in Bias, Small
100
20TERM
300
500
DB1
700
SSN
900
STORM
(d) LHS: % Red. in Bias, Large
Figure 4.1: Percentage reductions in bias of optimality gap estimator between MRP
and (a) MRP AV for small problems, (b) MRP AV for large problems, (c) MRP LHS
for small problems, and (d) MRP LHS for large problems, with respect to sample size
n for suboptimal candidate solutions
107
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
−20
100
NV
300
500
PGP2
(a) AV: %
Small Opt
700
APL1P
Red.
900
−20
GBD
in
100
NV
300
500
PGP2
Var., (b) AV: %
Small Sub
700
APL1P
Red.
900
−20
GBD
in
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
NV
300
PGP2
500
700
APL1P
900
GBD
−20
100
NV
300
PGP2
500
700
APL1P
900
GBD
−20
100
20TERM
500
DB1
Var., (c) AV: %
Large Sub
100
100
300
20TERM
100
−20
100
300
700
SSN
Red.
500
DB1
in
700
SSN
900
STORM
Var.,
900
STORM
(d) LHS: % Red. in Var., (e) LHS: % Red. in Var., (f) LHS: % Red. in Var.,
Small Opt
Small Sub
Large Sub
Figure 4.2: Percentage reductions in variance of optimality gap estimator between
MRP and (a) MRP AV for small problems at optimal candidate solutions, (b) MRP
AV for small problems at suboptimal candidate solutions, (c) MRP AV for large problems at suboptimal candidate solutions, (c) MRP LHS for small problems at optimal
candidate solutions, (d) MRP LHS for small problems at suboptimal candidate solutions, (e) MRP LHS for large problems at suboptimal candidate solutions, with
respect to sample size n
108
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
100
NV
300
500
PGP2
700
APL1P
900
0
100
GBD
NV
300
500
PGP2
700
APL1P
900
100
GBD
300
20TERM
500
DB1
700
SSN
900
STORM
(a) AV: % Red. in MSE, (b) AV: % Red. in MSE, (c) AV: % Red. in MSE,
Small Opt
Small Sub
Large Sub
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
100
NV
300
PGP2
500
700
APL1P
900
GBD
0
100
NV
300
PGP2
500
700
APL1P
900
GBD
100
20TERM
300
500
DB1
700
SSN
900
STORM
(d) LHS: % Red. in MSE, (e) LHS: % Red. in MSE, (f) LHS: % Red. in MSE,
Small Opt
Small Sub
Large Sub
Figure 4.3: Percentage reductions in MSE of optimality gap estimator between MRP
and (a) MRP AV for small problems at optimal candidate solutions, (b) MRP AV for
small problems at suboptimal candidate solutions, (c) MRP AV for large problems at
suboptimal candidate solutions, (c) MRP LHS for small problems at optimal candidate solutions, (d) MRP LHS for small problems at suboptimal candidate solutions,
(e) MRP LHS for large problems at suboptimal candidate solutions, with respect to
sample size n
109
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
100
NV
300
500
PGP2
700
APL1P
900
0
100
GBD
NV
300
500
PGP2
700
APL1P
900
100
GBD
300
20TERM
500
DB1
700
SSN
900
STORM
(a) AV: % Red. in CI Width, (b) AV: % Red. in CI Width, (c) AV: % Red. in CI Width,
Small Opt
Small Sub
Large Sub
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
100
NV
300
PGP2
500
700
APL1P
900
GBD
0
100
NV
300
PGP2
500
700
APL1P
900
GBD
100
20TERM
300
500
DB1
700
SSN
900
STORM
(d) LHS: % Red. in CI Width, (e) LHS: % Red. in CI Width, (f) LHS: % Red. in CI Width,
Small Opt
Small Sub
Large Sub
Figure 4.4: Percentage reductions in CI width between MRP and (a) MRP AV for
small problems at optimal candidate solutions, (b) MRP AV for small problems at
suboptimal candidate solutions, (c) MRP AV for large problems at suboptimal candidate solutions, (c) MRP LHS for small problems at optimal candidate solutions,
(d) MRP LHS for small problems at suboptimal candidate solutions, (e) MRP LHS
for large problems at suboptimal candidate solutions, with respect to sample size n
Single Replication Procedure
We now discuss the computational results for SRP, SRP AV, and SRP LHS. Recall
that the results on the bias, variance, and MSE of the optimality gap estimator are
the same as those for MRP, so we do not present them here. Note as well that the SV
estimator is no longer a multiple of the variance of the optimality gap estimator, and
so we present the SV results separately. Figure 4.5 gives the percentage reductions
in the SV estimator for all test problems and candidate solutions. We observe that
110
AV has a moderate to significant e↵ect in all cases. For small-scale problems at the
optimal solutions, LHS nearly eliminates SV with exception of PGP2, where SV is
increased for some sample sizes. At suboptimal solutions, LHS has a decreasing e↵ect
with sample size for GBD, little e↵ect for NV, increases SV for APL1P and PGP2,
and has a slight e↵ect for large-scale problems.
The percentage reductions in the CI width for all test problems and candidate solutions are presented in Figure 4.6. The results are very similar to those in Figure 4.5,
although AV is less e↵ective overall and can even increase CI width for small-scale
problems at optimal solutions. This is not surprising because the SRP AV samp
p
ple error is calculated using n/2 rather than n, increasing the sample error and
therefore the CI width relative to SRP with i.i.d. sampling.
The coverage results for SRP, SRP AV, and SRP LHS at n = 500 are given in
Table 4.6. As expected per the discussion in Section 2.4, SRP has some low coverage,
particularly for PGP2 and DB1. In some cases, AV can increase the coverage slightly,
but at other times coverage is decreased as well. On the other hand, LHS mostly
increases coverage. While this may seem surprising since LHS can decrease CI width,
coverage still improves because the point estimators have significantly less variability
under LHS.
Problem
NV
PGP2
APL1P
GBD
20TERM
DB1
SSN
STORM
SRP
0.907
0.501
0.873
0.874
0.940
0.690
1.000
0.920
±
±
±
±
±
±
±
±
0.015
0.026
0.017
0.017
0.039
0.076
0.000
0.045
SRP AV
0.984
0.534
0.883
0.889
0.930
0.690
1.000
0.820
±
±
±
±
±
±
±
±
0.007
0.026
0.017
0.016
0.042
0.076
0.000
0.063
SRP LHS
1.000
0.550
1.000
1.000
0.890
0.660
1.000
1.000
±
±
±
±
±
±
±
±
0.000
0.026
0.000
0.000
0.051
0.078
0.000
0.000
Table 4.6: SRP coverage for suboptimal candidate solutions (n = 500)
111
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
−20
−20
−20
−40
100
NV
300
500
PGP2
(a) AV: %
Small Opt
700
900
APL1P
Red.
−40
GBD
in
100
300
NV
500
PGP2
SV, (b) AV: %
Small Sub
700
900
APL1P
Red.
−40
GBD
in
100
100
80
80
60
60
60
40
40
40
20
20
20
0
0
0
−20
−20
−20
−40
−40
300
500
PGP2
(d) LHS:
Small Opt
%
700
900
APL1P
Red.
GBD
in
100
300
NV
SV, (e) LHS:
Small Sub
500
PGP2
%
700
900
APL1P
Red.
GBD
in
500
DB1
SV, (c) AV: %
Large Sub
80
NV
300
20TERM
100
100
100
−40
100
300
20TERM
SV, (f) LHS:
Large Sub
700
SSN
Red.
500
DB1
%
900
STORM
in
700
SSN
Red.
SV,
900
STORM
in
SV,
Figure 4.5: Percentage reductions in SV estimator between SRP and (a) SRP AV
for small problems at optimal candidate solutions, (b) SRP AV for small problems
at suboptimal candidate solutions, (c) SRP AV for large problems at suboptimal
candidate solutions, (c) SRP LHS for small problems at optimal candidate solutions,
(d) SRP LHS for small problems at suboptimal candidate solutions, (e) SRP LHS for
large problems at suboptimal candidate solutions, with respect to sample size n
112
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
−20
100
NV
300
500
PGP2
700
APL1P
900
−20
GBD
100
NV
300
500
PGP2
700
APL1P
900
−20
GBD
100
300
20TERM
500
DB1
700
SSN
900
STORM
(a) AV: % Red. in CI Width, (b) AV: % Red. in CI Width, (c) AV: % Red. in CI Width,
Small Opt
Small Sub
Large Sub
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
−20
100
NV
300
PGP2
500
700
APL1P
900
GBD
−20
100
NV
300
PGP2
500
700
APL1P
900
GBD
−20
100
20TERM
300
500
DB1
700
SSN
900
STORM
(d) LHS: % Red. in CI Width, (e) LHS: % Red. in CI Width, (f) LHS: % Red. in CI Width,
Small Opt
Small Sub
Large Sub
Figure 4.6: Percentage reductions in CI width between SRP and (a) SRP AV for
small problems at optimal candidate solutions, (b) SRP AV for small problems at
suboptimal candidate solutions, (c) SRP AV for large problems at suboptimal candidate solutions, (c) SRP LHS for small problems at optimal candidate solutions, (d)
SRP LHS for small problems at suboptimal candidate solutions, (e) SRP LHS for
large problems at suboptimal candidate solutions, with respect to sample size n
113
Averaged Two-Replications Procedure
Finally, we discuss the results for A2RP with variance reduction. We first note that
the values for the small test problems under A2RP and A2RP-B di↵er from those in
Section 3.7.3. This is due to the change in the total sample size and the decreased
number of independent runs; however, the trends remain the same, so we do not
include these results here. For completeness, in Figure 4.7 we compare A2RP and
A2RP-B for the large-scale problems, which were not considered in Chapter 3. A2RPB does not markedly improve the point and interval estimators in this case.
Figure 4.8 gives the results for the MSE of the optimality gap estimator, which
summarizes the bias and variance together in one estimator, for A2RP AV-B and
A2RP LHS-B. A2RP AV-B increases the percentage reduction in the MSE somewhat
compared to A2RP AV, whereas LHS-B has a positive impact compared to A2RP
LHS only in the case of PGP2. Both A2RP AV-B and A2RP LHS-B show significant
improvement in the MSE compared to A2RP-B.
Although the values of the SV estimators for A2RP, A2RP AV, and A2RP LHS
di↵er from those of SRP, the percentage reductions are the same since we are using
the same sample size per replication. In addition, the percentage reduction in the CI
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
100
300
20TERM
500
DB1
700
SSN
900
STORM
(a) % Red. in MSE
0
100
300
20TERM
500
DB1
700
SSN
(b) % Red. in SV
900
STORM
100
20TERM
300
500
DB1
700
SSN
900
STORM
(c) % Red. in CI Width
Figure 4.7: Percentage reductions between A2RP and A2RP-B in (a) MSE of optimality gap estimator, (b) SV estimator, and (c) in CI width, with respect to sample
size n for large problems at suboptimal candidate solutions
114
width will be the same in both cases. Note that this is true for A2RP LHS because
of our choice of ↵. Therefore, we only present the SV and CI estimator results for
A2RP AV-B and A2RP LHS-B in Figures 4.9 and 4.10, respectively. A2RP AV-B has
a noticeable e↵ect on the estimators compared to A2RP AV only when the optimal
candidate solution is used. A2RP LHS-B has no discernible impact in comparison
to A2RP LHS for all test problems except for PGP2 at an optimal solution. In that
case, A2RP LHS-B o↵ers moderate improvement over LHS. Compared to A2RP-B,
A2RP AV and A2RP AV-B further reduces the SV estimator and the CI width, with a
particular positive impact for PGP2. A2RP LHS and A2RP LHS-B o↵er improvement
compared to A2RP-B in some cases, mainly for the large-scale problems.
Coverage results are given in Table 4.7. As expected, the use of two replications
results in higher coverage compared to SRP. AV further increases the coverage the
majority of the time. A2RP AV-B produces very similar coverage results to A2RP
AV. The asymptotic validity results for A2RP LHS and A2RP LHS-B give an asymptotic lower bound of only 0.670. However, our empirical results for small sample sizes
indicate that LHS significantly exceeds this bound, and in fact always improves coverage compared to i.i.d. sampling. As with SRP, this is due to the lessening of the
variability of the point estimators under LHS. The coverage results of A2RP LHSB are very similar to those for A2RP LHS, except for PGP2 where coverage drops
somewhat.
Problem
NV
PGP2
APL1P
GBD
20TERM
DB1
SSN
STORM
A2RP
0.905
0.745
0.894
0.929
0.950
0.860
1.000
0.960
±
±
±
±
±
±
±
±
0.015
0.023
0.016
0.013
0.036
0.057
0.000
0.032
A2RP-B
0.897
0.679
0.877
0.891
0.960
0.830
1.000
0.970
±
±
±
±
±
±
±
±
0.015
0.024
0.017
0.016
0.032
0.062
0.000
0.028
A2RP AV
0.985
0.784
0.901
0.884
0.970
0.930
1.000
0.970
±
±
±
±
±
±
±
±
0.006
0.021
0.016
0.017
0.028
0.042
0.000
0.028
A2RP AV-B
0.986
0.735
0.893
0.883
0.960
0.930
1.000
0.980
±
±
±
±
±
±
±
±
0.006
0.023
0.016
0.017
0.032
0.042
0.000
0.023
A2RP LHS
1.000
0.807
1.000
1.000
0.980
0.920
1.000
1.000
±
±
±
±
±
±
±
±
0.000
0.021
0.000
0.000
0.023
0.045
0.000
0.000
A2RP LHS-B
1.000
0.725
1.000
1.000
0.990
0.950
1.000
1.000
±
±
±
±
±
±
±
±
Table 4.7: A2RP coverage for suboptimal candidate solutions (n = 500)
0.000
0.023
0.000
0.000
0.016
0.036
0.000
0.000
115
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
−20
−20
−20
−40
100
NV
300
500
PGP2
700
APL1P
900
−40
GBD
100
NV
300
500
PGP2
700
APL1P
900
−40
GBD
100
300
20TERM
500
DB1
700
SSN
900
STORM
(a) AV-B: % Red. in MSE, (b) AV-B: % Red. in MSE, (c) AV-B: % Red. in MSE,
Small Opt
Small Sub
Large Sub
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
−20
−20
−20
−40
−40
100
NV
300
PGP2
500
700
APL1P
900
GBD
100
NV
300
PGP2
500
700
APL1P
900
GBD
−40
100
20TERM
300
500
DB1
700
SSN
900
STORM
(d) LHS-B: % Red. in MSE, (e) LHS-B: % Red. in MSE, (f) LHS-B: % Red. in MSE,
Small Opt
Small Sub
Large Sub
Figure 4.8: Percentage reductions in MSE of optimality gap estimator between A2RP
and (a) A2RP AV-B for small problems at optimal candidate solutions, (b) A2RP
AV-B for small problems at suboptimal candidate solutions, (c) A2RP AV-B for large
problems at suboptimal candidate solutions, (c) A2RP LHS-B for small problems
at optimal candidate solutions, (d) A2RP LHS-B for small problems at suboptimal
candidate solutions, (e) A2RP LHS-B for large problems at suboptimal candidate
solutions, with respect to sample size n
116
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
−20
−20
−20
−40
100
NV
300
500
PGP2
700
APL1P
900
−40
GBD
100
NV
300
500
PGP2
700
APL1P
900
−40
GBD
100
300
20TERM
500
DB1
700
SSN
900
STORM
(a) AV-B: % Red. in SV, (b) AV-B: % Red. in SV, (c) AV-B: % Red. in SV,
Small Opt
Small Sub
Large Sub
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
−20
−20
−20
−40
−40
100
NV
300
PGP2
500
700
APL1P
900
GBD
100
NV
300
PGP2
500
700
APL1P
900
GBD
−40
100
20TERM
300
500
DB1
700
SSN
900
STORM
(d) LHS-B: % Red. in SV, (e) LHS-B: % Red. in SV, (f) LHS-B: % Red. in SV,
Small Opt
Small Sub
Large Sub
Figure 4.9: Percentage reductions in SV estimator between A2RP and (a) A2RP AV-B
for small problems at optimal candidate solutions, (b) A2RP AV-B for small problems
at suboptimal candidate solutions, (c) A2RP AV-B for large problems at suboptimal
candidate solutions, (c) A2RP LHS-B for small problems at optimal candidate solutions, (d) A2RP LHS-B for small problems at suboptimal candidate solutions, (e)
A2RP LHS-B for large problems at suboptimal candidate solutions, with respect to
sample size n
117
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
−20
100
NV
300
500
PGP2
700
APL1P
900
−20
GBD
100
NV
300
500
PGP2
700
APL1P
900
−20
GBD
100
300
20TERM
500
DB1
700
SSN
900
STORM
(a) AV-B: % Red. in CI Width, (b) AV-B: % Red. in CI Width, (c) AV-B: % Red. in CI Width,
Small Opt
Small Sub
Large Sub
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
−20
100
NV
300
PGP2
500
700
APL1P
900
GBD
−20
100
NV
300
PGP2
500
700
APL1P
900
GBD
−20
100
20TERM
300
500
DB1
700
SSN
900
STORM
(d) LHS-B: % Red. in CI Width, (e) LHS-B: % Red. in CI Width, (f) LHS-B: % Red. in CI Width,
Small Opt
Small Sub
Large Sub
Figure 4.10: Percentage reductions in CI width between A2RP and (a) A2RP AV-B
for small problems at optimal candidate solutions, (b) A2RP AV-B for small problems
at suboptimal candidate solutions, (c) A2RP AV-B for large problems at suboptimal
candidate solutions, (c) A2RP LHS-B for small problems at optimal candidate solutions, (d) A2RP LHS-B for small problems at suboptimal candidate solutions, (e)
A2RP LHS-B for large problems at suboptimal candidate solutions, with respect to
sample size n
118
n
1000
2000
3000
4000
5000
IID
AV
LHS
0.0012
0.0023
0.0034
0.0044
0.0056
0.0010
0.0018
0.0027
0.0035
0.0043
0.0090
0.0247
0.0515
0.0824
0.1269
Table 4.8: Computational time for IID, AV, and LHS (in seconds) with respect to
sample size n
Timings
In this section, we compare the computational e↵ort required by i.i.d. sampling, AV,
and LHS. For each sampling method, we generated samples with sizes ranging from
n = 1, 000 to n = 5, 000 to facilitate comparison with the timings results in Section 3.7.4. The tests were performed on a 1.66 GHz LINUX computer with 4GB
memory.
Table 4.8 presents the results. Note that this table lists only the time required to
generate the samples. As expected, the time taken by both i.i.d. sampling and AV
increases roughly linearly with sample size. AV slightly decreases the computational
burden compared to i.i.d. sampling because half as many uniform random numbers
need to be generated. This is well known; see e.g., Chapter 4 of (Lemieux, 2009). In
our implementation, due to the shu✏ing across each component of the observations,
LHS can increase the sampling time by several orders of magnitude, but the time
required is insignificant compared to the time required to solve the sampling problem
(SPn ) for large-scale problems.
4.6.4
Discussion
In this section, we highlight insights gained from our computational experiments
and provide guidelines for the use of the various sampling schemes when estimating
optimality gaps.
119
• Both AV and LHS can be e↵ective at reducing the variance of the optimality gap
estimator and the SV estimator, as well as lessening the bias of the optimality
gap estimator. LHS outperforms AV in almost all situations.
• The structure of the optimality gap estimator can lessen the amount of variance
reduction achieved, particularly for AV (recall the discussion on page 103).
• As expected, the coverage associated with the CI estimator increases with the
number of replications. LHS can further improve coverage, despite reducing the
width of the CI estimator, because of the reduced variability of the optimality
gap and SV estimators.
• The combination of the bias reduction technique from Chapter 3 with A2RP AV
and A2RP LHS does not appear to o↵er significant additional benefits beyond
the variance reduction schemes on their own.
• LHS can increase the time required to generate a random sample by several
orders of magnitude compared to using AV or i.i.d. sampling; however, the LHS
sampling time is still trivial compared to that of solving sampling problems.
Given the observations above, we recommend the use of LHS to improve the
reliability of estimators produced by algorithms that assess solution quality. As per
the discussion in Section 3.7.6, we generally recommend using A2RP over SRP for
improved coverage. However, MRP provides the most conservative coverage results,
and so the use of MRP is recommended if coverage is of highest importance.
4.7
Summary and Concluding Remarks
In this chapter, we combined algorithms for assessing solution quality with sampling
schemes used for variance reduction and with the bias reduction scheme of Chapter 3.
In each case, we showed that the optimality gap and SV estimators are consistent
120
and the CI estimator is asymptotically valid. Computational experiments indicate
that both AV and LHS can improve the reliability of the estimators, with LHS in
particular eliminating or nearly eliminating bias and variance in some cases. We
note, however, that the structure of the optimality gap estimator can result less
variance reduction than if the sampled optimal value zn⇤ was considered on its own.
Neither sampling scheme requires undue computational e↵ort. We conclude that
implementing variance reduction schemes in MRP, A2RP, and SRP can improve the
performance of the algorithms without a loss of computational efficiency. The next
chapter studies the use of SRP with variance reduction schemes within a sequential
sampling framework. Note that until now, we used a fixed candidate solution x̂ 2 X
and a fixed sample size n to run the procedures. In the next chapter, both of these
will change iteratively until a stopping condition that depends on the optimality gap
estimator is satisfied.
121
Chapter 5
Sequential Sampling with Variance Reduction
The procedures for assessing solution quality discussed so far in this dissertation are
static, in that the candidate solution, x̂ 2 X, and the sample size, n, are fixed and
given as input at the beginning of the algorithms. In this chapter, we allow both x̂
and n to be adaptive and aim to solve (SP) in an iterative fashion. The sequential
sampling procedure we consider, which appears in (Bayraksan & Morton, 2011), is
based on a sequence of candidate solutions. At each step, the quality of the current
candidate solution is assessed using a method that generates an optimality gap and
its associated variance estimate, like the ones investigated earlier in this dissertation.
The procedure terminates when the value of the optimality gap estimator of the
candidate solution falls below a threshold dictated by the value of the SV estimator.
Given the success of AV and LHS in improving estimator reliability, as demonstrated in Chapter 4 and in the literature, in this chapter we adapt the sequential
sampling methodology to use these variance reduction schemes when estimating the
optimality gaps of the candidate solutions. We also generate the sequence of candidate
solutions using LHS with the aim of producing higher quality candidate solutions. We
note that MRP is computationally burdensome if the same sample size per replication
n as for SRP is used in a sequential setting, or too biased if n is divided into 25-30
batches. We also note that the LHS variant of A2RP requires significant changes to
the existing theory and so we leave this for future work and instead focus on SRP.
In order to establish desired theoretical properties, minor changes to the theory are
required when using SRP AV to assess solution quality, and more significant adaptations are required when using SRP LHS.
This chapter is organized as follows. An overview of the literature related to se-
122
quential sampling is given in the next section. Section 5.2 summarizes the particular
sequential sampling procedure being considered and specifies the necessary assumptions. Section 5.3 applies the variance reduction schemes to the sequential sampling
procedure and discusses theoretical results. Section 5.4 presents computational results
and Section 5.5 concludes with a summary of our findings.
5.1
Literature Review
In general, Monte Carlo sampling-based sequential procedures operate in the following
way. Rather than using a fixed sample size, the algorithms proceed iteratively: at each
iteration, the observations generated so far are analyzed, and either the procedure
is terminated or the sample size is increased and the procedure continues. The use
of sequential procedures has been studied extensively in statistics (Chow & Robbins,
1965; Ghosh et al., 1997; Nadas, 1969) and in simulating stochastic systems (Glynn
& W.Whitt, 1992; Kim & Nelson, 2001, 2006; Law, 2007).
In the context of stochastic programming, sequential sampling procedures iteratively increase the sample size used when solving Monte Carlo-sampling based approximations (SPn ) of (SP). These methods rely on stopping rules to determine when
to terminate with a high-quality solution. When using such algorithms to (approximately) solve stochastic programs, it is important to determine (i) what sample size
to use at a given iteration and (ii) when to terminate the algorithm in order to obtain
high-quality solutions. We will highlight work from the literature that addresses these
goals.
First, Shapiro (2003) summarizes methods to estimate the sample size required to
solve a single approximate stochastic program (SPn ) with a desired accuracy. These
techniques are based on rates of convergence of optimal solutions and large deviations
theory. However, these methods can provide overly conservative estimates compared
to the sample sizes required in practice (Verweij et al., 2003).
123
Returning to a sequential approach, Homem-de-Mello (2003) examines the rate
at which sample sizes must grow to guarantee the consistency of sampling-based
estimators of z ⇤ . A natural question then is how to best allocate the sample sizes at
each iteration. One aim is to minimize expected total computational cost (Byrd et al.,
2012; Polak & Royset, 2008; Royset, 2013). In another approach,Royset & Szechtman
(2013) study the relationship between sample size selection and e↵ort required to
solve (SPn ) as the computational budget increases in order to maximize the rate of
convergence to the optimal value. Some papers study sample size allocation with
the goal of achieving convergence of optimal solutions. For example, in the context
of stochastic root finding and simulation optimization, Pasupathy (2010) provides
guidance on how to select sample sizes in order to guarantee the convergence of
solutions of retrospective-approximation algorithms.
As mentioned above, sequential sampling procedures require stopping rules in
addition to methods for determining sample sizes. Stopping rules have been described in the literature for certain sampling-based methods for stochastic programs
(Dantzig & Infanger, 1995; Higle & Sen, 1991a, 1996a; Norkin et al., 1998; Shapiro
& Nemirovski, 2005). Furthermore, Homem-de-Mello et al. (2011) discuss variance
reduction schemes and stopping criteria for a stochastic dual dynamic programming
algorithm to solve multistage stochastic programs. With the exception of stochastic
quasi-gradient methods (see the survey by Pflug (1988)), however, these stopping rules
do not necessarily control the quality of the solution obtained and are not analyzed
in a sequential fashion.
Morton (1998) studies stopping rules that control solution quality for algorithms
based on asymptotically normal optimality gap estimators. The asymptotic normality assumption can be restrictive for stochastic programs, and the procedure of
Bayraksan & Morton (2011), discussed in detail in the next section, removes this requirement. Bayraksan & Pierre-Louis (2012) consider fixed-width sequential stopping
rules for interval estimators, and Pierre-Louis (2012) develops a sequential approxi-
124
mation method using both sampling approximation and deterministic bounding with
similar stopping rules.
5.2
Overview of a Sequential Sampling Procedure
We now review the sequential sampling procedure of Bayraksan & Morton (2011) on
which our work is based. The material presented in this section is taken largely from
(Bayraksan & Morton, 2009). We consider the following basic framework to solve
(SP):
Step 1: Generate a candidate solution,
Step 2: Assess the quality of the candidate solution,
Step 3: Check stopping criterion. If satisfied, stop. Else, go to Step 1.
In the context of stochastic programming with Monte Carlo sampling-based approximations, improving the candidate solution typically involves generating one or more
˜ In Step 1, a candidate solution can be generated by a number
additional samples of ⇠.
of methods that solve approximations of (SP). This includes, but is not limited to,
solving a sequence of sampling problems (SPmk ) at iteration k with increasing sample
sizes, mk ! 1 as k ! 1, and setting the solutions x⇤mk as the candidate solution,
i.e., x̂k = x⇤mk , at iteration k. Additionally, any method that satisfies the assumptions
detailed below can be used to assess solution quality. We will focus on SRP in this
chapter.
Before we formally state the sequential sampling procedure, we first make the
following definitions and assumptions. Let {⇠˜1 , ⇠˜2 , . . . , ⇠˜n } be a random sample of
size n. Suppose we have at hand an optimality gap estimator Gn (x) and an SV
estimator s2n (x)
0. We define
1 Xh
Dn (x) =
f (x, ⇠˜i )
n i=1
n
f (x⇤min , ⇠˜i )
i
.
(5.1)
125
Note that the expected value of Dn (x) is Gx , and its variance under i.i.d. sampling is
2
(x)/n, where
2
˜
(x) = var[f (x, ⇠)
˜ and x⇤ 2 arg miny2X ⇤ var[f (x, ⇠)
˜
f (x⇤min , ⇠)]
min
˜
f (y, ⇠)].
Observe that the definitions of
(x) and x⇤min vary slightly to those used
in Chapters 3 and 4. Throughout this chapter, we focus on problems that satisfy
assumptions (A1) and (A3) of Chapter 2 and (A5) of Chapter 4 (recall that (A5)
implies (A2)). In particular, our computational results are on a subset of problems
from Chapters 3 and 4. That said, in this chapter we work with more detailed
assumptions and point to when these are satisfied under (A1)–(A3) and (A5):
(A6) The sequence of feasible candidate solutions {x̂k } has at least one limit
point in X ⇤ , a.s.
(A7) Let {xk } be a feasible sequence with x as one of its limit points. Let
sample size nk satisfy nk ! 1 as k ! 1. Then, lim inf k!1 P (|Gnk (xk ) Gx | > ) = 0
for any
> 0.
(A8) Gn (x)
Dn (x), a.s., for all x 2 X and n
(A9) lim inf n!1 s2n (x)
(A10)
N (0,
2
p
n(Dn (x)
2
1.
(x), a.s., for all x 2 X.
Gx ) ) N (0,
2
(x)) as n ! 1 for all x 2 X, where
(x)) is a normal random variable with mean zero and variance
2
(x).
Briefly, the assumptions are: (i) the algorithm used in Step 1 eventually generates
at least one optimal solution (assumption (A6)), (ii) the statistical estimators of the
optimality gap and its variance have desired properties such as convergence (assumptions (A7)–(A9)), and (iii) the sampling is done in such a way that a form of the
CLT holds (assumption (A10)). More specifically, assumption (A7) ensures that the
optimality gap estimator Gn (x) converges in probability, uniformly in x, to Gx , the
true optimality gap. Similarly, (A9) requires that a form of convergence holds for the
sample variance estimator s2n (x). Assumption (A8) ensures the correct direction of
126
the bias in the estimation of the optimality gap.
Assumption (A6) is satisfied for optimal solutions to (SPmk ) with mk ! 1 under i.i.d. sampling; see, e.g., (Shapiro, 2003). The same is also true for the class
of problems we consider when the samples are generated using LHS, as we do in
our computational results in this chapter; see (Homem-de-Mello, 2008). Assumption
(A7) is satisfied under (A1)–(A3) (Bayraksan & Morton, 2006). Assumption (A8)
is satisfied as a direct consequence of the optimization that occurs in SRP. Uniform
convergence in x of the relevant sample means to their expectations is a sufficient
condition for (A9) to be satisfied—as discussed earlier in Chapter 4, this requirement
is satisfied within our framework. Furthermore, the sampling schemes considered in
this chapter, namely i.i.d. sampling, AV, and LHS, satisfy Assumption (A10) (see
Section 4.1).
At iteration k of sequential sampling, we are given a candidate solution x̂k 2 X
(from Step 1). First, we select a sample size nk . To achieve this sample size, we
can either choose a newly generated sample {⇠˜1 , . . . , ⇠˜nk }, or we can augment the
previously obtained observations with {⇠˜nk
1 +1
, . . . , ⇠˜nk }. The resampling frequency
kf controls at which iterations to generate new observations. We now estimate the
optimality gap of x̂k through Gnk (x̂k ) and snk (x̂k ) (Step 2). To simplify notation,
from this point on, we will usually suppress dependence on x̂k and nk , and simply
denote Gk = Gx̂k ,
2
k
=
2
(x̂k ), Dk = Dnk (x̂k ), Gk = Gnk (x̂k ) and sk = snk (x̂k ). Let
h0 > 0 and "0 > 0 be two scalars. In Step 3, we check the following stopping criterion
and terminate the sequential sampling at iteration
T = inf {k : Gk  h0 sk + "0 },
k 1
(5.2)
i.e., we stop the first time the ratio of Gk to sk falls below h0 > 0 plus a small positive
number. Let h > h0 . We select the sample size at iteration k according to
✓
◆2
1
nk
cp + 2p ln2 k ,
0
h h
(5.3)
127
n
where cp = max 2 ln
⇣P
1
j=1
j
p ln j
p
⌘
o
/ 2⇡↵ , 1 . Here, p > 0 is a parameter we
can choose, which a↵ects the number of samples we generate. Under this formula,
the sample size grows of order O(log2 k) in iterations k. When we stop at iteration
T according to (5.2), the sequential sampling procedure provides an approximate
solution, x̂T , and a CI on its optimality gap, GT , as [0, hsT +"], where " > "0 . Note that
" and "0 are very small, e.g., 10 7 , and h and h0 are the more important parameters
in selecting initial sample sizes and determining when to stop. We will discuss how
to select these parameters in more detail in our computational experiments.
A formal statement of the sequential sampling procedure using SRP under i.i.d.
sampling is as follows. Observe that the sequence of steps is more detailed than at
the beginning of the section.
Sequential Sampling Procedure
Input: Values for h > h0 > 0, " > "0 > 0, and p > 0, a method that generates
candidate solutions {x̂k } with at least one limit point in X ⇤ , a resampling frequency
kf (a positive integer), and a desired value of ↵ 2 (0, 1).
Output: A candidate solution x̂T and a (1
↵)-level confidence interval on its opti-
mality gap, GT .
0. (Initialization) Set k = 1, calculate nk according to (5.3), and sample i.i.d. observations {⇠˜1 , . . . , ⇠˜nk } from P .
1a. Solve (SPnk ) using {⇠˜1 , . . . , ⇠˜nk } to obtain x⇤nk .
1b. Calculate
nk h
1 X
Gk =
f (x̂k , ⇠˜i )
nk i=1
and
s2k
=
1
nk
1
nk h⇣
X
f (x̂k , ⇠˜i )
f (x⇤nk , ⇠˜i )
f (x⇤nk , ⇠˜i )
i=1
2. If {Gk  h0 sk + "0 }, then set T = k, and go to 4.
⌘
i
Gk
i2
.
128
3. Set k = k + 1 and calculate nk according to (5.3). If kf divides k, then sample
observations {⇠˜1 , . . . , ⇠˜nk } independently of samples generated in previous iterations.
Else, sample nk
nk
1
observations {⇠˜nk
1 +1
, . . . , ⇠˜nk } from P . Go to 1.
4. Output candidate solution x̂T and a one-sided confidence interval on GT :
[0, hsT + "].
(5.4)
Note that a resampling frequency of kf = 1 will generate nk new observations
independent of the observations from the previous iteration. Secondly, it is important
to note that the quality statement (5.4) provides a larger bound (hsT + ") on GT than
the bound used as a stopping criterion (h0 sT +"0 ) for GT in (5.2). Such inflation of the
CI statement, relative to the stopping criterion, is fairly standard when using sampling
methods with a sequential nature (Chow & Robbins, 1965; Glynn & W.Whitt, 1992).
A third remark is that the stopping rule (5.2) and the corresponding quality statement
(5.4) are written relative to the standard deviation. This allows us to stop with a
larger optimality gap estimate when the variability of the problem is large compared
to a tighter stopping rule when the variability is low. Consequently, the quality
statement regarding x̂T is tighter when variability is low and larger when variability
is high.
When the sample sizes are increased according to (5.3) and the procedure stops
at iteration T according to (5.2), the CI (5.4) is asymptotically valid provided Dn (x)
satisfies a finite moment generating function (MGF) assumption. Next, we state
the MGF assumption, and then we summarize this result along with the fact that
sequential sampling stops in a finite number of iterations, a.s.:
h
h ⇣
⌘ii
Dn (x) Gx
p
(A11) supn 1 supx2X E exp
< 1, for all | | 
(x)/ n
0
0,
for some
> 0.
Assumption (A11) holds, for instance, when X is compact, f (x, ⇠) is uniformly
bounded (assumption (A5)), and {⇠˜1 , ⇠˜2 , . . . , ⇠˜n } are an i.i.d. sample from P . More
129
generally, (A11) holds for the problem class described in Chapter 3; see (Bayraksan
& Morton, 2011) for further details. The theorem below summarizes properties of
the sequential procedure.
Theorem 5.5. Let " > "0 > 0 and h > h0 > 0 and 0 < ↵ < 1. Consider the above
sequential sampling procedure where the sample size is increased according to (5.3),
and the procedure stops at iteration T according to (5.2).
(i) Assume (A6) and (A7) are satisfied. Then, P (T < 1) = 1.
(ii) Assume (A8)–(A11) are satisfied. Then,
lim inf
P (GT  hsT + ")
0
1
h#h
↵.
(5.6)
Part (i) of Theorem 5.5 implies that if the algorithm used in Step 1 eventually
produces an optimal solution (assumption (A6)) and the optimality gap estimator can
consistently estimate solution quality (assumption (A7)), then the sequential sampling
procedure stops in a finite number of iterations, a.s. Part (ii) of Theorem 5.5 shows
that for values of h close enough to h0 , or, when the sample sizes nk are large enough,
the optimality gap of the solution when we stop lies within [0, hsT + "] with at least
the desired probability of 1
↵.
The results in Theorem 5.5 can be also be proven under i.i.d. sampling and the
following weaker second moment condition:
˜ < 1.
(A12) supx2X Ef 2 (x, ⇠)
The sample size at each iteration must be calculated as follows:
✓
◆2
1
nk
(cp,q + 2pk q ) ,
h h0
where q > 1, p > 0, and where cp,q = max{2 ln(
P1
j=1
(5.7)
p
exp[ pj q ]/ 2⇡↵), 1}. This
formula results in larger sample sizes than (5.7). Choosing q to be just larger than 1
will result in sample sizes growing roughly linearly.
130
5.3
Sequential Sampling Procedure with Variance Reduction
In this section, we update the sequential sampling procedure to include variance
reduction when assessing solution quality via SRP.
5.3.1
Antithetic Variates
The sequential sampling procedure using AV when estimating optimality gaps is as
follows:
Sequential Sampling Procedure with AV
Input: Values for h > h0 > 0, " > "0 > 0, and p > 0, a method that generates
candidate solutions {x̂k } with at least one limit point in X ⇤ , a resampling frequency
kf (a positive integer), and a desired value of ↵ 2 (0, 1).
Output: A candidate solution x̂T and a (1
↵)-level confidence interval on its opti-
mality gap, GT .
0. (Initialization) Set k = 1, calculate nk so that
nk
2
✓
1
◆2
(cp + 2p ln2 (k)),
(5.8)
h0
p
P
0
0
p ln j
where cp = max{2 ln( 1
j
/
2⇡↵, 1}, and sample {⇠˜1 , ⇠˜1 , . . . , ⇠˜nk /2 , ⇠˜nk /2 }
j=1
h
from P using AV sampling.
0
0
1a. Solve (SPnk ,A ) using {⇠˜1 , ⇠˜1 , . . . , ⇠˜nk /2 , ⇠˜nk /2 } to obtain x⇤nk ,A .
1b. Calculate
Gk,A
nk /2 
2 X 1⇣
0
=
f (x̂, ⇠˜i ) + f (x̂, ⇠˜i )
nk i=1 2
and
s2k,A
1
=
nk /2
nk /2 
X 1⇣
f (x̂k , ⇠˜i )
1 i=1 2
0
f (x⇤k,A , ⇠˜i )
f (x⇤k,A , ⇠˜i )
f (x⇤nk ,A , ⇠˜i )
2. If {Gk,A  h0 sk,A + "0 }, then set T = k, and go to 4.
⌘
2
Gk,A
.
⌘
131
3. Set k = k + 1 and calculate nk according to (5.8). If kf divides k, then sample
observations {⇠˜1 , . . . , ⇠˜nk } independently of samples generated in previous iterations.
Else, sample nk
nk
1
observations {⇠˜nk
1 +1
, . . . , ⇠˜nk } from P . Go to 1.
4. Output candidate solution x̂T and a one-sided confidence interval on GT :
[0, hsT,A + "].
Therefore, the procedure is terminated when the following stopping criterion is
satisfied:
T = inf {k : Gk,A  h0 sk,A + "0 }.
(5.9)
k 1
Observe that formula (5.8) calculates nk /2 rather than nk , and that Gk,A and s2k,A are
updated accordingly. These changes are a consequence of the structure of AV. For a
fixed x 2 X, we redefine Dn (x) and
2
(x) to accommodate AV as follows:
2X1h
0
=
f (x, ⇠˜i ) + f (x, ⇠˜i )
n i=1 2
n/2
Dn,A
and
0
f (x⇤min , ⇠˜i )
i
◆
1
0
⇤
⇤
0
˜
˜
˜
˜
= var
[f (x, ⇠) + f (x, ⇠ ) f (xmin,A , ⇠) f (xmin,A , ⇠ )] ,
2
⇣
⌘
˜ + f (x, ⇠˜0 ) f (y, ⇠)
˜ f (y, ⇠˜0 )] .
2 arg miny2X ⇤ var 12 [f (x, ⇠)
2
A (x)
where x⇤min,A
f (x⇤min , ⇠˜i )
✓
We now examine assumptions (A7)–(A11) in the context of AV:
(A7) Let xk be a feasible sequence with x as one of its limit points. Let nk
satisfy nk ! 1 as k ! 1. Then, lim inf k!1 P (|Gnk ,A (xk )
Gx | > ) = 0 for any
> 0.
(A8) Gn,A (x)
Dn,A (x), a.s., for all x 2 X and n
(A9) lim inf n!1 s2n,A (x)
(A10)
N (0,
2
A (x))
p
n/2(Dn,A (x)
2
A (x),
1.
a.s., for all x 2 X.
Gx ) ) N (0,
2
A (x))
as n ! 1 for all x 2 X, where
is a normal random variable with mean zero and variance
2
A (x).
132
(A11) supn
0
1
supx2X E exp
> 0.
 ✓
Dn,A (x) Gx
A (x)/
p
n/2
◆
< 1, for all | | 
0,
for some
p
Note that, in addition to updates in notation, assumptions (A10) and (A11) use n/2
p
rather than n. This is because Dn,A (x) is a sample mean of n/2 i.i.d. observations.
As in Section 5.2, the modified AV assumptions (A7)–(A10) hold via the results in
Section 4.3.1 when (A1)–(A3) hold. Assumption (A11) holds when (A5) is satisfied
but it can also hold under less restrictive cases when (A1)–(A3) hold, as discussed in
(Bayraksan and Morton, 2011). Note that with the appropriate redefinitions for AV,
everything follows as in the i.i.d. case, but with a focus on antithetic pairs. Therefore,
Theorem 5.5 holds under these circumstances with minor adjustments:
Theorem 5.10. Let " > "0 > 0 and h > h0 > 0 and 0 < ↵ < 1. Consider the above
sequential sampling procedure with AV where the sample size is increased according
to (5.3), and the procedure stops at iteration T according to (5.9).
(i) Assume (A6) and (A7) are satisfied. Then, P (T < 1) = 1.
(ii) Assume (A8)–(A11) are satisfied. Then,
lim inf
P (GT  hsT,A + ")
0
h#h
5.3.2
1
↵.
Latin Hypercube Sampling
Finally, we present the sequential sampling procedure with LHS:
Sequential Sampling Procedure with LHS
Input: Values for h > h0 > 0, " > "0 > 0, and p > 0, a method that generates
candidate solutions {x̂k } with at least one limit point in X ⇤ , and a desired value of
↵ 2 (0, 1).
Output: A candidate solution x̂T and a (1
mality gap, GT .
↵)-level confidence interval on its opti-
133
0. (Initialization) Set k = 1, calculate nk according to (5.7), and sample observations
{⇠˜1 , . . . , ⇠˜nk } from P using LHS.
1a. Solve (SPnk ,L ) using {⇠˜1 , . . . , ⇠˜nk } to obtain x⇤nk ,L .
1b. Calculate
Gk,L
and
s2k,L
=
nk h
1 X
=
f (x̂k , ⇠˜i )
nk i=1
1
nk
1
nk h⇣
X
f (x̂k , ⇠˜i )
f (x⇤nk ,L , ⇠˜i )
f (x⇤nk ,L , ⇠˜i )
i=1
2. If {Gk,L  h0 sk,L + "0 }, then set T = k, and go to 4.
i
⌘
Gk,L
i2
.
3. Set k = k+1 and calculate nk according to (5.7). Sample observations {⇠˜1 , . . . , ⇠˜nk }
independently of samples generated in previous iterations. Go to 1.
4. Output candidate solution x̂T and a one-sided confidence interval on GT :
[0, hsT,L + "].
It is important to note the di↵erences compared to using i.i.d. sampling. First,
it is not possible to obtain a larger LHS sample by augmenting with additional observations, so the procedure regenerates new observations at each iteration (this is
equivalent to setting kf = 1). In addition, the procedure uses the sample size formula (5.7), which uses larger sample sizes than the sample size formula under the
assumption of finite moment generating function. The procedure is terminated when
the following stopping criterion is satisfied:
T = inf {k : Gk,L  h0 sk,L + "0 }.
(5.11)
k 1
We make the following additional definitions:
nk h
1 X
Dn,L (x) =
f (x̂k , ⇠˜i )
nk i=1
and
"
1 X⇣
2
(x)
=
var
f (x, ⇠˜i )
L
n i=1
f (x⇤min , ⇠˜i )
n
f (x⇤min , ⇠˜i )
i
⌘
#
,
134
where the observations {⇠˜1 , . . . , ⇠˜n } are generated via LHS. We now discuss the requirements on the procedure. Assumptions (A7)–(A9) simply require an update of
notation and are satisfied via the results in Section 4.3.2 when (A1)–(A3) hold. Assumption (A10) can be expressed as:
(A10) We assume that (Dn,L (x)
Gx )/
2
L (x)
) N (0, 1) as n ! 1 for all
x 2 X.
> 0, then the result holds by Theorem 4.6, the CLT for LHS. If L2 (x) = 0,
⌘
P ⇣
then n1 ni=1 f (x, ⇠˜i ) f (x⇤min , ⇠˜i ) = Gx , for almost all ⇠˜ and the result holds in a
If
2
L (x)
degenerate form. It is not straightforward to establish the finite moment generating
assumption (A11) in the case of LHS, so instead we rely on the second moment
condition (A12) and use the sample size formula (5.7). Assumption (A12) holds by
assumption (A2) or by the stronger uniformly bounded assumption of (A5). However,
we need (A5) to be able to invoke the CLT for LHS.
The proof of asymptotic validity of the sequential procedure that uses LHS for
assessment of solution quality requires a few adjustments to the proof for the same
procedure under i.i.d. sampling; see, e.g., Theorem 2 in (Bayraksan & Morton, 2011).
Specifically, we apply inequality (4.4) relating the variance of an LHS estimator to
that of a standard Monte Carlo estimator, as illustrated in the detailed proof of
Theorem 5.12 below. First, we require the following lemmas:
Lemma 5.1. Let X 1 , . . . , X n be i.i.d. random variables with mean µ and X̄n =
Pn
1
i
1
µ)2 < 1, then
i=1 X . If E(X
n
✓ ◆
2
2
1
2
E(X̄n µ)  E(X
µ)
.
n
Lemma 5.2. (Fatou’s Lemma) Suppose fn is a sequence of measurable functions on
E. Then
(i) If fn
0 for all n, then
Z
E
lim inf fn  lim inf
n!1
n!1
Z
fn .
E
135
R
R
(ii) If L  fn  U for all n, such that E L < 1 and E U < 1, then
Z
Z
Z
Z
lim inf fn  lim inf
fn  lim sup fn 
lim sup fn .
E
n!1
n!1
E
n!1
E
E
n!1
Lemma 5.3. (Bound on Tail of a Standard Normal) Let Z be a standard normal
and t > 0. Then,
P (Z
1 exp[ t2 /2]
t)  p
.
t
2⇡
The following theorem expresses the asymptotic validity and almost sure finite
stopping of the sequential sampling procedure with LHS.
Theorem 5.12. Let " > "0 > 0 and h > h0 > 0 and 0 < ↵ < 1. Consider the above
sequential sampling procedure with LHS where the sample size is increased according
to (5.7), and the procedure stops at iteration T according to (5.11).
(i) Assume (A6) and (A7) are satisfied. Then, P (T < 1) = 1.
(ii) Assume (A5) and (A8)–(A10) are satisfied and nk
lim inf
P (GT  hsT,L + ")
0
h#h
1
2 for all k. Then,
↵.
Proof. The proof of finite stopping is as in proof of Proposition 1 in (Bayraksan &
Morton, 2011). We now provide a proof of part (ii). Let
h=h
h0 and
"="
"0 .
As in the proof of Theorem 1 in (Bayraksan & Morton, 2011), it suffices to show that
lim sup
h#0
1
X
k=1
P (Dk,L
Gk 
")  ↵.
h sk,L
To apply part (ii) of Fatou’s lemma, we first show that
P1
k=1
P (Dk,L Gk 
h sk,L
1
˜nk
") is bounded above. Let {⇠˜M
C , . . . , ⇠M C } be a standard Monte Carlo random sam-
136
ple. Then,
1
X
Gk 
P (Dk,L
k=1

=
1
X
P (Dk,L
k=1
1 Z
X
k=1

G k )2
x̂k
")
( ")2
P (Dk,L
G k )2
⇥
E (Dk,L
⇤
Gk )2 x̂k ( ") 2 dPx̂k
x̂k
1 Z
X
k=1
h sk,L
"
( ")2 x̂k dPx̂k
(5.13)
(5.14)
#
nk h
i
1 X
=
var
f (x, ⇠˜i ) f (x⇤min , ⇠˜i )
Gk x̂k ( ") 2 dPx̂k
n
k i=1
k=1 x̂k
"
#
Z
nk h
1
i
X
nk
1 X
i
i

var
f (x, ⇠˜M
f (x⇤min , ⇠˜M
Gk x̂k ·
C)
C)
n
1
n
k
k i=1
x̂
k
k=1
1 Z
X
( ") 2 dPx̂k
2
nk h
1 Z
X
nk
1 X
i
4
=
E
f (x, ⇠˜M
C)
n
1
n
k
k
x̂
k
i=1
k=1
( ") 2 dPx̂k
⇣
 sup E f (x, ⇠˜M C )
f (x⇤min , ⇠˜M C )
f (x, ⇠˜M C )
f (x⇤min , ⇠˜M C )
x2X
= sup E
x2X
⇣
Gk
Gk
i
f (x⇤min , ⇠˜M
C)
⌘2
⌘2
i
Gk
!2
(5.15)
3
(5.16)
x̂k 5 ·
◆
1 ✓
X
nk 1
2( ")
(5.17)
nk 1 nk
k=1
◆
1 ✓
X
1
2
2( ")
,
(5.18)
nk 1
k=1
2
where Px̂k denotes the distribution function of x̂k , and (5.14) follows from an application of Markov’s inequality to the conditional probability in (5.13). Inequality (5.15)
holds because Dk,L
Gk has mean zero for a fixed x̂k , inequality (5.16) holds by the
bound on the LHS variance by the Monte Carlo variance given in (4.4), and (5.17)
follows from Lemma 5.1. The multiplier of the infinite sum in (5.18) is bounded by
(A12), and the sum itself is bounded since nk
2 and nk / k 1+ for some > 0.
P
Therefore, 1
Gk 
h sk,L
") is bounded and Fatou’s lemma can
k=1 P (Dk,L
be used.
137
Taking limits, we obtain
lim sup
h#0


=


1
X
1
X
lim sup P (Dk,L
h#0
k=1
1
X
k=1
1
X
k=1
1
X
lim sup P (Dk,L
h#0
lim sup
h#0
lim sup
k=1
1 Z
X
k=1
Gk 
P (Dk,L
k=1
h#0
Z
P
x̂k
Z
P
x̂k
lim sup P
x̂k
h#0
✓
✓
✓
hsk,L
")
Gk 
h sk,L
Gk 
h sk,L ))
Dk,L Gk

L (x̂k )
Dk,L Gk

L (x̂k )
Dk,L Gk

L (x̂k )
"))
(5.19)
✓
◆
◆
1
sk,L
h nk p
x̂k dPx̂k
(5.20)
nk
L (x̂k )
r
✓
◆
◆
p
nk 1 sk,L
h nk
x̂k dPx̂k
(5.21)
nk
k
r
✓
◆
◆
nk 1 sk,L
2
1/2
(cp + 2p ln k)
x̂k dPx̂k
nk
k
p
 ↵.
The first and final inequalities follow from an application of Fatou’s lemma. In (5.20),
we assume that
2
L (x̂k )
> 0. Otherwise, the probability in (5.19) is zero. Inequal-
ity (5.21) holds because
Gk )/
2
L (x̂k )

2
k
nk 1
(see (4.4)). With k and x̂k fixed, (Dk,L
converges to a standard normal by (A10) because h # 0 ensures nk ! 1.
q
q
Similarly, lim inf h#0 nknk 1 (sk,L / k ) 1 by (A9), and so lim sup h#0 nknk 1 (sk,L / k )
L (x̂k )
1. The final inequality then follows by applying Lemma 5.3 and the definition of
cp,q .
We now examine the sequential sampling procedure with variance reduction for a
variety of test problems.
5.4
Computational Experiments
In this section, we apply the sequential sampling procedure with variance reduction to
a number of two-stage stochastic linear programs with recourse from the literature and
compare to using i.i.d. sampling when assessing solution quality. We first outline the
138
experimental setup in Section 5.4.1. We then present the results of the experiments
and discuss our findings in Section 5.4.2.
5.4.1
Experimental Setup
We now provide the details of our experimental design, including the test problems
considered, how we generated the candidate solutions, and our methodology for choosing parameter values.
Test Problems
We consider the small-scale test problems PGP2, APL1P, and GBD also studied in
Chapters 3 and 4 and the large-scale test problems DB1, 20TERM, SSN, and STORM
outlined in Chapter 4. In prior chapters, the candidate solution x̂ was fixed, and so
˜ only needed to be estimated once to allow the estimation of the optimality
Ef (x̂, ⇠)
gap of x̂. In this case, x̂T , the candidate solution at the final iteration, may vary from
run to run. However, in the case of the small-scale problems, it is not computation˜ for each x̂T output by the procedure. For the
ally burdensome to calculate Ef (x̂T , ⇠)
˜ for each solution x̂T using 50,000 scelarge-scale problems, we estimated Ef (x̂T , ⇠)
narios generated by LHS. We observed very slightly negative values for the optimality
gap estimates for some solutions to 20TERM, most likely due to numerical imprecision in the optimization algorithm solving (SPn ). These discrepancies are negligible
compared to the width of the CIs output by the sequential sampling procedures, and
so we treated the optimality gap as zero in these cases when calculating the coverage
probabilities.
Candidate Solution
Given the success of LHS in the previous chapter and in the literature, we used the
following method to generate the sequence of candidate solutions x̂k :
139
1. Set mk = m1 . Sample observations (independent of those used in the evaluation
procedure) {⇠˜1 , . . . , ⇠˜mk } from P using LHS.
2. Solve (SPmk ,L ) using {⇠˜1 , . . . , ⇠˜mk } to obtain x⇤mk .
3. Set x̂k = x⇤mk . Calculate mk+1 and generate a fresh sample of observations
{⇠˜1 , . . . , ⇠˜mk+1 } from P using LHS. Set k = k + 1 and go to 2.
Assumption (A6) is satisfied by part (iii) of Theorem 4.11 of Section 4.3.2. We
set mk = 2nk so that more computational e↵ort is spent on generating high-quality
candidate solutions than on assessing them.
Parameter Settings
As in previous chapters, we set ↵ = 0.10. Following the guidelines in (Bayraksan &
Morton, 2011), we chose p = 4.67 ⇥ 10
3
and q = 1.5. We also set " = 2 ⇥ 10
7
and
"0 = 1 ⇥ 10 7 . The sequential sampling procedure with each type of sampling was run
a total of 300 times for the small-scale test problems and 100 times for the large-scale
test problems using a common stream of random numbers for each run.
In order to directly compare sequential sampling with i.i.d. sampling, AV, and
LHS, we chose the remaining parameters based on the requirements for LHS. Specifically, we set the refresh rate kf = 1, so that new observations are generated at each
iteration, and use the sample size formula (5.7). AV requires the following slight
Problem
n1
PGP2
APL1P
GBD
20TERM
DB1
SSN
STORM
100
200
500
500
500
500
500
h
0.3116
0.2204
0.1394
0.1394
0.1394
0.1394
0.1394
IID: (h, h0 )
(0.4166,
(0.2284,
(0.1494,
(0.2394,
(0.1694,
(0.1694,
(0.1694,
0.105)
0.008)
0.010)
0.100)
0.030)
0.030)
0.030)
LHS: (h, h0 )
(0.4060,
(0.2304,
(0.1404,
(0.2044,
(0.1694,
(0.1594,
(0.1494,
0.090)
0.010)
0.001)
0.065)
0.030)
0.020)
0.010)
Table 5.1: Parameters for the sequential sampling procedure with IID and LHS
140
Problem
n1
PGP2
APL1P
GBD
20TERM
DB1
SSN
STORM
100
200
500
500
500
500
500
h
0.4407
0.3116
0.1971
0.1971
0.1971
0.1971
0.1971
AV1: (h, h0 )
(0.5457,
(0.3196,
(0.2071,
(0.2971,
(0.2271,
(0.2221,
(0.2121,
0.105)
0.008)
0.010)
0.100)
0.030)
0.025)
0.015)
n1
200
400
1000
1000
1000
1000
1000
h
0.3116
0.2204
0.1394
0.1394
0.1394
0.1394
0.1394
AV2: (h, h0 )
(0.4166,
(0.2904,
(0.1494,
(0.2394,
(0.1694,
(0.1644,
(0.1544,
0.105)
0.070)
0.010)
0.100)
0.030)
0.025)
0.015)
Table 5.2: Parameters for the sequential sampling procedure with AV1 and AV2
adjustment:
nk
2
✓
1
h
h0
◆2
(cp,q + 2pk q ) .
(5.22)
Therefore, we make our comparisons in the following two ways. First, we adjusted
h for AV so that the procedure used the same sample sizes at each iteration for each
sampling scheme. We refer to this case as AV1. Second, we doubled the sample size
when AV is used compared to i.i.d. sampling and LHS. This case is referred to as AV2.
In the cases of i.i.d. sampling and LHS, we rounded up each sample size nk to be even
in order to facilitate better comparison with AV. Formula (5.22) produces a larger
sample size than (5.8), but Theorem 5.10 is still valid since (5.8) simply provides a
lower bound on the sample size.
We chose
h=h
h0 so that the sample size at the first iteration is equal to a
specified initial sample size n1 . For example, setting
h = 0.3116 leads to an initial
sample size n1 = 100 for LHS and n1 = 200 for AV. We selected the parameters
h and h0 using the method outlined in (Pierre-Louis, 2012). Specifically, for each
test problem, we ran a pilot study examining the average values of Gk /sk , Gk,A /sk,A ,
and Gk,L /sk,L for the first 10 iterations over 25 replications. The associated values
of h0 for each test problem and sampling scheme were chosen to be slightly smaller
than the average ratios observed in the pilot study. Slight adjustment was required
for some cases of the small-scale problems after examining the main sequential runs.
This approach aims to prevent the procedure from stopping too soon (h0 too large),
141
while also preventing it from running for an excessive number of iterations (h0 too
small). Tables 5.1 and 5.2 show our parameter choices in each situation.
5.4.2
Results of Experiments
We now present the results of our experiments for the sequential sampling procedure
with i.i.d. sampling, both experimental setups for AV, and LHS. Table 5.3 provides
confidence intervals on the number of iterations required, the CI widths on the optimality gap hsT +", hsT,A +", and hsT,L +", and an estimate of the coverage probability
of the CI estimator. The confidence interval on the coverage probability is given by
p
p̂ ± 1.645 p̂(1 p̂)/ , where p̂ is the fraction of the CIs that contained the true
optimality gap at termination and
= 300 for the small-scale problems and 100 for
the large-scale problems. For the CIs on coverage, a margin of error smaller than
0.001 is reported as 0.000. Table 5.3 also reports the average total computational
time required (in seconds) over all runs.
First, we observe that in all cases a relatively small number of iterations are
required, particularly in comparison to similar experimental results in (Bayraksan
& Morton, 2011). We believe that this is due to the use of LHS when generating
the sequence of candidate solutions, as the results presented in Chapter 4 as well as
studies in the literature (Bailey et al., 1999; Freimer et al., 2012; Homem-de-Mello,
2008) indicate that LHS can perform very well in a stochastic programming context,
generating high-quality solutions.
The sequential sampling procedure is particularly e↵ective for APL1P and GBD,
where the CI widths are within 1% and 0.04% of optimality, respectively (see the
values of z ⇤ in Table 3.2), and coverage probabilities are at least 0.90 for APL1P and
always 1 for GBD. The sequential procedure performs less well for PGP2, where the
CI widths are within 7% of optimality, and coverage drops as low as 0.68 with the
use of LHS. As in the non-sequential setting, we again observe low coverage in PG2
142
when using SRP.
The results are similar for the large-scale problems. The CI widths for 20TERM,
DB1, SSN, and STORM are estimated to be within 0.02%, 0.04%, 28%, and 0.002% of
optimality, respectively (see the values of z ⇤ in Table 4.3). The coverage probabilities
for 20TERM, SSN, and STORM are greater than 0.90 and often close to 1. Coverage
drops to about 0.84 for DB1; however, this is higher than the coverage observed in the
non-sequential setting in Chapter 4. The relatively wide CIs and very high coverage
for SSN indicate that slightly lower values of h and h0 may tighten the CI widths
without compromising coverage.
Using LHS when assessing solution quality leads to the tightest CI widths for all
test problems except DB1. LHS requires the fewest number of iterations for APL1P
and GBD but the most iterations for the large-scale problems; however computation
time is only significantly a↵ected compared to i.i.d. to i.i.d. sampling and LHS, as
in AV2, produces better results than fixing the initial sample size across all sampling
schemes and adjusting
h, as in AV1, but at a cost of increased computational time
for every problem except SSN. In particular, AV2 performs better than i.i.d. sampling,
but it is not as e↵ective as LHS (with the exception of DB1).
Further investigations indicate that time spent generating the sequence of candidate solutions is the major contributor to the total computation time. This is not
surprising since we use LHS to generate the candidate solutions, where computation
time grows faster with sample size than for i.i.d. sampling or AV (see Table 4.8 in
Section 4.6.3), and additionally observations must be generated from scratch at each
iteration.
Based on these results, we recommend the use of LHS both to quickly obtain
high-quality solutions when generating candidate solutions and when assessing solution quality with SRP in the sequential sampling procedure, albeit at the risk of
undercoverage for some problems.
143
Problem
Method
CI on T
CI Width
CI on Coverage
Total Time
PGP2
IID
AV1
AV2
LHS
1.55
1.42
1.13
1.52
±
±
±
±
0.10
0.05
0.04
0.09
24.45
31.94
17.99
16.99
±
±
±
±
2.70
3.41
2.01
2.44
0.740
0.763
0.780
0.680
±
±
±
±
0.042
0.040
0.039
0.028
1.08
1.03
1.97
1.02
APL1P
IID
AV1
AV2
LHS
6.62
5.33
3.07
2.04
±
±
±
±
0.63
0.44
0.24
0.16
19.08
27.08
18.18
10.77
±
±
±
±
2.72
3.66
2.65
1.75
0.903
0.907
0.890
0.907
±
±
±
±
0.028
0.028
0.030
0.028
3.60
2.85
4.30
1.40
GBD
IID
AV1
AV2
LHS
1.86
1.86
1.53
1.00
±
±
±
±
0.11
0.13
0.09
0.00
0.60
0.06
0.01
0.00
±
±
±
±
0.13
0.05
0.00
0.00
1.000
1.000
1.000
1.000
±
±
±
±
0.000
0.000
0.000
0.000
4.50
4.45
9.96
3.97
20TERM
IID
AV1
AV2
LHS
1.96
2.19
1.54
2.48
±
±
±
±
0.25
0.28
0.14
0.29
53.59
61.38
42.75
36.83
±
±
±
±
4.41
5.23
3.35
3.15
1.000
1.000
0.990
0.970
±
±
±
±
0.000
0.000
0.016
0.028
3664.73
3883.49
8770.98
4104.24
DB1
IID
AV1
AV2
LHS
1.41
1.40
1.27
1.45
±
±
±
±
0.12
0.12
0.09
0.14
5.57
7.53
5.40
5.88
±
±
±
±
1.11
1.50
1.11
1.11
0.840
0.840
0.840
0.830
±
±
±
±
0.060
0.060
0.060
0.062
52.48
52.66
108.58
52.06
SSN
IID
AV1
AV2
LHS
2.22
2.73
1.08
4.33
±
±
±
±
0.24
0.35
0.05
0.33
2.15
2.75
1.93
1.39
±
±
±
±
0.08
0.11
0.07
0.05
1.000
0.980
1.000
1.000
±
±
±
±
0.000
0.023
0.000
0.000
1221.72
1483.09
1641.71
2222.17
STORM
IID
AV1
AV2
LHS
1.82
2.54
1.75
2.84
±
±
±
±
0.19
0.31
0.20
0.34
303.66
309.89
177.78
144.75
±
±
±
±
47.18
59.73
41.06
41.94
0.980
0.980
0.980
0.900
±
±
±
±
0.023
0.023
0.023
0.049
488.73
584.80
1044.73
621.89
Table 5.3: Summary of results for sequential procedures using IID, AV1, AV2, and
LHS
144
5.5
Summary and Concluding Remarks
In this chapter, we studied the use of SRP with variance reduction, specifically SRP
AV and SRP LHS, to assess solution quality in a sequential sampling procedure to
approximately solve (SP) with a desired probability. We also applied LHS when
generating the sequence of candidate solutions that is input to the sequential procedure. We showed that the CI on the optimality gap of the solution at termination
is asymptotically valid and that the procedure stops in a finite number of iterations,
a.s, when using both SRP AV and SRP LHS. Our computational results indicate the
inclusion of variance reduction, particularly LHS, can result in a procedure that takes
a small number of iterations to produce a high-quality approximate solution to (SP);
however, coverage probability can be low when using SRP LHS for assessing solution
quality for some problems. The results in (Bayraksan & Morton, 2011) indicate that
SRP in the i.i.d. setting improves coverage probability as the initial sample size is
increased, with only modest increases in computation time.
145
Chapter 6
Conclusions
In this dissertation, we have developed a bias reduction technique and implemented
variance reduction methods to improve our ability to identify high-quality solutions
to stochastic programs. Stochastic programs can be very difficult to solve, particularly as the size of the problem increases, and so there is a need for methods to
accurately assess the quality of solutions obtained by means such as approximation.
We have focused on increasing the reliability of Monte Carlo sampling-based estimators of optimality gaps, with a particular emphasis on two-stage stochastic linear
programs with recourse. We have applied these ideas to a sequential procedure to
solve stochastic programs. The following section summarizes the contributions of this
dissertation, and Section 6.2 concludes with future directions based on this work.
6.1
Summary of Contributions
An overview of the contributions of the work presented in this dissertatation is as
follows:
• We developed a technique to reduce the bias of the Averaged Two-Replication
Procedure optimality gap estimators. This method is motivated by stability
results in stochastic programming. Rather than sampling two independent
replications of observations, we partition a larger random sample by solving
a minimum-weight perfect matching problem, which can be done in polynomial
time in sample size. We showed that the resultant point estimators are consistent and the interval estimator is asymptotically valid. The empirical behavior
of the bias reduction technique was examined in computational experiments.
146
The results indicate that the technique can be e↵ective at reducing bias and
also variance, particularly when the optimal solution is considered.
• We studied the use of antithetic variates and Latin hypercube sampling for
variance reduction in several algorithms that estimate optimality gaps. These
include algorithms both with and without the bias reduction technique developed in this dissertation. As with the bias reduction technique, we established
the consistency and asymptotic validity of the estimators. We conducted computational experiments for a range of small-scale and large-scale test problems
to evaluate the e↵ectiveness of the variance reduction schemes compared to
i.i.d. sampling. Both techniques can improve the reliability of the optimality
gap estimators and Latin hypercube sampling performs particularly well for the
class of problems considered. We provided guidelines for the use of variance
reduction schemes in algorithms that assess solution quality.
• We applied the above ideas to a sequential sampling procedure that (approximately) solves (SP) iteratively by generating and assessing a sequence of candidate solutions. Specifically, we examined the use of variance reduction schemes
when assessing the quality of the current candidate solution using the Single
Replication Procedure. We proved that the subsequent sequential procedure
stops in a finite number of iterations and (asymptotically) with high probability produces a solution within a desired quality threshold. Computational
results indicate that variance reduction techniques, particularly Latin hypercube sampling, can improve the performance of the sequential procedure, but
can reduce the coverage probability for some problems.
6.2
Future Research
We conclude this dissertation by discussing several avenues for future work:
147
• Other classes of problems: A primary goal is to adapt methods to assess solution
quality with bias and variance reduction to other classes of stochastic programs,
such as multi-stage stochastic programs, risk models, etc. The bias reduction
technique in particular will require significant adaptations, which we now briefly
discuss:
– Stability results for models such as chance-constrained and mixed-integer
stochastic programs use di↵erent probability metrics (Römisch, 2003). For
example, the Kolmogorov metric can be used for chance-constrained and
mixed-integer stochastic programs (Henrion et al., 2009). Minimizing these
metrics could potentially lead to more complicated problems than the
minimum-weight perfect matching problem that arose in the current context; see for instance the scenario reduction techniques for di↵erent classes
of stochastic programs (Heitsch & Römisch, 2003; Henrion et al., 2009).
However, quick solution methods could still provide significant reduction
in bias.
– Multi-stage stochastic programs also o↵er several interesting avenues to
explore. The first step is to focus on problems where the stages are independent from one another and determine how to generate the scenario
trees while applying the bias reduction technique. The next step is to
generalize the approach to dependent stages. Dupačová et al. (2003) discuss backward and forward heuristics for multi-stage programs that may
be applicable in this case.
– The use of a modified version of the bias reduction technique may also be
beneficial in risk models such as Conditional Value at Risk.
• Comparison with other bias reduction methods: It would be instructive to compare the bias reduction technique to other bias reduction methods such as the
148
jackknife estimators considered in (Partani et al., 2006), (Partani, 2007) and
other sampling techniques.
• Additional variance reduction techniques: The use of other variance reduction
techniques, such as quasi-Monte Carlo sampling, in algorithms that assess solution quality can be studied. The use of quasi-Monte Carlo in the context of
stochastic programming has been explored in (Drew & Homem-de-Mello, 2006;
Homem-de-Mello, 2008; Koivu, 2005; Pennanen & Koivu, 2005). Challenges in
adapting the methodology presented in this dissertation include the determination of appropriate quasi-Monte Carlo sequences and identifying a CLT-type
result for the problems under consideration. Furthermore, dimension reduction
techniques need to be further explored for large-scale problems. For some risk
models, tail behavior is important. Variance reduction geared toward capturing
tails of distributions needs to be explored for such models.
• Fixed-width sequential sampling: Bayraksan & Pierre-Louis (2012) introduce
a fixed-width sequential sampling scheme to approximately solve (SP). In this
case, the procedure is terminated when the width of a confidence interval estimator of the optimality gap falls below a pre-specified level. Variance reduction
techniques may improve the performance of this procedure.
• Connections with stochastic control theory: There appear to be natural connections between the notion of assessing solution quality in stochastic programming
and assessing policy quality in stochastic optimal control theory. In the latter
case, suboptimal policies are adopted in the absence of explicit solutions to
stochastic control problems. The question of how good these policies are then
arises. The techniques described in this dissertation may be helpful in answering
this question.
149
References
Attouch, H., & Wets, R. (1981). Approximation and convergence in nonlinear optimization. In O. Mangasarian, R. Meyer, & S. Robinson (Eds.) Nonlinear Programming 4 , (pp. 367–394). Academic Press, New York.
Bailey, T., Jensen, P., & Morton, D. (1999). Response surface analysis of two-stage
stochastic linear programming with recourse. Naval Research Logistics, 46 , 753–
778.
Bayraksan, G., & Morton, D. (2006). Assessing solution quality in stochastic programs. Mathematical Programming, 108 , 495–514.
Bayraksan, G., & Morton, D. (2009). Assessing solution quality in stochastic programs via sampling. In M. R. Oskoorouchi (Ed.) INFORMS TutORials in Operations Research, vol. 6, (pp. 102–122). INFORMS, Hanover, MD.
Bayraksan, G., & Morton, D. (2011). A sequential sampling procedure for stochastic
programming. Operations Research, 59 , 898–913.
Bayraksan, G., & Pierre-Louis, P. (2012). Fixed-width sequential stopping rules for
a class of stochastic programs. SIAM Journal on Optimization, 22 , 1518–1548.
Beale, E. (1955). On minimizing a convex function subject to linear inequalities.
Journal of the Royal Statistical Society, Series B , 17 , 173–184.
Bertocchi, M., Dupačová, J., & Moriggia, V. (2000). Sensitivity of bond portfolio’s
behavior with respect to random movements in yield curve: a simulation study.
Annals of Operations Research, 99 , 267–286.
Byrd, R., Chin, G., Nocedal, J., & Wu, Y. (2012). Sample size selection in optimization methods for machine learning. Mathematical Programming, 134 , 127–155.
Chow, Y. S., & Robbins, H. (1965). On the asymptotic theory of fixed-width sequential confidence intervals for the mean. Annals of Mathematical Statistics, 36 ,
457–462.
Dantzig, G. (1955). Linear programming under uncertainty. Management Science,
1 , 197206.
Dantzig, G., & Glynn, P. (1990). Parallel processors for planning under uncertainty.
Annals of Operations Research, 22 , 1–21.
150
Dantzig, G., & Infanger, G. (1995). A probabilistic lower bound for two-stage stochastic programs. Technical Report SOL 95-6, Department of Operations Research,
Stanford University.
Donohue, C., & Birge, J. (1995). An upper bound on the network recourse function.
Working Paper, Department of Industrial and Operations Engineering, University
of Michigan.
Drew, S. (2007). Quasi-Monte Carlo Methods for Stochastic Programming. Ph.D.
thesis, Northwestern University.
Drew, S., & Homem-de-Mello, T. (2006). Quasi-Monte Carlo strategies for stochastic
optimization. In Proceedings of the 2006 Winter Simulation Conference, (pp. 774–
782).
Drew, S., & Homem-de-Mello, T. (2012). Some large deviations results for Latin
hypercube sampling. Methodology and Computing in Applied Probability, 14 , 203–
232.
Dudley, R. M. (2002). Real Analysis and Probability. Cambridge University Press,
Cambridge, 2nd ed.
Dupačová, J., Gröwe-Kuska, N., & Römisch, W. (2003). Scenario reduction in
stochastic programming: an approach using probability metrics. Mathematical
Programming, 95 , 493–511.
Dupačová, J., & Wets, R.-B. (1988). Asymptotic behavior of statistical estimators and
of optimal solutions of stochastic optimization problems. The Annals of Statistics,
16 , 1517–1549.
Edmonds, J. (1965). Paths, trees, and flowers. Canadian Journal of Mathematics,
(pp. 449–467).
Efron, B., & Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman &
Hall, New York.
Ferguson, A., & Dantzig, G. (1956). The allocation of aircraft to routes: an example
of linear programming under uncertain demands. Management Science, 3 , 45–73.
Freimer, M. B., Thomas, D. J., & Linderoth, J. T. (2012). The impact of sampling methods on bias and variance in stochastic linear programs. Computational
Optimization and Applications, 51 , 51–75.
Ghosh, M., Mukhopadhyay, N., & Sen, P. K. (1997). Sequential Estimation. Wiley,
New York.
151
Glynn, P., & W.Whitt (1992). The asymptotic validity of sequential stopping rules
for stochastic simulations. The Annals of Applied Probability, 2 , 180–198.
Hackney, B., & Infanger, G. (1994). Private Communication.
Heitsch, H., & Römisch, W. (2003). Scenario reduction algorithms in stochastic
programming. Computational Optimization and Applications, 24 , 187–206.
Henrion, R., Küchler, C., & Römisch, W. (2009). Scenario reduction in stochastic
programming with respect to discrepancy distances. Computational Optimization
and Applications, 43 , 67–93.
Higle, J. (1998). Variance reduction and objective function evaluation in stochastic
linear programs. INFORMS Journal on Computing, 10 , 236–247.
Higle, J., & Sen, S. (1991a). Statistical verification of optimality conditions for
stochastic programs with recourse. Annals of Operations Research, 30 , 215–240.
Higle, J., & Sen, S. (1991b). Stochastic decomposition: an algorithm for two-stage
linear programs with recourse. Mathematics of Operations Research, 16 , 650–669.
Higle, J., & Sen, S. (1996a). Duality and statistical tests of optimality for two stage
stochastic programs. Mathematical Programming, 75 , 257–272.
Higle, J., & Sen, S. (1996b). Stochastic Decomposition: A Statistical Method for Large
Scale Stochastic Linear Programming. Kluwer Academic Publishers, Dordrecht.
Higle, J., & Sen, S. (1999). Statistical approximations for stochastic linear programming problems. Annals of Operations Research, 85 , 173–192.
Homem-de-Mello, T. (2003). Variable-sample methods for stochastic opti- mization.
ACM Transactions on Modeling and Computer Simulation, 13 , 108–133.
Homem-de-Mello, T. (2008). On rates of convergence for stochastic optimization
problems under non-i.i.d. sampling. SIAM Journal on Optimization, 19 , 524–551.
Homem-de-Mello, T., de Matos, V. L., & Finardi, E. C. (2011). Sampling strategies
and stopping criteria for stochastic dual dynamic programming: a case study in
long-term hydrothermal scheduling. Energy Systems, 2 , 1–31.
Infanger, G. (1992). Monte Carlo (importance) sampling within a Benders decomposition algorithm for stochastic linear programs. Annals of Operations Research,
39 , 69–95.
Keller, B., & Bayraksan, G. (2010). Scheduling jobs sharing multiple resources under
uncertainty: A stochastic programming approach. IIE Transactions, 42 , 16–30.
152
Kenyon, A., & Morton, D. (2003). Stochastic vehicle routing with random travel
times. Transportation Science, 37 , 69–82.
Kim, S.-H., & Nelson, B. L. (2001). A fully sequential procedure for indi↵erence-zone
selection in simulation. ACM Transactions on Modeling and Computer Simulation,
11 , 251–273.
Kim, S.-H., & Nelson, B. L. (2006). On the asymptotic validity of fully sequential
selection procedures for steady-state simulation. Operations Research, 54 , 475–488.
King, A., & Rockafellar, R. (1993). Asymptotic theory for solutions in statistical
estimation and stochastic programming. Mathematics of Operations Research, 18 ,
148–162.
Koivu, M. (2005). Variance reduction in sample approximations of stochastic programs. Mathematical Programming, 103 , 463–485.
Kolmogorov, V. (2009). Blossom V: a new implementation of a minimum cost perfect
matching algorithm. Mathematical Programming Computation, 1 , 43–67.
Lan, G., Nemirovski, A., & Shapiro, A. (2012). Validation analysis of mirror descent
stochastic approximation method. Mathematical Programming, 134 , 425–458.
Law, A. (2007). Simulation Modeling and Analysis, 4th ed . McGraw-Hill, New York.
Lemieux, C. (2009). Monte Carlo and Quasi-Monte Carlo Sampling. Springer, New
York.
Linderoth, J., Shapiro, A., & Wright, S. (2006). The empirical behavior of sampling
methods for stochastic programming. Annals of Operations Research, 142 , 215–
241.
Loh, W. L. (1996). On Latin hypercube sampling. The Annals of Statistics, 24 ,
2058–2080.
Mak, W., Morton, D., & Wood, R. (1999). Monte Carlo bounding techniques for
determining solution quality in stochastic programs. Operations Research Letters,
24 , 47–56.
McKay, M., Conover, R., & Beckman, W. (1979). A comparison of three methods for
selecting values of input variables in the analysis of output from a computer code.
Technometrics, 21 , 239–245.
Mehlhorn, K., & Schäfer, G. (2002). Implementation of O(nm log n) weighted matchings in general graphs: the power of data structures. Journal on Experimental
Algorithmics, 7 .
153
Morton, D. (1998). Stopping rules for a class of sampling-based stochastic programming algorithms. Operations Research, 24 , 47–56.
Mulvey, J., & Ruszczyński, A. (1995). A new scenario decomposition method for
large scale stochastic optimization. Operations Research, 43 , 477–490.
Nadas, A. (1969). An extension of a theorem of Chow and Robbins on sequential
confidence intervals for the mean. Annals of Mathematical Statistics, 40 , 667–671.
Norkin, V., Pflug, G., & Ruszczyński, A. (1998). A branch and bound method for
stochastic global optimization. Mathematical Programming, 83 , 425–450.
Owen, A. B. (1992). A central limit theorem for Latin hypercube sampling. Journal
of the Royal Statistical Society, Series B , 54 , 541–551.
Owen, A. B. (1997). Monte Carlo variance of scrambled net quadrature. SIAM
Journal on Numerical Analysis, 34 , 1884–1910.
Partani, A. (2007). Adaptive Jacknife Estimators for Stochastic Programming. Ph.D.
thesis, The University of Texas at Austin.
Partani, A., Morton, D., & Popova, I. (2006). Jackknife estimators for reducing bias
in asset allocation. In Proceedings of the 2006 Winter Simulation Conference, (pp.
783–791).
Pasupathy, R. (2010). On choosing parameters in retrospective-approximation algorithms for stochastic root finding and simulation optimization. Operations Research, 48 , 889–901.
Pennanen, T., & Koivu, M. (2005). Epi-convergent discretizations of stochastic programs via integration quadratures. Numerische Mathematik , 100 , 141–163.
Pflug, G. C. (1988). Stepsize rules, stopping times and their implementations in
stochastic quasigradient algorithms. In Y. Ermoliev, & R. Wets (Eds.) Numerical
Techniques for Stochastic Optimization, (pp. 353–372). Springer-Verlag, Berlin.
Pierre-Louis, P. (2012). Algorithmic Developments in Monte Carlo Sampling-Based
Methods For Stochastic Programming. Ph.D. thesis, The University of Arizona.
Pierre-Louis, P., Morton, D., & Bayraksan, G. (2011). A combined deterministic and
sampling-based sequential bounding method for stochastic programming. In Proceedings of the 2011 Winter Simulation Conference, (pp. 4172–4183). Piscataway,
New Jersey.
Polak, E., & Royset, J. (2008). Efficient sample sizes in stochastic nonlinear programming. Journal of Computational and Applied Mathematics, 217 , 301–310.
154
Rachev, S. T. (1991). Probability Metrics and the Stability of Stochastic Models.
Wiley, New York.
Rockafellar, R., & Wets, R.-B. (1998). Variational Analysis. Springer-Verlag, Berlin.
Römisch, W. (2003).
Stability of stochastic programming problems.
In
A. Ruszczyński, & A. Shapiro (Eds.) Handbooks in Operations Research and Management Science, Vol. 10: Stochastic Programming, (pp. 483–554). Elsevier, Amsterdam.
Royset, J. (2013). On sample size control in sample average approximations for solving
smooth stochastic programs. Computational Optimization and Applications, 55 ,
265–309.
Royset, J., & Szechtman, R. (2013). Optimal budget allocation for sample average
approximation. Operations Research, 61 , 762–776.
Rubinstein, R. Y., & Shapiro, A. (1993). Discrete Event Systems: Sensitivity and
Stochastic Optimization by the Score Function Method . John Wiley & Sons, Chichester.
Ruszczyński, A. (1986). A regularized decomposition method for minimizing a sum
of polyhedral functions. Mathematical Programming, 35 , 309–333.
Ruszczyński, A., & Świetanowski, A. (1997). Accelerating the regularized decomposition method for two stage stochastic linear problems. European Journal of
Operational Research, 101 , 328–342.
Santoso, T., Ahmed, S., Goetschalckx, M., & Shapiro, A. (2005). A stochastic programming approach for supply chain network design under uncertainty. European
Journal of Operational Research, 167 , 96–115.
Sen, S., Doverspike, R., & Cosares, S. (1994). Network planning with random demand.
Telecommunication Systems, 3 , 11–30.
Shao, J., & Tu, D. (1995). The Jackknife and Bootstrap. Springer-Verlag, New York.
Shapiro, A. (1989). Asymptotic properties of statistical estimators in stochastic programming. The Annals of Statistics, 17 , 841–858.
Shapiro, A. (2003). Monte Carlo sampling methods. In A. Ruszczyński, & A. Shapiro
(Eds.) Handbooks in Operations Research and Management Science, Vol. 10:
Stochastic Programming, (pp. 353–425). Elsevier, Amsterdam.
Shapiro, A., & Homem-de-Mello, T. (1998). A simulation-based approach to two-stage
stochastic programming with recourse. Mathematical Programming, 81 , 301–325.
155
Shapiro, A., & Homem-de-Mello, T. (2000). On rate of convergence of Monte Carlo
approximations of stochastic programs. SIAM Journal on Optimization, 11 , 70–86.
Shapiro, A., Homem-de-Mello, T., & Kim, J. (2002). Conditioning of convex piecewise
linear stochastic programs. Mathematical Programming, 94 , 1–19.
Shapiro, A., & Nemirovski, A. (2005). On complexity of stochastic programming
problems. In V. Jeyakumar, & A. Rubinov (Eds.) Continuous Optimization: Current Trends and Applications, (pp. 111–144). Springer, Berlin.
Stockbridge, R., & Bayraksan, G. (2012). A probability metrics approach for reducing
the bias of optimality gap estimators in two-stage stochastic linear programming.
Mathematical Programming. doi:10.1007/s10107-012-0563-6.
Verweij, B., Ahmed, S., Kleywegt, A., Nemhauser, G., & Shapiro, A. (2003). The
sample average approximation method applied to stochastic vehicle routing problems: a computational study. Computational Optimization and Applications, 24 ,
289–333.
Wagner, R. (2009). Mersenne twister random number generator. http://svn.seqan.
de/seqan/trunk/core/include/seqan/random/ext_MersenneTwister.h. Last
accessed June 3, 2013.
Wallace, S., & Ziemba, W. (Eds.) (2005). Applications of Stochastic Programming.
MPS-SIAM Series in Optimization, Philadelpha.
Wets, R. (1983). Stochastic programming: solution techniques and approximation
schemes. In A. Bachem, M. Grötschel, & B. Korte (Eds.) Mathematical Programming: The State of the Art (Bonn 1982), (pp. 560–603). Springer-Verlag, Berlin.