Numerical Optimization and Empty Set Problem in Application of

Numerical Optimization and Empty Set Problem in
Application of Empirical Likelihood Methods
by
Ashley Elizabeth Askew
(Under the Direction of Nicole A. Lazar)
Abstract
Empirical likelihood is a nonparametric method based on a data-driven likelihood. The
flexibility of empirical likelihood facilitates its use in complex settings, which can in turn
create computational challenges. Additionally, the Empty Set Problem (ESP) which arises
with the Empirical Estimating Equations approach can pose problems with estimation, as
data are unable to meet constraints when the true parameter is outside the convex hull.
As an alternative to the Newton and quasi-Newton methods conventionally used in empirical likelihood, this dissertation develops and examines various Evolutionary Algorithms for
global optimization of empirical likelihood estimates, as well as a comparison of the ESP versus non-ESP data sets on an overdetermined problem. Finally, we carry out a preliminary
application of composite empirical likelihood methods, noting the impact of the ESP on the
subsets of data, and compare the results obtained on all possible combinations against those
from convergent subsets only.
Index words:
Empirical Likelihood, Estimating Equations, Empty Set Problem,
Evolutionary Algorithm, Composite Likelihood
Numerical Optimization and Empty Set Problem in
Application of Empirical Likelihood Methods
by
Ashley Elizabeth Askew
B.S., Clayton State University, 2006
M.S., University of Georgia, 2008
A Dissertation Submitted to the Graduate Faculty
of The University of Georgia in Partial Fulfillment
of the
Requirements for the Degree
Doctor of Philosophy
Athens, Georgia
2012
c
2012
Ashley Elizabeth Askew
Numerical Optimization and Empty Set Problem in
Application of Empirical Likelihood Methods
by
Ashley Elizabeth Askew
Approved:
Major Professor:
Committee:
Electronic Version Approved:
Maureen Grasso
Dean of the Graduate School
The University of Georgia
December 2012
Nicole A. Lazar
Jeongyoun Ahn
William P. McCormick
Cheolwoo Park
Lynne Seymour
Numerical Optimization and Empty Set
Problem in Application of Empirical
Likelihood Methods
Ashley Elizabeth Askew
December 3, 2012
Acknowledgments
The path to this accomplishment has been and continues to be an amazing journey, one
where each step has been possible because of blessings and the impact of so many. First,
I thank God and my Lord Jesus Christ for the strength to come this far, to celebrate the
joys, to overcome trials along the way, and to embark on this dream, wherever the next step
takes me. Innumerable blessings have helped me to overcome difficulties and to “hold fast”
while aspiring for this degree.
My family has been a wellspring of support, specifically my grandmother and grandfather,
who raised me as their own, and my aunt. My grandmother, also known as “Mama One”
to me, has been there for me through the hard times and the triumphs, and without her, I
would not have made it this far. My grandfather and “poker bud,” Paw-Paw, is in my heart
and was an amazing source of support in the time we had together. While I miss him dearly,
his influence continues to live on because of the encouragement he gave me in so many ways,
whether it was a bear hug or a note with kind words of love. My aunt has also been an
inspirational person in my life, providing emotional support and helping out at our family
home during my time at university.
I am appreciative of the experience of being a part of the Department of Statistics here
at the University of Georgia. I would like to convey my deepest gratitude for the influence of
my faculty advisor, Dr. Nicole Lazar, and the other faculty members who have been a part
of my education, whether as a committee member or while imparting valuable statistical
vi
or mathematical knowledge. Dr. Lazar’s great patience and understanding helped me as I
was going through some rough times, and her excellence as a researcher and mentor shone
through in our collaboration over the years. I appreciate very much the opportunity to have
worked with her on a one-to-one basis, to gain knowledge under her guidance. My thanks
also go to the committee members, who have graciously given of their time regarding their
role in overseeing this dissertation work: Dr. Jeongyoun Ahn, Dr. Cheolwoo Park, Dr. Bill
McCormick, and Dr. Lynne Seymour. I am also thankful to Dr. J. Michael Bowker, for the
invaluable experience and refined skills I have gained through the research projects we have
worked on over the course of four years and counting. The staff and so many fellow students
have also helped to make my experience memorable and enjoyable.
vii
Contents
Acknowledgements
vi
List of Figures
x
List of Tables
xix
1 Fundamental Concepts of Empirical Likelihood
1
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.3
Likelihood Ratio Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.4
Empirical Likelihood for Univariate Mean . . . . . . . . . . . . . . . . . . .
12
1.5
Estimating Equations and Extensions of Vector Empirical Likelihood Theorem 15
1.6
Empty Set Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Empty Set Problem
24
32
2.1
Empty Set Problem, Example . . . . . . . . . . . . . . . . . . . . . . . . . .
35
2.2
Visualizing the Empty Set Problem . . . . . . . . . . . . . . . . . . . . . . .
42
2.3
Conclusions of Empty Set Problem . . . . . . . . . . . . . . . . . . . . . . .
47
3 Evolutionary Algorithms in Empirical Likelihood Estimation
3.1
Optimization: Core Concepts and Issues . . . . . . . . . . . . . . . . . . . .
viii
49
49
3.2
Numerical Optimizers in R: Newton and Quasi-Newton Methods . . . . . . .
55
3.3
Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.4
Applications in Empirical Likelihood: Preliminary Algorithm Implementation
79
3.5
EA versus Newton-Raphson in R: Fisher’s Iris Data . . . . . . . . . . . . . . 105
4 Composite Likelihood Functions and Inference
115
4.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5 Conclusions and Directions for Future Research
143
5.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.2
Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . 146
References
152
ix
List of Figures
1.1
Visualization of a non-convex and convex set . . . . . . . . . . . . . . . . . .
2.1
The first of four non-ESP data sets of n = 15 drawn from N (0, 1) with the
constraints of ESP example, µ = (µ0 , 2µ20 + 1) . . . . . . . . . . . . . . . . .
2.2
41
The ECDF of the four data sets based on weights created from MELE statistic
µ̂M ELE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
40
The last of four non-ESP data sets of n = 15 drawn from N (0, 1) with the
constraints of ESP µ = (µ0 , 2µ20 + 1) . . . . . . . . . . . . . . . . . . . . . .
2.5
39
The third of four non-ESP data sets of n = 15 drawn from N (0, 1) with the
constraints of ESP µ = (µ0 , 2µ20 + 1) . . . . . . . . . . . . . . . . . . . . . .
2.4
38
The second of four non-ESP data sets of n = 15 drawn from N (0, 1) with the
constraints of ESP µ = (µ0 , 2µ20 + 1) . . . . . . . . . . . . . . . . . . . . . .
2.3
25
42
Convex hulls generated for four data sets based on n = 10 elements independently drawn from N(0,1). Because the data are two-dimensional (Xi and
Xi2 , i = 1, . . . , n), the convex hull plot becomes a convenient visual representation of the ESP. The true mean of µ = (µ0 , 2µ20 + 1) = (0, 1) is indicated in
red; the convex hull in (a) fails to contain this point. The other three samples,
(b)-(d), capture the true mean, though this point appears in various locations
relative to the space spanned by the data. . . . . . . . . . . . . . . . . . . .
x
43
2.7
We compute all possible k-tuples from the data set of n = 10 and divide the
plots by their convergence status. This plot is from the computation of the
4-tuples, where those that converged produced the pattern of the true mean
being inside the convex hull. The use of four points leads to quite different
shapes of regions, from the rather narrow (as in row C and column (2)) to
a comparatively large quadrilateral (as in row E and column (2), where the
points were relatively far apart). The shapes depend not only on the k-tuple
used but also on the data themselves. . . . . . . . . . . . . . . . . . . . . . .
2.8
44
Again, to supplement Figure 2.7, we examine all possible k-tuples from the
data set of n = 10 and divide the plots by their convergence status. This is
another plot from the computation of the 4-tuples, wherein those that failed to
converge showed the pattern of the true mean point being outside the convex
hull created from the data. The use of four points leads to quite different
shapes of regions, from the very narrow (as in row B and column (3), in which
it would be very difficult to capture the true mean) to a comparatively large
quadrilateral (as in row E and column (1), where the points were relatively far
apart). The shapes depend on the k-tuple used and also the data themselves.
Data points that are close together will produce very narrow convex hulls as
opposed to those that are farther apart, thus creating more coverage. . . . .
2.9
45
For the data set from N(0,1) with n = 10 and two estimating equations,
we take all possible 3-tuples to 9-tuples, and plot the convex hulls for those
convergent groups. The denser areas result from overlapping the convex hulls
generated from convergent sets. . . . . . . . . . . . . . . . . . . . . . . . . .
xi
47
3.1
The basic steps of the (α, β)-Evolutionary Strategy algorithm can be visualized as a flow chart. The initial population is used as a starting point. At each
generation, the groups of parents are subdivided and mated via a rule established in advance. A subset of the offspring undergoes mutation, and these
mutants are entered into a pool along with the parents, the non-mutated
offspring, and newly generated immigrants. We find the top α candidates
according to the objective function, where the one with the best fitness is
the potential solution, and if convergence or the maximum number of iterations has been reached, then we terminate the algorithm with the appropriate
output. Otherwise, we take the top α individuals as the new generation and
repeat the process. In the event that all solutions have stagnated to the same
value, we maintain one of the vectors as a candidate and refresh the other
α − 1 from the same distribution that generated the initial population. . . .
3.2
66
In the Gaussian mutation, we use a vector of values drawn from a convenient
normal distribution, N (0, σ 2 ). Here, we randomly select a number of mutants
from the pool of offspring (not the parents), and a random vector of standard
Gaussian noise is added. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xii
67
3.3
For a vector that is randomly chosen for uniform mutation, we draw the value
for the mutant component from a continuous uniform distribution with limits
to be determined as follows. First, we first draw a lower and an upper limit
from a discrete uniform distribution of possible integers from 1 to 100. Denote
these draws of the random variables as LBi = lbi (obtained by multiplying the
randomly selected limit value by -1) and U Bi = ubi , where these correspond
to the ith mutant. Then, we draw the mutant element from U (lbi , ubi ), where
the lower and upper bounds are not necessarily equal. This incorporates an
additional level of randomness, as opposed to selecting all values from the
same fixed limits characterizing the distribution. . . . . . . . . . . . . . . . .
3.4
69
The one-point crossover, shown here, produces a pair of offspring by operating
on the genotypes of parameter vectors A and B. Along the dimension of the
vector, an index is randomly chosen as a split point, and we create offspring
by alternating the sequences preceding and proceeding the cut point. Though
it is possible to keep both offspring, in order to manage the pool of candidates,
we randomly choose one of the offspring to go into the pool, along with the
parents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
70
The uniform crossover generates a single offspring (or more if desired) based
on two parents A and B through a sequence of values drawn from U (0, 1).
If the value meets a predetermined threshold, such as 0.5, then the genotype
at a given index is inherited from A; otherwise, it will come from B. This
continues along the length of the vector until the offspring has been randomly
generated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
71
3.6
With the guaranteed average crossover, we take the constant a = 1/2 as the
multiplier for both parents, for which the summation produces an offspring.
Each element from A is averaged with its corresponding element in B, thereby
producing a child that represents a mean of the features contained in A and B. 72
3.7
The previous crossovers do not incorporate a decision-making step that helps
to preserve the desirable features of A or B; they rely on randomness to generate offspring, with the hopes of diversifying the pool enough to encounter
a desirable candidate. The generation process tends to preserve the desirable features when the objective function is evaluated, but in the heuristic crossover, this step is supplemented by the screening of parents at each
stage of producing offspring. We distinguish which parent has the more favorable objective function (comparing f (A) to f (B)) and construct the new
sequence based on a randomly drawn value from U (0, 1). Randomness is incorporated by the scalar multiplier r ∈ [0, 1]. We produce a child of the form
C = r(A − B) + A for f (A) ≥ f (B). If we have f (A) ≤ f (B), then we will
switch the ordering of A and B in the formula for the child. . . . . . . . . .
3.8
Boxplot of ELR statistics from Newton-Raphson and preliminary application
of EA with 1000 simulations of N(0,1) with n = 15 . . . . . . . . . . . . . .
3.9
73
84
Plots of ELR statistics for Newton-Raphson and Preliminary EA by ESP status 86
3.10 Plots of constraints (C1), (C2), and (C3) for data sets (all, ESP, and non-ESP)
for Newton-Raphson and EA preliminary results . . . . . . . . . . . . . . . .
xiv
86
3.11 The plots of the ELR statistics from Newton-Raphson against the BFGS estimation in el.test.bfgs for µ0 = −0.5, 0, and 0.5 in (a)–(c), respectively,
along with the line of slope 1 and intercept 0 show the correspondence of the
two outputs. The values in the upper right corner are those that failed to
converge under both algorithms, though these receive a larger value under
Newton-Raphson. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
3.12 All ELR Statistics for valid Evolutionary Algorithms by mean and NewtonRaphson from R for ESP example . . . . . . . . . . . . . . . . . . . . . . . .
92
3.13 ELR statistics from valid Evolutionary Algorithms and Newton-Raphson from
R for ESP data sets at µ0 = −0.5, 0, and 0.5 . . . . . . . . . . . . . . . . . .
93
3.14 ELR statistics from valid Evolutionary Algorithms and Newton-Raphson from
R for non-ESP data sets at µ0 = −0.5, 0, and 0.5
. . . . . . . . . . . . . . .
P
3.15 The mean absolute violations for constraint (C1) ni=1 wi = 1 across three
94
hypothesized means µ0 = −0.5, 0, and 0.5 . . . . . . . . . . . . . . . . . . . .
P
3.16 The mean absolute violations for constraint (C2) ni=1 wi xi = µ0 across three
96
hypothesized means µ0 = −0.5, 0, and 0.5 . . . . . . . . . . . . . . . . . . . .
P
3.17 The mean absolute violations for constraint (C3) ni=1 wi x2i = 2µ20 + 1 across
97
three hypothesized means µ0 = −0.5, 0, and 0.5 . . . . . . . . . . . . . . . .
99
3.18 The mean running time (in seconds) for Evolutionary Algorithms and NewtonRaphson analyses three hypothesized means µ0 = −0.5, 0, and 0.5 . . . . . . 100
3.19 The mean number of iterations for Evolutionary Algorithms and NewtonRaphson analyses three hypothesized means µ0 = −0.5, 0, and 0.5 . . . . . . 102
3.20 An example of a solution process for an Evolutionary Algorithm, specifically
the heuristic crossover with normal mutation . . . . . . . . . . . . . . . . . . 103
xv
3.21 A visualization of the convex hull and assigned weights to data points, demonstrating how points farther away from the true mean, indicated by the star at
(0,1), receive a lower weight than those with a smaller absolute deviation . . 104
3.22 The scatterplot of a subset of Fisher’s iris data shows the relationship between
the sepal width and length for 50 observations from the Iris setosa species.
The convex hull is shown outlining a boundary enclosing the points. . . . . . 106
3.23 Plot of ELR statistics for Fisher’s iris data for six algorithms . . . . . . . . . 109
3.24 Plot of total time for Fisher’s iris data for six algorithms . . . . . . . . . . . 110
3.25 Plot of total iterations for Fisher’s iris data for six algorithms . . . . . . . . 111
3.26 Plot of summation of weights produced by the six algorithms . . . . . . . . . 112
3.27 Plot of sepal length means for Fisher’s iris data, which has a sample mean of
5.006, based on EL weights for six algorithms . . . . . . . . . . . . . . . . . 113
3.28 Plot of sepal width means for Fisher’s iris data, which has a sample mean of
3.428, based on EL weights for six algorithms . . . . . . . . . . . . . . . . . 114
4.1
A histogram of the data set with n = 15 from N (0, 1) . . . . . . . . . . . . . 129
4.2
The ELR statistics for a data set with n = 15 from N (0, 1) by full and composite empirical likelihood computed on k-tuples, k = 3, . . . , 14 are divided
into all summations (a) and only convergent summations (b). . . . . . . . . . 131
4.3
The individual plots for ELR statistics for k-tuples (2 < k < n) and the full
likelihood on all possible nk combinations of the data . . . . . . . . . . . . . 132
4.4
The individual plots for ELR statistics for k-tuples (2 < k < n) and the full
likelihood on convergent subsets of the data . . . . . . . . . . . . . . . . . . 133
xvi
4.5
The plot shows the total running time in seconds for a data set with n =
15 elements from N (0, 1), as we use it to estimate the full and composite
empirical likelihood computed on k-tuples, k = 3, . . . , 14. Since we know
that the data sets with ESP will fail to produce a solution, we do not add
the running time for any subset with the ESP. However, for those non-ESP
subsets that produced a solution, we include their running time, since the
decision to exclude them was made after the attempt to optimize. . . . . . . 135
4.6
For a grid of µ0 values from -2.5 to 4.5 by 0.01, the plot shows the mean of the
first-order mean for all possible k-tuples (k = 3, . . . , 14) and the full empirical
likelihood using a data set with n = 15 from N (0, 1). . . . . . . . . . . . . . 136
4.7
For a grid of µ0 values from -2.5 to 4.5 by 0.01, the plot shows the mean of
the first-order mean for only convergent k-tuples (k = 3, . . . , 14) and the full
empirical likelihood using a data set with n = 15 from N (0, 1). . . . . . . . . 137
4.8
For a grid of µ0 values from -2.5 to 4.5 by 0.01, the plot shows the mean of
the second-order mean for all k-tuples (k = 3, . . . , 14) and the full empirical
likelihood using a data set with n = 15 from N (0, 1). . . . . . . . . . . . . . 138
4.9
For a grid of µ0 values from -2.5 to 4.5 by 0.01, the plot shows the mean of
the second-order mean for only convergent k-tuples (k = 3, . . . , 14) and the
full empirical likelihood using a data set with n = 15 from N (0, 1). . . . . . . 139
4.10 The visualization of the estimator µ0 with respect to the ESP status by ktuplet composite likelihood (2 < k < n) and full empirical likelihood reveals
that, by k = 4, the behaviors of the ESP and non-ESP curves are quite similar,
due to the reduced percentage of data sets with the ESP.
xvii
. . . . . . . . . . 141
4.11 The k = 7 case produces the following histograms based on the component
empirical likelihoods for all possible seven-point combinations of the data. We
examine the effect of all estimators, only those from the ESP sets, and only
those from the non-ESP sets, as determined by the ESP criterion for this
example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
xviii
List of Tables
3.1
Table of 24 Evolutionary Algorithms under consideration for ESP example .
4.1
The values of µ for which the ELR statistic curve for the k-wise empirical
90
composite likelihood reaches a minimum on a grid spaced from -2.5 to 4.5 in
increments of 0.01, under two computational scenarios: one in which the ELR
statistic for every subset is included in the summation and another for which
only convergent subsets contribute to the ELR statistic overall . . . . . . . . 134
4.2
Table of empirical composite likelihood estimators of µ for the ESP example,
subdivided into summations taken over all values, on non-ESP sets, and on
ESP sets, along with standard deviations of the estimators across all k-tuples
used to derive estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
xix
1
Fundamental Concepts of Empirical Likelihood
1.1
Introduction
In applying statistical methods to a data set for estimation and/or inference, an essential step
is the specification of the assumptions that will be made. Such assumptions may include
conditions on the error component and, more importantly, whether a parametric family
will be used for explaining the underlying mechanism that generated the data. Parametric
methods are underscored by established theorems and asymptotic behaviors. As a result,
it is possible to compare the theorized population parameter(s) and the estimates from
the data through confidence interval estimation or hypothesis testing. Another advantage
to specifying a parametric family is that one may formulate the likelihood, which permits
efficient estimation and powerful tests (Casella and Berger, 2002). The likelihood also allows
for the accommodation of unusual data properties, such as biased sampling or incomplete
observations, and such external information as a limited domain for the likelihood function
as a result of constraints (Owen, 2001).
A widely used parametric family is the Gaussian distribution, which is also found in many
theorems pertaining to test statistics and asymptotic properties. The central limit theorem
states that, though the underlying distribution may be non-Normal, a large sample size per1
mits the testing of a mean to proceed under the Normal properties of the test statistics under
the null hypothesis. This allows us to have a more robust test by relaxing the distributional
assumption through the central limit theorem. A method is said to be robust against an
assumption when the failure of the data to satisfy that assumption does not invalidate the
analysis completely. Having fewer assumptions produces a more robust test, since there are
not as many conditions to meet in order to validate the analysis/inference.
A parametric family that is specified correctly results in optimum efficiency and most
powerful tests. However, the problem with parametric methods is that there is no guarantee
that the correct distribution has been used, and oftentimes, there are no clear indications
as to which family is most appropriate. Rather, one can only determine whether the data
depart significantly from what is expected by the parametric family. However, making a
definitive statement about a data set’s underlying distribution with a parametric family
results in the potential for misspecification and invalid inferences. Furthermore, even if
we know which parametric family is appropriate, it may be too complex computationally in
terms of representation or time. In that case, one may utilize nonparametric methods, which
are less powerful than a correctly chosen parametric family, though more powerful than an
incorrect specification. A primary disadvantage of nonparametric methods is the loss of
power through less information, such as with the sign test based on the count of positive
differences for a large-sample matched pairs experiment. Though the test statistic can be
used to outline a rejection region based on a Poisson distribution, we lose information on the
magnitude of the differences by considering only the sign of the paired differences (Conover,
1999). However, as Tukey wrote, “Far better an approximate answer to the right question,
which is often vague, than an exact answer to the wrong question, which can always be made
precise” (Tukey, 1962, p. 13-14).
The nonparametric and parametric approaches need not be a dichotomy; there exists
a method which incorporates elements of each and presents an attractive alternative: the
2
empirical likelihood. The empirical likelihood approach has the advantage of not requiring a
specified parametric distribution, instead using the observations to construct a data-driven
likelihood while maintaining robustness. This approach also allows one to utilize the many
properties gained from likelihood theory, including asymptotic properties, confidence interval
estimation, and inference, with more power than those from nonparametric methods. In nonparametric statistics, though we may construct confidence intervals and carry out hypothesis
testing, we have exchanged flexibility for reduced power (Owen, 2001). In using empirical
likelihood, we see an improvement in power over other nonparametric methods, which is the
primary advantage, while not requiring the knowledge of the underlying distribution (Owen,
2001).
Furthermore, the ability of empirical likelihood to accommodate constraints on parameters and/or data, unconventional or distorted sampling, and partial knowledge of the underlying distribution also makes it a very desirable method of analysis. Examples where the
empirical likelihood is useful include data with missing observations or records, as well as the
case where a discrepancy exists between the source data’s distribution and what was to be
used for inference (Owen, 2001). Empirical likelihood may also be used to analyze a sample
that consists of two data sets where one’s distribution is known and the other unknown, or
it may be used when data are combined from various sampling schemes, each potentially
with its own challenges. An example of the former would be the comparison of an older,
established process whose behavior is well known or well-modeled with a new or modified
process that is under study (Owen, 2001). Another setting in which empirical likelihood may
be useful is where the data do not seem to match one of the well-known or easily computed
parametric families. If the data exhibit skewness and kurtosis, it would be difficult to select
a parametric family that is flexible in terms of these two characteristics.
Owen (1988), building on the work of Thomas and Grunkemeier (1975), facilitated the
development of empirical likelihood theory. Empirical likelihood links many parametric con3
cepts to a nonparametric form through a data-based likelihood. For instance, empirical
likelihood is Bartlett-correctable, as shown in DiCiccio, Hall, and Romano (1991), and it
is also the only member of the power-divergence family to feature this property (Baggerly,
1998). In other words, we may achieve an improvement in coverage accuracy via a “simple
correction” (DiCiccio et al., 1991, p. 1053). Bartlett corrections are quite useful and may be
applied to any parameter that may be expressed as a function of the mean. Hall (1990) discusses a location adjustment for empirical likelihood for correctness of regions to the second
order; the subsequent adjusted regions are also Bartlett-correctable, though an alternative
formulation is necessary because of the location adjustment.
In the years since its introduction, various aspects of empirical likelihood in the analytical setting have been explored, such as in analysis of missing data (Wang and Rao, 2002),
weakly dependent processes (Kitamura, 1997), and generalized linear models (GLMs) (Kolaczyk, 1994). If a specified model form is desired, then we may use empirical likelihood
as a tool to ease computations. For GLMs with complicated structures and distributional
assumptions, it becomes impractical and challenging to use the parametric counterpart. Kolaczyk (1994) shows the benefits of empirical likelihood for GLMs when compared to the
parametric approach. The empirical likelihood’s ability to handle constraints makes it a
powerful tool, such as when modeling dispersion in addition to the mean, link, and variance
functions.
Because of the flexibility and robustness of empirical likelihood, research in the field has
shown a progression of complexity and variety of applications, with still more uncharted
territory for future studies. As an example of such development, Owen’s introduction of
empirical likelihood began in the context of independent and identically distributed data. In
any statistical setting, results and procedures become more generalizable (at the cost of potentially decreased power) with less stringent assumptions, thereby broadening the potential
to appear in even more fields. In the event that the assumption of identically distributed
4
data is not met, empirical likelihood may be used with independent data, according to the
Triangular Array Empirical Likelihood Theorem with some regularity conditions (Owen,
2001). Furthermore, if the data fail to be independent, it is possible to adapt the empirical
likelihood approach to account for the dependence structure, much like in bootstrapping and
parametric theory (Owen, 2001). Though there are many approaches–and likely more still
to be discovered–one way to handle dependence is to view the data as functions of independent variables; one strategy is to utilize appropriate estimating equations. Another stems
from the close relationship between the bootstrap and empirical likelihood, a connection by
which block, sieve, and local bootstrapping may be integrated into the empirical likelihood
theory with suitable adjustments (Wu and Cao, 2011). Hall (1990) proves that the difference between confidence regions based on empirical likelihood and bootstrap likelihood is to
the second order, though their foundational distributions (when centered) agree to second
order. Rao (2006) discusses advancements made in the application of empirical likelihood to
survey data since the publication of Hartley and Rao (1968), which termed it the scale-load
approach.
Much of the research in time series within the empirical likelihood framework had operated under the assumptions of stationarity and the study of observations from a continuous
scale. Wu and Cao (2011) utilize blocking methods in studying time series based on count
data, arriving at a maximum blockwise empirical likelihood estimator. This is just one
example of the progression of the complexity from the core of empirical likelihood to the
ongoing potential for expanding the theory.
The empirical likelihood method of analysis or inference is not without its disadvantages
or problems, however. A significant issue is the computational complexity of maximizing
certain objective functions with respect to constraints. Furthermore, concepts that are used
as building blocks in empirical likelihood theory may have their own issues or complexities
that resurface after the appropriate modifications or applications are made. An example is
5
how the convex hull duality problem carries over and reappears in a more complex form,
better known as the Empty Set Problem. This interesting problem was first discussed in the
empirical likelihood context by Grendár and Judge (2009), which we discuss in further detail
in Section 1.6. In this dissertation, we seek to provide an alternative estimation procedure
that is not based on derivatives. To this end, we develop an Evolutionary Algorithm and
compare it to the standard methods in use. We also investigate how the Empty Set Problem
affects the estimation of weights in certain circumstances. As composite likelihood is used
as a computational tool to simplify estimations, we examine the adaptation of composite
likelihood to empirical likelihood and the subsequent implications of the Empty Set Problem.
1.2
Basic Definitions
As reflected in the name, a core concept in empirical likelihood theory is indeed the empirical
distribution computed on the data. The usual cumulative distribution function (CDF),
defined for a random variable X ∈ R, is F (x) = Pr(X ≤ x), where the support of x is
(−∞, ∞).
Definition 1. Given a set of random variables X1 , X2 , . . . , Xn ∈ R and x ∈ (−∞, ∞), the
empirical cumulative distribution function (ECDF) is
n
1X
IX ≤x .
Fn (x) =
n i=1 i
F (x−) represents the probability that the random variable X is strictly less than x.
One can therefore write Pr(X = x) = F (x) − F (x−). Using similar notation, one may
define the nonparametric likelihood of a CDF F given independent X1 , . . . , Xn ∈ R and
6
Xi ∼ F, i = 1, . . . , n as follows:
L(F ) =
n
Y
(F (Xi ) − F (Xi −)).
i=1
In order for the nonparametric likelihood to have non-zero values, each data value must
receive non-zero probability in accordance to the distribution F . If the distribution is continuous, then L(F ) = 0 by definition. An additional property of the ECDF is that it is
the nonparametric maximum likelihood estimator (NPMLE) of the distribution F (Kaplan
and Meier, 1958). For X1 , X2 , . . . , Xn ∈ R, suppose the underlying (but usually unknown in
practice) distribution function is F0 . For the ECDF computed on the sample, denoted by
Fn , comparing this to any general CDF F (where F 6= Fn ), the result is that L(F ) < L(Fn ).
Thus, Fn maximizes the nonparametric likelihood. The invariance property of maximum
likelihood with respect to functions holds for the NPMLE as well. Given a function T on
the real numbers that operates on distributions and a parameter of interest θ = T (F ), the
NPMLE of θ is θ̂ = T (Fn ). Since the true parameter, θ0 = T (F0 ), is not known, the NPMLE
may be used for estimation given the information available from the data. In the nonparametric setting, this plug-in rule refers to the estimation of a distribution F with its empirical
distribution function Fn . On the other hand, if we assume a parametric distribution F (x; θ),
the bootstrap or plug-in rule produces an estimate of the function at the maximum likelihood
estimate, θ̂M LE , or F (x; θ̂M LE ) (Davison, Hinkley, and Young, 2003). This plug-in rule is at
the core of the bootstrap method: the ability to estimate a distribution F ∗ via the following
procedure. First, build the sample probability distribution F̂ with fixed mass 1/n on each
xi , i = 1, . . . , n, and randomly sample n elements with replacement from the set of xi to
construct Xi∗ = x∗i , where Xi∗ ∼ F̂ . The sampling distribution of R, some random variable
we have fixed that is a function of X and the unknown distribution F , may be constructed
as R∗ = R(X ∗ , F̂ ), for X ∗ = (x∗1 , . . . , x∗n ) (Efron, 1979).
7
1.3
Likelihood Ratio Functions
For further inference with the nonparametric likelihood with respect to a distribution F ,
examine the ratio
R(F ) =
L(F )
.
L(Fn )
Wilks’ Theorem, with respect to the parametric counterpart of such a statistic, allows one
to construct a confidence interval based on a similarly defined ratio of likelihood functions
(Owen, 2001). Suppose that we are interested in parameter vectors η and θ through the
mapping θ(η). For testing H0 : η = η0 with an estimate η̂, the statistic −2 log(L(η0 )/L(η̂))
converges in distribution to a χ2 distribution as n → ∞. In light of this property, confidence
intervals for θ as a function of η may be constructed; in this case, c is determined from the
χ2 distribution with degrees of freedom equal to the dimension of θ:
{θ(η) : L(η) ≥ cL(η̂)}.
(1.1)
The empirical likelihood ratio, first implemented by Thomas and Grunkemeier in 1975,
may be used in constructing a confidence interval to obtain a range of plausible values for
θ = T (F ) under appropriate regularity conditions (Owen, 1988; 2001). Assume that F
belongs to a set of distributions F. The profile likelihood ratio function is defined as follows:
R(θ) = sup{R(F ) : T (F ) = θ, F ∈ F}.
Here, we profile out the nuisance parameters and focus on the behavior of the statistics at
values for the parameters of interest (Owen, 1990). The confidence interval based on the
statistic from (1.1) returns the set of θ for which R(θ) ≥ r0 , where r0 is a critical value
that is appropriately chosen as discussed below. Once we compute these statistics on a
8
grid of values for the parameter vector θ, we may use a contouring algorithm to provide a
visualization of the confidence interval/region (Owen, 1990).
Without loss of generality, let us consider Xi ∈ R, where the data are such that Xi 6= Xj
for i 6= j; in other words, no observations are tied. The distribution under consideration,
F , computes a probability or weight, denoted by pi , for each observed value of the random
P
variable Xi ∈ R for i = 1, . . . , n. The weights must satisfy pi ≥ 0 for all i and ni=1 pi = 1.
By the definition of likelihood,
L(F ) =
n
Y
Pr(Xi = xi ) =
i=1
n
Y
pi ,
(1.2)
i=1
since each Xi = xi is assumed to be distinct. If ties exist in the data, we have k distinct
values, denoted by zj , j = 1, . . . , k, where there are nj ≥ 1 copies of zj . For each of the zj ,
the estimate of pj is pˆj = nj /n. We then have the likelihood ratio
k
Y
L(F )
=
R(F ) =
L(Fn ) j=1
pj
pˆj
nj
=
n
k Y
npj j
j=1
nj
.
Equation (1.2) corresponds to the case where k = n and nj = 1 for j = 1, . . . , n. In the case
where ties exist for some distinct value zj , the probabilities may be subdivided into smaller
components among the random variables that are tied at zj . In other words, the probability
pj may be subdivided into smaller values whose sum is pj for the Xi = zj . These weights
P
are denoted as wi , so that {i:Xi =zj } wi = pj . The result is that distributing wi among the
observations leads to the same F as distributing pi . The distinction is important only when
considering computational performance, particularly if methods exist that run more quickly
(i.e., reach a solution more quickly and efficiently) in the event of data with numerous ties.
To maximize each wi in the product of weights over all observations, each weight attains its
maximum when wi = pj(i) /nj(i) , where j(i) represents the index of the elements from the set
9
such that Xi = zj (i) and
X
w i = pj .
{i:Xi =zj (i)}
Thus, maximizing the product of the weights in terms of wi under the distribution F corresponds to
n
Y
wi =
i=1
n
k Y
pj j
j=1
nj
Qk
=Q
j=1
n
pj j
nj
j=1 nj
L(F )
= Qk
nj .
n
j=1 j
The ratio of the likelihood for the distribution F in theory and Fn in practice may be written
as follows:
n
R(F ) =
Y
L(F )
=
nwi ,
L(Fn ) i=1
with the constraints that wi ≥ 0 and that the weights must sum to 1. For all repeated
observations equaling some value v (i.e., ties), the weights for the multiple listings of v must
sum to the value that represents the probability distributed to the distinct value v by F .
Now that the weights have been determined, the probability simplex may be defined.
Definition 2. The set of n weights, w1 , . . . , wn , meeting the constraints of non-zero weights
that sum to 1 is known as the probability simplex:
(
Sn−1 =
(w1 , . . . , wn ) : wi ≥ 0,
n
X
)
wi = 1 .
i=1
The dimension of the simplex is n−1 since one weight may be written as a linear combination
P
of the others: wk = 1 − i0 6=k wi0 .
Extending to the multivariate case, we again assume that there are no ties in the data,
without loss of generality. With the specification of weights wi for each Xi ∈ Rd , i = 1, . . . , n,
the profile empirical likelihood ratio function (ELRF) for a d-dimensional mean vector µ is
10
computed as follows:
R(µ) = max
( n
Y
nwi :
i=1
n
X
wi Xi = µ, wi ≥ 0,
i=1
n
X
)
wi = 1 .
(1.3)
i=1
Using the confidence interval in (1.1) and extending Wilks’ Theorem to empirical likelihood, we may formulate the Empirical Likelihood Theorem (ELT), where the asymptotic
behavior is based on the profile ELRF multiplied by an appropriate scalar. The motivation
is to test H0 : T (F0 ) = µ0 for some theorized distribution F0 and corresponding parameter,
µ0 . Then, the resulting test rejects any value of µ for which R(µ0 ) < r0 . In this case, r0 is
a critical value chosen according to the ELT, as stated below in general form (Owen, 2001):
Theorem 1.3.1. Vector Empirical Likelihood Theorem (ELT). Suppose the random
variables X1 , X2 , . . . , Xn are independent, and for all i = 1, . . . , n, assume the following:
Xi ∈ Rd , Xi ∼ F0 , E(Xi ) = µ0 , ΣXi = V0 , and rank(V0 ) = q > 0. Define the confidence
region on Rd as
Cr,n =
( n
X
wi X i :
i=1
n
Y
nwi ≥ r, wi ≥ 0,
i=1
n
X
)
wi = 1 .
i=1
Then, Cr,n qualifies as a convex set, and the empirical likelihood ratio statistic
d
WE (µ0 ) = −2 log R(µ0 ) −→ χ2(q) as n → ∞.
The empirical likelihood confidence region for the mean µ is given as follows:
{µ : R(µ) ≥ r0 } =
( n
X
i=1
w i Xi :
n
Y
i=1
nwi ≥ r0 , wi ≥ 0,
n
X
)
wi = 1 .
i=1
The critical value r0 is determined from the χ2 distribution with q degrees of freedom:
r0 = exp(−χ2q,1−α /2). If the underlying distribution’s variance-covariance matrix, V0 , is
11
of full rank, then q = d. However, if V0 is not of full rank (q < d), then the variables
Xi , 1 ≤ i ≤ n, are examined relative to the smaller subspace, Rq , and the degrees of freedom
are modified appropriately.
In one sense, the process of assigning the weights to the observed values of each random
variable Xi is similar to modeling a multinomial distribution, and the larger the sample
size, the more the “representative power of the multinomial improves” (Owen, 1988, p. 238).
Baggerly (1998) likens it to distributing probabilities to n cells of a contingency table, all
while optimizing the goodness of fit.
If the underlying distribution F is discrete, the total number of possible distinct Xi for
which to distribute weights is limited to a countable set as well. Each of the possible values
will be observed at least once when n is large enough, and the empirical likelihood becomes
equivalent to the parametric likelihood, since the entire sample space has been exhausted.
For a distribution F , there are up to n − 1 parameters to distribute among the n observed
values. With a continuous distribution, it is almost impossible to have ties in the data, so if
F is continuous, the number of parameters diverges to infinity as the sample size approaches
infinity. In contrast, in the traditional MLE setting, it is conventional to consider a finite
or bounded number of parameters with the sample size going to infinity. If the count of
parameters also goes to infinity, the difference between the MLE and the true parameter
value is not guaranteed to go to 0.
1.4
Empirical Likelihood for Univariate Mean
Without loss of generality and for a more detailed discussion, let us consider the estimation
of empirical likelihood confidence regions for the univariate mean, µ, or d = 1. The data
consist of independent (X1 , X2 , . . . , Xn ), where Xi ∼ F0 , E(Xi ) = µ0 , i = 1, . . . , n, and X(i)
denotes the ith order statistic of the set. Additionally, assume that the variance of each Xi
12
is bounded, thus restricting the rate of change for X(1) and X(n) as the sample size increases:
Var(Xi ) ∈ (0, ∞). Testing H0 : µ = µ0 with asymptotic level α is possible according to the
ELT, since WE (µ0 ) = −2 log(R(µ0 )) → χ2(1) as n → ∞. This allows the rejection region to
be formulated as the set of µ such that −2 log R(µ) > χ21,1−α . Consequently, the coverage
error behaves in the following manner, as n → ∞:
Pr(−2 log R(µ0 ) ≤ χ21,1−α ) − (1 − α) → 0.
Returning to the example of the univariate mean, the domain of R(µ) depends on the
data, as to be expected. If µ is outside the range of the data, then it is not possible to
P
find weights such that ni=1 xi wi = µ while simultaneously requiring all weights to be nonnegative and to sum to 1. Thus, in this trivial case, R(µ) = 0. If both the minimum and
maximum order statistics are equal to the mean, then we set R(µ) = 1. For the case where µ
is bounded by both the minimum and maximum, i.e., X(1) < µ < X(n) , we may ascertain the
P
Q
value of the weights by maximizing ni=1 nwi or ni=1 log(nwi ) with respect to the constraint.
P
With the use of ni=1 log(nwi ) as an objective function, the strict concavity of the function
on a convex set formulated from the weights assures a global maximum. Since wi = 0 would
minimize the function, the fact that wi > 0, for i = 1, . . . , n, means that the maximized
objective function is contained strictly inside the boundary defining the domain.
We solve the optimization problem using Lagrange multipliers, and hence write
G=
n
X
i=1
log(nwi ) − nλ
n
X
wi (Xi − µ) + γ
i=1
n
X
!
wi − 1 .
(1.4)
i=1
The objective function is the first element of the summation, whereas the next two summands
correspond to the equality constraints. The next step is to perform partial differentiation of
13
(1.4) with respect to wi and set the result equal to 0 to obtain
∂G
1
=
− nλ(Xi − µ) + γ = 0.
∂wi
wi
(1.5)
This produces an expression that can be summed over all 1 ≤ i ≤ n, where each equation
takes the form
wi
∂G
= 1 − nλwi (Xi − µ) + wi γ = 0.
∂wi
(1.6)
With simplification, the summation becomes
n
X
wi
i=1
∂G
= n + γ := 0,
∂wi
from which γ may be determined to be γ = −n. Now, using (1.6), we obtain an expression
for wi :
wi =
1
1
.
n 1 + λ(Xi − µ)
Then, substituting this expression for wi and summing (1.5) for 1 ≤ i ≤ n, we obtain
n
1X
Xi − µ
= 0.
n i=1 1 + λ(Xi − µ)
(1.7)
The result is an equation in terms of µ, as λ is also a function of µ, so we now may treat
this as an optimization problem for λ instead of working with the original objective function.
Once we have an estimate of λ, we may determine the weights wi to be distributed among
the n points, which in turn yields a confidence interval.
14
1.5
Estimating Equations and Extensions of Vector Empirical Likelihood Theorem
The theory of estimating equations, initially researched in the context of maximum likelihood
optimality by Godambe (1960), provides a powerful tool for estimating a parameter θ that
indexes a family of distributions. Suppose that we have class of distributions F = {F } on
X = {x}, which represents the sample space; we also have a class G comprising real-valued
functions g(x, θ). Any g ∈ G that satisfies
EF [g(x, θ(F ))] = 0 (F ∈ F ),
with appropriate regularity conditions is an estimating function. The parameter θ may subsequently be estimated as g(x, θ) = 0 (Godambe and Thompson, 1984). Godambe (1960)
proves that the score function is the optimum estimating function in establishing a maximum
likelihood estimator. The usefulness of estimating equations is not restricted to the parametric framework. The connection between empirical likelihood and estimating equations is
made with ease, especially since constraints are expressed via estimating equations, essentially leading to a constrained optimization. In some settings of data analysis, the researcher
is able to incorporate partial knowledge of the parameter without making more stringent
statements about the underlying distribution. As an example, consider a researcher examining a group of manufactured items produced by process A and another group by process B.
Then, in testing if the two processes have a common mean, θ, the estimating equations can
be appropriately formulated and added to the constraints in the optimizing problem (Owen,
2001). The method of estimating equations is not without its limitations. The estimating
function may have multiple roots presenting challenges in isolating the optimal θ̂; in this
case, there exists θ 1 6= θ 0 such that Eθ0 (θ 1 ) = 0 (Hanfelt and Liang, 1995).
15
Consider xi ∈ Rd , 1 ≤ i ≤ n, as well as X = (x1 , . . . , xn ). Suppose we are interested
in making statements about a distribution using information about θ ∈ Rp in terms of
functions (Qin and Lawless, 1994). Such functions may be written as gj (x, θ), j = 1, . . . , r.
These r ≥ p “functionally independent unbiased estimating functions” have the property
that EF {gj (x, θ)} = 0, where g(x, θ) = (g1 (X, θ), . . . , gr (x, θ))T (Qin and Lawless, 1994,
p. 301).
Expressing the ELRF for θ using estimating equations yields
R(θ) = max
( n
Y
i=1
nwi :
n
X
wi g(Xi , θ) = 0, wi ≥ 0,
i=1
n
X
)
wi = 1 .
i=1
A more general vector ELT follows (Owen, 2001).
Lemma 1.5.1. For 1 ≤ i ≤ n, suppose Xi ∈ Rd , each Xi ∼ F0 , and Xi ⊥Xj , i 6= j. The
parameter of interest is θ ∈ Θ ⊆ Rp , and the estimating functions are g(X, θ) ∈ Rr . The
hypothesized value of the parameter (to be tested) is denoted by θ0 ∈ Θ, and we require that
Var(g(Xi , θ0 )) < ∞ with rank q > 0. If θ0 is a solution to the estimating equation, that is,
E(g(X, θ0 )) = 0, then the asymptotic result is that
d
WE (θ0 ) = −2 log R(θ0 ) −→ χ2(q) as n → ∞.
With the asymptotic distributional results described in the Lemma, an issue to examine
is the fitness of the estimator θ̂ for θ0 . In fact, we cannot even guarantee the existence of
a solution for θ0 . The result in Lemma 1.5.1 may be used to formulate confidence intervals
and hypothesis tests if two conditions are met:
1. for θ0 and any θ1 meeting the conditions of Lemma 1.5.1, it must be that θ0 = θ1 ,
that is, the estimating equations are associated with a unique θ0 ; and
2. the estimator θ̂ is consistent for θ0 .
16
It is not always the case that the number of estimating functions matches the dimension
of the parameter vector. Additionally, it is not always possible to construct such confidence
intervals (or hypothesis tests) based on Lemma 1.5.1. Described below are the three possible scenarios (just determined, underdetermined, and overdetermined), along with some
examples and remarks on the estimation of confidence intervals.
Case: p = r
The common scenario is the “just determined” problem, where the dimension of the parameter vector matches the number of estimating functions, in which case p = r. With
appropriate regularity conditions on the estimating function g(X, θ) and/or on the underlying distribution F , this setup may lead to a unique solution for θ via the following system
of equations:
n
1X
g(Xi , θ̂) = 0.
n i=1
Examples of problems where the number of estimating functions matches the dimension
of the parameter vector are given below.
• To incorporate information about the mean of the underlying distribution through an
estimating equation is to define g(X, θ) = X − θ. In this case, the MELE for the
univariate mean is the sample mean, X̄, in light of
n
n
1X
1X
g(Xi , θ̂) =
(Xi − θ̂) = 0.
n i=1
n i=1
17
• For (X, Y ), we may characterize the mean and variance for X and Y , along with the
2
covariance; define the parameter vector θ = (µX , µY , σX
, σY2 , ρ). Then, the estimating
functions are as follows:
g1 (X, Y, θ) = X − µX ,
g2 (X, Y, θ) = Y − µY ,
2
g3 (X, Y, θ) = (X − µX )2 − σX
,
g4 (X, Y, θ) = (Y − µY )2 − σY2 , and
g5 (X, Y, θ) = (X − µX )(Y − µY ) − ρσX σY .
Case: p > r
In the event that p > r, the number of estimating equations is less than the dimension of the
θ vector. Such a case is said to be underdetermined, as we have not described all elements
of the θ vector through estimating equations, and θ is hence not estimable. Contributing
causes may be an ill-behaved underlying distribution or the selection of g(X, θ) that is not
appropriate for the context.
If all solutions have the same rank, q, for the Var(g(Xi , θ0 )) matrix, then the probability
(for n → ∞) that every θ is contained in the confidence region is 1 − α. If the rank q of the
variance matrix above is the same for all solutions, we have at least some common ground
on which to make comparisons across the solutions θ. With fewer estimating equations than
parameters, we have fewer restrictions and thus could have more candidate solutions than
the case with p = r. An example of an underdetermined problem is given below.
• Suppose that θ = (µ, σ 2 ) characterizes an unknown distribution for X. Then, setting
g(X, θ) = X − µ as the only estimating function would qualify the problem to be
underdetermined since σ 2 is not specified.
18
Case: p < r
The case where p < r is said to be overdetermined and includes more (or possibly too many)
restrictions. The additional constraints mean that it may be difficult (or even impossible)
to find solutions θ satisfying the estimating equation E(g(X, θ)) = 0. The asymptotic
results above may yield a confidence region that is null. Some interesting examples of
overdetermined estimating equations appear below.
• Let Qα denote the α quantile of a distribution, and suppose that θ = µ ∈ R. We
may specify two estimating functions to incorporate information about the quantile
as well. This is overdetermined because of the presence of one parameter, µ, and two
estimating functions. One of the functions specifies the mean, which by itself would
be a just-determined problem, but the estimating function for the α quantile behaves
as an additional constraint involving the data, where α is given and used to determine
Qα .
g1 (X, µ) = X − µ
g2 (X) = IX≤Qα − α.
• Assume that it is known that the variance is a function of the mean, µ ∈ R, so we write
Var(X) = h(µ). Then, the following estimating functions outnumber the dimension of
the parameter (µ):
g1 (X, µ) = X − µ
g2 (X, µ) = (X − µ)2 − h(µ).
We can categorize problems involving estimating equations into one of three categories:
just determined, underdetermined, or overdetermined. The estimating equations provide a
powerful link to estimating the parameter(s) of interest with the imposition of simultaneous
19
constraints. These constraints can vary in degree of how much they restrict the possible
values of a parameter. Hypothesis testing allows us to determine whether a fixed set of
values for a parameter vector is plausible, particularly how well they perform with respect
to the estimating equations. Though a distribution is characterized by a set of parameters,
we may not be interested in testing the entire vector. Suppose that we may partition the
parameter vector θ into two components: θ = (ψ, τ ), where ψ comprises the parameters of
interest and τ the nuisance parameters. How do we solve for two separate parameter classes,
where ψ may depend on some of the nuisance parameters?
In estimation with nuisance parameters, Owen (1990) suggests a nested approach based
on two steps. First, we use a candidate value (such as from a hypothesis test) for ψ and then
construct a distribution estimate for τ . Using this distribution, we return to optimizing the
actual parameter(s) of interest in the outer step. Let us consider the estimating equation
Z
Z
g(X, θ)dF =
g(X, τ , ψ)dF = 0.
Fix ψ at a value ψ 0 , which may be a hypothesized value for testing or at least close to what
could be a legitimate value based on the data. Then, considering R(ψ) = supτ R(ψ, τ ), we
maximize with respect to τ and instead use the dual problem of minimizing with respect to
λ:
!
n
1X
log(1 + g(λT (Xi , ψ, τ ))) .
min −
λ
n i=1
The next step is to find, for a fixed value of ψ, the maximum of
1
n
log R(ψ, τ ) over τ ,
where λ = λ(τ ). This two-step process allows us to handle the parameters of interest and
the nuisance parameters separately, which we can use in hypothesis testing of subvectors of
parameters.
After considering the constraints of the estimating equations, the asymptotic properties
of ELR statistics provide a means by which we can carry out hypothesis testing. The results
20
of Wilks’ Theorem permit testing H0 : θ = θ 0 using the statistic
W0 = −2 log{L(θ 0 )/L(θ̂)},
from which we may compare against a χ2 approximation with q degrees of freedom, the
dimension of θ. Recall that we can utilize any parameter that can be written as a smooth
function of the mean parameter µ, or θ(µ). Then, to have a coverage accuracy on the order
of O(n−1 ) means that
P (W0 ≤ z) = P (χ2q ≤ z) + O(n−1 ).
In addition to the ELR statistic, the bootstrap (which is not Bartlett-correctable) typically
produces confidence intervals with coverage error on the order of O(n−1 ), in contrast to the
performance of a Bartlett correction of an empirical log-likelihood ratio, which theoretically
has coverage error of O(n−2 ) (DiCiccio and Romano, 1989). The bootstrap may require
bootstrapped iterations in order to reduce the coverage error, which is computationally
more intensive than a simple Bartlett correction.
While ELR statistics and estimating equations feature certain advantages (i.e., robustness), the coverage accuracy of asymptotic confidence intervals via the Bartlett correction
may not improve in certain scenarios (Qin and Lawless, 1995). By the derivation in DiCiccio
et al. (1991), we have the Bartlett correction
P {W0 (1 − ân−1 ) ≤ z} = P (χ2q ≤ z) + O(n−2 ).
The quantity a usually requires estimation, based on sample moments and the values of the
√
derivatives at those points such that the estimator is n-consistent (Owen, 2001). We then
21
have a confidence interval of the form
â
2
θ : −2 log R(θ) ≤ 1 +
χq,(1−α) .
n
Theoretically, application of the Bartlett correction for ELR statistics may help to improve
the accuracy of inference, though having both an advantage and a disadvantage (due to
the need for additional computations) (DiCiccio and Romano, 1989). However, computational considerations aside, in practice, the Bartlett correction does not always improve on
the asymptotic coverage error. From the parametric standpoint, Chandra and Ghosh (1979)
confirm the validity of a Bartlett-corrected likelihood ratio statistic for absolutely continuous
F distributions on Lebesgue measures under appropriate regularity conditions. For distributions F with strictly lattice or a mixture of lattice and continuous covariates, the Bartlett
correction does not always reduce the coverage error of the asymptotic χ2 distribution in
simulations (Corcoran, Davison, and Spady, 1995). For a mixed underlying distribution, an
improvement may be achievable if the lattice variables are not vital to F (Corcoran et al.,
1995). Bartlett correctability does not prove beneficial for discrete distributions due to the
lack of a valid continuous Edgeworth expansion (Frydenberg and Jensen, 1989). Corcoran
et al. (1995) remark that, because empirical likelihood behaves like a multinomial placing
weights on observations from an assumed continuous distribution, we may encounter issues
with coverage accuracy, especially in small- to moderate-sized samples in practice.
As Qin and Lawless noted in 1994, there are multiple ways in which to simultaneously
consider estimating functions, and we are not restricted to likelihood as the only distance
measure for the weights that in turn produce confidence intervals. Before, we used an ELR
statistic, which measures the difference between distributions based on the hypothesized and
estimated parameters. However, we may use the Kullback-Leibler distance (DiCiccio and
Romano, 1990) or the Euclidean likelihood distance as an objective function instead, though
22
the Kullback-Leibler has the practical interpretation as an entropy and gauges how much one
distribution deviates from another (Cressie and Read, 1984). Both the Kullback-Leibler and
Euclidean likelihood distance measures are nicely contained within the class of Cressie-Read
statistics (Baggerly, 1998). For the Cressie-Read power-divergence statistic,
k
X
2
oi
CR(λ) =
λ(λ + 1) i=1
)
( λ
oi
− 1 , −∞ < λ < ∞,
ei
there are k distinct cells for a given table in which, for index i, we have the counts of observed
oi and expected ei . The flexibility of this statistic stems in the specification of the parameter
λ. For example, λ = 0 in the limit leads to the maximum empirical likelihood statistic, and
the value of λ = −2 produces the Euclidean likelihood from Owen (1990), also known as a
Neyman-modified χ2 statistic. The maximum entropy measure can be derived through the
specification of λ = −1 in the limit. The empirical likelihood has the distinction of being
the only Bartlett-correctable Cressie-Read statistic (Baggerly, 1998).
Another interesting property of the empirical likelihood with respect to estimating equations is the ability to conduct inference on scenarios where the number of estimating equations can grow with n (Hjort, McKeague, and Van Keilegom, 2009). Usually, we first set
low-dimensional r estimating equations, whether based on theory or to test, after which we
conduct the inference procedure. If we permit the relationship between r and n, we have
a “triangular” pattern on dn -dimensional random variables Zi,n for i = 1, . . . , n and the
subsequent empirical likelihood:
ELn (θ n ) = max
( n
Y
nwi : wi > 0, i = 1, . . . , n,
i=1
n
X
i=1
wi = 1,
n
X
)
wi gn (Z i,n , θ n ) = 0 .
i=1
Hjort, McKeague, and Van Keilegom (2009) provide examples of such a scenario, demonstrating the validity of the inference under empirical likelihood, such as with a growing
23
polynomial regression model. Nevertheless, there is the additional challenge of handling Lagrange multipliers that grow with the sample size (Leng and Tang, 2012). Leng and Tang
(2012) examine the high-dimensionality problem for estimating equations, as well as a penalized empirical likelihood method to address variable selection, in response to modern data
complexities. This recent publication reflects the continued applicability of and interest in
empirical likelihood, especially when adapting it to more complex data.
1.6
Empty Set Problem
Mathematical Background for Convex/Affine Hulls
The issue of optimization arises in formulating the profile empirical likelihood, thus adding to
the computational complexity with the use of such methods as iterative solving algorithms.
Since we have based the empirical likelihood on the data, it is necessary to examine the
convex hull generated by the observations. Before defining the convex hull, we review some
of the foundational mathematical concepts.
A convex set is one such that every pair of points in the set is connected via a line
segment; C is a convex set if all x1 , x2 ∈ C are such that θx1 + (1 − θ)x2 ∈ C, for any
0 ≤ θ ≤ 1 (Boyd and Vandenberghe, 2004). In Figure 1.1, the shaded blue regions represent
sets which envelop the points appearing in black. In the non-convex set, there exists a line
segment connecting two points in the set which goes outside the shaded region. The convex
set is such that all points are connected via a line segment while remaining within the bounds
of the set (de Berg, Cheong, van Kreveld, and Overmars, 2008). The definition for an affine
set C is very similar to that of the convex set with one exception: an affine set does not
require that 0 ≤ θ ≤ 1.
24
Figure 1.1: Visualization of a non-convex and convex set
Suppose that x1 , . . . , xn ∈ C; then y = θ1 x1 + · · · + θn xn is a convex combination of
P
xi and θi ≥ 0 for 1 ≤ i ≤ n, as well as ni=1 θi = 1. Now, extending this single convex
combination to a broader set of convex combinations, we arrive at the definition of a convex
hull, namely the collection containing all convex combinations of points from C (Boyd and
Vandenberghe, 2004). In the notation of Boyd and Vandenberghe (2004), the convex hull is
conv C = {θ1 x1 + · · · + θn xn : xi ∈ C, θi ≥ 0, i = 1, . . . , n, θ1 + · · · + θn = 1}.
Though we can formulate a variety of convex sets that contain a group of points, our
interest is in the smallest set such that a line segment between any two points remains inside
the set. This special set, also known as the convex hull, is essentially the intersection of all
convex sets containing the group of points. For a plane or a three-dimensional setting, the
running time of constructing a convex hull is O(n log n), where n is the number of points. A
convex hull is not limited to two- or three-dimensional Cartesian coordinates; though they
apply to any dimension, we lose the linear running time and the ability to visualize (de Berg
et. al, 2008). Removing the constraint that θi ≥ 0 for 1 ≤ i ≤ n returns the affine hull of C,
which contains linear combinations of elements from C:
aff C = {θ1 x1 + · · · + θn xn : xi ∈ C, θi ∈ R, i = 1, . . . , n, θ1 + · · · + θn = 1} .
25
Another mathematical definition underlying the convex hull concept is that of a hyperplane. A hyperplane, as defined in Boyd and Vandenberghe (2004), is simply the set of all
solutions such that, for a ∈ (Rn \0) and b ∈ R, x : aT x = b . Since we are working with
linear combinations of points defined within a set, it is natural to consider a hyperplane.
In (1.7) on page 14, we transform an equation in terms of the weights wi , i = 1, . . . , n, into
a version that is more conducive for solving via an optimizer. Not only have we reduced the
number of constraints to n − 1, we also have no constraints on the dual variables, λ. Because
each of the weights wi is a nonlinear function of λ, we consider nonlinear programming
methods, which seek solutions for a problem in which the objective function or some or all
of the constraints are nonlinear (Nocedal and Wright, 1999).
Duality is a mathematical concept that is important in nonlinear programming and in
convex nonsmooth optimization. It provides a link between the original problem, often called
the primal, and a modified version (the duality problem) that is better suited computationally
(Nocedal and Wright, 1999). As we consider Lagrangian duality, denote the domain of a
function f : A → B to be dom f . Assume that we are interested in optimizing the following
function with respect to its constraints, where x ∈ Rn :
minimize
subject to
f0 (x)
(1.8)
fi (x) = 0, 1 ≤ i ≤ m,
hj (x) ≥ 0, 1 ≤ j ≤ p.
T
T
p
Suppose that D = ( m
dom
f
)
∩
dom
h
i
j 6= ∅, the domain for equation (1.8).
i=1
j=1
In other words, there is at least one point in the set of x values such that all the constraints
are met, and we are looking for the optimal choice, denoted by x? , within that set (Boyd
and Vandenberghe, 2004).
26
The Lagrangian duality transforms an equation of the form (1.8) into an equation involving Lagrange multipliers, or dual variables, λ ∈ Rm and γ ∈ Rp :
L(x, λ, γ) = f0 (x) +
m
X
λi fi (x) +
i=1
p
X
γj hj (x),
j=1
where the domain is now dom L = D × Rm × Rp . Now we have a convex optimization
problem in working with the dual function:
h(λ, γ) = inf L(x, λ, γ),
(1.9)
x∈D
where h : Rm × Rp → R. The values for which (1.9) is satisfied are said to be dual feasible,
and within those solutions, the optimal one is said to be dual optimal, denoted by (λ? , γ ? ).
The newly expressed problem is
maximize
h(λ, γ)
subject to
λ 0.
(1.10)
where λ 0 means that each element λi ≥ 0, i = 1, . . . , m.
Extension to Empirical Likelihood Theory
Now we extend the mathematical definitions relating to convexity to the empirical likelihood
theory. In the notation of Owen (2001), denote the convex hull as
H = H(X1 , . . . , Xn ) =
( n
X
wi Xi : wi ≥ 0,
i=1
n
X
)
wi = 1 .
i=1
Given a p-dimensional parameter µ, we must ascertain whether µ ∈ H. If µ ∈
/ H, then we
define R(µ) = 0. If µ is on the boundary of H, then we may assign the value of R(µ) in one
27
of two ways. First, if all Xi belong to a hyperplane with dimension 0 < q < p, then we may
restrict our analysis to the smaller problem, that is full rank of dimension q. To accomplish
this objective, we first identify the components across all Xi such that the reduced Xi∗
forms a collection of q-dimensional random variables that are of full rank. For these specified
components, we reduce the µ parameter to µ∗ , which will be of dimension q rather than
p. Using these subsets of Xi , we may now carry out empirical likelihood methods for µ∗ .
Secondly, if the analysis cannot be restricted to a full rank, smaller dimensional problem,
then we set R(µ) = 0.
We now discuss the convex duality problem with the profile empirical likelihood function
for µ. Suppose we have a nontrivial case for µ, meaning that R(µ) 6= 0 and µ belongs to
the interior of H. Let λ ∈ Rp , which is the set of Lagrange multipliers or dual variables. As
seen before, the Lagrangian is given by the equation
G=
n
X
log(nwi ) − nλT
i=1
n
X
!
wi (Xi − µ)
+γ
i=1
n
X
!
wi − 1 .
i=1
Partial differentiation produces the result that γ = −n and that
wi =
1
1
.
T
n 1 + λ (Xi − µ)
(1.11)
Then, λ = λ(µ) must be chosen such that the p equations
n
1X
Xi − µ
=0
n i=1 1 + λT (Xi − µ)
hold true, which is determined by using (1.11) in place of wi to formulate a new objective
28
function
log R(F ) = log
n
Y
!
nwi
= log
i=1
= −
n
X
n
Y
i=1
1
T
1 + λ (Xi − µ)
!
log 1 + λT (Xi − µ) ≡ L(λ).
i=1
Instead of finding wi directly using the original formulation, we have expressed a new equation
from which we find λ, which is a function of the data and µ. With a value for λ, we may
now find the values of the weights, wi . In the original optimization problem, we had p + 1
equality constraints. Now we are able to work with p-dimensional λ, since we were able to
reduce γ = −n and did not have to include it in the optimization. The original inequalities
required that wi ≥ 0 for 1 ≤ i ≤ n, so for all i ∈ {1, . . . , n} we impose the modified inequality
constraint
wi ≥ 0
⇒
1
1
≥0
n 1 + λT (Xi − µ)
⇒
1 + λT (Xi − µ) > 0.
By examining all of the λ for which this inequality holds, we arrive at an intersection that is
a nonempty and convex subset. In optimizing for λ, if we take the Hessian of L, we obtain
an expression that is positive semidefinite with respect to θ. L may be reformulated so that
the result is convex over the entire set Rp ; additionally, this may be done with no impact on
the function’s value in the neighborhood of the result.
An alternative formulation of L exists so that it is not necessary to handle constraints
on λ ∈ Rp . The original optimization was reduced to the dual problem for maximizing
log(R(F )) with constraints on the weights wi , i = 1, . . . , n, as well as the means or estimating
functions. The formulation in terms of Lagrange multipliers λ ∈ Rp allowed us to handle p
29
constraints instead of the p + 1 when working with the original weight variables. However,
even this can be cumbersome in optimization, as it requires a check that the denominator
of the resulting wi is positive for i = 1, . . . , n. This prevents negative and zero weights from
being designated as the best guess of the mass to place on the corresponding data points.
Owen (2001) uses a variation on the logarithm function, known as the pseudo-logarithm,
which is given by
log∗ (z) =



log(z),
if z ≥ 1/n


log(1/n) − 1.5 + 2nz − (nz)2 /2,
if z ≤ 1/n
In this approach, the objective function becomes
L∗ (λ) = −
n
X
log∗ (1 + λT (Xi − µ)),
(1.12)
i=1
which is free of constraints on λ and maintains the property of convexity, which factors into
the optimization process.
The dual problem includes the determination of a λ ∈ Rp that minimizes a convex
function. If µ is an interior point of H and the Xi are not restricted to a hyperplane of
dimension p − 1, then the optimizer is unique and extends globally for that problem. Once
we obtain possible values for λ, the next question concerns whether the solutions are unique
when µ ∈ H. If the data belong to a hyperplane with dimension less than p, then there may
be other solutions that may be used for λ. If there exists τ such that hτ , Xi − µi = 0 for
all i = 1, . . . , n, then additional solutions may be generated from λ by creating λ0 = λ + τ .
Regardless of the specific solution used, the weights wi are indeed unique for any of the
candidates for λ. In the case that the interior of H fails to include µ, iterative solvers may
output values of λ such that ||λ|| → ∞ as the algorithm progresses. If the length becomes
too large, this is detected by the value of the gradient falling below the threshold established
30
beforehand. Since it is cumbersome to consider the convex hull H and to try to see if µ ∈ H,
it is possible to test this by checking the sum of the weights. If the weights fail to sum to 1,
then µ ∈
/ H.
31
2
Empty Set Problem
The flexibility of empirical likelihood, particularly the ability to incorporate external information, allows one to include estimating equations in an approach known as Empirical
Estimating Equations (E 3 ). However, this estimation is not always possible due to an interesting problem, known as the Empty Set Problem (ESP), described by Grendár and Judge
(2009). The main result is that not all data sets are guaranteed to have a valid solution using the E 3 method. One may ascertain which (if any) characteristics determine that a given
data set suffers from this issue; whether a data set has the ESP depends on the assumed
estimating equations and constraints. If we are examining a scalar mean µ, then suppose µ
is such that it is outside the convex hull. There is no subsequent solution or viable estimate
that can be determined, and the algorithm typically reaches large values of λ indicating a
lack of convergence, also confirmed by the summation of weights failing to sum to 1 (Owen,
1990). Generally, suppose that the random variable under study is X ∈ Rd , along with
the parameter of interest θ ∈ Θ ⊆ Rp . Additionally, define fX (x; θ), which symbolizes the
unknown probability distribution.
Grendár and Judge (2009) utilize estimating functions g(X; θ) ∈ Rr in order to incorporate external information in estimating θ. The choice of θ will in turn characterize the
probability distribution for F . With these two components and an arbitrary value for θ, let
32
us define the set of all distributions F , Φ, satisfying the estimating equations for θ:
Φ(θ) =
Z
F (x; θ) :
g(x; θ)dF (x; θ) = 0 .
Extending to the set containing all possible values of θ, we may define Φ(Θ) =
S
θ∈Θ
Φ(θ).
In general, this set does not necessarily contain the probability distribution fX (x; θ). The
objective of the E 3 approach is to estimate θ and subsequently to make inference. Instead
of working with the theoretical Φ(Θ), the E 3 method instead uses the information available
from the data to produce an estimate of the set, denoted by Φq (Θ). Since this set depends
on the data under analysis, we must incorporate a constraint based on the convex hull of
the data. For X1 n = (X1 , . . . , Xn ), a randomly sampled data set, the estimated set is
as follows: Φq (θ) = {q(x; θ) : ch(g(X1 n ; θ)) ∩ {0}}, where ch is the convex hull. In this
case, ch(A) is the convex hull of A ⊆ Rr , and 0 is the vector of length r with all elements
equal to 0. In this estimated set Φq (Θ), the data produce a set of possible probability mass
functions q(x; θ). The next step is to choose the q that has the best fit, which may be chosen
through Maximum Empirical Likelihood Estimation (MELE). In this procedure, the MELE
is computed as
n
q̂(·, θ̂)M ELE
1X
log q(xi ; θ).
= arg
sup
q(·,θ)∈Φq (Θ) n i=1
The probability mass function q, defined in terms of the parameters (through the estimating
equations) and the Lagrange multiplier(s), is given as
"
q(·; θ, λ) = n 1 −
r
X
j=1
33
!#−1
λj gj (·; θ)
.
Consequently, the results from the convex duality problem mean that we may compute
θ̂ M ELE as follows:
n
θ̂ M ELE = arg sup sup
θ∈Θ λ∈Rr
1X
log q(xi ; θ, λ)
n i=1
n
1X
= arg inf sup −
log q(xi ; θ, λ).
θ∈Θ λ∈Rr
n i=1
To obtain an estimate of θ using the MELE, which has desirable asymptotic properties,
we utilize numerical optimization methods. Then, we carry out the inference based on the
asymptotic distributional results of the θ̂ M ELE . To this end, we construct a distribution
using the MELE by computing the weights/mass for each point and using the distribution
function
F̃n (x) =
n
X
p̃i I(xi < x),
i=1
where the p̃i are constructed based on the weights computed from θ̂ M ELE (Qin and Lawless,
1994).
Though the MELE is a desirable candidate for use in the E 3 framework, it is not the
only way in which to carry out such an estimation; the objective function may be chosen in
another manner that may be more appropriate for the analysis. Examples include a function
that is defined in terms of distances between the data and another distribution, such as the
uniform with support including the data range.
In the E 3 algorithm, if we were to remove the constraint q(xi ; θ) ≥ 0, then the modification to Φq (θ) involves the affine hull:
n
Φm
q (θ) = {q(x; θ) : ah(g(X1 ; θ)) ∩ {0}}.
34
In this case, after relaxing the non-negativity constraint, we may use the Euclidean Empirical
Likelihood for estimation and inference.
The ESP occurs when Φq (Θ) = ∅, which means that an estimate cannot be obtained given
the data set under any value of θ ∈ Θ. If we were to restrict Θ to just a single point, say, Θ =
{θ0 }, then the ESP reduces to the convex hull problem. In some circumstances, it is possible
to derive conditions by which a data set may be tested for the ESP. These formulations are
based on the constraints that were included, as demonstrated in the following example.
Data with the ESP fail to produce valid estimates and inferences. We are interested in
seeing how this problem surfaces in the course of numerical optimization, particularly what
behavior gives a hint that the data set is problematic. The following example, described in
Grendár and Judge (2009), provides a useful avenue for exploring properties and behaviors
of empty versus non-empty sets through simulation.
2.1
Empty Set Problem, Example
Suppose that X1 n = (X1 , . . . , Xn ) are drawn independently from F0 , with µ ∈ Θ = R. The
side conditions to be incorporated in the form of estimating functions are
g1 (X; µ) = X − µ
g2 (X; µ) = X 2 − (2µ2 + 1).
In this case, we have two estimating equations with one parameter, so this is an example of
an overdetermined problem. The E 3 algorithm produces the following set:
(
Φw (µ) =
w(xi ; µ) :
n
X
w(xi ; µ)g1 (xi ; µ) = 0;
i=1
n
X
w(xi ; µ)g2 (xi ; µ) = 0;
i=1
n
X
w(xi ; µ) = 1; w(xi ) ≥ 0, 1 ≤ i ≤ n} .
i=1
35
(2.1)
To determine whether a data set drawn from F0 has the ESP, we must formulate an
easily checked condition based on what makes (2.1) empty. For the estimating equations to
P
P
hold, ni=1 w(xi ; µ)xi = µ and ni=1 w(xi ; µ)x2i = 2µ2 + 1. Solving for µ2 and equating the
two expressions, we obtain
2
n
X
!2
w(xi ; µ)xi
+1=
i=1
n
X
w(xi ; µ)x2i ,
i=1
which may be expressed as
n
X
i=1
w(xi ; µ)x2i − 2
n
X
!2
w(xi ; µ)xi
= 1.
(2.2)
i=1
To maximize the left-hand side of (2.2) with respect to the data, we distribute the mass
on two points, the minimum and maximum of the data, x(1) and x(n) , respectively. Since
we need the weights allocated to the observations to add up to 1, once we have the weight
distributed by w(1) allocated to terms involving x(1) , the other weight is simply 1 − w(1) to
terms involving x(n) . Thus, we obtain an expression for maximizing to obtain ŵ(1) , which
may be used to obtain ŵ(n) = 1 − ŵ(1) . The term
ŵ(1) = arg max L(w(1) )
w(1) ∈[0,1]
produces the maximized left-hand side of (2.2) based on
L(w(1) ) := w(1) x2(1) + w(n) x2(n) − 2 w(1) x(1) + w(n) x(n)
2
= w(1) x2(1) + (1 − w(1) )x2(n) − 2 w(1) x(1) + (1 − w(1) )x(n)
(2.3)
2
.
Differentiating (2.3) with respect to w(1) , setting the result equal to 0 and solving for w(1)
36
produces
x2(1) −x2(n)
m
=
ŵ(1)
4(x(1) −x(n) )
− x(n)
x(1) − x(n)
for x(n) − x(1) 6= 0. Since we only have two weights to distribute, we can simply obtain
m
m
.
= 1 − ŵ(1)
ŵ(n)
The next step is to formulate a condition that allows us to check whether a given data
set falls into the ESP. The equality in (2.2) must fail to hold, which means that the left hand
m
side must be strictly less than 1. For this strict inequality to occur, plugging in ŵ(1)
into the
equation L(w(1) ) < 1 produces the interval
n
o
q
q
2
2
x(1) : 3x(n) − 2 2x(n) + 2 ≤ x(1) ≤ 3x(n) + 2 2x(n) + 2 .
(2.4)
Any data set for which the minimum falls into the interval in (2.4) meets the criterion of the
ESP.
We now use this overdetermined example in order to investigate some interesting properties through the MELE in the case of four non-ESP data sets drawn from N (0, 1) with
n = 15 subject to the two constraints above. We see the impact of the shape of the data
on the MELE and the MLE versus the true mean. The ELR statistics for the first data set,
shown in Figure 2.1, show a very close correspondence of the MELE and the MLE, both of
which are slightly above (at approximately 0.2) the true mean value, µ0 = 0.
37
Figure 2.1: The first of four non-ESP data sets of n = 15 drawn from N (0, 1) with the
constraints of ESP example, µ = (µ0 , 2µ20 + 1)
In Figure 2.2, the shape of the curve for the ELR statistics takes on a slightly different
shape, increasing more quickly with larger values for the mean than with Figure 2.1, which
featured a symmetric, parabolic shape. This increase at a faster rate can be attributed to
the right-skewness of the sample, as the mean becomes less likely with the increasing values.
In this case, the MELE and MLE are still above the true mean, though there is a disparity
between the two estimates–the MELE is pulled to the right of the MLE.
38
Figure 2.2: The second of four non-ESP data sets of n = 15 drawn from N (0, 1) with the
constraints of ESP µ = (µ0 , 2µ20 + 1)
Moving to a slightly left-skewed sample from N (0, 1), the ELR statistics appear in Figure
2.3; we see quite a shift to the left of the actual mean, as to be expected. Both the MELE
and the MLE are pulled away from the true mean by the skewness, and the MLE is once
again in between the actual and MELE mean values.
39
Figure 2.3: The third of four non-ESP data sets of n = 15 drawn from N (0, 1) with the
constraints of ESP µ = (µ0 , 2µ20 + 1)
The feature of the data in Figure 2.4 is that there is a relatively strong peak between 0
and 1. More than half of the values appear in this interval, so it is interesting to note that
the curve of candidate ELR statistics is not symmetric as it was in the first and third data
sets. We see the close correspondence of the MLE and MELE statistics, which also occurred
for the first data set, which shared the symmetry property. However, the MLE/MELE duo
is not as close to the true mean for the fourth data set, perhaps because of the strong peak.
40
Figure 2.4: The last of four non-ESP data sets of n = 15 drawn from N (0, 1) with the
constraints of ESP µ = (µ0 , 2µ20 + 1)
As we have observed the behaviors of the MELE and MLE statistics in the four data
sets above, it is interesting to further observe the impact on the empirical distribution functions based on the weights. In Figure 2.5, we observe the respective empirical cumulative
distribution functions for all four data sets drawn from the same distribution, with the same
constraints. The behaviors of the first and third appear to be similar; likewise the second
and fourth. The curves for all four seem to follow a general pattern of shooting up quickly
until 8 or 9 data points, pursuing a more gradual climb to the upper bound of 1, and seeing
quite a difference in the behavior with the last two points (the largest weights). Data set
2 involves the largest weight assigned out of all four, followed by data set 4; both of these
samples had observations greater than 2. The other two reach the limit more quickly, the
ones with the symmetric ELR statistic curves. Though all four are from the same underlying
distribution, the variation in sampling creates differences in the MELE, which is affected by
the shape of the sample and thereby subsequently impacts the ECDF.
41
Figure 2.5: The ECDF of the four data sets based on weights created from MELE statistic
µ̂M ELE
2.2
Visualizing the Empty Set Problem
Using the ESP example above, we draw n = 10 random values from the N (0, 1) distribution
Pn
Pn
Pn
2
with the constraints of (C1)
i=1 wi = 1, (C2)
i=1 wi xi = µ, and (C3)
i=1 wi xi =
(2µ2 + 1). If (C1) is not met, then the algorithm has failed to produce a viable weighting
scheme for the problem. This can occur as a result of the ESP or a failure of the optimization
routine. Algorithms exist that construct the convex hull given a set of points, such as the
QuickHull (Barber, Dobkin, and Huhdanpaa, 1996) or Gift Wrapping, also known as Jarvis’s
March (Jarvis, 1973; Cormen, Leiserson, Rivest, and Stein, 2001) algorithms, but these can
be computationally complex and time-consuming, especially with higher dimensions. It is
much more efficient to check the summation of the weights and/or the number of iterations.
42
Figure 2.6: Convex hulls generated for four data sets based on n = 10 elements independently
drawn from N(0,1). Because the data are two-dimensional (Xi and Xi2 , i = 1, . . . , n), the
convex hull plot becomes a convenient visual representation of the ESP. The true mean of
µ = (µ0 , 2µ20 + 1) = (0, 1) is indicated in red; the convex hull in (a) fails to contain this
point. The other three samples, (b)-(d), capture the true mean, though this point appears
in various locations relative to the space spanned by the data.
In Chapter 4, we consider the use of composite likelihood methods for empirical likelihood,
during which we examine all possible subsets of size k ≤ n of the sampled data. First, we
investigate the implications of the ESP on subsets of n points; we must see whether issues
arise when decomposing the sample space into all nk possible combinations. In Figure 2.6,
focusing on the data from (c) without the ESP, we now investigate the impact of using ktuples from the data (2 < k < 10). Because we have two estimating equations, the smallest
grouping we consider is the triplet. Taking all possible combinations of triplets from the data,
each triplet enters the Newton-Raphson program as input, and the output can be classified
as convergent or non-convergent. Dividing the output into these two categories, where lack
of convergence is gauged by the weights failing to sum to 1, reveals two interesting finds.
43
Firstly, examples of the convergent and non-convergent k-tuples appear in Figure 2.7 and
Figure 2.8, respectively, and demonstrate the ESP. The true mean appears inside the convex
hulls for the convergent k-tuples, while appearing outside the boundaries of the convex hulls
for the non-convergent subsets.
Figure 2.7: We compute all possible k-tuples from the data set of n = 10 and divide the
plots by their convergence status. This plot is from the computation of the 4-tuples, where
those that converged produced the pattern of the true mean being inside the convex hull.
The use of four points leads to quite different shapes of regions, from the rather narrow (as
in row C and column (2)) to a comparatively large quadrilateral (as in row E and column
(2), where the points were relatively far apart). The shapes depend not only on the k-tuple
used but also on the data themselves.
44
Figure 2.8: Again, to supplement Figure 2.7, we examine all possible k-tuples from the data
set of n = 10 and divide the plots by their convergence status. This is another plot from the
computation of the 4-tuples, wherein those that failed to converge showed the pattern of the
true mean point being outside the convex hull created from the data. The use of four points
leads to quite different shapes of regions, from the very narrow (as in row B and column
(3), in which it would be very difficult to capture the true mean) to a comparatively large
quadrilateral (as in row E and column (1), where the points were relatively far apart). The
shapes depend on the k-tuple used and also the data themselves. Data points that are close
together will produce very narrow convex hulls as opposed to those that are farther apart,
thus creating more coverage.
45
Secondly, as we increase the number of points in the k-tuple, we have more points from
which to generate linear combinations, thus creating a possibly larger convex hull. If a set
of points is used to formulate a convex hull, then including those same points in a superset
will produce a new convex hull that contains the previously constructed one. This property
stems from the basis of the convex hulls in linear combinations, and we can see this in Figure
2.9. The overlapping of the convex hulls from the convergent 9-tuples recreates the same
shape as the union of convex hulls from convergent 3-tuples.
46
Figure 2.9: For the data set from N(0,1) with n = 10 and two estimating equations, we take
all possible 3-tuples to 9-tuples, and plot the convex hulls for those convergent groups. The
denser areas result from overlapping the convex hulls generated from convergent sets.
2.3
Conclusions of Empty Set Problem
The ESP can be problematic for small sample sizes, especially when there are relatively
few data points from which to span a convex hull with ample coverage. In addition to
the sample size, the probability of a given set having the ESP depends on the assumed
47
constraints and distributional characteristics (if any are made). Since EL is used for such
challenging data situations as small sample sizes, this problem has potentially significant
implications on the viability of E 3 as a method of choice. In the example just discussed,
the underlying distribution was known, thus permitting comparison of the E 3 algorithm’s
performance under the empty and non-empty sets. The next question then becomes whether
such issues are detectable in a real-world data analysis. The question may be answered by
the inability of an algorithm to find weights that sum to 1. It may be that the particular data
set at hand is ill-behaved for a given algorithm. In that case, we may wish to consider other
algorithms that may handle aspects of the problem better than others. The complexity and
flexibility of the EL approach can complicate the universality of algorithms, thus requiring
that care is taken to ensure that solutions are indeed valid.
48
3
Evolutionary Algorithms in Empirical Likelihood Estimation
3.1
Optimization: Core Concepts and Issues
Optimization, sometimes called “mathematical programming,” is an extremely useful tool
spanning multiple disciplines. It unites mathematics, statistics, and computing via a common
objective: to determine the input for which the system, function, or model of interest realizes
its maximum or minimum value. It is convenient to consider just one of these two operations,
such as minimization, without loss of generality, since we may express the maximum of a
function f (x) as minimizing −f (x).
Before exploring specific algorithms, it is informative to consider first the general outline
of optimization. The system or function of interest should be measurable in some manner,
so that one may assess the output and gauge its optimality, if possible (Nocedal and Wright,
1999). This is accomplished using an objective, such as a mathematical function, production cost, or lifespan of a system. This objective returns an output using the variables or
unknowns that are recorded as input. The desired result is to find the values of the variables
so that the objective is minimized (or maximized). If the variables are free of constraints,
we are able to use unconstrained optimization methods. However, the variables are often
restricted in some way, whether the limitations take the form of equality constraints, mean49
ing that some function of the variables must be satisfied to equal some constant, or the less
restrictive inequality constraints. The inequality constraints are given as lower and/or upper
bounds on function(s) of the inputs.
In optimization theory, the modeling phase involves identifying three elements: the objective, variables, and constraints (if any). In the case of empirical likelihood, these are
typically specified with each problem, so we are ready to proceed with an algorithm to reach
a valid solution under the specified model. The question of which algorithm to use is not
a simple and readily answered one. There are in fact multiple algorithms, differentiated by
such elements as the nature of the problem, the complexity of the set of potential solutions,
or the characteristics of the constraints (linear, nonlinear, and/or convex). Deciding which
algorithm to use must be made with careful consideration, as one that suits the problem well
has the potential to perform more quickly. A poorly chosen algorithm may waste computational resources and may not even reach a plausible solution. Once an algorithm outputs a
solution, the next step is to determine whether the answer is indeed optimal or if it can be
used to refine a new input value to reach a better solution. This may be accomplished via
optimality conditions, which provide a mathematical way to verify solutions. If the solution
is not an optimum, then we may use the optimality conditions to refine the input and to try
again. Otherwise, another algorithm must be implemented.
In reality, when we reach a “solution,” we have not pinpointed the actual value(s) of
interest for the phenomena; we have represented a very complicated reality through a model
and used this to arrive at values compatible with the limits of the model. In other words,
instead of having
Problem ⇒ Solution,
we have in fact
Problem ⇒ Model ⇒ Solution
50
as the process (Blum, Chiong, Clerc, De Jong, Michaelewicz, Neri, and Weise, 2011). We
must keep in mind that real-life problems are rarely as neat as a model represents them, as
we have several factors to consider. First, a search for the optimal solution may have to span
a very complex or large set of candidate solutions, thus rendering an “exhaustive search”
difficult or impossible (Blum et al., 2011, p. 3). We may also have to maintain a set of
multiple candidates, as the objective function may be distorted with noise. The constraints
may restrict the search space to the point that even finding a candidate can be challenging,
thus rendering the goal of finding an optimum problematic. Blum et al. (2011) note that
a metaheuristic model is useful in handling problems with increasing degrees of challenges;
this class contains Evolutionary Algorithms, a focus of this project.
Elements of Optimization Problem
The general form of an optimization problem consists of the following elements: the variables
x, the objective function f , and the constraints, given as functions ci and cj . The setup is
min f (x) subject to
x∈Rn


 ci (x) = 0, i ∈ E

cj (x) ≥ 0, j ∈ I
The objective and constraint functions are assumed to be smooth, a property that facilitates
aspects of the algorithm, such as improving the steps taken towards the solution. If a
collection of constraints is not smooth, then we may equivalently define the constraints
using a set of smooth functions. Furthermore, unconstrained problems that are nonsmooth
may be expressed as a smooth constrained optimization.
We carry out the identification of a possible solution across a search space, denoted
by S ⊆ Rn (Michalewicz and Schoenauer, 1996). In the minimization problem above, E
represents a set for which each element indexes a constraint function that must be equal to
51
0. If a constraint is of the form g(x) = a, then we may express it as g(x) − a = 0 to match
the format above. The elements of I are matched to a constraint function that involves an
inequality. If we have a constraint in the form of a ≤ g(x) ≤ b, we may specify jointly that
g(x) − a ≥ 0 and that −g(x) + b ≥ 0. The constraint functions identify a set of possible
x values that characterize them simultaneously, called the feasible region, denoted by Ω in
Michalewicz and Schoenauer (1996):
Ω = {x : ci (x) = 0, i ∈ E, and cj (x) ≥ 0, j ∈ I}.
Using this notation, it is possible to express the minimization problem more succinctly:
min f (x).
x∈Ω
The complement of Ω, S − Ω, is identified as the infeasible region. If E = I = ∅, then the
problem reduces to an unconstrained optimization.
Another important distinction to make is between a local and a global solution. Denote
a neighborhood about a specified point x∗ as N = N (x∗ ), and suppose we are interested in
an input that minimizes the system. A local solution is a vector x∗ that, when input into
the function, meets the following criteria: x∗ ∈ Ω and there exists N defined around x∗
such that f (x) ≥ f (x∗ ) for all x ∈ N ∩ Ω. If this is a strict inequality, then the point is
said to be a strict or strong local solution. If the point is the only solution in N ∩ Ω, then
x∗ is an isolated local solution (Nocedal and Wright, 1999). A system may present with
one or many local solutions, which is where we must take care in finding the truly global
solution, which occurs at the input x∗ such that f (x∗ ) is the overall minimum for the entire
function. In the three-dimensional case, we may have multiple valleys representing local
minima, but the lowest overall represents the location of the global optimum. For a convex
programming problem, we have a convex objective function, linear equality constraints, and
52
concave inequality constraints; the consequence is that a local solution is indeed a global
minimum.
Selection of Optimization Algorithm
The appropriate optimizer depends on the types of functions involved in a problem. If
the objective function and equality constraints are linear functions, we may utilize linear
programming methods. Nonlinear programming methods may be better suited for nonlinear
objective function and/or constraint function(s). The functions may also be convex, which
will be the case for the application of methods to empirical likelihoods. Other considerations
to take into account while deciding on a solver include such aspects as the number of variables
and the differentiability of the function. If we have integer-valued inputs, then we should
consider an optimizer equipped for integer-programming problems. In the case that there is
a combination of integer-valued and continuous inputs, then we can examine mixed integer
programming approaches. If we are able to differentiate the function, then we may use the
information contained in those derivatives to guide us towards a solution, much like in a line
search method, such as the deepest ascent method based on the gradient.
We must also determine for an optimization procedure whether a solution has indeed
been found. For methods that are not appropriate for a problem (and applied without
understanding the unsuitable application), the value returned as output is not necessarily a
solution. Even if we apply the correct type of optimization, we must also ascertain whether
the solution is local or global. A global solution is the true minimum (or maximum) out
of all possible points in the entire feasible region. However, it is not always convenient
to strive for a global optimum. Firstly, the computational complexity may prevent such
a solution from being found, especially if there are multiple local optima. If the solver
cannot refine the search steps to reach the optimum, it may tend to a local solution. Price,
Storn, and Lampinen (2005) discuss the starting point problem that is often seen with
53
multi-modal functions, where each mode is a local optimum, thus leading to multiple local
optima. With the starting point problem, the optimizer has a tendency to seek a local
optimum that is close to the starting point, or initial solution value used in the algorithm.
Additionally, it is not always possible to verify that a solution is global (Nocedal and Wright,
1999). Algorithms constructed to search for global optima incorporate methods from local
optimization methods, as it is sometimes necessary to search for local optima in the process.
One aspect of global methods with constraints is that the restrictions may help to reduce the
feasible set to a more manageable and searchable space. This may remove some of the local
optima, thereby possibly making it easier to identify the globally optimum value. However,
this feasible region may also be very irregular and difficult to characterize.
As previously discussed, convexity is an important property in optimization theory. It is
a useful property that helps to simplify algorithms, thus leading to more readily identified
solutions. Convex programming methods require the following components to a constrained
problem: a convex objective function, linear (or affine) equality constraint functions, and
concave inequality constraint functions.
General Steps and Desirable Properties of Optimization Algorithms
Now that we have identified the fundamental elements of an optimization problem, we can
provide a general outline for the algorithm, which is iterative in nature. First, we input
an initial guess for the solution into the algorithm, and evaluate its fitness with respect to
the objective function and constraints. Based on the specific estimation procedure, we may
tailor this guess into a better candidate for the next iteration. This may utilize derivatives,
information from previous iterations, or the value of the point at the current step. Then, we
continue to loop through the steps until convergence is reached, as determined by previously
set criteria, or until the maximum number of iterations has been reached.
54
The specifics at each step differentiate the various algorithms available for use, but regardless, there are some basic properties that should be met. First, the algorithms should be
robust, meaning that they should reach a solution (if one exists) fairly quickly with reasonable starting values “on a wide variety of problems in their class” (Nocedal and Wright, 1999,
p. 8). Secondly, the algorithm itself should be optimal if possible; computational resources
should be used efficiently. An example of poor algorithm performance would be the excessive
use of “for” loops to carry out matrix multiplication or addition instead of using the more
efficient matrix operations. Lastly, if a solution is indeed output, it should be reasonably
accurate. Issues with computational accuracy, such as rounding and representation of small
numbers, and data errors should not completely change the solution. As to be expected
with such a complex problem, it is not always possible to achieve these three objectives
simultaneously. Similar to the variance-bias tradeoff, we may have to give preference to one
objective over another, such as choosing accuracy over efficiency. A method that produces
the most accurate answer may be one that utilizes much computing power and time.
3.2
Numerical Optimizers in R: Newton and QuasiNewton Methods
Among the standard optimizers used to carry out a hypothesis test on a mean (scalar or
vector) for empirical likelihood is the Newton-Raphson algorithm. The set of constraints
from the hypotheses is used in estimating a set of weights which may or may not result in a
valid empirical likelihood, depending on the limitations of the constraints or the optimizer. In
addition to the code for the Newton-Raphson algorithm for S-PLUS posted on Art Owen’s
web site, hypothesis testing for the mean(s) in R can be carried out via the emplik or
el.convex packages by Zhou (2004) and Yang and Small (2012), respectively. The emplik
package in R by Zhou (2004) contains a function el.test for testing means using uncensored
55
data, which is the same as Owen’s Newton-Raphson program. The el.convex package
written by Yang and Small (2012) includes multiple convex optimizers: Davidon-FletcherPowell (DFP), Broyden-Fletcher-Goldfarb-Shanno (BFGS), and damped Newton.
The el.test function does not always produce a correctly estimated empirical likelihood, which is problematic in that the resulting inference is invalid. The problem can be
resolved through the implementation of another candidate, such as the damped Newton that
“[p]ractically... converges within [the] same number of iterations as el.test and always gives
us the correct” empirical likelihood (Yang and Small, 2009, p. 1). Yang and Small (2009)
claim that DFP is also guaranteed to succeed, as it searches for solutions at a superlinear
rate (Owen, 2001). Also, treating the empirical likelihood testing or estimation as a convex optimization procedure means that, according to Rheinboldt (1974), there is a unique
solution, one to which algorithms like the damped Newton and DFP will stabilize.
Starting with the Newton method, recall that we are minimizing the convex function
L∗ (λ), given in (1.12) on page 30. Denote the gradient as g = ∇L∗ (λ) and the Hessian
matrix as G = ∇2 L∗ (λ). Within the emplik package, the el.test function takes steps of
a determined size towards a potential solution; it adapts the step size at each iteration and
continues until a maximum number of steps is taken or a solution has been reached. Yang
and Small (2009) outline this algorithm, which starts with initialized values and a preset
vector of weights, particularly 16 pairs of Newton and gradient weights, respectively:
nwts = (1, 1/3, 1/9, 1/27, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)T
and
gwts = (0, 4/101 , 2/101 , 1/101 , 6/102 , 3/102 , 2/102 , 8/103 ,
4/103 , 2/103 , 1/103 , 5/103 , 2/106 , 1/107 , 6/109 , 3/1010 )T .
56
In Owen’s original program, the gwts are formulated as approximations of the values above.
The Newton direction at the kth step is δ k = −(Gk )−1 gk , which uses first and second
derivatives. The weights above, in conjunction with the gradient and Hessian, serve as
the coefficients by which we evaluate candidates for L∗ for i = 1, . . . , 16: λk + nwts[i] ·
δ k + gwts[i] · g k . The goal is to proceed along a sequence of Lagrange multipliers (or dual
variables) λk in search of an input that decreases the function L∗ (λk ); at each step, we
may progress towards a solution with an updated λk , reach a satisfactory solution, or meet
criteria at which we terminate the search. For a given iteration k, the first attempt (i = 1)
takes a full Newton step forward from the current value, thereby evaluating λk + δ k as a
candidate for λk+1 . If this results in a smaller function value, the algorithm uses this value.
Otherwise, it sequentially considers 15 more candidates, with increasingly decremental steps
as a function of both δ k and g k , with scalars determined by nwts and gwts. Suppose that,
in our sequential progression along index k, the first point such that the newly evaluated
function falls below L∗ (λk ) has index i0 . Then, the update on the λ candidate is λk+1 =
λk +nwts[i0 ]·δk +gwts[i0 ]·g k . Problems can arise with this algorithm based on unfavorable
initial values; if we start the optimizer too far away from a solution, it becomes difficult to
take an appropriate step towards a global minimum. We can fail to adequately take enough
steps towards a solution because of the dependence of the step size on the previous value of
the λ.
The simple Newton method function el.test.newton from the el.convex package by
Yang and Small (2009) follows the same general procedure as the aforementioned el.test
from Zhou (2004), with the exception of how it takes steps towards a solution. It does not
include the gradient component g k in the step, and the algorithm instead guides the search
in a direction using the Newton step based on δk and λk ; we can think of the gwts as a
zero vector in this case. The procedure continues using the previous and updated values of
the input λk+1 and λk in exploring for a solution within tolerance until it reaches one or is
57
halted by the limit of iterations. It also has the problem of taking appropriate step sizes and
the limitation of the starting point problem.
The damped Newton method follows a similar path but updates the Newton direction
differently: it may be that taking the full step δ k = −(Gk )−1 g k overshoots the target solution
or a path leading to the solution in a manageable time. In that case, we can obtain an
approximate step that is some fraction of the full Newton step. In so doing, we seek a vector
of multipliers (or a scalar) αk for this step such that the components range between 0 and
1, hence the “damped” name. An approximate line search can estimate the value of this
multiplier vector, thereby producing an estimate of the Lagrangian parameter vector for the
iteration at k + 1: λk+1 = λk + αk δ k (Yang and Small, 2009). A downside of this algorithm
is the need to compute the inverse Hessian matrix, G−1 , as this can be computationally
expensive.
The first quasi-Newton (“Newton-like”) algorithm was the Davidon-Fletcher-Powell (DFP)
method. The feature of quasi-Newton methods is that they are based on changes in the gradients of the objective function, thus eliminating the need to compute second derivatives as
in Newton’s method (Yang and Small, 2009). Additionally, the quasi-Newton approach to
approximating a Hessian may involve intensive matrix computations, as well as differentiability on the objective function (Price et al., 2005). Though advances in computing have
revived the usability of Newton’s method, it is still not a universally workable or desirable
algorithm. In some cases, the resources needed to compute second derivatives may outweigh
the cost of using a gradient. W.C. Davidon developed the DFP method in 1959 to circumvent the limited computational resources of the time (Yang and Small, 2009; Nocedal and
Wright, 1999). Fletcher and Powell, who were instrumental in bringing Davidon’s method
to light, showed that it actually outperformed the coordinate descent method that inspired
the search for an alternative. Davidon’s work was not published until decades later, in the
inaugural 1991 issue of the SIAM Journal on Optimization (Nocedal and Wright, 1999). The
58
general outline of Davidon’s algorithm, as well as “variable-metric algorithm[s],” consists of
five major steps: computing a step direction, taking a step in that direction by some amount
or recomputing a new step, ascertaining the objective function f and gradient ∇f , updating
the values, and determining whether to reiterate or to terminate (Davidon, 1991).
The BFGS method is “[t]he most popular quasi-Newton algorithm,” and the DFP method
is “its close relative” (Nocedal and Wright, 1999, p. 136). Both algorithms, which aim for
a minimum of L∗ (λ), read in a starting point λ0 , a pre-defined convergence criterion ε > 0,
and maximum number of iterations, along with an approximation of the inverse Hessian
(G0 )−1 ≈ H0 . For the kth iteration, while k∇f k > ε and k has not exceeded the maximum
number of iterations permitted, the algorithms continue the search towards a solution. This
is carried out by finding a direction in which to take a step, namely δ k = −Hk ∇L∗ (λk ), where
Hk is a symmetric positive definite matrix that approximates the inverse Hessian without
actually taking the inverse (Nocedal and Wright, 1999). This saves computation time and
resources; the way in which we accomplish this approximation distinguishes the BFGS and
DFP methods (Yang and Small, 2009). Once the inverse Hessian has been approximated,
we compute λk+1 = λk + αk δ k , where αk is the step length and is determined through
an appropriate line search (same as in the damped Newton to ensure that the solution
proceeds in a viable yet efficient direction). We subsequently compute sk = λk+1 − λk and
y k = ∇L∗ (λk+1 ) − ∇L∗ (λk ). These values are used to estimate Hk+1 . The DFP estimate of
Hk+1 is
Hk+1 = Hk −
sk sTk
Hk y k y Tk Hk
+
.
y Tk Hk y k
y Tk sk
By contrast, the BFGS estimate is
Hk+1 = (I − ρk sk y Tk )Hk (I − y k sTk ) + ρk sk sTk ,
where ρk =
1
.
yT
k sk
The differences between the DFP and BFGS stem from roundoff error
59
and tolerances for convergence (Press, Teukolsky, Vetterling, and Flannery, 2007). The DFP
updates converge to the inverse of the Hessian, and the BFGS, considered to more straightforward and superior to the DFP, uses updates that converge to the Hessian (Venkataraman,
2009).
If convergence is not reached, the algorithm continues with the updates of all components
accordingly. The rate of convergence is superlinear, which is defined in terms of the limiting
behavior of the Lagrangian parameters. Without loss of generality, suppose we examine a
hypothesis test of a single mean, thus using a single Lagrangian parameter, i.e., the scalar
λk for k = 0, 1, 2, . . . , and we take the limit of λk as k goes to infinity, a value we will denote
by λ . The series (λk ) converges superlinearly to λ if a sequence (εk ) exists that converges
to 0, along with the following properties:
εk+1
= 0.
k→∞ εk
|λk − λ| ≤ εk and lim
In other words, as the sequence of λk approaches the limit sufficiently, we can establish an
increasingly tightening upper bound for the absolute difference of the sequence and the limit
λ using some variable εk . This bounding variable has the property that taking the ratio
of the k + 1 value to the kth realization has a limit of 0. As we proceed towards k → ∞,
the absolute difference should grow smaller, so εk+1 ≥ εk , and at some point, the ratio will
become so small that it is 0 in limit (Süli and Mayers, 2003).
Based on their findings, Yang and Small (2009) recommend that el.test be used first,
with the realization of the potential to fail in certain applications, as to be indicated by
weights that do not sum to 1. In addition to the ESP, which is not resolvable by any
algorithm, there can be problems with the el.test method searching for a solution, as
Yang and Small (2009) trace in their example to the algorithm taking steps of magnitude
−g, known as the steepest descent, in place of −G−1 g. Yang and Small (2009) remark that
60
“[n]o theoretical result will ensure that the el.test procedure will converge to the global
minimum” (p. 10). In the event that the weights fail to sum to 1, the next step may be to
examine an alternative, such as the damped Newton. Though the straightforward Newton
is most expedient, it does not always converge to a solution and can be sensitive to starting
values. The other algorithms, BFGS, DFP, and damped Newton, are viable alternatives
that can require more iterations to reach a solution than el.test, though with the benefit
of greater stability.
3.3
Evolutionary Algorithms
Since the introduction of empirical likelihood, the optimizers used for estimation and inference have been of the Newton or gradient-based type, like the DFP algorithm. These
methods are not without their disadvantages, such as the need to compute a gradient, a
Hessian, or an approximation to the Hessian. Prior to this research, no work published
(aside from our own Askew and Lazar, 2011) has addressed the potential for Evolutionary
Algorithms in the empirical likelihood context. The ability of Evolutionary Algorithms to
address complex problems and to search for global solutions makes them an attractive alternative, one that this dissertation investigates in more detail. By establishing another tool
for estimation and inference using a different approach, we can perhaps broaden the applicability of empirical likelihood; we should not be limited by shortcomings of an algorithm
because of an artifact, such as a matrix operation that is numerically unstable at a certain
point.
The class of Evolutionary Algorithms contains such well-established procedures as the
Genetic Algorithm (GA) and the Evolutionary Strategy (ES). In contrast to the singlepoint and derivative-based Newton or quasi-Newton approaches, the Evolutionary Algorithm
maintains multiple candidate points and is derivative-free, a property that “provide[s] greater
61
flexibility” (Price et al., 2005, p. 11). An Evolutionary Algorithm is an example of a
metaheuristic (Blum et al., 2011). Metaheuristic algorithms maintain singular or multiple
candidate solutions on the way to searching for the best solution. At each stage, members
of the group of candidates represent in some way a betterment over those from the previous
group, as metaheuristics incorporate past information to drive the process forward towards a
solution. The algorithm continues to search for successively improved candidates until either
it has stagnated or keeps fluctuating and fails to converge within a reasonable amount of
time. In contrast to gradient-based methods, Evolutionary Algorithms “involve some form
of randomness and selection” (Blum et al., 2011, p. 2). These strategies can be useful
in situations that thwart such methods as the Newton optimization, as in the case of a
plethora of solutions (for which a thorough search is not feasible or practical) or where there
is a dearth of candidates (where the constraints restrict the pool to the point that a single
answer may be difficult to unearth, much less discovering the minimum) (Blum et al., 2011).
However, the “No Free Lunch” Theorem by Wolpert and Macready (1997) states “that
for any algorithm, any elevated performance over one class of problems is offset by performance over another class” (p. 67). If an algorithm A outperforms another (say, B) in a
problem-solving context, then there exist other problems for which B excels over A. Wolpert
and Macready (1997) suggest first evaluating an objective function for the important characteristics by which to formulate and customize a search algorithm for the minimum. Thus,
there is a need for “problem-specific knowledge to achieve better than random performance”
(Wolpert and Macready, 2005, p. 721). This can be seen in that gradient-based methods
may outperform an Evolutionary Algorithm in the case of simple-to-compute derivatives and
clear direction, though they may perform poorly in the event of a complex problem with
constraints. An algorithm that indeed qualifies as “free lunch” is the coevolutionary method,
which views the optimization through a game framework. In this approach, the objective
function for a single candidate can be affected by another’s value, a setting not covered by
62
the “No Free Lunch” Theorem. If an algorithm permits “self-play,” the candidates may interact and “train” as a champion, or the optimum for which we are striving. In this self-play
Generalized Optimization, “there are algorithms that are superior to other algorithms for all
problems” (Wolpert and Macready, 2005, p. 721). Because this builds on the complexity of
an Evolutionary Algorithm (i.e., multiple groups interacting in a manner like a tournament),
we first explore the Evolutionary Algorithm for this dissertation and note the potential for
future research directions along this line in Chapter 5.
There are multiple useful optimization methods contained in the metaheuristic class,
ranging from Hill Climbing to Swarm Intelligence (used in artificial intelligence) and, of
course, Evolutionary Algorithms. The distinctions among these algorithms stem from whether
they maintain a single solution or multiple ones, whether they are deterministic or stochastic,
and whether they refresh each generation with a new set of solutions or modified sets based
on the previous generation. Of course, there are variations within these elements, and the
flexibility of this approach allows for hybrid methods to be used. The Evolutionary Algorithm mimics a natural framework involving a pool of candidates (a population) who vie for
a fixed number of assets, with fitness evaluated according to some criterion. The candidates,
or elements in the search space, represent the genotypes (the inherited properties), and when
their viabilities are assessed in terms of the objective function, the algorithm treats those
values as phenotypes (the realizations of the genetic structures) (Blum et al., 2011).
The first step is to initialize the algorithm with an initial guess for the solution, i.e., a
starting population. The researcher must also establish in advance the convergence criteria,
such as the tolerance level and the maximum number of iterations permitted. The following
elements must also be determined beforehand: the number in the starting population, the
way of selecting parents and producing children, the ways in which to mutate children, the
number of immigrants to permit at each generation, and how to preserve the best group for
the next iteration (Blum et al., 2011).
63
Within the population, births and deaths change the features for the next generation.
The candidates may reproduce according to some pre-defined rule, and depending on the rule
chosen, the offspring will incorporate some aspects of their parents, whether it is a bit swap
or an average of the parents’ values, for better or worse. Generally, a poorly performing
parent will produce offspring that will not survive the next generation. In keeping with
the natural terminology, those candidates who fail to proceed to the next iteration are said
to have “died.” In a sense, the best features will progress and survive each iteration, thus
propelling the desirable features forward towards a solution (Blum et al., 2011). We can
think of this as a competition at each level, with the top winners proceeding to the next
round. At the end, if a solution is deemed to be found, the one(s) declared to be the winner
overall outperformed all candidates, those who were based on improvements of others or
those who immigrated.
Now we outline the (α, β)-Evolution Strategy (ES) algorithm as discussed in Price et al.
(2005) and illustrated in Figure 3.1. The initial population of α members and subsequent
members must be evaluated against a fitness function that measures the outputs. Each
iteration of the algorithm is referred to as a generation. At each iteration, we generate
β ≥ α children using some predetermined rule after randomly choosing the parental pairs.
This process is called recombination. If the features or components are randomly chosen
from either parent, the recombination is discrete, and if the qualities are averaged across
the pair, then it is intermediate. Once a child has been produced from a pair of parents,
it may be mutated to help introduce randomness and also to prevent the algorithm from
stalemating. As another means to maintain variety of individuals, we permit immigrants
to enter the pool, which can be drawn from the same distribution as the initial population.
From the pool of candidates, the computer program must choose the α best individuals for
the next generation (if the population size is to remain the same). These individuals may
be chosen from the total generation of parents, children, and immigrants. This extensive
64
pool permits desirable solutions from previous generations to remain under consideration.
Maintaining multiple candidates simultaneously helps to avoid the starting point problem.
Next, we consider the variations that may be tailored within a specific problem.
65
Figure 3.1: The basic steps of the (α, β)-Evolutionary Strategy algorithm can be visualized
as a flow chart. The initial population is used as a starting point. At each generation, the
groups of parents are subdivided and mated via a rule established in advance. A subset
of the offspring undergoes mutation, and these mutants are entered into a pool along with
the parents, the non-mutated offspring, and newly generated immigrants. We find the top
α candidates according to the objective function, where the one with the best fitness is
the potential solution, and if convergence or the maximum number of iterations has been
reached, then we terminate the algorithm with the appropriate output. Otherwise, we take
the top α individuals as the new generation and repeat the process. In the event that all
solutions have stagnated to the same value, we maintain one of the vectors as a candidate
and refresh the other α − 1 from the same distribution that generated the initial population.
66
Mutation Strategies
Michalewicz and Schnoenauer (1996) discuss strategies for mutating individuals in each
generation, including Gaussian mutation and uniform mutation. Gaussian mutation is “[t]he
most popular mutation operator” and involves randomly sampling a vector of Gaussian errors
to add to the solution vector, as in Figure 3.2 (Michalewicz and Schoenauer, 1996, p. 3). The
modified solution then becomes xt+1 = xt + N (0, σ 2 ), where σ 2 is established in advance.
This can incorporate partial knowledge from the researcher (i.e., permitted fluctuations for
a given process on a quality control framework) or can be set arbitrarily.
Figure 3.2: In the Gaussian mutation, we use a vector of values drawn from a convenient
normal distribution, N (0, σ 2 ). Here, we randomly select a number of mutants from the pool
of offspring (not the parents), and a random vector of standard Gaussian noise is added.
The uniform mutation in Figure 3.3 produces a new solution vector that is different from
the previous one by one element. A component is randomly selected and replaced by a randomly chosen value from the appropriate uniform distribution. For xt = (x1 , . . . , xk , . . . , xn ),
suppose that the kth component is chosen at random. Then, the valid domain for xk is specified and is used to formulate a uniform distribution from which to randomly choose x0k . The
67
mutated update of xt becomes xt+1 = (x1 , . . . , x0k , . . . , xn ). Suppose that we use the Evolutionary Algorithm to find the optimum Lagrangian parameters, which in turn characterizes
the weights. In theory, the Lagrangian parameters come from R, but we have limitations on
how we can represent this distribution. Furthermore, from the Newton-Raphson program,
the Lagrangian parameters for solutions that converged tend to come from a smaller interval, whereas the larger values generally correspond to non-convergent solutions. We recall
that the weights for the empirical likelihood use the Lagrangian in the denominator, so the
value will tend to 0 as we choose larger values for the parameter. As a starting point, we
use the range of -100 to 100 as the absolute limits; this can be easily adjusted as needed,
if the researcher feels that sufficient coverage is not gained. If we take a large enough pool,
in theory, we would more fairly cover the uniform values, but this drives up the running
time considerably and impractically. It actually creates more problems with convergence,
which occurred in the initial development of the simulations to be discussed in Section 3.4
on page 89. Adequate coverage is quite difficult to achieve unless a large enough generation
is maintained. In fact, we may have a restricted range for parameters corresponding to convergence, say between 0.2 and 1.4. This restriction is not always known in advance, though
we may adjust the limits accordingly if we suspect that we are not covering the appropriate
ranges. In order to introduce more variation, we choose values from a continuous uniform
distribution with randomly determined limits. We draw the lower and upper limits from a
discrete uniform distribution with 100 as the parameter, and we multiply the lower bound
value by -1; as an example, a draw of 9 and 2, respectively, means that we use the distribution U (−9, 2) for randomly choosing candidate parameters. Under this construct, the
lower and upper bounds are not necessarily the same; this is set up to maintain variability in
exploring various ranges of parameter values. Because larger generations translate to longer
running times, we may have just a few generations in order to practically and sufficiently
cover the range of plausible values. If we always allow larger values in the range with large
68
limits, we can have an undercoverage of smaller numbers, where we know solutions may
belong. Instead of using U (−100, 100) for all potential parameters, we can use the varying
limits so that we can draw from distributions like U (−3, 20), U (−2, 4), or U (−51, 15), which
distribute the probabilities for draws over differing lengths of values.
Figure 3.3: For a vector that is randomly chosen for uniform mutation, we draw the value for
the mutant component from a continuous uniform distribution with limits to be determined
as follows. First, we first draw a lower and an upper limit from a discrete uniform distribution
of possible integers from 1 to 100. Denote these draws of the random variables as LBi = lbi
(obtained by multiplying the randomly selected limit value by -1) and U Bi = ubi , where these
correspond to the ith mutant. Then, we draw the mutant element from U (lbi , ubi ), where
the lower and upper bounds are not necessarily equal. This incorporates an additional level
of randomness, as opposed to selecting all values from the same fixed limits characterizing
the distribution.
Crossover Methods
Crossover is a strategy for producing children based on two or more parents. Examples of
discrete recombinations include the one-point, two-point, and uniform crossover. Instead of
69
swapping bits, the real-valued components are exchanged between the two individuals. A
one-point crossover, shown in Figure 3.4, produces children based on two parents by selecting
an element at random for dividing the vector. Suppose that the parent vector is to be split
at some random point. For parents A and B, the two children that can be produced are
constructed by alternating between parents across each segment. The offspring can inherit
the first segment from parent A and the second from parent B, or it can receive the first
segment from B and the second from A.
Figure 3.4: The one-point crossover, shown here, produces a pair of offspring by operating on
the genotypes of parameter vectors A and B. Along the dimension of the vector, an index
is randomly chosen as a split point, and we create offspring by alternating the sequences
preceding and proceeding the cut point. Though it is possible to keep both offspring, in
order to manage the pool of candidates, we randomly choose one of the offspring to go into
the pool, along with the parents.
For parents with a longer length, it may be more suitable to use a two-point crossover,
which uses two randomly chosen elements for splicing. For example, let the vector of parent
A be given as (A1 , A2 , A3 ), where the splits are made between A1 and A2 and between A2
and A3 . Then, let parent B be expressed as (B 1 , B 2 , B 3 ), split at the same elements as in
parent A. The recombination of these two parents will produce two children with spliced
70
segments of the original vector. The segments are formulated by alternating between the
parents, so that the children will not have two consecutive subsets from the same parent.
The uniform crossover, demonstrated in Figure 3.5, involves generating a random number
at each point from the uniform U (0, 1) distribution, with a previously determined threshold
(such as 0.5). For two parents under consideration, a child is generated element-wise. At each
element of an offspring, we draw a value from U (0, 1), and if it is above the threshold (say,
0.5), then we will assign to that index the corresponding element from parent A; otherwise,
we the value from parent B. We repeat this process for each element of the offspring vector
so that we have a mixture of values from A and B, in the hopes that we end up producing
a more desirable combination. If this is the case, then the offspring will be chosen over the
parents for the next generation. If we obtain a worse-performing offspring, then we give
preference to the parents for the next iteration.
Figure 3.5: The uniform crossover generates a single offspring (or more if desired) based on
two parents A and B through a sequence of values drawn from U (0, 1). If the value meets
a predetermined threshold, such as 0.5, then the genotype at a given index is inherited from
A; otherwise, it will come from B. This continues along the length of the vector until the
offspring has been randomly generated.
These three algorithms are referred to as discrete due to the selection of one or two
outcomes at each element. However, one may carry out recombination in an arithmetical
71
fashion, much like on a continuous variable; an example is the arithmetical crossover algorithm. Suppose that A and B are parents with two children, p and q. Then, given a constant
a, values for the children are generated as p = aA + (1 − a)B and q = (1 − a)A + aB.
If a = 1/2, then the arithmetical crossover is specifically a “guaranteed average crossover,”
where the operations are visualized in Figure 3.6 (Michalewicz and Schoenauer, 1996, p. 4).
The recombination is not restricted to using just two parents; rather, it may use r parents
P
with r − 1 constants (ai ) ∈ [0, 1]r such that ri=1 ai = 1. The newly formed offspring based
P
on r parents is given by y = ri=1 ai xi .
Figure 3.6: With the guaranteed average crossover, we take the constant a = 1/2 as the
multiplier for both parents, for which the summation produces an offspring. Each element
from A is averaged with its corresponding element in B, thereby producing a child that
represents a mean of the features contained in A and B.
Flavors of crossovers exist that may keep desirable members of the generation under
consideration at subsequent iterations. One example is the heuristic crossover operator,
shown in Figure 3.7. This produces one offspring based on two parents A and B and a
random value r ∈ [0, 1]. We compute an offspring based on the properties of the parents. If
we are seeking a minimum and f (A) ≤ f (B), then the child is C = r(A−B)+A; otherwise,
if f (A) ≥ f (B), then we compute C = r(B − A) + B. If the optimization strives for a
maximum, then we calculate C = r(A−B)+A for f (A) ≥ f (B) and C = r(B −A)+B for
72
f (A) ≤ f (B). This type of crossover incorporates information about the objective function
while including some randomness.
Figure 3.7: The previous crossovers do not incorporate a decision-making step that helps to
preserve the desirable features of A or B; they rely on randomness to generate offspring,
with the hopes of diversifying the pool enough to encounter a desirable candidate. The
generation process tends to preserve the desirable features when the objective function is
evaluated, but in the heuristic crossover, this step is supplemented by the screening of parents
at each stage of producing offspring. We distinguish which parent has the more favorable
objective function (comparing f (A) to f (B)) and construct the new sequence based on a
randomly drawn value from U (0, 1). Randomness is incorporated by the scalar multiplier
r ∈ [0, 1]. We produce a child of the form C = r(A − B) + A for f (A) ≥ f (B). If we have
f (A) ≤ f (B), then we will switch the ordering of A and B in the formula for the child.
Constraints in Evolutionary Algorithms
Like other optimization routines, constraints may be handled by Evolutionary Algorithms.
Michalewicz and Schoenauer (1996) remark that Evolutionary Algorithms have yet to handle
73
nonlinear constraints in a systematic way; though methods have been proposed, the inability
to generalize the approach was not traceable to a single cause in their research (e.g., quantity
of variables or categorizing and counting the constraints by linearity or nonlinearity). As
generalizability is already a difficult problem in itself, it does not mean that the methods are
not useful–it means that it may take some trial-and-error or considering multiple candidates
to find the best one for the problem at hand. We have ways to gauge whether an algorithm
performs well or poorly (e.g., summing the weights and assessing the constraint violations).
Michalewicz and Schoenauer (1996) discuss four ways of addressing constraints: “methods
based on preserving feasibility of solutions,” “methods based on penalty functions,” “methods which make a clear distinction between feasible and infeasible solutions,” and “hybrid
methods” (p. 6).
1. The first approach involves forcing infeasible solutions back into the feasible region F
within the search space via an operator that is closed on F. Such a method might
be well suited for linear constraints, for which operators may be easily specified for
forcing feasibility. The convexity of the sets also helps to facilitate the algorithm. If
the feasible search space F is convex, then we may use this method, but it fails at the
inclusion or encountering of a non-convex set. If a mutated offspring falls outside F,
then it may be forced back in with the appropriate uniform crossover. If two parents
are feasible, then the arithmetic crossover (by definition of convexity) helps to ensure
the children are feasible as well. Another recombination method to consider is the
heuristic crossover, which produces one feasible child based on two feasible parents.
74
The other method to preserve feasibility is that of restricting the search to the boundary
of the feasible region F. In exploring the search space, Evolutionary Algorithms are
unable to distinguish the boundary between the set of feasibility and infeasibility. This
can have a significant impact in the case of optimizing with nonlinear constraints or
nonlinear equality constraints at the value of the true optimum. It is thus desirable to
make such a search efficient while exploring the boundaries of the search region, which
would be an extremely valuable enhancement for Evolutionary Algorithms. Restricting
to the boundaries of the search region also produces a smaller search set.
Methods based on feasibility are discarded from further consideration with the empirical likelihood application, as it can be difficult and computationally intensive to delineate the feasible region quickly given the estimating equations and other constraints.
Our goal is to find an effective and relatively efficient Evolutionary Algorithm.
2. The approach we select in this dissertation involves using penalty functions to penalize
infeasible solutions. We transform the original problem into
eval(x) =



f (x),
if x ∈ F


f (x) + penalty(x),
otherwise .
(3.1)
The penalty is equal to 0 if the solution is in the feasible region, and it is otherwise
defined on a measure of infeasibility, such as the distance to the feasible set F.
This is the easiest to implement of the four methods, as maintaining a feasible set
or even distinguishing the feasible set provides a challenge in the empirical likelihood
context, especially with the presence of multiple estimating equations. It is a relatively
quick computation to see how well or badly the constraints are met–that is, whether
the candidate is in the feasible region, but to actually determine the feasible region in
its entirety is a computationally intensive challenge.
75
3. An example of an algorithm that imposes penalties by searching the space S is the
behavioral memory method, first discussed by Schnoenauer and Xanthakis (1993). It
proceeds sequentially along the set of constraints. The algorithm begins with an initial
pool of individuals, regardless of whether they are feasible. For a given constraint,
the population is entered into an Evolutionary Algorithm until some predetermined
percentage φ of the individuals are feasible for this constraint (j = 1). Then, the
pool is entered into a new iteration of the algorithm, where j = 2 constraints are
now included. If individuals fail to meet the constraints 1, . . . , (j − 1), then they are
discarded from further consideration. The evolution proceeds in a similar manner until
a proportion of feasible individuals φ meet the j constraint. Then, the process repeats
until all constraints are assessed in a sequence.
Another example of a method that searches for feasible solutions is that which rewards feasible points while penalizing infeasible ones. It requires the specification of a
constant r; the measure applied to each point is
eval(x) = f (x) + r
m
X
fj (x) + θ(t, x).
j=1
The function θ is an adjustment so that a feasible solution is preferred over an infeasible
one. An example is
θ(t, x) =



0,
if x ∈ F,
n
o

P

max 0, maxx∈F {f (x)} − minx∈S−F {f (x) + r m
j=1 fj (x)} ,
otherwise .
The region S −F represents the set of infeasible points; this formula guarantees that the
maximum of the feasible solutions will be preferred over the minimum of the infeasible
points.
76
A downside to this approach is that the order of the constraints is not well studied
and could lead to different behaviors in attaining a solution (such as in running time).
There is not an algorithm in place to designate the optimum order, so this is another
consideration in choosing this algorithm. Additionally, it may require immense computational power to continue running the procedure until we obtain φ feasible individuals.
However, we could investigate this in a future study to see how well it applies to the
empirical likelihood.
4. The final category includes any hybrid methods that have elements from numerical
programming problems that are deterministic, while incorporating features from Evolutionary Algorithms. The compartmentalizing of Evolutionary Algorithms allows for
elements to be swapped out with possibly new approaches or elements from deterministic theory. It is what makes the algorithm so useful, though it places a burden on
the user to ascertain the best course of action to take.
Nocedal and Wright (1999) discuss the merit function, which is similar to the fitness
function (3.1) from Michalewicz and Schoenauer (1996). The difficulty in working with
constraints and searching for viable solutions involves judgment calls that must be made at
each step. For instance, if a solution optimizes the objective function but fails to satisfy the
constraints, the question then becomes which step to take. Merit functions and filters are
candidates for helping to overcome these obstacles.
An example of a commonly used merit function, with inputs from the data and based on
penalty parameters p = (p1 , p2 ), is
φ1 (x; p1 , p2 ) = f (x) + p1
X
i∈E
|ci (x)| + p2
X
[cj (x)]− ,
j∈I
where [cj (x)]− = max(0, −cj (x)). This equation is also known as the `1 penalty function. In
a minimization problem, constraint violations are accumulated, and thus, in a minimization
77
problem, the merit function will be larger for values with large deviations, thus steering the
algorithm away from undesirable solutions. One downside of using merit functions is that
they may exhibit the “Maratos effect,” which refers to the possibility that the algorithm may
trek in the wrong direction, thus taking longer to get back to the solution (if convergence
is achieved before the maximum number of iterations) (Nocedal and Wright, 1999). If the
algorithm is proceeding along a path to a solution, it may take the longer way around
because of constraint violations at stages of the route. The Maratos effect may be visualized
by means of an analogy to traffic, where the vehicle (representing a candidate solution to a
function) proceeds along routes in search of a destination while avoiding traffic (representing
the constraint violations and/or running time costs). If the vehicle constantly takes detours
to avoid traffic (taking the paths with minimum constraint violations), it may take longer
to arrive at the destination as opposed to waiting in some amount traffic on certain roads
that may lead to a clear path afterward. However, given the complexity and broad scope
of such a problem, when addressing optimization problems with constraints, the answer of
how to address the violations and which direction to proceed is not an easy one. Although
it may sound reasonable to proceed along the path of minimum sums (objective function
plus magnitude of constraint violations), the Maratos effect shows that it is indeed not a
simple problem. Furthermore, a decision must be made regarding how to specify the penalty
parameter; whether it is set arbitrarily or within a range of reasonable values, a parameter
creates another aspect to investigate, though it can be made powerful with the right choice.
As usual, there is a tradeoff between how efficient and fast of an algorithm we desire versus
the additional steps, parameters, or resources needed to create a more global or suitable
algorithm.
78
3.4
Applications in Empirical Likelihood: Preliminary
Algorithm Implementation
When compared with a numerical optimizer, the Evolutionary Algorithm approach incorporates randomness at each iteration. Furthermore, it considers several candidates at once
against a fitness function, which broadens the scope for solution-finding while increasing
running time. In this work, we consider two approaches for the Evolutionary Algorithm, one
of which is an improvement over the other. Both approaches utilize the example from the
Empty Set Problem discussed in Chapter 2, as it presents challenges we might not observe
with a simple just-identified problem. We first examine the preliminary implementation and
its results, following up with the improved algorithm on which we conduct a more in-depth
analysis.
For the initial adaptation of Evolutionary Algorithms to empirical likelihood estimation,
the strategy is to express the optimization in the framework of a constrained problem and
to apply the penalty function method of directing away from undesirable solutions. From
the example in the ESP section, we use data drawn from N (0, 1) with a sample size of
n = 15 (using 1000 replicates), along with the estimating functions g1 (X; µ) = X − µ and
g2 (X; µ) = X 2 − (2µ2 + 1). For this overdetermined problem (one parameter, two estimating
functions), we initially used the following setup:
min f (x, w) = min 2
w
w
n
X
log(wi xi ) subject to
i=1

Pn



i=1 wi xi − µ = 0,



P

 n wi x2 − (2µ2 + 1) = 0,
i=1
Let us denote the three equality constraints (C1)
79
i
Pn



i=1 wi = 1, and




 w ≥ 0, i = 1, . . . , n.
i
Pn
i=1
wi = 1, (C2)
Pn
i=1
wi xi = µ,
and (C3)
Pn
i=1
wi x2i = (2µ2 + 1). To handle the constraint of positive weights, we draw
our candidates from U (0, 1) so that each vector of n = 15 populates the pool. To enforce
the constraint of summing to 1, we scale each vector by the sum of the components before
entering it into the fitness function. To handle the other equality constraints, we first try a
fitness function with arbitrary static penalty parameters p = (p1 , p2 ) = (1000, 1000):
ffitness (x) = 2
n
X
i=1
!
log(wi xi )
n
n
X
X
2
2
wi xi − (2µ + 1) .
w i xi − µ + p 2 + p1 i=1
(3.2)
i=1
We use large positive parameters so that the minimization will not choose weights that
produce large deviations from the constraints. The choice of the penalty parameter can
be used to emphasize one constraint over another or to scale the absolute deviations as
needed. That said, it can also obscure the original function to be minimized, especially
if the penalty terms are dominant. Another issue with this framework is that it becomes
difficult to simultaneously minimize all three summands: the function and the absolute
deviations of the first and second constraints. For example, suppose that four candidates
entered into the fitness function above produce 10 + 3 + 2, 5 + 5 + 5, 15 + 0 + 0, and 0 + 10 + 5,
all of which sum to 15. From the perspective of the algorithm, all are viewed as the same,
though the last one minimizes the function with quite a large absolute deviation on the
two constraints. This leads to a difficult question to answer: How do we establish which
is “best”? Do we give priority to the second or third summand (representing the first and
second constraint absolute deviations)? These kinds of questions become a disadvantage of
this approach, especially in the EL problem. This penalized strategy may be better suited
to problems without the rather rigid constraints and limited range on the weights.
Now we describe the initial algorithm implementation:
• In advance, specify both the soft and hard stopping criteria: the acceptable rate of
change between a previous solution and a new solution (εconv ) that indicates a stabi80
lizing of the results and the maximum number of iterations, nmax . We set these values
at εconv = 10−6 and nmax = 200.
• Generate the initial population of α members, where each vector is a candidate solution to the constrained optimization problem. For the ESP under study, each initial
candidate comprises n = 15 elements from U (0, 1), and we set the generation size α to
be 100.
• Carry out recombination by splitting the population in half; all possible pairs between
the first α/2 and the second group of α/2 members are combined. We use the singlepoint crossover, thus producing two children per pair of parents. The number of
children, β ≥ α, is a function of the crossover method used. In this case, we create
(α/2)2 = 2500 offspring.
• Introduce mutation throughout the offspring by randomly choosing m offspring and
adding a vector of length n = 15 with absolute values of random draws from the N(0,1)
distribution. The absolute value is needed to keep the weights from becoming negative.
Ten mutants are chosen at random from the offspring at each iteration; parents and
immigrants do not enter into the mutation operator.
• Add immigrants to the population by drawing them from the U (0, 1) distribution,
much like the initial generation. We use 10 immigrants in the work reported here, but
this value can be adjusted as needed.
• Consider the total grouping of candidates: the parents, children (those who were mutated and those who weren’t), and immigrants. Based on the numbers we selected,
this total grouping is 100 + 2500 + 10 = 2610, meaning that we consider that many
solutions at once at each iteration. Evaluate these candidates according to the fitness
function (3.2). We enter the vectors scaled by the sum into the fitness function so that
81
the constraint of summing to 1 is met. We do not, however, modify the actual vector
itself so as to permit more variability and arithmetic fluctuations in the recombination
and mutation processes, and once a solution is reached, we output the scaled version
for analysis.
• Using the fitness function (3.2), determine the minimum α = 100 candidates, and use
this pool as an initial population for the next iteration. Continue until nmax iterations
have been reached or
fnew (x) − fprev (x) < εconv .
fprev (x)
(3.3)
As the generation size and choice of recombination strategy can have a strong impact
on the running time, it is desirable to maintain a moderate number of candidates while
allowing for an extensive search for the best one. Nevertheless, if the generation size is
small, a solution may survive two consecutive generations if the subsequent generation fails
to produce a better candidate. In that case, the criterion (3.3) leads to the termination
of the algorithm. There is the possibility that this solution is not the best answer overall,
and we do not want an output to be an artifact of the algorithm. We cannot impose a
strict positivity condition in (3.3), as it would not permit the survival of a best answer from
generation to generation. We thus require the solution to survive at least some fixed number
of generations, so that we have at least tested that many more candidates. We can be more
confident in an answer that survives 10 or 20 successive generations than one that we have
terminated after making it through two. This number can be modified according to the
application, but we use a value of 10 for the simulation in the next section.
82
ESP: Gradient Method versus Preliminary Evolutionary Algorithm
Implementation
Comparison of ELR Statistic Estimates
The Evolutionary Algorithm (EA) is a very flexible but heuristic global algorithm, whereas
a numerical optimizer like the one from the Newton-Raphson optimization generally uses a
predetermined set of rules for choosing the next step based on derivatives. Now that the
Preliminary EA approach has been identified, we may compare the estimates obtained from
both programs. Let us focus on testing the true µ0 = 0, along with a symmetric deviation,
say, µ0 = −0.5 and µ0 = 0.5. At these three values, we compute test statistics using both
Newton-Raphson and the Preliminary EA.
The statistic output by Newton-Raphson is multiplied by -2 to obtain the ELR, so a
large negative number from the output appears as a large positive number in the boxplot, as
seen in Figure 3.8. Newton-Raphson also outputs the weights such that a legitimate solution
sums to n rather than 1; we simply adjust the weights as needed, such as when computing
the mean and higher moments. There are three possible summations of the weights from
the Newton-Raphson algorithm: (i) a valid sum of n when µ0 belongs to the convex hull’s
interior; (ii) a sum of 0 < k < n when µ0 is on the boundary of the convex hull surrounded
by k data points within the face; and (iii) a sum close to or exactly 0 when µ0 is not in the
convex hull. Thus, the convex hull condition, a complex problem in itself, can be checked
quickly in Newton-Raphson by the summation of the weights. With the Preliminary EA, the
summation of weights is constrained to 1 with the scaling operation, thereby not permitting
this convenient check. Thus, to further develop the EA work, we also consider an alternative
formulation that instead operates on the dual parameter without constraints.
83
Figure 3.8: Boxplot of ELR statistics from Newton-Raphson and preliminary application of
EA with 1000 simulations of N(0,1) with n = 15
Examining Figure 3.8 the boxplot in panel (a) corresponds to the statistics from all 1000
simulations entered into the Newton-Raphson program (Owen’s program or the el.test
function from the R emplik package). There are two groups: those with a very large (> 490)
ELR statistic and those in the (0, 200) range. The large values indicate that the algorithm
failed to produce an estimate, whether it is due to the optimization or the ESP.
To see if the groupings of ELR statistics correspond to the ESP characteristic, panels (b)
and (c) correspond to the Newton-Raphson algorithm applied to the sets with and without
the ESP, respectively. Subdividing the 1000 simulated data sets by the status of the ESP
gives some insight and also an avenue for further investigation. In panel (b) of Figure 3.8,
we see that all of the values are large (> 500), thus indicating a failure of the algorithm
to produce a viable solution, as to be expected by theory. However, in panel (c), we see
the same two divisions as in (a). It is interesting that not all data sets without the ESP
produced a solution, especially with the data set being of a standard type.
84
In both the Newton-Raphson and Preliminary EA programs, more solutions are found for
the true mean than the others. Fewer solutions are viable as the discrepancy grows between
the test and true means, which is a desirable property. The IQR for the statistics computed
while testing µ0 = 0 is smaller for the overall and non-ESP data sets, a reflection of the
equivalence of the test mean and population parameter. The behaviors for the statistics
under µ0 = −0.5 and µ0 = 0.5 are similar for the overall and non-ESP sets, reflecting the
symmetry of the normal distribution.
The bottom three panels in Figure 3.8 correspond to the ELR statistics from the Preliminary EA program. The behaviors differ from the Newton-Raphson program in that there
is not a distinct division between the moderate and very large values. The boxplots in (d)
for all 1000 data sets show a similar behavior across the three means, though the boxplot
is slightly lower for the true µ0 = 0. This pattern also emerges for the non-ESP boxplots
in (f). When examining panel (e), created from the estimates from those data sets with the
ESP, we see a slight resemblance in the boxplots for µ0 = −0.5 and µ0 = 0.5, a reflection of
the symmetry of the underlying distribution.
The boxplots in Figure 3.8 give us a sense of center and spread for the ELR statistics
across the three means and two algorithms, but we cannot gain a sense of how the NewtonRaphson and Preliminary EA outputs compare for the same data set. Figure 3.9 illustrates
the relationship between the two algorithms on a data-by-data basis. We see the clear
division of the Newton-Raphson output for those data sets that failed to produce a solution.
A perfect correspondence would have produced a strong linear trend about the line y = x, a
pattern we do not see here.
85
Figure 3.9: Plots of ELR statistics for Newton-Raphson and Preliminary EA by ESP status
Absolute Constraint Violations
As another means to compare the algorithms, we examine the sum of the absolute deviations
from the constraints (C1)–(C3) defined above.
Figure 3.10: Plots of constraints (C1), (C2), and (C3) for data sets (all, ESP, and non-ESP)
for Newton-Raphson and EA preliminary results
86
In Figure 3.10, the first row corresponds to the summation of absolute deviations from
(C3) against (C2). The second row displays the sums of absolute deviations of (C3) against
(C1). The differences from the Newton-Raphson program are limited to a few distinct values,
whereas the values for the Preliminary EA program range quite widely. A possible explanation is that the randomness introduced throughout the Preliminary EA program results
in varying estimates, in contrast to the systematic program from Newton-Raphson, which
takes a functionally determined path towards a solution. For the Newton-Raphson output,
the estimates are able to match the mean constraint (C2) very well when the hypothesized
mean matches the actual µ = 0, regardless of the ESP status. As we veer away from the
true mean to µ0 = −0.5 and µ0 = 0.5, the Newton-Raphson algorithm either meets the
(C2) constraint very well or has an absolute constraint violation of around 0.5. For the
(C1) and (C3) constraint violations, we see the division of solutions from Newton-Raphson
into two categories: those that meet the constraint well and those that do not. The worst
absolute constraint violations for (C3) are on the order of 1 (for µ0 = 0) and 1.5 for the
incorrectly specified means. There were some weights that summed to 0, as indicated by the
absolute constraint (C1) violation of 1. Recall that the Newton-Raphson outputs a weight
whose sum is (or is close to) 0 when there is a failure to solve. In the data sets that fail to
solve, there are some belonging to both the ESP and non-ESP categories. Even if we try
another solver, like the BFGS method, we obtain essentially the same results; those that do
not converge receive a slightly lower value under BFGS, though all other values match those
from Newton-Raphson, as seen in Figure 3.11. Thus, the lack of convergence is not specific
to Newton-Raphson.
87
Figure 3.11: The plots of the ELR statistics from Newton-Raphson against the BFGS estimation in el.test.bfgs for µ0 = −0.5, 0, and 0.5 in (a)–(c), respectively, along with the
line of slope 1 and intercept 0 show the correspondence of the two outputs. The values in
the upper right corner are those that failed to converge under both algorithms, though these
receive a larger value under Newton-Raphson.
For the Evolutionary Algorithm, in Figure 3.10, we see a concentration of points near the
origin that appear to satisfy both the (C2) and (C3) constraints quite well simultaneously.
The fanning-out pattern of points becomes increasingly sparse as we veer away with larger
absolute constraint violations. There are points which reached a solution satisfying one
constraint well, say (C2) or (C3), while performing worse with the other. There are also
some points which perform equally poorly on these two constraints. Interestingly, the points
on the topmost parts of the plot (those with the worst overall constraint violations) tend to
come from those with the ESP. Points without the ESP are scattered about below, indicating
that the constraints are not completely met but are more satisfactory with respect to one
or more constraints. The Evolutionary Algorithm shows more clearly the inability of the
constraints to be met for the ESP data sets. The points for the ESP under (C3) against (C2)
88
tend to follow a negative linear or quadratic trend. When considering (C1), the Preliminary
EA procedure imposes the constraint of the weights summing to 1.
Improvement on EA for EL
To improve the Evolutionary Algorithm application to empirical likelihood would be to
simplify it while maintaining the power to search for global solutions. One way to mitigate
the complexity is to avoid handling constraints, as that is in itself a difficult problem. Recall
that we can work on the likelihood in the form of L∗ (λ) = log∗ (1 + λT (g(X i ) − µ)) with
λ ∈ Rr and no constraints on the Lagrangian parameters. Thus, we shift our genotypes from
the weights to be distributed to the Lagrangian parameters. Though we can vary all the
parameters in an Evolutionary Algorithm problem to investigate the resultant impact, we
focus on the impact of the following: four types of crossover, two types of mutation, and three
generation sizes (50, 500, and 1000) for n = 15 from N (0, 1) and the ESP example from above,
P
P
P
with the constraints (C1) ni=1 wi = 1; (C2) ni=1 wi xi = µ; and (C3) ni=1 wi x2i = 2µ2 + 1.
We use three test means µ0 = −0.5, 0, and 0.5 to discover any patterns of interest.
In all, there are 24 total algorithms under consideration, shown in Table 3.1. However,
some of these algorithms are not viable, as they produce ELR statistics in the imaginary
range. For H0 : µ0 = −0.5 and H0 : µ0 = 0.5, we remove from consideration the Evolutionary
Algorithms with ID 4, 7, 10, 13, and 16, and for H0 : µ0 = 0, the algorithms with ID 10
and 16 are not viable. These all correspond to a generation size of 50 and a crossover other
than the heuristic; the random-based crossovers seem to fare better with a sufficiently large
generation size, though this has the downside of a longer running time.
In Figure 3.12, we plot the density of all 1000 data sets as processed by the valid Evolutionary Algorithms that did not produce imaginary values, in addition to the theoretical
χ2 density and the output from Owen’s R program for Newton-Raphson. This allows us
to see at once the general behavior of the statistics from all Evolutionary Algorithms that
89
Table 3.1: Table of 24 Evolutionary Algorithms under consideration for ESP example
EA ID # Crossover Mutation Generation Size
1
One-Point
Normal
50
2
One-Point
Normal
500
3
One-Point
Normal
1000
4
One-Point Uniform
50
5
One-Point Uniform
500
6
One-Point Uniform
1000
7
Uniform
Normal
50
8
Uniform
Normal
500
9
Uniform
Normal
1000
10
Uniform
Uniform
50
11
Uniform
Uniform
500
12
Uniform
Uniform
1000
13
Arithmetic Normal
50
14
Arithmetic Normal
500
15
Arithmetic Normal
1000
16
Arithmetic Uniform
50
17
Arithmetic Uniform
500
18
Arithmetic Uniform
1000
19
Heuristic
Normal
50
20
Heuristic
Normal
500
21
Heuristic
Normal
1000
22
Heuristic
Uniform
50
23
Heuristic
Uniform
500
24
Heuristic
Uniform
1000
90
do not produce imaginary ELR statistics and the Newton-Raphson method. The output
unveils a similar behavior for the Evolutionary Algorithms and the Newton-Raphson output
around the theoretical density, though the Newton-Raphson program assigns much larger
values for those failing to converge. The EA methods show a distinction between the values
that converged close to the theoretical curve and those with a larger value, much like the
one for R, only without such an expansive gap between them. Once again, we return to the
question of the ESP and explore whether this observation can be attributed to it.
The ELR statistics for the ESP data sets, shown in Figure 3.13, indeed show the larger
values that deviate from the theoretical curve for both the Evolutionary Algorithm and
Newton-Raphson methods. This is in accordance with the results we had seen previously
with the initial examination of the ESP. All data sets with the ESP appear farther away
from the theoretical curve, an indication of their failure to solve. However, we will see that
this feature is not restricted to just the ESP data sets.
In examining the non-ESP data sets in Figure 3.14, we see a similar division of convergent
and non-convergent values as with the plots for all of the data in Figure 3.12. For those
convergent values, we see a closer approximation to the theoretical results with the hypothesis
test of the true mean, H0 : µ0 = 0. For the two misspecified means, we see a skewing of the
distribution of ELR statistics to the right. Under the incorrectly specified mean, it becomes
more difficult to make the constraints match, especially as we go farther out from the mean.
A test of H0 : µ0 = −2 would produce fewer viable weights than a test under the true mean
or even a slightly misspecified mean, such as H0 : µ0 = −0.25. An interesting result is
that we still fail to have total convergence on the non-ESP problems, as we have seen with
the Newton-Raphson program, even when we take an Evolutionary Algorithm approach.
However, under the Evolutionary Algorithm, the deviations of the ELR statistics do not
appear to be as large as those under Newton-Raphson.
91
Figure 3.12: All ELR Statistics for valid Evolutionary Algorithms by mean and NewtonRaphson from R for ESP example
92
Figure 3.13: ELR statistics from valid Evolutionary Algorithms and Newton-Raphson from
R for ESP data sets at µ0 = −0.5, 0, and 0.5
93
Figure 3.14: ELR statistics from valid Evolutionary Algorithms and Newton-Raphson from
R for non-ESP data sets at µ0 = −0.5, 0, and 0.5
94
Moving on to evaluating the constraint violations, we take the mean of all absolute constraint violations for each of the three constraints. To assess the ability of the algorithm to
closely approximate the constraint, we use this form of a single measure to compare across
all Evolutionary Algorithms under consideration. For plotting purposes, we include the algorithms that produced imaginary ELR statistics and their constraint violations with respect
to the weights (which, when entered into the ELR statistic function, produced an invalid
response) to show the change across generation sizes. The incorporation of the heuristic
crossover results in an interesting property: those without it seem to show an improvement
in approximating the constraints with a larger generation size (which has the downside of
longer running time), whereas those based on it seem to perform with a generation size of
50 just as well as 1000.
Since we are trying to improve over the Newton-Raphson approach, we use the corresponding values as a baseline for comparison. In these figures, we divide the data set into the
non-ESP and ESP groups, so as to make a fairer comparison. In Figure 3.15, we see that the
two Evolutionary Algorithms based on the heuristic crossover perform the best out of the
methods under consideration. Though the Newton-Raphson method has a marginally better
performance in the non-ESP data sets, we have made improvements over the Preliminary
EA application. In the ESP sets, though the Evolutionary Algorithms appear to have a
smaller deviation overall, we note that the small scale indicates the algorithms are all fairly
close to each other in performance.
The (C2) constraint violations appear in Figure 3.16, and once again, we see the marginally
better ability of the Newton-Raphson approach to approximate this constraint based on the
sample mean for the non-ESP sets. The heuristic algorithms also show a markedly improved
performance over all the other strictly random crossovers. The step of moving the more
desirable candidate forward makes a huge improvement in both time and approximations.
95
Figure 3.15: The mean absolute violations for constraint (C1)
hypothesized means µ0 = −0.5, 0, and 0.5
96
Pn
i=1
wi = 1 across three
Figure 3.16: The mean absolute violations for constraint (C2)
hypothesized means µ0 = −0.5, 0, and 0.5
97
Pn
i=1
wi xi = µ0 across three
The (C3) constraint on the second moment is where we begin to see the Evolutionary
Algorithms outperforming the Newton-Raphson method, demonstrated in Figure 3.17. As
with the “No Free Lunch” Theorem, though Newton-Raphson results in finer estimates
(though marginal) on the other two constraints, the Evolutionary Algorithm excels with
handling the higher-order constraint. We see that the heuristic-based crossovers have the
lowest mean absolute deviations from (C3).
If an algorithm performs very well with respect to constraints, it is also desirable that it
runs in a short amount of time. Sometimes we even consider a rougher approximation that
runs very quickly in place of an algorithm that performs very well but takes days to run.
We visualize the mean time in seconds spent on each of the data sets (within non-ESP and
ESP). It is also important to note that internal structures in MATLAB (i.e., “for” loops) can
drive up the running time as opposed to the same structure programmed in, say, FORTRAN.
Figure 3.18 shows that the increasing generation size drives up the running time, as these
yield much larger groups of offspring. The method for generating the offspring must be
chosen carefully to explore the better candidates while not wasting time on the unviable
ones. The heuristic method of crossover incorporates this decision-making step with a quick
check of objective function values, whereas the others are relying on randomness to bring
the best candidates to the surface. Generally, using randomness by itself means that we
risk requiring a large pool of candidates or a longer running time to sufficiently explore the
space, and we can reduce this somewhat with the incorporation of some “intelligent” steps to
direct the search in a more fruitful direction. In all three test means, the two heuristic-based
algorithms perform the most quickly relative to the Evolutionary Algorithms and perform
well in general with respect to running time. The heuristic crossover with normal mutation
has the lowest running time overall for the generation sizes of 50, 500, and 1000. Given that
the constraints are well met even at a generation size of 50 with this algorithm, it seems to
be the best candidate to contend with the Newton-Raphson method.
98
Figure 3.17: The mean absolute violations for constraint (C3)
three hypothesized means µ0 = −0.5, 0, and 0.5
99
Pn
i=1
wi x2i = 2µ20 + 1 across
Figure 3.18: The mean running time (in seconds) for Evolutionary Algorithms and NewtonRaphson analyses three hypothesized means µ0 = −0.5, 0, and 0.5
100
Lastly, we check the numbers of iterations (in Figure 3.19) it takes for an algorithm to
terminate. This is related to the running time and the maximum number of evaluations
permitted. For the Evolutionary Algorithms, we require a candidate to survive four consecutive generations before calling it a solution, to prevent the algorithm from choosing a
candidate because it was not able to find a better one (whether restricted by a generation
size, “bad” sampling, etc.). We also set the maximum number of iterations allowed at 200
for the Evolutionary Algorithms and 25 for the Newton-Raphson method (the default given
by Owen’s code). Keeping in mind this disparity in bounds, we see that the Evolutionary
Algorithms take longer to reach a solution, on average, with the non-ESP sets. However,
they terminate more quickly in the event of an ESP data set. The Evolutionary Algorithms
show a decreasing mean count of iterations as the generation size expands, as we include
more candidates and may encounter a solution more quickly, though the rates at which they
decrease vary across the specific types. For instance, the heuristic-based methods show a
more substantial reduction in iterations with increasing generation size than the ones using
the uniform and one-point crossovers. The heuristic-based methods reach a solution more
quickly than the others, as they integrate a decision-making step that saves time and computational resources. The step of checking the relationship between two parents and randomly
drawing from U (0, 1) for the heuristic crossover is much faster than the vector operations
needed to shift elements around in the other crossovers.
Given that the constraints are met quite well across all three generation sizes for the
heuristic-based methods, we consider the tradeoff of choosing the one with a shorter mean
running time yet with fewer iterations to reach the solution, as the running time increases
while the iterations decline with larger generations. We should take care not to choose too
large of a generation size so as to avoid a large pool of offspring that can tax computational resources in processing their phenotypes (objective function values). There is also
the consideration that a larger pool can result in exploring more solutions and taking fewer
101
Figure 3.19: The mean number of iterations for Evolutionary Algorithms and NewtonRaphson analyses three hypothesized means µ0 = −0.5, 0, and 0.5
102
iterations (though with a longer running time per iteration) to reach a solution. In weighing
all of these, the best Evolutionary Algorithm for the ESP example would be the one with
heuristic crossover for offspring and normal mutation. Noting that the (C3) constraint is
well met under this algorithm, both this and the Newton-Raphson method have tradeoffs
such that one is not a clear winner. One is faster in computational time and more accurate
for non-ESP problems, though marginally, in meeting the lower-level constraints; the other
excels at meeting the highest-ordered mean for this particular problem.
Focusing on the heuristic crossover with normal mutation, we examine an example of how
quickly a solution converges under H0 : µ0 = 0 for a non-ESP in Figure 3.20. For plotting
purposes, we expand the required number of consecutive iterations for a solution to survive
from 4 to 20. Regardless of the generation size, the solution seems to start stabilizing by 12
iterations, so we are able to zoom in and view the behaviors for the first few iterations. As
to be expected, the largest generation size of 1000 takes fewer iterations to stabilize to an
answer (though with the tradeoff of a longer time spent on each).
Figure 3.20: An example of a solution process for an Evolutionary Algorithm, specifically
the heuristic crossover with normal mutation
103
The solution for the non-ESP data set from Figure 3.20 appears in Figure 3.21 using the
output from the Evolutionary Algorithm with a generation size of 50, the heuristic crossover,
and normal mutation of offspring. We can see that the convex hull indeed includes the true
mean point, with the shape based on the location of the data points. Under H0 : µ0 = 0,
the true mean, the weights are distributed across the 15 points in the manner shown in the
figure. The points that are farthest away from the mean receive a smaller weight, as seen
with the rightmost point receiving a weight of about 0.03. Those closest to the mean receive
a weight of approximately 0.08, and we see a progressive lowering of the weights as we move
away from the mean point.
Figure 3.21: A visualization of the convex hull and assigned weights to data points, demonstrating how points farther away from the true mean, indicated by the star at (0,1), receive
a lower weight than those with a smaller absolute deviation
104
3.5
EA versus Newton-Raphson in R: Fisher’s Iris Data
To compare the performance of numerically-based optimizations in Newton-Raphson against
an EA implementation, we use a well-known data set for illustration, namely the iris data
used by Fisher (1936). Fisher’s original purpose in 1936 was to discern the species of iris
based on linear functions of the dimensions (width and length) of the sepals and petals of
50 observations from three categories: Iris setosa, Iris virginica, and Iris versicolor. As our
primary objective is not in discriminant analysis, we restrict the data set to those observations
from Iris setosa and focus on the sepal length and width. In Figure 3.22, the data appear
with the corresponding convex hull based on 50 points and the sample mean indicated at
(5.006, 3.428).
105
Figure 3.22: The scatterplot of a subset of Fisher’s iris data shows the relationship between
the sepal width and length for 50 observations from the Iris setosa species. The convex hull
is shown outlining a boundary enclosing the points.
We translate this into a “just-identified” EL estimating problem by examining two estimating equations, one for the mean of the sepal length and the other for the mean of the
sepal width. The possible values of the mean length and width, µL and µW , are specified
on a two-dimensional grid: µL ranges from 4.3 to 5.8 and µW from 2.3 to 4.4 with increments of 0.1. In addition to using the Newton-Raphson program, we also use four other R
functions available through the el.convex package for R by Yang and Small (2009). The
package includes ELR test routines, and we utilize it to obtain estimates on the grid based
on the following strategies: Broyden-Fletcher-Goldfarb-Shanno (BFGS), damped Newton,
106
Davidon-Fletcher-Powell (DFP), and Newton’s method. All four use the default tolerance
of 10−8 , and the default limit for iterations are 100, 200, 100, and 25, respectively, for the
BFGS, damped Newton, DFP, and Newton’s methods. Since there are differences among the
algorithms, we may not have a standard that works for all equally well, hence the varying
limits. The EL program in R uses a gradient tolerance of 10−8 and an upper bound on
iterations of 25, and its core is essentially a Newton-Raphson approach, thereby resulting in
similarities with the Newton’s method limits above.
The EA adapted to this purpose uses a generation size of 1000 with the requirement
that the output survives at least four generations consecutively to avoid a stagnation of the
algorithm. In conjunction with µ = (µL , µW ), we determine λ = (λL , λW ) that generates a
set of potential weights through which to place mass on each point. We employ the heuristic
crossover with standard normal mutation for a generation size of 50, which proved to be a
strong candidate for estimation in the simulation study of Evolutionary Algorithm variations.
At each iteration, 10% offspring are randomly chosen for mutation through adding a standard
normal random vector to the components, and the immigration rate is 30%.
Because we are minimizing
L∗ (λ) = −
n
X
log∗ (1 + λT (Xi − µ)),
i=1
we recall that equating the gradient to be 0 leads to the equation
n
1X 0
log∗ 1 + λT (Xi − µ) (Xi − µ).
0=
n i=1
(3.4)
Since (3.4) should be satisfied to some degree, we first find all candidates where the
absolute value of the summation above is less than 0.1, a value that isn’t too stringent nor
too lenient. Within these candidates, the likelihood L∗ (λ) is computed, and the minimum
107
becomes the top guess. If this guess survives at least four generations, then it becomes the
algorithm’s output for that particular problem. Otherwise, we reach a maximum number
of iterations and an answer that is the best of the ones considered, though not necessarily
ideal overall. Using this method, we create a contender for the other five numerically based
routines. Noting that the color scales are not equivalent across panels, the filled contour
plots of the ELR statistics in Figure 3.23 show a similar trend across all six algorithms. The
general shape of the convex hull spanned by the data appears to contain the minimum ELR
statistics, as to be expected. Once the µ = (µL , µW ) vector goes outside the convex hull, we
see a drastic increase in the statistics, thereby indicating a failure to solve.
In Figure 3.25, the general tendency of the derivative-based algorithms is to take fewer
iterations to arrive at a solution within the convex hull. Even if no solution is possible, these
algorithms “linger” towards a potential solution until they are halted by the algorithm. In
contrast, the EA shows the opposite trend: more iterations are taken within the convex hull
than outside. This is possibly due to the fact that there are multiple candidates within the
convex hull, and they must contend with each other, leading to a tournament-style match,
where the “winner” must advance. Outside the convex hull, if all solutions are equally bad
(and no truly good solution exists anyway), then we may see a termination of the algorithm
occur sooner because of a “solution” not being improved upon in four consecutive iterations.
This is also reflected in the time it takes for each algorithm to reach a solution, as displayed
in Figure 3.24. The EA takes the most time within the convex hull, whereas the others spend
more time outside. The running time for the EA is much longer than the more expedient
numerical algorithms, but it is a heuristic and a global minimizer.
We now examine the behavior of the summation of the weights for the output based on
the six algorithms in Figure 3.26. The damped Newton, Newton, and EA approaches produce
reasonable summations of weights on means within the convex hull. We see a sharp drop-off
to 0 for these summations when reaching the exterior of the convex hull, as to be expected.
108
Figure 3.23: Plot of ELR statistics for Fisher’s iris data for six algorithms
109
Figure 3.24: Plot of total time for Fisher’s iris data for six algorithms
110
Figure 3.25: Plot of total iterations for Fisher’s iris data for six algorithms
111
The EA appears to define a sharper boundary within the convex hull. The R program for EL
estimation produces weights that exceed 1 in areas along the convex hull’s border. Again,
noting that the color charts are not equivalent, the BFGS and DFP algorithms also produce
summations that are satisfactory within the convex hull, though they assign a much larger
value to those non-convergent solutions.
Figure 3.26: Plot of summation of weights produced by the six algorithms
Looking at the final set of constraints, shown in Figures 3.27 and 3.28, we examine
how the algorithms perform with handling the mean for the sepal length and sepal width,
respectively. The sample mean for the sepal length is 5.006, and for the sepal width it is
112
3.428. We see that there are contours within the convex hull on which these are satisfied.
Again, as we go outside the convex hull, we fail to meet these constraints. The EA program
seems to delineate a finer boundary within the convex hull than the other strategies. This
gives us a finer sense of accuracy, at the cost of increased running time. We also note that
the BFGS and DFP assign large values to the points outside of the convex hull, as opposed
to the very small values (close to 0) assigned by the other algorithms (Damped Newton,
Newton, Newton-Raphson, and the Evolutionary Algorithm).
Figure 3.27: Plot of sepal length means for Fisher’s iris data, which has a sample mean of
5.006, based on EL weights for six algorithms
113
Figure 3.28: Plot of sepal width means for Fisher’s iris data, which has a sample mean of
3.428, based on EL weights for six algorithms
114
4
Composite Likelihood Functions and
Inference
4.1
Motivation
Joint likelihoods in the parametric or semiparametric context are often very complex and
computationally intensive, thus posing challenges in estimation. One potential difficulty occurs when the full probability density contains an “awkward normalizing constant” (Mardia,
Kent, Hughes, and Taylor, 2009, p. 975). On the other hand, estimating a full likelihood
may become computationally intractable or impractical when nuisance parameters are involved or the data exhibit correlations. As an alternative to using the full likelihood, the
estimation procedure may be transformed into a simpler likelihood (usually based on subsets of the data) with minimal loss of information (Varin and Vidoni, 2005). Cox (1975)
lists potential avenues to explore: obtaining a robust estimate with a reduced likelihood;
addressing the problem of intractable full likelihoods, such as in stochastic processes; using
the information from a finite number of moments as opposed to using the stringent distributional assumption affiliated with the full likelihood; and reducing the dimension of the
parameter vector, particularly in the case with many nuisance parameters that can plague
full likelihood estimation. Regardless of how the modified likelihood is used, these issues
must be considered and henceforth addressed. However, the challenges associated with sim115
plifying the likelihood do not stop there. Though Cox (1975) writes in the context of the
partial likelihood, the following apply to any strategy using a reduced likelihood formulation: identifying “constructive procedures for finding useful” formulations, determining the
cases for which all or part of the relevant information is captured in the simplified setting,
and formulating small- and large-sample tests (p. 270). The practicality and usefulness of
composite likelihood has continued an impetus of research and application in many areas,
such as genetics, longitudinal modeling, and spatial statistics (Varin, 2008).
In problems where the joint likelihood may be very difficult (or even impossible) to
compute, it may be more convenient to consider a product of likelihoods (or equivalently,
summation in log likelihoods) that are marginal and/or conditional (Lindsay, 1988). These
methods have been referred to as a “pseudolikelihood,” dating back to Besag in 1974 (Besag,
1974; Lindsay, 1988), or even a “quasi-likelihoood” or “approximate likelihood” (Varin, Reid,
and Firth, 2011). There is potential for confusion, as these terms have been either applied
in other settings or are too vague. As an example, Wedderburn (1974) uses the mean and
variance structure to create a simplified likelihood, known as the quasi-likelihood approach
in generalized linear models and a data analysis using a logit model of proportions. Lindsay
(1988) suggests that, in lieu of the more general term “pseudolikelihoods,” to describe them
most accurately is to call them composite likelihoods.
The ordinary likelihood is contained within the class defined by the composite likelihood. Varin and Vidoni (2005) discuss two groupings of composite likelihoods: “subsetting
methods” that utilize subsets of the data and “omission methods” that do not use all the
components from the full likelihood (p. 520). Taking all k-tuples (1 < k < n) of n observations from a distribution falls into this first class, as well as using marginal or composite
likelihoods. Reducing the dimension of the margins leads to a simpler construct that may be
more viable computationally (Varin, 2008). The category of omission methods may involve
factoring out terms that are not relevant to the parameter of interest and dropping them from
116
estimation, hopefully leading to a more manageable estimation with minimal loss. This class
contains Besag’s pseudolikelihood (1974) and Cox’s partial likelihood (1975). Differences between the composite and partial likelihood methods stem from the components. Firstly, the
components in the partial likelihood of Cox (1975) do not necessarily match the marginal
or conditional likelihoods. Additionally, the partial likelihood is such that the scores are
not correlated, though this is permitted in the composite likelihood. The “total information
[in the partial likelihood] is the sum of the component informations,” so information can be
added by swapping out a component with a more informative one (Lindsay, 1988, p. 224).
This property does not hold in the composite likelihood framework. Despite the differences,
both are approaches to a general issue, though we focus on the application of the composite
likelihood in the context of empirical likelihood.
Focusing on the composite likelihood, the question then becomes how θ̂ CL , the parameter
vector estimated under the composite likelihood, compares to θ̂ F L , the value under the full
likelihood. In other words, with the reduction from the full likelihood, how much information
is preserved or available in a simplified construct? The composite likelihood must contain
the necessary information concerning θ, which helps to assure identifiability (Mardia et al.,
2009). Cox (1975) states that any data discarded from the reduced likelihood should not
be essential to estimating the parameter of interest. If we fail to incorporate the necessary
information on the parameters in the composite likelihood, the estimator will fail to be
consistent (Varin et al., 2011).
A full likelihood, especially with higher dimensions, can be associated with a very complex
surface that poses difficulty in optimization procedures. Relaxing the restrictions imposed
by the full joint likelihood and instead using a composite likelihood can produce a smoother
surface that is better suited for identifying solutions. Permitting this structure may also
simplify the parameter space to the point where we can identify the most informative features
(Varin et al., 2011).
117
The components that go into the product are themselves likelihoods on the restricted
sample space. Because of the foundation on marginal or conditional likelihoods, if we formulate an estimating equation based on the derivative of the composite log-likelihood, the
outcome is unbiased. Regardless of whether the components satisfy the independence property, the “inference function has the properties of likelihood from a misspecified model”
(Varin et al., 2011, p. 5). As such, we can relate concepts from a misspecified modeling
perspective to composite likelihood; we may gain additional insight into the estimation via
the relative efficiency of the estimation based on the reduced likelihood versus using the
full likelihood (Varin et al., 2011). This misspecification may occur in two ways: the true
underlying distribution occurring outside of the family of those under consideration and the
reduction of the full likelihood to a composite likelihood (Varin and Vidoni, 2005).
There is a “form of consistency robustness” when it comes to the composite likelihood;
sometimes it may be the case that the full MLE fails to be consistent (Lindsay, 1988, p.
222). The notion of robustness in this context, much like that of the generalized estimating
equations as opposed to point estimation, stems from the use of fewer model assumptions
on lower dimensions in lieu of full joint distributional specifications. In the event that there
are multiple joint distributions affiliated with common marginal or conditional distributions,
we may use a common procedure of inference (Varin et al., 2011). The issue of robustness
has been studied in depth for certain applications, such as the pairwise likelihood for multilevel models applied to binary data as opposed to a maximum marginal likelihood method
requiring integration (for which the computational complexity grows especially with higher
dimensions of integration) in Renard, Molenberghs, and Geys (2004). However, robustness
in this context has not yet been generalized. First of all, there are multiple ways to reduce
a full likelihood: Do we choose marginal or conditional likelihoods (or a mixture)? Do we
condition on two, three, or more sample points at a time? These preliminary exploratory
questions show that there is no “one-size-fits-all” approach, especially with higher dimen118
sions. Once a composite likelihood has been formulated, we still have to investigate how
well these lower dimensional representations represent the important information from the
full likelihood. We may have an issue of identifiability, in the event that there is no full joint
likelihood “compatible with the component densities” (Varin et al., 2011, p. 28). This goes
back to Cox (1975), who mentions the need “to provide constructive procedures for finding
useful partial likelihoods,” and likewise extends to composite likelihood (p. 270).
If the components are from conditional densities, the Hammersley-Clifford Theorem guarantees the existence of a valid joint distribution, even if the full distribution is not estimable.
In Besag (1974), the Hammersley-Clifford Theorem is discussed in the context of a lattice, where a conditional probability model is used in place of specifying the full likelihood
with the positivity condition assumed–that is, if f (xi ) > 0 for each i = 1, . . . , n, then
f (x1 , . . . , xn ) > 0. To have a most generalized probability formulation “given the neighbors of each site,” the Hammersley-Clifford Theorem uses component conditional densities
to construct a “valid probability structure to the system” (Besag, 1974, p. 197). On the
other hand, the case is not so simple with composite likelihoods based on marginal density
foundations. One tool that has been used in this context is the copula to reformulate the
underlying joint likelihood (Varin et al., 2011).
4.2
Definitions
Denote the random variable of interest by Y = (Y1 , . . . , Yn )T . This random variable is
associated with a probability density function f (y; θ), where θ ∈ Θ ⊆ Rp is assumed to be
unknown. In this type of problem, the challenge typically lies with optimizing f (y; θ) in
order to estimate θ, though it may be possible to make such estimates on subsets of the data.
Varin et al. (2011) and Varin (2008) define a composite likelihood in terms of measurable
119
events {Ai ; i = 1, . . . , m} and weights wi > 0, i = 1, . . . , m:
CL(θ; y) =
m
Y
f (y ∈ Ai ; θ)wi =
i=1
m
Y
Li (θ; y)wi .
i=1
The weighting term wi may be disregarded if all are equal for i = 1, . . . , m. On the other
hand, we may boost the efficiency with an appropriate choice of the weights (Varin et al.,
2011). The composite log-likelihood is defined as c`(θ; y) = log CL(θ; y). If a unique
maximum exists for this function, it is referred to as the maximum composite likelihood
(MCL) estimator.
When estimating with composite likelihoods, the loss of information from using a coarser
formulation versus the complicated joint likelihood is hoped to be minimal, with little impact
on parameter estimation. Composite likelihoods utilizing marginal densities or conditional
densities are labeled as composite marginal or conditional likelihoods, respectively. Now we
discuss some examples and their nomenclature. Generally, we represent the event as the set
of y ∈ Ai , though we now demonstrate specific examples as discussed in Varin (2008) and
Varin et al. (2011).
Besag’s pseudolikelihood may be viewed as incorporating conditional information based
on the relationship between a single point and its neighbors. The notation for such a construct is
LC (θ; y) =
m
Y
f (yr : {ys : ys is a neighbor of yr }; θ).
r=1
Another example of a composite conditional likelihood is that discussed by Liang (1987),
where the formulation is
LC (θ; y) =
m−1
Y
m
Y
f (yr : yr + ys ; θ),
r=1 s=r+1
which arises from stratified case-control studies, where the parameter of interest is the degree
120
of aggregation of a given disease. After matching each control to a case, the disease status
of the first-degree relatives for the case and control are recorded, along with supplemental
information like risk factors. The conditional likelihood involves computing the probability
that a given relative is linked to the case or control observation, given the disease status of
all other first-degree relatives for the matched case/control pair (Liang, 1987; Hanfelt, 2004).
The pairwise conditional composite likelihood
LP CCL (θ; y) =
m Y
m
Y
f (yr : ys ; θ)
r=1 s=1
involves combining pairwise conditional densities, examining only one neighboring point in
each condition (Mardia et al., 2009). This is in contrast to the pairwise marginal composite
likelihood, which takes the form
LP M CL (θ; y) =
m−1
Y
m
Y
f (yr , ys ; θ).
r=1 s=(r+1)
We may also condition on a set of all other points except yr , denoted by y(−r) :
LC (θ; y) =
m
Y
f (yr : y(−r) ; θ).
r=1
In other words, we represent the composite likelihood by conditioning on all other points and
multiplying the results together (Varin et al., 2011). As we can see, there are multiple ways
in which to condition on a subset or function of observations. The question then becomes
which is most appropriate for the type of analysis being carried out.
Following the notation of composite likelihoods, the independence assumption may be
worked into a composite likelihood as
IL(θ; y) =
n
Y
i=1
121
f (yi ; θ)wi ,
which is intuitive and may be referred to as an independence likelihood. With wi = 1 for i =
1, . . . , n for independent data, this is also the usual full joint likelihood. The independence
likelihood belongs to the class of composite marginal likelihoods, and inference is focused on
the marginal parameters (Varin et al., 2011). If we are considering parameters describing
dependence, then it is vital to maintain blocks of data/information. Though this block size
may vary, as an example, this may be carried out in blocks of two through the pairwise
likelihood:
P L(θ; y) =
n−1
Y
n
Y
f (yi , yj ; θ)wi,j .
i=1 j=i+1
We can extend this into higher dimensions, such as triplets, but a high enough dimension
might be reached such that we encounter the same problem of complexity that surfaced with
the original problem of joint likelihood. It is also possible to combine various types under the
pseudolikelihood class, such as mixing independent and pairwise likelihoods (Varin, 2008).
Another variation of the composite marginal likelihood is that based on pairwise differences:
Ldif f (θ; y) =
m−1
Y
m
Y
f (yr − ys ; θ),
r=1 s=r+1
which has been applied in “continuous symmetric responses” with “dependence structure”
(Varin et al., 2011, p. 7). These examples demonstrate the flexibility of using composite
likelihoods, whether the components are based on conditional or marginal likelihoods. The
ultimate goal is common: to estimate or approximate the model that is otherwise too complex
in its joint form. A desirable outcome is one such that
L(θ; y1 , . . . , yn ) = CL + loss ≈ CL,
where CL represents the chosen composite likelihood and the loss is negligible. In this case,
the computational resources spent on the left-hand side are not needed, as it is acceptable
122
to use the simpler approximation. Varin et al. (2011) use the following notation: LC for
composite likelihood and, if clarification is needed, LM C for marginal likelihood, and LCC
for composite conditional likelihood.
Asymptotic Theory and Properties
An objective in traditional likelihood theory is to estimate the parameter and to determine its
asymptotic behavior. Similar formulations may be accomplished with composite likelihood.
Of interest is the behavior of the maximum composite likelihood estimator θ̂ CL , which is
arg max LC (θ; y)
θ∈Θ
or
arg max c`(θ; y) =
θ∈Θ
K
X
log Lk (θ; y) wk .
k=1
Using the notation of Varin (2008) and Varin et al. (2011), we represent the composite
score function as
u(θ; y) = ∇θ c`(θ; y) =
K
X
wk ∇ log f (y ∈ Ak ; θ).
k=1
This summation of individually viable likelihood score functions has the property of being
unbiased, provided regularity conditions are satisfied (Varin, 2008). Now we define the
matrices to be used in the asymptotic results: the sensitivity matrix is
Z
H(θ) = Eθ {−∇θ u(θ; Y )} =
123
{−∇θ u(θ; y)}f (y; θ)dy,
and the variability matrix is represented by
J(θ) = Varθ {u(θ; Y )}.
The q × q Godambe information matrix, also known as the sandwich information matrix, is
defined as
G(θ) = H(θ)J(θ)−1 H(θ).
The expected Fisher information matrix will be denoted as I(θ) = Varθ {∇θ log f (Y ; θ)}.
The second Bartlett identity states that H(θ) = J(θ). If we are indeed using the full loglikelihood function, then G = H = I. Otherwise, under misspecification commonly present
with composite likelihood to a degree, this identity does not hold, thereby necessitating the
use of the Godambe information matrix instead of the Fisher information matrix (Varin et
al., 2011).
Now that we have defined the elements that explain the asymptotic behavior of the
composite maximum likelihood estimator θ̂ CL , with appropriate regularity conditions, we
may express the composite likelihood counterpart of the central limit theorem regarding the
score statistic. The asymptotic distribution of θ̂ CL is given by
√
d
n(θ̂ CL − θ) −→ Np {0, G−1 (θ)}.
The asymptotic efficiency of θ̂ CL relative to θ̂ F L from the complete likelihood is given by
the ratio of G(θ) to I(θ), since the latter is what we associate with the traditional MLE.
As with the asymptotic tests for the MLE, we may conduct tests for a a vector of parameters of interest, ψ, of θ = (ψ, τ ) with H0 : ψ = ψ 0 under composite likelihood as well.
Here, τ represents the vector of nuisance parameters. Though the variability matrix H(θ)
and the sensitivity matrix J(θ) may be solvable via expectation directly, sometimes it is
124
necessary to use a numerical approximation of these values in the tests below. The approximation is accomplished using functions based on the composite score functions evaluated at
the composite likelihood estimator, θ̂ CL . The variability matrix H(θ) is an expectation and
thus may be estimated as follows:
n
Ĥ(θ) = −
1X
∇u(θ̂ CL ; yi ),
n i=1
(4.1)
and in the case that second Bartlett identity holds true, we may avoid the differentiation
and instead compute
n
1X
Ĥ(θ) =
u(θ̂ CL ; yi ) u(θ̂ CL ; yi )T .
n i=1
The variability matrix, J(θ), is also estimated with the values from the composite score
functions:
n
1X
ˆ
u(θ̂ CL ; yi ) u(θ̂ CL ; yi )T .
J(θ)
=
n i=1
(4.2)
If the sample size n does not exceed the dimension of θ by a large enough margin, then
(4.1) and (4.2) may become poor estimators, such as in longitudinal data or other settings
with a fixed sample size with a comparatively large length for the data vector y i , i = 1, . . . , n,
like a long-running single time series. The sensitivity matrix may be estimated via (4.2);
we may estimate the variability matrix H using the assumptions of the model under study
in the parametric setting instead of (4.1). If we can simulate data from the model, we may
alternatively build an empirical estimate of the J matrix using Monte Carlo methods and an
average over the independent simulations (Varin et al., 2011). Again, we have a procedure
that may require modification under special circumstances as those listed above, a reflection
on the challenges of this approach and that “the range of models is simply too broad” (Varin
et al, p. 34). Given that composite methods of estimation and inference are useful for so
many fields, there is potential for further research in this area in these open problems.
125
Now, we examine three tests based on composite likelihood estimators. Denote the
Godambe information matrix G corresponding to ψ as Gψψ . Additionally, let the matrix
H ψψ represent the q × q matrix formulated from the elements corresponding to ψ from
H −1 (θ). The evaluation of H at ψ 0 and τ̂ CL (ψ 0 ) corresponds to H̃ = H{ψ 0 , τ̂ CL (ψ 0 )}.
1. Wald-Type Test: The Wald-type test statistic is given by
We = n(ψ̂ CL − ψ 0 )T Gψψ (θ̂ CL )(ψ̂ CL − ψ 0 ).
(4.3)
The disadvantage of the test statistic (4.3) is that if we are interested in a one-to-one
cT
function of ψ, say ψT = T (ψ), then ψ
CL does not necessarily equal T (ψ̂ CL ). Computationally speaking, the terms H(θ) and J(θ) require evaluating, with a potential
cost of complexity (Varin et al., 2011). We may resort to using values for these variability and sensitivity matrices, respectively, based on empirical estimates as in (4.1)
and (4.2).
2. Score-Type Test: We may also carry out a hypothesis test with the score-type
statistic. In this case, τ acts like a nuisance parameter, so we estimate it first using
the composite likelihood maximum likelihood estimator. Then, we are able to test
whether enough evidence exists to disprove that ψ = ψ 0 . The score statistic, given
by the following expression, may be ill-behaved computationally and also requires
computing the sensitivity and variability matrices:
Wu =
1
uψ {ψ 0 , τ̂ CL (ψ 0 )}T H̃ ψψ G̃ψψ H̃ ψψ uψ {ψ 0 , τ̂ CL (ψ 0 )}.
n
126
3. Likelihood Ratio Test: The third asymptotic test that may be used is the composite
likelihood ratio statistic. The statistic is given by the following expression:
q
h
i
X
d
W = 2 c`{θ̂ CL ; y} − c`{ψ 0 , τ̂ CL (ψ 0 ); y} −→
λj χ2q .
j=1
In this formulation, λi , i = 1, . . . , q are the eigenvalues of (H ψψ )−1 Gψψ and thus adds
computational complexity through the additional matrix operations required.
Adjustments exist that may help to re-express the W statistic in terms of a standard
distribution that is more convenient for calculating p-values. Examples of such adjustments include the Satterthwaite approximation W 0 = νW/(q λ̄), which is approximately
P
P
distributed as a χ2ν random variable with degrees of freedom ν = qj=1 λj / qj=1 λ2j .
Another suggestion, made by Chandler and Bate (2007), is to adjust the composite loglikelihoods so that the second Bartlett identity holds, i.e., H(θ) = I(θ), thus leading
to the use of the traditional approximation to a standard chi-squared distribution.
Composite Likelihood in Empirical Likelihood Context
The composite likelihood can prove beneficial in the parametric likelihood setting, especially
when the full likelihood is quite difficult to optimize or perhaps integrals are intractable.
The goal is to reduce the likelihood to the point where the important features are preserved,
while making the problem much easier to optimize. In the empirical likelihood context,
there is an important distinction between a composite empirical likelihood and an empirical
composite likelihood. The empirical composite likelihood represents the optimization of the
overall composite likelihood, whereas the composite empirical likelihood constructs an overall
composite likelihood with the optimized components. Without loss of generality, let us focus
on a pairwise composite likelihood for data of size n, where we take all possible n2 pairs of
data points and formulate the empirical likelihood. Henceforth, we are making a distinction
127
between
min
λ
n−1 X
n
X
log(L∗ (λ, X i , X j ))
i=1 j=(i+1)
and
n−1 X
n
X
i=1 j=(i+1)
min log(L∗ (λ, X i , X j )).
λ
The first case demonstrates the empirical composite likelihood and the second the composite empirical likelihood. With the empirical composite likelihood, we have a single set
of weights or parameters for all composite likelihoods concurrently, and with the composite
empirical likelihood, we estimate parameters on each sub-grouping.
One issue with the composite likelihood approach when used in an empirical likelihood
setting is the ESP. This is because, while the entire sample may form a convex hull that
captures the true mean parameter, a given subset of those points is not guaranteed to form
a non-empty set. Because no valid solution may be found, these will produce a large ELR
statistic, indicating a failure of the algorithm to solve. Thus, summing the components
without checking whether an ESP has occurred can distort the results, and we may have
some problematic point combinations if we do not take care to handle them appropriately.
This problem is not akin to removing outliers, since removing a single point is a relatively
simple process to execute. The ESP is contingent on the convex hull, which is itself a linear
combination of points, so it is not as simple as removing a point in order to remove any
undesirable properties. As such, we will remove from the summation of the log likelihoods
any data sets that suffer from the ESP. A question then becomes whether we are removing
noise or non-information when we encounter an ESP set. Additionally, do we gain any savings
in time or sharpen the usefulness of the information through this subdivision process?
The composite empirical likelihood can be thought of as a decomposition of the full
convex hull into pieces, some of which will produce a theoretically valid estimate with others
qualifying as an empty set. First, we examine a single data set of size n = 15 from the
128
N (0, 1) distribution, where the entire data set does not form an empty set and appears in
Figure 4.1.
Figure 4.1: A histogram of the data set with n = 15 from N (0, 1)
Supposing that we do not wish to assume the underlying distribution, we again use the
same ESP example constraints as elsewhere in this dissertation. Because we have a vector
of length n = 15, we can compute empirical likelihoods on all possible triplets, quadruplets,
and so on to 14-tuples, along with the full empirical likelihood. Since the empirical likelihood is built around two estimating equations, we cannot estimate a pairwise likelihood–the
dimension of the data must be greater than that of the parameter vector. At each stage, we
compute all possible k-tuples, rejecting those that meet the empty set problem. There may
be some that fail to converge while belonging to the non-ESP class, but this can indicate
some feature of that data set that made it unfavorable for optimization or a poor choice of
constraints or hypothesized mean value.
Figure 4.2 shows the plots of the summations of the composite ELR statistics across
hypothesized means from -2.5 to 4.5, along with the results for the k-tuplet empirical composite likelihoods (2 < k < n) and the full empirical likelihood. If we have n = 15 data
129
points and examine the triplet empirical composite likelihood, we have
n
3
=
15
3
= 455
individual empirical likelihoods on the respective subset of data. In panel (a) is the plot
resulting from taking all summations, regardless of convergence status. We also look at the
results obtained by taking only the summations of components that converged (as indicated
by the number of iterations being less than 25, the maximum set for Newton-Raphson),
which appears in panel (b). When we look at the summations taken without accounting
for the ESP or non-convergent data sets, the center appears to occur around µ0 ≈ −0.09;
for the summations based only on convergent components, the center appears to be around
µ0 ≈ −0.10. Thus, we would take these approximations as the maximum composite empirical likelihood estimator (MCELE). The sample mean for the data is 0.0213, though we have
seen previously from four data sets how the MELE and the MLE can behave similarly or
have quite a difference based on the sample distribution. Though the curves exhibit subtle
differences, all tend towards the center in that neighborhood. The total number of possible
summations (without considering the removal of non-convergent entities) is the same for
the five-wise composite empirical likelihood as it is for the ten-wise construct, due to the
combinatorial nature of the formulation. With the quadruplet empirical composite likelihood (in solid red), the curve tends to its minimum over approximately the same mean
as the minimum for the full empirical likelihood, though at different heights. Once we go
outside the approximate interval of [−0.5, 0.5], we see stronger differences in the rates of
change of the curve for the quadruplet empirical composite likelihood in comparison to the
full empirical likelihood, which is a much smoother curve. As we tend towards the sevenand eight-wise empirical composite likelihoods, we are reaching our maximum number of
combinations for n = 15, thus bringing the running time factor into consideration for this
choice. This actually has quite a long running time, one that does not seem to reveal more
substantial trends or information at the expense of the computations. If we take too small
of a k-tuplet (2 < k < n), then we risk having more empty subsets, as the space will not
130
be as well explored by 3 points as by 10 points. Although we may have the same number
of total combinations for k = 4 and k = 11, say, we may have a different running time with
fitting a subset of size 4 versus a subset of size 11. In this case, it can be more difficult to
find valid solutions with the smaller k-tuples, as the probability of the ESP can be higher
with fewer points.
Figure 4.2: The ELR statistics for a data set with n = 15 from N (0, 1) by full and composite
empirical likelihood computed on k-tuples, k = 3, . . . , 14 are divided into all summations (a)
and only convergent summations (b).
We now take a look at the behaviors of the curves individually, for both the ELR statistics computed on all possible k-tuples and just the convergent ones, as determined by the
iterations used for each k-tuple. In Figure 4.3, the curves represent the summation of all
individual k-tuplet ELR statistics, regardless of convergence or ESP status. As k increases,
the ELR statistic curve becomes smoother and closer to that of the full empirical likelihood.
The horizonal line in each plot indicates the estimate of µ using the minimum of the ELR.
131
Figure 4.3: The individual plots for ELR statistics for k-tuples (2 < k < n) and the full
likelihood on all possible nk combinations of the data
Let us now look at the summation of ELR statistics for only convergent k-tuples, as
shown in Figure 4.4. Subsets with the ESP appear in the group of non-convergent values, as
they fail to produce a valid solution. The group of non-convergent subsets will, however, also
include data without the ESP, as seen in the ESP example simulations in Chapter 2. Thus,
to categorize whether the subsets are convergent, we cannot rely on the relatively fast ESP
condition check. For each non-ESP subset, we must carry out the optimization and then
determine whether it failed to converge, until we identify the feature(s) that make it ill-suited
for optimization. Removing the noise of ELR statistics from non-convergent subsets, we see
a much coarser plot at the k-tuples of 3 to 8. Using the rule of including only convergent
summations has an influence on the estimators that would be output as the solution based
on minimizing the ELR statistic curve.
132
Figure 4.4: The individual plots for ELR statistics for k-tuples (2 < k < n) and the full
likelihood on convergent subsets of the data
We now summarize the numerical results of Figures 4.3 and 4.4 in Table 4.1. These
estimates of µ are based on the minimum of the appropriate (all or convergent-only) ELR
statistic curve for the k-wise empirical composite likelihood. Recall that the sample mean
of the data set is 0.0213, while the true underlying µ = 0. For the column corresponding
to a summation of all subsets’ ELR statistics, the increasing k produces a better estimate,
from -0.10 to 0.02 under the full empirical likelihood. On the other hand, if we consider a
summation with summands only from convergent k-tuples, the estimator reaches 0.02 much
faster, as early as k = 5. With k = 14, the curve seems to flatten out quite a bit and thus
may pose a challenge with finding the minimum–if this is flat enough at the base of the
curve, then we can have a wider interval on which the minimum may occur.
The running time for this procedure is actually much greater than just computing for
the full likelihood. The long running time is an artifact of taking all possible combinations
from the data–it may be that we can improve on this by randomly sampling a subset of
133
Table 4.1: The values of µ for which the ELR statistic curve for the k-wise empirical composite likelihood reaches a minimum on a grid spaced from -2.5 to 4.5 in increments of 0.01,
under two computational scenarios: one in which the ELR statistic for every subset is included in the summation and another for which only convergent subsets contribute to the
ELR statistic overall
k-Tuple
3
4
5
6
7
8
9
10
11
12
13
14
Full
All Subsets
-0.10
-0.09
-0.09
-0.09
-0.09
-0.09
-0.09
-0.08
-0.06
-0.04
0.00
0.01
0.02
Convergent Subsets Only
0.48
0.84
0.20
0.22
0.20
0.20
0.20
0.20
0.00
0.01
0.03
0.06
0.01
a fixed size, say 100 or 200, to compute the composite empirical likelihood estimator as a
future research direction. This random sampling framework would enable us to work with
fewer subsets of data that would be more manageable for estimation. An interesting aspect
for future investigation would be to ascertain whether using subsets yields interesting finds
that can be obscured by examining the entire convex hull at once; if this avenue were to
prove successful, then it would provide a benefit of this approach. In Figure 4.5, because
we are accounting for non-convergence, though k = 4 and k = 11 have the same number of
combinations, they will not yield the same total running time. In this case, k = 5 (solid blue
line) shows the best running time for 3 < k < 10.
134
Figure 4.5: The plot shows the total running time in seconds for a data set with n = 15
elements from N (0, 1), as we use it to estimate the full and composite empirical likelihood
computed on k-tuples, k = 3, . . . , 14. Since we know that the data sets with ESP will fail to
produce a solution, we do not add the running time for any subset with the ESP. However,
for those non-ESP subsets that produced a solution, we include their running time, since the
decision to exclude them was made after the attempt to optimize.
The two mean constraints (C2) and (C3) of the ESP example are used to further gauge
how the k-tuplet constructs (2 < k < n) behave over the range of hypothesized means. For
Figure 4.6, at each µ0 , we take all possible k-tuplets, compute an estimate of the weights,
and use the weights to formulate the mean (noted as “mean 1”). Then, the mean of these
mean 1 values is used to visualize the behavior of the constructs in handling (C2). We see
that a positive linear trend is apparent over a very small interval in all cases, though the
135
centers occur at different points. Once a µ0 is reached so that (C2) is not well met, we see
a dropping off to 0, as the weights are assigned values close to 0.
Figure 4.6: For a grid of µ0 values from -2.5 to 4.5 by 0.01, the plot shows the mean of
the first-order mean for all possible k-tuples (k = 3, . . . , 14) and the full empirical likelihood
using a data set with n = 15 from N (0, 1).
Similarly, we observe the plot of the first-order means corresponding to only the subsets
that converged, displayed in Figure 4.7. By restricting to only convergent subsets in the
summand, we see a zooming in of the plot to only those values that produced a non-zero
mean due to weights of close to 0. For k = 3, we see some non-linear trends at the edges of
the plot, though this disappears by k = 5. Starting with k = 5 and upwards, the empirical
composite likelihoods seem to meet constraint (C2) quite well.
136
Figure 4.7: For a grid of µ0 values from -2.5 to 4.5 by 0.01, the plot shows the mean of the
first-order mean for only convergent k-tuples (k = 3, . . . , 14) and the full empirical likelihood
using a data set with n = 15 from N (0, 1).
Moving to the constraint (C3)–the higher-ordered mean, which we will call “mean 2” for
plotting purposes–we first use the estimated weights to compute the second-ordered mean
for each convergent subset. Then, all of the subsets are summed together and divided by
the count to produce a mean of the mean 2 statistics. Plotted in Figure 4.8, we see the
result of summing all subsets without considering convergence. All k = 3, . . . , 14 and the
full empirical likelihood show a quadratic curve, which becomes increasingly smoother as we
increment k. Constraint (C3) gives the mean 2 as 2µ2 + 1, a function better matched with
k > 13 than at k = 3.
137
Figure 4.8: For a grid of µ0 values from -2.5 to 4.5 by 0.01, the plot shows the mean of the
second-order mean for all k-tuples (k = 3, . . . , 14) and the full empirical likelihood using a
data set with n = 15 from N (0, 1).
Figure 4.9 shows the mean of the mean 2 computations by µ0 for convergent subsets
only. We see the zooming-in effect again, and once we reach k = 5, the tails dropping off to
0 disappear. The quadratic trend becomes stronger, and while the shapes are similar, how
quickly the curve rises differs among the k-tuples.
138
Figure 4.9: For a grid of µ0 values from -2.5 to 4.5 by 0.01, the plot shows the mean of
the second-order mean for only convergent k-tuples (k = 3, . . . , 14) and the full empirical
likelihood using a data set with n = 15 from N (0, 1).
Overall, dividing the components as convergent or non-convergent does not substantially
alter the overall trends of the plots, such as the quadratic nature of the second-order mean
or the linear trend of the first-order mean. However, an impact was observed in how quickly
the curves change and where the centers occur. Because the centers are used to obtain
an estimate from the ELR statistic plots, this produces different estimators under the two
categories, all or convergent-only subsets. It appears that the convergent-only subsets reach
a viable solution more quickly, and they have the advantage of requiring fewer solutions in
the case of a large number of ESP subsets. The results support the use of higher values of k,
such as k = 14 or even, more simply, the full mean. It would be very interesting to see if a
data set even exists where the overall collection of points fails to have the ESP and also fails
convergence, while having subsets that converge. In that case, this approach based on the
empirical composite likelihood would realize a great benefit and potential for further study.
139
As an alternative to using the ELR statistic curve’s minimum for an overall µ̂ value, we
can produce individual estimates and then pool the estimates via the mean. Varin, Reid,
and Firth (2011) use the example of a pairwise likelihood based on all possible n2 pairings
from the data to discuss obtaining an overall estimate based on the individual parameter
estimates from the components. The parameters are given as ω = (θ 1,2 , θ 1,3 , . . . , θ n−1,n ), and
we estimate ω̂ = (θ̂ 1,2 , θ̂ 1,3 , . . . , θ̂ n−1,n ), obtaining an overall estimate of θ by averaging all
components in ω̂. In our case, we consider both summations of all elements and summations
of only convergent or non-ESP components for a sample n = 15 from N (0, 1) with the ESP
example framework. In Table 4.2, we see that the percentage of the ESP starts at around 71%
with the tripletwise empirical composite likelihood and then declines to almost 0 by k = 8,
as seen in Figure 4.10. The estimators all hover around the true mean of µ0 = 0, but slight
differences appear when factoring in the ESP status. The decreasing standard deviation
reflects the use of fewer k-tuples as we approach the full likelihood. In the 5 ≤ k < 12 case,
the estimators on sets with the ESP show greater variation than those for the non-ESP,
though the reverse is true for k = 3 and k = 4, perhaps because the estimator is more
difficult to pin down with relatively few points. Table 4.1 contains the estimates obtained
from examining a plot of the ELR curves for the summation over all k-tuples and only
convergent k-tuples; in contrast, the estimates in Table 4.2 are obtained by averaging over
the individual estimates for the k-tuples, which allows us to compute variance of estimators.
As an example of visualizing the maximum empirical likelihood estimators on each subset
of data over all combinations, we take a closer look in Figure 4.11 at the case k = 7. The
maximum empirical composite likelihood estimator is then taken as the mean of all the
relevant values. The histograms for the total set of estimators and those only on the nonESP sets share a resemblance of a slightly left-skewed shape, with a bit of heaviness on the
upper tail that deviates from the usual normal shape. The values in the ESP histogram form
a bimodal shape, departing from a normal distribution due to not having a valid optimization.
140
Table 4.2: Table of empirical composite likelihood estimators of µ for the ESP example,
subdivided into summations taken over all values, on non-ESP sets, and on ESP sets, along
with standard deviations of the estimators across all k-tuples used to derive estimate
k-Tuplet #
3
4
5
6
7
8
9
10
11
12
13
14
15 (Full)
All (SD)
0.0151 (0.3125)
0.0065 (0.2779)
0.0069 (0.2453)
0.0061 (0.2143)
0.0044 (0.1861)
0.0056 (0.1586)
0.0074 (0.1338)
0.0089 (0.1110)
0.0109 (0.0901)
0.0126 (0.0712)
0.0136 (0.0530)
0.0153 (0.0344)
0.02
Non-ESP (SD)
0.0230 (0.3617)
0.0028 (0.2848)
0.0067 (0.2340)
0.0062 (0.1988)
0.0043 (0.1726)
0.0057 (0.1489)
0.0075 (0.1283)
0.0090 (0.1085)
0.0109 (0.0894)
0.0126 (0.0712)
0.0136 (0.0530)
0.0153 (0.0344)
0.02
ESP (SD)
0.0120 (0.2915)
0.0096 (0.2720)
0.0074 (0.2629)
0.0057 (0.2568)
0.0046 (0.2526)
0.0038 (0.2500)
0.0031 (0.2485)
0.0033 (0.2520)
0.0033 (0.3037)
NA
NA
NA
NA
Percentage of ESP
0.7187
0.5407
0.3776
0.2434
0.1427
0.0741
0.0326
0.0110
0.0022
0
0
0
0
Figure 4.10: The visualization of the estimator µ0 with respect to the ESP status by k-tuplet
composite likelihood (2 < k < n) and full empirical likelihood reveals that, by k = 4, the
behaviors of the ESP and non-ESP curves are quite similar, due to the reduced percentage
of data sets with the ESP.
141
Figure 4.11: The k = 7 case produces the following histograms based on the component
empirical likelihoods for all possible seven-point combinations of the data. We examine the
effect of all estimators, only those from the ESP sets, and only those from the non-ESP sets,
as determined by the ESP criterion for this example.
142
5
Conclusions and Directions for Future Research
5.1
Conclusions
Empirical likelihood continues to attract interest and more research, with the increasing
applicability of methods. The theory is relatively young, with many open questions for researchers to explore; this potential, coupled with the flexibility of empirical likelihood, means
that there are even more avenues to research. While correctly specified parametric families
are most powerful, a nonparametric method presents an alternative with fewer assumptions,
thereby having robustness as an advantage at a cost of power. Empirical likelihood incorporates the power of a likelihood without the stringent assumptions often found in many
parametric tests. While a powerful tool, empirical likelihood is not without its challenges
in optimization, such as the running time and complexity required to estimate or test a
parameter. In this dissertation, we explored the ESP and its implications for optimization
and empirical composite likelihood. We developed an Evolutionary Algorithm and applied
it to empirical likelihood estimation and testing, and in so doing, we now have an alternative
optimizer in addition to the Newton and quasi-Newton algorithms.
Because empirical likelihood is so data-driven, it can suffer from data failing to adequately
cover the true parameter. This scenario may be checked easily by the summation of the
143
weights or a direct formulation of an ESP condition based on the constraints that are used.
We recall that a failure to converge is not always directly related to the ESP–some non-ESP
data sets failed to converge, both in the Newton-Raphson and the Evolutionary Algorithm
applications. For the ESP example, we were not able to isolate a common characteristic that
contributed to the failure of a non-ESP data set to solve, even with conventional optimizers.
The question then became whether we would gain additional insight with the use of an
Evolutionary Algorithm, a method free from the derivatives used in Newton and quasiNewton methods from the emplik and el.convex R packages.
Past research has used variants of Newton or quasi-Newton optimization for estimating
or testing weights associated with each datum. Up until this dissertation research, the
Evolutionary Algorithm had not been published as an alternative. A great advantage of
Evolutionary Algorithms is their ability to find global solutions. However, this comes at
the cost of a longer running time related to factors like the generation size, method of
crossover and mutation, and the evaluation of the objective function and any constraints (if
present). An initial application of the Evolutionary Algorithm based on finding the optimal
weights proved unsuccessful, though not unfruitful. The focus of the Evolutionary Algorithm
then shifted to the dual problem, where the members of the population were unconstrained
Lagrangian parameters rather than weights. We evaluated 4 crossover methods with 2 types
of mutations at three generation sizes (for a total of 24 possible programs) using the ESP
problem example used throughout this dissertation. We chose this example due to the
complexity of the overdetermined problem, with a single parameter µ required to meet
two estimating equations. The Evolutionary Algorithms did not all perform equally well,
and those based on the heuristic crossover tended to a solution more quickly and used an
efficient decision-making process in generating offspring. The Newton-Raphson algorithm
outperformed in some respects, particularly with running time, but the chosen Evolutionary
Algorithms tended to accommodate the higher-level constraints more adeptly. In this work,
144
the heuristic crossover with normal mutation represented a greater improvement over the
initial implementation on the weights.
Composite likelihood is a tool for facilitating optimization of a likelihood, such as when
the complexity of a full parametric likelihood hinders estimation, whether it is by integration or a normalizing constant. A downside of composite likelihood methods is that some
information is lost, hopefully and possibly to a minimal extent. The question then became
whether such an approach is possible with the empirical likelihood, to facilitate estimation
and/or inference. In this dissertation, such an approach did not yield any time savings. In
fact, the running time was actually much greater, when taking all k-tuples from a data set
of length n to construct the ELR statistics. However, it may not be necessary to investigate
every single combination, as some may share an overlap of convex hull with others. The
ESP factors again with the composite empirical likelihood; once a data set has the ESP, all
subsets will likewise have the ESP. If a data set is without the ESP based on all n points,
then taking a subset of those is not guaranteed to produce a non-ESP entity. Thus, we have
an additional screening step in summation of the ELR statistics that is not inherent with
the parametric composite likelihood.
To manage the discussion of the implications of this research, we use the ESP example
in order to unify the three major concepts: the Evolutionary Algorithm development for
empirical likelihood, the ESP pattern and behavior, and the composite likelihood approach.
These research objectives may realize greater improvements over standard methods with
more complex situations, especially given that Newton-Raphson and other gradient-based
methods can perform poorly with an unfavorable surface. Just as we improved on the
Preliminary EA, we can continue to investigate and adapt the programs accordingly. As of
this year, publications are continuing to surface in the theory of empirical likelihood, and
given the complexity of modern data, it is a method that is likely to see continued interest
and advancement. Just as Davidon developed a groundbreaking algorithm in working around
145
limitations of computing power, sometimes it takes a solution to a problem/issue to make
advances and to move forward. The field of empirical likelihood is still relatively young, and
there are many more interesting findings and open problems to be explored.
5.2
Directions for Future Research
In the course of this research, we discovered multiple directions for future research in light
of the discoveries we made. The extensive impact of the ESP means that it is an issue
that underlies many empirical likelihood methods and strategies. We now list some future
research directions based on the results we obtained, most of which involve the ESP in some
way.
Evolutionary Algorithm
A next step in the application of Evolutionary Algorithms to empirical likelihood would
be to continue making improvements, with the goal of increasing the appeal by reducing the
running time. There may be ways to make “smarter” steps with the empirical likelihood
that are not typically observed with the general Evolutionary Algorithm. Another potential
research direction would be the co-evolutionary approach that is said to be a true “free
lunch.” If this method can outperform even the Newton-Raphson, then we will add another
viable and attractive tool for optimization in the empirical likelihood analysis. Perhaps
that co-evolutionary method would reveal some additional insight into the ESP and nonconvergent non-ESP data sets.
MELE Estimation
To determine the MELE, the ELR statistics were computed on a grid of mean values
in increments of 0.01, to manage the number of iterations, and the parameter producing
the minimum ELR statistic was chosen as the MELE. This process is quite time-consuming,
and there is potential to discover a way to better search for the mean, so that we can come
146
to a closer approximation with greater precision. This is a case where the theory is much
simpler than the implementation via computer. This brute force method wastes precious
time on non-candidate means, but how do we tell a computer which way to go? Perhaps the
“training” method of the co-evolutionary algorithm would be a direction to pursue.
Composite Empirical Likelihood
With large data sets, to capture the underlying phenomena, it may be necessary to use
many covariates for an adequate model. This increase in the number of variables leads to a
full likelihood that is extremely computationally intensive. We can also utilize asymptotic
results with accepting some tradeoffs or making adjustments as needed. Again, though we
have simplified the problem, it may not completely resolve the original problem of optimizing
a fully joint likelihood.
Not thoroughly established is a procedure for selecting the most appropriate composite
likelihood. It may be a function of blocking data, in which case the question becomes how
large to make the blocks, or the dimension of the margins. If the methods are being applied
to time series data or other dependent data, the issue is the inclusion of correlation structures
that are sufficient without increasing the computational running time.
Varin (2008) remarks that an “interesting possibility... not much explored in applications” involves formulating a new objective function based on the independence and pairwise
likelihoods:
c`(θ; y) =
n−1 X
n
X
log f (yi , yj ; θ) − an
i=1 j>i+1
n
X
log f (yi ; θ)
i=1
(p. 23). The parameter a may be assessed relative to some optimality criterion to be established beforehand, such as the Godambe information. Since composite likelihoods are
underscored by their own theoretical developments and results, they are not simply approximations of the full likelihood; rather, they should be thought of as “surrogates of a complex
full likelihood” (Varin, 2008, p. 24).
147
A future direction for study would be to examine whether we can choose sufficient groupings of points for adequate coverage without throwing away too much information. Since the
ESP arises from a problem with the linear combinations of points, it is necessary to examine
the subsets for the ESP. Another question is whether, out of two convex hulls covering the
mean point, the larger one provides more information or contributes more noise.
The composite likelihood formulation would be most appealing if we were to reach an
empirical likelihood that proved difficult or impossible to formulate jointly. There is the
tradeoff between the choice of k (which relates to the total number of combinations) and the
possibility of having more occurrences of the ESP. The ESP means that we cannot take the
summations at face value–we must take a closer look in order to avoid having an invalid inference, which in turn drives up running time. If we were to base the statistic on the inclusion
of invalid components, we have rendered the subsequent conclusions ineffective. Another
question becomes whether there is a way to efficiently sample the data to maximize coverage
while minimizing the need to consider all nk combinations. How would this compare to
a composite likelihood based on randomly sampled subsets? If n becomes large enough, it
might be that we cannot distribute a weight across a large number of points without some
loss of numerical accuracy or representation. If we restrict to a manageable group of points
in a composite likelihood paradigm, then we may be able to handle the weights in a more
accurate manner. This would be worth investigating if it can offset the downside of a longer
running time.
Asymptotics and Model Selection Criteria
In least-squares modeling, we can compute the Akaike Information Criterion (AIC) or the
Bayes Information Criterion (BIC) to make decisions about variable selection. In composite
likelihood, measures of model complexity also exist. The AIC and BIC are defined as
AIC = −2c`(θ̂ CL ; y) + 2 dim(θ)
148
and
BIC = −2c`(θ̂ CL ; y) + dim(θ) log(n).
The dimension of θ is determined to be the trace of a function of the sensitivity and Godambe
information matrices: tr{H(θ)G(θ)−1 }. Alternatively, the composite likelihood information
criterion (CLIC) gives preference to models that maximize
ˆ )Ĥ(Y )−1 ),
c`∗ (θ̂ M CL ; Y ) = c`(θ̂ M CL ; Y ) − tr(J(Y
ˆ )
which is also known as the penalized log-composite likelihood. In this formula, the J(Y
must be consistent for J(θ 0 ), as well as H for H(θ). We may reformulate the problem as
minimizing −2c`∗ (θ̂ M CL ; y) to obtain the model chosen by CLIC. The question becomes how
to best apply this to the empirical likelihood framework, where we have both the question of
choosing the most appropriate composite likelihood and the best way to construct a variable
selection statistic on that composite likelihood. We should take into account the number of
k-tuples needed to carry out the computations, a value that does not appear in the AIC and
BIC criteria above.
Composite Likelihood Adjustments
To improve the approximation of a composite likelihood to the full likelihood, we may
use a magnitude or a curvature adjustment (Cooley, Davison, and Ribatet, 2011). The
parameters for a given problem may be decomposed into those of interest and the nuisance
elements: θ = (ψ T , τ T )T . To formulate a hypothesis H0 : ψ = ψ 0 is to fix the value of ψ
and to estimate a corresponding τ for which we test
Λ(ψ 0 ) = 2{`(θ̂; Y ) − `(θ̃; Y )},
which converges in distribution to χ2q , where q is the dimension of the subvector of the
149
parameter of interest. The result for composite likelihood is constructed in a similar manner,
P
Λ(ψ 0 ) = 2{c`(θ̂ c ; Y )−c`(θ̃ c ; Y )}, where the statistic converges in distribution to qi=1 λi Xi
with independent Xi ∼ χ21 . The λi , i = 1, . . . , q, are derived from the eigenvalues of the q × q
matrix based on H and J evaluated at the hypothesized value for θ:
H(θ 0 )−1 J(θ 0 )H(θ 0 )−1
i−1
h
−1
)
,
H(θ
0
ψ
ψ
where Aψ is the component of the matrix A corresponding to ψ.
The magnitude adjustment (Rotnitzsky and Jewell, 1990) works on the entire composite log likelihood, expressing it on a more appropriate magnitude or scale. Only the first
moment is guaranteed under the χ2q approximation with the magnitude adjustment. First,
P
we compute k = q/ qi=1 λi , the eigenvalues of H(θ 0 )−1 J(θ 0 ) (Pauli, Facugno, and Ventura,
2011). The higher moments of the magnitude-adjusted composite likelihood do not match
the χ2q distribution unless λi for all i = 1, . . . , n are equal or q = 1. The magnitude-adjusted
composite likelihood is given by
`magn (θ; y) = k`tot
θ ∈ Θ,
c (θ; y),
Pn P
where `tot
C (θ; y) =
j=1
i∈I wi log f (y j ∈ Ai ; θ). Then, we have the asymptotic result
that, as n → ∞,
q
n
o
X
d
Λmagn (ψ0 ) = 2 `magn (θ̂ c ; y) − `magn (θ̃ c ; y) −→ k
λi X i .
i=1
The curvature adjustment, on the other hand, guarantees the convergence to a χ2q distribution, by modifying the curvature of the composite likelihood in the neighborhood of its
150
maximum θ̂ c . For a constant matrix with dimension p × p, we compute the adjustment as
∗
`curv (θ; y) = `tot
c (θ ; y),
where θ ∗ = θ̂ c + C(θ − θ̂ c ). C may be estimated as a semi-definite matrix that satisfies
C T H(θ 0 )C = H(θ 0 )J(θ 0 )−1 H(θ0 ).
The impacts or applicability of these adjustments on the composite empirical likelihood
provide a direction for future research. Another consideration is to prove or disprove whether
it is possible for an empirical likelihood to fail converging jointly, while having convergent
composite likelihoods based on the data. This is not necessarily tied to the ESP, since in our
particular example, once a data set has the ESP, any subsequent subset will have it as well.
However, with a different set of constraints, it may be possible to find a data set that would
benefit from the composite empirical likelihood approach. It will be that data set, then, that
realizes the benefits of an adjustment like the magnitude or curvature, if the connection can
be made.
151
References
Askew, A. and Lazar, N. (2011). Application of evolutionary algorithms in estimation
of empirical likelihoods. Proceedings of the 2011 Joint Statistical Meetings. Paper
presented at 2011 Joint Statistical Meetings, Miami Beach, FL.
Baggerly, K. A. (1998). Empirical likelihood as a goodness-of-fit measure. Biometrika,
85 (3), 535–547.
Barber, B., Dobkin, D., and Huhdanpaa, H. (1996). The QuickHull algorithm for
convex hulls. ACM Transactions on Mathematical Software, 22 (4), 469–483.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems.
Journal of the Royal Statistical Society, Series B, 36 (2), 192–236.
Blum, C., Chiong, R., Clerc, M., De Jong, K., Michalewicz, Z., Neri, F., and Weise,
T. (2011). Evolutionary Optimization. In Raymond Chiong, Thomas Weise, and
Zbigniew Michalewicz (Eds.), Variants of Evolutionary Algorithms for Real-World Applications 129. Berlin/Heidelberg: Springer-Verlag.
Boyd, S., and Vandenberghe, L. (2004). Convex Optimization. New York: Cambridge
University Press.
Casella, G. and Berger, R. L. (2002). Statistical Inference. Pacific Grove, CA: Duxbury.
152
Chandra, T. K. and Ghosh, J. K. (1979). Valid asymptotic expansions for the likelihood
ratio statistic and other perturbed chi-square variables. Sankhyā: The Indian Journal
of Statistics, Series A, 41 (1/2), 22–47.
Conover, W. J. (1999). Practical nonparametric Statistics (3rd ed.). New York: John
Wiley & Sons, Inc.
Corcoran, S. A., Davison, A. C., and Spady, R. H. (1995). Reliable inference from
empirical likelihoods. Discussion Paper, Department of Statistics, Oxford University.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2001). Introduction to
Algorithms (2nd Ed.).
Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. Journal of
the Royal Statistical Society, Series B, 46 (3), 440–464.
Cox, D. R. (1975). Partial likelihood. Biometrika, 62 (2), 269–276.
Davidon, W. C. (1991). Variable metric method for minimization. SIAM Journal of
Optimization, 1 (1), 1–17.
Davison, A. C., Hinkley, D. V., and Young, G. A. (2003). Recent developments in
bootstrap methodology. Statistical Science, 18 (2), 141–157.
De Berg, M., Cheong, O., Van Kreveld, M., and Overmars, M. (2008). Computational
Geometry: Algorithms and Applications (3rd ed.). Springer-Verlag Berlin Heidelberg.
DiCiccio, T. and Romano, J. (1989). On adjustments based on the signed root of the
empirical likelihood ratio statistic. Biometrika, 76 (3), 447–456.
DiCiccio, T. and Romano, J. (1990). Nonparametric confidence limits by resampling
methods and least favorable families. International Statistical Review, 58 (1), 59–76.
153
DiCiccio, T., Hall, P., and Romano, J. (1991).
Empirical likelihood is Bartlett-
correctable. The Annals of Statistics, 19 (2), 1053–1061.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7 (1), 1–26.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals
of Eugenics, 7, 179–188.
Frydenberg, M. and Jensen, J. L. (1989). Is the “improved likelihood ratio statistic”
really improved in the discrete case? Biometrika, 76 (4), 655–661.
Grendár, M., and Judge, G. (2009). Empty set problem of maximum empirical likelihood methods. Electronic Journal of Statistics, 3, 1542–1555.
Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. The Annals of Mathematical Statistics, 31 (4), 1208–1211.
Godambe, V. P. and Thompson, M. E. (1984). Robust estimation through estimating
equations. Biometrika, 71 (1), 115–125.
Hall, P. (1990). Pseudo-likelihood theory for empirical likelihood. Annals of Statistics,
18 (1), 121–140.
Hanfelt, J. (2004). Composite conditional likelihood for sparse clustered data. Journal
of the Royal Statistical Society, Series B, 66 (1), 259–273.
Hanfelt, J. and Liang, K. (1995). Approximate likelihood ratios for general estimating
equations. Biometrika, 82 (3), 461–477.
Hartley, H. O. and Rao, J. N. K. (1968). A new estimation theory for sample surveys.
Biometrika, 55 (3), 547–557.
154
Hjort, N. L., McKeague, I. W., and Van Keilegom, I. (2009). Extending the scope of
empirical likelihood. Annals of Statistics, 37 (3), 1079–1111.
Jarvis, R. A. (1973). On the identification of the convex hull of a finite set of points
in the plane. Information Processing Letters, 2 (1), 18–21.
Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 52 (282), 457–481.
Kitamura, Y. (1997). Empirical likelihood methods with weakly dependent processes.
Annals of Statistics, 25 (5), 2084–2102.
Kolaczyk, E. (1994). Empirical likelihood for generalized linear models. Statistica
Sinica, 4 (1), 199–218.
Leng, C. and Tang, C. Y. (2012). Penalized empirical likelihood and growing dimensional general estimating equations. Biometrika, 99 (3), 703–716.
Liang, K. (1987). Estimating functions and approximate conditional likelihood. Biometrika,
74 (4), 695–702.
Lindsay, B. G. (1988). Composite likelihood methods. Contemporary Mathematics,
80, 221–239.
Mardia, K., Kent, J., Hughes, G., and Taylor, C. (2009). Maximum likelihood estimation using composite likelihoods for closed exponential families, 96 (4), 975–982.
Michalewicz, Z., and Schoenauer, M. (1996). Evolutionary algorithms for constrained
parameter optimization problems. Evolutionary Computation, 4 (1), 1–32.
Nocedal, J., and Wright, S. (1999). Numerical Optimization. New York: Springer
Science+Business Media, Inc.
155
Owen, A. (1988). Empirical likelihood ratio confidence intervals for a single functional.
Biometrika, 75, (2), 237–249.
Owen, A. (1990). Empirical likelihood ratio confidence regions. The Annals of Statistics, 18 (1), 90–120.
Owen, A. (2001). Empirical Likelihood. Boca Raton: Chapman & Hall/CRC.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (2007). Numerical Recipes 3rd Edition: The Art of Scientific Computing. New York: Cambridge
University Press.
Price, K., Storn, R., and Lampinen, J. (2005). Differential Evolution: A Practical
Approach to Global Optimization. Berlin Heidelberg: Springer-Verlag.
Qin, J., and Lawless, J. (1994). Empirical likelihood and general estimating equations.
The Annals of Statistics, 22 (1), 300–325.
Qin, J., and Lawless, J. (1995). Estimating equations, empirical likelihood, and constraints on parameters. The Canadian Journal of Statistics, 23 (2), 145–159.
Rao, J. N. K. (2006). Empirical likelihood methods for sample survey data: An
overview. Austrian Journal of Statistics, 35 (2and 3), 191–196.
Renard, D., Molenberghs, G., and Geys, H. (2004). A pairwise likelihood approach to
estimation in multilevel probit models. Computational Statistics & Data Analysis, 44
(4), 649–667.
Ribatet, M., Cooley, D., and Davison, A. C. (2011). Bayesian inference from composite
likelihoods, with an application to spatial extremes. arXiv preprint arXiv:0911.5357.
156
Rotnitzsky, A. and Jewell, N. P. (1990). Hypothesis testing of regression parameters
in semiparametric generalized linear models for cluster correlated data. Biometrika,
77 (3), 485–497.
Süli, E. and Mayers, D. (2003). An Introduction to Numerical Analysis. New York:
Cambridge University Press.
Thomas, D. and Grunkemeier, G. (1975). Confidence interval estimation of survival
probabilities for censored data. Journal of the American Statistical Association, 70,
865–871.
Tukey, J. W. (1962). The future of data analysis. The Annals of Mathematical Statistics, 33 (1), 1–67.
Varin, C. (2008). On composite marginal likelihoods. Advances in Statistical Analysis,
92, 1–28.
Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood methods.
Statistica Sinica, 21, 5–42.
Varin, C. and Vidoni, P. (2005). A note on composite likelihood inference and model
selection. Biometrika, 92 (3), 519–528.
Venkataraman, P. (2009). Applied Optimization with MATLAB Programming (2nd
ed.). Hoboken, NJ: John Wiley & Sons, Inc.
Wang, Q. and Rao, J. K. (2002). Empirical likelihood-based inference in linear models
with missing data. Scandinavian Journal of Statistics, 29 (3), 563–576.
Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models,
and the Gauss-Newton method. Biometrika, 61 (3), 439–447.
157
Wolpert, D. H. and Macready, W. G. (1997). No Free Lunch theorems for optimization.
IEEE Transactions on Evolutionary Computation, 1 (1), 67–82.
Wolpert, D. H. and Macready, W. G. (2005). Coevolutionary free lunches. IEEE
Transactions on Evolutionary Computation, 9 (6), 721–735.
Wu, R. and Cao, J. (2011). Blockwise empirical likelihood for time series of counts.
Journal of Multivariate Analysis, 102 (3), 661–673.
Yang, D. and Small, D. (February 14, 2012). Package ‘el.convex.’ The Comprehensive
R Archive Network. From http://cran.r-project.org/web/packages/el.convex/
el.convex.pdf.
Yang, D. and Small, D. (2009). An R package and a study of methods for computing
empirical likelihood. Journal of Statistical Computation & Simulation, 00 (00), 1–10.
Zhou, M. (September 11, 2004). The emplik package. The Comprehensive R Archive
Network. From http://cran.r-project.org/web/packages/emplik/emplik.pdf.
158

Download Report

Numerical Optimization and Empty Set Problem in Application of

Paperzz.com

Your Paperzz