Nakamura, Miguel; (1989).Transformations to Symmetry in teh Transform-Both-Sides Regression Model."

TRANSFORMATIONS TO SYMMETRY IN THE
TRANSFORM-BaTH-SIDES REGRESSION MODEL
by
Miguel Nakamura
A dissertation submitted to the faculty of the
University of North Carolina at Chapel Hill in
partial fulfillment of the requirements for the
degree of Doctor of Philosophy in the Department
of Statistics
Chapel Hill
1989
Approved by:
Advisor
t.
CriR~JJ
Reader
MIGUEL NAKAMURA.
Transformations
Sides Regression Model.
to Symmetry in the Transform-Both-
(Under the direction of DAVID RUPPERT.)
ABSTRACT
The Transform-Both-Sides
(TBS)
regression model
of
Carroll
and
Ruppert (1984) is practical when there exists a conjectured relationship
between
the
dependent
variable and
the
independent
variables,
although the exact way in which randomness affects the relationship is
unknown.
ing
The object is to make prediction more precise while respect-
the pre-specified relationship.
This dissertation
is
concerned
with estimation methods which are alternate to the normal-theory Maximum Likelihood Estimate
(MLE) ,
and whose properties
rely
solely on
symmetry of errors.
First, consistency and asymptotic normality of the skewness estimator proposed by Ruppert and Aldershof (1989) are proved.
For consis-
tency, an asymptotically eqUivalent estimator is constructed in order
to attain convergence.
Secondly,
a
new estimator is proposed based on the notion of a
weighted Minimum Distance (MD).
This estimate is seen to be a special
case of Generalized M-Estimation, a general class for which consistency
and asymptotic normality are then proved.
The MD estimate is shown to
be consistent for Box-COx modified power transformations, and modified
power transformations with shift with the power fixed.
Some examples
are provided which tend to show that the original parameterization in a
shifted power transformation is inappropriate.
Finally,
different features are explored for MD in a
simulation
study.
For
a
power
weight which appears
transformation parameter
sui table
is produced.
other estimators in TBS are also provided;
and
normal
Comparisons
in particular,
errors,
of
a
MD wi th
it is found
that the loss of efficiency with respect to the MLE in the normal case
is not extreme.
iii
ACKNOWLEIX;MENTS
First, I wish to express my sincere appreciation to my advisor,
David Ruppert, for the guidance and patience which he displayed through
all stages of work on this dissertation.
I
thank E.
Carlstein.
1. M. Chakravarti. G. Simons. and W. L.
Smith. the members of my committee, for reviewing the manuscript.
I
am grateful to the staff of the Department of Statistics for
providing a
frui tful
environment for
learning and where help could
always be found if needed, with special thanks to June Maxwell.
For the warm friendship and hospitality during my "year abroad", I
am grateful to the '87-'88 dwellers of 357 Upson Hall in Cornell University: Andrew, Karen, Maika. and Mary.
I
also wish to thank my fellow graduate students and friends in
Chapel Hill for their fellowship and words of encouragement.
To cons-
truct an exhaustive list of names for this would be a nearly impossible
task; I duly thank all.
Likewise. I thank my family for their constant support throughout
the years.
For financial support. I thank the Department of Statistics at UNC
and the Consejo Nacional de Ciencia y Tecnologia (Mexico).
This work was partially supported by The National Science Foundation under grants DMS-8701201 and DMS-8902973.
iv
TABLE OF CONTENTS
CHAPTER I:
INTRODUCTION
1
1.0
The Regression Setting.
1
1.1
Transformations
6
1.2
The Transform-Both-Sides (TBS) Model
10
1.3
Transformations to Symmetry and
Homoscedasticity
15
Summary of Work
17
TRANSFORMATIONS TO SYMMETRY
19
2.0
Introduction
19
2.1
General Results about Consistency
22
2.2
Consistency and Asymptotic Normality of the
Transformation to Symmetry
30
Identification of a Consistent Solution in the
Transformation to Symmetry
33
CONSISTENCY OF A MINIMUM DISTANCE ESTIMATOR
46
3.0
Introduction
46
3.1
Proposed Estimator
48
3.2
Alternate Representation of the Estimator
53
3.3
Generalized M-Estimators
57
3.4
U-Statistics
59
3.5
Consistency of Generalized M-Estimators
62
3.6
A Special Case: Minimum Distance Estimation
of a Transformation
66
Discussion
77
1.4
CHAPTER II:
2.3
CHAPTER I I I :
3.7
ASYMPTOTIC NORMALITY OF THE MINIMUM DISTANCE
ESTIMATOR .
82
4.0
Introduction
82
4.1
Asymptotic Normality of Generalized
M-Estimators
84
A Special Case: Minimum Distance Estimation
of a Transformation
89
The Single-sample Model with a Modified Power
Transformation with Shift
97
CHAPTER IV:
4.2
4.3
4.4
CHAPTER V:
Figures
103
WEIGIITS IN MINIMUM DISTANCE ESTIMATION. AND A MONTE
CARLO STUDY
107
5.0
Introduction
107
5.1
A Heuristic Argument for an Optimal Weight
108
5.2
Monte Carlo Study
115
5.3
Discussion
118
5.4
Tables and Figures
123
BIBLIOCRAPHY
. 135
vi
CHAPTER I
INTRODUCTION
1.0
The Regression Setting
Broadly speaking, we may say that the object of regression analy-
sis is to study relationships that may be present between a dependent
variable Y and an independent variable X.
Most of the time, we are
restricted to only being able to record occurrences of X and Y which
include unobservable random perturbations, due to random variation and
measurement error.
Regression methods focus on extracting characteris-
tics of the underlying relationship which are masked by the observed
randomness.
These methods are means for obtaining a better understand-
ing of the connections between X and Y and for making predictions on Y
based on X.
The accuracy of these 'procedures depends on a few assump-
tions which may be violated in given si tuations,
in particular. ones
that involve properties of the probability distributions of the perturbations.
The present work deals wi th the application of transforma-
tions to data that arise in the regression setting when some key distributional properties of
the perturbations are not verified.
The
ul timate purpose behind transformation is to increase the efficiency
and precision of the inferences that are drawn.
When observing pairs of variables (X ,Y ).· ... (Xn.Yn ). frequently
1 1
a model of the following form is adopted.
For 1
~
i
~
n let
(1. 1)
Y.1
= f(X.,~)
1
+ at.
1
where the following notation and assumptions are employed:
is a p-dimensional vector of regression parameters.
~
T
X.1
= (X.1
X. ) is a k-dimensional random vector,
1k
l .....
>0
a
t
i
is a parameter to model the extent of variability.
for 1
~
i
~
n are independent having a distribution with mean
zero and variance one. and
f : ffik+p ~ ffi
is of a given (known) form.
As stated. the term f(X .. ~) models the conditional
1
ex~ectation
of
Y.
given X.. while the term at. can be interpreted as the unobserved
1 1 1
random perturbation introduced by nature.
~~en
considering the above
model. the implicit presumption is that this perturbation is additive.
In the case that the probabi 1 i ty distribution of the term
t.
1
is com-
pletely known. model (1.1) provides a description of the condi tional
distribution of Y. given X..
1
1
When the primary objective is to estimate the parameter
estimation procedures can be appl ied assuming model
~.
(1.1).
several
Least
squares and M-estimation are examples of these procedures; these methods are generally efficient and/or consistent when the
t.
1
are
(at
least) symmetric with tails no heavier than the normal distribution and
have constant variance (see Hampel et at. (1986). chapter 6).
(1969).
Malinvaud (1970).
least squares estimators.
and Wu (1981).
discuss
Jennrich
the properties of
When heteroscedasticity (that is. the lack
of constant variance across the t.) is present. and therefore estima1
tion procedures are deemed non-optimal. al ternate techniques such as
weighting or heteroscedastic regression may be applied (Carroll and
2
Ruppert (1988). chapters 1-3).
Transformation is another way that investigators have proposed to
counteract the negative effect of lack of symmetry or lack of homoscedasticity in errors.
Although transformations applied to random varia-
bles can be traced back to early work and in different settings (e.g.
Bartlett (1947». it is primari ly regression models which we have in
mind for this work.
We can then speak of transforming either the res-
ponse variable (Y) or the explanatory variable (X). or both. and hope
to produce one of several favorable effects.
Transformations of inde-
pendent variables and transformations that convert an intrinsically
linear model to explici t lineari ty in the parameters wi 11 not be of
primary concern here.
When the latter is done. some non-normality or
heteroscedasticity can be induced in the errors. as shown by the examples in Carroll and Ruppert (1984) or Box and Hill (1974).
Some of the
work in transformations has focused on searching for the form of f
either nonparametrically or parametrically through families of transformations (Box and Tidwell (1962».
In the present work. we are con-
cerned wi th transformations that intentionally al ter the distribution
of the errors in the response.
We can see the significance of a general transformation by noting
the following remark (Carroll and Ruppert
strictly convex function,
(1988»:
If cp(y)
is a
its effect on the distribution of a random
variable Y is to increase right skewness.
If the function cp is in-
creasing. this effect can be explained by noting that the derivative of
a convex function is greater on the right tail of the distribution than
on the left.
Thus, values on the right will be spread out more than
those on the left.
The argument still holds if the function cp is de-
3
since the left tai I of Y becomes the right tai I of <p(Y).
creasing.
Similarly, a concave transformation will decrease right skewness.
general discussion on convex transformations.
van Zwet
In a
(1964) makes
this point into a precise statement by proving the following resul t:
If <p is convex and
random
~(X)
x
variable
~(<p(X»
~ ~(X).
is the standard coefficient of skewness of a
by
given
then
The coefficient of skewness thus defined recurs in
work that concerns sYmmetry.
In Chapter II, we will notice an example
of this observation.
A transformation may have effect not only on skewness, but also on
heteroscedasticity.
have means
~.
I
Bartlett (1947) noted that if random variables Y.
I
and standard deviations
G.
I
related by
G.
I
=
g(~.).
I
then
<p(Y.)
will have approximately constant variance if (d/dy)<p(y) is proI
portional to [g(y)]
-1
.
Box and Cox (1964) proposed one possible approach for incorporating a transformation in regression:
Extend model (1.1) by assuming
that it is not the Y.· s themselves which possess SYmmetric distribuI
tions with constant variance (given X. 's). but rather transformed vall
ues h(Y.) for some function h(-) with reasonable properties (e.g. monoI
tone) .
Considering
a
family
of
monotonic
functions
in
y.
{ h(y.A)
A € A C ffiq }. provides a parameterized version of the idea.
We regard A as an additional parameter and assume that for some A.
hey .• A) = f(X .• f3) +
(1.2)
where the
I
f:..
I
I
are independent.
random variables.
If h(y.A)
=y
Gf:..
I
identically distributed. and SYmmetric
for some A. then (1.2) includes (1.1)
as a special case.
4
We must now estimate ").. and f3 simul taneously.
Al though greater
efficiency in estimating f3 will be attained as a consequence. the estimation of ").. by itself may also be of interest since it may provide
insight into the relationship between X and Y.
Furthermore. we would
be in a better position to model the complete conditional distribution
of Y given X if we were able to recognize a simple distributional form
for the errors c.
Non-parametric approaches to the transformation problem are also
possible. although we will not pursue them any further.
In this direc-
tion. we point out the work of Breiman and Friedman (1985) in which the
following model is a special case:
k
=
8(Y.)
1
}: '1>.(X .. ) +
. 1 J IJ
J=
OC '
1
where 8. '1> .. j=1 ..... k are (nonparameterized) functions. c. are stanJ
1
dard normal. and { '1>.(X .. ) : j=1 ..... k } are jointly normal.
J
1J
These
authors provide a method for estimating 8. '1>. j=1 ..... k using smoothJ
ing techniques. and call it the ACE algorithm (after Alternating Conditional Expectations).
They obtain this algorithm as a consequence of
maximizing the correlation between 8(Y) and
k
~(X1'.'.' Xp ) =
}: ~.(X.) .
. 1 J J
J=
They state that if the '1>.(X.) are not normal.
J
approximately correct.
J
then the estimates are
In a different approach. Hastie and Tibshirani
(1986) treat the problem of estimating E(YIX) when the assumed form is
k
}: 11>. (X .) .
J
E(Ylx) =
j=1 J
Note that this latter view deals with transformations of the independent variables but not the response.
5
In section 1.1. we review the transformation literature. in particular. that of the Box-Cox transformation which deals with the case
when only the response is transformed.
We will emphasize the paramet-
ric point of view of a transformation. as in equation (1.2).
In sec-
tion 1.2 we take a different approach to transformations when we consider the Transform-Both-Sides model of Carroll and Ruppert (1984).
We
shall briefly discuss some properties of these transformations as well
as the maximum likelihood estimators.
We also make some comparisons
with the Box-COx transformation model in this section.
In section 1.3,
we introduce some alternative estimators of transformation parameters.
which will later be the subject of study in Chapter II.
1.1 Transformations
Box and Cox (1964) proposed the following fami ly of transformations and a corresponding special case of model (1.2):
(1. 3)
y
( A)
= h(y.A) =
{ (yA log(y)
1)~
if A#-O
=0
if A
We shall refer to this family of transformations as the Modified Power
Transforrnat ion, which is valid for posi tive y.
We notice that model
(1.2) with A = 1 corresponds to the "no Transformation" in model (1.1).
A transformation in the family (1.3) is either concave or convex
according to A
<1
or A
> 1.
Thus, the effect on the skewness of Y is
intuitively clear according to the heuristic explanation mentioned in
section. 1.0:
A
<1
decreases right skewness; A
skewness.
Box and Cox (1964) assume that for some A,
6
>1
increases right
X:/3
1
(1.4)
where the
f:..
1
+
Gf:..
1
are assumed to be independent and identically distributed
random variables having the standard normal distribution with density
~.
Furthermore. they propose to estimate A by maximum likelihood.
shall denote this estimator by ABC.
We
Because of the form of the postu-
lated model. we notice that this transformation attempts to simultaneously attain three properties: normality. simplicity in structure for
E(Ylx), and homoscedasticity.
Various authors return to the study of the Box and Cox transformation in later papers.
Draper and Cox (1969) obtain an approximate
expression for the asymptotic variance of ABC' while Andrews (1971) and
Atkinson (1973) consider the problem of testing and constructing confidence intervals for the parameters of the model.
Hernandez and Johnson
(1980) show that ABC converges to a value that minimizes the KullbackLeibler information between the underlying density of Y in model (1.2)
and a normal density with the same mean and variance.
They provide an
interpretation of the estimate, which is that ABC transforms Y to the
"nearest" possible normal densi ty. even if the transformed Y does not
appear normal at all.
Bickel and Doksum (1981) consider the Box-COx
model with a general density g for the
f:..
1
instead
consistency and consider the estimate from
view.
of~.
They study
the robustness point of
They also verify the fact that ABC is not consistent for A ex-
cept in the case g
=~
or A
=0
(Bickel and Doksum (1981), Hinkley
(1975» .
The problem of estimating the regression parameter /3 when A has
also been estimated is an issue of controversy.
7
Box and Cox (1964)
advocate first estimating A. and then treating the estimate A as fixed
of~.
for computation of the covariance matrices
In their case, esti-
mation is accomplished using linear least squares and then "standard
analysis" on the estimated transformation is employed.
Part of the
reason for the controversy arises from the fact that A is treated as
unknown and has to be estimated.
The covariance matrix of the
param-
~
eter suffers great modification, when compared to the same matrix when
A is known and only
~
has to be estimated.
We refer to the following
Bickel and Doksum (1981): Draper and
references for further detai Is:
Cox (1969); Carroll and Ruppert (1981); Box and Cox (1982); and Hinkley
and Runger (1984).
We shall only mention that the problem has its
origin in the fact that A and
~
are highly correlated.
On the other hand, when the goal is to perform inference about Y
in the original scale of observations, the effect of not knowing A and
Carroll and Ruppert (1981)
having to estimate it is only moderate.
discuss estimation of the conditional median of Y given X. and Taylor
(1986) considers a pair of methods for estimating the conditional mean.
In both condi tioning si tuations.
they discovered that there is only a
small to moderate cost due to estimating A. when cost is measured by
the Mean Squared Error of estimation.
Several authors have proposed other estimators for A and
are based on ideas that are different from maximum likelihood.
these
y.(A)
estimators
=~
have
their
roots
in
+ oc .. where the density of the c
I
I
i
the
one-sample
is denoted by g.
the methods can be extended heuristically to the general case.
~
which
Most of
model
Some of
We now
briefly describe some of the approaches:
(a)
Hinkley (1975) assumes
that g is a
8
symmetric distribution and
proposes finding X to solve
for fixed p
{
€
~(X)
(0.1/2) where f q
is the sample q-percentile of
y~X)} i=l, ... , n.
1
In a later paper. Hinkley (1977) solves for
(b)
where
and
y{X)
(c)
= (lin)
n
}: y~X).
i=1 1
Bickel and Doksum (1981) drop the assumption that g is normal by
modifying the likelihood equation used in the estimation of {3.
They extend the estimating procedure to include some estimators
that have some robust properties, and they consider the simplifying assumption a
suIts.
-+
0 in addi tion to n
-+
(X)
for asymptotic re-
This small a approach also appears in the work of Draper
and Cox (1969), and later in Carroll and Ruppert (1984) and KettI
(1987).
(d)
Some estimators for the transformation parameter X that have some
robust properties have also been considered by Carroll (1980) and
Bickel and Doksum (1981).
Taylor (1985) compares these estimators
with the estimators of Hinkley.
(e)
In the case of a single sample and symmetric errors,
9
Taylor
(1985) considers estimators that solve the equation
(1.5)
where
~
is an odd function.
He is able to obtain consistency and
asymptotic normality results. as well as to give an optimal choice
for the
When the errors are normally distributed. the
function~.
optimal choice for
~
= u3 :
~(u)
is
thus.
the left hand side of
(1.5) becomes the standard coefficient of skewness.
1.2
The Transform-Both-Sides (TBS) Model
Carroll and Ruppert (1984) introduced a model which they describe
as follows:
Assume there exists a theoretical model that relates Y and
X given by
(1.6)
Y.1
= f(X 1.• (3)
for a known function f and an unknown p-vector of parameters {3.
The
form of the function f might be postulated by physical or scientific
considerations. for instance by chemical theory.
and Rupper t
( 1984.
1987.
1988) ;
Rupper t
Snee (1986); Carroll
and Carro 11
(1985) ; Ke ttl
(1987); Bates. Wolf and Watts (1986); and Ruppert. Cressie. and Carroll
(1989) provide many examples of f with applications of the TBS methodology from several fields of study.
Errors in measurement and random variation will cause the relation
(1.6) not to be observed exactly.
Fitting model (1.1) by least squares
is a conceivable solution. but after a fi t.
skewness and/or heteroscedasticity.
the optimal properties of
the
the residuals may show
This aspect is an indication that
least-squares estimator are spoiled;
10
therefore a transformation of Y to hey) may be suggested in order to
remove skewness, induce homoscedasticity, or both.
If the transforma-
tion is monotonic, it is possible to preserve the theoretical relationf(X,~)
ship (1.6) between X and Y by transforming
If
we
let
{h(y,;\),;\
€
A
C
IRq}
denote a
to
h(f(X,~»
fami ly of
transformations, the TBS model states that for some values of
h{Y.,;\)
(l. 7)
1
where the
t.
1
uni t variance.
are i. i.d.
= h(f{X 1.. ~),;\)
as well.
monotonic
(;\,~,a),
+ at.
1
symmetric (mean zero) random variables wi th
The modified power transformation of Box and Cox is a
special case for the family { h{y,;\)}.
The complete designation for
this choice would be TBS with Box-Cox transformation. but frequently we
shall simply write TBS.
The TBS model (1.7) allows randomness to affect the relationship
Y
= f(X,~)
in different ways, not necessarily in an additive fashion.
It is also capable of explaining skewness and heteroscedasticity that
may emerge when Y is fitted to
(1.6) in the sense that Y
f{X.~).
= f{X,~)
We maintain the relationship
in the absence of error, because we
apply the same transformation to "both sides" of the theoretical equation (1.6) and h{y,;\) is monotonic.
This concept is the main differ-
ence from the Box-COx model (1.4),
in which only the response Y is
transformed using a modified power transformation, and
f(X,~)
is postu-
lated as linear in order to obtain a simple form for the regression.
= XT~
in the Box-COx model does not
arise from a scientific theory; the form
XT~ is chosen instead in order
In other words, the choice
f{X,~)
to obtain a model that allows for "simplici ty of structure for E(Y)"
(Box and Cox (1964».
Carroll and Ruppert (1984, 1988) emphasize that
11
the TBS model (1.7) is not to be considered a replacement or generalization for the model (1.4). since each model is directed toward achieving different goals.
f(X.~)
They also note that wi thin the TBS formulation.
can be interpreted as the conditional median of Y given X. re-
gardless of the real value of
In contrast. in the Box-Cox model.
~.
f(X.~) is the median (and mean) of Y(~) for the true value of ~. but
f(X.~)
has no physical interpretation otherwise.
We will now deal with the question of estimation in the TBS model.
Carroll and Ruppert (1988. chapter 4) discuss estimation of
(~.~.a)
by
maximum likelihood in the case of normal errors and the modified power
transformation.
They show that the MLE estimate is found by minimizing
where Y is the geometric mean of Y ..... Y and each
n
1
e.(~.M
1
is the
residual defined by
e.(~.~)
(1.8)
1
= Y.1(~)
- f
(~)
(X .• ~).
1
There is a significant feature that TBS possesses; in contrast to
the
transform-the-response model discussed in the previous section.
Carroll and Ruppert (1984) found that in TBS.
related.
Thus.
~
and
the controversy between treating
~
~
are only weakly
as fixed and ac-
knowledging that it is random does not exist as it does in the Box-COx
transformation model.
Its sensitivity to outliers (Carroll and Ruppert (1985»
inconsistency under
lack of
normality of
the errors
and its
(Carroll
and
Ruppert (1988». are two disadvantages of the maximum likelihood estimator.
We note. however, that since the responses are assumed posi-
12
tive. the distribution of e in models (1.4) and (1.7) cannot have infinite support for
all~.
It is more desirable to assume only approxi-
mate normality (perhaps up to truncation) or sYmmetry on a compact set.
and to develop estimation procedures that could be applied under these
circumstances.
Examples of methods that follow this
estimates of Hinkley and Taylor mentioned above.
and
trend are the
the Ruppert-
Aldershof estimates to be defined in section 1.3.
In the context of maximum likelihood estimation. several methods
exist to obtain large sample estimates of the covariance matrix of
(~.~.a);
Carroll and Ruppert (1988) recognize at least six methods and
give some guidelines in connection with their use.
Once the covariance
matrix is estimated. we would then apply standard procedures to perform
inference on the parameters.
For example. the asymptotic normality of
the MLE can be used for the construction of Wald-type confidence intervals and
tests;
another alternative
is
likelihood-ratio confidence
regions and tests.
Sometimes inference in the original untransformed scale of the Y
observations will be reqUired.
Carroll and Ruppert (1988. chapter 4)
first notice that TBS provides the following model for the observations:
where h-l(y.~) is the inverse of the transformation h(y.~) as a function of y.
If the distribution of the e.
1
q-th quantile of Y given X is
(1.9)
13
is denoted by F. then the
for q E (0.1). and the conditional mean of Y given X is
(1. 10)
Through (1.9) and (1.10). we gain partial information on the conditional distribution of Y given X.
One way to obtain estimates of the
above two quantities is by substituting
'"
'"
~.
X. and F(c) for estimates
~.
A few different possibili ties exist for estimating F.
X. and F(c).
2
For instance. if F is assumed to be N(0.a ). then an estimate for F is
the N(0.;2) distribution.
bution F
n
If we estimate F using the empirical distri-
of the residuals { e.
1
(X.~)
}. we obtain the estimator for
E(YIX) which Duan (1983) called the "smearing estimator".
Carroll and Ruppert (1987) also address the problem of detecting
outlying observations and assessing their influence on the estimate of
(~.X)
in the TBS model.
They develop a case-deletion diagnostic that
avoids recomputation of estimates for
~
and X. in order to evaluate the
effect of an observation on the estimates.
They also propose a robust
estimator by solving
n
'"
L w.(Y .. x.~)ve.(Y. . X.~.a(X.f3»
i=l 1 1
1
1
where e
i
= 0
is the log-likelihood. wi is a weighting function. and
"'2
a
(X.~)
n
2
n
= { L w.(Y .. X.~)e.(X.~)}/ L w.(Y .. X.~)
i=l 1 1
1
i=l 1 1
is a weighted variance estimator.
14
1.3
Transformations to Symmetry and Homoscedasticity
When considering
the
transform-bath-sides model
have proposed and studied other methods for estimating
ferent ideas.
(1.7).
authors
~
based on dif-
The following estimates provide examples.
Ruppert and
Aldershof (1989) define the "skewness estimator" (or transformation to
symmetry) as
~sk
the value of
~
that solves
n
A
A2
a (~)
= (l/n)
A
(l/n) L {e.(~.~(~»/ a(~)}
i=1 1
(1.11)
3
=0
where
n
2
i=l
1
A
L e.(~.~(~»
A
and
~(~)
is the least-squares estimator for
fixed~.
This proposal was
based on the work of Hinkley (1975) and Taylor (1985) (see equation
A
(1.5».
This estimator
More generally.
~sk
then defines the estimate
~sk
= ~(~sk)·
the Ruppert-Aldershof skewness estimator would solve
the equation
n
where
A
A
=0
(l/n) L ~(e.(~.~(~»/ a(~»
1
i=1
(1. 12)
~
is a function such that
n
A
A
L ~(e.(~.~(~»/ a(~» = 0
i=1
1
if the residuals { e.} are symmetric (or
1
random variable).
If a value of
wi th symmetric errors.
~
E~(Y)
=0
if Y is a symmetric
exists for which the TBS model holds
this procedure wi 11 estimate i t consistently.
In Chapter II. we will prove some asymptotic properties of this estimator.
In a different approach. Ruppert and Aldershof (1989) define the
"heteroscedasticity estimator" (or transformation to homoscedasticity)
15
by X
the value of X that solves the equation
het
2
A
A2
2 (e.(X,~(X»/ a (X»[log f(X.,~(X»
n
(1.13)
A
i=1
1
A
- log f(X,~(X»J
=0
1
where
log f (X, ~(X»
n
= (lin)
2 log f(X .. ~(X».
i=l
1
We can express this approach in general form as follows:
For
fixed X. let H (X) be some measure of heteroscedasticity that depends
n
A
on
A
{e.(X.~(X)}
1
and {f(X .. ~(X)}; an example is the measure given by the
1
Pearson Correlation between {e
2
i
A
require that H (X) satisfies that H (X)
n
n
dasticity, H (X)
n
and H (X)
n
<0
>0
A
(X.~(X)}
and {log f(X
=0
i
.~(X»}.
We
if there is no heterosce-
if variance is an increasing function of the mean,
if it is decreasing.
In the general setting. a heteros-
cedasticity estimator would solve
(1. 14)
H (X)
n
= O.
Carroll and Ruppert (1988) also discuss these general heteroscedasticity estimators.
They anticipate consistency of X
under the
h et
assuming homoscedastic but not necessarily symmetric errors.
rES
model
In their
work. Ruppert and Aldershof (1989) study the form of asymptotic covariances and find that estimating the nuisance parameter a does not affeet asymptotic covariances of
(~.X).
In addi tion.
the asymptotic
distribution of the estimate for X does not depend on whether
known or not.
~
is
For small a. the estimating equation (1.13) is optimal
~
among families where
b(f(Xi.~(A»
A
replaces log
f(Xi'~(X»
for monoton-
ieally increasing functions b.
Since the estimator Ask attempts to transform to symmetry and "net
~
tries to eliminate heteroscedasticity. a natural question arises:
16
The
possibili ty that a value of the transformation parameter exists that
attains both distinctions.
In this case both of the estimators defined
would estimate the same quantity.
Ruppert and Aldershof (1989) propose
a possible test of this hypothesis based on asymptotic normality, and
explore the possibility of improving the estimate for A by a combination of Ask and A
.
het
Thus. they propose a third estimator A
' which
hsk
combines the ideas of the two previous estimators.
They state that it
is natural to consider the linear combination
where the weight w would be chosen to achieve minimum asymptotic variance.
They examine another view which solves a linear combination with
weights wand (1 - w) of the estimating equations (1.12) and (1.13) set
to zero.
The additional equation
must be introduced to specify the weight.
If a
transformation that
induces both symmetry and homoscedasticity exists, A
should produce
hsk
a better estimate than ei ther of
the two components.
Ruppert and
Aldershof (1989) inspect some applications of all the stated estimators
using real data sets.
1.4
Summary of Work
In Chapter II we will establish consistency and asymptotic normal-
i ty for the Ruppert-Aldershof estimate of the transformation to symmetry.
As an alternative to this approach and others described above.
in this work we propose and study a new estimator of a transformation
17
based on the notion of minimum distance.
One advantage of this estima-
tor is that it will be applicable in situations where other procedures
fail to be consistent, while assuming only a sYmmetric distribution on
the errors.
In Chapter III we define this estimator,
show that it
belongs to a certain class of estimators (General ized M-Estimators),
and then prove consistency for this class.
Asymptotic normal i ty for
estimators in this class is the subject of Chapter IV.
In Chapter V we
address the problem of increasing the efficiency of the estimator, and
we also report the results of a Monte Carlo study for comparing the new
estimate to the maximum likelihood estimate and its performance under
different error distributions.
18
aIAPTER II
TRANSFORMATIONS TO SYMMETRY
2.0
Introduction
In Chapter I, we were introduced to the following Transform-Both-
Sides regression model of Carroll and Ruppert (1984):
(2.1)
h(Y.,A)
= h{f(X.,{3),A)
+
Of;..
1 1 1
where f(X.{3) is of an assumed known form, (3 is an unknown regression
parameter, h(Y,A) is a one-to-one transformation of Y for each fixed A.
and A is an unknown transformation parameter that may be, in general.
vector-valued.
The motivation behind this specification was to induce
symmetric and homoscedastic errors while preserving a theoretical relationship between X and the conditional median of Y. given X.
The ad-
vantages are to increase efficiency in parameter estimation, and to
make prediction intervals more accurate.
Also.
the model provided a
way of incorporating random variation into a pre-specified re1ationship.
When there is no prior knowledge on the structure of variabili-
ty, this is an important feature.
Carroll and Ruppert (1988) explored estimation by maximum likelihood.
The purpose of this chapter is to establish some asymptotic
properties of one of the alternate estimating procedures introduced by
Ruppert and Aldershof (1989) for the parameters in model (2.1).
As we
reviewed in Chapter I (section 1.3), two main procedures were proposed
in that paper for estimating the transformation parameter.
The first
of these methods is an extension of the transformations to symmetry of
Taylor (1985). and uses a measure of skewness that is applied to residuals.
~~en
errors are normally distributed, we obtain an estimator
with optimal properties when the measure of skewness is chosen to be
the standardized coefficient of skewness.
The second proposed estima-
tor employs a certain measure of heteroscedasticity, and was shown to
In ei ther case,
be approximately optimal under small a asymptotics.
the estimates were members of the class of M-estimators, which enabled
theoretical analysis.
In this chapter we are concerned with the first
of the two approaches.
We will assume that for 1
~
i
~
n,
the errors c. are independent
1
and identically distributed, and that the X. are random variables also
1
with a common probability distribution.
Thus.
are
in some
i.i.d.
random vectors,
with values
the pairs Z.
= (Y .. X.)
I I I
space
~.
Following
~
Ruppert and Aldershof (1989). we introduce an additional parameter
order
to center
transformation
e = (~,~,~)T.
residuals at
to
symmetry.
their mean in
The
vector
the defini tion of
of
all
their
parameters
and it will be assumed to be an element of
e,
in
is
a compact
subset of Euclidean space.
For convenience of later reference we now introduce the following
notation.
20
Notation 2. I:
=
Nl.
S
N2.
r(S.Z)
= h(Y,A)
N3.
r.(8)
1
= r(8,Z.)
1
Z
N4.
N5.
Z.
E e C ffi
(~,A.~)
-
=
(Y,X)
=
(Y.,X.)
p+2
• ~ E ffi
p
h(f(X,~),A)
III
1 ~ j ~ p
r(8,Z)(a/aSj;r(8,Z)
~.(8,Z)
N7.
J
=
{
r(8,Z) N8 .
N9.
G . ( S)
J
(n)
~.
J
NIO.
~
(8,!r)
n
~]
[r(S,Z) ~
= E~. ( 8 , Z)
J
= (lin)
(n) : ffip+2 x!rn
~
n
L ~ .(8,Z.)
i=l J
1
ffi
= p+l
j
= p+2
I ~ j
~
p+2
~
~
p+2
1
p+2.IS the function with coordinates
NIL
j
j
(n)
~.
J
(a p+2 square matrix).
•
21
Definition 2.2: (Ruppert and Aldershof (1989))
The Ruppert-ALdershoF
@~eIDness
Estimator
en
is defined as a solu-
tion to the following equation:
(2.2)
•
2.1
General Results About Consistency
The first property to be investigated about this estimator is the
existence of a sequence of solutions to equation (2.2) that is consistent.
This existence. as well as asymptotic normality of a consistent
solution. will be the topics of section 2.2.
some useful
lemmas and develop some foundations.
functions on T to
mq .
uous in
A. if for each
t
In this section. we give
For x
€
~
= { f(t.x)
t.t'
€
t
T and
An indexed set of
x € A }. is said to be equicontin-
>0
I
there is a
t - t'
oCt) > 0
such that if
I < O.
then
" f(t.x) - f(t' .x)11
<t
for all x € A.
Lemma 2.3:
(Rubin (1956))
Let Xl ..... X be a sequence of independent and identically disn
tributed random variables with values in an arbitrary space X. having
common distribution P.
be a function on
Let T be a compact topological space. and let f
TxX. measurable for each t € T.
Hn (t)
= (l/n)
n
2:
j=l
22
f( t.X.) and
J
Define
•
H{t)
If
= f f{t,x)
dP{x).
there is an integrable g such that
I
f{t,x)
I < g{x)
for all
t E T and x E X, if there is a sequence S. of measurable sets such that
1
P{X - US.)
i=1
and for each i,
= O.
1
f(t.x) is equicontinuous in t for x
E
Si'
then with
probability one,
sup
tET
as n
~ ro
I
H (t) - H{t)/ ~ 0
n
and the limit function H(t) is continuous.
•
Two ways exist which are used to define M-estimators:
By minimiz-
ing a sample-based function, or by solving a system of equations (appropriately called the estimating equations).
see how consistency of M-estimation follows.
In the next lemma, we
Condition (2.3) below is
strong, but will be satisfied for the cases that we are concerned with
in thi s chapter; in fact.
(2.3) wi 11 be implied by Lemma 2.3.
Thi s
condition may be replaced by other assumptions (Huber (1967». but in
its present form it also serves well to illustrate the principle of Mestimation.
Lemma 2.4:
Let
(T.~,PT)
be a probability space. and let Hn(t) and H(t)
be defined as in Lemma 2.3.
(2.3)
sup
tET
If
IH
n
(t) - H(t)1 ~ 0
and either
23
(a)
t
n
is defined by
=H
inf H (t)
tET
n
n
(t )
n
=
and the function H(t) has a unique minimum at t
to E T
or,
(b)
t
n
is defined by
"
H (t )
n
=0
n
and the equation H(t) has a unique root at t
then t
proof:
n
~
= to
E T,
to with PT-probability 1.
Fix c
>0
and define
A
c
=T n {
I
t:
t -
to
I
~ c }.
We wish to show that
to
e-
to
{ I
t n - to
I <c
} )
=1
or equivalently
to
to
{t
Case (a):
n
EA } ) =
c
1.
Since A is compact.
c
inf H(t)
tEA
c
is attained. and by assumption. strictly greater than H(t o ). say greater than H(t o ) + 3D for some D > O.
If (2.3) holds. then with probability one there exists an integer
No such that for all n
> No.
sup
tET
IH
n
(t) - H(t)1
24
<D
and therefore with probability one
< H{t)
H{t) - 20
- 0
~
~
Hn{t)
H{t) + 0
> No.
for all t and all n
This impl ies
inf H (t) l inf H{t) - 20
tEA
n
tEA
E.
E.
and
inf H (t)
n
t~A
and thus for all n
~
inf H{t) + 0
utA E.
E.
= H{t o )
+ 0
> No.
> H{t o )
inf Hn{t) l inf H{t) - 20
tEA
tEA
E.
+ 0
l
inf H (t).
n
t~A
E.
E.
~
thus showing that H (t) is not minimized on A . that is. t
n
Case (b):
if
E.
Since H{t) is continuous. there exists 0
I
t E T and
t - to
I
l
then
E..
I
H( t)
I > O.
n
~
>0
A .
E.
such that
(2.3) holds.
If
then
there exists an integer No such that with probability one.
I Hn (t)
- H{t)
I < 0/2
H{t)1 -
I Hn (t)
- H{t)
I
~
I Hn (t)1
I-
IH
(t) - H{t)
I
~ inf
sup
tET
for all n
> No.
Since
I
we have
I
inf
tEA
E.
H{t)
sup
tEA
n
E.
n
(t)
I.
E.
Thus. with probability one. for all n
o < 0 - 0/2
IH
tEA
> No.
IH
~ inf
tEA
n
(t)l.
E.
showing that H (t) has no roots on A . that is t
n
E.
n
~
A .
E.
•
25
It is convenient at this point to establish some preliminary resuIts.
The setting shall be more general than the situation just des-
cribed for the Ruppert-Aldershof transformation to symmetry.
We will
make provisions for estimators defined by other sets of M-estimating
equations as well, for example, the equations that define the RuppertAldershof heteroscedasticity estimator.
The dimension of the parameter
that is being estimated need not be equal to one.
Lemma 2.5:
(~,~,P),
space
~
n
= (Xl' ... '
~.(8,X)
J
Let Xl' X , ... be i.i.d. random vectors on some probability
2
:
Xn ),
ex~ ~
e C ffiP
and
and
a
compact
assume
~
ffi, for I
a
j ~ p.
G.(8)
J
parameter
given
set
space.
of
Denote
functions
p
Define
= E~.(8,X),
J
=
n
(l/n)
2
~.(8,X.),
i=l J
1
~(n): ffiPx~n ~ ffiP is the (random) function with coordinates ~~n)(8),
J
and
If the following assumptions hold:
Assumptions 2.6:
AI.
There exists a 8 0
€
e
such that G.(8 0 )
J
=0
Without loss of generality we may assume 8 0
A2.
The support of X is bounded.
A3.
H(8 0 ) is nonsingular.
26
for I
= o.
~
j
~
p.
Third order partials of 'lJ.(8.X) with respect to 8 exist and are
A4.
J
a.s. continuous, 1
j
~
~
p.
Then, with probabil ity one.
exists N and a sequence
to
en
for
to
>0
sufficiently small,
such that for all n
there
> Nto .
(2.4)
and such that
II 8n - 8 0 11
< to.
A key step in the proof will utilize Brouwer's Fixed Point
Theorem from analysis. which states that any continuous function f with
domain D
={x
€
IRP : " x
II
D has at least one point u
~ B } for some B
€
D such that f(u)
>0
and range contained in
= u.
The system of equations (2.4) is equivalent to
(2.5)
We have assumed enough conditions so that Lemma 2.3 can be applied
to insure that
(2.6)
as n
~
00,
for all 1
~
j
~
p, and
that is, differentiation and expectation do interchange.
Let H = H(O),
Also, for 1
~
j ~ p,
and denote
the (j,2)-th entry of H-
let
27
1
by h(jE).
and
r~n)(a,~
)
n
J
= ~ h(j£)~(n)(a.~).
£=1
£
n
By (2.6), we conclude with probability one,
Expanding to first order about a
+ 0(11
=
a. +
J
~
2=1
such that
II
a
II < 0
smaller than o.
II
(2.8)
II
II
II
II) -
a
+ 0(11 a
j
II
---+ 0,
II).
II = 0(11 a II), so there exists aD> 0
II rca) - a II < II a 11/2. Restrict E to be
a
II < E
r ( a) - a
implies
II < II
a
II /2 < E/2.
Since (2.7) implies
with probability one there exists an N such that for all a
E
(2.9)
a
a II) }
rca) - a
implies
Then
we obtain, as
= O.
using the fact that G(O)
This impl ies that
h(j2)0(11 a
= 0,
/I r(n)(a'~n)
- rca)
/I < E/2
Using the inequality
28
for all n
> NE .
e-
as well as
inequalities
(2.8) and
(2.9),
we may conclude
that with
probability one. there exists an integer N such that
E.
II ~
1/8 - r{n){8,!r )
n
> NE.
E. for all n
on the set
II
811
< E..
By Brouwer's Fixed Point Theorem mentioned above, for all n
there exists 8
.
IS,
r
n
{n)'"
(8 ,!r )
n
n
such that
= O.
II 8n II < E.
and
8
n
r{n)
(8 n .!rn ) = 8 n ,
> NE. ,
that
Equation (2.4) now follows (2.5).
•
Under
Lemma,
the same notation and assumptions,
which amounts
to
the proof of
we have
the following
the existence of a
consistent
sequence of solutions to a given set of M-estimating equations.
Lemma 2.7:
Suppose Assumptions 2.6 hold.
there exists a sequence {
proof:
Then, with probability I,
(3 } such "that (3 ~ 8 0 and ljJ{n){(3 .!r ) = O.
n
n
n
n
We denote the underlying probability space by (O,A,P).
fixed E. in Lemma 2.5,
there is a set 0
such that P{O )
E.
E.
= 1,
For
and such
that for all w € 0 , there is an N and a solution 8 such that
E.
E.
n
IjJ ( n)
for all n
(
8n ,!rn ) = 0
and
II 8n -
80
II < E.
> NE. .
,..
Denote the explicit dependence of 8
fine
29
n
on w and E., by
enE. (w).
De-
and notice that p{n o ) = 1.
Consider w E no.
Without loss of generali-
ty. the N / which exist by Lemma 2.5 can be chosen to satisfy
1 j
N1 {w)
~
N1/ 2 {w)
~
N1/ 3 {w)
s ...
If w E no. define
if N1/k{W) ~ n < N1/{k+1){w)
k=1.2 ....
ifn<N {w)
1
"
and if w E no. 8 (w) =
n
a
for all n.
The sequence of random variables 8
n
achieves the conclusion of the
lemma.
•
2.2
Consistency and Asymptotic
Normality of
the Transformation
to
Symmetry
We now establish that if the TBS model (2.1) holds for some parameter
values and
symmetric errors.
then with
probability
one.
the
Ruppert-Aldershof estimating equations based on skewness admit a solution 8 that is consistent.
n
A problem which we will have to address
after this result is obtained. is how to identify the consistent solutions in practice since the estimating equations may have mul tiple
roots.
Lemma 2.7 only states that a consistent solution exists.
The next theorem makes use of the Notation 2.1 and the definition
of the Ruppert-Aldershof estimator via the estimating equation (2.2).
Theorem 2.8:
Consider the TBS model (2.1) and the conventions of Nota-
tion 2.1. in particular let the function
by N7 and NIl. respectively.
80 E
e
~
and the matrix H be defined
Assume there exists a parameter value
such that the TBS model
(2. I) holds.
30
Suppose Z. = (Y .. X.).
I I I
1
~
~
i
n are i.i.d. observations obeying the model, and that the fol-
lowing assumptions hold:
B1.
The distribution of
B2.
The support of the random variable X is bounded.
B3.
The matrix H(B o ) is nonsingular.
B4.
The third order partial derivatives of f(X.I3) wi th respect to 13
t
is symmetric on a bounded support.
all exist and are a.s. continuous.
B5.
The third order partial derivatives of h(Y,i\) wi th respect to i\
all exist and are a.s. continuous.
e
B6.
The parameter space
is compact.
B7.
For each i, the error t. and the X. are independent.
1
1
Then, with probability one,
to
proof:
the
there exists a sequence of solutions
Ruppert-Aldershof
equations
(2.2)
such
that
The theorem will be a direct consequence of Lemma 2.7. once we
identify that the Ruppert-Aldershof equations (2.2) are a special case
through the particular form of the function
~
given by N7 of Notation
2.1.
Since a parameter value
check that
G.(B o ) =
J
E~.(Bo,Z)
J
Bo such that the model holds exists, we
= 0 for all j=1, ... ,p+2. as follows.
view of the fact that under the assumed model.
and
31
In
is independent of c. we have by Bl
3
Er(8 0 .Z) = Er (8 0 .Z) = 0
and by B7.
Thus Al of Assumptions 2.6 holds.
The remaining assumptions of
Lemma 2.7 are satisfied through assumptions BI-B7.
In particular. the
~j(8.Z)
required smoothness properties of the functions
for 1
~ j
~
p+2
are a consequence of the smoothness of the functions r(8.Z). f(X.IJ).
and h(Y .~).
•
Next we consider a consistent solution 8
Asymptotic normality
n
under i.i.d. observations is established in the following result. which
is set forth by Ruppert and Aldershof (1989). but is stated here for
completeness.
Theorem 2.9:
Under the assumptions of Theorem 2.8. any sequence of
A
solutions to (2.2) such that 8
n
P
~
80
•
satisfies
where
A =
a~
EIJi(8.Z)
I .
8=8 0
T
B = E[IJi(8 0 .Z)1Ji (8 0 .Z)].
A-T = (A- 1)T.
32
~(e.Z)
and
~.(e.Z),
J
= (~l(e.Z) .... , ~p+2(e.Z»
T
.
j=l ..... p+2 are the functions defined by N7 in Notation
2.1.
proof: The Ruppert-Aldershof estimator e
(lin)
n
"
~ ~(e
i=l
and thus it is an M-Estimator.
n
n
.Z.)
satisfies
=0
1
Asymptotic normal i ty wi 11 follow from
an application of Huber's (1967) Theorem 3.
on general M-Estimation
theory.
•
2.3
Identification of a Consistent Solution in the Transformation to
Symmetry
We now turn to the problem of identifying a sequence of solutions
to the Ruppert-Aldershof equations that is consistent.
The problem
exists because of fact that several solutions might exist, all of which
may not necessarily converge to the unknown parameter value.
We will describe a device which will enable us to compute a certain estimate that will possess the same asymptotic normal distribution
as stated in Theorem 2.9.
regression parameter
~
A preliminary step shall be to estimate the
consistently. even in the case that the trans-
formation parameter X is unknown.
This can be achieved under the TBS
formulation by using the Least Absolute Deviations estimate for
~.
to
be described below.
Once this preliminary estimate for {3 is obtained. a consistent
estimate for X will be constructed. and furthermore. a one-step proce-
33
dure will be applied in order to produce an estimate of
desired asymptotic properties.
with the
In Theorem 2.13, we will state this
stepwise estimating procedure with greater detail.
We first define the
~.
Least Absolute Deviation (LAD) estimator for
Definition 2.10:
(~,~)
Consider a general non-linear regression model
Y.
= f(X.,~o) + e.
I I I
where each e. has median equal to zero (they need not be symmetrically
1
distributed).
The Least AbsoLute Deviation (LAD) estimator for
~
based on obser-
vations (X 1 'Y 1 ) ..... (Xn . Yn ). denoted by ~LAD' is defined by
n
( l/n)
A
L
i=l
I
Y. - f(\ '~LAD) I = inf (l/n)
1
~
n
L
i=l
I Y.1
- f(X.1 ,~)
I.
•
This estimator has desirable properties that are a consequence of
the fact that the conditional median of Y given X is
fact
that
EI
Y - f(X,~)
I
f(X.~o)
and the
as a function of ~ is minimized at ~ = ~o.
Bassett and Koenker (1978)
investigate the
estimator in the linear regression case.
least absolute deviation
They find that the LAD and
the Least Squares estimators compare in much the same way as the sample
median and the sample mean compare when estimating location in a single
sample.
In the nonlinear case, Oberhofer (1982) investigates general
sufficient conditions for consistency to the true value of
~o.
A distinguishing property of the TBS regression model with medianzero errors,
is
that the condi tional median of Y given X is sti II
34
f(X.13o). regardless of the value of A.
This makes LAD estimation a
likely method for estimating 13 consistently even if A is
unkno~TI.
al-
though it may not be fully efficient.
Lemma 2.11:
Consider the TBS regression model with errors that have a
unique median at zero. and the assumptions of Theorem 2.8.
If
the
regression function f(X.13) is such that 13 # 130 implies f(X.13) # f(X.13o)
on a set of posi tive measure.
then the estimator 13
LAD
of Defini tion
2.10 satisfies
;;
a.s. R
I-'LAD----+ 1-'0
~
and
'"
d
n (13LAD - 130) ----+ N(0.1)
for some matrix 1.
proof:
Let
Z = (X.Y).
p(Z.13) =
IY-
f(X.13)
I.
and
,(13)
= Ep(Z.13).
Then 13
minimizes
LAD
n
(Un) }; p(Z .. 13)
i=l
1
and is therefore an M-Estimate.
The proof of consistency will follow
from an application of Huber's (1967) Theorem 1.
An assumption that is
critical for the result is that 13 # 130 implies ,(13)
be verified as follows:
h(y.A) in the TBS model.
> ,(130)'
This can
Denote by h- 1 (x.A) the inverse of the function
Then
Y = Y(X.e) = h
-1
(h(f(X.13o).A o ) + Goe. Ao)
and the conditional median of Y given X is f(X.13o). since
21 = P{
Y
~
f(X.13o)}
= P{
h(f(X.13o).Ao) + Goe
= P{
e
~
~
h(f(X.13o).A o )}
O}.
Due to independence between e and X. we can write
35
Now.
ElY
- f(X,~)1 for each fixed X is minimized uniquely when
.
E.
f(X,~)
=
such that
f(X.~o),
thus if
f(X,~) #
on this set.
f(X,~o)
~
# ~o,
there is a set of positive X-measure
and hence
It follows that
> ~(~o).
~(~)
The remaining assumptions of Huber's Theorem 1 are verifiable
because of the compactness of e, the boundedness of the support of X,
and the continuity of the function
~LAD
p(Z,~).
This proves consistency of
under the assumptions we have adopted.
For asymptotic normality, we note that ~LAD satisfies
n
(lin) L
i=l
A
~(Z"~LAD)
= 0
1
where
~(Z,{3)
=
-(a/a{3) f(X,~)
(a/a{3) f (X,~)
{
if Y
if Y
> f(X,~)
< f(X,~).
if Y =
0
f(X,~)
Asymptotic normali ty wi 11 follow from an appl ication of Huber's
(1967) Theorem 3. as
E~(Z.{3)
is differentiable in {3.
•
The next result will make it possible to estimate
in the case that the regression parameter {30 is known.
capabi Ii ty will be a crucial fact in Theorem 2.13.
the special case obtained when
h(y.~)
36
~
consistently
In turn, this
We wi 11 consider
is the Box-COx modified power
transform.
Other cases would have to be studied individually.
Lemma 2.12:
In the TBS model (2.1) with the Box-Cox modified power
transform
A.
h(y,A.) = y( ) =
A.
{(y - 1) / A.
log(y)
and assume that the value
Let
the error t
~o
is known.
ifA.-/-O
ifA.=O
Define
possess a symmetric distribution F (0) wi th a
t
symmetric density ft(o), let F be the distribution of X, and assume X
X
and t are independent.
Then the equation
has a unique solution at A. = A.a.
proof:
Suppose A.o
> O.
The proof for the case A.o S 0 is similar.
Since t is symmetric and the TBS model holds. the equation
(2.10)
has a root at A. = A.a.
We wish to show that it is unique.
sufficient to prove that for each fixed X.
(2.11)
For the modified power transform.
(2.12)
37
Then it is
Denote
= -u
by fo.
Using (2.12).
because of the symmetry of f (.).
Now.
w
f(X.~o)
t
the change of variable
yields that the second integral above is equal to
;o{ [fo -
~owJAlAo
- [foJAIA o }3 f(w)dw
e'
-(X)
and thus
(2.13)
Etr3(~) =
J: {[{
(f o +
~ou)AlAo
- [ { f oAIAo - (f o Consider the function 'P(u)
- f oAlAO } /
~ou)AlAo
= (uAlAO)/A.
~
} /
~
J3
J3 } ft(u) duo
defined for u ) O.
We
find that
'P"(u)
= (l/Ao )(AlAo
- 1) u
AlA -2
0
.
We consider two cases.
Case 1.
~
> ~o.
In this case. since 'P"(u)
> O.
38
the function 'P(u) is strictly con-
(2.14)
Therefore. the integrand in (2.13) is positive for all X. so that
for all X.
Case 2.
,,<
"0.
In this case
.p(u ) - .p(u )
3
2
the function .p(u)
< .p(u2 )
is strictly concave.
- .p(u ) under condi tion (2.14).
1
and
thus
the integrand in
(2.13) is negative. and hence
for all X.
Hence. (2.11) holds.
•
In Theorem 2.13 to follow. we will use a one-step algorithm in
order to approximate the solution of a system of non-linear equations.
We make reference to the Newton-Raphson procedure. which can be summarized as follows.
Let G be a function from ffiP to ffiP and assume a root to the equation G(x)
=0
exists at
Xo.
Let
XCO)
be an approximation to
Xo.
and
define
XC 1 )
This
XC 1)
is called the one-step Newton-Raphson improvement to-
wards the solution x o .
In the numerical analysis Ii terature (e.g.
Dahlquist. Bj~rck and Anderson (1974). section 6.9.2). it is of common
knowledge that this method is of "second order".
39
that
is.
for each
d
>0
= C(d)
there is a constant C
-
X
II
o
~ d implies
II
such that
( 1 )
X
-
Xo
II
~ C
II
x
(0 )
Theorem 2.13 provides the method for obtaining an estimate that
has. the same asymptotic properties as the Ruppert-Aldershof transformation to symmetry.
We are in the need for utilizing this procedure
because solving the system (2.2) does not guarantee consistency.
the general case.
such that 8
~
80
Theorem 2.13:
In
the equation G(8) = EIfi(8.Z) = 0 may have solutions
,
Consider observations
~
n
=
«Y .X ) ....• (Y .X
1 1
n n
».
ac-
cording to the model (2.1) with the Box-Cox modified power transform.
assume that the errors are symmetric. and the assumptions of Theorem
Let 8
2.8.
n
=
"-
({3 .A ) be
n n
the consistent solution
to
the Ruppert-
Aldershof equations (2.2) which exists with probability one. by Theorem
2.8.
We recall the notation
r.({3.A)
1
1.
r({3.A)
= yeA)
= y~A)
1
- f(A)(X .. {3). 8
_ f(A)(X.{3).
1
= ({3.A).
and
Define {3LAD by
n
}: I
(2.15)
i=l
2.
Let A
LAD
(2.16)
be the solution to
n
3 '"
}: r ({3LAD.A)
i=l
40
=0
Yi - f(X .. {3)
1
I·
as a function of A, and set
3.
Let 8~*
n
= (~. . . .n* ,A. . . n* )
be a I-step Newton-Raphson modification towards
the solution to the Ruppert-Aldershof equations wi th starting value
8
, that is, let
LAD
(2.17)
Then, ~*
8 ~ 8 0 and furthermore,
n
have the same asymptotic distribution.
proof:
The first thing to establish is that ~LAD~ Ao as n tends to
infinity.
For this it will be useful to show that the function
is equicontinuous in W,A) wi th respect to (Y .X).
r{~,A)
Equicontinui ty was
defined in section 2.1.
Since the function
tiable in both A and
~.
r{~,A)
= yA -
fA{X.~) is continuously differen-
and the parameter space is assumed to be com-
pact. then we can find continuous functions K {X,Y), K {X.Y) such that
1
2
Since the support of both Y and X is bounded. then K and K can
1
2
be bounded by some constant K and equicontinuity follows from the inequality that results:
41
(2.18)
sup
II
r{{3,A) - r{{3',A')1I S K II {3 - (3'1I + K
II
A - A'II·
X,Y
By Lemma 2.3, since the observations are i.i.d. and the parameter
space has been assumed to be compact, (2.18) implies
n 3
3
(1/n) L r.{{3,A) - Er ({3,A)
i=1 1
sup
(2.19)
e
I
From Lemma 2.10 we know that ~LAD ~ (30.
~ O.
Therefore. because of
uniform continuity of r. in ({3,A), we obtain
1
n 3
n 3
(1/n) L r.{{3LAD,A) - (1/n) L r.{{3o,A)
i=1 1
i=1 1
A
(2.20)
sup
A
I~
O.
Now, since
n 3
3
(1/n) L r.{{3LAD,A) - Er ({3o.A)
i=1 1
A
sup
A
I
n 3
n 3
(1/n) L r.{{3LAD,A) - (1/n) L r.{{3o.A)
i=1 1
i=1 1
A
s sup
A
+ sup
A
n 3
3
(l/n) L r.{{3o,A) - Er ({3o,A)
i=l 1
I
I
then (2.19) and (2.20) imply
(2.21)
sup
A
I
n
3
3
A
(l/n) L r.{{3LAD,A) - Er ({3o,A)
i=l 1
I~
O.
3
By Lemma 2.12, the unique root to Er {{3o,A) = 0 is A = Ao.
Con-
sistency of A
then follows from uniform convergence (2.21), by using
LAD
Lemma. 2.4.
An application of Huber's M-Estimation theory will establish that
A
A
({3LAD,A
LAD
) is asymptotically normal, but for the proof of the current
theorem we will only need the fact that
42
II "eLAD
(2.22)
II
- eo
= open -1/4 ).
The proof of the theorem now follows.
fined by Notation 2.1.
tinuous.
8LAD
The notation is that de-
Assuming the bounded supports, since
we have uniform continuity of
~(e,Z)
in
(e,Z)
is con-
~
and
since
~ eo we may conc Iude
(2.23)
That "*
e ~ eo, follows from (2.17) and (2.23).
n
To prove asymptotic normality, we begin by noting that we have
assumed enough derivatives so that for 1
"
~.(eLAD'z.)
J
1
p+2
+ (1/2) 2
k=l
= ~.(eo,z.) +
J
p~2
1
a2
as as
k'=l
k k'
~
,h
~.
J
~
a
ae k
p+2
2
k=l
(-e .Z. )
j
~
p+2 almost surely:
A(k)
~.(eo,z.)(eLAD
J
1
("s(k) _ s (k»
LAD
0
1
- eo
(k)
)
(SA(k') _ e (k»
LAD
0
for some 8 = 8(i,j) on the line between SLAD and So.
There is a cons2
tant A such that the third term is bounded by A(p+2) II SLAD- eo II for
A
all j, so the third term is open
Next, for 1
~
j
~
-~
).
p+2.
n
=: (1/n) 2 ~
(2.24)
"
.(eLAD,z.)
i=l J
1
n
= (Un)
2~.(eo,z.)
i=1 J
n
+ (l/n) 2
i=l
+
1
p+2
2
k=l
n
+ (l/n) 2 {open
i=l
a
as
k
"(k)
~.(eo,z.)(eLAD
J
_~
) terms}
that is.
43
1
- eo
(k)
)
(2.25)
ljJ(n)(8
!'X)
LAD' n
ljJ(n)(8 0 '!'X )
n
=
+
~(n)(80'!'Xn)
(8LAD-
80 )
+
op(n-~).
Using (2.25) we find
(2.26)
e'
By the Central Limit Theorem.
(2.27)
where B
= OOV(IjJ(8 0 .Z))
and by the Strong Law of Large Numbers.
(2.28)
where A
= EIjJ(8
0
'Z).
Equations (2.26). (2.27). and (2.28) imply
44
This distribution is precisely the limit distribution which was
obtained in the conclusion of Theorem 2.9. and thus the proof is complete.
•
45
CHAPTER III
CONSISTENCY OF A MINIMUM DISTANCE ESTIMATOR
3.0
Introduction
In this chapter we shall describe a new estimator for the trans-
formation parameter in the transform-both-sides regression model. and
investigate consistency.
We discussed the TBS model in detail in sec-
tion 1.3. where the definition was given by
(3.1)
In Chapter I we described several methods that researchers have
proposed
to estimate
transformations.
and
we particularly
stressed
those techniques whose favorable properties were a consequence of the
symmetry of the errors.
in this way.
Shortcomings of maximum likelihood are avoided
The methods of Taylor (1985) or the Ruppert-Aldershof
estimator studied in Chapter II are examples of this approach.
We assume throughout this chapter that the error c does indeed
possess a symmetric distribution F (x).
c
estimating the parameter A.
We will propose a method for
which is based on a certain measure of
departure of a probability distribution from symmetry.
In section 3.1.
we apply this criterion to the empirical distribution F
n
(X.A.~)
of the
residuals. and then select as an estimate of A a value which minimizes
the criterion.
Based on symmetry of the error distribution. the method
will produce consistent estimates of transformation parameters in two
important cases where other estimators such as maximum likelihood fail
to converge:
(i) The modified power transformations, and (ii) their
extension to shifted power transformations.
An extensive discussion of
these examples is contained in section 3.7; in this section. we only
point out some preliminary background.
To grasp the origin of the new estimate. we begin by letting F (x)
n
denote the empirical distribution function of a random sample of size n
from a distribution F(x).
We then define
and
Yn2
=
J
{Q (x)} 2 dx.
n
If F (x) is not symmetric. the number y2 assesses the degree of
n
n
departure from symmetry.
Orlov (1972) and Rothman and Woodroofe (1972)
2
suggested the statistic Y for testing the null hypothesis that F is
n
symmetric; Koziol (1980) further elaborates upon the same statistic.
Test statistics of this form are known in the literature as being of
the Cramer-von Mises-type.
Other methods based on Q (x) for testing
n
symmetry are also possible. for example. Butler (1969) considered
sup
x~O
Q (x).
n
The notion of Cramer-von Mises-type statistics can be used for
estimation purposes as well.
Let
r
= {
Fa : a E e } be
a parameterized
family of distributions. let Xl •...• X be a random sample from a disn
tribution G.
sample.
and let G be the empirical distribution function of the
n
Minimization over the parameter space
47
e
of the quantity
yields the so-called Minimum Distance Estimator of 0 (using the Cramervon Mises discrepancy).
If G
= F Oo
for some 0 0 E e, then the quantity
which is being estimated is clearly meaningful.
Otherwise, we have to
interpret the estimate as selecting a member of
r
in the sense of the quantity cvm(G,F ).
which is close to G
For work on the minimum dis-
O
tance estimate in the problem of estimating location. see Boos (1981).
and Parr and De Wet (1981). for example.
In section 3.1. we propose an estimate of the transformation parameter which is based on a discrepancy that is of the Cramer-von Mises
type.
Sections 3.2 through 3.5 deal with material that leads to the
proof of consistency.
In section 3.6 we prove consistency for the
estimator of the transformation in selected fami lies of transformations.
including the Box-COx modified power
transformation and
the
modified power transformation with shift.
3.1
Proposed Estimator
We consider the si tuation of observing independent real izations
Zi = (Xi' Yi)
for
1
~
i
~
n.
that follow
the regression model
where the errors e. have distribution F (e).
~
1
to estimate the parameter 0
= ("A.I3.a);
(3.1)
The primary objective is
in this section we propose an
alternate method for achieving this purpose.
We make
the following general assumptions wi th regard
transform-both-sides model (3.1):
48
to
the
Assumptions 3.1:
(~.~.a)
lies in a compact set
e.
AI.
The parameter
A2.
There exists
A3.
The pairs (X .• e.) i=I .... , n are independent.
A4.
X is independent of e.
A5.
The random variables X and e have compact supports
(~o.~o.ao) €
1
e
such that (3.1) holds.
1
~
and G, res-
pectively.
A6.
The distribution F (x) is symmetric.
e
A7.
The function
~
AS.
is continuously differentiable with respect to
h(y,~)
is continuously differentiable with respect to
for each x.
The function
~
f(x,~)
for each fixed y.
•
We shall first allow for a brief heuristic construction of the
alternate estimate before proceeding on to the formal definitions.
The
idea is to utilize information contained in the residuals to discern if
their distribution is symmetric or not.
The criterion which we employ
for this bears a resemblance with that of the Cramer-von Mises type,
and the estimate is picked by minimization.
qual ifier
We therefore adopt the
"Cramer-von-Mises Minimum Distance"
to designate
the new
estimate.
Putting Z.
1
(3.2)
= (X.,Y.),
1
1
we define the residuaLs by
r(A,~.Z.)
= h(Y.
,~) I
I
and notice immediately that
r(~o,~o,Z.)
1
h(f(X.
,~) ,~)
I
= 0oe.,
1
so that at the correct
value of the parameter. the residuals are symmetric if the e. are sym1
49
metric.
We now define the standardized residuaLs by
r * (S,Z.)
1
(3.3)
Let F (x, S)
denote
n
= r(A,~,Z.)
/0.
1
the empirical distribution function of
the
standardized residuals. that is
(3.4)
F (x,S)
n
n
= (lin)
~
l( r(A,~,Zi) ~ ax)
i=l
_00
Suppose for heuristic purposes that the parameters
known, and thus F (x,S) is a function of A and x only.
n
~o
<x <
and
00
00
are
We could then
define
F (X,A)
n
= (lin)
n
~
r(A,~o,Z.) ~
l(
i=l
Oox) ,
1
and
We then define for each A, the statistic
(3.5)
The quantity y2(A) is a measure of the degree of symmetry of the
n
empirical distribution function F (X,A).
n
We introduce a weighting
function w(x) to allow for some flexibility, as it will have an effect
on the efficiency of the estimate.
Later on. it will be of interest to
investigate precisely what influence this weighting function has on the
properties of the estimate.
Without loss of generality. we may assume
at this point that w(x) is symmetric because adopting any w(x) is equivalent to using [w(x) + w( -x) J/2.
50
This equivalence holds because
2
Q (x,~) is symmetric in x. and thus
n
f
V~(~) = Q~(x,~)
o
w(x) dx +
o
co
f Q~(x.~)
w(x) dx
co
f
= Q~(x,~)
w(x) dx +
o
f Q~(x,~)
we-x) dx
0
co
f
= Q~(X'A)
[w(x) + w(-x)]/2 dx.
-co
Since r * (8 0 'Z.)
= c.1
1
value
of
the
parameters.
if
Furthermore,
estimates zero.
function of
~
if the regression model truly holds for some
F (X,Ao)
then
the
estimates
n
latter distribution
distribution
symmetric,
then
V2(~)
n
as a
It thus appears sensible to treat
and to select as an estimate of
We wi 11.
produces a minimum.
is
the
in fact,
~o
that value of
formally define
~
which
the "Minimum
Distance Estimator based on a Cramer-von Mises-type statistic", denoted
by
~CVM'
as that value of A which minimizes
When
~
Since the well-known optimal properties of
least-squares estimator for
denote by
that is
and a are not known. some adjustments must be made to the
preceding argument.
desirable.
V~(A).
we will
~(A)
employ
and a(A)
~
in
the presence of normal
least-squares
for
n
2
We will
~
and a for
the least-squares estimators of
A
(A.~(~).Z.)
i=l
n
= inf (lin) L
R
1
tJ
and a(A) is then given by
51
errors are
this purpose.
fixed A. that is ~(A) satisfies
(lin) L r
the
• 1
1=
r
2
(A.~.Z.)
1
= (lin)
n
2 r
i=l
2
'"
(~.~(~).z.).
1
We make this estimation principle more precise in Definition 3.3.
Some notation is convenient at this point.
Suppose we observe (X .. Y.).
1
1
I ~ i ~ n. under the assumption that the transform-both-sides regression
{
model
h(y,~)
:
~ E
holds
(3.1)
for
a
fami ly
of
transformations
Throughout this chapter, we let w(x) be a non-
A}.
negative. fixed symmetric function.
Notation 3.2:
Nl.
z.1 = (X 1.. Y.)
1
Z = (X. Y).
r(~.~.Z)
N2.
'"
~(~)
N3.
that is.
h(Y.~)
=
-
h{f(X.~) .~)
is the least-squares estimator for fixed
'"
~(~)
minimizes
"'2
N4.
a
(~)
~,
n
2
2 {r(~.~.Z.)} as a function of
i=l
1
= (lin)
*
n
2
r
2
~.
'"
(~.~(~).Z.)
i=l
1
'"
'"
r.(~) = r(X,~(X).Z.) / a(X)
N5.
1
1
*
n
i=l
F (x.X) = (lin) 2 l( r.(X) ~ x )
N6.
n
Q (x,X)
n
N7.
= Fn (x.X)
y 2 (X) =
N8.
n
f
1
+ F (-x,X) - 1
n
Q2 (x.X) w(x) dx
n
•
Definition 3.3:
The Minimum Distance Estimator X
is defined by
CVM
minimizing the quantity y2(X) over X:
n
(3.6)
.
inf
XEA
_2
y-(x)
n
2
A
= Vn (XCVM )'
52
~
This also defines the following estimators of
and
0:
(3.7)
(3.8)
•
To study this estimator analytically. it is convenient to investigate the behavior of the estimate of ;\ when
~o
and
00
are known.
In
this case.
the estimate ;\CVM is still defined by (3.6). but replacing
~(;\)
and 0(;\) for
for
~o
00
*
in the definition of r.(;\).
1
The first property we study for this estimator is consistency. in
section 3.6.
After demonstrating in section 3.2 that the estimator
belongs to a broad class of estimators. we will then prove results for
this general class in sections 3.3 through 3.5.
We will discuss the
asymptotic distribution of the estimate in Chapter IV.
3.2
Alternate Representation of the Estimator
It is useful to observe that the minimum distance estimator de-
fined above can be represented in an alternate form. which shall be the
basis for theoretical analysis.
This framework provides motivation for
the class of estimators to be studied in section 3.3. and it also allows for simplification in the computation of the estimate.
53
Lemma 3.4:
~o
We refer to Notation 3.2 assuming that
and
Go
are known.
For each A E A, the quantity y2(A) can be written in the form
n
(3.9)
where for each A the function KA(o,o) is symmetric in the sense that
proof:
Letting
a.(x,A)
1
we note that
n
*
2
Qn (X.A) = {(lin) 2 l(r.(A)
i=l
1
~
n
*
x) + (lin) 21(r.(A)
i=l
1
*1
~ [l(r~(A) ~ x) + l(r.(A)
i=l
1
n
n
2
2
a.(x.A) a.(x,A).
(lIn
2
)
=
1
J
j=l
i=l
= (1/n
2
){
~
-x) - 1 }2
e
-x) - 1 ] }2
~
Therefore.
y2(A)
n
(3.10)
If
~o
and
= (1/n 2 )
Go
n
2
i=l
~
. 1
J=
J
a. (x, A) a. (x . A) w(x ) dx.
1
J
are both known. then putting r~(A) = r(A.~o.Z.) /
1
1
Go
shows that for each A. v2(A) is of the form described in the Lemma. by
n
setting
•
We have one additional result in this section. which is of interest from a computational point of view.
54
It enables us to compute the
.
cri terion y2(A) wi thout having to actually perform integration.
n
ex-
ploiting the fact that empirical distribution functions are step functions.
This
is an advantage because minimization algori thms of ten
operate by mul tiple evaluations of the objective function.
that
each evaluation of
y2(A)
for
n
fixed
A already
We note
involves
least
squares (most likely non-linear). so this simplification reduces the
task.
Proposition 3.5:
Assume that the weighting function w(x) that appears
in the definition of y2(A)
is a symmetric function.
n
exists
~'(x)
a
(necessarily
= w(x)
of 1~(r~(A»
1
I.
nondecreasing
function
Let R. be the rank of
for all x.
*
rank of r.(A).
1
odd)
1
*1
~(r.(A»
and that
~(x)
such
there
that
and Q. the rank
1
We note since ~ is nondecreasing. that R. is just the
1
Then
where
c.
={ 2R i
1
n - 2R. + Q.
1
~
i
if
1
2
Consider the expression for Y (A) given by Lemma 3.4.
proof:
1
if
- 2 - n + Qi
n
~
n. use the notation r
t ..
lJ
=
J
i
*
for ri(A).
and for 1
lXla.(X.A) a.(x.A) w(x) dx.
-IX)
1
J
where
55
~
i.j
~
n let
For
Since we have assumed the transform-bath-sides model with bounded
supports for X and
as well as compact parameter spaces,
t,
then the
integral above over (_00, (0) is actually over a bounded interval. say
[L. U]. where L =
1
t ..=
L
IJ
< x)
U [ 1 ( r.
~
i
~
- l(r j
J
U w(x) dx
riVr j
and U =
< x)
1 (r.
1 -
+ l(r
=
L(~.i.j)
~
~
x) - l(r j
~
x) + l(r i
Thus
~
x) 1 (r.
J
~
-x) l(r j
1
~
~
-x)
-x) - 1 ( r.
-x) - l(r i
x)
-x) + 1 ] w(x) dx
J
riv-rjw(x) dx
+
~
+ 1 ( r l'
J -
-x) l(r j
U(~.i.j).
J
Uw(x) dx
-
ri
ri
-r. Vr.
1
Jw(x)
J
-J
+
r.
J
U
w(x) dx
J
-r
-
J
~(U)
IJ
1
~(r.) + ~(-r.A-r.)
-
~(-rj) + ~(L) + ~(U)
J
J
1
-
~(r.)
-
J
-
U
w(x) dx
L
~(r.Vr.) + ~(r.V-r.)
-
f
+
L
r.
=
jw(x) dx
~(L)
~(U) + ~(r.) + ~(-r.Vr.)
-
1
1
~(-r.) + ~(L)
-
J
~(U) + ~(r.)
-
1
1
J
~(L)
-
=
-~(r.Vr.) + ~(r.V-r.) + ~(-r.Vr.) + ~(-r.A-r.) + ~(r.) + ~(r.)
=
-2~(r.Vr.)
IJ
1
1
J
J
IJ
1
J
1
J
+ ~(-r.Vr.) + ~(r.V-r.) + ~(r.) + ~(r.).
1
J
1
J
1
Notice t .. = t .. and t .. =21~(r.)I.
IJ
Jl
11
1
J
We can show by considering
cases. that
={
2{ 1~(r.)IAI~(r.)I)
J
1
ifr.r.~O
1 J
-2{ 1~(r.)IAI~(r.)I) if r.r.
J
1
For instance. if 0
1
J
< O.
< r i < r j . then from the expression for t ij we
obtain
t ij
= -2~(rj) + ~(rj) + ~(ri) + ~(ri)
= 2~(r.)
= 21~(r.)I.
1
1
The remaining configurations are all similar.
56
+ ~(rj)
Now, by Lemma 3.4,
= (lin
2
n
){
~ t ..
i=l
+ 2 ~ t .. },
11
i<j
IJ
and each t ij is one of either ±21~(ri) I or ±21~(rj) I. depending on the
relative location of r. and r ..
J
1
Thus the problem reduces to counting
the number of times that each term 21~(rk)1 appears with a positive or
negative sign as the double sum over (i<j) progresses.
Let P be the
k
number of times that 21~(rk)1 is positive, and N the number of times
k
that it is negative.
Using the definitions of R and Qk' the following
k
two cases can be verified.
Case A.
If r
k
is negative. then
= Rk
Nk = (n
{ Pk
Case B. If r
k
- 1
- Qk) - (n - R )
k
is non-negative. then
{ Pk
=n
- Rk
Nk = (n - Qk) - (n - Rk ) = Rk - Qk
Hence the coefficient for the term 21~(rk) I is
(n - Qk) + (R - 1) = 2R - 2 - n + Q
k
k
k
if r
(R
if r
k
- QK) = n - 2R k + Q
k
k
k
<0
L
0,
as stated.
•
3.3
Generalized M-Estimators
Relation (3.9) shows that the proposed estimator A
is a special
CVM
case of a definite family of estimators.
the class of GeneraLized M-Estimators.
related type.
Serf! ing (1984) termed this
while studying classes of a
One way of characterizing the general case is through
the following description:
57
Let m be a fixed integer and Xl' X ..... X random variables on a
2
n
probability space
(~.~.P).
~
Consider a function K :
symmetric
K(x
in
its
x
a parameter space which is a subset of
m
xe
~ ffi.
arguments
x .8) = K(x
11
m
1
e
and
for
x . . 8)
1m
such that K(x .x ... ·. x .8) is
m
I 2
each
fixed
8 E
e.
that
is
for all permutations (i .···. i )
I
m
of the integers (1 ..... m).
We define a GeneraL ized M-Est imator to be a random variable 8
n
that satisfies
n
n
m
inf (lIn ) L
8
iI=I
(3.11)
m
= (lIn )
L K(X
i =1
m
n
n
L K(X
L
i =1
iI=I
X. . 8)
1m
11
X. . 8 ).
1m n
11
m
In the event that the function K has a derivative with respect to
8.
then after putting K·(x 1 .···. x m.8) =
a~
K(x 1 .···. x m.8) we would
find that the Generalized M-Estimator satisfies
n
A
L
(3.12)
K' (X. . .... X. .8 )
i =1
m
11
1m n
= o.
If we take m = 1 we obtain Huber's M-Estimator as a special case.
namely. a
en
which is defined by
n
n
inf (I/n) L p(X .. 8)
8
i=l
1
for some function p
~xe ~
= (I/n) . Llp(X
.. 8 )
In
1=
ffi. or alternatively as satisfying
n
A
(l/n) L ~(X .. 8 ) = 0
i=l
1
n
where ~ : ~xe ~ mP .
In the remaining part of this chapter we establish some conditions
for consistency of a generalized M-estimate.
58
The special case of esti-
mating the transformation using the estimator described in section 3.1
will be examined closely in section 3.6.
We first state a few prelimi-
nary definitions and key results for later reference.
3.4
U-Statistics
Let
Xl' X ,···, X
2
n
be
independent
random variables wi th distribution F.
function from
and
identically
distributed
In this section,
let K be a
rnm to rnq , which we call a
assume that K(x , ... ,
I
X
of Order m.
) is symmetric in its m arguments,
m
= K(x 1 , ... ,
K(x. , ... , x. )
1
KerneL
1
1
m
We will
that is
x )
m
for all permutations (i , ... , i ) of the integers {I, ... , m}. and that
m
I
For a given sYmmetric kernel K, the associated U-Statistic based
on a sample of size n
> m is
(3.13)
U
n
where
the
sum
is
2:
C
defined to be
2: K(X. , ... , X. )
11
1m
C
over
all
combinations
of
distinct
elements
{i , ... , i } from the set {I, ... , n}.
m
I
A second statistic, also based on the kernel K and on the sample
of size n, is termed a V-Statistic if it is defined as
(3.14)
V
n
1
=
n
m
n
2:
i =1
m
K(X. , ... , X. ).
11
1m
Some well-lmown resul ts on U-Statistics for real-valued (q
kernels will be quoted here without proof.
= 1)
For a more detailed discus-
sion of U-Statistics, see for example, chapter 5 of Serfling (1980).
59
We define the functions
Kc (xl' ....
X )
c
={
EF{K(x I ,···,
X
,
c
X
c+
1 •.... X )}
m
1
K(x I ,···· xm)
~
c
~
c
=m
1
~
m-I
as well as the numbers
(3.15)
, c = VARF{Kc (Xl .... '
Xc )}
c ~ m.
and also
(i)
(ii)
(i ii)
2
(m In)'1 ~ VARF{Un } ~ (m/n)'m
(n + l)VARF{Un+ I }
VARF{Un }
~
= (m2In)'1
n VARF{Un }
+
O(n
-2
) as
n ---+
00.
•
Lemma 3.7:
r
Let r ~ 2, and suppose that EF{IK(X I ..... Xm) I }
EI U n
'Y
r
-r/2
).
I = O(n
as
n
---+
<
00
Then
00.
•
The next lemma shows the close connection between aU-Statistic
and a V-Statistic when they are based on the same symmetric kernel K of
order m.
60
Lemma. 3.8:
Let r
be a positive integer and suppose
EF{ I K(X.1 , ... , X.1 ) Ir
1
m
}
<
that
for all
OJ.
Then the statistics V and V defined by (3.13) and (3.14) satisfy
n
n
E I V-V
n
r
n
I
-r
=O(n).
•
Lemma. 3. 9 :
If
2
E { I K(X. , ... , X. ) 1 }
F
11
1
<
OJ
m
~ i 1 ,.··, i m ~ m, then n~(Vn- Vn ) ~ 0 and hence n~(Vn- ~)
and n~(V - ~) possess the same asymptotic distribution.
n
for all 1
•
Lemma. 3.10:
(Strong Law of Large Numbers for V-Statistics)
OJ.
t h en
V
n
a.s.
~ ~.
•
Lemma. 3.11:
If
(Asymptotic Normality of V-Statistics)
~{K2(X1"'" Xm)} <
OJ
and '1> O. then
•
61
Lemma 3.12:
for sufficiently large n.
•
3.5
Consistency of Generalized M-Estimators
Let us now consider a collection of real-valued kernels of order
m. {K(x 1 .. ··. x m.8): 8
E
8}. such that each K is symmetric in its x
arguments for each fixed 8 in a parameter space 8.
Let Xl" ... X be
n
independent and identically distributed random variables with values in
~ and having the common distribution F.
A Generalized M-Estimator was
defined in section 3.3 as a random variable 8
n
that minimizes the ex~
pression
(3.16)
H (8)
n
=
n
n
1
m
2:
K(X
i =1
m
\
11
.8)
m
over 8 E 8.
Strong consistency of 8n . with convergence to some constant 8 0
the subject of this section.
extension to Lemma 2.3.
which shall be important.
is
The following lemma can be viewed as an
to establish a
form of uniform convergence
The assumptions make use of the notion of
equicontinuity which was stated in section 2.1.
Lemma 3.13:
•
Denote
(3.17)
62
If we assume
(i)
IF
I
K(X l .···. Xm,8)
I<
ro
for all 8 E
e.
(ii) K(x l , ... , x m.8) is equicontinuous in 8 for all xl' .. ·' x in
m
their domain. and
(iii)
The parameter space
e
is compact.
then
I Hn (8) - H(8) I ~ o.
sup
8
proof:
(X
l
'X
2
>0
Let e.
m
, x ) E!ir
m
8 , 8
1
2
Since K is equicontinuous
be fixed.
and
the
e
set
is
compact.
in 8 for
there
exist
8 E e and open subsets D ..... D C e such that
l
Q
Q
Q
UD.:Je.8.ED ..
. 1 J
J
J=
J
and for all 8 E D j , and all Xl.···' Xm'
For any 8 ED .. j=l ..... Q we have
J
I Hn (8) - H(8)1 <
I Hn (8) - Hn (8.)
J
m
~ (l/n )
n
L
i =1
l
n
...
L
I K(X . . . . . ,
im=l
11
+
+
Hn (8.)
- H(8.)/
J
J
X. . 8) - K(X. , ... , \
.8)
1m
11
m
I Hn (8.)
- H(e.) I
J
J
~ e./3 +
I Hn (e.)
- H(e.) I.
J
J
Now.
sup
8
I Hn(8)
- H(8)1 ~
sup
I Hn(e)
8EW .
J
63
- H(e)1
I
~ max
I
sup
H (B) - H(B)I.
n
BEU.
l~HQ
J
Since
sup
BEU.
I
H (B) - H(B)
n
I
I
~ sup
BEU.
J
H (B) - H(B.)
n
J
I
+ sup
BEU.
J
~ c/3 +
I
I
H(B.) - H(B)
I
J
J
Hn(B j ) - H(B j )
I
+ c/3.
we obtain
sup
B
IH
(B) - H(B)
n
Is
= (2/3)c
max { (2/3)c +
l<'<Q
_J_
+ max
lSj~Q
I
I
H (B.) - H(B.)
n
H (B.) - H(B.)
n
J
J
J
I}
I·
J
For any 0 ) O. by Lemma 3.10 of U-Statistic theory. there exists
an integer N such that for j=l .... , Q
P( some n ) N.
I
H (B.) - H(B.)
n
J
J
I
l c/3 ) < o/Q.
e'
Denoting the event within parentheses by E .. we note
J
Q
n
j=l
I H (B.)
E. C { all n ) N. max
j
J
C {
all n ) N,
I
sup
n
- H(B.)
J
J
H (B) - H(B)
B
n
I < c/3
I <c
}
}.
By Bonferroni's inequality,
Q-
P( n E.) l
1-
. 1 J
J=
Q
2 P(E.)
. 1
J=
J
~
1-
0
and hence
P( all n ) N, sup
B
I
H (B) - H(B)I
n
< c) l
1 - 0,
completing the proof.
•
64
The following theorem establishes strong consistency of a Generalized M-Estimator B.
We note that condition (3.18) below is a charac-
n
terization of the limit.
Although (3.18) may appear as manageable. it
may be the case that it is actually hard to check in a given situation,
particularly if B is multi-dimensional.
Theorem 3.14:
Let H(B) be defined by (3.17).
Assume that the condi-
tions (i)-(iii) of Lemma 3.13 hold, and that there exists Bo E e such
that
(3.18)
H(B)
Then B
n
proof:
> H(B o )
for all B
~
Bo • BEe.
defined by minimizing (3.16) satisfies en
For each
> 0,
t
let V
t
={ B
I
B - Bo
I <t
,all n
> no
~
Bo ·
}.
It is sufficient to prove
P(there exists no : B E V
n
Since e - V
t
t
)
= 1.
is a compact set,
inf
H(S)
8-Vt
is attained and by assumption
is
strictly greater
> 0.
greater than H(B o ) + 36 for some 6
and thus H(B) - 26
I Hn (8)
- H(S)I
- 6
~
~
n
> no.
inf H (B)
~
< H(B)
Hn(S)
> no
<6
H(B) + 6.
This implies
8-V n
t
inf H(B) - 26 and
8-V
65
t
say
By Lemma 3.13, with probabili-
ty one there exists an no such that for all n
sup
BE8
than H(B o ),
for all
B.
and all
inf H (0)
OEU n
S
c
and thus for all n
inf H(O) + 0 = H(Oe) + 0,
8EU
c
> no
~
inf H (8)
8Ee-U n
c
inf H(8) - 20
8Ee-U
c
Hence with probability one. 8
n
> H(8 e )
+ 0 ~
inf H (8).
8EU
n
c
E U
> no-
for all n
c
•
3.6
A Special Case: Minimum Distance Estimation of a Transformation
We return to the study of estimation of a transformation parameter
by the method outlined in section 3.1.
There we stated that the quan-
tity
(3.19)
Vn2 (A) =
J
{Fn (X.A) + Fn (-X,A) - I} 2 w(x) dx
is minimized over A in order to obtain the estimate denoted by A
.
CVM
The notation F (X,A) stands for the empirical distribution function of
n
the standardized residuals
r.*(A) = { h(Y.,A) - h(f(X .. &(A».A) } / ;(A)
1
1
1
1
SiS
n.
2
We showed in section 3.2 that an alternate expression for V (A) is
n
= (1/n2 )
where Z. = (X .. Y.).
S
1
I I I
i
~
n
n
}:
}:
K(Z. ,Z. ,A)
i=l j=l
1
J
n. are independent and identically distri-
buted observations. and furthermore. that
K(Z .. Z .. A) =
1
J
where ai(x,A) = l(r*i (A)
~
f
a.(x.A) a.(x.A) w(x) dx
J
1
x) + l(r*i (A)
66
~
-x) - 1.
If the parameters
~e
and
that
0 0
'"
are known, replacing ~(~) for ~o and o(~) for
~CVM
00
shows directly
is a generalized M-estimator with a kernel K of order m
= 2.
The topic in the current section is consistency of such an estimate of
the
transformation,
under
the
structure of
the
transform-both-sides
regression model.
To establ ish consistency using
the resul t
from Theorem 3.14,
a
relevant quantity to be examined closely is
(3.20)
For consistency
to be plausible a
2
exists a value ~o such that V (~)
required condi tion
> V2 (~o)
see in one of the results to follow,
quence of assuming
that
for all " ~ ~o.
there
We shall
that this requisite is a conSe-
the existence of a parameter value such that
transform-both-sides model holds,
each particular
is
family of
the
but it will have to be checked for
transformations
that
respect to other assumptions of Theorem 3.14,
is considered.
the fact that
tricted to lie in a compact set has already been assumed.
crucial assumption in Theorem 3.14 is equicontinuity in
tion K(Zl,Z2'~) for Zl and Z2 in their domain;
~
~
Wi th
is res-
The other
of the func-
in Lemma 3.15 we check
that equicontinuity holds.
Lemma 3.15:
(3.1),
Let
h(y,~),
the family of transformations in the TBS model
be such that it is continuously di fferentiable
fixed y.
(This is AS of Assumptions 3.1)
67
in
~
for each
proof:
We
can
K(Zl,Z2,A)
*
= g(rl(A).
r * (A»,
2
f ~(X)
f
- f
g(u.v) =
show
f
+
dx
uVv
u
-uVv
+
w(x) dx
v
(see
proof
of
-
u
3.5),
that
where the function g is defined by
f
uV-v
w(x) dx
-
1
1
-Ul\-VW(X) dx
+
u
w(x) dx
u
L
UW(X) dx
Proposition
-Vw(x) dx
L
+
f
-
UW(X) dx
u
I
U
w(x) dx.
L
This g(u.v) is a continuous function of (u,v), and for (u,v) on a
compact support. it is thus uniformly continuous.
t h ere
I
.
eXIsts
.<:*
such
u
g(u.v) - g(u',v')1
<e
that
if
I
u - u·
for all (u,u'.v.v').
that r * (A) is equicontinuous in A for Z E
that i f II A - A'II
< O.
GX~,
I
+
Then for given c
I
v - v'
I < 0*.
>0
then
It will suffice to show
for then there is 0 such
then
e·
and thus II A - A'II
<0
would imply
which completes the proof of the lemma.
It thus remains to show that rCA) is equicontinuous in A.
By the
form of the model and by definition of a residual,
rCA)
Suppose A is p-dimensional. A = (AI ..... A )'
p
Since the deriva-
tives are assumed continuous on bounded supports. there exists a constant C such that
I (a/aA.)
J
r(A)1
<C
for all j=l ....• p.
By the Mean
Value Theorem for functions from ffiP to ffi. for some t E [0.1].
68
r(~)
-
=
r(~')
a~
2
j=l
with
~
* =
t~
+
*
a
p
,
r(~ )(~.-~.)
J
j
J
and thus
(l-t)~'
p
I r(~) - r(~')1 ~ C
If
II
~
-
~'II ~ c/(Cp) , then
is equicontinuous in
~.
I
2 I~. - A'.I·
. 1
J
J
J=
rCA) - r(A')1 ~ c, showing that rCA)
as required.
•
We next proceed to study special cases of parameterized families
of transformations {
h(y,~)
(3.18) in the case that
~o
~
:
EII.},
and seek to verify assumption
and ao are known.
Recall that we have as-
sumed that the transform-both-sides model (3.1) holds for some parameters on a compact subset, the supports of
the observations (X .• Y.) 1
1
1
~
~
i
~
and 8 are compact sets. and
n are independent.
We also recall from equation (3.20) that
where
a.(x.~)
1
and
*1
r.(~)
*1
~
x) +
l(r.(~) ~
~
x) -
l(r.(~)
*1
-x) - 1
) -x)
depends on the transformation family through (3.2) and (3.3).
Given that Ir~(~) I is bounded for all (~.X,c). we can write
1
=
J
*
{P(rl(~) ~
*
x) - P(r2(~)
) -x)} 2 w(x) dx
69
since the
r.(A)
are independent and identically distributed.
.
1
= P(r*1(A)
Define F(X.A)
~
x)
= PereA)
~
oox) so that
(3.21)
Now let f o
= f(X.f3o).
We notice that i f the TBS model with a
family { h(y.A)} holds. then for fixed X.
F(X.A)
= PereA) ~ OoX ) = P( h(Y.A) -1
= P( h(h (h(fo.Ao)+ooc.Ao).A)
= P(
= P(
h
-1
c
(h(fo.Ao)+ooc).Ao)
~
{h(h
-1
~
h(fo.A)
~
°oX )
- h(fo.A) ~ °oX )
-1
h (oox + h(fo.A).A) )
(oox + h(fo.A).A).Ao) - h(fo.Ao) }/oo ) .
Denote the right-hand side of the last inequality by
~
x
(A).
Since
X and c are independent. we can write
where F
c
is the distribution function of c.
We thus have
F(X.A)
(3.22)
= F__
F (~ (A»
A c x
wi th
(3.23)
Notice of course. F(X.AO)
= Fc (x).
and furthermore. if F
c
is sym-
metric then
(3.24)
F(X.A) + F(-X.A) - 1
= F__{
A
F (~ (A»
c x
- F (-~ (A»}.
c-X
We introduce one last piece of notation to use in proofs of lemmas
that follow.
Let
70
~(X.x.~)
(3.25)
= Ft
(~ (~»
- F (-~-x (~»
Xt
so that (3.24). (3.25). and (3.21) imply
(3.26)
In the lemmas that ensue, the goal is to verify that there exists
2
such that V (~)
~o
a value
> V2 (~o)
this assuming that the parameters
A
substitute
00
for
~o
for all
and
00
~ ~ ~o.
We shall prove
are known; in this case we
A
o(~)
and
~o
for
~(~)
in (3.19) and (3.20).
This is a
first step towards the issue of consistency in the case in which these
parameters are also estimated.
later section.
The results will be summarized in a
Lemma 3.15 and Theorem 3.14 would then imply that the
minimum distance estimator is consistent for the transformation parameter.
Lemma 3.16:
Assume
that
the distribution F (x)
t
strictly increasing on the support [-M.M] of
w(x) is strictly posi tive.
fixed f o
for
= f(X.~o).
t.
If for each fixed X
the quantity
~ (~)
x
is
symmetric and
and that the weight
€~.
that is. for each
defined by (3.23) is such that
~ ~ ~o and all x ~ O. ~x (~) ~ -~ -x (~), then V2(~) > V2(~0)'
proof:
First note that
(3.26).
V2(~0)
= O.
~ (~o)
x
If F
t
=x
is strictly increasing and
the conditions of the lemma, then
(3.25) is strictly non-zero.
for all fa. and hence by (3.25) and
~(X,x,~)
~x (~) satisfies
in the notation defined by
By (3.26), we conclude that
V2(~) > 0 if
•
71
Lemma 3.17:
Let
Modified Power Transform.
be the modified power transform of Box and Cox given by
h(y.~)
(3.27)
o.
~o ~
{y
={
h(y,A)
A
- 1} / ~
log(y)
if A
~
if A
= O.
0
Assume that the distribution F (x) is symmetric and strictly inc
creasing on the support [-M.M] of c. and that
Then. V2(~)
proof:
> V2(~0)
~o ~
O.
if A ~ ~o'
We find that for this family of transformations.
-1
h
=
(U.A)
(AU
{
+ 1)
11A
exp(u)
if
~ ~
if
~
0
= O.
Consider two cases for A.
Case A.
A ~ O.
The quantity
(3.28)
~x (A)
~ (~)
x
in (3.23) is in this case
= { [~(oox
+
[fo~ - 1] /~) + 1]~01A - 1} /OoA o
- [fo~O - 1] /OoAo
A
= ([f o +
=(
Fix A
~
[fo~
AOoX]~olA - 1) / OoAo
+ AOoX]AolA - f oAo ) /
[fo~O - 1] / oo~o
oo~o.
Ao and consider the function g defined by
g(u)
= u~IA
0
•
Since
g'(u)
g"(u)
we observe that i f A
strictly convex.
> Ao
= (AolA)
~ lA-I
u 0
and
= (A olA)(A olA-1)
or A
< Ao,
g
u
A0 1A-2 •
is ei ther strictly concave or
Thus
[foA + AOoX]AolA _ (foA)AolA
72
is either strictly greater or strictly smaller than
(foA)Ao/A _ [faA - AaoXJAo/A
for all x "# O.
~
Therefore
x
(A) "#
-~
-x
(A). and the conclusion follows
from Lemma 3.16.
Case B.
A
= O.
In this case. (3.23) yields
~
(3.29)
x (A)
A
= {[exp(aox
+ 10g(fo)J a -
I} /aoA o
- [fa Ao - 1J / aoA o
= {exp(aoAox
= {exp(aoAox
Consider
A
+ log(f a 0)) - fa AO} / aoA o
A
A
+ log{f a 0)) - exp(log(f o a))} / aoA o ·
the function g defined by g(u)
= exp(u).
Since g is
strictl"y convex, we conclude
is either strictly larger or strictly smaller than
exp(log(f oAO )) - exp(-aoAox + 10g(f oAO ))
for all x #- O.
Therefore
~
x
(A) #-
-~
-x
(A).
and the conclusion also
follows from Lemma 3.16.
•
Lemma 3.18:
Log Transformation.
Let h(y.A) be the modified power transform of Box and Cox given by
(3.27) .
Assume that the distribution F (x) is sYmmetric and strictly
c
increasing on the support [-M.M] of c. and that Ao
true transformation is the log transform.
Then. y2(A)
> v2(AO)
if A #- AO'
73
= 0,
so that the
proof:
Fixing ~ ~ O. h-l(u.~)
~x (~)
(3.30)
= log[~(aox
= (~u
= { log(~aox
l)l~ and by (3.23) we obtain
(fo~ - l}~) + 1)11A] / a o - log(fo}/a o
+
= { log[(~aox
+
fo~)l~]
+
- log(f o ) } / ao
fo~) - log(fo~) } / ~ao.
+
Since the function log(u) is concave. we conclude that
log(~aox
for all x
~
~
~
Thus
~ (~) ~ -~
~
+ f o ) - log(f o )
O.
x
-x
~
log(f o ) -
(~).
log(-~aox
~
+ fo )
and the conclusion follows from
Lemma 3.16.
•
Lemma 3.19:
Estimating the shift for fixed power.
~o ~
O.
Let h(Y.M) be the modified shifted power transform of Box and Cox
with fixed power
(3.31)
~o.
={
h(Y.M)
given by
{(y -
M)~O - l}~o
log(y - M)
if
~o
= O.
Assume that the distribution F (x) is symmetric and strictly inc
creasing on the support [-M.M] of
> y2(Mo)
Then. y2(M)
proof:
t.
and that
~o ~
O.
if M ~ Mo.
In this case. h -1 (u .M)
= M + (~ou
+ 1) llA o .
Equation (3.23)
yields
(3.32)
~x(M)
=
({~ + (~o[aax + «f o - ~)~a - l)~o]
+ 1
)1~0 - ~0}~0 - 1 ) / ~oao
- «fa - Ma)~O - 1) / ~oaa
= [{ M + (~aaax
= (~aaa)-l [{ ~
+ (fa - Ma +
M)~a)l~a - Ma}~a - (fa - Ma)~a] / ~aaa
(~oaax + (fa - ~)~a)l~a }~a
- { ~ - ~a + «fa - ~)~o)l~a}~a ].
74
Now consider the function g defined by
g(u)
={~ -
~o + u1/A o }A O .
We find
and
g"(u)
= u 1/Ao-1[(Ao
- l){ ~ - ~o + u 1/A o }Ao-2 Ao- 1 u 1/A o-l]
+ { ~ - ~o + u1/A o }Ao-l (l/A o-1) u 1/A o-2
A)
= (1 - 1/AO
2(1/Ao-1) {
A
1) u 1/Ao-2{
+ (I /AO-
=
~
U
~
-
+ U
~o
-
+
~o
U
l/A o }Ao-2
l/Ao }Ao-1
(1-1/Ao){ ~ - ~o + u1/A o }Ao-2 u 1/A o-2 x
[ u1/A o - { ~ - ~o + u1/A o }]
= (1 - l /AO
A
)(){
~o
-
~
~
-
~o
+
U
The sign of g"(u) is constant for all u
> ~o.
< ~o
~
and changes as
~o.
the function g is either strictly concave or strictly convex in u.
We
~
~
t
> O.
t
or
Furthermore. g" (u) t 0 if
1/Ao-2
l/Ao }Ao-2 u
.
~o·
Therefore. i f
~
then conclude that if x t O.
is either strictly larger or strictly smaller than
and hence
~x(~) t
-~-x(~)
if
~ t
~o.
The conclusion now follows from
Lemma 3.16.
•
75
Lemma 3.20:
Shifted log transform.
h{y.~)
Let
be the modified shifted power transform of Box and Cox
with fixed power Ao. given by (3.31). and suppose that Ao
= O.
Assume that the distribution F (x) is symmetric and strictly inE.
creasing on the support [-M,M] of
E..
Then, V2{~) ) V2{~0) if ~ # ~o'
proof:
We check tha t in this case. h
-1
(u,~)
=~
+ exp{u).
Equation
(3.23) yields
(3.33)
log{ ~ + exp[oox + log{f o - ~)] - ~o } /
=
00
log{f o - ~o) /
-
[log{ ~ - ~o + exp[oox + log{f o - ~)] } - log{f o - ~o} ] /
=[
log{
~
-
~o
+ exp[oox + log{f o -
~)]
00
0 0
}
log{ ~ - ~o + exp{log{f o - ~»}
] /
00'
If we now consider the function g defined by
= log{
g{u)
~
~o
-
+ exp{u) },
we find
g'(u)
={ ~ -
= exp(u)
[ -{
~o + exp{u) }-1 exp{u)
and
g"{u)
+ { ~ -
= exp{u)
{
~
-
~o +
[ -exp{u) + {
= exp{u)
(
+ exp{u)}
~o
-2
exp{u) ]
-1
+ exp{u) }
exp{u)
~o
~
-
~
exp{u) }
-
~o
-2
x
+ exp{u) } ]
~ - ~o ) { ~ - ~o
+ exp(u) }-2.
We note that the sign of g"(u) is constant for all u ) O. changes
< ~o
sign as
~
~
the function g is either strictly concave or strictly convex in
#
~o.
or J.l )
~o.
and also g"(u) # 0 if
76
~
# ~o'
Therefore. if
u.
log{
We then conclude that if x
~
-
~o
~
O.
+ exp[oox + log(f o - ~)] } - log{ ~ - ~o + exp(log(f o - ~)}
is either strictly larger or strictly smaller than
log{~
-
~o
and hence
+ exp(log(f o - ~)} - log{ ~ - ~o + exp[ -OoX + log(f o - ~)]}
<p (~)
X
~ -<p
-x (~) if
~ ~ ~o'
An application of Lemma 3.16
completes the proof.
•
We may summarize the resul ts of
this section in the following
theorem.
Theorem 3.21:
Assume the conditions A1-A7 of Assumptions 3.1 and that
the distribution F (x) is strictly increasing on the support [-M.M] of
c
c.
If
f30
and
00
are lmown. and the transformation h is ei ther the
modified power transformation (3.27) or the modified fixed-power transformation with shift (3.31). then the minimum distance estimator of the
transformation parameter is strongly consistent.
proof:
Follows from Theorem 3.14 and Lemmas 3.15 through 3.20.
•
3.7
Discussion
We showed that minimum distance estimation yields consistent esti-
mates of the transformation parameter when
f30
and
00
are known.
This
was achieved under nonparametric assumptions on the error distribution.
77
in the sense that we only assumed sYmmetry.
If
~o
and
are unknown
00
and are estimated by least-squares for fixed X. consistency is a different matter.
(~o.oo)-known
However, consistency in the
case can be
exploited to establish consistency in the more general case.
sketch a procedure to obtain a consistent estimate of
similar
to
the method explained
(X.~.o).
in Theorem 2.13 for
Aldershof transformation to symmetry.
We now
which is
the Ruppert-
We let Z. = (X .. Y.). i=l, ... ,n.
I I I
denote the sequence of observations. and suppose that X and
~
are t and
r-dimensional. respectively.
1.
Estimate
~
using the Least Absolute Deviation estimator
inition 2.10). that is.
~LAD
}; I Y.1
. 1
- f (X. ,~)
1
1=
as a function of
~.
(Def-
minimizes
n
(3.34)
~LAD
I
We know (section 2.3) that ~LAD ~ ~o if the
observations follow the TBS model with errors that have a unique median
at zero. regardless of the unknown value of Xo.
2.
Estimate X by minimizing
(3.35)
y2(X) =
n
I{Fn (x.X)
+
Fn (-x.X) -1 }2 w(x) dx
where
n
F (x,X) = (1/n) }; 1(
i=1
n
and denote this estimate by X
.
LAD
r(X'~LAD,Zi) ~
x )
By using arguments similar to the
ones in Theorem 2.13. as well as Lemmas 3.16 through 3.20. we would
obtain
XLAD ~ Xo
whenever h is the modified power transformation or
the modified fixed-power transformation with shift.
78
3.
Define the estimate of a by
(3.36)
'"
and put 8LAD = (~LAD,ALAD.aLAD)' Using 8LAD as a starting point, now
let "'*
8 be a one-step Newton-Raphson approximation (section 2.3) to the
n
solution of the following system of equations:
n
L
(3.37)
j=l
f
a"'a
"'k
J
1
n {
r(A,~,Z.)
L
j=l
=0
a.(x.A) a.(X,A) w(x) dx
1
aaR
r(A.~.Z.)
~k
1
Sk S
t
1
Sk S
r
+
1
r(A,~,Zj) ag r(A,~,Zj)
}
=0
k
n
L [(1/2){ r
j=1
2
(A.~,Z.)
+ r
2
(A,~.Z.)
2
} - a ]
J
1
=0
where
By definition. the estimate 8
solves system (3.37). but we are
CVM
once again confronted wi th
exist.
the possibili ty
that mul tiple
The algorithm just described serves as a means of assuring that
a consi'stent solution is chosen.
an appropriate rate.
8LAD
Since it is true that
a.s. 8 0
we cou ld t h en prove "'8* ----+
n
•
techniques as were applied in the proof of Theorem 2.13.
is the foundation for this last result.
n ~'"8
solutions
~
8 0 at
using simi lar
Theorem 3.21
Additionally. we conclude that
and n ~"'*
8 have the same asymptotic distribution which will be
n
derived in Chapter IV; therefore. 8
and "'*
8 are asymptotically equivn
CVM
CYM
79
alent for estimation purposes.
In Lemmas 3.19 and 3.20. we considered the transformation which
applies a shift before taking a power.
This transformation is a res-
tricted case of a more general family of transformations known as the
famiLy of modified power transformations with shift. which is defined
by
h(y,~.~)
(3.38)
={
[(y -
~)~ - 1J~
log(y -
~)
=0
~
It is interesting to consider this two-parameter family of transformations. because it may generate observations Y. that are not neces1
sari ly posi tive as we had to assume when we considered the Box-Cox
transform.
tive.
By an appropriate choice of
~.
all Y. 1
~
can be made posi-
Adding a posi tive constant to all the Y. 's would be desirable.
1
for instance, if one of them is zero and we wished to include the possibi Ii ty of applying the log transform.
Carroll and Ruppert (1988)
study an example (Example 6.2) using the modified power transformation
(3.27) where some of the Y measurements were zero and inference on
~
is
performed after adding an arbitrary fixed constant to all observations.
They found that estimates
~
of the transformation parameter for fixed
shifts are very sensitive to the choice
of~.
This observation illus-
trates that the capabili ty of using data to estimate
tial.
is essen-
Atkinson (1985) discusses shift in modified power transformation
in detail.
mate of
(~.~)
In particular. he shows that the maximum likelihood esti-
(~,~)
under normal errors fails to be consistent.
lies in the fact that the likelihood function is unbounded.
cisely. the likelihood tends to infinity when
80
-~
The problem
More pre-
tends to the minimum
of the Y. 's.
1
If the power parameter in (3.38) is fixed at Ao, the problem reduces to estimation of the shift.
In practical terms,
this situation
is interesting because it is quite common that a researcher wishes to
apply a specified transformation. for example the log transformation if
the data seem to be posi tively skewed and having variance increasing
with the mean.
However, it may be the case that some observations are
zero or even negative and adding a constant to observations is required
before taking logs.
Since this constant is an unknown parameter.
it
would be an advantage to have the capability of estimating its value.
Lemmas 3.29 and 3.20 show that minimum distance estimation is consistent in this fixed-Ao case, whereas methods such as maximum likelihood
still exhibit the difficulties mentioned above.
The minimum distance estimator of a
transformation also has the
potential to consistently estimate both of the parameters of a power
transformation with shift. simultaneously.
Of course. the computation
of the minimum distance estimator would involve solving (3.6) in twodimensional
(/.l. A)
space, and this introduces the need for numerical
optimization methods in higher dimensions that do not rely on differentiability.
However. it turns out that numerical procedures are not the
main problem here.
ture:
Rather.
the compl ications are statistical in na-
Jointly estimating (/.l.A) in (3.38) is not a task that can be
accomplished efficiently by symmetry considerations alone.
More shall
be said about this problem in Chapter IV, after we derive the asymptotic covariance of the minimum distance estimate.
81
CHAPTER IV
ASYMPTOTIC NORMALITY OF THE MINIMUM DISTANCE ESTIMATOR
4.0
Introduction
The goal in this chapter is to establish asymptotic normality of
the minimum distance estimator of transformation parameters which was
described
We saw that
in Chapter III.
the proposed estimator
is a
special case of generalized M-estimation. minimizing the expression
n
2: K(X . • . . . . X . • 8)
(4.1)
i =1
1
m
over the parameter space
mator 8.
n
estimates 8
e.
1m
1
thus giving rise to the generalized M-esti-
We wish to investigate conditions under which a sequence of
n
that minimizes (4.1).
probability to some fixed 8 0 E
~
e.
will also satisfy
d
A
n (8 - 8 0
n
and that satisfies convergence in
)
~
N(O.V)
for some (positive definite) matrix V.
Appl ication of
the resul t
to the special case of estimating the
transformation parameters in the transform-both-sides regression setting will directly yield asymptotic normality of the minimum distance
estimator of section 3.1.
Of course. what we achieve with this conclu-
sion is the capability of obtaining large sample confidence sets and
approximate
8
= (X.f3.a).
tests
of
hypotheses
with
respect
to
the
parameter
Testing hypotheses about transformation parameters is of special
interest because tests can be performed to assess the val idi ty of a
postulated model.
h(y,A)
= (y A -
1);'i\,
For instance,
then
if
the family of transformations is
testing for
the null hypothesis Ho : A
=1
amounts to testing for additivity of the error in the regression model.
More general hypotheses, say linear combinations of the elements of
are obtained by anyone of standard asymptotic tests.
e,
Examples of
these asymptotic methods are the Wald-type confidence sets and tests,
and the Rao score tests.
Assuming that consistency holds, we proceed in section 4.1 wi th
the theory of asymptotic normality for generalized M-estimators in the
general setting, and then return to the particular case of estimating a
transformation parameter in section 4.2.
In section 4.3 we study a
special case of a regression model which was selected to serve as an
example for further examination; we carry out actual computations for
the matrix V assuming the single sample model
f(X,~o)
= ~o
for all X,
that is,
(4.2)
This concrete case will be enlightening because of two main reasons.
First, some comparisons with other estimation procedures can be
accompl ished,
for instance wi th maximum 1ikel ihood,
mates, or the Ruppert-Aldershof estimator.
the Taylor esti-
Secondly, we will be able
to assess the effect of the weighting function w(x) used in the computat ion of the minimum distance estimate.
It is important to investi-
gate how the asymptotic variance is dependent on the particular choice
of w(x) for different error distributions.
83
Understanding the effect of
w(x). we then could give some guidel ines for
the choice of w(x)
in
order to obtain an efficient estimator 8 .
n
4.1
Asymptotic Normality of Generalized M-Estimators
We consider
the setting of generalized M-estimation,
which was
introduced in section 3.3:
Let
be
Xl' X .···, X
n
2
random variables on a
assume
independent
is
a
identically
distributed
let 9 C IRP be a parameter
(!IX.,sd, P) and
space
there
and
function
~ : !lX.m x8 ~ IRP
space.
We
such
that
~(x1"'"
x ,8) is symmetric in its x arguments for each fixed 8 E 8,
m
"
and that a sequence { 8 } of random variables satisfies
n
(4.3)
n
n
}";
}";
i -1
i
C
~(X . • . . .
11
=1
m
It is possible that each 8
P
, X . • 8 ) ~ O.
1
m n
is obtained by minimizing the expres-
n
sion
n
n
}"; K(X .•...• X. ,8)
}";
i =1
i =1
a for
a
= aa K.
over
~
11
m
1
m
some function K : !IX. x9
~
1m
IR, so that
an
satisfies (4.3) with
We present the proof of asymptotic normality after several preliminary results.
The first of these extends asymptotic normal ity of U-
statistics (Lemma 3.11) to kernels that have values in IRk. k
Lemma 4.1:
Let K
> 1.
~m ~ IRk be a symmetric kernel and define:
}"; K(X .•... , X. ).
C
1
84
1
1
m
Then. if
proof:
Using
~
T
T
n (a Un-a '"Y)
a
€ ffik
EI
K(X ..... X )
m
1
2
1
<m
and r
the Cramer-Wald device,
d
~
2T
N1 (0, mar 1a)
for
1
is positive definite.
it
every
is
sufficient
k
a € ffi .
to
Therefore
prove
fix
and define
= a TK(X 1 , ....
K* (Xl' .... Xm)
'"Y *
Xm).
= E[K* (Xl'···' Xm)],
U*
n
T
= aU.
n
h*1(x) = E[K* (x.X2 .....
and
X)],
m
r*1 = VAR[h*1 (X 1 )]·
Then notice that
and '"Y*
= a T'"Y.
Now, E I h*(X , ... , X ) I
1
m
then r *
1
> 0,
< m,
and since r 1 is positive definite,
so by Lemma 3.11, we have
n~(U: -
'"Y*)
~ N(O,m2r7)
that is,
85
•
The following two results extend the work of Huber (1967) on the
asymptotic normality of M-estimators.
We put x
= (xl ... ·.
xm) and
define
u(x l .· .. , x .e,d) = u(x,e,d) = sup I t(X.T) - t(x.e)
m
IT-el~d
I
and
A(e)
= Et(x.e) =
f···f
t(x l ····· Xm' e)dp(x l )· .. dP(xm)·
We notice that u(x , ... ,xm,e,d) is symmetric in xl .. · .. x for each
m
l
fixed (e. d).
if t
is symmetric.
We also make the following assump-
tions.
Assumptions 4.2:
C ffiP ,
e
AI.
For each fixed e
A2.
There is a eo E
A3.
There are strictly positive numbers a.b.c,d o such that
I
(i)
(ii)
(iii)
A4.
£1
E
e
such that A(e o )
A(e)1 ~ al e - eol
E[u(X,e.d)]
E[u(X,e,d)]
t(x,e o )1
2
<
t(x,8) is d-measurable
~
2
b·d
~
c·d
I
for
for
I
=0
e - eo I
8 - 80
for
1
~
do
+ d ~ do
e - eol + d ~ do
00
•
86
Lemma 4.3:
Let
Zn (T.e) =
n
I
n
2 K(X .•...• X . • T) - K(X . . . . . • X .
2
-1
C
i
i =1
m
11
1m
n
11
1m
• e) - i\(T) + i\(e)
I
m/n 1/2
Assumptions AI. A2. and A3 imply
Z (T.e o )
sup
IT-eo I~do n
proof:
Huber' s
through.
p
~
O.
as n
rigorous proof of Lemma 3
Only one of
~
00.
in his
1967 paper pushes
the major changes required will
be mentioned
n
here.
Basically. a sum
2
~(x ..
i=1
e) is replaced by the sum
1
n
2~(X
i =1
m
and
wherever a
reference
X. ,e).
1m
11
to Chebyshev's
inequali ty
is
encountered.
Lemma 3.12 is used instead.
•
Lemma 4.4:
Assume that Assumptions 4.2 hold. and that
n
2
i =1
(4.4)
~(X .•...•
11
m
X. • e )
1m n
en
satisfies
~
0
A
P
and that
(4.5)
Then
n
2
i =1
m
x.. eo )
~(X
11
1m
87
~
+ n i\(e )
n
~
O.
proof:
The reader is again referred to Huber' s 1967 paper and the
The current proof is supported by
proof of hi s Theorem 3 therein.
Lemma 4.3. and similar changes are required to make the former proof go
through.
•
The following theorem is the main result of this chapter:
Theorem 4.5:
'"
Suppose Assumptions 4.2 hold. and that { 8 } is a sen
quence of random variables that satisfies
80
€
e.
Also assume that the derivative
gular when evaluated at 8 0
,
(4.4) and (4.5)
for
some
a~ A(8) exists and is nonsin-
Let
Then.
~ '"
n (8
proof:
n
- 80 )
d
~
N(O. m2A-1 BA-T ).
Using the assumption that A(8 o ) = O. the definition of a deriv-
ative yields
and thus
88
~
~
A
A
n A (S n - So) = n ~(S)
n +
A
ope Isn -
Sol)
n
i
By Lemma 4.4.
L l/J(X. , ...• X. . So) ]
=1
11
1m
m
the term in square brackets converges to zero in
probability. and by Lemmas 4.1 and 3.9,
n
L l/J(X . ••.• , X. ,So)
11
1m
i =1
m
~ N(O.m~).
Thus, by Slutsky's theorem, we obtain
~
A
n (Sn - So)
d
~
-1 ?_ -T
N(O, A m~A ).
•
4.2
A Special Case: Minimum Distance Estimation of a Transformation
We turn to the minimum distance estimator described in Definition
3.3.
We restate the definition here. and then show explicitly how the
minimum distance estimator is viewed as a generalized M-estimator.
We
observe (X 'Y ) ..... (Xn,Y ) under the assumption that the transformn
1 1
both-sides regression model holds:
h(Y,~)
= h(f(X.~).~)
+ oc.
Notation 4.6:
Nl.
Z=(X. Y).
Z.=(X .• Y.)
1
1
1
N2.
N3.
~(~)
is the least-squares estimator for fixed
89
~.
A
that is.
P(~)
minimizes
n
2
2 {r(~.~.Z.)} as a function of
i=l
1
n
~2(~) = (lin) 2 r
N4.
2
i=l
r.* (~) =
1
N5.
"
(~.~(~).Z.)
1
r(~.~(~).z.)
1
n
F (x.~) = (lin) 2 l(
n
i=l
N6.
~.
/
~(~).
*
r.(~) ~
x )
1
N7.
V2 (~)
n
N8.
=
J
Q2 (x.~) w(x) dx
n
•
Defini tion 4.7:
The Minimum Distance Estimator
2
minimizing the quantity V (~) over
n
.
2
Inf V (~)
~EA n
(4.6)
~CVM
was defined by
~:
"
= V2 (~CVM).
n
This also defined the following estimators of
~
and a:
•
Letting
e
=
(~.~.a)
be the parameter. we wish to show that the
minimum distance estimator
eCVM
estimator in the sense of (4.3).
~
thus defined is truly a generalized MWe assume that
is r-dimensional. and we record the notation
90
~
is t-dimensional and
~
1
Now.
~ n.
i
in order to observe how 8
is a Generalized M-Estimator
CVM
even when (f3.a)
is also estimated using least-squares. we note that
8
is a solution to the following set of equations:
CVM
n
};
(4.7)
j=1
2
n
(l/n ) };
i=1
f
B"\B
A
};n
a. (x.'A) a .(x.'A) w(x) dx
k
J
1
=0
1
~
k
~
1
~
k
~ r
t
r('A.f3.Z.) Bf3B r('A.f3.Z.) +
{
j=1
k
1
1
B
r('A.f3.Z.) Bf3 r('A,f3.Z.)
J
k
J
}= 0
n
2 2 2
}; [(1/2){ r ('A,f3.Z.) + r ('A.f3,Z.) } - a ]
1
J
j=1
= O.
We do note that all the summands are symmetric functions of Z. and
1
Z..
J
The system of equations (4.7) defines the function ~ : ~2x8 ~ ffiP
such that the estimator 8
satisfies
CVM
n
};~(Z .. Z ..
j=1
1
8CVM) =0.
J
If consistency of 8
is assumed. then in order to achieve asympCVM
totic normality. the conditions of Theorem 4.5 must be verified.
The
stated conclusion would then be
for specified matrices A and B. where 8 0 is such that E~(Z1,Z2.80)
91
= O.
The situation that we are mainly interested in is the transformboth-sides regression setting.
We check that Assumptions 4.2 are veri-
fied for this situation. in the following lemma.
Lemma 4.8:
If Z. = (X.,Y.) for i=l .... , n are independent and identi1
1
1
cally distributed observations following the TBS model, the supports of
X and c are compact. the errors c are sYmmetric. the random variables X
and c are independent. and i f I/J is defined by equations (4.7),
then
Assumptions 4.2 are satisfied.
proof:
Measurability of I/J follows from continuity.
show that there exists a
a parameter value
eo
eo
such that E[I/J(Zl,Z2,8 0 )J =
We next wish to
o.
If there is
for which the TBS model holds with symmetric erAlso,
the model implies that Y can be
expressed in the form Y = g(X,c) for some function g.
Thus Z = (X,Y)
may be written Z = (X.g(X.c». where the random variables X and c have
been assumed to be independent of each other.
r(A,~,Z)
the derivative
= h(Y.A)
(8/a~k)r(A.~.Z)
not depend on c.
2
h(f(X,~).A),
is independent of Y. and thus it does
Therefore
(4.8)
and since Ec
-
Since
= 1.
92
(4.9)
E[r
2
(A,~,Z)
Now let F(x.e) = P(
-
222
= E[oot -
2
00 ]
r(A.~,Z)
~
ax).
00]
= O.
The function
is Lipchitz continuous as a function of A, so the following interchange
of expectation and differentiation is allowed:
E[
a~
k
f al(x,A)
a 2 (x,A) w(x) dx ] =
=
=
=
J
a~k
Ja
aA
f
a~
k
E[f al(x,A)
E[al(x,A) a 2 (x,A)] w(x) dx
{ F(x,e) - F(-x,e) + l}
k
2{F(x,e) - F(-x,e) + l}
a~k
Since F(x,e o ) = F (x) and F
t
a 2 (x,A) w(x) dx ]
2
w(x) dx
{F(x,e) - F(-x,e) + l} w(x) dx.
is a symmetric distribution. then the
t
above quanti ty is zero when evaluated at e = eo for all k=l, .... t.
Thus by (4.8) and (4.9), assumption A2 holds.
Assumptions A3 and A4 follow from the compact supports that were
assumed for X and t;
since I/I(Zl,Z2,e) is bounded,
then u(x,e.d)
is
bounded.
•
Theorem 4.5 and Lemma 4.8 yield the following result:
Theorem 4.9:
(Asymptotic normality of the minimum distance estimator)
Under the assumptions of Lemma 4.8, if e
~
d
A
n (e
CYM
- eo) ----+ N(O. 4A
where
93
CYM
-1
~ eo. then
-T
BA )
•
The next theorem shows that the nuisance parameter a can be disregarded since it does not have an effect on the asymptotic covariance of
....
....
(~,
~)
when compared to the covariance that results from assuming that
a was lrnown and did not have to be simul taneously estimated.
enables us to concentrate only on the covariance of
forming actual computations.
(~,
~)
This
when per-
For this reason, we will hardly refer to
minimum distance estimates of a in the remaining part of this work.
The absence of a estimates will be particularly noticeable in Chapter
v.
Lemma 4.10:
(Ruppert and Aldershof (1989))
Let A and B be qxq matrices such that A is partitioned as follows
and
B
=[
where All and B11 are q1 xq 1' and A22 and B22 are q2xq2.
-1
-T
upper-left corner of A-1 BA-T is A11B11A11.
94
proof:
-1
Matrix algebra using A
=
-1
All
[
0
-1
-1
-A22A21All
-1
A
22
].
•
Theorem 4.11:
(~.~)
parameter
and let
when
0
Let
'"
"'*
"'*
(~CMV' ~CVM)
assuming
'"
(~CMV' ~CVM'
0 0
is known and did not need to be estimated.
'"
0CVM) be the minimum distance estimate of
is also estimated.
ances by 2.* and .2..
be the minimum distance estimate of the
(~.~.o)
Denote the corresponding asymptotic covari-
respectively.
Then.
the
(t+r)x(t+r) upper-left
corner of 2. is equal to 2.* .
proof: . Using Notation 4.6. by Theorem 4.9 and Lemma 4.10. it is sufficient to prove that
(4.10)
and
(4.11)
are all zero when evaluated at 8
k
Clearly A (8)
23
=0
= 80 .
for all 8 and all k. since the function that is
differentiated with respect to
0
in equation (4.10) does not depend on
o.
k
For the terms A (8). let us note. putting
13
F(x.8) = P(r(~.~.Z) ~ ax )
95
and
M(x.S)
= £a 1(x.S) = F(x,B)
+ F(-x,B) - I,
that we have assumed enough conditions so that
(4.12)
= aaaf aAa
k
M2 (x.S) w(x) dx
a
= 2 f [aaa
M(x.S) aA
k
= aaaf
2 M(x.S) aAa M(x.S) w(x) dx
k
M(x.B) + M(x,S)
Next we note by letting f
= f(X,~)
a2
aaaA
k
M(x,S)] w(x) dx.
in the TBS model with family of
transformations h(y.A) and error distribution F (and density f ). that
E.
E.
F(x,S) =
P(r(A,~,Z)
5 ax )
= P( h(Y,A) - h(f,A)
5 ax )
1
h(h- (h(f o ,Ao)+aoE..A o ).A) - h(f,A) 5 ax )
= P( h- 1 (h(f o ,Ao)+aoE.),A o ) 5 h- 1(ax + h(f,A),A) )
= P(
= P( E. 5 {h(h-1 (ax + h(f.A),A),Ao) - h(fo.Ao) }/ao )
= ExFE.( {h(h-1 (ax + h(f.A),A),AO) - h(fo,AO) }/ao )
and hence
aaa F(x.S)
= Ex{(l/ao}fE.({ h(h
.
-1
(ax + h(f.A).A).Ao) - h(fo.Ao)
.
}/ao)x
h(h-1(ax + h(f,A).A}.A o } h-1(ax + h(f,A},A} x.
The second factor above simpl Hies considerably using the fact
that when the indicated derivatives exist for any function f(x),
From this we obtain
aaa F(x.8}1 S=So =
F_-[(l/ao}f (x}x].
-X
E.
96
showing that if f
E.
is a symmetric density,
aaa M(x,e o ) = aaa F(x,e o )
(4.13)
+
aaa F(-x,e o ) = 0
and
(4.14)
M(x,e o )
= o.
k
Equations (4.12), (4.13), and (4.14) show that A
I3
(e o )
=0
for all
k. and this completes the proof.
•
4.3
The Single-sample Model with a Modified Power Transformation with
We define the modified power transformation with shift by
h(y,~,X)
(4.15)
for
-(X)
< ~,
X
<
(X)
={
[(y -~)
log(y -
x
- IJ~
~)
x=0
In section 3.7 we discussed why this family of
transformations is suitable when there is the need for accounting for
non-positive observations of the response.
We also cited that conven-
tional . estimation methods such as maximum I ikel ihood are unable to
properly estimate the shift parameter. whereas minimum distance produces consistent estimates.
Next. we present the results of computations for the matrices A
and B of Theorem 4.9 under a simple one-sample model with the shifted
power transformation.
If
~
is restricted to be zero. we obtain the Box
Cox transformation and the one-sample model considered also by Taylor
(1985). and Hinkley (1975. 1977).
This model will be the basis for a
simulation study in Chapter V. where we vary error densities. estima-
97
tors, and weights used in minimum distance estimation.
Proposition 4.12:
Consider the one-sample model
= h(~,~,A)
hey.1 ,~,A)
(4.16)
+ Ge
where h is given by (4.15). e is assumed to have a continuous density
= 1.
f (0) which is symmetric on [-M. M], and Var(e)
e
We adopt the notation 8
= (~.A.~),
and a similar notation for B.
80 =
(~o,Ao,~O).
We assume (4.16) holds for some value
Also. we define the following:
F (x) = f (x) G~I{ [{~o - ~o)AO + AoGox]l-llA o - (~o - ~0)Ao-1 },
Jl
e
FA(x)
X {
= fe{x) Go-1
[(~o - Jlo)
A~l(GoX + (~o - A~2Iog[{~o
+ AOGoY]
log(~o
-
~o»
- Jlo)A O + AoGoX]
-1
Go
PJl(x)
PA(x)
= FA(x)
+ FA(-x). and
P~(x)
= F~(x)
+ F~(-x).
(~o
0
Jlo)A O
= fe{x)
= FJl{x)
F~{x)
A
- Jlo
)Ao-1
/
«~o
}.
,
+ FJl{-x).
We then have
~ ~
n (8
CYM
- 80
)
d
-----+ N(O. 4A
where:
98
-1
BA
-T
)
-
~o)AO
+
AOGOX)
A
J.4L
M
2
P (x)
1
=4
0
J-L
- J-Lo)
R
A~~ = - 2( ~o
x {
JM[(~o
J.4L
B
R
J-LfJ
-
-M
-
B
_
(~o
-
J-Lo
=B
R
~J-L
M
{
J
[(~o
-
(~o
- J-Lo) ~ 0 -1 }
)~O ~
AO
-M
- J-Lo)
~o
+
~oaoyJ 1-1/A o
0
f
t
(y) dy
•
-1
J-Lo)~O log(~o
M{
0
~o-l
J-Lo)~O + ~oaoyJ log[(~o - J-Lo)~O
J J
=8
w(x) dx.
+
~oaoYJ1/Ao
(y) dy
f
t
- J-Lo) }.
yp (x) w(x) dx }2 f (y) dy.
J-L
= 4ao(~o
t
- J-Lo)
~
-11 J
M
0
0
99
Y { YP (x) w(x) dx}
0
J-L
(y) dy.
f
t
= 4Go(~0
- ~o)
I... 0-
1fMY { fY P~(x)
o
0
w(x) dx } f (y) dy,
E.
1\
and
B~~
proof:
= Go2
(~_
~o
Omitted.
)2(1... 0 -1)
~o
.
It follows from lengthy algebra and calculus. start-
ing from the expressions obtained in Theorem 4.9.
•
We may also use these expressions for obtaining the asymptotic
~
A
A
~)
covariance matrix of n (I....
if the transformation is the one-parame-
ter Box-Cox modified power transformation wi thout the shifting.
this case, we simply set
~o
=0
and let
Likewise. when estimating the shift
~
for fixed known power 1.. 0
obtain the asymptotic covariance matrix of n
A
A = [
Since
no
IJ.l.L
In
~
A
(~.
•
we
A
~)
by setting
A
~~]
A~~ A~~
mention has
and
been
made
about
estimating
the
matrix
4A-l BA-T. we note in short that the expressions in Proposi tion 4.12
provide one feasible way of estimating A and B. and hence an estimate
of the asymptotic covariance of
n~eCVM may be found by
100
4A- 1BA-T
It is
possible to replace f (y) for a symmetric density estimate f (y) of the
residuals {
* "'ri(A
t
CVM
t
) } and substitute 8 0 for 8
.
CVM
If ft(y)
~
ft(y)
uniformly in y. we obtain that A and B are consistent estimators of A
and B.
Other methods for estimating A and B could and should also be
explored in the future.
It was interesting to take into consideration the shifted power
transformations in the first place. because of the reasons given in
section 3.7.
ty of the
In addition. it is interesting to explore the estimabili-
(~.A)
parameter by a consistent method. as when using minimum
In actuality. estimating
distance.
hard problem.
~
and A simul taneously is a very
One example should suffice to illustrate the difficul-
ties that we encounter when estimating
(~.A)
jointly.
With this inten-
tion. we consider model (4.16) under the two-parameter transformation
(4.15).
By defini tion.
the estimate of the transformation parameter
(~.A) is obtained by minimizing the quantity y2(~.A)
(which was defined
n
in N8 of Notation 4.6) over two-dimensional
(~.A)
space.
A mere glance
at the immense order of magnitudes that result on the diagonal elements
T
1
of the matrix 4A- BA- . would show that estimating
problem.
(~.A) is a difficult
However. we consider a graphic explanation of this fact in-
stead. which perhaps better clarifies the nature of the problem.
is
large.
If n
for each fixed (~.A). y2(~.A) is an approximation to its
n
expected value (the limit of y2(~.A)); thUS. a large n gives the genern
al picture.
The usefulness of y2{~.A) for estimating (~.A) is related
n
to the location of the minimum of
y2{~.A)
wi th respect to the true
n
parameter values. and how discriminating the minimum really is wi th
respect to surrounding values of
(~.A).
examine contour plots of the quantity
101
With this notion in mind. we
(4.17)
as a function of
(~.~).
Here. the log function is used solely to make
graphics more well-behaved and to improve resolution.
One example was
selected to be studied closely when the distribution of errors is standard normal. by setting the parameters at n
0 0
= 0.25:
while
~o
= 250.
f30
= 16.
~o
= O.
was allowed to take values -0.25 (Figure 4.1). 0.25
(Figure 4.2). and 1.25 (Figure 4.3).
As we examine the contour plots. we discover peculiar "topographic" features in the vicinity of the true
true value
of~:
flat plains.
(~.~)
pair. depending on the
Long narrow gorges. wide flat valleys. or virtually
As the regions plotted in the figures show a wide range
of values for both
~ and~.
2
in any case they tend to show that V (~.~)
n
is unable to pick out departures from the true values. unless the sampIe size was drastically increased.
Under this pictorial representa-
tion. consistency means that eventually. the plots conform so that they
all have a unique minimum at the true value: a well-determined estimate
would occur when the true value lies at the bottom of a pronounced and
rounded "hole".
All of this exposition translates into the fact that
(~.~)
cannot
be discerned unless the sample size is very large (perhaps unattainable);
it also means
that
the
(~.~)
parameterization (4.15)
shifted power transformations is ineffective.
of
the
This strongly suggests
that reparameterizations of the shifted power need to be investigated.
We also note that the one-sample model which we considered does not
have any heteroscedasticity since there is no variation in the median.
Symmetry considerations alone seem to be unable to pinpoint
102
(~.~)
in
this case.
In the more general model, the inclusion of a device which
also measures heteroscedasticity might possibly improve the situation.
4.4
Figures
Figures for Chapter IV follow.
103
Figure 4.1:
n
= 250.
(30
Contour plot of T (~.A)
n
= 16.
~o
= O.
Ao
= -0.25.
=
2
log { (n /2)
and
Go
= 0.25
y2(~.A)
}.
n
in model
under the modified power transformation wi th shift (4.15).
when
(4.16)
The error
density is standard normal.
Cramer-von-Mises criterion
e
In
q
I"')
I
::J
E
"C'1
In
I
o
ci'--""'-....I.-.....L.....J........L--'L-..L-..L--'-....r.......r..---I~.L-...L....l:...L.......L......Jt.--J._L.L-.L-...J...J.....L.--U.....r..~
.. -1
-0.802
-0.604
-0.406
-0.01
-0.208
lambda
104
-
Figure 4.2:
= 250.
n
f30
Contour plot of T
n
= 16.
~o
= O.
Xo
(~,X)
= 0.25.
=
2
log { (n /2)
and Go
= 0.25
V2(~,X)
n
in model
under the modified power transformation with shift (4.15).
}. when
(4.16)
The error
density is standard normal.
o
o
cD
Cramer-von-Mises criterion
o
~
j'T)
o
....wI
::::J
E
o
~
W
I
o
N
...-:
....
I
0.408
0.806
1.204
lambda
105
1.602
2
Figure 4.3:
= 250.
n
(30
Contour plot of T
n
= 16.
~o
= O.
~o
(~.~) =
= 1.25.
2
log { (n /2)
and
00
= 0.25
V2(~.~)
n
in model
under the modified power transformation with shift (4.15).
}. when
(4.16)
The error
density is standard normal.
o
Cramer-von-Mises criterion
O""""""""'~""""~"""""""""T"'T'"T"'T'""""-"""'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''~'''''''''~'''''''''---''---'::---'::---'':--''''''''''
ai
o
N
rri
e-
o
to
~
I
o
.q
cD
I
o
N
"..:
I
~
0.608
1.206
1.804
lambda
106
2.402
3
CHAPTER V
WEIGIITS IN MINIMUM DISTANCE ESTIMATION,
5.0
M1)
A MONTE CARLO STUDY
Introduction
In the definition of the minimum distance estimator introduced in
Chapter III. we considered the empirical distribution function (e.d.f.)
of
standardized residuals.
measures symmetry.
and sought
to minimize a
cri terion that
Explicitly. denoting by F (x.A) the e.d.f. of stann
dardized residuals. the quantity to be minimized is
2
V (A)
n
(5.1)
=
f{
F (x,A) + F (-x.A) - 1 }2 w(x) dx
n
n
where the weight function w(x) was assumed symmetric and non-negative.
The weight was arbitrary, but considered fixed throughout Chapters III
and IV.
When the weight was expressible as the derivative of an odd,
nondecreasing function,
then the computation of (5.1) for arbitrary A
is simplified (Proposition 3.4).
In the present chapter. we wish to explore possible forms of the
weight function so that a better estimator of the transformation parameter is obtained.
This matter is the subject of section 5.1. where we
consider an intui tive argument
that points in
the direction of
the
weight exp(x 2 /2) if the error distribution is standard normal and the
transformation is the modified power.
The remaining part of this chap-
ter summarizes the resul ts of a Monte Carlo study performed on the
minimum distance estimator.
The purposes of this study are to evaluate
the effect of different weights,
to attain asymptotic theory, and to
explore the performance of the estimate for small sample sizes.
5.1
A Heuristic Argument for an Optimal Weight
In Theorem 4.9 we established that the covariance of the limiting
distribution of the minimum distance estimator is given by
4A- I BA- T
where
and
~
is defined by equations (4.7).
While exploring the effects of different weights on the estimates
of A,
intuitively we expect at least one property that the weight w
should possess in order to improve the estimates.
To see this notion,
we consider the quantity
Q (X,A)
n
(5.2)
= Fn (X,A)
+ F
n
(-X,A) - 1.
For any set value of A, we see from properties of distribution
functions,
that when the absolute value of x increases,
Q (X,A) tends to zero.
n
for large values of
Ixl
the value of
This behavior suggests that we should upweight
in order to increase the sensitivity of Q
n
departures of A from its true value.
to
In this way, a weight with heavi-
er values at the tails is expected to yield more efficient estimates of
A.
Although the most general problem is finding the weight which is
108
optimal for a given error distribution.
it would be interesting as a
preliminary step to discover the form of an optimal weight in the case
that errors are normal.
The use of such weight would insure that we
obtain an efficient estimator whenever the errors happened to be normal,
but we would need
to evaluate
the behavior of
weight when departures from normality occur.
the
identified
If the weight is fixed we
obtain consistent estimators of A under any symmetric (not necessarily
normal) error densi ty. but i t is plausible that one weight will not
perform uniformly well for all error densities.
We will concentrate on the case in which the parameter A is onedimensional. and the transformation is the modified power family of Box
and Cox which we denoted by { y(A)}.
The model to be considered is
then
(5.3)
We assume c possesses a symmetric distribution F (.) and a density
c
f (.).
c
For the remaining part of this section. we also assume that the
parameters
~
and a are known. and denote
~
= ~(X) = f(X.~).
The model gives rise to the family of distribution functions for
Y. indexed by A. given by
(5.4)
and corresponding densities
109
(5.5)
We assume i.i.d. observations Y , ... , Y following model (5.3) and
l
n
we let
F (y.A)
(5.6)
n
= e.d.f.{
n
(y.(A) _ ~(A))/O }
i=1
1
and
Fn (y)
(5.7)
= e.d.f.{
n
Y. }
1
i=1
We note that
Fn (Z.A)
(5.8)
= Fn [
(~
A
+ AOZ)
1/A
].
By definition, the minimum distance estimator A
minimizes
CVM
V2 (A)
n
(5.9)
=
f {Fn (y,A)
for a fixed function w(y).
+ F (-y,A) - I} 2 w(y) dy
n
However, we shall now allow w(y) to depend
= W(X,A).
on A in the minimization, that is, we let w(y)
'"
If. on the other hand. we were to define A by minimizing
(5.10)
for some given weight w* . we would obtain an estimate of the type which
has been examined by Parr and DeWet (1981).
By considering an inf1u-
ence function argument. these authors show that in order to obtain an
'"
efficient estimate A. the form of the weight should be
(5.11)
By the change of variable z
= (y(A)
110
- J.L (A) )/0
= (yA
- ~A)/OA so
e-
that y
=
(~
"A.
+ "A.oz)
1/A.
and dy
=
(~
"A.
+ AOZ)
1/A.-1
0
dz
in (5.10). we can
use (5.4) and (5.8) to obtain
(5.12)
F (z."A.) - F (z)}
n
E.
2
* [(~A+"A.oz) l/A. ."A.J
w
(~
"A.
l/A.-l
+"A.oz)
0 dz.
Next. we consider approximating F (z) in (5.12).
E.
distribution was assumed symmetric.
it seems natural
Since the error
to consider a
symmetrization of the empirical distribution of the residuals. given by
A
(5.13)
FE. (z)
= [Fn (z."A.)
+ (1 - Fn (-z."A.»
J / 2.
If we also define
(5.14)
then we will obtain that
(5.15)
2
Wn ("A.)
~
(1/4)
J{
Fn (z."A.) + Fn (-z."A.) - 1 }2 w(z."A.) dz.
By comparing with (5.9). we can see
(5.16)
Thus. the two estimators "A. and "A.
CVM
are equivalent in the sense
that one minimizes its defining criterion if and only if it also minimizes the opposing criterion.
That is. the minimum distance estimator
is approximately equal to an estimator of
therefore an
(approximately)
optimal w(z."A.)
distance estimator would be given by
111
the Parr-DeWet type. and
weight for
the minimum
w(z.:\)
(5.17)
:\
l/A
= w*[(~
+:\oz)
.:\]
:\
(~
+:\oz)
l/A-l
o.
where w* (z.:\) is chosen to be (5.11).
We acknowledge the fact that (5.17) was obtained by an important
approximation. when F
t
was replaced by an estimate F
best. this optimal weight would only be crude.
t
in (5.12).
So at
There is yet a second
reason why (5.17) should be taken with precaution. because it was obtained assuming that the value of the
~
parameter was known.
When
~
is
also estimated. the optimal weight which we seek may be different; the
problem is understandably more complex in this case.
There is no reason why (5.17) should give rise to a
sYmmetric
weight. and furthermore. it may be possible that (5.17) varies greatly
with:\ so that the weight is not a function of yonly. in (5.9).
we compute the function w(z.:\) in the case that errors
normal.
are standard
Consider (5.4) and (5.5) which hold for a general error dis-
tribution F (0) and error density f (0).
E.
E.
then we obtain
and
~
8:\8y
log f ( (y(:\)E.
-2 [(y (A) -
-0
~
~(:\»/o ) =
(A) ) y A-I log(y)
Substituting. obtain using (5.11).
(5.18)
t
Next.
w* (y.:\)
=
112
If we assume
f c ( ((A)
(A»
y
-~
I
/a
)
a
-1 aAa
( y (A) -~ (A»
=-------------------------Next we find an expression for a~ ( y(A)_ ~(A) ),
a~ ( y(A)_ ~(A) ) = a~ ( (yA_ ~A);A )
= { A(yAlog(y) _ ~Alog(~» _ (yA_ ~A) } / A2
which evaluated at Yo
(5.19)
= (~A
D(Z.A)
+ Aaz)llA
becomes
a~ ( y(A)_ ~(A) ) I
=:
=
Yo
{ A((~A
+
Aaz)(llA) log(~A
+
Aaz) - ~Alog(~»
- Aaz } / A2 .
Using (5.18) and (5.19).
w* [(~A + Aaz) 11A ,AJ
(2~)
!h
=
e Z2/2 { a- I [z(~A + Aaz) (A-1)1A log(~ A + Aaz) 11A J / D(Z,A)
-
a(~
A +
Aaz) -11A / D(Z,A)
+ a -1 (~A + Aaz) (A-1)1A } .
Finally. using (5.17), we obtain
W(Z.A)
= W*[(~A
+ Aaz)llA. AJ
(~A
+ Aaz)11A-1 a
= (2~)!h
e
z2
x
{ [z log(~A + Aaz)llA J / D(Z.A) - a2[(~A + Aaz)-l J / D(Z.A) + 1 }.
If
we denote
(5.20)
113
the result we have obtained is that
w(z.:\)
(5.21)
In this way. it is the reciprocal of the standard normal density
which produces the term exp(x 2/2) that appears as a leading term for
The function exp(x 2/2) is appealing as a weight
the optimal weight.
because of its smoothness and because it upweights at the tails in a
way that was intui tively foreseen for an optimal weight.
justify using w(x)
= exp(x 2/2)
in (5.1),
In order to
it is necessary to compare
w(z,:\) and exp(z2/2 ), or equivalently, to check that T:\(z) is approximately constant for all (:\,z).
Various plots of T:\(z) for different values of :\ have shown that
its general behavior is not dependent on :\ or a.
The weight w(z,:\)
obtained in this way has a peculiar feature in that that w(O-,:\)
and w(O+ ,:\)
=
-00
point
(:\,~,a)
=
w(z.:\).
0.3
<
has
00
This fact seems troublesome at first, since w(z.:\)
resists proper interpretation around z
this
=
only a
marginal
= O.
effect.
However, we next see that
We
consider
the
example
(.25,12.25.1), which shows the typical trend of the function
Figure 5.1 shows a plot of
Izi ~ 4, and z2/2 on 0 ~ Izi ~ 4.
= O.
singularity at z
the functions
log{w(z,:\)/6} on
We observe that, excluding the
both curves resemble each other over most of the
range of z. and that log{w(z.:\)/6} is (unintentionally) almost a symmetric function.
Now, if we consider the symmetrized version of w(z.:\)
= [w(z.:\) + w(-z.:\)J/2. we find that
at z = 0 is no longer present and the
given by w (z.:\)
s
the problem with
the singularity
resemblance with
z2/2 is markedly stronger.
graphs
of
the
functions
Figure 5.2 is a simple plot which shows
log{w (z.:\)/6}
s
114
and
Z2/2
over
the
range
o
~ Iz
I
~ 4;
we notice that these two curves are virtually indistin-
guishable from each other.
As a result of
subsequently consider the weight w(x)
these findings.
= exp(x 2 /2)
we will
as a likely candidate
for minimum distance estimation under normal errors, and next proceed
to entertain this w(x) in a Monte Carlo study.
5.2 Monte Carlo Study
We intend to study minimum distance estimation in the single-sampIe TBS transformation model
(5.22)
h(Y .• "-)
1
where the errors
E..
1
= h(f3."-)
+
OE..
1
are symmetric wi th densi ty f .
E.
If h is the modi-
fied shifted power transformation. this is precisely the model we considered in Proposition 4.12.
There. we computed the asymptotic covari-
ance of the estimate that was derived from general ized M-estimation
theory.
The densities f
E.
which were counted were a standard normal trunca-
ted at ±4. a standard normal truncated at ±2. a uniform densi ty. a
triangular density. Student's t-distribution with 3 degrees of freedom
truncated at ±6. and a standard normal contaminated wi th probabi Ii ty
.15 by a normal with mean zero and variance 5.
In all cases. the den-
sities were standardized so that they had mean zero and variance one.
The number of Monte Carlo replications was fixed at 1000 in each
case; a selected case was defined by a fixed setting of the parameter
e = (A.f3.0).
the density f . the sample size n. and the chosen estimat-
ing method.
A replication then consisted in generating a sample fol-
E.
lowing model (5.22) and then computing the estimator.
115
When deal ing wi th the minimum distance estimate,
requires minimization of a non-differentiable function
tion
requires
some numerical
elaboration:
aLgorithm (NeIder and Mead (1965»
The
the calculation
in~.
The solu-
NeLder-Mead
simpLex
was the numerical algorithm employed
for this matter.
This iterative procedure is based only on evaluations
of
itself
the
function
and not
of
its
derivatives,
and
requires a
starting point which in this study was found by search over a grid of
~-values.
Apart from the fact that this algorithm renders convergence
that is painstakingly slow (especially in higher dimensions).
it was
otherwise effective for the problem at hand.
For comparative purposes.
the parameter
the minimum distance estimate using weight
~B&C'
which is
function
III.
w(x), of Chapter
(b)
three estimators of
~:
~CVM'
(a)
we shall consider
the Maximum Likelihood estimate of Box and Cox (1964),
the MLE computed under
the assumption
that
f
Eo
is
standard
normal.
the Ruppert-Aldershof skewness estimator, of Chapter
II.
This simulation study is organized into three main segments:
Part A.
Here we consider minimum distance estimation under the
simple power transform
h(y.~)
(5.23)
and a
large sample size (n = 120) in order to compare with the asymp-
toties of Theorem 4.9.
weight
= y~
functions
w(x).
and
to observe
The
parameter
116
the action of a
settings
were
few selected
chosen
to
be
(X,~.a)
= (.5.100,2) (Table 5.1).
(X.~.a)
= (.5,100 •. 2)
Part B.
(X,~.a)
= (.5,100,1) (Table 5.2), and
(Table 5.3).
Here we consider the modified power transformation of Box
and Cox
(y
(5.24)
h(y,X) = {
using a large sample size (n
of weight functions.
x
- 1)/A
log(y)
= 120).
i f X#-O
if X
=0
In order to perceive the effects
several choices of the weight function w(x) are
examined when using X
'
CVM
We are particularly interested in observing
the performance of the weight w(x)
= exp(x 2 /2)
obtained in section 5.1,
but other weights considered for experimentation were the unit weight
w(x):l. and the power weights w(x) = x
k
for k=2,4, and 6.
The other
two estimates were also included in this part of the study.
Since we
are interested in assessing the performance of the estimate X
under
CVM
normal errors, the case when f
c.
is normal will be of special interest.
We will be able to gauge the loss of efficiency when we compare against
XB&C' which is asymptotically optimal in the normal case.
Of course,
when the errors are not normal then XB&C is not necessarily a consistent estimate of Xo.
However. other studies have shown that XB&C pro-
duces meaningful estimates even in non-consistent cases (Hernandez and
Johnson (1980».
study were
(X.~.a)
Parameter settings for Part B of the Monte Carlo
=
(-.24 . . 037432.2) (Table 5.4) and
(X,~.a)
= (-.24 .
. 104383.1) (Table 5.5).
Part C.
As in Part B. in this third part we also consider the
transformation (5.24), but we concentrate only on the minimum distance
estimate that is obtained when using the weight w(x)
= exp(x 2 /2).
Some
comparisons between X
' XB&C' and XR&A under different error densiCVM
117
ties are provided. when the sample sizes are not quite as large as in
parts A and B.
were (1-..{3.a)
5.7) .
= 40.
For sample size n
= (.5.12.25 .. 25)
the selected parameter settings
(Table 5.6).
(Table
(1-..{3.a) = (1.5.4.16497.1)
(1-..{3.a)
5.8).
= (.5.36.1)
and
(Table
(1-..{3.a) = (.25.
..
25.6289 .. 25) (Table 5.9).
= (1-..{3)
In tables. whenever an estimate 8
~
"
n (8 - 80
)
d
~
satisfies
N(O. L)
we resort to the notation
L
5.3
=
[L(I-..I-.)
L(I-..{3)
L(I-..{3)
L({3.{3)
J.
Discussion
The one-sample model which we considered in the Monte Carlo study
provides a first step towards the more general regression model. and
several general conclusions with respect to minimum distance estimation
can be drawn.
An important issue is to decide on the weight function
which should be employed for minimum distance estimation.
ing a power parameter. we derived the weight w(x)
For estimat-
= exp(x 2 /2)
as opti-
mal for normal errors, and this weight was one of the main focal points
in the simulation study.
In the tables
that correspond to Part A, we generally observe
agreement wi th respect
to asymptotic
theory.
Severe computational
difficulties were encountered in some of the asymptotics. particularly
in the case of normal errors and minimum distance estimation using the
unit weight.
These problems are presumably due to the two-level numer-
ical integrations that are present in Proposition 4.12; any numerical
118
JI'
errors in the inner integrals are also carried over to the outer ones
and compounded.
As with any numerical calculations. we should be cau-
tious about any lurking numerical instability; in this particular problem, the numerical difficulties were unquestionably present in some of
the cases considered.
in the
The problems are very likely due to shortcomings
intrinsic numerical
integration routines
that were employed.
For this reason, while a few of the entries in Tables 5.1. 5.2 and 5.3
show discrepancies between asymptotic and simulation results that are
in excess of 10%. we will not be too rigorous.
It is of greater impor-
tance here. to notice relative behavior across different weights.
While
looking exclusively at
the variance of A
,
CVM
even at an
early stage we begin observing from Tables 5.1. 5.2. and 5.3 that the
weight exp(x 2 /2)
performs bet ter
than
the other .....eights considered.
This exponential weight is preferred at all error densities except the
uniform, where the squared weight seems to do just as well.
The unit
weight performs badly in all cases and clearly should be disregarded
for any estimation purposes;
it appears
imperative
to upweight
for
values away from zero in order to induce estimates that are worthy.
In tables of Part B. we again observe that among different weight
choices for A
. the weight exp(x 2 /2) is without doubt the one that
CVM
gives the best estimates of A.
The few exceptions where it is not the
optimal choice are where uni form or normal
are involved;
truncated at ±2 densi ties
these densi ties are the less likely to be interesting
from a point of view of a real-life application.
It is significant to
observe that the exp(x 2 /2) weight is performing well despite the fact
that
~
has not been assumed to be known and is also being estimated in
the Monte Carlo.
2
We recall that in section 5.1. the weight exp(x /2)
119
was obtained by an informal argument in which we assumed that f3 was
known, and that the errors were standard normal.
Since the previous
2
findings only establish that the exp(x /2) weights are good within the
family of minimum distance estimators with varying weights.
the need for
there is
comparing optimal minimum distance estimation to other
procedures.
Strictly speaking,
truncated,
since we are restricting the densi ties to be
we did not consider a case wi th normal
approximation
to normal i ty
truncated at ±4.
ically optimal.
is given by
the
standard normal
In the normal case it is known that
but
error distributions.
A close
errors.
~B&C
densi ty
is asymptot-
this estimator is not even consistent for other
This last fact is evident in some of the
la~ge
bias terms in Tables 5.6-5.9, but we note that even for several nonnormal cases,
the bias for
~B&C
was negl igible.
This agrees with the
findings of Taylor (1980), and Hernandez and Johnson (1980). who showed
that the Box-COx estimate has an interpretation even in inconsistent
cases.
Even if minimum distance estimation is not expected to yield
fully efficient estimators. it is certainly desirable to observe how it
compares to
~B&C
in the normal case. as a basis for judgment.
This
loss of efficiency is foreseen for minimum distance estimation because
we only made the nonparametric assumption of symmetry with respect to
....
the errors.
In Tables 5.4 and 5.5. we see that when estimating the
modified power transformation.
the loss of efficiency for
~CVM
normal case is roughly of the order of 18% and 11% respectively.
parison wi th the optimal in this case is merely a standard;
potential of minimum distance estimation is called for
where other procedures fail
to be consistent.
120
in the
Com-
the full
in si tuations
for example
~B&C
under
•
non-normality. XB&C when estimating a shift. or XR&A under multi-dimensional
X.
An accessory remark that results from Part B is that the weight
w(x)
= x2
appears to be a reasonable contender when we grant that com-
putation of the estimate is much simplified when using x 2
2
exp(x /2).
instead of
For uniform errors, squared weights are evident ly better:
2
for other cases they are dominated by exp(x /2) weight, but not severely.
We detect no difference in overall behavior in Part C when the
sample size is n
= 40.
That is,
the estimates compare amongst them-
selves in much the same way as they did for n
= 120.
However. in this
part. we considered a greater collection of error densities and therefore we can better assess the behavior of estimates when departures
from normality occur.
Here, the weight for minimum distance was fixed
at exp(x 2 /2). which was a weight designed to yield minimum variance for
normal errors.
Whi Ie it would have been favorable to find that this
weight is particularly efficient also in non-normal cases.
not seem to be evidence that this is the case.
there does
It is self-evident, for
example noting the bias terms in Table 5.B, how departures from normality influence XB&C: in order to note what the effects are on X
' we
CYM
must compare with XR&A' which is also consistent.
In most instances
reported in Tables 5.6 though 5.9. X
has a larger variance than XR&A
CYM
does.
In a few cases they are behaving practically in the same way.
while X
is superior particularly in the rather radical examples that
CYM
involve uniform errors.
This discovery exposes the need for investi-
gating the form of an optimal weight for heavy-tailed error distributions (e.g.
the t-distribution).
The minimum distance estimator is
121
based
on
full
symmetry of
the
errors.
but
it
seems
that
the wrong
choice of a weight function causes A
to be comparable to AR&A which
CYM
is based only on skewness.
A natural question arises:
What happens if we apply the heuristic
argument for finding an optimal weight to a heavy-tailed error density?
The answer
involved
only
resul t
hard
interpret.
was
to
some
straightforward calculations.
For a
t-distribution.
the
but
the
argument
yields a weight function which was not as clear-cut as the ones illustrated in Figures 5.1 and 5.2.
The problem is that the result could
not be expressed as a simple function of z;
this property was desirable
to simplify computation if simulations were to be performed.
ble solution is
say a
weight.
to approximate the weight by some rational
symmetric
polynomial.
and
then
use
the
A possifunction.
approximation as
the
Since the original weight might initially be in error because
of the approximations that were involved in its derivation.
was not elaborated upon.
this issue
Thus an optimal weight for heavy-tailed dis-
tributions remains an open problem.
It is not clear at this point what
characteristics
should have
the optimal
weight
to accommodate heavy
tails.
Conceivably.
the next effort in further research on minimum dis-
tance estimation should be focused on estimating a shifted power transformation.
As we remarked in Chapter IV. maximum likelihood is unable
to cope wi th a
shift parameter.
Two related problems are es timating
the shift for fixed power. and estimating the shift and the power simultaneously.
Questions about choice of weights and/or possible alter-
nate parameterizations remain to be answered.
122
Figure 5.1:
Plot of log(w(z.r-)/6) for 0.3
<
Iz I ~ 4 (sol id line) and
z2/2 for 0 ~ Izi ~ 4 (broken line). when (r-.~.a)=(.25.12.25.1).
•
.0
cD
\
\
\
\
0
~
CO
\
\
\
\
\
e
\
0
CD
.
\
"lit
\
\
0"1
\
0
\
m
m
\
\
N
\
\
\
\
m
\
~
.-
\
~
~
rq
o "---_--L._ _
...&..._........L...._...L.._ _..I......L-.....L_ _- ' -_ _'--_.....L.._----'
I
-4.00
-2.41
-0.83
0.76
z
124
2.35
3.94
.
Figure 5.2:
for 0 ~
Izi
Plot of log(w (z.A)/6) (solid line) and z2 /2 (broken line)
s
~ 4 when (A.~.a)=(.25.12.25.1)
.
.....
o
air--------------------------
m
~
to
0'1
a
m
......
.q-
IJ')
cia..----'---""'----001---......--.;:=--.1....-'-~--..L-----1.--...L-----1
I
-4.00
-2.41
-0.83
0.76
z
125
2.35
3.94
Table 5.1: Minimum distance estimation of e=(~.~) in model (5.22) with
simple power transformation (5.23).
Monte Carlo resul ts (MC) using
n=120. and asymptotic values (asy) of covariance matrix L when ~o=.5.
~0=100. uo=2. (*** denotes an entry that had a numerical difficulty)
normal ±4
ERROR DENSITY
uniform
normal ±2
triang.
uni t wt
L(~.~)
LW.~)
MC
6.10
6.05
8.73
5.71
asy
***
5.67
8.59
5.82
2221.36
3376.73
2322.00
2381.51
3655.36
2308.15
MC
asy
L(~.~)
1890.73
***
MC
46.96
69.42
157.59
71.02
asy
***
71.11
167.87
67.21
x 2 wt
L(~.~)
2W.~)
L(~.~)
MC
5.43
3.90
7.93
4.06
asy
4.91
3.39
7.64
3.71
MC
1938.23
2422.68
3630.97
2391.81
asy
1930.30
2306.52
3690.33
2287.93
MC
41.13
60.33
160.00
61.26
asy
40.30
57.23
159.99
57.38
MC
4.73
3.57
8.72
3.64
asy
3.69
2.87
7.60
3.22
MC
1894.55
2322.42
3775.01
2276.24
asy
1854.55
2301.67
3687.49
2272.54
MC
38.66
58.09
171.39
54.87
asy
30.67
54.83
159.65
54.43
exp(x 2 /2) wt
L(~.~)
2,(~.~)
2,(~.~)
126
e'
Table 5.2: Minimum distance estimation of e=(~.~) in model (5.22) with
simple power transformation (5.23).
Monte Carlo resul ts (MC) using
n=120. and asymptotic values (asy) of covariance matrix ~ when ~o=.5.
~o=100. Go=l. (*** denotes an entry that had a numerical difficulty)
ERROR DENS ITY
normal ±4
normal ±2
uniform
triang.
unit wt
~(~.~)
~W.~)
MC
26.24
24.62
41.12
23.33
asy
***
23.64
35.38
24.20
430.01
595.34
964.84
561.21
***
596.78
941.88
577.76
37.38
74.14
182.38
66.79·
asy
***
72.74
170.55
68.56
MC
19.73
15.92
31.71
16.97
asy
20.64
14.18
31.51
15.47
MC
478.55
542.46
853.15
573.72
asy
483.86
578.34
926.13
573.39
MC
35.06
56.67
155.54
58.81
asy
41.61
58.65
162.74
58.71
MC
18.36
14.79
33.39
14.25
asy
15.87
12.05
31.34
13.46
MC
444.47
573.84
926.99
552.16
asy
466.02
577.33
925.44
569.75
MC
32.56
60.65
165.36
51.95
asy
32.37
56.26
162.39
55.78
MC
asy
~(~.~)
MC
x 2 wt
~(~.~)
~(~.~)
~(~.~)
exp(x 2 /2) wt
~(~.~)
~(~.~)
~(~.f3.)
127
Table 5.3: Minimum distance estimation of e=(A,~) in model
simple power transformation (5.23).
Monte Carlo resul ts
n=120. and asymptotic values (asy) of covariance matrix L
~0=100. 00=0.2.
(*** denotes an entry that had a numerical
(5.22) with
(MC) using
when Ao=.5,
difficulty)
ERROR DENS ITY
normal ±4
norunl ±2
uniform
triang.
uni t wt
L(A,A)
L(~.~)
L(A.~)
MC
663.39
520.08
889.38
550.35
asy
***
598.68
892.53
612.21
MC
19.04
20.24
34.87
23.60
asy
***
23.88
37.71
23.11
MC
46.48
58.29
160.26
59.96
asy
***
73.24
171.39
68.98
MC
568.55
459.36
787.67
472.27
asy
523.62
359.43
795.14
391.86
MC
18.22
24.40
34.42
23.26
asy
19.36
23.15
37.08
22.95
MC
40.75
72.84
155.03
62.60
asy
42.00
59.08
163.60
59.12
MC
495.22
400.69
807.18
359.70
asy
404.87
305.70
790.85
341.08
MC
18.64
22.64
38.20
21.13
asy
18.66
23.11
37.06
22.80
MC
38.16
62.45
165.17
52.18
asy
32.86
56.70
163.25
56.19
x 2 wt
L(A,A)
LW,~)
L(A.~)
exp(x 2 /2) wt
L(A.A)
LW.~)
L(A.~)
128
~
"
Table 5.4: Monte Carlo results for covariance matrix 2 of n (8 - 80 )
for different estimators of 8=(A.~), based on a samples of size n=120
when Ao=-.24. ~0=.037432, 00=2 in the single-sample model (5.22) with
the modified power transformation (5.24).
normal ±4
normal ±2
ERROR DENSITY
uniform
triang.
Student-t
MD unit
2(A.A)
1.164
0.937
2.074
1.031
1.530
2(~.~)
0.0014
0.0017
0.0029
0.0017
0.0010
2(A.~)
0.018
0.023
0.072
0.024
0.003
MD x 2
2(A.A)
0.889
0.806
1.457
0.763
1.354
2W.~)
0.0015
0.0018
0.0026
0.0018
0.0010
2(A.~)
0.014
0.026
0.058
0.024
0.004
MDx "
2(A.A)
0.915
0.573
1.499
0.640
1.222
2W.~)
0.0014
0.0016
0.0027
0.0015
0.0011
2(A,~)
0.013
0.020
0.060
0.0208
0.006
MDx 6
2(A,A)
0.997
0.552
1.689
0.622
1.309
2(~,~)
0.0015
0.0017
0.0032
0.0016
0.0010
2(A,~)
0.014
0.021
0.071
0.021
0.007
0.890
0.679
1.669
0.673
1.205
2W.~)
0.0013
0.0017
0.0028
0.0015
0.0011
2(A,~)
0.013
0.022
0.064
0.019
0.007
2(A.A)
0.754
+-note: inconsistent for these
0.544
0.920
0.500
2(~.~)
0.0014
0.0016
0.0025
0.0015
0.0010
2(A.~)
0.011
0.017
0.044
0.017
0.003
R&A Skew.
!(A.A)
0.806
0.722
1.772
0.686
1.117
!W.~)
0.0014
0.0017
0.0028
0.0016
0.0010
!(A.~)
0.012
0.023
0.065
0.020
0.003
.-
MD exp(~2)
2(A.A)
B&CMLE
129
errors~
1.320
!h
A
Table 5.5: Monte Carlo results for covariance matrix L of n (8 - 8 0 )
for different estimators of 8=(~.~). based on a samples of size n=120
when ~0=-.24. ~0=.104383. a o=l in the single-sample model (5.22) with
the modified power transformation (5.24).
normal ±4
normal ±2
ERROR DENS ITY
uniform
triang.
Student-t
MD uni t
L(~.~)
2.499
2.880
4.653
2.417
4.099
LW.~)
0.004
0.005
0.009
0.005
0.003
L(~.~)
0.037
0.075
0.189
0.056
0.005
L(~.~)
2.331
1.850
4.138
1.865
3.377
L(~.~)
0.005
0.005
0.009
0.005
0.003
L(~.~)
0.037
0.061
0.188
0.060
0.006
L(~.~)
2.306
1.655
4.144
1. 718
3.242.. "'".-
LW.~)
0.004
0.006
0.009
0.005
0.003
L(~.~)
0.041
0.058
0.160
0.063
0.017
L(~.~)
2.633
1.470
3.519
1.614
3.336
L(~.~)
0.004
0.005
0.009
0.005
0.003
L(~.~)
0.047
0.064
0.166
0.059
0.020
L(~.~)
2.214
1.700
4.055
1.920
3.467
L(~.~)
0.004
0.005
0.009
0.005
0.003
L(~.~)
0.036
0.059
0.181
0.065
0.023
L(~.~)
1.990
+-note: inconsistent for these
1.361
1.460
2.433
L(~.f3)
0.004
0.005
0.008
0.005
0.003
L(~.f3)
0.035
0.057
0.122
0.053
0.015
L(~.~)
2.075
1.733
4.476
1.669
3.005
LW.~)
0.004
0.005
0.009
0.005
0.003
L(~.f3)
0.038
0.060
0.186
0.059
0.004
MD x 2
MD x 1
MD x 6
MD exp (!hx 2 )
B&C MLE
errors~
3.298
R&A Skew.
130
Table 5.6:
~0=12.25.
Monte Carlo results for selected estimates of A when Ao=.5.
00=.25. and n=40 in the single-sample model (5.22) under the
modified power transformation (5.24).
ESTIMATOR
'"
errors 1
bias
'"
A
CVM
AB&C
.0137
.0165
'"
AR&A
-.0300
Normal ±4
var
bias
4.936
.0347
3.908
4.545
-.1790
-.0446
2.942
3.539
Normal ±2
var
4.066
bias
-.0311
var
7.116
6.789
7.257
bias
-.0461
-.0714
-.0562
var
7.734
4.807
7.218
.1319
-.0836
Student-t
-'
Con t . Normal
bias
.0770
-.1315
.1078
Uniform
var
4.279
8.348
131
7.779
Table 5.7:
(30=36.
Monte Carlo results for selected estimates of A when Ao=.5.
00=1.
and n=40
in
the
single-sample model
(5.22)
under
modified power transformation (5.24).
ESTIMATOR
A
A
A
A
CVM
AB&C
AR&A
bias
-.0103
-.0580
-.0577
var
0.926
0.709
0.874
bias
-.0122
-.0713
0.701
0.505
errors 1
Normal ±4
.0020
Normal ±2
var
.0081
0.634
bias
.0068
var
1.4224
1.243
1.391
bias
-.0260
-.0044
-.0047
var
1.616
1.157
1.292
.0392
Student-t
Con t . Normal
bias
.0557
-. 1516
.0724
Uniform
var
0.771
1.422
132
1.479
the
Table 5.8:
Ao=I.5.
Monte Carlo resul ts for
~0=4.164977.
selected estimates of A when
00=1. and n=40 in the single-sample model
under the modified power transformation (5.24).
ESTIMATOR
~
~
A
CYM
AB&C
AR&A
bias
-.0003
-.0980
-.0007
var
1.709
1.342
1.642
errors !
Normal ±4
bias
-.1678
.0564
~
.0075
Normal ±2
var
1.378
0.957
bias
-.0397
var
2.768
2.353
2.726
bias
-.0740
-.0584
-.0188
var
2.874
2.114
2.503
.0420
1.372
-.0361
Student-t
Con t . Normal
bias
.0836
-.4053
.1494
Uniform
var
1.455
2.715
133
2.905
(5.22)
Table 5.9:
;\0=.25.
Monte Carlo resul ts
~0=25.629.
for
selected estimates of ;\ when
Go=·25. and n=40 in the single-sample model (5.22)
under the modified power transformation (5.24).
ESTIMATOR
errors 1
bias
Normal ±4
var
bias
A
A
A
;\CVM
;\B&C
;\R&A
.0232
-.0199
2.033
1.607
.0244
-.0418
.0178
1.829
.0397
Normal ±2
var
bias
1.591
1.097
.0113
.0088
1.580
-.0259
Student-t
var
3.188
bias
-.0053
2.991
.0319
3.055
.0626
Cont.Normal
var
bias
3.161
1.936
.0090
-.1046
2.850
.0293
Uniform
var
3.273
1.656
134
3.041
•
BIBLIOCRAPHY
Andrews, D. F. (1971) "A Note on the Selection of Data
Transformations", Biometrika 58, 249-254
Atkinson, A. C. (1973) "Testing Transformations to Normality", JournaL
of the RoyaL StatisticaL Society, B 35, 473-479
Atkinson, A. C. (1985) PLots, Transformations, and Regression,
Clarendon Press, Oxford.
Bartlett, M. S. (1947) "The Use of Transformations", Biometrics 3,
39-52
Bassett, G. and Koenker, R. (1978) "Asymptotic Theory of Least Absolute
Error Regression", JournaL of the American StatisticaL Association
73, 618-622
Bates, D. M., Wolf, D. A., and Watts, D. G. (1986) "Nonlinear Least
Squares and First-order Kinetics", in Proceedings of Computer
Science and Statistics: Seventeenth Symposium on the Interface,
ed. David Allen, New York: North-Holland.
Bickel, P. J. (1975) "One-step Huber Estimates in the Linear Model",
JournaL of the American StatisticaL Association, 70, 428-434
Bickel, P. J. and Doksum, K. A. (1981) "An Analysis of Transformations
Revisited", JournaL of the American StatisticaL Association 76,
296-311
Boos, D. D. (1981) "Minimum Distance Estimators for Location and
Goodness of Fit", JournaL of the American StatisticaL Association,
76, 663-670
Box, G. E. P. and Cox, D. R. (1964) "An Analysis of Transformations",
JournaL of the RoyaL StatisticaL Society, B 26, 211-252
Box, G. E. P. and Cox, D. R. (1982) "An Analysis of Transformations
Revisited", JournaL of the American StatisticaL Association 77,
209-210
Box, G. E. P. and Hill, W. J. (1974), "Correcting Inhomogeneity of
Variance with Power Transformation Weighting", Technometrics 16,
385-389
Box, G. E. P. and Tidwell, P. W. (1962) "Transformation of the
Independent Variables", Technometrics 4, 531-550
Breiman. L. and Friedman, J. H. (1985) "Estimating Optimal
Transformations for Multiple Regression and Correlation". JournaL
of the American StatisticaL Association SO, 580-619
Butler. C. C. (1969) "A Test for Symmetry Using the Sample Distribution
Function". AnnaLs of MathematicaL Statistics, 40, 2209-2210
Carroll. R. J.(1980) "A Robust Method for Testing Transformations to
Achieve Approximate Normality". JournaL of the American
StatisticaL Association 74. 674-679
Carroll. R. J. and Ruppert. D. (1981) "On Prediction and the Power
Transformation Family", Biometrika 68. 609-615
Carroll. R. J. and Ruppert. D. (1984) "Power Transformations when
Fitting Theoretical Models to Data", JournaL of the American
StatisticaL Association 79. 321-328
Carroll. R. J. and Ruppert. D. (1985) "Transformations in Regression: A
Robust Analysis", Technometrics 27, 1-12
Carroll, R. J. and Ruppert. D. (1987) "Diagnostics and Robust
Estimation When Transforming the Regression Model and the
Response". Technometrics 29. 287-299
Carroll, R. J. and Ruppert. D. (1988) Transformations and Weighting in
Regression. Chapman and Hall. New York and London.
Dahlquist, G.. Bjorck, A. and Anderson, N. (1974) NumericaL Methods,
Prentice-Hall. Englewood Cliffs, New Jersey.
Draper. N. R. and Cox. D. R. (1969) "On Distributions and Their
Transformation to Normal i ty". JournaL ·of the RoyaL Statist icaL
Society, B 31. 472-476
Duan. N. (1983) "Smearing Estimate: A Nonparametric Retransformation
Method". JournaL of the American StatisticaL Association 78,
605-610
Hampel. F. R.• Ronchetti. E. M.. Rousseeuw. P. J .. and Stahel. W. A.
(1986) Robust Statistics: The Approach based on InfLuence
Functions. Wiley. New York.
Hastie.· T. and Tibshirani. R. (1986) "Generalized Addi tive Models".
StatisticaL Science 1. 297-318
Hernandez. F. and Johnson. R. A. (1980) "The Large-Sample Behaviour of
Transformations to Normality". JournaL of the American StatisticaL
Association 75. 855-861
Hinkley. D. V. (1975) "On Power Transformations to Symmetry".
Biometrika 62. 101-111
Hinkley. D. V. (1977) "On Quick Choice of Power Transformation".
AppLied Statistics 26. 67-68
136
Hinkley, D. V. and Runger, G. (1984) "Analysis of Transformed Data",
JournaL of the American StatisticaL Association 79, 302-309
Huber, P. J. (1967) "Behavior of Maximum Likelihood Estimates Under
Nonstandard Conditions", Proceedings of the Fourth BerkeLey
Symposium on MathematicaL Statistics and ProbabiLity I, 221-233
Jennrich, R. I. (1969) "Asymptotic Properties of Nonlinear Least
Squares Estimators", AnnaLs of MathematicaL Statistics 40, 633-643
KettI, E. (1987) "Some Applications of the Transform-Both-Sides
Regression Model (dissertation)", Mimeo Series #1725, Department
of Statistics, University of North Carolina at Chapel Hill.
Koziol, J. A. (1980) "On a Cramer-von Mises-Type Statistic for Testing
Symmetry", JournaL of the American StatisticaL Association, 75,
161-167
Malinvaud, E. (1970) "The Consistency of Nonlinear Regression", AnnaLs
of MathematicaL Statistics 41, 956-969
NeIder, J.A. and Mead, R. (1965) "A Simplex Method for Function
Minimization", The Computer JournaL 7, 308
Oberhofer, W. (1982) "The Consistency of Nonlinear Regression
Minimizing the L 1 -norm", The AnnaLs of Statistics 10, 316-319
Orlov, A. 1. (1972) "On Testing the Symmetry of Distributions", Theory
of ProbabiLity and Its AppLications, 17, 357-361
Parr, W. C., and De Wet, T. (1981) "On Minimum Cramer-von Mises-Norm
Parameter Estimation", Communications is Statistics, Part A Theory and Methods 10(12), 1149-1166
Rothman, E. D., and Woodroofe, M. (1972) "A Cramer-von Mises Type
Statistic for Testing Symmetry", AnnaLs of MathematicaL
Statistics, 43, 2035-2038
Rubin, H. (1956) "Uniform Convergence of Random Functions with
Applications to Statistics", The AnnaLs of MathematicaL Statistics
27, 200-203
Ruppert, D. and Aldershof, B. (1989) "Transformations to Symmetry and
Homoscedasticity", JournaL of the American StatisticaL Association
84, 437-446
Ruppert, D. and Carroll, R. J. (1985) "Data Transformations in
Regression Analysis with Applications to Stock-Recruitment
Relationships", In Resource Management, M. Mangel editor, Lecture
Notes in Biomathematics 61, Springer Verlag, New York.
Ruppert, D.• Cress ie, N., and Carroll. R. J. (1989) "A Transformation!
Weighting Model for Estimating Michaelis-Menten Parameters",
Biometrics 45, 637-656
137
Serfling. R. J. (1980) Approximation Theorems of MathematicaL
Statistics. Wiley, New York.
Serf ling. R. J. (1984) "Generalized L-, M-, and R-Statistics", The
AnnaLs of Statistics 12. 76-86
Snee, R. D. (1986) "An Alternative Approach to Fitting Models when
Re-expression of the Response is Useful". JournaL of QuaLity
TechnoLogy 18. 211-225
Taylor. J. M. G. (1985) "Power Transformations to Symmetry".
Biometrika 72. 145-152
Taylor. J. M. G. (1986) "The Retransformed Mean After a Fi tted Power
Transformation". JournaL of the American StatisticaL Association
81, 114-118
Van Zwet. W. R. (1964) "Convex Transformations of Random Variables",
Mathematisch Centrum. Amsterdam.
Wu. C. F. (1981) "Asymptotic Theory of Nonlinear Least Squares
Estimation". The AnnaLs of Statistics 9. 501-513
138