Kettl, Ernestine E.; (1987).Some Applications of the Transform-Both-Sides Regression Model."

SOME APPLICATIONS OF THE
TRANSFORM-BOTII-SIDFS REGRESSION MODEL
by
Ernestine E. KettI
A dissertation submitted to the faculty of the
Universi ty of North Carolina. at Chapel Hill in
partial fulfillment of the requirements for the
degree of Doctor of Philosophy in the Department
of Statistics
Chapel Hill
1987
Approved by
Advisor
R~::J~
.'
\'
\
R~der
... 1
I
t··
,"
t
- _-
ERNESTINE E. KE1TL.
Some Applications of the Transform-Both-Sides
(Under the direction of RAYMOND J. CARROLL.)
Regression Model.
Transform-both-sides
regression
is
a
useful
method
when an
assumed relationship exists between the response and the predictors
but the way in which error enters the model is not well understood.
Transform-both-sides applies
the
same
transformation
to both
the
response and the regression function thereby preserving the relationship between the response and the predictors yet allows variability to
enter the model in different ways.
In this dissertation we investi-
gate three applications of the transform-bath-sides regression model.
First
the
transform-both-sides model
power-of-the-mean weighting.
is
extended
to
include
Power transformation and weighting are
used together to remove heteroscedasticity and induce symmetry in the
error distribution.
The regression parameters are estimated using a
generalized least squares method where the transformation parameters
are fixed.
An approximate confidence region is obtained by allowing
the transformation parameters to range over a grid of values and then
the regression parameters are estimated for each fixed pair of the
transformation parameters.
We compute the asymptotic distribution of
the estimated parameters and show that as N
is no effect due
to fixing
-+
0 there
the transformation parameters.
This
-+
ClO
and then a
approach is then applied to three sets of data.
Next we consider the Errors-in-Variables problem in the transform-bath-sides model.
We compute the asymptotic distribution for the
estimated parameters when the measurement error is ignored and form
corrected estimators to remove the resulting bias.
iii
Finally we study the nonparametric ACE procedure in the transform-both-sides setting.
both-sides
ACE
criterion
We show that the parameterized transformyields
inconsistent
estimates
transformation parameter when it is equal to zero.
for
the
We find that ACE
does not account for heteroscedasticity nor does it always identify
the correct transformation in a collection of examples generated by a
transform-bath-sides model.
be useful
In some numerical examples we find ACE to
in discriminating between competing models and able
identify known influential points.
to
ACKNOWLEDGEMENTS
I want to express my gratitude to my advisor. Raymond Carroll.
for the amount of time and effort he put forth to direct this work.
I
appreciate his unlimited patience and support. and thank him for his
guidance and encouragement throughout the preparation of this thesis.
I acknowledge the members of my committee. David Ruppert. Don
Richards. Bob Rodriguez. and M. R. Leadbetter. for their diligence in
reading the manuscript in such a timely fashion and for the careful
attention given to detail.
In addition I thank Bob Rodriguez for
making available the programmed ACE procedure. and the SAS Institute
for allowing me the use of their computing facilities.
I also thank the members of the staff who have been most helpful
throughout the past five years. especially June Maxwell for somehow
always managing to keep things in perspective.
A special note of appreciation to my roommate.
friend. Marie Davidian.
classmate and
More than anyone else she knows exactly what
obtaining this degree has meant.
She has been there to pick up the
pieces when things fell apart and to celebrate when things finally did
go right.
'Thanks for b.eing there when i t counted.
I also want to thank my fellow graduate students for
friendship and support.
their
Without you. David. Yin. Charu. George •...•
this would not have been possible.
Finally I want to thank my family and friends for their encouragement throughout my education.
It is truly appreciated.
This work was supported by the US Air Force Office of Scientific
Research under Grant AFOSR F49620-82-C-0009.
TABLE OF OONTENTS
INTRODUCfION
1
1.0
Introduction
1
1.1
Transformations.
2
1.2
Heteroscedasticity .
8
1.3
Errors-in-Variables
11
1.4
Nonparametric Estimation of the Transformation
14
1.5
Summary of Results.
14
aIAPTER I:
aIAPTER II:
TRANSFORM-BOTII-SIDES AND POWER-QF-THE-MEAN
WEIGHI'ING
2.0
Introduc tion
2.1
Theoretical Analysis
2.2
Examples
30
2.2.1
31
2.3
Capacitance Data
. 18
2.2.2 The Skeena River Sockeye Salmon Data
35
2.2.3
39
Population A Sockeye Salmon Data.
2.2.4 Discussion
41
Appendix
42
2.3.1
Heuristic Sketch of the Carroll-Ruppert
Efficiency Result
2.3.2 A Heuristic Derivation of (2.3)
2.4
17
42
. 43
2.3.3 Approximate Confidence Regions for (x.e)
45
Tables and Figures .
48
vi
aIAPTER III:
ERRORS-IN-VARIABLES AND TIIE TRANSFORM-BOTII-SIDES
MODEL
66
3.0
Introduction
66
3.1
Asymptotic Theory
66
3.2
Detailed Calculations
69
3.3
Appendix
76
TIIE ACE ALGORITIIM AND TRANSFORM-BOTII-SIDES
REGRESSION.
78
4.0
Introduction
78
4.1
The ACE Criterion in the Transform-Both-Sides
Setting
81
4.1.1
81
aIAPTER IV:
Theoretical Analysis
4.1.2 Examples
4.2
92
Using ACE to Determine Model Suitability
. 97
4.2.1
. 98
The Atlantic Menhaden Fishery Data.
4.2.2 Chemical Reaction Data
100
4.3
ACE as a Diagnostic Tool
102
4.4
Discussion
106
4.5
Appendix
107
4.6
Tables and Figures
107
BIBLIOGRAPHY
132
CIAPTER I
INTRonUCfION
1.0
Introduction
In regression one
seeks
to
find a
relationship between
response. y i' and the independent variables, xi
= (xU'
the
.... x ip )·
Frequently there is a previously known and well understood functional
relationship between the response variable y i
variables xi.
and the explanatory
This relationship might be derived from some physical
process or biological system that is known in the absence of error to
generate data according
to Yi
= f(x i .13}.
where
function and 13 is a vector of unknown parameters.
f
is
some known
Error is introduced
into the system through measurement and modeling error as well as
through random
~riablity.
While it is not always clear how this error enters the model. it
is common to assume that the error is additive. i. e .•
(l.1)
where
the
~i
are independent and identically distributed standard
normal random variables.
When the data indicate that this model is
inadequate one might then assume that the errors are mul tiplicative
and lognormal so that
(l.2)
2
Models (1.1) and (1.2) are based on the same theoretical transformation model. where we write
(1.3)
for some monotonic transformation h(.).
This transform-both-sides
approach preserves the assumed relationship between y i and xi' yet
allows variablility to enter the model in different ways.
Al though transforming both sides of the regression equation is
not a new technique. Carroll and Ruppert (1984) are the first to
advocate its routine use for reasons other than linearizing the model.
i. e .•
symmetrizing
variance.
the
error
distribution
and
stabilizing
the
We consider some applications of this approach in the
following chapters.
describe in detail
In the first two sections of this chapter we
the
transform-both-sides model and review the
transformation and heteroscedasticity literature.
relevant Errors-in-Variables
problem is discussed.
literature for
In Section 1.3 the
the measurement error
In Section 1.4 we briefly discuss the ACE
algori thm. a nonparametric procedure for estimating transformations.
Section 1.5 contains a brief summary of the results obtained.
1.1
Transformations
Transformations have long been used to stabilize variance and to
transform responses from a known nonnormal distribution to approximate
normality. thereby allOWing the use of normal theory methods.
Trans-
formations have also been used to linearize regression models when the
responses
are
related
to
the
independent
variables
by a
known
~
3
nonlinear function of the unknown parameters.
Standard least squares
methods can then be used to analyze the model if the resulting errors
are independent. homogeneous and normally distributed.
For review of
transformations see Draper and Smith (1981. pp. 220-241) and Cook and
Weisberg (1982. pp. 58-86).
In certain problems.
yield
residuals
nonnormal.
that
standard regression procedures sometimes
are
skewed.
heteroscedastic
or
otherwise
Either some violation of the assumptions of independence,
homogeneity. or normality has occurred. or there may exist some other
deficiency in the postulated model. Box and Cox (1964) propose for a
positive response variable y. a family of transformations indexed by
the parameter A.
A
y -1
if A
~
0
A
(1.4)
log(y)
if A = 0
They assume that there exists some transformation for which yeA) will
be normally distributed with constant variance and additive errors.
Since (1.4) assumes
that y
is posi tive.
only
approximate normality can be achieved. except for A
transformations
= O.
to
Furthermore.
a single transformation may not exist that will simultaneously achieve
all the stated objectives and the scale chosen may be a compromise
selection.
The Box-Cox transformation family has received much attention in
the
literature.
Maximum likelihood estimation and normal
methods have been investigated by Draper and Cox (1969).
theory
Andrews
4
(1971), Atkinson (1973), Hinkley (1975), Hernandez and Johnson (1980),
Carroll (1980, 1982b, 1982c, 1982d) and Carroll and Ruppert (1981).
Robust estimation and diagnostic methods for this model are reviewed
by Carroll and Ruppert (1985).
Much discussion has been generated over the question of how one
should proceed once an appropriate transformation has been selected.
In practice one usually estimates X and then proceeds as if the
,.
In the "transform the response
estimated X is the correct scale.
only" model where
y~X)
= f(x i ,/3)
+ a
~i' Bickel and Doksum (1981)
estimate (/3, a, X) jointly and show that the estimate of /3 when X is
unlmown can be much more variable than when X is mown.
A meaningful
interpretation of results in the joint estimation scheme is difficult
since it is made in reference to an unlmown scale.
Although the error
of the estimated parameters may be large, Carroll and Ruppert (1981)
show that prediction error in the original scale is not much larger
than
when Xis
mown.
In many
important
cases,
because
the
differences between mowing and estimating X are small and since
linear
model
parameters
are
meaningful
only
in
reference
to
a
particular scale, Box and Cox (1982) and Hinkley and Runger (1984)
,.
recommend condi tioning on X as if it were mown and reporting
subsequent results on the estimated scale.
1£ one is willing to
accept the estimated scale as the true scale, such an approach leads
to results that are more easily interpreted since they are based on a
mown scale.
Often in practice, after estimating X,
one then chooses the
nearest value of X that makes sense wi thin the framework of the
problem.
The
selected
transformation may
have
more
scientific
5
relevance and
lead
to
simpler
meaningful interpretations of the
,.,
results than analyses done on the X-scale. Carroll (1982b) addresses
this problem and shows that restricted estimation of X can lead to
inferences different from the maximum likelihood estimation of X.
al though in many practical problems
the restricted estimate works
well.
In the transform-both-sides problem. the same transformation is
applied to the response as well as the regression function.
Carroll
and Ruppert (1984) investigate maximum likelihood estimation in the
"transform-both-sides"
problem
using
the
Box-COx
family
of
transformations. i. e .•
(1.5)
This is model (1.3) where h(.) is the Box-COx transformation.
Besides preserving the assumed relationship between xi and y i •
transform-both-sides is appealing in that unlike the Box-COx "transform
the
response
transformation.
only"
model.
According to model (1.5),
buted about f(X) (Xi .(j),
conditional mean for Yilxi'
Ruppert
and
median for
Carroll
distribution of yilx
i
still
has
meaning after
y~X) is normally distri-
therefore f(X) (Xi ,(j) gives the conditional
y~X) IX i .
mean and median of
the condi tional
f(x .(j)
i
(1985)
Unless X = I,
f(x ,(j)
i
is not
the
Regardless of the value of X, f(xi,(j) is
the untransformed
for
a
discussion
response Yi Ix .
i
of
the
See
condi tional
in the transform-both-sides setting.
Transformations have also been applied to both sides of
the
equation where the response and each of the explanatory variables are
transformed by not necessarily
the same
transformation.
See for
6
example Box and Tidwell (1962) and Boylan. et al..
(1982).
This
differs from the transform-both-sides of Carroll and Ruppert (1984).
where the same transformation is applied to both the response and the
known regression function.
•
There are settings in which transformations can be used to remove
heteroscedasticity and to induce symmetry.
Using Bartlett's (1947)
method for stabilizing variance. by Taylor approximation. we get
Var{yi(X) ) ~ [d:~(X)
i
(l.6)
In order for
I
]2 Var{yi)
Yi=f{xi.{j)
Var{y~X» to be approximately constant. the Var{Yi) is
proport i ona I to [f{xi' ~A)]2{1-X).
There f ore if t h e var i ance
0
f Yi i s
proportional to its expected value then a value for X can be found so
that
the
transformed response will have approximately homogeneous
errors.
Since
the
Jacobian of
the
transformation.
y
X-1
forces
the
transformed density to alter its shape. skewness can also be reduced
by transformation.
skewed.
Suppose that
the densi ty of y
is positively
To induce symmetry. the large values of y that come from the
extended right tai I must be brought together and the smaller values in
the shorter left tail must be spread out.
X. and is concave for X
occur.
~
1,
Since yX-1 is increasing in
the desired adjustment in shape will
Smaller values of X will increase the amount of compression in
the right tail.
For X
~
1. yX-1 is convex.
Similarly. left skewed
distributions can be transformed to approximate symmetry.
~
7
As demonstrated. power transformations can reduce both skewness
and heteroscedastici ty in a
model.
However,
the value of A that'
transforms to approximate sYmmetry may not be the same value that
stabilizes
Therefore
the variance.
if a
model
shows
substantial
skewness and heteroscedasticity it may be an unrealistic expectation
to require that a single transformation simul taneously achieve both
symmetry and homoscedastic errors.
It is noted by Zarembka {1974} that the estimate of the transformation parameter of
the response is biased in the direction of
stabilizing the variance.
interest in A is
In some econometric problems where the
to determine the optimal functional
form,
it is
important for the researcher to know what contributed to the estimate
"-
A,
1. e.,
heteroscedastic
nonlinearity.
models.
model
{1985}
Rupper~
by
nonnormali ty,
or
structural
Some work has been done in an attempt to model out the
heteroscedasticity.
Smallwood
errors,
See Gaudry and Dagenais {1979} and Blaylock and
for
examples
in Box-Tidwell
tyPe
transformation
and Carroll {1985} extend their transform-both-sides
including
heteroscedasticity.
separate
parameter
to
model
the
Their expanded model,
is more flexible allowing
remove the skewness.
a
e
to control the heteroscedasticity and A to
Egy and Lahiri {1979} construct a similar model
where they also model the heteroscedasticity by weighting with a power
of the predictor variable, xi.
Their model, in our notation,
8
differs slightly from the Carroll-Ruppert transform-both-sides model
with the inclusion of the intercept (a ) and the slope (a l ).
O
Their
intent is to create a model such that the transformation parameter A
represents the nonlinearity of the structure and not the heteroscedasticity or nonnormality.
1.2
Heteroscedasticity
Heteroscedastic regression models. where
i = 1. 2 •...• n.
f is a mown function. (3 is a vector of unlmown parameters. and a i
represents
the heteroscedasticity.
account for nonconstant variance.
have been used successfully
The variance a
i
to
can be of unlmown
form or modeled as a parametric function of the explanatory variables.
One
a
popular
form
of
a
i
9
i
= a f (x .{3) for f(x .{3)
i
i
is
the
"power-of-the-mean"
model
where
>0
and 9 and a are unlmown scalars.
This
model is Widely used in applications where it is reasonable to eXPect
a positive mean response.
See for example. Box and Hill
(1974).
Pritchard. Downie and Bacon (1977). and Bates. Wolf and Watts (1985).
When
theoretical
regression function.
functional
considerations
have
already
determined
the
transform-both-sides is a way to preserve the
relationship and still allow for
transformation.
The
"transform-both-sides" model of Carroll and Ruppert (1984) is related
to the "power-of-the-mean" model when a is small.
(1984) show by a Taylor expansion that
Carroll and Ruppert
9
may be written as
Thus for small
0,
where 9
~
1 - A. transform-both-sides and power-of-
the-mean models should yield approximately the same results.
See
Carroll and Ruppert (1984), Snee (1985). and Bates. Wolf and Watts
(1985)
for examples where
transform-both-sides
is used
to analyze
data.
Ruppert
and
Carroll
(1985)
expand
their
transform-both-sides
model by modeling heteroscedasticity separately.
In addition
to
transforming both the response and the regression function by the
Box-COX transformation.
they model the variance as a power of the
explanatory variable.
thus allowing for greater flexibility in the model.
While maximum likelihood is the standard procedure used to estimate parameters in transformation models. the usual method to estimate
parameters in heteroscedastic regression is some form of generalized
least squares (GLS).
Consider the power-of-the-mean model,
where f is a known function,
an unknown scalar and
generalized
least
~i
P is
a vector of unknown parameters, 9 is
are independent standard normal errors.
squares
procedure
we
start
with
some
In a
"good"
A
preliminary estimate for
estimate for
P. say p.
p
For example. as a preliminary
p, we might use the unweighted least squares estimate.
10
If 9 is mown, we form the estimated weights, f
apply weighted least squares to estimate 13.
must be estimated.
(1977),
29
A
(xi ,13p ) , and then
If 9 is unknown, it too
Box and Hill (1974), Pritchard, Downie and Bacon
Jobson and Fuller (1980), and Carroll and Ruppert (1982a)
study this model.
Jobson and Fuller (1980) and Carroll and Ruppert
A
(1982a) prove that if
all generalized
least
A
(f3,9)
(13,9),
are consistent estimates of
squares estimates of
then
f3 will have the same
asymptotic distribution as weighted least squares with mown weights.
Carroll and Ruppert (1982a) propose a pseudo-likelihood approach for
obtaining a consistent estimate of 9.
Al though
asymptotically
optimal
when
the
model
and
the
distributional assumptions are correct, maximum likelihood estimators
are sensitive to outliers and are not robust against departures from
the assumpt ions.
McCullagh (1983) and Carroll and Ruppert (1982a)
caution against the routine use of normal theory maximum likelihood
estimation, since the maximum likelihood estimates can be much less
efficient
than
distributions.
generalized
least
squares
estimates
for
nonnormaI
Carroll and Ruppert (1982b) show that misspecifica-
f3MLE
tions in the form of the variance function lead to a bias in
A
while hardly affecting the asymptotic distribution of
As
a
numerical
indicator
of
f3GLS .
heteroscedasticity
Carroll
and
Ruppert (1987) compute the Spearman's rank correlation coefficient (p)
on
the
absolute
predicted values.
studentized
This
residuals
correlation
variable and computing the usual
is
and
the
predictor
computed by
or
ranking
the
each
correlation based on the ranks.
Since Spearman's correlation is not changed by monotone transformations
of either variable,
we can use squared residuals or
their
11
logarithms as well as the logarithms of the predictors or predicted
values.
The Spearman correlation significance levels can be used as a
rough indicator of heteroscedastici ty.
However one should not rely
too strongly on these significance levels since no formal theory has
yet been developed.
1.3 Errors-in-Variables
In the usual regression situation we assume the responses Yi are
random variables but that the xi's are fixed and known.
However in
many practical situations the xi's are subject to measurement error:
for example, physical quantities such as temperature, pressure, and
time
cannot be measured precisely.
Typically one
variability and treats the observed xi's as known.
ignores
this
Such an analysis
will lead to estimates that are biased, and the predicted values may
differ from those based on the correct model.
There is a vast literature on the subject of measurement error.
The problem was considered even before the turn of the century by
Adcock (1878) and KWllDell (1879).
concentrated on the linear model.
Much of the subsequent work has
For discussion and review see
Madansky (1959) and Kendall and Stuart (1979, Ch. 29).
The asymptotic
normali ty for
the regression
the maximum likelihood estimates of
parameters is established in the univariate case by Fuller (1980) and
in the multivariate case by GIeser (1981).
Recent interest has been
in the area of nonlinear models, see Wolter and Fuller (l982a and
1982b). Stefanski (1983) and Carroll. et al.
(1984).
Amemiya and
Fuller (1985) consider estimation in an implicit nonlinear functional
relationship.
Stefanski (1985) presents a general formulation for the
12
Errors-In-Variables
(EIV)
problem
encomPassing
both
linear
and
nonlinear models and functional and structural relationships.
We want
to consider the nonl inear
transform-bath-sides model
where the observations (yi ,Xi) are such that the predictors xi are
measured wi th error.
Suppose we have the following model when xi
contains no error,
(1.8)
where f is a mown function, 13 is an unlmown vector of parameters, and
zeAl
is the Box-COx transformation.
Further assume that the ~i are
independent, symmetric with mean 0 and variance 1.
Assume also that
we can model the measurement error as Xi = xi + 02 vi' where the vi's
are independent, symmetric, with mean 0 and variance I, and
independent of Vi' for all i.
~i
is
Model (1.8) postulated for (Yi,x i ) does
not necessarily hold for the observed data (yi ,Xi).
To demonstrate this, let us consider simple linear regression.
2
Assume Yi and Xi are independent given xi' and that Xi'" N(E ,°3 ).
Assume also that yilx i '" N(a +
xilx i '" N(x i ,
f3xi'o~)
o~) where Xi = Xi
where Yi
=a
+ f3x i + 01
~i'
and
+ 02vi.
Working with the appropriate
2
2
joint and marginal densities we find that Xi '" N(E ,°2 + 03) and
2
2
2
2
Yi IXi '" N(aM + 13*"i' OM)' where aM = a + 13 E °2 / (02 + ° 3 ),
2
2
2
2
2
..2 22
2
2
13M = f303 / (°2 + ° 3 ), and OM = 01 + P 0203/(02 + ° 3 ). Thus in this
simple linear regression we retain the linear structure, but the slope
is less than the original slope 13, the intercept aM differs from a by
an addi tive term, and the variance o~ is increased by a term depending
on the original slope and the error variances of Xi and xilxi.
Condi tioning on the observed Xi changes the model considerably,
13
even in this simple case.
It is not at all clear what will happen in
a more complicated nonlinear model.
sides
model.
where
yi(X)
yi(X)lx i - N(f(X)(xi.P).
o~).
If we look at the transform-both-
= f(X)(xi.P)
(1.9)
0l~i
corresponds
and if we assume that xilxi -
and xi has density g. then the density of
=
+
to
N(xi'O~)
y~X)IXi' say h. is
t+ [t -::A)(X'~)tl]g(x)dx
Jexp[-~[(X :2x)r] g(x) dx
~Jexp[-~[[(X :2X)
h(t)
Depending on g. we mayor may not be able to get a closed form for h.
and the resulting conditional model may bear little resemblance to the
original model.
If h(t) has a closed form then standard likelihood
procedures can be used to estimate the parameters.
We hope that
(1.10)
In most cases (1.9) will be difficult if not impossible to work with
and since (1.10) is the desired result. we can use (1.8) to obtain
estimates for
P.
by pretending there is no error in the observed xi·s.
If we then subs ti tute Xi f or Xi into the es timator.
estimates are biased.
and
Fuller
(1982)
and
the resul ting
Employing methods similar to those of Wolter
Amemiya and Fuller
asymptotic distribution of the estimators.
(1985)
we derive
Stefanski (1983) uses a
similar asymptotic theory in a logistic regression model.
Stefanski and Carroll (1985).
the
See also
14
1.4
Nonparametric estimation of the transformation
Some research has been done in estimating transformations through
the use of nonparametric regression.
Friedman
(19S5a.
Takane (1916).
b).
Young
notably
that of Breiman and
and Young.
(1981, 1985).
The work of Young (1981.
de Leeuw.
1985) and Young.
and
et al..
(1916) centers around an algorithm called MORAlS ("Multiple Optimal
Regression by Alternating Least Squares").
use with quali tative data
in a
psychometric
procedure deals wi th discrete data.
("AI ternating
Friedman
Conditional
(1985)
is
more
Originally developed for
whereas
Expectation")
flexible.
setting.
the more
algorithm
The
ACE
of
the MORAlS
recent
ACE
Breiman
and
procedure
employs
nonparametric smoothing and yields numerical transformations such that
the subsequent plots are easier to interpret than the MORAlS output.
While
the
general
transformations
necessarily monotone.
of
ACE
and
MORAlS
are
not
they can be determined quite reacUly with ACE
al though some grouping of data into discrete intervals must occur
before MORAlS can be applied.
of
See Rodriguez (1985) for a comparison
the ACE and MORAlS procedures.
develop a
Breiman and Friedman
(19S5a)
large body of theoretical results for the ACE procedure.
Nonparametric monotone transformations have previously been studied by
Kruskal (1964. 1965).
1.5
Summary of results
The discussion of Sections 1.1 and 1.2 reviews the role of the
transform-both-sides approach in nonlinear regression.
We investigate
some applications of the transform-both-sides model where specifically
e
15
we consider heteroscedasticity, measurement error and a nonparametric
regression procedure for estimating the transformation.
In Chapter 2 we use transform-bath-sides but model the variance
separately as a power-of-the-mean.
We consider the model
(1.11)
where the
~i
are syuJnetric, independent and identically distributed
wi th mean 0 and variance 1.
As in Carroll and Ruppert (1984) we
comPute the asymptotic distribution for the estimators when (A,e) are
fixed and then again when (A,e) are estimated.
We find that there is
no effect on the asymptotic distribution of the estimated regression
parameters.
In Chapter 3 we study the effect of measurement error in the
transform-bath-sides setting.
Here the predictors xi are measured
with error where the observed values (Yi'X i ) are such that Xi = xi +
The resulting estimates for the
parameters are biased.
We calculate the asymptotic distribution of
the estimators and study
this bias
Errors-in-Variables model.
We compute also bias-corrected estimators
and determine
the
effect of
for
the
transform-bath-sides
measurement error on
the corrected
estimates.
In Chapter 4 we explore how the ACE algori thm performs in the
transform-bath-sides
framework.
Since we are
interested
in
the
modified power transformation family applied to continuous data we do
not consider the MORAlS procedure.
We show that for A = 0,
the
transform-bath-sides parameterized version of the ACE criterion does
not yield consistent estimates for A.
Data generated according to
16
model (1.5) with A
=0
are used to study the performance of ACE in the
transform-bath-sides setting.
We then use ACE as a tool for selecting
the better of two or more cOmPeting models.
Finally we construct two
case deletion diagnostics and study the effectiveness of ACE as a
diagnostic tool for identifying influential points.
Clapter II
TRANSFORM-BOTH-SIDES AND POWER-oF-1lIE-MEAN WEIGlITING
2.0
Introduction
Suppose that in the absence of random error the relationship
between x and y is such that y i = f{x ./3).
i
Consider the model
(2.1)
where the response and the regression function are both transformed by
a Box-COx power transformation and the variance is modeled as a power~
of-the-mean.
f{x
i
= ~(I3.O'.A.e)
.13) is a mown function.
is a vector of unlmown parameters and
Assume
symmetric with mean 0 and variance 1.
the
~i
are
independent and
The way in which error enters
the model is controlled by the choice of (A.e).
In the study of this transformation model we employ the now standard small(1981).
when
0'
0'
asymptotic distribution theory of Bickel and Doksum
Calculations made in computing the asymptotic distribution
is fixed are complex and yield little insight into the problem.
whereas the smalllead
to major
0'
asymptot ic theory. where N ...
simplifications allowing a
developed in the transformation setting.
and the
GO
tractable
theory
0' ...
O.
to be
In data sets where the data
agree closely with the model. i. e .• there is little variation about
the mean.
the small-
0'
assumption is reasonable.
See for example
Bickel and Doksum (1981) and Carroll and Ruppert (1987).
Carroll and
18
Ruppert
suggest using
the median coefficient of variation,
{si/I~il}, to determine the adequacy of modeling the variance
median
as a
(1987)
power-of-the-mean.
Based on their experience they find
the
median coefficient of variation is often less than .35 when power-ofSince the small- u assumption in model
the-mean variance is used.
(2.1) yields a power-of-the-mean equivalent model we use the median
coefficient
of
variation
to
indicate
small- u assumption in a given situation.
the
appropriateness
of
the
For model (2.1), the median
coefficient of variation for a given (A.e) combination is defined as
A
e
A
A
A
median [uf (xi.~}/f (xi'~)].
In Section 2.1 we ascertain the effect of fixing (A. e) on the
A
asymptotic distribution of ~.
theory where N ...
GO
We employ the small- u asymptotic
and then u ... O.
In Section 2.2 the methods
discussed in Section 2.1 are applied to three examples.
Section 2.3
is the appendix for Chapter 2 and Section 2.4 contains all the tables
and plots for the examples.
2.1
Theoretical Analysis
As discussed in Chapter 1. the parameter
~
f(x.~}
has physical meaning
whether A is mown or unlmown.
that is.
is the median of y
regardless of the value of A.
Thus conditioning on the value of A
should not lead to the same controversy that arises in the Box-COx
model where only the response is transformed.
We
can estimate all
maximum likelihood.
the elements of
!!
simul taneously using
but this is complex numerically.
By treating
(A.e) as fixed and mown and conditioning on these values. standard
nonlinear regression software can be used to estimate
{~.u}.
We
19
estimate here the effect of treating (A,a) as known by computing the
A
A
A
A
A
A
A
A
asymptotic distribution of ~(I3,a,A, a) and ~(I3,a lA, a).
Ruppert (1984) have done this for the case where
in order
to get manageable expressions,
we
a = o.
employ
Carroll and
As they did
the
small-
0'
asymptotic theory to derive and compare the asymptotic distributions
A
A
A
A
A
A
A
A
of ~(I3,a,A,a) and ~(I3,aIA,a).
We make the following assumptions to be used throughout the
arguments contained in this section.
(A.1)
(~i,xi)
are independent and identically distributed,
(~i)
and
(xi) are independent, and
(A.2)
Xi are uniformly bounded.
(A.3)
There exists 0
< M1 ,
-1
~
implies that M1
(A.4)
The
~
< co such that for all x, II 13.- 13 II
f(x i ,I3.)
~
~ ~
M1 .
standard convergence result
for normal
theory maximum
likelihood holds, (Ser£ ling , 1980, Cmpter 4, and Rao, 1973,
section 5f.2). for each a, i.e., as N ~co,
0'-0'
A
a- a
~
-1
N{O. I. (13.a.a,A»
A
A-A
and if l is the loglikelihood function,
E{8l{I3,a,a,A)/8{I3,a,a.A) T}
= O.
and
20
(A.S)
The standard maximum likelihood asymptotic distribution theory
holds if A is known.
e
is known. or both. i.e .• the limiting
covariance matrix of the remaining estimates is the inverse of
the appropriate information matrix.
exists and is positive definite.
The first two derivatives of f(xi'~} exist. are continuous. and
(A.7)
are uniformly bounded.
Remark:
Assumption (A.4) holds strictly only for A
because yeA} is undefined for y
for A
a
= O.
< O.
and if P(y
= O.
> O} = 1
This is
then. except
exact transformation to normality is impossible.
There is
long history starting with Box and Cox (1964) of ignoring this
problem.
One way to make the transformation exact is to replace yeA}
by (signlylA - 1}1A.
We have instead made the decision to do our
calculations in the usual way. so that strictly speaking. they are
exact only for A = O.
Now for each fixed o. write
I
.=
I~~
lo~
le~
1(X1
I~e
I~
I00
I~
1M
loe
lee
I Ae
lOA
leA
lAX
lea
21
Define
•=
F
Lemma 2.1:
For fixed a.
A
f3}/a
(a - a}/a
(13 A
A
9 - 9
A
).-).
Proof:
Matrix algebra using (A.4).
•
The loglikelihood is given by
ice}
= -log a
- 9Iog[f(xi ./3}] + ()'-1) log Yi
= -log a - 9Iog[f(x i ./3}] + {)'-1} log Yi
h(y i .).}-h[f(Xi ./3}').]j2
- 1/2
[
a f9(x ./3}
.
i
We proceed on a case-by-case inspection
Write h(x.f3.).} = f().}(x ./3).
i
of F. as a ... O. The strong boundedness conditions (A.2). (A.3) and
(A.7) make these calculations legitimate.
22
Lemma 2.2:
Write
= E[log
where Y
J
f{xt,P)]J.
~m
a1 0
Then
F"
a
F:
=
[~ ~]
where
M*
=
1
Y
Y
Y
-Y
-Y
Y
-Y
Proof:
1
1
1
2
2
-Y
1
2
2
*
This is Just a case by case examination of F.
the terms
0
2
100 ,
02
1pp ' and
02
1PO and omit the details for the others.
Now,
-1
o =o-
2
and thus
-2
2po =
~
[h{y,~)
We consider
-
h{x,P.~)] hp{X,P,~)
f 29 {x,p)
~
23
Since [h(y,X) - h(x,P,X)]
=a
f 9 (x,P)
~,
Therefore
a
2
2
If30 = -a E[lf30]
=2 a
Thus, lim a
2
If30
a"'O
9 E[fp(x,P)/f(x,P)].
See (A.3).
= O.
The derivative of l a with respect to a yields
l,.,.
=
uu
1
a2
1
=2
a
Thus a
2
I aa
_.d- £h(y,X) - h(x,I3,X)]2
f 29 (x,p)
a4
3
2
--r~.
a
= 2.
Note that
[h(y,X) - h(x,P,X)] h (x,P.X)
r 29 (x.p)
13
24
Now
_ 9 fpp(X,P)
2
o f(x,P)
2pp -
_
[[h(Y'X) - h(x,P,X)]2 _ ]
29
1
f (x,P)
9fp(x,~) f~(X,P)
22
o f (x,P)
[[h(Y'X) - h(x,~,X)]2 _ ]
29
1
f (x,P)
29 fp(X,P) [[h{Y'X) - h(x,P,X)] hp(X,P,X)
2 f(x,P)
0
f 29 {x,p)
_ 29 [h(y.X) - h(x.p.X)]2r~(x.p) ]
.
f29+1{x,~)
Rewrite this as
25
Therefore
lim
2
0
0-+0
If3f3 =
=
A.
The other terms are similar but are no more difficult.
•
The next lemma shows that for small
0,
if either A or 0 is known,
the maximum likelihood estimator for f3 has approximately the same
distribution as if both A and 0 are known.
Definition:
A statistic TN will be said to satisfy TN
N -+m and then
0
N{Q,l) as
-+0 if for all sets d,
lim P{TN ~ d} = P{N{Q,l)
lim
~
o-+ON-+m
~
d}.
A
Lemma 2.3:
(A, 0).
Let" f3{A,O) be the maximum likelihood estimator for (3 at
Then as N -+
m
and then
0
-+ 0,
N1/2(~{A,O) - (3)/0 ~ N{Q,A- 1)
N1/2(~{A,e) - (3)/0 ~ N{Q,A- 1)
N{Q,A-1 ).
26
Proof:
We verify only the last case, the others being similar.
Let
F* (J3,a,A) be that part of F* with row and column corresponding to
mown 9 left out.
As in Lemma 2.1,
A
P- P
a
a-a
~
N(Q, F* (J3,a,A) -1 ).
a
A
A-A
Write b
= (a
T
0 0).
Then
A
P- P
a
a-a
~
v]
a
A
A- A
P[bTN(Q,F* (p.a,A) -1 )
But as a
~
~
v].
0, F* (J3,a,A) ---+
Since F*
m is of full rank and hence invertible
F* (J3,a,A) -1
---+
F*m(J3,a,A) -1
* (J3.a,A) -1 b
bTF* (J3,a,A) -1b ---+ bTFm
= a TA-1 a.
Therefore
lim P[bTN(Q,F* (J3,a.A) -1 )
a~O
~
*
-1 )
v] = P[bTN(Q,Fm(p,a,A)
= P[aTN(Q,A-1 )
~
~
v]
v]
•
21
We must now consider what happens if neither A nor 8 is known.
*
In this case the entire matrix
definite.
F~
in Lemma 2.2 is no longer positive
The intuitive reason is that as a
weighting are no longer distinguishable.
that
*
have
--+ G
~
While Ca --+ (
1 0 ). and C( o 0
a
Lemma 2.4:
as a
6g)
1
~
O.
O. transformation and
The technical problem is
*
is a generalized inverse of
~
F~.
we need not necessarily
Ca =
A simple counterexample is
[6 ~ ].
as a --+ O. with a generalized inverse. C---+
a
= (1
0 ] --+ ( 1 0 )
0 1/a
0 ~
as a --+ O.
Make the additional assumption that for any c ) O.
lim
lim
P{II(F* - F:) TNII ) c} = 0
a~ON~~
where
13 - 13
a
A
a - a
a
A
8 -
13
A
A
A
~
and then a
= 13(A.8)
e
A
A-A
and 11.11 is the maximum element.
Proof:
F*TN =>
Then as N ~
By Lenuna. 2.1, for fixed a. TN
*) .
N(Q.F
=>
*-1
N(O.F
).
~
O.
Therefore.
By the assumption. according to the definition.
this means that for any vector a.
28
To see this. note that for any c > O.
~
P{aT*
F TN
*
I
and IT
a (F* - Fe»}T
N ~ c}
*
I
v + c and IT
a (F* - Fe»}T
N > c}
~ v + c
+ P{aT*
F TN ~
~
P{aT*
F TN
~ v + c
*
I
and IT
a (F* - Fe»}T
N
+ P{laT(F* - F:}TNI > c}
~
P{aT*
F TN
~ v + c } +
~
c}
*
I
P{ IT
a (F* - Fe»}T
N > c}.
Also.
= P{aT*
F TN ~
IaT (F*
~
P{aT*
F TN
~
v - c} - P{aT*
F TN
~
v - c and
* NI > c}
- Fe»}T
*
I
v - c } - P{ IT
a (F* - Fe»}T
N > c}.
Now let N .. e» and then a .. 0 and by the assumption of the lemma for
any c > 0
lim lim P{aT*
Fe»TN ~ v}
a-+ON-+e»
and
Now let c .. 0 so that
~
lim lim P{aT*
F TN
a-+ON-+e»
~
v + c}.
e
29
Thus
lim
*
lim P{aTF~TN
~
=
v}
*
P{aTN(Q.F~)
~
v}.
a~ON~~
1/2 A
Therefore N (~-
~)/a
-1
~N{Q.A
) because
of the form of F: (See Lemma 2.2).
•
While the small- a asymptotic distribution theory allows for a
manageable theory. the results are of somewhat limited value since we
are interested in all values of a.
Carroll and Ruppert {1984} show
for the transform-bath-sides model wi th fixed a that the asymptotic
A
relative efficiency of the MLE for
A
that
A
A
A
when X is estimated.
A
~(X).
and
that
is.
A
~
for
~
when X is lmown.
~(X).
is at
least 2/."..
A
ARE[~{X).~{X)] ~
Our model differs in that we have also included
2/."..
power-of-the-mean variance which introduces the additional parameter
B.
This however does not change the Carroll and Ruppert result and we
A
have that
the
A
A
A
ARE[~{X.B).~{X.B)] ~ 2/."..
heuristic sketch of the proof.
See Section 2.3 for a
This result together with Lemma 2.4
tells us that the estimates from a nonlinear procedure where we
condition on the value of (X.B) will be only moderately in error.
It is generally difficult to compute the maximum likelihood
estimators in this problem.
For this reason. we will use a gener-
alized least squares estimate in all of our practical calculations.
That is. for fixed (X.B). we will solve the generalized least squares
equations.
(See Carroll and Ruppert. (1987). Clapter 2.)
30
o
(2.2)
N
= :!
i=l
A
Perhaps the greatest justification for replacing
step
of
the
calculations
is
that
in
the
a
small-
A
distribution theory (N ~~, then a ~ 0)
A
I3MLE by I3GLS at each
asymptotic
A
I3MLE {A,B) and I3GLS {A.B) have
the same limit distributions in this sense:
(2.3)
lim
lim
a~ON~~
(P{N1/2Ia[~MLE{A,B) -
~
13]
v}
- P{N 1~la[I3GLS {A,B)
A
for all v.
- 13] ~ v}
]
= O.
A sketch which indicates that this result is reasonable is
given in Section 2.3.2 of the appendix.
2.2
Examples
In this section we apply the transform-bath-sides power-of-the-
mean model to three sets of data.
This section demonstrates how the
method can be applied in real situations and numerically illustrates
how skewness and heteroscedastici ty are addressed by the parameters
(A,B).
In the first example, N is large (N
= 297)
and a is small (the
median coefficient of variation is approximately .1).
the
assumed
model
can
be
linearized
by
In this example
transformation
but
the
transform-both-sides power-of-the-mean analysis makes it clear that
linearization is not necessary.
The second and third examples arise from fisheries management.
In bath examples N is small. 26 and 27 respectively, and the median
31
coefficients of variation are approximately .4 and .9.
These examples
are interesting in that we shall see the interplay between skewness
and heteroscedastici ty and the subsequent effect on the parameters
(X,O).
2.2.1
Capacitance Data
These data originate
(Thurber,
Lowney
characterization
Phillips,
and
of
from
deep-level
the National
1985),
defect
Bureau of Standards
and
centers
are
in
used
in
the
semiconductor
depletion regions through the use of transient capacitance techniques.
There are two observed variables:
measured at time x.
X
= time and C(x) = capaci tance
Capacitance ratio, y = C (T ,x), is a function of
r
a fixed measurement temperature (T) and time (x), such that
[Cb(T) - C~(T)][Cr(T) - ~(x)]
y = C (T,x) =
r
[Cb(T) - C2 (x)][Cr (T) - C~(T)]
where
Cf(T) = the capacitance at reverse voltage, Vr , at
temperature T
~(T) = the capacitance at charging voltage, Vc , at
temperature T
is
Ci(T) = the capacitance when reverse voltage, V,
r
restored at temperature T and at time x = 0
C(x)
= the capacitance at time x.
See Figure 2.1 for a schematic.
For these data
y = 5.805346 [667628.747 - C2 (x)]
785925.600447 -
~(x)
32
where 'beT} = 886.5245, Cf(T} = 817.0855, and Ci(T} = 801.8797.
There
are 11.872 original data points from which a total of 297 points are
chosen by selecting every fortieth observation.
The relationship between time (x) and capacitance ratio (y) is
assumed to be exponential.
The data are plotted in Figure 2.2 where
an exponential relationship is indicated.
The log transformation is
Plotting log(y} against x (Figure
the linearizing transformation.
2.3) shows a heteroscedastic trend not as apparent in Figure 2.2 and
suggests that the log scale is not appropriate.
where
= P1exp(P2x i }
f(xi,P}
unlmown.
The parameters
and
the
We assume model (2.1)
(p 1,p2 ,a,A,a) are
parameters
(P 1 ,P2 ) are estimated for fixed (A, a) using
See Carroll and
an iterative generalized least squares procedure.
Ruppert
(1987)
for
a
discussion
of
generalized
least
squares
A grid of values for (A,a) is selected ranging from -0.25
procedures.
to 1.75 for A and -1 to 1 for
a.
A
loglikelihoods
computed at
Table 2.1 contains the values of the
A
(p,a,A,a).
See Section 2.3.3
for
a
discussion of how to obtain confidence regions for this problem.
From Table 2.1, the parameters for which the maximum value of the
loglikelihood
is
attained
over
the
grid are
parameters corresponding to no transformation.
(A,a)
= (l,O),
the
The (A, a) pairs that
fall within the 99X confidence region are such that for any given
A= 1 +
change
a,
a,
that is. a change in one parameter corresponds to a similar
in
the
other.
This
indicates
that
the
model
is
overparametrized and supports the small- a assumption.
In addi tion to the no-transform model we consider two other
models, (A,a)
earized
= (O,O)
model
one
and (O.-l).
would
have
The
had
(A,a)
the
= (O,O)
model, the lin-
original
errors
been
~
33
homoscedastic and entered
the model
in a
multiplicative
fashion.
yields
log(Yi) = log 131 + 13i'i + o~i·
This model ignores the possibility of heteroscedastic errors or the
creation of such errors by the transformation.
If the true model is
such that no transformation is needed (the errors are homoscedastic
and additive) but a log transformation is incorrectly applied to both
sides.
account
then heteroscedasticity is induced and should be taken into
in the model by using
Yi = f(x .13) +
i
o~i
(A.9)
=
(0.-1).
For example.
1£
is the correct model. applying logs to both sides
yields
log Yi = log[f(x i ·13) + o~i]
~
1
log f(x i .13) + o~i·--f(x .13)
i
Weighting by the inverse power-of-the-mean removes the heteroscedasticity induced by the log transformation.
This is the model typically
used to model these data.
As illustrated in Table 1. neither (A.9)
= (0.0)
nor (0.-1) fall
in the 99X confidence region but (A.9) = (0.-1) is on the same ridge
of values where for a given A the maximum is attained over the 9' s.
The loglikelihood value for (A.9) = (0.0) is small relative to the
maximum over
the grid.
so one might expect
it
to do poorly
in
relationship to the other two.
The residual plot for
(A.9)
=
(l.0)
(Figure 2.4) reveals no
evidence of heteroscedasticity or skewness whereas for (A.9)
= (0.0)
there is clear heteroscedasticity and the data are somewhat skewed to
34
the left (See Figure 2.5).
The nonlinearity of the normal probability
plot (Figure 2.6) indicates that the assumption of normality may be
inappropriate
for
these data.
al though
it
is not
clear how
interpret such plots when the residuals are heteroscedastic.
case
(A.e) = (0.-1).
where
the
residuals
do
not
to
In the
exhibit
any
heteroscedastic pattern but there are twenty fewer positive residuals
indicating a slight right skewness.
See Table 2.2 for computed values of the skewness. kurtosis and
Spearman·s
rank
correlation
coefficient
consistent with the residual plots.
(p).
The (A.e) =
These
(O~O)
values
are
model yields
residuals that are skewed and heteroscedastic and have a large value
for kurtosis.
The (A.G) = (0.-1) model competes favorably with the no
transformation model as far as heteroscedasticity is concerned but is
more skewed and has a somewhat inflated value for kurtosis.
The
skewness and kurtosis for the no transformation model are closer to 0
than the other two and do as well as the (A.e) = (0.-1) model in
controlling the heteroscedasticity.
Analyzing the data in the original scale wi th no weighting yields
residuals that are symmetric about zero with no evidence of heteroscedastici ty.
f
-1
(xi.P)
Applying a
removes
the
residuals to be skewed.
log transformation with weights formed by
induced
heteroscedasticity
but
causes
the
Simply taking a log transformation wi thout
accounting for the induced heteroscedasticity produces residuals that
are not only heteroscedastic. but clearly do not satisfy the normal
assumptions.
This example demonstrates that it is not necessary to
linearize nonlinear models even when it is possible to do so.
35
2.2.2 The Skeena River Sockeye Salmon Data
These data originate from Ricker and Smith (1915) and represent
the stock-recruitment of the Skeena River sockeye salmon from 1940 to
1961.
For each year there are two variables, x = spawners (mature
fish) and y
= the
number of recruits into the fishery.
A preliminary
analysis indicated that observations 12 and 16 were extreme points and
are removed for this analysis.
Both observations have small x values.
Justification for their removal is that in 1951 (observation 12) there
was a rocks I ide that blocked the river and prevented the fish from
getting upriver to spawn.
Since it takes about four years for the
sockeye salmon to mature and return to
observation in 1955
(observation 16)
the river
is also
to spawn,
the
severely affected.
Removal of these two points leaves 26 points for the analysis.
The
data are listed in Table 2.3 and are plotted in Figure 2.1.
We fit model (2.1) over a grid of values for (A,9) where we use
the Ricker model,
f(xi'~)
=
~lxiexp(~~i).
Table 2.4 illustrates the
95% confidence region for the loglikelihood where A ranges from -.15
to 1.5 and 9 ranges from -.5 to 2.25.
Note that the confidence region
in this example is much larger than that in the preVious example.
This is due in part to the small sample size and in part to the range
of x and y relative to the model.
When x and y vary over a wide range
of values, the parameters are better determined by the data than when
the range of x and y is narrow, which consequently results in smaller
confidence regions.
The maximum likelihood estimators for (A,9) over the grid are
(.5,1.25).
(A,9)
In Figure 2.1 the mean regression line corresponding to
= (1,0)
and the median regression line,
(A,9)
= (.5,1.25)
are
36
The Ricker model
plotted.
is such that as x
increases until x reaches a maximum at -1/f32 .
increasing x results in decreasing y.
increases y also
After that point.
A possible explanation for this
behavior is that after a certain number of fish enter the 'system
overcrowding
occurs
and
food
resulting in fewer recruits.
and
oxygen
supplies
are
depleted
It is important for those managing the
fishery to know where this maximum will occur so that the amount of
Thus we are
fish can be regulated if necessary.
estimating f3
A
f3 (X
2
2
as effeciently as possible.
= 1. a = O} = -.000S5
A
and f3 2 (X
=
.5. a
interested in
For these two models
= 1.25} = -.00064.
This
corresponds to maximum values on the Ricker curve of 1176.5 and 1562.5
respectively for the mean and the median regression.
The no-transformation model. (x.a) = (1.0). does not fall within
the confidence region.
following small-
0
Taylor series expansions about
equivalent models:
sides with no weighting.
(1.1.75) -
0
yield the
(-.75.0) -
transform-both-
weight only.
and (O.O.75) -
linearize the model with the log transformation but weight to offset
the induced heteroscedasticity.
(x.a) = (O.O) model
It is also of interest to look at the
to see how badly one does
if
the model
is
linearized but the log induced heteroscedastici ty is not taken into
account.
Table 2.5 contains the values of skewness. kurtosis and
Spearman's rank correlation coefficient for these six models.
interesting
to
see
how
well
the
maximum
likelihood
It is
estimators
transform to both symmetry and homoscedasticity. at least in terms of
these numerical measures.
As expected the residual plots exhibit no
heteroscedasticity. whereas the residual plots for the no transformation model show marked heteroscedasticity.
See Figures 2.8 and 2.9
37
for the residual plots for these two models.
In several models,
residual.
observation 4
is associated with a
Carroll and Ruppert (1986) in their robust analysis found
this point to be influential: see also Section 4.3.
analysis
large
the
leverage is not
large.
However in this
See Table 2.3 for computed
leverages for the no transformation model and the maximum likelihood
model.
Whereas most of the nearby points are fit well, observation 4
is poorly fit.
See Table 2.6 for the effect on the summary statistics
when this point is removed from the analysis.
been set equal to 0 and A varies.
smallest x value,
heteroscedastic
increases,
it
In Table 2.6, 9 has
Since observation 4,
the third
has a relatively large residual and the general
trend
is not
of
the
residuals
surprising
that
is
to
the value
grow as
for
the
mean
Spearman's
p
increases (more heteroscedastic) when observation 4 is removed from
the analysis.
Also the value for skewness generally decreases but the
differences between the models stay the same.
It
is
int~resting
that
the model
(A,9)
= (-.75,0) given by
small- a Taylor manipulations is not the best 9 = 0 model when one
considers how well the model accounts for both heteroscedasticity and
skewness.
It appears that this model is trying to account for hetero-
scedasticity by over transforming at the expense of skewness.
Closer
investigation of the likelihood surface indicates that given 9 = 0,
the loglikelihood is maximized for A between 0 and -.25, which agrees
wi th the summary statistics given in Table 2.6.
This is a further
indication that care should be used when one is not in a small- a
setting.
In Table 2.7, A has been fixed at 1 and 9 ranges from 0 to 2.25.
38
Heteroscedasticity is adequately accounted for when 9
or 9
=2
(p
= -.02).
= 1.75
(p
= .06)
It is interesting to note that as 9 changes the
skewness remains fairly constant while 9 accounts for the heteroscedasticity in the model.
This is not the case when 9 is set equal
to 0 and A varies (Table 2.6).
When A is the only parameter in the
model used to transform to approximate normali ty. it plays the dual
role of trying to account for both skewness and heteroscedasticity.
whereas 9 alone attempts to model only the heteroscedastici ty.
The
(A.9) = (1.1.75) model is the model obtained through the small- a
Taylor expansions and appears to be adequate given A
small- a derived model for 9
=1
unlike the
= o.
These data have been extensively analyzed by Carroll and Ruppert
(1986. 1987) and Ruppert and Carroll (1985).
They use the Ricker
model and do a robust analysis of the full data set and also analyze
the data with observation 12 deleted.
From the robust analysis they
find observations 4. 5. 12. 16. and 18 to be influential points. but
they only remove observation 12 (the rockslide year) for the standard
nonrobust analysis.
Observation 16 has an exceptionally small number
of
the
spawners
since
spawning
stock
that
year
came
from
the
recruitment of the rockslide year. but since observation 16 had normal
recruitment it was retained for their analysis.
Carroll and Ruppert
(1987)
find for the transform-both-sides model with no weighting
,.
(9
0) that ~
-.2. supporting our results which indicate a value
,.
,.
between 0 and -.25. They have also calculated Ask' the value of A
=
=
,.
that produces
symmetric
residuals.
and
~et
which
transforms
to
homoscedasticity. to be .45 and -.86 respectively. According to our
,.
,.
analysis (see Table 2.6) Ask ~ .5 and ~et ~ -1. Note that our table
39
stops with A = -I, the actual estimate for
~et
may be smaller.
It is
interesting that as in Carroll and Ruppert (1987) the transform-bothA
sides model with power weighting yields
A
~
to be close to Ask'
A
indicating that skewness alone is determining A.
While our estimates
differ slightly from those of Carroll and Ruppert (1987) in that we
chose
to
delete
observation
16 as
well
as
observation
12,
the
from
the
essential characteristics of the analysis agree.
2.2.3
Population A Sockeye Salmon Data
This
is
fisheries
another
industry.
example
of
spawner-recrui t
Since permission to
identify
data
the
stock was
refused by the original source, Ruppert and Carroll (1985) and Carroll
and RUpPert (1987) refer to these data as the "Population A" data.
There are 28 years of data which are listed in Table 2.8 and plotted
in Figure 2.10.
The number of recruits for the last observation year
is extremely small relative to the other observations.
Carroll and
Ruppert (1987) suspected that the collection for that year was not
complete and subsequently deleted the observation.
We do the same
here.
As in Ruppert and Carroll (1985) we look at both the Ricker model
f(xi'~)
= ~lxiexp(~~i)'
1/(~1
~2/xi)·
+
models.
(A,9)
the
Beverton-Holt
model,
f(xi'~)
=
See Table 2.9 for 95% confidence regions for the two
The maximum likelihood estimator for the Ricker model is
= (.25,
(.25.3).
and
1) while that for the Beverton-Holt model is (A,9) =
For each model we consider the no transformation model. the
lineariZing
transformation.
transformation.
the
See Table 2.10.
maximum
likelihood
and
the
log
As with the Skeena River data, it is
40
interesting
to
see
how
well
the
maximum
likelihood
estimates
simultaneously account for both skewness and heteroscedasticity.
The
data under either model are somewhat insensitive to the amount of
weighting used but are greatly affected by the value of A.
This is
not surprising since the confidence regions are narrow for A but long
for
a.
especially for the Beverton-Ho I t model.
The Bever ton-Hoi t
median regression line (Figure 2.10) quickly approaches its asymptote
and is nearly constant over most of the range of x.
Therefore a
power-of-the-mean variance model is effectively a constant variance
model and the parameter
a
is not well determined.
For these data the median coefficient of variation is approximately .9 indicating that the small- a assumption may not be appropriate.
This is further supported in that the relationship
a~
does not appear to hold based on the loglikelihood grid.
1 - A
It is
reasonable to assume that if A and
a
are both included in the model
then A will model the skewness and
a
will control the heteroscedas-
Consider the Beverton-Hoi t model.
The Spearman· s p value for the
tici ty.
(A.a) combinations (.25.0) and (.25.3) are respectively .20 and .18.
According
comparable
to
this
numerical
job accounting
for
measure
both
models
heteroscedastici ty.
seem
to
Skewness
do
a
and
kurtosis are closer to 0 for the maximum likelihood estimators but the
difference between the two models is slight.
As was anticipated by
the extremely long confidence interval. the inclusion of
a
does not
produce significantly different resul ts than the 9 = 0 model.
This
analysis seems to indicate that skewness is the predominant feature of
the data rather than heteroscedasticity.
Compare Figure 2.10 with the
41
Skeena River data (Figure 2. 7).
Relative to the plotted regression
lines. the Skeena River data exhibit a clear heteroscedastic trend not
as apparent in the Population A data where the data appear to be
skewed relative to the plotted regression lines.
This skewness is
further evidenced by the median regression being smaller than the mean
regression.
As in the Skeena River example. the managers of the fishery are
interested
in
the effecient
estimation of
the parameters.
The
behavior of the Beverton-Holt model is quite different from that of
the Ricker model.
The Bever ton-Ho I t
curve does not exhibit the
downward trend as the Ricker curve does.
In the Beverton-Holt model
the number of recruits seems to stabilize at an asymptote determined
by the parameters.
We see from Figure 2.10 that this asymptote is
very different depending on whether or not we use transformations.
2.2.4 Discussion
We see from these examples that the transform-both-sides powerof-the-mean model is
impor~t
in nonlinear regression.
While it may
effectively reduce to a transform-both-sides model. or a power-of-themean model. or a no-transformation model.
the transform-both-sides
power-of-the-mean model is useful in discovering characteristics of
the data that might otherwise go undetected in a model
that
is
initially limited to one or the other.
We saw that linearizing a nonlinear model may not always be
appropriate (Example 1) and the transform-both-sides power-of-the-mean
approach was able to recognize that no transformation was necessary.
In Examples 2 and 3 we saw that skewness and heteroscedasticity were
42
major factors of the data.
The choice of
(~.e)
allows error to enter
the model in different ways and the effect of skewness and heteroscedasticity is better understood.
By choosing a
(~.e)
combination such
that the errors are approximately normally distributed we can achieve
more efficient estimators than would be possible if the model were not
Improved estimators. while in addition to having nice
transformed.
statistical properties. can lead to savings in cost and resources and
may even change management policies as in the fisheries examples.
Thus
the
whereby
trans form-both-s ides
insight
into
power-of-the-mean model
is
a
tool
the error behavior can be gained and more
efficient modeling can occur.
2.3
Appendix
2.3.1
Heuristic Sketch of the Carroll-Ruppert Effeciency Result
We sketch here a proof of Carroll and Ruppert' s (1984) resul t
adapted to the transform-both-sides regession model with power-ofthe-mean variance.
for
~O.
Basically the idea is to compute a third estimator
the true parameter. and compute its efficiency with respect to
(~1~.9 known).
This gives a bound on the efficiency of the maximum
likelihood estimators.
"-
Let w •...• w be positive numbers and let
1
n
"-
minimizes ~ Wi IYi - f(xi'~1)1.
~1
be any point that
Under (2.1). f(xi'~O) is the unique
"-
median of Yi'
conditions.
so
~1
will be consistent under certain regularity
If Wi is set equal to the Jacobian of the transformation
evaluated at the true parameters.
is.
~O' ~O'
90 , and y =
Wi is proportional to the density of [Y i -
f(xi'~O)'
that
f(x . 13 )] at its
i 0
I A.a}
A
-1
= 81 ,
The asymptotic optimality of
-1
A
-1
maximum likelihood estimates shows that 8
~ Cov(l3} ~ w/2 8
• where
1
1
the inequalities are in the sense of positive definiteness.
From (2.5).
the Cov(1l
•
2.3.2 A Heuristic Derivation of (2.3)
Fix A and
a.
We rely on the calculations of Olapter 7 of Carroll
and Ruppert (1987). who indicate by Taylor series that for fixed (A.a)
-1/2 -1
N
8
1
(2.4)
~~
i=1
where
McCullagh (1983) has a similar calculation.
By formal
likelihood
theory. for fixed (a.A.a).
1 2
N / [
~ ~ (3 ] ~
a - a
I(Il.a}
is
the
1 2
I(I3.a}]-1 a N- / [ ll3(l3.a) ].
l (l3.a)
a
where
2
[0
a
information matrix for a
assuming the x·s are random.
single observation
From previous calculations
44
a 2 I(~.a) ~ [8
01
and
~
-1/2 a
=a N
'lJO
N
= N- 1/ 2 I
O ],
2
e(~.a)
(~~ - 1).
1=1
Thus
a
a
MLE - a
a
{a
+
2
I(~.a)}
-1
-
[
[ (u2 I(~.u» -1
-
r
-1
s~
o ]]
~O
1/2
we get
A
N1/ 2 fjMLE - fj
a
A
a
MLE - a
a
r
ae 8-1
1
0
o ] [ Al
1/2
~
].
45
Therefore
1/2
N
-1
~
A
(f3MLE - 13)/0 = Sl AI'
and.
Now let
0
~
0 and get a stronger form of (2.3) since
N-
1/2
N
~ (~2 - l)f A (x i •f3 )/f(xi .f3) = 0p(l).
i=1
i
,.,
•
2.3.3 Approximate Confidence Regions for (A.9>
We compute an approximate (1-a)100% confidence region in the
following way.
Define
=
Lmax (A.9)
l[f3(A.9). 0(A.9). A. 9]
to be the loglikelihood maximized with respect to f3 and
and 9.
0
for fixed A
The loglikelihood ratio is given by
A
A
LR(AO·90 ) = Lmax (A.9) - Lmax (AO·90 )
where (A.9) are those values of (A.9) over the grid for which the
loglikelihood
estimation
is
maximized.
scheme
(Rao.
asymptotically )(~ with
(A.9) =
(~.
90 ).
In
a
usual
section
1913.
level
the
under
maximum
2LR(~.90)
6.e).
the null
likelihood
hypothesis
is
that
We form an approximate confidence region by finding
all those values of
(~0.90)
such that
~
Two points need to be addressed.
likelihood estimators but
squares estimation scheme.
cally eqUivalent as
0 ~
2
)(2 (1 - a).
First we have not computed maximum
instead have used a generalized
least
Although the two estimators are asymptoti-
O. it is not an automatic consequence that the
46
loglikelihood
ratio
evaluated
at
2
estimators is asymptotically X
the
generalized
least
squares
A second point is that the maximum
is attained over a grid and is thus not exact.
Therefore
we
must
conf idence regions.
qualify
our
usage
of
the
normal
theory
I t can be shown (see be low) that as N -+
00
and
then a -+ O.
(2.5)
It is not clear what effect the overparamterization and subsequent
loss of a degree of freedom signifies.
For the examples considered
there is little or no difference in the confidence region when based
on one or two degrees of freedom.
A difference would be observed if
our estimation of X and 9 were not limited to a grid of values.
Calculations for (2.5):
Recall that
~
= (~.a.X.9).
ratio test 2[t{SMLE) -
•
hyPOthesis.
The usual normal theory likelihood
2 Wlder the null
l.{e)]
is asymptotically X
Cox and Hinkley (1974)
A
show that
A
likelihood ratio test 2[t{e
MLE ) - t{eO)]
the normal
theory
2
is asymptotically X
with
degrees of freedom given by the number of parameters estimated under
A
when N -+
00
A
2
In order to show that 2[t{eGLS ) - t{eO)] - Xl
and then a -+ O. rewrite
the null hyPOthesis.
A
A
A
t{eGLS ) - t{eO)
= [t(eGLS )
A
A
- t{eMLE )] + [t{eMLE ) - t{e)]
A
- [t{eo ) - t{e)].
We must show
A
(a)
A
A
(b)
P
t{eGLS ) - t{eMLE ) --+ O. and
2[t{eMLE ) - t{e)] - 2[t{eo) - t(e)]
A
47
Part (a) is an immediate consequence of the discussion of Section
2.2.3.
In order to show (b) we first define additional notation.
all estimates be maximum likelihood estimates, and define 8
e01 =
8 2 = (A,e), and
(~o,aOIA=AO,e=eO).
1
=
Let
(~,a),
Let' be the information
matrix, partitioned such that
'11 = E
'21 = E
*
and let Cov(8) = I.
A
A
[lpp
la~
;]
aa
[l~ lM ]
le~
'12 = E
'22 = E
lea
laA
lae
[lM
lAe
leA
lee
]
Thus
""',...,
A
and (801182) - N(8 1- I 12I 22 (82 - 82 ),
Iy.
[lJlh lpa ]
Iy) for some covariance matrix
A
A
(See Rao 1973, p. 522.)
Write [l(8
) - l(8)] - [l(8 ) - l(8)]
MLE
0
A
A
A
= [l(8 1 ,82 ) - l(8 1 ,82 )] - [l(801 ,82 ) - l(8 1 ,82 )]. Expand in a Taylor
series,
where
..
-.2
l(8 1 ,82 ) = ol(8)/a8
T
ae.
Combining the above we get
48
"
"
2[l(8
) - l(8 )]
0
MLE
(2.6)
A
A""''''
~
A
""J
Replace 8
with 8 - 1 1;(8 - 8 ), I is not of full rank and must
2
12
2
1
01
be calculated using generalized inverses. Substitution and simplification reduce (2.6) to
I'
= E[E.4i ]
(I'
+ 2)/2
(92
- 82 )T
- 3 represents the kurtosis.
l(8" )] is asymptotically
0
of 1 22 .
...
...1
22
"
(8 - 8 ), where
2
2
"
Thus 4/(1' + 2} [l(8MLE } -
2 with one degree of freedom. the dimension
~
2.4 Tables and Figures
The tables and figures for the Chapter 2 examples follow.
•
49
Table 2.1:
A
9
Loglikelihood computed at (p,A,a) for Capacitance Data.
-1
-.75 -.5
-.25
o
.25
.5
.75
1
**
-.25
525.4 471. 7 405.3 329.7 246.5 155.0 51.4 -75.2
0.00
578.9 550.1 502.0 440.2 368.4 288.6 200.7 102.1 -14.5
0.25
590.7 590.3 566.4 522.9 465.1 396.9 320.6 236.7 143.5
0.50
569.2 594.4 596.9 576.4 536.4 481.8 416 8 343.7 263.7
0.75
524.8 569.6 595.9 600.0 581.7 544.2 492.1 429.6 359.5
1.00
465.6 524.1 568.9 595.6 600.5 583.5 547.5 497.2 436.8
1.25
396.9 464.6 522.8 567.3 593.8 598.9 582.3 547.2 498.2
1.50
321.1 395.7 463.2 521.0 565.0 590.8 595.4 578.6 543.9
1.75
240.9 320.3 394.5 461.6 518.7 561.8 586.7 590.2 572.8
**
_
Convergence not obtained wi thin time limit
99% confidence region
=
maximum value for loglikelihood over grid
Table 2.2:
Summary statistics for Capaci tance Data
A
9
Skewness Kurtosis
p
1
0
-. 115
o
0
1.774
10.3921
.673
.0001
o
-1
.624
1.6006
.133
.0221
.4154. 134
p value
.0210
50
V
V
r
r
, . . - - - - - - - - - - - - - - - - - - Vol tage
V
c
'1,(T)
-1I
-Capaci tance
C(x)
----------4----------------- Time
o
Figure 2.1:
Capacitance Data. Schematic of vol tage and waveforms
associated with repetively changing the voltage and
analyzing the resulting capacitance ratio.
51
CAPACXTANCE DATA
...
0
oft
...
•
•c
.•..
•
•
0.8
II
0
o.e
oft
0
0.'-
0.
U
0.2
o
o
2
B
e
T1m.
Figure 2.2:
Capacitance Data.
by time.
Scatterplot of capaci tance ratio
52
CAPAC~TANC.
-...
CATA
0
0
....
•
•
-~
I[
0
-2
C
II
.6J
'f't
-3
0
II
a.
*
II
-4
0
0
-e
0
.J
e
*
*
-e
0
2
4
e
T1m.
Figure 2.3:
Capacitance Data.
ratio) by time.
Scatterplot of log(capacitance
e
53
CAPAC%TANCE DATA
Non11neer Mode1
•
•
•I
I•
rot
:J
lJ
2
off
lJ
•
o
N
off
011
C
-~
•
-2
OJ
-3
'C
:J
011
....
..
.. II
-4
o
e
2
T1me
Figure 2.4:
Capacitance Data.
Plot of studentized residuals
from the no transformation model. (A.S) = (1.0).
e
54
CAPAC%TANCE DATA
•
•
•
a•:
..
r1
:J
"0
~
0
..
.
..
"0
•
'•"
N
~
C
-!!
e
'0
:J
'"
OJ
-£0
o
e
2
e
T1me
Figure 2.5:
Capacitance Data.
Plot of studentized residuals
from the log transformation model, (A,G)
= (0,0).
55
CAPAC%TANCE DATA
Norm.l
Prob.b~l~~y Plo~
*
•
rf
I
~
lJ
oft
••
0
a:
lJ
I
N
oft
oil
C
•
-!5
lJ
~
oil
CD
-so
-3
-2
o
2
Fl.nk.
Figure 2.6:
Capacitance Data.
Normal probability plot for the
log transformation model.
3
56
Table 2.3:
Skeena River Sockeye Salmon Data
LEVERAGES
Observation Year Spawners
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
MM
Recrui ts (A,9)=(1,0)
963
572
2215
1334
305
800
438
272
824
940
486
307
1066
480
393
176
237
700
511
87
370
448
819
799
273
936
558
597
848
619
397
616
3071
957
934
971
2257
1451
686
127
700
1381
1393
.1469
.0501
.0680
.0648
.0751
.1318
.0592
.0681
.2304
.0599
.0679
MM
MM
.1928
.0489
.0354
.0563
MM
668
.0690
.0634
.0734
.0671
.0649
.1296
.0514
644
1747
744
1087
1335
1981
627
1099
1532
2086
.1128
.0356
.1045
.1389
.0737
.1055
.0370
.1028
.1490
.0375
.0551
.0593
.0489
363
2067
(A,9)=( .5,1.25)
.0483
.0842
.0472
.0677
.0473
not included in the analysis
MM
.0639
.0417
.0725
.0679
.1377
.1043
.0351
.0369
.0796
.0387
.0538
.0384
51
Table 2.4:
~ 9
1.50
1.25
1.00
.15
.50
.25
0.00
-.25
-.50
-.15
(~.9)
Confidence Region for Skeena River Sockeye Salmon
-.50 -.25 0.00 0.25 0.50 0.15 1.00 1.25 1.50 1.15 2.00 2.25
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
t88f
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Swmnary Statistics for Skeena River Data
9
1.00
1.00
.50
0.00
0.00
-.75
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
90X Confidence Region
MLE over the grid
MMM
Table 2.5:
+
+
+
+
+
+
+
+
+
+
+
1.15
0.00
1.25
.75
0.00
0.00
Skewness
-.406
-.466
.005
.396
.352
.929
Kurtosis
-.121
.486
-.501
-.500
-.390
.114
Spearman •s p
(p value)
.061
.562
.101
.203
( .1665)
( .0028)
( .6029)
(.3189)
.389 (.0493)
.179 (.3821)
+
+
+
+
58
Table 2.6:
x
Summary statistics for Skeens. River data.
to O.
SKEWNESS
1.00
.75
.50
.25
0.00
-.25
-.50
-.75
-1.00
-.47 [-.46]
-.26 [-.26]
-.05
.15
.35
.54
.74
.93
1.09
[-.06]
[.13]
[.31]
[.48]
[.65]
[.80]
[.95]
KURTOSIS
.486 [.454]
.063
-.214
-.359
-.390
-.319
-.152
.114
.494
[.075]
[-.160]
[-.267]
[-.266]
[-.177]
[-.020]
[.185]
[.421]
e has
been set
SPEARMAN"S II (p value)
.56
.52
.48
.42
.39
.32
.28
.18
.13
(.0028)
(.0063)
(.0135)
(.0311)
(.0493)
(.1043)
(.1596)
(.3821)
(.5347)
[.57
[ . 57
[.53
[.50
[.46
[.38
[ . 39
[ .35
[.31
(.0028)]
(.0028)]
(.0062)]
(.0105)]
(.0219)]
(.0644)]
(.0524)]
(.0875)]
(.1350)]
[ ] represent values when observation 4 is deleted
Table 2.7:
e
0.00
.25
.50
.75
1.00
1.25
1.50
1.75
2.00
Summary statistics for Skeens. River data. X has been set
to 1.
SKEWNESS
KURTOSIS
-.466
.486
-.459
-.452
.307
.127
-.445
-.438
-.042
-.181
-.264
-.432
-.424
-.256
-.406
-.365
-.121
.165
SPEARMAN"S II (p value)
.56
.54
.47
.38
.29
.26
.18
.06
-.02
(.0028)
(.0043)
(.0146)
(.0566)
(.1444)
(.1944)
(.3859)
(.7665)
(.9037)
59
SKEENA AXVEA DATA
.cooo
*
3000
...:J•
.IJ
to
*
2000
0
•
* *
**
It
~ooo
o
o
.coo
aoo
~200
S~.wn.r".
Figure 2.7:
Skeena River Data.
·spawners.
Scatterplot of recruits by
The sol id line represents the median
regression line and the dotted line represents the
mean regression line for the Ricker model.
60
SKEENA R%VER DATA
No Tr.n.~orm.~~on Mode1
3
•
•
••
II:
.
2
rt
:J
1J
oft
1J
•.
. * ..
0
*
oft
-II
c
-so
.. *
*
....
..*
..
..
•
'tJ
:J
-II
II
..*
..
t\f
e
*
*
-2
-3
SOO
Figure 2.8:
SlOO
Skeena River Data.
S.200
s.eoo
Scatterplot of studentized
residuals by predicted values for the no
transformation model.
s.eoo
61
SKEENA RXVER DATA
ML.E Made1
2
•
•
•
a:•
*'
::J
'0
~
¥
0
~
*
•
'"
*
*
*
**
**
*
C
....
*
-~
....
'0
::J
*
*
'0
.•.
M
**
rt
-2
m
-3
soo
'-400
'-000
P~.d~ct.d
Figure 2.9:
Skeena River Data.
'-soo
Va1ue.
Scatterplot of studentized
residuals by predicted values for the MLE model.
62
Table 2.8:
Population A Sockeye Salmon Data
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Spawners
19431
12725
32539
12856
28050
8989
28137
21636
8690
20237
38439
8363
7033
4246
15824
14469
1196
13521
13360
19720
2407
41716
6108
2482
15193
27806
40032
11308
Recruits
131635
54928
182836
116935
10933
217870
232492
72378
21572
8801
30715
39208
4623
57472
47854
161915
24962
32377
12377
30940
43587
46276
63126
20372
72928
34050
41873
1541
e
63
95% COnfidence Regions for (A.a) for Population A Data
Table 2.9:
Beverton-Holt Model
A
0.0 .50 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
a
.50
+
+
+
+
+
+
.25
+
+
+
+
+
M**
+
+
+
+
+
+
0
+
+
+
+
+
+
+
+
+
+
+
+
Ricker Model
A
a
-.25 0.0 .25 .50 .15 1.0 1.25 1.50 1.15 2.0 2.25
.50
+
+
+
+
+
+
.25
+
+
+
+
M**
+
+
+
+
0
+
+
+
+
+
+
+
+
+
M**
+
MLE over the grid
95% confidence region
64
Table 2.10:
Summary statistics for Population A Data
Skewness
Model
Kurtosis
Spearman's p
(p value)
Beverton-Holt
(A.e) = (1,0)
-1.36
1.12
no transformation
(A, e) = (-1,0)
2.90
9.69
.04
(.85)
.12
-.08
.18
( .36)
-.68
-.11
linearizing
(A,e) = (.25,3)
MI.E
(A,e) = (.5,0)
.44
(.02)
.31
( .11)
= (.25,0)
= (0,0)
-.20
-.36
.38
-.01
.20
( .31)
.09
( .64)
Ricker
(A, e) = (1,0)
-1.25
1.05
.59
-.08
no transformation
(A,e) = (0,0)
MI.E
.05
( .80)
linearizing
(A,e) = (.25,1)
.42
( .03)
-.20
-.37
-.15
(.44)
65
PQPU~AT%QN
ex
A
CATA
.BV.RTQN-HQ~T MQCB~
iOO)
a.-oo
*
aooo
*
~eoo
~aoo
*
800
.,..__ - - !l-=----if
.... - - - -- --- - - - --- -- - - --;.
~
.-00
o
o
*
*
*
~oo
*
aoo
*
*
*
300
ex
Sp.wn.• r"'a
Figure 2.10:
Population A Data.
spawners.
!!So 0
Scatterplot of recruits by
The solid line represents the mean
regression line for the Beverton-Holt model.
The
dashed line represents the median regression model
for the Beverton-Holt model.
~OO)
aIAPTER III
ERRORS-IN-VARIABLFS AND THE TRANSFORM-B01H-SIDES MODEL
3.0
Introduction
We assume the transform-both-sides model
(3.1)
but
the observed data (yi .X ) are measured with error such that
i
Xi = Xi + 02 v i'
If the observed variables Xi's are substi tuted for
the xi's
into
(3.1)
and
the measurement
estimated parameters are biased.
error
is
ignored,
the
By varying the rates of convergence
we are able to gain insight into the nature of this bias through the
investigation of asymptotic distributions.
In section 3.1 we consider
two types of asymptotics and compute the asymptotic distribution of
A
the maximum likelihood estimator,
~.
based on each.
We study the bias
A
terms and suggest corrected estimators
for~.
The calculations here
are deliberately left on a heuristic basis as we are
understand the form of the asymptotic bias.
3.1
Recall that
~
trying to
= (fj,o,X).
Asymptotic Theory
We assume model (3.1), but the observed data (yi ,Xi) are measured
with error such that Xi = Xi + 02vi'
We assume further that the
~i's
and the Vi'S are independent, symmetric random variables with mean 0
and variance 1.
~
67
We consider two cases of the asymptotics used by Amemiya and
Fuller {1985} in their study of the nonlinear functional relationship.
In their work,
to obtain consistency and asymptotic normality,
1/2
measurement error must tend to 0 as N increases, that is, N a
o
~
f < w.
How fast a
approaches 0 relative to a
2
the choice of , where a /{a 1+' }
2
some ,.
~
1
6, for 0
~
6
1
2
the
~
f,
is controlled by
< wand,
~
0, for
We compute the asymptotic distribution of the estimators
based on two cases of these asymptotics where in case {A}, ,
in case {B}, ,
= O.
(3.4)
Case (A)
N1/ 2a
2
~f
~
(a )
1
6
2
o ~ f <W
Case (B)
In each case N ....
w,
error variance a
2
2
and
Specifically we let
a2
(3.S)
= I,
a
1
~
0, and a
2
is very small
.... O.
In case (A), the measurement
relative
to
the equation error
2
variance aI' whereas in case (B) they are proportional.
The case (B)
rates of convergence lead to more interesting results than those of
case (A) since under rates (B) the model is farther from the non-EIV
model than in case (A).
In order to simplify the presentation, let subscripts A and B
represent the asymptotics employed and subscripts
the designated parameter.
~,
a, and X refer to
58
Assume the rates of convergence given by (3.4).
Resul t 3.1:
N
~ 00.
a
l
~
O. and a
2
~
Let
0 simultaneously. then
. N('aA' ~}
-~A .
l 2
N / n-l(S" - S)
Under the rates given by (3.5).
I 2
N / n-I(S" - S}
where
a2
n = (a l •...•
= O).
and AA
- . N('aBo Y_}
--a .
a • I). !A
l
= {Q
=I
T
AAa AAl\} • AB
(the covariance matrix when
= (AB13
T-
~ ABA) • and
1J
are
as defined in Section 3.2.
It is not surprising that there is an asymptotic bias since the
estimating equations are biased.
See Section 3.2.
Under rate (A)
convergence the only effect of measurement error is a bias in the
estimates.
~ and;.
It is interesting that the bias in
~ does not
"
correspond to a bias in the estimated regression parameters 13.
the weaker rates (B). where a
Under
to a l • not only is
" "
there bias for each of the parameters but the bias in (A.a)
is not
convergent.
2
is proportional
In addition. the asymptotic variance increases for each
of the parameters. that is. ~
>!A
ii
ii
In order to obtain asymptotically unbiased estimates we subtract
from the original estimating equations the estimated bias. where
is defined on page
70.
l(~)
By asymptotically unbiased. we mean that
1 2
N /
{~" - ~} :::::> N{O~)
• ~ •
i.e .• the limit distribution is centered
at~.
Al though the resulting estimating equations produce asymptotically unbiased estimators of the parameters. there is an effect on the
~
69
,.
asymptotic covariance matrix since
E[e(~)]
will have random elements
due to the measurement error u 2 u i .
Result
3.2:
e* (~)
The
= e(~)
-
bias-corrected
,.
E[e(~)]
estimating
equations
given
by
yield under rates (A),
and under rates (B),
N(Q, :I* ),
=>
,.
where
E[e(~)]
is
E[e(~)]
evaluated at Xi
:I*
P
0
0
2*
:I* =
~
0
+ u2 u i , and
0
0
o
o
=
u"l\
0
= xi
Thus there is no effect due to measurement error under case (A)
when the bias-adjusted estimating equations are used.
Under case (B)
A
we can obtain unbiased estimates for {3 but there is a cost due to
measurement error in that
*
:I{3
= '"1JJp
> :Ip ' Since there is no change in
* = 1JJ13'
'"
the asymptotic covariance matrix. :I{3
there is no additional
cost due to the bias correction.
"l\ and
the
0'
In obtaining unbiased estimates for
there is an additional effect due to the correction imposed by
measurement
error
contained
in
the
correction
terms,
i.e.,
3.2 Detailed Calculations
In this section we sketch results 3.1 and 3.2.
To simplify the
70
= y~A), g(xi.~.A) =
proofs we adopt additional notation.
Let h(yi,A)
f(A)(xi'~) and D
Subscripts on g and f indicate
= (a 1 •
•••• a • 1).
1
the derivative taken with respect to that parameter.
indicates that a
2
Superscript
0
has been set to 0 after the last a 2 differentiation.
Once Xi has been replaced by xi + a v i and the subsequent expansion
2
about a
g~.
2
=0
occurs. the arguments are suppressed. i.e .• g~(Xi.P.A)
=
If there is any ambigui ty the arguments will be retained.
Calculations for Result 3.1:
Let
2(~)
= N-1
(derivative of the loglikelihood with respect to
evaluated at the observed data).
1/2
N
1/2
Let
-E[2*(~)]
H
We assume that
A
(P - p)/a1 =op(l)
A
N
(a - a )/a
1
1
1
1 2
N /
(X - A)
= 0p(l)
= 0 p (1)
be the information in the correctly measured data.
By the form of D. if A =
E[2(~)]
then
Thus
71
When set equal to zero. the estimating equations based on the normal
likelihood for model {3.1} are:
~{a}
=.2-.
N
l
Na 1 i=l
'eA)
=
;1 1
[h(Y1,A)
a: g(Xl.I3.A)H~(Yl'A) a: ~(Xl.I3.A)]
1
+ - I log {Y i }
Ni
Replacing Xi with xi + a v
2 i
a
2
= 0 gives
.
and expanding in a Taylor series about
72
N
2(a 1 } =_1
Na 1 i=l
l [~~
0
2
2 0
- 2(a2/a 1 } v·E:.ig
1
a 2- (a2/a1}E:.iViga2a2
+ (a2/a2 ) 2 0 0
2 1 viga ga
2 2
3
+ O(a2 )
1
N
2(X} =~
l
NUl i=l
[
0
E:.i(llx-gx) - (u2/u 1 ) Vi (llx-gx)ga0 - U2E:.iVigM2
2
A Taylor expansion about a
(3.6)
1
= 0 and taking expectations yields
E[2(~)] ~ -a~/(Na~)
E[2(X)] ~ -a~/(Na~)
We compute A =
DA
=
E[a/a~ 2(~)]
and pre-multiply by D to obtain
1
a
1
73
where -v
= N-1
-
! log f. w
= N-1
! (log f)
2
and all sums Sij are from 1
i
i
to N and are defined in Section 3.3.
Up to this point we have not yet had to consider the differing
The
asymptotic
bias.
given
rates
of
convergence.
1/2
-1
T
N (DA) E[t(~)]. under case (A) tends to AA = (Q AAo AAA) where
~
In case (B).
AB~
-
-
--2
AAo
= -f612
(w S22- v S23)/(w - v )
AAA
= -f612
-2
(v S22- S23)/(w - v ) .
= (~ ~~)
where
4 * -1 2
S12 + f~ ID Sll [(If + ~ S33) S12S22
-1
= -f~2/4 Sll
-
T
by
2
2
+ (v + ~ S23)(2S13S22 - S12S23) - (2 + 3~ S22) S13S23]
~
and D*
E[ t
(~)]
= 2 (w-
= -2N 1/2ID*
+
~
2 -
2
2
+ ~ S33) S22 + (v + ~ S23)]
[~(w
2S33)(2 +
3~
2S22) - 4 (v
- + ~2
2
S23).
We also compute
where
.
DE[t(~)]D
=
-
2
-S11 + °2 So
2
(°21°1 ) S12
2
T
(° 21°1) 8 12
2 2
-2 - 3(°2/° 1) S22
2
T
(°21°1 ) S13
-
2 2
2
-(°21°1 ) S13
T
2v + 2(°2/° 1 ) S23
-
2 2
2v + 2(°2'° 1 ) S23
-
2 2
-2w - (° 21°1 ) S33
74
Under case (A).
0
-Sll
0
0
-2
2v
0
2v
-2w
.
DE[2(~)]D ~
and under Case (B).
..
DE[2(~)]D ~
0
0
-Sll
0
-2 -3'l'2S22
0
2v + 2'l'S32
-
-
2
2v + 2'l' S23
- - 'l'2S
-2w
2
33
1: is the covariance matrix for 2(~) and we have under case (A)
ND1: D =
A
Sll
0
E6
0
and under case (B).
ND~D
e
0
0
4 - 1
i
-(E64 -1) -v
i
-(E64 - 1) -v
i
(E61- 1) ;;
= ND1: D + 1:A where
A
o
o
4
4
['l' (Ev i
-
2
I)S50 + 'l' S60]
75
Therefore we can compute the asymptotic covariance matrix for case
~
(A). !A
= N(DAD) -1 D!AD(DAD) -1 •
8
-
-1
which can be written as
0
11
0
4
(E~i
- 1)w
-(Ee4 - 1) vi
0
!A =
-2
4(w - v )
0
-2
4(w - v )
-(Ee4 - 1) vi
Ee
-2
4(w - v )
Under
the asymptotics for
case
4
- 1
i
-2
4(w - v )
(B).
the
terms of
the asymptotic
~ = N(DAD)-1(D!:aDHDAD)-1. are unwieldy.
covariance matrix.
The
A
variance term for
P is
given by
•
Calculations for Result 3.2:
The calculations parallel those of Result 3.1.
A
Let
l*(~)
= l(~)
A
- E[l(~)]. where E[l(~)] is E[l(~)] evaluated at Xi = Xi + 02vi.
Expanding in a Taylor series about 02
=0
yields
76
2
3
0
0
2*(a 1 ) = 2(a 1 ) - [a2/(Na 1 )] ! [g g ]
i
a2 a2
3
3
0
0
4
- 2[a2/(Na 1 )] ! vi[g g
] + O(a2 )
i
a2 a2a2
l*(A)
= 2(A)
2
2
3
2
0
0
+ [a2/(Na 1 )] ! [g g~_]
i
a 2 ""2
0
0
+ [a2/(Na 1 )] ! vi[g
g~-]
i
a2a2 ""2
3
2
0
0
+ [(a /{Na )] ! vi[g g~_
2
Var[2*UD]
~
1
4
]+ 0(a2 )
a2 ""2a2
i
Var[2{!!)] but E[2M (!!)] ¢ E[2(!!)] and will affect the
asymptotic covariance of
the estimated parameters.
We have under
either (A) or (B) that
o
o
-2
o
M
o
-2w
M
....
Thus !13 is not affected but !oA is different from ~.
•
3.3 Appendix
The following sums are used throughout Chapter 3.
from 1 to N.
8 12
= -2N-1
! [g
i
0
gA + 2g
° 2° 2 ",
0
0
gR,.,.]
°2 '""'2
All sums are
77
S22=N
-1
-1
S33 = N
0
!g
i
g
0
a2 a2
! (g
0
,)
2
a 2",
i
S20
-1
S30 = N
S50
=
-1
N
! (g
i
0
a2
0
0
i
0
a2
)
4
a2
i
2
! (g g,_)
i
a2 N.l 2
! (g
-1
0
2
S40 = N ! (log f){g )
3 0
) g,_
N.l
= N-1
2
Soo
-1
= N
0
0
! (g log f + g,_ )
i
a2
N.l
2
aIAPTER 4
THE ACE ALGORITIIM AND TRANSFORM-BOTH-SIDES REGRESSION
4.0
Introduction
The ACE algorithm of Breiman and Friedman (19S5a. b) is designed
to find transformations for each of the regression variables such that
the best-fi tting additive model is constructed.
By "bes t-fi t ting"
they mean the model where the fraction of variance not explained by
the
regression of
the
transformed response onto
the
transformed
predictors is minimized.
The ACE procedure is conceptually simple.
Al though ACE is
sufficiently general for vector-valued responses and predictors. we
consider only the bivariate case with one response and one predictor.
Let y and x be random variables such that y is the response and x is
the predictor. where 9 and
functions of y and x.
~
are respectively the transformation
The ACE algorithm minimizes with respect to 9
and~
(4.1)
= E{[9(y) - ~(x)]2}
Var[9(y)]
The basis for the ACE algorithm is
(4.2)
which mimimizes (4.1) with respect to 9(y) for a given function of
79
,(x).
Minimizing (4.l) with respect to
flex}
(4.3)
,(x) for a given B(y} yields
= E[B(y} Ix].
See Section 4.4 for a derivation.
The ACE algorithm is an iterative
procedure based on computing and updating (4.2) and (4.3) in order to
minimize (4.l).
estimate
In the bivariate case,
the maximal
minimizing e
2
the effect of ACE is to
correlation between
the
two variables,
2
is eqUivalent to maximizing R = 1 -
2
e,
thus
the usual
correlation coefficient.
The ACE procedure is nonparametric in the sense that the only
distributional
independent
assumption
samples
drawn
made
is
that
the
data
from
the
distribution
of
(yi ,xi)
are
(y,x).
No
assumptions are made about the form of the transformation, i. e., it
need not be from a
monotone.
particular parametric family and need not be
However, the algorithm is flexible enough to constrain the
transformations to be monotone if so desired.
Our goal in this chapter is to investigate the performance of ACE
in the transform-both-sides setting.
When little is mown about the
relationship between the response and
the predictors,
ACE may be
useful in identifying some otherwise unlmown structure in the data.
The transform-both-sides problem is slightly different.
We have an
assumed regression model which we want to preserve, but the way in
which error enters the system is unlmown.
We consider transforming
both the response and the regression function simul taneously by the
same transformation in order to maintain this relationship yet still
address problems in the error distribution, i. e., heteroscedasticity
and lack of symmetry.
Breiman and Friedman (1985) claim that finding
80
the best-fitting additive model is a more comprehensive goal than the
common goals of variance stabilization and symmetrization of errors.
Thus it seems the ACE procedure is a likely candidate for estimating
the transformation in the transform-both-sides setting.
In the following we address several issues concerning the performance of ACE in the transform-bath-sides model.
X • Breiman and Friedman
on the joint distribution of y. Xl'
(1985)
show that optimal
Under weak conditions
p
transformations exi stand are generally
unique up to a change of sign.
There may exist suboptimal transfor-
mations not identified by ACE. that is. eigenfunction solutions that
are apprOXimately equal to the optimal solutions. that wi 11 provide
different information about the relationship between the response and
the predictors.
The first question is whether or not maximizing R2 is
a reliable criterion for model selection.
Maximizing R2 with respect
to the regression parameters in a linear model leads to the least
squares estimators.
Thus in a linear model under normali ty we get
consistent estimates using R2 as our selection criterion.
This
cri terion. under the nonlinear transform-both-sides model. does not
clearly lead to consistent estimates.
4.1.
We investigate this in Section
In transform-both-sides we assume that we have a regression
model based on some prior understanding about the underlying physical
process.
In Section 4.2 we look at ACE as a goodness-of-fit tool for
the assumed model. y
= f(x.~).
We investigate the ability of ACE to
select the more appropriate model in two examples where there are two
or more competing models.
Breiman and Friedman (1985) state that ACE
does not perform well in the presence of outliers. where by outliers
they mean points that are located at extreme points.
These points are
81
not necessarily bad points but have high leverage by virtue of their
location apart from the bulk of the data.
In section 4.3 we investi-
gate how ACE responds to influential points that are not outliers in
the Breiman-Friedman sense and determine if ACE can be used as a
diagnostic tool to identify such points.
The numerical
results
were
obtained using an ACE procedure
programmed by R. N. Rodriguez that implements the ACE algorithm given
in Breiman and Friedman (1982) and the data smoother found in Friedman
and Stuetzle (1982).
The smoother used by this program is not the
same smoother used by Breiman and Friedman (1985).
4.1
4.1.1
The ACE criterion in the Transform-Both-Sides Setting
Theoretical Analysis
Since ACE is a nonparametric procedure for estimating transforma-
tions its performance will be no more efficient than a parametric
procedure where the distribution and model assumptions are correct.
For the sake of comparison we formulate the ACE criterion as a parametric estimation problem where the errors are assumed to be independent and identically distributed normal
assumed model is correct.
random variables and the
We consider the transformations 9 and • to
be the same and to be members of the Box-COx modified power transformation family indexed by the parameter A.
value of the parameter A that minimizes
(4.4)
We define
~2
to be the
82
A
A
~2
In addition to
we consider the maximum likelihood estimator (AMLE )
A
and an estimator suggested by Breiman and Friedman (1985).
estimate
A
~F
(~F).
The
2
is that value of A for which R (A) is maximized when ACE
A
A
is applied to {y .f (x.P».
example where
there
Breiman and Friedman suggest this in an
is a
strong
response and the predictors.
linear
relationship between the
In that example the ACE procedure does
not indicate a transformation but there is reason to believe that a
transformation is called for.
We know that at A = O. the estimating equations for the maximum
likelihood estimators are unbiased in a
with known fl.
Le ..
Er~
lv LaJ
transform-both-sides model
loglikelihood
]
=
o.
(See below.)
A=O
Therefore. by M-estimation theory. for any a we can expect transformboth-sides to be consistent if A
that we want to study ACE.
= O.
Our goal here is different in
We want to show in the case where A = O.
Since P is known. we
that ACE does not consistently estimate A.
suppress the arguments and write f
i
= f{xi.P).
The loglikelihood for
the transform-both-sides model is
l{A.a)
=
-1/2 log a2 + (A-1)log{Yi) - 1/(202 )[y (A) -f (A) ]2.
We have
a
-:2 l{A.a)
00
I
A=O
{yeA) _ f{A)r2
= --
4
20
= (log
y - log f)
4
-
20
2 2
= a (~ - 1)
4
20
2
-
2
1/(20 )
- a
2
83
Therefore
Also
a
~
i(X,a)
Er~
lao i(X,a) Ix=o] = o.
I
X=o
= {log
X
y - l/(Xa2 )[y (X) - f (X) ][yXlog y - flog
f]
Repeated application of L'Hospital's rule yields
a i(X,a)
ax
Ix=o=
log y
- 1/{2a2 )[log y - log f] [(log y) 2 - (log f) 2 ]
= log
f + a~ - ~(2a) [(log f + a~)2 - (log f)2]
= log
f + a~ - ~2 log f - a~3/2.
This also has mean 0, which shows that the estimating equation for the
maximum likelihood estimator is unbiased.
Assume that
(fi'~i)
f i is independent of
~i'
are independent and identically distributed,
and
~i ~
N(O,l).
< M- 1 < f i < M < ~.
Assume also that for some M > 0,
0
Proposition 1:
>0
(1)
For any integer k
k a
-a
E[log Yi] [Yi + Yi ]
Let log Yi
and any a
< b(a,k) < ~,
where b(a,k) is independent of 1.
> 0,
= log
fi +
a~i·
84
Proof:
(Part 1):
M- 1 < f i < M.
Let Mi
= log
M.
Therefore
k
-a]
[1 og Yi ] [ Ya
i + Yi
= [1 og
aei a
ae
k
) + {fie i )-a]
f i + ae i ] [(f ie
Therefore for constants d •
j
It suffices that for any k. aM' Eleil
kaMe
e
< ~. which holds for the
normal distribtion.
(Part 2):
ITI
By the Markov inequality.
= 0p(1)
if
Elxl
Write
T
p[lTI > a]
= N-1
~ EIXI/a.
Thus
N
k
a
-a
I (log Yi) (Y + Y1 ) and
1=1
i
apply Part (1).
•
Proposition 2:
If
Ixl
~ a. then
85
y{X) _ log y
Proof:
Let x
=X
log y.
there exists an
= {yX
- 1 - X log y)1A
= {eX
log y _ 1 - X log y)lA.
Then by a Taylor series expansion about x
"*
between 0 and x such that eX - 1 - x
=0
= 1/2
x
where
2
e
xM
,
we can write
=
I
2
1/2 X (log
= Ixl/2
r) 2 eXMlog y
XMlog y
2
(log y)
Ie
= Ixl/2 {log y )2
Since XM is between X and 0 and
X
y
M
< ya
+ y
-a .
I
y~.
Ix I ~ a,
IXM I ~ a,
To see this, note that we have the four cases:
~
> O.
y
>1
y~
~
> 0, y
<1
y
XM < 0, y
>1
X
y M < ya
< O.
<1
y
~
therefore
y
XM
< ya
< y-a
~ < y -a .
Therefore
•
86
Proposition 3:
If
Ixi
~ a. then
a/ax y(X)
Proof:
= _(yX
= [X
_ 1)/A2 + 1/A yX log y
yX log y _ yX + 1]/A2
= 1/A
[(1 - YX)/A] + 1/A yX(log y)
= _y(X)/A
+ [(yX - 1) log y + log y]/A
= y(X)log
y + 1/A log y - 1/A y(X)
= y(X)
log y _ 1/A [y(X) - log y].
From the proof of Proposition 2
y
( X)
= log
y + 1/(2A) [X log y]
2 X*log y
e
Therefore
1/A (y
Thus
a/ox y(X)
(X)
- log y)
= y(X)log
(log y)
= 1/2
(log y)
2 X*log y
e
2
2 X*log y
+ 1/2 (log y) [e
- 1].
y - 1/A [y(X) - log y]
= y (X) log y
= (log
= 1/2
- 1/2 (log y)
2
2 X*log y
- 1/2 (log y) [e
- 1]
y)[(log y) + 1/(2A) (X log y)
2 X*log y
e
]
2
2 X*log y
- 1]
- 1/2 (log y) - 1/2 (log y) [e
= (log
y)
2
3 X*log y
+ A/2 (log y) e
- 1/2 (log y)
2 X*log y
- 1]
- 1/2 (log y) [e
2
87
2
= 1/2 (log y) + X/2 (log y)
3 X*log y
e
2 X*log y
- 1/2 (log y) [e
- 1].
Now do a Taylor series expansion about X*
for some X** between 0 and X*.
la/ax
y
(X)
- 1/2 [log y]
2
= O.
Therefore
I = I{log
2
3 X*log y
y) /2 + X/2 (log y) e
X*log y
2
2
- (log y) /2 [e
- 1] - (log y) /2
= 1X/2
3 X*log Y
2
X*log y
(log y) e
- (log y) /2 [e
- 1]1
3
~ Ix I Ilog y I
~
[y
X**
+ Y
]
~ 2 Ixi Ilog yl3 [ya + y-a].
•
88
_ { (A) _f(A)} ~ { (A) _f(A)}
Hi4 ( ')
A
Yi
i
8A Yi
i'
Thus for
Proof:
j
= 1.
,. P
2. 3. 4 if A ~ AO
We verify only for
j =
1.
= O.
The rest are similar [for
j
= 3. 4
use also Proposition 3].
,.
Hil(~)
= {y~A)
-.::--
_y(A)}2
,.
= {log Yi-
,.
log y + y~A)_ log Yi- (y(A)_ log y)}2.
From Proposition 2
Consider I~I ~ a.
Therefore.
,.
IHil(~) - Hi1 (O) I
= 12(log Yi-
-.::--
log y) [y~A)_ log Yi- (y(A)_ log y)}
89
~ 2 I~I I(log y.1 - log
Y)I
2 a
-a
2 a
-a
x [(log Yi ) (Y i + Yi ) + (log y) (y + y )]
A2
2 a
-a
2 a
-a 2
+ X [(log Yi) (Y i + Yi ) + (log y) (y + y )].
By Proposition 1 and Cauchy-Schwartz.
log y = 0p (1)
2
a
(log y) (y + y
-a
.
) = 0p (1)
. and further application of Proposition 1 yields
-1 N
N
,.
P
-1 N
Ali = 0 (1) = N I ~i·
i=l
P
i=l
Since X ~ O. we have
I
-1 N A P
N
I
i=l
{H i1 (X} - Hi1 (O}}
~
O.
and we are
done.
Lenuna 2:
•
For the purpose of bookkeeping we have
Hi3 (O} = {log Yi - log y} {(log Yi )
Proof:
Notation.
2
2
- (log y) }
•
90
N
Lemma. 3:
N- 1 ~ H (O) ~ Var[log f] +
i1
i=l
c)
N- 1 ~ H (O) ~ 1/2{E(log f)3_ [E(log f)] [E(log f)2]}
i3
i=l
N
d)
-1 N
N
~
i=l
a)
P
~
Hi4 (O)
a
2
a~(log
f)
E(log f)
It is given for H , H , and H by strong laws.
i1
i2
i4
-1 N
1
N
i=1
Hi1 (O)
p
~
Var[log y]
= Var[log f +
a~]
= Var[log f] + a
d)
a
p
+
Proof:
2
p
a)
-1 N
1
N
i=1
-1
Hi4 (O) = N /2
-1
= N /2
N
~ (a~i)[(log
2
a
f +
a~i)
2
N
3
~
[(a~i)
.
E(log f)
2 2
2
- (log f) ]}
i=1
i=1
p
---+
2
+ 2a ~i (log f)]
91
c)
-1 N
~
Since log y = N
(log f +
P
a~i) ~
E[log f]
and
i=1
----:::.-2
(log y)
-1 N
= N
~
i=1
(log f +
a~.)
2
P
~
E(log f}
2
2
+ a .
1
we have from strong laws that
2N
-1 N
~
1 N
Hi3 (O) = N- ~ (log Yi )
i=1
i=1
3
- (log y)(log y)
1 N
= N- ~ (log f i + a~i)
i=1
P
~
3
E[log f] +
2..-
3a~[log
3
2
- (log y}(log y)
2
2
2
f] - E[log f][a + E(log f} ]
32..-2
= E[log f] +
2a~[log
f] - E[log f] E[log f] .
•
Theorem 4.1:
o=
2
"
"
Let X
= ~2.
-a /2 [E(log f}
Proof:
3
" P
If X ~ 0 for every a. then we must have
- [E(log f}] 3 ] + 3/2 a 2 E[log f] Var[log f].
ACE minimizes
I og [N
-1 N
(')
(') 2
-1 N
~ {y ~ - f ~ } ] - I og [N
~
~
~
i=1 i
i
i=1
Taking derivatives with respect to X. we get
(') - ----('}
Y ~ }2 ]
{Yi~
92
i.e ..
-1 N
o=N
-1 N
A
! H. (A) N
i=l
1
4
-1 N
A
! H. (A) - N
i=l
1
1
-1 N
A
A
! H. 2 (A).
! H (A) N
i=l i3
i=l
1
From Lemmas 1 - 3. we get.
o = [o2E(log £)][Var(log £) + 0 2 ]
-
0
2
[1/2 E(log £)3+ o2E(log f) - 1/2 E(log f)E(log f)2]
= o2E(log f) Var(log f) + o~(log f) ~
- o-c(log f) +
= -
0
2
0
2
2
2
/2 E(log f)3
/2 E(log f) E(log f)
2
/2 E(log f)3 + o2E(log f) Var(log f)
+
= -
0
0
0
2
/2 E(log f) Var(log f} +
0
2
/2 E{log f}3
/2 {E{log f)3 - [E{log f)]3}
2
+ 30 /2 E(log f) Var(log fl.
•
,.,
Note:
Suppose E[log f] = O.
E[log f]3 = O.
Then
~2 ---+
AO = 0 implies
This simple counterexample shows that ACE
cannot be consistent in general.
4.1.2 Examples
At this point it will be illustrative to see how these estimators
perform on data.
Sixteen sets of data were generated according to the
model log{y i ) = log{x i } + o~i' where the ~i are independent standard
normal. the Xi are independent uniform (O.10). 0 = (.1 •. 3 •. 5. 1) and
N = (50. 100. 200. 500).
The values for each of the estimators for
93
each data set are given in Table 4.1.
Table 4.2 contains the summary
measures of skewness, kurtosis and Spearman's p for the residuals of
the three models when N
= 100.
Also included are the summary measures
based on the residuals from the no-transformation model,
the log
transformation model and the residuals based on the ACE procedure
itself.
These are recorded in the column labeled AACE'
"
The maximum likelihood estimator ~
is closer to the true value
of A than either
~F
"
~2
or
and as N increases
"
~
tends to O.
From
"
our calculations in the previous section we would expect ~2
to be
biased.
For
these
data
it
appears
that
Xa2
consistently
"
underestimates A, and as N increases the value of Xa2
stabilize around -.5.
For
"
~F
seems to
we record only the results for
0
= .1.
Breiman and Friedman suggested this estimator when there are reasons
to believe that a transformation is needed but because a strong linear
relationship exists between the response and the predictor ACE does
not
identify a
transformation.
transform-bath-sides model when
this
model
is
approximately
variance model with power a
~
Such a
0
si tuation occurs
is small.
equivalent
1 - A.
to
in the
As seen in Chapter 2,
the
power-of-the-mean
Data arising from such a model
will exhibit a strong linear trend as well as heteroscedasticity.
Consider the simulated data where N
= 100 and
0
is a scatterplot of the untransformed data, x and y.
= .1.
Figure 4.1
Notice that as x
increases the points aPPear to fan out.
When ACE is applied to these
data
(Figure
no
residuals,
transformation
is
indicated
4.2)
yet
the
ACE
"
"
a(y) - .p(x)
, are markedly heteroscedastic (Figure 4.3).
Following Breiman and Friedman's suggestion, we apply ACE to (yA, XA)
for A ranging over
the grid (-1,
0) by increments of.!.
The
94
See Figure 4.4 where the ACE
resul ting estimate for A
is -.6.
BF
transformation for y is plotted.
"-
~F
log transformation and the
We also plot scaled
= -.6
transformation.
v~rsions
of the
It is clear that
"-
the
~F
over transforms the data while the ACE procedure indicates no
transformation.
"-
The question arises as to why
to
"- 2
~.
The estimate
2
for which the R
"-
~F
~F
does so poorly in relationship
is selected as that value of A over the grid
A
A
value is maximized when ACE is applied to (y .x ).
2
We compute R (A::O)
= .9718
2
and R (A:-.6)
= .9997.
Examination of the
- 6
transformation plot for e(y- . 6 ) versus y'
(Figure 4.5) shows that
the spacing along the yA axis has been manipulated so that there are
Si~ilar spacing occurs for
two points far out in the yA direction.
the transformed x plots.
This occurs because values of y near 0 are
being driven to inHni ty as A decreases.
In order to avoid the
problem of small observations we multiplied the observations by 100
forcing all observations to be greater than 1.
Exactly the same sort
of behavior occurs for these scaled data. where two points are being
forced to 1 and the rest are near O.
This subsequent spacing of the
untransformed variables causes the increased R2 when there is no real
increase in model performance.
We have seen that ACE does not perform particularly well in
transform-bath-sides situations when a is small.
We investigate what
happens when a is large. that is. the "strong linear relationship" is
obscured by the increased variation.
where N
= 100.
Consider the simulated data
As a increases the ACE transformation plots indicate
increasingly severe transformations.
The transformed x plots. while
also indicating transformations. are in each case milder than those
95
The ACE transformation at a = .5 seems
indicated for the y variable.
to lie very close to the log transformation.
mation is too mild and at a
=1
< .5
For a
the transformation is much too severe.
See Figures 4.2. 4.6. 4.7 and 4.8 for comparison.
summary measures from the A
=0
the transfor-
If we use
the
model as a reference guide the small-
a residuals seem to be heteroscedastic and approximately synunetric.
=
For a
model.
1 all
three measures are large compared to the reference
Figure 4.9 is a scatterplot of the ACE residuals from the
a = .5 model whereas Figure 4.10 shows the ACE residuals from the
a = 1 model.
Figure 4.11
illustrates the ACE transformation.
and
A
scaled versions of the log transformation and the
~2
transformation
A
for
the a = 1 model.
The
~2
seems to well approximate the ACE
transformation but both are more severe than the underlying transformation.
The difficulty that ACE is encountering with data for large a is
that a number of data points. although small in number relative to the
bulk of the data. can be viewed as extreme data points. "outliers" in
the Breiman and Friedman terminology.
that
y
> 10
are effectively
truncated
In Figure 4.8 all data such
in
the
transformed values.
Figure 4.12 is a scatterplot of the (x.y) values for the a
It is clear that the points y
> 10
=1
model.
are separated from the bulk of the
data and are outliers in that sense.
It is interesting that ACE seems to do fairly well at a
not at a = 1 or .1.
= .5
but
I t appears that as a increases A decreases.
Therefore it is reasonable to conjecture that for some a
transformation will approximate the true transformation.
the ACE
One must
then question whether or not. for a given set of data. the errors are
96
sufficiently large or small enough for ACE to perform adequately well.
The severity of the transformation seems to be dictated by the size of
u rather than the need for transformation.
It is clear from these
examples that the ACE procedure does not always account for heteroscedasticity. nor does it induce sYmmetry in the error distribution.
The transformation plots provided by ACE are imprecise and only
"
give a general idea of what AACE
should be.
If estimation of A is the
"
primary goal. ACE will provide only a rough estimate of AACE which may
be a very biased estimate for the true A.
If prediction is the goal.
there is a predictive ACE procedure (PACE) developed by Friedman and
Owens (1985).
Since the ACE procedure yields inconsistent estimates
for A in the transform-both-sides setting when A
=
O.
we might
question the predictive effectiveness of PACE when applied to data
arising from a transform-both-sides model.
The
parameterized
ACE
criterion
differs
from
the
maximum
likelihood criterion by the scaling factor in the denominator.
The
ACE criterion scales by the variance of the transformed response while
maximum likelihood scales by the Jacobian of the transformation.
An
area of further research would be to adapt the ACE cri terion to be
like the maximum likelihood criterion.
Since the Jacobian is the
derivative of the transformation. with the availability of a data
smoother. numerical derivatives can easily be computed for B(y) at
each point and the appropriate scaling can be accomplished.
97
4.2
Using ACE to Determine Model Suitability
In the transform-both-sides setting if there is no error in the
In a
competing models, say f
1
situation where
there are
two
and f , plotting y versus f and f would be
1
2
2
an easy visual check on the appropriateness of the assumed function if
no transformation is required.
However if an unlmown transformation
is involved this method is not applicable and provides a setting where
ACE might usefully be employed
to
determine which
is
the
more
appropriate model.
The ACE procedure maximizes a criterion
max
h,g
~[h(y),g(x)]
such that
~
= ~[91(Y)'+1(x)].
If ACE acts upon y and f(x,P) rather than y and x then
max
h,g
Thus +1(x)
-1
+2
0
~[h(y),g(f(x»]
= +2(f(x,P»,
+1 (x).
= ~[92(Y)'+2(f(x»].
and it follows that f(x) is approximated by
If f(x,P) is a suitable model then 9 1 , 92 and +2 should
all approximate the same transformation.
Plotting 9
-1
-1
yield approximately the same results as plotting 9
2
f(x,P) is the true model.
different
functions
Comparing plots of
f(x,P)
with
the plot
indicate the better fitting model.
of
-1
If
transformation
the choice is
can
be
-1
when
+2(f(x,P» for
0
+l(x)
should
In order to construct these plots
limited
estimated
0
91
+1(x) should
+2(f(x,P»
0
9
2
an assumption must be made about the form of the 9
mations.
0
1
to
by
1
and 9 trans for2
the Box-Cox family,
overlaying
on
the
the
ACE
98
transformation plot standardized Box-COx functions for different A and
selecting the one that best approximates the 9 curve.
A similar but slightly different setting is where there are two
competing models but the response also differs in each model, say
h 1 (y,z}
= f1(x,~}
the covariates.
and h2 (y,z}
= f2(x,~},
where z may contain some of
Again we wish to determine the better fitting model.
By applying ACE
to
each model,
the
transformation plots
should
indicate similar transformations for the better model and somewhat
different transformations if the model is inappropriate.
4.2.1
The Atlantic Menhaden Fishery Data
An area of science where transform-both-sides has been success-
fully used is in the area of fisheries management.
Ruppert (1984) and Ruppert and Carroll (1985).
See Carroll and
An important issue in
the management of fisheries is the spawner-recruit relationship.
The
way in which this relationship is modeled can have a large impact on
the subsequent management policies of the fishery.
of recrui ts and x = the number of spawners.
= the
Let y
number
There are two mode I s
often used to model the spawner-recruit relationship:
y =
y
1/(~1
= ~1x
+
~2/x}
exp(~2x}
In the Bever ton-HoI t model,
asymptote
Beverton-Holt (1957)
in recruitment,
Ricker (1954) .
large numbers of spawners lead to an
whereas
the Ricker model
implies
that
increasing spawners ultimately result in decreasing recruitment.
See
Carroll and Ruppert (1984) for a more complete discussion of these
e
99
models.
The data are from the Atlantic menhaden fishery and are listed
and analyzed in Carroll and Ruppert (1984).
•
Although their results
indicate that both models are reasonable since the fitted curves are
similar over the observed range, they decided upon the Beverton-Holt
model because there was no real evidence for asymptotically decreasing
recruitment.
We investigate if ACE can be useful in determining the
better model.
A
A
A
We applied ACE to (y,x), (y,fBH(x,P» and (y,fR(x,P», where
P
are the estimated parameters from an untransformed fit for each model,
and the subscripts BH and R indicate which spawner-recruit model is
The transformations 9 , 9 and 9 are approximately the same,
3
1
2
used.
A
say 9. and X based on these plots is approximately -1.
4. 13.
Carro11 and Rupper t (1984) find
both models.
Thus X
= -1
~
See Figure
to be approximate ly -.7 for
seems a reasonable choice.
•
Note also that
X '= -1 linearizes the Beverton-Hol t model but not the Ricker model.
We
then plot 9
compare the results.
-1
0
+1 (x),
and 9
See Figure 4.14.
-1
0
+3(f R) and
o +l(x) curve compares
favorably wi th the Bever ton-Hoi t curve and not the Ricker model.
In
this example ACE appears to select the Bever ton-Ho 1t model as being
the more appropriate model.
This agrees wi th
the conclusion of
Carroll and Ruppert but is not based on any considerations of the
problem as was their decision.
•
•
100
4.2.2
Chemical Reaction Data
The data set of Carr (1960) on the isomerization of pentane is a
good candidate for demonstrating ACE's ability to identify the better
model in the second setting.
The data are listed and the model is
discussed in Box and Hill (1974).
Pritchard. Downie and Bacon (1977)
and Carroll and Ruppert (1984) further discuss and analyze the data.
Our purpose here is not to reanalyze the data but to see if the ACE
procedure when applied to different models will
appropriate model.
select
the more
One proposed model for these data is
Modell:
y =
(31133(~
- ~/1.632)
1 + 132x 1 + 133~ + {34~
•
This model can be linearized by transform-both-sides with A = -1 and
yields
Model 2:
where
the
parameters.
a's
are
the
appropriate
functions
of
the
original
Carr (1960) chose to linearize the model by the following
transformation:
Model 3:
In the previously mentioned analyses the inverse transformed
.~
model 2 induced marked heteroscedasticity while
there was sl ight
heteroscedasticity for the original untransformed model.
Box and Hill
A
•
(1974) and Carroll and Ruppert (1984) find values for A to be .8 and
.71. respectively, indicating a slight transformation for the original
101
nonlinear model 1.
Carr (1960) rejects model 3 as inadequate.
It
will be interesting to see how ACE performs on these models.
A
•
A
In each of the three mode 1s we form f i (x, tl) where the tl are
estimated values of the parameters and the subscript i = 1, 2. and 3
refers
to
the corresponding model.
We then apply ACE to (y, f 1)'
Figures 4.15. 4.16, and 4.17 show the
transformation plots for each model.
transformation
and
model
2
shows
We see that model 1 calls for no
a
slight
convexi ty,
thereby
supporting the claim that the inverse transformation is too strong.
Model
3
indicates a
need
for
transformation but
it
is not
the
transform-bath-sides situation.
Model 1 (X = 1) and model 2 (X
form-bath-sides
transformation
= -1)
family
belong to the same trans-
while
model
3
does
not.
Supporting this, the ACE transformation plots indicate that models 1
•
and 2 are candidates for the transform-bath-sides model. whereas model
3 is completely inadequate from a transform-bath-sides point of view.
We note here that ACE's performance on model 1 does not indicate
a transformation but previous parametric analysis did.
ACE does not control for heteroscedasticity.
As seen before
Thus it is not surpris-
ing that the slight transformation indicated by parametric methods is
missed by ACE.
It is encouraging to see. at least in this example.
that ACE does behave differently when confronting two very different
models and seems to correctly indicate the more appropriate model.
•
102
4.3
ACE as a Diagnostic Tool
Often in parametric settings there are influential points which
have a large effect on the value of the estimated parameters.
Robust
methods exist for estimation that minimize the effect of the extreme
points, and case deletion diagnostics are useful in identifying such
points.
Breiman and Friedman (1985) state that the ACE procedure does not
perform well in the presence of extreme data points.
The question
then arises as to whether or not ACE can locate points that are not
"outliers"
in
the Breiman-Friedman sense yet are
influential
in
determining the estimated transformation.
In some numerical examples the deletion of a single point causes
•
the ACE transformation plots to change dramatically whereas in other
examples the deletion of a known influential point hardly affects the
plotted transformation.
We need a diagnostic that does not depend so
heavily on a subjective assessment of the ACE plots yet 15 able to
locate influential points even when the plotted transformation does
not appear to change much.
In trying to form a diagnostic based 'on the ACE procedure we are
limited in the amount of available information.
There is no theore-
tical distributional framework so methods based on likelihood distance
or influence functions are not available.
Since ACE maximizes
R2,
the diagnostic
considered, where the subscript i
•
R~i
= I R2
-
R~i}
I
is
indicates that the ith point 15
deleted and the procedure is reapplied to the reduced data set.
A
second diagnostic, DACE , is in the spirit of Cook's (1977) distance
i
103
•
and DFFITS (Belsley, Kuh, and Welsh, 1980) both of which measure the
effect of case deletion in a linear model based on the least squares
fitted values.
See also Cook and Weisberg (1982) for a discussion of
single case deletion diagnostics.
complicated.
Our situation is a little more
The model is nonlinear with the estimated transforma-
tions occurring on both sides of the equation.
Since our goal is to
see how much a single point affects the estimated transformation we
decided to compute [9(y j ) 9(y) and
~(x)
9(i)(Y j )] and [~(Xj) - ~(i)(Xj)] where
are the ACE transformed values based on the full data
set and (i) represents the reduced data set where the ith point has
been removed.
Notice that we are unable to Use 9(Yi) and
~(xi)
since
the ith point has been deleted and there is no corresponding 9(i)(yj)
and ~(i)(Xj) available to form the DACE
In order to
i diagnostic.
combine the information about 9 and ~ we form the following diagnostic
...
n
DACE i
~
2
[(9(y j ) - 9(i)(Y j )} -
j~l
jili
n
~l
j~l
j#
n
~l
j=l
j#
T
~ [RES - RES(i)] [RES - RES(i)]
where RES j represents the jth ACE residual and RES(i)j represents the
•
•
104
jth ACE residual for the reduced data.
Consider the perforllBIlce of these two 'estimators on the Skeena
River data set discussed in Chapter 2.
observations 12 and 16.
In that analysis we delete
Carroll and Ruppert (1986) analyze these data
using robust procedures.
They find observation 12, the year in which
the
to be an outlier.
rockslide occurred,
Observation 4,
whose
influence is masked by observation 12, is also determined to be an
influential point.
They find observations 5,
16.
19 and 25 to be
somewhat influential but with the removal of 12 and the downweighting
of 4 the other points seem to stabilize.
We compare these resu1 ts
with the outcome of our two diagnostics.
See Figure 4.18 for plots of the diagnostics based on the full
•
data set.
The
R~i
diagnostic does not indicate that observation 12
is an influential point.
moderate
influence
observations.
value of DACE
other values.
Based on this plot i t seems to have only
and
is
smaller
than
that
This is not the case for the DACE
i
of
i
several
diagnostic.
other
The
whim 12 is deleted is twice as large as any of the
It is interesting that the value of both diagnostics
for observations 8, 13. 16, 19 and 25 is moderately large but observation 4 is not flagged by either diagnostic.
We delete observation 12 and recompute the diagnostics.
Figure
4.19.
diagnostics.
,POints
4
and
16
are
clearly
for
both
As in Carroll and Ruppert's analysis the influence of
observation 4 is masked by observation 12.
•
flagged
See
For DACE
i
the points 8.
13. 19 and 25 again show moderate influence, whereas observations 6.
13, and 22 have rather large values for
We recompute
R~i'
the diagnostics when observations 12 and 4 are
•
105
removed and then again when observations 12 and 4 are deleted.
In the
first case when observations 4 and 12 are deleted both diagnostics are
stable except at the points 11. 17 and 25.
The ACE procedure con-
structs transformations such that there is a large residual at point
16.
,
The value for
R~i)
is not computed by the ACE. procedure for these
points and the corresponding value for DACE
observations 12 and 16 are deleted a
fourth observation for 9(21)(y).
i
becomes very large.
When
large residual occurs at the
Observation 8 also shows influence.
The results are not conclusive.
Observation 4 when 12 and 16 are
deleted and observation 16 when 12 and 4 are deleted continue to cause
points to be flagged as influential even though the points themselves
do not appear
•
diagnostics.
to be
identified as
influential by either of
the
It appears that retaining observation 16 creates more
problems than retaining observation 4.
Using also the rationale that
observation 16 occurs four years after the rock slide (observation
12). the apprOXimate time for which it takes the recruits to mature
and return to the river to spawn. we have some additional justifi-
cation for
deleting observation 16 from
the analysis as well as
observation 12. as we did for the example in Olapter 2.
It is important to realize that case deletion diagnostics in the
ACE setting are of limited value.
Nonparametric regression techniques
should be applied to large sets of data. but as the number of observations increases. case deletion methods become impractical.
However
it is interesting to see that in this example for a relatively small
•
data set. the performance of DACE
i
essentially agrees with the robust
analysis of Carroll and Ruppert (1986).
106
4.4
Discussion
The ACE procedure and the transform-bath-sides regression model
are alike in that both estimate unknown transformations.
•
In trans-
form-bath-sides regression the transformation parameter A controls the
way in which error enters the model and does not contribute to the
structure of
the
regression
relationship.
The primary goal
in
transform-both-sides is to model the data in such a way to obtain the
most effecient regression estimators.
In ACE the transformations 9(.)
and '1'(.) create the regression relationship between x and y and in
this sense ACE is a prediction model rather than an estimation model.
We saw in Section 4.1 that the parameterized transform-bath-sides
ACE criterion leads to inconsistent estimates for A. when A
= O.
ACE
when applied to the generated data of Section 4.1.2. is generally not
sensitive to the skewness and heteroscedasticity of the data and the
subsequent ACE transformations are not always the transformations used
•
to generate the data. i.e .. the log transformation. A = O.
In Section 4.2. based on two examples. ACE worked well as a
goodness-of-fit tool to determine the.better of two or more competing
models.
Since ACE
is
designed
to deal
with
the
structure of
regression models we found ACE to be useful in identifying the model
that previous analyses in the li terature indicate as appropriate for
the data.
The goals of ACE and transform-both-sides regression apparently
differ
in that ACE is a prediction tool and transform-bath-sides
regression is an estimation model.
useful
Whi Ie the ACE algori thm may be a
tool when trying to build a model or discriminate between
•
•
107
models that are structurally different, it is not clear at this point
how adequate ACE 15 for estimation.
effectiveness of ACE for
In order to further study the
estimation it is necessary to gain more
understanding about certain aspects of the ACE procedure such as the
relationship between sample size, variability and the tuning constants
,
for the smoothing procedure.
4.5
Appendix
We
e
2
sketch
the
= E[9(y) - .,,(x)]
2
derivation
of
(4.2)
and
(4.3).
with respect to .,,(x) given 9(y) is the same as
finding the best mean square error predictor for .,,(x).
•
E[9(y) - .,,(x)) 2
= E(9(y)
2
Similarly 9(y) = E[.p(x) ly]/IIE[.p(x) ly]1I
with respect to 9(y) given .,,(x) where 9(y) is scaled such
that Var[9(y)]
4.6
That is.
- E[9(y) Ix]} 2 + E{.,,(x) - E[9(y) Ix]} 2 which 15
minimized by .,,(x) = E[9(y) Ix].
minimizes e
Minimizing
= 1.
Tables and Figures
All tables and figures for the examples in Chapter 4 follow.
-.
•
108
•
Table 4.1:
Simulated Data.
A
Parameter estimates for
A
~. ~2.
A
and
~F'
A
A
A
N
a
~F
~2
~ 95% Confidence Interval
50
.1
.3
.5
1
-1.1
-0.86
-0.88
-0.92
-1.08
-0.26
-0.22
-0.16
-0.03
(-.44.-.04)
(-.41. .02)
(-.37. .08)
(-.20. .15)
100
.1
.3
.5
1
-0.6
-0.45
-0.45
-0.45
-0.36
-0.08
-0.06
-0.04
-0.02
(-.18. .04)
(-.16. .05)
(-.14. .07)
(-.09. .08)
200
.1
.3
.5
1
-1.7
-0.52
-0.57
-0.49
-0.36
-0.04
-0.04
-0.03
-0.02
(-.13.
(-.12.
(-.10.
(-.08.
.1
.3
.5
1
-1.0
-0.55
-0.57
-0.56
-0.45
0.02
0.02
0.02
0.02
500
-
---
-
.05)
. 05)
.04)
.04)
•
(-.03. .08)
(-.03. .08)
(-.03. .08)
(-.02. .06)
.-~
•
109
•
Table 4.2:
N = 100
;.
•
X = 1
X=O
A
A
A
A
~
X
ACE
~2
XaF
-2.64
-2.19
-1.69
-5.60
Skewness
a = .1
.3
.5
1
1.70
2.74
5.36
-.21
-.21
-.21
-.21
-.29
-.31
-.01
-.11
-.30
-.25
-.38
-.83
Kurtosis
a = .1
.3
.5
1
3.22
6.23
12.06
37.21
.25
.25
.25
.25
.13
.18
.22
1.57
.76
.25
3.59
.65
Spearman's p
(p value)
.6904
.1525
a = .1
( .0001) (.1298)
.3
.5
1
•
Summary measures for simulated, data. N = 100
.71
-<l.52
17.29
14.98
12.74
4.31
-
41.82
-
-
.4114
.0650
( .5203) ( .0001)
-.2213
( .0269)
-.3104
( .(017)
.6979
.1525
(.0001) (.1298)
.1666
.0866
(.3919) ( .0977)
-.2236
-
.7034
.1525
(0001) (.1298)
.1052
.0397
(.2974) ( .6947)
-.2259
.6969
.1525
(.0001) (.1298)
.1385 -.4340
(.1694) ( .0001)
-.1799
( .0733)
( .0253)
-
( .0239)
-
110
•
S:rMUL.ATED DATA
N - ~OO. S1gme -
.~
M
M
MM
~o
e
e
•
4
a
o
o
a
e
4
e
~o
x
.Figure 4.1:
Simulated Data.
N
= 100
and a
Scatterplot of y versus x for
.
= .1.
•
111
•
-
S:IMULATED DATA
-
N
"
SOO.
S~gmll
• S
"....
2
S
>D
bI
1:
0
0
bID
Z
-s
I[
•
•
-2
I[
I-
-3
-~
-15
0
3
S2
IS
Y
Figure 4.2:
Simulated Data.
The plotted points represent the
ACE tranformation and the solid line represents a
scaled log transformation for N
•
= 100 and
a
= .1.
112
•
Figure 4.3:
Simulated Data.
against x for N
Plot of ACE residuals plotted
= 100 and u = .1.
•
113
•
S%MUL.ATED DATA
N - ~oo. S1gme -
.~
2
;;
_._-------
o
>a
\II
~
-2
l[
o
IL
OJ
Z
.
oC
•
l[
-8
--S
o
8
Y
Figure 4.4:
Simulated Data.
The jagged line represents the Acr.
transformation. the smooth solid line is the scaled
log transformation curve. and the dashed line is the
•
x = -.6
scaled transformation curve.
114
•
S:J:MULATED DATA
N - SOO. Sigma - .S
so
If
IL
II
.
7
)-
If
a
~
II:
o
IL
II
Z
.
~
•
II:
-2
so
-a
Sl!!I
YSF
Figure 4.5:
Simulated Data.
ACE is applied to
The plotted ACE transformation when
(i'BF.f~F) where ~F
= -.6 and
N = 100 and cr = . 1.
•
115
•
S2: ....Ul-ATED DATA
N
-
~oo.
s~gm.
-
.3
iii
M
M
>
a
0
111
%
a:
0
IL
m
z
.a:
~
•
-a
o
~a
B
us
ao
y
Figure 4.6:
Simulated Data.
The smooth curve is the scaled log
transformation curve and the points represent the
plotted ACE transformation for N = 100 and a = .3 .
•
116
•
S%MULATED DATA
N -
SOO.
Sigma -
a
.--
.>
a
..
J'
.S
... . . . . . . .
0
bl
2:
a:
a
IL
-S
lIJ
Z
.
~
a:
-a
•
-3
o
so
20
,
30
y
Figure 4.7:
Simulated Data.
The solid line represents the ACE
,
transformation curve and the dotted line represents
the scaled log transformation curve N = 100 and
u = .5.
•
117
•
S%MUL.ATED DATA
N - SOO. S1gme - S
3
..
•
.. " . "
.. .
. .. .. . . . . . . . .
.
.
.
..
.
.. . .. ..
>a
~
It
o
II.
•
-s
zm
~
.
It
-3
-15
o
so
~O
sao
y
Figure 4.8:
Simulated Data.
The solid line represents the ACE
transformation curve and the dotted line represents
the scaled log transformation curve for N = 100 and
•
u=1.
118
S%MULATED DATA
N - tOO. SSgm. - .I!I
:l.1!I
-2
o
.
12
I!I
S
x
Figure 4.9:
Simulated Data.
for N = 100 and
Plot of ACE residuals versus time
C1
= .5.
•
•
S:tMULATED DATA
N -
~OO.
S~gmB
-
~
2
.
.
:.... ,:"...........
.
..
.. ....
..
..
..
.
..
..... .. ..
.
. .. .
.
aJ
.J
f\t..
H
aJ
bI
II:
0
bI
•
,
.If..
«
:J
a
u
«
-~
..
-2
0
2
e
4
e
~o
x
-
Figure 4.10:
Simulated Data.
x for N
•
= 100
Plot of the ACE residuals against
and a
= 1.
120
••
S%MULATED DATA
N - SOO. S~gm. - s
-------------------.,.,.-- -- ----. _. . . .. .. . .
>
D
11/
2:
l[
o
IL
-s
m
Z
.
~
•
l[
-3
-15
o
eo
..0
sao
y
Figure 4.11:
Simulated Data.
Dashed line represents the scaled
i'..,
log transformation curve. the solid line represents
the ACE transformation curve, and the dotted line
A
represents the scaled
N = 100 and a = 1.
~2
transformation curve for
•
121
•
Sl:MULATED DATA
N - 100. Siame -
I
..
120
100
so
..
. ..
.
..
.. ... .
eo
•
40
20
0
e
2
0
S
X
Figure 4.12:
Simulated Data.
a
•
Scatterplot of y versus x for
= 1 and N = 100 .
10
122
AT~ANT%C
•
MENHADEN DATA
•
3
2
. . . . .... . . . .M
~
~
~
D
D
~
X
~
a
~
m
z
~
~
0
•
-~
••
•
~
M
-2
•
M
-3
0
0.8
~.8
2
2.8
Y
Figure 4.13:
Atlantic Menhaden Data.
The points represent the
ACE transformation curve, the solid line is the
scaled log transformation curve and the dotted line
is the scaled inverse transformation curve.
•
.,
123
•
•
ATLANT%C MENHADEN DATA
:9
.e
.e
a
a
l-
•
2
,,~
--.,~ ..0
... . -J,oo_ .... ,..
..
---. :..:..----,,-
111
2:
It
..
0
. . . . . .. .
o
IL
III
Z
«
"
I-
o
o
2
:9
x
Figure 4.14:
Atlantic Menhaden Data.
The solid line represents
the ACE transformation. the dotted line represents
•
the Ricker model. and the dashed line represents
the Beverton-Holt model .
124
•
Figure 4.15:
Chemical Reaction Data - Model l.
ACE transformation plots for y and x.
J
•
•
125
•
• •
•
•
• ••
-,
• •
......
•
..........
••
.
,
...
0
D
0
•
T
,... T-..u.
•
• •
•
a=
•
Ii
I
•
•
•
• ••
~
~•
...
II!
0
--.
0
T
.... Temu.
127
•
-
a=
i..
,,,.;
•
•
•
3-
•
!
..
•
--.
..
-..
~
~
0
.. ..
I
0
I
1:1 B '...,.
•
•
=
a
•
;I-
•
-
•
•
.
•
..
U ail
..
-. •
c.emu.
0
..
I
0
I.'!
128
Figure 4.17:
•
Chemical Reaction Data
ACE transformation plots for y and f .
3
•
•
•
I
•
=
~
I·
I'
,
•
I
..,
•
••
"
. I l Gil
n EMIL
C!
a
0
T
130
•
SKIIENA RXVER
FUL.L. DATA SET.
0.:l2
O.OB
a
N
II:
0.04
o
o
as
a:l
"7
iii
•
:l.B
III
o
:l.a
~
a
0.8
0.4
o
o
7
:l4
2S
a:l
YEAR
Figure 4.18:
Skeena River Data.
Diagnostic plots for the full
data set. deleting one point at a time.
Upper
chart is for R2D i and the lower is for DACE!.
•
131
•
SKEENA R:IVER
*S.2 CEL.ETEC
0.24
o.
US
o
N
It
O.OS
o
o
s. ..
7
as.
YEAR
FigUre 4.19:
•
Skeena River Data.
Diagnostic plots for the
reduced data set where observation #12 has been
removed.
for DACE i .
Upper chart is for R2n. and the lower is
1
as
•
BIBLIOGRAPHY
Adcock. R. J .• (1878).
53-54.
A problem in ·least squares.
The Analyst 5.
Amemiya, Y. and Fuller, W. A. (1985).
Estimation for the nonlinear
functional relationship. unpublished manuscript.
Andrews. D. F. (1971). A note on the selection of data transformations. Biometrika 58, 249-54.
Atkinson, A. C. (1973).
Testing transformations to normality.
Journal of the Royal Statistical Society, Series B 35, 473-9.
Bartlett. M. S. (1947).
39-52.
The use of transformations.
Bates, D. M., Wolf. D. A., and Watts, D. G. (1985).
squares and first-order kinetics. manuscript.
Belsey. D.' A.. Kuh. E., and Welsh.
Diagnostics. New York: Wiley.
R.
E.
Biometrics 3.
Nonlinear least
(1980) .
Regression
Bever ton. R. J. H., and Holt, S. J. (1957). On the dynamics of export
fish populations. Her Majesty's Stationery Office, London, 533.
•
Bickel. P. J. and Doksum, K. A. (1981). An analysis of transformations revisi ted. .TournaI of the American Statistical Society 76.
296-311.
Blaylock. J. R., and Smallwood. D. M. (1985). Box-COx transformations
and a heteroscedastic error variance:
import demand equations
revisited. International Statistical Review 53. 91-97.
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations .
.TournaI of the Royal Statistical Society Series B 26, 211-246.
Box. G. E. P. and Cox, D. R. (1982). An analysis of transformations
revisited.
.TournaI of the American Statistical Association
77,209-210.
Box, G. E. P. and Hill, W. J. (1974).
Correcting inhomogeneity of
variance with power transformation weighting. Technometrics 16.
385-389.
•
133
•
Box. G. E. P .• and Tidwell. P. W. (1962).
Transformation of
independent variables. Technometrics 4, 531-550.
the
Boylan. T. A., Cuddy, X. P .. and O'Xuircheartaigh, I. G. (1982).
Import demand equations: an application of a generalized Box-COx
methodology. International Statistical Review 50. 103-112.
Breiman. L. and Friedman. J. H. (1982). Estimating Optimal Transformat ions for Multiple Regression and Correlation. ORION010,
Department of Statistics. Stanford University.
Breiman. L. and Friedman. J. H. (1985a). Estimating Optimal Transformations for Multiple Regression and Correlation. .Tournal of the
American Statical Association 80. 580-619.
Breiman. L. and Friedman. J. H. (l985b). Estimating Optimal Transformations
for
Xul tiple Regression.
Comouter Science and
Statistics:
The Interface. L. Billard (ed.)
Elsevier Science
Publications. B. V.• North Holland.
carr. N. L. (1960). Kinetics of catalytic isomerisation of n-pentane.
Industrial Engineering Chemistrv. 52. 391-396.
carroll. R. J. (1980). A robust method for testing transformations to
achieve approximate normality.
.TournaI of the American
Statistical Association 14. 614-619. _,..........O--"O--"=_=""'-..,.,~
•
•
carroll. R. J. (1982a).
Adapting for heteroscedasticity in linear
models, Annals of Statistics 10, 1224-1233 .
carroll. R. J. (1982b).
Power transformations when the choice of
power is restricted to a finite set.
Journal of the American
Statistical Association 71. 908-915.
carroll. R. J. (1982c).
Two examples where
Applied Statistics 31. 149-152.
there are outliers.
carroll. R. J. (1982d).
Tests for regression parameters in power
transformation models.
Scandinavian .Tournal of Statistics 9,
211-222.
carroll. R. J .• and Ruppert. D.. (1981). On prediction and the power
transformation family. Biometrika 68. 609-615.
carroll. R. J .. and Ruppert. D. (1982a). A comparison between maximum
likelihood and generalized least squares in a heteroscedastic
linear model.
Journal of the American Statistical Association
71. 818-882.
•
carroll. R. J .. and Ruppert. D. (l982b). Robust estimation in heteroscedastic linear models. Annals of Statistics 10. 429-441 .
134
Power transformations when
Carroll, R. J .. and Ruppert, D. (1984).
Journal of the American
fi tting theoretical models to data.
Statistical Association 79. 321-328.
Carroll, R. J., and Ruppert, D. (1985).
analysis. Technometrics 27. 1-12.
Transformations: a robust
Carroll. R. J., and Ruppert. D. (1986). Diagnostics when transforming
the regression model and the response. manuscript.
Carroll. R. J .. and Ruppert, D. (1987). Transformations and Weighting
in Regression.
Chapman and Hall:
New York and London.
manuscript.
r,
Carroll. R. J .. Spiegelman, C. H.• Lan. K. K. G.• Bailey. K. T.• and
Abbott, R. D. (1984).
On errors-in variables for binary
regression models. Biometrika 71. 19-25.
Cook. R. D. (1977). Detection of influential observations in linear
regression. Technometrics. 19, 15-18.
Cook. R. D., and Weisberg. S. (1982).
Residuals and Influence in
Regression, 58-86. Chapman and Hall, New York and London.
Cox.
D. R.• and Hinkley. D. V.
Chapman and Hall. New York.
(1974).
Theoretical Stati sties.
Draper. N. R. and Cox, D. R. (1969).
On distributions and their
transformations to normali ty. Journal of the Roval Statistical
Society. Series B 31, 472-476.
•
Draper. Norman and Smith, Harry (1981). Applied Regression Analysis,
220-241. 2nd edi tion. Wiley: New York.
Egy. D.• and Lahiri, K. (1979). On maximum likelihood estimation of
functional form and heteroskedastici ty.
Economic Letters 2.
155-159. North Holland Publishing Company.
Friedman. J. H.• and Stuetzle. W. (1982). Smoothing of Scatterplots.
ORIONOO6 Report. Department of Statistics, Stanford University.
Friedman. J. H.• and Owen. A. (1985).
manuscript.
Predictive ACE. unpublished
Fuller. W. A. (1980). Properties of SOme estimators for the errorsin-variables model. Annals of Statistics 8. 407-422.
Gaudry. M. J. I" and Dagenais. M. G. (1979). Heteroscedasticityand
the use of Box-COx transformations. Economic Letters 2, 225-229.
Gieser, L. J. (1981).
Estimation in a multivariate
variables" regression model:
large sample resul ts.
Statistics 9, 24-44.
tI
errors
in .
Annals of
•
•
•
135
Hernandez, F. and Johnson, R. A. (1980).
of transformations to normal i ty,
Statistical Association 75, 855-861.
Hinkley. D. V. (1975).
On power
Biometrika 62. 101-112.
The large sample behaviour
Journal of the American
transformations
to
symmetry.,
The analysis of transformed
Hinkley. D. V. and Runger, G. (1984) .
data. Journal of the American Statistical Association 79,
302-320.
Jobson, J. D. and Fuller. W. A. (1980). Least squares estimation when
the covariance matrix and parameter vector are functionally
related.
.Journal of the American Statistical Association 75,
176-181.
Kendall, M. G.• and Stuart. A. (1979).
Statistics 2, London: Griffin.
The Advanced Theory of
Kruskal, J. B. (1964).
Nonmetric multidimensional
numerical method. Psychometrika 29, 115-129.
•
scaling:
a
Kruskal. J. B. (1965).
Analysis of factorial experiments by estimating monotone transformations of the data.
.Journal of the
Royal Statistical Society B. 27, 251-263 .
Kummell, C. H. (1879). Reduction of observed equations which contain
more than one observed quanti ty. The Analyst 6. 97-105.
Madansky. A. (1959).
The £i tting of
variables are subject to error.
Statistical Association 54, 173-205.
McO.111agh. P.
(1983).
Statistics 11. 59-67.
straight
Journal
Quasilikelihood
lines when both
of the American
functions.
Annals
of
Pritchard. D. J .• Downie. J., and Bacon. D. W. (1977).
Further
considerations of heteroscedasticity in fitting kinetic models.,
Techn0metrics 19, 227-236.
Rae, C. R. (1973).
New York:
Linear Statistical Inference and Its Applications.
Wiley.
Ricker. W. E. (1954). Stock and recruitment.
Research Board of Canada. 11. 559-623.
Journal of the Fishery
Ricker, W. E.. and Smith. H. D. (1975). A revised interpretation of
the history of the Skeena River sockeye salmon (Oncorhynchus
nerka. ).
Journal of the Fishery Research Board of Canada. 32,
1369-1381.
136
Rodriguez. R. N. (1985).
A comparison of the ACE and MORALS
algori thms in an Application to Engine Exhaust Emissions
Modeling.
Computer Science and Statistics:
The Interface. L.
Billard (ed.)
Elsevier Science Publications. B. V., North
Holland.
.'"
Ruppert. D.. and Carroll. R. J. (1985).
Data transformations in
regression analysis with applications to stock recruitment relationships.
In Resource Management M. Mangel editor.
Lecture
Notes in Biomathematics 61. Springer Verlag. New York.
Serfling. R. J. (1981).
Statistics. New York:
Approximation Theorems
Wiley.
of
Mathematical
t
Snee. R. D. (1986).
An alternative approach to fitting models when
reexpression of the response is useful.
Journal of Qual! tv
Technology 18. 211-225.
Stefanski. L. A. (1983). Influence and measurement error in logistic
regression. Unpublished Ph. D. dissertation. University of North
Carolina. Chapel Hill.
Stefanski. L. A. (1985).
The effects of measurement
parameter estimation. Biometrika 72. 583-592.
error
on
Stefanski. L. A.• and Carroll. R. J. (1985).
Covariate measurement
error in logistic regression.
Annals of Statistics 13.
1335-1351.
Thurber. W. R.• Lowney. J. R. and Phillips. W. E. (1985).
Standard
practice for characterizing semiconductor deep levels by
transient capacitance techniques.
Draft 6. document 6D44. ASI'M
Committee F1 on Electronics.
•
,Wolter. K. M.• and Fuller. W. A. (1982a).
Estimation of nonlinear
errors-in-variables models. Annals of Statistics 10, 539-548.
Wolter. K. M.• and Fuller. W. A. (l982b). Estimation of the quadratic
errors-in-variables model. Biometrika 69. 175-182.
Young. F. W. (1981).
Quantitative analyses of qualitative data.
Psychometrika 46. 357-388.
Young. F. W. (1985).
Quantitative analyses of qualitative data.
Computer Science and Statistics: The Interface, L. Billard (ed.)
Elsevier Science Publications. B. V.• North Holland.
Young. F. W.• de Leeuw. J. and Takane. Y. (1976).
Regression with
quali tative and quantitative variables:
an alternating least
squares method with optimal scaling features. Psychometrika 41.
505-529.
Zarembka. P. (1974).
Transformation of variables in econometrics.
in: P. Zarembka. ed .• Frontiers in Economics, 81-103. Academic
Press. New York.
;