Carroll, R.J. and David Ruppert.On Prediction and the Power Transormation Family."

ON PREDICTION AND THE POWER TRANSPORMATION FAMILY
by
R.J. Carroll* and David Ruppert**
University of North Carolina
Abstract
The power transformation family studied by Box and Cox (1964) for
transforming to a normal linear model has recently been further studied by
Bickel and Doksum (1978), who show that "the cost of not knowing A and
estimating it . . . is generally severe"; some of the variances for
regression parameters can be extremely large.
We consider prediction
of future observations (untransformed) when the data can be
transfo~ed
to a
linear model and show that while there is a cost due to estimating A it is
generally not severe.
Similar results emerge for the two sample problems.
Monte-Carlo results lend credence to the asymptotic calculations.
Key Words and Phrases:
Power transformations, Prediction, Robustness,
Box-Cox family, Asymptotic theory.
AMS 1970 Subject Classification:
*Research
Primary 62E20; Secondary 62G35.
for this work was supported by the Air Force Office of Scientific
Research under Grants AFOSR-75-2796 and AFOSR-80-0080.
**
Research for this work was supported by the National Science
Foundation under Grant MCS78-01240.
2
1.
Introduction
The power transformation family studied by Box and Cox (1964) takes the
following form:
for some unknown A,
(1.1)
y ~ A) =
l.
x. 13
-1-
+ E.
(i
l.
= 1, ... ,N)
,
where {E.} are independently and identically distributed with distribution F
l.
and where
A
= log
y
;it
0
A= 0
Box and Cox propose maximum likelihood estimates for A and
normal distribution.
~
when F is the
There are numerous alternate methods as well as
proposals for testing hypotheses of the form HO: A
= AO (Hinkley
Andrews (1971), Atkinson (1973), Carroll (1978)).
Carroll studied the testing
(1975),
problem via Monte-Carlo; by allowing F to be non-normal he approximated a
problem with outliers and found that the chance of mistakenly rejecting the
null hypothesis can be very high indeed.
Bickel and Doksum (1978, hereafter denoted B-D) develop an asymptotic
theory for estimation.
They assume that !.l' !. 2' ... are independent and
identically distributed according to G, a lead which we follow throughout.
the maximum likelihood estimate of the regression parameter is
A A A
~(~(A))
If
when A
A
is known (unknown and estimated by A), they compute the asymptotic distributions
1
A
1:. A A
of n~(~ - B)/O' and n z(l!.(A)- ~/O' as n-loOO,O'-+{).
and as regards variances:
These distributions are different
3
'the cost of not knowing A and estimating it . . • is generally
severe. . . . The problem is that (S(A)) and A are highly
correlated. Thus, the assertion (ofBox and Cox) that "having
chosen a suitable A, we should make the usual detailed
examination and interpretation on this transformed scale" is
incorrect. '
""
"
They thus show that A and
""
~(A)
"
are unstable (i.e., highly variable) and
highly correlated, a problem similar in nature to that of multicollinearity
in regression.
These conclusions made by B-D seem to have caused some controversy.
One point of discussion concerns the scale on which inference is to be made:
i.e., should one make unconditional inference about the regression parameter
in the correct but unknown scale (the B-D theory) or a conditional inference
for an appropriately defined "regression parameter" in an estimated scale?
In order to eliminate such problems, we will study the cost of
estimating A when one wants to make inferences in the originaZ scale of the
observations.
In the multicollinearity problem, reasonably good prediction
is still possible if new vectors x arrive independently with the distribution
G.
Motivated by this fact (see the discussion in Section 2), we will
focus our attention
spec~fica11y
on prediction, but we also discuss the
2-samp1e problem and a somewhat more general estimation theory.
Using the
B-Dasymptotic theory and Monte-Carlo, we find that for prediction (and
other problems in the original scale), there is a cost due to estimating A,
but it is generally not severe.
2.
Predicting the conditional median in regression
Our model specifically includes an intercept, i.e., x.
-1
= (l,c- 1.);
by
suitable rescaling we assume the -c.
have mean zero and identity covariance.
1
"
From the sample we calculate A and
""
~(A),
and we are given a new vector
4
x 0 = (l,.£, 0) which is independent of the other .!' s but still has the same
distribution G.
Our predicted value in the transformed scale would be
A A
.! 0 ~(A), so a natural predictor is
A
(2.1)
A A
f(A,.!O f3(A)) ,
where
f(A,e)
= (l+Ae)l/A
= exp(e)
A = 0
Notice that if F is symmetric, or more generally has median equal to 0, then
f(A,!.O 13) is the median of the conditional distribution of y given xO' even
though it is not necessarily the conditional expectation.
Calculation of
conditional expectations would require the use of numerical integration and
that F be known or an estimator of F be available (see Section 4).
A
A A
Assuming A and f3(A) are consistent,
(2.2)
A Taylor expansion shows that
(2.3)
where
g(A,8)
= f(A,8)/(1+A8)
h(A,8)
= 8/A - {(1+A8)10g(1+A8)}/A 2
5
1\
Noting that A and
1\ 1\
~(A)
are unstable and highly correlated, the expansion (2.3)
shows that our problem as presently formulated is quite similar to a predi.ction
problem in regression when there is multicollinearity.
Note:
If 1 +
1\
1\ 1\
A!.O~(A)
< 0, then f in (2.1) must be defined by absolute values.
1\
This circumstance will not occur often; if A > 0 «0), then except in rare
A
1\
A 1\
cases x 0 i(~) ~min yy) (s max yy)), in which case 1 + A!. 0 !(A)
~
o.
Case I (A = 0; cr = 1; normal errors; simple linear regression with slope e ,
l
intercept eO)
For this special case, likelihood calculations (cf. Hinkley (1975)) can be
made.
Here the correct scale is the log scale and Ec.
1
4
and EC i
= ~4'
= 0,
Lengthy likelihood analysis shows
where
LO =
1
-c
eoei
-c
Y+c 2
2
eoei
-ce oe*1
-ce oe*1
Y+e 2e*2
2y~1
2
0 1
and where
c
==
-~(l+e~+ei)
Bi = 81
2
+ el~3/(280)
242
Y = 3 + 48 1 +
8l(~4-~3-1)
.
2
3
Ec.1 =1, Ec.1
= ~3
6
Theorem I
A
A A
Var [(fo.,~ 0 .f!(A)) -f(A=O '~o.f!) )exp( -~ O~)]
(2.4)
A
Var [(f(A=O,~ 0 ~ -f(A=O,~ o.f!)) exp( -~ O.[)]
~
H(B l ) ,
where
o
The normalization by exp(-~o.f!) = (g(A=O, ~o.f!))-l is as in (2.3) and is
made to simplify the calculations and results.
Since the numerator of (2.4) is
the (suitably normalized) variance of the prediction error when A is estimated
while the denominator is the same variance for the case A known, Theorem 1
tells us:
(i) there is a cost due to estimating A, but it cannot exceed 50%;
(ii) for the balanced two-sample problem (c. = ±l with probability
1
the cost is at most 8% and decreases to zero as Sl
~
~),
00.
Case II (symmetric errors, p parameter regression, any A)
Here we use the asymptotic theory
(n~,a~)
k A A
of 8-D.
They find that
(equation (8.2)) n 2 (a(A)-a)/a has variance ~ and is asymptotically independent
k A A A
of n2(A,~(A))/a, which has limiting covariance
E
1
where
= e- 1 [
1
-0'
-0
eI+O'O
]
7
x
H(a,A)
= (1
= A-la
p
2
= (Xl ••• Xp )
- A- 2 (1+Aa)log(1+Aa)
EH (x ~, A)!.
D =
e
X ••• X )
P
I
=
j=l
(Ex .H(x a, A))
J
2
--
It is interesting to note that in the case of simple linear regression with
A = 0, Ll is different from but of the same form as Lo (replace c by c* = c
and Y/2 by e
4
= al(~4-~3-l)/4).
In Theorem 1, replace A = 0 by A and exp (-!. 0 19 by (crg(A,!.O.@.))
Theorem 2.
+ ~
Then Theorem 1 holds with H(a l )
The small
0"
=1
-1
o
+ lip.
asymptotics of B-D tell us that there is a positive but
bounded cost due to estimating A, with the cost decreasing as p increases.
Note that Theorem 2 and Theorem 1 agree for simple linear regression, A = 0,
~4
- 1 -
2
~3
> 0 and a l
+
00.
B-D and Carroll also simultaneously introduced robust estimates of
(A,~)
based on the ideas of Huber (1977).
One can use the B-D small
0"
asymptotics to show that (i) the cost in robust estimation for estimating A
is still lip and (ii) the Bickel-Doksum and Carroll methods have better
robustness properties than does maximum likelihood.
We conducted a small MDnte-Carlo study to check small sample performance
and to investigate the results of Theorems 1 and 2.
generated according to
.
The observations were
8
(1+(30+f\c i +E i )
l/A
A = -1,1
(i = 1, ••. , N)
exp ((30+(31c i +E i )
A= 0
.
Here N=20,{E } are standard normal, (30 = 5, (31 = 1 and the c i centered at zero,
i
equally spaced, satisfy rci = N and range from -1.65 to 1.65. Then ~4 = 1.79
and H(Sl) = 1.06, so that Theorems 1 and 2 lead us to expect very little cost
due to estimating A.
There were 600 iterations in the experiment.
For this
example,
.27
3.65
1.35
50.28
18.25
7.76
.99
Correlation Matrix • [ 1
.93 ]
.92
1
1
which illustrates the multicollinearity quite well.
A
A
A
A
In Table 1 we provide an analysis of the estimates SO(A) and Sl(A).
The estimates are quite biased and, as noted by Carroll and B-D, the
A
A
A
A
A
A
variances of SO(A),Sl(A) are much larger than those of SO,Sl when A is known.
In Table 2 we present the results for the prediction problem.
The first two
rows are the relative biases when A is known or unknown, respectively.
In
the third row we have the prediction variance when ~ is estimated relative
to that when A is known.
These tables support our asymptotic calculations
and show that there is very little cost involved in estimating A for
prediction.
To this point we have defined the cost of estimating A as an average over
the distribution of the new value
~O.
costs conditional on a given value of
It is also of interest to study the
~O.
For Case I when
~o
= (l,c )' the
O
9
asymptotic ratio of the conditional prediction variances (when A is estimated
versus known) is given by
(2.5)
lim
N4<lo
while for Case II this limit
(N4<lo,o~)
T.(cO,B)
J
-
=
is Tl(cO,ED, where
aE. a'
-
J-
(j = 1,2)
In Table 3 we list the values of TO(co,ED, Tl(cO,ED and the results of a
Monte-Carlo experiment (600 iterations) for the simple linear regression design
discussed previously (at the different values c , ... ,c
in this design). As
20
l
expected from Theorems 1 and 2, there is only a slight cost due to estimating
A and the B-D small
0
asymptotics are somewhat conservative.
10
Table 1
Biases and relative costs in the simple linear regression model when A is
estimated C80=5,81=1).
[
A
A
A
r
var(B~(A)) -,
A
A
A
Var C13 0)
[ varC8~
A (A))
A
Var C131)
A
E8 0 CA)-80
ES 1 CA) -131
1.0
.44
.20
12.9
4.0
0.0
.60
.26
9.6
4.0
-1.0
.44
.20
12.9
4.0
t
11
Table 2
Prediction biases and relative variance in the simple linear regression model.
A= 1
A =0
A = -1
.00
.07
.00
.01
.07
.00
1.06
1.06
1.02
1\
E (f(A.~ 0 ~ -f(A,~O ~)
. Ef(A,~O~
1\
1\ 1\
E (f(A,~O ~(A)) -f(A.~O ~)
Ef(A,~O~
1\
1\ 1\
Var [f(A,~ 0 ~(A)) -f(A,~ O~)]
1\
Var [f(A,~ O~) -f (A,~ O.§)]
12
Table 3
Relative conditional costs in the simple linear regression model
when the data are generated by exp(5+c+£).
Co
Likelihood analysis
using L
O
2.00
1.55
Monte-Carlo
1.23
1.04
1.01
1.12
1.37
1.08
1.08
1.09
1.11
1.14
1.19
1.23
1. 70
1.30
1.33
1.27
2.03
2.24
1.35
1.27
2.27
1.35
.087
1.27
2.24
1.32
.26
1.24
2.03
1.25
.43
1.19
.61
.78
.95
1.13
1. 70
1.37
1.18
1.10
1.12
1.01
1. 04
1.04
.99
1.13
1.08
1.04
1.02
1.30
1.47
1.65
1.00
1.00
1.01
1. 23
1.55
1.02
2.00
1.12
1.11
1.56
1.15
1 .06 (Theorem 1)
1.50 (Theorem 2)
-1.65
-1.47
-1.30
-1.13
- .95
- .78
- .61
- .43
-
.26
-.087
0 (not in design)
Average~
Predicted Average
1.01
1.00
1.00
1.02
1.04
1.08
1.13
Small a analysis
using l:l
1.19
1.24
.98
.98
13
3.
Two samples:
estimating the difference in medians
The balanced two-sample problem has the structure
y.OJ = 8 + £.
1
1
1
= 8
2
+ £.
1
i = 1, ... ,N
i = N+1, •.. , 2N
.
B-D show that for estimating 81-8 2 there is a substantial penalty for not
knowing A unZess 81 = 8 , The "treatment difference" in the original scale
2
that we estimate is defined by
When the errors are symmetric,
two populations.
is just the difference in the medians of the
Because the B-D small a asymptotics do not apply, we only
consider the case a fixed.
normal errors.
~
We confine our attention to A = 0 and standard
Lengthy likelihood calculations show that
where
4
Q
= y -1
2a
y+a
28
2
a8
a+8
2
and
2
a = 1 + 8
1
2
8 = 1 + 82
].1 = 81 + 8
2
2 2
4
4
2
Y = 14 + 10(e 1+8 ) + 8 + e - a. - 82 - 4].12
1
2
2
14
Define
Then, by Taylor expansions, it is possible to show
Theorem 3
where
o
If 8 = 8 , then HC8 ,8 ) = 1 and there is no cost for estimating A.
1 2
1
2
Our numerical calculations indicate that 1 s HC8 ,8 ) s 1.03, which indeed
l 2
means there is a very small cost overall.
We conducted a Monte-Carlo experiment with 8 =4,8 =6.
1
2
The sample size for
each population was 10 and there were 600 iterations of the experiment.
.14
Q=
[
[
2.64
11.32
22.46
49.89
1
Correlation Matrix =
1.21
.96
.99
1
.95
1
Here
]
].
The results are given in Tables 4 and 5 from which it appears that when working
in the original scale, and estimating population medians, there is essentially
no cost for estimating A.
15
Table 4
Biases and relative costs in the two-sample problem when A is estimated.
The populations have transformed means 81=4,8 2=6, so that So = (8 1+8 2)/2 = 5
and Sl = (8 2-8 1)/2 = 1.
A
A
A
A
[
var(:o (AJ)
A
A
r
,
[
r
var(~l (A)) ~2
A
ESO(A) - So
ES1 (A) -Sl
Var(SOCA))
1.0
.20
.13
13.2
4.2
0.0
.67
.28
16.3
5.8
-1.0
.21
.14
13.2
4.2
A
Var(Sl (A))
16
Table 5
Biases and relative costs in the two-sample problem. Here the treatment
difference is ~ = E[f(A,8 ) - f(A,a 1)] and H(A,8 l ,a ) = f(A,8 2) - f(A,8 ).
l
2
2
A A
H(A,el,a2)-~
~
A A A A A
H(A,al(A),e2(A))-~
~
A A A A A
2
E[H(A,8l(A),e2(A))-~]
A
A
E[H(A,el,e2)-~]
2
=1
A= 0
A = -1
.01
.04
.00
.02
.02
.01
1.00
1.00
.99
A
17
4.
Prediction of the conditional mean
The estimator in Section 2 is the median of the conditional
distribution of y given x
(when F is sYmmetric). Our focus in this section
o
is on estimating the conditional mean of y given x '
O
We sketch a general r~sult which indicates that the cost of extra
nuisance parameters (such
&S
A) is not large.
with (Y.,X.) having a joint density g(y,xle
1
1
o)'
We assume a regression model
As in normal theory
regression we assume
Letting LN(e) denote the log-likelihood, we make the usual assumptions:
(4.1)
where eN is the maximum likelihood estimate and q is the dimension of the
parameter eO'
We are given a new value xo and wish to predict E(Ylx ); the
o
natural estimate (usually only computable numerically) is
1\
(4.2)
E(Ylxo)
A Taylor expansion shows that
= J ygl(ylxo,eN)dy .
18
J~ (8 -8 )gl(ylx ,8 )dy
~ !(y-E(ylxo)) ~
~8 log gl(Ylxo,80~N
o 0
N 0
(4.3)
bi~
= !(y-E(ylx o))
log
g(y,XoI8oiJN~(8N-80)gl(YIXo,80)dY
2
An overall measure of the accuracy of the prediction is EAN(8 0 ,x O);
(4.1), (4.3) and the Schwar~ inequality show
E[~(80'XO) Isample] ~ var(Y-E(Ylxo))N~(eN-eo)I(eo)N~(~N-eO)'
(4.4)
L
2
-> var(y-E(ylxo))x (q) ,
2
where X (q) is a chi-square with q degrees of freedom.
This suggests
(4.5)
Equation (4.5) shows that in prediction with q parameters, the average
squared prediction error is bounded and this bound increases in relative
magnitude by r/q when r additional nuisance parameters are added.
A similar
result holds for the two-sample problem.
Example:
Consider the
tra~sformation model
(1.1) but take A
= 1; this means
one uses the Box-Cox model when transformation is unnecessary.
p regression parameters, then q
2
EA N(8 0 ,x O)
= var(y-E(ylxo))p.
If there are
= p + 1 when A = 1 is known and
When one estimates A, (4.5) shows that
. E~(8o'xo) ~ Var(y-E(ylx o))(p+2). Thus, the relative cost of estimating A
is 2/p (which agrees qualitatively with Theorem 2).
19
References
Andrews, D.P. (1971). A note on the selection of data transformations.
Biometrika 58, 249-254.
Atkinson, A.C. (1973). Testing transformations to normality.
Statist. Soa. Ser. B, 35, 473-479.
J. Royal
Bickel, P.J. and Doksum, K.A. (1978). An Analysis of Transformations
Revisited. Unpublished manuscript. University of California, Berkeley.
Box, G.E.P. and Cox, D.R. (1964). An analysis of transformations.
Statist. Soa. Ser. B, 26, 211-252.
J. Royal
Carroll, R.J. (1978). A Robust Method for Testing Transformations to
Achieve Approximate Normality. Institute of Statistias Mimeo Series
#1190. University of North Carolina at Chapel Hill. To appear in
J. Royal Statist. Soa. Ser. B.
Hinkley, D.V. (1975).
101-111.
Huber, P.J. (1977).
On power transformations to sYmmetry.
Robust Statistiaal Proaedures.
SIAM.
Biometrika 62,
Philadelphia, PA.
UNCLASSIFIED
IECURITY CLASSIFI
READ INSTRUCTIONS
BEFORE COMPLETING FORM
REPORT DOCUMENTA:rION PAGE
[1.
••
2. GOVT ACCESSION NO. 3. RECIPIENT'S CATALOG NUMBER
REPORT NUMBER
5. TYPE 0' AIPORT • PERiOD COVERED
TITLE (MtI SuA''''.)
On Prediction and the Power Transformation
Family
7. AUTHOR(.)
TECHNICAL
S. PERFORMING
8.
R.J. Carroll and David Ruppert
O~G.
REPORT NUMBER
Mimeo Series No. 1264
aONTRACT ?>~~RA~T N~M8ER(.)
rant AF
-7 -2 96
Grant AFOSR-80-0080
Grant MCS78-0l240
9. PERFORMING ORGANIZATION NAMI!: I\NO ADDRESS
10. PROGRAM ELEMENT. PROJECT, TASK
AREA 6 WORK UNIT NUM8ERS
11. CONTROLLING OFFICE NAME AHI) ApORESS
12. REPORT DATE
Air Force Office of Scientific Research
Bolling Air Force Base
Washington, D.C. 20332
February 1980
13. NUM8ER OF PAGES
19
I •. MONITORIHG AGENCY HAME 6 AOORESS(II tllIl.,..., lrorn Con',ollln, Olllc.)
IS. SECURITY CLASS. (01
'hi. ,epo,t)
UNCLASSIFIED
15•. OECLAISIFICATIOHfDOWHGRADIHG
SCHEDULE
IS. DISTRIBUTION STATEMENT (01 'hi. R.po,')
Approved for Public Release:
Distribution Unlimited
"
17. DISTRIBUTIOH STATEMENT (01 'h• •b.".ct .,.,.,.d In SIod20, II dill.,..., 1_ R.".,,)
II. SUPPLEMENTARY NOTES
...
KEy WORDS (Continue on ,.v.,. . .Id. II ".c...... . .d Id.ntlty "" A/oclr nUlltA.,)
Power transformations, Prediction, Robustness, Box-Cox family, Asymptotic
theory.
20.
ABSTRACT (Conllnu. on , ....,•• • Id. II ".c. . .. " . .d l.n,lI" b" bloclr _b.,)
The power transfOT<rnation family studied by Box and Cox (1964) for
transforming to a normal linear model has recently been further studied by
Bickel and Doksum (1978) , who show that "the cost of not knowing A and
estimating it .
. is generally severe"; some of the variances for regression
parameters can be extremely large. We consider prediction of future
observations (untransformed) when the data can be transformed to a linear
model and show that while there is a cost due to estimating A it is generally
not severe. Similar results emerl!e for the two ".:Im",l", nl"nhl""m~ Mn~f'P.-rSll·ln
.
DD
'ORM
IJAN 71
1473
EDITIOH 0' I NOV II II OBSOLETE
UNCLASSIFIED
I
t
SECURITY Cl.ASSIFICATION 0,. THIS PAG£(lf'h_ D".",t.red)
20.
UNCLASSIFIED
results lend credence to the asymptotic calculations.
UNCLASSIFIED
SECURITY CLASS'I",CATIOII OF Tu"
"AGE(When D"IA
Enl~'
,