ON PREDICTION AND THE POWER TRANSPORMATION FAMILY by R.J. Carroll* and David Ruppert** University of North Carolina Abstract The power transformation family studied by Box and Cox (1964) for transforming to a normal linear model has recently been further studied by Bickel and Doksum (1978), who show that "the cost of not knowing A and estimating it . . . is generally severe"; some of the variances for regression parameters can be extremely large. We consider prediction of future observations (untransformed) when the data can be transfo~ed to a linear model and show that while there is a cost due to estimating A it is generally not severe. Similar results emerge for the two sample problems. Monte-Carlo results lend credence to the asymptotic calculations. Key Words and Phrases: Power transformations, Prediction, Robustness, Box-Cox family, Asymptotic theory. AMS 1970 Subject Classification: *Research Primary 62E20; Secondary 62G35. for this work was supported by the Air Force Office of Scientific Research under Grants AFOSR-75-2796 and AFOSR-80-0080. ** Research for this work was supported by the National Science Foundation under Grant MCS78-01240. 2 1. Introduction The power transformation family studied by Box and Cox (1964) takes the following form: for some unknown A, (1.1) y ~ A) = l. x. 13 -1- + E. (i l. = 1, ... ,N) , where {E.} are independently and identically distributed with distribution F l. and where A = log y ;it 0 A= 0 Box and Cox propose maximum likelihood estimates for A and normal distribution. ~ when F is the There are numerous alternate methods as well as proposals for testing hypotheses of the form HO: A = AO (Hinkley Andrews (1971), Atkinson (1973), Carroll (1978)). Carroll studied the testing (1975), problem via Monte-Carlo; by allowing F to be non-normal he approximated a problem with outliers and found that the chance of mistakenly rejecting the null hypothesis can be very high indeed. Bickel and Doksum (1978, hereafter denoted B-D) develop an asymptotic theory for estimation. They assume that !.l' !. 2' ... are independent and identically distributed according to G, a lead which we follow throughout. the maximum likelihood estimate of the regression parameter is A A A ~(~(A)) If when A A is known (unknown and estimated by A), they compute the asymptotic distributions 1 A 1:. A A of n~(~ - B)/O' and n z(l!.(A)- ~/O' as n-loOO,O'-+{). and as regards variances: These distributions are different 3 'the cost of not knowing A and estimating it . . • is generally severe. . . . The problem is that (S(A)) and A are highly correlated. Thus, the assertion (ofBox and Cox) that "having chosen a suitable A, we should make the usual detailed examination and interpretation on this transformed scale" is incorrect. ' "" " They thus show that A and "" ~(A) " are unstable (i.e., highly variable) and highly correlated, a problem similar in nature to that of multicollinearity in regression. These conclusions made by B-D seem to have caused some controversy. One point of discussion concerns the scale on which inference is to be made: i.e., should one make unconditional inference about the regression parameter in the correct but unknown scale (the B-D theory) or a conditional inference for an appropriately defined "regression parameter" in an estimated scale? In order to eliminate such problems, we will study the cost of estimating A when one wants to make inferences in the originaZ scale of the observations. In the multicollinearity problem, reasonably good prediction is still possible if new vectors x arrive independently with the distribution G. Motivated by this fact (see the discussion in Section 2), we will focus our attention spec~fica11y on prediction, but we also discuss the 2-samp1e problem and a somewhat more general estimation theory. Using the B-Dasymptotic theory and Monte-Carlo, we find that for prediction (and other problems in the original scale), there is a cost due to estimating A, but it is generally not severe. 2. Predicting the conditional median in regression Our model specifically includes an intercept, i.e., x. -1 = (l,c- 1.); by suitable rescaling we assume the -c. have mean zero and identity covariance. 1 " From the sample we calculate A and "" ~(A), and we are given a new vector 4 x 0 = (l,.£, 0) which is independent of the other .!' s but still has the same distribution G. Our predicted value in the transformed scale would be A A .! 0 ~(A), so a natural predictor is A (2.1) A A f(A,.!O f3(A)) , where f(A,e) = (l+Ae)l/A = exp(e) A = 0 Notice that if F is symmetric, or more generally has median equal to 0, then f(A,!.O 13) is the median of the conditional distribution of y given xO' even though it is not necessarily the conditional expectation. Calculation of conditional expectations would require the use of numerical integration and that F be known or an estimator of F be available (see Section 4). A A A Assuming A and f3(A) are consistent, (2.2) A Taylor expansion shows that (2.3) where g(A,8) = f(A,8)/(1+A8) h(A,8) = 8/A - {(1+A8)10g(1+A8)}/A 2 5 1\ Noting that A and 1\ 1\ ~(A) are unstable and highly correlated, the expansion (2.3) shows that our problem as presently formulated is quite similar to a predi.ction problem in regression when there is multicollinearity. Note: If 1 + 1\ 1\ 1\ A!.O~(A) < 0, then f in (2.1) must be defined by absolute values. 1\ This circumstance will not occur often; if A > 0 «0), then except in rare A 1\ A 1\ cases x 0 i(~) ~min yy) (s max yy)), in which case 1 + A!. 0 !(A) ~ o. Case I (A = 0; cr = 1; normal errors; simple linear regression with slope e , l intercept eO) For this special case, likelihood calculations (cf. Hinkley (1975)) can be made. Here the correct scale is the log scale and Ec. 1 4 and EC i = ~4' = 0, Lengthy likelihood analysis shows where LO = 1 -c eoei -c Y+c 2 2 eoei -ce oe*1 -ce oe*1 Y+e 2e*2 2y~1 2 0 1 and where c == -~(l+e~+ei) Bi = 81 2 + el~3/(280) 242 Y = 3 + 48 1 + 8l(~4-~3-1) . 2 3 Ec.1 =1, Ec.1 = ~3 6 Theorem I A A A Var [(fo.,~ 0 .f!(A)) -f(A=O '~o.f!) )exp( -~ O~)] (2.4) A Var [(f(A=O,~ 0 ~ -f(A=O,~ o.f!)) exp( -~ O.[)] ~ H(B l ) , where o The normalization by exp(-~o.f!) = (g(A=O, ~o.f!))-l is as in (2.3) and is made to simplify the calculations and results. Since the numerator of (2.4) is the (suitably normalized) variance of the prediction error when A is estimated while the denominator is the same variance for the case A known, Theorem 1 tells us: (i) there is a cost due to estimating A, but it cannot exceed 50%; (ii) for the balanced two-sample problem (c. = ±l with probability 1 the cost is at most 8% and decreases to zero as Sl ~ ~), 00. Case II (symmetric errors, p parameter regression, any A) Here we use the asymptotic theory (n~,a~) k A A of 8-D. They find that (equation (8.2)) n 2 (a(A)-a)/a has variance ~ and is asymptotically independent k A A A of n2(A,~(A))/a, which has limiting covariance E 1 where = e- 1 [ 1 -0' -0 eI+O'O ] 7 x H(a,A) = (1 = A-la p 2 = (Xl ••• Xp ) - A- 2 (1+Aa)log(1+Aa) EH (x ~, A)!. D = e X ••• X ) P I = j=l (Ex .H(x a, A)) J 2 -- It is interesting to note that in the case of simple linear regression with A = 0, Ll is different from but of the same form as Lo (replace c by c* = c and Y/2 by e 4 = al(~4-~3-l)/4). In Theorem 1, replace A = 0 by A and exp (-!. 0 19 by (crg(A,!.O.@.)) Theorem 2. + ~ Then Theorem 1 holds with H(a l ) The small 0" =1 -1 o + lip. asymptotics of B-D tell us that there is a positive but bounded cost due to estimating A, with the cost decreasing as p increases. Note that Theorem 2 and Theorem 1 agree for simple linear regression, A = 0, ~4 - 1 - 2 ~3 > 0 and a l + 00. B-D and Carroll also simultaneously introduced robust estimates of (A,~) based on the ideas of Huber (1977). One can use the B-D small 0" asymptotics to show that (i) the cost in robust estimation for estimating A is still lip and (ii) the Bickel-Doksum and Carroll methods have better robustness properties than does maximum likelihood. We conducted a small MDnte-Carlo study to check small sample performance and to investigate the results of Theorems 1 and 2. generated according to . The observations were 8 (1+(30+f\c i +E i ) l/A A = -1,1 (i = 1, ••. , N) exp ((30+(31c i +E i ) A= 0 . Here N=20,{E } are standard normal, (30 = 5, (31 = 1 and the c i centered at zero, i equally spaced, satisfy rci = N and range from -1.65 to 1.65. Then ~4 = 1.79 and H(Sl) = 1.06, so that Theorems 1 and 2 lead us to expect very little cost due to estimating A. There were 600 iterations in the experiment. For this example, .27 3.65 1.35 50.28 18.25 7.76 .99 Correlation Matrix • [ 1 .93 ] .92 1 1 which illustrates the multicollinearity quite well. A A A A In Table 1 we provide an analysis of the estimates SO(A) and Sl(A). The estimates are quite biased and, as noted by Carroll and B-D, the A A A A A A variances of SO(A),Sl(A) are much larger than those of SO,Sl when A is known. In Table 2 we present the results for the prediction problem. The first two rows are the relative biases when A is known or unknown, respectively. In the third row we have the prediction variance when ~ is estimated relative to that when A is known. These tables support our asymptotic calculations and show that there is very little cost involved in estimating A for prediction. To this point we have defined the cost of estimating A as an average over the distribution of the new value ~O. costs conditional on a given value of It is also of interest to study the ~O. For Case I when ~o = (l,c )' the O 9 asymptotic ratio of the conditional prediction variances (when A is estimated versus known) is given by (2.5) lim N4<lo while for Case II this limit (N4<lo,o~) T.(cO,B) J - = is Tl(cO,ED, where aE. a' - J- (j = 1,2) In Table 3 we list the values of TO(co,ED, Tl(cO,ED and the results of a Monte-Carlo experiment (600 iterations) for the simple linear regression design discussed previously (at the different values c , ... ,c in this design). As 20 l expected from Theorems 1 and 2, there is only a slight cost due to estimating A and the B-D small 0 asymptotics are somewhat conservative. 10 Table 1 Biases and relative costs in the simple linear regression model when A is estimated C80=5,81=1). [ A A A r var(B~(A)) -, A A A Var C13 0) [ varC8~ A (A)) A Var C131) A E8 0 CA)-80 ES 1 CA) -131 1.0 .44 .20 12.9 4.0 0.0 .60 .26 9.6 4.0 -1.0 .44 .20 12.9 4.0 t 11 Table 2 Prediction biases and relative variance in the simple linear regression model. A= 1 A =0 A = -1 .00 .07 .00 .01 .07 .00 1.06 1.06 1.02 1\ E (f(A.~ 0 ~ -f(A,~O ~) . Ef(A,~O~ 1\ 1\ 1\ E (f(A,~O ~(A)) -f(A.~O ~) Ef(A,~O~ 1\ 1\ 1\ Var [f(A,~ 0 ~(A)) -f(A,~ O~)] 1\ Var [f(A,~ O~) -f (A,~ O.§)] 12 Table 3 Relative conditional costs in the simple linear regression model when the data are generated by exp(5+c+£). Co Likelihood analysis using L O 2.00 1.55 Monte-Carlo 1.23 1.04 1.01 1.12 1.37 1.08 1.08 1.09 1.11 1.14 1.19 1.23 1. 70 1.30 1.33 1.27 2.03 2.24 1.35 1.27 2.27 1.35 .087 1.27 2.24 1.32 .26 1.24 2.03 1.25 .43 1.19 .61 .78 .95 1.13 1. 70 1.37 1.18 1.10 1.12 1.01 1. 04 1.04 .99 1.13 1.08 1.04 1.02 1.30 1.47 1.65 1.00 1.00 1.01 1. 23 1.55 1.02 2.00 1.12 1.11 1.56 1.15 1 .06 (Theorem 1) 1.50 (Theorem 2) -1.65 -1.47 -1.30 -1.13 - .95 - .78 - .61 - .43 - .26 -.087 0 (not in design) Average~ Predicted Average 1.01 1.00 1.00 1.02 1.04 1.08 1.13 Small a analysis using l:l 1.19 1.24 .98 .98 13 3. Two samples: estimating the difference in medians The balanced two-sample problem has the structure y.OJ = 8 + £. 1 1 1 = 8 2 + £. 1 i = 1, ... ,N i = N+1, •.. , 2N . B-D show that for estimating 81-8 2 there is a substantial penalty for not knowing A unZess 81 = 8 , The "treatment difference" in the original scale 2 that we estimate is defined by When the errors are symmetric, two populations. is just the difference in the medians of the Because the B-D small a asymptotics do not apply, we only consider the case a fixed. normal errors. ~ We confine our attention to A = 0 and standard Lengthy likelihood calculations show that where 4 Q = y -1 2a y+a 28 2 a8 a+8 2 and 2 a = 1 + 8 1 2 8 = 1 + 82 ].1 = 81 + 8 2 2 2 4 4 2 Y = 14 + 10(e 1+8 ) + 8 + e - a. - 82 - 4].12 1 2 2 14 Define Then, by Taylor expansions, it is possible to show Theorem 3 where o If 8 = 8 , then HC8 ,8 ) = 1 and there is no cost for estimating A. 1 2 1 2 Our numerical calculations indicate that 1 s HC8 ,8 ) s 1.03, which indeed l 2 means there is a very small cost overall. We conducted a Monte-Carlo experiment with 8 =4,8 =6. 1 2 The sample size for each population was 10 and there were 600 iterations of the experiment. .14 Q= [ [ 2.64 11.32 22.46 49.89 1 Correlation Matrix = 1.21 .96 .99 1 .95 1 Here ] ]. The results are given in Tables 4 and 5 from which it appears that when working in the original scale, and estimating population medians, there is essentially no cost for estimating A. 15 Table 4 Biases and relative costs in the two-sample problem when A is estimated. The populations have transformed means 81=4,8 2=6, so that So = (8 1+8 2)/2 = 5 and Sl = (8 2-8 1)/2 = 1. A A A A [ var(:o (AJ) A A r , [ r var(~l (A)) ~2 A ESO(A) - So ES1 (A) -Sl Var(SOCA)) 1.0 .20 .13 13.2 4.2 0.0 .67 .28 16.3 5.8 -1.0 .21 .14 13.2 4.2 A Var(Sl (A)) 16 Table 5 Biases and relative costs in the two-sample problem. Here the treatment difference is ~ = E[f(A,8 ) - f(A,a 1)] and H(A,8 l ,a ) = f(A,8 2) - f(A,8 ). l 2 2 A A H(A,el,a2)-~ ~ A A A A A H(A,al(A),e2(A))-~ ~ A A A A A 2 E[H(A,8l(A),e2(A))-~] A A E[H(A,el,e2)-~] 2 =1 A= 0 A = -1 .01 .04 .00 .02 .02 .01 1.00 1.00 .99 A 17 4. Prediction of the conditional mean The estimator in Section 2 is the median of the conditional distribution of y given x (when F is sYmmetric). Our focus in this section o is on estimating the conditional mean of y given x ' O We sketch a general r~sult which indicates that the cost of extra nuisance parameters (such &S A) is not large. with (Y.,X.) having a joint density g(y,xle 1 1 o)' We assume a regression model As in normal theory regression we assume Letting LN(e) denote the log-likelihood, we make the usual assumptions: (4.1) where eN is the maximum likelihood estimate and q is the dimension of the parameter eO' We are given a new value xo and wish to predict E(Ylx ); the o natural estimate (usually only computable numerically) is 1\ (4.2) E(Ylxo) A Taylor expansion shows that = J ygl(ylxo,eN)dy . 18 J~ (8 -8 )gl(ylx ,8 )dy ~ !(y-E(ylxo)) ~ ~8 log gl(Ylxo,80~N o 0 N 0 (4.3) bi~ = !(y-E(ylx o)) log g(y,XoI8oiJN~(8N-80)gl(YIXo,80)dY 2 An overall measure of the accuracy of the prediction is EAN(8 0 ,x O); (4.1), (4.3) and the Schwar~ inequality show E[~(80'XO) Isample] ~ var(Y-E(Ylxo))N~(eN-eo)I(eo)N~(~N-eO)' (4.4) L 2 -> var(y-E(ylxo))x (q) , 2 where X (q) is a chi-square with q degrees of freedom. This suggests (4.5) Equation (4.5) shows that in prediction with q parameters, the average squared prediction error is bounded and this bound increases in relative magnitude by r/q when r additional nuisance parameters are added. A similar result holds for the two-sample problem. Example: Consider the tra~sformation model (1.1) but take A = 1; this means one uses the Box-Cox model when transformation is unnecessary. p regression parameters, then q 2 EA N(8 0 ,x O) = var(y-E(ylxo))p. If there are = p + 1 when A = 1 is known and When one estimates A, (4.5) shows that . E~(8o'xo) ~ Var(y-E(ylx o))(p+2). Thus, the relative cost of estimating A is 2/p (which agrees qualitatively with Theorem 2). 19 References Andrews, D.P. (1971). A note on the selection of data transformations. Biometrika 58, 249-254. Atkinson, A.C. (1973). Testing transformations to normality. Statist. Soa. Ser. B, 35, 473-479. J. Royal Bickel, P.J. and Doksum, K.A. (1978). An Analysis of Transformations Revisited. Unpublished manuscript. University of California, Berkeley. Box, G.E.P. and Cox, D.R. (1964). An analysis of transformations. Statist. Soa. Ser. B, 26, 211-252. J. Royal Carroll, R.J. (1978). A Robust Method for Testing Transformations to Achieve Approximate Normality. Institute of Statistias Mimeo Series #1190. University of North Carolina at Chapel Hill. To appear in J. Royal Statist. Soa. Ser. B. Hinkley, D.V. (1975). 101-111. Huber, P.J. (1977). On power transformations to sYmmetry. Robust Statistiaal Proaedures. SIAM. Biometrika 62, Philadelphia, PA. UNCLASSIFIED IECURITY CLASSIFI READ INSTRUCTIONS BEFORE COMPLETING FORM REPORT DOCUMENTA:rION PAGE [1. •• 2. GOVT ACCESSION NO. 3. RECIPIENT'S CATALOG NUMBER REPORT NUMBER 5. TYPE 0' AIPORT • PERiOD COVERED TITLE (MtI SuA''''.) On Prediction and the Power Transformation Family 7. AUTHOR(.) TECHNICAL S. PERFORMING 8. R.J. Carroll and David Ruppert O~G. REPORT NUMBER Mimeo Series No. 1264 aONTRACT ?>~~RA~T N~M8ER(.) rant AF -7 -2 96 Grant AFOSR-80-0080 Grant MCS78-0l240 9. PERFORMING ORGANIZATION NAMI!: I\NO ADDRESS 10. PROGRAM ELEMENT. PROJECT, TASK AREA 6 WORK UNIT NUM8ERS 11. CONTROLLING OFFICE NAME AHI) ApORESS 12. REPORT DATE Air Force Office of Scientific Research Bolling Air Force Base Washington, D.C. 20332 February 1980 13. NUM8ER OF PAGES 19 I •. MONITORIHG AGENCY HAME 6 AOORESS(II tllIl.,..., lrorn Con',ollln, Olllc.) IS. SECURITY CLASS. (01 'hi. ,epo,t) UNCLASSIFIED 15•. OECLAISIFICATIOHfDOWHGRADIHG SCHEDULE IS. DISTRIBUTION STATEMENT (01 'hi. R.po,') Approved for Public Release: Distribution Unlimited " 17. DISTRIBUTIOH STATEMENT (01 'h• •b.".ct .,.,.,.d In SIod20, II dill.,..., 1_ R.".,,) II. SUPPLEMENTARY NOTES ... KEy WORDS (Continue on ,.v.,. . .Id. II ".c...... . .d Id.ntlty "" A/oclr nUlltA.,) Power transformations, Prediction, Robustness, Box-Cox family, Asymptotic theory. 20. ABSTRACT (Conllnu. on , ....,•• • Id. II ".c. . .. " . .d l.n,lI" b" bloclr _b.,) The power transfOT<rnation family studied by Box and Cox (1964) for transforming to a normal linear model has recently been further studied by Bickel and Doksum (1978) , who show that "the cost of not knowing A and estimating it . . is generally severe"; some of the variances for regression parameters can be extremely large. We consider prediction of future observations (untransformed) when the data can be transformed to a linear model and show that while there is a cost due to estimating A it is generally not severe. Similar results emerl!e for the two ".:Im",l", nl"nhl""m~ Mn~f'P.-rSll·ln . DD 'ORM IJAN 71 1473 EDITIOH 0' I NOV II II OBSOLETE UNCLASSIFIED I t SECURITY Cl.ASSIFICATION 0,. THIS PAG£(lf'h_ D".",t.red) 20. UNCLASSIFIED results lend credence to the asymptotic calculations. UNCLASSIFIED SECURITY CLASS'I",CATIOII OF Tu" "AGE(When D"IA Enl~' ,
© Copyright 2025 Paperzz