Some Comments on C P C. L. Mallows Technometrics, Vol. 15, No. 4. (Nov., 1973), pp. 661-675. Stable URL: http://links.jstor.org/sici?sici=0040-1706%28197311%2915%3A4%3C661%3ASCO%3E2.0.CO%3B2-6 Technometrics is currently published by American Statistical Association. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/astata.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For more information regarding JSTOR, please contact [email protected]. http://www.jstor.org Thu Apr 5 14:50:09 2007 Some Comments on C, Bell Laboratories Murray Hill, New Jersey We discuss the interpretation of Cp-plots and show how they can be calibrated in several ways. We comment on the practice of using the display as a basis for formal selection of a subset-regression model, and extend the range of application of the device to encompass arbitrary linear estimates of the regression coefficients, for example Ridge estimates. Linear Regression Selection of Variables Ridge Regression + Suppose that we have data consisting of n observations on each of k 1variables, namely k independent variables x1 , . . . , x, and one dependent variable, y. Write 20 = I, x(l X (k 1)) = (xo , 2, , . . . , x,), y(n X 1) = (yl, . . . , y J T , X(n X (k 1)) = (xu,).A model of the form + + where is to be entertained,+with the residuals el . . . en being regarded (tentatively) as being independent random variables with mean zero and unknown common variance a'. The x's are not to be regarded as being sampled randomly from some population, but rather are to be taken as fixed design variables. We suppose that the statistician is interested in choosing an estimate @ = (fi, , . . . , b,), with the idea that for any point x in the general vicinity of the data a t hand, the value will be a good estimate of q(x). In particular he may be interested in choosing a "subset least-squares" estimate in which some components of 6 are set a t zero and the remainder estimated by least squares. The Cp-plot is a graphical display device that helps the analyst to examine his data with this framework in mind. Consider a subset P of the set of indices K' = (0, 1, 2, . . . , k ) ; let Q be the complementary subset. Suppose the number of elements in P, Q are IP( = p, I&( = q, so that p q = k 1. Denote by Qp the vector of estimates that is obtained when the coefficients with subscripts in P + + k + If PO is absent the development is entirely similar. We assume throughout that X has rank + 1. Received June 1972; revised Oct. 1972 C. L. MALLOWS 662 are estimated by least squares, the remaining coefficients being set equal to zero; i.e. where Xi is the (Ptloore-Penrose) generalized inverse of Xp , which in turn is obtained from X replacing the columns having subscripts in Q by columns of zeroes. (Thus X; has zeroes in the rows corresponding to Q, and the remaining rows contain the matrix (zBZp)-'Z,T where Zp is obtained from X by deleting the columns corresponding to Q). Let RSSp denote the corresponding residual sum of squares, i.e. RSSP For any such estimate sum of squared errors" Bp ,a = C (9, - xu8p)Z measure of adequacy for prediction is the "scaled the expectation of which is easily found to be where Vp , Bp are respectively "variance" and "bias" contributions given by and @, is @ mith the elements corresponding to P replaced by zeroes, and M p xx, = xpx, = zp(z,TZp)-'z,T . The Cp statistic is defined t o be = where 8' is an estimate of a'. Clearly (as has been remarked by Kennard (1971)), Cr is a simple function of RSSp , as are the multiple correlation coefficient defined by 1 - Rj = RSSp/TSS (where TSS is the total sum of squares) and the "adjusted" version of this. However the form (3) has the advantage (as has been shown by Gorman and Toman (1966), Daniel and Wood (1971), and Godfrey (1972)) that since under the above assumptions C p is an estimate of E ( J p ) , and is suitably standardized for graphical display, plotted against p. Graphical presentation of the various regression sums of squares themselves against p was advocated by Watts (1965). For k not too large it is fvasiblc ([4], [5], [19]) to compute and display all the 2k'1 values of Cp ; for larger values one can use algorithms of [2], [8], [14] to compute only the more interesting (smaller) values. I n sect& 2 we describe some of the configurations that can arise; in section 3 we provide some formal calibration for the display and in section 4 comment on the practice of using it as a basis for formal selection. The approach is extended in section 5 to handle arbitrary linear estimates of the regression coefficients. The approach can also be extended to handle multivariate response data and to dcxl mith an arbitrary weight function L U ( X ) in factor-space, describing a region SOME COMMENTSON CI. 663 of interest different from that indicated by the configuration of the data currently to hand. I n each case, the derivation is exactly parallel to that given above. I n the former case, one obtains a matrix analog of C, in the form $-'.RSS, (n - 2 p ) I where 2 is an estimate of the residual covariance matrix, and RSS, is Z(y, - x , @ , ) ~ ( ~-, xu@,). One or more measures of the "size" of C, (such as the trace, or largest eigenvalue) can be plotted against p. I n the latter case, with the matrix A = (A,,) defined by A,, = J x,z,zo(x)dx,one arrives a t a statistic of the form where V i + = trace ( A ( x ~ x ) - ' ) , V % = trace (h(X;X,)-), and we can plot C: against V ; . This reduces to the Cp-plot when A = XTX. If interest is concentrated is equivalent to that a t a single point x, we have A = xxT, and the statistic suggested by Allen (1971); his equation (9) = t2(C;3.- x(XTx)-'xT). From ( 2 ) , (3), (4) we see that if 6, = 0, so that the P-subset model is in fact 2 Cp E p. If 6' is taken as completely appropriate, then R S S p w (n - p ) ~ and 1 exactly. Sotice that if P* is a ( p 1) RSSK+/c,-k-ll , then CK+ = jK'j = lc elemcsnt subset which contains P , then + + where SS is the on[>-d.f.contribution to the regression sum of squares due to the (p 1)-th variable, so that SS/k2 is a t; statistic that could be used in a stepwise testing algorithm. If the additional variable is unimportant, i.e. if the bias contribution Bp - Bp, is small, then E(SS) % 6' and so + E(Cp* - Cp) z 1. Rlantel (1970) has discusscd the use of stepwise procedures, and how they behave in the face of various patterns of corrc,lation amongst the independent variablcs. I t is illuminating to consider how patterns similar to those hc describes would show up on a C,-plot. First, suppose the indcpcndcnt variables arc not highly correlated, that @ = 6 , , and that every non-zcbro clement of @ is large (relative to the standard error of its least-squares estimate.). Then the Cp-plot will look something like Figure 1 (drawn for the cascl p = lc - 2, K' - P = { I , 2, 3 ) ) . Notice the approximately linear diagonal configuration of points corresponding to the nell-fitting subsets of variables. Xow, suppose s, , s, , s,are highly correlated with each other, with each being about oclually corrclatod with y. Then any two of these variables, but not all three, can b~ d(,l(~ted from thc model without much effect. I n this case the relevant points on the PI.-plot will look somclthing l i k ~Figure 2a, if no other variables are of importance, or like Figure. 2b if some othm subset P is also n c ~ d c d .(In all these examplcs we arcb assuming that tho constant tcrm P,, is always needed). Notice that now thc diagonal pattern is incomplete. I n an i1ltc3rmcdiatc case, when r , , s, , s, have modcratr correlations, a picture intr,rm~diatebctwccn Figures 1 and 2b nil1 be obtained. 664 C. L. MALLOWS Thirdly, suppose z, , z, are individually unimportant but jointly are quite effective in reducing the residual sum of squares; suppose some further subset P of variables is also needed. Mantel gives an explicit example of this behavior. Figure 3 shows the resulting configuration in the case IPI = k - 4. 2, Notice that even if C i p 2 , is the smallest C,-value for subsets of size p there might be subsets P{ , P i , (not containing P ) with IP:l = p or p 1 that gave smaller values of Cp than those for P, j P , 11, j P , 2 ) . I n this case an upward stepwise testing algorithm might be led to include variables in these subsets and so not get to the subset {P,1, 2 ) . Mantel describes a situation where this would happen. + + To derive bench marks for more formal interpretation of C,-plots, we assume that the model (1) is in fact exactly appropriate, with the residuals e, . . . e, being independent and Normal (0, u2). Suppose k2 is estimated by RSS,+/v where v = n - k - 1, the residual degrees of freedom. We do not of course recommend that the following distributional results be used blindly without careful inspection of the empirical residuals y, - d(x,), i = 1, . . . , 7 ~ .However, they should give pause to workers who arc tempted to assign significance to quantities of the magnitude of a few units or even fractions of a unit on the Cp scale. 665 SOME COMMENTS ON CI, FIGURE 2a-Cp-plot: Variables 1 , 2 , 3 , are highly explanatory also highly correlated. First, notice that the increment Cp. - Cp (in (5) above) is distributed as 2 - I: , where the t-statistic t , is central if @ = . I n this case this increment has mean and variance of approximately 1 and 2 respectively. Similarly, @,. + where q = k 1 - p and the F statistic is central if Q = Q, ; thus if v is large compared with q this increment has mean and variance approximately q and 2q respectively. The variance of the slope of the line joining the points (p, C,), (lc 1, k 1) is thus 2/q, so that the slope of a diagonal configuration such as is shown in Figure 1 will vary considerably about 45". The following tables (derived from (6)) give values of C, - p that will be exceeded with probability a when the subset P is in fact adequate ( i t . when @ = @, so that @, = O ) , for the cases v = n - Ic - 1 = 30, co . The value tabulated is q(K,,(a) - 1). For comparing two Cp-values corresponding to subsets P, P' with P A P' = B, P = A U B, P' = A' U B, it is straightforward to derive the results, valid under the null hypothesis that each of P and P' is an adequate subset, + + E(Cp - C p r jX IPJ - IP'J = J A J- JA'I + TTar(C, - Cp.) z 2 ( J A J / A ' / - 2R2) where R2 is the sum of squares of the canonical correlations between the sets of variables X A and X,, , after partialling out the variables X , . (Thus if lBI = C. 1. MALLOWS 666 FIGURE 2b-Cp-plot: Same as 2a except that variables in P are also explanatory. jPl - 1, Var (C, - C,,) 4(1 - p 2 ) where p is the partial correlation coefficient = PAA'.B).~ We now use the Scheffk confidence ellipsoid to derive a different kind of result. = (po , 3:) for the least-squares estimate of pT = (0,, Q:), and Let us writ'e let 1 XTX = mT] , D. = D - - mmT. n m D oT I; Then the Scheffi: 100& Sa = confidence ellipsoid for the elements of @ , is the region {@K : (QK - oK)TDK(@K - 0,) < kt+'FaJ (7) where Fa is the upper 100a% quantile of the F distribution on k, n - k - 1 degrees of freedom. Yotice that Sa can be written where SXis a fixed ellipsoid centered a t the origin: S: = ( y : yTDKy < IcF,). + Srikantan (1070) has proposed the average, rather than the sum, of the squared canorlical correlatiorls as an overall measure of association. This measure has the property that its value is changed when a new variable, completely rlncorrelated with all the previous ones, is added to one of the sets of variates. 667 SOME COMMENTS ON CP FIGURE 3-Cp-plot: Two variables that jointly are explanatory but separately are not. Let P-, Q be any complementary subsets of K, P The following lemma is proved in the Appendix. = (0, P-1. Lemma The follo~vingstatements arc equivalent: (i) The region S , intersects the coordinate hyperplane H p = ( QK : Q , = 01, (ii) The projection of S , onto the H , hyperplane contains the origin, (iii) The subset least squares estimate $, = (b, , $,-) has 8,- in S , , (iv) Cp < 2 p - k - 1 k F , , (v) RSSp - RSS,+ < ka2F, . Now consider any hypothesis that specifies the value of QK , and the corresponding 100ayo acceptance region + (clearly P(oKE To; / 0:) is in fact equal to a ; this is just the confidence property of the Scheff6 ellipsoid (7)). Starting from this family of acceptance regions for hypotheses that specify QK completely, a natural acceptance region for a composite hypothesis of the form 0, = 0 is given by the union of all regions Tg;for values of e i such that @,0 = 0; the reasoning is that the hypothesis @, = 0 cannot he rejected if there is any 6, with 0, = 0 that is acceptable according to the corwith = 0 lying within the responding test in the family, i.e. if there is any confidence ellipsoid S , . By the Lemma, the corresponding acceptable subsets e, e, C. L. MALLOWS I 3 2n . Zm. %i. - m * mmf- 2 ?O!N 2 2 gi 0 ccj 2 0 m m O!?? /I i 1 L s a I + A? rr: .- .1 : 1 i - E C I- 0-9 m * M "032 I 5 .g 1 " $21 -s Lo .5 m o l e ??oq m m m 2 b 0 Ycq? mt-S;J 8 e' 8u U a u 2 +Ma: "2s; su 1 " 5 '0?? "22 i II * a 0 M +atL ? t - 0 e'"2 mf-m "m 2 ? J .- N - m g m m m ",? cr m ol 1 " m I? . + .L o. - m a a I i + .y o m 3 199 I/ 11 c4 a SOME COMMENTS ON CP 669 {O, P-} are just those that have We state the property formally: A subset P = (0, P - } satisfies (9) if and only if thrre is some vector of coeffi= 0 that lies within the ScheffB ellipsoid (7), i.e. if and only cients @ having if there is some vector of this form that is accepted by the corresponding test with acceptance region of the form (8). As an example, consider the 10-variable data studied by Gorman and Toman (1966). Taking a = 0.10, k = 10, v = 25, we find that among the 58 subsets for which Gorman and Toman computed Cp-values, there arc 39 that satisfy (9), in number 7, 13, 9, 10 with p = 7, 8, 9, 10 respectively. This result gives little support to the view that this set of data is sending a clear message regarding the relative importance of the variables under consideration. n'otice that if the true coefficient vector @*has @*,, = 0, then Pr {for all P containing P*, Cp 5 2p - k - 1 lcF,] 2 a, with equality only if p* = 1 (i.e. P* = (01). This property of the procedure is not completely satisfying since it is not an equality; also the form of the boundary in the Cp-plot is inflexible. In theory, one way of getting a better result is the following. Given any subset P * and a sequence of constants c , , c, , . . . , ck (and the matrix D,) one could compute the probability Pr {for all P containing P*, Cp < cP};this probability depends on c, , . . . , ck , P * and DK , but not on any other parameters. One could then adjust cI , . . . , ck so as to make the minimum of this probability over all choices of P * (or possibly only over all choices with p* > some p,,) equal to some desired level a . The computation would presumably be done by simulation. Starting from the Scheff6 ellipsoid, Spjgtvoll (1972) has developed a multiplecomparison approach that provides confidence intervals for arbitrary quadratic functions of the unknown regression parameters, for example Bp - B p , for two subsets P, P'. + hIany authors have studied the problem of giving formal rules for the selection of predictors; Kennedy and Bancroft (1971) give many references. Lindley (1968) presents a Bayesian formulation of the problem. The discussion in section 3 above does not lend any support to the practice of taking the lowest point on a Cp-plot as defining a "best" subset of terms. The present author feels that the greatest value of the device is that it helps the statistician to examine some aspects of the structure of his data and helps him to recognize the ambiguities that confront him. The device cannot be expected to provide a single "best" equation when the data are intrinsically inadequate to support such a strong inference. To make these remarks more precise and objective, me shall compute (in a special case) a measure of the performance to be expected of the rule "choose the subset that minimizes Cp , and fit it by least-squares". We shall use as a figure of merit of an arbitrary estimator $(x) the same quantity as was used in setting up the Cp-plot, namely the sum of predictive squared errors We can handle in detail only the case of orthogonal regressors, and so now assume XTX = nI. I n this case we see from ( 5 ) that Cp is minimized when P contains C . L. MALLOWS 670 just those terms for which t: > 2, where t, = & b,/8 is the t-statistic for the j-th regression coefficient, and b, is the least squares estimate, b, = Zxu,y,/n. Thus in this case the "minimize Cp" rule is equivalent to a stepmise regression algorithm in which all critical t-values are set at v2 and &' is kept at the full-equation value throughout. Now let us assume that n is sufficiently large that variation in & can be ignored; then to , tl , . . . , tk will be independent Normal variables with unit variances and with means 7, , . . . , rkwhere r , = 4; @,/&. Let d(t) be the function that equals 0 for It1 v2,and equals 1 otherwise, then J for the "minimum-Cp subset least squares'' estimate can be written < which reduces to Hence + + where m(r) = E ( ( u 7)d(u T) - T)' (where u is a standard Normal variable), and is the function displayed in Figure 4 (labelled "16%11, since Pr (lul > 142)= 0 1 2 3 4 5 6 S T A N D A R D I Z E D REGRESSION COEFFICIENT, T FIGURE 43-functions 7 67 1 SOME COMMENTS ON CF .1573). If the constant term is always to bc included in the selected subset, the corresponding result is Notice that the function VZ(T)is less than 1 only for 171 < .78, and rises to a maximum value of 1.65 a t 171 = 1.88. I t exceeds 1.25 for 1.05 < 171 < 3.05. We reiterate that in this case of orthogonal regressors mith n very large, the "minimum Cr" rule is equivalent to a stepwise regression algorithm mith all critical levels set a t 15.73%. Also shown in Figure 4 are the m-functions corresponding to several other critical levels; when all k 1 terms are infallibly included (the "full-1.s." rule), m(7) = 1 for all 7, so that E(J,,,, , , ) = k 1. We see that the "minimum Cp" rule will give a smaller value for E ( J ) than the "full-1.s." rule only nhen rather more of the true regression coefficients satisfy 171 < .78 than satisfy 171 > 1; in the worst case with IT,^ = 1.88 for j = 1, . . . , k, E ( J ) for the "minimize Cp" rule is 165% of that for the "full-1.s." rule. Similarly for rrjection rules with other critical levels; in particular, a rule mith a nominal level of 57, (two tailed) gives an E ( J ) a t worst 2467, of that of the "full-1.s." rule. Thus using the "minimum Cp" rule to select a subset of terms for least-squares fitting cannot be rc~commendeduniversally. Sotice however that by examining the Cp-plot in the light of the distributional results of the previous section one can see whether or not a single best subset is uniquely indicated; the ambiguous cases where the "minimum Cp" rule will give bad results are exactly those where a large number of subsets are close competitors for the honor. With such data no selection rule can be expected to perform reliably. + + We now extend the Cp-plot device to handle general linear estimators. With the same set-up as in the Introduction, consider an estimate of the form where i j is the mean i j = Zy,/r~, and L is a k X n matrix of constants. We shall assume that L1, = 0 (where :1 = (1, 1, . . . , 1)) so that a change in the origin of measurement of y affects only b,, and not 1, , . . . , bk . Examples of estimators of this class are: full least-squares; subset-least-squares; and Bayes estimates under multinormal specifications with a multinormal prior, a special case of which is the class of "Ridge" estimates advocated by Hoerl and Kennard (1970a, b), (see also Theil (1963), section 2.3): where f is a (small) scalar parameter (Hoerl and Iiennard used k), and in this section me are writing X for the n X k matrix of independent variables, which we are now assuming have been standardized to have zero means and unit variances. Thus I ZX = 0 , diag (XTX) = I. As a measure of adequacy for prediction we again use the scaled summed mean square error, which in the present notation is C. 1. MALLOWS 672 and which has expectation where IfL = 1 BL = + tr (XTXLLT) @3(LX - I ) ~ x ~ X ( L X- I ) @ K. The sum of squares about the fitted regression is RSSL = / / y- yl, - X 6 L / 2 xhich has expectation E(RSS L) = + BL 8'Vz where V2 = 11 - 1 - 2 tr (XL) + tr ( x T x ~ L T ) . Thus we have an estimator of E ( J L)lnamely CL = 1 ;T RSSL - n + 3 + 2 t r (XL). (12) By analogy with the C p development, we propose that values of C L (for various choices of L) should be plotted against values of VL . Xotice that when L is a matrix corresponding to subset least squares, C , V L reduce to Cp , p respectively. For computing values of C L , VL for Ridge estimates ( l l ) , the following steps can be taken. First, find H (orthogonal) and A = diagonal (A, , A, , . . . , A,) so that XTX = H T n H . Compute z = HXTy. Then tr (XL) Is = t=, A, + A, --- f Figure 5 gives the resulting plot for the set of 10-variable data analyzed by Gorman and Toman (1966) and by Hoerl and Kennard (1970b). Shown are (p, C,) points corresponding to various subset-least-squares estimates and a continuous arc of (VL , C L ) points corresponding to Ridge estimates with values of f varying from zero a t (11, 11) and increasing to the left. For this example, Hoerl and Kennard (1970b) suggested that a value of f i11 the interval (0.2, 0.3) would "undoubtedly" give estimated coefficients "closer to Q arid more stable for prediction than the least-squares coefficients or some subset of them". On the other hand from Figure 5 one would be inclined to suggest a value off nearer to .02 than to 0.2. One obvious suggestion is to choose f to minimizc C L r . Some insight into the effect of this choice can be gained as f o l l o ~ ~First s . consider the case of orthogonal regressors, and now assume XTX = I. Notice that in this case our risk function E ( J ) is equivalent to the quantity ~ ( y -, p,)' used by Hoerl and Kennard (1970a). We may take H = I, so that z , = 1, , the least-squares estimate of P , . xt_, SOME COMMENTSON Cp 67 3 FIGURE5-CL plot for Gorman-Toman data; subset ( C p )values and Ridge ( C L r )values. From (12) and (13) we sce that C L ris a minimum when f satisfies the adjusted estimates are then given by I t is interesting that this set of estimates is of the form suggested by Stein (1960) for the problem of estimating regression coefficients in a multivariate Normal distribution. Jamcs and Stein (1961) showed that for Ic 2 3 the vector of estimates $** obtained by rcplacing the multiplier k in (14) by any number between 0 and 2k - 4 has the property that E(J**) is less than the full-least-squares value C. L. MALLOWS 674 12 + I (see (lo)), for all values of the true regression coefficients. Thus our "minimize C L r ' l rule dominates full least-squares for k 2 4. This result stands in interesting contrast to the disappointing result found above for the "minimize Cpl' rule. n'ow, consider the case of equi-correlated regressors, with XTx = I p (11 - I). I n this case the least-squares estimate of B = Zp,/k has variance l / k ( l - p kp), and the vector of deviations (8, - 8) has covariance matrix (I - k-'11 T ) / ( l - p). Thus when p is large, these deviations become very unstable. I t is found that for p near unity, C L ris minimized when f is near (1 - p)y, where + 8 + The adjusted estimates are given approximately by Thus here the "minimize CLrI1rule leads to shrinking the least-squares estimates towards their average. While the details have not been fully worked out, one expects that this rule will dominate full least-squares for k 2 5. I t is a great personal pleasure to recall that the idea for the Cp-plot arose in the course of some discussions with Cuthbert Daniel around Christmas 1963. The use of the letter C is intended to do him honor. The device was described publicly in 1964 [16] and again in 1966 [17] (with the extensions described at the end of section 1 above) and has appeared in several unpublished manuscripts. Impetus for preparing the present exposition was gained in the course of delivering a series of lectures a t the University of California at Berkeley in February 1972; their support is gratefully acknowledged. Proof of the Lemma The key to these results is the identity, true for any subset P that includes 0, i.e. P = (0, P-}, R S S p RSS,+ = (@,- - @,)T~,(@p- - )8, where gT = (& , BE) is the vector of least-squarrs estimates of all the coefficients in the model, and ( P i , 8;) is the vector of subset-least-squares estimates. From the form of S , (7) it now follows that (iii) 6,- is in S , if and only if (v) RSSp-RSS,+ < kb2F, , which is directly equivalent (if b2 = ItSS,+/n - k - 1) to (iv) Cp < 271 - Ic - 1 kF, . Clearly (iii) implies (i); to prove the converse we remark that for any vector Q T = (Po , 6): with QK in the hyperplane H p = ( Q, : Q, = 01, we have + the cross-product term vanishing by definition of @, . Thus if any point of H, is in S , , Qp must be. Finally, (i) is directly equivalent to (ii) by a simple geometrical argument. SOME COMMENTSON CP 675 To handle the case of non-orthogonal regressors, Sclove (1968) has suggested transforming to orthogonality before applying a shrinkage factor. A composite procedure with much intuitive appeal for this writer would be to use the C, plot or some similar device to identify the terms that should certainly be included (since they appear in all subsets that give reasonably good fits to the data), to fit these by least squares, and to adjust the remaining estimates by orthogonalizing and shrinking towards zero as in (14). REFERENCES [I] ALLEN,D. hi. (1971). Mean square error of prediction as a criterion for selecting variables. Technometrics,13, 469-475. [2] BEALE,E. M. L., KENDALL, hi. G. and MANN,D. W. (1967). The discarding of variables in multivariate analysis, Biometrika 54, 337-366. [3] DANIEL, C. and WOOD,F. S. Fitting Equations to Data New York: Wiley-Interscience, 1971. [4] FURNIVAL, G. RI. (1971). All possible regressions with less computation. Technometrics 13, 403-408. [5] GARSIDE, M. J. (1965). The best subset in multiple regression analysis. Applied Statistics 14, 196-200. R l . B. (1972). Relations between Cp, RSS, and mean square residual. Submitted 161 GODFREY, to Technometrics. [7] GORMAN, J. W. and TOMAN, R. J. (1966). Selection of variables for fitting equations to data. Technometrics 8, 27-51. [8] HOCKING, R. R. and LESLIE,R. N. (1967). Selection of the best subset in regression analysis. Technometrics 9, 531-540. [9] HOERL,A. E. and KENNARD, R. W. (1970a). Ridge regression: biased est,in~ationfor nonorthogonal problems. Technometrics 12, 55-67. R. W. (1970b). Ridge regression: applicat,ionsto non-orthogonal [lo] HOERL,A. E. and KENNARD, problems. Technometrics 12, 69-82. [ l l ] JAMES,W. and STEIN,C. (1961). Estimation with quadratic loss. Proc. Fourth Berkeley Symp. 1, 361-379, Univ. of California Press. 1121 KENNARD, R. W. (1971). A note on the Cp statistic. Technometries 13, 899-900. [13] KENNEDY, W. J. and BANCROFT, T. A. (1971). Model-building for prediction in regression based on repeated significance tests. Ann. Math. Statist. 42, 1273-1284. 1141 LA MOTTE,L. R. and HOCKING, R. R. (1970). Computational efficiency in the selection of regression variables. Technometrics 12, 83-93. 1151 LINDLEY, D. V. (1968). The choice of variables in multiple regression. J. Roy. Statist. Soc. B SO, 31-53, (Discussion, 54-66). [16] MALLOWS, C. L. (1964). Choosing variables in a linear regression: a graphical aid. Presented at the Central Regional Meeting of the Institute of Mathematical Statistics, LIanhattan, Kansas, May 7-9. [17] M A L L O ~C. S , L. (1966). Choosing a subset regression. Presented at t,he Annual Meeting of the American Statistical Association, Los Angeles, August 15-19, 1966. [18] MANTEL,N. (1970). Why stepdown procedures in variable selection. Technometrics 12, 621-625. [19] SCHATZOFF, M., TSAO,R . and FIENBERG, S. (1968). Efficient calculation of all possible regressions. Technometrics 10, 769-779. [20] SCLOVE,S. L. (1968). Improved estimators for coefficients in linear regression. J. Amer. Statist. Assoc. 63, 596-606. [21] SPJPTVOLL, E. (1972). Multiple comparison of regression functions. Ann. Math. Statist. 45, 1076-1088. [22] SRIKANTAN, K. S. (1970). Canonical association between nominal measurements. J. Amer. Statist. Assoc. 66, 284-292. 1231 STEIN,C. (1960). Multiple regression. Contributions to Probability and Statistics, ed. I. Olkin, Stanford University Press, 424-443. [241 THEIL,H . (1963). On the use of incomplete prior information in regression analysis. J. Amer. Statist. Assoc. 68, 401-414. 1251 WATTS,H. W. (1965). The Test-o-Gram; a pedagogical and presentational device. Amer. Stat. 19, No. 4, 22-28. http://www.jstor.org LINKED CITATIONS - Page 1 of 3 - You have printed the following article: Some Comments on C P C. L. Mallows Technometrics, Vol. 15, No. 4. (Nov., 1973), pp. 661-675. Stable URL: http://links.jstor.org/sici?sici=0040-1706%28197311%2915%3A4%3C661%3ASCO%3E2.0.CO%3B2-6 This article references the following linked citations. If you are trying to access articles from an off-campus location, you may be required to first logon via your library web site to access JSTOR. Please visit your library's website or contact a librarian to learn about options for remote access to JSTOR. References 1 Mean Square Error of Prediction as a Criterion for Selecting Variables David M. Allen Technometrics, Vol. 13, No. 3. (Aug., 1971), pp. 469-475. Stable URL: http://links.jstor.org/sici?sici=0040-1706%28197108%2913%3A3%3C469%3AMSEOPA%3E2.0.CO%3B2-U 2 The Discarding of Variables in Multivariate Analysis E. M. L. Beale; M. G. Kendall; D. W. Mann Biometrika, Vol. 54, No. 3/4. (Dec., 1967), pp. 357-366. Stable URL: http://links.jstor.org/sici?sici=0006-3444%28196712%2954%3A3%2F4%3C357%3ATDOVIM%3E2.0.CO%3B2-A 4 All Possible Regressions with Less Computation George M. Furnival Technometrics, Vol. 13, No. 2. (May, 1971), pp. 403-408. Stable URL: http://links.jstor.org/sici?sici=0040-1706%28197105%2913%3A2%3C403%3AAPRWLC%3E2.0.CO%3B2-9 5 The Best Sub-Set in Multiple Regression Analysis M. J. Garside Applied Statistics, Vol. 14, No. 2/3. (1965), pp. 196-200. Stable URL: http://links.jstor.org/sici?sici=0035-9254%281965%2914%3A2%2F3%3C196%3ATBSIMR%3E2.0.CO%3B2-V NOTE: The reference numbering from the original has been maintained in this citation list. http://www.jstor.org LINKED CITATIONS - Page 2 of 3 - 7 Selection of Variables for Fitting Equations to Data J. W. Gorman; R. J. Toman Technometrics, Vol. 8, No. 1. (Feb., 1966), pp. 27-51. Stable URL: http://links.jstor.org/sici?sici=0040-1706%28196602%298%3A1%3C27%3ASOVFFE%3E2.0.CO%3B2-7 8 Selection of the Best Subset in Regression Analysis R. R. Hocking; R. N. Leslie Technometrics, Vol. 9, No. 4. (Nov., 1967), pp. 531-540. Stable URL: http://links.jstor.org/sici?sici=0040-1706%28196711%299%3A4%3C531%3ASOTBSI%3E2.0.CO%3B2-%23 9 Ridge Regression: Biased Estimation for Nonorthogonal Problems Arthur E. Hoerl; Robert W. Kennard Technometrics, Vol. 12, No. 1. (Feb., 1970), pp. 55-67. Stable URL: http://links.jstor.org/sici?sici=0040-1706%28197002%2912%3A1%3C55%3ARRBEFN%3E2.0.CO%3B2-Z 10 Ridge Regression: Applications to Nonorthogonal Problems Arthur E. Hoerl; Robert W. Kennard Technometrics, Vol. 12, No. 1. (Feb., 1970), pp. 69-82. Stable URL: http://links.jstor.org/sici?sici=0040-1706%28197002%2912%3A1%3C69%3ARRATNP%3E2.0.CO%3B2-7 12 A Note on the C p Statistic Robert W. Kennard Technometrics, Vol. 13, No. 4. (Nov., 1971), pp. 899-900. Stable URL: http://links.jstor.org/sici?sici=0040-1706%28197111%2913%3A4%3C899%3AANOTS%3E2.0.CO%3B2-M 18 Why Stepdown Procedures in Variable Selection Nathan Mantel Technometrics, Vol. 12, No. 3. (Aug., 1970), pp. 621-625. Stable URL: http://links.jstor.org/sici?sici=0040-1706%28197008%2912%3A3%3C621%3AWSPIVS%3E2.0.CO%3B2-Q NOTE: The reference numbering from the original has been maintained in this citation list. http://www.jstor.org LINKED CITATIONS - Page 3 of 3 - 19 Efficient Calculation of All Possible Regressions M. Schatzoff; R. Tsao; S. Fienberg Technometrics, Vol. 10, No. 4. (Nov., 1968), pp. 769-779. Stable URL: http://links.jstor.org/sici?sici=0040-1706%28196811%2910%3A4%3C769%3AECOAPR%3E2.0.CO%3B2-T NOTE: The reference numbering from the original has been maintained in this citation list.
© Copyright 2026 Paperzz