Some Comments on C P C. L. Mallows Technometrics, Vol. 15, No

Some Comments on C P
C. L. Mallows
Technometrics, Vol. 15, No. 4. (Nov., 1973), pp. 661-675.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28197311%2915%3A4%3C661%3ASCO%3E2.0.CO%3B2-6
Technometrics is currently published by American Statistical Association.
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained
prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in
the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/journals/astata.html.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For
more information regarding JSTOR, please contact [email protected].
http://www.jstor.org
Thu Apr 5 14:50:09 2007
Some Comments on C,
Bell Laboratories Murray Hill, New Jersey We discuss the interpretation of Cp-plots and show how they can be calibrated in
several ways. We comment on the practice of using the display as a basis for formal
selection of a subset-regression model, and extend the range of application of the device
to encompass arbitrary linear estimates of the regression coefficients, for example Ridge
estimates.
Linear Regression
Selection of Variables
Ridge Regression
+
Suppose that we have data consisting of n observations on each of k 1variables,
namely k independent variables x1 , . . . , x, and one dependent variable, y. Write
20 = I, x(l X (k
1)) = (xo , 2, , . . . , x,), y(n X 1) = (yl, . . . , y J T , X(n X
(k 1)) = (xu,).A model of the form
+
+
where
is to be entertained,+with the residuals el . . . en being regarded (tentatively) as
being independent random variables with mean zero and unknown common
variance a'. The x's are not to be regarded as being sampled randomly from some
population, but rather are to be taken as fixed design variables. We suppose that
the statistician is interested in choosing an estimate @ = (fi, , . . . , b,), with the
idea that for any point x in the general vicinity of the data a t hand, the value
will be a good estimate of q(x). In particular he may be interested in choosing
a "subset least-squares" estimate in which some components of 6 are set a t zero
and the remainder estimated by least squares.
The Cp-plot is a graphical display device that helps the analyst to examine his
data with this framework in mind. Consider a subset P of the set of indices K' =
(0, 1, 2, . . . , k ) ; let Q be the complementary subset. Suppose the number of
elements in P, Q are IP( = p, I&( = q, so that p
q = k
1. Denote by Qp the
vector of estimates that is obtained when the coefficients with subscripts in P
+
+
k
+
If PO is absent the development is entirely similar. We assume throughout that X has rank
+ 1.
Received June 1972; revised Oct. 1972
C. L. MALLOWS
662
are estimated by least squares, the remaining coefficients being set equal to zero; i.e.
where Xi is the (Ptloore-Penrose) generalized inverse of Xp , which in turn is
obtained from X replacing the columns having subscripts in Q by columns of
zeroes. (Thus X; has zeroes in the rows corresponding to Q, and the remaining
rows contain the matrix (zBZp)-'Z,T where Zp is obtained from X by deleting
the columns corresponding to Q). Let RSSp denote the corresponding residual
sum of squares, i.e.
RSSP
For any such estimate
sum of squared errors"
Bp
,a
=
C (9,
- xu8p)Z
measure of adequacy for prediction is the "scaled
the expectation of which is easily found to be
where Vp
,
Bp are respectively "variance" and "bias" contributions given by
and @, is @ mith the elements corresponding to P replaced by zeroes, and M p
xx, = xpx, = zp(z,TZp)-'z,T .
The Cp statistic is defined t o be
=
where 8' is an estimate of a'. Clearly (as has been remarked by Kennard (1971)),
Cr is a simple function of RSSp , as are the multiple correlation coefficient defined
by 1 - Rj = RSSp/TSS (where TSS is the total sum of squares) and the "adjusted" version of this. However the form (3) has the advantage (as has been
shown by Gorman and Toman (1966), Daniel and Wood (1971), and Godfrey
(1972)) that since under the above assumptions
C p is an estimate of E ( J p ) , and is suitably standardized for graphical display,
plotted against p. Graphical presentation of the various regression sums of squares
themselves against p was advocated by Watts (1965). For k not too large it is
fvasiblc ([4], [5], [19]) to compute and display all the 2k'1 values of Cp ; for larger
values one can use algorithms of [2], [8], [14] to compute only the more interesting
(smaller) values.
I n sect& 2 we describe some of the configurations that can arise; in section 3
we provide some formal calibration for the display and in section 4 comment on
the practice of using it as a basis for formal selection. The approach is extended
in section 5 to handle arbitrary linear estimates of the regression coefficients.
The approach can also be extended to handle multivariate response data and
to dcxl mith an arbitrary weight function L U ( X ) in factor-space, describing a region
SOME COMMENTSON
CI.
663
of interest different from that indicated by the configuration of the data currently
to hand. I n each case, the derivation is exactly parallel to that given above. I n
the former case, one obtains a matrix analog of C, in the form $-'.RSS, (n - 2 p ) I where 2 is an estimate of the residual covariance matrix, and RSS,
is Z(y, - x , @ , ) ~ ( ~-, xu@,). One or more measures of the "size" of C, (such as
the trace, or largest eigenvalue) can be plotted against p. I n the latter case, with
the matrix A = (A,,) defined by A,, = J x,z,zo(x)dx,one arrives a t a statistic
of the form
where V i + = trace ( A ( x ~ x ) - ' ) , V % = trace (h(X;X,)-), and we can plot C:
against V ; . This reduces to the Cp-plot when A = XTX. If interest is concentrated
is equivalent to that
a t a single point x, we have A = xxT, and the statistic
suggested by Allen (1971); his equation (9) = t2(C;3.- x(XTx)-'xT).
From ( 2 ) , (3), (4) we see that if 6, = 0, so that the P-subset model is in fact
2 Cp E p. If 6' is taken as
completely appropriate, then R S S p w (n - p ) ~ and
1 exactly. Sotice that if P* is a ( p
1)
RSSK+/c,-k-ll , then CK+ = jK'j = lc
elemcsnt subset which contains P , then
+
+
where SS is the on[>-d.f.contribution to the regression sum of squares due to the
(p
1)-th variable, so that SS/k2 is a t; statistic that could be used in a stepwise
testing algorithm. If the additional variable is unimportant, i.e. if the bias contribution Bp - Bp, is small, then E(SS) % 6' and so
+
E(Cp* - Cp) z 1.
Rlantel (1970) has discusscd the use of stepwise procedures, and how they
behave in the face of various patterns of corrc,lation amongst the independent
variablcs. I t is illuminating to consider how patterns similar to those hc describes
would show up on a C,-plot.
First, suppose the indcpcndcnt variables arc not highly correlated, that @ = 6 , ,
and that every non-zcbro clement of @ is large (relative to the standard error of
its least-squares estimate.). Then the Cp-plot will look something like Figure 1
(drawn for the cascl p = lc - 2, K' - P = { I , 2, 3 ) ) . Notice the approximately
linear diagonal configuration of points corresponding to the nell-fitting subsets
of variables.
Xow, suppose s, , s, , s,are highly correlated with each other, with each being
about oclually corrclatod with y. Then any two of these variables, but not all three,
can b~ d(,l(~ted
from thc model without much effect. I n this case the relevant points
on the PI.-plot will look somclthing l i k ~Figure 2a, if no other variables are of
importance, or like Figure. 2b if some othm subset P is also n c ~ d c d .(In all these
examplcs we arcb assuming that tho constant tcrm P,, is always needed). Notice
that now thc diagonal pattern is incomplete. I n an i1ltc3rmcdiatc case, when r , ,
s, , s, have modcratr correlations, a picture intr,rm~diatebctwccn Figures 1 and
2b nil1 be obtained.
664
C. L. MALLOWS
Thirdly, suppose z, , z, are individually unimportant but jointly are quite
effective in reducing the residual sum of squares; suppose some further subset
P of variables is also needed. Mantel gives an explicit example of this behavior.
Figure 3 shows the resulting configuration in the case IPI = k - 4.
2,
Notice that even if C i p 2 , is the smallest C,-value for subsets of size p
there might be subsets P{ , P i , (not containing P ) with IP:l = p or p
1 that
gave smaller values of Cp than those for P, j P , 11, j P , 2 ) . I n this case an upward
stepwise testing algorithm might be led to include variables in these subsets and
so not get to the subset {P,1, 2 ) . Mantel describes a situation where this would
happen.
+
+
To derive bench marks for more formal interpretation of C,-plots, we assume
that the model (1) is in fact exactly appropriate, with the residuals e, . . . e, being
independent and Normal (0, u2). Suppose k2 is estimated by RSS,+/v where v =
n - k - 1, the residual degrees of freedom. We do not of course recommend that
the following distributional results be used blindly without careful inspection of the
empirical residuals y, - d(x,), i = 1, . . . , 7 ~ .However, they should give pause
to workers who arc tempted to assign significance to quantities of the magnitude
of a few units or even fractions of a unit on the Cp scale.
665
SOME COMMENTS ON CI,
FIGURE
2a-Cp-plot: Variables 1 , 2 , 3 , are highly explanatory also highly correlated.
First, notice that the increment Cp. - Cp (in (5) above) is distributed as 2 - I: ,
where the t-statistic t , is central if @ =
. I n this case this increment has mean
and variance of approximately 1 and 2 respectively. Similarly,
@,.
+
where q = k
1 - p and the F statistic is central if Q = Q, ; thus if v is large
compared with q this increment has mean and variance approximately q and 2q
respectively. The variance of the slope of the line joining the points (p, C,),
(lc
1, k
1) is thus 2/q, so that the slope of a diagonal configuration such as
is shown in Figure 1 will vary considerably about 45". The following tables (derived
from (6)) give values of C, - p that will be exceeded with probability a when
the subset P is in fact adequate ( i t . when @ = @, so that @, = O ) , for the cases
v = n - Ic - 1 = 30, co . The value tabulated is q(K,,(a) - 1).
For comparing two Cp-values corresponding to subsets P, P' with P A P' = B,
P = A U B, P' = A' U B, it is straightforward to derive the results, valid under
the null hypothesis that each of P and P' is an adequate subset,
+
+
E(Cp - C p r jX IPJ - IP'J
=
J A J- JA'I
+
TTar(C, - Cp.) z 2 ( J A J / A ' / - 2R2)
where R2 is the sum of squares of the canonical correlations between the sets of
variables X A and X,, , after partialling out the variables X , . (Thus if lBI =
C. 1. MALLOWS
666
FIGURE
2b-Cp-plot: Same as 2a except that variables in P are also explanatory.
jPl - 1, Var (C, - C,,)
4(1 - p 2 ) where p is the partial correlation coefficient
=
PAA'.B).~
We now use the Scheffk confidence ellipsoid to derive a different kind of result.
= (po , 3:) for the least-squares estimate of pT = (0,, Q:), and
Let us writ'e
let
1
XTX =
mT] ,
D. = D - - mmT.
n
m D
oT
I;
Then the Scheffi: 100&
Sa =
confidence ellipsoid for the elements of @ , is the region
{@K
:
(QK
-
oK)TDK(@K
-
0,) < kt+'FaJ
(7)
where Fa is the upper 100a% quantile of the F distribution on k, n - k - 1 degrees
of freedom.
Yotice that Sa can be written
where SXis a fixed ellipsoid centered a t the origin:
S:
= ( y : yTDKy
< IcF,).
+ Srikantan (1070) has proposed the average, rather than the sum, of the squared canorlical
correlatiorls as an overall measure of association. This measure has the property that its value is
changed when a new variable, completely rlncorrelated with all the previous ones, is added to one
of the sets of variates.
667
SOME COMMENTS ON CP
FIGURE
3-Cp-plot: Two variables that jointly are explanatory but separately are not.
Let P-, Q be any complementary subsets of K, P
The following lemma is proved in the Appendix.
=
(0, P-1.
Lemma The follo~vingstatements arc equivalent:
(i) The region S , intersects the coordinate hyperplane H p = ( QK : Q , = 01,
(ii) The projection of S , onto the H , hyperplane contains the origin,
(iii) The subset least squares estimate $, = (b, , $,-) has 8,- in S , ,
(iv) Cp < 2 p - k - 1 k F , ,
(v) RSSp - RSS,+ < ka2F, .
Now consider any hypothesis that specifies the value of QK , and the corresponding 100ayo acceptance region
+
(clearly P(oKE To; / 0:) is in fact equal to a ; this is just the confidence property
of the Scheff6 ellipsoid (7)). Starting from this family of acceptance regions for
hypotheses that specify QK completely, a natural acceptance region for a composite
hypothesis of the form 0, = 0 is given by the union of all regions Tg;for values
of e i such that @,0 = 0; the reasoning is that the hypothesis @, = 0 cannot he
rejected if there is any 6, with 0, = 0 that is acceptable according to the corwith
= 0 lying within the
responding test in the family, i.e. if there is any
confidence ellipsoid S , . By the Lemma, the corresponding acceptable subsets
e,
e,
C. L. MALLOWS
I
3 2n . Zm. %i.
- m *
mmf-
2 ?O!N
2 2 gi
0
ccj 2
0
m m
O!??
/I
i
1
L
s
a
I
+
A? rr:
.- .1
:
1
i
-
E
C
I-
0-9
m * M
"032
I
5 .g 1 "
$21
-s
Lo
.5
m o l e
??oq
m m
m
2
b 0
Ycq?
mt-S;J
8 e'
8u U
a u 2
+Ma:
"2s;
su 1
"
5
'0??
"22
i
II
* a 0
M
+atL ? t - 0
e'"2
mf-m
"m
2
?
J
.-
N
-
m
g
m m m
",?
cr m
ol
1
"
m I? . + .L o.
- m a
a I
i
+
.y
o m 3
199
I/ 11
c4
a
SOME COMMENTS ON CP
669
{O, P-} are just those that have
We state the property formally:
A subset P = (0, P - } satisfies (9) if and only if thrre is some vector of coeffi= 0 that lies within the ScheffB ellipsoid (7), i.e. if and only
cients @ having
if there is some vector of this form that is accepted by the corresponding test
with acceptance region of the form (8).
As an example, consider the 10-variable data studied by Gorman and Toman
(1966). Taking a = 0.10, k = 10, v = 25, we find that among the 58 subsets for
which Gorman and Toman computed Cp-values, there arc 39 that satisfy (9),
in number 7, 13, 9, 10 with p = 7, 8, 9, 10 respectively. This result gives little
support to the view that this set of data is sending a clear message regarding the
relative importance of the variables under consideration.
n'otice that if the true coefficient vector @*has @*,, = 0, then Pr {for all P
containing P*, Cp 5 2p - k - 1
lcF,] 2 a, with equality only if p* = 1 (i.e.
P* = (01). This property of the procedure is not completely satisfying since it
is not an equality; also the form of the boundary in the Cp-plot is inflexible. In
theory, one way of getting a better result is the following. Given any subset P *
and a sequence of constants c , , c, , . . . , ck (and the matrix D,) one could compute
the probability Pr {for all P containing P*, Cp < cP};this probability depends
on c, , . . . , ck , P * and DK , but not on any other parameters. One could then
adjust cI , . . . , ck so as to make the minimum of this probability over all choices
of P * (or possibly only over all choices with p* > some p,,) equal to some desired
level a . The computation would presumably be done by simulation.
Starting from the Scheff6 ellipsoid, Spjgtvoll (1972) has developed a multiplecomparison approach that provides confidence intervals for arbitrary quadratic
functions of the unknown regression parameters, for example Bp - B p , for two
subsets P, P'.
+
hIany authors have studied the problem of giving formal rules for the selection
of predictors; Kennedy and Bancroft (1971) give many references. Lindley (1968)
presents a Bayesian formulation of the problem. The discussion in section 3 above
does not lend any support to the practice of taking the lowest point on a Cp-plot
as defining a "best" subset of terms. The present author feels that the greatest
value of the device is that it helps the statistician to examine some aspects of
the structure of his data and helps him to recognize the ambiguities that confront
him. The device cannot be expected to provide a single "best" equation when the
data are intrinsically inadequate to support such a strong inference.
To make these remarks more precise and objective, me shall compute (in a
special case) a measure of the performance to be expected of the rule "choose the
subset that minimizes Cp , and fit it by least-squares". We shall use as a figure
of merit of an arbitrary estimator $(x) the same quantity as was used in setting
up the Cp-plot, namely the sum of predictive squared errors
We can handle in detail only the case of orthogonal regressors, and so now assume XTX = nI. I n this case we see from ( 5 ) that Cp is minimized when P contains C . L. MALLOWS
670
just those terms for which t: > 2, where t, = & b,/8 is the t-statistic for the j-th
regression coefficient, and b, is the least squares estimate, b, = Zxu,y,/n. Thus
in this case the "minimize Cp" rule is equivalent to a stepmise regression algorithm in which all critical t-values are set at v2 and &' is kept at the full-equation
value throughout.
Now let us assume that n is sufficiently large that variation in & can be ignored;
then to , tl , . . . , tk will be independent Normal variables with unit variances and
with means 7, , . . . , rkwhere r , = 4; @,/&. Let d(t) be the function that equals
0 for It1
v2,and equals 1 otherwise, then J for the "minimum-Cp subset least
squares'' estimate can be written
<
which reduces to
Hence
+
+
where m(r) = E ( ( u 7)d(u T) - T)' (where u is a standard Normal variable),
and is the function displayed in Figure 4 (labelled "16%11, since Pr (lul > 142)=
0
1
2
3
4
5
6
S T A N D A R D I Z E D REGRESSION COEFFICIENT, T
FIGURE
43-functions
7
67 1
SOME COMMENTS ON CF
.1573). If the constant term is always to bc included in the selected subset, the
corresponding result is
Notice that the function VZ(T)is less than 1 only for 171 < .78, and rises to a
maximum value of 1.65 a t 171 = 1.88. I t exceeds 1.25 for 1.05 < 171 < 3.05.
We reiterate that in this case of orthogonal regressors mith n very large, the
"minimum Cr" rule is equivalent to a stepwise regression algorithm mith all
critical levels set a t 15.73%. Also shown in Figure 4 are the m-functions corresponding to several other critical levels; when all k 1 terms are infallibly included
(the "full-1.s." rule), m(7) = 1 for all 7, so that E(J,,,, , , ) = k
1. We see
that the "minimum Cp" rule will give a smaller value for E ( J ) than the "full-1.s."
rule only nhen rather more of the true regression coefficients satisfy 171 < .78
than satisfy 171 > 1; in the worst case with IT,^ = 1.88 for j = 1, . . . , k, E ( J )
for the "minimize Cp" rule is 165% of that for the "full-1.s." rule. Similarly for
rrjection rules with other critical levels; in particular, a rule mith a nominal level
of 57, (two tailed) gives an E ( J ) a t worst 2467, of that of the "full-1.s." rule.
Thus using the "minimum Cp" rule to select a subset of terms for least-squares
fitting cannot be rc~commendeduniversally. Sotice however that by examining
the Cp-plot in the light of the distributional results of the previous section one
can see whether or not a single best subset is uniquely indicated; the ambiguous
cases where the "minimum Cp" rule will give bad results are exactly those where
a large number of subsets are close competitors for the honor. With such data
no selection rule can be expected to perform reliably.
+
+
We now extend the Cp-plot device to handle general linear estimators. With
the same set-up as in the Introduction, consider an estimate of the form
where i j is the mean i j = Zy,/r~, and L is a k X n matrix of constants. We shall
assume that L1, = 0 (where :1 = (1, 1, . . . , 1)) so that a change in the origin
of measurement of y affects only b,, and not 1, , . . . , bk . Examples of estimators
of this class are: full least-squares; subset-least-squares; and Bayes estimates
under multinormal specifications with a multinormal prior, a special case of which
is the class of "Ridge" estimates advocated by Hoerl and Kennard (1970a, b),
(see also Theil (1963), section 2.3):
where f is a (small) scalar parameter (Hoerl and Iiennard used k), and in this
section me are writing X for the n X k matrix of independent variables, which
we are now assuming have been standardized to have zero means and unit variances.
Thus I ZX = 0 , diag (XTX) = I.
As a measure of adequacy for prediction we again use the scaled summed mean
square error, which in the present notation is
C. 1. MALLOWS
672
and which has expectation
where
IfL = 1
BL
=
+ tr (XTXLLT)
@3(LX - I ) ~ x ~ X ( L X- I ) @ K.
The sum of squares about the fitted regression is
RSSL
=
/ / y- yl, - X 6 L / 2
xhich has expectation
E(RSS L)
=
+ BL
8'Vz
where
V2
=
11
-
1 - 2 tr (XL)
+ tr ( x T x ~ L T ) .
Thus we have an estimator of E ( J L)lnamely
CL =
1
;T
RSSL - n
+ 3 + 2 t r (XL).
(12)
By analogy with the C p development, we propose that values of C L (for various
choices of L) should be plotted against values of VL . Xotice that when L is a
matrix corresponding to subset least squares, C , V L reduce to Cp , p respectively.
For computing values of C L , VL for Ridge estimates ( l l ) , the following steps
can be taken. First, find H (orthogonal) and A = diagonal (A, , A, , . . . , A,) so
that XTX = H T n H . Compute z = HXTy. Then
tr (XL)
Is
=
t=,
A,
+ A,
---
f
Figure 5 gives the resulting plot for the set of 10-variable data analyzed by
Gorman and Toman (1966) and by Hoerl and Kennard (1970b). Shown are (p, C,)
points corresponding to various subset-least-squares estimates and a continuous
arc of (VL , C L ) points corresponding to Ridge estimates with values of f varying
from zero a t (11, 11) and increasing to the left. For this example, Hoerl and Kennard (1970b) suggested that a value of f i11 the interval (0.2, 0.3) would "undoubtedly" give estimated coefficients "closer to Q arid more stable for prediction
than the least-squares coefficients or some subset of them". On the other hand
from Figure 5 one would be inclined to suggest a value off nearer to .02 than to 0.2.
One obvious suggestion is to choose f to minimizc C L r . Some insight into the
effect of this choice can be gained as f o l l o ~ ~First
s . consider the case of orthogonal
regressors, and now assume XTX = I. Notice that in this case our risk function
E ( J ) is equivalent to the quantity
~ ( y -, p,)' used by Hoerl and Kennard
(1970a). We may take H = I, so that z , = 1, , the least-squares estimate of P , .
xt_,
SOME COMMENTSON Cp
67 3
FIGURE5-CL plot for Gorman-Toman data; subset ( C p )values and Ridge ( C L r )values.
From (12) and (13) we sce that C L ris a minimum when f satisfies
the adjusted estimates are then given by
I t is interesting that this set of estimates is of the form suggested by Stein (1960)
for the problem of estimating regression coefficients in a multivariate Normal
distribution. Jamcs and Stein (1961) showed that for Ic 2 3 the vector of estimates
$** obtained by rcplacing the multiplier k in (14) by any number between 0 and
2k - 4 has the property that E(J**) is less than the full-least-squares value
C. L. MALLOWS
674
12 + I (see (lo)), for all values of the true regression coefficients. Thus our "minimize C L r ' l rule dominates full least-squares for k 2 4. This result stands in interesting contrast to the disappointing result found above for the "minimize Cpl' rule.
n'ow, consider the case of equi-correlated regressors, with XTx = I p (11 - I).
I n this case the least-squares estimate of B = Zp,/k has variance l / k ( l - p kp),
and the vector of deviations (8, - 8) has covariance matrix (I - k-'11 T ) / ( l - p).
Thus when p is large, these deviations become very unstable.
I t is found that for p near unity, C L ris minimized when f is near (1 - p)y, where
+
8
+
The adjusted estimates are given approximately by
Thus here the "minimize CLrI1rule leads to shrinking the least-squares estimates
towards their average. While the details have not been fully worked out, one
expects that this rule will dominate full least-squares for k 2 5.
I t is a great personal pleasure to recall that the idea for the Cp-plot arose in
the course of some discussions with Cuthbert Daniel around Christmas 1963.
The use of the letter C is intended to do him honor. The device was described
publicly in 1964 [16] and again in 1966 [17] (with the extensions described at the
end of section 1 above) and has appeared in several unpublished manuscripts.
Impetus for preparing the present exposition was gained in the course of delivering
a series of lectures a t the University of California at Berkeley in February 1972;
their support is gratefully acknowledged.
Proof of the Lemma
The key to these results is the identity, true for any subset P that includes 0,
i.e. P = (0, P-},
R S S p RSS,+
=
(@,- -
@,)T~,(@p-
- )8,
where gT = (& , BE) is the vector of least-squarrs estimates of all the coefficients
in the model, and ( P i , 8;) is the vector of subset-least-squares estimates. From
the form of S , (7) it now follows that (iii) 6,- is in S , if and only if (v) RSSp-RSS,+
< kb2F, , which is directly equivalent (if b2 = ItSS,+/n - k - 1) to (iv) Cp <
271 - Ic - 1
kF, . Clearly (iii) implies (i); to prove the converse we remark
that for any vector Q T = (Po , 6): with QK in the hyperplane H p = ( Q, : Q, = 01,
we have
+
the cross-product term vanishing by definition of @, . Thus if any point of H,
is in S , , Qp must be. Finally, (i) is directly equivalent to (ii) by a simple geometrical
argument.
SOME COMMENTSON CP
675
To handle the case of non-orthogonal regressors, Sclove (1968) has suggested
transforming to orthogonality before applying a shrinkage factor. A composite
procedure with much intuitive appeal for this writer would be to use the C, plot
or some similar device to identify the terms that should certainly be included
(since they appear in all subsets that give reasonably good fits to the data), to fit
these by least squares, and to adjust the remaining estimates by orthogonalizing
and shrinking towards zero as in (14).
REFERENCES
[I] ALLEN,D. hi. (1971). Mean square error of prediction as a criterion for selecting variables.
Technometrics,13, 469-475.
[2] BEALE,E. M. L., KENDALL,
hi. G. and MANN,D.
W. (1967). The discarding of variables in
multivariate analysis, Biometrika 54, 337-366. [3] DANIEL,
C. and WOOD,F. S. Fitting Equations to Data New York: Wiley-Interscience, 1971.
[4] FURNIVAL,
G. RI. (1971). All possible regressions with less computation. Technometrics 13,
403-408.
[5] GARSIDE,
M. J. (1965). The best subset in multiple regression analysis. Applied Statistics 14,
196-200. R l .
B. (1972). Relations between Cp, RSS, and mean square residual. Submitted
161 GODFREY,
to Technometrics.
[7] GORMAN,
J. W. and TOMAN,
R. J. (1966). Selection of variables for fitting equations to data.
Technometrics 8, 27-51. [8] HOCKING,
R. R. and LESLIE,R. N. (1967). Selection of the best subset in regression analysis.
Technometrics 9, 531-540.
[9] HOERL,A. E. and KENNARD,
R. W. (1970a). Ridge regression: biased est,in~ationfor nonorthogonal problems. Technometrics 12, 55-67.
R. W. (1970b). Ridge regression: applicat,ionsto non-orthogonal
[lo] HOERL,A. E. and KENNARD,
problems. Technometrics 12, 69-82.
[ l l ] JAMES,W. and STEIN,C. (1961). Estimation with quadratic loss. Proc. Fourth Berkeley
Symp. 1, 361-379, Univ. of California Press.
1121 KENNARD,
R. W. (1971). A note on the Cp statistic. Technometries 13, 899-900.
[13] KENNEDY,
W. J. and BANCROFT,
T. A. (1971). Model-building for prediction in regression
based on repeated significance tests. Ann. Math. Statist. 42, 1273-1284.
1141 LA MOTTE,L. R. and HOCKING,
R. R. (1970). Computational efficiency in the selection of
regression variables. Technometrics 12, 83-93.
1151 LINDLEY,
D. V. (1968). The choice of variables in multiple regression. J. Roy. Statist. Soc. B
SO, 31-53, (Discussion, 54-66).
[16] MALLOWS,
C. L. (1964). Choosing variables in a linear regression: a graphical aid. Presented
at the Central Regional Meeting of the Institute of Mathematical Statistics, LIanhattan,
Kansas, May 7-9.
[17] M A L L O ~C.
S , L. (1966). Choosing a subset regression. Presented at t,he Annual Meeting
of the American Statistical Association, Los Angeles, August 15-19, 1966.
[18] MANTEL,N. (1970). Why stepdown procedures in variable selection. Technometrics 12,
621-625.
[19] SCHATZOFF,
M., TSAO,R . and FIENBERG,
S. (1968). Efficient calculation of
all possible
regressions. Technometrics 10, 769-779. [20] SCLOVE,S. L. (1968). Improved estimators for coefficients in linear regression. J. Amer.
Statist. Assoc. 63, 596-606.
[21] SPJPTVOLL,
E. (1972). Multiple comparison of regression functions. Ann. Math. Statist. 45,
1076-1088.
[22] SRIKANTAN,
K. S. (1970). Canonical association between nominal measurements. J. Amer.
Statist. Assoc. 66, 284-292.
1231 STEIN,C. (1960). Multiple regression. Contributions to Probability and Statistics, ed. I. Olkin,
Stanford University Press, 424-443.
[241 THEIL,H . (1963). On the use of incomplete prior information in regression analysis. J. Amer.
Statist. Assoc. 68, 401-414.
1251 WATTS,H. W. (1965). The Test-o-Gram; a pedagogical and presentational device. Amer.
Stat. 19, No. 4, 22-28.
http://www.jstor.org
LINKED CITATIONS
- Page 1 of 3 -
You have printed the following article:
Some Comments on C P
C. L. Mallows
Technometrics, Vol. 15, No. 4. (Nov., 1973), pp. 661-675.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28197311%2915%3A4%3C661%3ASCO%3E2.0.CO%3B2-6
This article references the following linked citations. If you are trying to access articles from an
off-campus location, you may be required to first logon via your library web site to access JSTOR. Please
visit your library's website or contact a librarian to learn about options for remote access to JSTOR.
References
1
Mean Square Error of Prediction as a Criterion for Selecting Variables
David M. Allen
Technometrics, Vol. 13, No. 3. (Aug., 1971), pp. 469-475.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28197108%2913%3A3%3C469%3AMSEOPA%3E2.0.CO%3B2-U
2
The Discarding of Variables in Multivariate Analysis
E. M. L. Beale; M. G. Kendall; D. W. Mann
Biometrika, Vol. 54, No. 3/4. (Dec., 1967), pp. 357-366.
Stable URL:
http://links.jstor.org/sici?sici=0006-3444%28196712%2954%3A3%2F4%3C357%3ATDOVIM%3E2.0.CO%3B2-A
4
All Possible Regressions with Less Computation
George M. Furnival
Technometrics, Vol. 13, No. 2. (May, 1971), pp. 403-408.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28197105%2913%3A2%3C403%3AAPRWLC%3E2.0.CO%3B2-9
5
The Best Sub-Set in Multiple Regression Analysis
M. J. Garside
Applied Statistics, Vol. 14, No. 2/3. (1965), pp. 196-200.
Stable URL:
http://links.jstor.org/sici?sici=0035-9254%281965%2914%3A2%2F3%3C196%3ATBSIMR%3E2.0.CO%3B2-V
NOTE: The reference numbering from the original has been maintained in this citation list.
http://www.jstor.org
LINKED CITATIONS
- Page 2 of 3 -
7
Selection of Variables for Fitting Equations to Data
J. W. Gorman; R. J. Toman
Technometrics, Vol. 8, No. 1. (Feb., 1966), pp. 27-51.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28196602%298%3A1%3C27%3ASOVFFE%3E2.0.CO%3B2-7
8
Selection of the Best Subset in Regression Analysis
R. R. Hocking; R. N. Leslie
Technometrics, Vol. 9, No. 4. (Nov., 1967), pp. 531-540.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28196711%299%3A4%3C531%3ASOTBSI%3E2.0.CO%3B2-%23
9
Ridge Regression: Biased Estimation for Nonorthogonal Problems
Arthur E. Hoerl; Robert W. Kennard
Technometrics, Vol. 12, No. 1. (Feb., 1970), pp. 55-67.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28197002%2912%3A1%3C55%3ARRBEFN%3E2.0.CO%3B2-Z
10
Ridge Regression: Applications to Nonorthogonal Problems
Arthur E. Hoerl; Robert W. Kennard
Technometrics, Vol. 12, No. 1. (Feb., 1970), pp. 69-82.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28197002%2912%3A1%3C69%3ARRATNP%3E2.0.CO%3B2-7
12
A Note on the C p Statistic
Robert W. Kennard
Technometrics, Vol. 13, No. 4. (Nov., 1971), pp. 899-900.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28197111%2913%3A4%3C899%3AANOTS%3E2.0.CO%3B2-M
18
Why Stepdown Procedures in Variable Selection
Nathan Mantel
Technometrics, Vol. 12, No. 3. (Aug., 1970), pp. 621-625.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28197008%2912%3A3%3C621%3AWSPIVS%3E2.0.CO%3B2-Q
NOTE: The reference numbering from the original has been maintained in this citation list.
http://www.jstor.org
LINKED CITATIONS
- Page 3 of 3 -
19
Efficient Calculation of All Possible Regressions
M. Schatzoff; R. Tsao; S. Fienberg
Technometrics, Vol. 10, No. 4. (Nov., 1968), pp. 769-779.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28196811%2910%3A4%3C769%3AECOAPR%3E2.0.CO%3B2-T
NOTE: The reference numbering from the original has been maintained in this citation list.