17 Shrinkage 17.1 Mallows Averaging and Shrinkage Suppose there are two models or estimators of g = E(y j X) (1) g0 = 0 (2) g^1 = X ^ Given weights (1 w) and w an averaging estimator is g^ = wX ^ : The Mallows criterion is C(w) = w0^ e0 ^ ew + 2^ 2 w0 K = 1 ! y 0 y y 0 e^ w w e^0 y 2 0 e^0 e^ 2 = (1 w) y y + w + 2w(1 = (1 w)2 y 0 y 1 w w ! + 2^ 2 wk w) e^0 e^ + 2^ 2 wk e^0 e^ + e^0 e^ + 2^ 2 wk The FOC for minimization is d C(w) = dw w) y 0 y 2(1 with solution k F w ^=1 where F = is the Wald statistic for e^0 e^ + 2^ 2 k = 0 y0y e^0 e^ ^2 = 0: Imposing the constraint w ^ 2 [0; 1] we obtain w ^= 8 > > < 1 > > : k F 0 F k F <k The Mallow averaging estimator thus equals ^ where (a)+ = a if a =^ 1 0; 0 else. This is a Stein-type shrinkage estimator. 145 k F + 17.2 Loss and Risk A great reference is Theory of Point Estimation, 2nd Edition, by Lehmann and Casella. Let ^ be an estimator for ; k 1: Suppose ^ is an (asymptotic) su¢ cient statistic for so that any other estimator can be written as a function of ^: We call ^ the “usual”estimator. p Suppose that n ^ !d N (0; V): Thus, approximately, ^ where Vn = n a N ( ; Vn ) 1 V: Most of Stein-type theory is developed for the exact distribution case. It carries over to the asymptotic setting as approximations. For now on we will assume that ^ has an exact normal distribution, and that Vn = V is known. (Equivalently, we can rewrite the statistical problem as local to using the “Limits of Experiments” theory. Is ^ the best estimator for ; in the sense of minimizing the risk (expected loss)? The risk of ~ under weighted squared error loss is R( ; ~; W) = E 0 ~ = tr WE W ~ ~ A convenient choice for the weight matrix is W = V R( ; ^; V 1 ) = tr V = tr V 1 1 1: Then ~ E 0 ~ 0 ~ V = k: If W 6= V 1 then R( ; ^; W) = tr WE ~ ~ 0 = tr (WV) which depends on WV: Again, we want to know if the risk of another feasible estimator is smaller than tr (WV) : Take the simple (or silly) estimator ~ = 0: This has risk R( ; 0; W) = 0 W : Thus ~ = 0 has smaller risk than ^ when 0 W < tr (WV) ; and larger risk when 0 W > tr (WV) : Neither ^ nor ~ = 0 is “better” in the sense of having (uniformly) smaller risk! It is not enough to ask that one estimator has smaller risk than another, as in general the risk is a function depending 146 on unknowns. As another example, take the simple averaging (or shrinkage) estimator ~ = w^ where w is a …xed constant. Since ~ =w ^ (1 w) we can calculate that R( ; ~; W) = w2 R( ; ~; W) + (1 = w2 tr (WV) + (1 This is minimized by setting w= w)2 0 W w)2 0 W 0 W tr (WV) + 0 W which is strictly in (0,1). [This is illustrative, and does not suggest an empirical rule for selecting w:] 17.3 Admissibile and Minimax Estimators For reference. To compare the risk functions of two estimators, we have the following concepts. De…nition 1 ^ weakly dominates ~ if R( ; ^) De…nition 2 ^ dominates ~ if R( ; ^) R( ; ~) for all R( ; ~) for all ; and R( ; ^) < R( ; ~) for at least one : Clearly, we should prefer an estimator if it dominates the other. De…nition 3 An estimator is admissible if it is not dominated by another estimator. An estimator is inadmissible if it is dominated by another estimator. Admissibility is a desirable property for an estimator. If the risk functions of two estimators cross, then neither dominates the other. How do we compare these two estimators? One approach is to calculate the worst-case scenerio. Speci…cally, we de…ne the maximum risk of an estimator ~ as R(~) = sup R ; ~ We can think: Suppose we use ~ to estimate : Then what is the worst case, how bad can than this estimator do? 147 ; ^; W = tr (WV) for all ; so For example, for the usual estimator, R R(^) = tr (WV) while for the silly estimator ~ = 0 R(0) = 1 The latter is an example of an estimator with unbounded risk. To guard against extreme worst cases, it seems sensible to avoid estimators with unbounded risk. The minimium value of the maximum risk R(~) across all estimators = (^) is inf R( ) = inf sup R( ; ) where De…nition 4 An estimator ~ of which minimizes the maximum risk inf sup R( ; ) = sup R( ; ~) is called a minimax estimator. It is desireable for an estimator to be minimax, again as a protection against the worst-case scenerio. There is no general rule for determining the minimax bound. However, in the case ^ it is known that ^ is mimimax for : 17.4 N ( ; Ik ); Shrinkage Estimators Suppose ^ N ( ; V) A general form for a shrinkage estimator for ^ = 1 is 0 h(^ W^) ^ where h : [0; 1) ! [0; 1): Sometimes this is written as ^ = 1 0 c(^ W^) ^0 W^ ! ^ where c(q) = qh(q): This notation includes the James-Stein estimator, pretest estimators, selection estimators, and the Model averaging estimator of section 17.1. Pretest and selection estimators take the form h(q) = 1(q < a) 148 where a = 2k for Mallows selection, and a is the critical value from a chi-square distribution for a pretest estimator. We now calculate the risk of ^ . Note ^ 0 h(^ W^)^ = ^ Thus ^ 0 W ^ 0 = ^ 0 0 + h(^ W^)2 ^ W^ W ^ 0 0 2h(^ W^)^ W ^ Taking expectations: h 0 i 0 R( ; ^ ; W) = tr (WV) + E h(^ W^)2 ^ W^ h 0 0 2E h(^ W^)^ W ^ i To simplify the second expectation when h is continuous we use: Lemma 1 (Stein’s Lemma) If ( ) : Rk ! Rk is absolutely continuous and ^ (^)0 ^ E Proof: Let (x) = @ ^0 ( )V : @ = E tr 1 k=2 (2 ) 1 0 xV 2 exp 1 x denote the N (0; V) density. Then @ (x) = @x and @ (x @x )= V V 1 1 x (x) (x ) (x ): By multivariate integration by parts E (^)0 ^ = = Z Z (x)0 VV tr = E tr 1 (x @ (x)0 V (x @ @ ^0 ( )V @ ) (x ) (dx) as stated. Let ( )0 = h 0 W 0 W; for which @ ( )0 = h( 0 W )W + 2W @ 149 0 ) (dx) Wh0 ( 0 W ) N ( ; V) then and tr @ ( )0 V = tr (WV) h( 0 W ) + 2 0 WVW h0 ( 0 W ) @ Then by Stein’s Lemma h 0 0 E h(^ W^)^ W ^ i h 0 i 0 0 = tr (WV) Eh(^ W^) + 2E (^ WVW^)h0 (^ W^) Applying this to the risk calculation, we obtain Theorem. h 0 i h 0 i 0 0 0 R( ; ^ ; W) = tr (WV) + E h(^ W^)2 ^ W^ 2 tr (WV) Eh(^ W^) 4E (^ WVW^)h0 (^ W^) ! 2 3 ^0 WVW^ 0 c(^ W^) 2 tr (WV) + 4 6 7 ^0 WVW^ ^0 W^ 6 0 7 0 ^0 ^ ^ ^ 6 = tr (WV) + E 6c( W ) 4 c ( W )7 0 0 7 ^ W^ ^ W^ 4 5 where the …nal equality uses the alternative expression h(q) = c(q)=q: We are trying to …nd cases where R( ; ^ ; W) < R( ; ^; W): This requires the term in the expectation to be negative. We now explore some special cases. 17.5 Default Weight Matrix Set 1 W=V and write R( ; ^ ) = R( ; ^ ; V Then 2 0 R( ; ^ ) = k + E 4c(^ V 1^ 0 c(^ V ) 1 ^) 0 ^V 1 ) 2k + 4 1^ 0 4c0 (^ V 3 )5 : 1^ Theorem 1 For any absolutely continuous and non-decreasing function c(q) such that 0 < c(q) < 2 (k 2) (1) then R( ; ^ ) < R( ; ^); the risk of ^ is strictly less than the risk of ^: This inequality holds for all values of the parameter : 150 Note: Condition (1) can only hold if k > 2: (Since k , the dimension of ; is an integer, this means k 3:) Proof. Let g(q) = c(q) (c(q) 2 (k 2)) q 4c0 (q) For all q 0; g(q) < 0 by the assumptions. Thus Eg (q) for any non-negative random variable q: 0 Setting qk = ^ V 1 ^; R( ; ^ ) = k + Eg (qk ) < k = R( ; ^) which proves the result. It also useful to note that 0 qk = ^ V 1^ 2 k ( ) a non-central chi-square random variable with k degrees of freedom and non-centrality parameter = 17.6 0 V 1 James-Stein Estimator Set c(q) = c; a constant. This is the James-Stein estimator ^ = Theorem 2 If ^ c ^0 V 1 ^ 1 N ( ; V); k > 2; and 0 < c < 2(k ^ (2) 2); then for (2), R( ; ^ ) < R( ; ^) the risk of the James-Stein estimator is strictly less than the usual estimator. This inequality holds for all values of the parameter : Since the risk is quadratic in c; we can also see that the risk is minimized by setting c = k This yields the classic form of the James-Stein estimator ^ = 17.7 k 2 ^0 V 1 ^ 1 Positive-Part James-Stein 0 If ^ V 1^ < c then 1 c 0 ^V 1^ 151 <0 ^ 2: and the James-Stein estimator over-shrinks, and ‡ips the sign of ^ relative to ^: This is corrected by using the positive-part version ^+ = = 1 8 > > > < c ^0 V 1 ^ ^ + c 1 0 ^V > > > : ^0 V ^ 1^ 0 1^ c else This bears some resemblance to selection estimators. The positive-part estimator takes the shrinkage form with c(q) = or h(q) = In general the positive-part version of 8 > < c > : q q q<c 8 c > > < q > > : c q 1 c q<c 0 h(^ W^) ^ ^ = 1 is ^+ = 1 0 h(^ W^) ^ + + Theorem. For any shrinakge estimator, R( ; ^ ) < R( ; ^) The proof is a bit technical, so we will skip it. 17.8 General Weight Matrix Recall that for general c(q) and weight W we had 2 0 c(^ W^) 6 6 0 ^ ^ R( ; ^ ; W) = tr (WV)+E 6 6c( W ) 4 2 tr (WV) + 4 152 ^0 W^ ^0 WVW^ ^0 W^ ! 4 ^0 WVW^ ^0 W^ 3 7 7 0 ^ ^ c ( W )7 7 5 0 Using a result about eigenvalues and setting h = W ^0 WVW^ max 0 ^ W^ 0 1=2 WVW 0 W = h0 W1=2 VW1=2 h h h0 h 1=2 VW1=2 ) max (W = max (WV) = max Thus if c0 (q) 0; 2 R( ; ^ ; W) 0 c(^ W^) 0 tr (WV) + E 4c(^ W^) 2 tr (WV) + 4 max (WV) ^0 W^ < tr (WV) 3 5 the …nal inequality if 0 < c(q) < 2 (tr (WV) When W = V; the upper bound is 2(k 2 max (WV)) (3) 2) so this is the same as for the default weight matrix. Theorem 3 For any absolutely continuous and non-decreasing function c(q) such that (3) holds, then R( ; ^ ; W) < R( ; ^; W); the risk of ^ is strictly less than the risk of ^: 17.9 Shrinkage Towards Restrictions The classic James-Stein estimator shrinks towards the zero vector. More generally, shrinkage can be towards restricted estimators, or towards linear or non-linear subspaces. These estimators take the form ^ =^ h( ^ 0 ~ W ^ ~ ) ^ ~ where ^ is the unrestricted estimator (e.g. the long regression) and ~ is the restricted estimator (e.g. the short regression). The classic form is ^ =^ 0 B @ ^ ~ r 2 0 1 ^ V ^ ~ 1 C A ^ ~ 1 ^ is the covariance matrix for ^; and r is the number of restrictions (by where (a)1 = max(a; 1); V the restriction from ^ to ~): 153 This estimator shrinks ^ towards ~; with the degree of shrinkage depending on the magnitude of ^ ~ : This approach works for nested models, so that ^ ~ 0 ^ V 1 ^ ~ is approximately (non- central) chi-square. It is unclear how to extend the idea to non-nested models, where ^ ~ 0 ^ V 1 ^ ~ is not chi-square. 17.10 Inference We discussed shrinkage estimation. Model averaging, Selection, and Shrinkage estimators have non-standard non-normal distributions. Standard errors, testing, and con…dence intervals need development. 154
© Copyright 2026 Paperzz