Hardle, Wolfgang; (1985)An Effective Selectionof Regression Variables when the Error Distribution is Incorrectly Specified."

AN EFFECTIVE SELECTION OF REGRESSION VARIABLES
WHEN THE ERROR DISTRIBUTION IS INCORRECTLY SPECIFIED*
.
Wolfgang HardIe
FB Mathematik
Johann-Wolfgang-Goethe-Universitat
D-6000 Frankfurt/M.
WEST GERMANY
and
Department of Statistics
University of North Carolina
Chapel Hill, North Carolina 27514
Abstract
In the situation where the statistician estimates regression parameters by maximum likelihood methods but fails to choose a likelihood function matching the
true error distribution, an asymptotically efficient selection of regression
variables is considered.
The proposed procedure is especially useful when a
robust regression is applied but the data in fact do not require that treatment.
Examples are given and relationships to other selectors such as Mallows' Care
p
investigated.
Keywords and Phrases:
•
variable selection, regression analysis, robust regression,
model choice
AMS Subject Classification:
,'"
Primary 62J05, Secondary 62G99
Research supported bv Deutsche Forschungsgemeinschaft. Sonderforschungsbereich
123 lIStochastische Mathematische Hodel1E:" and AFOSR Contract No. F49620 82 C 0009.
1.
Introduction and Results
Suppose that Y
mean P
= (PI"'"
P )' and assume that the i
th
n
,.
covariate x., i
= (Y l , ... , Yn )' is a random vector of n observations with
=
1, ... , n.
1
finite parameter vector
The dependence of P. on
1
S;
observation is associated with a
x. is specified by an in-
1
therefore at most n parameters can be estimated on
the basis of the observations.
Suppose that a certain likelihood function, not
necessarily matching the true error distribution, has been chosen and that parameter estimates in a finite dimensional submodel p have been obtained by the
maximum likelihood principle.
!\
lJ.(p)
and a loss L (p) =
1
n
I IlJ
The regression curve P. at x. is estimated by
!\
1I
- p(p)
1
2
is suffered.
1
We shall consider an ef-
ficient model selection procedure that asymptotically minimizes the loss L (p)
n
over a certain class of finite dimensional models of increasing dimension.
-e
This paper completes earlier papers in various ways.
Shibata (1981),
Breiman and Freedman (1983) consider the problem of selecting regression variables when the true error distribution is known to be Gaussian and derive selectors that are equivalent to ours in this case.
In the setting of least squares
estimation Li (1984) recently gave conditions for asymptotic efficiency of
model choice procedures based on cross validation, FPE and other means.
Schrader and Hettmansberger (1980) provided a robust analysis of variance
based on Huber's M-estimates and propose a likelihood ratio type of test for
testing between finite dimensional submode1s.
This approach was pursued by
Ronchetti (1985) who derived a "robust model selection" procedure that is
•
related to ours.
The mismatch of the chosen likelihood function and of the true error
distribution can happen when a robust regression estimation (Huber, 1973) is
applied but the data is in fact Gaussian.
We may also think of the reverse
1.2
situation that a Gaussian maximum likelihood estimate (i.e. the least squares
estimate) is computed but the true error distribution is different, possibly
long tailed.
This mismatch is mathematically reflected in the way that a
larger constant in the dimensionality penalty term has to be chosen.
For in-
stance, Akaike's (1970) AIC has the penalty constant 2, whereas in case of
mismatch this constant is changed depending on the type of mismatch.
If there
are outliemin the data and we apply AIC based on a Gaussian maximum likelihood
estimate we will overfit the data since the model selection procedure tries to
fit the outliers.
On the other hand, if we apply the model selection procedure
to be presented below, a high dimensional model is more penalized since the
penalty constant is bigger than 2.
In the simple case that the data is Gaussian and the statistician chooses
a Gaussian likelihood function, then our model selection procedure is equivalent to Mallow's C
p
(1973) (see Section 4).
This entails equivalence to many
other selectors such as FPE, AIC, GCV, as was shown by Li (1984).
We will assume the control variables xi = (xil' xi2' ... ) "
and the parameter vector
i = 1, ... , n,
S = (6 1 ,6 2 ",,)' are in t 2 ,
The model can then be written as
Y
X6 + e,
where e= (e , ... , en)' is the vector of the independent observation errors
l
having distribution F with density f and X' = (Xix2"'X~) is considered as
n
a linear operator from t 2 to ~ .
By p = (p1'P "'"
Z
Pk(p»
finite dimensional submodel with parameter
B'(p) = (0, ... , B ,Ot ... , B ,0" .. , B
Pl
where Pl < Pz < ••• < Pk(p)' k(p) ~ 1.
P2
Pk(p)
,0, ... )
we denote a
•
1.3
The statistician chooses a likelihood function h of which he believes to represent the true error distribution, and estimates the parameters in a submodel p
by maximizing
n
IT h(Y. i=l
~
where x ~ (p)
1.
x~(p)S(p»
~
(0, ... , x.
1. P l
,0, ... , x.
1.
,0, ... , x.
PZ
1.
Pk (p)
,0, ... ).
1\
Call this maximum likelihood estimate S(p).
The regression surface point
~.
~
is estimated by
1\
~i(P)
Define \)J(u)
<x.J(p».
~
d
- du log h(u), Y
=
Z
EF~ (e)/(EF\)J' (e»
Z
, B
n
=
X'X and let Pn be
A possible selection rule for choosing a model pEP
a family of models p.
could be defined by W' (p) =
n
W'(p) - L (p)
n
n
-zlI ~ (p)
1\
-II C(p) liZ
-{IIC(p)
n
+ Zyk(p) + II ~ liZ, since
_~IIZ +
+ II~IIZ}
Z<@(p) -S,8>B
n
- ~ II
Z
/\
IZ
- Z{ -II ]J ( p) - ~ I
+ <tl\j (p)
-
S. tl\j ( p) >B }
n
+ 2yk(p)
2{yk(p) - <~(p) - SJ(P»B }
n
(1)
1.4
denotes the bilinear form u'B v for vectors
where <u,v>B
Suppose
n
n
that it can be shown that the last term in (1) is tending to a constant uniformly over the model class P
n
Then minimizing W'(p) over P
n
n
will be the
However, W' cannot
same task, at least asymptotically, as minimizing Ln(P).
n
be computed directly from the data since it depends on the unknown regression
curve W.
But note that the last term in W'(p) is independent of the model p.
n
We will therefore define
W (p)
n
/\
2
-II w
(p)"
+ 2yk(p)
as the score function that is to be minimized over P •
The problem of simul-
n
taneously estimating y from the data, in order to make
~
n
completely data driven
is considered in Section 4.
Remark 1
If the statistician is in the happy situation of knowing f, then he will
choose h
~
f.
If f is symmetric, then by partial integration
f1J; , f
I(F)
and therefore the constant y reduces to (EFlj)')
-1
E lj)'
F
I(F)-l, the Fisher-informa-
tion number in a location family with density f.
We will use the concept of asymptotic efficiency as in Shibata (1981),
Li (1984), Stone (1984):
as n·}
A selected ~ is called asymptotically optimal if,
ex>
"
L (p)
n
inf L (p)
n
pEP
n
p
-+ 1.
(2)
1.5
The following condition on
~
will be needed.
Condition 1
The function
~
bounded second derivative.
= 0 and twice differentiable with
EF~(e)
is centered i.e.,
We furthermore assume that E [q
-1
F
for some positive integer Nand q
= EF~'
(~'
(e) - q]
2N
< 00
(e) > O.
'"
The estimates B(p) will be compared with the Gauss-Markov estimates in the mo-
Y.=jJ
+;.,;.=1jJ(e.)/q.
1
ill
1
del p basedonthe(unobservable)pseudodata
Define X(p) as
the (n,p) matrix containing the nonzero control variables in model p and assume that
B (p)
n
= X'(p)X(p) has full rank k(p).
based on the pseudodata
M (p)
n
Then the Gauss-Markov estimate ot
Y= (Y l ,.·., Yn )'
is defined as ~(p)
= X(p)B-l(p)X
(p) denotes the hat matrix in model p.
n
Gauss-Markov estimate is L (p)
n
as will be seen later on.
=
II ~ (p)
-]J
11
2
= Mn(p)y,
]J
where
The loss for the
which will be approximately L (p)
n
A well behaved design is guarenteed by
Condition 2
-
There exists a positive integer N such that with R (p)
n
Let h(p) be the largest diagonal element of the hat matrix M (p).
n
of h(p) is controlled by
Condition 3
•
-
sup h(p)R (p)
pEP
n
+
0, as n
+
n
Remark 2
It follows from Condition 3 that
00
•
The speed
1.6
2
k (p)/n
~
=
since Rn (p)
~ O~
as n
00,
2
yk(p) + 111..1 -1..1 (p) II , 1..1 (p)
analogue of the necessary
p. 166).
~
Conditions
2~3
condition~
p
= ~n (p)l..I.
2
In
~ O~
This should be seen as an
that can be found in Huber (1981,
imply also
h(p)N ~ O.
I
pEP
(3)
n
Remark 3
If
~
is
bounded~
be weakened.
analysis~
as is assumed in robust regression
Condition 2 can
It is seen from the proofs that in this case Bernstein's inequality
could be used instead of Whittle's (1960,
weakened to E P exp(-CR (p))
pE n
n
A
Denote by p a model p
P
to
n
+
Theor~m
0, for some C
>
2).
Condition 2 could be
O.
that minimizes \-J (p) over P.
n
n
The main result
is as follows.
Theorem
Under Conditions
l-3~
A
p is asymptotically optimal.
The rest of the paper is organized in four sections.
In Section 2 the
theorem above is shown, in Section 3 we give a variety of examples that sat isfy our conditions, and in Section 4 the estimation of y and the relation to
other model selection procedures is investigated.
Section 5 is devoted to de-
tailed proofs of the lemmata that are needed in showing the asymptotic optimality.
2.
Proof of the Theorem
In the proof of the Theorem, the following Lemmata will be used.
Lemma 1
Under the conditions of the Theorem, for all E > 0
~
II
p{supllj.l(p) -j.l(p)
peP
II 2 /Rn"(p)
~
>
d
-+
0, as n
-+
00
n
Lemma 2
Under the conditions of the Theorem, for all
p{sup
P
pE n
11n (p) - Rn (pWRn (p)
>
d
-+
E
0, as n
> 0
-+
00
Lemma 3
Under the conditions of the Theorem, for all
p{sup Iyk(p) - (~(p) pEP
jJ)
E
> 0
'W(p) + j.l'~1 /R (p) > d
n
0, as n
-+
-+
00
n
~
Recall that the Gauss-Markov estimate based on the pseudodata Y is S(p)
B-l(p)X'(p)Y.
n
The crossterrn in (1) will be approximated by a corresponding
~
crossterm based on the linearized estimates B(p).
'.
~
<@(p) - SJ(P»B
~
- <S(p) - S,f)(P»B
n
n
11~(p)
- O(p) 11
2
+ <S'ep) - B(p) ,B(p) - S(P»B
n
+ <B(p) - sJ(p) - B(p»B
+ <@(p) - B(p) ,S(P»B
n
(4)
n
2.2
~
By Lemma 1, the first term is of lower order than R (p) uniformly over P , the
n
n
second term is bounded by the Cauchy-Schwarz inequality and then Lemma 1 and
Lemma 2 are applied.
The third term is handled by formula (8), given in the
proof of Lemma 1, by setting a
as the second term.
= B(p), n = ~(p).
The fourth term is handled
Suppose that
sup
p,p'EP
(W (p) n
wn (p'»
- (L (p)
n
L (p) + L (p')
n
n
n
aId let p,'c denote a minimizer of L (p) over P.
n
n
o (1),
p
Then by (5) with probability
greater than 1 - E,
II
II
Wn (p) - Wn (p*) - (L n (p) - Ln (p*»
L (~) + L (p*)
n
n
II
II
By the definition of p, W (p) - W (p*)
n
n
II
-(L (p) - L (p*»
n
L (p*)(l + E)
n
1 2
L (p*)
n
L (~)
n
2 - E.
0; therefore,
II
-E(L (p) + L (p*»
n
n
2
n
~
2
L (~) (l - E)
n
1 - E
1 + E
2
II
which shows that (2) holds, Le. p is asymptotically optimal.
follows by observing that
(5)
Formula (5)
2.3
(W'(p) + w'e - L (p»
n
n
L (p)
n
2(yk(p) - <~(p) - 8'~(P»B
n
+ w'e)
Ln (p)
Iin (p)
L (p)
n
..
The first factor is tending to zero in probability, uniformly over P
n
3 and formula (4).
over P ,by
n
by Lemma
The two other factors tend to one in probability, uniformly
Lemma 1 and Lemma 2.
3.
Examples
We start with a reformulation of Condition 2 in the case of hierarchical
model sequences, i.e. P
n
infinity.
= {(1),(1,2), •.. ,
(1,2, ... , P )} with p tending to
n
n
In this case Condition 2 follows for N = 2 from
Condition 2'
~
lnf R (p)
pEP
n
~
00,
as n
~
00
n
We slightly abuse notation by writing j for (1, ...• j).
lows from R (j) = yj + 11J.l(p)
n
_]111 2
Then Condition 2 fol-
and
J
P
R (.)-2
R (p)-2 = InR (j)-2 + In
n J
n
.
1
n
pEP
j=J +1
J=
n
n
L
~
if J
P
n
~
J {inf R (p)}
n pE P
n
represents
a
+ Y
n
tends to infinity slowly enough.
n
-2
-2
00
.-2
I J
j=J +1
n
~
0,
In the following examples we assume that
hierarchical model sequence.
The following lemma. which is
due to Shibata (1981), is useful in checking Condition 2'.
Lemma 4
Assume that with a positive divergent sequence {c } the linear operator
n
-1
cn B converges weakly to a nonsingu1ar operator B:{2
n
~
{2' such that every
pxp principal submatrix B(p) has full rank p for all p > O.
If B has infinitely
many nonzero coordinates, then Condition 2' holds and p* diverges to infinity.
as n
~
00
Are the conditions of the Theorem fulfilled for typical examples?
check conditions in examples given by Shibata (1981).
We
3.2
Example 1
Consider the polynomial regression on the interval [0,1).
x..
1J
( ~~)j-l
n
'
=
i
1 , ... ,n,
j
Here
1,2, ...
and
y.
1, ... , n
1
is observed.
Condition 1 is model independent and is an assumption about the error distribution.
Condition 2 I is satisfied via Lemma 4 (set c
check Condition 3.
n
=
n).
It remains to
The symmetric matrix B-l(p) has a spectral decomposition
n
(Mardia, Kent, Bibby, 1979, Theorem A.6.4)
r n An r'n'
where
the i
An = diag(Al(p), ... ,
th
A (p»
p
rn =
and
(Y ,···, Y ), y.
l
P
J
=
(Y ., .. ·, Y .)'
lJ
PJ
Lemma 4 insures that Am1n
. (p)
. the
normalized eigenvector of B-l(p).
n
smallest eigenvalue of n-lB (p) is bounded above zero by a constant C.
n fore each diagonal element h. of M (p) can be estimated by
n
1
h.
1
~
p
I
£.=1
:::; PA
2
~
2
L x .. )( L Y£..)
A£.(p)(
j=l 1J
P
2
j=l
2
J
(p) /. x .. :::; P Amin(p)
max
. '1 1J
J=
-1
-1 2
:::; C pin.
There-
3.3
So Condition 3 is fulfilled if we ask for
2~
sup P R (p)/n
n
l:<;p:<;p
O.
+
(6)
n
A necessary condition is p3 /n
0 which is slightly stronger than Huber's (1981)
+
conditions.
Example 2
Consider the following representation of the regression curve
co
L S.cos(Tf(j -1) (i -l)/nj).
j=l J
x = O,n
Here the observations are taken at
-1
, ... ,
above Condition 2 is satisfied by Lemma 4, setting c
As in the example
n
=
n/2.
Condition 3 is
satisfied by similar arguments as above if we assume that (6) holds.
Example 3
Consider the robust M-estimation of location at different units x ..
J
tions are taken repeatedly at p
taken at the point x., j
J
=
n
different units and nip
1, ... , p.
is satisfied if 1jJ,1jJ' are bounded.
n
Assume that E 1jJ(e)
F
observations are
n
=
0, then Condition 1
Shibata (1981, p. 51) shows that Condition 2
is satisfied if the vectors of the control-variables (xl' ... ' x
independent.
Observa-
Condition 3 can be checked as above.
Pn
) are linearly
4.
Other methods and estimation of y
There are a variety of other model selection methods, most of which were
shown to be equivalent to Mallows' C.
p
C only.
p
We therefore compare our method with
For simplicity, we work with the linearized estimate
the pseudodata Y.
~(p)
based on
Mallows' (1973) score function reads
C (p) =
p
~
IIY -
~
~(p)11
2
+ 2yk(p)
~
+ 2e'(I
- M (p))~
n
n
The first term is independent of p, the third and the last term vanish uniformly
over model classes P , as can be seen in the next section.
n
model selected by C
p
This shows that a
is asymptotically optimal.
It can now be seen that W (p) has a similar structure.
n
~
\01 (p)
n
-llu(p)11
2
+ 2yk(p)
~
-
2(~-~)'~
Ln (p)
~
+ 2~' e -
+ 2yk(p)
+ 2yk(p) - 2~'M (p)~ + 2~' (I - M (p))~
n
2
II ~ II .
n
4.2
Here the last two terms are independent of the model.
The remaining terms are
identical to those in Mallows C , which shows that W (p) is equivalent to C .
p
p
n
It could be argued that the score function that is proposed here is not
very reasonable in a practical application since the constant y is unknown to
the statistician.
However, if the constant y can be consistently estimated (in-
dependent of p) then the score function based on an estimated y is also asymptotf\
iC!llly optimal.
A consistent estimate Y of y is provided, for instance, by
n
n
n
n
-1 \ 2 A
-1 \ , A
2
L ~ (e.(p »/(n
L ~ (e.(p »)
i=l
~
n
i=l
~
n
A
where E.(p ) denote residuals from a fit with a deterministic model p , in~
n
n
creasing in magnitude as n
inequality show that
yn
+
g y,
00
as n
A Taylor expansion and the Cauchy-Schwarz
+
00
5.
Proofs
In this section we give the proofs of Lemmas 1-3.
lows a related proof in Huber (1981), Section 7.4.
The proof of Lemma 1 folSimilar ideas were
used by Cox (1983) who considered X-type smoothing splines.
In order to simplify
notation we will consider the hierarchical case only. i. e. the model IIp'' is
identified with "(1.2, ... , k(p». k(p) = p."
Furthermore we assume without loss
of generality that the coordinate system in the p-dimensional subspace of the
first p components has been chosen so that X' (p)X(p)
-+ lRP
,
-q
n = (n •... , n ) I
l
p
a zero of
zero of
~k(n)
~k(n)
pseudodata Y.
E
I
-1 n
p
L, l~(Y' - L, lx.,n,)x'k' k
1=
lRP .
J=
1
1J J
P
.
Consider the mapping
1 •... , P where
1
A zero (with respect to n) of ¢ will be compared with
n
~
= n k - Li=l(~i + ei)x ik • where e i = ~(ei)/q, q = EF~' (e).
is the
least squares estimate
The
8(p) = X (p)Y based on the
Consider an arbitrary normalized vector a
E
lRP •
II a II
= 1.
Taylor
expansion of ¢, using Condition 1, leads to
"-
I
_q-l
a
k=l k i=l
I
I (~'
q-l
I
Y(~' (e.)1
1.2
T
(e.) - q)(
x,.S')X'k
1
j=p+l 1J J 1
a
k=l k i=l
~
q-l
l ,n
L ak
k=l
(p) + T
n
- q)
I
x.,x.k(s. - n.)
j=l 1J 1
J
J
00
L ~"(e, +v( L x. ,S,
i=l
2 .n
1
-
j=l 1J J
(p) + T
(p), V
3 ,n
E
poop
2
L x. ,n.»( L x, ,6, - L x .. n.) x' k
j=l 1J J
j=l 1.J J j=l 1.J J
1.
(-1,1).
We will now show that each of these terms uniformly vanishes over P • in the
n
sense that
5.2
~ll
p
sup T
(p)/R 2(p) ~ O.
p a,n
n
a = 1.2.3
(7)
pE n
for all (n.a) in the set
F
n
=
(r
~
n {(n.a)
pEP
x .. (8. - n.»2
i=l j =1 1J
n
J
J
~
KR (p).
n
Iiall
Define for i
f
1 )
1 •...• n
v =
i
q-1(~, (e.) _ q).
1
00
B.
I
(D) =
1.n .
x ..
s.,
j=p+1 1J J
p
S.
1
I akx· k •
k=l
1
p
!':,.
1.n
l: x .. (13.
. 1 1J J
(p)
- n .).
J
J=
Note that
n
L s.2
i=l
The first term T
l •n
1
n
=
I (
J?
L
i=l k=l
a k x . k)
2
=
1
(p) is estimated as follows .
p{sup
pE.P
n
IT l .n (p)I/R~lln 2(p)
> £}
II X(p) a II
2
=
II a II
2
1.
5.3
I
S
-2N
n
E{
pE P
2N ~N
I . I 1 s.B.
(p)v.1
IR n (p)}.
1 1,n
1
1=
n
Applying Condition 1 and Whittle's (1960), Th. 2 inequality, this term is
•
bounded by
~
-2N ~ 2
N ~N
L Cls
(I. s.B.
(p» IRn(p)
pEP
i=l 1 1,n
n
Recall that
Rn (p)
= yp +
Remark 2, Formula (3»
--
11(1 n - Mn (p»~ 11 2 ~ E~1= lB:1,n (p).
and the simple inequality
where h.(p) denotes the i
1
th
Condition 2,3 (see
s~ ~ E~=lX~jE~=la~ =
diagonal element of the hat matrix M (p), imply
n
that (7) holds for a = 1.
The second term is estimated similarly.
(n,a) E F
n
hi(p),
We omit some details.
If
then as above
N ~ 2
N~
N
h(p) ( L b.. (p» IR (p) .
i=l 1,n
n
Now apply Condition 2,3 as described in Remark 2, Formula (3).
The third term, involving the second derivative of
~
is bounded by
5.4
If
(~.a) E
F we obtain that with a constant C
n
2
•
which tends to zero by Condition 3.
Altogether we have shown that
(8)
for all
G
n
= {~
(~,a)
E
E
F.
This entails that for all
n
lR P : sup
pEP
L:~=l (L:~=l (n k
n
in the set
- Bk)X ik ) 2 IR n (p) : : K},
n
(9)
Condition 2 and bounds on higher moments as above imply that with probability
greater than 1 - 0,
This shows that S(p)
11 <P(n) - nil
E
G with high probability.
n
Note that
1\ <P(n) -1J1(n) + «(3(p) - B(p» + (3(p) II
~ II ¢(n) -1J1(n) II + II B(p) - S(p) II + "B(p) II
(11)
5.5
From formula (9) we know that the first term vanishes asymptotically.
From
(10) we conclude that for K big enough
II s(p)
sup
pEP
~
- s(p)
~ll
II IRn 2 (p)
n
Certainly the third term can be made less than
n
7
n - ¢(n)
has a fixed point
n*
~
K 1/2 p 1/2
Thus the function
in the compact, convex set G.
n
Since this
fixed point is necessarily a zero of ¢, it is seen that @(p) is in G with
n
probability greater than 1 - O.
1\
Substituting S(p) into equation (9) shows that
Lemma 1 holds.
Lemma 2 is seen by the following equation.
--
~
L
n
(p)
R (p)
n
Condition 2 implies that
sup
pE: P
I 11Mn (p)~11
2 - yk(p)
I/Rn (p) .~
n
which shows Lemma 2.
Lemma 3 follows similarly observing that
<8 - S(p) ,S(P»B
n
0
Acknowledgement
I would like to thank Charles J. Stone and Johanna Behrens for helpful
discussions.
References
Akaike, H. (1970). Statistical Predictor Identification.
Stat. ~~, 203-217.
Ann. Inst. Math.
Breiman, L. and Freedman, D. (1983). How many variables should be entered in
a regression equation.
J. Amer. Stat. Assoc. 1~, 131-136.
Cox, D. (1983).
530-551.
Asymptotics for M-type smoothing splines.
Ann. Statist.
11,
Huber, P. (1973). Robust regression:
Carlo. Ann. Statist. 1, 799-821.
Asymptotics, conjectures, and Monte
Huber, P.
Wiley, New York.
Mallows, C.
(1981).
Robust Statistics.
(1973).
Some comments on C .
P
Mardia, K.V., Kent, J.T. and Bibby, J.M.
Academic Press, London.
Technometrics,
(1979).
1~,
661-675.
Multivariate Analysis,
Li, K.C. (1984). Asymptotic optimality for CP,C l , cross-validation and
generalized cross-validation: Discrete inuex set. Manuscript.
Ronchetti, E. (1985).
_L~~ ters 2, 21-23.
Robust model selection in regression.
Stat. and Prob.
Schrader, R.M. and Hettmansberger, T.P. (1980). Robust analysis of variance
based upon a likelihood ratio criterion. Biometrika, ~1, 93-101.
Shibata, R. (1981).
2~, 45-54.
An optimal selection of regression variables.
Biometrik~,
Stone, C.J. (1984). An asymptotically optimal window selection rule for kernel density estimates. Ann. Statist. 11, 1285-1297.
Whittle, P. (1960). Bounds for the moments of linear and quadratic forms in
independent variables. Theor. Prob. Appl. J, 302-305.
..