PDF

Consistent Hazard Regression Estimation by Sieved
Maximum Likelihood Estimators
Sebastian Dohler
Institut fur Mathematische Stochastik, Universitat Freiburg
Abstract
We consider maximum likelihood estimators in general sieve function classes
for the estimation of the conditional log-hazard function of a survival time in a
censored data model. We prove consistency of these types of estimators under the
assumption that the conditional log-hazard function is continuous, by using results
from empirical process theory. As special examples we consider feedforward neural
network estimators and radial basis function network estimators.
1 Introduction
The aim of this paper is to show that sieved maximum likelihood estimators can be used
to consistently estimate the logarithm of a continuous conditional hazard function of the
survival time in a censoring model.
We consider random variables (rvs) T; C and X , where
T : (
; A; P ) ,! R
C : (
; A; P ) ,! T := [0; 1]
X : (
; A; P ) ,! X := [0; 1]k
is a survival (failure) time,
is a censoring time and
is a vector of covariates.
+
In survival analysis the survival and censoring time are not observed directly, but instead
the \observable time"
Y := T ^ C;
the \censoring indicator"
(
:= 1fT Cg = 1 failure takes place before censoring;
0 failure event is censored
as well as the covariates X are observed. This censoring mechanism is called rightcensorship. We are interested in inference on the conditional distribution of the lifetime
1
T given the vector of covariates X .
We assume that for the model described above there exists a conditional density function
f (tjx) and denote by F (tjx) the conditional distribution function of T given X = x.
Then
0
0
F (tjx) := 1 , F (tjx)
(tjx) := f (tjx)
F (tjx)
(tjx) := log (tjx)
0
0
is the conditional survival function,
0
is the conditional hazard function and
0
0
0
is the conditional log-hazard function.
0
The conditional hazard function has the following interpretation (cf. p. 127 in [FH]): For
small t
(tjx)t P (t T < t + tjT t; x)
0
is the approximate conditional probability of observing a failure in the time-interval [t; t +
t) given x and no failure before time t. Our goal is to nd an estimator for respectively
based on iid censored data (y ; ; x ); : : : ; (yn; n; xn).
One well-known way to model the conditional hazard function is the Cox (proportional
hazard) model. Here it is assumed that the conditional log-hazard function is the sum of
an unspecied function of t and a linear function in x
0
0
1
1
1
(tjx) = f
(t) + x ;
where f is called the baseline hazard function and 2 Rk is a vector of parameters;
often one is interested only in estimating . This model has the property that 00 ttjjxx21 is
independent of t. Partial-likelihood can be used to estimate the vector .
0
0
0
(
(
)
)
The disadvantage of the Cox model is that it does not allow for interactions between
the covariates and time. We want to deal with the problem of estimating , where
= (t; x) is known to belong to some (usually large) function space (e.g. L (T X )).
This means that we should use functions as estimators that are known to have some
sort of universal approximation property. Since neural networks and radial basis function
networks have this property they should be considered as candidates for these estimators.
A dierent approach by Kooperberg, Stone and Truong in [KST] is based on tensor
product splines. Their main result is a L rate of convergence for spline estimators under
the assumption that the true conditional log-hazard function is suciently smooth. In our
approach we dismiss the smoothness condition as well as an additional condition on the
censoring distribution, and establish a general consistency result under the assumption
that the true conditional log-hazard function is continuous.
Other related results could be expected in the context of density estimation for instance
for neural networks. However, as far as we know there are no results available for the
case of censored data.
Our approach is also dierent from these two approaches in the sense that we do not
consider a special sieve like splines or neural networks, but a rather general type of
0
0
0
2
2
2
sieve that includes various estimators used in nonparametric statistics. Furthermore, the
method established in this paper should also be applicable to further types of censoring
as well as to other types of estimators.
We use two assumptions that are similar to two of the three assumptions introduced in
[KST]. The rst condition is the same as in [KST].
Condition 1
T and C are conditionally independent given X .
The second condition is weaker than the the corresponding condition in [KST]. It could
be replaced by an even weaker integrability condition on but since we assume for our
main results that is bounded we use this assumption.
0
0
Condition 2
is bounded on T X .
0
Let (t ; c ; x ); ; (tn; cn; xn) be an iid sample from the distribution of (T; C; X ). We
observe (y ; ; x ); : : :; (yn; n; xn), where
1
1
1
1
1
1
yi :=(ti ^ ci
i = 1 i-th survival time observed;
0 i-th survival event censored.
According to [KST] the conditional log-likelihood corresponding to this sample is
Ln () :=
n
X
i=1
[i(yijxi) ,
yi
Z
0
1
exp (ujxi)du]
where : T X ,! R. The expected conditional log-likelihood is
() :=
E L1 ():
Later we will dene our estimator as the maximum likelihood estimator over some class
of functions (sieve): ^n := argmax2Fn Ln () where the sieve Fn depends on the number
n of observations. Of course the crucial question with regard to consistency is how this
sequence of function classes should be chosen.
The following lemma states a formula for the \distance" between an arbitrary function
and the "true" conditional log-hazard function with respect to the expected conditional
log-likelihood.
Here and for the rest of the paper, when we use integrals etc., we always assume them to exist and
be well-dened.
1
3
Lemma 1
Let : T X ,! R, G : R ,! R, y 7! exp(y) , (1 + y). Let the rvs (T; C; X ) satisfy
conditions 1 and 2. Then we have
( ) , () = ( ) , ()
0
0
Z
=
T X
FC G( , ) dP T;X
(
0
)
where is the \true" conditional log-hazard function, and FC is the conditional survival
function of the censoring time C given X .
0
0
5
g(x)
10
15
For y ,! ,1 the function G(y) behaves like ,(1 + y) and for y ,! 1 it behaves like
an exponential function.
-3
-2
-1
0
1
2
3
x
Figure 1: The function G
Proof We use the following representation from [KST]
() =
Z
Z
X T
h
i
FC (tjx) (tjx)f (tjx) , F (tjx) exp((tjx)) dt fX (x) dx;
0
0
where : X T ,! R and fX denotes the density function of X . The claim of the lemma
then follows from some algebra and the denition of .
0
2 Consistency of Sieve Estimators
Here we introduce a type of function class that satises some conditions sucient to
prove consistency ofPa sieved maximum likelihood estimator. We consider vector spaces
of functions F = f i aifi j ai 2 R; fi 2 F g generated by a set of \basis functions" F .
The rst condition we impose is that F is dense in a suitable space of functions. This is
0
0
4
a natural requirement, since it guarantees that it is possible to (asymptotically) remove
the bias of estimators that are members of this class. The other condition we need is
control of the \complexity" or \richness" of the set of basis functions F . This will enable
us to control the stochastic part of the overall error of the estimator. Examples of such
classes are \feedforward neural networks with one hidden layer" and \radial basis function
networks".
0
Condition 3
Let F ff : T X ,! Rg be a class of functions with the following properties:
a) F is dense in (C (T X ); k : k1), where k : k1 is the sup-norm on T X .
P
b) F = f i aifi j ai 2 R; fi 2 F g is a vector-space generated by a VapnikCervonenkis-class F with f : T X ,! [0; 1] for f 2 F .
0
0
0
Denition 4 (Neural networks)
For an increasing function : R ,! [0; 1] satisfying limx!1 (x) = 1 and limx!,1 (x) =
0 let
n
F = F () := f
: Rm ,! R;
K 2 N; ai
z 7,!
2 Rm;
K
X
i=1
ci(aTi
z + bi ) + c
0 o
bi; ci 2 R :
Denition 5 (Radial basis function networks)
For a decreasing, continuous, nonzero function h : R ,! [0; 1] with Rm h(kxk) dm (x) <
1 let
R
+
n
F = F (h) := f
: Rm ,! R;
z 7,!
K
X
i=1
cih(kAi z + bik) + c
0 o
K 2 N; Ai 2 Rmm; bi 2 Rm; ci 2 R :
Remark 6
Both neural networks and radial basis function networks satisfy condition 3 (as functions
dened on [0; 1] [0; 1]k ).
Proof For the denseness part cf. Theorem 2.3 in [HSW] for the case of neural networks
and Theorem 5 in [PS] for the case of radial basis function networks.
The statement about the niteness of the VC-dimension follows from the fact that in both
cases every member of F is the composition of a xed increasing or decreasing function
(in one case in the other case h p ) with a nite-dimensional vector space of functions.
0
5
Since nite-dimensional vector spaces of measurable functions are VC-classes (cf. lemma
2.6.15 in [vdV-W]) and since the composition of a xed monotone function with a VCclass is again a VC-class (cf. lemma 2.6.18(viii) in [vdV-W]) condition b) is also satised. 2
To establish \universal consistency" for a sieved maximum-likelihood estimator with respect to the -error, we introduce bounds for the number of components and for the
weights of elements of F .
Denition 7
Let F satisfy condition 3. For > 0; K 2 N dene
n
F (; K ) := : T X ,! [,K; K ]; (t; x) 7,!
o
fi 2 F ; j ci j :
K
X
i=1
cifi(t; x)
0
In the case of neural networks this means that F (; K ) is a set of \feedforward neural
networks" with one \hidden layer" consisting of K hidden units and the magnitude of the
weights bounded by .
Theorem 8 (Consistency with respect to the -error)
Let F be a class of functions that satises condition 3. For a subset Fn F let ^ n be the
maximum-likelihood estimator ^n := argmax2Fn Ln ().
Then there are sequences n " 1, Kn " 1 such that Fn := F (n; Kn ) " F and for all rvs
(T; C; X ) that satisfy conditions 1 and 2 with 2 C (T X ):
0
0 (^n ) , ( ) ,! 0
a.s.
Convergence holds for any n " 1, Kn " 1 such that
Kn exp(3n Kn) = O(n 14 ):
From the preceeding result we can easily deduce L -consistency of ^n.
1
Theorem 9 (Consistency with respect to the L -error)
1
Let the assumptions and terminology of Theorem 8 be given. Then
Z
T X
FC j^n , jdP T;X ,! 0
(
0
6
)
a.s.
Proof An easy calculation shows that there exist A; B > 0 such that
G(y) Ay
for
jyj 1
G(y) B jyj
for
jyj > 1:
Let En := f(t; x) 2 T X j j^n(t; x) , (t; x)j 1 g. Then we have
2
0
Z
FC G(^n , ) dP
0
T X
(
T;X ) A
Z
En
FC j^n , j dP
0
Since
Z
T X
FC G(^n , ) dP
0
T;X ) + B
2
F
^n , jdP T;X :
C j
c
En
n ) , (
0
(
0 T;X ) = (^
(
Z
(
)
on account of Lemma 1, we can conclude with Theorem 8 that
Z
En
FC j^n , j dP
0
2
(
T;X ) ,! 0
Since En FC j^n , j dP
inequality leads to
R
2
0
2
(
T;X )
Z
En
Z
and
T;X ,! 0:
F
^
C j
n , jdP
c
0
En
(
)
REn FC j^n , j dP T;X , application of Jensen's
0
2
(
)
FC j^n , jdP T;X ,! 0
0
(
)
2
which nishes the proof.
These theorems show that the function classes considered here can be used to consistently
estimate the conditional log-hazard function of a survival time. In addition they indicate
how the constraints on the number of basis functions and the magnitude of the weights
should be chosen to achieve consistency.
The idea of the proof of Theorem 8 is to decompose the overall error into an approximation error and an estimation error. The approximation error can then be removed
asymptotically due to the denseness condition imposed on F and the estimation error is
handled by a uniform strong law of large numbers.
Lemma 2 (Decomposition of the -error)
Let F be a class of functions such that for 2 F we have : T X ,! R. Let
^n := argmax2F Ln(). Then
0 0 1 L () , ()
+
2
sup
(
)
,
(
)
(^n) , ( ) inf
2F
n n
7
2F
)
2 F
Proof Let
() , (0) = inf 2F () , (0). Then
with
0 (^n) , ( ) = ( ) , (^n )
since is maximal for 0
0
= ( ) , () + ( ) , (^n)
1
() , ( ) + () , L ( )
= inf
2F
n n
+ n1 Ln () , n1 Ln (^n ) + n1 Ln(^n ) , (^n)
1
inf () , ( ) + 2 sup L () , ()
0
0
2F 0
2F n
n
since ^n is dened as maximizing Ln .
2
2.1 Proof of Theorem 8
By Lemma 2 we have
0 0 1 L () , ():
+
2
sup
(
)
,
(
)
(^n ) , ( ) inf
2Fn
n n
2Fn
The second term converges to 0 (a.s.) because of the uniform law of large numbers for
the log-likelihood functional, Theorem 10, which is given in the following section. Then
it is left to show that the rst term also converges to 0. Since 2 C (T X ) and F is
dense in (C (T X ); k : k1) there exists, for every " > 0, an n 2 N, n 2 Fn such that
kn , k1 < ". With Lemma 1 we conclude that
0
0
0 (n) , ( ) =
Z
FC G(n , ) dP T;X
T X
max(G("); G(,"))
0
Since G(x) ,! 0 for x ,! 0 this nishes the proof.
(
)
2
3 A uniform law of large numbers for the loglikelihood functional
The main part of the proof of Theorem 8 is the following uniform law of large numbers
for the log-likelihood functional, which is also of interest in itself.
8
Theorem 10 (LLN for the log-likelihood functional)
Suppose condition 3 holds for F . Then there are sequences n " 1, Kn " 1 such that
Fn := F (n; Kn ) " F and for all rvs (T; C; X ) that satisfy conditions 1 and 2:
sup n1 Ln () , () ,! 0
2Fn
a.s.
Convergence holds for any n " 1, Kn " 1 such that
Kn exp(3n Kn) = O(n 14 ):
The idea of the proof of Theorem 10 is to derive an exponential maximal inequality for
the likelihood functional over function classes. We accomplish this by using some results
from empirical process and Vapnik-Cervonenkis theory that we recall here.
3.1 Some results from empirical process theory
Here we introduce some results and notation we need to derive probability bounds on
the maximal deviation of empirical processes indexed by functions. For the rst two
denitions cf. [vdV-W].
Denition 11 (Covering Numbers, Packing Numbers)
Let (T; d) be a semimetric space. Then the covering number N ("; T; d) is dened as the
minimal number of balls of radius " needed to cover T . The packing number N ("; T; d) is
dened as the maximal number of points in T such that the distance between each pair
of points is strictly larger than ".
Denition 12 (Vapnik-Cervonenkis-class)
Let C be a collection of subsets of some set S . A set s S with s = fs ; : : :; sn g is
shattered by C if for any s0 s there is a c0 2 C such that s0 = s \ c0. The VC-index of C
is the smallest n 2 N such that for no set s 2 S with jsj = n s can be shattered by C .
A collection F of measurable functions on some space Y is called a VC-class of functions
1
(and the associated VC-index is called the VC-dimension) if the collection of all subgraphs
of the functions (that is sets f(x; t) : t < f (x)g for f 2 F ) in F is a VC-class of sets in
Y R.
Lemma 3 (Pollard)
Let F be a Vapnik-Cervonenkis-class with envelope function F and d = dim (F ). Then
there is a constant C (d) 0 such that
N ("kFkL ; F ; dL ) C (d)", d,
for all 0 < " 1 and all probability measures .
VC
2( )
2( )
9
2(
1)
Proof cf. Theorem 2.6.7. in [vdV-W].
Lemma 4 (Stability properties of covering numbers)
Let be a probability measure and let F ; G function classes on Rm. Then
a.
N (" + ; F G ; dL2 ) N ("; F ; dL2 ) N (; G ; dL2 )
( )
( )
( )
b. If 8f 2 F : jf j K then for a > 0
N ("; [,a; a] F ; dL2 ) N ( 2"a ; F ; dL2 ) 4aK
"
( )
( )
c. Let 8f 2 F : jf j K and ' : [,K; K ] ,! R be a Lipschitz-function with Lipschitzconstant Lip('), then:
N ("; ' F ; dL2 ) N ( Lip(" ') ; F ; dL2 )
( )
( )
d. If 8f 2 F : jf j K and 8g 2 G : jgj L, then:
N ("; F G ; dL2 ) N ( 2"L ; F ; dL2 ) N ( 2"K ; G ; dL2 )
( )
( )
( )
Proof Similar to the calculations in Section 5 in [P] and Chapter 29 in [DGL]. Next we
introduce some terminology from Section 7 in [P].
Denition 13
For a class of functions F let i : F ,! R, i = 1; :::; n be independent stochastic
processes. Dene the
Random set of points indexed by F
Zn (F )(!) :=
n
o
(; !); : : : ; n(; !) 2 F Rn
1
and the
Random entropy integral
n(!) p
Z
Jn (F )(!) := 9
log D (x; Zn(F )(!)) dx
where n (!) := supf kyk j y 2 Zn(F )(!) g:
2
0
2
10
3.2 Maximal Inequality for Ln
Denition 14
For > 0; K 2 N let
n
G (; K ) := g : f0; 1g T X ,! R; (; y; x) 7,! (y; x) ,
o
2 F (; K ) :
y
Z
0
exp (u; x) du
and for g 2 G (; K )
i (g; !) := g(i; yi; xi):
Proposition 15
Let the rvs (T; C; X ) satisfy the assumptions 1 and 2. Let d = dim (F ) with F as in
Condition 3. Then there exists a constant C (d) > 0 such that for all > 0; K 2 N and
t > 0:
n
h
i
1
1 X
P sup L () , () t = P sup
g( ; y ; x ) , E g(; Y; X ) t
VC
2F (;K )
n
n
g2G (;K )
i i
ni
0
0
i
=1
n
1
5 exp , 2 C (d)K (K + exp(K )) t
2
2
Proof For > 0; K 2 N let
n
o
H(; K ) := h : f0; 1g T X ,! R; (; y; x) 7,! (y; x) 2 F (; K ) ;
Z y
o
n
K(; K ) := k : f0; 1g T X ,! R; (; y; x) 7,! , exp (u; x) du 2 F (; K ) ;
0
then we have G (; K ) H(; K ) K(; K ). Let n := n i i;yi;xi be the
empirical
Pn
measurePassociated with the data f( ; y ; x ); :::; (n; yn; xn)g and let ~n := n i yi;xi ,
~n := n ni xi . By Lemma 4 a.
1
1
1
(1)
1
Pn
=1
(
)
1
N ("; G (; K ); dL2 n ) N "2 ; H(; K ); dL2 n N 2" ; K(; K ); dL2 n :
(
(
)
)
(
N (; H(; K ); dL2 n ) N (; F (; K ); dL2 n ):
(
)
(~ )
For the covering numbers of K we show
(3)
=1
=1
For the covering numbers of H it is easily seen that
(2)
1
N (; K(; K ); dL2 n ) N (; exp F (; K ); dL2 n
U ; ));
(
(~
~
)
11
[0 1]
)
(
)
where U [0; 1] is the uniform distribution on [0; 1]. For k1 ; k2 2 K(; K ) we have
n Z yi
X
1
d (k ; k ) =
exp (u; x ) , exp (u; x ) du
2
2
L n ) 1
2(
2
ni
n Z
X
1
n
yi i=1 0
n Z 1
X
n1
i=1
(~
~
i
2
2
exp (u; xi) , exp (u; xi) du by Jensen's inequality
1
2
2
exp (u; xi) , exp (u; xi) du
1
0
= dL2 n U
2
i
1
0
=1
2
(exp ; exp ):
;
1
[0 1])
2
Applying Lemma 4 c. to (3) then yields
N (; K(; K ); dL n ) N exp(K ) ; F (; K ); dL2 n
U
2(
(~
~
)
;
[0 1])
and we conclude with (1) and (2) that
(4)
"
N ("; G (; K ); dL2 n ) N 2 ; F (; K ); dL2 n N 2 exp" K ; F (; K ); dL2 n U
(
(~
~
(~ )
)
;
[0 1])
Now let be any probability measure on T X . Then for all > 0
N (; F (; K ); dL ) N (;
2( )
K
M
i=1
[,; ] F ; dL2 )
0
( )
N K ; [,; ] F ; dL
2( )
0
K
K
4
K
N 2K ; F ; dL2 K
, d,
4
K
C (d) 2K
0
( )
2(
1)
K (2d,1)
= C (d)K (K )K(2d,1)
Since
1
Lemma 4 a.
Lemma 4 b.
Lemma 3
with a new C (d).
D ("; Zn(G (; K ))(!))
2
N 2" ; Zn (G (; K ))(!)
p
= N "2 ; G (; K ); n dL n
N 2p" n ; G (; K ); dL n ;
2
2(
2(
12
)
)
by def. of N in [P]
2
:
we obtain
(5) D ("; Zn(G (; K ))(!)) 2
C (d)K (K )2K(2d,1)
K (2d,1) exp(K )
pn K
2
(2
d,1)
"
:
Now let an := pn(K + exp K ). Then n(!) an (where n(!) is dened in Denition
13). So
Jn (G (; K ))(!) 9
an p
Z
log D (x; Zn(G (; K ))(!)) dx
2
0
= 9an (6)
Z 1
p
log D (any; Zn(G (; K ))(!)) dy
2
0
using the substitution y := axn . Estimate (5) yields
D (any; Zn
2
(G (; K ))(!)) C (d)K (K )2K(2d,1)
= C (d)K
and, since the function R ,! R;x 7! x2x
:x
x
C (d)K
1
2
2K (2d,1)
1
1
2K (2d,1)
y
y
;
and so
log D (any; Zn(G (; K ))(!)) K log C (d) + 2K (2d , 1) log 1y :
2
p
p
p
Since a + b a + b for a; b > 0 this leads to
s
log D (any; Zn(G (; K ))(!)) K log C (d) + 2K (2d , 1) log y1 :
p
p
2
Because
Z 1r
0
we can conclude that
Z 1
p
y
p
log D (any; Zn(G (; K ))(!)) dy C (d) K:
2
0
log 1 dy < 1
13
pn K
2
any
d,1)
(2
2K (2d,1)
is bounded,
exp(0 5 )
+exp
+
exp(K )
(K ) exp( K )
K + exp(K )
2
K (2d,1)
Therefore, inequality (6) implies
p
p
Jn (G (; K ))(!) 9 n(K + exp K )C (d) K
pp
= C (d) n K (K + exp K ):
pp
This means that the rv Jn(G (; K )) is bounded by the constant (rv) C (d) n K (K +
exp K ). Hence we can use equation (7.3) of [P] to obtain the maximal inequality
n
h
i
1 X
p n
P sup n g(i; yi; xi) , E g(; Y; X ) t 5 exp , 12
t
(C (d) K (K + exp K ))
g2G ;K
i
2
(
)
2
=1
P
n
1
g2G (;K ) n i=1
(our sup
i
h
g(i; yi; xi) , E g(; Y; X ) corresponds to n n in [P])
1
= 5 exp , 12 C (d)K (K n+ exp(K )) t :
2
2
2
We note that the bound (4) in the proof holds true for any function class F that is
uniformly bounded by some constant M > 0:
N ("; G ; dL2 n ) N "2 ; F ; dL2 n N 2 exp" M ; F ; dL2 n
U
(
)
(~
~
(~ )
;
[0 1])
;
where G = f g j 2 F g is dened similarly to Denition 14. Since the behavior of the
random entropy integral is determined by the random covering numbers N (; G ; dL2 n ),
the key to proving the maximal inequality (and therefore the LLN for the loglikelihood functional) lies in obtaining suitable bounds for the random covering numbers
N (; F ; dL2 n ) and N (; F ; dL2 n
U ; ). This means a similar result to Proposition 15
(and therefore to Theorem 10) is true for any function class F , for which N ("; F ; dL2 )
increases suciently slowly (for instance polynomially) in " for any probability measure .
(
(~
~
(~ )
[0 1])
( )
1
3.3 Proof of Theorem 10
We let Kn " 1; n " 1 so that
Let Fn := F (n; Kn ) and
Kn exp(3n Kn) = O(n 14 ):
An :=
)
n
o
1
1
sup n Ln () , () 1 :
2Fn
From Proposition 15 we get for large n 2 N
n4
P (An) 5 exp , Cn :
14
1
4
Since for large n 2 N
exp , Cn 14 n1 ;
2
we have
X
n2N
P (An) < 1
2
and we apply the Lemma of Borel-Cantelli to conclude the proof.
Acknowledgement I thank Prof. Dr. Ludger Ruschendorf for a very careful reading of
the manuscript and for numerous suggestions that greatly improved the presentation of
the paper.
References
[DGL] Devroye, L., Gyor, L., Lugosi, G., (1996), A Probabilistic Theory of Pattern
Recognition, Springer, New York.
[FH] Fleming, T.R., Harrington, D.P., (1991), Counting Processes and Survival Analysis,
Wiley & Sons, New York.
[HSW] Hornik, K., Stinchcombe, M., White, H., (1989), Multilayer Feedforward Networks
are Universal Approximators, Neural Networks, 2, 359-366.
[KST] Kooperberg, C., Stone, C.J. und Truong, Y.K., (1995), The L Rate of Convergence
for Hazard Regression, Scandinavian Journal of Statistics, 22, 143-157.
[PS] Park, J., Sandberg, I. (1993), Approximation and Radial-Basis-function Networks,
Neural Computation, 5, 305-316.
[P] Pollard, D., (1990), Empirical Processes: Theory and Applications, Institute of Mathematical Statistics.
[vdV-W] van der Vaart, A., Wellner, J.A. (1996), Weak convergence and empirical processes, Springer.
2
15