Estimation of learning rate of least square algorithm via Jackson

Neurocomputing 74 (2011) 516–521
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Estimation of learning rate of least square algorithm via Jackson operator$
Yongquan Zhang a, Feilong Cao b,, Zongben Xu a
a
b
Institute for Information and System Sciences, Xi’an Jiaotong University, Xi’an 710049, Shannxi Province, PR China
Department of Information and Mathematics Sciences, China Jiliang University, Hangzhou 310018, Zhejiang Province, PR China
a r t i c l e i n f o
a b s t r a c t
Article history:
Received 25 December 2009
Received in revised form
8 July 2010
Accepted 24 August 2010
Communicated by G.-B. Huang
Available online 28 October 2010
In this paper, regression problem in learning theory is investigated by least square schemes in polynomial
space. Results concerning the estimation of rate of convergence are derived. In particular, it is shown that
for one variable smooth regression function, the estimation is able to achieve good rate of convergence. As
a main tool in the study, the Jackson operator in approximation theory is used to estimate the rate. Finally,
the obtained estimation is illustrated by applying simulated data.
Crown Copyright & 2010 Published by Elsevier B.V. All rights reserved.
Keywords:
Learning theory
Covering number
Rate of convergence
Jackson operator
1. Introduction
In many applications, there is usually no priori information about
regression function. Therefore, it is necessary to apply least square
methods in the study of regression problem. This paper considers the
regression problem in learning theory. The upper bound of the
learning rate is estimated by using general probability inequality and
Jackson operator.
It is well known that regression problem is an important one in
learning theory (e.g., [2,14,9,11]). There have been many studies on
the convergence of regression problem (see [17,5,3,13]). All these
methods minimize a kind of least square risk of the regression
estimation over reproducing kernel Hilbert space (see [6,16]). In
this paper, we consider a function set consisting of polynomial
functions on X¼ [ 1,1], over which we minimize least square risk.
In regression analysis, an R Rvalued random vector ðX ,YÞ
with EY 2 o 1 is considered and the dependency of Y on the value of
X is of interest. More precisely, the goal of regression problem is to
find a function f : R-R such that f ðX Þ is a good approximation of Y.
In the sequel, we assume that the main aim of the analysis is to obtain
the minimization of the mean squared prediction error or L2 risk
Eðf Þ ¼ Efjf ðxÞyj2 g:
The function that minimizes the above error is called the
regression function, which is given by
mðxÞ ¼ EfYjX ¼ xg,
x A R:
Indeed, let f : R-R be an arbitrary measurable function on R. We
denote the distribution of X by m. The well-known relation
Z
Efjf ðxÞyj2 g ¼ EfjmðxÞyj2 g þ ðf ðxÞmðxÞÞ2 mðdxÞ
R
(see [10,16]) implies that the regression function is the optimal
predictor in view of minimization of the L2 risk
EfjmðxÞyj2 g ¼ min Efjf ðxÞyj2 g:
f :R-R
In addition, any measurable function f is a good predictor in the
sense that its L2 risk is close to the optimal value, if and only if the L2
risk
Eðf Þ ¼ Efjf ðxÞyj2 g
ð1Þ
is small. This motivates us to measure the error caused by using the
function f instead of the regression function by (1).
We know that distribution of the sample is usually unknown in
general, hence the regression function is also unknown. But often it is
possible to observe some samples chosen according to the distribution.
This leads to the regression estimation problem. Let z ¼ fzi gni¼ 1 ¼
fðxi ,yi Þgni¼ 1 be independent and identically distributed random samples drawn on X YðY RÞ. Our goal is to construct an estimator
fz ðÞ ¼ f ð,zÞ
of the regression function such that the L2 error
Z
ðfz ðxÞmðxÞÞ2 mðdxÞ
X
$
The research was supported by the National 973 Project (2007CB311002) and
the National Natural Science Foundation of China (nos. 90818020, 60873206).
Corresponding author.
E-mail address: [email protected] (F. Cao).
is small.
Throughout this paper, we assume that jyjr M for some
M A R þ , then jmðxÞj r M for any x A ½1,1. Here we need to impose
smoothness condition on the regression function.
0925-2312/$ - see front matter Crown Copyright & 2010 Published by Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2010.08.023
Y. Zhang et al. / Neurocomputing 74 (2011) 516–521
517
Definition 1 (See Xie and Zhou [18]). Let k be a natural number. For
a continuous function f : ½1,1-R, we define the k-th difference
of the function f:
!
k
X
k
k
kj
Dh f ðxÞ ¼
ð1Þ
f ðx þ jhÞ:
j
The approximation result of Theorem 1 implies Corollary 1,
which considers the rate of convergence for the smoothness
regression function.
Then the k-th order continuous modulus of f is defined by
9C n
6
A
d ¼ 4@
þ C Þ2
64ð2M þ C Þ2 log32nð2M
M2
j¼0
ok ðf ,tÞ ¼
max
1 r x,x þ kh r 1,0 o h r t
jDkh f ðxÞj:
Definition 2. Let k be a natural number. A function f is defined on
[ 1,1]. If the k-th order derivative f(k) exists, and for all x,y A ½1,1
there exists a constant C 4 0 satisfying
jf ðkÞ ðxÞf ðkÞ ðyÞj r Cjxyj,
then we say that f(k) belongs to the class of Lipschitz, which is
written by f A LipC 1.
It is well known that learning process needs some structure at
the beginning of the process. This structure (which is called
hypothesis space) is usually taken the form of function space
(e.g., polynomial space, continuous function space, etc). A familiar
hypothesis space is polynomial space, which has been used in
[4,23].
In the sequel, we introduce the polynomial functions (see [18])
on X ¼[ 1,1]. Let
(
)
d
X
i
Hd ¼ f : R-R,f ðxÞ ¼
ai x ,x A ½1,1,ai A R,i ¼ 0,1, . . . ,d :
i¼0
Corollary 1. Suppose that F d is defined as above, jmðxÞj r M for any
x A ½1,1, and mðkÞ A LipC 1. Let
20
3
1
1=ð2k þ 1Þ
l¼
1
nð1 þ JaðQd ðm,ÞÞJ22 Þ
7
5,
:
Then for all 0 o d o 1, with confidence 1d, there holds
!2k=ð2k þ 1Þ
þ C Þ2
128ð2M þ C Þ2 log32nð2M
M2
Eðfz ÞEðmÞ r 4C 9C n
88ð2M þC Þ2 log4d 3
þ ,
n
9n
where [b] denotes the integer part of real number b.
þ
The reminder of this paper is organized as follows. In
Section 2, we introduce the Jackson operator which is used in
this paper. In Section 3, the estimation is illustrated by applying
it to simulated data. We give the proof of Theorem 1 and Corollary
1 in Section 4. Finally, we conclude the paper with obtained
results.
2. Approximation of Jackson operator
Obviously, the dimension of Hd is d +1.
We consider F d ¼ ff A Hd : jf ðxÞj rM þC ,x A ½1,1g as the
hypothesis space, where C* ¼CCk, Ck is given in Proposition 1 and
C is the Lipschitz constant given in Definition 2.
The estimator fz is given by
!
n
1X
2
2
ðf ðxi Þyi Þ þ lJaðf ÞJ2 ,
fz ¼ argmin
ð2Þ
f AFd n
i¼1
In this paper we obtain the convergence rate of algorithm (2) by
using Jackson operator on X¼[ 1,1]. Jackson operator (see [18,12])
plays an important role in approximation theory. For two natural
number d, r, taking q¼[d/r]+ 1, Jackson kernel is defined by
!
qt 2r
1 sin 2
Kdr ðtÞ ¼ Lq,r ðtÞ ¼
,
ð3Þ
lqr sin2t
P
P
where Jaðf ÞJ22 ¼ di¼ 0 jai j2 for f ðxÞ ¼ di¼ 0 ai xi , and l is a regularized parameter.
We will analyze the rate of convergence of estimator fz.
The efficiency of the algorithm (2) is measured by the difference
between fz and m. According to the definition of m(x), we have
Z
ðfz ðxÞmðxÞÞ2 mðdxÞ ¼ Eðfz ÞEðmÞ:
where
X
Let
E z ðf Þ ¼
!2r
dt:
Lemma 1 (See Xie and Zhou [18]). Let Kdr(t) be defined by (3). Then
Kdr(t) is a trigonometric polynomial with d order, and
Z p
Kdr ðtÞ dt ¼ 1,
Z p
fz ¼ argmin fE z ðf Þ þ lJaðf ÞJ22 g:
f AFd
Theorem 1. Suppose that F d is defined as above, jmðxÞj rM for any
x A ½1,1. Let l ¼ 1=nð1þ JaðQd ðm,ÞÞJ22 Þ, and fz be given by (2). Then
for all 0 o d o1, with confidence 1d, we have
Eðfz ÞEðmÞ
þC
128ð2M þC Þ2 ðd þ 1Þlog32nð2M
M2
9n
Z
3
þ 2 ðQd ðm,xÞmðxÞÞ2 mðdxÞ þ ,
n
X
p
sinqt
2
sin2t
p
n
1X
ðf ðxi Þyi Þ2 ,
ni¼1
it is a discretization of Eðf Þ. Therefore, fz can be written as
r
lqr ¼
Z p
2
Þ
þ
88ð2M þC Þ2 log4d
9n
where Qd ðm,Þ is Jackson operator with respect to m, which will be
given in Section 2.
p
t k Kdr ðtÞ dt r Ck ðd þ 1Þk ,
k ¼ 0,1, . . . ,2r2:
Let f(x) ¼f(cos u)¼g(u) for u ¼arccos x. By using the kernel Kdr(t),
we define Jackson operator on [ 1,1] (see [18]):
!
Z p
k
X
k
Kdr ðtÞ
ð1Þj
Qd ðf ,xÞ ¼ Jd ðg,uÞ ¼ gðu þ jtÞ dt,
j
p
j¼1
where r is minimum integer satisfying r Z½ðkþ 2Þ=2.
Lemma 2 (See Xie and Zhou [18]). Let g A C½p, p , l ¼0,1,y,j¼1,2y .
When l is not divided exactly by j, we have
Z p
gðjtÞeilt dt ¼ 0,
p
where C[a,b] denotes the set consisting of all continuous functions on [a,b].
518
Y. Zhang et al. / Neurocomputing 74 (2011) 516–521
From Lemma 2 and the fact that
Z p
Z p
u
dt
gðu þjtÞcoslt dt ¼
gðjtÞcos l t
j
p
p
Z p
lu
lu
¼
dt,
gðjtÞ cos lt cos þ sin lt sin
j
j
p
We may solve this minimization problem exactly by gradient. In
the sequel, we will illustrate it only by applying it to a few
simulated data set. Here we define ðX ,YÞ by
ð4Þ
if l is not divided exactly by j, then (4) equals to 0. Otherwise, (4) is a
trigonometric polynomial with l=j order. From the above discussion,
we know that Kdr is a trigonometric polynomial with d order. Then
Rp
Jd(g,u) is a linear combination of p gðu þ jtÞcoslt dt for 1 rj r
k,0 rl rd, i.e., Jd(g,u) is a trigonometric polynomial with at most d
order. Therefore, Qd(f,x)¼Jd(g,arccos x) is d order polynomial with
u¼arccos x.
Proposition 1. Let k be a natural number, f A C½1,1 . For d¼0,1,y,
there holds
1
, 8x A ½1,1,
jf ðxÞQd ðf ,xÞj rCk ok f ,
d þ1
Y ¼ X þ s e,
where X is uniformly distributed on [ 1,1], e is standard normally
distributed and independent of [ 1,1], and s 40. In Figs. 1 and 2,
we choose s ¼ 1, and use two different univariate regression
functions in order to define two different data sets with size
n¼500. Each figure shows the true regression function with its
formula, a corresponding sample of size n ¼500 and our estimation
applied to these samples.
4. Proof of Theorem 1
According to the definition of fz, it is easy to obtain
Eðfz ÞEðmÞ r Eðfz ÞEðmÞ þ lJaðfz ÞJ22
r jfEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞgj
where Ck is a constant depending on k.
Proof. From Lemma 2 and the definition of Qd(f,x), we have for
u ¼arccos x
Z p
jf ðxÞQd ðf ,xÞj ¼ jgðuÞJd ðg,uÞj ¼ Kdr ðtÞDkt gðuÞ dt p
Z p
Kdr ðtÞok ðg,tÞ dt:
r2
þ jfE z ðQd ðm,ÞÞE z ðmÞðEðQd ðm,ÞÞEðmÞÞgj þDðlÞ,
ð5Þ
0
From the definition of smoothness modulus, we know
1
:
ok ðg,tÞ r ð1 þ ðd þ1ÞtÞk ok g,
dþ1
For k r2r2, applying Lemma 1, we get for d¼ 0,1,y
Z p
1
jf ðxÞQd ðf ,xÞj r2ok g,
ð1þ ðd þ 1ÞtÞk Kdr ðtÞ dt
dþ1
0
1
,
r Ck ok g,
dþ1
for any x A ½1,1.
For any t 4 0,u A ½p, p,u þ t A ½p, p, there holds jcosðu þtÞ
cosuj rjtj. For s ¼ cost, we obtain
sup jDkt gðuÞj r sup jDkt gðuÞj ¼ sup jDks f ðxÞj:
jtj r h
jcostj r h
jsj r h
We can get ok ðg,hÞ r ok ðf ,hÞ. Combining with the above inequality,
we obtain
1
,
jf ðxÞQd ðf ,xÞj rCk ok f ,
d þ1
for any x A ½1,1.
The proof of Proposition 1 is completed.
Fig. 1. m(x) ¼3(0.5x2 1)2 + 2x/(cos x+ 2) with n¼ 500, s ¼ 1.
&
3. Applying to simulated data
We choose the function in F d by minimizing this risk with
respect to the parameter a ¼ ða0 , . . . ,ad Þ A Rd þ 1 . To compute the
estimation in polynomial space, we need to minimize
0
12
n
d
X
1X
j
@
a x y A þ lJaJ22
ni¼1 j¼0 j i i
for given x1 ,x2 , . . . ,xn A ½1,1, y1 ,y2 , . . . ,yn A R with respect to
aj A R,
j ¼ 0,1, . . . ,d:
Fig. 2. m(x) ¼ 8x/(x2 + 2) with n¼ 500, s ¼ 1.
Y. Zhang et al. / Neurocomputing 74 (2011) 516–521
where DðlÞ ¼ EðQd ðm,ÞÞEðmÞ þ lJaðQd ðm,ÞÞJ22 , Qd ðm,Þ is Jackson
operator with respect to m.
We first estimate jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj in (5)
concerning the random variable x ¼ ðQd ðm,xÞyÞ2 ðmðxÞyÞ2 , we
need the following probability inequality.
Lemma 3 (See van der Vaart and Wellner [1]). Let P be a probability
measure on Z ¼ X Y and set z1 ¼(x1,y1),y,zn ¼(xn,yn) be independent random variables distributed according to P. Given a function
P
g : Z-R, set S ¼ ni¼ 1 gðzi Þ, let b ¼ JgJ1 and put s2 ¼ nEg 2 . Then
(
)
t2
:
Probz A Zn fjSESj Ztg r2exp 2 s2 þ bt
3
Using Lemma 3, we obtain the following Theorem.
Theorem 2. For every 0 o d o1, with confidence 1d=2, there holds
8ð4M þ C Þ2
4 1
jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj r
log þ DðlÞ:
3n
d 2
Proof. Let gðzÞ¼ð1=nÞððQd ðm,xÞyÞ2 ðmðxÞyÞ2 Þ. From Proposition 1,
we know that
1
þJmJ1 ,
JQd ðm,ÞJ1 rCk ok m,
d þ1
where JmJ1 ¼ maxx A ½1,1 jmðxÞj.
Since jmðxÞj rM for any x A ½1,1, we obtain
JQd ðm,ÞJ1 rM þ
CC K
ðd þ1Þk
rMþC ,
where C* ¼CCk. For any z A Z we have
jgðzÞj ¼
1
ð4M þC Þ2
jðQd ðm,xÞ2yþ mðxÞÞðQd ðm,xÞmðxÞÞj r
:
n
n
Hence JgJ1 rð4M þC Þ2 =n ¼ b. From (3), we get
1
EððQd ðm,xÞ2y þ mðxÞÞ2 ðQd ðm,xÞmðxÞÞ2 Þ
n2
ð4M þ C Þ2
ðEðQd ðm,ÞÞEðmÞÞ
r
n2
ð4M þ C Þ2
Eg:
r
n
Eðg 2 Þ ¼
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Now we apply Lemma 3 with t ¼ eðe þ EðQd ðm,ÞÞEðmÞÞ to
g ¼ ð1=nÞððQd ðm,xÞyÞ2 ðmðxÞyÞ2 Þ. It asserts that for every e 40,
with confidence at least
9
>
>
>
>
=
eðe þ EðQd ðm,ÞÞEðmÞÞ
!
12exp p
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
2
>
>
>
ð4M þ C Þ
ð4M þ C Þ
eðe þ EðQd ðm,ÞÞEðmÞÞ >
>
>
>
>
ðEðQd ðm,ÞÞEðmÞÞþ
;
: 2
n
3n
8
>
>
>
>
<
(
Z 12exp 3ne
8ð4M þC Þ2
)
,
there holds
jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj pffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
r e:
EðQd ðm,ÞÞEðmÞ þ e
Recall an elementary inequality:
ab r 12ða2 þb2 Þ
8a,bA R,
we have
jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj
e 1
r þ ðEðQd ðm,ÞÞEðmÞ þ eÞ
2 2
reþ
519
1
DðlÞ:
2
Let d=2 ¼ 2expf3ne=8ð4M þ C Þ2 g, then
e¼
8ð4M þ C Þ2
4
log :
3n
d
Therefore, with confidence 1d=2, there holds
jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj r
8ð4M þC Þ2
4 1
log þ DðlÞ:
3n
d 2
The proof of Theorem 2 is completed.
&
For the first part in (5), we shall bound it by using the covering
number of the unit ball B1 in F d .
Definition 3 (See Zhou [21]). Let F be a subset of a metric space. For
any e 4 0, the covering number N ðF, eÞ is defined to be the minimal
integer l such that there exist l balls with radius e covering F.
Covering number is also used in lots of literature (see
[7,15,22,8,19,20]). Let BR be a ball in F d with radius R. Dimension
of F d is d+ 1, we know that (see [21])
logN ðBR , eÞ rðd þ 1Þlog
4R
e
:
ð6Þ
The first part of (5) is bounded by the following Proposition.
Proposition 2. For all e 4 0, we have
Probz A Z fjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj Z eg
(
)
32ð2M þ C Þ2 3nð3eDðlÞÞ
r 2exp ðd þ 1Þlog
:
e
64ð2M þ C Þ2
Proof. From the definition of fz, we know that
jEðfz ÞE z ðfz ÞEðmÞ þ E z ðmÞj r sup jEðf ÞE z ðf ÞEðmÞ þ E z ðmÞj:
f AFd
Moreover, since
jðyhðxÞÞ2 ðygðxÞÞ2 j ¼ jðhðxÞgðxÞÞðhðxÞ þ gðxÞ2yÞj
r ð4M þ 2C ÞJhgJ1 , h,g A F d
it follows that
jEðhÞE z ðhÞEðgÞ þ E z ðgÞj r2ð4M þ2C ÞJhgJ1 , h,g A F d :
Let U ¼ ff1 ,f2 , . . . ,fl g F d be a gnet of F d with the size
l ¼ N ðF d , gÞ. So we have
sup jEðf ÞE z ðf ÞEðmÞ þ E z ðmÞj
f AFd
r supjEðf ÞE z ðf ÞEðmÞ þ E z ðmÞj þ2ð4M þ 2C Þg:
f AU
Using the similar way with Theorem 2, there holds for any fi A U,
(
)
3n e12DðlÞ
Probz A Z fjEðfi ÞEðmÞðE z ðfi ÞE z ðmÞÞj Z egr 2exp ,
8ð4M þ 2C Þ2
which implies that
Probz A Z fjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj Z eg
(
r Probz A Z
)
sup jEðf ÞEðmÞðE z ðf ÞE z ðmÞÞj Z e
(f A F d
)
r Probz A Z supjEðf ÞEðmÞðE z ðf ÞE z ðmÞÞj Z e2ð4M þ 2C Þg
f AU
rN ðF d , gÞsupProbz A Z fjEðf ÞEðmÞðE z ðf ÞE z ðmÞÞj Z e2ð4M þ2C Þgg
f AU (
)
3n e þ2ð4M þ 2C Þg12DðlÞ
r 2N ðF d , gÞexp :
8ð4M þ 2C Þ2
520
Y. Zhang et al. / Neurocomputing 74 (2011) 516–521
2
We take g ¼ e=ð8ð2M þC ÞÞ, then
þC Þ
64ð2M þ C Þ2 ðd þ 1Þlog32nð2M
64ð2M þ C Þ2 log4d DðlÞ
M2
,
þ
þ
3
9n
9n
with confidence 1d=2.
To estimate T2, by using Theorem 2, we know that with
confidence 1d=2, there holds
r
Probz A Z fjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj Z eg
(
)
e
3nð3eDðlÞÞ
exp
r 2N F d ,
:
8ð2M þ C Þ
64ð2M þ C Þ2
From the definition of F d , we know that
Jf J1 rM þC ,
jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj r
f A F d:
8ð4M þ C Þ2
4 1
log þ DðlÞ:
3n
d 2
We bound DðlÞ in (7) and by taking l ¼ 1=nðJaðQd ðm,ÞÞJ22 þ1Þ.
Combining with (6), we have
e
32ð2M þC Þ2
r ðd þ1Þlog
:
logN F d ,
8ð2M þC Þ
e
DðlÞ ¼ EðQd ðm,ÞÞEðmÞ þ lJaðQd ðm,ÞÞJ22
Z
1
¼ ðQd ðm,xÞmðxÞÞ2 mðdxÞ þ :
n
X
Therefore,
Combining with the upper bound of T1, T2 and DðlÞ, there holds
Probz A Z fjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj Z eg
(
)
32ð2M þ C Þ2 3nð3eDðlÞÞ
r 2exp ðd þ 1Þlog
:
e
64ð2M þ C Þ2
The proof of Proposition 2 is finished.
Eðfz ÞEðmÞ
2
þC Þ
64ð2M þ C Þ2 ðd þ 1Þlog32nð2M
88ð2M þ C Þ2 log4d
M2
þ
9n
9n
Z
3
þ2 ðQd ðm,xÞmðxÞÞ2 mðdxÞ þ ,
n
X
r
&
From Theorem 2 and Proposition 2, we can now start with the
proof of Theorem 1.
Proof of Theorem 1. We use the following error decomposition:
Eðfz ÞEðmÞ rjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj
þ jE z ðQd ðm,ÞÞE z ðmÞðEðQd ðm,ÞÞEðmÞÞj þ DðlÞ
¼ T1 þ T2 þ DðlÞ:
We begin with bounding T1 in (7). We discuss two cases for
e Z M2 =n and e o M2 =n.
(i) When e Z M2 =n, we know that from Proposition 2
Probz A Z fjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj r eg
(
)
32ð2M þ C Þ2 3nð3eDðlÞÞ
Z 12exp ðd þ1Þlog
e
64ð2M þC Þ2 )
(
2
32nð2M þ C Þ
3nð3eDðlÞÞ
:
Z 12exp ðd þ1Þlog
M2
64ð2M þ C Þ2
32nð2M þC Þ2 3nð3eDðlÞÞ
2exp ðd þ 1Þlog
M2
64ð2M þ C Þ2
)
¼
d
2
:
Then we have
2
þC Þ
64ð2M þC Þ2 ðd þ1Þlog32nð2M
M2
9n
64ð2M þ C Þ2 log4d DðlÞ
M2
Z
þ
:
þ
3
9n
n
e¼
&
In order to prove Corollary 1, we need to estimate
Z
X
ð7Þ
Let
(
with confidence 1d.
The proof of Theorem 1 is completed.
ðQd ðm,xÞmðxÞÞ2 mðdxÞ:
From (4), we know that Qd(m,x) is a polynomial with d order.
And Proposition 1 tells us
1
:
jQd ðm,xÞj ¼ jJd ðm,arccosxÞj r jmðxÞj þ ok m,
dþ1
Since mðkÞ A LipC 1, then we get
1
1
C
r CC k
ok m,
¼
r C,
k
dþ1
ðd þ1Þk
ðd þ 1Þ
where C* ¼CCk is a constant depending on k.
So we obtain
1
jQd ðm,xÞj ¼ jJd ðm,arccosxÞj r jmðxÞj þ ok m,
dþ1
rM þC , 8 x A ½1,1:
Hence Qd ðm,xÞ A F d .
From Proposition 1, we know
Z
ðQd ðm,xÞmðxÞÞ2 mðdxÞ rJQd ðm,xÞmðxÞJ21
X
r Ck ok m,
2
1
1
r C 2k :
dþ1
d
Combining Theorem 1 with the above inequality, we get
Eðfz ÞEðmÞ
2
So there holds
jEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj
2
þC Þ
64ð2M þC Þ2 ðd þ1Þlog32nð2M
M2
9n
64ð2M þ C Þ2 log4d DðlÞ
þ
þ
3
9n
with confidence 1d=2.
(ii) When e r M2 =n, we know that from Proposition 2
r
jEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj r
M2
,
n
with confidence 1d=2.
Combining with the cases e 4M 2 =n and e r M2 =n, there holds
jEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj
þC Þ
128ð2M þC Þ2 dlog32nð2M
2C 88ð2M þC Þ2 log4d 3
M2
þ 2k þ
þ ,
n
9n
9n
d
and this expression is minimized for
20
11=ð2k þ 1Þ 3
9C n
6@
7
A
d¼4
5:
þ C Þ2
64ð2M þ C Þ2 log32nð2M
M2
r
With confidence 1d, there holds
2
!2k=ð2k þ 1Þ 3
þ C Þ2
128ð2M þ C Þ2 log32nð2M
4
M2
5
Eðfz ÞEðmÞ r 4C
9C n
88ð2M þ C Þ2 log4d 3
þ :
n
9n
The proof of Corollary 1 is finished.
þ
Y. Zhang et al. / Neurocomputing 74 (2011) 516–521
5. Conclusions
In this paper, the explicit upper bounds of learning rate have been
derived by using least square schemes in polynomial space. In
particular, the estimation of bounds has achieved good rate of
convergence for one variable smooth regression function. To our
knowledge, these bounds, in some extent, improved the previous
known bounds under the smooth condition. In the proof the Jackson
operator in approximation theory and general probability inequality
were used. The obtained error estimation has also been is illustrated by
applying simulated data.
521
[22] D.X. Zhou, Capacity of reproducing kernel Hilbert spaces in learning theory,
IEEE Trans. Inform. Theory 49 (2003) 1734–1752.
[23] D.X. Zhou, K. Jetter, Approximation with polynomial kernels and SVM
classifiers, Adv. Comput. Math. 25 (2006) 323–344.
Yongquan Zhang received M.S. degree from China
Jiliang University, in 2007. He is currently pursuing
the Ph.D. degree in Xi’an Jiaotong University. His
research interests include neural networks and machine
learning.
References
[1] A.W. van der Vaart, J.A. Wellner, Weak Convergence and Empirical Processes,
Springer, 1996.
[2] A.M. Bagirov, C. Clausen, M. Kohler, Estimation of a regression function by maxima
of minima of linear functions, IEEE Trans. Inform. Theory 55 (2009) 833–845.
[3] A. Caponnetto, E. DeVito, Optimal rates for the regularized least-squares
algorithm, Found. Comput. Math. 7 (2007) 331–368.
[4] D. Chen, Q. Wu, Y. Ying, D.X. Zhou, Support vector machine sort margin
classifiers: error analysis, J. Mach. Learn. Res. 5 (2004) 1143–1175.
[5] F. Cucker, S. Smale, On the mathematical foundations of learning, Bull. Amer.
Math. Soc. 39 (2001) 1–49.
[6] F. Cucker, S. Smale, Best choices for regularization parameters in learning
theory, Found. Comput. Math. 1 (2002) 413–428.
[7] Y. Guo, P.L. Bartlett, J. Shawe-Taylor, R.C. Williamson, Covering numbers for
support vector machines, IEEE Trans. Inform. Theory 48 (2002) 239–250.
[8] M. Pontil, A note different covering numbers in learning theory, J. Complexity
19 (2003) 665–671.
[9] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, M. Anthony, Structural risk
minimization over data-dependent hierarchies, IEEE Trans. Inform. Theory 44
(1998) 1926–1940.
[10] S. Smale, D.X. Zhou, Estimating the approximation error in learning theory,
Anal. Appl. 1 (1) (2003) 17–41.
[11] S. Smale, D.X. Zhou, Learning theory estimates via integral operators and their
approximations, Constr. Approx. 26 (2007) 153–172.
[12] Y.S. Sun, Function Approximation Theory (I), Beijing Normal University Press,
1989 (in Chinese).
[13] V.N. Temlyakov, Approximation in Learning Theory, IMI Preprints, vol. 5, 2005,
pp. 1–42.
[14] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.
[15] R.C. Williamson, A.J. Smola, B. Schǒkopf, Generalization performance of
regularization networks and support vector machines via entropy numbers
of compact operators, IEEE Trans. Inform. Theory 47 (2001) 2516–2532.
[16] Q. Wu, Y.M. Ying, D.X. Zhou, Learning theory: from regression to classification,
in: Topics in Multivariate Approximation and Interpolation, Elsevier, B.V.,
Amsterdam, 2004.
[17] Q. Wu, Y.M. Ying, D.X. Zhou, Learning rates of least-square regularized
regression, Found. Comput. Math. 6 (2007) 171–192.
[18] T.F. Xie, S.P. Zhou, Real Function Approximation Theory, Hangzhou University
Press,, 1998 (in Chinese).
[19] T. Zhang, Effective dimension and generalization of kernel learning, NIPS, 2002,
pp. 454–461.
[20] T. Zhang, Leave-one-out bounds for kernel methods, Neural Comput. 13 (2003)
1397–1437.
[21] D.X. Zhou, The covering number in learning theory, J. Complexity 18 (2002)
739–767.
Feilong Cao was born in Zhejiang Province, China, on
August, 1965. He received the B.S. degree in mathematics
and the M.S. degree in applied mathematics from Ningxia
University, China, in 1987 and 1998, respectively. In 2003,
he received the Ph.D. degree in Institute for Information
and System Science, Xi’an Jiaotong University, China. From
1987 to 1992, he was an assistant professor. During 1992–
2002, he was an associate professor. He is now a professor
in China Jiliang University. His current research interests
include neural networks, machine learning and approximation theory. He is the author or coauthor of more than
100 scientific papers.
Zong-Ben Xu received the M.S. degree in mathematics
in 1981 and the Ph.D. degree in applied mathematics in
1987 from Xi’an Jiaotong University, China. In 1998, he
was a postdoctoral researcher in the Department of
Mathematics, The University of Strathclyde, United
Kingdom. He worked as a research fellow in the
Information Engineering Department from February
1992 to March 1994, the Center for Environmental
Studies from April 1995 to August 1995, and the
Mechanical Engineering and Automation Department
from September 1996 to October 1996, at The Chinese
University of Hong Kong. From January 1995 to April
1995, he was a research fellow in the Department of
Computing in The Hong Kong Polytechnic University. He has been with the Faculty
of Science and Institute for Information and System Sciences at Xi’an Jiaotong
University since 1982, where he was promoted to associate professor in 1987 and
full professor in 1991, and now serves as an authorized Ph.D. supervisor in
mathematics and computer science, vice president of Xi’an Jiaotong University,
and director of the Institute for Information and System Sciences.
Professor Xu has published 150 academic papers. His current interest is in
computational Intelligence (particularly, neural networks, evolutionary computation, data mining theory and algorithms), nonlinear functional analysis and
geometry of Banach spaces.