Neurocomputing 74 (2011) 516–521 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Estimation of learning rate of least square algorithm via Jackson operator$ Yongquan Zhang a, Feilong Cao b,, Zongben Xu a a b Institute for Information and System Sciences, Xi’an Jiaotong University, Xi’an 710049, Shannxi Province, PR China Department of Information and Mathematics Sciences, China Jiliang University, Hangzhou 310018, Zhejiang Province, PR China a r t i c l e i n f o a b s t r a c t Article history: Received 25 December 2009 Received in revised form 8 July 2010 Accepted 24 August 2010 Communicated by G.-B. Huang Available online 28 October 2010 In this paper, regression problem in learning theory is investigated by least square schemes in polynomial space. Results concerning the estimation of rate of convergence are derived. In particular, it is shown that for one variable smooth regression function, the estimation is able to achieve good rate of convergence. As a main tool in the study, the Jackson operator in approximation theory is used to estimate the rate. Finally, the obtained estimation is illustrated by applying simulated data. Crown Copyright & 2010 Published by Elsevier B.V. All rights reserved. Keywords: Learning theory Covering number Rate of convergence Jackson operator 1. Introduction In many applications, there is usually no priori information about regression function. Therefore, it is necessary to apply least square methods in the study of regression problem. This paper considers the regression problem in learning theory. The upper bound of the learning rate is estimated by using general probability inequality and Jackson operator. It is well known that regression problem is an important one in learning theory (e.g., [2,14,9,11]). There have been many studies on the convergence of regression problem (see [17,5,3,13]). All these methods minimize a kind of least square risk of the regression estimation over reproducing kernel Hilbert space (see [6,16]). In this paper, we consider a function set consisting of polynomial functions on X¼ [ 1,1], over which we minimize least square risk. In regression analysis, an R Rvalued random vector ðX ,YÞ with EY 2 o 1 is considered and the dependency of Y on the value of X is of interest. More precisely, the goal of regression problem is to find a function f : R-R such that f ðX Þ is a good approximation of Y. In the sequel, we assume that the main aim of the analysis is to obtain the minimization of the mean squared prediction error or L2 risk Eðf Þ ¼ Efjf ðxÞyj2 g: The function that minimizes the above error is called the regression function, which is given by mðxÞ ¼ EfYjX ¼ xg, x A R: Indeed, let f : R-R be an arbitrary measurable function on R. We denote the distribution of X by m. The well-known relation Z Efjf ðxÞyj2 g ¼ EfjmðxÞyj2 g þ ðf ðxÞmðxÞÞ2 mðdxÞ R (see [10,16]) implies that the regression function is the optimal predictor in view of minimization of the L2 risk EfjmðxÞyj2 g ¼ min Efjf ðxÞyj2 g: f :R-R In addition, any measurable function f is a good predictor in the sense that its L2 risk is close to the optimal value, if and only if the L2 risk Eðf Þ ¼ Efjf ðxÞyj2 g ð1Þ is small. This motivates us to measure the error caused by using the function f instead of the regression function by (1). We know that distribution of the sample is usually unknown in general, hence the regression function is also unknown. But often it is possible to observe some samples chosen according to the distribution. This leads to the regression estimation problem. Let z ¼ fzi gni¼ 1 ¼ fðxi ,yi Þgni¼ 1 be independent and identically distributed random samples drawn on X YðY RÞ. Our goal is to construct an estimator fz ðÞ ¼ f ð,zÞ of the regression function such that the L2 error Z ðfz ðxÞmðxÞÞ2 mðdxÞ X $ The research was supported by the National 973 Project (2007CB311002) and the National Natural Science Foundation of China (nos. 90818020, 60873206). Corresponding author. E-mail address: [email protected] (F. Cao). is small. Throughout this paper, we assume that jyjr M for some M A R þ , then jmðxÞj r M for any x A ½1,1. Here we need to impose smoothness condition on the regression function. 0925-2312/$ - see front matter Crown Copyright & 2010 Published by Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2010.08.023 Y. Zhang et al. / Neurocomputing 74 (2011) 516–521 517 Definition 1 (See Xie and Zhou [18]). Let k be a natural number. For a continuous function f : ½1,1-R, we define the k-th difference of the function f: ! k X k k kj Dh f ðxÞ ¼ ð1Þ f ðx þ jhÞ: j The approximation result of Theorem 1 implies Corollary 1, which considers the rate of convergence for the smoothness regression function. Then the k-th order continuous modulus of f is defined by 9C n 6 A d ¼ 4@ þ C Þ2 64ð2M þ C Þ2 log32nð2M M2 j¼0 ok ðf ,tÞ ¼ max 1 r x,x þ kh r 1,0 o h r t jDkh f ðxÞj: Definition 2. Let k be a natural number. A function f is defined on [ 1,1]. If the k-th order derivative f(k) exists, and for all x,y A ½1,1 there exists a constant C 4 0 satisfying jf ðkÞ ðxÞf ðkÞ ðyÞj r Cjxyj, then we say that f(k) belongs to the class of Lipschitz, which is written by f A LipC 1. It is well known that learning process needs some structure at the beginning of the process. This structure (which is called hypothesis space) is usually taken the form of function space (e.g., polynomial space, continuous function space, etc). A familiar hypothesis space is polynomial space, which has been used in [4,23]. In the sequel, we introduce the polynomial functions (see [18]) on X ¼[ 1,1]. Let ( ) d X i Hd ¼ f : R-R,f ðxÞ ¼ ai x ,x A ½1,1,ai A R,i ¼ 0,1, . . . ,d : i¼0 Corollary 1. Suppose that F d is defined as above, jmðxÞj r M for any x A ½1,1, and mðkÞ A LipC 1. Let 20 3 1 1=ð2k þ 1Þ l¼ 1 nð1 þ JaðQd ðm,ÞÞJ22 Þ 7 5, : Then for all 0 o d o 1, with confidence 1d, there holds !2k=ð2k þ 1Þ þ C Þ2 128ð2M þ C Þ2 log32nð2M M2 Eðfz ÞEðmÞ r 4C 9C n 88ð2M þC Þ2 log4d 3 þ , n 9n where [b] denotes the integer part of real number b. þ The reminder of this paper is organized as follows. In Section 2, we introduce the Jackson operator which is used in this paper. In Section 3, the estimation is illustrated by applying it to simulated data. We give the proof of Theorem 1 and Corollary 1 in Section 4. Finally, we conclude the paper with obtained results. 2. Approximation of Jackson operator Obviously, the dimension of Hd is d +1. We consider F d ¼ ff A Hd : jf ðxÞj rM þC ,x A ½1,1g as the hypothesis space, where C* ¼CCk, Ck is given in Proposition 1 and C is the Lipschitz constant given in Definition 2. The estimator fz is given by ! n 1X 2 2 ðf ðxi Þyi Þ þ lJaðf ÞJ2 , fz ¼ argmin ð2Þ f AFd n i¼1 In this paper we obtain the convergence rate of algorithm (2) by using Jackson operator on X¼[ 1,1]. Jackson operator (see [18,12]) plays an important role in approximation theory. For two natural number d, r, taking q¼[d/r]+ 1, Jackson kernel is defined by ! qt 2r 1 sin 2 Kdr ðtÞ ¼ Lq,r ðtÞ ¼ , ð3Þ lqr sin2t P P where Jaðf ÞJ22 ¼ di¼ 0 jai j2 for f ðxÞ ¼ di¼ 0 ai xi , and l is a regularized parameter. We will analyze the rate of convergence of estimator fz. The efficiency of the algorithm (2) is measured by the difference between fz and m. According to the definition of m(x), we have Z ðfz ðxÞmðxÞÞ2 mðdxÞ ¼ Eðfz ÞEðmÞ: where X Let E z ðf Þ ¼ !2r dt: Lemma 1 (See Xie and Zhou [18]). Let Kdr(t) be defined by (3). Then Kdr(t) is a trigonometric polynomial with d order, and Z p Kdr ðtÞ dt ¼ 1, Z p fz ¼ argmin fE z ðf Þ þ lJaðf ÞJ22 g: f AFd Theorem 1. Suppose that F d is defined as above, jmðxÞj rM for any x A ½1,1. Let l ¼ 1=nð1þ JaðQd ðm,ÞÞJ22 Þ, and fz be given by (2). Then for all 0 o d o1, with confidence 1d, we have Eðfz ÞEðmÞ þC 128ð2M þC Þ2 ðd þ 1Þlog32nð2M M2 9n Z 3 þ 2 ðQd ðm,xÞmðxÞÞ2 mðdxÞ þ , n X p sinqt 2 sin2t p n 1X ðf ðxi Þyi Þ2 , ni¼1 it is a discretization of Eðf Þ. Therefore, fz can be written as r lqr ¼ Z p 2 Þ þ 88ð2M þC Þ2 log4d 9n where Qd ðm,Þ is Jackson operator with respect to m, which will be given in Section 2. p t k Kdr ðtÞ dt r Ck ðd þ 1Þk , k ¼ 0,1, . . . ,2r2: Let f(x) ¼f(cos u)¼g(u) for u ¼arccos x. By using the kernel Kdr(t), we define Jackson operator on [ 1,1] (see [18]): ! Z p k X k Kdr ðtÞ ð1Þj Qd ðf ,xÞ ¼ Jd ðg,uÞ ¼ gðu þ jtÞ dt, j p j¼1 where r is minimum integer satisfying r Z½ðkþ 2Þ=2. Lemma 2 (See Xie and Zhou [18]). Let g A C½p, p , l ¼0,1,y,j¼1,2y . When l is not divided exactly by j, we have Z p gðjtÞeilt dt ¼ 0, p where C[a,b] denotes the set consisting of all continuous functions on [a,b]. 518 Y. Zhang et al. / Neurocomputing 74 (2011) 516–521 From Lemma 2 and the fact that Z p Z p u dt gðu þjtÞcoslt dt ¼ gðjtÞcos l t j p p Z p lu lu ¼ dt, gðjtÞ cos lt cos þ sin lt sin j j p We may solve this minimization problem exactly by gradient. In the sequel, we will illustrate it only by applying it to a few simulated data set. Here we define ðX ,YÞ by ð4Þ if l is not divided exactly by j, then (4) equals to 0. Otherwise, (4) is a trigonometric polynomial with l=j order. From the above discussion, we know that Kdr is a trigonometric polynomial with d order. Then Rp Jd(g,u) is a linear combination of p gðu þ jtÞcoslt dt for 1 rj r k,0 rl rd, i.e., Jd(g,u) is a trigonometric polynomial with at most d order. Therefore, Qd(f,x)¼Jd(g,arccos x) is d order polynomial with u¼arccos x. Proposition 1. Let k be a natural number, f A C½1,1 . For d¼0,1,y, there holds 1 , 8x A ½1,1, jf ðxÞQd ðf ,xÞj rCk ok f , d þ1 Y ¼ X þ s e, where X is uniformly distributed on [ 1,1], e is standard normally distributed and independent of [ 1,1], and s 40. In Figs. 1 and 2, we choose s ¼ 1, and use two different univariate regression functions in order to define two different data sets with size n¼500. Each figure shows the true regression function with its formula, a corresponding sample of size n ¼500 and our estimation applied to these samples. 4. Proof of Theorem 1 According to the definition of fz, it is easy to obtain Eðfz ÞEðmÞ r Eðfz ÞEðmÞ þ lJaðfz ÞJ22 r jfEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞgj where Ck is a constant depending on k. Proof. From Lemma 2 and the definition of Qd(f,x), we have for u ¼arccos x Z p jf ðxÞQd ðf ,xÞj ¼ jgðuÞJd ðg,uÞj ¼ Kdr ðtÞDkt gðuÞ dt p Z p Kdr ðtÞok ðg,tÞ dt: r2 þ jfE z ðQd ðm,ÞÞE z ðmÞðEðQd ðm,ÞÞEðmÞÞgj þDðlÞ, ð5Þ 0 From the definition of smoothness modulus, we know 1 : ok ðg,tÞ r ð1 þ ðd þ1ÞtÞk ok g, dþ1 For k r2r2, applying Lemma 1, we get for d¼ 0,1,y Z p 1 jf ðxÞQd ðf ,xÞj r2ok g, ð1þ ðd þ 1ÞtÞk Kdr ðtÞ dt dþ1 0 1 , r Ck ok g, dþ1 for any x A ½1,1. For any t 4 0,u A ½p, p,u þ t A ½p, p, there holds jcosðu þtÞ cosuj rjtj. For s ¼ cost, we obtain sup jDkt gðuÞj r sup jDkt gðuÞj ¼ sup jDks f ðxÞj: jtj r h jcostj r h jsj r h We can get ok ðg,hÞ r ok ðf ,hÞ. Combining with the above inequality, we obtain 1 , jf ðxÞQd ðf ,xÞj rCk ok f , d þ1 for any x A ½1,1. The proof of Proposition 1 is completed. Fig. 1. m(x) ¼3(0.5x2 1)2 + 2x/(cos x+ 2) with n¼ 500, s ¼ 1. & 3. Applying to simulated data We choose the function in F d by minimizing this risk with respect to the parameter a ¼ ða0 , . . . ,ad Þ A Rd þ 1 . To compute the estimation in polynomial space, we need to minimize 0 12 n d X 1X j @ a x y A þ lJaJ22 ni¼1 j¼0 j i i for given x1 ,x2 , . . . ,xn A ½1,1, y1 ,y2 , . . . ,yn A R with respect to aj A R, j ¼ 0,1, . . . ,d: Fig. 2. m(x) ¼ 8x/(x2 + 2) with n¼ 500, s ¼ 1. Y. Zhang et al. / Neurocomputing 74 (2011) 516–521 where DðlÞ ¼ EðQd ðm,ÞÞEðmÞ þ lJaðQd ðm,ÞÞJ22 , Qd ðm,Þ is Jackson operator with respect to m. We first estimate jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj in (5) concerning the random variable x ¼ ðQd ðm,xÞyÞ2 ðmðxÞyÞ2 , we need the following probability inequality. Lemma 3 (See van der Vaart and Wellner [1]). Let P be a probability measure on Z ¼ X Y and set z1 ¼(x1,y1),y,zn ¼(xn,yn) be independent random variables distributed according to P. Given a function P g : Z-R, set S ¼ ni¼ 1 gðzi Þ, let b ¼ JgJ1 and put s2 ¼ nEg 2 . Then ( ) t2 : Probz A Zn fjSESj Ztg r2exp 2 s2 þ bt 3 Using Lemma 3, we obtain the following Theorem. Theorem 2. For every 0 o d o1, with confidence 1d=2, there holds 8ð4M þ C Þ2 4 1 jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj r log þ DðlÞ: 3n d 2 Proof. Let gðzÞ¼ð1=nÞððQd ðm,xÞyÞ2 ðmðxÞyÞ2 Þ. From Proposition 1, we know that 1 þJmJ1 , JQd ðm,ÞJ1 rCk ok m, d þ1 where JmJ1 ¼ maxx A ½1,1 jmðxÞj. Since jmðxÞj rM for any x A ½1,1, we obtain JQd ðm,ÞJ1 rM þ CC K ðd þ1Þk rMþC , where C* ¼CCk. For any z A Z we have jgðzÞj ¼ 1 ð4M þC Þ2 jðQd ðm,xÞ2yþ mðxÞÞðQd ðm,xÞmðxÞÞj r : n n Hence JgJ1 rð4M þC Þ2 =n ¼ b. From (3), we get 1 EððQd ðm,xÞ2y þ mðxÞÞ2 ðQd ðm,xÞmðxÞÞ2 Þ n2 ð4M þ C Þ2 ðEðQd ðm,ÞÞEðmÞÞ r n2 ð4M þ C Þ2 Eg: r n Eðg 2 Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Now we apply Lemma 3 with t ¼ eðe þ EðQd ðm,ÞÞEðmÞÞ to g ¼ ð1=nÞððQd ðm,xÞyÞ2 ðmðxÞyÞ2 Þ. It asserts that for every e 40, with confidence at least 9 > > > > = eðe þ EðQd ðm,ÞÞEðmÞÞ ! 12exp p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 > > > ð4M þ C Þ ð4M þ C Þ eðe þ EðQd ðm,ÞÞEðmÞÞ > > > > > ðEðQd ðm,ÞÞEðmÞÞþ ; : 2 n 3n 8 > > > > < ( Z 12exp 3ne 8ð4M þC Þ2 ) , there holds jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj pffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r e: EðQd ðm,ÞÞEðmÞ þ e Recall an elementary inequality: ab r 12ða2 þb2 Þ 8a,bA R, we have jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj e 1 r þ ðEðQd ðm,ÞÞEðmÞ þ eÞ 2 2 reþ 519 1 DðlÞ: 2 Let d=2 ¼ 2expf3ne=8ð4M þ C Þ2 g, then e¼ 8ð4M þ C Þ2 4 log : 3n d Therefore, with confidence 1d=2, there holds jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj r 8ð4M þC Þ2 4 1 log þ DðlÞ: 3n d 2 The proof of Theorem 2 is completed. & For the first part in (5), we shall bound it by using the covering number of the unit ball B1 in F d . Definition 3 (See Zhou [21]). Let F be a subset of a metric space. For any e 4 0, the covering number N ðF, eÞ is defined to be the minimal integer l such that there exist l balls with radius e covering F. Covering number is also used in lots of literature (see [7,15,22,8,19,20]). Let BR be a ball in F d with radius R. Dimension of F d is d+ 1, we know that (see [21]) logN ðBR , eÞ rðd þ 1Þlog 4R e : ð6Þ The first part of (5) is bounded by the following Proposition. Proposition 2. For all e 4 0, we have Probz A Z fjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj Z eg ( ) 32ð2M þ C Þ2 3nð3eDðlÞÞ r 2exp ðd þ 1Þlog : e 64ð2M þ C Þ2 Proof. From the definition of fz, we know that jEðfz ÞE z ðfz ÞEðmÞ þ E z ðmÞj r sup jEðf ÞE z ðf ÞEðmÞ þ E z ðmÞj: f AFd Moreover, since jðyhðxÞÞ2 ðygðxÞÞ2 j ¼ jðhðxÞgðxÞÞðhðxÞ þ gðxÞ2yÞj r ð4M þ 2C ÞJhgJ1 , h,g A F d it follows that jEðhÞE z ðhÞEðgÞ þ E z ðgÞj r2ð4M þ2C ÞJhgJ1 , h,g A F d : Let U ¼ ff1 ,f2 , . . . ,fl g F d be a gnet of F d with the size l ¼ N ðF d , gÞ. So we have sup jEðf ÞE z ðf ÞEðmÞ þ E z ðmÞj f AFd r supjEðf ÞE z ðf ÞEðmÞ þ E z ðmÞj þ2ð4M þ 2C Þg: f AU Using the similar way with Theorem 2, there holds for any fi A U, ( ) 3n e12DðlÞ Probz A Z fjEðfi ÞEðmÞðE z ðfi ÞE z ðmÞÞj Z egr 2exp , 8ð4M þ 2C Þ2 which implies that Probz A Z fjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj Z eg ( r Probz A Z ) sup jEðf ÞEðmÞðE z ðf ÞE z ðmÞÞj Z e (f A F d ) r Probz A Z supjEðf ÞEðmÞðE z ðf ÞE z ðmÞÞj Z e2ð4M þ 2C Þg f AU rN ðF d , gÞsupProbz A Z fjEðf ÞEðmÞðE z ðf ÞE z ðmÞÞj Z e2ð4M þ2C Þgg f AU ( ) 3n e þ2ð4M þ 2C Þg12DðlÞ r 2N ðF d , gÞexp : 8ð4M þ 2C Þ2 520 Y. Zhang et al. / Neurocomputing 74 (2011) 516–521 2 We take g ¼ e=ð8ð2M þC ÞÞ, then þC Þ 64ð2M þ C Þ2 ðd þ 1Þlog32nð2M 64ð2M þ C Þ2 log4d DðlÞ M2 , þ þ 3 9n 9n with confidence 1d=2. To estimate T2, by using Theorem 2, we know that with confidence 1d=2, there holds r Probz A Z fjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj Z eg ( ) e 3nð3eDðlÞÞ exp r 2N F d , : 8ð2M þ C Þ 64ð2M þ C Þ2 From the definition of F d , we know that Jf J1 rM þC , jEðQd ðm,ÞÞEðmÞðE z ðQd ðm,ÞÞE z ðmÞÞj r f A F d: 8ð4M þ C Þ2 4 1 log þ DðlÞ: 3n d 2 We bound DðlÞ in (7) and by taking l ¼ 1=nðJaðQd ðm,ÞÞJ22 þ1Þ. Combining with (6), we have e 32ð2M þC Þ2 r ðd þ1Þlog : logN F d , 8ð2M þC Þ e DðlÞ ¼ EðQd ðm,ÞÞEðmÞ þ lJaðQd ðm,ÞÞJ22 Z 1 ¼ ðQd ðm,xÞmðxÞÞ2 mðdxÞ þ : n X Therefore, Combining with the upper bound of T1, T2 and DðlÞ, there holds Probz A Z fjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj Z eg ( ) 32ð2M þ C Þ2 3nð3eDðlÞÞ r 2exp ðd þ 1Þlog : e 64ð2M þ C Þ2 The proof of Proposition 2 is finished. Eðfz ÞEðmÞ 2 þC Þ 64ð2M þ C Þ2 ðd þ 1Þlog32nð2M 88ð2M þ C Þ2 log4d M2 þ 9n 9n Z 3 þ2 ðQd ðm,xÞmðxÞÞ2 mðdxÞ þ , n X r & From Theorem 2 and Proposition 2, we can now start with the proof of Theorem 1. Proof of Theorem 1. We use the following error decomposition: Eðfz ÞEðmÞ rjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj þ jE z ðQd ðm,ÞÞE z ðmÞðEðQd ðm,ÞÞEðmÞÞj þ DðlÞ ¼ T1 þ T2 þ DðlÞ: We begin with bounding T1 in (7). We discuss two cases for e Z M2 =n and e o M2 =n. (i) When e Z M2 =n, we know that from Proposition 2 Probz A Z fjEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj r eg ( ) 32ð2M þ C Þ2 3nð3eDðlÞÞ Z 12exp ðd þ1Þlog e 64ð2M þC Þ2 ) ( 2 32nð2M þ C Þ 3nð3eDðlÞÞ : Z 12exp ðd þ1Þlog M2 64ð2M þ C Þ2 32nð2M þC Þ2 3nð3eDðlÞÞ 2exp ðd þ 1Þlog M2 64ð2M þ C Þ2 ) ¼ d 2 : Then we have 2 þC Þ 64ð2M þC Þ2 ðd þ1Þlog32nð2M M2 9n 64ð2M þ C Þ2 log4d DðlÞ M2 Z þ : þ 3 9n n e¼ & In order to prove Corollary 1, we need to estimate Z X ð7Þ Let ( with confidence 1d. The proof of Theorem 1 is completed. ðQd ðm,xÞmðxÞÞ2 mðdxÞ: From (4), we know that Qd(m,x) is a polynomial with d order. And Proposition 1 tells us 1 : jQd ðm,xÞj ¼ jJd ðm,arccosxÞj r jmðxÞj þ ok m, dþ1 Since mðkÞ A LipC 1, then we get 1 1 C r CC k ok m, ¼ r C, k dþ1 ðd þ1Þk ðd þ 1Þ where C* ¼CCk is a constant depending on k. So we obtain 1 jQd ðm,xÞj ¼ jJd ðm,arccosxÞj r jmðxÞj þ ok m, dþ1 rM þC , 8 x A ½1,1: Hence Qd ðm,xÞ A F d . From Proposition 1, we know Z ðQd ðm,xÞmðxÞÞ2 mðdxÞ rJQd ðm,xÞmðxÞJ21 X r Ck ok m, 2 1 1 r C 2k : dþ1 d Combining Theorem 1 with the above inequality, we get Eðfz ÞEðmÞ 2 So there holds jEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj 2 þC Þ 64ð2M þC Þ2 ðd þ1Þlog32nð2M M2 9n 64ð2M þ C Þ2 log4d DðlÞ þ þ 3 9n with confidence 1d=2. (ii) When e r M2 =n, we know that from Proposition 2 r jEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj r M2 , n with confidence 1d=2. Combining with the cases e 4M 2 =n and e r M2 =n, there holds jEðfz ÞEðmÞðE z ðfz ÞE z ðmÞÞj þC Þ 128ð2M þC Þ2 dlog32nð2M 2C 88ð2M þC Þ2 log4d 3 M2 þ 2k þ þ , n 9n 9n d and this expression is minimized for 20 11=ð2k þ 1Þ 3 9C n 6@ 7 A d¼4 5: þ C Þ2 64ð2M þ C Þ2 log32nð2M M2 r With confidence 1d, there holds 2 !2k=ð2k þ 1Þ 3 þ C Þ2 128ð2M þ C Þ2 log32nð2M 4 M2 5 Eðfz ÞEðmÞ r 4C 9C n 88ð2M þ C Þ2 log4d 3 þ : n 9n The proof of Corollary 1 is finished. þ Y. Zhang et al. / Neurocomputing 74 (2011) 516–521 5. Conclusions In this paper, the explicit upper bounds of learning rate have been derived by using least square schemes in polynomial space. In particular, the estimation of bounds has achieved good rate of convergence for one variable smooth regression function. To our knowledge, these bounds, in some extent, improved the previous known bounds under the smooth condition. In the proof the Jackson operator in approximation theory and general probability inequality were used. The obtained error estimation has also been is illustrated by applying simulated data. 521 [22] D.X. Zhou, Capacity of reproducing kernel Hilbert spaces in learning theory, IEEE Trans. Inform. Theory 49 (2003) 1734–1752. [23] D.X. Zhou, K. Jetter, Approximation with polynomial kernels and SVM classifiers, Adv. Comput. Math. 25 (2006) 323–344. Yongquan Zhang received M.S. degree from China Jiliang University, in 2007. He is currently pursuing the Ph.D. degree in Xi’an Jiaotong University. His research interests include neural networks and machine learning. References [1] A.W. van der Vaart, J.A. Wellner, Weak Convergence and Empirical Processes, Springer, 1996. [2] A.M. Bagirov, C. Clausen, M. Kohler, Estimation of a regression function by maxima of minima of linear functions, IEEE Trans. Inform. Theory 55 (2009) 833–845. [3] A. Caponnetto, E. DeVito, Optimal rates for the regularized least-squares algorithm, Found. Comput. Math. 7 (2007) 331–368. [4] D. Chen, Q. Wu, Y. Ying, D.X. Zhou, Support vector machine sort margin classifiers: error analysis, J. Mach. Learn. Res. 5 (2004) 1143–1175. [5] F. Cucker, S. Smale, On the mathematical foundations of learning, Bull. Amer. Math. Soc. 39 (2001) 1–49. [6] F. Cucker, S. Smale, Best choices for regularization parameters in learning theory, Found. Comput. Math. 1 (2002) 413–428. [7] Y. Guo, P.L. Bartlett, J. Shawe-Taylor, R.C. Williamson, Covering numbers for support vector machines, IEEE Trans. Inform. Theory 48 (2002) 239–250. [8] M. Pontil, A note different covering numbers in learning theory, J. Complexity 19 (2003) 665–671. [9] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, M. Anthony, Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inform. Theory 44 (1998) 1926–1940. [10] S. Smale, D.X. Zhou, Estimating the approximation error in learning theory, Anal. Appl. 1 (1) (2003) 17–41. [11] S. Smale, D.X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx. 26 (2007) 153–172. [12] Y.S. Sun, Function Approximation Theory (I), Beijing Normal University Press, 1989 (in Chinese). [13] V.N. Temlyakov, Approximation in Learning Theory, IMI Preprints, vol. 5, 2005, pp. 1–42. [14] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [15] R.C. Williamson, A.J. Smola, B. Schǒkopf, Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators, IEEE Trans. Inform. Theory 47 (2001) 2516–2532. [16] Q. Wu, Y.M. Ying, D.X. Zhou, Learning theory: from regression to classification, in: Topics in Multivariate Approximation and Interpolation, Elsevier, B.V., Amsterdam, 2004. [17] Q. Wu, Y.M. Ying, D.X. Zhou, Learning rates of least-square regularized regression, Found. Comput. Math. 6 (2007) 171–192. [18] T.F. Xie, S.P. Zhou, Real Function Approximation Theory, Hangzhou University Press,, 1998 (in Chinese). [19] T. Zhang, Effective dimension and generalization of kernel learning, NIPS, 2002, pp. 454–461. [20] T. Zhang, Leave-one-out bounds for kernel methods, Neural Comput. 13 (2003) 1397–1437. [21] D.X. Zhou, The covering number in learning theory, J. Complexity 18 (2002) 739–767. Feilong Cao was born in Zhejiang Province, China, on August, 1965. He received the B.S. degree in mathematics and the M.S. degree in applied mathematics from Ningxia University, China, in 1987 and 1998, respectively. In 2003, he received the Ph.D. degree in Institute for Information and System Science, Xi’an Jiaotong University, China. From 1987 to 1992, he was an assistant professor. During 1992– 2002, he was an associate professor. He is now a professor in China Jiliang University. His current research interests include neural networks, machine learning and approximation theory. He is the author or coauthor of more than 100 scientific papers. Zong-Ben Xu received the M.S. degree in mathematics in 1981 and the Ph.D. degree in applied mathematics in 1987 from Xi’an Jiaotong University, China. In 1998, he was a postdoctoral researcher in the Department of Mathematics, The University of Strathclyde, United Kingdom. He worked as a research fellow in the Information Engineering Department from February 1992 to March 1994, the Center for Environmental Studies from April 1995 to August 1995, and the Mechanical Engineering and Automation Department from September 1996 to October 1996, at The Chinese University of Hong Kong. From January 1995 to April 1995, he was a research fellow in the Department of Computing in The Hong Kong Polytechnic University. He has been with the Faculty of Science and Institute for Information and System Sciences at Xi’an Jiaotong University since 1982, where he was promoted to associate professor in 1987 and full professor in 1991, and now serves as an authorized Ph.D. supervisor in mathematics and computer science, vice president of Xi’an Jiaotong University, and director of the Institute for Information and System Sciences. Professor Xu has published 150 academic papers. His current interest is in computational Intelligence (particularly, neural networks, evolutionary computation, data mining theory and algorithms), nonlinear functional analysis and geometry of Banach spaces.
© Copyright 2025 Paperzz