On the Generalization Ability of Recurrent Networks Barbara Hammer Department of Mathematics/Computer Science, University of Osnabrück D-49069 Osnabrück, Germany [email protected] Abstract. The generalization ability of discrete time partially recurrent networks is examined. It is well known that the VC dimension of recurrent networks is infinite in most interesting cases and hence the standard VC analysis cannot be applied directly. We find guarantees for specific situations where the transition function forms a contraction or the probability of long inputs is restricted. For the general case, we derive posterior bounds which take the input data into account. They are obtained via a generalization of the luckiness framework to the agnostic setting. The general formalism allows to focus on reppresentative parts of the data as well as more general situations such as long term prediction. 1 Introduction Recurrent networks (RNNs) constitute an elegant way of enlarging the capacity of feedforward networks (FNNs) to deal with complex data, sequences of real vectors, as opposed to simple real vectors. They process data where time is involved in a natural way – such as language or time series [5, 7]. Moreover, sequences constitute a simple form of symbolic data. Indeed, RNNs can easily be generalized such that general symbolic data, terms and formulas, can be processed as well [4, 6, 8]. One hopes that the advantages of FNNs transfer to RNNs directly; in particular, the theoretical guarantees for the success of FNNs should hold for RNNs, i.e. their universal approximation property, the possibility of efficient training, and their generalization ability [2, 11, 18]. There exist well known results showing the universal approximation capability; remarkable algorithms for efficient training have been developed, though training is more complex for RNNs than for FNNs [15]; some work has been done on the generalization ability [9]. However, the latter question is not yet answered satisfactory and we make a step further into the direction of establishing the generalization ability of RNNs in this paper. We first examine restricted situations where the standard VC arguments can be applied directly: restricted transition functions or restricted input distributions, respectively. Afterwards, we derive posterior bounds on the generalization capability which depend on the concrete training set. For this purpose we generalize the luckiness framework to the general agnostic setting [10, 12, 16]. Unlike in [9], we obtain results which cover long-term prediction and allow the restriction to representative parts of data. 2 The Learning Scenario x A x A single-layer FNN computes a function f : Rm ! Rn ; 7! ( ) where 2 Rnm , 2 Rn , and denotes the componentwise application of some activation function : R ! R. and are the weights of the network. Popular choices for are the sigmoidal function sgd(x) = (1 + e x ) 1 , or piecewise polynomial functions. For m > n, recursive application of f induces a recursive network which computes A A n ) Rn ; ((Rm n ) denoting sequences of arbitrary length with elements m n in R ) defined by [ ℄ 0; [x1 ; : : : ; xt ℄ f (x1 ; f~([x2 ; : : : ; xn ℄)), [ ℄ denoting the empty sequence and [x1 ; : : : ; xt ℄ denoting a sequence of length t. Hence starting with f~ : (Rm ! 7! 7! 0 we recursively apply the transition function f to the already computed value, the so called context, and the next entry in the sequence. n is the number of context neurons. Often we implicitly assume that f~ is combined with a projection to a lower dimensional vector space. It is easy to see that various more general definitions of RNNs can be simulated in the above formalism such that the argumentation presented in this paper holds for various realistic architectures [8]. A neural architecture only specifies the structure and . Sometimes, RNNs are used for long term predictions. but not the weights I.e., given an input sequence we are interested in the output sequence with entries f~([0t ; ℄), t 2 N obtained via further application of f . The so-called pseudodimension plays a key role in learning theory. It measures the richness of a function class. Formally, the pseudodimension VC (F ) of a class F : X ! [0; 1℄ is the largest cardinality (possibly infinite) of X 0 X which can be shattered, which means that for each x 2 X 0 a reference point rx 2 R exists such that for every d : X 0 ! f0; 1g some f 2 F exists with f (x) rx iff d(x) = 1 for all x 2 X 0 . Upper and lower bounds on VC (F ) where F is computed by a RNN can be found in the literature or easily derived from the arguments given therein [3, 8, 13]. For a sigmoidal or piecewise polynomial RNN, W weights, and inputs of length at most t the bounds are polynomial in W and t. Since all lower bounds depend on t we obtain an infinite pseudodimension for arbitrary inputs in contrast to standard FNNs. We consider the following so called agnostic setting [17]. All sets are assumed to be measurable. Denote by X the input set, by U the output set, by F a function class mapping from X to U used for learning which may be given by a neural architecture. A set of probabilities P on X Y is to be learned. Some loss function l : U Y ! [0; 1℄ measures the distance of the real and the prognosed output. This may be the euclidian distance, if U and Y are subsets of [0; 1℄. Associated to each f 2 F is the function lf (x; y ) = l(f (x); y ) and associated to F is the class lF = flf j f 2 Fg. Given a finite set of examples (xi ; yi )m i=1 chosen according to some unknown P 2 P we want toR find a function f 2 F which models the regularity best. I.e. the error EP (lf ) = X Y l(f (x); y)dP (x; y) should be as small as possible. Since this quantity P is not known, often simply the empirical error E^m (lf ; ( ; )) = i l(f (xi ); yi )=m is minimized. In order to guarantee valid generalization, the so called uniform convergence of empirical means (UCEM) property for F w.r.t. l should hold: A x x xy x; y) j fsup jE (lf ) 2F P sup P m (( P 2P E^m (lf ; (x; y))j > ) ! 0 (m ! 1) : For standard FNNs the UCEM property is guaranteed for the distribution independent case, i.e. P may be arbitrary, since FNNs have finite pseudodimension. The situation is more difficult for recursive networks since the pseudodimensions is infinite for unrestricted inputs. In general, distribution independent learnability cannot be guaranteed [8]. However, one can equivalently characterize the UCEM property w.r.t. P which may be a proper subset of all probabilities, i.e. for the so called distribution dependent setting, via lim sup m!1 P 2P EPm (lg L(; lF j(x; y); j j1 )) m ! 0 (m ! 1) : xy Here L denotes the external covering number of lF on inputs ( ; ) measured with the pseudometric j j1 mapping two functions to the maximum absolute pointwise distance. Explicit bounds on the generalization error arise. The key observation for the distribution independent case is that for a function class with finite pseudodimension d the external covering number on m points can be limited by 2 (2em= ln(2em=))d . Hence the distribution independent UCEM property is guaranteed. Additionally, the covering numbers of lF and F can be related under mild conditions on l [1, 17]. 3 Restricted Architectures or Distributions Recurrent architectures with step activation function and binary input sequences have finite pseudodimension [13]. Moreover, noise may reduce the capacity of general architectures to a finite pseudodimension [14]. An analogous result holds for contractions: Theorem 1. Assume the class F : X U ! U induces the recursive class F~ . Assume C < 1 exists such that for each f 2 F and x, u1 , u2 jf (x; u1 ) f (x; u2 )j C ju1 u2j holds. Then L(; F ; j j1 ) : Proof. Assume ff1 , . . . , fn g is an external -covering of F . Choose f 2 F . Assume fi is a function from the cover with distance smaller than to f . For contexts u1 and u2 with ju1 u2 j (1 + C + C 2 + : : : + C t ) we find for any x jf (x; u1 ) fi (x; u2 )j jf (x; u1 ) f (x; u2 )j + jf (x; u2 ) fi (x; u2 )j (C + : : : + C t+1 ) + and hence by induction j(f~ f~i ) ([x1 ; : : : ; xt ℄) j (1 + : : : + C t ) =(1 C ). 2 L 1 C ; R; j j1 Hence RNNs induced by a contraction provide valid generalization with respect to every loss l, such that l : U Y ! [0; 1℄ is equicontinuous with respect to U , i.e., 8 > 09Æ () such that 8y 2 Y; u1 ; u2 2 U if ju1 u2 j Æ (), jl(u1 ; y ) l(u2 ; y )j . We call such a loss function admissible. This holds due to the inequality L(; lF j( ; ); jj1 ) L(Æ(); Fj ; jj1 ) as proved in [17]. Admissible loss functions are the popular choices ju yjs , for s 1 and U = Y = [0; 1℄, as an example. Sufficient guarantees for the contraction property of the transition function can be stated easily for a large number of activation functions including the sigmoidal function: Lemma 1. Any recurrent architecture with Lipschitz continuous activation function with constant K , n context neurons, and weights bounded by C=(Kn) for C < 1, fulfills the distribution independent UCEM property with respect to admissible loss functions. A second guarantee for valid generalization can be obtained for restricted input distributions. Denote by Xt the input sequences restricted to length at most t and choose values pt such that 0 pt " 1. Assume that we restrict to those probabilities P on X Y such that the marginal distribution on X assigns a probability of at least pt to Xt , hence prior information on the probability of long input sequences is available. It is shown in [9] that the UCEM property with respect to the linear loss holds under these conditions if the VC (FjXt ) < 1. One can generalize the result as follows: Theorem 2. Assume F is a function class with inputs in X = [Xt with Xt Xt+1 and outputs in [0; 1℄. Assume 0 pt " 1. Assume P are distributions on X Y such that the marginal distributions assign a value pt to Xt . Then F fulfills the UCEM property w.r.t. any admissible loss function and P if the restrictions FjXt fulfill the UCEM property w.r.t. the probabilities induced by P on Xt and the linear loss. x xy Proof. Assume the parameter Æ () describes the admissible loss. As proved in [17], we can equivalently substitute the metric j j1 in the UCEM characterization by the euclidian metric j j. Denote the restrictions of the marginal distribution of P to Xt by Pt and the restriction of F to inputs Xt by Ft . One can estimate EP m (lg L(; lF j(x; y); j j))=m EP m (lg L(Æ()=2; Fjx; j j))=m P ( m(1 Æ()=2) points in Xt ) lg(2=Æ())m + EPtm (lg L(Æ(); Ft jx; j j))=m lg(2=Æ()) 16pt(1 pt )=(mÆ()2 ) + EPtm (lg L(Æ()=2; Ftjx; j j))=m for pt 1 Æ ()=4 due to the Tschebychev inequality. The second term tends to zero for m ! 1 if Ft fulfills the UCEM property with respect to Pt . 2 Hence RNNs with finite pseudodimension for restricted input length t fulfill the UCEM property for arbitrary inputs with respect to admissible loss and input distributions where sequences of length t have a probability pt " 1. 4 Posterior Bounds However, it may happen that neither prior information about the input distribution is available nor the weights fulfill appropriate restrictions. A different approach allows to derive posterior, data dependent bounds on the generalization ability. For this purpose we use a modification of the luckiness framework proposed in [16]. The key idea is very simple: Generalization does not hold for the general setting. However, in lucky situations, restriucted maximum input length of sequences, for example, very good generalization bounds can be obtained. Hence the situation is stratified according to the concrete setting. Some luckiness function L : X m F ! R+ which measures the concrete luckiness of the output of a learning algorithm on the training data. In lucky situations, i.e. for large t, the covering number LLt (; lF j( ; ); jj1 ) of functions in F , such that L yields at least t on should be small. Additionally, the luckiness function should be smooth in the following sense: Denote by P the marginal distributions of P on X . Denote by B : R+ ! R+ some monotonic, increasing function. Then xy x xx0 j 9f 2 F 8X x; X0 x0 L(XX0 ; f ) < B(L(x; f ))) Æ sup P 2m ( P 2P x X x should hold for some = (m; L( ; f ); Æ ) : N 2 R+ ! R+ . indicates that a fraction of the points of is canceled in . This tells us that we can bound the luckiness of a double sample with regard to the luckiness on the first half of the sample. x X Theorem 3. Assume pt are positive values which add up to 1. Assume > 0. If L is a smooth luckiness function then one can estimate under the above conditions x; y) j 9f 9t(L(x; f ) t ^ jE^m(lf ; (x; y)) sup P m (( P 2P EP (lf )j > )) Æ for = (m; L(x; f ); pt Æ=4) and = 6 + 4+ v u u 2t4 lg m + 1 m 2 lg 16 pt Æ XX0 ; YY0 ); j j1 ) + 2 lg sup EP 2m (LLB (t) (; lF j( P 2P ! : (The proof is similar to [1, 17] and is omitted due to space limitations.) We want to apply this result to recurrent networks. For this purpose, we consider the dual formalism, an unluckiness function. Let F~ be given by a recurrent architecture. We define the unluckiness function L ( ; f ) = maximum length of all but a fraction of the sequences in and outputs in [0; 1℄. We consider the absolute loss function. The expected external covering number of F~ on m points restricted to situations with unluckiness at most t can be limited by (Xt are the sequences of length t) x x m 2 2e(1 )m e(1 )m VC(Fj~ Xt ) 2 ln z z x ^m (lh(z) ; ) EP (lh(z) )j > ( )) Æ Corollary 1. The inequality supP 2P P 2m ( j jE is valid in the above situation for any output h( ) of a learning algorithm h on and for z 0 s r tx 1 1 1 (x) = O + ln + ln + ln m m Æ m ln 1 Æ z 1 ~ Xt ) m m VC (Fj A + ln ln m tx being the maximum length of all but a fraction in the sample x. Proof. p The above unluckiness function is smooth with the identity B and (m; L; Æ) = 1=(2m) ln(2=Æ) which follows from Hoeffding’s inequality. Hence the result follows directly from Theorem 3 and the estimation of the covering number setting pt = 1=2t and = . 2 Hence we obtain posterior bounds which depend on the concrete training sample. Moreover, we may drop a fraction if measuring the maximum length compared to [8, 16]. Note that this possibility is essential in order to obtain good bounds since the fraction of long input sequences will approximate their respective probability in the limit. A second advantage lies in the possibility to deal with general loss functions. One interesting topic are long term predictions where P the outputs consist in sequences. One possible loss function computes d(x; y ) = t pt jxt yt j where pt are positive values adding up to 1. xt and yt are the t’s component of sequences x or y , respectively, if existing, or 0 otherwise. Theorem 3 yields the following result: Theorem 4. Assume a recurrent sigmoidal architecture with W weights is given. We consider long term prediction, i.e. inputs and outputs P are contained in [0; 1℄ . Choose the above loss function and choose T such that t>T pt < . Assume is the training sample and tx is the maximum length of all but a fraction in . Then the empirical and real error deviate by at most with confidence Æ for ( ) = 0 s r 1 1 tx 1 ln + ln m O + ln + m Æ x x x 1 W 4 (T 3 + T t2x) m m A : ln + ln ln m Æ m 1 Proof. As above we choose the smooth unluckiness which measures the length of all 0; 0 ); jj1 ) but a fraction of the sample. The covering number LLB (t) (; lF j( XX YY is to be estimated. T is chosen such that sequences of length > T do not contribute to the minimum number of covering functions. For outputs of a varying length we can upper bound the covering number of the entire class via the product of the covering numbers restricted to outputs of a fixed length [1]. Hence we obtain the bound m Q d 2 tT (2em= ln 2em= ) t+tx where dt+tx is the pseudodimension of the sigmoidal recurrent architecture with inputs of length at most t + tx , tx being the maximum length of all but a fraction in which is limited by O(W 4 (t + tx )2 ). 2 x 5 Conclusions We have considered the in principle ability of learning neural mappings dealing with sequences, more precisely, the generalization ability of RNNs. We obtained prior bounds for restricted weights or restricted probabilities of long input sequences. Alternatively, posterior bounds depending on the concrete sample could be derived. Since the general agnostic scenario is considered, the results apply to general loss functions as demonstrated for the example of long term prediction. References 1. M. Anthony and P. Bartlett Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. 2. E. B. Baum and D. Haussler. What size net gives valid generalization? Neural Computation, 1, 1989. 3. B. Dasgupta and E. D. Sontag. Sample complexity for learning recurrent perceptron mappings. IEEE Transactions on Information Theory, 42, 1996. 4. P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of data sequences. IEEE Transactions on Neural Networks, 9(5), 1997. 5. C. L. Giles, G. M. Kuhn, and R. J. Williams. Special issue on dynamic recurrent neural networks. IEEE Transactions on Neural Networks, 5(2), 1994. 6. M. Gori (chair). Special session: Adaptive computation of data structures. ESANN’99, 1999. 7. M. Gori, M. Mozer, A. C. Tsoi, and R. L. Watrous. Special issue on recurrent neural networks for sequence processing. Neurocomputing, 15(3-4), 1997. 8. B. Hammer. Learning with Recurrent Neural Networks. Springer Lecture Notes in Control and Information Sciences 254, 2000. 9. B. Hammer. Generalization ability of folding networks. To appear in IEEE Transactions on Knowledge and Data Engineering. 10. D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100, 1992. 11. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2, 1989. 12. M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward Efficient Agnostic Learning. Machine Learning, 17, 1994. 13. P. Koiran and E. D. Sontag. Vapnik-Chervonenkis dimension of recurrent neural networks. Discrete Applied Mathematics, 86(1), 1998. 14. W. Maass and P. Orponen. On the effect of analog noise in discrete-time analog computation. Neural Computation, 10(5), 1998. 15. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), 1997. 16. J. Shawe-Taylor, P. L. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1998. 17. M. Vidyasagar. A Theory of Learning and Generalization. Springer, 1997. 18. P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Science. PhD thesis, Harvard University, 1974.
© Copyright 2025 Paperzz