On the Generalization Ability of Recurrent Networks

On the Generalization Ability of Recurrent Networks
Barbara Hammer
Department of Mathematics/Computer Science,
University of Osnabrück
D-49069 Osnabrück, Germany
[email protected]
Abstract. The generalization ability of discrete time partially recurrent networks
is examined. It is well known that the VC dimension of recurrent networks is
infinite in most interesting cases and hence the standard VC analysis cannot be
applied directly. We find guarantees for specific situations where the transition
function forms a contraction or the probability of long inputs is restricted. For the
general case, we derive posterior bounds which take the input data into account.
They are obtained via a generalization of the luckiness framework to the agnostic
setting. The general formalism allows to focus on reppresentative parts of the data
as well as more general situations such as long term prediction.
1 Introduction
Recurrent networks (RNNs) constitute an elegant way of enlarging the capacity of feedforward networks (FNNs) to deal with complex data, sequences of real vectors, as opposed to simple real vectors. They process data where time is involved in a natural way
– such as language or time series [5, 7]. Moreover, sequences constitute a simple form
of symbolic data. Indeed, RNNs can easily be generalized such that general symbolic
data, terms and formulas, can be processed as well [4, 6, 8]. One hopes that the advantages of FNNs transfer to RNNs directly; in particular, the theoretical guarantees for the
success of FNNs should hold for RNNs, i.e. their universal approximation property, the
possibility of efficient training, and their generalization ability [2, 11, 18]. There exist
well known results showing the universal approximation capability; remarkable algorithms for efficient training have been developed, though training is more complex for
RNNs than for FNNs [15]; some work has been done on the generalization ability [9].
However, the latter question is not yet answered satisfactory and we make a step further
into the direction of establishing the generalization ability of RNNs in this paper.
We first examine restricted situations where the standard VC arguments can be applied directly: restricted transition functions or restricted input distributions, respectively. Afterwards, we derive posterior bounds on the generalization capability which
depend on the concrete training set. For this purpose we generalize the luckiness framework to the general agnostic setting [10, 12, 16]. Unlike in [9], we obtain results which
cover long-term prediction and allow the restriction to representative parts of data.
2 The Learning Scenario
x
A x A single-layer FNN computes a function f : Rm ! Rn ; 7! ( ) where
2 Rnm , 2 Rn , and denotes the componentwise application of some activation
function : R ! R.
and are the weights of the network. Popular choices for are the sigmoidal function sgd(x) = (1 + e x ) 1 , or piecewise polynomial functions.
For m > n, recursive application of f induces a recursive network which computes
A
A
n )
Rn ; ((Rm n ) denoting sequences of arbitrary length with elements
m
n
in R
) defined by [ ℄
0; [x1 ; : : : ; xt ℄
f (x1 ; f~([x2 ; : : : ; xn ℄)), [ ℄ denoting the
empty sequence and [x1 ; : : : ; xt ℄ denoting a sequence of length t. Hence starting with
f~ : (Rm
!
7!
7!
0 we recursively apply the transition function f to the already computed value, the so
called context, and the next entry in the sequence. n is the number of context neurons.
Often we implicitly assume that f~ is combined with a projection to a lower dimensional
vector space. It is easy to see that various more general definitions of RNNs can be simulated in the above formalism such that the argumentation presented in this paper holds
for various realistic architectures [8]. A neural architecture only specifies the structure
and . Sometimes, RNNs are used for long term predictions.
but not the weights
I.e., given an input sequence we are interested in the output sequence with entries
f~([0t ; ℄), t 2 N obtained via further application of f .
The so-called pseudodimension plays a key role in learning theory. It measures the
richness of a function class. Formally, the pseudodimension VC (F ) of a class F : X !
[0; 1℄ is the largest cardinality (possibly infinite) of X 0 X which can be shattered,
which means that for each x 2 X 0 a reference point rx 2 R exists such that for every
d : X 0 ! f0; 1g some f 2 F exists with f (x) rx iff d(x) = 1 for all x 2 X 0 . Upper
and lower bounds on VC (F ) where F is computed by a RNN can be found in the
literature or easily derived from the arguments given therein [3, 8, 13]. For a sigmoidal
or piecewise polynomial RNN, W weights, and inputs of length at most t the bounds
are polynomial in W and t. Since all lower bounds depend on t we obtain an infinite
pseudodimension for arbitrary inputs in contrast to standard FNNs.
We consider the following so called agnostic setting [17]. All sets are assumed
to be measurable. Denote by X the input set, by U the output set, by F a function
class mapping from X to U used for learning which may be given by a neural architecture. A set of probabilities P on X Y is to be learned. Some loss function
l : U Y ! [0; 1℄ measures the distance of the real and the prognosed output. This
may be the euclidian distance, if U and Y are subsets of [0; 1℄. Associated to each f 2 F
is the function lf (x; y ) = l(f (x); y ) and associated to F is the class lF = flf j f 2 Fg.
Given a finite set of examples (xi ; yi )m
i=1 chosen according to some unknown P 2 P
we want toR find a function f 2 F which models the regularity best. I.e. the error
EP (lf ) = X Y l(f (x); y)dP (x; y) should be as small as possible. Since this quantity
P
is not known, often simply the empirical error E^m (lf ; ( ; )) =
i l(f (xi ); yi )=m
is minimized. In order to guarantee valid generalization, the so called uniform convergence of empirical means (UCEM) property for F w.r.t. l should hold:
A
x
x
xy
x; y) j fsup
jE (lf )
2F P
sup P m ((
P 2P
E^m (lf ; (x; y))j > ) ! 0
(m ! 1) :
For standard FNNs the UCEM property is guaranteed for the distribution independent case, i.e. P may be arbitrary, since FNNs have finite pseudodimension. The situation is more difficult for recursive networks since the pseudodimensions is infinite for
unrestricted inputs. In general, distribution independent learnability cannot be guaranteed [8]. However, one can equivalently characterize the UCEM property w.r.t. P which
may be a proper subset of all probabilities, i.e. for the so called distribution dependent
setting, via
lim sup
m!1 P 2P
EPm (lg L(; lF j(x; y); j j1 ))
m
! 0 (m ! 1) :
xy
Here L denotes the external covering number of lF on inputs ( ; ) measured with
the pseudometric j j1 mapping two functions to the maximum absolute pointwise
distance. Explicit bounds on the generalization error arise. The key observation for the
distribution independent case is that for a function class with finite pseudodimension
d the external covering number on m points can be limited by 2 (2em= ln(2em=))d .
Hence the distribution independent UCEM property is guaranteed. Additionally, the
covering numbers of lF and F can be related under mild conditions on l [1, 17].
3 Restricted Architectures or Distributions
Recurrent architectures with step activation function and binary input sequences have
finite pseudodimension [13]. Moreover, noise may reduce the capacity of general architectures to a finite pseudodimension [14]. An analogous result holds for contractions:
Theorem 1. Assume the class F : X U ! U induces the recursive class F~ . Assume
C < 1 exists such that for each f 2 F and x, u1 , u2 jf (x; u1 ) f (x; u2 )j C ju1 u2j
holds. Then
L(; F ; j j1 ) :
Proof. Assume ff1 , . . . , fn g is an external -covering of F . Choose f 2 F . Assume fi
is a function from the cover with distance smaller than to f . For contexts u1 and u2
with ju1 u2 j (1 + C + C 2 + : : : + C t ) we find for any x
jf (x; u1 ) fi (x; u2 )j jf (x; u1 ) f (x; u2 )j + jf (x; u2 ) fi (x; u2 )j
(C + : : : + C t+1 ) + and hence by induction j(f~ f~i ) ([x1 ; : : : ; xt ℄) j (1 + : : : + C t ) =(1 C ). 2
L
1
C
; R; j j1
Hence RNNs induced by a contraction provide valid generalization with respect to every
loss l, such that l : U Y ! [0; 1℄ is equicontinuous with respect to U , i.e., 8 > 09Æ ()
such that 8y 2 Y; u1 ; u2 2 U if ju1 u2 j Æ (), jl(u1 ; y ) l(u2 ; y )j . We call
such a loss function admissible. This holds due to the inequality L(; lF j( ; ); jj1 ) L(Æ(); Fj ; jj1 ) as proved in [17]. Admissible loss functions are the popular choices
ju yjs , for s 1 and U = Y = [0; 1℄, as an example. Sufficient guarantees for the
contraction property of the transition function can be stated easily for a large number of
activation functions including the sigmoidal function:
Lemma 1. Any recurrent architecture with Lipschitz continuous activation function
with constant K , n context neurons, and weights bounded by C=(Kn) for C < 1, fulfills
the distribution independent UCEM property with respect to admissible loss functions.
A second guarantee for valid generalization can be obtained for restricted input
distributions. Denote by Xt the input sequences restricted to length at most t and choose
values pt such that 0 pt " 1. Assume that we restrict to those probabilities P on X Y
such that the marginal distribution on X assigns a probability of at least pt to Xt , hence
prior information on the probability of long input sequences is available. It is shown in
[9] that the UCEM property with respect to the linear loss holds under these conditions
if the VC (FjXt ) < 1. One can generalize the result as follows:
Theorem 2. Assume F is a function class with inputs in X = [Xt with Xt Xt+1
and outputs in [0; 1℄. Assume 0 pt " 1. Assume P are distributions on X Y such
that the marginal distributions assign a value pt to Xt . Then F fulfills the UCEM
property w.r.t. any admissible loss function and P if the restrictions FjXt fulfill the
UCEM property w.r.t. the probabilities induced by P on Xt and the linear loss.
x
xy
Proof. Assume the parameter Æ () describes the admissible loss. As proved in [17],
we can equivalently substitute the metric j j1 in the UCEM characterization by the
euclidian metric j j. Denote the restrictions of the marginal distribution of P to Xt by
Pt and the restriction of F to inputs Xt by Ft . One can estimate
EP m (lg L(; lF j(x; y); j j))=m EP m (lg L(Æ()=2; Fjx; j j))=m
P ( m(1 Æ()=2) points in Xt ) lg(2=Æ())m + EPtm (lg L(Æ(); Ft jx; j j))=m
lg(2=Æ()) 16pt(1 pt )=(mÆ()2 ) + EPtm (lg L(Æ()=2; Ftjx; j j))=m
for pt 1 Æ ()=4 due to the Tschebychev inequality. The second term tends to zero
for m ! 1 if Ft fulfills the UCEM property with respect to Pt .
2
Hence RNNs with finite pseudodimension for restricted input length t fulfill the UCEM
property for arbitrary inputs with respect to admissible loss and input distributions
where sequences of length t have a probability pt " 1.
4 Posterior Bounds
However, it may happen that neither prior information about the input distribution is
available nor the weights fulfill appropriate restrictions. A different approach allows to
derive posterior, data dependent bounds on the generalization ability. For this purpose
we use a modification of the luckiness framework proposed in [16]. The key idea is
very simple: Generalization does not hold for the general setting. However, in lucky situations, restriucted maximum input length of sequences, for example, very good generalization bounds can be obtained. Hence the situation is stratified according to the
concrete setting. Some luckiness function L : X m F ! R+ which measures the
concrete luckiness of the output of a learning algorithm on the training data. In lucky
situations, i.e. for large t, the covering number LLt (; lF j( ; ); jj1 ) of functions in
F , such that L yields at least t on should be small. Additionally, the luckiness function should be smooth in the following sense: Denote by P the marginal distributions
of P on X . Denote by B : R+ ! R+ some monotonic, increasing function. Then
xy
x
xx0 j 9f 2 F 8X x; X0 x0 L(XX0 ; f ) < B(L(x; f ))) Æ
sup P 2m (
P 2P
x
X
x
should hold for some = (m; L( ; f ); Æ ) : N 2 R+ ! R+ . indicates that
a fraction of the points of is canceled in . This tells us that we can bound the
luckiness of a double sample with regard to the luckiness on the first half of the sample.
x
X
Theorem 3. Assume pt are positive values which add up to 1. Assume > 0. If L is a
smooth luckiness function then one can estimate under the above conditions
x; y) j 9f 9t(L(x; f ) t ^ jE^m(lf ; (x; y))
sup P m ((
P 2P
EP (lf )j > )) Æ
for = (m; L(x; f ); pt Æ=4) and = 6 + 4+
v
u
u
2t4 lg m +
1
m
2 lg
16
pt Æ
XX0 ; YY0 ); j j1 )
+ 2 lg sup EP 2m (LLB (t) (; lF j(
P 2P
!
:
(The proof is similar to [1, 17] and is omitted due to space limitations.)
We want to apply this result to recurrent networks. For this purpose, we consider
the dual formalism, an unluckiness function. Let F~ be given by a recurrent architecture.
We define the unluckiness function L ( ; f ) = maximum length of all but a fraction
of the sequences in and outputs in [0; 1℄. We consider the absolute loss function.
The expected external covering number of F~ on m points restricted to situations with
unluckiness at most t can be limited by (Xt are the sequences of length t)
x
x
m 2
2e(1 )m e(1 )m VC(Fj~ Xt )
2
ln
z
z
x
^m (lh(z) ; ) EP (lh(z) )j > ( )) Æ
Corollary 1. The inequality supP 2P P 2m ( j jE
is valid in the above situation for any output h( ) of a learning algorithm h on and
for
z
0 s
r
tx 1
1
1
(x) = O + ln + ln + ln m
m
Æ
m
ln
1
Æ
z
1
~ Xt ) m m VC
(Fj
A
+
ln
ln
m
tx being the maximum length of all but a fraction in the sample x.
Proof.
p The above unluckiness function is smooth with the identity B and (m; L; Æ) =
1=(2m) ln(2=Æ) which follows from Hoeffding’s inequality. Hence the result follows
directly from Theorem 3 and the estimation of the covering number setting pt = 1=2t
and = .
2
Hence we obtain posterior bounds which depend on the concrete training sample. Moreover, we may drop a fraction if measuring the maximum length compared to [8, 16].
Note that this possibility is essential in order to obtain good bounds since the fraction
of long input sequences will approximate their respective probability in the limit.
A second advantage lies in the possibility to deal with general loss functions. One
interesting topic are long term predictions where
P the outputs consist in sequences. One
possible loss function computes d(x; y ) = t pt jxt yt j where pt are positive values
adding up to 1. xt and yt are the t’s component of sequences x or y , respectively, if
existing, or 0 otherwise. Theorem 3 yields the following result:
Theorem 4. Assume a recurrent sigmoidal architecture with W weights is given. We
consider long term prediction, i.e. inputs and outputs
P are contained in [0; 1℄ . Choose
the above loss function and choose T such that t>T pt < . Assume is the training
sample and tx is the maximum length of all but a fraction in . Then the empirical
and real error deviate by at most with confidence Æ for ( ) =
0 s
r
1
1 tx 1
ln + ln m
O + ln +
m
Æ
x
x
x
1
W 4 (T 3 + T t2x)
m m A
:
ln +
ln
ln
m Æ
m
1
Proof. As above we choose the smooth unluckiness which measures the length of all
0;
0 ); jj1 )
but a fraction of the sample. The covering number LLB (t) (; lF j(
XX YY
is to be estimated. T is chosen such that sequences of length > T do not contribute to
the minimum number of covering functions. For outputs of a varying length we can
upper bound the covering number of the entire class via the product of the covering
numbers restricted to outputs of a fixed length [1]. Hence we obtain the bound m Q
d
2 tT (2em= ln 2em= ) t+tx where dt+tx is the pseudodimension of the sigmoidal
recurrent architecture with inputs of length at most t + tx , tx being the maximum length
of all but a fraction in which is limited by O(W 4 (t + tx )2 ).
2
x
5 Conclusions
We have considered the in principle ability of learning neural mappings dealing with sequences, more precisely, the generalization ability of RNNs. We obtained prior bounds
for restricted weights or restricted probabilities of long input sequences. Alternatively,
posterior bounds depending on the concrete sample could be derived. Since the general
agnostic scenario is considered, the results apply to general loss functions as demonstrated for the example of long term prediction.
References
1. M. Anthony and P. Bartlett Neural Network Learning: Theoretical Foundations. Cambridge
University Press, 1999.
2. E. B. Baum and D. Haussler. What size net gives valid generalization? Neural Computation,
1, 1989.
3. B. Dasgupta and E. D. Sontag. Sample complexity for learning recurrent perceptron mappings. IEEE Transactions on Information Theory, 42, 1996.
4. P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of data
sequences. IEEE Transactions on Neural Networks, 9(5), 1997.
5. C. L. Giles, G. M. Kuhn, and R. J. Williams. Special issue on dynamic recurrent neural
networks. IEEE Transactions on Neural Networks, 5(2), 1994.
6. M. Gori (chair). Special session: Adaptive computation of data structures. ESANN’99, 1999.
7. M. Gori, M. Mozer, A. C. Tsoi, and R. L. Watrous. Special issue on recurrent neural networks
for sequence processing. Neurocomputing, 15(3-4), 1997.
8. B. Hammer. Learning with Recurrent Neural Networks. Springer Lecture Notes in Control
and Information Sciences 254, 2000.
9. B. Hammer. Generalization ability of folding networks. To appear in IEEE Transactions on
Knowledge and Data Engineering.
10. D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other
learning applications. Information and Computation, 100, 1992.
11. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal
approximators. Neural Networks, 2, 1989.
12. M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward Efficient Agnostic Learning. Machine
Learning, 17, 1994.
13. P. Koiran and E. D. Sontag. Vapnik-Chervonenkis dimension of recurrent neural networks.
Discrete Applied Mathematics, 86(1), 1998.
14. W. Maass and P. Orponen. On the effect of analog noise in discrete-time analog computation.
Neural Computation, 10(5), 1998.
15. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8),
1997.
16. J. Shawe-Taylor, P. L. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization
over data dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1998.
17. M. Vidyasagar. A Theory of Learning and Generalization. Springer, 1997.
18. P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Science. PhD thesis, Harvard University, 1974.