Sequence Learning with Recurrent Networks: Analysis of Internal Representations
Joydeep Ghosh
Department of Electrical and Computer Engineering,
University of Texas, Austin, TX 78712-1084
Vijay Karamcheti
Coordinated Sciences Laboratory,
University of Illinois, Urbana, IL.
Abstract. The recognition and learning of temporal sequences is fundamental to cognitive processing. Several
recurrent networks attempt to encode past history through feedback connections from \context units". However,
the internal representations formed by these networks is not well understood. In this paper, we use cluster analysis
to interpret the hidden unit encodings formed when a network with context units is trained to recognize strings
from a nite state machine. If the number of hidden units is small, the network forms fuzzy representations of the
underlying machine states. With more hidden units, dierent representations may evolve for alternative paths to the
same state. Thus, appropriate network size is indicated by the complexity of the underlying nite state machine.
The analysis of internal representations can be used for modeling of an unknown system based on observation of its
output sequences.
1 Introduction
Many cognitive behaviors involve the generation and recognition of temporal sequences [3, 4, 8] Time is an integral part of goal-directed behaviors and planning that are characterized by coordination of functionality over long
sequences of input-output pairs. This implies that goals and plans act as a context for the interpretation and generation of individual events. Therefore the processing of sequences using articial neural networks is of considerable
importance.
In this paper, we further the study of Cleeremans et al. [2] on recognizing sequences from a Finite State Automaton
(FSA) using Elman's recurrent network. The emphasis is on characterizing the internal representation of the states
of the FSA as activation patterns of the hidden units, and thus obtaining a description of what the network \learns".
Such an understanding makes it feasible to perform reverse engineering and extract an unknown FSA model given a
sequence of positive and negative examples generated by that model. This ability has obvious implications in modeling
and control of discrete event systems that can be characterized by Markov models to a good approximation. The
problem of representation is easily solved for serial processing approaches where a representation of sequences is
implicit in the control ow mechanism.
The next section compares two approaches to neural net based recognizers of temporal sequences, and introduces
Elman's recurrent network. Section 3 describes the network architecture and experiments. Section 4 presents
an analysis of the internal representations formed by the trained recurrent network, and Section 5 highlights the
signicance of the results.
2 Implicit Representations of Time using Recurrent Networks
Consider a neural network used for recognition or classication based on observing input signals/patterns over a
period of time. One obvious way of handling input patterns with a temporal factor is to represent time explicitly
by associating the serial order of the pattern with the dimensionality of the pattern vector. The rst temporal
event is represented by the rst element in the pattern vector, the second temporal event by the second component
of the pattern vector; and so on in what is called a `moving window' paradigm. Thus, the temporal dimension of
data is transformed into a spatial dimension across the input units. The transformed input can then be fed into a
feedforward network such as the multilayered perceptron to perform the required mapping.
Such explicit representations of time have been used in NETtalk [11] and \time-delay neural networks" such
as those used with some success in speech recognition [13]. While explicit representations obviate the need for
1
feedback connections or other methods for realizing a memory to store past events, they result in large inputs and a
computationally expensive technique that is not `psychologically satisfying' [1]. An interface with the external world
is required to buer the input so that it can all be presented at the same time. This imposes a temporal window
the same size as the number of input unit sets so that no data occurring before can inuence the way the spatialized
vector is processed. Thus, the representation is not suitable for patterns that are of variable length.
An alternative is to use a dynamic network that is given some kind of memory to encode past history. Examples
of dynamic neural networks include those based on attractor transitions [1] and those using context units [7, 9].
In networks of the latter class, non-trainable sets of feedback connections are added to a standard feed-forward
framework. The current input vector is combined with residual activations (context) from the previous cycle in order
to obtain the current output, and the new context is generated in terms of the previous context and the current
input. An example of such networks is the Simple Recurrent Network (SRN) of Elman, shown in Fig. 1, in which
hidden unit layer is allowed to feed back onto itself so that the intermediate results of processing at time t ? 1 can
inuence the intermediate results of processing at time t. In practice, the SRN is implemented by copying the pattern
of activation on the hidden units onto a set of `context units' which feed into the hidden layer along with the input
units.
Figure 1: Elman's Simple Recurrent Network [3].
The set of context units provides the system with memory in the form of a trace of activations obtained at the
previous time step. The activation values of the hidden units correspond to an `encoding' or internal representation of
the presented input pattern that is partially processed into features relevant to the task. In case of Elman's network,
the internal representations encode not only the input (prior event) but also relevant aspects of the representation
that was developed in the hidden units during the prediction of the prior event from its predecessor. When fed back,
such representations could provide information that allows the network to maintain prediction relevant features of
an entire sequence.
Cleeremans et al [12, 2] conducted an extensive set of experiments of training Elman's Simple Recurrent Network
on strings from a nite state grammar and reported some very impressive results:
(i) The network was correctly able to predict strings from the grammar after being trained on exemplars drawn from
the grammar.
(ii) It was able to remember length constraints on the strings.
(iii) Long-distance contingencies could be remembered given certain constraints.
Other notable eorts on grammatical inference using second-order recurrent networks are presented in [5, 14].
This paper attempts to analyze the internal representations that develop in an experiment similar to the one
above, in an eort to explain the prediction capabilities of the network. It also tries to show the relationship that
exists between the internal representations and the amount of representational resources (number of hidden units).
2
3 Recognition of Strings from FSAs
The following experiments were conducted in order to obtain the internal representations that develop corresponding
to dierent input patterns obtained from an FSA. These representations were then analyzed in an attempt to explain
the prediction capabilities of the network. In all the experiments conducted, the network was assigned the task of
predicting the successive elements of a sequence. The task is interesting from the point of view of seeing whether
the network does gain (develop) some representation about the complete sequence without, at any time, seeing more
than two elements of the sequence simultaneously. Dierent quantities of temporal information can be incorporated
into the sequences by constructing dierent training and testing strings under dierent constraints. The stimulus set
thus needs to exhibit various features such as dierent sequence lengths, loops allowed in sequences, the elements of
the sequences being more or less predictable, etc., in order to exploit the potentialities of the network architecture.
Figure 2: A small Finite-State Grammar.
The sequences on which the network was trained were chosen from the grammar of a small FSA given by Reber
[10], and shown in Fig. 2. A grammatical string (a string belonging to the grammar of the automaton) is generated
by starting at node #0 (start node) and traversing various nodes till node #5 (end node) is reached. Each transition
between nodes results in the addition of the letter on the arc between the two nodes to the string being constructed.
The diculty in the network learning this task arises from the fact that no simple correspondence can be inferred
between input-output pairs drawn from the sequence. Two instances of the same letter may lead to dierent nodes
in the network and therefore to dierent successors.
3.1 Network Architecture
The network used was Elman's Simple Recurrent network discussed earlier. The network consists of two pools of
input units : the context pool is used to represent the temporal context by holding a copy of the contents of the hidden
units' activation level at the previous time-slice (fully connected feedback). To help encode the various input-output
pairs that are possible when the sequences belong to the FSA discussed above, the number of units in the second
input pool (and at the output) were chosen to be the same as the number of letters in the alphabet of the FSA. An
additional unit was used at both the input and the output to encode the start (B) and end (E) letters respectively.
Thus the network used had 6 input, 6 output and a variable number of hidden units. Each of the units have an
activation level in the range [0,1].
On each trial the network was presented with an element of the string and is supposed to predict the next element
in the string. Each letter is encoded by a 6-bit vector which is equivalent to representing it by the activation of a
single unit. The number of units in the hidden layer were varied as a parameter in the experiments. Experiments
were performed for 3, 4, 6 and 10 hidden units.
Coding of the Strings. A string of n letters is coded as a series of n + 1 training patterns. Each pattern
consists of two input vectors and one target vector:
1. A NH-bit vector (NH is the number of hidden units) representing the activation of the hidden units at time
3
t ? 1.
2. A 6-bit vector representing element t of the string.
The target vector is a 6-vector representing the next element of the string.
To ensure that the network does not carry over the context of the last element of the previous string to the next
string, the context units are reset at the begin of each new string to have activation values of 0.5.
Training. Training consisted of the presentation of between 60,000 { 80,000 strings (the exact number varied with
the experiment) generated according to the FSA with a length constraint of 32 imposed so as to limit the execution
time within bounds (since innite strings can also be generated by the FSA grammar). The number of strings on
which the training is performed need to be quite large since we are trying to represent an innite behavior of the
grammar with a nite number of exemplars. This also means that the learning rate for any weight-update algorithm
that can be used should be quite small. Further, for generating the strings of the grammar, equal probabilities were
assumed along all outgoing arcs from a node. In our case this worked out to 0.5 since all internal nodes of the FSA
had two possible successors).
Training took place using the Back-propagation algorithm. The error signal was derived from comparing activation
on the output layer (predicted by the network) with the target pattern determined by the next letter in the string.
A slight tolerance, , was allowed in the activation values for computing the value of the error signal. This is because
the system cannot actually reach its extreme values of 1 or 0 without innitely large weights. A momentum term, ,
was also introduced to allow learning to take place without oscillation. The range of values of the various parameters
of the Back-propagation algorithm are: (i) Learning Rate, = 0.01{0.1, (ii) Momentum Term, = 0.9, and (iii)
Tolerance, = 0.1.
The weights were updated after the presentation of each letter. There really is no concept of a training epoch in
this case, since exemplars are being generated at random. It is this randomness that allows the network to learn an
innite behavior with only a nite set of exemplars.
Testing and Performance of the network. Testing in this case is slightly dierent from the case where a
neural net is trained on a set of input{output pairs and in the test phase only the input vector is presented to the
network: the output generated by the network is compared with the corresponding target vector, and this gives a
measure of the network's performance. A similar exercise is not possible for this experiment since more than one
successor is possible for a particular letter presented to the network as an input. This non-determinism arises due
to two reasons:
(i) More than one successor to each node in the FSA, and
(ii) A letter can be present on more than one transition between dierent nodes. In this case the current state of the
FSA comes into play to decide the successor to an input letter.
Further, corresponding to the presentation of an input letter, the activation of the output units is in accordance
with the probability of occurrence of a particular letter as the successor to the input letter. To take a specic
example, the network always activates both P and T (output units corresponding to the letters) whenever the start
symbol B is presented as an input. Since during training P and T followed B equally often, each is activated partially
in order to minimize error.
In order to test whether the network would generate good predictions after every letter of any grammatical
string, its behavior was tested on two test cases of 100 and 1000 randomly generated strings. An idea of true
network performance can only be obtained if testing was done with a large number of strings; however, the object
in performing this experiment was not to evaluate network performance, but to analyze the representations that
develop as a result of training. The performance gures were only used as a guideline to identify the instances in
which better training has taken place, the thesis being that these instances would have `better organized' internal
representations. The network was deemed to accept a string if it met the following criterion:
It correctly predicts all letters in the string till the end letter E, when presented with the previous letter in the
string.
The network is assumed to correctly predict a letter if the activation of the output unit corresponding to the
letter is > 0:3 This threshold is an arbitrary one which follows from the fact that for the FSA being considered,
4
each letter has at most two successors (the network is assumed to have learned to dierentiate between dierent
contexts of a letter). Thus, for each letter, the 6 output units have 2 of `high' value, and the rest of `low' value.
Using a value between 0 and 0.1 as the tolerance (fuzziness) of each output unit, an estimate of the `high' value
can be shown to be close to 0.3.
4 Analysis of Internal Representations
Having obtained the best cases of training for each NH value, the internal representations were obtained by recording
the hidden unit activations corresponding to the presentation of each letter in each of the following small set of
randomly generated strings, with the restriction that the length of the strings was limited to 9:
BTXXTTVPSE
BTXXTVPSE
BPVVE
BPTVPXTVVE
BTXXTVVE
BPVPSE
BTXXVPXVVE
BPTVPSE
BTSSXXVPSE
BPVPXVVE
BPTVVE
BTXSE
BTSSXSE
BPTVPXVVE
BTSXSE
BTXXTTTVVE
BTSXXTVVE
BTSXXVPSE
BTSSXXVVE
BTSSSXSE
The internal representations are obtained as points in an NH-dimensional space.
Table 1 shows the fraction of strings accepted by the network after training with dierent parameters.
Table 1: Fraction of Reber Grammar strings accepted by simple recurrent networks of various sizes.
# Hidden Units Learning Rate, # Training Strings # Test Strings Fraction Accepted
3
0.1
0.05
60,000
70,000
60,000
0.01
70,000
60,000
100
100
100
1000
100
100
1000
100
1000
100
100
1000
100
1000
100
1000
100
1000
100
1000
70,000
4
6
10
0.1
0.05
70,000
70,000
0.01
70,000
0.05
70,000
0.01
70,000
0.1
70,000
0.48
0.52
0.78
0.826
0.60
0.68
0.721
0.73
0.758
0.61
0.76
0.746
0.79
0.846
0.68
0.694
0.94
0.974
0.99
0.999
The internal representations of the best trained networks from above were obtained by testing them on the small
subset of strings specied earlier. A vector of NH components is obtained corresponding to the presentation of each
letter in each of the 20 strings. For the subset of strings chosen, there were 121 such presentations, and therefore
121 vectors in NH-dimensional space corresponding to each value of NH.
5
To facilitate analysis of the representation that develops across the hidden units, these vectors were subjected
to a clustering analysis where Euclidean distance was used as a criterion for forming clusters. The resulting cluster
trees are shown as Figs. 3-6. The ordinate (label of the leaf) in the cluster tree refers to the hidden unit vector
corresponding to a particular letter presentation (this is indicated by the upper-case letter in the string). For example,
if the leaf is identied as `tssXse', `X' is the current letter and its predecessors were `T,S,' and `S'. The abscissa refers
to the distance between clusters. Clusters formed at small distances are similar to each other.
From an analysis of the cluster trees for various numbers of hidden units the following observations can be made:
1. For 3 hidden nodes (Fig. 3), the activation patterns are grouped according to the dierent nodes in the FiniteState Grammar to a degree of approximation: all the patterns that produce a similar prediction are grouped
together, independently of the current letter. For example, the bottom cluster groups together patterns that
result in the activation of the arcs from node # 2, i.e., `T' and `V'. Therefore, copying the hidden layer to the
context layer provides the network with information about the current node. That information is combined
with the current letter to produce a pattern on the hidden units that is a representation of the next node.
2. At very small distances in the cluster tree, patterns within a cluster corresponding to a particular node are
further divided according to the path traversed before that node. An example of this type of behavior is seen
by looking at the cluster tree for NH = 3 . The patterns in the cluster corresponding to node # 2 are further
subdivided into two subclusters: one containing patterns that arrive at node # 2 from nodes # 0 or 2, and the
other from node # 3.
3. For 10 hidden nodes (Fig. 6), clusters of internal representations do not correspond to the nodes of the grammar:
dierent representations for a node have developed. A closer look reveals that the internal representations
correspond to a representation of the path by which the pattern has reached a particular node in the FSA.
The gures next to each cluster in Fig. 6 show the paths taken by the patterns. Within a cluster again the
various subclusters show a further dierentiation with respect to the path taken in reaching a particular node.
For example, consider the bottom cluster in Fig. 6 that corresponds to reaching node # 5 from node # 4. This
cluster is made up of subclusters that contain patterns that have reached node # 4 from the node sequence
3-2-2-4 (e.g. `txxtvV'), 0-2-4 (e.g. `pvV'), 3-2-4 (e.g. `pvpxvV'), and 0-2-2-4 (e.g. `ptvV').
4. Figs. 4 and 5 show intermediate stages of this development where path representations become macro behaviors
and explicit node representations are no longer present.
Based on these observations, the following can be stated:
Copying the state of activation on the hidden layer and using it as an input for the next time step provided
the network with the basic equipment to act as a sequence generator/ recognizer.
The Simple Recurrent Network can be used as a recognizer of a Finite-State machine when the exemplars of
the net are the strings of the FSA grammar.
Constrained representation resources result in the network picking up representations of the actual nodes of
the grammar. Even in this case path information is stored in the network at the micro-level.
A larger set of representation resources result in the storage of the path information at a macro-level. This is
not a contradiction with the earlier statement since now the network has learned to respond to contexts which
are more abstractly dened. Thus, sensitivity to context does not preclude the network's ability to capture
generalizations that are at a high level of abstraction. This is evidenced by the near perfect acceptance of
strings by the network when it has NH=10 units.
An interesting fact to be noted is that the network picks up path information without explicitly being trained
to do so. It develops path representations that mimic the paths in the actual FSA (Fig. 6). This behavior
seems to be an attribute of the recurrent nature of the network since preserving information about the path
does not contribute in itself to reducing error in the prediction task.
The analysis performed can also be viewed as looking at the network from h-space. On seeing a letter the
current state of the hidden units moves to another point in h-space. Depending on which cluster the network
6
is in, one can draw an interpretation of the state/path the FSA would be in if it were accepting letters from a
string of the grammar.
5 Extraction of FSA from Trained Recurrent Networks
The experiments described above show a linkage between the internal representations developed and the number of
hidden units in the network. It helps to determine the number of hidden units required by a network to satisfactorily
learn strings from an arbitrary FSA. Note that the internal representations can be grouped together and a \state"
label be given to each group. Moreover, by tracking the sequence of groups visited by the hidden unit vector when an
input string is presented, one obtains transitions among the groups along with the alphabets associated with these
transitions. Thus it is possible to extract an FSA description by analyzing the internal representations of a trained
SRN. If many hidden units are used, then the extracted FSA has a large number of states. However, such FSAs can
easily be reduced to equivalent minimal FSAs [6].
This leads to the intriguing feasibility of being able to extract an unknown FSA from an SRN trained using
strings from this FSA. Thus, this techniques can be used to nd a model for a plant given sequences of observed
output behavior. A pioneering work in this direction has been performed by Giles et al [5, 6], who use a secondorder recurrent network and a clustering technique based on partitioning the space of hidden unit activations into
hypercubes. While this work highlights the feasibility of FSA extraction, it leaves open the question of nding the
right sized hypercubes and avoiding articial boundaries generated by such space quantization approaches.
Since the clustering method used in this paper yields natural, similarity based partitions, it is worthwhile to
investigate automated routines for FSA extraction based on cluster diagrams such as those shown in Figs. 3-6. Also,
the ecacy of FSA extraction should be explored for larger or more involved grammars, including those that have
probabilities associated with the state transition arcs.
Acknowledgements: This research was supported in part by contract N00014-89-C-0298 with Dr. Barbara
Yoon (DARPA) and Dr. Thomas McKenna (ONR) as government cognizants, and by a Faculty Development Grant
from TRW. Vijay Karamcheti was supported by an MCD Fellowship while he was at UT, Austin. We thank Dr. C.
Lee Giles for helpful comments on this topic.
References
[1] T. Bell. Sequential processing using attractor transitions. In Proceedings of the 1988 Connectionist Models
Summer School, pages 93{102, June 1988.
[2] A. Cleeremans, D. Servan-Schreiber, and J. L. McClelland. Finite state automata and simple recurrent networks.
Neural Computation, 1(3):372{381, 1989.
[3] J.L. Elman. Finding structure in time. Cognitive Science, 14:179{211, 1990.
[4] J.L. Elman. Distributed representations, simple recurrent networks, and grammatical inference. Machine Learning, 7(2/3):91{121, 1991.
[5] C.L. Giles et al. Second-order recurrent neural networks for grammatical inference. In Proceedings of the
International Joint Conference on Neural Networks, Seattle, pages II:273{281, Seattle, WA, July 1991.
[6] C.L. Giles et al. Extracting and learning an unknown grammar with recurrent neural networks. In S.J. Hanson J.E. Moody and R.P. Lippmann, editors, Advances in Neural Information Processing Systems -4. Morgan
Kaufmann, San Mateo, CA, 1992.
[7] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley,
1991.
[8] J. Pollack. The induction of dynamical recognizers. Machine Learning, 7(2/3):227{, 1991.
7
[9] R.F. Port. Representation and recognition of temporal patterns. Connection Science, 1-2:151{176, 1990.
[10] A. S. Reber. Implicit learning of articial grammars. Jl. of Verbal Learning, Verbal Behavior, 5:855{863, 1967.
[11] T. J. Sejnowski and C. R. Rosenberg. NETtalk: A parallel network that learns to read aloud. Technical Report
JHU/EECS-86/01, Johns Hopkins Univ., Baltimore, Jan. 1986.
[12] D. Servan-Schreiber, A. Cleeremans, and J. L. McClelland. Encoding sequential structure in simple recurrent
networks. Technical Report CMS-CS-88-183, Carnegie Mellon University, Pittsburg, Nov. 1988.
[13] A. Waibel. Modular construction of time-delay neural networks for speech recognition. Neural Computation,
1(1):39{46, 1989.
[14] R.L. Watrous and G.M. Kuhn. Induction of nite-state languages using second-order recurrent networks. In
S.J. Hanson J.E. Moody and R.P. Lippmann, editors, Advances in Neural Information Processing Systems -4.
Morgan Kaufmann, San Mateo, CA, 1992.
8
© Copyright 2026 Paperzz