Long Short-Term Memory (LSTM) networks

Long Short-Term Memory (LSTM)
networks
Manu Airaksinen
21.2.2017
Overview
●
Artificial neural networks
●
Recurrent neural networks
●
Long short-term memory
●
LSTM extensions
●
LSTM applications
●
Homework
Artificial Neural networks
●
Originally created as computational models of biological
brains
●
In reality, only vague similarities with the “real deal”
●
Basic unit = Neuron
●
●
Has n input connections
Sum of inputs is fed into neuron’s non-linear activation function (e.g.,
tanh, sigmoid)
Multi-layer perceptron (‘DNN’)
●
A common architecture of ANNs is to create a network of
interconnected layers where each neuron i within layer k is connected
to each neuron j within the next layer k+1 with a weight wij.
●
●
Activations of a single layer can be transmitted to next layer by matrix
multiplication of the weight matrix W that contains the weights w ij.
A “deep” neural network (DNN) commonly consists of at least two
hidden layers between input and output
Tasks and training of neural
networks
●
We have defined the basic neural network architecture
●
Where can we use it?
●
●
Great pattern recognition capabilities
●
Classification, regression
How do we train it?
●
Supervised learning task to learn the network weight matrices W between each layer with
backpropagation:
1. Initialize weights (e.g. random weights or layer-wise pre-training)
2. Perform forward pass on a chunk of training data (input and desired output known)
3. Compute error based on error criterion (e.g. L2-norm) between NN output and target
4. Backpropagate the error gradient through the layers, and adjust the weights slightly to the direction of the gradient
5. Perform steps 2-4 until stopping criterion (e.g. convergence, ceratin amount of epochs) has passed
●
MLP architecture doesn’t account for sequential structure
●
Sequential structure can be “hacked in” by including e.g. delta and delta-delta features, or by
concatenating consecutive frames in the input layer
●
Can obtain some performance in sequence labeling tasks, but suffers from a bloated structure, as well as from a fixed
context window
Recurrent neural networks (RNN)
●
A natural extension of the MLP, where one (or more) hidden layer(s) is
allowed to be self-connected
●
●
●
i.e. the input at time instant t contains the current input vector as well as the
layer output at previous timestate t-1
Allows propagation of sequential information within the network
Training done by unfolding the structure with a sequence length of T and
performing Backpropagation through time (BPTT)
●
●
Same principle as normal BP but errors are also propagated within the unfolded
sequence structure
Training has considerably higher computational & memory footprint compared to MLP
Problem with basic RNNs: vanishing
gradient
●
●
In theory, RNNs are capable of modeling sequential information over the whole
history of the sequence
In practice, the “vanishing gradient problem” limits the capabilities of RNNs
●
Over time, effect of recurrence either blows up or decays exponentially (depending on the
magnitude of weights given to the recurrent connection)
●
●
Decay greatly preferred over blowing up
A realistic expectation is to be able to influence sequential information over 5-10 discrete
time steps
●
Not a really significant increase in sequential scope over time-window-based MLPs
Long short-term memory (LSTM)
●
●
●
●
A family of RNN architectures where that aims to explicitly overcome the vanishing
gradient problem
Gradient-based approach that enforces constant error flow through internal states
The basic LSTM neuron, or “cell” has a separate “cell state” that keeps track of longterm sequential information
Cell has specific “input”, “output”, and “forget” gates that regulate the cell state’s
interactions with the conventional RNN data flow (input, output, and recurrent
connection)
LSTM cell structure
●
●
●
●
LSTM Cell state (single value for a single cell) is manipulated by a
“Forget gate”, “Input gate”, and “Output gate”
Differentiable and continuous analogy to digital memory blocks
Cell state updates are controlled by the conventional RNN
connections (present state input + previous state output)
Cell output determined by new Cell state multiplied by the sigmoid
activation of the conventional RNN connections
LSTM Forget gate
●
●
“Reset” in digital memory analogy
Apply a sigmoid layer (weight + nonlinearity) to input
(concatenated input xt + previous output ht-1) to scale between
[0,1], multiply cell state C (top horizontal line)
●
●
i.e. choose between [0,1] how much of previous state should be
forgotten
Manipulates how much of previous cell state should be kept
within the update to next state
LSTM Input gate
●
●
●
“Write” in digital memory analogy
Apply sigmoid and tanh NN layers to input to obtain target
input to hidden state (from tanh layer) and its
corresponding weighting (from sigmoid layer)
Add the new state information provided by the input gate to
the Cell state
●
If activation is small (i.e. it ~= 0), cell state is not updated
LSTM Output gate
●
“Read” in digital memory analogy
●
Determines the output ht of the Cell
●
Cell state Ct is passed through nonlinearity and
multiplied by the sigmoid activation of the input
●
Decide which parts of the state you want to output
LSTM Training
●
Since LSTM architecture is fully differentiable,
BPTT can be used
●
Originally proposed method was Real Time
Recurrent Learning with truncated BPTT, which
reduces computational complexity at the expense of
exactness
Extensions to LSTM architecture
●
In the original formulation (1997), LSTM contained only the input and output gates
●
Addition of Forget gate (1999)
●
●
Better learning for continual tasks
“Peep-hole connections” between Cell state and gate activations (2000)
●
Computation of gate activations dependent also on the Cell state, not just the Cell input
●
These are included in today’s “Vanilla LSTM” architecture
●
Bidirectional LSTMs (and RNNs)
●
Only the past information is taken into account in the training of a vanilla “unidirectional” RNN/LSTM
●
Bidirectional architecture enables the use of future information
●
Implementation with parallel “Forward-pass” and “Backward-pass” specific layer weights
●
Final output computed as the sum of forward and backward layer outputs
Extensions to LSTM architecture:
Bidirectional LSTM
●
Only the past information is taken into account in the training of a vanilla
“unidirectional” RNN/LSTM
●
●
●
●
Upcoming samples might have crucial information about the context (e.g. in text
analysis)
Bidirectional architecture enables the use of future information
Implementation with separate “Forward-pass” and “Backward-pass” specific
layer weights
Final output computed as the sum of forward and backward layer outputs
LSTM network architecture
●
LSTM blocks can be combined with traditional neurons etc.
●
An example LSTM network:
●
●
50 dimensional input layer
One hidden unidirectional LSTM layer, with 100 cells (without
peepholes)
●
LSTM weight matrix dimensions for the hidden layer (including biases):
–
–
●
10 dimensional output layer (e.g. softmax for classification)
●
●
Wf = 151x100, Wi = 151x100, WC = 151x100, Wo = 151x100
Total: 4x151x100 = 60400
Output layer weight matrix dimension (with biases) = 101x10 = 1010
Total number of weights = 60400 + 1010 = 61410
Homework
●
Consider the LSTM network architecture in the
previous slide. Determine the total number of
network weights with the following changes to
the LSTM system:
a) Identical network structure, but LSTM layer
contains peephole connections
b) Bidirectional LSTM without peephole
connections
c) Bidirectional LSTM with peephole connections
LSTM applications
●
LSTMs have pushed the envelope in many fields of
research
●
●
E.g., handwriting recognition & generation, language modeling
& translation, acoustic modeling of speech (recognition &
synthesis), etc.
Example: Statistical Parametric Speech Synthesis:
●
LSTM-based acoustic modeling represents the state-of-the-art
LSTM-based statistical machine
translation
●
Separate Encoder/Decoder LSTMs
●
●
Input sentence (terminated by <EOS> tag) mapped to
a fixed-length parametric representation with Encoder
Decoder generates target sentence symbol-by-symbol
(i.e. word by word) with the Sentence-level parametric
representation and previously generated symbol as
input, until it outputs the <EOS> tag
References
●
Multilayer perceptron and backpropagation:
●
●
Recurrent neural networks, Backpropagation through time (BPTT):
●
●
●
●
P. Werbos. “Backpropagation Through Time: What It Does and How to Do It”. Proceedings of the IEEE, 78(10):1550 – 1560, 1990.
R. J. Williams and D. Zipser. “Gradient-Based Learning Algorithms for Recurrent Networks and Their Computational Complexity”. In Back-propagation:
Theory, Architectures and Applications, pages 433–486. Lawrence Erlbaum Publishers, 1995
Vanishing gradient problem:
●
●
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning Internal Representations by Error Propagation”, pages 318–362. MIT Press, 1986.
Y. Bengio, P. Simard, and P. Frasconi. “Learning Long-Term Dependencies with Gradient Descent is Difficult”. IEEE Transactions on Neural Networks, 5(2):157–166,
March 1994.
Original LSTM articles:
●
S. Hochreiter and J. Schmidhuber. “Long Short-Term Memory”. Neural Computation 9(8):1735–1780, 1997
●
F. A. Gers, J. Schmidhuber, and F. Cummins. “Learning to Forget: Continual Prediction with LSTM”. Neural Computation, 12(10):2451–2471, 2000.
LSTM imporovements
●
Bidirectional RNNs:
●
●
●
●
A. Graves and J. Schmidhuber. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures. Neural Networks,18(5-6):602–610, June/July 2005
Peephole connections:
●
●
M. Schuster and K. K. Paliwal. “Bidirectional Recurrent Neural Networks”. IEEE Transactions on Signal Processing, 45:2673–2681, 1997.
Untruncated BPTT:
F. Gers, N. Schraudolph, and J. Schmidhuber. “Learning Precise Timing with LSTM Recurrent Networks”. Journal of Machine Learning Research, 3:115–143, 2002.
LSTM applications
●
Statistical parametric speech synthesis:
●
●
Y. Fan, Y. Qian, F. Xie, F. K. Soong. “TTS Synthesis with Bidirectional LSTM based Recurrent Neural Networks”. In Proc. Interspeech, 2014
LSTM-based Neural machine translation
●
I. Sutskever, O. Vinyals, Q. V. Le. “Sequence to Sequence Learning with Neural Networks”. In Proc. NIPS, 2014