Long Short-Term Memory (LSTM) networks Manu Airaksinen 21.2.2017 Overview ● Artificial neural networks ● Recurrent neural networks ● Long short-term memory ● LSTM extensions ● LSTM applications ● Homework Artificial Neural networks ● Originally created as computational models of biological brains ● In reality, only vague similarities with the “real deal” ● Basic unit = Neuron ● ● Has n input connections Sum of inputs is fed into neuron’s non-linear activation function (e.g., tanh, sigmoid) Multi-layer perceptron (‘DNN’) ● A common architecture of ANNs is to create a network of interconnected layers where each neuron i within layer k is connected to each neuron j within the next layer k+1 with a weight wij. ● ● Activations of a single layer can be transmitted to next layer by matrix multiplication of the weight matrix W that contains the weights w ij. A “deep” neural network (DNN) commonly consists of at least two hidden layers between input and output Tasks and training of neural networks ● We have defined the basic neural network architecture ● Where can we use it? ● ● Great pattern recognition capabilities ● Classification, regression How do we train it? ● Supervised learning task to learn the network weight matrices W between each layer with backpropagation: 1. Initialize weights (e.g. random weights or layer-wise pre-training) 2. Perform forward pass on a chunk of training data (input and desired output known) 3. Compute error based on error criterion (e.g. L2-norm) between NN output and target 4. Backpropagate the error gradient through the layers, and adjust the weights slightly to the direction of the gradient 5. Perform steps 2-4 until stopping criterion (e.g. convergence, ceratin amount of epochs) has passed ● MLP architecture doesn’t account for sequential structure ● Sequential structure can be “hacked in” by including e.g. delta and delta-delta features, or by concatenating consecutive frames in the input layer ● Can obtain some performance in sequence labeling tasks, but suffers from a bloated structure, as well as from a fixed context window Recurrent neural networks (RNN) ● A natural extension of the MLP, where one (or more) hidden layer(s) is allowed to be self-connected ● ● ● i.e. the input at time instant t contains the current input vector as well as the layer output at previous timestate t-1 Allows propagation of sequential information within the network Training done by unfolding the structure with a sequence length of T and performing Backpropagation through time (BPTT) ● ● Same principle as normal BP but errors are also propagated within the unfolded sequence structure Training has considerably higher computational & memory footprint compared to MLP Problem with basic RNNs: vanishing gradient ● ● In theory, RNNs are capable of modeling sequential information over the whole history of the sequence In practice, the “vanishing gradient problem” limits the capabilities of RNNs ● Over time, effect of recurrence either blows up or decays exponentially (depending on the magnitude of weights given to the recurrent connection) ● ● Decay greatly preferred over blowing up A realistic expectation is to be able to influence sequential information over 5-10 discrete time steps ● Not a really significant increase in sequential scope over time-window-based MLPs Long short-term memory (LSTM) ● ● ● ● A family of RNN architectures where that aims to explicitly overcome the vanishing gradient problem Gradient-based approach that enforces constant error flow through internal states The basic LSTM neuron, or “cell” has a separate “cell state” that keeps track of longterm sequential information Cell has specific “input”, “output”, and “forget” gates that regulate the cell state’s interactions with the conventional RNN data flow (input, output, and recurrent connection) LSTM cell structure ● ● ● ● LSTM Cell state (single value for a single cell) is manipulated by a “Forget gate”, “Input gate”, and “Output gate” Differentiable and continuous analogy to digital memory blocks Cell state updates are controlled by the conventional RNN connections (present state input + previous state output) Cell output determined by new Cell state multiplied by the sigmoid activation of the conventional RNN connections LSTM Forget gate ● ● “Reset” in digital memory analogy Apply a sigmoid layer (weight + nonlinearity) to input (concatenated input xt + previous output ht-1) to scale between [0,1], multiply cell state C (top horizontal line) ● ● i.e. choose between [0,1] how much of previous state should be forgotten Manipulates how much of previous cell state should be kept within the update to next state LSTM Input gate ● ● ● “Write” in digital memory analogy Apply sigmoid and tanh NN layers to input to obtain target input to hidden state (from tanh layer) and its corresponding weighting (from sigmoid layer) Add the new state information provided by the input gate to the Cell state ● If activation is small (i.e. it ~= 0), cell state is not updated LSTM Output gate ● “Read” in digital memory analogy ● Determines the output ht of the Cell ● Cell state Ct is passed through nonlinearity and multiplied by the sigmoid activation of the input ● Decide which parts of the state you want to output LSTM Training ● Since LSTM architecture is fully differentiable, BPTT can be used ● Originally proposed method was Real Time Recurrent Learning with truncated BPTT, which reduces computational complexity at the expense of exactness Extensions to LSTM architecture ● In the original formulation (1997), LSTM contained only the input and output gates ● Addition of Forget gate (1999) ● ● Better learning for continual tasks “Peep-hole connections” between Cell state and gate activations (2000) ● Computation of gate activations dependent also on the Cell state, not just the Cell input ● These are included in today’s “Vanilla LSTM” architecture ● Bidirectional LSTMs (and RNNs) ● Only the past information is taken into account in the training of a vanilla “unidirectional” RNN/LSTM ● Bidirectional architecture enables the use of future information ● Implementation with parallel “Forward-pass” and “Backward-pass” specific layer weights ● Final output computed as the sum of forward and backward layer outputs Extensions to LSTM architecture: Bidirectional LSTM ● Only the past information is taken into account in the training of a vanilla “unidirectional” RNN/LSTM ● ● ● ● Upcoming samples might have crucial information about the context (e.g. in text analysis) Bidirectional architecture enables the use of future information Implementation with separate “Forward-pass” and “Backward-pass” specific layer weights Final output computed as the sum of forward and backward layer outputs LSTM network architecture ● LSTM blocks can be combined with traditional neurons etc. ● An example LSTM network: ● ● 50 dimensional input layer One hidden unidirectional LSTM layer, with 100 cells (without peepholes) ● LSTM weight matrix dimensions for the hidden layer (including biases): – – ● 10 dimensional output layer (e.g. softmax for classification) ● ● Wf = 151x100, Wi = 151x100, WC = 151x100, Wo = 151x100 Total: 4x151x100 = 60400 Output layer weight matrix dimension (with biases) = 101x10 = 1010 Total number of weights = 60400 + 1010 = 61410 Homework ● Consider the LSTM network architecture in the previous slide. Determine the total number of network weights with the following changes to the LSTM system: a) Identical network structure, but LSTM layer contains peephole connections b) Bidirectional LSTM without peephole connections c) Bidirectional LSTM with peephole connections LSTM applications ● LSTMs have pushed the envelope in many fields of research ● ● E.g., handwriting recognition & generation, language modeling & translation, acoustic modeling of speech (recognition & synthesis), etc. Example: Statistical Parametric Speech Synthesis: ● LSTM-based acoustic modeling represents the state-of-the-art LSTM-based statistical machine translation ● Separate Encoder/Decoder LSTMs ● ● Input sentence (terminated by <EOS> tag) mapped to a fixed-length parametric representation with Encoder Decoder generates target sentence symbol-by-symbol (i.e. word by word) with the Sentence-level parametric representation and previously generated symbol as input, until it outputs the <EOS> tag References ● Multilayer perceptron and backpropagation: ● ● Recurrent neural networks, Backpropagation through time (BPTT): ● ● ● ● P. Werbos. “Backpropagation Through Time: What It Does and How to Do It”. Proceedings of the IEEE, 78(10):1550 – 1560, 1990. R. J. Williams and D. Zipser. “Gradient-Based Learning Algorithms for Recurrent Networks and Their Computational Complexity”. In Back-propagation: Theory, Architectures and Applications, pages 433–486. Lawrence Erlbaum Publishers, 1995 Vanishing gradient problem: ● ● D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning Internal Representations by Error Propagation”, pages 318–362. MIT Press, 1986. Y. Bengio, P. Simard, and P. Frasconi. “Learning Long-Term Dependencies with Gradient Descent is Difficult”. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994. Original LSTM articles: ● S. Hochreiter and J. Schmidhuber. “Long Short-Term Memory”. Neural Computation 9(8):1735–1780, 1997 ● F. A. Gers, J. Schmidhuber, and F. Cummins. “Learning to Forget: Continual Prediction with LSTM”. Neural Computation, 12(10):2451–2471, 2000. LSTM imporovements ● Bidirectional RNNs: ● ● ● ● A. Graves and J. Schmidhuber. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures. Neural Networks,18(5-6):602–610, June/July 2005 Peephole connections: ● ● M. Schuster and K. K. Paliwal. “Bidirectional Recurrent Neural Networks”. IEEE Transactions on Signal Processing, 45:2673–2681, 1997. Untruncated BPTT: F. Gers, N. Schraudolph, and J. Schmidhuber. “Learning Precise Timing with LSTM Recurrent Networks”. Journal of Machine Learning Research, 3:115–143, 2002. LSTM applications ● Statistical parametric speech synthesis: ● ● Y. Fan, Y. Qian, F. Xie, F. K. Soong. “TTS Synthesis with Bidirectional LSTM based Recurrent Neural Networks”. In Proc. Interspeech, 2014 LSTM-based Neural machine translation ● I. Sutskever, O. Vinyals, Q. V. Le. “Sequence to Sequence Learning with Neural Networks”. In Proc. NIPS, 2014
© Copyright 2026 Paperzz