Training recurrent and deep variational models for un-, semi- and supervised learning Ole Winther with Lars Maaloee, Casper and Soren Sonderby Dept for Applied Mathematics and Computer Science Technical University of Denmark (DTU) Bioinformatics Centre/BRIC University of Copenhagen (KU) March 7, 2016 Are we heading towards the singularity? kurzweilai.net Are we heading towards the singularity? • Elon Musk at MIT AeroAstro Symp: • If I were to guess at what our biggest existential threat is, it’s probably that... • With artificial intelligence, we are summoning the demon.. kurzweilai.net • Inofficial quotes (email to friend): • The risk of something seriously dangerous happening is in the five year timeframe. 10 years at most, • Unless you have direct exposure to groups like Deepmind, you have no idea how fast — it is growing at a pace close to exponential. • mashable.com/2014/11/17/ elon-musk-singularity/ Growth in computer power GPU processing - projection • Nvidia next generation GPU processers GPU processing - projection • Nvidia next generation GPU processers • Fast progress recent years • Tech and academia focused on deep learning • In response to perceived threat from AI, Elon Musk and others has founded openai.org. Deep learning – DTU and KU research group • End-to-end! Deep learning – DTU and KU research group • End-to-end! • Structured data - sequences+ Deep learning – DTU and KU research group • End-to-end! • Structured data - sequences+ • Bioinformatics Deep learning – DTU and KU research group • End-to-end! • Structured data - sequences+ • Bioinformatics • Information retrieval - search in findzebra Deep learning – DTU and KU research group • End-to-end! • Structured data - sequences+ • Bioinformatics • Information retrieval - search in findzebra • Green tech – Siemens Windpower and greengoenergy.com Deep learning – DTU and KU research group • End-to-end! • Structured data - sequences+ • Bioinformatics • Information retrieval - search in findzebra • Green tech – Siemens Windpower and greengoenergy.com • Document interpretation - tradeshift.com Deep learning – DTU and KU research group • End-to-end! • Structured data - sequences+ • Document interpretation - tradeshift.com • Bioinformatics • Methodology generative • Information retrieval - search in findzebra • Green tech – Siemens Windpower and greengoenergy.com models • Variational un- and semi-supervised learning (Casper and Lars talks) Recurrent neural networks (RNNs) • Feedforward neural networks • RNN applications and architectures • Long short term memory (LSTM) RNN • Further applications and architectures Approx. 1011 neurons and 1014 synapses in a human brain Feed forward neural networks (1) hidden units zM wM D (2) wKM xD yK outputs inputs y1 x1 z1 x0 z0 (2) w10 Neural network mapping • Compute weighted sum of inputs: D X (1) (1) wji xi + wj0 = i=1 D X (1) wji xi (1) hidden units zM wM D (2) wKM xD i=0 yK outputs inputs • Output k two-layer network: y1 x1 M X (2) yk (x, w) = σ wkj f j=0 D X ! (1) wji xi i=0 • f hidden unit activation function z1 x0 z0 (2) w10 Non-linearity - activation function • Logistic function σ(a) = 1 1 + e−a • Hyperbolic tangent 1 ea logistic tanh e−a − tanh(a) = a e + e−a 0 • Rectified linear -1 0 relu(a) = max(0, a) • Linear activation functions will give a linear network. • Can be mixed, e.g. linear for output units and tanh for hidden units. Recurrent neural networks – DeepSpeech Deep speech 2: ∼human level performance + realtime server Recurrent neural networks aht = I X i=1 wih xit + H X wh0 h f (aht−1 0 ) h0 =1 www.cs.toronto.edu/~graves/preprint.pdf Recurrent neural networks unrolled aht = I X i=1 wih xit + H X wh0 h f (aht−1 0 ) h0 =1 www.cs.toronto.edu/~graves/preprint.pdf Bidirectional recurrent neural networks Recurrent neural networks unrolled bidirectional www.cs.toronto.edu/~graves/preprint.pdf Long short term memory cells Hochreiter et al, 1997. FindZebra search Deep learning for search • ∼ 8260 CUIs - unique disease labels • ∼ 14261 articles OMIM, Wikipedia, Orphanet, NIH sources • Training: map 100 character article snippets to disease label • Test: queries with diagnosis. Visualizing latent representation for characters Visualizing latent representation for words Example queries Results Results • SOLR and C2W2D make different errors (Recall@20) Results • SOLR and C2W2D make different errors (Recall@20) • Simple combination: MRR=0.373, Recall@10=0.657 and Recall@20=0.738 Sub-cellular localization Sub-cellular localization machine learning set-up • Use (one hot encoded) protein sequence as input • 11 localization classes softmax output • RNN architecture for one output - need to memorize to the end! • First layer is a convolutional layer • Sonderby, Sonderby, Nielsen and Winther, AICOB, 2015. Visualizing learned convolutions Attention mechanism • Attention mechanism (adapted from Bahdanau et al, 2014) " • Compute context vector c, dim(c) = dim(ht ), ht = c= Tx X t=1 at ht (f) ht (b) ht exp(f (ht ; W )) at = PT x t 0 =1 exp(f (ht 0 ; W ))) that are fed into feed-forward network. # Where does the network pay attention? Confusion matrix Performance comparison t-sne visualization of last hidden state Deep variational learning • So far we only considered supervised learning • Probabilistic mapping from input x to label y : p(y |x) Deep variational learning • So far we only considered supervised learning • Probabilistic mapping from input x to label y : p(y |x) • Unsupervised learning with latent variables z: Z p(x) = p(x|z)p(z) dz Deep variational learning • So far we only considered supervised learning • Probabilistic mapping from input x to label y : p(y |x) • Unsupervised learning with latent variables z: Z p(x) = p(x|z)p(z) dz • p(z|x) = p(x|z)p(z) difficult — variational learning p(x) Z p(x) ≥ L = q(z|x) log p(x|z)p(z) dz q(z|x) (Kingma+Welling 2013, Rezende et al 2013) Deep variational learning • So far we only considered supervised learning • Probabilistic mapping from input x to label y : p(y |x) • Unsupervised learning with latent variables z: Z p(x) = p(x|z)p(z) dz • p(z|x) = p(x|z)p(z) difficult — variational learning p(x) Z p(x) ≥ L = q(z|x) log p(x|z)p(z) dz q(z|x) (Kingma+Welling 2013, Rezende et al 2013) • Useful in semi-supervised learning Summary and Outlook • Deep learning • End-to-end! Less feature engineering - more data, • fast computation (GPUs), • mature software (for example Theano/Lasagne), • a lot of attention from industry and academia and • (new) algorithms! Summary and Outlook • Deep learning • End-to-end! Less feature engineering - more data, • fast computation (GPUs), • mature software (for example Theano/Lasagne), • a lot of attention from industry and academia and • (new) algorithms! • Will it bring us to the singularity? Summary and Outlook • Deep learning • End-to-end! Less feature engineering - more data, • fast computation (GPUs), • mature software (for example Theano/Lasagne), • a lot of attention from industry and academia and • (new) algorithms! • Will it bring us to the singularity? • Recurrent neural networks - more elegant and better performance for sequence data • Versatile ground for variations on architectures bidirectional, attention, . . .
© Copyright 2026 Paperzz