Training recurrent and deep variational models for un

Training recurrent and deep
variational models for un-, semi- and
supervised learning
Ole Winther
with Lars Maaloee, Casper and Soren Sonderby
Dept for Applied Mathematics and Computer Science
Technical University of Denmark (DTU)
Bioinformatics Centre/BRIC
University of Copenhagen (KU)
March 7, 2016
Are we heading towards the singularity?
kurzweilai.net
Are we heading towards the singularity?
• Elon Musk at MIT AeroAstro Symp:
• If I were to guess at what our
biggest existential threat is, it’s
probably that...
• With artificial intelligence, we are
summoning the demon..
kurzweilai.net
• Inofficial quotes (email to friend):
• The risk of something seriously
dangerous happening is in the five
year timeframe. 10 years at most,
• Unless you have direct exposure to
groups like Deepmind, you have no
idea how fast — it is growing at a
pace close to exponential.
• mashable.com/2014/11/17/
elon-musk-singularity/
Growth in computer power
GPU processing - projection
• Nvidia next generation GPU processers
GPU processing - projection
• Nvidia next generation GPU processers
• Fast progress recent years
• Tech and academia focused on deep learning
• In response to perceived threat from AI, Elon Musk and
others has founded openai.org.
Deep learning – DTU and KU research group
• End-to-end!
Deep learning – DTU and KU research group
• End-to-end!
• Structured data - sequences+
Deep learning – DTU and KU research group
• End-to-end!
• Structured data - sequences+
• Bioinformatics
Deep learning – DTU and KU research group
• End-to-end!
• Structured data - sequences+
• Bioinformatics
• Information retrieval - search in
findzebra
Deep learning – DTU and KU research group
• End-to-end!
• Structured data - sequences+
• Bioinformatics
• Information retrieval - search in
findzebra
• Green tech – Siemens Windpower
and greengoenergy.com
Deep learning – DTU and KU research group
• End-to-end!
• Structured data - sequences+
• Bioinformatics
• Information retrieval - search in
findzebra
• Green tech – Siemens Windpower
and greengoenergy.com
• Document interpretation -
tradeshift.com
Deep learning – DTU and KU research group
• End-to-end!
• Structured data - sequences+
• Document interpretation -
tradeshift.com
• Bioinformatics
• Methodology generative
• Information retrieval - search in
findzebra
• Green tech – Siemens Windpower
and greengoenergy.com
models
• Variational un- and
semi-supervised learning
(Casper and Lars talks)
Recurrent neural networks (RNNs)
• Feedforward neural networks
• RNN applications and architectures
• Long short term memory (LSTM) RNN
• Further applications and architectures
Approx. 1011 neurons and 1014 synapses in a human brain
Feed forward neural networks
(1)
hidden units
zM
wM D
(2)
wKM
xD
yK
outputs
inputs
y1
x1
z1
x0
z0
(2)
w10
Neural network mapping
• Compute weighted sum of inputs:
D
X
(1)
(1)
wji xi + wj0 =
i=1
D
X
(1)
wji xi
(1)
hidden units
zM
wM D
(2)
wKM
xD
i=0
yK
outputs
inputs
• Output k two-layer network:
y1
x1

M
X
(2)
yk (x, w) = σ 
wkj f
j=0
D
X
!
(1)
wji xi
i=0
• f hidden unit activation function

z1
x0
z0
(2)
w10
Non-linearity - activation function
• Logistic function
σ(a) =
1
1 + e−a
• Hyperbolic tangent
1
ea
logistic
tanh
e−a
−
tanh(a) = a
e + e−a
0
• Rectified linear
-1
0
relu(a) = max(0, a)
• Linear activation functions will give a
linear network.
• Can be mixed, e.g. linear for output
units and tanh for hidden units.
Recurrent neural networks – DeepSpeech
Deep speech 2: ∼human level performance + realtime server
Recurrent neural networks
aht =
I
X
i=1
wih xit +
H
X
wh0 h f (aht−1
0 )
h0 =1
www.cs.toronto.edu/~graves/preprint.pdf
Recurrent neural networks unrolled
aht =
I
X
i=1
wih xit +
H
X
wh0 h f (aht−1
0 )
h0 =1
www.cs.toronto.edu/~graves/preprint.pdf
Bidirectional recurrent neural networks
Recurrent neural networks unrolled bidirectional
www.cs.toronto.edu/~graves/preprint.pdf
Long short term memory cells
Hochreiter et al, 1997.
FindZebra search
Deep learning for search
• ∼ 8260 CUIs - unique disease labels
• ∼ 14261 articles OMIM, Wikipedia, Orphanet, NIH sources
• Training: map 100 character article snippets to disease
label
• Test: queries with diagnosis.
Visualizing latent representation for characters
Visualizing latent representation for words
Example queries
Results
Results
• SOLR and C2W2D make different errors (Recall@20)
Results
• SOLR and C2W2D make different errors (Recall@20)
• Simple combination:
MRR=0.373, Recall@10=0.657 and Recall@20=0.738
Sub-cellular localization
Sub-cellular localization machine learning set-up
• Use (one hot encoded) protein sequence as input
• 11 localization classes softmax output
• RNN architecture for one output - need to memorize to the
end!
• First layer is a convolutional layer
• Sonderby, Sonderby, Nielsen and Winther, AICOB, 2015.
Visualizing learned convolutions
Attention mechanism
• Attention mechanism (adapted from Bahdanau et al, 2014)
"
• Compute context vector c, dim(c) = dim(ht ), ht =
c=
Tx
X
t=1
at ht
(f)
ht
(b)
ht
exp(f (ht ; W ))
at = PT
x
t 0 =1 exp(f (ht 0 ; W )))
that are fed into feed-forward network.
#
Where does the network pay attention?
Confusion matrix
Performance comparison
t-sne visualization of last hidden state
Deep variational learning
• So far we only considered supervised learning
• Probabilistic mapping from input x to label y : p(y |x)
Deep variational learning
• So far we only considered supervised learning
• Probabilistic mapping from input x to label y : p(y |x)
• Unsupervised learning with latent variables z:
Z
p(x) =
p(x|z)p(z) dz
Deep variational learning
• So far we only considered supervised learning
• Probabilistic mapping from input x to label y : p(y |x)
• Unsupervised learning with latent variables z:
Z
p(x) =
p(x|z)p(z) dz
• p(z|x) = p(x|z)p(z)
difficult — variational learning
p(x)
Z
p(x) ≥ L =
q(z|x) log
p(x|z)p(z)
dz
q(z|x)
(Kingma+Welling 2013, Rezende et al 2013)
Deep variational learning
• So far we only considered supervised learning
• Probabilistic mapping from input x to label y : p(y |x)
• Unsupervised learning with latent variables z:
Z
p(x) =
p(x|z)p(z) dz
• p(z|x) = p(x|z)p(z)
difficult — variational learning
p(x)
Z
p(x) ≥ L =
q(z|x) log
p(x|z)p(z)
dz
q(z|x)
(Kingma+Welling 2013, Rezende et al 2013)
• Useful in semi-supervised learning
Summary and Outlook
• Deep learning
• End-to-end! Less feature engineering - more data,
• fast computation (GPUs),
• mature software (for example Theano/Lasagne),
• a lot of attention from industry and academia and
• (new) algorithms!
Summary and Outlook
• Deep learning
• End-to-end! Less feature engineering - more data,
• fast computation (GPUs),
• mature software (for example Theano/Lasagne),
• a lot of attention from industry and academia and
• (new) algorithms!
• Will it bring us to the singularity?
Summary and Outlook
• Deep learning
• End-to-end! Less feature engineering - more data,
• fast computation (GPUs),
• mature software (for example Theano/Lasagne),
• a lot of attention from industry and academia and
• (new) algorithms!
• Will it bring us to the singularity?
• Recurrent neural networks - more elegant and better
performance for sequence data
• Versatile ground for variations on architectures bidirectional, attention, . . .