Natural language processing

Neural networks
Natural language processing - convolutional network
WORD TAGGING
Topics: word tagging
• In
many NLP applications, it is useful to augment text data
with syntactic and semantic information
‣
we would like to add syntactic/semantic labels to each word
• This
problem can be tackled using a conditional random field
with neural network unary potentials
‣
we will describe the model developed by Ronan Collobert and Jason Weston in:
A Unified Architecture for Natural Language Processing: Deep Neural Networks
with Multitask Learning
Collobert and Weston, 2008
(see Natural Language Processing (Almost) from Scratch for the journal version)
2
SENTENCE NEURAL NETWORK
C OLLOBERT, W ESTON , B OTTOU , K ARLEN , K AVUKCUOGLU AND K UKSA
‣
could use a CRF with neural network unary
potentials, based on a window (context) of words
-
‣
to model each label sequence
not appropriate for semantic role labeling, because
relevant context might be very far away
The cat sat
w11 w21 . . .
on the mat
1
wN
w1K w2K . . .
K
wN
Lookup Table
LTW 1
..
.
LTW K
d
Convolution
M1 × ·
Collobert and Weston suggest a convolutional
network over the whole sentence
-
prediction at a given position can exploit information
from any word in the sentence
Padding
• How
Text
Feature 1
..
.
Feature K
Padding
Topics: sentence convolutional network
Input Sentence
n1
hu
Max Over Time
max(·)
n1
hu
Linear
M2 × ·
n2
hu
HardTanh
Linear
M3 × ·
n3
hu = #tags
Figure 2: Sentence approach network.
3
SENTENCE NEURAL NETWORK
Topics: sentence convolutional network
• Each
word can be represented by more
C
,W
,B
than one feature
‣
‣
ESTON
OTTOU ,
K ARLEN , K AVUKCUOGLU AND K UKSA
feature of the word itself
Input Sentence
substring features
suffix: ‘‘ eating ’’
‘‘ eat ’’
‘‘ ing ’’
gazetteer features
-
whether the word belong to a list of
persons, etc.
The cat sat
w11 w21 . . .
on the mat
1
wN
w1K w2K . . .
K
wN
Padding
-
prefix: ‘‘ eating ’’
Text
Feature 1
..
.
Feature K
Padding
-
‣
OLLOBERT
Lookup Table
LTW 1
known.. locations,
.
LTW K
features are treated like
word
Convolution
features, with their own lookup tables
d
• These
M1 × ·
4
SENTENCE NEURAL NETWORK
Topics: sentence convolutional network
• Feature
must encode for which word
C
,W
,B
we are making a prediction
‣
• For
OTTOU ,
K ARLEN , K AVUKCUOGLU AND K UKSA
Input Sentence
Text
Feature 1
..
.
Feature K
The cat sat
w11 w21 . . .
on the mat
1
wN
w1K w2K . . .
K
wN
Lookup Table
LTW 1
..
.
LTW K
SRL, must know the roles for which
verb we are predicting
‣
Padding
this feature also has its lookup
table
ESTON
Padding
‣
done by adding the relative
position i-posw, where posw
is the position of the current
word
OLLOBERT
d
also add the relative position of that Convolution
verb i-posv
M1 × ·
5
SENTENCE NEURAL NETWORK
Input Sentence
The cat sat
w11 w21 . . .
on the mat
1
wN
Topics: sentence convolutional network
• Lookup
‣
table:
for each word concatenate
the representations of its
features
• Convolution:
‣
‣
LTW 1
..
.
LTW K
d
Convolution
M1 × ·
n1
hu
this is a convolution in 1D
pooling:
obtain a fixed hidden layer
with a max across positions
K
wN
Lookup Table
at every position, compute
linear activations from a
window of representations
• Max
‣
w1K w2K . . .
Padding
Padding
Text
Feature 1
..
.
Feature K
Max Over Time
max(·)
n1
hu
Linear
6
SENTENCE NEURAL NETWORK
n1
hu
Topics: sentence convolutional network
Max Over Time
• Regular
‣
‣
neural network:
the pooled representation
serves as the input of
a regular neural network
they proposed using a
‘‘hard’’ version of the
tanh activation function
max(·)
n1
hu
Linear
M2 × ·
n2
hu
HardTanh
Linear
M3 × ·
n3
hu = #tags
Figure 2: Sentence approach network.
• The
outputs are used as the unary potential of a chain CRF
over the labels
‣
‣
no connections
a separate
In the following, we will describe each layer we use in our networks shown in Figure 1 and Figure 2.
We adopt few
a matrix task
A we (one
denoteCRF
[A]i, jper
the task)
coefficient at row i and column j
between
the notations.
CRFs of Given
the different
in the matrix. We also denote ⟨A⟩di win the vector obtained by concatenating the dwin column vectors
neuralaround
network
used for
each
task A ∈ Rd1 ×d2 :
the ithiscolumn
vector
of matrix
7