natural language processing distributed word

NATURAL LANGUAGE PROCESSING
(COM4513/6513)
DISTRIBUTED WORD REPRESENTATIONS
Andreas Vlachos
[email protected]
Department of Computer Science
University of Sheffield
1
SO FAR
Word senses and their disambiguation.
We considered senses to be discrete representations of
meaning.
Thus, each sense is equally different from every other sense.
2
IN THIS LECTURE
We will challenge the discreteness of the word senses.
Instead of senses, we will map words to a vector.
3
A QUICK TEST
What is tesguino?
A bottle of tesguino is on the table.
Everybody likes tesguino
Tesguino makes you drunk.
We make tesguino out of corn.
You shall know a word by the company it keeps (Firth, 1957)
Thus, two words are similar if they appear in similar contexts
4
WORD-CONTEXT MATRIX
typically very sparse
the shorter the window the more syntactic they are
the longer the window the more semantic they are
example above with 7 words window, syntactic or semantic?
5
CONTEXT VECTORS
The most basic word representation, similar intuition to the
term-document matrix in information retrieval
one-hot
xapricot = [1, 0, 0, . . . , 0]
xpineapple = [0, 1, 0, . . . , 0]
∈
∈
Re
Re
|V|
|V|
similarity(xapricot , xpineapple ) = 0
context
xapricot = [0, 1, 1, . . . , 0]
xpineapple = [0, 1, 1, . . . , 0]
∈
∈
Re
Re
|V|
|V|
similarity(xapricot , xpineapple ) > 0
6
SIMILARITY
inner-product(w, v) = w
⋅v
=
∑
|V|
i=1
wi vi
Problem: More frequent words appear in more contexts, will
be more similar to anything.
Solution: Divide by their length (a.k.a. cosine)
cosine(w, v) =
⋅
w v
|w||v|
∑
|V|
=
i=1
∑w
√
2
i
wi vi
∑v
√
2
i
Other options: Dice, Jaccard, etc.
7
POINTWISE MUTUAL INFORMATION
Raw counts are OK, but frequent words (articles, pronouns,
etc.) can dominate without being informative.
Pointwise mutual information measures how often two
events occur relative to them occurring independently:
P(word, context)
PMI(word, context) = log
2
P(word)P(context)
Positive values quantify relatedness.
Negative values? Usually ignored:
positivePMI(word, context) = max(PMI(word, context), 0)
8
CHOICE OF CONTEXTS
We can refine contexts using:
their part-of-speech tags (bank_V vs. bank_N)
syntactic dependencies (eat_dobj vs. eat_subj)
We can weigh contexts according to the distance from the
word: the further away, the lower the weight.
Rare contexts dominate, replace P(context)
=
#(context)
∑
c
with:
Pα (context) =
#(context)
∑
c
α
#(c)
α
,
#(c)
α = 0.75
9
SINGULAR VALUE DECOMPOSITION
PPMI matrices are good, but:
high dimensional
very sparse
Dimensionality reduction using truncated singular value
decomposition:
PPM I
|V|×|C|
≈
W
|V|×k
S
k×k
C
k×|C|
Approximation is good: removes noise and redundnancy
Same as latent semantic analysis for term-document matrices
10
SINGULAR VALUE DECOMPOSITION
11
SKIP-GRAM (MIKOLOV ET AL. 2013)
Running SVD on large matrices is expensive.
Let's look at one word a time:
P(context|wt ) = P(wt−2 , wt−1 , wt+1 , wt+2 |wt )
P(context|wt ) =
P(wc |wt ) =
∏ P(w∈
context
⋅
exp(w c w t )
∑
′
w
c
∈V
exp(w c
′
⋅w
t)
|wt )
(skip-gram)
(independence)
(word vectors)
A giant logistic regression classifier: words are the labels and
the parameters are the word vectors
Negative sampling: negative training examples are subsampled intelligently
EACH WORD HAS TWO EMBEDDINGS
Can discard the context word embeddings, add them, or
concatenate them with the target word embeddings
13
12
SKIP-GRAM
14
EVALUATION
Intrinsic:
similarity: order word pairs according to their semantic
similarity
in context similarity: substitute a word in a sentence
without chagning its meaning.
analogy: Athens is to Greece what Rome is to ...?
Extrinsic: use them to improve performance in a task. They
are an easy way to take advantage of unlabeled data to do
semi-supervised learning.
15
BEST WORD VECTORS?
high-dimensional (processed) counts?
low-dimensional neural/SVD?
Recent paper by Levy et al. (2015) showed that choice of
context window size, rare word removal, context distribution
smoothing matter more.
Choice of texts to obtain the counts matters. More text is
better, and low-dimensional methods scale better.
Learning the vectors makes it possible to incorporate domain
knowledge, e.g. bass being more similar to fish.
16
WHAT ABOUT POLYSEMY?
All occurrences of a word (and all its senses) are represented
by one vector.
How do we handle polysemy?
all senses are present in the vector
given a task, it is often useful to adapt the vectors to
represent the appropriate sense
17
LIMITATIONS
antonyms appear in similar contexts, hard to distinguish
them from synonyms
compositionality: what is the meaning of a sequence of
words? while we might be able to obtain context vectors
for short phrases, this doesn't scale to whole sentences,
paragraphs, etc.
18
BIBLIOGRAPHY
Jurafsky & Martin Chapter 19
Omer Levy's article
Turian et al.'s paper on how to use them
19
COMING UP NEXT
Language modeling
20