pptx - cs.Virginia - University of Virginia

Lecture 8:
Word Clustering
Kai-Wei Chang
CS @ University of Virginia
[email protected]
Couse webpage: http://kwchang.net/teaching/NLP16
6501 Natural Language Processing
1
This lecture
 Brown Clustering
6501 Natural Language Processing
2
Brown Clustering
 Similar to language model
But, basic unit is “word clusters”
 Intuition: again, similar words appear in similar context
 Recap: Bigram Language Models
 𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛
= 𝑃 𝑤1 𝑤0 𝑃 𝑤2 𝑤1 … 𝑃 𝑤𝑛 𝑤𝑛−1
n
= Πi=1
P(w𝑖 ∣ 𝑤𝑖−1 )
𝑤0 is a dummy word representing ”begin of a sentence”
6501 Natural Language Processing
3
Motivation example
 ”a dog is chasing a cat”
 𝑃 𝑤0 , “𝑎”, ”𝑑𝑜𝑔”, … , “𝑐𝑎𝑡”
= 𝑃 ”𝑎” 𝑤0 𝑃 ”𝑑𝑜𝑔” ”𝑎” … 𝑃 ”𝑐𝑎𝑡” ”𝑎”
 Assume Every word belongs to a cluster
Cluster 3
a
the
Cluster 46
dog cat
fox rabbit
bird boy
Cluster 64
is
was
6501 Natural Language Processing
Cluster 64
chasing
following
biting…
4
Motivation example
 Assume every word belongs to a cluster
 “a dog is chasing a cat”
C3
Cluster 3
a
the
C46
C64
Cluster 46
dog cat
fox rabbit
bird boy
C8
Cluster 64
is
was
6501 Natural Language Processing
C3
C46
Cluster 8
chasing
following
biting…
5
Motivation example
 Assume every word belongs to a cluster
 “a dog is chasing a cat”
C3
C46
C64
C8
C3
C46
a
dog
is
chasing
a
cat
Cluster 3
a
the
Cluster 46
dog cat
fox rabbit
bird boy
Cluster 64
is
was
6501 Natural Language Processing
Cluster 8
chasing
following
biting…
6
Motivation example
 Assume every word belongs to a cluster
 “the boy is following a rabbit”
C3
the
Cluster 3
a
the
C46
C64
boy
is
Cluster 46
dog cat
fox rabbit
bird boy
C8
following
Cluster 64
is
was
6501 Natural Language Processing
C3
C46
a
rabbit
Cluster 8
chasing
following
biting…
7
Motivation example
 Assume every word belongs to a cluster
 “a fox was chasing a bird”
C3
C46
C64
C8
C3
C46
a
fox
was
chasing
a
bird
Cluster 3
a
the
Cluster 46
dog cat
fox rabbit
bird boy
Cluster 64
is
was
6501 Natural Language Processing
Cluster 8
chasing
following
biting…
8
Brown Clustering
 Let 𝐶 𝑤 denote the cluster that 𝑤 belongs to
 “a dog is chasing a cat”
C3
C46
C64
C8
C3
C46
a
dog
is
chasing
a
cat
P(C(dog)|C(a))
Cluster 46
Cluster 3
a
the
dog cat
fox rabbit
bird boy
Cluster 64
is
was
P(cat|C(cat))
Cluster 8
6501 Natural Language Processing
chasing
following
biting…
9
Brown clustering model
 P(“a dog is chasing a cat”)
= P(C(“a”)|𝐶0 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))…
P(“a”|C(“a”))P(“dog”|C(“dog”))...
C3
C46
C64
C8
C3
C46
a
dog
is
chasing
a
cat
P(C(dog)|C(a))
Cluster 46
Cluster 3
a
the
dog cat
fox rabbit
bird boy
Cluster 64
is
was
P(cat|C(cat))
Cluster 8
6501 Natural Language Processing
chasing
following
biting…
10
Brown clustering model
 P(“a dog is chasing a cat”)
= P(C(“a”)|𝐶0 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))…
P(“a”|C(“a”))P(“dog”|C(“dog”))...
 In general
𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛
= 𝑃 𝐶(𝑤1 ) 𝐶 𝑤0 𝑃 𝐶(𝑤2 ) 𝐶(𝑤1 ) … 𝑃 𝐶 𝑤𝑛
𝐶 𝑤𝑛−1
𝑃(𝑤1 |𝐶 𝑤1 𝑃 𝑤2 𝐶 𝑤2 … 𝑃(𝑤𝑛 |𝐶 𝑤𝑛 )
n
= Πi=1
P 𝐶 w𝑖 𝐶 𝑤𝑖−1 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 )
6501 Natural Language Processing
11
Model parameters
𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛
n
= Πi=1
P 𝐶 w𝑖
𝐶 𝑤𝑖−1
𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 )
Parameter set 2:
𝑃(𝑤𝑖 |𝐶 𝑤𝑖 )
Parameter set 1:
𝑃(𝐶(𝑤𝑖 )|𝐶 𝑤𝑖−1 )
C3
C46
C64
a
dog
is
Parameter set 3:
Cluster 3
𝐶 𝑤𝑖
a
the
Cluster 46
dog cat
fox rabbit
bird boy
C8
C3
C46
chasing
a
cat
Cluster 64
is
was
6501 Natural Language Processing
Cluster 8
chasing
following
biting…
12
Model parameters
𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛
n
= Πi=1
P 𝐶 w𝑖
𝐶 𝑤𝑖−1
𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 )
 A vocabulary set 𝑊
 A function 𝐶: 𝑊 → {1, 2, 3, … 𝑘 }
 A partition of vocabulary into k classes
 Conditional probability 𝑃(𝑐′ ∣ 𝑐) for 𝑐, 𝑐 ′ ∈ 1, … , 𝑘
 Conditional probability 𝑃(𝑤 ∣ 𝑐) for 𝑐, 𝑐 ′ ∈ 1, … , 𝑘 , 𝑤 ∈ 𝑐
𝜃 represents the set of conditional probability parameters
C represents the clustering
6501 Natural Language Processing
13
Log likelihood
LL(𝜃, 𝐶 ) = log 𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛 𝜃, 𝐶
n
= log Πi=1
P 𝐶 w𝑖 𝐶 𝑤𝑖−1 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 )
= ∑ni=1 [log P 𝐶 w𝑖 𝐶 𝑤𝑖−1 + log 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) ]
 Maximizing LL(𝜃, 𝐶) can be done by
alternatively update 𝜃 and 𝐶
1. max 𝐿𝐿(𝜃, 𝐶)
𝜃∈Θ
2. max 𝐿𝐿(𝜃, 𝐶)
𝐶
6501 Natural Language Processing
14
max 𝐿𝐿(𝜃, 𝐶)
𝜃∈Θ
LL(𝜃, 𝐶 ) = log 𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛 𝜃, 𝐶
n
= log Πi=1
P 𝐶 w𝑖 𝐶 𝑤𝑖−1 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 )
= ∑ni=1 [log P 𝐶 w𝑖 𝐶 𝑤𝑖−1 + log 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) ]
 𝑃(𝑐′ ∣ 𝑐) =
#(𝑐 ′ ,𝑐)
#𝑐
 𝑃(𝑤 ∣ 𝑐) =
#(𝑤,𝑐)
#𝑐
6501 Natural Language Processing
15
max 𝐿𝐿(𝜃, 𝐶)
𝐶
max ∑ni=1 [log P 𝐶 w𝑖
𝐶
=n
∑𝑘𝑐=1 ∑𝑘𝑐′=1 𝑝
𝑐, 𝑐 ′
𝐶 𝑤𝑖−1
log
+ log 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) ]
𝑝 𝑐,𝑐 ′
𝑝 𝑐 𝑝(𝑐 ′ )
+𝐺
where G is a constant
 Here,
𝑝 𝑐, 𝑐
′
=∑
# 𝑐,𝑐 ′
𝑐,𝑐′

𝑝 𝑐,𝑐 ′
𝑝 𝑐 𝑝(𝑐 ′ )
=
𝑝
#(𝑐,𝑐 ′ )
𝑐 𝑐′
𝑝 𝑐
,
𝑝 𝑐 =
# 𝑐
∑𝑐 #(𝑐)
(mutual information)
6501 Natural Language Processing
16
max 𝐿𝐿(𝜃, 𝐶)
𝐶
max ∑ni=1 [log P 𝐶 w𝑖
𝐶
=n
∑𝑘𝑐=1 ∑𝑘𝑐′=1 𝑝
𝑐, 𝑐 ′
𝐶 𝑤𝑖−1
log
+ log 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) ]
𝑝 𝑐,𝑐 ′
𝑝 𝑐 𝑝(𝑐 ′ )
+𝐺
6501 Natural Language Processing
17
Algorithm 1
 Start with |V| clusters
each word is in its own cluster
 The goal is to get k clusters
 We run |V|-k merge steps:
 Pick 2 clusters and merge them
 Each step pick the merge maximizing 𝐿𝐿(𝜃, 𝐶)
 Cost? (can be improved to 𝑂( 𝑉 3 ))
O(|V|-k) 𝑂( 𝑉 2 ) 𝑂 ( 𝑉 2 ) = 𝑂( 𝑉 5 )
#Iters
#pairs compute LL
6501 Natural Language Processing
18
Algorithm 2
 m : a hyper-parameter, sort words by frequency
 Take the top m most frequent words, put each of
them in its own cluster 𝑐1 , 𝑐2 , 𝑐3 , … 𝑐𝑚
 For 𝑖 = 𝑚 + 1 … |𝑉|
 Create a new cluster 𝑐𝑚+1 (we have m+1 clusters)
 Choose two cluster from m+1 clusters based on
𝐿𝐿 𝜃, 𝐶 and merge ⇒ back to m clusters
 Carry out (m-1) final merges ⇒ full hierarchy
 Running time O 𝑉 𝑚2 + 𝑛 ,
n=#words in corpus
6501 Natural Language Processing
19
Example clusters
(Brown+1992)
6501 Natural Language Processing
20
Example Hierarchy(Miller+2004)
6501 Natural Language Processing
21
Quiz 1
 30 min (9/20 Tue. 12:30pm-1:00pm)
 Fill-in-the-blank, True/False
 Short answer
 Closed book, Closed notes, Closed laptop
 Sample questions:
 Add one smoothing v.s. Add-Lambda Smoothing
 𝑎 = 1, 3, 5 , 𝑏 = 2, 3, 6 what is the cosine
similarity between a and 𝑏?
6501 Natural Language Processing
22
6501 Natural Language Processing
23