Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia [email protected] Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture Brown Clustering 6501 Natural Language Processing 2 Brown Clustering Similar to language model But, basic unit is “word clusters” Intuition: again, similar words appear in similar context Recap: Bigram Language Models 𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛 = 𝑃 𝑤1 𝑤0 𝑃 𝑤2 𝑤1 … 𝑃 𝑤𝑛 𝑤𝑛−1 n = Πi=1 P(w𝑖 ∣ 𝑤𝑖−1 ) 𝑤0 is a dummy word representing ”begin of a sentence” 6501 Natural Language Processing 3 Motivation example ”a dog is chasing a cat” 𝑃 𝑤0 , “𝑎”, ”𝑑𝑜𝑔”, … , “𝑐𝑎𝑡” = 𝑃 ”𝑎” 𝑤0 𝑃 ”𝑑𝑜𝑔” ”𝑎” … 𝑃 ”𝑐𝑎𝑡” ”𝑎” Assume Every word belongs to a cluster Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was 6501 Natural Language Processing Cluster 64 chasing following biting… 4 Motivation example Assume every word belongs to a cluster “a dog is chasing a cat” C3 Cluster 3 a the C46 C64 Cluster 46 dog cat fox rabbit bird boy C8 Cluster 64 is was 6501 Natural Language Processing C3 C46 Cluster 8 chasing following biting… 5 Motivation example Assume every word belongs to a cluster “a dog is chasing a cat” C3 C46 C64 C8 C3 C46 a dog is chasing a cat Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was 6501 Natural Language Processing Cluster 8 chasing following biting… 6 Motivation example Assume every word belongs to a cluster “the boy is following a rabbit” C3 the Cluster 3 a the C46 C64 boy is Cluster 46 dog cat fox rabbit bird boy C8 following Cluster 64 is was 6501 Natural Language Processing C3 C46 a rabbit Cluster 8 chasing following biting… 7 Motivation example Assume every word belongs to a cluster “a fox was chasing a bird” C3 C46 C64 C8 C3 C46 a fox was chasing a bird Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was 6501 Natural Language Processing Cluster 8 chasing following biting… 8 Brown Clustering Let 𝐶 𝑤 denote the cluster that 𝑤 belongs to “a dog is chasing a cat” C3 C46 C64 C8 C3 C46 a dog is chasing a cat P(C(dog)|C(a)) Cluster 46 Cluster 3 a the dog cat fox rabbit bird boy Cluster 64 is was P(cat|C(cat)) Cluster 8 6501 Natural Language Processing chasing following biting… 9 Brown clustering model P(“a dog is chasing a cat”) = P(C(“a”)|𝐶0 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... C3 C46 C64 C8 C3 C46 a dog is chasing a cat P(C(dog)|C(a)) Cluster 46 Cluster 3 a the dog cat fox rabbit bird boy Cluster 64 is was P(cat|C(cat)) Cluster 8 6501 Natural Language Processing chasing following biting… 10 Brown clustering model P(“a dog is chasing a cat”) = P(C(“a”)|𝐶0 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... In general 𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛 = 𝑃 𝐶(𝑤1 ) 𝐶 𝑤0 𝑃 𝐶(𝑤2 ) 𝐶(𝑤1 ) … 𝑃 𝐶 𝑤𝑛 𝐶 𝑤𝑛−1 𝑃(𝑤1 |𝐶 𝑤1 𝑃 𝑤2 𝐶 𝑤2 … 𝑃(𝑤𝑛 |𝐶 𝑤𝑛 ) n = Πi=1 P 𝐶 w𝑖 𝐶 𝑤𝑖−1 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) 6501 Natural Language Processing 11 Model parameters 𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛 n = Πi=1 P 𝐶 w𝑖 𝐶 𝑤𝑖−1 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) Parameter set 2: 𝑃(𝑤𝑖 |𝐶 𝑤𝑖 ) Parameter set 1: 𝑃(𝐶(𝑤𝑖 )|𝐶 𝑤𝑖−1 ) C3 C46 C64 a dog is Parameter set 3: Cluster 3 𝐶 𝑤𝑖 a the Cluster 46 dog cat fox rabbit bird boy C8 C3 C46 chasing a cat Cluster 64 is was 6501 Natural Language Processing Cluster 8 chasing following biting… 12 Model parameters 𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛 n = Πi=1 P 𝐶 w𝑖 𝐶 𝑤𝑖−1 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) A vocabulary set 𝑊 A function 𝐶: 𝑊 → {1, 2, 3, … 𝑘 } A partition of vocabulary into k classes Conditional probability 𝑃(𝑐′ ∣ 𝑐) for 𝑐, 𝑐 ′ ∈ 1, … , 𝑘 Conditional probability 𝑃(𝑤 ∣ 𝑐) for 𝑐, 𝑐 ′ ∈ 1, … , 𝑘 , 𝑤 ∈ 𝑐 𝜃 represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing 13 Log likelihood LL(𝜃, 𝐶 ) = log 𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛 𝜃, 𝐶 n = log Πi=1 P 𝐶 w𝑖 𝐶 𝑤𝑖−1 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) = ∑ni=1 [log P 𝐶 w𝑖 𝐶 𝑤𝑖−1 + log 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) ] Maximizing LL(𝜃, 𝐶) can be done by alternatively update 𝜃 and 𝐶 1. max 𝐿𝐿(𝜃, 𝐶) 𝜃∈Θ 2. max 𝐿𝐿(𝜃, 𝐶) 𝐶 6501 Natural Language Processing 14 max 𝐿𝐿(𝜃, 𝐶) 𝜃∈Θ LL(𝜃, 𝐶 ) = log 𝑃 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛 𝜃, 𝐶 n = log Πi=1 P 𝐶 w𝑖 𝐶 𝑤𝑖−1 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) = ∑ni=1 [log P 𝐶 w𝑖 𝐶 𝑤𝑖−1 + log 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) ] 𝑃(𝑐′ ∣ 𝑐) = #(𝑐 ′ ,𝑐) #𝑐 𝑃(𝑤 ∣ 𝑐) = #(𝑤,𝑐) #𝑐 6501 Natural Language Processing 15 max 𝐿𝐿(𝜃, 𝐶) 𝐶 max ∑ni=1 [log P 𝐶 w𝑖 𝐶 =n ∑𝑘𝑐=1 ∑𝑘𝑐′=1 𝑝 𝑐, 𝑐 ′ 𝐶 𝑤𝑖−1 log + log 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) ] 𝑝 𝑐,𝑐 ′ 𝑝 𝑐 𝑝(𝑐 ′ ) +𝐺 where G is a constant Here, 𝑝 𝑐, 𝑐 ′ =∑ # 𝑐,𝑐 ′ 𝑐,𝑐′ 𝑝 𝑐,𝑐 ′ 𝑝 𝑐 𝑝(𝑐 ′ ) = 𝑝 #(𝑐,𝑐 ′ ) 𝑐 𝑐′ 𝑝 𝑐 , 𝑝 𝑐 = # 𝑐 ∑𝑐 #(𝑐) (mutual information) 6501 Natural Language Processing 16 max 𝐿𝐿(𝜃, 𝐶) 𝐶 max ∑ni=1 [log P 𝐶 w𝑖 𝐶 =n ∑𝑘𝑐=1 ∑𝑘𝑐′=1 𝑝 𝑐, 𝑐 ′ 𝐶 𝑤𝑖−1 log + log 𝑃(𝑤𝑖 ∣ 𝐶 𝑤𝑖 ) ] 𝑝 𝑐,𝑐 ′ 𝑝 𝑐 𝑝(𝑐 ′ ) +𝐺 6501 Natural Language Processing 17 Algorithm 1 Start with |V| clusters each word is in its own cluster The goal is to get k clusters We run |V|-k merge steps: Pick 2 clusters and merge them Each step pick the merge maximizing 𝐿𝐿(𝜃, 𝐶) Cost? (can be improved to 𝑂( 𝑉 3 )) O(|V|-k) 𝑂( 𝑉 2 ) 𝑂 ( 𝑉 2 ) = 𝑂( 𝑉 5 ) #Iters #pairs compute LL 6501 Natural Language Processing 18 Algorithm 2 m : a hyper-parameter, sort words by frequency Take the top m most frequent words, put each of them in its own cluster 𝑐1 , 𝑐2 , 𝑐3 , … 𝑐𝑚 For 𝑖 = 𝑚 + 1 … |𝑉| Create a new cluster 𝑐𝑚+1 (we have m+1 clusters) Choose two cluster from m+1 clusters based on 𝐿𝐿 𝜃, 𝐶 and merge ⇒ back to m clusters Carry out (m-1) final merges ⇒ full hierarchy Running time O 𝑉 𝑚2 + 𝑛 , n=#words in corpus 6501 Natural Language Processing 19 Example clusters (Brown+1992) 6501 Natural Language Processing 20 Example Hierarchy(Miller+2004) 6501 Natural Language Processing 21 Quiz 1 30 min (9/20 Tue. 12:30pm-1:00pm) Fill-in-the-blank, True/False Short answer Closed book, Closed notes, Closed laptop Sample questions: Add one smoothing v.s. Add-Lambda Smoothing 𝑎 = 1, 3, 5 , 𝑏 = 2, 3, 6 what is the cosine similarity between a and 𝑏? 6501 Natural Language Processing 22 6501 Natural Language Processing 23
© Copyright 2026 Paperzz