Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
Introduction to Text Analysis
DocumentTerm
Matrices
Preprocessing
Bryce J. Dietrich
Unsupervised
Learning
University of Iowa
ExpectationMaximization
Algorithm
LDA
Conclusion
Introduction
to Text
Analysis
1 Dictionaries
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
2 Document-Term Matrices
3 Preprocessing
4 Unsupervised Learning
Unsupervised
Learning
5 Expectation-Maximization Algorithm
ExpectationMaximization
Algorithm
6 LDA
LDA
Conclusion
7 Conclusion
Text and Political Science
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Post 2000, things have changed. . .
Massive collections of texts are being increasingly used as
a data source:
Congressional speeches, press releases, newsletters,. . .
Facebook posts, tweets, emails, text messages. . .
Newspapers, magazines, transcripts. . .
Foreign news sources, treaties, . . .
Why?
LOTS of unstructured text data (201 billion emails sent
and received every day)
LOTS of cheap storage: 1956: $10,000 per megabyte.
2016: $0.0001 per megabyte.
LOTS of methods and software to analyze text
Text and Political Science
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Ultimately, these trends mean. . .
Analyzing text has become bigger, faster, and stronger:
Generalizable → one method can be used across a variety
of text
Systematic → one method can be used again and again
Cheap → one computer can do a lot, 100 computers can
do even more
Analyzing text is still important:
Laws
Treaties
News media
Campaigns
Petitions
Speeches
Press Releases
What Can We Do?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
What Can We Do?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
There are two things we may want to do with this haystack. . .
Understanding the meaning of a sentence or phrase →
analyzing a straw of hay
Humans = 1
Computers = 0
Classifying text → organizing hay stack
Humans = 0
Computers = 1
Text = Complex?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
This bill has seen a long and winding road, but in the
process we have worked together. We have not quit.
We have worked across the aisle. The final bill has
the support of over 370 groups, and they represent
those from all over the country and all over the
ideological spectrum.
Text = Complex?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
This bill has seen a long and winding road, but in the
process we have worked together. We have not quit.
We have worked across the aisle. The final bill has
the support of over 370 groups, and they represent
those from all over the country and all over the
ideological spectrum.
Text = Complex?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
This bill has seen a long and winding road, but in the
process we have worked together. We have not quit.
We have worked across the aisle. The final bill has
the support of over 370 groups, and they represent
those from all over the country and all over the
ideological spectrum.
Text = Complex?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
This bill has seen a long and winding road, but in the
process we have worked together. We have not quit.
We have worked across the aisle. The final bill has
the support of over 370 groups, and they represent
those from all over the country and all over the
ideological spectrum.
Text = Complex?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
This bill has seen a long and winding road, but in the
process we have worked together. We have not quit.
We have worked across the aisle. The final bill has
the support of over 370 groups, and they represent
those from all over the country and all over the
ideological spectrum.
Text = Simple?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Speech by Barbara Mikulski (D-MD) delivered on June 28,
2016. . .
word
zika
million
emergency
health
act
response
republican
report
treat
world
organization
affordable
count
4
4
3
3
2
2
2
1
1
1
1
1
Text = Simple?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
The Republican conference report also doesn’t treat
Zika like the emergency it is. The World Health
Organization declared the Zika virus a public health
emergency on February 1. And Zika meets the Budget
Act criteria for emergency spending: It is urgent,
unforeseen, and temporary. Yet Republicans insisted
that we cut $750 million to pay for the response to
Zika, including $543 million from the Affordable Care
Act, $100 million from the Department of Health and
Human Services, HHS, nonrecurring expense fund,
and $107 million from Ebola response funds.
“Big Data”
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Suppose we want to categorize 100 documents. . .
Consider two documents A and B, how many clusters can
we make? → (AB, BA) = 2
Consider three documents A, B, and C, how many clusters
can we make? → (ABC, CBA, ACB, BCA, CAB) = 5
Bell(n) = number of ways to partition n objects. Bell(2)
= 2, Bell(3) = 5, Bell(5) = 52, etc.
Bell(100) = 4.758539 × 10115
It takes R 0.001 seconds to count to 100000
It would take R = 4.758539 × 10110 seconds to count to
Bell(100)
There are 3.154 × 107 seconds in a year.
4.758539×10110
= 1.508731 × 10103 years.
3.154×107
Automated methods can help with even small tasks!
Principles of Automated Text Analysis (from Justin
Grimmer)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Principle 1: All Quantitative Models of Language are
Wrong – But Some are Useful
Data generation process unknown
Complexity of language:
Time flies like an arrow; fruit flies like a banana
Make peace, not war; Make war not peace
Models necessarily fail to capture language
Principle 2: Quantitative Methods Augment Humans, Not
Replace Them
Computer-Assisted Content Analysis
Computers suggest, Humans interpret
Principle 3: There is no Globally Best Method for
Automated Text Analysis
Supervised methods = known categories
Unsupervised methods = discover categories
Principle 4: Validate, Validate, Validate
Principles of Automated Text Analysis (from Justin
Grimmer)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Principle 2: Quantitative Methods Augment Humans, Not
Replace Them
Computer-Assisted Content Analysis
Computers suggest, Humans interpret
Principles of Automated Text Analysis (from Justin
Grimmer)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Principle 3: There is no Globally Best Method for
Automated Text Analysis
Supervised methods = known categories
Unsupervised methods = discover categories
Principles of Automated Text Analysis (from Justin
Grimmer)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Principle 4: Validate, Validate, Validate
Few theorems to guarantee performance
Apply methods → validate
Do not blinding use methods!
Principles of Automated Text Analysis (from Justin
Grimmer)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
Principle 1: All Quantitative Models of Language are
Wrong – But Some are Useful
DocumentTerm
Matrices
Principle 2: Quantitative Methods Augment Humans, Not
Replace Them
Preprocessing
Principle 3: There is no Globally Best Method for
Automated Text Analysis
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Principle 4: Validate, Validate, Validate
Types of Classification Problems
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Topic: What is this text about?
- Policy area of legislation
- Party agendas
Sentiment: What is said in this text?
- For or against legislation
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
- Agree or disagree with an argument
- A liberal/conservative position
Style/Tone: How is it said?
- Positive/Negative Emotion
Weighted Words
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Weighted Words!
Dictionary Methods
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Dictionary Methods
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Many Dictionary Methods (like LIWC)
1) Proprietary
2) Basic tasks:
wrapped in GUI
a) Count words
b) Weight some words more than others
c) Some graphics
3) Expensive!
Other Dictionaries
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
1) General Inquirer Database
(http://www.wjh.harvard.edu/~inquirer/ )
- Stone, P.J., Dumphy, D.C., and Ogilvie, D.M. (1966) The
General Inquirer: A Computer Approach to Content
Analysis
- 1,915 positive words and 2,291 negative words
2) Linguistic Inquiry Word Count (LIWC)
- Creation process:
1) Generate word list for categories “ We drew on common
emotion rating scales...[then] brain-storming sessions
among 3-6 judges were held” to generate other words
2) Judge round (a) Does the word belong? (b) What
other categories might it belong to?
- 406 positive words and 499 negative words
Generating New Words
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Three ways to create dictionaries:
- Statistical methods
- Manual generation
- “Theory”
- “Research Assistants”
a) Grad Students
b) Undergraduates
c) Mechanical Turkers
- Example: {Happy, Unhappy}
- Ask Turkers: how happy is
elevator, car, pretty, young
Applying a Dictionary to NYT Articles
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Python!
Document-Term Matrices
Introduction
to Text
Analysis
Bryce J.
Dietrich
Dictionaries
DocumentTerm
Matrices
X
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
1 0 0 ... 3
0 2 1 . . . 0
= . . . .
.
.. .. .. . . ..
0 0 0 ... 5
Agenda
X = N × K matrix
- N = Number of documents
- K = Number of features
”I Have A Dream”
Introduction
to Text
Analysis
1
2
Bryce J.
Dietrich
3
# import modules
import requests
from b s 4 i m p o r t B e a u t i f u l S o u p
4
Agenda
5
Dictionaries
6
DocumentTerm
Matrices
7
8
Preprocessing
Unsupervised
Learning
9
10
11
ExpectationMaximization
Algorithm
12
LDA
14
Conclusion
15
13
16
17
# create url
u r l = ’ h t t p : / / a v a l o n . l a w . y a l e . edu /20 t h c e n t u r y / mlk01
. asp ’
# c r e a t e s ou p
response = requests . get ( u r l )
contents = response . content
so up = B e a u t i f u l S o u p ( c o n t e n t s ,
# get speech
speech = ” ”
l i n e s = so up . f i n d a l l ( ’ p ’ )
for li ne in l i n e s :
s p e e c h += l i n e . g e t t e x t ( )
’ h tm l . p a r s e r ’ )
Installing Beautiful Soup
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
1
DocumentTerm
Matrices
2
Preprocessing
3
4
5
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
# a c t i v a t e our v i r t u a l environment
s o u r c e a c t i v a t e python−UI
# i n s t a l l b e a u t i f u l soup
p y t h o n −m p i p i n s t a l l b e a u t i f u l s o u p 4
Using Beautiful Soup
Introduction
to Text
Analysis
Bryce J.
Dietrich
1
2
3
Agenda
4
Dictionaries
5
DocumentTerm
Matrices
6
Preprocessing
ExpectationMaximization
Algorithm
# create url
u r l = ’ h t t p : / / a v a l o n . l a w . y a l e . edu /20 t h c e n t u r y / mlk01
. asp ’
7
8
Unsupervised
Learning
# import modules
import requests
from b s 4 i m p o r t B e a u t i f u l S o u p
9
10
11
# c r e a t e s ou p
response = requests . get ( u r l )
contents = response . content
so up = B e a u t i f u l S o u p ( c o n t e n t s ,
12
LDA
13
Conclusion
14
# p r i n t so up
p r i n t ( sou p . p r e t t i f y ( ) [ 0 : 3 0 0 ] )
’ h tm l . p a r s e r ’ )
HTML
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
1
2
3
Dictionaries
4
DocumentTerm
Matrices
5
Preprocessing
6
Unsupervised
Learning
7
ExpectationMaximization
Algorithm
LDA
8
9
10
11
12
Conclusion
<h t m l>
<head>
< l i n k h r e f=” . . / c s s / s i t e . c s s ” r e l =” s t y l e s h e e t ” t y p e
=” t e x t / c s s ”>
< t i t l e>
A v a l o n P r o j e c t − I h a v e a Dream by M a r t i n L u t h e r
King , J r ; August 2 8 , 1963
</ t i t l e >
</ l i n k>
</ head>
<body>
<d i v c l a s s=” H e a d e r C o n t a i n e r ”>
<u l c l a s s=” HeaderTopTools ”>
< l i c l a s s=” S e a r c h ”>
HTML
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Browsers read in the HTML document, parses it into a DOM
(Document Object Model) structure, and then renders the
DOM structure.
HTML
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Browsers read in the HTML document, parses it into a DOM
(Document Object Model) structure, and then renders the
DOM structure.
“I Have A Dream”
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Finding Text
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
1
2
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
3
<p>
I h a v e a dream t h a t one day on t h e r e d h i l l s o f
G e o r g i a t h e s o n s o f f o r m e r s l a v e s and t h e s o n s
o f f o r m e r s l a v e o w n e r s w i l l be a b l e t o s i t down
together at a t a b l e of brotherhood .
</p>
Finding Text
Introduction
to Text
Analysis
1
2
Bryce J.
Dietrich
3
# import modules
import requests
from b s 4 i m p o r t B e a u t i f u l S o u p
4
Agenda
5
Dictionaries
6
DocumentTerm
Matrices
7
8
Preprocessing
Unsupervised
Learning
9
10
11
ExpectationMaximization
Algorithm
12
LDA
14
Conclusion
15
13
16
17
# create url
u r l = ’ h t t p : / / a v a l o n . l a w . y a l e . edu /20 t h c e n t u r y / mlk01
. asp ’
# c r e a t e s ou p
response = requests . get ( u r l )
contents = response . content
so up = B e a u t i f u l S o u p ( c o n t e n t s ,
# get l i n e s
l i n e s = so up . f i n d a l l ( ’ p ’ )
# get type
p r i n t ( type ( l i n e s ) )
’ h tm l . p a r s e r ’ )
Finding Text
Introduction
to Text
Analysis
This a bs4.element.ResultSet object which is simply a
collection of tags. Note, this is a bs4 object, so most standard
functions will not work.
Bryce J.
Dietrich
Agenda
1
Dictionaries
2
DocumentTerm
Matrices
3
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
4
5
# i n s p e c t f i r s t element
print ( lines [0])
# output
’<p>D e l i v e r e d on t h e s t e p s a t t h e L i n c o l n M e m o r i a l
i n Washington D . C . on August 2 8 , 1963 </p> ’
6
7
8
# get text
print ( l i n e s [ 0 ] . get text () )
9
LDA
10
Conclusion
11
# output
’ D e l i v e r e d on t h e s t e p s a t t h e L i n c o l n M e m o r i a l i n
Washington D . C . on August 2 8 , 1963 ’
”I Have A Dream”
Introduction
to Text
Analysis
1
2
Bryce J.
Dietrich
3
# import modules
import requests
from b s 4 i m p o r t B e a u t i f u l S o u p
4
Agenda
5
Dictionaries
6
DocumentTerm
Matrices
7
8
Preprocessing
Unsupervised
Learning
9
10
11
ExpectationMaximization
Algorithm
12
LDA
14
Conclusion
15
13
16
17
# create url
u r l = ’ h t t p : / / a v a l o n . l a w . y a l e . edu /20 t h c e n t u r y / mlk01
. asp ’
# c r e a t e s ou p
response = requests . get ( u r l )
contents = response . content
so up = B e a u t i f u l S o u p ( c o n t e n t s ,
# get speech
speech = ” ”
l i n e s = so up . f i n d a l l ( ’ p ’ )
for li ne in l i n e s :
s p e e c h += l i n e . g e t t e x t ( )
’ h tm l . p a r s e r ’ )
Preprocessing
Introduction
to Text
Analysis
Bryce J.
Dietrich
Preprocessing
Simplify text in order to make it useful.
Agenda
Dictionaries
DocumentTerm
Matrices
1
Remove capitalization, punctuation
2
Discard word order (Bag of Words Assumption)
3
Discard stop words
4
Create equivalence class: Stem, lemmatize, or synonym
5
Discard less useful features
6
Other reduction, specialization
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Step 1: Removing Capitalization and Punctuation
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Removing capitalization:
Dictionaries
- Python : string.lower()
DocumentTerm
Matrices
- R : tolower(‘string’)
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Removing punctuation
- Python: re.sub(‘\W’, ‘ ’, string)
- R : gsub(‘\\W’, ‘ ’, string)
Step 1: Removing Capitalization and Punctuation
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
1
2
3
4
5
DocumentTerm
Matrices
Preprocessing
6
Unsupervised
Learning
7
ExpectationMaximization
Algorithm
LDA
Conclusion
# import modules
import re
# create sentence
s e n t e n c e = ’ F i v e s c o r e y e a r s ago , a g r e a t American ,
i n whose s y m b o l i c shadow we s t a n d s i g n e d t h e
Emancipation Proclamation . ’
sentence2 = sentence . lower ()
s e n t e n c e 3 = r e . sub ( ’ \W’ , ’ ’ , s e n t e n c e 2 )
8
9
10
# output
’ f i v e s c o r e y e a r s ago a g r e a t a m e r i c a n i n whose
s y m b o l i c shadow we s t a n d s i g n e d t h e e m a n c i p a t i o n
proclamation ’
Step 1: Removing Capitalization and Punctuation
Introduction
to Text
Analysis
Bryce J.
Dietrich
1
2
Agenda
3
Dictionaries
4
DocumentTerm
Matrices
5
Preprocessing
Unsupervised
Learning
6
7
#i m p o r t m o d u l e s
import re
#c r e a t e s e n t e n c e
s e n t e n c e = ’ F i v e s c o r e y e a r s ago , a g r e a t American ,
i n whose s y m b o l i c shadow we s t a n d s i g n e d t h e
Emancipation Proclamation . ’
s e n t e n c e = r e . sub ( ’ \W’ , ’ ’ , s e n t e n c e . l o w e r ( ) )
print ( sentence )
8
ExpectationMaximization
Algorithm
LDA
Conclusion
9
f i v e s c o r e y e a r s ago a g r e a t a m e r i c a n i n whose
s y m b o l i c shadow we s t a n d s i g n e d t h e e m a n c i p a t i o n
proclamation
Step 2: “Bag of Words Assumption”
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Assumption: Discard Word Order
Five score years ago, a great American, in whose
symbolic shadow we stand signed the Emancipation
Proclamation. five score years ago a great
american in whose symbolic shadow we stand signed
the emancipation proclamation Unigrams
Unigram
the
of
to
and
a
be
will
that
is
freedom
in
Count
101
96
57
44
36
31
26
24
21
19
19
Bigram
five score
score years
years ago
ago a
a great
great american
american in
Bigrams
in whose
whose symbolic
symbolic shadow
shadow we
Count
1
1
1
1
1
1
1
Trigrams
1
1
1
1
Step 2: “Bag of Words Assumption”
Introduction
to Text
Analysis
Bryce J.
Dietrich
1
Agenda
2
Dictionaries
3
DocumentTerm
Matrices
4
5
7
Unsupervised
Learning
8
ExpectationMaximization
Algorithm
10
Conclusion
# i n s t a l l nltk
p y t h o n −m p i p i n s t a l l
nltk
6
Preprocessing
LDA
# a c t i v a t e v i r t u a l environment
s o u r c e a c t i v a t e python−UI
# download s t o p words
n l t k . download ( ’ s t o p w o r d s ’ )
9
11
# download w o r d n e t
n l t k . download ( ’ w o r d n e t ’ )
Step 2: “Bag of Words Assumption”
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
1
Dictionaries
2
DocumentTerm
Matrices
3
4
5
Preprocessing
Unsupervised
Learning
6
7
8
ExpectationMaximization
Algorithm
LDA
Conclusion
# import modules
import requests
import re
from b s 4 i m p o r t B e a u t i f u l S o u p
from n l t k i m p o r t F r e q D i s t
from n l t k i m p o r t w o r d t o k e n i z e
from n l t k i m p o r t b i g r a m s
from n l t k i m p o r t t r i g r a m s
Step 2: “Bag of Words Assumption”
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
1
2
3
4
Dictionaries
5
DocumentTerm
Matrices
6
Preprocessing
8
Unsupervised
Learning
10
ExpectationMaximization
Algorithm
7
9
11
12
13
LDA
Conclusion
# url
u r l = ’ h t t p : / / a v a l o n . l a w . y a l e . edu /20 t h c e n t u r y / mlk01
. asp ’
# c r e a t e s ou p
response = requests . get ( u r l )
contents = response . content
so up = B e a u t i f u l S o u p ( c o n t e n t s ,
’ h tm l . p a r s e r ’ )
# get text
speech = ” ”
l i n e s = so up . f i n d a l l ( ’ p ’ )
for li ne in l i n e s :
s p e e c h += l i n e . g e t t e x t ( )
14
15
16
# remove p u n c t u a t i o n
s p e e c h = r e . sub ( ’ \W’ , ’ ’ , s p e e c h . l o w e r ( ) )
Step 2: “Bag of Words Assumption”
Introduction
to Text
Analysis
Bryce J.
Dietrich
1
Agenda
2
Dictionaries
3
DocumentTerm
Matrices
4
Preprocessing
6
Unsupervised
Learning
8
ExpectationMaximization
Algorithm
10
Conclusion
unigrams
= word tokenize ( speech . lower () )
f r e q u e n c y = F r e q D i s t ( words )
f r e q u e n c y . most common ( 1 0 )
5
7
LDA
# get
words
words
words
9
# get bigrams
l i s t ( b i g r a m s ( words ) )
# get trigrams
l i s t ( t r i g r a m s ( words ) )
How Could This Possibly Work?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Three answers
1) It might not: Validation is critical (task specific)
2) Central Tendency in Text: Words often imply what a text
is about war, civil, union or tone consecrate,
dead, died, lives.
Likely to be used repeatedly: create a theme for an article
3) Human supervision: Inject human judgement (coders):
helps methods identify subtle relationships between words
and outcomes of interest
Dictionaries
Training Sets
Step 3: Discard Stopwords
Introduction
to Text
Analysis
Bryce J.
Dietrich
Stop Words: English Language place holding words
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
- the, it, if, a, able, at, be, because...
- Add “noise” to documents (without conveying much
information)
- Discard stop words: focus on substantive words
Be Careful!
- she, he, her, his
- You may need to customize your stop word list
abbreviations, titles, etc.
Step 3: Discard Stopwords
Introduction
to Text
Analysis
Bryce J.
Dietrich
1
Agenda
2
Dictionaries
3
DocumentTerm
Matrices
4
Preprocessing
5
6
7
Unsupervised
Learning
8
ExpectationMaximization
Algorithm
10
LDA
Conclusion
9
# import modules
import requests
import re
import s t r i n g
from b s 4 i m p o r t B e a u t i f u l S o u p
from n l t k i m p o r t F r e q D i s t
from n l t k i m p o r t w o r d t o k e n i z e
from n l t k i m p o r t b i g r a m s
from n l t k i m p o r t t r i g r a m s
from n l t k . c o r p u s i m p o r t s t o p w o r d s
Step 3: Discard Stopwords
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
1
2
3
4
Dictionaries
5
DocumentTerm
Matrices
6
Preprocessing
8
Unsupervised
Learning
10
ExpectationMaximization
Algorithm
7
9
11
12
13
LDA
Conclusion
# url
u r l = ’ h t t p : / / a v a l o n . l a w . y a l e . edu /20 t h c e n t u r y / mlk01
. asp ’
# c r e a t e s ou p
response = requests . get ( u r l )
contents = response . content
so up = B e a u t i f u l S o u p ( c o n t e n t s ,
’ h tm l . p a r s e r ’ )
# get text
speech = ” ”
l i n e s = so up . f i n d a l l ( ’ p ’ )
for li ne in l i n e s :
s p e e c h += l i n e . g e t t e x t ( )
14
15
16
# remove p u n c t u a t i o n
s p e e c h = r e . sub ( ’ \W’ , ’ ’ , s p e e c h . l o w e r ( ) )
Step 3: Discard Stopwords
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
1
2
# import modules
import s t r i n g
3
Dictionaries
4
DocumentTerm
Matrices
5
Preprocessing
7
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
# c r e a t e word l i s t
words = w o r d t o k e n i z e ( s p e e c h . l o w e r ( ) )
6
8
9
10
# remove s t o p w o r d s
s t o p w o r d s = s t o p w o r d s . words ( ’ e n g l i s h ’ )
c l e a n s p e e c h = f i l t e r ( lambda x : x n o t i n s t o p w o r d s ,
words )
c l e a n s p e e c h 2 = [ word f o r word i n words i f word n o t
in stopwords ]
Step 4: Create an Equivalence Class of Words
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Reduce dimensionality further
Comparing Stemming and Lemmatizing
Introduction
to Text
Analysis
Stemming algorithm:
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
→ Porter – most commonly used stemmer.
→ Lancaster – very aggressive, stems may not be
interpretable.
→ Snowball (Porter 2) – essentially a better version of Porter.
Comparing Stemming Algorithms
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
1
2
3
4
Preprocessing
5
Unsupervised
Learning
6
ExpectationMaximization
Algorithm
LDA
Conclusion
# import modules
from n l t k . c o r p u s i m p o r t s t o p w o r d s
from n l t k . stem . p o r t e r i m p o r t ∗
from n l t k . stem . l a n c a s t e r i m p o r t ∗
from n l t k . stem . s n o w b a l l i m p o r t SnowballStemmer
from n l t k . stem i m p o r t WordNetLemmatizer
Comparing Stemming Algorithms
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
1
2
# sample text
’ f i v e s c o r e y e a r s ago a g r e a t a m e r i c a n i n whose
s y m b o l i c shadow we s t a n d t o d a y s i g n e d t h e
e m a n c i p a t i o n p r o c l a m a t i o n t h i s momentous d e c r e e
came a s a g r e a t b e a c o n l i g h t o f hope t o m i l l i o n s
o f n e g r o s l a v e s who had been s e a r e d i n t h e
f l a m e s o f w i t h e r i n g i n j u s t i c e i t came a s a
j o y o u s d a y b r e a k t o end t h e l o n g n i g h t o f t h e i r
captivity ’
Porter
Introduction
to Text
Analysis
Bryce J.
Dietrich
1
2
Agenda
3
Dictionaries
DocumentTerm
Matrices
4
5
6
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
# use p o r t e r
stemmer = P o r t e r S t e m m e r ( )
p o r t e r t e x t = [ stemmer . stem ( word ) f o r word i n
sample words ]
# output
’ f i v e s c o r e y e a r ago a g r e a t a m e r i c a n i n whose
s y m b o l shadow we s t a n d t o d a y s i g n t h e emancip
p r o c l a m t h i moment d e c r e came a s a g r e a t b e a c o n
l i g h t o f hope t o m i l l i o n o f n e g r o s l a v e who had
been s e a r i n t h e f l a m e o f w i t h e r i n j u s t i c i t
came a s a j o y o u d a y b r e a k t o end t h e l o n g n i g h t
of t h e i r captiv ’
Lancaster
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
1
2
3
Dictionaries
DocumentTerm
Matrices
4
Preprocessing
6
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
5
# use l a n c a s t e r
stemmer = L a n c a s t e r S t e m m e r ( )
l a n c a s t e r t e x t = [ stemmer . stem ( word ) f o r word i n
sample words ]
# output
’ f i v s c o r y e a r ago a g r e am i n whos s y m b o l shadow we
s t a n d t o d a y s i g n t h e emancip p r o c l a m t h i mom
d e c r cam a s a g r e b e a c o n l i g h t o f hop t o m i l o f
n e g r o s l a v who had been s e a r i n t h e f l a m o f w i t h
i n j u s t i t cam a s a j o y d a y b r e a k t o end t h e l o n g
night of t h e i r capt ’
Snowball
Introduction
to Text
Analysis
Bryce J.
Dietrich
1
2
Agenda
3
Dictionaries
DocumentTerm
Matrices
4
5
6
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
# use snowball
stemmer = SnowballStemmer ( ’ e n g l i s h ’ )
s n o w b a l l t e x t = [ stemmer . stem ( word ) f o r word i n
sample words ]
# output
’ f i v e s c o r e y e a r ago a g r e a t a m e r i c a n i n whose
s y m b o l shadow we s t a n d t o d a y s i g n t h e emancip
p r o c l a m t h i s moment d e c r e came a s a g r e a t b e a c o n
l i g h t o f hope t o m i l l i o n o f n e g r o s l a v e who had
been s e a r i n t h e f l a m e o f w i t h e r i n j u s t i c i t
came a s a j o y o u s d a y b r e a k t o end t h e l o n g n i g h t
of t h e i r captiv ’
Finding Lower Dimensional Embeddings of Text
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
1) Task:
-
Embed our documents in a lower dimensional space
Visualize our documents
Inference about similarity
Inference about behavior
2) Supervised Learning:
- Predict the values of one or more outputs or response
variables Y = (Y1 , . . . , Ym ) for a given set of input or
predictor variables X T = (X1 , . . . , Xp )
- xiT = (xi1 , . . . , Xip ) denotes the inputs for the i th training
case and yˆi is the response measure.
- “Student” presents an answer yˆi for each xi in the training
sample, and the supervisor or “teacher” grades the answer.
- Usually this requires some loss function L(y , ŷ ), for
example, L(y , ŷ ) = (y − ŷ )2 .
Finding Lower Dimensional Embeddings of Text
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
3) Unsupervised Learning:
- In this case one has a set of N observations (x1 , x2 , . . . , xN )
of a random p-vector X having joint density Pr (X )
- The goal is to directly infer the properties of this
probability density without the help of a supervisor or
teacher.
- The dimension of X is sometimes much higher than in
supervised learning, and the properties of interest are often
more complicated than simple point estimates.
k-Means
Introduction
to Text
Analysis
Bryce J.
Dietrich
Suppose, we have a dataset with two variables
x = (1, 2, 3, 4, 5) and y = (1, 2, 3, 4, 5):
Agenda
Dictionaries
1
DocumentTerm
Matrices
Randomly place k centroids inside the two-dimensional
space (X,Y).
2
For each point (xi , yi ) find the nearest centroid by
minimizing some distance measure.
3
Assign each point (xi , yi ) to cluster j.
For each cluster j = 1 . . . K :
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
4
Create a new centroid cj using the average across all
points xi and yi
k-Means
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
k-Means
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
k-Means
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
k-Means
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
k-Means
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Expectation-Maximization Algorithm
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
The Expectation Maximization algorithm enables parameter
estimation in probabilistic models with incomplete data.
Expectation-Maximization Algorithm
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
The Expectation Maximization algorithm enables parameter
estimation in probabilistic models with incomplete data.
exponential family of distributions:
Normal
Exponential
Gamma
Chi-Squared
Beta
Dirichlet (Der-rick-let)
Bernoulli
Poisson
Pθt+1 (X ) ≥ Pθt (X )
Expectation-Maximization Algorithm
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
The Expectation Maximization algorithm enables parameter
estimation in probabilistic models with incomplete data.
exponential family of distributions:
DocumentTerm
Matrices
Not guaranteed to give θMLE
Preprocessing
Overfitting
Unsupervised
Learning
Slow
ExpectationMaximization
Algorithm
Generally, it can’t be used for non-exponential
distributions.
LDA
Conclusion
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Let’s assume we have observations x1 . . . xn :
Each xi is drawn from one of two normal distributions.
Dictionaries
DocumentTerm
Matrices
One of these distributions (red) has a mean of µred and a
2 .
variance of σred
Unsupervised
Learning
The other distribution (blue) has a mean of µblue and a
2 .
variance of σblue
ExpectationMaximization
Algorithm
If we know the source of each observation, then estimating
2 , and σ 2
µred , µblue , σred
blue is trivial.
Preprocessing
LDA
Conclusion
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Let’s assume we have observations x1 . . . xn drawn from the red
distribution:
µred =
x1 +x2 +···+xn
nred
2 =
σred
(x1 −µred )2 +···+(xn −µred )2
nred
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Let’s assume we have observations x1 . . . xn drawn from the
blue distribution:
µblue =
x1 +x2 +···+xn
nblue
2
σblue
=
(x1 −µblue )2 +···+(xn −µblue )2
nblue
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
However, what if we do not know the source?
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Why don’t we just guess?
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
2 ,
Let’s assume someone gave you the parameters µred and σred
what is the probability a given point, xi , is from that
distribution? (Normal PDF)
1
(xi − µred )2
P(xi |red) = q
exp −
(1)
2
2σred
2
2πσred
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
2 are
What is the probability the parameters µred and σred
correct, given point xi ? (Bayes’ Rule)
DocumentTerm
Matrices
P(A|B) =
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
P(Yes|xi ) =
P(B|A)P(A)
P(B)
P(xi |Yes)P(Yes)
P(xi |Yes)P(Yes) + P(xi |No)P(No)
(2)
(3)
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
2 are
What is the probability the parameters µred and σred
correct, given point xi ? (Bayes’ Rule)
P(red|xi ) =
P(xi |red)P(red)
P(xi |red)P(red) + P(xi |blue)P(blue)
P(blue|xi ) = 1 − P(red|xi )
(4)
(5)
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
If you knew where the points came from, you could estimated
µ and σ 2 easily. Unfortunately, you do not know where the
points came from.
2 , µ
2
If we knew µred , σred
blue , and σblue we could figure out
which distribution the points came from.
EM Algorithm
Start with two randomly placed normal distributions (µred ,
2
2
σred
) and (µblue , σblue
).
For each xi , determine P(red|xi ) = the probability that the
point was drawn from the red distribution.
This is a soft assignment, meaning that each xi with have
two probabilities: P(red|xi ) and P(blue|xi ).
2
Once this is done, re-estimate (µred , σred
) and (µblue ,
2
σblue ), given what we learned.
Iterate until it convergence.
Gaussian Mixture Model
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
If you knew where the points came from, you could estimated
µ and σ 2 easily. Unfortunately, you do not know where the
points came from.
2 , µ
2
If we knew µred , σred
blue , and σblue we could figure out
which distribution the points came from.
EM Algorithm
Start with two randomly placed normal distributions (µred ,
2
2
σred
) and (µblue , σblue
).
For each xi , determine P(red|xi ) = the probability that the
point was drawn from the red distribution (E-STEP).
This is a soft assignment, meaning that each xi with have
two probabilities: P(red|xi ) and P(blue|xi ).
2
Once this is done, re-estimate (µred , σred
) and (µblue ,
2
σblue ), given what we learned (M-STEP).
Iterate until it convergence.
EM Example
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
EM Example
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
To use the EM algorithm, we have to answer several questions:
1 How Likely is Each of the Points to Come From the Red
Distribution?
(xi − µred )2
1
exp −
P(xi |red) = q
2
2σred
2
2πσred
EM Example
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
1
How Likely is Each of the Points to Come From the Red
2 = 9)?
Distribution (µred = 10, σred
xi
5
10
15
20
30
39
51
54
58
70
P(xi |red)
0.03316
0.13298
0.03316
0.00051
0
0
0
0
0
0
EM Example
Introduction
to Text
Analysis
Bryce J.
Dietrich
1
How Likely is Each of the Points to Come From the Red
2 = 9)?
Distribution (µred = 10, σred
2
How Likely is it the Red Distribution is specified correctly,
given xi ?
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
P(redi |xi ) =
P(xi |red)P(red)
P(xi |red)P(red) + P(xi |blue)P(blue)
P(redi |xi ) =
P(xi |red).50
P(xi |red).50 + P(xi |blue).50
EM Example
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
1
2
How Likely is Each of the Points to Come From the Red
2 = 9)?
Distribution (µred = 10, σred
How Likely is it the Red Distribution is specified correctly,
given xi ?
xi
5
10
15
20
30
39
51
54
58
70
P(redi |xi )
0.01658
0.06649
0.01658
0.00026
0
0
0
0
0
0
EM Example
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
1
How Likely is Each of the Points to Come From the Red
2 = 9)?
Distribution (µred = 10, σred
2
How Likely is it the Red Distribution is specified correctly,
given xi ?
3
How Likely is it the Blue Distribution is specified correctly,
given xi ?
P(bluei |xi ) = 1 − P(redi |xi )
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
EM Example
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
How Likely is it the Blue Distribution is specified correctly,
given xi ?
xi
5
10
15
20
30
39
51
54
58
70
P(bluei |xi )
0.98342
0.93351
0.98342
0.99974
1
1
1
1
1
1
EM Example
Introduction
to Text
Analysis
1
How Likely is Each of the Points to Come From the Red
2 = 9)?
Distribution (µred = 10, σred
2
How Likely is it the Red Distribution is specified correctly,
given xi ?
DocumentTerm
Matrices
3
How Likely is it the Blue Distribution is specified correctly,
given xi ?
Preprocessing
4
Given what we know what is the likely red mean (µred )
2 )?
and variance (σred
Bryce J.
Dietrich
Agenda
Dictionaries
Unsupervised
Learning
ExpectationMaximization
Algorithm
µred =
LDA
Conclusion
2
σred
=
red1 x1 + red2 x2 + · · · + redn xn
red1 + red2 + . . . redn
red1 (x1 − µred )2 + · · · + redn (xn − µred )2
red1 + red2 + . . . redn
EM Example
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
1
How Likely is Each of the Points to Come From the Red
2 = 9)?
Distribution (µred = 10, σred
2
How Likely is it the Red Distribution is specified correctly,
given xi ?
3
How Likely is it the Blue Distribution is specified correctly,
given xi ?
4
Given what we know what is the likely red mean (µred )
2 )?
and variance (σred
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
µ̂red = 10.02573
2
σ̂red
= 8.554147
EM Example
Introduction
to Text
Analysis
Bryce J.
Dietrich
1
2
Agenda
Dictionaries
3
DocumentTerm
Matrices
4
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
5
How Likely is Each of the Points to Come From the Red
2 = 9)?
Distribution (µred = 10, σred
How Likely is it the Red Distribution is specified correctly,
given xi ?
How Likely is it the Blue Distribution is specified correctly,
given xi ?
Given what we know what is the likely red mean (µred )
2 )?
and variance (σred
Given what we know what is the likely blue mean (µblue )
2 )?
and variance (σblue
µblue =
Conclusion
2
σblue
=
blue1 x1 + blue2 x2 + · · · + bluen xn
blue1 + blue2 + . . . bluen
blue1 (x1 − µblue )2 + · · · + bluen (xn − µblue )2
blue1 + blue2 + . . . bluen
EM Example
Introduction
to Text
Analysis
1
How Likely is Each of the Points to Come From the Red
2 = 9)?
Distribution (µred = 10, σred
2
How Likely is it the Red Distribution is specified correctly,
given xi ?
3
How Likely is it the Blue Distribution is specified correctly,
given xi ?
4
Given what we know what is the likely red mean (µred )
2 )?
and variance (σred
5
Given what we know what is the likely blue mean (µblue )
2 )?
and variance (σblue
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
µblue = 35.45405
2
σblue
= 454.2171
EM Example
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
EM Example
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Topic and Mixed Membership Models (Grimmer)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Doc 1
Cluster 1
Doc 2
Cluster 2
Doc 3
..
.
..
.
Cluster K
Conclusion
Doc N
Latent Dirichlet Allocation (LDA)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Suppose, we have the following sentences:
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Donald Trump is running for president.
Donald Trump debated last night.
Hillary Clinton is running for president.
Hillary Clinton debated last night.
Hillary Clinton debated better than Donald Trump last
night.
Latent Dirichlet Allocation (LDA)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Suppose, we have the following sentences:
Donald Trump is running for president. (50% Topic A,
50% Topic C)
Donald Trump debated last night. (50% Topic A, 50%
Topic D)
Hillary Clinton is running for president. (50% Topic B,
50% Topic C)
Hillary Clinton debated last night. (50% Topic B, 50%
Topic D)
Hillary Clinton debated better than Donald Trump last
night. (33% Topic A, 33% Topic B, 33% Topic D)
Latent Dirichlet Allocation (LDA)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Suppose, we have the following sentences:
Donald Trump is running for president. (50% Topic A,
50% Topic C)
Donald Trump debated last night. (50% Topic A, 50%
Topic D)
Hillary Clinton is running for president. (50% Topic B,
50% Topic C)
Hillary Clinton debated last night. (50% Topic B, 50%
Topic D)
Hillary Clinton debated better than Donald Trump last
night. (33% Topic A, 33% Topic B, 33% Topic D)
Latent Dirichlet Allocation (LDA)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Suppose, we have the following sentences:
Donald Trump is running for president. (50% Topic A,
50% Topic C)
Donald Trump debated last night. (50% Topic A, 50%
Topic D)
Hillary Clinton is running for president. (50% Topic B,
50% Topic C)
Hillary Clinton debated last night. (50% Topic B, 50%
Topic D)
Hillary Clinton debated better than Donald Trump last
night. (33% Topic A, 33% Topic B, 33% Topic D)
Latent Dirichlet Allocation (LDA)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Suppose, we have the following sentences:
Donald Trump is running for president. (50% Topic A,
50% Topic C)
Donald Trump debated last night. (50% Topic A, 50%
Topic D)
Hillary Clinton is running for president. (50% Topic B,
50% Topic C)
Hillary Clinton debated last night. (50% Topic B, 50%
Topic D)
Hillary Clinton debated better than Donald Trump last
night. (33% Topic A, 33% Topic B, 33% Topic D)
Latent Dirichlet Allocation (LDA)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
LDA represents documents as mixtures of topics that produce
words with certain probabilities. Imagine you are writing an
article:
1
How many N words will your article have? (Let’s assume
this follows a Poisson distribution.)
2
What is the mixture of K topics? (Let’s assume this
follows a Dirichlet distribution.)
For each word in your article:
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
3
First, pick a topic from the distribution outlined above.
Second, given the topic you have selected, choose the word
that appears the most.
Latent Dirichlet Allocation (LDA)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
LDA represents documents as mixtures of topics that produce
words with certain probabilities. Imagine you are writing an
article:
Dictionaries
1.) Let’s assume your article will have 5 words.
DocumentTerm
Matrices
2.) Let’s assume our topic distribution will be 75% Topic B
and 25% Topic A.
Preprocessing
3.) Let’s assume “Hillary Clinton,” “president,” and “win,”
appears in 75%, 50%, and 25% of the documents that are
included in Topic B, respectively.
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
4.) Let’s assume the “Donald Trump” and “president”
appears in 75% and 50% of the documents that are
included in Topic A, respectively.
Latent Dirichlet Allocation (LDA)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
LDA represents documents as mixtures of topics that produce
words with certain probabilities. Imagine you are writing an
article:
4.) Let’s assume the “Donald Trump” and “president”
appears in 75% and 50% of the documents that are
included in Topic A, respectively.
1st word comes from Topic B, which then gives you
“Hillary Clinton.”
2nd word comes from Topic A, which then gives you
“Donald Trump.”
3rd word comes from Topic A, which then gives you
“president.”
4th word comes from Topic B, which then gives you
“president.”
5th word comes from Topic B, which then gives you “win.”
“Hillary Clinton Donald Trump president president win.”
Harmonic Mean
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Wallach et al. (2009) suggest the harmonic mean could be
considered a goodness-of-fit test for the LDA model:
M
1 X
M
i=1
K
1 X
θm,k
K
!−1
(6)
i=1
, where θm,k is the topic distribution for document m and topic
k.
The degree of differentiation of a distribution θ over all
topics is theoretically captured by the harmonic mean.
If the topic distribution has a high probability for only a
few topics, then it would have a lower harmonic mean
value
the model has a better ability to separate
documents into different topics.
Perplexity
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
Heinrich (2005) suggests perplexity as a way to prevent
overfitting an LDA model. Perplexity is equivalent to the
geometric mean per-word likelihood:
(
)
log (p(w ))
Perplexity (w ) = exp − PD PV
jd
d=1
j=1 n
, where njd denotes how often the j th term occurred in the d th
document.
Perplexity is essentially the reciprocal geometric mean of
the likelihood of testing data given the trained model M.
LDA
Conclusion
(7)
Therefore, the lower perplexity value indicates that the
model could fit the testing data better.
Entropy
Introduction
to Text
Analysis
Bryce J.
Dietrich
The entropy measure can also be used to indicate how the
topic distributions differ across LDA models. Higher values
indicate that the topic distributions are more evenly spread over
the topics.
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
1
s a p p l y ( f i t t e d m o d e l s , f u n c t i o n ( x ) mean ( a p p l y (
p o s t e r i o r ( x ) $ t o p i c s , 1 , f u n c t i o n ( z ) − sum ( z ∗ l o g
(z)))))
Entropy
Introduction
to Text
Analysis
The entropy measure can also be used to indicate how the
topic distributions differ across LDA models. Higher values
indicate that the topic distributions are more evenly spread over
the topics.
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
1
2
3
4
Unsupervised
Learning
ExpectationMaximization
Algorithm
5
6
7
LDA
8
Conclusion
9
e n t r o p y <−NULL
for ( i in 1: length ( fitted models ) ){
t e m p t o p i c s <−p o s t e r i o r ( f i t t e d m o d e l s [ [ i ] ] ) $ t o p i c s
temp sums<−NULL
f o r ( j i n 1 :NROW( t e m p t o p i c s ) ) {
temp sums<−c ( temp sums , sum ( t e m p t o p i c s [ j , ] ∗ l o g (
temp topics [ j , ] ) ) )
}
e n t r o p y <−c ( e n t r o p y , mean(−temp sums ) )
}
Applying LDA to NYT Articles
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
R!
Cluster Quality (Grimmer and King 2011)
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Assessing Cluster Quality with experiments
- Goal: group together similar documents
- Who knows if similarity measure corresponds with
semantic similarity
Inject human judgement on pairs of documents
Design to assess cluster quality
- Sample pairs of documents
- Scale: (1) unrelated, (2) loosely related, (3) closely related
- Cluster Quality = mean(within cluster) - mean(between
clusters)
- Select clustering with highest cluster quality
What Can We Do?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Additional Resources
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Webscraping –
http://web.stanford.edu/~zlotnick/TextAsData/
Web_Scraping_with_Beautiful_Soup.html
Machine Learning – The Elements of Statistical Learning:
Data Mining, Inference, and Prediction by Hastie et al.
2009
Text Analysis – https://aeshin.org/textmining/
Questions?
Introduction
to Text
Analysis
Bryce J.
Dietrich
Agenda
Dictionaries
DocumentTerm
Matrices
Preprocessing
Unsupervised
Learning
ExpectationMaximization
Algorithm
LDA
Conclusion
Questions?
© Copyright 2026 Paperzz