Slides on text mining.

NYU
Bag-of-Words Methods
for
Text Mining
CSCI-GA.2590 – Lecture 2A
Ralph Grishman
Bag of Words Models
• do we really need elaborate linguistic analysis?
• look at text mining applications
– document retrieval
– opinion mining
– association mining
• see how far we can get with document-level bagof-words models
– and introduce some of our mathematical approaches
1/16/14
NYU
2
• document retrieval
• opinion mining
• association mining
1/16/14
NYU
3
Information Retrieval
• Task: given query = list of keywords, identify
and rank relevant documents from collection
• Basic idea:
find documents whose set of words most
closely matches words in query
1/16/14
NYU
4
Topic Vector
• Suppose the document collection has n distinct words,
w1, …, wn
• Each document is characterized by an n-dimensional
vector whose ith component is the frequency of word
wi in the document
Example
• D1 = [The cat chased the mouse.]
• D2 = [The dog chased the cat.]
• W = [The, chased, dog, cat, mouse] (n = 5)
• V1 = [ 2 , 1 , 0 , 1 , 1 ]
• V2 = [ 2 , 1 , 1 , 1 , 0 ]
1/16/14
NYU
5
Weighting the components
• Unusual words
like elephant determine the topic much more

than common words such as “the” or “have”
• can ignore words on a stop list or
• weight each term frequency tfi by its inverse document
frequency idfi
i
i
n
idf log( N )
• where N = size of collection and ni = number of documents
containing term i
wi tfi idfi
1/16/14
NYU
6
Cosine similarity metric
Define a similarity metric between topic vectors
A common choice is cosine similarity (dot
product):
 ai bi
i
sim(A, B)
2
2
a

b
 i
 i
i
i

1/16/14
NYU
7
Cosine similarity metric
• the cosine similarity metric is the cosine of the
angle between the term vectors:
w2
w1
1/16/14
NYU
8
Verdict: a success
• For heterogenous text collections, the vector
space model, tf-idf weighting, and cosine
similarity have been the basis for successful
document retrieval for over 50 years
• document retrieval
• opinion mining
• association mining
1/16/14
NYU
10
Opinion Mining
– Task: judge whether a document expresses a positive
or negative opinion (or no opinion) about an object or
topic
• classification task
• valuable for producers/marketers of all sorts of products
– Simple strategy: bag-of-words
• make lists of positive and negative words
• see which predominate in a given document
(and mark as ‘no opinion’ if there are few words of either
type
• problem: hard to make such lists
– lists will differ for different topics
1/16/14
NYU
11
Training a Model
– Instead, label the documents in a corpus and then
train a classifier from the corpus
• labeling is easier than thinking up words
• in effect, learn positive and negative words from the
corpus
– probabilistic classifier
• identify most likely class
s = argmax
P ( t | W)
t є {pos, neg}
1/16/14
NYU
12
Using Bayes’ Rule
argmaxt P(W |t)P(t)
argmaxt P(t |W)
P(W)
 argmaxt P(W |t)P(t)
 argmaxt P(w1,...,wn |t)P(t)
 argmaxt  P(wi |t)P(t)
i

• The last step is based on the naïve assumption
of independence of the word probabilities
1/16/14
NYU
13
Training
• We now estimate these probabilities from the
training corpus (of N documents) using maximum
likelihood estimators
• P ( t ) = count ( docs labeled t) / N
• P ( wi | t ) =
count ( docs labeled t containing wi )
count ( docs labeled t)
1/16/14
NYU
14
A Problem
• Suppose a glowing review GR (with lots of
positive words) includes one word,
“mathematical”, previously seen only in
negative reviews
• P ( positive | GR ) = ?
1/16/14
NYU
15
P ( positive | GR ) = 0
because P ( “mathematical” | positive ) = 0
The maximum likelihood estimate is poor when
there is very little data
We need to ‘smooth’ the probabilities to avoid
this problem
1/16/14
NYU
16
Laplace Smoothing
• A simple remedy is to add 1 to each count
– to keep them as probabilities
 P(t) 1
t
we increase the denominator N by the number of outcomes
(values of t) (2 for ‘positive’ and ‘negative’)
– for the conditional probabilities P( w | t ) there are similarly two
outcomes (w is present or absent)
1/16/14
NYU
17
An NLTK Demo
Sentiment Analysis with Python NLTK Text
Classification
• http://text-processing.com/demo/sentiment/
NLTK Code (simplified classifier)
• http://streamhacker.com/2010/05/10/textclassification-sentiment-analysis-naive-bayes-classifier
1/16/14
NYU
18
• Is “low” a positive or a negative term?
1/16/14
NYU
19
Ambiguous terms
“low” can be positive
“low price”
or negative
“low quality”
1/16/14
NYU
20
Verdict: mixed
• A simple bag-of-words strategy with a NB
model works quite well for simple reviews
referring to a single item, but fails
– for ambiguous terms
– for comparative reviews
– to reveal aspects of an opinion
• the car looked great and handled well,
but the wheels kept falling off
1/16/14
NYU
21
• document retrieval
• opinion mining
• association mining
1/16/14
NYU
22
Association Mining
• Goal: find interesting relationships among attributes
of an object in a large collection …
objects with attribute A also have attribute B
– e.g., “people who bought A also bought B”
• For text: documents with term A also have term B
– widely used in scientific and medical literature
1/16/14
NYU
23
Bag-of-words
• Simplest approach
• look for words A and B for which
frequency (A and B in same document) >>
frequency of A x frequency of B
• doesn’t work well
• want to find names (of companies, products, genes),
not individual words
• interested in specific types of terms
• want to learn from a few examples
– need contexts to avoid noise
1/16/14
NYU
24
Needed Ingredients
• Effective text association mining needs
– name recognition
– term classification
– preferably: ability to learn patterns
– preferably: a good GUI
– demo: www.coremine.com
1/16/14
NYU
25
Conclusion
• Some tasks can be handled effectively (and
very simply) by bag-of-words models,
• but most benefit from an analysis of language
structure
1/16/14
NYU
26

Download Report

Slides on text mining.

Paperzz.com

Your Paperzz