Vector space classification - College of Engineering and Computer

Vector Space Text Classification
Adapted from Lectures by
Raymond Mooney and Barbara Rosario
Prasad
L14VectorClassify
1
Text Classification

Today:

Introduction to Text Classification





K Nearest Neighbors
Decision boundaries
Vector space classification using centroids
Decision Trees (briefly)
Later

More text classification


Prasad
Support Vector Machines
Text-specific issues in classification
L14VectorClassify
2
Recap: Naïve Bayes classifiers

Classify based on prior weight of class and
conditional parameter for what each word says:
c NB



 argmax log P(c j )   log P(x i | c j )
c j C 



i positions
Training is done by counting and dividing:
P(c j ) 


Nc j
N
P(x k | c j ) 
Don’t forget to smooth

Tc j x k  
x i V
[Tc j x i   ]
3
Recall: Vector Space Representation



Each document is a vector, one component for
each term (= word).
Normally normalize vector to unit length.
High-dimensional vector space:




Terms are axes
10,000+ dimensions, or even 100,000+
Docs are vectors in this space
How can we do classification in this space?
Prasad
L14VectorClassify
4
14.1
Classification Using Vector Spaces

As before, the training set is a set of documents,
each labeled with its class (e.g., topic)
In vector space classification, this set corresponds
to a labeled set of points (or, equivalently, vectors)
in the vector space
Premise 1: Documents in the same class form a
contiguous region of space
Premise 2: Documents from different classes don’t
overlap (much)




We define surfaces to delineate classes in the
space
Prasad
L14VectorClassify
5
Documents / Classes in a Vector Space
Government
Science
Arts
Prasad
L14VectorClassify
6
Test Document of what class?
Government
Science
Arts
Prasad
L14VectorClassify
7
Test Document = Government
Is this
similarity
hypothesis
true in
general?
Government
Science
Arts
Our main topic today is how to find good separators
8
k Nearest Neighbor Classification
Prasad
L14VectorClassify
9
Aside: 2D/3D graphs can be misleading
10
k Nearest Neighbor Classification

kNN = k Nearest Neighbor
To classify document d into class c:




Define k-neighborhood N as k nearest neighbors of d
Count number of documents ic in N that belong to c
Estimate P(c|d) as ic/k
Choose as class arg maxc P(c|d) [ = majority class]
Prasad
L14VectorClassify
11
14.3
Example: k=6 (6NN)
P(science| )?
Government
Science
Arts
Prasad
L14VectorClassify
12
Nearest-Neighbor Learning
Algorithm

Learning is just storing the representations of the training
examples in D.


Testing instance x:



Compute similarity between x and all examples in D.
Assign x the category of the most similar example in D.
Also called:


Does not explicitly compute a generalization, category
prototypes, or class boundaries.
Case-based learning, Memory-based learning, Lazy learning
Rationale of kNN: contiguity hypothesis
Prasad
L14VectorClassify
13
kNN: Value of k

Using only the closest example (k=1) to
determine the category is subject to errors due to:




A single atypical example.
Noise in the category label of a single training
example.
More robust alternative is to find the k (> 1) mostsimilar examples and return the majority category
of these k examples.
Value of k is typically odd to avoid ties; 3 and 5
are most common.
Prasad
L14VectorClassify
14
Probabilistic kNN
1NN, 3NN
classification
decision
for star?
15
Exercise
How is star classified by:
(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio?
16
kNN decision boundaries
Boundaries
are, in
principle,
arbitrary
surfaces –
but usually
polyhedra
Government
Science
Arts
kNN gives locally defined decision boundaries between
classes – far away points do not influence each classification
Prasad
L14VectorClassify
decision
(unlike in Naïve Bayes,
Rocchio, etc.)
17
kNN is not a linear classifier
 Classification decision
based on majority of k
nearest neighbors.
The decision boundaries
between classes are
piecewise linear . . .
. . . but they are in
general not linear
classifiers that can be
described as
18
Similarity Metrics

Nearest neighbor method depends on a
similarity (or distance) metric.
 Simplest for continuous m-dimensional
instance space is Euclidean distance.
 Simplest for m-dimensional binary instance
space is Hamming distance (number of
feature values that differ).
 For text, cosine similarity of tf.idf weighted
vectors is typically most effective.
Prasad
L14VectorClassify
19
Illustration of 3 Nearest Neighbor
for Text Vector Space
Prasad
L14VectorClassify
20
Nearest Neighbor with Inverted Index




Naively finding nearest neighbors requires a
linear search through |D| documents in collection
But determining k nearest neighbors is the same
as determining the k best retrievals using the test
document as a query to a database of training
documents.
Heuristics: Use standard vector space inverted
index methods to find the k nearest neighbors.
Testing Time: O(B|Vt|)
where B is the average
number of training documents in which a testdocument word appears.
 Typically B << |D|
Prasad
L14VectorClassify
21
kNN: Discussion


No feature selection necessary
Scales well with large number of classes


No training necessary



Actually: some data editing may be necessary.
Classes can influence each other


Don’t need to train n classifiers for n classes
Small changes to one class can have ripple effect
Scores can be hard to convert to probabilities
May be more expensive at test time
Prasad
L14VectorClassify
22
kNN vs. Naive Bayes : Bias/Variance tradeoff
[Bias : Variance] :: Lack of [expressiveness : robustness]
 kNN has high variance and low bias.


NB has low variance and high bias.


Infinite memory
Decision surface has to be linear (hyperplane – see later)
Consider asking a botanist: Is an object a tree?

Too much capacity/variance, low bias



Not enough capacity/variance, high bias



Botanist who memorizes
Will always say “no” to new object (e.g., different # of leaves)
Lazy botanist
Says “yes” if the object is green
You want the middle ground
(Example due to C. Burges)
23
Bias vs. variance:
Choosing the correct model capacity
Prasad
L14VectorClassify
24
14.6
Decision Boundaries Classification
Prasad
L14VectorClassify
25
Binary Classification : Linear Classifier

Consider 2 class problems



Deciding between two classes, perhaps,
government and non-government
[one-versus-rest classification]
How do we define (and find) the separating
surface?
How do we test which region a test doc is in?
Prasad
L14VectorClassify
26
Separation by Hyperplanes


A strong high-bias assumption is linear separability:
 in 2 dimensions, can separate classes by a line
 in higher dimensions, need hyperplanes
Can find separating hyperplane by linear programming
(or can iteratively fit solution via perceptron):

Prasad
separator can be expressed as ax + by = c
27
Linear programming / Perceptron
Find a,b,c, such that
ax + by  c for red points
ax + by  c for green
points.
Prasad
L14VectorClassify
28
x + y  1.5 for red points
x + y  1.5 for green points.
y
AND
x
y
c
1
1
1
1
0
0
0
1
0
0
0
0
1
x  y  1.5  0
x
0
Prasad
Linearly separable
L14VectorClassify
1
29
x + y  0.5 for red points
x + y  0.5 for green points.
y
OR
Linearly separable
x
y
c
1
1
1
1
0
1
0
1
1
0
0
0
1
x  y  0.5  0
x
0
Prasad
L14VectorClassify
1
30
The planar decision surface
in data-space for the simple
linear discriminant function:
T
w x  w0  0
Prasad
31
y
XOR
Linearly inseparable
x
y
c
1
1
0
1
0
1
0
1
1
0
0
0
1
0
Prasad
L14VectorClassify
1
x
32
Which Hyperplane?
In general, lots of possible
solutions for a,b,c.
Prasad
L14VectorClassify
33
Which Hyperplane?


Lots of possible solutions for a,b,c.
Some methods find a separating
hyperplane, but not the optimal one
[according to some criterion of
expected goodness]


E.g., perceptron
Which points should influence
optimality?


All points
 Linear regression, Naïve Bayes
Only “difficult points” close to decision
boundary
 Support vector machines
34
Linear Classifiers

Many common text classifiers are linear
classifiers
 Naïve Bayes
 Perceptron
 Rocchio
 Logistic regression
 Support vector machines (with linear
kernel)
 Linear regression
 (Simple) perceptron neural networks
Prasad
L14VectorClassify
35
Linear Classifiers

Despite this similarity, noticeable
performance differences




For separable problems, there are infinite number
of separating hyperplanes. Which one do you
choose?
What to do for non-separable problems?
Different training methods pick different
hyperplanes
Classifiers more powerful than linear often
don’t perform better. Why?
Prasad
L14VectorClassify
36
Naive Bayes is a linear classifier

Two-class Naive Bayes. We compute:
P(C | d )
P(C )
P( w | C )
log
 log
  log
P(C | d )
P(C ) wd
P( w | C )


Decide class C if the odds is greater than 1, i.e.,
if the log odds is greater than 0.
So decision boundary is hyperplane:
P (C )
  wV  w  nw  0 where   log
;
P (C )
P( w | C )
 w  log
; nw  # of occurrence s of w in d
P( w | C )
37
A nonlinear problem

A linear
classifier like
Naïve Bayes
does badly on
this task

kNN will do
very well
(assuming
enough training
data)
38
High Dimensional Data

Pictures like the one on the right are
absolutely misleading!




Documents are zero along almost all axes
Most document pairs are very far apart
(i.e., not strictly orthogonal, but only share
very common words and a few scattered
others)
In classification terms: virtually all
document sets are separable, for most
classification
This is partly why linear classifiers are
quite successful in this domain
Prasad
L14VectorClassify
39
More Than Two Classes

Any-of or multivalue classification





Classes are independent of each other.
A document can belong to 0, 1, or >1 classes.
Decompose into n binary problems
Quite common for documents
One-of or multinomial or polytomous
classification



Prasad
Classes are mutually exclusive.
Each document belongs to exactly one class
E.g., digit recognition is polytomous classification

Digits are mutually exclusive
40
14.5
Set of Binary Classifiers: Any of




Build a separator between each class and its
complementary set (docs from all other classes).
Given test doc, evaluate it for membership in
each class.
Apply decision criterion of classifiers
independently
Done

Prasad
Sometimes you could do better by considering
dependencies between categories
L14VectorClassify
41
Set of Binary Classifiers: One of



Build a separator between each class and its
complementary set (docs from all other classes).
Given test doc, evaluate it for membership in
each class.
Assign document to class with:




maximum score
maximum confidence
maximum probability
Why different from multiclass/
any of classification?
Prasad
L14VectorClassify
?
?
?
42
Classification
using Centroids
Prasad
L14VectorClassify
43
Relevance feedback: Basic idea
The user issues a (short, simple) query.
The search engine returns a set of documents.
User marks some docs as relevant, some as
nonrelevant.
Search engine computes a new representation of the
information need – should be better than the initial query.
Search engine runs new query and returns new results.
New results have (hopefully) better recall.
44
Rocchio illustrated
45
Using Rocchio for text classification

Relevance feedback methods can be adapted for text
categorization




2-class classification (Relevant vs. nonrelevant documents)
Use standard TF/IDF weighted vectors to represent
text documents
For each category, compute a prototype vector by
summing the vectors of the training documents in the
category.
 Prototype = centroid of members of class
Assign test documents to the category with the
closest prototype vector based on cosine similarity.
Prasad
L14VectorClassify
46
14.2
Illustration of Rocchio Text
Categorization
Prasad
L14VectorClassify
47
Definition of centroid
1
(c) 
v (d)

| Dc | d Dc


Where Dc is the set of all documents that belong
to class c and v(d) is the vector space
representation of d.

Note that centroid will in general not be a unit
vector even when the inputs are unit vectors.
48
Rocchio Properties




Forms a simple generalization of the examples in
each class (a prototype).
Prototype vector does not need to be averaged
or otherwise normalized for length since cosine
similarity is insensitive to vector length.
Classification is based on similarity to class
prototypes.
Does not guarantee classifications are consistent
with the given training data.
Prasad
L14VectorClassify
49
Rocchio Anomaly

Prototype models have problems with
polymorphic (disjunctive) categories.
Prasad
L14VectorClassify
50
3 Nearest Neighbor Comparison

Nearest Neighbor tends to handle polymorphic
categories better.
Prasad
L14VectorClassify
51
Rocchio is a linear classifier
52
Two-class Rocchio as a linear
classifier

Line or hyperplane defined by:
wT x  

For Rocchio, set:
w  (c1)  (c 2 )
  0.5  (| (c1 ) |2  | (c 2 ) |2 )

53
Rocchio classification





Rocchio forms a simple representation for each
class: the centroid/prototype
Classification is based on similarity to / distance
from the prototype/centroid
It does not guarantee that classifications are
consistent with the given training data
It is little used outside text classification, but has
been used quite effectively for text classification
Again, cheap to train and test documents
54
Rocchio cannot handle nonconvex,
multimodal classes
Exercise: Why is Rocchio
not expected to do well for
the classification task a vs.
b here?
a
a
a
a
a a
a
X
a
a
A is centroid of the a’s,
B is centroid of the b’s.
a
A
a
O
B
b
b
b
b
b
X
a
a
b
b
b b
bb
b
b
b
a
a
a
a
a
a
aa
a
a a
The point o is closer to
A than to B.
But o is a better fit for
the b class.
A is a multimodal class
with two prototypes.
But in Rocchio we only
have one prototype.
55
SKIP WHAT FOLLOWS
Prasad
L14VectorClassify
56
Decision Tree Classification







Tree with internal nodes labeled by terms
Branches are labeled by tests on the weight that
the term has
Leaves are labeled by categories
Classifier categorizes document by descending
tree following tests to leaf
The label of the leaf node is then assigned to the
document
Most decision trees are binary trees
(advantageous; may require extra internal nodes)
DT make good use of a few high-leverage features
57
Decision Tree Categorization:
Example
Prasad
Geometric interpretation of DT?
Decision Tree Learning

Learn a sequence of tests on features, typically
using top-down, greedy search


At each stage choose the unused feature with
highest Information Gain (feature/class MI)
Binary (yes/no) or continuous decisions
f7
P(class) = .6
Prasad
f1
!f1
!f7
P(class) = .9
P(class) = .2
L14VectorClassify
59
Entropy Calculations
If we have a set with k different values in it, we can calculate the
entropy as follows:
k
entropy( Set )  I ( Set )   P(valuei )  log 2P(valuei ) 
i 1
Where P(valuei) is the probability of getting the ith value when
randomly selecting one from the set.
So, for the set R = {a,a,a,b,b,b,b,b}
 3 
 3  5
 5 
entropy( R)  I ( R)     log 2     log 2 
8 8
 8 
 8 
Prasad
a-values
b-values
60
Looking at some data
Color
Size
Shape
Edible?
Yellow
Small
Round
+
Yellow
Small
Round
-
Green
Small
Irregular
+
Green
Large
Irregular
-
Yellow
Large
Round
+
Yellow
Small
Round
+
Yellow
Small
Round
+
Yellow
Small
Round
+
Green
Small
Round
-
Yellow
Large
Round
-
Yellow
Large
Round
+
Yellow
Large
Round
-
Yellow
Large
Round
-
Yellow
Large
Round
-
Yellow
Small
Irregular
+
Yellow
Large
Irregular
+
61
Entropy for our data set

16 instances: 9 positive, 7 negative.
entropy(all _ data) 
 9 
9 7
 7 
I (all _ data)     log 2     log 2 
 16   16 
 16 
 16 


This equals: 0.9836
This makes sense – it’s almost a 50/50 split; so, the entropy
should be close to 1.
Prasad
L14VectorClassify
62
Visualizing Information Gain
Color
Yellow
Yellow
Green
Green
Yellow
Yellow
Yellow
Yellow
Green
Yellow
Yellow
Yellow
Yellow
Yellow
Yellow
Yellow
Size
Small
Small
Small
Small
Small
Small
Small
Small
Shape
Round
Round
Irregular
Irregular
Round
Round
Round
Round
Round
Round
Round
Round
Round
Round
Irregular
Irregular
Edible?
+
+
+
+
+
+
+
+
+
Size
Entropy = 0.8113
(from 8 examples)
Color
Yellow
Yellow
Green
Yellow
Yellow
Yellow
Prasad
Green
Yellow
Size
Small
Small
Small
Large
Large
Small
Small
Small
Small
Large
Large
Large
Large
Large
Small
Large
Small
Shape
Round
Round
Irregular
Round
Round
Round
Round
Irregular
Edible?
+
+
+
+
+
+
L14VectorClassify
Large
Color
Green
Yellow
Yellow
Yellow
Yellow
Yellow
Yellow
Yellow
Entropy of set = 0.9836
(16 examples)
Entropy = 0.9544
(from 8 examples)
Size
Large
Large
Large
Large
Large
Large
Large
Large
Shape
Irregular
Round
Round
Round
Round
Round
Round
Irregular
Edible?
+
+
63+
Visualizing Information Gain
The data set that goes down each
branch of the tree has its own
entropy value. We can calculate for
each possible attribute its expected
entropy. This is the degree to which
the entropy would change if we
branch on this attribute. You add the
entropies of the two children,
weighted by the proportion of
examples from the parent node that
ended up at that child.
expect_ent ropy(size)  I(size) 
Prasad
0.9836
Small
Size
(16 examples)
Large
0.8113
0.9544
(8 examples)
(8 examples)
Entropy of left child is 0.8113
I(size=small) = 0.8113
Entropy of right child is 0.9544
I(size=large) = 0.9544
8
0.8113  8 0.9544  0.8828
16
16
8 examples with ‘small’
64
8 examples with ‘large’
G(attrib) = I(parent) – I(attrib)
We want to calculate the information gain (or entropy reduction).
This is the reduction in ‘uncertainty’ when choosing our first
branch as ‘size’. We will represent information gain as “G.”
G(size) = I(parent) – I(size)
G(size) = 0.9836 – 0.8828
G(size) = 0.1008
Entropy of all data at parent node = I(parent) = 0.9836
Child’s expected entropy for ‘size’ split = I(size) = 0.8828
So, we have gained 0.1008 bits of information about the dataset
by choosing ‘size’ as the first branch of our decision tree.
Prasad
L14VectorClassify
65
Decision Tree Learning

Fully grown trees tend to have decision rules that are
overly specific and therefore unable to categorize
documents well




Therefore, pruning or early stopping methods for Decision
Trees are normally a standard part of classification packages
Use of small number of features is potentially bad in
text categorization, but in practice decision trees do
well for some text classification tasks
Decision trees are easily interpreted by humans –
compared to probabilistic methods like Naive Bayes
Decision Trees are normally regarded as a symbolic
machine learning algorithm, though they can be used
probabilistically
Prasad
L14VectorClassify
66
Category: “interest” – Dumais et al. (Microsoft) Decision Tree
rate=1
rate.t=1
lending=0
prime=0
discount=0
pct=1
year=0
year=1
67
Summary: Representation of
Text Categorization Attributes




Representations of text are usually very high
dimensional (one feature for each word)
High-bias algorithms that prevent overfitting in
high-dimensional space generally work best
For most text categorization tasks, there are
many relevant features and many irrelevant ones
Methods that combine evidence from many or all
features (e.g. naive Bayes, kNN, neural-nets)
often tend to work better than ones that try to
isolate just a few relevant features (standard
decision-tree or rule induction)*
Prasad
*Although one can compensate by using many rules
Which classifier do I use for a
given text classification problem?

Is there a learning method that is optimal for all
text classification problems?


No, because of bias-variance tradeoff.
Factors to take into account:




How much training data is available?
How simple/complex is the problem? (linear vs.
nonlinear decision boundary)
How noisy is the problem?
How stable is the problem over time?

For an unstable problem, it’s better to use a simple and
robust classifier.
69

Download Report

Vector space classification - College of Engineering and Computer

Paperzz.com

Your Paperzz