2PP

CS 188: Artificial Intelligence
Fall 2006
Lecture 25: Clustering
11/30/2006
Dan Klein – UC Berkeley
Announcements
ƒ Project 4 is up
ƒ Pacman contest!
ƒ Watch the web page for final review info
after last class
1
Properties of Perceptrons
Separable
ƒ Separability: some parameters get
the training set perfectly correct
ƒ Convergence: if the training is
separable, perceptron will
eventually converge (binary case)
ƒ Mistake Bound: the maximum
number of mistakes (binary case)
related to the margin or degree of
separability
Non-Separable
Non-Linear Separators
ƒ Data that is linearly separable (with some noise) works out great:
x
0
ƒ But what are we going to do if the dataset is just too hard?
x
0
ƒ How about… mapping data to a higher-dimensional space:
x2
0
x
This and next few slides adapted from Ray Mooney, UT
2
Non-Linear Separators
ƒ General idea: the original feature space can always be
mapped to some higher-dimensional feature space
where the training set is separable:
Φ: x → φ(x)
Issues with Perceptrons
ƒ Overtraining: test / held-out
accuracy usually rises, then
falls
ƒ Overtraining isn’t quite as bad as
overfitting, but is similar
ƒ Regularization: if the data isn’t
separable, weights might
thrash around
ƒ Averaging weight vectors over
time can help (averaged
perceptron)
ƒ Mediocre generalization: finds
a “barely” separating solution
3
Support Vector Machines
ƒ Several (related) problems with perceptron
ƒ Can thrash around
ƒ No telling which separator you get
ƒ Once you make an update, can’t retract it
ƒ SVMs address these problems
ƒ Converge to globally optimal parameters
ƒ Good choice of separator (maximum margin)
ƒ Find sparse vectors (dual sparsity)
Linear Separators
ƒ Binary classification can be viewed as the task
of separating classes in feature space:
w .x = 0
w
.x
>0
w .x < 0
4
Linear Separators
ƒ Which of the linear separators is optimal?
Classification Margin
ƒ Distance from example xi to the separator is r
ƒ Examples closest to the hyperplane are support vectors.
ƒ Margin δ of the separator is the distance between support vectors.
δ
r
5
Support Vector Machines
ƒ
ƒ
ƒ
ƒ
ƒ
Maximizing the margin: good according to intuition and PAC theory.
Only support vectors matter; other training examples are ignorable.
Support vector machines (SVMs) find the separator with max margin
Mathematically, gives a quadratic program which calculates alphas
Basically, SVMs are perceptrons with smarter update counts!
Today
ƒ Clustering
ƒ K-means
ƒ Similarity Measures
ƒ Agglomerative clustering
ƒ Case-based reasoning
ƒ K-nearest neighbors
ƒ Collaborative filtering
6
Recap: Classification
ƒ Classification systems:
ƒ Supervised learning
ƒ Make a rational prediction
given evidence
ƒ We’ve seen several
methods for this
ƒ Useful when you have
labeled data (or can get it)
Clustering
ƒ Clustering systems:
ƒ Unsupervised learning
ƒ Detect patterns in
unlabeled data
ƒ E.g. group emails or
search results
ƒ E.g. find categories of
customers
ƒ E.g. detect anomalous
program executions
ƒ Useful when don’t know
what you’re looking for
ƒ Requires data, but no
labels
ƒ Often get gibberish
7
Clustering
ƒ Basic idea: group together similar instances
ƒ Example: 2D point patterns
ƒ What could “similar” mean?
ƒ One option: small (squared) Euclidean distance
K-Means
ƒ An iterative clustering
algorithm
ƒ Pick K random points
as cluster centers
(means)
ƒ Alternate:
ƒ Assign data instances
to closest mean
ƒ Assign each mean to
the average of its
assigned points
ƒ Stop when no points’
assignments change
8
K-Means Example
Example: K-Means
ƒ [web demos]
ƒ http://www.cs.tubs.de/rob/lehre/bv/Kmeans/Kmeans.html
ƒ http://www.cs.washington.edu/research/image
database/demo/kmcluster/
9
K-Means as Optimization
ƒ Consider the total distance to the means:
means
points
assignments
ƒ Each iteration reduces phi
ƒ Two stages each iteration:
ƒ Update assignments: fix means c,
change assignments a
ƒ Update means: fix assignments a,
change means c
Phase I: Update Assignments
ƒ For each point, re-assign to
closest mean:
ƒ Can only decrease total
distance phi!
10
Phase II: Update Means
ƒ Move each mean to the
average of its assigned
points:
ƒ Also can only decrease total
distance… (Why?)
ƒ Fun fact: the point y with
minimum squared Euclidean
distance to a set of points {x}
is their mean
Initialization
ƒ K-means is nondeterministic
ƒ Requires initial means
ƒ It does matter what you
pick!
ƒ What can go wrong?
ƒ Various schemes for
preventing this kind of thing:
variance-based split /
merge, initialization
heuristics
11
K-Means Getting Stuck
ƒ A local optimum:
Why doesn’t this work out like
the earlier example, with the
purple taking over half the blue?
K-Means Questions
ƒ Will K-means converge?
ƒ To a global optimum?
ƒ Will it always find the true patterns in the data?
ƒ If the patterns are very very clear?
ƒ Will it find something interesting?
ƒ Do people ever use it?
ƒ How many clusters to pick?
12
Clustering for Segmentation
ƒ Quick taste of a simple vision algorithm
ƒ Idea: break images into manageable regions for visual
processing (object recognition, activity detection, etc.)
http://www.cs.washington.edu/research/imagedatabase/demo/kmcluster/
Representing Pixels
ƒ Basic representation of pixels:
ƒ 3 dimensional color vector <r, g, b>
ƒ Ranges: r, g, b in [0, 1]
ƒ What will happen if we cluster the pixels in an
image using this representation?
ƒ Improved representation for segmentation:
ƒ
ƒ
ƒ
ƒ
5 dimensional vector <r, g, b, x, y>
Ranges: x in [0, M], y in [0, N]
Bigger M, N makes position more important
How does this change the similarities?
ƒ Note: real vision systems use more
sophisticated encodings which can capture
intensity, texture, shape, and so on.
13
K-Means Segmentation
ƒ Results depend on initialization!
ƒ Why?
ƒ Note: best systems use graph segmentation algorithms
Other Uses of K-Means
ƒ Speech recognition: can use to quantize wave
slices into a small number of types (SOTA: work
with multivariate continuous features)
ƒ Document clustering: detect similar documents
on the basis of shared words (SOTA: use
probabilistic models which operate on topics
rather than words)
14
Agglomerative Clustering
ƒ Agglomerative clustering:
ƒ First merge very similar instances
ƒ Incrementally build larger clusters out of
smaller clusters
ƒ Algorithm:
ƒ Maintain a set of clusters
ƒ Initially, each instance in its own cluster
ƒ Repeat:
ƒ Pick the two closest clusters
ƒ Merge them into a new cluster
ƒ Stop when there’s only one cluster left
ƒ Produces not one clustering, but a family
of clusterings represented by a
dendrogram
Agglomerative Clustering
ƒ How should we define
“closest” for clusters with
multiple elements?
ƒ Many options
ƒ Closest pair (single-link
clustering)
ƒ Farthest pair (complete-link
clustering)
ƒ Average of all pairs
ƒ Distance between centroids
(broken)
ƒ Ward’s method (my pick, like kmeans)
ƒ Different choices create
different clustering behaviors
15
Agglomerative Clustering
ƒ Complete Link (farthest) vs. Single Link (closest)
Back to Similarity
ƒ K-means naturally operates in Euclidean space (why?)
ƒ Agglomerative clustering didn’t require any mention of
averaging
ƒ Can use any function which takes two instances and returns a
similarity
ƒ Kernelized clustering: if your similarity function has the right
properties, can adapt k-means too
ƒ Kinds of similarity functions:
ƒ
ƒ
ƒ
ƒ
Euclidian (dot product)
Weighted Euclidian
Edit distance between strings
Anything else?
16
Collaborative Filtering
ƒ Ever wonder how online merchants
decide what products to recommend to
you?
ƒ Simplest idea: recommend the most
popular items to everyone
ƒ Not entirely crazy! (Why)
ƒ Can do better if you know something
about the customer (e.g. what they’ve
bought)
ƒ Better idea: recommend items that
similar customers bought
ƒ A popular technique: collaborative filtering
ƒ Define a similarity function over
customers (how?)
ƒ Look at purchases made by people with
high similarity
ƒ Trade-off: relevance of comparison set vs
confidence in predictions
ƒ How can this go wrong?
You are
here
Case-Based Reasoning
ƒ Similarity for classification
ƒ Case-based reasoning
ƒ Predict an instance’s label using
similar instances
ƒ Nearest-neighbor classification
ƒ 1-NN: copy the label of the most
similar data point
ƒ K-NN: let the k nearest neighbors vote
(have to devise a weighting scheme)
ƒ Key issue: how to define similarity
ƒ Trade-off:
ƒ Small k gives relevant neighbors
ƒ Large k gives smoother functions
ƒ Sound familiar?
ƒ [DEMO]
http://www.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html
17
Parametric / Non-parametric
ƒ Parametric models:
ƒ Fixed set of parameters
ƒ More data means better settings
ƒ Non-parametric models:
ƒ Complexity of the classifier increases with data
ƒ Better in the limit, often worse in the non-limit
ƒ (K)NN is non-parametric
2 Examples
10 Examples
100 Examples
Truth
10000 Examples
Nearest-Neighbor Classification
ƒ Nearest neighbor for digits:
ƒ Take new image
ƒ Compare to all training images
ƒ Assign based on closest example
ƒ Encoding: image is vector of intensities:
ƒ What’s the similarity function?
ƒ Dot product of two images vectors?
ƒ Usually normalize vectors so ||x|| = 1
ƒ min = 0 (when?), max = 1 (when?)
18
Basic Similarity
ƒ Many similarities based on feature dot products:
ƒ If features are just the pixels:
ƒ Note: not all similarities are of this form
Invariant Metrics
ƒ Better distances use knowledge about vision
ƒ Invariant metrics:
ƒ Similarities are invariant under certain transformations
ƒ Rotation, scaling, translation, stroke-thickness…
ƒ E.g:
ƒ 16 x 16 = 256 pixels; a point in 256-dim space
ƒ Small similarity in R256 (why?)
ƒ How to incorporate invariance into similarities?
This and next few slides adapted from Xiao Hu, UIUC
19
Rotation Invariant Metrics
ƒ
Each example is now a curve
in R256
ƒ
Rotation invariant similarity:
s’=max s( r(
ƒ
), r(
))
E.g. highest similarity between
images’ rotation lines
Tangent Families
ƒ Problems with s’:
ƒ Hard to compute
ƒ Allows large
transformations (6 → 9)
ƒ Tangent distance:
ƒ 1st order approximation
at original points.
ƒ Easy to compute
ƒ Models small rotations
20
Template Deformation
ƒ Deformable templates:
ƒ
ƒ
ƒ
ƒ
An “ideal” version of each category
Best-fit to image using min variance
Cost for high distortion of template
Cost for image points being far from distorted template
ƒ Used in many commercial digit recognizers
Examples from [Hastie 94]
21