A Theory of Learning and Clustering via Similarity Functions

A Theory of Learning and Clustering
via Similarity Functions
Maria-Florina Balcan
Carnegie Mellon University
Joint work with Avrim Blum and Santosh Vempala
09/17/2007
2-Minute Version
Generic classification problem:
learn to distinguish men from women.
Problem: pixel representation not so good.
Nice SLT
theory
Powerful technique: use a kernel, a special kind of similarity
function K(
, ).
But, theory in terms of implicit mappings.
Can we develop a theory that views K as a measure of similarity?
What are general sufficient conditions for K to be useful for
learning?
2-Minute Version
Generic classification problem:
learn to distinguish men from women.
Problem: pixel representation not so good.
Powerful technique: use a kernel, a special kind of similarity
function K(
, ).
What if don’t have any labeled data? (i.e., clustering)
Can we develop a theory of conditions sufficient for K to be
useful now?
Part I: On Similarity
Functions for Classification
Kernel Functions and Learning
E.g., given images
labeled by gender, learn a rule
to distinguish men from women.
[Goal: do well on new data]
Problem: our best algorithms learn linear separators,
not good for data in natural representation.
Old approach: learn a more complex class of functions.
- + +
New approach: use a Kernel.
- + + ++ -+ +
--
Kernels, Kernalizable Algorithms
• K kernel if 9 implicit mapping  s.t. K(x,y)=(x) ¢ (y).
Point: many algorithms interact with data only via dot-products.
• If replace x¢y with K(x,y), it acts implicitly as if data was in
higher-dimensional -space.
• If data is linearly separable by large margin in -space, don’t
have to pay in terms of sample complexity or comp time.

 (x)
1
w
If margin  in -space, only need 1/2
examples to learn well.
Kernels and Similarity Functions
Kernels: useful for many kinds of data, elegant SLT.
Our Work: analyze more general similarity functions.
Characterization of good similarity functions:
1) In terms of natural direct properties.
• no implicit high-dimensional spaces
• no requirement of positive-semidefiniteness
2) If K satisfies these, can be used for learning.
3) Is broad: includes usual notion of “good kernel”.
has a large margin sep. in -space
A First Attempt: Definition Satisfying (1) and (2)
P distribution over labeled examples (x, l(x))
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1- prob.
mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
• E.g., K(x,y) ¸ 0.2, l(x) = l(y)
K(x,y) random in [-1,1], l(x)  l(y)
Note: might not be a legal kernel.
A First Attempt: Definition Satisfying (1) and (2).
How to use it?
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1- prob.
mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Algorithm
• Draw S+ of O((1/2) ln(1/2)) positive examples.
• Draw S- of O((1/2) ln(1/2)) negative examples.
• Classify x based on which gives better score.
Guarantee: with probability ¸ 1-, error ·  + .
A First Attempt: Definition Satisfying (1) and (2).
How to use it?
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1- prob.
mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Guarantee: with probability ¸ 1-, error ·  + .
• Hoeffding: for any given “good x”, prob. of error w.r.t. x
(over draw of S+, S-) is · 2.
• At most  chance that the error rate over GOOD is ¸ .
• Overall error rate ·  + .
A First Attempt: Not Broad Enough
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1- prob.
mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
++
+
+
++
more similar
to + than to
typical -
--- --
• K(x,y)=x ¢ y has large margin separator but doesn’t satisfy
our definition.
A First Attempt: Not Broad Enough
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1- prob.
mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
++
+
+
++
R
--- --
Broaden: OK if 9 non-negligible R s.t. most x are on average
more similar to y2R of same label than to y2 R of other label.
Broader/Main Definition
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a
weighting function w(y) 2 [0,1] a 1- prob. mass of x satisfy:
Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+
Algorithm
• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
• “Triangulate” data:
F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
• Take a new set of labeled examples, project to this space, and run
any alg for learning lin. separators.
Theorem: with probability ¸ 1-, exists linear separator of
error ·  + at margin /4.
Main Definition & Algorithm, Implications
• S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
• “Triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
Theorem: with prob. ¸ 1-, exists linear separator of error ·  +
at margin /4.
legal
kernel
K arbitrary sim.
function
(,)-good sim.
function
(+,/4)-good kernel function
Theorem
Any (,)-good kernel is an (’,’)-good similarity function.
(some penalty: ’ =  + extra, ’ = 2extra )
Similarity Functions for Classification, Summary
• Formal way of understanding kernels as similarity
functions.
• Algorithms and guarantees for general similarity
functions that aren’t necessarily PSD.
Part II: Can we use this angle to
help think about Clustering?
What if only unlabeled examples available?
[sports]
S set of n objects.
[documents,
images]
[fashion]
There is some (unknown) “ground truth” clustering.
Each object has true label l(x) in {1,…,t}. [topic]
Goal: h of low error up to isomorphism of label names.
Err(h) = minPrx~S[(h(x))  l(x)]
Problem: only have unlabeled data!
But we have a Similarity function!
What conditions on a similarity function would be
enough to allow one to cluster[sports]
well?
S set of n objects.
[documents,
images]
[fashion]
There is some (unknown) “ground truth” clustering.
Each object has true label l(x) in {1,…,t}. [topic]
Goal: h of low error up to isomorphism of label names.
Err(h) = minPrx~S[(h(x))  l(x)]
Problem: only have unlabeled data!
But we have a Similarity function!
Contrast with “Standard” Approach
Traditional approach: the input is a graph or embedding
of points into Rd.
- analyze algos to optimize various criteria
- which criterion produces “better-looking” results
We flip this perspective around.
More natural, since the input graph/similarity is merely
based on some heuristic.
- closer to learning mixtures of Gaussians
- discriminative, not generative
What conditions on a similarity function would be
enough to allow one to cluster well?
[sports]
Condition that trivially works.
K(x,y) > 0 for all x,y, l(x) = l(y).
K(x,y) < 0 for all x,y, l(x)  l(y).
[fashion]
What conditions on a similarity function would be enough to
allow one to cluster well?
Strict Ordering Property
Still
Strong
K is s.t. all x are more similar to points y in their own cluster
than to any y’ in other clusters.
Problem: same K can satisfy it for two very different clusterings
of the same data!
Unlike learning, you can’t
even test your hypotheses!
sports
soccer
tennis
fashion
Lacoste
Coco Chanel
sports
soccer
tennis
fashion
Lacoste
Coco Chanel
Relax Our Goals
1.
Produce a hierarchical clustering s.t. correct answer is
approximately some pruning of it.
soccer
tennis
Lacoste
Coco Chanel
Relax Our Goals
1.
Produce a hierarchical clustering s.t. correct answer is
approximately some pruning of it.
All topics
sports
soccer
tennis
fashion
Coco
Chanel
Lacoste
Relax Our Goals
1.
Produce a hierarchical clustering s.t. correct answer is
approximately some pruning of it.
All topics
sports
soccer
tennis
fashion
Coco
Chanel
Lacoste
Relax Our Goals
1.
Produce a hierarchical clustering s.t. correct answer is
approximately some pruning of it.
All topics
sports
fashion
Coco
Chanel
Lacoste
Relax Our Goals
1.
Produce a hierarchical clustering s.t. correct answer is
approximately some pruning of it.
All topics
sports
soccer
tennis
fashion
Coco
Chanel
Lacoste
Relax Our Goals
1.
Produce a hierarchical clustering s.t. correct answer is
approximately some pruning of it.
All topics
sports
soccer
tennis
fashion
Coco
Chanel
Lacoste
Relax Our Goals
1.
Produce a hierarchical clustering s.t. correct answer is
approximately some pruning of it.
All topics
sports
soccer
2.
fashion
tennis
List of clusterings s.t. at least one has low error.
Tradeoff strength of assumption with size of list.
Start Getting Nice Algorithms/Properties
Strict Ordering Property
Sufficient for
hierarchical clustering
K is s.t. all x are more similar to points y in their own
cluster than to any y’ in other clusters.
Weak Stability Property
Sufficient for
hierarchical clustering
For all clusters C, C’, for all A in C, A’ in C’:
at least one of A, A’ is more attracted
to its own cluster than to the other.
A
A’
Example Analysis for Strong Stability Property
K is s.t. for all C, C’, all A in C, A’ in C’
K(A,C-A) > K(A,A’),
(K(A,A’) - average attraction between A and A’)
Algorithm
Average Single-Linkage.
• merge “parts” whose average similarity is highest.
Analysis: All “parts” made are laminar wrt target clustering.
• Failure iff merge P1, P2 s.t. P1 ½ C, P2 Å C =.
• But must exist P3 ½ C s.t. K(P1,P3) ¸ K(P1,C-P1) and
Contradiction.
K(P1,C-P1) > K(P1,P2).
Strong Stability Property, Inductive Setting
Inductive Setting
Draw sample S, hierarchically partition S.
Insert new points as they arrive.
Assume for all C, C’, all A ½ C, A’µ C’:
K(A,C-A) > K(A,A’)+
– Need to argue that sampling preserves stability.
– A sample cplx type argument using Regularity type
results of [AFKK].
Weaker Conditions
Not Sufficient
for hierarchy
Average Attraction Property
Ex’ 2 C(x)[K(x,x’)] > Ex’ 2 C’ [K(x,x’)]+ (8 C’C(x))
Can produce a small list of clusterings.
2
Upper bound tO(t/ ). [doesn’t depend on n]
Lower bound ~ t(1/).
Sufficient for
hierarchy
Stability of Large Subsets Property
Might cause bottom-up algorithms to fail.
Find hierarchy using learning-based algorithm.
2
(running time tO(t/ ))
A’
A
Similarity Functions for Clustering, Summary
Discriminative/SLT-style model for Clustering with
non-interactive feedback.
• Minimal conditions on K to be useful for clustering.
–
–
List Clustering
Hierarchical clustering
• Our notion of property: analogue of a data dependent concept
class in classification.

Download Report

A Theory of Learning and Clustering via Similarity Functions

Paperzz.com

Your Paperzz