On Learning with Similarity Functions

Learning with Similarity Functions
Maria-Florina Balcan & Avrim Blum
CMU, CSD
Maria-Florina Balcan
Kernels and Similarity Functions
Kernels have become a powerful tool in ML.
• Useful in practice for dealing with many different kinds
of data.
• Elegant theory about what makes a given kernel good for
a given learning problem.
Our Goal: analyze more general similarity functions.
• In the process we describe ways of constructing good
data dependent kernels.
Maria-Florina Balcan
Kernels
• A kernel K is a pairwise similarity function s.t. 9 an implicit
mapping  s.t. K(x,y)=(x) ¢ (y).
• Point is: many learning algorithms can be written so only
interact with data via dot-products.
• If replace x¢y with K(x,y), it acts implicitly as if data was in
higher-dimensional -space.
• If data is linearly separable by large margin in -space, don’t
have to pay in terms of data or comp time.

 (x)
1
If margin  in -space, only need 1/2
examples to learn well.
w
Maria-Florina Balcan
General Similarity Functions
Goal: definition of good similarity function for a learning
problem that:
1) Talks in terms of natural direct properties:
• no implicit high-dimensional spaces
• no requirement of positive-semidefiniteness
2) If K satisfies these properties for our given problem,
then has implications to learning.
3) Is broad: includes usual notion of “good kernel”.
(induces a large margin
separator in -space)
Maria-Florina Balcan
A First Attempt: Definition satisfying
properties (1) and (2)
Let P be a distribution over labeled examples (x, l(x))
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a
1- probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
• Suppose that positives have K(x,y) ¸ 0.2, negatives have
K(x,y) ¸ 0.2, but for a positive and a negative K(x,y) are
uniform random in [-1,1].
Note: this might not be a legal kernel.
C-
BA+
Maria-Florina Balcan
A First Attempt: Definition satisfying
properties (1) and (2). How to use it?
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a
1- probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Algorithm
• Draw S+ of O((1/2) ln(1/2)) positive examples.
• Draw S- of O((1/2) ln(1/2)) negative examples.
• Classify x based on which gives better score.
Maria-Florina Balcan
A First Attempt: How to use it?
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1-
probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Algorithm
• Draw S+ of O((1/2) ln(1/2)) positive examples.
• Draw S- of O((1/2) ln(1/2)) negative examples.
• Classify x based on which gives better score.
Guarantee: with probability ¸ 1-, error ·  + .
Proof
•
Hoeffding: for any given “good x”, probability of error
w.r.t. x (over draw of S+, S-) at most 2.
• By Markov, at most  chance that the error rate over
GOOD is more than . So overall error rate ·  + .
Maria-Florina Balcan
A First Attempt: Not Broad Enough
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1-
probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
++
+
+
++
more similar
to negs than
to typical pos
--- --
• K(x,y)=x ¢ y has good (large margin) separator but doesn’t
satisfy our definition.
Maria-Florina Balcan
A First Attempt: Not Broad Enough
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1-
probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
++
+
+
++
R
--- --
Idea: would work if we didn’t pick y’s rom top-left.
Broaden to say: OK if 9 large region R s.t. most x are on
average more similar to y2R of same label than to y2 R
of other label.
Maria-Florina Balcan
Broader/Main Definition
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a
weighting function w(y) 2 [0,1] at least a 1- probability
mass of x satisfy:
Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+
Maria-Florina Balcan
Main Definition, How to Use It
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting
function w(y) 2 [0,1] at least a 1- probability mass of x satisfy:
Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+
Algorithm
• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
• Use to “triangulate” data:
F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
• Take a new set of labeled examples, project to this space,
and run your favorite alg for learning lin. separators.
Point is: with probability ¸ 1-, exists linear separator of
error ·  + at margin /4.
(w = [w(y1), …,w(yd),-w(zd),…,-w(zd)])
Maria-Florina Balcan
Main Definition, Implications
Algorithm
• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
• Use to “triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
Guarantee: with prob. ¸ 1-, exists linear separator of error ·  +
at margin /4.
legal
kernel
Implications
K arbitrary sim.
function
(,)-good sim.
function
(+,/4)-good kernel function
Maria-Florina Balcan
Good Kernels are Good Similarity Functions
Main Definition: K:(x,y) ! [-1,1] is an (,)-good similarity
for P if exists a weighting function w(y) 2 [0,1] at least a 1 probability mass of x satisfy:
Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+
Theorem
• An (,)-good kernel is an (’,’)-good similarity function
under main definition.
Our current proofs incur some penalty:
’ =  + extra, ’ = 3extra.
Maria-Florina Balcan
Good Kernels are Good Similarity Functions
Theorem
• An (,)-good kernel is an (’,’)-good similarity function
under main definition, where ’ =  + extra, ’ = 3extra.
Proof Sketch
• Suppose K is a good kernel in usual sense.
• Then, standard margin bounds imply:
– if S is a random sample of size Õ(1/(2)), then whp we
can give weights wS(y) to all examples y 2 S so that the
weighted sum of these examples defines a good LTF.
• But, we want sample-independent weights [and bounded].
– Boundedness not too hard (imagine a margin-perceptron
run over just the good y).
– Get sample-independence using an averaging argument.
Maria-Florina Balcan
Learning with Multiple Similarity Functions
• Let K1, …, Kr be similarity functions s. t. some (unknown)
convex combination of them is (,)-good.
Algorithm
• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
• Use to “triangulate” data:
F(x) = [K1(x,y1), …,Kr(x,yd), K1(x,zd),…,Kr(x,zd)].
Guarantee: The induced distribution F(P) in R2dr has a
separator of error ·  +  at margin at least
Sample complexity is
roughly
Maria-Florina Balcan
Implications & Conclusions
• Develop theory that provides a formal way of understanding
kernels as similarity function.
• Our algorithms work for similarity fns that aren’t
necessarily PSD (or even symmetric).
Open Problems
• Improve existing bounds.
• Better results for learning with multiple similarity
functions. Extending [SB’06].
Maria-Florina Balcan

Download Report

On Learning with Similarity Functions

Paperzz.com

Your Paperzz