Some Mathematical Tools for Machine Learning (part 1)

Some Mathematical Tools
for Machine Learning
Chris Burges
Microsoft Research
February 13th, 2009
Contents – Part 1
1.Lagrange Multipliers: An Overview
2.Basic Concepts in Functional Analysis
3.Some Notes on Matrix Analysis
4.Convex Optimization: A Brief Tour
A Brief Diversion
Leslie Lamport: “How to Write a Proof”: DEC tech report, Feb 14, 1993
Abstract: A method of writing proofs is proposed that makes it much harder
to prove things that are not true. The method, based on hierarchical
structuring, is simple and practical.
“Anecdotal evidence suggests that as many as a third of all papers
published in mathematical journals contain mistakes – not just minor errors,
but incorrect theorems and proofs.”
Lagrange Multipliers:
An Overview, and Some Examples
Lagrange the Mathematician
• Born 1736 in Turin, one of two of 11 to survive infancy
• “Responsible for much fine mathematics published under the names
of other mathematicians”
• Believed that a mathematician has not thoroughly understood his
own work till he has made it so clear that he can go out and explain
it to the first person he meets on the street
• Worked on mechanics, calculus, the calculus of variations
astronomy, probability, group theory, and number theory
• At least partly responsible for the choice of base 10 for the metric
system, rather than 12
• Supported by Euler and d’Alembert, financed by Frederick and Louis
XIV, close to Lavoisier, Marie Antoinette
An indirect approach can be easier
Example: Minimize f ( x) subject to c( x)
xAx  1, x 
n
If A 0, could rotate to coordinate system and rescale so that
constraints take the form yy  1, substitute with a parameterization
that encodes the constraints that x lives on S n 1 :
y1  sin 1 sin  2
sin  n 1
y2  sin 1 sin  2
cos  n 1
solve, and then apply the inverse mapping. But this can get
very complicated!
It is often not even possible to parameterize constraints (for
example, polynomial constraints in several variables)
One equality constraint
Minimize f ( x) subject to c( x)  0, x 
2
.
f
c( x)  0
c
f 0
f 3
f 2
f 1
One equality constraint, cont.
f
c( x)  0
c
f 0
f 3
f 2
f 1
Hence at the optimum, we must have f  c, or:
f  c
L ( f   c)  0
Multiple equality constraints
n constraints:
ci ( x)  0, i  1,
Define gradients:
, n.
gi ( x)  ci ( x).
Let S be the subspace spanned by the gi , and let S  be its
orthogonal complement.
Suppose that at some point, all constraints hold, and
 f    0
Then can increase (or decrease) f by moving along  f 
Hence  f   0, or: f   i ci ( x) :
i
L
( f  i ci )  0
i
Puzzle: why not multiple Lagrangians?
One inequality constraint
Find x* that minimizes f ( x) subject to c( x)  0.
What's new? At the solution, it's possible that c( x)  0.
(Simple but not guaranteed: solve 'minimize f ( x) ', check that c( x* )  0.)
f 3
f 2
f 1
f
f 0
c
c( x)  0
c( x)  0
c( x)  0
f  c : ( f  c)  0,   0
Multiple inequality constraints
( f   i ci )  0, i  0
i
Suppose that at the solution
Then removing
makes no difference, and we must drop
from the sum in
Equivalently we can set
Hence, always impose
A simple example
Extremize the distance between two points on S n :
Embed in
n 1
: extremize f  x1  x2 , x1 , x2  R nn 1
2
subject to c1 ( x1 , x2 )  1  x1
2
 0, c2 ( x1 , x2 )  1  x2
L( x1 , x2 )  f   i ci  x1  x2
2
2
 1 (1  x1 )  2 (1  x2 )
i
1L  0  ( x1  x2 )  1 x1  0
 2 L  0  ( x2  x1 )  2 x2  0
 x2  (1  1 ) x1, x1  (1  2 ) x2
 antipodal or equal: i  2 or 0.
2
2
Another simple example
Given a parallelogram whose sidelengths you can choose but
whose perimeter c is fixed - what shape has the largest area?
h
b
Maximize bh subject to 2(b  h)  c
L(b, h)  bh   (2(b  h)  c)
L  0  b  h
Again,  not explicitly needed: hence “method of
undetermined multipliers”
Simple exercises
Puzzle: what coefficients maximize a convex sum of fixed numbers?
Puzzle: minimize
 xi 2 subject to  xi  1.
i
Puzzle: maximize
i
 xi 2 subject to  xi  1 and xi  0 (hint: use i xi  0)
i
i
Resource Allocation
Cost
Rate
ci (ni )
Number of widgets ni
Distortion
Puzzle: Does this work for economics, too?
Fiber has
bit errors per second, and sends
bits total, per second. We
wish to maximize the bit rate at a fixed distortion rate:
A variational problem
An isoperimetric problem: find the curve of fixed length  and
fixed endpoints {a,b} that encloses maximum area above [a,b].
1
1
Area   y dx,
length    (1  y '2 )1/ 2 dx
0
0
1

L   y dx     (1  y '2 )1/ 2 dx   


0
0

1
1
1
0
0
 L    y dx    (1  y '2 ) 1/ 2 y '  y ' dx
1
y
x=0
d


  1  
y '(1  y '2 ) 1/ 2   y dx
dx

0
1



 1   y ''(1  y ' )
2 3 / 2
  y dx
0
 1   y ''(1  y '2 ) 3 / 2  1    0
…straight line, or arc of circle.
x=1
Which univariate distribution has max entropy?

Minimize

subject to: f ( x)  0 x,
f ( x) log f ( x) dx



f ( x) dx  1,


 x f ( x) dx
Need functional derivative:
 g ( x)
  ( x  y)
 g ( y)
 c1 ,



x 2 f ( x) dx  c2


L

f ( x) log f ( x) dx   (1 

Impose



L
 f ( y)




f ( x) dx)  1  x f ( x) dx   2  x 2 f ( x) dx
 0, integrate w.r.t x  log f ( y)  log(e)    1 y   2 y 2  0
→ f must be Gaussian!
Which univariate distribution has max
entropy?
Puzzle: I thought the uniform distribution has max entropy.
What’s going on?
Puzzle: What distribution do you get if you fix the mean,
but not the variance?
Puzzle: What distribution do you get if you fix only the
function’s support?
Max Entropy for Discrete Distribn + Linear
Constraints
Have discrete distribution Pi :
 Pi  1, Pi  0
i
Suppose you also have known linear constraints:
 aij Pj  C j
i
but you are maximally uncertain about everything else. So want max
entropy distribution subject to these constraints.
L   Pi log Pi 
i




  j   aij Pj  C j      Pi  1    i Pi
j


i

i

i
 L  0  Pk  exp(1     k    j akj )  (1/ Z ) exp(   j akj )
j
→ logistic regression!
j
Are Lagrange Multipliers Really That Common?
Yes.
For example:
•
•
•
•
•
•
Most flavors of Support Vector Machines
Principal Component Analysis
Canonical Correlation Analysis
Locally Linear Embedding
Laplacian Eigenmaps
…
Basic Concepts in
Functional Analysis:
A Brief Tour
of Hilbert spaces, Norms,
and All That
What is a Field?
{F , , }: F a set, {, } operations
{F ,} is an Abelian group with identity denoted by 0
{F  0, } is an Abelian group with identity denoted by 1
x  ( y  z)  x y  x z  {x, y, z  F}
A field generalizes the notion of arithmetic on reals.
Field : Examples
With + meaning addition and  meaning multiplication,
the reals:
F R
the rationals:
F Q
the complex numbers:
F C
What's the smallest field?
Z2  {0,1} with + meaning "XOR" and  meaning "AND"
Puzzle 1: For Z 2 , why doesn't using "OR" for + work?
Puzzle 2: Is
+
 {x : x  0} a field under {, }?
How Many Fields Are There?
Actually infinitely many - e.g. Z p , p prime.
However,
for for
some
fields.
However,can
candefine
defineananordering
'ordering;'
fields
(like <, =, > for R ).
R, Q can be ordered; C and Z 2 cannot; in fact all finite fields
cannot be ordered.
For ordered fields, 'supremum' and 'infimum' can be defined.
An ordered field is 'complete' iff every nonempty subset of F
that has an upper bound in F also has a supremum in F .
Q is not complete, but R is. In fact: every complete, ordered
field is isomorphic to R !
What is a Vector Space?
A vector space is a nonempty set E , a field F and operations 'addition'
(( x, y )  x  y from E  E  E ) and 'multiplication by a scalar'
(( , x)   x from F  E into E ) such that:
(a) x  y  y  x;
(b) ( x  y )  z  z  ( y  z );
(c) For every x, y  E , there exists z  E such that x  y  z;
(d)  ( x)  ( ) x;
(e) (   ) x   x   x;
(f)  ( x  y )   x   y;
(g) 1x  x.
A vector space generalizes the notion of vectors in R nn
Vector Spaces: Field Matters!
{E , F }  {
N
, } (dimension N );
{E , F }  {
N
, } (dimension N );
{E , F }  {
N
, } (dimension 2 N );
'Linear dependence' depends on field F :
For the vector space { , }, vectors 1 and i are linearly
independent.
For the vector space { , } they are not.
Vector Spaces: More Examples
(1) Functions, whose range is a vector space, also
themselves form a vector space:
( f  g )( x)  f ( x)  g ( x);
( f )( x)   f ( x).
(2) M mn (complex m by n matrices), over the field ;
(3) l p : for p  1, the space of all infinite sequences of

complex numbers such that

n 1
x y 


n 1
n 1
 ( xn  yn ),  x    xn
zn
p

What is an Inner Product?
Let V be a vector space over or . A function  ,  : V  V  F
is an inner product if for all x, y, z V ,
(a)  x, x  0
(b)  x, x  0  x  0
(c)  x  y, z    x, z    y, z 
(d) cx, y  c x, y for all scalars c  F
(e)  x, y   y, x
The inner product generalizes the notion of dot product
Inner Product: Examples
(1) Vector space {
N
, } with , defined by  x, y
(2) Vector space l2 over
with , defined by  x, y
(3) Vector space of matrices over
(X , Y  M pm )
 i xi yi
 i1 xi yi

with  X , Y  Trace( X T Y )
Inner Product: Trace
Tr ( X T X )   X i
(a)  X , X   0
2
0
2
0
i
Tr ( X T X )   X i
(b)  X , X   0  x  0
i
(c)  x  y, z    x, z   y, z :
Tr (( X  Y )T , Z )  Tr ( X T Z )  Tr (Y T Z )
(d)  cx, y  c x, y for all scalars c  F
(e)  x, y   y, x
Tr ( X T Y )   Tr ( X T Y )
Tr ( X T Y )  Tr (Y T X )
Inner Product is General
Cauchy Schartz inequality holds for any inner product ,:
|  x, y |2   x, x y, y
for all x, y  V
"Angle" between two vectors can be defined for any inner product:

1 

|  x, y  |
cos 
:
1/ 2 
 ( x, x y, y ) 
0  

2
1 2
 1 0 
E.g. the angle between 
 and 
 is 78.4 degrees !
3 4
 1 2
What is a Norm?
Let V be a vector space over
or . A function  : V 
a norm if for all x, y  V ,
(a) x  0
(b) x  0  x  0
(c) cx | c | x
for all scalars c  F
(d) x  y  x  y
If condition (b) is dropped, it's a 'seminorm'.
What is the simplest seminorm?
Ans : x  0
is
Seminorm splits the space
If  is a seminorm on V , then V0
{v  V : v  0} is a subspace
(called the null space).
If V1 is a subspace of V such that V0 V1  , then  is a norm
on V1.
Define x
y  x  y  0. The corresponding cosets form a vector
space, and on that space,  is a norm.
Example seminorm on 5:   ,  x ' Dx, D
spanned by (0, 0, 0,1, 0) and (0, 0, 0, 0,1).
diag (3, 2,1, 0, 0), null space
Norm Generalizes Length
x  x12  x2 2 
 xn 2
for x  ( x1 , x2 ,
, xn ) 
n
is the Euclidean
norm.
1/ p
Let z  {zn }  l p . The function defined by z
  z p
 n 
 n=1

This works for finite sums too: define the l p norm on
is a norm in l p .
n
as:
1/ p
x
 n x p
 i 
 i=1

The l1 norm is the 'Manhattan' norm x | x1 |  | x2 | 
The l norm is the 'max' norm x  max({xi })
A normed vector space is a pair {vector space, norm}.
 | xn |
What is convexity?
A function is convex if, between x1 and x2  x1, it lies below the
chord joining f ( x1 ) and f ( x2 ). It is strictly convex if it lies strictly below.
f ((1   ) x1   x2 )  (1   ) f ( x1 )   f ( x2 )
x2
0   1
x1
A set S is convex if, for all a  S and b  S, all points on the line
joining a and b lie in S .
When is a sphere a square?
Red: the l 1-sphere
Blue: the l2 1-sphere
Green: the l1 1-sphere
Interesting... each ball looks convex
When is a sphere a cross?
?
x
0
lim p0

i
xi
p
?
In this context, never. The "l0 norm" (count the number of
nonzero elements) is not a norm (for
 v 0  0 if   0,
 v
0
if   0
| | v
0
unless   1.
n
over
or
).
All norms are convex functions
(1   ) x1   x2  1    x1   x2
 | 1    | x1  |  | x2
 1    x1   x2
If f ( x) is convex, then f ( x)  0 defines a convex set:
let R
 x : f ( x)  0.
Then if f ( x1 )  0 and f ( x2 )  0,
f ((1   ) x1   x2 )  (1   ) f ( x1 )   f ( x2 )  0
 the unit ball, x  1, for any norm is a convex region.
2
Open, Closed, Compact
Given a norm n
 , and a set S in a vector space V :
Open :  x  S ,    0 s.t. Bn ( , x)  S
Closed : complement of open
Bounded :  r  0 such that S  Bn (r , 0)
Compact : Every sequence  xi  in S contains convergent subsequence
with limit in S
OR : Every cover has a finite sub - cover :

S   S ,  S 1 , S 2 ,
, S N , N finite, such that
N
i
S i  S
Compact sets are closed and bounded.
For finite dimensional normed vector spaces, every closed bounded
set is compact.
If for a normed vector space V , the unit ball is compact, then V has
finite dimension.
Making your own norm
In fact, in real finite dimensional spaces, any symmetric,
compact, convex region centered on the origin defines
a norm (as the unit ball for that norm):
Rescuing l0 : the Hamming norm
Consider n-tuples taking values in Z 2 . These form a vector
space over the field Z 2 : for example,
 ( x  y)   x   y
 (  x)  ( ) x
1 x  x
Now define the l0 norm x 0 to be the number N of nonzero elements
of x. Is this a norm?
(a) x 0  0
(b) x 0  0  x  0
(c) cx 0 | c | x
0
(d) x  y 0  x 0  y
0
Hamming norm, cont.
Puzzle: (N  1)  Z 2 - how can this be correct?
Puzzle: What is the subtraction operation, in Z 2?
Puzzle: What is the Hamming distance x  y 0 ?
When does a norm come from an inner product?
Every inner product defines a norm: x   x, x
Does every norm define an inner product? If so, for real vector spaces,
1
2
2
 x, y  
x y  x y
4


No! A necessary and sufficient condition for a norm to correspond t o an
inner product is the paralellogram identity:

1
2
x y  x y
2
2

 x  y
2
2
(Jordan and von Neumann, 1935)
Lp norms, inner products
f
f
Lp
2



| f |
p
| f | 
p
2/ p
, where | f | p is integrable
?f, f
  | fg | 
f , g     | ( f  
E.g. try  f , g 
1 f1   2

1/ p
2
unless p  2.
p/2
1 1
2/ p
: then   f , g     f , g  but
2 f2 ) g
|
p/2

2/ p
 1  f1 , g    2  f 2 , g 
Lp norms, inner products cont.
Maybe we could find some other inner product that works for
all p  1?
No: if a norm is derivable from an inner product (over
 x, y  

1
2
x y  x y
4
1
2
.
Choose two functions:
f1 ( x)  0 : x  0
f2  0 : x  1
f1 ( x)  1 : 0  x  1
f2  1 : 1  x  2
f1 ( x)  2 : 1  x  2
1
2
), then
f1 ( x)  0 : x  2
f2  0 : x  2
Lp norms, inner products cont.
We’d like:
2.5
2.4
2.3
2.2
2.1
2
1.9
1.8
1.7
1.6
1.5
1
2
3
4
5
6
p
7
8
9
10
norm on Rn has no inner product
Example in R 2 : x  [1, 0], y  [0, 2]
Use parallelogram test:
1
2
2
2
x y  x y ? x  y
2


1
2
x y  x y
2

2


2
1 2
2
2
(2  22 )  4  x  y  1  4  5
2
Extend to R n : x  [1, 0, 0,
, 0], y  [0, 2, 0,
, 0]
What is a Cauchy Sequence?
A sequence of vectors {xn } in a normed vector space is called
a Cauchy sequence if for every  >0 there exists a number M
such that xm  xn   for all m, n  M .
x1
x2
x3 x4 x5 x6
Key idea: the Cauchy sequence
allows us to define notions of
convergence without ever
leaving the space
Cauchy sequences, cont.
Every convergent sequence is a Cauchy sequence.
Not every Cauchy sequence is a convergent sequence.
P (0,1) : the space of polynomials on [0,1]. Choose the l
norm
P  max[0,1] | P( x) | . Define
x 2 x3
xn
Pn ( x)  1  x    
2! 3!
n!
for n  1, 2, . Then {Pn } is a Cauchy sequence but it does
not converge in P (0,1) because its limit is not a polynomial.
Note Cauchy sequence requires choice of norm!
The notion of completeness
A normed vector space E is called complete if every Cauchy
sequence in E converges to an element of E.
A complete normed space is called a Banach space.
n
with l p norm is complete, for all 1  p  .
The sequence space l p is complete, for all 1  p  .
C[a, b] with L norm is complete.
C[a, b] with L2 norm or L1 norm is not complete.
"All square integrable functions with L2 norm" is complete.
A complete space "contains all its limit points."
It is always possible to 'complete' a non-complete space.
Hilbert and Banach spaces
A Hilbert space is a complete inner product space.
Polynomials on [0,1] with max norm
Normed
Inner Product
Banach
Hilbert
C[a, b] :  x, y   xy
Hilbert and Banach spaces, cont.
C[a, b] with L norm
Lp2
Normed
C[a, b] with L2
norm
Banach
l2
Hilbert
L2 , all square
integrable fns
l p2
Finite dim. inner product spaces
Polynomials on [0,1], max norm
How many kinds of Hilbert spaces are there?
A mapping T : E1  E2 , where E1,2 are vector spaces, is called
a linear mapping if T ( x   y )   T ( x)   T ( y )  x, y  E1 and
all scalars  , .
A Hilbert space H1 is said to be isomorphic to a Hilbert space H 2
if there exists a 1-1 linear mapping T from H1 to H 2 such that
T ( x), T ( y )   x, y
for every x, y  H1. Such a T is called a Hilbert space isomorphism
of H1 onto H 2.
How many kinds of Hilbert spaces are there?
A Hilbert space H is called separable if and only if it admits a
countable orthonormal basis. (So, all finite dimensional
Hilbert spaces are separable).
E.g. l2, L2[a, b] are separable Hilbert spaces.
Answer: 2...
Let H be separable:
If H is infinite dimensional, then it is isomorphic to l2 .
If H has dimension N , then it is isomorphic to
N
.
The Riesz Representation Theorem
A linear functional on a normed vector space {V , F} is
a linear mapping  : V  F .
The operator norm of a linear functional f is defined:
f
sup x 1 | f ( x) |
A linear functional on a normed vector space is bounded
( K s.t. f ( x)  K x  x V ) if and only if its operator
norm is finite.
The Riesz Representation Theorem
Let f be a bounded linear functional on a Hilbert space H .
Then there exists exactly one x0  H such that f ( x)   x, x0 
for all x  H , and in fact f  x0 .
Example: H  L2 [a, b],
   a  b  . Define a linear functional by
b
f ( x)
 x(t )dt
a
b
b
b
a
a
a
Linear ? f ( x   y )    x(t )   y (t )dt   x(t )dt    y (t )dt  f ( x )   f ( y )
Bounded ?
f ( x) 
b
b
b
a
a
a
 x(t )dt   x(t ) dt   x(t )1 dt


2
   x(t ) dt 


a

b
1
2
1
2


1dt  b  a x
  
a 
b
Riesz Representation Theorem, cont.
Can we find x0?
b
a
a
1:  x,1   x(t ).1dt   x(t )dt  f ( x)
f  x0 ?
Check:
x0 
Try x0
b
 1 
b 2 1/ 2
a
 ba
b
 b

 sup x 1 | f ( x) |  sup x 1 |  x(t )dt |  sup |  x(t ) dt | :  x(t ) 2 dt  1 
 a

a
a
b
f
The sup is found by choosing x  1/ b  a , a  t  b, x  0 otherwise.
 f  ba
A Brief Look Ahead
The Riesz representation theorem can be used to show that any Hilbert
space for which the evaluation functional is continuous, is a Reproducing
Kernel Hilbert Space: there exists K such that
In particular:
Representer Theorem (Kimeldorf and Wahba, 1971; Schölkopf and
Smola, 2002): Let
be a strictly monotonic increasing function
and let be an arbitrary loss function. Then each minimizer
of
the “regularized risk”
admits a representation of the form
What is a metric space?
For any set E, let  ( x, y ) be a function (with range in ) defined on
the set E  E of all ordered pairs ( x, y ) of members of E , satisfying:
(i)  ( x, y ) is a finite real number for every pair ( x, y ) of E  E;
(ii)  ( x, y )  0  x  y;
(iii)  ( y, z )   ( x, y )   ( x, z ), {x, y, z}  E.
Such a function  : E  E  is called a metric on E; a set E with
metric  is called a metric space. Different choices of metric on E
give different metric spaces.
Puzzle: What about  ( x, y)  0?
Puzzle: How about  ( x, y)   ( y, x)?
Puzzle: Where's the triangle inequality:  ( x, y)   ( x, z)   ( z, y)?
Metric versus norm
Every normed vector space is a metric space: define  ( x, y)
x y
But metric spaces are much more general:
Metric spaces
0.5
1.1
3.4
Normed spaces
5.9
6.0
2.2
1.0
4.5
3.1
d  19.5
Metrics extend "continuity": f ( B ( x))  B f ( x)
1.1
6.4
2.3
9.6
Some metrics on function spaces
Let A be the set of all bounded functions f :[a, b]  R.. For two
points f , g  A,  ( f , g ) sup x[ a ,b ] f ( x)  g ( x ) is a metric.
b
a
Let A be C[a, b] : 1 ( f , g )
2 ( f , g )
f ( x)  g ( x) dx,
   f (x)  g (x) dx  ,  ( f , g )  ?
b
2
1
2
p
a
Suppose instead A  C r [a, b] : then

,r ( f , g )  sup x[ a ,b ] f ( x)  g ( x) , f '( x)  g '( x) ,
, f ( r ) ( x)  g ( r ) ( x)

Topological Spaces
A topological space T   A, S  : A is a non-empty set, S aa fixed collection
of subsets of A, satisfying
(1) A,   S ,
(2) Intersection of any two sets in S is
i in S ,
(3) Union of any collection of sets in S is in S..
S is called a topology for A, and the members of S are called the open
sets of T .
Topological spaces are more general than metric spaces!..
Topological spaces extend "continuity": Given T1   A1 , S1  and T2   A2 , S2  and a
map  : A1  A2 ,  is "continuous" if U  A2   1 (U )  A1
Continuity, convergence, connectedness… without distance!
Putting Spaces in their Places…
S   A,  : the "indiscrete" topology on A
Topological spaces
Metric spaces
Normed vector spaces
Banach spaces
Hilbert spaces
~ The Middle ~
Thanks!

Download Report

Some Mathematical Tools for Machine Learning (part 1)

Paperzz.com

Your Paperzz