Linear Threshold Units

Linear Threshold Units
hx
(
LI wx . . . wnxn w
RWKHUZLVH
We assume that each feature xj and each weight
wj is a real number (we will relax this later)
We will study three different algorithms for
learning linear threshold units:
– Perceptron: classifier
– Logistic Regression: conditional distribution
– Linear Discriminant Analysis: joint distribution
26
What can be represented by an
LTU:
Conjunctions
x + x + x 3 y
· x · x · x · x At least m-of-n
DWOHDVWRI{x, x, x} 3 y
· x · x · x · x 27
Things that cannot be represented:
Non-trivial disjunctions:
x + x ` x + x 3 y
· x · x · x · x SUHGLFWV
f hi
Exclusive-OR:
x + ¬x ` ¬x + x 3 y
28
A canonical representation
Given a training example of the form
(hx1, x2, x3, x4i, y)
transform it to
(
, x2, x3, x4i, y)
The parameter vector will then be
w = hw0, w1, w2, w3, w4i.
We will call the unthresholded hypothesis u(x,w)
u(x,w) = w · x
Each hypothesis can be written
h(x) = sgn(u(x,w))
Our goal is to find w.
29
The LTU Hypothesis Space
µ
n
¶
Fixed size: There are O distinct
linear threshold units over n boolean
features
Deterministic
Continuous parameters
30
Geometrical View
Consider three training examples: h., .i, h., .i, h., .i, We want a classifier that looks like
the following:
31
The Unthresholded Discriminant
Function is a Hyperplane
The equation
u(x) = w · x
is a plane
y
A
(
LI ux RWKHUZLVH
32
Machine Learning and Optimization
When learning a classifier, the natural way to
formulate the learning problem is the following:
– Given:
A set of N training examples
{(x1,y1), (x2,y2), …, (xN,yN)}
A loss function L
– Find:
The weight vector w that minimizes the expected loss on the
training data
N
X
Jw
LVJQw · xi, yi.
Ni In general, machine learning algorithms apply
some optimization algorithm to find a good
hypothesis. In this case, J is piecewise
constant, which makes this a difficult problem
33
Approximating the expected loss by
a smooth function
Simplify the optimization problem by replacing the
original objective function by a smooth, differentiable
function. For example, consider the hinge loss:
a w
J
N
X
max, yiw · xi
Ni When y = 1
34
Minimizing Jaby Gradient Descent Search
Start with weight vector w0
Compute gradient
a w
nJ
Ã
a w
a w
a w ( J
( J
( J
,
, ... ,
(w
(w
(wn
a w Compute w1 = w0 – K nJ
where K is a “step size” parameter
Repeat until convergence
!
35
Computing the Gradient
/HW Jaiw
( Jaw (wk
PD[, yiw · xi
C
N
( H X
Jai wO
(wk N i Ã
N
X
(
N i (wk
( Jai w (w k
I
Jai w
!
I
C
X
(
PD[ H , yi
w j xij O
(wk
j
(
P
LI yi j wj xij > yi xik RWKHUZLVH
36
Batch Perceptron Algorithm
*LYHQ
WUDLQLQJ H[DPSOHV xi, yi i
. ..N
, , , , . . . , EH WKH LQLWLDO ZHLJKW YHFWRU
/HW w
, , . . . , EH WKH JUDGLHQW YHFWRU
/HW g
5HSHDW XQWLO FRQYHUJHQFH
)RU i
WR N GR
w · xi
ui
,I yi · ui < )RU j
WR n GR
gj yi · xij
gj
g g/N
w w g
Simplest case: K = 1, don’t normalize g: “Fixed Increment Perceptron”
37
Online Perceptron Algorithm
/HW w
, , , , . . . , EH WKH LQLWLDO ZHLJKW YHFWRU
5HSHDW IRUHYHU
$FFHSW WUDLQLQJ H[DPSOH i hxi, yii
ui
w · xi
,I yi ui < )RU j
WR n GR
gj yi · xij
w w g
This is called stochastic gradient descent because the
overall gradient is approximated by the gradient from each
individual example
38
Learning Rates and Convergence
The learning rate K must decrease to zero in order to guarantee
convergence. The online case is known as the Robbins-Munro
algorithm. It is guaranteed to converge under the following
assumptions:
OLP t
t
X
t
t X
t <
t The learning rate is also called the step size. Some algorithms (e.g.,
Newton’s method, conjugate gradient) choose the stepsize
automatically and converge faster
There is only one “basin” for linear threshold units, so a local
minimum is the global minimum. Choosing a good starting point can
make the algorithm converge faster
39
Decision Boundaries
A classifier can be viewed as partitioning the input space or feature
space X into decision regions
A linear threshold unit always produces a linear decision boundary.
A set of points that can be separated by a linear decision boundary
is said to be linearly separable.
40
Exclusive-OR is Not Linearly
Separable
41
Extending Perceptron to More than
Two Classes
If we have K > 2 classes, we can learn a
separate LTU for each class. Let wk be the
weight vector for class k. We train it by treating
examples from class y = k as the positive
examples and treating the examples from all
other classes as negative examples. Then we
classify a new data point x according to
yA
DUJPD[ wk · x.
k
42
Summary of Perceptron algorithm
for LTUs
Directly Learns a Classifier
Local Search
– Begins with an initial weight vector. Modifies it
iterative to minimize an error function. The error
function is loosely related to the goal of minimizing
the number of classification errors
Eager
– The classifier is constructed from the training
examples
– The training examples can then be discarded
Online or Batch
– Both variants of the algorithm can be used
43
Logistic Regression
Learn the conditional distribution P(y | x)
Let py(x; w) be our estimate of P(y | x), where w is a
vector of adjustable parameters. Assume only two
classes y = 0 and y = 1, and
px w
H[S w · x
.
H[S w · x
px w
px w.
On the homework, you will show that this is equivalent to
px w
ORJ
px w
w · x.
In other words, the log odds of class 1 is a linear function
of x.
44
Why the exp function?
One reason: A linear function has a range from
[–, ] and we need to force it to be positive
and sum to 1 in order to be a probability:
45
Deriving a Learning Algorithm
Since we are fitting a conditional probability distribution, we no
longer seek to minimize the loss on the training data. Instead, we
seek to find the probability distribution h that is most likely given the
training data
Let S be the training sample. Our goal is to find h to maximize P(h |
S):
DUJPD[ P h|S
h
P S|hP h
P S
h
DUJPD[ P S|hP h
DUJPD[
E\ %D\HV
5XOH
EHFDXVH P S GRHVQ
W GHSHQG RQ h
h
DUJPD[ P S|h
LI ZH DVVXPH P h
XQLIRUP
h
DUJPD[ ORJ P S|h
EHFDXVH ORJ LV PRQRWRQLF
h
The distribution P(S|h) is called the likelihood function. The log
likelihood is frequently used as the objective function for learning. It is
often written as Ɛ(w).
The h that maximizes the likelihood on the training data is called the
maximum likelihood estimator (MLE)
46
Computing the Likelihood
In our framework, we assume that each training
example (xi,yi) is drawn from the same (but
unknown) probability distribution P(x,y). This
means that the log likelihood of S is the sum of
the log likelihoods of the individual training
examples:
ORJ P S|h
ORJ
X
i
Y
i
P xi , yi|h
ORJ P xi , yi|h
47
Computing the Likelihood (2)
Recall that any joint distribution P(a,b) can be
factored as P(a|b) P(b). Hence, we can write
DUJPD[ ORJP S|h
h
DUJPD[
h
DUJPD[
h
X
i
X
i
ORJP xi, yi|h
ORJP yi|xi , hP xi |h
In our case, P(x | h) = P(x), because it does not
depend on h, so
DUJPD[ ORJP S|h
h
DUJPD[
h
DUJPD[
h
X
i
X
i
ORJP yi|xi , hP xi |h
ORJP yi|xi , h
48
Log Likelihood for Conditional
Probability Estimators
We can express the log likelihood in a compact
form known as the cross entropy.
Consider an example (xi,yi)
– If yi = 0, the log likelihood is log [1 – p1(x; w)]
– if yi = 1, the log likelihood is log [p1(x; w)]
These cases are mutually exclusive, so we can
combine them to obtain:
Ɛ(yi; xi,w) = log P(yi | xi,w) = (1 – yi) log[1 – p1(xi;w)] + yi log p1(xi;w)
The goal of our learning algorithm will be to find
w to maximize
J(w) = ¦i Ɛ(yi; xi,w)
49
Fitting Logistic Regression by
Gradient Ascent
(J w
(wj
(
`y x , w
(wj i i
X (
`yi xi, w
(w
j
i
(
yi ORJ> pxi w@ y ORJ pxi w (wj
yi
"
Ã
!
Ã
(pxi w
(p x w
i
yi
pxi w
(wj
pxi w
(wj
#Ã
!
yi
(pxi w
yi pxi w pxi w
(wj
"
#Ã
!
yi pxi w yipxi w (pxi w
pxi w pxi w
(w j
"
#Ã
!
yi pxi w
(pxi w
pxi w pxi w (w j
50
!
Gradient Computation (continued)
Note that p1 can also be written as
pxi w
.
H[S>w · xi@
From this, we obtain:
(pxi w
(wj
(
H[S>w · xi @
H[S>w · xi@ (w j
(
H[S>
w
·
x
@
w · xi i
(w
H[S>w · xi@
j
H[S>w · xi@xij H[S>w · xi@
pxi w pxi wxij
51
Completing the Gradient
Computation
The gradient of the log likelihood of a
single point is therefore
(
`yi xi, w
(wj
"
#Ã
!
yi pxi w
(pxi w
pxi w pxi w (w j
"
#
yi pxi w
pxi w pxi wxij
pxi w pxi w yi pxi wxij
The overall gradient is
(Jw
(wj
X
i
yi pxi wxij
52
Batch Gradient Ascent for Logistic Regression
*LYHQ
WUDLQLQJ H[DPSOHV xi, yi i
. ..N
/HW w
, , , , . . . , EH WKH LQLWLDO ZHLJKW YHFWRU
5HSHDW XQWLO FRQYHUJHQFH
, , . . . , EH WKH JUDGLHQW YHFWRU
/HW g
)RU i
WR N GR
/ H[S>w · xi@
pi
HUURUi
yi pi
)RU j
WR n GR
gj
gj HUURUi · xij
w w g
VWHS LQ GLUHFWLRQ RI LQFUHDVLQJ JUDGLHQW
An online gradient ascent algorithm can be constructed, of course
Most statistical packages use a second-order (Newton-Raphson)
algorithm for faster convergence. Each iteration of the second-order
method can be viewed as a weighted least squares computation, so
the algorithm is known as Iteratively-Reweighted Least Squares
(IRLS)
53
Logistic Regression Implements a
Linear Discriminant Function
In the 2-class 0/1 loss function case, we should
predict ǔ = 1 if
X
y
P y
|xL, P y
Ey|x>L, y@ > Ey|x >L, y@
X
P y|xL, y >
P y|xL, y
y
|xL, > P y
P y
P y
P y
P y
ORJ
P y
|x
|x
|x
|x
|x
w·x
> P y
> |xL, P y
|xL, |x
LI P y
|X 6
> > A similar derivation can be done for arbitrary
L(0,1) and L(1,0).
54
Extending Logistic Regression to K > 2 classes
Choose class K to be the “reference class” and
represent each of the other classes as a logistic
function of the odds of class k versus class K:
P y
P y
P y
ORJ
P y
ORJ
ORJ
|x
K|x
|x
K|x
P y
K |x
P y
K |x
w · x
w · x
wK · x
Gradient ascent can be applied to
simultaneously train all of these weight vectors
wk
55
Logistic Regression for K > 2 (continued)
The conditional probability for class k z K can be
computed as
P y
k|x
H[Swk · x
PK
` H[Sw` · x
For class K, the conditional probability is
P y
K|x
PK
` H[Sw` · x
56
Summary of Logistic Regression
Learns conditional probability distribution P(y | x)
Local Search
– begins with initial weight vector. Modifies it iteratively
to maximize the log likelihood of the data
Eager
– the classifier is constructed from the training
examples, which can then be discarded
Online or Batch
– both online and batch variants of the algorithm exist
57
Linear Discriminant Analysis
Learn P(x,y). This is sometimes
called the generative approach,
because we can think of P(x,y) as a
model of how the data is generated.
– For example, if we factor the joint
distribution into the form
y
P(x,y) = P(y) P(x | y)
– we can think of P(y) as “generating” a
value for y according to P(y). Then we
can think of P(x | y) as generating a value
for x given the previously-generated
value for y.
– This can be described as a Bayesian
network
x
58
Linear Discriminant Analysis (2)
P(y) is a discrete multinomial distribution
– example: P(y = 0) = 0.31, P(y = 1) = 0.69 will
generate 31% negative examples and 69%
positive examples
For LDA, we assume that P(x | y) is a
multivariate normal distribution with
mean Pk and covariance matrix 6
P x|y
k
$n/|h|/
H[S
µ
y
x
¶
>x µk @T h>x µk @
59
Multivariate Normal Distributions:
A tutorial
Recall that the univariate normal (Gaussian) distribution has the formula
"
#
x µ
H[S
px
$/
where P is the mean and V2 is the variance
Graphically, it looks like this:
60
The Multivariate Gaussian
A 2-dimensional Gaussian is defined by a
mean vector P = (P1,P2) and a covariance
" matrix
#
h
, ,
,
,
where V2i,j = E[(xi – Pi)(xj - Pj)] is the
variance (if i = j) or co-variance (if i z j). 6
is symmetrical and positive-definite.
61
The Multivariate Gaussian (2)
If 6 is the identity matrix h
"
#
and
P = (0, 0), we get the standard normal
distribution:
62
The Multivariate Gaussian (3)
If 6 is a diagonal matrix, then x1, and x2 are independent random
variables, and lines of equal probability are ellipses parallel to the
coordinate axes. For example, when
"
#
h
µ
, and
we obtain
63
The Multivariate Gaussian (4)
Finally, if 6 is an arbitrary matrix, then x1 and x2 are
dependent, and lines of equal probability are ellipses
tilted relative to the coordinate axes. For example, when
h
"
µ
, we obtain
.
. #
and
64
Estimating a Multivariate Gaussian
Given a set of N data points {x1, …, xN}, we can compute
the maximum likelihood estimate for the multivariate
Gaussian distribution as follows:
N
N
µ
A
A
h
X
i
X
i
xi
xi µ
A · xi µ
AT
Note that the dot product in the second equation is an
outer product. The outer product of two vectors is a
matrix:
x ·y T
m
u
x
U
N
t x { ·>y y y@
x
m
u
xy xy xy
U
N
t xy xy xy {
xy xy xy
For comparison, the usual dot product is written as xT· y
65
The LDA Model
Linear discriminant analysis assumes that the
joint distribution has the form
P x, y
µ
P y
H[S >x µy @T h>x µy @
$n/|h|/
¶
where each Py is the mean of a multivariate
Gaussian for examples belonging to class y and
6 is a single covariance matrix shared by all
classes.
66
Fitting the LDA Model
It is easy to learn the LDA model in a single pass
through the data:
– Let A
$k be our estimate of P(y = k)
– Let Nk be the number of training examples belonging to class k.
$
Ak
µAk
Nk
N
X
xi
Nk {iy k}
i
A
h
X
xi A
µyi · xi µ
Ayi T
N i
Ayi
Note that each xi is subtracted from its corresponding µ
prior to taking the outer product. This gives us the
“pooled” estimate of 6
67
LDA learns an LTU
Consider the 2-class case with a 0/1 loss function. Recall that
P y
|x
P y
|x
P x, y
P x, y
P x, y
P x, y
P x, y
P x, y
Also recall from our derivation of the Logistic Regression classifier
that we should classify into class ǔ = 1 if
ORJ
P y
P y
|x
> |x
Hence, for LDA, we should classify into ǔ = 1 if
P x, y
ORJ
P x, y
> because the denominators cancel
68
LDA learns an LTU (2)
P x, y
µ
¶
P y
H[S
>x µy@T h>x µy@
n/
/
$
|h|
P y
´
T
H[S >x µ@ h >x µ@
$n/|h|/
³
´
T
H[S >x µ@ h >x µ@
$n/|h|/
³
´
T
H[S >x µ@ h >x µ@
³
´
T
H[S >x µ@ h >x µ@
P x, y
P x, y
P x, y
P x, y
P y
P x, y
ORJ
P x, y
P y
ORJ
P y
P y
P y
³
´
³
T
T
>x µ@ h >x µ@ >x µ@ h >x µ@
69
LDA learns an LTU (3)
Let’s focus on the term in brackets:
³
´
>x µ@T h>x µ@ >x µ@T h>x µ@
Expand the quadratic forms as follows:
>x µ@T h>x µ@
T xT hx xT hµ µT
h x µ h µ
>x µ@T h>x µ@
T xT hx xT hµ µT
h x µ h µ
Subtract the lower from the upper line and collect similar
terms. Note that the quadratic terms cancel! This
leaves only terms linear in x.
T xT hµµµµh xµT
h µµ h µ
70
LDA learns an LTU (4)
T xT hµµµµh xµT
h µµ h µ
Note that since 6-1 is symmetric aT hb bT ha
for any two vectors a and b. Hence, the first two terms
can be combined to give
µ µT hµ .
xT hµ µ µT
h
Now plug this back in…
P x, y
ORJ
P x, y
P y
ORJ
P y
P x, y
P x, y
ORJ
ORJ
P y
P y
i
h T T
T
x h µ µ µ h µ µ h µ
T h
µ
xT hµ µ µT
µ h µ
71
LDA learns an LTU (5)
ORJ
P x, y
P x, y
ORJ
P y
P y
T xT hµ µ µT
h
µ
µ h µ
/HW
w
c
hµ µ
T P y
T µ h µ µ h µ
ORJ
P y
7KHQ ZH ZLOO FODVVLI\ LQWR FODVV yA
LI
w · x c > .
7KLV LV DQ /78
72
Two Geometric Views of LDA
View 1: Mahalanobis Distance
x uT hx u is known as
The quantity D M x , u
the (squared) Mahalanobis distance between x and u. We can think
of the matrix 6-1 as a linear distortion of the coordinate system that
converts the standard Euclidean distance into the Mahalanobis
distance
Note that
ORJ P x|y
k ORJ $k >x µkT hx µk@
ORJ P x|y
k ORJ $k DM x, µk
Therefore, we can view LDA as computing
–
DM x, µ and DM x, µ
and then classifying x according to which mean P0 or P1 is closest in
Mahalanobis distance (corrected by log Sk)
73
View 2: Most Informative LowDimensional Projection
LDA can also be viewed as finding a hyperplane of
dimension K – 1 such that x and the {Pk} are projected
down into this hyperplane and then x is classified to the
nearest Pk using Euclidean distance inside this
hyperplane
74
Generalizations of LDA
General Gaussian Classifier
– Instead of assuming that all classes share the same
6, we can allow each class k to have its own 6k. In
this case, the resulting classifier will be a quadratic
threshold unit (instead of an LTU)
Naïve Gaussian Classifier
– Allow each class to have its own 6k, but require that
each 6k be diagonal. This means that within each
class, any pair of features xj1 and xj2 will be assumed
to be statistically independent. The resulting classifier
is still a quadratic threshold unit (but with a restricted
form)
75
Summary of
Linear Discriminant Analysis
Learns the joint probability distribution P(x, y).
Direct Computation. The maximum likelihood estimate
of P(x,y) can be computed from the data without search.
However, inverting the 6 matrix requires O(n3) time.
Eager. The classifier is constructed from the training
examples. The examples can then be discarded.
Batch. Only a batch algorithm is available. An online
algorithm could be constructed if there is an online
algorithm for incrementally updated 6-1. [This is easy for
the case where 6 is diagonal.]
76
Comparing Perceptron, Logistic
Regression, and LDA
How should we choose among these three
algorithms?
There is a big debate within the machine
learning community!
77
Issues in the Debate
Statistical Efficiency. If the generative model
P(x,y) is correct, then LDA usually gives the
highest accuracy, particularly when the amount
of training data is small. If the model is correct,
LDA requires 30% less data than Logistic
Regression in theory
Computational Efficiency. Generative models
typically are the easiest to learn. In our
example, LDA can be computed directly from the
data without using gradient descent.
78
Issues in the Debate
Robustness to changing loss functions. Both generative
and conditional probability models allow the loss function
to be changed at run time without re-learning.
Perceptron requires re-training the classifier when the
loss function changes.
Robustness to model assumptions. The generative
model usually performs poorly when the assumptions
are violated. For example, if P(x | y) is very nonGaussian, then LDA won’t work well. Logistic
Regression is more robust to model assumptions, and
Perceptron is even more robust.
Robustness to missing values and noise. In many
applications, some of the features xij may be missing or
corrupted in some of the training examples. Generative
models typically provide better ways of handling this than
non-generative models.
79

Download Report

Linear Threshold Units

Paperzz.com

Your Paperzz