Neural networks

FUNCTION LEARNING AND
NEURAL NETS
1
SETTING
Learn a function with :
 Continuous-valued examples



E.g., pixels of image
Continuous-valued output

E.g., likelihood that image is a ‘7’
Known as regression
 [Regression can be turned into classification via
thresholds]

2
FUNCTION-LEARNING (REGRESSION)
FORMULATION
Goal function f
 Training set: (x(i),y(i)), i = 1,…,n, y(i)=f(x(i))


Inductive inference: find a function h that fits the
points well
f(x)
x

Same Keep-It-Simple bias
3
LEAST-SQUARES FITTING
Hypothesize a class of functions g(x,θ)
parameterized by θ
 Minimize squared loss E(θ) = Σi ( g(x(i),θ)-y(i) )2

f(x)
x
4
LINEAR LEAST-SQUARES
g(x,θ) = x ∙ θ
 Value of θ that optimizes E(θ) is:
θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)]
 E(θ) = Σi ( x(i)∙θ - y(i) )2
= Σi ( x(i) 2 θ 2 – 2 x(i)y(i) θ + y(i)2)
 E’(θ) = 0 => d/d θ [Σi ( x(i) 2 θ 2 – 2 x(i) y(i) θ + y(i)2)]

= Σi 2 x(i)2 θ – 2 x(i) y(i) = 0
 => θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)]

f(x)
g(x,q)
5
x
LINEAR LEAST-SQUARES WITH CONSTANT
OFFSET
g(x,θ0,θ1) = θ0 + θ1 x
 E(θ0,θ1) = Σi (θ0+θ1 x(i) - y(i) )2
= Σi (θ02 + θ12 x(i) 2 + y(i)2 +2θ0θ1x(i)-2θ0y(i)-2θ1x(i)y(i))
 dE/dθ0(θ0*,θ1*) = 0 and dE/dθ1(θ0*,θ1*) = 0, so:

0 = 2Σi (θ0* +θ1*x(i) - y(i))
0 = 2Σi x(i)(θ0*+ θ1* x(i) - y(i))

f(x)
Verify the solution:
g(x,q)
θ0* = 1/N Σi (y(i) – θ1*x(i))
θ1* = [N (Σi x(i)y(i)) – (Σi x(i))(Σi y(i))]/
[N (Σi x(i)2) – (Σi x(i))2]
x
6
MULTI-DIMENSIONAL LEAST-SQUARES
Let x include attributes (x1,…,xN)
 Let θ include coefficients (θ1,…,θN)
 Model g(x,θ) = x1 θ1 + … + xN θN

f(x)
g(x,q)
x
7
MULTI-DIMENSIONAL LEAST-SQUARES
g(x,θ) = x1 θ1 + … + xN θN
 Best θ given by
θ = (ATA)-1 AT b
 Where A is matrix of x(i)’s in rows, b is vector of
y(i)’s

f(x)
g(x,q)
x
8
NONLINEAR LEAST-SQUARES
E.g. quadratic g(x,θ) = θ0 + x θ1 + x2 θ2
 E.g. exponential g(x,θ) = exp(θ0 + x θ1)
 Any combinations
g(x,θ) = exp(θ0 + x θ1) + θ2 + x θ3
 Fitting can be done using gradient descent

quadratic
f(x)
other
linear
x
9
GRADIENT DESCENT
g(x,θ) = x1 θ1 + … + xN θN
 Error E(θ) = Σi (g(x(i),θ)-y(i))2
 Take derivative:
dE(θ)/dθ = 2Σi dg(x(i),θ)/dθ (g(x(i),θ)-y(i))
 Since dg(x(i),θ)/dθ = x(i),
dE(θ)/dθ = 2Σi x(i)(g(x(i),θ)-y(i))
 Update rule
θ  θ -  Σi x(i)(g(x(i),θ)-y(i))
 Convergence to global minimum guaranteed
(with  chosen small enough) because E is a
convex function

STOCHASTIC GRADIENT DESCENT
Prior rule was a batch update because all
examples were incorporated in each step
 Needs to store all prior examples
 Stochastic Gradient Descent: use single
example on each step
 Update rule:

Pick example i (either at random or in order) and a
step size 
 Update rule
θ  θ +  x(i)(y(i)-g(x(i),θ))


Reduces error on i’th example… but does it
converge?
11
PERCEPTRON
(THE GOAL FUNCTION F IS A BOOLEAN ONE)
x2
+
+
x1
+
-
-
-
+
xi wi
+
S
g
y
x1
-
w1 x1 + w 2 x2 = 0
-
xn
y = g(Si=1,…,n wi xi)
12
PERCEPTRON
(THE GOAL FUNCTION F IS A BOOLEAN ONE)
+
+
x1
xi wi
?
-
S
g
y
+
+
-
+
-
xn
y = g(Si=1,…,n wi xi)
13
PERCEPTRON LEARNING RULE
θ  θ +  x(i)(y(i)-g(θT x(i)))
 (g outputs either 0 or 1, y is either 0 or 1)

If output is correct, weights are unchanged
 If g is 0 but y is 1, then weight on attribute i is
increased
 If g is 1 but y is 0, then weight on attribute i is
decreased


Converges if data is linearly separable, but
oscillates otherwise
14
UNIT (NEURON)
x1
xi wi
S
g
y
xn
y = g(Si=1,…,n wi xi)
g(u) = 1/[1 + exp(-u)]
15
A SINGLE NEURON CAN LEARN
x1
xi wi
S
g
y
xn
A disjunction of boolean literals x1  x2  x3
Majority function
XOR?
16
NEURAL NETWORK

Network of interconnected neurons
x1
xi w
x1
i
xi w
i
xn
S
g
y
S
g
y
xn
17
Acyclic (feed-forward) vs. recurrent networks
TWO-LAYER FEED-FORWARD
NEURAL NETWORK
w1j
Inputs
w2k
Hidden
layer
Output
layer
18
NETWORKS WITH HIDDEN LAYERS
Can learn XORs, other nonlinear functions
 As the number of hidden units increase, so does
the network’s capacity to learn functions with
more nonlinear features
 Difficult to characterize which class of functions!
 How to train hidden layers?

19
BACKPROPAGATION (PRINCIPLE)
New example y(k) = f(x(k))
 φ(k) = outcome of NN with weights w(k-1) for inputs
x(k)
 Error function: E(k)(w(k-1)) = (φ(k) – y(k))2
 wij(k) = wij(k-1) – ε∙E(k)/wij
(w(k) = w(k-1) - e∙E)
 Backpropagation algorithm:
Update the weights of the inputs to the last
layer, then the weights of the inputs to the
previous layer, etc.

20
UNDERSTANDING BACKPROPAGATION
Minimize E(q)
 Gradient Descent…

E(q)
q
21
UNDERSTANDING BACKPROPAGATION
Minimize E(q)
 Gradient Descent…

E(q)
Gradient of E
q
22
UNDERSTANDING BACKPROPAGATION
Minimize E(q)
 Gradient Descent…

E(q)
Step ~ gradient
q
23
LEARNING ALGORITHM

Given many examples (x(1),y(1)),…, (x(N),y(N)) a
learning rate e

Init: Set k = 0 (or rand(1,N))

Repeat:
Tweak weights with a backpropagation update on
example x(k), y(k)
 Set k = k+1 (or rand(1,N))

24
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent
 Decompose E(q) = e1(q)+e2(q)+…+eN(q)



Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
Gradient of e1
25
q
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent
 Decompose E(q) = e1(q)+e2(q)+…+eN(q)



Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
Gradient of e1
26
q
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent
 Decompose E(q) = e1(q)+e2(q)+…+eN(q)



Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
Gradient of e2
27
q
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent
 Decompose E(q) = e1(q)+e2(q)+…+eN(q)



Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
Gradient of e2
28
q
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent
 Decompose E(q) = e1(q)+e2(q)+…+eN(q)



Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
Gradient of e3
29
q
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent
 Decompose E(q) = e1(q)+e2(q)+…+eN(q)



Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
Gradient of e3
30
q
STOCHASTIC GRADIENT DESCENT
Objective function values (measured over all
examples) over time settle into local minimum
 Step size must be reduced over time, e.g., O(1/t)

31
CAVEATS

Choosing a convergent “learning rate” e can be
hard in practice
E(q)
q
32
COMMENTS AND ISSUES

How to choose the size and structure of
networks?
If network is too large, risk of over-fitting (data
caching)
 If network is too small, representation may not be
rich enough

Role of representation: e.g., learn the concept of
an odd number
 Incremental learning
 Low interpretability

33
PERFORMANCE OF FUNCTION LEARNING
Overfitting: too many parameters
 Regularization: penalize large parameter values
 Efficient optimization

If E(q) is nonconvex, can only guarantee finding a
local minimum
 Batch updates are expensive, stochastic updates
converge slowly

34
READINGS
R&N 18.8-9
 HW5 due on Thursday

35