Chap. 10 Models
data analysis objectives
Regression
Optimization: Linear Regression (Least Square)
• Curve fitting with a line
• w = (w0, w1) for h(x,w) = w0 +w1x
• Given a dataset
(x1, y1), (x2, y2), ….. (xN, yN)
• Find w that minimizes the error:
• Least Square Mean (LSM) Error
é 1 x1 éé w1 é é e1 é
é
y = é 1 x2 éé w2 é + é
e2
é
é é
é
éé
é e3 é
é 1 x3 é
é
é
é
é
•
•
•
•
E(w) = (1/N) ∑ n(yi - h(xi, w))2
E(w) = (y – Xw)T(y – Xw)
XT(y – Xw) = 0 => XTXw = XTy
w = (XTX)-1XTy
Optimization: Linear Regression (Least Square)
• For a linear line
• w1 = ∑(xi-E[x])(yi-E[y])/∑(xi-E[x]))2
• w0 = E[y] – w1*E[x]
• w1 = correlation(x,y) * standard_dev(y) / standard_dev(x)
• x & y are
• perfectly correlated –increase in stdev of x is proportional to
the increase of stdev of y
• Anti-correlated, stdev increase in x will decrease stdev in y
• Not correlated, changes in x does not affect prediction
Polynomial Regression
• Curve fitting by a polynomial of degree m
• w = (w0, w1, .. ,wm) for h(x,w) = w0 +w1x + w2x2 + … +wmxm
• Given a dataset
(x1, y1), (x2, y2), ….. (xN, yN)
• Find w that minimizes the error:
• E(w) = (1/N) ∑ N(h(xi, w) – yi)2
• Least Square Mean (LSM) Error
• y = Xw + e
• e = (1/N)(Xw – y)2
• ΔE(w) = (1/N)2XT(Xw-y) = 0
• XTXw = XTy
• w = (XTX)-1XTy
• Pseudo-inverse: (XTX)-1XT
Linear Classifier
Linear classifier
• Given training data
(petal len, sepal len, iris-vericolor or iris-virginica)
• New data (petal len, sepal len), predict if vericolor or virginica ?
• Given a dataset
(x11, x12), (x21, x22), ….. (xN1, xn1) & (y1, y2, … yn),
assume a model:
h(x,w) = w1xi1 + w2xi2
Error in prediction:
εi = yi – hi(x,w)
Linear classifier
• Cost function J:
• J(w) = ∑ n(yi - h(xi, w))2 /2
• wi = wi-1 + Δw
• J(w+Δw) = J(w) + ΔJ(w)
• Use Δw = -η ΔJ(w)
W = np.zeros(X.shape[1] + 1)
costs = []
iter = 50
for i in range(iter):
output = np.dot(X, W[1:]) + W[0]
error = y – output
W[1:] += eta * np.dot(X.T, error)
W[0] += eta * error.sum()
cost = (error**2).sum()/2.0
costs.append(cost)
Gradient Descent
Classifier
• Linear classifier: (X,y)
(x11, x12), (x21, x22), ….. (xN1, xn1) & (y1, y2, … yn),
assume a model:
h(x,w) = w1xi1 + w2xi2
Error in prediction: εi = yi – hi(x,w)
• Input X is d-dimensional
((xi1, xi2, … , xid) , yi)
hi(xi,w) = w1xi1 + w2xi2 + .. + wdxid = ∑jwj xi j
εi = yi – hi(x,w)
• Nonlinear
((xi1, xi2, … , xid) , (y1, y2,… yk,…))
hk(xi,w) = w1kxi1 + w2kxi2 + .. + wdkxid = ∑jwjk xi j
εi = yk– g(hk(x,w))
Single Neuron (Perceptron)
• Single output, y
• Connection from input to a neuron has positive/negative weight wij
• Total input x0 = ∑iwi xi
• Output y = ϕ(x0) +e = +1 if x,>0
-1, otherwise
Perceptron Example (j=1)
• Input x = (x1, x2, …, xd)
• e.g. customer profile info (income, debt, employment years,...)
• Output = 1 if (∑d wi xi – threshold) > 0
= -1 otherwise
• Use w0 for the threshold, such that
hypothesis ϕ(x) = sign(∑i=1,d wi xi + w0) = sign(∑i=0,d wi xi)
= sign(wTx)
• Given dataset: input x = (1, x1, x2, …, xd)
(x1, y1), … (xN, yN)
• Learning Algorithm:
Update weights w <- w + ynxn for misclassified dataset that
sign(wTxn) != yn
Perceptron Example
• Two inputs, one output
• Trained by
(x1 x2) → y
(0, ½)
1
(1,1)
1
(1,1/2)
0
(0,0)
0
• Total input xj = w1 x1 + w2 x2 + w0 (w0 is bias)
• Assume a step function for g(xj)
w2 /2 + w0 > 0
w 1 + w2 + w 0 > 0
w1 + w2/2 + w0 < 0
w0 < 0
Perceptron Example
• Visualize
w2 /2 + w0 > 0
w 1 + w2 + w 0 > 0
w1 + w2/2 + w0 < 0
w0 < 0
• Can pick x2 = ¼ + ½ x1
• -1/4 - ½ x1 + x2 > 0
• w1 = -¼, w2 = -1/2, w0 = 1
Perceptron Example
• Given dataset: input x = (1, x1, x2, …, xd)
(x1, y1), … (xN, yN)
• Learning Algorithm:
Update weights w <- w + ynxn for misclassified dataset that
sign(wTxn) != yn
W = np.zeros(X.shape[1] + 1)
costs = []
iter = 50
for i in range(iter):
output = np.dot(X, W[1:]) + W[0]
error = y – np.where(output>=0.0, 1, 1)
W[1:] += eta * np.dot(X, error)
W[0] += eta * error.sum()
cost = (error**2).sum()/2.0
costs.append(cost)
Logistic Regression
• Total input xj = ∑iwij xi
• Output yj = Θ(xj)
= 1/(1 + e-x)
• Θ(xj)
• sigmoid function
• can be interpreted as a probability
• input x: cholesterol level, age, weight, …
• Signal s = wTx
• Output Θ(s): prob. of heart attack
Error Measure
• For each (x, y), y is generated by probability f(x):
• Plausible error measure based on likelihood:
if h=f, how likely to get y from x ?
p(y|x) = h(x)
1 – h(x)
for y= +1
for y = -1
Substitute h(x) = Θ(s) = Θ(wTx)
Since Θ(-s) = 1 - Θ(s),
p(y|x) = Θ(ywTx)
Error Measure
• Likelihood of having (x1,y1), … (xN, yN):
Πn=1,NP(yn|xn) = Πn=1,NΘ(ynwTxn)
• Maximize Σn=1,N ln(Θ(ynwTxn))/N
•
Minimize -Σn=1,N ln(Θ(ynwTxn))/N
= Σn=1,Nln(1/Θ(ynwTxn))/N
= Σn=1,Nln(1 + exp(-ynwTxn))/N
• Error Measure:
• E(w) = Σn=1,Nln(1 + exp(-ynwTxn))/N
Minimizing Error Measure – Gradient Descent
• E(w) = Σn=1,Nln(1 + exp(-ynwTxn))/N
• Start with w(0)
• Use fixed step size η with unit vector
• ΔE = E(w(1)) – E(w(0))
= E(w(0) + ηv) – E(w(0))
= ηΔE(w(0))Tv + O(η2)
≥ -η||ΔE(w(0))||
Since v is a unit vector,
v = -ΔE(w(0))/||ΔE(w(0))||
v
Implementation
• η affects the performance
• Instead of Δw = ηv = -η ΔE(w(0))/||ΔE(w(0))||
use Δw = -ηΔE(w(0)) (η: learning rate)
• Rule of thumb: η=0.1
Logistic regression algorithm
1. Initialize weights at t=0 to w(0)
2. For t = 1,2,….
3.
compute the gradient
ΔE = - [Σn=1,N ynxn /(1 + exp(-ynwT(t) xn)]/N
1.
2.
update weights: w(t+1) = w(t) – ηΔE
until Δw < threshold
Gradient Descent
• Minimize Error Measure:
• E(w) = Σn=1,NE(h(xn), yn)/N
= Σn=1,Nln(1 + exp(-ynwTxn))/N for logistic
• Difficult to apply to ALL samples at one time (batch)
• Stochastic GD
• Pick one (xn, yn) at a time
• Apply GD to E(h(xn), yn)
• En[-ΔE(h(xn), yn)] = Σn=1,NΔE(h(xn), yn)/N
= -ΔE (average direction)
Neural Networks
Neural Networks
• Simulate human nerve system
• Neurons and synapse
• Neuron puts out a real number between 0 and 1
• Perceptron is good for linear classification
• Cannot do this:
Combining perceptrons
• Combine h1 and h2
• Logical OR and AND
• x1 and x2 can be +1 or -1
Multilayer perceptrons
• 3 layers
Neural Network
• Multilayer perceptrons are limited to multiple linear separations
• Can have two many weights as layers increase
• Soften the threshold to logistic
• Each Θ can be different
NN parameters
• Assume that each Θ is identical
• tanh(s) = (es – e-s)/(es + e-s)
• Weights
• wi,j(l)
• l : layer; i input; j output
• Relationship
xj(l) = Θ(sj(l)) = Θ(Σn=0,d(l-1)wij(l)xi(l-1))
How NN works
• All the weights {wi,j(l)} determine the model, h(x)
• Error on one sample (xn, yn) is
• E(h(xn), yn) = e(w)
• Compute the gradient
• ΔE(w):
e(w)/ wij(l)
Computing Δe(w)
• Use a chain rule
• Δe(w) = e(w)/ wij(l)
=[ e(w)/ sj(l)]x[ sj(l)/ wij(l)]
[ sj(l)/ wij(l)] = xi(l-1)
[
e(w)/ sj(l)] = δj(l)
• δj(l) can be computed recursively
• Error is separated into a product of xi(l-1) and sj(l)
δ for the final layer
• δj(l) = [
e(w)/ sj(l)]
• For the final layer, j=1, l=L:
δj(l) = [ e(w)/ s1(L)]
• e(w) = e(h(xn), yn)
= e(x1(L), yn)
= (x1(L)-yn)2 (last layer is linear)
= (Θ(s1(L)) - yn)2
Θ’ = 1 – Θ2(s) for tanh()
Backpropagation of δ
• δi(l-1) = [
e(w)/ si(l-1)]
= Σj=1,d(l) [ e(w)/ sj(l)]
*[ sj(l)/ xi(l-1)]
*[ xi(l-1)/ si(l-1)]
[
[
[
xi(l-1)/ si(l-1)] = Θ’(si(l-1))
sj(l)/ xi(l-1)] = wij(l)
e(w)/ sj(l)] = δj(l)
• δi(l-1) = [1 – (xi(l-1))2] Σj=1,d(l) wij(l) δj(l)
NN Application
• Protein structure prediction by PROF
• Input layer
• Sliding 15-residue window
• Predict secondary structure of the central residue
• One residue has 20 input nodes
• Hidden layer
• Connected to ALL input and output nodes
Convolutional NN
• Character Recognition
• Divide a picture by 16x16 blocks
• x = (1, x1, x2, …, x256)
• Need to compute w = (w0, w1, w2, …, w256)
• Use intensity and symmetry as features
• x = (1, x1, x2)
Conventional NN
• Conventional NN
• Connection from layer l to every node in layer (l+1)
• 28x28 pixel images
• Into 784 input nodes
Convolutional NN
• Convolutional NN
• Input layer in 2D
• Each neuron in the next layer is connected to a subset of input
neurons
• If 5x5 input neurons to one neuron in the first hidden layer,
24x24 neurons in the next layer
Convolutional NN
• Convolutional NN
Convolutional NN
• Convolutional layer
• 5x5 NN has the same bias and weights
• (j,k) hidden neuron
• s = Σl=0,4 Σm=0,4 wl,m aj+l,k+m + b -- similar to convolution
• Output: Θ(s)
• 5x5 NN filters features from input layer
• Map from input layer to hidden layer is called a feature map
• Multiple feature maps
Convolutional NN
• Some feature maps
• Dark – high weight
Convolutional NN
• Pooling layer
• Condense convolutional layer
• E.g. 2x2
• Pooling layer is 12x12
© Copyright 2026 Paperzz