Machine Learning Learning highly non

Machine Learning
10--701/15
10
701/15--781, Fall 2011
Neural Networks
Eric Xing
Lecture 5, September 26, 2011
Reading: Chap. 5 CB
© Eric Xing @ CMU, 2006-2011
1
Learning highly non-linear
functions
f: X Æ Y
z
f might be non-linear function
z
X (vector of) continuous and/or discrete vars
z
Y (vector of) continuous and/or discrete vars
The XOR gate
Speech recognition
© Eric Xing @ CMU, 2006-2011
2
1
Perceptron and Neural Nets
z
From biological neuron to artificial neuron (perceptron)
Inputs
Synapse
Axon
x1
Dendrites
Synapse
w1
Axon
Linear
Combiner
Hard
Limiter
Output
∑
Soma
w2
Soma
Dendrites
θ
x2
Synapse
Activation function
i =1
⎧+ 1, if X ≥ ω0
⎩− 1, if X < ω0
Y =⎨
Artificial neuron networks
z
supervised learning
z
gradient descent
Output Signals
n
X = ∑ xi wi
z
Threshold
Input Signals
z
Y
Middle Layer
Input Layer
Output Layer
© Eric Xing @ CMU, 2006-2011
3
Connectionist Models
z
Consider humans:
z
Dendrites
~ 0.001 second
z
Synapses
+
+
Number of neurons
(weights)
Axon
+
-
~ 1010
z
Nodes
Neuron switching
g time
Synapses
-
Connections per neuron
~ 104-5
z
Scene recognition time
~ 0.1 second
z
100 inference steps doesn't seem like enough
Æ much parallel computation
z
Properties of artificial neural nets (ANN)
z
Many neuron-like threshold switching units
z
Many weighted interconnections among units
z
Highly parallel, distributed processes
© Eric Xing @ CMU, 2006-2011
4
2
Abdominal Pain Perceptron
Temp
Age
20
10
Ulcer
Pain
Cholecystitis
Duodenal Non-specific
Perforated
0
1
AppendicitisDiverticulitis
37
Obstruction Pancreatitis
Small Bowel
0
1
Intensity Duration
Pain
Pain
adjustable
1
1
weights
WBC
0
Male
0
0
0
© Eric Xing @ CMU, 2006-2011
5
Myocardial Infarction Network
Duration
Pain
2
Intensity Elevation
Pain ECG: ST
4
1
Myocardial Infarction
0.8
Smoker
1
Age
50
Male
1
“Probability” of MI
© Eric Xing @ CMU, 2006-2011
6
3
The "Driver" Network
© Eric Xing @ CMU, 2006-2011
7
Perceptrons
Input units
Cough
Headache
∆ rule
change weights to
decrease the error
weights
No disease
Pneumonia
Flu
Meningitis
-
what we got
what we wanted
error
Output units
© Eric Xing @ CMU, 2006-2011
8
4
Perceptrons
Input units
i
Input to unit i: ai
measured value of variable i
Input to unit j: aj = Σ wijai
Output
j
units
oj = 1/ (1 + e- (aj+θ j) )
Output of unit j:
© Eric Xing @ CMU, 2006-2011
9
Jargon Pseudo-Correspondence
z
Independent variable = input variable
z
Dependent variable = output variable
z
Coefficients = “weights”
z
Estimates = “targets”
Logistic Regression Model (the sigmoid unit)
Inputs
g
Age
34
Gender
1
Output
5
0.6
Stage
4
4
Σ
“Probability
of beingAlive”
8
Independent variables
Coefficients
Dependent variable
x1, x2, x3
a, b, c
p Prediction
© Eric Xing @ CMU, 2006-2011
10
5
The perceptron learning
algorithm
z
Recall the nice property of sigmoid function
z
Consider regression problem f:XÆY , for scalar Y:
z
Let’s maximize the conditional data likelihood
© Eric Xing @ CMU, 2006-2011
11
xd = input
td = target output
Gradient Descent
od =observed unit
output
wi =weight i
© Eric Xing @ CMU, 2006-2011
12
6
xd = input
td = target output
The perceptron learning rules
od =observed unit
output
wi =weight i
Incremental mode:
Do until converge:
ƒ For each training example d in D
1. compute gradient ∇Ed[w]
2.
where
Batch mode:
Do until converge:
1. compute gradient ∇ED[w]
2.
© Eric Xing @ CMU, 2006-2011
13
MLE vs MAP
z
Maximum conditional likelihood estimate
z
Maximum a p
posteriori estimate
© Eric Xing @ CMU, 2006-2011
14
7
What decision surface does a
perceptron define?
x
y
Z (color)
0
0
1
0
1
1
1
0
1
1
1
0
NAND
w1
x1
y θ = 0.5
w2
x2
f(x1 w1 + x2 w2 ) = y
f(0w1 + 0w2 ) = 1
f(0w1 + 1w2 ) = 1
f(1w1 + 0w2 ) = 1
f(1w1 + 1w2 ) = 0
( )=
f(a)
θ
w1
0.20
0.20
0.25
0.40
some possible values for w1 and w2
1, for a > θ
0, for a ≤ θ
w2
0.35
0.40
0.30
0.20
© Eric Xing @ CMU, 2006-2011
15
What decision surface does a
perceptron define?
x
y
Z (color)
0
0
0
0
1
1
1
0
1
1
1
0
NAND
w1
x1
y θ = 0.5
w2
x2
f(x1 w1 + x2 w2 ) = y
f(0w1 + 0w2 ) = 0
f(0w1 + 1w2 ) = 1
f(1w1 + 0w2 ) = 1
f(1w1 + 1w2 ) = 0
( )=
f(a)
θ
w1
1, for a > θ
0, for a ≤ θ
w2
some possible values for w1 and w2
© Eric Xing @ CMU, 2006-2011
16
8
What decision surface does a
perceptron define?
x
y
Z (color)
0
0
0
0
1
1
1
0
1
1
1
0
NAND
θ = 0.5 for all units
w5
w6
f(a) =
w1 w w2
3
w4
θ
a possible set of values for (w1,
1, for a > θ
0, for a ≤ θ
w2, w3, w4, w5 , w6):
(0.6,-0.6,-0.7,0.8,1,1)
© Eric Xing @ CMU, 2006-2011
17
Non Linear Separation
Meningitis
Flu
No cough
Headache
Cough
Headache
01
11
No treatment
Treatment
00
No disease
No cough
No headache
10
Pneumonia
Cough
No headache
011
010
111
110
101
000
© Eric Xing @ CMU, 2006-2011
100
18
9
Neural Network Model
Inputs
Age
.6
34
.2
Output
.4
Σ
.5
.1
Gender
2
.2
.3
Stage
4
Independent
variables
.8
Σ
.7
Σ
“Probability
of beingAlive”
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
© Eric Xing @ CMU, 2006-2011
19
“Combined logistic models”
Inputs
Age
Output
.6
34
.5
.1
Gender
Σ
2
.8
.7
Stage
“Probability
of beingAlive”
4
Independent
variables
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
© Eric Xing @ CMU, 2006-2011
20
10
Inputs
Age
Output
34
.5
.2
Gender
2
Σ
.3
“Probability
of beingAlive”
.8
Stage
4
Independent
variables
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
© Eric Xing @ CMU, 2006-2011
21
Inputs
Age
Output
.6
34
.5
.2
.1
Gender
1
Σ
.3
.7
Stage
4
Independent
variables
“Probability
of beingAlive”
.8
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
© Eric Xing @ CMU, 2006-2011
22
11
Not really,
no target for hidden units...
Age
.6
34
.4
Σ
.2
.5
.1
Gender
2
.2
.3
Σ
.7
Stage
4
Independent
variables
Σ
.8
“Probability
of beingAlive”
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
© Eric Xing @ CMU, 2006-2011
23
Recall perceptrons
Input units
Cough
Headache
∆ rule
change weights to
decrease the error
weights
No disease
Pneumonia
Flu
Meningitis
-
what we got
what we wanted
error
Output units
© Eric Xing @ CMU, 2006-2011
24
12
Hidden Units and
Backpropagation
Input units
-
what we g
got
what we wanted
error
∆ rule
Hidden
units
∆ rule
Output units
© Eric Xing @ CMU, 2006-2011
25
xd = input
td = target output
Backpropagation Algorithm
od =observed unit
output
wi =weight i
z
Initialize all weights to small random numbers
Until convergence, Do
1.
Input the training example to the network
and compute the network outputs
1.
For each output unit k
2.
For each hidden unit h
3.
Undate each network weight wi,j
where
© Eric Xing @ CMU, 2006-2011
26
13
More on Backpropatation
z
It is doing gradient descent over entire network weight vector
z
Easily generalized to arbitrary directed graphs
z
Will find a local, not necessarily global error minimum
z
In practice, often works well (can run multiple times)
z
Often include weight momentum α
z
Minimizes error over training examples
z
Will it generalize well to subsequent testing examples?
z
Training can take thousands of iterations, Æ very slow!
z
Using network after training is very fast
© Eric Xing @ CMU, 2006-2011
27
Learning Hidden Layer
Representation
z
A network:
z
A target function:
z
Can this be learned?
© Eric Xing @ CMU, 2006-2011
28
14
Learning Hidden Layer
Representation
z
A network:
z
Learned hidden layer representation:
© Eric Xing @ CMU, 2006-2011
29
© Eric Xing @ CMU, 2006-2011
30
Training
15
Training
© Eric Xing @ CMU, 2006-2011
31
Minimizing the Error
initial error
Error surface
negative derivative
final error
local minimum
winitial wtrained
positive change
© Eric Xing @ CMU, 2006-2011
32
16
Overfitting in Neural Nets
CHD
Overfitted model “Real” model
Overfitted model
error
holdout
training
0
age
© Eric Xing @ CMU, 2006-2011
cycles
33
Alternative Error Functions
z
Penalize large weights:
z
Training on target slopes as well as values
z
Tie together weights
z
E.g., in phoneme recognition
© Eric Xing @ CMU, 2006-2011
34
17
Regression vs. Neural Networks
Y
Y
“X1”
X1
X2
X3
X1X2
“X2”
“X1X3”
“X1X2X3”
X1X3 X2X3 X1X2 X3
(23 -1)
1) possible
ibl combinations
bi ti
X1
X2
X3
Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...
© Eric Xing @ CMU, 2006-2011
35
Artificial neural networks – what
you should know
z
Highly expressive non-linear functions
z
Highly parallel network of logistic function units
z
Minimizing sum of squared training errors
z
z
Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output
values
Minimizing sum of sq errors plus weight squared (regularization)
z
MAP estimates assuming weight priors are zero mean Gaussian
z
Gradient descent as training procedure
z
Discover useful representations at hidden units
z
Local minima is greatest problem
z
Overfitting, regularization, early stopping
z
How to derive yyour own gradient
g
descent p
procedure
© Eric Xing @ CMU, 2006-2011
36
18
Modern ANN topics:
“Deep” Learning
© Eric Xing @ CMU, 2006-2011
37
Expressive Capabilities of ANNs
z
z
Boolean functions:
z
Every
y Boolean function can be represented
p
by
y network with single
g hidden layer
y
z
But might require exponential (in number of inputs) hidden units
Continuous functions:
z
Every bounded continuous function can be approximated with arbitrary small
error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989]
z
Any function can be approximated to arbitrary accuracy by a network with two
hidden layers [Cybenko 1988].
© Eric Xing @ CMU, 2006-2011
38
19
Using ANN to
hierarchical representation
Trainable
Feature
Extractor
Trainable
Feature
Extractor
Trainable
Classifier
Good Representations are hierarchical
• In Language: hierarchy in syntax and semantics
– Words->Parts of Speech->Sentences->Text
– Objects,Actions,Attributes...-> Phrases -> Statements -> Stories
• In Vision: part-whole hierarchy
– Pixels->Edges->Textons->Parts->Objects->Scenes
© Eric Xing @ CMU, 2006-2011
39
“Deep” learning: learning hierarchical
representations
Trainable
Feature
Extractor
Trainable
Feature
Extractor
Trainable
Classifier
Learned Internal Representation
• Deep Learning: learning a hierarchy of internal representations
• From low-level features to mid-level invariant representations,
to object identities
• Representations are increasingly invariant as we go up the
layers
• using multiple stages gets around the specificity/invariance
dilemma
© Eric Xing @ CMU, 2006-2011
40
20
Filtering+NonLinearity+Pooling = 1
stage of a Convolutional Net
• [Hubel & Wiesel 1962]:
– simple cells detect local features
– complex cells “pool” the outputs of simple cells within a retinotopic
neighborhood.
“Simple cells”
“Complex cells”
pooling subsampling
Multiple convolutions
Retinotopic Feature Maps
© Eric Xing @ CMU, 2006-2011
41
Convolutional Network: MultiStage Trainable Architecture
Convolutions,
Filtering
Pooling
Subsampling
Convolutions,
Classification
Convolutions,
Filtering
Hierarchical Architecture
Pooling
Subsampling
Representations are more global, more invariant, and more
abstract
abst act as we
e go up
p the layers
la e s
Alternated Layers of Filtering and Spatial Pooling
Filtering detects conjunctions of features
Pooling computes local disjunctions of features
Fully Trainable
All the layers are trainable
© Eric Xing @ CMU, 2006-2011
42
21
Convolutional Net Architecture
for Hand-writing recognition
input
1@32x32
Layer 1
6@28x28
Layer 2
6@14x14
Layer 3
12@10x10
Layer 4
12@5x5
Layer 5
100@1x1
Layer 6: 10
Layer 6: 10
10
5x5
convolution
z
2x2
pooling/
subsampling
5x5
convolution
5x5
convolution
2x2
pooling/
subsampling
Convolutional net for handwriting recognition (400,000 synapses)
z
Convolutional layers (simple cells): all units in a feature plane share the same weights
z
Pooling/subsampling layers (complex cells): for invariance to small distortions.
z
Supervised gradient-descent learning using back-propagation
z
The entire network is trained end-to-end. All the layers are trained simultaneously.
z
[LeCun et al. Proc IEEE, 1998]
© Eric Xing @ CMU, 2006-2011
43
© Eric Xing @ CMU, 2006-2011
44
How to train?
22
Application:
MNIST Handwritten Digit Dataset
Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples
© Eric Xing @ CMU, 2006-2011
45
Results on MNIST Handwritten
Digits
CLASSIFIER
linear classifier (1-layer NN)
linear classifier (1-layer NN)
pairwise linear classifier
K-nearest -neighbors,
g
, ((L2))
K-nearest -neighbors, (L2)
K-nearest -neighbors, (L2)
K-NN L3, 2 pixel jitt er
K-NN, shape context m at ching
40 PCA + quadrat ic classifier
1000 RBF + linear classifier
K-NN, Tangent Dist ance
SVM, Gaussian Kernel
SVM deg 4 polynom ial
Reduced Set SVM deg 5 poly
Virt ual SVM deg-9 poly
V-SVM, 2-pixel jitt ered
V-SVM, 2-pixel jitt ered
2-layer NN, 300 HU, MSE
2-layer NN, 300 HU, MSE,
2 layer NN
2-layer
NN, 300 HU
3-layer NN, 500+ 150 HU
3-layer NN, 500+ 150 HU
3-layer NN, 500+ 300 HU, CE, reg
2-layer NN, 800 HU, CE
2-layer NN, 800 HU, CE
2-layer NN, 800 HU, MSE
2-layer NN, 800 HU, CE
Convolut ional net LeNet -1
Convolut ional net LeNet -4
Convolut ional net LeNet -5,
Conv. net LeNet -5,
Boosted LeNet -4
Conv. net, CE
Com v net , CE
DEFORMATION PREPROCESSING
ERROR (%)
none
12.00
deskewing
8.40
deskewing
7.60
none
3.09
deskewing
2.40
deskew, clean, blur
1.80
deskew, clean, blur
1.22
shape context feat ure
0.63
none
3.30
none
3.60
subsam p 16x16 pixels
1.10
none
1.40
deskewing
1.10
deskewing
1.00
Affine
none
0.80
none
0.68
deskewing
0.56
none
4.70
Affine
none
3.60
deskewing
1 60
1.60
none
2.95
Affine
none
2.45
none
1.53
none
1.60
Affine
none
1.10
Elast ic
none
0.90
Elast ic
none
0.70
subsam p 16x16 pixels
1.70
none
1.10
none
0.95
Affine
none
0.80
Affine
none
0.70
Affine
none
0.60
Elast ic
none
0.40
© Eric Xing @ CMU, 2006-2011
Reference
LeCun et al. 1998
LeCun et al. 1998
LeCun et al. 1998
Kennet h Wilder,, U. Chicago
g
LeCun et al. 1998
Kennet h Wilder, U. Chicago
Kennet h Wilder, U. Chicago
Belongie et al. IEEE PAMI 2002
LeCun et al. 1998
LeCun et al. 1998
LeCun et al. 1998
LeCun et al. 1998
LeCun et al. 1998
LeCun et al. 1998
DeCoste and Scholkopf, MLJ 2002
DeCoste and Scholkopf, MLJ 2002
LeCun et al. 1998
LeCun et al. 1998
LeCun et al
al. 1998
LeCun et al. 1998
LeCun et al. 1998
Hint on, unpublished, 2005
Sim ard et al., ICDAR 2003
Sim ard et al., ICDAR 2003
Sim ard et al., ICDAR 2003
Sim ard et al., ICDAR 2003
LeCun et al. 1998
LeCun et al. 1998
LeCun et al. 1998
LeCun et al. 1998
LeCun et al. 1998
Sim ard et al., ICDAR 2003
Sim ard et al., ICDAR 2003
46
23
Application: ANN for Face Reco.
z
The model
z
The learned hidden unit
weights
© Eric Xing @ CMU, 2006-2011
47
Face Detection with a
Convolutional Net
© Eric Xing @ CMU, 2006-2011
48
24
Computer vision features
SIFT
Spin image
HoG
RIFT
Drawbacks of feature engineering
1. Needs expert knowledge
2. Time consuming hand-tuning
Textons
Ng
GLOHCourtesy: Lee and
49
© Eric Xing @ CMU, 2006-2011
Sparse coding on images
Natural Images
Learned bases: “Edges”
N
New example
l
= 0.8 *
x
= 0.8 *
+ 0.3 *
b36
+ 0.3 *
+ 0.5 *
b42
+ 0.5 *
b65
[0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representation)
© Eric Xing @ CMU, 2006-2011
Courtesy: Lee and Ng
50
25
Basis (or features) can be learned by
Optimization
Given input data {x(1), …, x(m)}, we want to find good bases {b
{ 1,, …, b
, n}}:
min b ,a ∑ || x (i ) − ∑ a (ji )b j ||22 + β ∑ || a ( i ) ||1
i
j
Reconstruction error
∀ j : || b j || ≤ 1
i
Sparsity penalty
Normalization constraint
Solve by alternating minimization:
-- Keep b fixed, find optimal a.
-- Keep a fixed, find optimal b.
Courtesy: Lee and Ng
© Eric Xing @ CMU, 2006-2011
51
Learning Feature Hierarchy
Higher layer
(Combinations
of edges)
“Sparse coding”
(edges)
Input image (pixels)
DBN (Hinton et al., 2006) with additional sparseness constraint.
[Related work: Hinton, Bengio, LeCun, and others.]
© Eric Xing @ CMU, 2006-2011
Courtesy: Lee and Ng
52
26
Convolutional architectures
Max‐pooling layer
maximum 2x2 grid
max
Detection layer
Detection layer
[Lecun et al., 1989])
et al 1989])
convolution
conv
Max‐pooling layer
maximum 2x2 grid
max
ƒ “Max‐pooling”
Invariance Computational efficiency
Deterministic and feed‐forward
ƒ One can develop convolutional
Restricted Boltzmann machine (CRBM).
Detection layer
Detection layer
convolution
convolution filter
conv
ƒ Weight sharing by convolution (e.g.,
Input
ƒ One can define probabilistic max‐pooling that combine bottom‐up and top‐down information.
© Eric Xing @ CMU, 2006-2011
Courtesy: Lee and Ng
53
Convolutional Deep Belief
Networks
z
Bottom-up (greedy), layer-wise training
z
z
Train one layer (convolutional RBM) at a time
time.
Inference (approximate)
z
Undirected connections for all layers (Markov net)
[Related work: Salakhutdinov and Hinton, 2009]
z
Block Gibbs sampling or mean-field
z
Hierarchical probabilistic inference
© Eric Xing @ CMU, 2006-2011
Courtesy: Lee and Ng
54
27
Unsupervised learning of objectparts
Faces
Cars
Elephants
© Eric Xing @ CMU, 2006-2011
Chairs
Courtesy: Lee and Ng
55
Weaknesses & Criticisms
z
Learning everything. Better to encode prior knowledge about
g
structure of images.
A: Compare with machine learning vs. linguists debate in
NLP.
z
Results not yet competitive with best engineered systems.
A: Agreed.
g
True for some domains.
© Eric Xing @ CMU, 2006-2011
Courtesy: Lee and Ng
56
28