Optimizing Neural-network Learning Rate by Using a Genetic

Optimizing Neural-network Learning Rate
by Using a Genetic Algorithm with Per-epoch Mutations
Yasusi Kanada
Hitachi, Ltd.
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
1
Contents
► Introduction
► Proposed learning method (LOG-BP)
► Two application results
■ Pedestrian recognition (Caltech benchmark)
■ Hand-written digit recognition (MNIST benchmark)
► Conclusion and future work
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
2
Introduction
► Two difficult problems concerning BP (back propagation)
■ Decision (or scheduling) of learning rate
• Constant or prescheduled learning rates — not adaptive
• Adaptive scheduling methods — have sensitive hyper parameters difficult
to be tuned.
■ To control locality of search properly
• The gradient descent algorithm, including SGD, does not search the
space globally.
• To find a better solution efficiently, multiple trials are required.
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
3
Introduction (cont’d)
► Proposal: LOG-BP (the learning-rate optimizing genetic
back-propagation) learning method
■ A method of new combination of BP and GA (genetic
algorithm)
■ Multiple neural networks run in parallel.
■ Per-epoch genetic operations are used.
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
4
Outline of LOG-BP
► Multiple “individuals” (neural networks) learn and search
for a best network in parallel in LOG-BP.
► Each individual contains a chromosome c.
■ c = (η; w11, w12, …, w1n1, b1; w21, w22, …, w2n2, b2;
… ; wN1, wN2, …, wNnN, bN)
η: learning rate, wij (1 ≤ j ≤ ni): weights of i-th layer of the network,
bi: bias of i-th layer.
► A mutation-only GA is applied to these chromosomes.
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
5
Learning algorithm of LOG-BP
Initialization
Randomize the weights and learning rates
c1
W10
c2
η1
W20
cn
η2
…
Wn0
ηn
Epoch 1
1.1 Learning by back-propagation (using stochastic gradient-descent with mini-batch)
c1
c2
cn
…
Wn1 ηn
W21 η2
W11 η1
Weights are muted.
1.2 Evaluation (calculating validation losses) (least-square errors)
c1
e11 … best
c2
e21 … worst
cn
en1
η1’ = f η1 (probability of 0.5)
1.3 Selection and mutation (no crossover)
η1’ = η1 / f (probability of 0.5)
Duplicate and mute c1. Kill c2.
(f > 1)
c1
c1’
cn
… Wn1 ηn
W11 η1’
W11 η1
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
6
Application to Pedestrian Recognition
► Caltech Pedestrian Dataset
■ A famous pedestrian detection benchmark that contains videos
with more than 190,000 “small” pedestrians.
■ Sets of training data and test data, both of which are 24×48- and
32×64-pixel images were generated.
5×5
► Network: CNN2 and CNN3
3×3 …
conv, ReLU,
conv, ReLU,
max pool
max pool
■ Convolutional neural networks (CNNs) with two/three convolution layers
W3
W1
W2
Feature maps
► Environment for computation
W4
Output
neurons
Hidden neurons
(50)
■ Deep learning environment: Theano
■ GPU: NVIDIA GeForce GTX TITAN X
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
7
Pedestrian: Change of Learning Rate
► Example: CNN2
(a) A trial with 12 individuals
(filters = [16, 26])
(b) A trial with 12 individuals
(filters = [16, 32])
0.14
0.14
904)* [16,)28],)[5,)3],)50
*stdev
+stdev
Learning rate
0.1
0.06
0.08
0.06
0.04
0.04
0.02
0.02
0
0
0
20
40 Epoch 60
80
+stdev
0.1
Learning rate
usually decreases.
0.08
903*+
[16,*32],*[5,*3],*50
The learning
rate
may increase
+stdev
if the initial value is too small.
0.12
Learning rate
0.12
100
0
20
40 Epoch 60
80
100
Learning
Learningrate
ratemay
maybecome
becomemostly
mostlystationary
stationaryatatsome
someepoch.
epoch.
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
8
Pedestrian: Change of Learning Rate (cont’d)
► Example: CNN3
(a) A trial with 12 individuals
(mutation rate = 8.3% (1/12))
0.1
0.1
Trial+10026+- [16,+22,+22],+[5,+
3,+3],+[2,+2,+2],+50
Trial-10091-/ [16,-22,-22],-[5,-3,3],-[2,-2,-2],-50
The decrease rate
may be exponential.
-stdev
0.01
Learning rate
0.01
Learning rate
(b) A trial with 24 individuals
(mutation rate = 4.2% (1/24))
0.001
0.001
0.0001
0.0001
0.00001
0.00001
0
50
100
Epoch
150
Learning rate continuously decrease by
more than three orders of magnitude.
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
200
0
20
40
60
80
Epoch
100
120
Learning rate may become mostly
stationary at some epoch.
9
Application to MNIST (Character Recognition)
► MNIST benchmark
■ A set of hand-written digit images (28×28) containing a training set
with 60,000 samples and a test set with 10,000 samples.
► Adaptive learning rate is not required!
■ The learning rate during the whole learning process of LOG-BP
was around 0.1.
► Summary: LOG-BP may still have benefits in terms of
parallel-search performance.
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
10
MNIST: Performance (CNN3, Examples)
► Mutation rate 4%
► Mutation rate 2%
► Mutation rate 0
1.1
Error
rate 1
(%)
Quicker convergence &
better final error rate
0.9
0.8
0.7
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Time (S)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Time (S)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Time (S) 11
1.1
Error
rate
(%)
1
0.9
0.8
0.7
1.1
Error
rate 1
(%)
0.9
0.8
0.7
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
MNIST: Performance (CNN3, Statistics)
► Convergence time and final error rate
Mutation rate
IJCNN 2016
2016-7-26
Average convergence
time (std. dev., s)
Final error rate
(std. dev., %)
4%
2.6×103 (0.5×103)
0.82 (0.01)
2%
4.9×103 (2.6×103)
0.84 (0.07)
0%
5.6×103 (0.9×103)
0.86 (0.03)
Yasusi Kanada, Hitachi, CTI
12
Conclusion
► LOG-BP that combines BP and a GA by a new manner is
proposed.
► LOG-BP solves two problems concerning BP.
■ Scheduling of learning rate
■ Controlling locality of search
► Two benchmarks show high performance of LOG-BP.
■ The MNIST benchmarking suggests advantages of LOG-BP over
conventional SGD algorithms.
■ LOG-BP will make machine learning less dependent to properties
of various applications.
IJCNN 2016
2016-7-26
Yasusi Kanada, Hitachi, CTI
13