Optimizing Neural-network Learning Rate by Using a Genetic Algorithm with Per-epoch Mutations Yasusi Kanada Hitachi, Ltd. IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI 1 Contents ► Introduction ► Proposed learning method (LOG-BP) ► Two application results ■ Pedestrian recognition (Caltech benchmark) ■ Hand-written digit recognition (MNIST benchmark) ► Conclusion and future work IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI 2 Introduction ► Two difficult problems concerning BP (back propagation) ■ Decision (or scheduling) of learning rate • Constant or prescheduled learning rates — not adaptive • Adaptive scheduling methods — have sensitive hyper parameters difficult to be tuned. ■ To control locality of search properly • The gradient descent algorithm, including SGD, does not search the space globally. • To find a better solution efficiently, multiple trials are required. IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI 3 Introduction (cont’d) ► Proposal: LOG-BP (the learning-rate optimizing genetic back-propagation) learning method ■ A method of new combination of BP and GA (genetic algorithm) ■ Multiple neural networks run in parallel. ■ Per-epoch genetic operations are used. IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI 4 Outline of LOG-BP ► Multiple “individuals” (neural networks) learn and search for a best network in parallel in LOG-BP. ► Each individual contains a chromosome c. ■ c = (η; w11, w12, …, w1n1, b1; w21, w22, …, w2n2, b2; … ; wN1, wN2, …, wNnN, bN) η: learning rate, wij (1 ≤ j ≤ ni): weights of i-th layer of the network, bi: bias of i-th layer. ► A mutation-only GA is applied to these chromosomes. IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI 5 Learning algorithm of LOG-BP Initialization Randomize the weights and learning rates c1 W10 c2 η1 W20 cn η2 … Wn0 ηn Epoch 1 1.1 Learning by back-propagation (using stochastic gradient-descent with mini-batch) c1 c2 cn … Wn1 ηn W21 η2 W11 η1 Weights are muted. 1.2 Evaluation (calculating validation losses) (least-square errors) c1 e11 … best c2 e21 … worst cn en1 η1’ = f η1 (probability of 0.5) 1.3 Selection and mutation (no crossover) η1’ = η1 / f (probability of 0.5) Duplicate and mute c1. Kill c2. (f > 1) c1 c1’ cn … Wn1 ηn W11 η1’ W11 η1 IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI 6 Application to Pedestrian Recognition ► Caltech Pedestrian Dataset ■ A famous pedestrian detection benchmark that contains videos with more than 190,000 “small” pedestrians. ■ Sets of training data and test data, both of which are 24×48- and 32×64-pixel images were generated. 5×5 ► Network: CNN2 and CNN3 3×3 … conv, ReLU, conv, ReLU, max pool max pool ■ Convolutional neural networks (CNNs) with two/three convolution layers W3 W1 W2 Feature maps ► Environment for computation W4 Output neurons Hidden neurons (50) ■ Deep learning environment: Theano ■ GPU: NVIDIA GeForce GTX TITAN X IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI 7 Pedestrian: Change of Learning Rate ► Example: CNN2 (a) A trial with 12 individuals (filters = [16, 26]) (b) A trial with 12 individuals (filters = [16, 32]) 0.14 0.14 904)* [16,)28],)[5,)3],)50 *stdev +stdev Learning rate 0.1 0.06 0.08 0.06 0.04 0.04 0.02 0.02 0 0 0 20 40 Epoch 60 80 +stdev 0.1 Learning rate usually decreases. 0.08 903*+ [16,*32],*[5,*3],*50 The learning rate may increase +stdev if the initial value is too small. 0.12 Learning rate 0.12 100 0 20 40 Epoch 60 80 100 Learning Learningrate ratemay maybecome becomemostly mostlystationary stationaryatatsome someepoch. epoch. IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI 8 Pedestrian: Change of Learning Rate (cont’d) ► Example: CNN3 (a) A trial with 12 individuals (mutation rate = 8.3% (1/12)) 0.1 0.1 Trial+10026+- [16,+22,+22],+[5,+ 3,+3],+[2,+2,+2],+50 Trial-10091-/ [16,-22,-22],-[5,-3,3],-[2,-2,-2],-50 The decrease rate may be exponential. -stdev 0.01 Learning rate 0.01 Learning rate (b) A trial with 24 individuals (mutation rate = 4.2% (1/24)) 0.001 0.001 0.0001 0.0001 0.00001 0.00001 0 50 100 Epoch 150 Learning rate continuously decrease by more than three orders of magnitude. IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI 200 0 20 40 60 80 Epoch 100 120 Learning rate may become mostly stationary at some epoch. 9 Application to MNIST (Character Recognition) ► MNIST benchmark ■ A set of hand-written digit images (28×28) containing a training set with 60,000 samples and a test set with 10,000 samples. ► Adaptive learning rate is not required! ■ The learning rate during the whole learning process of LOG-BP was around 0.1. ► Summary: LOG-BP may still have benefits in terms of parallel-search performance. IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI 10 MNIST: Performance (CNN3, Examples) ► Mutation rate 4% ► Mutation rate 2% ► Mutation rate 0 1.1 Error rate 1 (%) Quicker convergence & better final error rate 0.9 0.8 0.7 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Time (S) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Time (S) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Time (S) 11 1.1 Error rate (%) 1 0.9 0.8 0.7 1.1 Error rate 1 (%) 0.9 0.8 0.7 IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI MNIST: Performance (CNN3, Statistics) ► Convergence time and final error rate Mutation rate IJCNN 2016 2016-7-26 Average convergence time (std. dev., s) Final error rate (std. dev., %) 4% 2.6×103 (0.5×103) 0.82 (0.01) 2% 4.9×103 (2.6×103) 0.84 (0.07) 0% 5.6×103 (0.9×103) 0.86 (0.03) Yasusi Kanada, Hitachi, CTI 12 Conclusion ► LOG-BP that combines BP and a GA by a new manner is proposed. ► LOG-BP solves two problems concerning BP. ■ Scheduling of learning rate ■ Controlling locality of search ► Two benchmarks show high performance of LOG-BP. ■ The MNIST benchmarking suggests advantages of LOG-BP over conventional SGD algorithms. ■ LOG-BP will make machine learning less dependent to properties of various applications. IJCNN 2016 2016-7-26 Yasusi Kanada, Hitachi, CTI 13
© Copyright 2026 Paperzz