NN-ch3-notes

Hand-written character recognition
• MNIST: a data set of hand-written digits
− 60,000 training samples
− 10,000 test samples
− Each sample consists of 28 x 28 = 784 pixels
• Various techniques have been tried
− Linear classifier:
− 2-layer BP net (300 hidden nodes)
− 3-layer BP net (300+200 hidden nodes)
− Support vector machine (SVM)
− Convolutional net
− 6 layer BP net (7500 hidden nodes):
Failure rate for
test samples
12.0%
4.7%
3.05%
1.4%
0.4%
0.35%
Hand-written character recognition
• Our own experiment:
− BP learning with 784-300-10 architecture
− Total # of weights: 784*300+300*10 = 238,200
− Total # of Δw computed for each epoch: 1.4*10^10
− Ran 1 month before it stopped
− Test error rate: 5.0%
Risk-Averting Error Function
• Mean Squared Error (MSE)
1
Q(w ) 
K
 y
K
k 1
k
 fˆ ( xk , w )


2
1

K

K
 g(w)
k 1
k
2
where g k ( w )  yk  fˆ ( xk , w ) and
( xk , yk ) : training data pairs, k  1, K , K
• Risk-Averting Error (RAE)
1
J  (w ) 
K


2
1


ˆ
exp  yk  f ( xk , w ) =


 K
k 1
K
K
 exp  g(w)
k 1
k
 : Risk-Sensitivity Index (RSI)
1. James Ting-Ho Lo. Convexification for data fitting. Journal of
Global Optimization, 46(2):307–315, February 2010.
Normalized Risk-Averting Error
• Normalized Risk-Averting Error (NRAE)
Cl (w) =
1
l
ln Jl (w)
It can be simplified as
Cl (w) = gM (w) +
ln hl (w) - ln K ]
[
l
1
g M (w) : argmax gk (w),
k
K
h (w ) :  exp   vk (w )
k 1
vk (w ) : g k (w )  g M (w )
The Broyden-Fletcher-Goldfarb-Shanno
(BFGS) Method
• A quasi-Newton method for solving the nonlinear
optimization problems
• Using first-order gradient information to generate
an approximation to the Hessian (second-order
gradient) matrix
• Avoiding the calculation of the exact Hessian
matrix can significantly save the computational
cost during the optimization
The Broyden-Fletcher-Goldfarb-Shanno
(BFGS) Method
The BFGS Algorithm:
1. Generate an initial guess x0 and an initial
approximate inverse Hessian Matrix B0 = I .
2. Obtain a search direction pk at step k by solving:
Bkpk = -Ñf (xk )
where Ñf (xk ) is the gradient of the objective
function f (x) evaluated at xk .
3. Perform a line search to find an acceptable
stepsize a k in the direction pk, then update
xk+1 = xk + a kpk
The Broyden-Fletcher-Goldfarb-Shanno
(BFGS) Method
4. Set sk = a kpkand yk = Ñf (xk+1 ) - Ñf (xk ).
5. Update the approximate Hessian matrix Bk by
T
k
T
k k k k
T
k k k
yk y
Bss B
Bk+1 = Bk + T yk sk
s Bs
6. Repeat step 2-5 until x converges to the solution.
Convergence can be checked by observing the
norm of the gradient, Ñf (xk ) .
The Broyden-Fletcher-Goldfarb-Shanno
(BFGS) Method
Limited-memory BFGS Method:
• A variation of the BFGS method
• Only using a few vectors to represent the
approximation of the Hessian matrix implicitly
• Less memory requirement
• Well suited for optimization problems with a
large number of variables
References
1. J. T. Lo and D. Bassu. An adaptive method of training
multilayer perceptrons. In Proceedings of the 2001
International Joint Conference on Neural Networks, volume
3, pages 2013–2018, July 2001.
2. James Ting-Ho Lo. Convexification for data fitting. Journal
of Global Optimization, 46(2):307–315, February 2010.
3. BFGS: http://en.wikipedia.org/wiki/BFGS
A Notch Function
y = f (x) = c[1,2] [2.1,3.1] (x)
where c A (x) = 1 if x Ñ A and c A (x) = 0 if x Ñ A
MSE vs. RAE
MSE vs. RAE