Neural networks
Training neural networks - regularization
P
MACHINE
LEARNING
h
• R(X) = {x 2 R | 9w x = j wj X·,j }
arg min ✓
T
✓
l(fT(x t; ✓), y
) t+ ⌦(✓)
Département d’inform
2
Hugo
Larochelle
• l(f (x ; ✓), y )
Hugo
Lar
Université
de
Sherb
(t)
(t) (t)
Département
d’informatique
(t)
Hugo Larochelle
(x ; ;✓),
y y) )• ⌦(✓)
• •l(fl(f(x
✓),
Abstract
Département
d’
hugo.larochelle@usher
P
Université
de
Sherbrooke
Département d’informatique
•
=
r l(f (x ; ✓), y )
r ⌦(✓)
• ⌦(✓)
Université
de
• •⌦(✓) 1 P
Math
for
my
slides
“Feedforward
neural
network”.
[email protected]
Université
de
Sherbrooke
• (t)
✓ ; ✓),✓ y+(t)descent
Topics:
gradient
(SGD)
X
•
=stochastic
r
l(f
(x
)
r
⌦(✓)
September 13, 20
✓
t ✓
1
T P
hugo.larochelle@
(t)
(t)
arg
min
(x
; ✓), y )[email protected]
+ ⌦(✓)
1
(t)| r f (x)l(f
(t)
•
{x
2
R
=
0}
• • ✓= ✓ +
r
l(f
(x
;
✓),
y
)
r
⌦(✓)
•
f
(x)
T
✓
✓
✓
t
T
• Algorithm
that performs updates
after each exampleSeptember 13, 2012
t
• v r f (x)v > 0 8v
September
(L+1)
(L+1)
• {x 2 Rd | rx f (x) = 0} (1) (1)
September 13, 2012
(t) t
1
T
d
>
‣
(t)
✓
t
(t)
✓
x
2
x
( ✓y⌘
(t) {W
•initialize
✓(t)+;•✓),
•✓ l(f (x
) • v> r2x, fb(x)v,<. .0 . ,8vW
> 2
•
v
rx f (x)v > 0 8v
‣ for N iterations
(t)
•
(t)
,b
})
(t)
(t)
(t)
=
r
l(f
(x
;
✓),
y
) for
r✓my
⌦(✓)slides
✓
Math
; ✓), y )
“Feedforward
neural network”.
Abstract
• veach
rxtraining
f (x)v example
< 0 8v• (x(t) , y (t) )
for
P
Abstract
1
(t)
(t)
training
epoch
(t)
(t)
Math
for
slides
“Feedforward
neural
network”.
••✓ == rT •l(fr
r
l(f
(x
;
✓),
y
)
r
⌦(✓)
(x
;
✓),
y
)
(t)l(f
(t) my
✓
✓
✓
•
f
(x)
t
(x ; ✓), y )
r✓ ⌦(✓)
✓
Math=for my slides “Feedforward neural network”.
Math for my slides “Feedforward
neural
network”.
iteration
over
all
examples
✓
• ✓
✓ •+•↵
⌦(✓)
f (x)
• l(f (x(t) ; ✓),5y (t) )
• ⌦(✓)
> 2
• l(f (x
• {x 2 R | x 2/ R(X)}
-
Abstract
h
• f (x)
•
f
(x)
• To apply
to(t)
neural
network
training,
we need
• {x 2this
Rd •|algorithm
rrx f⌦(✓)
(x)
= 0}
5
(t)
(t)
(t)
✓
• l(f (x
; ✓), y ) • r✓ l(f•(xl(f (x
; ✓),
(t) y ) (t)
;
✓),
y
)
(t)
(t)
‣ the• loss
• l(f>(x
; ✓), y )
v>function
r2x f (x)v
0 8v
(t) • ⌦(✓)
•rf✓(x)
=(t)p(y
=yc|x)
c(x
•
l(f
;
✓),
)
(t)
(t)
‣ a procedure
•
r
l(f
(x
;
✓),
y
)
to
compute
the
parameter
gradients
(t)
(t)
✓
> 2
• r(t)
• v rx f (x)v
<
0 (x
8v ; ✓), y )
✓ l(f
(t)
•
x
y
‣ the regularizer
(and
)
• r✓ ⌦(✓)
• ⌦(✓) (t) the(t)gradient
5
•
⌦(✓)
•
= r✓•l(f
(x ; ✓), y ) P r✓ ⌦(✓)
⌦(✓)
‣ initialization method
•
l(f
(x),
y)
=
1
log
f
(x)
=
log
f
(x)
=
c
y
(y=c)
c
•
f
(x)
=
p(y
=
c|x)
c• r✓ ⌦(✓)
(t) •
(t) r✓ ⌦(✓)
• { i,ui | Xui =
>
ui i et ui uj = 1i=j }
Abstr
REGULARIZATION
• g (a) = g(a)(1 g(a))
0
0
• g (a) = g(a)(1 g(a))
0 regularization2
Topics:
• gL2
(a) = 1 g(a)
0
2
• g (a) = 1 g(a)
P P P ⇣ (k) ⌘2 P
(k) 2
• ⌦(✓) = k Pi P
Wi,j⇣ = ⌘2k ||W
||
F
j P
P
(k)
• ⌦(✓) = k i j Wi,j
= k ||W(k) ||2F
(k)
• rW(k) ⌦(✓) = 2W
(k)
P
P
P
• Gradient:
• rW(k) ⌦(✓) = 2W(k)
• ⌦(✓) = k i j |Wi,j |
P P P
(k)
|W
|
(k)
i,j
k
i
j
sign(W )
•
⌦(✓)
=
•
r
(k) ⌦(✓) =
W
• Only applied on weights, not on biases (weight decay)
(k)
(k)
•
sign(W
)
=
1
0
•
r
⌦(✓)
=
sign(W
• Can be interpreted
as having a Gaussian) prior over the
W(k) i,j
weights
• sign(W(k) )i,j = 1 0
3
0
•
• g (a) = 1
⇣
P
⌘2
REGULARIZATION
=
||W
⇣
P
g(a)
•
⌦(✓)
=
0
g (a) = g(a)(1 g(a))
2P
k
i
(k)
j ⌘Wi,j
P
k
2
P P P
P
(k)
(k) 2
0 ⌦(✓) =
2
•
W
=
||W
||
•
g
(a)
=
1
g(a)
F
i,j
k
i
j
k
(k)
Topics: L1 regularization
• r (k) ⌦(✓) = 2W
P W
P P ⇣
(k)
Wi,j
⌘2
P
(k) 2
=
||W
||F
(k)
k
P P
• ⌦(✓) = k i j
• rW(k) ⌦(✓) =P2W
(k)
•
⌦(✓)
=
|W
|
(k)
i,j
k
i
j
• rW(k) ⌦(✓) = P
2W P P
(k)
• ⌦(✓)P=P Pk (k)
|Wi,j |
i
j
• ⌦(✓) •
= r
|Wi,j | = sign(W(k) )
k
i
j⌦(✓)
W(k)
(k)
• •rW
⌦(✓)
= sign(W
)
• Gradient:
r(k)W
=(k)sign(W
)
(k) ⌦(✓)
‣
where•
•
sign(W
(k)
sign(W ) = 1
(k)
i,j (k) W(k) >0
i,j
)i,j1 = 1
• sign(W )i,j = 1
• Also only applied on weights
• Unlike
• Can
0
(k)
0
Wi,j <0
L2, L1 will push certain weights to be exactly 0
be interpreted as having a Laplacian prior over the
weights
4
(k) 2
||F
5
MACHINE LEARNING
• l(f (x(t) ; ✓), y (t) )
• l(f (x(t) ; ✓), y (t) ) • l(f (x(t) ; ✓), y (t) )
Topics: bias-variance trade-off
•
•
•
• l(f (x(t) ; ✓), y•(t)⌦(✓)
)
•
1
T
P
(t)
(t)
r
l(f
(x
;
✓),
y
)
✓
t
r✓ ⌦
• ⌦(✓)
• ⌦(✓)
(t)
(t)
•
l(f
(x
;
✓),
y
)a lot if the training
Variance of
trained
model:
does
it
vary
P
P
P set
•r ✓l(f (x
✓ (t)
+; ✓), y (t) )
1
1 (t)
1
(t)
(t)
(t)
•
=
r
l(f
•
(x
=
;
✓),
y
)
r
l(f
r
(x
⌦(✓)
;
✓),
y
)
•
r
=
⌦(✓)
r✓ ⌦(✓)
✓
✓
✓
✓ T
✓
t
t
t
T
T
• ⌦(✓)
changes P
• ⌦(✓)
d
•
{x
2
R
| rx f (x) = 0}
•
✓
✓
+
✓
+
•
✓
✓
+
1
(t)
(t)• ✓
•
= T t r✓ l(f (x ; ✓), y )
r✓ ⌦(✓) 1 P
(t)
(t)
•
=
r
l(f
(x
;
✓),
y
) the
r✓ ⌦(✓)
> 2
✓
tmodel close to
T
Bias of trained
model:
is
the
average
•
v
r
f (x)v > 0 8v
d
d
d true
x0}
•
{x
2
R
|
r
f
(x)
•
=
{x
0}
2
R
|
r
f
(x)
=
0}
•
{x
2
R
|
r
f
(x)
=
x
x
x
• ✓
✓+
✓+
> 2
solution • v> r2 f (x)v > 0 •8vv>•r✓2 f (x)v
•
v
r
> 2
x f (x)v < 0 8v
>
0
8v
•
v
r
f
(x)v
>
0
8v
d
x 0}
x
x
• {x 2 R | rx f (x) =
d
• {x 2 R | rx f (x) = 0}
(t)
(t)
•
=
r
l(f
(x
;
✓),
y
)
r✓ ⌦(✓)
>
2
>
2
>
2
✓
Generalization
error
can
be
seen
as
the
sum
of
the
(squared)
•
v
r
f
(x)v
<
0
•
8v
v
r
f
(x)v
<
0
8v
•
v
r
f
(x)v
<
0
8v
> 2
x
x
x
• v rx f (x)v > 0 8v
> 2
• v rx f (x)v > 0 8v
(t) (t)
•
(x
))
bias> and
the
variance
(t)
(t)
(t)
(t)
(t) , y (t)
•
=
r
l(f
(x
•
;
✓),
y
=
)
r
l(f
r
(x
⌦(✓)
;
✓),
y
)
r
•
⌦(✓)
=
r
l(f
(x
;
✓),
y
r✓ ⌦(✓)
2
✓
✓
✓
✓
✓
• v rx f (x)v < 0 8v
• v> r2x f (x)v < 0 8v
⇤
•
f
f
(t) (t)
(t) (t)
(t) (t)
possible
• (x(t) , y )(t)
• (x , y )
• (x , y )
•
= r✓ l(f (x ; ✓), y )
r✓ ⌦(✓)
•
= r✓ l(f (x(t) ; ✓), y (t) )
r✓ ⌦(✓)
⇤
f
• f⇤ f
• f⇤ f
(t) (t) • f
• (x , y )
• (x(t) , y (t) )
• f⇤ f
possible
• f⇤ f
possible
⌦(✓)
• l(f (x(t) ; ✓),•y (t)
)
low variance/
high bias
good trade-off
high variance/
low bias
=
© Copyright 2026 Paperzz