Pres_InfoNN.pdf

In the Name of God
NN Learning
g based on
Information Maximization
Outline …
• Introduction
I
d
i to Information
I f
i Theory
Th
• Information Maximization in NNs
▫ Single Input Single Output
▫ Single Layer Networks
▫ Causal Filters
• Applications
▫ Blind Separation
▫ Blind Deconvolution
▫ Experimental
E
i
t l Results
R
lt
• Discussion
Introduction to Information Theory
• The
Th Uncertainty
U
t i t (Entropy)
(E t
) function
f
ti for
f an RV is
i
defined as
M
H ( X ) = −∑ p ( xi ) log p ( xi ),
)
Discrete RVs
H ( X ) = − ∫ f x ( x) log f x ( x)dx,
Continuous RVs
i =1
X
• Consider the RV of a dice:
x
1
2
3
4
5
6
H (X )
p1 ( x)
0
0
0
0
0
1
0
p2 ( x)
1/24
1/24
1/3
1/24
1/24
1/2
1.8
p3 ( x)
1/6
1/6
1/6
1/6
1/6
1/6
2.6
Introduction to Information Theory
• Joint
J i Entropy:
E
H ( X , Y ) = − ∫∫ f xy ( x, y ) log f xy ( x, y ) dxdy,
XY
• Conditional Entropy:
X
H (Y X ) = H ( X , Y ) − H ( X )
Deterministic
Neural Network
• Mutual Information:
Uncertaintyy
about the output
Y
I ( X ; Y ) = H (Y ) − H (Y X )
Uncertainty about the
output given the input
Information Maximization in NNs
• Maximize: I ( X ; Y ) = H (Y ) − H (Y X )
In regard to network parameters W
X
Deterministic
Neural Network
• But H (Y X ) is independent of W :
∂I ( X ; Y )
∂W
Maximize I ( X ; Y )
=
∂H (Y )
∂W
≡ Maximize H (Y )
Y
Single Input Single Output
x → g ( .) → y
• g (.) is a monotonic function of x
fy ( y) =
fx ( x)
∂y ∂x
, x = g −1 ( y )
⎡
∂y ⎤
H (Y ) = − E ⎣⎡log f y ( y ) ⎦⎤ = E ⎢log
− E ⎡⎣log f x ( x ) ⎤⎦
⎥
∂x ⎦
⎣
Independent of
Network Parameters
∂H
∂
ΔW ∝
=
∂W ∂W
−1
⎛
∂y ⎞ ⎛ ∂y ⎞ ∂
l
log
⎜
⎟=⎜ ⎟
∂x ⎠ ⎝ ∂x ⎠ ∂W
⎝
⎛ ∂y ⎞
⎜ ⎟
⎝ ∂x ⎠
Single Input Single Output
• Let
1
y=
, u = wx + w0
−u
1+ e
• Matching a neuron’s in-out function to the distribution of
signals:
∂y
= wy (1 − y ),
∂x
∂ ⎛ ∂y ⎞
⎜ ⎟ = y (1 − y )(1 + wx(1 − 2 y ))
∂w ⎝ ∂x ⎠
⇓
Δw ∝
Anti-Hebbian
bb
term
1
+ x(1 − 2 y ), Δw0 = 1 − 2 y
w
Anti-decay term
Single Layer Networks
1
Y = g (WX + W0 ) , g (u ) =
1 + e−u
fy ( y) =
fx ( x)
J
x1
x2
N
⎡ ∂yi ⎤
J = det ⎢
⎥ = ((det W )∏ yi ((1 − yi )
i =1
⎢⎣ ∂x j ⎥⎦
,
∂yi
= wij yi (1 − yi )
∂x j
xN
H (Y ) = − E ⎡⎣log f y ( y ) ⎤⎦ = E ⎣⎡log J ⎦⎤ − E ⎡⎣log f x ( x ) ⎤⎦
cofwij
1 ∂J
∂H
Δwij ∝
=
=
+ x j (1 − 2 yi )
J ∂wij det W
∂wij
−1
ΔW ∝ ⎡⎣W ⎤⎦ + (1 − 2Y ) X T , ΔW0 = 1 − 2Y
T
Anti-redundancy
term
Anti-Hebbian term
y1
y2
yN
Causal Filters
w (t )
x (t )
u (t )
Causal Filter
y (t )
g ( .)
Nonlinearity
y (t ) = g ( u (t ) ) = g ( w(t ) ∗ x(t ) )
2
Y = g (U ) = g (WX )
⎡ wL
⎢w
W = ⎢ L −1
⎢ M
⎢
⎣ 0
0
wL
0
0
L
O
K
wL −1
0
0⎤
0 ⎥⎥
0⎥
⎥
wL ⎦
3 4
1
L t
⎛ 1
⎞
ΔwL ∝ ∑ ⎜
− 2 xt yt ⎟
t =1 ⎝ wL
⎠
M
ΔwL − j ∝ ∑ ( −2 xt − j yt )
M
t =1
Applications
• Blind Separation (e,g, the cocktail party)
x1
s1
s2
x2
u1
W
u2
A
sN
xN
Find W that will
reverse the
effect of A
uN
Unknown Mixing
Matrix
• Blind Deconvolution ((e.g.
g echo cancelation))
s (t )
a (t )
x (t )
Unknown causal
Filter
w (t )
u (t )
Experimental Results
• 5x5 Information Maximization Network for blind
separation of 5 speakers from 7 sec segments.
Experimental Results
p
g
• Filters used to convolve speech
signals
Discussion
g
• Algorithms
are Limited
▫ Single layer networks were used…
▫ N2N Mappings (not useful for dimensionality
reduction or compression)
▫ Time delays not considered in Blind Separation
• Learning rule is not local
• The presented work brings new applications to
the NN domain
References
• A. J. Bell and T. J. Sejnowski, “An information-maximization
approach to blind separation and blind deconvolution,” Neural
Computing, vol. 7, N. 6, pp. 1004–1034.
• Ash R
R. B
B., “Information
Information Theory
Theory”, New York
York, Interscience,
Interscience
1965