On the improvement of the learning rate in Blind Source

224
PROCEEDINGS OF THE INTERNATIONAL WORKSHOP ON TELECOMMUNICATIONS - IWT/09
On the improvement of the learning rate in Blind
Source Separation using techniques from Artificial
Neural Networks theory
Felipe Augusto Pereira de Figueiredo
Carlos Alberto Ynoguti
Instituto Nacional de Telecomunicações - Inatel
P.O. Box 05 - 37540-000
Santa Rita do Sapucaı́ - MG - Brazil
[email protected]
Instituto Nacional de Telecomunicações - Inatel
P.O. Box 05 - 37540-000
Santa Rita do Sapucaı́ - MG - Brazil
[email protected]
Abstract— In this work some techniques from the Artificial
Neural Networks theory are used to improve the convergence
speed of a blind source separation (BSS) algorithm. The momentum term, bold driver and exponential decay techniques were
used, and experimental results show a convergence time reduction
by a factor of about 16.
Index Terms— Blind source separation, Neural Networks,
Natural Gradient, Multiple-Input-Multiple-Output (MIMO) Systems, Adaptive Filtering, convolutive mixtures, Second Order
Statistics, Dynamic learning rate, Momentum term.
where hqp (k), k = 0, . . . , M − 1 denote the coefficients of the
filter from the q-th source to the p-th sensor.
The problem of BSS consists in finding a corresponding
demixing system according to Figure 1, where the output
signals yq (n), q = 1, . . . , P (P = Q) are described by
I. I NTRODUCTION
where L is the length of the demixing system filters.
It can be shown (see, e.g., [1]) that the MIMO demixing
system coefficients can in fact reconstruct [10] the sources up
to an unknown permutation and an unknown filtering of the
individual signals, provided that L is chosen to be at least
equal to M .
To estimate the P 2 L coefficients wqp (k) of the MIMO
demixing filter W , it is considered in this work an approach
using second-order statistics [6], which exploits the Nonwhitheness and Nonstationarity properties of the signals. The
Nonwhiteness property is exploited by simultaneous diagonalization of output correlation matrices over multiple timelags, e.g., [4], and the Nonstationarity property is exploited by
simultaneous diagonalization of short-time output correlation
matrices at different time intervals, e.g., [5],[7]-[8]. In the
sequence, an algorithm for convolutive mixtures by first introducing a general matrix formulation for convolutive mixtures
following [1] that includes all time lags, is presented.
The problem of blind source separation is illustrated in
Figure 1:
x1
sensor 1
w1p
...
...
hpp
xp
sensor p
mixing system H
Fig. 1.
...
...
h1p
sp
y1
wp1
...
...
...
hp1
w11
...
h11
...
s1
wpp
yp
demixing system W
Linear MIMO model for BSS.
In this work it is assumed a MIMO (Multiple Inputs
Multiple Outputs) model, in which the signals are convolutively mixed. Also, the number of source signals (sq (n), q =
1, . . . , Q) is assumed to be equal to the number of sensor
signals (xp (n), p = 1, . . . , P ).
Each of the outputs of the mixing system H is described
by
xp (n) =
−1
P M
q=1 k=0
yq (n) =
P L−1
wpq (k)xp (n − k),
(2)
p=1 k=0
II. T IME - DOMAIN A LGORITHM FOR BSS
In this section the matrix formulation that allows derivation
of a time-domain [9] algorithm from a cost function which
inherently takes into account the nonstationarity and nonwhiteness properties will be introduced.
A. Matrix notation for Convolutive Mixtures
hqp (k)sq (n − k),
(1)
From Fig. 1, it can be seen that the output signals yq (n), q =
1, . . . , P of the demixing system at time n are given by
PROCEEDINGS OF THE INTERNATIONAL WORKSHOP ON TELECOMMUNICATIONS - IWT/09
yq (n) =
P
xTp (n)wpq ,
(3)
p=1
where
xp (n) = [xp (n), xp (n − 1), . . . , xp (n − L + 1)]T
(4)
is a vector containing the latest L samples of the sensor signal
xp (n) of the p-th channel and
wpq = [wpq,0 , wpq,1 , . . . , wpq,L−1 ]T
(5)
contains the current weights of the MIMO filter taps from the
p-th sensor channel to the q-th output channel.
An algorithm for BSS of convolutive signals which exploits
those two signal properties can be obtained from the definition
of the following matrix



Yq (m) = 

···
···
..
.
yq (mL + N − 1) · · ·
yq (mL)
yq (mL + 1)
..
.

T
Xp (m)Wpq


Xp (m) = 

xp (mL + N − 1) · · ·
0
(10)
0
Y (m) = X(m)W
(11)
Y (m) = [Y1 (m), Y2 (m), . . . , YP (m)]
(12)
X(m) = [X1 (m), X2 (m), . . . , XP (m)]
(13)

xp (mL − L + 1)
xp (mL − L + 2) 


..

.

W11

W =  ...
(7)
(8)
The approach followed here is carried out with overlapping
data blocks to increase the convergence rate and reduce signal
delay. Overlapping is introduced by simply replacing the time
index mL in the equations by m(L/a) with the overlap factor
1 ≤ a ≤ L. The matrices Xp (m), p = 1, . . . , P used in (8)
are defined as
···
···
..
.
···
···
..
.
···







wpq,0 

wpq,1 


..

.

wpq,L−1 

0 


..

.
Finally, to allow a convenient notation of the algorithm
combining all channels, (8) can be compactly rewritten as
(6)
p=1
xp (mL)
xp (mL + 1)
..
.
0
0
..
.

where:
in order to incorporate L time-lags in the cost function and
thus the algorithm will be able to exploit the nonwhiteness
property.
With the definitions above, (2) can be rewritten as

···
···
..
.
0
0
0
..
.
yq (mL − L + N )
yq (m) = [yq (mL), . . . , yq (mL + N − 1)]
P
wpq,L−2
wpq,L−1
..
.
0
0
where m denotes the time index of the block being processed
and N is the length of the system output blocks taken into
account for the estimates of short-time correlations used below.
This matrix captures L subsequent output signal vectors
Yq (m) =
wpq,0
wpq,1
..
.
···
···
···
..
.
wpq,0
wpq,1
wpq,2
..
.







wpq,L−1


Wpq (m) =  0

..

.

 0

 0


..

.

yq (mL − L + 1)
yq (mL − L + 2) 


..

.
225
(9)
xp (mL − L + N )
Those matrices are Toeplitz with dimension (N × 2L), so
the first row contains 2L input samples and each subsequent
row is shifted to the right by one sample and thus, contains
one new input sample. Wpq are 2L × L Sylvester matrices,
which are defined as
WP 1
···
..
.
···

W1P
.. 
. 
(14)
WP P
B. Cost Function and Algorithm Derivation
Having defined the compact matrix formulation (11) for the
block- MIMO filtering, a following cost function that explicitly
contains correlation matrices including several time-lags under
the assumption of short-time stationarity is defined. This cost
function is based on a generalization of Shannon’s mutual
information [11],[12] and simultaneously accounts for those
two properties of the signals used here:
M −1
1 log | bdiag Y T (i)Y (i)| − log |Y T (i)Y (i)|
M i=0
(15)
where the bdiag operation on a block matrix consisting of
several sub matrices sets all sub matrices on the off-diagonal to
zero. The cost function showed above was firstly introduced in
[13] as a generalization of [14]. Since the matrix formulation
(11) is used for calculating the short-time correlation matrices
Y T (m)Y (m), the cost function inherently includes all L timelags of all auto-correlations and cross-correlations of the BSS
output signals.
By Oppenheim’s inequality [15] q log |YqT (m)Yq (m)| ≥
log |Y T (m)Y (m)|, it is ensured that the first term in the braces
of (14) is always greater than or equal to the second term,
where the equality holds if all block-diagonal elements of
J(m) =
226
PROCEEDINGS OF THE INTERNATIONAL WORKSHOP ON TELECOMMUNICATIONS - IWT/09
Y T (m)Y (m), i.e., the output cross-correlation over all timelags, vanish.
The algorithm is based on the first-order gradient and in
order to express the update equations of the filter coefficients
exclusively by Sylvester matrices W , we take the gradient with
respect to W and ensure the Sylvester structure of the result
by selecting the non redundant values using a constraint.
∇W J(m) =
∂J(m)
∂W
(16)
And as result,
M −1
2 −1
∇W J(m) =
Rxy (i)Ryy
(i)
M i=0
Artificial Neural Networks literature has produced a large
number of heuristic techniques for boosting gradient descent
based algorithms by using adjustable learning rate, adding
some kind of derivative term, clever choice of of the initial
value, etc. In this work, some such strategies are used for the
BSS algorithm, and the results are described.
A. Momentum term
The momentum term is a simple and effective technique
of increasing the learning rate yet avoiding the danger of
instability. It is represented by the following equation:
ψ(m) = β (W (m − 1) − W (m − 2))
(Ryy (i) − bdiag Ryy (i)) bdiag
−1
Ryy (i)
(17)
With an iterative optimization procedure, the current demixing matrix is obtained by the recursive update equation
W (m) = W (m − 1) − µ(m)∆W (m)
(18)
The µ(m) parameter gives the length of the step in the
negative gradient direction and it is often called the step size
or learning rate, this parameter can be made either dynamic
or constant, depending on the technique adopted, as will
be shown in the next section. The choice of an appropriate
learning rate µ is essential for the convergence of the algorithm: a very small value will lead to slow convergence, on
the other hand, very large values will lead to overshooting
and instability, which prevents convergence altogether. As is
known, non quadratic cost functions may have many local
maxima and minima and therefore, good choices for initial
values are important.
C. Natural Gradient
The gradient of a function J(m) points in the steepest direction in the Euclidean orthogonal coordinate system. However,
the parameter space is not always Euclidean; in fact it has a
Riemannian metric structure, as pointed out by Amari [17].
In such a case, the steepest direction is given by the socalled natural gradient instead. Therefore, in order to use the
natural gradient as the update term ∆W (m) the following
modification should be applied to the descent gradient:
G
T
T
∇N
W J(m) = W W ∇W J(m) = W W
∂J(m)
∂W
(19)
M −1
2 W (m) (Ryy (i) − bdiag Ryy (i))
M i=0
bdiag−1 Ryy (i)
(21)
where 0 < β < 1 is a new global parameter which must be
determined by trial and error. The use of the momentum term
produces the following update equation:
W (m) = W (m − 1) − µ(m)∆W (m) + ψ(m)
(22)
Momentum simply adds a fraction of the previous weight
update to the current one. It is imoprtant to note that the
learning rate parameter µ is made constant. When the gradient
keeps pointing in the same direction, this will increase the
size of the steps taken towards the minimum and when the
gradient keeps changing direction, momentum will smooth out
the variations. This technique may also have the benefit of
preventing the algorithm from terminating in a shallow local
minimum on the error surface.
B. Bold Driver
A useful batch method for adapting the global learning rate
µ is the so called bold driver technique. Its operation is simple:
after each epoch, compare the value of the cost function with
its previous value. If the difference has decreased, increase µ
by a small proportion (typically 1%-10%). A value of 10%
was adopted as the value for that parameter. If the difference
has increased by more than a tiny proportion (say, 10−10 ),
however, undo the last weight change, and decrease µ sharply
- a value of 50% was used in this work. Thus bold driver
will keep growing µ slowly until it finds itself taking a step
that has clearly gone too far up onto the opposite slope of the
cost function. Since this means that the algorithm has reached
a tricky area of the cost function surface, it makes sense to
reduce the step size quite drastically at this point.
C. Exponential Decay
It is a simple non-adaptive technique, once it do not relay on
any values of the outputs of the algorithm. This is an effective
technique that can be used to accelerate the searching process
of the demixing filter coefficients. The following equation, that
was empirically derived, is adopted as the time-variant learning
factor:
And then we have
G
∇N
W J(m) =
III. L EARNING RATE ADAPTATION
(20)
m2
µ(m) = µ0 e− 10M
(23)
PROCEEDINGS OF THE INTERNATIONAL WORKSHOP ON TELECOMMUNICATIONS - IWT/09
where µ0 is the initial value of the function, M is the number
of epochs adopted and m is the value of the current epoch.
The function described above is time-variant: it starts with
the value µ0 and then, it decreases gradually at each epoch
of the algorithm. At the beginning of the learning process the
convergence rate is very fast, once the value of µ is high,
and as result, the algorithm can quickly find its way to the
minimum of the cost function; at the end of the process, the
small values of µ give a fine tuning for the parameters being
estimated.
SIR
40
30
20
10
6
...
...
...
...
...
..
..
..
..
..
....... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... .........
.
.
...
...
...
......................................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............. .
.
.
.
...
............
...
....
....
.... .....................
..
....... ....... ....... ....... ....... ....... ....... ....................... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........
........ ...
...
...
...
....
......
.
.
.
.
.
.
.
.
.
......
.
..
..
..
... ........
.
.
.
....
.
.
.
.
.
.
. ......
.
....... ....... ....... ....... ............. ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... .........
.....
.
.
.
.
.
..
.
.
.
.
.
..
..... .
..
..
..
..
... ...
..
..
..
..
...
.
.
.
.
..
....... ....... ............. ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... .........
.
.
.
.
.
.
.
.
.
.....
..
..
..
..
..
........
.........
..
..
..
..
..
...
...
...
...
...
.
.
.
.
.
300
IV. E XPERIMENTS AND R ESULTS
In this section some results regarding the experiments
performed by using the above mentioned techniques are
presented. The experiments were conducted by using speech
signals convolved with synthetic impulse responses simulating
the acoustical behavior of real rooms [18]. For this purpose,
filters with 100 taps were used.
Two audio signals with 5 seconds of speech were passed
through the filter. These signals correspond to a male and
female speaker voices. The recordings were made in a low
noise environment, with 11025 Hz sampling frequency and
16 bits of resolution.
As mentioned before, the number of source signals is
supposed to be equal to the number of sensors (two).
The Signal-to-Interference Ratio (SIR), which is defined
as the ratio of the signal power of the target signal to the
signal power from the jammer signal, was used to evaluate
the performance of the algorithm. The SIR measured at the
input of the demixing system was of 5.1496 dB and the SIR
measured at the output of the system for each of the techniques
mentioned here are presented below.
To decrease the delay introduced to the output signal
and increasing the convergence rate as well, the overlapping
method [23] was adopted, with a overlapping factor equal to
2, that means 50% overlapping.
The SIR measured at the input of the demixing system was
of 5.1496 dB
The length L of the demixing filters was made equal to the
length M of the mixing filters. The demixing filters Wpp were
initialized with an unit impulse at the first tap and all the taps
of the filters Wpq , p = q were made equal to zero.
In the following sections, experimental tests results are
reported in order to compare the performance of the proposed
methods for learning rate parameter µ modification.
A. Initial test
In [1,9,13 and 16] the parameter µ is always constant. The
first test follows this strategy, in roder to establish a baseline
performance for the system. A value of µ = 0.002 was chosen,
and the final result can be viewed in Figure 2.
Convergency was achieved after 1300 training epochs, and
the final SIR was 36.7790 dB.
227
600
Fig. 2.
900
1200
epochs
1500
Initial test.
B. Momentum
The second test make use of the momentum term (Equation
(22)) to reestimate de demixing matrix. As in the previous
case, a learning rate µ = 0.002 was used. The value of β that
led to the best result was 0.8. The result of this experiment is
shown in Figure 3
SIR
40
30
20
10
6
...
..
..
..
..
.
.
.
.
.
....... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... .......... ....... ....... ....... ...
..............................................................................................................................................................
.
....
.
.
.
............
.........
..
..
..
..
..
.......
.
.
.
.
.
.......
....... ....... ....... ....... ........................... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... .
.......
.
.
.
.
.
.
.
.
....
.
.
.
.
.
.
.
......
.
.
...
..
..
..
..
... ....
.
.
.
.
..
.
.
.
.
.
....... ....... ................... ........ ....... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... .
.
.
.
.
.
.
...
.
.
.
.
.
.
.
.
..
.
..
....
....
....
....
....
..
.
.
.
.
.
.
.
.
...
....... ............ ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... .......... ....... ....... ....... ...
.
.
.
.
.
.
.
.
.
.
......
....
....
....
....
....
........
.
.
.
.
.
....
....
....
....
....
100
200
300
epochs
400
500
Fig. 3.
-
Momentum.
For this test, convergency was achieved after 270 training
epochs, leading to a final SIR of 36.7556 dB.
C. Bold driver
An initial value µ0 = 0.002 was set for the learning rate.
The values used to increase and decrease de value of this
parameter were 10% and 50%, respectively. The result of
application of this technique is presented in Figure 4.
The convergence was reached after 130 training epochs, and
the final SIR was 35.8116 dB.
D. Exponential decay
For this case, the initial value of the learning rate was chosen
to be µ0 = 0.07. This higher value was chosen to accelerate
the convergence. The results are shown in Figure 5
This technique led to a SIR of 36.8 dB in 80 training epochs.
The oscilatory behaviour shown on the Exponential Decay
figure is due to changes in the sign, i.e. changes in the direction
of the gradient at each epoch of the algorithm. In some cases
the error surface has substantially different curvature along
228
SIR
40
30
20
10
PROCEEDINGS OF THE INTERNATIONAL WORKSHOP ON TELECOMMUNICATIONS - IWT/09
SIR
6
...
...
...
...
...
.
..
..
..
..
....... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... .......... ....... ....... ....... .......... ....... ....... ....... ...
.
.
.
.
....
.................................................................................................................................................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
......
....... .
.
....
....
....
.... ....
...
.
.
.
.
....... ....... .................. ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... .
...
...
...
...
...
...
.
.
.
.
.
....
..
.
.
.
.
.
....
....
....
....
....
....
.
.
....... ....... ......... ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ..
..
..
..
..
..
....
..
..
..
..
..
..
...
..
..
..
..
..
....
....... ....... .......... ....... ......... ....... ....... ....... ......... ....... ....... ....... .......... ....... ....... ....... .......... ....... ....... ....... ...
.
.
.
.
.
.
.
...
..
..
..
..
..
........
............
..
..
..
..
..
...
...
...
...
...
.
.
.
.
.
100
SIR
40
30
20
10
200
300
epochs
Fig. 4.
Bold driver.
400
40
30
20
10
-
500
100
SIR
...
...
...
...
...
.
.
.
.
.
....... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... .......... ....... ....... ....... .......... ....... ....... ....... ...
..................................................................................................................................................................................................................................................
.
.
.
.
.
.
........
........
...
...
...
...
...
.........
........................ ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... .
..
..
..
..
..
...
..
..
..
..
..
...
..
....
....
....
....
....
....
.
.
......... ....... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... .
...
...
...
...
...
....
..
...
....
....
....
....
....
.....
......... ....... ....... ....... .......... ....... ....... ....... .......... ....... ....... ....... ........... ....... ....... ....... ........... ....... ....... ....... ....
.
.
.
.
.
..
..
...
....
....
....
....
....
....
....
....
....
....
.
.
.
.
.
200
Fig. 5.
300
epochs
200
Fig. 6.
6
100
6
...
...
...
...
...
..
..
..
..
..
....... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... .......... ....... ....... ....... .......... ....... ....... ....... ...
...
.........................................................................................................................................................................................
.
. .................
.
.
.
.
. ...
...
...
...
...
... ....
....... ....... ....... ....... ........... ....... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... .
...
.
...
...
...
...
. ..
.
.
.
.
.
.
.
... .
.
....
....
....
.. ....
....
.
.
.
.
..
.
.
....... ....... ....... .......... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ..
.
..
..
..
..
..
..
.
...
..
..
..
..
..
...
..
..
..
..
..
..
.
.
....... .............. ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... .......... ....... ....... ....... ...
.
.
.
.
.
.
.
.
....
..
..
..
..
..
.....
.......
..
..
..
..
..
...
...
...
...
...
.
.
.
.
.
400
40
30
20
10
-
500
Exponential decay.
300
epochs
400
-
500
Bold driver + momentum.
6
..
..
..
..
..
.
.
.
.
.
....... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... ......... ....... ....... ....... .......... ....... ....... ....... ...
..
..
..
..
..
..
..
..
..
..
...
...
...
...
...
.
.
.
.
....... .........................................................................................................................................................................................................................................................................................................................................................................
..
..
..
..
..
... ....
............
..
..
..
..
..
..
...
.
.
.
.
.
...
.
.
.
.
.
..
..
..
..
.......
....... ......... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... ........ ....... ....... ....... ......... ....... ....... ....... ..
.
.
.
.
.. .....
..
.
.
.
.
.... ...
.. .
...
...
...
...
...
.......
.
.
.
.
.
............
..
............. ....... ....... ....... .......... ....... ....... ....... .......... ....... ....... ....... .......... ....... ....... ....... ........... ....... ....... ....... ....
....
...
...
...
...
...
..
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
100
Fig. 7.
200
300
epochs
400
-
500
Exponential decay + momentum.
different directions, leading to the formation of long narrow
valleys. For most points on the surface, the gradient does not
point towards the minimum, and successive steps of gradient
descent can oscillate from one side to the other, progressing
only very slowly to the minimum.
it has reached a flat state is also useless as the values of the
Momentum term will also tend to zero. In order to avoid that
the flat state be reached before the algorithm can converge to
its optimal result, an apropriate (higher) value for µ0 should
be chosen.
E. Experiments with combined techniques
F. Comparison among all techniques
As additional tests, some of the techniques mentioned above
were joined together with the intention of verifying the results
presented by these combinations. For all the cases shown
below, the same set of parameters as those used when testing
the techniques alone were adopted.
Two of such combinations were tested:
• Bold Driver + Momentum
• Exponential Decay + Momentum
The results are shown in Figures 6 and 7.
Using bold driver combined with the momentum term, no
noticeable changes were noticed, either in the final SIR or in
the convergence speed.
Joining together the Momentum and the Exponential Decay
techniques led to a worst result than using them separately.
A possible reason for this behavior is that the latter gradually
decreases the learning rate µ at each training epoch; after some
epochs its values tends towards zero, resulting in a flat Signalto-Interference Ratio. Once the Momentum term is a kind of
derivative technique, it smooths out the learning rate of the
algorithm; consequently its effect on the current approach after
The results presented above can be summarized in Table I.
In the first line, the results of the original approach, with a
fixed step size are shown. In the sequel, one have the results
for bold driver (BD), exponential decay (ED) and momentum techniques. Finally, the combined techniques results are
described.
TABLE I
C OMPARISON OF THE DIFFERENT METHODS .
Technique
Fixed
BD
ED
Momentum
Momentum + ED
Momentum + BD
Convergence
epoch
1300
130
80
270
120
135
SIR
(dB)
36.7790
35.8116
36.8
36.7556
29.2217
36.2502
Convergence time
(m)
317.42 (5.29 Hours)
32.39
21.88
67.99
29.36
33.885
The analysis of these results show that all the proposed
techniques but the combination with the momentum term with
the exponential decay, lead to the same SIR than the fixed step
PROCEEDINGS OF THE INTERNATIONAL WORKSHOP ON TELECOMMUNICATIONS - IWT/09
size approach. However, the convergence time is much lower,
with the fastest time being around 16 times the convergence
time for the fixed step size.
V. C ONCLUSIONS AND F UTURE W ORK
In this work, some techniques derived from the Artificial
Neural Networks theory were used to improve the algorithm
proposed by H. Buchner and colleagues [1].
The main idea presented here is to dynamically modify
the learning rate µ. Three such techniques were proposed:
the use of momentum, bold driver and exponential decay,
and also, combinations of the momentum term with the bold
driver and exponential decay were implemented and tested.
All of them sped up the convergence and improved the final
SIR, when compared with the fixed step size, proposed in
[1]. The final SIR was similar for all the techniques, but
the convergence time was much lower for all the proposed
methods. The Exponential Decay technique achieved the best
results of them all, with a reduction of about 16 times in the
number of training epochs until convergence.
For the future, different strategies to choose the initial values
for the demixing filter are being considered.
ACKNOWLEDGEMENTS
The authors would like to thank CAPES for partial funding.
R EFERENCES
[1] H. Buchner, R. Aichner, W. Kellermann, A generalization of blind source
separation algorithms for convolutive mixtures based on second-order
statistics, IEEE Trans. Speech Audio Process. 13 (1) (January 2005)
120-134.
[2] Cherry, E. Collin. Some Experiments on the Recognition of Speech, with
One and with Two Ears. Journal of the Acoustical Society of America,
24, pp. 975-979, 1953.
[3] Herault, Jeanny and Christian Jutten. Space or time adaptive signal
processing by neural network models. AIP Conference Proceedings, 151,
pp. 206-211, 1986.
229
[4] L. Molgedey and H. G. Schuster, “Separation of a mixture of independent signals using time delayed correlations”, Review Letters, vol. 72,
pp. 3634-3636, 1994.
[5] E. Weinstein, M. Feder, and A. Oppenheim, “Multi-channel signal separation by decorrelation”, IEEE Trans. on Speech and Audio Processing,
vol 1, no. 4, pp. 405-413, Oct. 1993 .
[6] R. Battiti, “First and second-order methods for learning: between
steepest descent and Newton’s method”, Technical report, University
of Trento, 1991.
[7] S. Van Gerven and D. Van Compernolle, “Signal separation by symmetric adaptive decorrelation: stability, convergence, and uniqueness”,
IEEE Trans. Signal Processing, vol. 43, no. 7, pp. 1602-1612, 1995.
[8] C. L. Fancourt and L. Parra, “The coherence function in blind source
separation of convolutive mixtures of non-stationary signals”, in Proc.
Int. Workshop on Neural Networks for Signal Processing (NNSP), 2001.
[9] R. Aichner et al., “Time-domain blind source separation of nonstationary convolved signals with utilization of geometric beamforming”,
in Proc. Int. Workshop on Neural Networks for Signal Processing,
Martigny, Switzerland, 2002.
[10] A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis, Wiley & Sons, Inc., New York, 2001.
[11] Shannon, C.E. and Weaver, W. (1949). The mathematical theory of
communication. University of Illinois Press, Urbana, Illinois.
[12] T.M. Cover and J.A. Thomas, Elements of Information Theory, Wiley
& Sons, New York, 1991.
[13] H. Buchner, R. Aichner, and W. Kellermann, “A generalization of a class
of blind source separation algorithms for convolutive mixtures”, Proc.
IEEE Int. Symposium on Independent Component Analysis and Blind
Signal Separation (ICA), Nara, Japan, Apr. 2003, pp. 945-950.
[14] K. Matsuoka, M. Ohya, and M. Kawamoto, “A neural net for blind
separation of nonstationary signals”, Neural Networks, vol. 8, no. 3, pp.
411-419, 1995.
[15] A. Oppenheim, “Inequalities connected with definite hermitian forms”,
J. London Math. Soc., vol. 5, pp. 114-119, 1930.
[16] H. Buchner, R. Aichner, and W. Kellermann, A generalization of blind
source separation algorithms for convolutive mixtures based on second
order statistics”, IEEE Trans on Speech and Audio Processing, Vol. 13,
Num. 1, pp. 120-134, Jan. 2005
[17] S. I. Amari, “Natural Gradient Works Efficiently in Learning”, Neural
Computation, February 15, 1998, Vol. 10, No. 2, Pages 251-276
[18] Stephen G. McGovern, “A Model for Room Acoustics”,
http://www.2pi.us/rir.html, 2003-2004