where f(y(k)) = f1(y1(k)) ••• fm(ym(k))]T is a vec-

ADAPTIVE STEP SIZE TECHNIQUES FOR DECORRELATION AND BLIND
SOURCE SEPARATION
S.C. Douglas1 and A. Cichocki2
1
Department of Electrical Engineering, Southern Methodist University, Dallas, TX 75275 USA
2
Brain-Style Information Systems Group, RIKEN Brain Science Institute, Wako-shi, Saitama 351-0198 JAPAN
ABSTRACT
Careful selection of step size parameters is often
necessary to obtain good performance from gradient-based adaptive algorithms for decorrelation and
source separation tasks. In this paper, we provide
an overview of methods for the on-line calculation
of step size parameters for these systems. A particular emphasis is placed on gradient adaptive step
sizes for a class of natural gradient algorithms for
decorrelation and blind source separation. Simulations verifying their useful behaviors are provided.
1. INTRODUCTION
Blind source separation (BSS) is the task of separating multiple statistically-independent signals from observed linear
mixtures. Useful for many signal processing tasks, BSS has
received much recent research attention, and numerous useful algorithms have been developed 1]{4]. In BSS, one
measures a vector sequence x(k) = x1 (k) xn (k)]T that
is assumed to t the model
x(k) = Hs(k)
(1)
where s(k) = s1 (k)
sm (k)]T , m n, contains m independent source signals and H is an (n m) mixing matrix.
In adaptive solutions to the BSS
task, an output signal vector y(k) = y1 (k) ym (k)]T is computed as
y(k) = W(k)x(k)
(2)
where W(k) is an (m n) matrix of adaptive parameters.
Then, the goal is to adjust W(k) such that the combined
system matrix C(k) = W(k)H evolves as
lim C(k) = PD
(3)
k!1
where P and D are permutation and nonsingular diagonal
scaling matrices, respectively, such that each si (k) in s(k)
appears in y(k).
One particularly-useful iterative technique for BSS is the
natural gradient algorithm given by
W(k + 1) = W(k) + (k)I f (y(k))yT (k)]W(k)(4)
where f (y(k)) = f1 (y1 (k))
fm (ym (k))]T is a vector of nonlinearly-modied output signals. This algorithm attempts to minimize the entropy-based cost function
E (W(k)) where
;
fJ
J
g
(W) = 21 log det WT W
;
j
j ;
m
X
i=1
log pi (yi (k))
(5)
and fi (y) = @ log pi (y)=@y. Numerous useful properties
concerning the behavior of (4) and the natural gradient have
been given in the literature 3]{6]. Perhaps the most important property possessed by (4) from the standpoint of
performance is its uniform convergence behavior for arbitrary matrices H, a property known as equivariance 4].
A related task to BSS is adaptive decorrelation (AD) or
prewhitening, in which x(k) has an autocorrelation
Rxx = E x(k)xT (k) :
(6)
The goal of AD is to determine W(k) in (2) such that
Ryy (k) = E y(k)yT (k) obeys
lim R (k) = I:
(7)
k!1 yy
;
f
f
g
g
As many adaptive systems behave better when driven
by decorrelated signals, AD is a useful preprocessing
step in adaptive ltering, array processing, blind deconvolution/equalization, and multilayer perceptron training.
It can be shown that choosing f (y(k)) = y(k) in (4)
yields an equivariant AD algorithm. Alternatively, the
computationally-simple update
W(k + 1) = W(k) + (k)I y(k)yT (k)] (8)
with W(0) chosen as a symmetric matrix can be used. Performance analyses of both (4) and (8) for AD indicate that
they work as desired, and the analyses provide useful information for choosing (k) to obtain stable, robust behavior
from these schemes 7].
A critical challenge in the above tasks is the choice of step
size (k) to achieve fast initial adaptation, a low steadystate error, and in the case of nonstationary or time-varying
signal models, good tracking performance. In this paper, we
develop gradient-based methods for adjusting (k) for the
algorithms in (4) and (8). Like gradient step size methods
for other systems 8]{13], the proposed schemes have desirable performance properties, as veried by simulations.
The organization of this paper is as follows. In the next
section, we review the task of step size selection for adaptive
systems. In Section 3, gradient adaptive step size methods
for (4) and (8) are derived, and simulations of their performances are described in Section 4. Section 5 presents our
conclusions.
;
2. STEP SIZE SELECTION1
2.1. Summary
An adaptive system is by its very nature time-varying. The
rate at which such a system changes its internal parameters
1
Portions of this section are taken from the review article 14]
determines its capabilities to adjust to and to obtain useful
information from an unknown physical environment. Most
practical adaptive systems employ one or more step size
parameters, or step sizes, that control the dynamics of the
internal states of the system.
In this section, we provide an overview of several techniques for selecting time-varying step sizes for the on-line
training of adaptive systems. In our discussion, we pay
particular attention to the techniques developed for linear
adaptive systems, as these systems have received the most
attention in the technical literature.
2.2. On-Line vs. Batch Training
Before discussing step size selection, it is important to distinguish the dierence between on-line and batch training.
In on-line training, the adaptive system continually adapts
its internal states as new signal measurements become available. In this context, the step size parameters control the
degree of averaging within the parameter updates as well
as the speed with which old measurements are \forgotten"
within the iterative procedure.
By contrast, batch training methods are o-line parameter estimation techniques in which the size of the training
set is xed. For nonlinear regression tasks such as multilayer
perceptron training, batch training takes the form of an iterative optimization procedure, with an associated step size
sequence, as applied to a cost function of the data. Classical optimization methods include steepest descent, conjugate gradient, Newton's, and quasi-Newton methods 15].
In such methods, step size selection depends on the form
of the cost function surface that one must search to determine a desired set of xed parameters. Standard optimization techniques are often designed for linear regression or
ltering tasks and quadratic-type cost functions, and thus
they may not well-suited to training nonlinear systems or
systems adapted using non-quadratic cost functions. It is
possible, however, to improve the convergence behaviors of
nonlinear systems by modifying standard optimization techniques for such systems. For a discussion of one of these
methods as applied to multilayer perceptron training, see
16].
2.3. Parameter Estimation Methods
One of the simplest and most-popular techniques for adjusting an adaptive system's parameters is the gradient descent method, in which the parameters are changed according to the derivatives of a particular cost function with
respect to the current parameter values. Both the leastmean-square (LMS) algorithm for adaptive lters 17] and
the back-propagation algorithm for multilayer perceptrons
18] are gradient descent methods. Much is known about
the behavioral characteristics of gradient adaptation, and
the algorithms are usually numerically-robust. In gradient
descent, the step sizes control the magnitudes of the changes
in the parameters in the negative direction of the gradient.
The Kalman lter forms the basis for another class of parameter estimation techniques that employ second-order information about the cost function being minimized. Recursive least-squares (RLS) techniques are widely-used in linear
estimation tasks, and 19] explores the connections between
RLS and Kalman techniques. The extended Kalman lter is a linearized version of the Kalman lter for nonlinear
state-space estimation tasks 20], and it has been successfully applied to multilayer perceptron training 21]. Many
simplied and approximate versions of this algorithm have
been studied. Note that in RLS and linearized least-squares
methods with forgetting factor , the parameter (1 )
;
plays the role of the step size, whereas in Kalman techniques the step size is automatically-determined from its
underlying Bayesian problem formulation.
2.4. Goal of Step Size Selection
The performance of any adaptive system that is attempting to drive its adjustable parameters to an optimum xed
set of parameters is governed by two quantities: (i) its convergence rate and (ii) the misadjustment in steady state.
The convergence rate refers to the transient behavior of the
parameters as they approach their optimum values. Misadjustment refers to the additional error in the output of the
system caused by the random uctuations of the parameters
in steady-state. Generally speaking, the convergence rate
of a system increases for step sizes that are somewhat less
than one-half of the maximum value of the step size that
provides stable adaptive behavior. In contrast, the misadjustment generally decreases as the step size is decreased.
The goal of any time-varying step size procedure is to
increase the step size to a large but stable operating value
when the parameters are some distance from their optimum settings and to systematically decrease the step size
to reduce the misadjustment when the parameters are in
the vicinity of their optimum settings. Since stability is often not explicitly ensured within a time-varying step size
method, it is generally necessary to limit the range of step
sizes to guarantee stable operation of the system.
When the desired parameter settings vary with time, the
system must continually readjust its parameters to follow
these variations. The error induced in the model parameters
by any time-variation of the unknown optimal parameters
is called the lag error. In some cases where the velocity
of the model parameters is constant, an optimum step size
value exists that minimizes the contributions of the misadjustment and the lag error after all initial transients in the
parameters have died out.
2.5. Types of Step Size Methods
We can classify a step size selection method as adaptive or
non-adaptive depending on its form. Non-adaptive methods calculate a time-varying step size according to a priori
knowledge about the signals being processed, the cost function being optimized, and/or the parameter structure of the
system. These techniques include asymptotically-optimal
methods as derived via the theory of stochastic approximation 22], methods based on a statistical analysis of the
particular system 23, 24], and heuristic approximations to
these methods, commonly known as \search-and-converge"
25] or \gearshifting." By contrast, adaptive methods are
based on on-line measurements of the state of the adaptive
system, usually as characterized by the outputs or by the
parameter updates of the system.
Non-adaptive step size methods usually require more information about the adaptive system and the problem context than do adaptive step size methods. However, nonadaptive step size methods usually outperform adaptive
step size methods because of this increased knowledge.
2.6. Adaptive Step Size Methods
Gradient adaptation has proven to be quite useful for parameter estimation. It can also be used to optimize the
step size parameters in an on-line fashion. This idea has
appeared and reappeared in the scientic literature. One of
the earliest descriptions of the methodology appears in 8],
and it was later reintroduced to both the neural network
9] and signal processing 10, 11] communities, where it has
become known as the \delta-bar-delta rule" and \gradient
step size method," respectively. An alternative version of
this algorithm is studied within a stochastic approximation
formulation in 12], and an extension of the method to the
RLS algorithm is described in 13].
In linear adaptive ltering, it can be shown under certain
conditions that gradient step size methods attempt to nd
the optimum step size sequence that provides the fastest
adaptation of the system 11]. Moreover, when tracking
constant-velocity time-varying systems, gradient step size
methods achieve near-optimal tracking ability in steadystate. However, the convergence behavior of the step size
procedure is generally slow due to the limitations of the
standard gradient descent procedure. In cases where the
unknown model parameters exhibit acceleration and higherorder movement eects, techniques for gradient adjustment
of the step size procedure (i.e., adjusting the \step size of
the step size") could be considered but are typically discounted due to their additional computational complexity
and to the diculty in tuning these procedures for best
performance.
Adaptive step size methods can also employ other statistical relationships between the processed signals to vary the
step sizes in a useful way. Some of the techniques adjust
the step sizes according to the statistical correlations between the input and error signals 26] or the error signals at
dierent time instants 27]. Alternatively, one can employ
the magnitude of the gradient in a suitable way 28]. While
these methods are not optimal in any particular sense, they
have been shown to be useful in certain contexts.
Although much research eort has been devoted to adaptive step size strategies for supervised training, relatively
little work has been focused on adaptive step size strategies
for unsupervised training. It is possible to extend many
of the ideas described previously to the unsupervised case
e.g., by applying gradient-based step size methods to a particular unsupervised cost criterion. In the latter portion of
this paper, we consider just such an extension for decorrelation and source separation. An example of a somewhatdierent philosophy can be found in 29], in which the parameter update terms are used directly within a nonlinear
ltering structure. While heuristic, such an approach is
more general than other cost-function-based approaches, as
certain parameter adaptation rules for unsupervised training are not derived from the explicit minimization of a cost
function.
3.DECORRELATION
GRADIENT STEPAND
SIZES
FOR
SOURCE
SEPARATION
In this section, we derive gradient step size algorithms for
(4) and (8) that adjust (k) such that E (W(k)) in (5)
is approximately minimized for an arbitrary f (y(k)) and
f (y(k)) = y(k), respectively. In its generic form, the gradient step size method calculates (k) as 9, 10]
fJ
(k))
(k) = (k 1) (k) @@((W
k 1) :
where (k) is a step size parameter.
;
;
3.1. Blind Source Separation
J
;
g
(9)
To determine the form of (9) for the algorithm in (4), we
must determine the form of the second term on the RHS
of (9) for an arbitrary f (y(k)). For notational simplicity,
assume that m = n, although the algorithms derived are
applicable in the general case of m < n. Taking derivatives
of both sides of (5) with respect to (k 1) gives
@ (W(k)) = @ log det W(k)
@(k 1)
@(k 1)
m
X
+ fi (yi (k)) @@y(ki (k)1) (10)
i=1
;
J
j
;
;
j
;
;
where fi (y) = @ log pi (y)=@y.
To evaluate the RHS of (10), note that (4) can be written
as
W(k + 1) = I + (k) I f (y(k))yT (k) W(k) (11)
and thus
det W(k + 1) = det I + (k) I f (y(k))yT (k)
det W(k):
(12)
The matrix in large brackets on the RHS of (11) has m 1
eigenvalues equal to 1 + (n) and one eigenvalue equal to
1 + (k)1 yT (k)f (y(k))]. Thus,
log det W(k + 1) = (m 1) log 1 + (k)
+ log 1 + (k)1 yT (k)f (y(k))]
+ log det W(k) :
(13)
T
Therefore, so long as 0 < (k)
1 y (k)f (y(k)) , we
have
@ log det W(k + 1) =
@(k)
m 1 +
1 yT (k)f (y(k))
1 + (k) 1 + (k)1 yT (k)f (y(k))] : (14)
To evaluate the second term on the RHS of (10), note
that
y(k + 1) = W(k + 1)x(k + 1)
(15)
= W(k)x(k + 1)
+ (k) I f (y(k))yT (k) W(k)x(k + 1)
(16)
and thus
@ y(k + 1) = I f (y(k))yT (k) W(k)x(k + 1): (17)
@(k)
Combining (10), (14), and (17), we obtain
@ (W(k + 1)) = m 1
@(k)
1 + (k)
T
1 y (k)f (y(k))
1 + (k)1 yT (k)f (y(k))]
+ f T (y(k + 1)) I f (y(k))yT (k) W(k)x(k + 1)(:18)
Finally, for small step sizes, we can approximate W(k)x(k +
1) with y(k + 1) to obtain
@ (W(k + 1))
m 1
@(k)
1 + (k)
1 yT (k)f (y(k))
1 + (k)1 yT (k)f (y(k))]
+ yT (k + 1)f (y(k + 1))
f T (y(k + 1))f (y(k))yT (k)y(k + 1): (19)
;
;
;
;
;
j
j
;
j
j
j
;
j
j
j
j
;
;
;
;
;
J
;
;
;
;
;
;
J
;
;
;
;
;
;
j
j
;
j
Thus, the gradient step size algorithm is
(k) = (k 1) + (k) 1 +m(k 1 1)
yT (k 1))f (y(k 1))
+ 1 + (k1 1)1
yT (k 1)f (y(k 1))]
T
y (k)f (y(k))
+ f T (y(k))f (y(k 1))yT (k 1)y(k) : (20)
Note that this algorithm requires approximately 4m multiplies and two divides thus, its complexity is typically much
less than that of the algorithm in (4), which requires approximately 4mn multiplies.
;
;
;
;
;
;
;
;
;
;
;
;
;
3.2. Adaptive Decorrelation
The gradient step size algorithm in (20) can be applied for
AD as well by choosing f (y(k)) = y(k) within both (4) and
(20). If (8) is used instead, a dierent gradient step size
algorithm must be derived that follows the update in (9).
The derivation for this case is similar to that above. We
can write (8) as
W(k + 1) = I + (k) I y(k)xT (k) W(k) (21)
such that @ log det W(k + 1) =@(k) is identical to (14) for
f (y(k)) = x(k). Similarly, we have
@ y(k + 1) = I y(k))xT (k) W(k)x(k + 1) (22)
@(k)
y(k + 1) y(k)xT (k)y(k + 1) (23)
if y(k + 1) W(k)x(k + 1). Hence, we can evaluate (10)
for f (y(k)) = y(k) as
@ (W(k + 1))
m 1
@(k)
1 + (k)
1 yT (k)x(k)
1 + (k)1 yT (k)x(k)]
+ yT (k + 1)y(k + 1)
yT (k + 1))y(k)xT (k)y(k + 1)
(24)
and thus the update for (k) becomes
(k) = (k 1) + (k) 1 +m(k 1 1)
T
x(k 1)
+ 1 + (k1 y1)1(k y1))
T (k 1)x(k 1)]
yT (k)y(k))
+ yT (k)y(k 1)xT (k 1)y(k) :
(25)
;
j
gradient step size algorithms for adaptive lters have indicated that their performance can be relatively insensitive to
the value of (k). Simulations of (20) and (25) indicate that
their performances are somewhat sensitive to the choice of
(k), but that the choice of max is more critical to the
convergence behavior of the overall system.
4. SIMULATIONS
We now explore the behaviors of the gradient step size
methods via simulations. For the BSS example, we have
generated x(k) according to (1) in which H = H0 or
H = HT0 for 0 k < 8000 and k 8000, respectively,
where
" 0:4 1:0 0:9 #
H0 = 0:6 0:5 0:5
(26)
0:3 0:7 0:2
and each s(k) contains one uniform- (5)1=4 (5)1=4 ]- distributed and two binary- 1 -distributed independent
sources. The condition number of H0 HT0 in this case is
36:9. Versions of (4) with xed and gradient step sizes were
used to process the data. Twenty trials were run, Tin which
W(0) was a dierent random matrix with W(0)W (0) = I,
and ensemble averages of
;
f g
03
3
X
1 @X
j
;
;
J
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
3.3. Practical Issues
;
Gradient step size methods are optimized for performance
and not robustness hence, they typically do not guarantee
the stabilities of the overall systems in which they are used.
For this reason, one must typically restrict (k) to a nite
range max ] in practice. Here, is a small constant that
prevents adaptation from completely ceasing, and max is
chosen to limit the size of (k) for stability purposes.
One drawback to gradient step size schemes in general is
the fact that the \step size of the step size" (k) is a parameter that must be chosen. Analyses and simulations of
BSS (k) = 3
1
c2ij (k)
1A (27)
2
j =1 1max
l 3cil (k)
i=1
;
i
i
for each algorithm were taken, where C(k)]ij = cij (k) and
li = lj for i = j .
Fig. 1(a) shows the evolution of BSS (k) for (4) with
(k) = 0:002 as well as the adaptive step size method in
(20) with (0) = 0:002, (k) = 10;5 , max = 0:005, and
= 0:0001. As can be seen, the gradient step size method
provides both fast initial convergence as well as a slight
reduction in crosstalk in steady-state as compared to the
xed step size case. Fig. 1(b) shows the average evolution
of (k) for this algorithm, in which the desirable behavior of
the gradient step size method|an increased (k) initially
and at system changes|is apparent.
We then simulated the behavior of the AD algorithms in
(4) with f (y(k)) = y(k) and in (21), respectively, with both
xed and gradient step sizes. The mixing matrix used for
H0 in this case was the same as that of A in the secondT
example in 7], in which the condition number of H0 H0
is 422.4. Fig. 2(a) shows the evolution of the performance
factor
(28)
AD (k) = I C(k)CT (k) 2F
for the equivariant algorithm in (4) with f (y(k)) = y(k)
with both xed and gradient step sizes, where
F denote the Frobenius norm. In this case, a xed step size
of (k) = 0:01 was chosen for (4), and the parameters
for the gradient step size algorithm were (0) = 0:01,
(k) = 5 10;5 , max = 0:015, and = 0:0001. As in the
source separation case, the algorithm with gradient step size
outperforms its xed-step-size counterpart, and the desirable behavior of the gradient step size method are clearly
observed in Fig. 2(b). Simulations of the simplied algorithm in (8) with both xed and gradient step sizes yield
similar relative performance dierences, although the absolute performances of the equivariant methods were found to
be far superior.
6
6
jj
;
jj
jj
jj
(a)
1
10
Adaptive μ(k)
Fixed μ(k)
0
ηBSS(k)
10
-1
10
-2
10
-3
10
0
2000
4000
6000
-3
5
8000
10000
12000
14000
16000
8000
10000
number of iterations
12000
14000
16000
(b)
x 10
μ(k)
4
3
2
1
0
2000
4000
6000
Fig. 1: Average evolutions of (a) BSS (k) and (b) (k) for
the various algorithms in the source separation example.
(a)
4
10
Adaptive μ(k)
Fixed μ(k)
2
ηAD(k)
10
0
10
-2
10
0
1000
2000
-3
16
3000
4000
5000
6000
4000
5000
6000
(b)
x 10
14
μ(k)
12
10
8
6
4
2
0
1000
2000
3000
number of iterations
Fig. 2: Average evolutions of (a) AD (k) and (b) (k) for
the various algorithms in the decorrelation example.
5. CONCLUSIONS
In this paper, we have provided gradient step size algorithms for blind source separation and adaptive decorrelation techniques. The algorithms are particularly simple
and perform well in their respective tasks. Our results suggest that gradient step size methods can be successfully applied to blind adaptation criteria and achieve relative performance improvements that are similar to those obtained
for trained adaptation criteria.
REFERENCES
1] P. Comon, \Independent component analysis: A new concept?"
Signal Processing, vol. 36, no. 3, pp. 287-314, Apr. 1994.
2] A.J. Bell and T.J. Sejnowski, \An information maximization
approach to blind separation and blind deconvolution," Neural
Comput., vol. 7, no. 6, pp. 1129-1159, Nov. 1995.
3] S. Amari, A. Cichocki, and H.H. Yang, \A new learning algorithm for blind signal separation," Adv. Neural Inform. Proc.
Sys. 8 (Cambridge, MA: MIT Press, 1996), pp. 757-763.
4] J.-F. Cardoso and B. Laheld, \Equivariant adaptive source separation," IEEE Trans. Signal Processing, vol. 44, pp. 30173030, Dec. 1996.
5] S. Amari and J.-F. Cardoso, \Blind source separation | semiparametric statistical approach," IEEE Trans. Signal Processing, vol. 45, pp. 2692-2700, Dec. 1997.
6] S. Amari and S.C. Douglas, \Why natural gradient?" Proc.
Int. Conf. Acoust., Speech, Signal Processing, Seattle, WA,
vol. 2, pp. 1213-1216, May 1998.
7] S.C. Douglas and A. Cichocki, \Neural networks for blind
decorrelation of signals," IEEE Trans. Signal Processing, vol.
45, pp. 2829-2842, Nov. 1997.
8] S. Amari, \A theory of adaptive pattern classiers," IEEE
Trans. Electron. Comput., vol. 16, pp. 299-307, 1967.
9] R.A. Jacobs, \Increased rates of of convergence through learning rate adaptation," Neural Networks, vol. 1, pp. 295-307,
1988.
10] V.J. Mathews and Z. Xie, \A stochastic gradient adaptive lter with gradient adaptive step size," IEEE Trans. on Signal
Processing, vol. 41, pp. 2075-2087, June 1993.
11] S.C. Douglas, \Generalized gradient step sizes for stochastic
gradient adaptive lters," Proc. Int. Conf. Acoust., Speech,
Signal Processing, Detroit, MI, vol. 2, pp. 1396-1399, May
1995.
12] H.J. Kushner and J. Yang, \Analysis of adaptive step-size SA
algorithms for parameter tracking," IEEE Trans. Automatic
Control, vol. 40, pp. 1403-1410, Aug. 1995.
13] S. Haykin, Adaptive Filter Theory, 3rd ed. (Upper Saddle
River, NJ: Prentice-Hall, 1996).
14] S.C. Douglas and A. Cichocki, \On-line step-size selection for
training of adaptive systems," IEEE Signal Processing Mag.,
vol. 14, no. 6, pp. 45-46, Nov. 1997.
15] D.G. Luenberger, Linear and Nonlinear Programming, 2nd ed.
(Reading, MA: Addison-Wesley, 1984).
16] S. Becker and Y. Le Cun, \Improving the convergence of backpropagation learning with second-order methods," in Proc.
1988 Connectionist Models Summer School, D. Touretzky, G.
Hinton, and T. Sejnowski, eds. (San Mateo, CA: Morgan Kaufman, 1989), pp. 29-37.
17] B. Widrow and S.D. Stearns, Adaptive Signal Processing, (Englewood Clis, NJ: Prentice-Hall, 1985).
18] S. Haykin, Neural Networks: A Comprehensive Foundation,
(Englewood Clis, NJ: Macmillan, 1994).
19] A.H. Sayed and T. Kailath, \A state-space approach to adaptive RLS ltering," IEEE Signal Processing Mag., vol. 11, no.
3, pp. 18-60, July 1994.
20] L. Ljung and T. Soderstrom, Theory and Practice of Recursive
Identication (Cambridge, MA: MIT Press, 1983).
21] S. Singhal and L. Wu, \Training feedforward networks with
the extended Kalman lter," Proc. IEEE Int. Conf. Acoust.,
Speech, Signal Processing, Glasgow, UK, vol. 2, pp. 1187-1190,
May 1989.
22] H.J. Kushner and D.S. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems (New York:
Springer-Verlag, 1978).
23] D.T.M. Slock, \On the convergence behavior of the LMS and
the normalized LMS algorithms, IEEE Trans. Signal Processing, vol. 41, pp. 2811-2825, Sept. 1993.
24] S.C. Douglas and T.H.-Y. Meng, \Normalized data nonlinearities for LMS adaptation," IEEE Trans. Signal Processing, vol.
42, pp. 1352-1365, June 1994.
25] C. Darken and J. Moody, \Towards faster stochastic gradient
search," Advances in Neural Information Processing Systems,
vol. 4, (San Mateo, CA: Morgan Kaufman, 1991), pp. 10091016.
26] T.J. Shan and T. Kailath, \Adaptive algorithms with an automatic gain control feature," IEEE Trans. Circuits Syst., vol.
35, pp. 122-127, Jan. 1988.
27] T. Aboulnasr and K. Mayyas, \A robust variable step-size LMStype algorithm: analysis and simulations," IEEE Trans. Signal
Processing, vol. 45, pp. 631-639, Mar. 1997.
28] S. Karni and G. Zeng, \A new convergence factor for adaptive
lters," IEEE Trans. Circuits Syst., vol. 36, pp. 1011-1012,
July 1989.
29] A. Cichocki, S. Amari, M. Adachi, and W. Kasprzak, \Selfadaptive neural networks for blind separation of sources," Proc.
Int. Symp. Circuits Syst., Atlanta, GA, vol. 2, pp. 157-161,
May 1996.
30] H. Sompolinsky, N. Barkai, and H.S. Seung, \On-line learning of dichotomies: Algorithms and learning curves," in Neural
Networks: The Statistical Mechanics Perspective, J.-H. Oh,
C. Kwon, and S. Cho, eds. (Singapore: World Scientic, 1995),
pp. 105-130.
31] N. Murata, K. Mueller, A. Ziehle, and S. Amari, \Adaptive
on-line learning in changing environments," Advances in Neural Information Processing Systems, vol. 9, (Cambridge, MA:
MIT Press, 1997), pp. 599-605.
32] A. Cichocki, B. Orsier, A. Back, and S. Amari, \On line adaptive algorithms in non-stationary environments using a modied conjugate gradient approach," Proc. IEEE Workshop Neural Networks Signal Processing, Amelia Island Plantation, FL,
Sept. 1997.

Download Report

where f(y(k)) = f1(y1(k)) ••• fm(ym(k))]T is a vec-

Paperzz.com

Your Paperzz