Structural Adaptation in Mixture of Experts

Structural Adaptation in Mixture of Experts
Viswanath Ramamurti and Joydeep Ghosh Department of Electrical and Computer Engineering
The University of Texas at Austin, Austin, TX 78712-1084
E-mail: fviswa,[email protected]
Abstract
A simple mixture of experts model is shown in gure 1. Each expert network is typically a single layer
network with a possible output non-linearity. In the
original mixture of experts model [1], the gating network is a single layer feedforward network with a \softmax" output activation function. The \softmax" function ensures that the gating network outputs are nonnegative and sum to 1. With the gating network of
this form, the input space is divided into dierent but
overlapping regions by \soft" hyperplanes, and each
region is assigned to an expert. Such a structure is
limited, in that, when the function approximation task
is non-trivial, the gating network nds it dicult to
come up with the right partitions for the linear experts
to perform a good job of approximating the function
at hand.
The one level mixture of experts architecture was
extended in [2] to a hierarchy where each expert network in gure 1 is replaced by a mixture of experts
network consisting of a gate and sub-experts. The resulting two-level hierarchy could be extended to form
deeper trees. The deeper the HME tree, more powerful the architecture becomes. Figure 2 illustrates this
phenomena. The task was to approximate a sinc function. A simple mixture of experts network consisting of
16 experts, a 4 level-binary and an 8-level binary tree
hierarchies were employed. It is seen that, deeper the
network, the better it performs. The same phenomena
is also reported in classication type HME networks
[5][4]. Unfortunately, the computational cost increases
as the tree height increases, as more gating and expert
networks need to be trained. The depth of the tree
also leads to the question of what is the right size and
structure of the tree to best solve a problem, i.e., one
is faced with the model selection problem.
Xu, Jordan and Hinton [6] have recently come up
with an alternative form for the gating network which
divides the input space into \soft" hyper-ellipsoids as
opposed to the \soft" hyperplane divisions created by
the original gating network. With such localized re-
The \mixture of experts" framework provides a modular and exible approach to function approximation.
However, the important problem of determining the appropriate number and complexity of experts has not
been fully explored. In this paper, we consider a localized form of the gating network that can perform
function approximation tasks very well with only one
layer of experts. Certain measures for the smooth functioning of the training algorithm to train this model
are described rst. We then propose two techniques
to overcome the model selection problem in the mixture of experts architecture. In the rst technique, we
present an ecient way to grow expert networks to
come up with an appropriate number of experts for a
given problem. In the second approach, we start with a
certain number of experts and present methods to prune
experts which become less useful and also add on experts which would be more eective. Simulation results
are presented which support the techniques proposed.
We observe that the growing/pruning approach yields
substantially better results than the standard approach
even when the nal network sizes are chosen to be the
same.
1. Introduction
The mixture of experts architecture is a powerful
modular architecture which works on the principle of
divide and conquer. The model employs probabilistic
methods to divide the input space into overlapping regions on which \experts" act, and a gating network
to weight these experts and form a combined network
output.
This research was supported in part by ARO contracts
DAAH04-94-G-0417 and 04-95-10494 and NSF grant ECS
9307632.
1
gions of expertise, a single layer of linear experts is
able to perform function approximation tasks very well.
Figure 3 shows the performance of the localized model
with only one layer of 15 experts on the same sinc function that was approximated earlier.
This paper begins by observing that the localized
gating network also provides a basis for overcoming the
model selection problem. Two techniques are proposed
to counter the model selection problem. In the rst
technique, one starts by training a network consisting
of only one expert, which corresponds to a linear regression, and grows more experts sequentially to t the
complexity of the problem. In the second technique,
when it is known apriori that the function approximation task is non-trivial, one starts training an adequately powerful network and while training, prunes
away experts which become less useful and grows experts, if needed, which would be more eective. Compared to training a network with a xed number of
experts, the second approach makes good use of every expert in the network and turns out to be ecient
at avoiding bad local minima. In this paper, we rst
briey describe the localized model for mixture of experts and also present measures to be taken for the
smooth functioning of the training algorithm to train
the model. This is followed by the description of the
two techniques proposed to counter the model selection
problem.
2. The Localized Mixture of Experts
Model
In the mixture of experts model, we have a set of
expert networks j=1..M all of which look at the input
vector x to form outputs y^j . The gating network also
looks
outputs gj P at the input vector x and produces
network
outputs
0, j gj = 1, which weight the expert
P
to form the overall output y^ = j gj y^j .
The gating network proposed in [6] is of the form
j)
gj (x; ) = PjPi(Px=
(x=i )
i
where
P (x=j ) = (2)?n=2 jj j?1=2 expf?0:5(x ? mj )T ?j 1 (x ? mj )g
Thus the jth expert's inuence is localized to a region around mj .
The expert networks are single layer networks with
identity output activation function for regression problems. The expert network parameter vector for expert
j is denoted by j (y^j = jT x) .
The Expectation-Maximization (EM) formulation
for estimating the gating network and expert network
parameters j ; j ; j from the set of training patterns
t is given by [6]:
E step:
k t
f?0:5(y?y^j )T (y?y^j )g
h(jk) (y(t) =x(t) ) = Pgj g (kx (;xt);exp
)expf?0:5(y ?y^i )T (y ?y^i )g
i i
M step (gating
P network):
(jk+1) = N1 t h(jk) (y(t) j x(t) )
P
m(jk+1) = P h k (1y t jx t ) t h(jk) (y(t) j x(t) )x(t)
t j
P
(jk+1) = P h k (1y t jx t ) t h(jk) (y(t) j x(t) )
t j
[x(t) ? mkj ][x(t) ? mkj ]T
M step (expert network):
min P
jk+1 = arg j t h(jk) (y(t) =x(t) )ky ? y^j k2
( )
( )
( )
( )
( )
( )
( )
( )
We have employed this model for solving a number
of regression problems and have found that the network
performs very well with just a single layer of experts.
By requiring only one gating network, this model is signicantly faster to train as compared to an hierarchical
model. Also, the M-step of the gating network has an
exact one-pass analytical solution as opposed to the
approximate solution to train the M-step in one-pass
for the original gating network.
However, there are a couple of things that need attention while training the localized model in practice.
In the EM iterations, the E-step of the gating network
requires inverting the covariance matrices j s obtained
in the M-step. Some of the covariance matrices that
are obtained often turn out to be close to singular.
To avoid this problem, we consider only the diagonal
elements of the covariance matrices and make the odiagonal elements zero. This means that only the variance terms of the dierent input elements are considered, the cross terms are being ignored. This does not
aect the function approximation capability of the network, it would at worst make the network need more
experts. The eect of this change on the EM derivation
would be that , while calculating the j s in the M-step,
only the terms corresponding to the diagonal entries
need be computed. Also, we impose lower bounds on
the diagonal entries which prevent them from becoming too small. The lower threshold for each entry is
selected in such a way that it reects the variance of
the individual input elements. When a computed value
goes below the threshold at any time, it is replaced by
the constant threshold value. We have also found the
algorithm to be sensitive to initial random means associated with the experts. Initializing these means by the
standard K-means algorithm works well in practice.
The capability of the localized model is illustrated
by simulation results on the building2 data-set, a function approximation task, from the PROBEN1 set of
benchmark data sets [3]. In the building2 data-set the
input is a 14-d vector and the output is a 3-d vector.
The task was to perform function approximation with
2104 training samples and 1052 samples to test the
approximation. A single layer mixture of 10 experts
network gives an average test set MSE of 0.0084 and
training set MSE of 0.0072. This result is as good as
the best results obtained using a fully connected MLP
including short-cut connections [3] and slightly better
than the results quoted in [3] when short-cut connections were not present.
3. The Growing Mixture of Experts Network
In this section, a constructive algorithm is presented
to build a mixture of experts network. The idea is to
initially start with one expert, and add on experts one
at a time to systematically reduce the output error.
Let us assume that at some given instant of time
there are m experts in the network (m - 1 experts having been already added). The mixture of m experts
network is trained using the EM algorithm described
in the previous section, until the MSE on the validation
set fails to decrease. The network parameters obtained
are saved. We dene a weighted MSE Ej , for each expert j (j= 1 .. m), on P
the validation
set samples p. For
p (y ?y^ )
hP
j p p
p
. hpj s are obtained
every expert: Ej =
p
p hj
by evaluating the E-step expression for the validation
set samples p. Ej gives a measure of how well a given
expert performed on the validation set. If the partitioning had been crisp, i.e., hpj s were either 1 or 0,
every sample in the test set would have been associated with only one expert and Ej would have been the
MSE of expert j trying to approximate a sub-function
from the samples associated with expert j. The hpj s in
the mixture of experts architecture are not all 1s and
0s and therefore the Ej s in reality correspond to a soft
version of the mean-squared error. If Ej is large, it indicates one of two possibilities:(1) the localized model
is not able to approximate well the target function for
the weighted samples associated with expert j or (2)
the localized model has overt the training data and
therefore is not able to perform well on the validation
set data. In the former case, addition of an expert to
the localized space spanned by expert j would reduce
the output error due to the added exibility. In the latter case, such an addition would only help in overtting
the training data further.
Therefore, to add the most eective expert, the Ej s
are ranked from largest to the smallest. Let c denote
the expert network with the highest rank. We now try
2
to add an expert m+1 which tries to reduce the error
due to expert c . Towards this end, a new expert m+1
is created as follows:
Weights of expert network m+1 = weights of ex-
pert network c.
Perform \weighted 2-means" on the training samples t (weights = htc ). The two means obtained are
the new means associated with the two experts,
m+1 and c.
1 old
Set m+1 = new
c = 2 c
Set m+1 = c .
The \weighted 2-means algorithm" is performed as
follows:
Initialize the two means to the two training sam-
ples having the largest htc s.
Assign each point in the training set to subset t1
or t2 depending on whether it is closer to m^1 or
m^2 .
DetermineP h x
t t
m^1 new = Pt ht
1
1
P hx
P h
t1
m^2 new =
1
1
t2 t2 t2
t t2
2
Iterate until convergence.
In the crisp case, the \weighted 2-means" procedure
reduces to nding two means for the set of samples
associated with expert c using the K-means algorithm.
The newly formed network is trained for one EMiteration. If there is a decrease in MSE on the validation data-set, we proceed further. Else, we revert to
the earlier saved network and try adding a new expert
to reduce the error due to expert c-1. This procedure
is terminated when at any stage, adding a new expert
to reduce the error of any of the existing experts does
not reduce the overall MSE on the validation data-set.
Also, at any time during the growing procedure, when
an overtting expert is detected, its corresponding expert network parameters, mean and covariance matrix
are permanently frozen.
The above algorithm was tested on both synthetic
and real life problems. The rst task was to approximate a sinc function with Gaussian noise of variance
0.25 added to it. 100 training samples were present
in the training set. Figure 4 shows results from the
growing algorithm. The test MSE in this case was
computed by measuring the error between the network
output and the true sinc function. The MSE obtained
at the end of the growing operation was found to be
better than the MSE obtained by training static networks of size 10,15 or 20 experts. Figure 5 shows the
performance of the nal trained network.
The algorithm was next tried on the building2 data
set which was described earlier. The results on the
building2 data set are shown in Table 1. It is seen that
with 4 experts, the network performs very well with
test set MSE on par with that obtained in section 2. A
multivariate function approximation task was considered next. The function to be approximated was
y = 0:79+1:27x1x2 +1:56x1x4 +3:42x2x5 +2:06x3x4 x5 .
The training set consisted of 820 samples and the validation set had 204 samples. Table 2 gives a comparison
of the average performance on the validation set.
4. Pruning and Growing Mixture of Experts
A popular approach to structural adaptation is to
start with an adequately powerful model and then to
simplify it based on training data. Various weight decay strategies for pruning links have been proposed especially for the MLP network. From section 2, it is
observed that the
P prior j is computed in the M-step
as (jk+1) = N1 t h(jk) (y(t) j x(t) ). j is seen to be proportional to the sum of htj s over all patterns 't' in the
training set, htj being the posterior probability of selecting expert 'j' given input x(t) and its corresponding
output y(t) . We therefore see that in every iteration
of the EM-algorithm j directly gives us a measure of
how important the network feels expert 'j' is in relation to the other experts. Hence, when it is desired to
prune an expert, the obvious candidate is that expert
which has the least value of j . The following method
for pruning and growing while training a mixture of
experts network, serves two purposes: (i) to remove
redundant experts and (ii) to avoid local minima and
perform better generalization.
Pruning and Growing: Train the mixture of experts
network with a certain initial number of experts m.
Prune expert corresponding to lowest . If there is
no signicant change in network performance, continue
pruning. Else, perform network growth as described
in the previous section. There is no strict order in
which pruning and growing should be performed. It
has generally been found more useful to perform initial
pruning followed by growing.
The simultaneous pruning and growing technique
was applied to two function approximation problems:
(1) to approximate the two dimensional Gabor func-
x +y ) cos(2 (x + y )),
tion, G(x; y) = 2(01:5) exp(? 2(0
:5)
consisting of 64 training set samples and 192 validation
set samples and (2) to approximate the multivariate
function described in the previous section.
For the Gabor function, the initial network conguration chosen had 30 experts. The network converged
to a validation set MSE of 0.0516. Network pruning
was performed using the pruning method described
above. No signicant change in MSE was observed
till the number of experts was brought down to 21.
The MSE with 21 experts was 0.0503. Next, network
growth was performed. As shown in table 3, signicant
performance gains were obtained by the addition of 2
more experts lowering the MSE to 0.0296. It is to be
noted that the result is much better than simply pruning the 30 expert network to 23 experts. Table 4 shows
results on the multivariate data-set.
2
2
2
2
5. Conclusions and Future Work
In this paper, two techniques were presented to overcome the model selection problem in the mixture of experts architecture. Both the techniques proposed were
based on a batch mode training algorithm. Eorts are
underway to develop an on-line algorithm to perform
structural adaptation. Such a model would have the
attractive capability of being able to adapt to changing environments.
References
[1] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E.
Hinton. Adaptive mixtures of local experts. Neural
Computation, 3:79{87, 1991.
[2] M. I. Jordan and R. A. Jacobs. Hierarchical mixture of
experts and the EM algorithm. Neural Computation,
6:181{214, 1994.
[3] L. Prechelt. PROBEN1 | A set of benchmarks and
benchmarking rules for neural network training algorithms. Technical Report 21/94, Fakultat fur Informatik, Universitat Karlsruhe, D-76128 Karlsruhe, Germany, Sept. 1994.
[4] V. Ramamurti and J. Ghosh. Advances in using hierarchical mixture of experts for signal classication.
In Proceedings of the IEEE International conference on
Accoustics, Speech and Signal Processing, 1996.
[5] S. R. Waterhouse and A. J. Robinson. Classication
using hierarchical mixture of experts. In Neural Networks for Signal Processing IV: Proceedings of the IEEE
Workshop, pages 177{186. IEEE Press, New York, 1994.
[6] L. Xu, M. I. Jordan, and G. E. Hinton. An alternative
model for mixture of experts. In G. Tesauro, D. S.
Touretzky, and T. K. Leen, editors, Advances in Neural
Information Processing Systems 7, pages 633{640. The
MIT Press, 1995.
Sinc function approximation with 10 experts
Expert 1
2
g
y1
1
1.5
y2
Expert 2
g
.
Σ
Test MSE = 0.0087
y
2
1
yK
.
.
0.5
Expert K
g
K
...
0
Gating
Network
−0.5
−1
−10
re 1. A mixture of experts network
−8
−6
−4
−2
0
2
4
6
8
Figure 5. Sinc with noise approximation with
10 experts at the end of the growing algorithm
Sinc function with the HME network
1.2
1
... true sinc
0.8
_. 1 level 16 exps
0.6
−− 4 level bin tree
0.4
__ 8 level bin tree
Table 1. Growing mixture of experts network
on the Building2 data-set
0.2
Number of Experts Test set MSE
1
0.0100
2
0.0090
3
0.0087
4
0.0083
0
−0.2
−0.4
−10
−8
−6
−4
−2
0
2
4
6
8
2. HME of different sizes trying to apate a sinc function
Table 2. Growing Mixture of Experts on the
Multivariate Data-set
Sinc with the localized gating network
1
0.8
... true sinc
0.6
__ 1 level 15 exps
Number of Experts Ave. MSE
15
0.0107
20
0.0093
25
0.0085
30
0.0089
Network Growth
15.5
0.0077
Static
Architecture
0.4
0.2
0
−0.2
−0.4
−10
−8
−6
−4
−2
0
2
4
6
8
Table 3. Pruning and Growing on 2-D Gabor
Function
3. Sinc function approximation with
alized model - 1 layer with 15 experts
Number of Experts
30
21
Growing
22
23
Pruning
No. of Experts Vs Test Set MSE (Sinc with noise)
0.12
0.1
MSE
0.0516
0.0503
0.0451
0.0296
Test Set MSE
0.08
Table 4. Pruning and Growing on the Multivariate Data-set
0.06
0.04
Number of Experts
25
16
Growing
17
19
21
0.02
0
1
2
3
4
5
6
−−> No. of Experts
7
8
9
Pruning
10
4. Number of experts Vs. Test Set MSE
1
Figure 1.
MSE
0.0086
0.0091
0.0083
0.0080
0.0073