Structural Adaptation in Mixture of Experts Viswanath Ramamurti and Joydeep Ghosh Department of Electrical and Computer Engineering The University of Texas at Austin, Austin, TX 78712-1084 E-mail: fviswa,[email protected] Abstract A simple mixture of experts model is shown in gure 1. Each expert network is typically a single layer network with a possible output non-linearity. In the original mixture of experts model [1], the gating network is a single layer feedforward network with a \softmax" output activation function. The \softmax" function ensures that the gating network outputs are nonnegative and sum to 1. With the gating network of this form, the input space is divided into dierent but overlapping regions by \soft" hyperplanes, and each region is assigned to an expert. Such a structure is limited, in that, when the function approximation task is non-trivial, the gating network nds it dicult to come up with the right partitions for the linear experts to perform a good job of approximating the function at hand. The one level mixture of experts architecture was extended in [2] to a hierarchy where each expert network in gure 1 is replaced by a mixture of experts network consisting of a gate and sub-experts. The resulting two-level hierarchy could be extended to form deeper trees. The deeper the HME tree, more powerful the architecture becomes. Figure 2 illustrates this phenomena. The task was to approximate a sinc function. A simple mixture of experts network consisting of 16 experts, a 4 level-binary and an 8-level binary tree hierarchies were employed. It is seen that, deeper the network, the better it performs. The same phenomena is also reported in classication type HME networks [5][4]. Unfortunately, the computational cost increases as the tree height increases, as more gating and expert networks need to be trained. The depth of the tree also leads to the question of what is the right size and structure of the tree to best solve a problem, i.e., one is faced with the model selection problem. Xu, Jordan and Hinton [6] have recently come up with an alternative form for the gating network which divides the input space into \soft" hyper-ellipsoids as opposed to the \soft" hyperplane divisions created by the original gating network. With such localized re- The \mixture of experts" framework provides a modular and exible approach to function approximation. However, the important problem of determining the appropriate number and complexity of experts has not been fully explored. In this paper, we consider a localized form of the gating network that can perform function approximation tasks very well with only one layer of experts. Certain measures for the smooth functioning of the training algorithm to train this model are described rst. We then propose two techniques to overcome the model selection problem in the mixture of experts architecture. In the rst technique, we present an ecient way to grow expert networks to come up with an appropriate number of experts for a given problem. In the second approach, we start with a certain number of experts and present methods to prune experts which become less useful and also add on experts which would be more eective. Simulation results are presented which support the techniques proposed. We observe that the growing/pruning approach yields substantially better results than the standard approach even when the nal network sizes are chosen to be the same. 1. Introduction The mixture of experts architecture is a powerful modular architecture which works on the principle of divide and conquer. The model employs probabilistic methods to divide the input space into overlapping regions on which \experts" act, and a gating network to weight these experts and form a combined network output. This research was supported in part by ARO contracts DAAH04-94-G-0417 and 04-95-10494 and NSF grant ECS 9307632. 1 gions of expertise, a single layer of linear experts is able to perform function approximation tasks very well. Figure 3 shows the performance of the localized model with only one layer of 15 experts on the same sinc function that was approximated earlier. This paper begins by observing that the localized gating network also provides a basis for overcoming the model selection problem. Two techniques are proposed to counter the model selection problem. In the rst technique, one starts by training a network consisting of only one expert, which corresponds to a linear regression, and grows more experts sequentially to t the complexity of the problem. In the second technique, when it is known apriori that the function approximation task is non-trivial, one starts training an adequately powerful network and while training, prunes away experts which become less useful and grows experts, if needed, which would be more eective. Compared to training a network with a xed number of experts, the second approach makes good use of every expert in the network and turns out to be ecient at avoiding bad local minima. In this paper, we rst briey describe the localized model for mixture of experts and also present measures to be taken for the smooth functioning of the training algorithm to train the model. This is followed by the description of the two techniques proposed to counter the model selection problem. 2. The Localized Mixture of Experts Model In the mixture of experts model, we have a set of expert networks j=1..M all of which look at the input vector x to form outputs y^j . The gating network also looks outputs gj P at the input vector x and produces network outputs 0, j gj = 1, which weight the expert P to form the overall output y^ = j gj y^j . The gating network proposed in [6] is of the form j) gj (x; ) = PjPi(Px= (x=i ) i where P (x=j ) = (2)?n=2 jj j?1=2 expf?0:5(x ? mj )T ?j 1 (x ? mj )g Thus the jth expert's inuence is localized to a region around mj . The expert networks are single layer networks with identity output activation function for regression problems. The expert network parameter vector for expert j is denoted by j (y^j = jT x) . The Expectation-Maximization (EM) formulation for estimating the gating network and expert network parameters j ; j ; j from the set of training patterns t is given by [6]: E step: k t f?0:5(y?y^j )T (y?y^j )g h(jk) (y(t) =x(t) ) = Pgj g (kx (;xt);exp )expf?0:5(y ?y^i )T (y ?y^i )g i i M step (gating P network): (jk+1) = N1 t h(jk) (y(t) j x(t) ) P m(jk+1) = P h k (1y t jx t ) t h(jk) (y(t) j x(t) )x(t) t j P (jk+1) = P h k (1y t jx t ) t h(jk) (y(t) j x(t) ) t j [x(t) ? mkj ][x(t) ? mkj ]T M step (expert network): min P jk+1 = arg j t h(jk) (y(t) =x(t) )ky ? y^j k2 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) We have employed this model for solving a number of regression problems and have found that the network performs very well with just a single layer of experts. By requiring only one gating network, this model is signicantly faster to train as compared to an hierarchical model. Also, the M-step of the gating network has an exact one-pass analytical solution as opposed to the approximate solution to train the M-step in one-pass for the original gating network. However, there are a couple of things that need attention while training the localized model in practice. In the EM iterations, the E-step of the gating network requires inverting the covariance matrices j s obtained in the M-step. Some of the covariance matrices that are obtained often turn out to be close to singular. To avoid this problem, we consider only the diagonal elements of the covariance matrices and make the odiagonal elements zero. This means that only the variance terms of the dierent input elements are considered, the cross terms are being ignored. This does not aect the function approximation capability of the network, it would at worst make the network need more experts. The eect of this change on the EM derivation would be that , while calculating the j s in the M-step, only the terms corresponding to the diagonal entries need be computed. Also, we impose lower bounds on the diagonal entries which prevent them from becoming too small. The lower threshold for each entry is selected in such a way that it reects the variance of the individual input elements. When a computed value goes below the threshold at any time, it is replaced by the constant threshold value. We have also found the algorithm to be sensitive to initial random means associated with the experts. Initializing these means by the standard K-means algorithm works well in practice. The capability of the localized model is illustrated by simulation results on the building2 data-set, a function approximation task, from the PROBEN1 set of benchmark data sets [3]. In the building2 data-set the input is a 14-d vector and the output is a 3-d vector. The task was to perform function approximation with 2104 training samples and 1052 samples to test the approximation. A single layer mixture of 10 experts network gives an average test set MSE of 0.0084 and training set MSE of 0.0072. This result is as good as the best results obtained using a fully connected MLP including short-cut connections [3] and slightly better than the results quoted in [3] when short-cut connections were not present. 3. The Growing Mixture of Experts Network In this section, a constructive algorithm is presented to build a mixture of experts network. The idea is to initially start with one expert, and add on experts one at a time to systematically reduce the output error. Let us assume that at some given instant of time there are m experts in the network (m - 1 experts having been already added). The mixture of m experts network is trained using the EM algorithm described in the previous section, until the MSE on the validation set fails to decrease. The network parameters obtained are saved. We dene a weighted MSE Ej , for each expert j (j= 1 .. m), on P the validation set samples p. For p (y ?y^ ) hP j p p p . hpj s are obtained every expert: Ej = p p hj by evaluating the E-step expression for the validation set samples p. Ej gives a measure of how well a given expert performed on the validation set. If the partitioning had been crisp, i.e., hpj s were either 1 or 0, every sample in the test set would have been associated with only one expert and Ej would have been the MSE of expert j trying to approximate a sub-function from the samples associated with expert j. The hpj s in the mixture of experts architecture are not all 1s and 0s and therefore the Ej s in reality correspond to a soft version of the mean-squared error. If Ej is large, it indicates one of two possibilities:(1) the localized model is not able to approximate well the target function for the weighted samples associated with expert j or (2) the localized model has overt the training data and therefore is not able to perform well on the validation set data. In the former case, addition of an expert to the localized space spanned by expert j would reduce the output error due to the added exibility. In the latter case, such an addition would only help in overtting the training data further. Therefore, to add the most eective expert, the Ej s are ranked from largest to the smallest. Let c denote the expert network with the highest rank. We now try 2 to add an expert m+1 which tries to reduce the error due to expert c . Towards this end, a new expert m+1 is created as follows: Weights of expert network m+1 = weights of ex- pert network c. Perform \weighted 2-means" on the training samples t (weights = htc ). The two means obtained are the new means associated with the two experts, m+1 and c. 1 old Set m+1 = new c = 2 c Set m+1 = c . The \weighted 2-means algorithm" is performed as follows: Initialize the two means to the two training sam- ples having the largest htc s. Assign each point in the training set to subset t1 or t2 depending on whether it is closer to m^1 or m^2 . DetermineP h x t t m^1 new = Pt ht 1 1 P hx P h t1 m^2 new = 1 1 t2 t2 t2 t t2 2 Iterate until convergence. In the crisp case, the \weighted 2-means" procedure reduces to nding two means for the set of samples associated with expert c using the K-means algorithm. The newly formed network is trained for one EMiteration. If there is a decrease in MSE on the validation data-set, we proceed further. Else, we revert to the earlier saved network and try adding a new expert to reduce the error due to expert c-1. This procedure is terminated when at any stage, adding a new expert to reduce the error of any of the existing experts does not reduce the overall MSE on the validation data-set. Also, at any time during the growing procedure, when an overtting expert is detected, its corresponding expert network parameters, mean and covariance matrix are permanently frozen. The above algorithm was tested on both synthetic and real life problems. The rst task was to approximate a sinc function with Gaussian noise of variance 0.25 added to it. 100 training samples were present in the training set. Figure 4 shows results from the growing algorithm. The test MSE in this case was computed by measuring the error between the network output and the true sinc function. The MSE obtained at the end of the growing operation was found to be better than the MSE obtained by training static networks of size 10,15 or 20 experts. Figure 5 shows the performance of the nal trained network. The algorithm was next tried on the building2 data set which was described earlier. The results on the building2 data set are shown in Table 1. It is seen that with 4 experts, the network performs very well with test set MSE on par with that obtained in section 2. A multivariate function approximation task was considered next. The function to be approximated was y = 0:79+1:27x1x2 +1:56x1x4 +3:42x2x5 +2:06x3x4 x5 . The training set consisted of 820 samples and the validation set had 204 samples. Table 2 gives a comparison of the average performance on the validation set. 4. Pruning and Growing Mixture of Experts A popular approach to structural adaptation is to start with an adequately powerful model and then to simplify it based on training data. Various weight decay strategies for pruning links have been proposed especially for the MLP network. From section 2, it is observed that the P prior j is computed in the M-step as (jk+1) = N1 t h(jk) (y(t) j x(t) ). j is seen to be proportional to the sum of htj s over all patterns 't' in the training set, htj being the posterior probability of selecting expert 'j' given input x(t) and its corresponding output y(t) . We therefore see that in every iteration of the EM-algorithm j directly gives us a measure of how important the network feels expert 'j' is in relation to the other experts. Hence, when it is desired to prune an expert, the obvious candidate is that expert which has the least value of j . The following method for pruning and growing while training a mixture of experts network, serves two purposes: (i) to remove redundant experts and (ii) to avoid local minima and perform better generalization. Pruning and Growing: Train the mixture of experts network with a certain initial number of experts m. Prune expert corresponding to lowest . If there is no signicant change in network performance, continue pruning. Else, perform network growth as described in the previous section. There is no strict order in which pruning and growing should be performed. It has generally been found more useful to perform initial pruning followed by growing. The simultaneous pruning and growing technique was applied to two function approximation problems: (1) to approximate the two dimensional Gabor func- x +y ) cos(2 (x + y )), tion, G(x; y) = 2(01:5) exp(? 2(0 :5) consisting of 64 training set samples and 192 validation set samples and (2) to approximate the multivariate function described in the previous section. For the Gabor function, the initial network conguration chosen had 30 experts. The network converged to a validation set MSE of 0.0516. Network pruning was performed using the pruning method described above. No signicant change in MSE was observed till the number of experts was brought down to 21. The MSE with 21 experts was 0.0503. Next, network growth was performed. As shown in table 3, signicant performance gains were obtained by the addition of 2 more experts lowering the MSE to 0.0296. It is to be noted that the result is much better than simply pruning the 30 expert network to 23 experts. Table 4 shows results on the multivariate data-set. 2 2 2 2 5. Conclusions and Future Work In this paper, two techniques were presented to overcome the model selection problem in the mixture of experts architecture. Both the techniques proposed were based on a batch mode training algorithm. Eorts are underway to develop an on-line algorithm to perform structural adaptation. Such a model would have the attractive capability of being able to adapt to changing environments. References [1] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3:79{87, 1991. [2] M. I. Jordan and R. A. Jacobs. Hierarchical mixture of experts and the EM algorithm. Neural Computation, 6:181{214, 1994. [3] L. Prechelt. PROBEN1 | A set of benchmarks and benchmarking rules for neural network training algorithms. Technical Report 21/94, Fakultat fur Informatik, Universitat Karlsruhe, D-76128 Karlsruhe, Germany, Sept. 1994. [4] V. Ramamurti and J. Ghosh. Advances in using hierarchical mixture of experts for signal classication. In Proceedings of the IEEE International conference on Accoustics, Speech and Signal Processing, 1996. [5] S. R. Waterhouse and A. J. Robinson. Classication using hierarchical mixture of experts. In Neural Networks for Signal Processing IV: Proceedings of the IEEE Workshop, pages 177{186. IEEE Press, New York, 1994. [6] L. Xu, M. I. Jordan, and G. E. Hinton. An alternative model for mixture of experts. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 633{640. The MIT Press, 1995. Sinc function approximation with 10 experts Expert 1 2 g y1 1 1.5 y2 Expert 2 g . Σ Test MSE = 0.0087 y 2 1 yK . . 0.5 Expert K g K ... 0 Gating Network −0.5 −1 −10 re 1. A mixture of experts network −8 −6 −4 −2 0 2 4 6 8 Figure 5. Sinc with noise approximation with 10 experts at the end of the growing algorithm Sinc function with the HME network 1.2 1 ... true sinc 0.8 _. 1 level 16 exps 0.6 −− 4 level bin tree 0.4 __ 8 level bin tree Table 1. Growing mixture of experts network on the Building2 data-set 0.2 Number of Experts Test set MSE 1 0.0100 2 0.0090 3 0.0087 4 0.0083 0 −0.2 −0.4 −10 −8 −6 −4 −2 0 2 4 6 8 2. HME of different sizes trying to apate a sinc function Table 2. Growing Mixture of Experts on the Multivariate Data-set Sinc with the localized gating network 1 0.8 ... true sinc 0.6 __ 1 level 15 exps Number of Experts Ave. MSE 15 0.0107 20 0.0093 25 0.0085 30 0.0089 Network Growth 15.5 0.0077 Static Architecture 0.4 0.2 0 −0.2 −0.4 −10 −8 −6 −4 −2 0 2 4 6 8 Table 3. Pruning and Growing on 2-D Gabor Function 3. Sinc function approximation with alized model - 1 layer with 15 experts Number of Experts 30 21 Growing 22 23 Pruning No. of Experts Vs Test Set MSE (Sinc with noise) 0.12 0.1 MSE 0.0516 0.0503 0.0451 0.0296 Test Set MSE 0.08 Table 4. Pruning and Growing on the Multivariate Data-set 0.06 0.04 Number of Experts 25 16 Growing 17 19 21 0.02 0 1 2 3 4 5 6 −−> No. of Experts 7 8 9 Pruning 10 4. Number of experts Vs. Test Set MSE 1 Figure 1. MSE 0.0086 0.0091 0.0083 0.0080 0.0073
© Copyright 2025 Paperzz