1 A Universal Approximation Theorem for Gaussian-Gated Mixture of Experts Models arXiv:1704.00946v1 [stat.ME] 4 Apr 2017 Hien D. Nguyen Abstract Mixture of experts (MoE) models are a powerful probabilistic neural network framework that can be used for classification, clustering, and regression. The most common types of MoE models are those with soft-max gating and experts with linear means. Such models are referred to as soft-max gated mixture of linear experts (MoLE) models. Due to their popularity, the majority of theoretical results regarding MoE and MoLE models have concentrated on those with soft-max gating. For example, it has been proved that soft-max gated MoLE mean functions are dense in the class of continuous functions, over any compact subset of Euclidean space. A popular alternative to soft-max gating is Gaussian gating. We prove that the denseness result regarding soft-max gated MoLE mean functions extends to Gaussian-gated MoLE mean functions. That is, we prove that Gaussian-gated MoLE mean functions are dense in the class of continuous functions, over any compact subset of Euclidean space. Index Terms Mean Function; Mixture of experts; Stone-Weierstrass Theorem; Universal Approximation Theorem I. I NTRODUCTION The mixture of experts (MoE) models are probabilistic artificial neural networks that were first introduced in [1] and then developed further in [2] and [3]. In the contemporary setting, MoE models have become highly popular and successful in a range of applications and areas including audio classification, bioinformatics, climate prediction, face recognition, financial forecasting, handwriting recognition, and text classification, among many others; see [4] and the references Department of Mathematics and Statistics, La Trobe University, Bundoora 3086, Victoria Australia. Email: [email protected]. 2 therein. Two representative recent applications of MoE models are for functional data analysis ([5], [6]) and for robust regression ([7], [8]). Let X > , Y 2 X ⇥ R be a random observation from some probability model, where each X may be non-stochastic and fixed at some value x and X ⇢ Rd (d 2 N). Here, > is the transposition operator. Suppose now that for some m 2 N, there is a latent random variable Z, such that P (Z = i|X = x) = gi (x; ↵) , (1) where i 2 [m] ([m] = {1, . . . , m}), and gi (·; ↵) are parametric functions that depend on some additional vector ↵ 2 Rq , for q 2 N. Here, we say that m is the number of experts. Further suppose that the conditional probability density function (PDF) of Y given X = x and Z = i can be written as f (y|X = x, Z = i) = hi (x; ) , (2) where hi (·; , ) are parametric PDFs that depend on the additional vector scalar 2 Rp (p 2 N) and 2 R. Taking the characterizations (1) and (2) together, we can write conditional PDF of Y on X = x as f (y|X = x) = m X gi (x; ↵) hi (x; i) , i, (3) i=1 which can be obtained via the law of total probability. We call (3) an MoE with gating functions gi and experts hi . We further say that (3) has parameter vector ✓ > = ↵> , > 1 ,..., > m, 1, . . . , m . Suppose that the expectation E (Y |X = x) exists for all x 2 X. If we also have E (Y |X = x) = M (x; ✓) = m X gi (x; ↵) i=1 ⇥ > i x + i ⇤ , (4) where p = d, then we say instead that (3) is a mixture of linear experts (MoLE) model, rather than an MoE model. In general, most MoE models—that are used in practice—are MoLE models. Additionally, the most common kinds of MoLE models in use are those that are soft-max gated. That is, the gating functions are of the form where ↵i 2 Rd and i exp ↵> i x+ i gi (x; ↵) = Pm > j=1 exp ↵j x + , (5) j > 2 R for i 2 [m]. Here, ↵> = ↵> 1 , . . . , ↵m , 1 , . . . , m and q = (d + 1) m. Examples of (5)-gated MoLE models are the original MoE models of [1], and the 3 robust MoLE models of [7] and [8]. In the respective models, the experts are taken to be Gaussian, t, and Laplace, each with means > i x + i. Although (5) is the most popular choice of gating, an often-used alternative is the Gaussian gate, which has the form ⇡i gi (x; ↵) = Pm (x; µi , ⌃i ) , j=1 ⇡j d (x; µj , ⌃j ) where d (x; µ, ⌃) = |2⇡⌃| 1/2 exp d 1 (x 2 µ)> ⌃ (6) 1 (x µ) is the multivariate Gaussian distribution in Rd with mean vector µ 2 Rd and symmetric positive- > definite covariance matrix ⌃ 2 Rd⇥d . Here ⇡i > 0 for i 2 [m] and ↵> = ↵> 1 , . . . , ↵m , where ⇣ ⌘ > ↵> ⇡i , µ > and vech (·) is the half-vectorization operator (cf. [9, Sec. 11.3]). i = i , [vech⌃i ] Further, q = m + dm + d (d 1) m/2. The (6)-gated MoE models were first considered by [10]. Since their introduction, their have been numerous examples of (6)-gated MoE and MoLE models in the literature; see for example [11], [12], [13], [14], [15], [16], [17]. More recently, the (6)-gated MoLE model, with Gaussian and t experts have been shown to be special cases of cluster-weighted models [18]. Furthermore, (6)-gated MoLE models—along with a wider class of MoLE models—have been shown useful for Bayesian nonparametric regression [19]. Due to the popularity of the (5)-gated MoE models, and MoLE models in particular, there have been numerous theoretical studies of their approximation capacity. For example, [20] demonstrated the uniform denseness of (5)-gated mean functions (4) in Sobolev space (cf. [21]), under assumptions on differentiability, measurability, and support compactness. The uniform denseness of (5)-gated mean functions in the space of continuous functions on finite supports is proved in [22] via an application of the Stone-Weierstrass theorem (SW; [23]). Other results pertaining to (5)-gated MoE models include proofs of denseness, in Kullback-Leibler divergence (KL; [24]), of (3) with exponential family experts within the class of arbitrary conditional PDFs under some regularity conditions ([25], [26]). An alternative result for (5)-gated MoE models with Gaussian experts, which makes different assumptions and which allows for multivariate estimation, is proved in [27]. A denseness in KL divergence result, within the class of conditional PDFs under regularity for a kind of (5)-gated hierarchical-MoE model with Gaussian experts, is proved in [28]. 4 In contrast, to the best of our knowledge, the only formal result regarding (6)-gated MoE models appears in [19]. In [19], the (6)-gated MoLE models with location-scale experts are proved to be dense in KL divergence within the class of conditional PDFs, under regularity. This result is analogous to cases of theorems obtained in [25], [26], [27]. There is currently no analogy to the results of [20] and [22] for (6)-gated MoLE models. In this paper, we follow the approach of [22] and utilize the SW theorem to prove the uniform denseness of the (6)-gated MoLE mean function (4), within the class of continuous functions on a compact support. We note that the use of the SW theorem has a rich history in the neural networks literature. For example, [29] famously utilized the theorem to show the universal approximation property for the classical sum-of-sigmoidal functions networks. Another example is the use of the SW theorem to prove the uniform denseness of radial basis networks [30]. An excellent introduction and demonstration of the use of the SW theorem for neural networks is [31]. The paper proceeds as follows. The main result of the paper is presented in Section II. Proofs pertaining to the main result are presented in Section III. Conclusions are drawn in Section IV. II. M AIN R ESULT Let C (X) denote the class of all continuous functions on the support X. For a pair of functions u and v with support X, we define the uniform distance between the functions to be d (u, v) = maxx2X |u (x) v (x)|, if it exists. If U (X) and V (X) are two classes of functions on the support X, then we say that U is dense in V if for any v 2 V and ✏ > 0, there exists a u 2 U such that d (u, v) < ✏. When U (X) is dense in V (X) for every compact X ⇢ Rd , U is said to be fundamental in V Rd (cf. [32, Ch. 18]). Let the class of (6)-gated mean functions (4) on the support X be denoted M (X) = {M has form (4) : gi has form (6), i 2 [m] , m 2 N} . In this paper, we prove that M is fundamental in C Rd . That is, we obtain the following result. Theorem 1. If X is a compact subset of Rd , then M (X) is dense in C (X). III. P ROOFS OF M AIN R ESULT We make use of the following version of the SW theorem (cf. [22], [31]). 5 Theorem 2. Let X ⇢ Rd be compact and U (X) be a set of continuous real-valued functions on X. Assume the following properties. (A1) The constant function u (x) = 1 is in U (X). (A2) For any two points x1 , x2 2 X, such that x1 6= x2 , the exists a u 2 U (X), such that u (x1 ) 6= u (x2 ). (A3) If a 2 R and u 2 U (X), then au 2 U (X). (A4) If u, v 2 U (X), then uv 2 U (X). (A5) If u, v 2 U (X), then u + v 2 U (X). If assumptions A1–A5 are fulfilled, then U (X) is dense in C (X). We shall utilize Theorem 2 in two ways. In the first application, we directly prove that there exists a subclass of M that satisfies the assumptions of Theorem 2, for any compact real subset X. In the second application, we indirectly show that the class of (5)-gated mean functions, which is known to be dense via the SW theorem, is a subclass of M, and thus M must also be dense. A. Direct Proof We shall limit ourselves to the subclass M1 (X) = {M 2 M (X) : i = 0, i 2 [m] , m 2 N} , where 0 is the zero vector. That is, we consider functions of the form M (x; ) = m X i=1 where = ↵> , 1, . . . , m ⇡ Pm i (x; µi , ⌃i ) j=1 ⇡j d (x; µj , ⌃j ) d i 2 M1 (X) , is the restricted parameter space that is obtained by setting i = 0, for all i. We now prove that class M1 is fundamental in C Rd , via demonstrating that each of the assumptions A1–A5 are fulfilled. As a consequence, since M1 ⇢ M, we will prove that M is fundamental in C Rd . Lemma 3. [A1] The constant function M (x) = 1 2 M1 (X). Proof: For any m 2 N, set ↵1 = · · · = ↵m and M = m ⇥ (1/m) = 1, thus completing the proof. 1 = ··· = m = 1. We then have Lemma 4. [A2] For any two x1 , x2 2 Rd , such that x1 6= x2 , there exists a function M 2 M1 , such that M (x1 ) 6= M (x2 ). 6 ⇣ ⌘ ⇣ ⌘ > > > > > > ¯ ¯ Proof: Set m = 2, ↵1 = 1, 0 , [vechI] , and ↵2 = 1, 1 , [vechI] . Here 1 is the one ¯ 1 and ↵ ¯ 2 in ↵, ¯ and set ¯ > = ↵ ¯ > , 1, 0 . vector and I is the identity matrix, respectively. Put ↵ That is, let 1 = 1 and 2 = 0. We can now write M as M (x; ¯ ) exp x> x/2 = exp ( x> x/2) + exp ( x> x/2 1 = . 1 + exp ( 1> 1/2 + 1> x) 1> 1/2 + 1> x) If we suppose that M (x1 ; ¯ ) = M (x2 ; ¯ ), then we obtain exp 1> 1/2 + 1> x1 = exp 1> 1/2 + 1> x2 , which implies that x1 = x2 is the unique solution to the system. Thus, for any x1 6= x2 , ¯ 6= M (x2 ; ↵), ¯ as is required to obtain the desired result. M (x1 ; ↵) Lemma 5. [A3] If a 2 R and M 2 M1 , then aM 2 M1 . Proof: For any m 2 N and , we can write aM (x; ) = a = m X i=1 m X i=1 ⇡ Pm i (x; µi , ⌃i ) j=1 ⇡j d (x; µj , ⌃j ) ⇡ Pm i d i (x; µi , ⌃i ) (a i ) j=1 ⇡j d (x; µj , ⌃j ) d = M (x; ¯ ) , where ¯ > = ↵> , ¯1 , . . . , ¯m and ¯i = a i , for each i 2 [m]. This completes the proof. In order to move forward, we require a result regarding the product of Gaussian PDFs. The following result is given in [33]. Lemma 6. If µ1 , µ2 2 Rd and ⌃1 , ⌃2 2 Rd⇥d are symmetric positive-definite matrices, then 2 Y d (x; µi , ⌃i ) = A d (x; µ, ⌃) , i=1 where A > 0, ⌃ 1 = P2 i=1 ⌃i 1 is positive definite, and µ = ⌃ Lemma 7. [A4] If M, N 2 M1 , then M N 2 M1 . P2 i=1 ⌃i 1 µ i . 7 Proof: Let M = M ·; and N = M ·; [1] [2] , where [j]> ⇣ = ↵ j 2 {1, 2}. Here m1 , m2 2 N. Taking the direct product M N yields ⇣ ⌘ [j] [j] [j] mj 2 ⇡i d x; µi , ⌃i YX ⌘ (M N ) (x) = Pmj [j] ⇣ [j] [j] j=1 i=1 d x; µk , ⌃k k=1 ⇡k [j]> , [j] 1 ,..., [j] mj ⌘ [j] i , for (7) [1] [2] [1] [2] For each i 2 [m1 ] and j 2 [m2 ], consider the following mapping. Let ¯ij = i j , ⇡ ˜ij = ⇡i ⇡j , ⇣ ⌘ ¯ 1 = ⌃[1] 1 +⌃[2] 1 , and µ̄ij = ⌃ ¯ ij ⌃[1] 1 µ[1] + ⌃[2] 1 µ[2] . Using Lemma 6, we can rewrite ⌃ ij i j i i j j (7) as (M N ) (x) = m1 X m2 X i=1 j=1 = m1 X m2 X i=1 j=1 ¯ ij Aij ⇡ ˜ij d x; µ̄ij , ⌃ Pm1 Pm2 ¯ kl ¯ij ˜ij d x; µ̄kl , ⌃ k=1 l=1 Akl ⇡ ¯ ij ⇡ ¯ij d x; µ̄ij , ⌃ Pm1 Pm2 ¯ kl ¯ij , ¯kl d x; µ̄kl , ⌃ k=1 l=1 ⇡ where ⇡ ¯ij = Ai ⇡ ˜ij . Now, via some pairing function that maps (i, j) to k 2 [m̄] (m̄ = m1 m2 ; see e.g. [34], [35]), we can write (M N ) (x) = m̄ X k=1 ¯ > , ¯1 , . . . , ¯m̄ where ¯ > = ↵ ¯k x; µ̄k , ⌃ ¯ ¯l k ¯l d x; µ̄l , ⌃ l=1 ⇡ ⇡ ¯k Pm̄ d = M (x; ¯ ) , ⇣ ⇥ ⇤> ⌘ > > ¯ ¯k = ⇡ and ↵ ¯k , µ̄k , vech⌃k . Thus, M N 2 M1 , as required. Lemma 8. [A5] If M, N 2 M1 , then M + N 2 M1 . Proof: Let M = M ·; [1] and N = M ·; [2] , where [j]> ⇣ = ↵[j]> , j 2 {1, 2}. Here m1 , m2 2 N. Taking the direct sum M + N yields (M + N ) (x) mj 2 X X [j] ⇡i ⇣ [j] [j] [j] mj ⌘ , for (8) ⌘ x; µi , ⌃i ⌘ i[j] = Pmj [j] ⇣ [j] [j] j=1 i=1 d x; µk , ⌃k k=1 ⇡k ⇣ ⌘ ⇣ ⌘ [1] [1] [1] Pm2 [2] [2] [2] m 1 X ⇡i d x; µi , ⌃i d x; µk , ⌃k k=1 ⇡k ⇣ ⌘ = Q2 Pmj [j] [j] [j] ⇡ x; µ , ⌃ i=1 d k=1 k k k j=1 ⇣ ⌘P ⇣ ⌘ [2] [2] [2] [1] [1] [1] m2 m 2 X ⇡i d x; µi , ⌃i d x; µk , ⌃k k=1 ⇡k ⇣ ⌘ + Q2 Pmj [j] [j] [j] ⇡ x; µ , ⌃ i=1 d k=1 k k k j=1 d [j] 1 ,..., [1] i [2] i 8 [1] [2] For each i 2 [m1 ] and j 2 [m2 ], consider the following mapping. Let ¯ij = i + j , ⇡ ˜ij = ⇣ ⌘ [1] [2] ¯ 1 [1] 1 [2] 1 [1] 1 [1] [2] 1 [2] ¯ ij ⌃ ⇡i ⇡j , ⌃ij = ⌃i + ⌃j , and µ̄ij = ⌃ µi + ⌃j µj . Using Lemma 6, we i can rewrite (8) as (M + N ) (x) = m1 X m2 X i=1 j=1 = m1 X m2 X i=1 j=1 ¯ ij Aij ⇡ ˜ij d x; µ̄ij , ⌃ Pm1 Pm2 ¯ kl ¯ij ˜ij d x; µ̄kl , ⌃ k=1 l=1 Akl ⇡ ¯ ij ⇡ ¯ij d x; µ̄ij , ⌃ Pm1 Pm2 ¯ kl ¯ij , ¯kl d x; µ̄kl , ⌃ k=1 l=1 ⇡ where ⇡ ¯ij = Ai ⇡ ˜ij . Now, via some pairing function that maps (i, j) to k 2 [m̄] (m̄ = m1 m2 ), we can write (M + N ) (x) = m̄ X k=1 ¯ > , ¯1 , . . . , ¯m̄ where ¯ > = ↵ required. ¯k x; µ̄k , ⌃ ¯ ¯l k ¯l d x; µ̄l , ⌃ l=1 ⇡ ⇡ ¯k Pm̄ d = M (x; ¯ ) , ⇣ ⇥ ⇤> ⌘ > ¯ ¯> and ↵ = ⇡ ¯ , µ̄ , vech ⌃ . Thus, M + N 2 M1 , as k k k k Together, Lemmas 3–5, 7, and 8 imply that M1 (X) fulfills the assumptions of Theorem 2 for any compact subset X ⇢ Rd . Thus M1 is fundamental in C Rd , by Theorem 2. Furthermore, the main result, Theorem 1, is implied since M1 (X) ⇢ M (X), for any X. B. Indirect Proof We shall limit ourselves to the subclass M2 (X) = {M 2 M (X) : ⌃i = I, i 2 [m] , m 2 N} . That is, we consider functions of the form M (x; !) = m X i=1 where ! > = > , > 1 ,..., ⇥ (x; µi , I) j=1 ⇡j d (x; µj , I) ⇡ Pm i > m, setting ⌃i = I, for all i. Here, d + i ⇤ 2 M2 (X) , is the restricted parameter space that is obtained by 1, . . . , m > > 1 ,..., = > i x > m and > i = ⇡i , µ > i , for each i. 9 Let the class of (6)-gated mean functions (4) on the support X be denoted N (X) = {M has form (4) : gi has form (5), i 2 [m] , m 2 N} . From [22], we have the following result. Theorem 9. If X is a compact subset of Rd , then N (X) is dense in C (X). We now seek to demonstrate that every member of N can be written as a member of M2 , for any X. The following result is sufficient for the purpose. Lemma 10. For any i 2 [m], m 2 N, ↵, and [N ] gi can be written in the form exp ↵> i x+ i (x; ↵) = Pm > j=1 exp ↵j x + [M] gi and vice versa. [M] Proof: Start with gi [M] gi , the function ⇡i (x; ) = Pm j (x; µi , I) , j=1 ⇡j d (x; µj , I) d and note that the normalizing constants of d cancel to yield > ⇡i exp x> x/2 µ> i µi /2 + µi x (x; ) = Pm > x> x/2 µ> j µj /2 + µj x j=1 ⇡j exp > exp log ⇡i µ> i µi /2 + µi x = Pm . > µ> j µj /2 + µj x j=1 exp log ⇡j [N ] We observe that we can write gi i = log ⇡i [M] (x; ↵) in the form of gi (x; ) by setting ↵i = µi and µ> i µi /2, for each i 2 [m]. The mapping is unique, as is the inverse mapping µi = ↵i and ⇡i = exp i + µ> i µi /2 . Therefore, we have obtained the desired result. Lemma 10 implies that N is a subclass of M2 and thus is a subclass of M, for any X. Since N is fundamental in C Rd via Theorem 9, we obtain the main result, Theorem 1, as a consequence of Lemma 10. IV. C ONCLUSIONS We have proved that the class of (6)-gated mean functions (4) is fundamental in the continuous functions C Rd . This result is a direct parallel to the denseness result of [22], who showed 10 that the class of (5)-gated mean functions is fundamental in the continuous functions C Rd . The result provides practitioners the guarantee that the (6)-gated mean functions are universal approximators and can approximate any continuous function arbitrarily well, provided that there are enough experts m. However, the main result of the paper says nothing about how one should obtain an estimate of a (6)-gated mean function from data, in practice. Given a fixed number of components m, one can estimate the parameters of the (6)-gated MoLE model with Gaussian experts via maximum likelihood estimation and an EM algorithm (expectation–maximization; [36]), as in [10] and [18]. The EM algorithm for the case of t distribution experts is given in [18] and [37]. When m is unknown, some information theoretic criterion is required to choose between different numbers of experts. In [37], the BIC (Bayes information criterion; [38]) and the ICL criterion (integrated completed likelihood; [39]) were found to be appropriate for the task. R EFERENCES [1] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, pp. 79–87, 1991. [2] M. I. Jordan and R. A. Jacobs, “Hierachical mixtures of experts and the EM algorithm,” Neural Computation, vol. 6, pp. 181–214, 1994. [3] M. I. Jordan and L. Xu, “Convergence results for the EM approach to mixtures of experts architectures,” Neural Networks, vol. 8, pp. 1409–1431, 1995. [4] S. E. Yuksel, J. N. Wilson, and P. D. Gader, “Twenty years of mixture of experts,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, pp. 1177–1193, 2012. [5] F. Chamroukhi, A. Same, G. Govaert, and P. Aknin, “A hidden process regression model for functional data description. Application to curve discrimination,” Neurocomputing, vol. 73, pp. 1210–1221, 2010. [6] F. Chamroukhi, H. Glotin, and A. Same, “Model-based functional mixture discriminant analysis with hidden process regression for curve classification,” Neurocomputing, vol. 112, pp. 153–163, 2013. [7] F. Chamroukhi, “Robust mixture of experts modeling using the t distribution,” Neural Networks, vol. 79, pp. 20–36, 2016. [8] H. D. Nguyen and G. J. McLachlan, “Laplace mixture of linear experts,” Computational Statistics and Data Analysis, vol. 93, pp. 177–191, 2016. [9] K. M. Abadir and J. R. Magnus, Matrix Algebra. Cambridge: Cambridge University Press, 2005. [10] L. Xu, M. I. Jordan, and G. E. Hinton, “An alternative model for mixtures of experts,” in Advances in Neural Information Processing Systems, 1995, pp. 633–640. [11] K. Chen and H. Chi, “A method for combining multiple probabilistic classifiers through soft competition on different feature sets,” Neurocomputing, vol. 20, pp. 227–252, 1998. [12] T. Jebara and A. Pentland, “Maximum conditional likelihood via bound maximization and the CEM algorithm,” in Proceedings of the 11th International Conference on Neural Information Processing Systems, 1998, pp. 494–500. [13] J. V. Hansen, “Combining predictors: comparison of five meta machine learning methods,” Information Sciences, vol. 119, pp. 91–105, 1999. 11 [14] M.-W. Mak and S.-Y. Kung, “Estimation of elliptical basis function parameters by the EM algorithm with application to speaker verification,” IEEE Transactions on Neural Networks, vol. 11, pp. 961–969, 2000. [15] M. Sato and S. Ishii, “Online EM algorithm for the normalized Gaussian network,” Neural Computation, vol. 12, pp. 407–432, 2000. [16] S. Vijayakumar, A. D’Souza, and S. Schaal, “Incremental online learning in high dimensions,” Neural Computation, vol. 17, 2005. [17] C. A. M. Lima, A. L. M. Coelho, and F. J. Von Zuben, “Hybridizing mixtures of experts with support vector machines: investigation into nonlinear dynamic systems identification,” Information Sciences, vol. 177, pp. 2049–2074, 2007. [18] S. Ingrassia, S. C. Minotti, and G. Vittadini, “Local statistical modeling via a cluster-weighted approach with elliptical distributions,” Journal of Classification, vol. 29, pp. 363–401, 2012. [19] A. Norets and J. Pelenis, “Posterior consistency in conditional density estimation by covariate dependent mixtures,” Econometric Theory, vol. 30, pp. 606–646, 2014. [20] A. J. Zeevi, R. Meir, and V. Maiorov, “Error bounds for functional approximation and estimation using mixtures of experts,” IEEE Transactions on Information Theory, vol. 44, pp. 1010–1025, 1998. [21] J. Heinonen, P. Koskela, N. Shanmugalingam, and J. T. Tyson, Sobolev spaces on metric measure spaces: an approach based on upper gradients. Cambridge: Cambridge University Press, 2015. [22] H. D. Nguyen, L. R. Lloyd-Jones, and G. J. McLachlan, “A universal approximation theorem for mixture-of-experts models,” Neural Computation, vol. 28, pp. 2585–2593, 2016. [23] M. H. Stone, “The generalized Weierstrass approximation theorem,” Mathematical Magazine, vol. 21, pp. 237–254, 1948. [24] S. Kullback and R. A. Leibler, “On information and sufficiency,” Annals of Mathematical Statistics, vol. 22, pp. 79–86, 1951. [25] W. Jiang and M. A. Tanner, “Hierachical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation,” Annals of Statistics, vol. 27, pp. 987–1011, 1999. [26] E. F. Mendes and W. Jiang, “On convergence rates of mixture of polynomial experts,” Neural Computation, vol. 24, 2012, 3025-3051. [27] A. Norets, “Approximation of conditional densities by smooth mixtures of regressions,” Annals of Statistics, vol. 38, pp. 1733–1766, 2010. [28] J. Pelenis, “Bayesian regression with heteroscedastic error density and parametric mean function,” Journal of Econometrics, vol. 178, pp. 624–638, 2014. [29] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals, and Systems, vol. 2, pp. 303–314, 1989. [30] I. W. Sandberg, “Gaussian radial basis functions and inner product sspace,” Circuits, Systems and Signal Processing, vol. 20, pp. 635–642, 2001. [31] N. E. Cotter, “The Stone-Weierstrass theorem and its application to neural networks,” IEEE Transactions on Neural Networks, vol. 1, pp. 290–295, 1990. [32] W. Cheney and W. Light, A Course in Approximation Theory. Pacific Grove: Brooks/Cole, 2000. [33] P. A. Bromiley, “Products and convolutions of Gaussian probability density functions,” TINA-VISION, Manchester, Tech. Rep. 2003-003, 2014. [34] K. W. Regan, “Minimum-complexity pairing functions,” Journal of Computer and System Sciences, vol. 45, pp. 285–295, 1992. [35] M. Lisi, “Some remarks on the Cantor pairing function,” Le Matematiche, vol. 62, pp. 55–65, 2007. 12 [36] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society Series B, vol. 39, pp. 1–38, 1977. [37] S. Ingrassia, S. C. Minotti, and A. Punzo, “Model-based clustering via linear cluster-weighted models,” Computational Statistics and Data Analysis, vol. 71, pp. 159–182, 2014. [38] G. Schwarz, “Estimating the dimensions of a model,” Annals of Statistics, vol. 6, pp. 461–464, 1978. [39] C. Biernacki, G. Celeux, and G. Govaert, “Assessing a mixture model for clustering wit the integrated completed likelihood,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 719–725, 2000.
© Copyright 2026 Paperzz