A Universal Approximation Theorem for Gaussian-Gated

1
A Universal Approximation Theorem for
Gaussian-Gated Mixture of Experts Models
arXiv:1704.00946v1 [stat.ME] 4 Apr 2017
Hien D. Nguyen
Abstract
Mixture of experts (MoE) models are a powerful probabilistic neural network framework that can
be used for classification, clustering, and regression. The most common types of MoE models are those
with soft-max gating and experts with linear means. Such models are referred to as soft-max gated
mixture of linear experts (MoLE) models. Due to their popularity, the majority of theoretical results
regarding MoE and MoLE models have concentrated on those with soft-max gating. For example, it has
been proved that soft-max gated MoLE mean functions are dense in the class of continuous functions,
over any compact subset of Euclidean space. A popular alternative to soft-max gating is Gaussian
gating. We prove that the denseness result regarding soft-max gated MoLE mean functions extends to
Gaussian-gated MoLE mean functions. That is, we prove that Gaussian-gated MoLE mean functions
are dense in the class of continuous functions, over any compact subset of Euclidean space.
Index Terms
Mean Function; Mixture of experts; Stone-Weierstrass Theorem; Universal Approximation Theorem
I. I NTRODUCTION
The mixture of experts (MoE) models are probabilistic artificial neural networks that were first
introduced in [1] and then developed further in [2] and [3]. In the contemporary setting, MoE
models have become highly popular and successful in a range of applications and areas including
audio classification, bioinformatics, climate prediction, face recognition, financial forecasting,
handwriting recognition, and text classification, among many others; see [4] and the references
Department of Mathematics and Statistics, La Trobe University, Bundoora 3086, Victoria Australia. Email:
[email protected].
2
therein. Two representative recent applications of MoE models are for functional data analysis
([5], [6]) and for robust regression ([7], [8]).
Let X > , Y
2 X ⇥ R be a random observation from some probability model, where each
X may be non-stochastic and fixed at some value x and X ⇢ Rd (d 2 N). Here, > is the
transposition operator. Suppose now that for some m 2 N, there is a latent random variable Z,
such that
P (Z = i|X = x) = gi (x; ↵) ,
(1)
where i 2 [m] ([m] = {1, . . . , m}), and gi (·; ↵) are parametric functions that depend on some
additional vector ↵ 2 Rq , for q 2 N. Here, we say that m is the number of experts. Further
suppose that the conditional probability density function (PDF) of Y given X = x and Z = i
can be written as
f (y|X = x, Z = i) = hi (x; ) ,
(2)
where hi (·; , ) are parametric PDFs that depend on the additional vector
scalar
2 Rp (p 2 N) and
2 R. Taking the characterizations (1) and (2) together, we can write conditional PDF
of Y on X = x as
f (y|X = x) =
m
X
gi (x; ↵) hi (x;
i) ,
i,
(3)
i=1
which can be obtained via the law of total probability. We call (3) an MoE with gating functions gi
and experts hi . We further say that (3) has parameter vector ✓ > = ↵> ,
>
1 ,...,
>
m,
1, . . . ,
m
.
Suppose that the expectation E (Y |X = x) exists for all x 2 X. If we also have
E (Y |X = x) = M (x; ✓) =
m
X
gi (x; ↵)
i=1
⇥
>
i x
+
i
⇤
,
(4)
where p = d, then we say instead that (3) is a mixture of linear experts (MoLE) model, rather
than an MoE model.
In general, most MoE models—that are used in practice—are MoLE models. Additionally,
the most common kinds of MoLE models in use are those that are soft-max gated. That is, the
gating functions are of the form
where ↵i 2 Rd and
i
exp ↵>
i x+ i
gi (x; ↵) = Pm
>
j=1 exp ↵j x +
,
(5)
j
>
2 R for i 2 [m]. Here, ↵> = ↵>
1 , . . . , ↵m , 1 , . . . ,
m
and q =
(d + 1) m. Examples of (5)-gated MoLE models are the original MoE models of [1], and the
3
robust MoLE models of [7] and [8]. In the respective models, the experts are taken to be Gaussian,
t, and Laplace, each with means
>
i x
+
i.
Although (5) is the most popular choice of gating, an often-used alternative is the Gaussian
gate, which has the form
⇡i
gi (x; ↵) = Pm
(x; µi , ⌃i )
,
j=1 ⇡j d (x; µj , ⌃j )
where
d
(x; µ, ⌃) = |2⇡⌃|
1/2
exp
d

1
(x
2
µ)> ⌃
(6)
1
(x
µ)
is the multivariate Gaussian distribution in Rd with mean vector µ 2 Rd and symmetric positive-
>
definite covariance matrix ⌃ 2 Rd⇥d . Here ⇡i > 0 for i 2 [m] and ↵> = ↵>
1 , . . . , ↵m , where
⇣
⌘
>
↵>
⇡i , µ >
and vech (·) is the half-vectorization operator (cf. [9, Sec. 11.3]).
i =
i , [vech⌃i ]
Further, q = m + dm + d (d
1) m/2.
The (6)-gated MoE models were first considered by [10]. Since their introduction, their have
been numerous examples of (6)-gated MoE and MoLE models in the literature; see for example
[11], [12], [13], [14], [15], [16], [17]. More recently, the (6)-gated MoLE model, with Gaussian
and t experts have been shown to be special cases of cluster-weighted models [18]. Furthermore,
(6)-gated MoLE models—along with a wider class of MoLE models—have been shown useful
for Bayesian nonparametric regression [19].
Due to the popularity of the (5)-gated MoE models, and MoLE models in particular, there
have been numerous theoretical studies of their approximation capacity. For example, [20]
demonstrated the uniform denseness of (5)-gated mean functions (4) in Sobolev space (cf. [21]),
under assumptions on differentiability, measurability, and support compactness. The uniform
denseness of (5)-gated mean functions in the space of continuous functions on finite supports
is proved in [22] via an application of the Stone-Weierstrass theorem (SW; [23]). Other results
pertaining to (5)-gated MoE models include proofs of denseness, in Kullback-Leibler divergence
(KL; [24]), of (3) with exponential family experts within the class of arbitrary conditional PDFs
under some regularity conditions ([25], [26]). An alternative result for (5)-gated MoE models
with Gaussian experts, which makes different assumptions and which allows for multivariate
estimation, is proved in [27]. A denseness in KL divergence result, within the class of conditional
PDFs under regularity for a kind of (5)-gated hierarchical-MoE model with Gaussian experts, is
proved in [28].
4
In contrast, to the best of our knowledge, the only formal result regarding (6)-gated MoE
models appears in [19]. In [19], the (6)-gated MoLE models with location-scale experts are
proved to be dense in KL divergence within the class of conditional PDFs, under regularity.
This result is analogous to cases of theorems obtained in [25], [26], [27]. There is currently no
analogy to the results of [20] and [22] for (6)-gated MoLE models.
In this paper, we follow the approach of [22] and utilize the SW theorem to prove the uniform
denseness of the (6)-gated MoLE mean function (4), within the class of continuous functions on a
compact support. We note that the use of the SW theorem has a rich history in the neural networks
literature. For example, [29] famously utilized the theorem to show the universal approximation
property for the classical sum-of-sigmoidal functions networks. Another example is the use of
the SW theorem to prove the uniform denseness of radial basis networks [30]. An excellent
introduction and demonstration of the use of the SW theorem for neural networks is [31].
The paper proceeds as follows. The main result of the paper is presented in Section II. Proofs
pertaining to the main result are presented in Section III. Conclusions are drawn in Section IV.
II. M AIN R ESULT
Let C (X) denote the class of all continuous functions on the support X. For a pair of functions
u and v with support X, we define the uniform distance between the functions to be d (u, v) =
maxx2X |u (x)
v (x)|, if it exists. If U (X) and V (X) are two classes of functions on the support
X, then we say that U is dense in V if for any v 2 V and ✏ > 0, there exists a u 2 U such
that d (u, v) < ✏. When U (X) is dense in V (X) for every compact X ⇢ Rd , U is said to be
fundamental in V Rd (cf. [32, Ch. 18]).
Let the class of (6)-gated mean functions (4) on the support X be denoted
M (X) =
{M has form (4) : gi has form (6), i 2 [m] , m 2 N} .
In this paper, we prove that M is fundamental in C Rd . That is, we obtain the following result.
Theorem 1. If X is a compact subset of Rd , then M (X) is dense in C (X).
III. P ROOFS OF M AIN R ESULT
We make use of the following version of the SW theorem (cf. [22], [31]).
5
Theorem 2. Let X ⇢ Rd be compact and U (X) be a set of continuous real-valued functions on
X. Assume the following properties. (A1) The constant function u (x) = 1 is in U (X). (A2) For
any two points x1 , x2 2 X, such that x1 6= x2 , the exists a u 2 U (X), such that u (x1 ) 6= u (x2 ).
(A3) If a 2 R and u 2 U (X), then au 2 U (X). (A4) If u, v 2 U (X), then uv 2 U (X). (A5)
If u, v 2 U (X), then u + v 2 U (X). If assumptions A1–A5 are fulfilled, then U (X) is dense in
C (X).
We shall utilize Theorem 2 in two ways. In the first application, we directly prove that there
exists a subclass of M that satisfies the assumptions of Theorem 2, for any compact real subset
X. In the second application, we indirectly show that the class of (5)-gated mean functions,
which is known to be dense via the SW theorem, is a subclass of M, and thus M must also
be dense.
A. Direct Proof
We shall limit ourselves to the subclass
M1 (X) = {M 2 M (X) :
i
= 0, i 2 [m] , m 2 N} ,
where 0 is the zero vector. That is, we consider functions of the form
M (x; ) =
m
X
i=1
where
= ↵> ,
1, . . . ,
m
⇡
Pm i
(x; µi , ⌃i )
j=1 ⇡j d (x; µj , ⌃j )
d
i
2 M1 (X) ,
is the restricted parameter space that is obtained by setting
i
= 0,
for all i. We now prove that class M1 is fundamental in C Rd , via demonstrating that each
of the assumptions A1–A5 are fulfilled. As a consequence, since M1 ⇢ M, we will prove that
M is fundamental in C Rd .
Lemma 3. [A1] The constant function M (x) = 1 2 M1 (X).
Proof: For any m 2 N, set ↵1 = · · · = ↵m and
M = m ⇥ (1/m) = 1, thus completing the proof.
1
= ··· =
m
= 1. We then have
Lemma 4. [A2] For any two x1 , x2 2 Rd , such that x1 6= x2 , there exists a function M 2 M1 ,
such that M (x1 ) 6= M (x2 ).
6
⇣
⌘
⇣
⌘
>
>
>
>
>
>
¯
¯
Proof: Set m = 2, ↵1 = 1, 0 , [vechI] , and ↵2 = 1, 1 , [vechI] . Here 1 is the one
¯ 1 and ↵
¯ 2 in ↵,
¯ and set ¯ > = ↵
¯ > , 1, 0 .
vector and I is the identity matrix, respectively. Put ↵
That is, let
1
= 1 and
2
= 0. We can now write M as
M (x; ¯ )
exp x> x/2
=
exp ( x> x/2) + exp ( x> x/2
1
=
.
1 + exp ( 1> 1/2 + 1> x)
1> 1/2 + 1> x)
If we suppose that M (x1 ; ¯ ) = M (x2 ; ¯ ), then we obtain
exp
1> 1/2 + 1> x1 = exp
1> 1/2 + 1> x2 ,
which implies that x1 = x2 is the unique solution to the system. Thus, for any x1 6= x2 ,
¯ 6= M (x2 ; ↵),
¯ as is required to obtain the desired result.
M (x1 ; ↵)
Lemma 5. [A3] If a 2 R and M 2 M1 , then aM 2 M1 .
Proof: For any m 2 N and
, we can write
aM (x; ) = a
=
m
X
i=1
m
X
i=1
⇡
Pm i
(x; µi , ⌃i )
j=1 ⇡j d (x; µj , ⌃j )
⇡
Pm i
d
i
(x; µi , ⌃i )
(a i )
j=1 ⇡j d (x; µj , ⌃j )
d
= M (x; ¯ ) ,
where ¯ > = ↵> , ¯1 , . . . , ¯m and ¯i = a i , for each i 2 [m]. This completes the proof.
In order to move forward, we require a result regarding the product of Gaussian PDFs. The
following result is given in [33].
Lemma 6. If µ1 , µ2 2 Rd and ⌃1 , ⌃2 2 Rd⇥d are symmetric positive-definite matrices, then
2
Y
d
(x; µi , ⌃i ) = A
d
(x; µ, ⌃) ,
i=1
where A > 0, ⌃
1
=
P2
i=1
⌃i
1
is positive definite, and µ = ⌃
Lemma 7. [A4] If M, N 2 M1 , then M N 2 M1 .
P2
i=1
⌃i 1 µ i .
7
Proof: Let M = M ·;
and N = M ·;
[1]
[2]
, where
[j]>
⇣
= ↵
j 2 {1, 2}. Here m1 , m2 2 N. Taking the direct product M N yields
⇣
⌘
[j]
[j]
[j]
mj
2
⇡i d x; µi , ⌃i
YX
⌘
(M N ) (x) =
Pmj [j] ⇣
[j]
[j]
j=1 i=1
d x; µk , ⌃k
k=1 ⇡k
[j]>
,
[j]
1 ,...,
[j]
mj
⌘
[j]
i
, for
(7)
[1] [2]
[1] [2]
For each i 2 [m1 ] and j 2 [m2 ], consider the following mapping. Let ¯ij = i j , ⇡
˜ij = ⇡i ⇡j ,
⇣
⌘
¯ 1 = ⌃[1] 1 +⌃[2] 1 , and µ̄ij = ⌃
¯ ij ⌃[1] 1 µ[1] + ⌃[2] 1 µ[2] . Using Lemma 6, we can rewrite
⌃
ij
i
j
i
i
j
j
(7) as
(M N ) (x) =
m1 X
m2
X
i=1 j=1
=
m1 X
m2
X
i=1 j=1
¯ ij
Aij ⇡
˜ij d x; µ̄ij , ⌃
Pm1 Pm2
¯ kl ¯ij
˜ij d x; µ̄kl , ⌃
k=1
l=1 Akl ⇡
¯ ij
⇡
¯ij d x; µ̄ij , ⌃
Pm1 Pm2
¯ kl ¯ij ,
¯kl d x; µ̄kl , ⌃
k=1
l=1 ⇡
where ⇡
¯ij = Ai ⇡
˜ij . Now, via some pairing function that maps (i, j) to k 2 [m̄] (m̄ = m1 m2 ; see
e.g. [34], [35]), we can write
(M N ) (x) =
m̄
X
k=1
¯ > , ¯1 , . . . , ¯m̄
where ¯ > = ↵
¯k
x; µ̄k , ⌃
¯
¯l k
¯l d x; µ̄l , ⌃
l=1 ⇡
⇡
¯k
Pm̄
d
= M (x; ¯ ) ,
⇣
⇥
⇤> ⌘
>
>
¯
¯k = ⇡
and ↵
¯k , µ̄k , vech⌃k
. Thus, M N 2 M1 , as required.
Lemma 8. [A5] If M, N 2 M1 , then M + N 2 M1 .
Proof: Let M = M ·;
[1]
and N = M ·;
[2]
, where
[j]>
⇣
= ↵[j]> ,
j 2 {1, 2}. Here m1 , m2 2 N. Taking the direct sum M + N yields
(M + N ) (x)
mj
2 X
X
[j]
⇡i
⇣
[j]
[j]
[j]
mj
⌘
, for
(8)
⌘
x; µi , ⌃i
⌘ i[j]
=
Pmj [j] ⇣
[j]
[j]
j=1 i=1
d x; µk , ⌃k
k=1 ⇡k
⇣
⌘
⇣
⌘
[1]
[1]
[1] Pm2
[2]
[2]
[2]
m
1
X ⇡i d x; µi , ⌃i
d x; µk , ⌃k
k=1 ⇡k
⇣
⌘
=
Q2 Pmj [j]
[j]
[j]
⇡
x;
µ
,
⌃
i=1
d
k=1 k
k
k
j=1
⇣
⌘P
⇣
⌘
[2]
[2]
[2]
[1]
[1]
[1]
m2
m
2
X ⇡i d x; µi , ⌃i
d x; µk , ⌃k
k=1 ⇡k
⇣
⌘
+
Q2 Pmj [j]
[j]
[j]
⇡
x;
µ
,
⌃
i=1
d
k=1 k
k
k
j=1
d
[j]
1 ,...,
[1]
i
[2]
i
8
[1]
[2]
For each i 2 [m1 ] and j 2 [m2 ], consider the following mapping. Let ¯ij = i + j , ⇡
˜ij =
⇣
⌘
[1] [2] ¯ 1
[1] 1
[2] 1
[1]
1
[1]
[2]
1
[2]
¯ ij ⌃
⇡i ⇡j , ⌃ij = ⌃i
+ ⌃j , and µ̄ij = ⌃
µi + ⌃j µj . Using Lemma 6, we
i
can rewrite (8) as
(M + N ) (x)
=
m1 X
m2
X
i=1 j=1
=
m1 X
m2
X
i=1 j=1
¯ ij
Aij ⇡
˜ij d x; µ̄ij , ⌃
Pm1 Pm2
¯ kl ¯ij
˜ij d x; µ̄kl , ⌃
k=1
l=1 Akl ⇡
¯ ij
⇡
¯ij d x; µ̄ij , ⌃
Pm1 Pm2
¯ kl ¯ij ,
¯kl d x; µ̄kl , ⌃
k=1
l=1 ⇡
where ⇡
¯ij = Ai ⇡
˜ij . Now, via some pairing function that maps (i, j) to k 2 [m̄] (m̄ = m1 m2 ),
we can write
(M + N ) (x) =
m̄
X
k=1
¯ > , ¯1 , . . . , ¯m̄
where ¯ > = ↵
required.
¯k
x; µ̄k , ⌃
¯
¯l k
¯l d x; µ̄l , ⌃
l=1 ⇡
⇡
¯k
Pm̄
d
= M (x; ¯ ) ,
⇣
⇥
⇤> ⌘
>
¯
¯>
and ↵
=
⇡
¯
,
µ̄
,
vech
⌃
. Thus, M + N 2 M1 , as
k
k
k
k
Together, Lemmas 3–5, 7, and 8 imply that M1 (X) fulfills the assumptions of Theorem 2 for
any compact subset X ⇢ Rd . Thus M1 is fundamental in C Rd , by Theorem 2. Furthermore,
the main result, Theorem 1, is implied since M1 (X) ⇢ M (X), for any X.
B. Indirect Proof
We shall limit ourselves to the subclass
M2 (X) = {M 2 M (X) : ⌃i = I, i 2 [m] , m 2 N} .
That is, we consider functions of the form
M (x; !)
=
m
X
i=1
where ! > =
>
,
>
1 ,...,
⇥
(x; µi , I)
j=1 ⇡j d (x; µj , I)
⇡
Pm i
>
m,
setting ⌃i = I, for all i. Here,
d
+
i
⇤
2 M2 (X) ,
is the restricted parameter space that is obtained by
1, . . . ,
m
>
>
1 ,...,
=
>
i x
>
m
and
>
i
= ⇡i , µ >
i , for each i.
9
Let the class of (6)-gated mean functions (4) on the support X be denoted
N (X) =
{M has form (4) : gi has form (5), i 2 [m] , m 2 N} .
From [22], we have the following result.
Theorem 9. If X is a compact subset of Rd , then N (X) is dense in C (X).
We now seek to demonstrate that every member of N can be written as a member of M2 ,
for any X. The following result is sufficient for the purpose.
Lemma 10. For any i 2 [m], m 2 N, ↵, and
[N ]
gi
can be written in the form
exp ↵>
i x+ i
(x; ↵) = Pm
>
j=1 exp ↵j x +
[M]
gi
and vice versa.
[M]
Proof: Start with gi
[M]
gi
, the function
⇡i
(x; ) = Pm
j
(x; µi , I)
,
j=1 ⇡j d (x; µj , I)
d
and note that the normalizing constants of
d
cancel to yield
>
⇡i exp x> x/2 µ>
i µi /2 + µi x
(x; ) = Pm
>
x> x/2 µ>
j µj /2 + µj x
j=1 ⇡j exp
>
exp log ⇡i µ>
i µi /2 + µi x
= Pm
.
>
µ>
j µj /2 + µj x
j=1 exp log ⇡j
[N ]
We observe that we can write gi
i
= log ⇡i
[M]
(x; ↵) in the form of gi
(x; ) by setting ↵i = µi and
µ>
i µi /2, for each i 2 [m]. The mapping is unique, as is the inverse mapping
µi = ↵i and ⇡i = exp
i
+ µ>
i µi /2 . Therefore, we have obtained the desired result.
Lemma 10 implies that N is a subclass of M2 and thus is a subclass of M, for any X.
Since N is fundamental in C Rd via Theorem 9, we obtain the main result, Theorem 1, as a
consequence of Lemma 10.
IV. C ONCLUSIONS
We have proved that the class of (6)-gated mean functions (4) is fundamental in the continuous
functions C Rd . This result is a direct parallel to the denseness result of [22], who showed
10
that the class of (5)-gated mean functions is fundamental in the continuous functions C Rd .
The result provides practitioners the guarantee that the (6)-gated mean functions are universal
approximators and can approximate any continuous function arbitrarily well, provided that there
are enough experts m.
However, the main result of the paper says nothing about how one should obtain an estimate
of a (6)-gated mean function from data, in practice. Given a fixed number of components m, one
can estimate the parameters of the (6)-gated MoLE model with Gaussian experts via maximum
likelihood estimation and an EM algorithm (expectation–maximization; [36]), as in [10] and
[18]. The EM algorithm for the case of t distribution experts is given in [18] and [37].
When m is unknown, some information theoretic criterion is required to choose between
different numbers of experts. In [37], the BIC (Bayes information criterion; [38]) and the ICL
criterion (integrated completed likelihood; [39]) were found to be appropriate for the task.
R EFERENCES
[1] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation,
vol. 3, pp. 79–87, 1991.
[2] M. I. Jordan and R. A. Jacobs, “Hierachical mixtures of experts and the EM algorithm,” Neural Computation, vol. 6, pp.
181–214, 1994.
[3] M. I. Jordan and L. Xu, “Convergence results for the EM approach to mixtures of experts architectures,” Neural Networks,
vol. 8, pp. 1409–1431, 1995.
[4] S. E. Yuksel, J. N. Wilson, and P. D. Gader, “Twenty years of mixture of experts,” IEEE Transactions on Neural Networks
and Learning Systems, vol. 23, pp. 1177–1193, 2012.
[5] F. Chamroukhi, A. Same, G. Govaert, and P. Aknin, “A hidden process regression model for functional data description.
Application to curve discrimination,” Neurocomputing, vol. 73, pp. 1210–1221, 2010.
[6] F. Chamroukhi, H. Glotin, and A. Same, “Model-based functional mixture discriminant analysis with hidden process
regression for curve classification,” Neurocomputing, vol. 112, pp. 153–163, 2013.
[7] F. Chamroukhi, “Robust mixture of experts modeling using the t distribution,” Neural Networks, vol. 79, pp. 20–36, 2016.
[8] H. D. Nguyen and G. J. McLachlan, “Laplace mixture of linear experts,” Computational Statistics and Data Analysis,
vol. 93, pp. 177–191, 2016.
[9] K. M. Abadir and J. R. Magnus, Matrix Algebra.
Cambridge: Cambridge University Press, 2005.
[10] L. Xu, M. I. Jordan, and G. E. Hinton, “An alternative model for mixtures of experts,” in Advances in Neural Information
Processing Systems, 1995, pp. 633–640.
[11] K. Chen and H. Chi, “A method for combining multiple probabilistic classifiers through soft competition on different
feature sets,” Neurocomputing, vol. 20, pp. 227–252, 1998.
[12] T. Jebara and A. Pentland, “Maximum conditional likelihood via bound maximization and the CEM algorithm,” in
Proceedings of the 11th International Conference on Neural Information Processing Systems, 1998, pp. 494–500.
[13] J. V. Hansen, “Combining predictors: comparison of five meta machine learning methods,” Information Sciences, vol. 119,
pp. 91–105, 1999.
11
[14] M.-W. Mak and S.-Y. Kung, “Estimation of elliptical basis function parameters by the EM algorithm with application to
speaker verification,” IEEE Transactions on Neural Networks, vol. 11, pp. 961–969, 2000.
[15] M. Sato and S. Ishii, “Online EM algorithm for the normalized Gaussian network,” Neural Computation, vol. 12, pp.
407–432, 2000.
[16] S. Vijayakumar, A. D’Souza, and S. Schaal, “Incremental online learning in high dimensions,” Neural Computation, vol. 17,
2005.
[17] C. A. M. Lima, A. L. M. Coelho, and F. J. Von Zuben, “Hybridizing mixtures of experts with support vector machines:
investigation into nonlinear dynamic systems identification,” Information Sciences, vol. 177, pp. 2049–2074, 2007.
[18] S. Ingrassia, S. C. Minotti, and G. Vittadini, “Local statistical modeling via a cluster-weighted approach with elliptical
distributions,” Journal of Classification, vol. 29, pp. 363–401, 2012.
[19] A. Norets and J. Pelenis, “Posterior consistency in conditional density estimation by covariate dependent mixtures,”
Econometric Theory, vol. 30, pp. 606–646, 2014.
[20] A. J. Zeevi, R. Meir, and V. Maiorov, “Error bounds for functional approximation and estimation using mixtures of experts,”
IEEE Transactions on Information Theory, vol. 44, pp. 1010–1025, 1998.
[21] J. Heinonen, P. Koskela, N. Shanmugalingam, and J. T. Tyson, Sobolev spaces on metric measure spaces: an approach
based on upper gradients.
Cambridge: Cambridge University Press, 2015.
[22] H. D. Nguyen, L. R. Lloyd-Jones, and G. J. McLachlan, “A universal approximation theorem for mixture-of-experts
models,” Neural Computation, vol. 28, pp. 2585–2593, 2016.
[23] M. H. Stone, “The generalized Weierstrass approximation theorem,” Mathematical Magazine, vol. 21, pp. 237–254, 1948.
[24] S. Kullback and R. A. Leibler, “On information and sufficiency,” Annals of Mathematical Statistics, vol. 22, pp. 79–86,
1951.
[25] W. Jiang and M. A. Tanner, “Hierachical mixtures-of-experts for exponential family regression models: approximation and
maximum likelihood estimation,” Annals of Statistics, vol. 27, pp. 987–1011, 1999.
[26] E. F. Mendes and W. Jiang, “On convergence rates of mixture of polynomial experts,” Neural Computation, vol. 24, 2012,
3025-3051.
[27] A. Norets, “Approximation of conditional densities by smooth mixtures of regressions,” Annals of Statistics, vol. 38, pp.
1733–1766, 2010.
[28] J. Pelenis, “Bayesian regression with heteroscedastic error density and parametric mean function,” Journal of Econometrics,
vol. 178, pp. 624–638, 2014.
[29] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals, and Systems,
vol. 2, pp. 303–314, 1989.
[30] I. W. Sandberg, “Gaussian radial basis functions and inner product sspace,” Circuits, Systems and Signal Processing,
vol. 20, pp. 635–642, 2001.
[31] N. E. Cotter, “The Stone-Weierstrass theorem and its application to neural networks,” IEEE Transactions on Neural
Networks, vol. 1, pp. 290–295, 1990.
[32] W. Cheney and W. Light, A Course in Approximation Theory.
Pacific Grove: Brooks/Cole, 2000.
[33] P. A. Bromiley, “Products and convolutions of Gaussian probability density functions,” TINA-VISION, Manchester, Tech.
Rep. 2003-003, 2014.
[34] K. W. Regan, “Minimum-complexity pairing functions,” Journal of Computer and System Sciences, vol. 45, pp. 285–295,
1992.
[35] M. Lisi, “Some remarks on the Cantor pairing function,” Le Matematiche, vol. 62, pp. 55–65, 2007.
12
[36] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal
of the Royal Statistical Society Series B, vol. 39, pp. 1–38, 1977.
[37] S. Ingrassia, S. C. Minotti, and A. Punzo, “Model-based clustering via linear cluster-weighted models,” Computational
Statistics and Data Analysis, vol. 71, pp. 159–182, 2014.
[38] G. Schwarz, “Estimating the dimensions of a model,” Annals of Statistics, vol. 6, pp. 461–464, 1978.
[39] C. Biernacki, G. Celeux, and G. Govaert, “Assessing a mixture model for clustering wit the integrated completed likelihood,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 719–725, 2000.

Download Report

A Universal Approximation Theorem for Gaussian-Gated

Paperzz.com

Your Paperzz