Monte Carlo Algorithms for Computing α−Permanents 1 Introduction

Monte Carlo Algorithms for Computing α−Permanents
BY JUNSHAN WANG & AJAY JASRA
Department of Statistics & Applied Probability, National University of Singapore, Singapore, 117546, SG.
E-Mail: [email protected] , [email protected]
Abstract
We consider the computation of the α−permanent of a non-negative n × n matrix. This appears in a
wide variety of real applications in statistics, physics and computer-science. It is well-known that the exact
computation is a #P complete problem. This has resulted in a large collection of simulation-based methods,
to produce randomized solution whose complexity is only polynomial in n. This paper will review and develop
algorithms for both the computation of the permanent α = 1 and α > 0 permanent. In the context of binary
n × n matrices a variety of Markov chain Monte Carlo (MCMC) computational algorithms have been introduced
in the literature whose cost, in order to achieve a given level of accuracy, is O(n7 log4 (n)); see [4, 18]. These
algorithms use a particular collection of probability distributions, the ‘ideal’ of which, (in some sense) are not
known and need to be approximated. In this paper we propose an adaptive sequential Monte Carlo (SMC)
algorithm that can both estimate the permanent and the ideal sequence of probabilities on the fly, with little
user input. We provide theoretical results associated to the SMC estimate of the permanent, establishing its
convergence. We also analyze the relative variance of the estimate, associated to an ‘ideal’ algorithm (related to
our algorithm) and not the one we develop, in particular, computating explicit bounds on the relative variance
which depend upon n. As this analysis is for an ideal algorithm, it gives a lower-bound on the computational
cost, in order to achieve an arbitrarily small relative variance; we find that this cost is O(n4 log4 (n)). For the
α−permanent, perhaps the gold standard algorithm is the importance sampling algorithm of [16]; in this paper
we develop and compare new algorithms to this method; apriori one expects, due to the weight degeneracy
problem, that the method of [16] might perform very badly in comparison to the more advanced SMC methods
we consider. We also present a statistical application of the α−permanent for statistical estimation of boson
point process and MCMC methods to fit the associated model to data.
Key Words: Monte Carlo, α−Permanents, Relative Variance
1
Introduction
Consider a n × n non-negative matrix A = (aij ), the α−permanent is defined as
perα (A) =
X
αcyc(σ)
σ∈Sn
n
Y
aiσ(i)
i=1
where α > 0, Sn is the set of permutations of [n] := {1, . . . , n}, cyc(σ) is the number of cycles of σ and the
case α = 1 is the permanent (in which case we write per(A)). Computing the α−permanent occurs in a wide
variety of real contexts, including applications in physics [15] and statistics - see [14] for an overview. The exact
computation of the α−permanent is not possible in polynomial time (as a function of n) and there are a wide variety
of ground-breaking randomized algorithms designed to do this for the permanent ([4, 18]) which can provably find
approximate solutions in polynomial time and α−permanent [16, 14] which can be polynomial in computational
cost. The computation of the permanent can be linked to a variety of counting problems (see e.g. [7, 14]) for
sampling matrices, although this link is not explored in this article. The fastest algorithm for the permanent is
O(n7 log4 (n)) in [4]. These algorithms use MCMC (a simulated annealing algorithm) or SMC methods. Much of
the work we consider is focussed upon SMC methods.
SMC methods are amongst the most widely used computational techniques in statistics, engineering, physics,
finance and many other disciplines; see [12] for a recent overview. They are designed to approximate a sequence of
probability distributions of increasing dimension. The method uses N ≥ 1 samples (or particles) that are generated
in parallel, using importance sampling and resampling methods. The approach can provide estimates of expectations
with respect to this sequence of distributions using the N weighted particles, of increasing accuracy as N grows.
These methods can also be used to approximate a sequence of probabilities on a common space, along with the
ratio of normalizing constants; see [10]. They have been found to out-perform MCMC in some situations.
The work in this article will focus upon developing and comparing algorithms:
1. designed just for the permanent (but could be extended to the α− permanent)
2. designed just for the α−permanent.
1
With regards to the first collection of algorithms there is a distinction between MCMC and SMC-based algorithms.
The former, which have a wide variety of complexity results can be found in [5, 6, 4, 18] and seem to indicate, for
different types of matrices, that MCMC type algorithms seem to work much better. Indeed, [6] seem to indicate
that the SMC approach [7] requires exponential in n effort in some scenarios. This leads us only to consider just
MCMC type algorithms for the permanent and SMC generalizations - which are different to those in e.g. [7]. For
the α−permanent, we will explore SMC-based algorithms.
In the first direction, we propose an adaptive SMC algorithm. The consistency of this method is also established
(that is as N grows); we show that our estimate of the permanent converges in probability to the true value; this
is a non-trivial convergence result the literature of is in its infancy; see [2]. In addition, we consider the relative
variance of the SMC estimate of the permanent, and its dependence upon n. Due to the afore mentioned issues with
the analysis of adaptive SMC algorithms, we consider a non-adaptive ‘perfect’ algorithm and the associated relative
variance associated to this algorithm. Using the results in [3, 18, 20] we show that in order to control the relative
variance up-to arbitrary precision one requires a computational effort of O(n2 log2 (n)); the adaptive SMC algorithm
requires an additional cost which increases this to O(n4 log4 (n)). As this analysis is for a simplified version of the
new algorithm the cost of O(n4 log4 (n)) is expected to be a lower-bound on the computational effort to control the
relative variance. This cost is, however, very favorable in comparison to the existing work and suggests that the
SMC procedure is a useful contribution to the literature on approximating permanents. It should be remarked that
the consistency result holds for exactly the algorithm that is implemented, which, as mentioned above is not the
case for the relative variance result.
In second direction, we focus upon SMC extensions of the approach in [16], which appears to be the gold-standard
in the literature. This method is expected to perform badly in the sense that estimates of the form given in [16]
might depend exponentially on the time the algorithm is run (which often depends upon n to obtain good results).
This is due to the well-known weight-degeneracy effect (see e.g. [12]). We add a resampling mechanism, which one
expects to fix this problem - at the cost of introducing an additional problem; the path degenerarcy effect. We also
develop a discrete particle filter (DPF) [13] which can perform better than SMC in some scenarios; see e.g. [22].
Finally we present a statistical application which relies on the SMC methods considered for the α−permanent.
This article is structured as follows. In Section 2 we focus upon computational algorithms for the permanent. In
particular our new algorithm is given and analyzed. In Section 3 we consider SMC algorithms for the α−permanent.
In Section 4 the numerical results for our comparisons are given. In Section 5 the statistical application is presented. Finally in Section 6 the article is summarized. The appendix holds some technical results associated to the
consistency analysis in Section 2.
2
2.1
Computational Algorithms for the Permanent
Basic Procedure
For the permanent, we focus upon binary matrices; one can extend to non-negative matrices as in [18]. The
calculation of the permanent can be rephrased in terms of counting the perfect matchings of a bipartite garph.
Consider a bipartite graph G = (U, V, E), where U = {u1 , . . . , un } and V = {v1 , . . . , vn } are disjoint sets, and E
is the edge set which is associated to the matrix A; for (i, j) ∈ [n]2 , (ui , vj ) ∈ U × V , (ui , vj ) ∈ E if and only if
aij = 1. Recall that a perfect matching of G is a set of edges with cardinality n, such that no two edges contain
the same vertex. If
M := {((uk1 , vs1 ), . . . , (ukn , vsn )) ∈ E n : (ki , si ) ∈ [n]2 , k1 6= k2 6= · · · 6= kn , s1 6= s2 6= · · · 6= sn }
denotes the set of perfect matchings, then from the definitions per(A) = Card(M). The collection of near perfect
matchings, is a perfect matching with a single edge removed; we denote this by N (u, v), where (u, v) are the pair
of vertices that do not lie in the set. That is, for a M ∈ M, such that (u, v) ∈ M
M \ {(u, v)} ∈ N (u, v)
The work in [3, 18] focusses on firstly a Metropolis-Hastings (M-H)
S algorithm whichis defined on the space (note that
the graph is completed, which we discuss later on) M = M ∪
(u,v)∈U ×V N (u, v) . In particular, efficiency results
are proved about the spectral gap associated to the given M-H kernel for a particular collection of probabilities
defined on M. Simulation from these probabilities allow one to approximate the permanent. In particular, the idea
is to construct a sequence of probabilities on M, which are increasingly more complex and of the form:
ηp (M ) ∝ Φp (M ) M ∈ M, 0 ≤ p ≤ r
2
with Φp : M → R+ ; these are defined later on. Writing Zp =
per(A) ≈
P
M ∈M
Φp (M ), [18] show that
Zr
+1
n2
and use the standard decomposition:
Zr = Z0
r
Y
Zk
Zk−1
k=1
to facilitate an accurate estimation of Zr and hence to estimate the permanent; note Z0 is known. The idea is that
it is ‘easy’ to estimate Z1 and so if the discrepancy between the consective Z’s is small, the resulting estimate is
better than if one just estimate Zr from the beginning. In order that the estimate of the permanent can be made
arbitrarily accurate, r is a function of n, and most recently [4] give a procedure which costs O(n7 log4 (n)). The
results rely upon a particular property of ‘ideal’ (say) {Φ∗p }0≤p≤r , which cannot be computed in practice.
2.2
Simulated Annealing Algorithm
The simulated annealing algorithms of [4, 18] work in the following way. The authors define a sequence of activities
(φp : U × V → R+ )0≤p≤r , such that φ0 (u, v) = 1 ∀(u, v) ∈ U × V and φr (u, v) = 1, if (u, v) ∈ E and φr (u, v) = 1/n!
otherwise.
The idea is to define a sequence of target distributions on the set of perfect and near-perfect matchings associated
to a completion of the original graph (a complete graph is one for which every vertex is connected to every other).
The initial graph is such that all perfect and near perfect matchings have close to uniform probability and as the
sequence gets closer to r, so the graph becomes closer to the original graph and the complexity of the target much
higher (so for example, it may be difficult to define a Markov transition that easily moves around on the given
sample space).
S
The targets are defined on the common space M = M ∪
(u,v)∈U ×V N (u, v) , p ∈ {0, . . . , r}
ηp (M ) ∝ Φp (M )
where
Φp (M ) =
(1)
φp (M )wp (u, v) if M ∈ N (u, v) for some(u, v) ∈ U × V
φp (M )
if M ∈ M
Q
where φp (M ) = (u,v)∈M φp (u, v) and wk : U × V → R is a weight to be defined.
[18] note that ideally, one should choose wp = wp∗ , where:
Ξp (M)
N (u, v) 6= ∅
(2)
Ξp (N (u, v))
P
P
and Ξp (C) := M ∈C φp (M ), C ⊆ M. This means that M ∈M ηp (M ) ≥ 1/(n2 + 1). The definition of φp (u, v)
is given in either [4, 18] and we refer the reader there for good choices of this cooling sequence. Note that r is
O(n2 log(n)) in [18] and this improved to O(n log2 (n)) in [3].
The simulated annealing algorithm is then essentially to generate a sequence of Markov chains each with invariant
measure ηk , although some additional improvements are in [18]. Appropriate M-H kernels (Kp )1≤p≤r of invariant
measure {ηp }1≤p≤r can be found in [18]. In order to estimate the permanent, [18] state that:
wp∗ (u, v) =
per(A) ≈
Zr
n2 + 1
and use the standard decomposition:
r
Y
Zk
Zr = Z0
Zk−1
k=1
to estimate the ratio of normalizing constants and hence the permanent (note Z0 = n!(n2 + 1)). In the analysis in
[3, 18], particular emphasis is placed upon being able to estimate the weights wp (u, v) to within a factor of 2 of the
ideal ones wp∗ (u, v).
3
2.3
New Adaptive SMC Algorithm
One of the major points of the simulated annealing algorithm, is that the methodology is not really designed to
adaptively compute approximations of (2) in an elegant manner and use a single Markov chain for simulation (or
multiple non-interacting chains). The algorithm samples from a sequence of targets on a common space and seeks
to estimate ratios of normalizing constants; a method which is designed exactly for this is in [10].
The approach in [10] is as follows. It uses a combination of importance sampling, MCMC and resampling, with
each step being perfomed sequentially in time. N > 1 samples are generated in parallel and weights ωpi , i ∈ [N ] are
used to approximate the probabilities. This technique can often out-perform single chain methods as it generates a
population of interacting samples in parallel (see [17]). In the context of interest, one would like to sample from the
sequence in (1), when Φp uses the ideal weights. This is not possible in general, and so we will use the collection
samples generated at the previous time point, to approximate the ideal weights.
The algorithm is now described, which is just an adaptive version of the class algorithms found in [10]. We fix
a small δ > 0 (say δ ≈ 10−10 ) which is used below to avoid dividing by zero. Below the Markov kernels Kp (with
invariant meausre ηp ) are as described in [18].
Algorithm 1 An Adaptive SMC algorithm for approximating the permanent.
1. Sample M01 , . . . , M0N i.i.d. from η0 . Set p = 0 and ωpi = 1 for each i ∈ [N ].
2. If p = r stop, otherwise, For each (u, v) ∈ U × V , compute
PN
Q
ωpi IM (Mpi )[ (u0 ,v0 )∈Mpi φp+1 (u0 , v 0 )/φp (u0 , v 0 )] + δ
N
wp+1
(u, v) =
PN
Q
1
i
i
0 0
0 0
{ wN (u,v)
i=1 ωp IN (u,v) (Mp )[ (u0 ,v 0 )∈M i φp+1 (u , v )/φp (u , v )] + δ}
i=1
p
p
set p = p + 1.
3. Compute for i ∈ [N ]:
ωpi
i
Gp−1,N (Mp−1
)
where
i
i
= ωp−1
Gp−1,N (Mp−1
)
=
i
ΦN
p (Mp−1 )
i
ΦN
p−1 (Mp−1 )
φp (M )wpN (u, v) if M ∈ N (u, v) for some (u, v) ∈ U × V
φp (M )
if M ∈ M.
PN
P
N
Compute ESS = ( i=1 ωpi )2 / i=1 (ωpi )2 if ESS < NT resample and set ωpi = 1 (denoting the resampled
particles with the same notation), otherwise go to 4.
ΦN
p (M ) =
i
i
∼ Kp (Mp−1
, ·) and go to 2.
4. For i ∈ [N ] sample Mpi |Mp−1
∗
Step 2. requires an O(n2 ) operation, that is, to approximate the wp+1
(u, v) for each (u, v). For large N ,
∗
N
wp+1 (u, v) should be close to wp+1 (u, v); this is proved formally below (see the proofs in the appendix). We note
however, that no rates of convergence are obtained, which removes the possibility of consideration of calibrating N
to ensure that one has wpN (u, v) within a factor of 2 of the ideal weights; we discuss this issue below.
Step 3. is called resampling; see [12] for some overview of this approach. The resampling is performed dynamically, that is, when the ESS drops below a threshold NT ; roughly, the ESS measures the number of useful samples
and is a number between 1 and N . Typically, one sets NT = N/2.
The estimate of the permanent is
l
N
Y
1 X i
n!
ω
(3)
N i=1 kp
p=1
where one assumes that resampling occurs l times, at time-points 1 ≤ k1 < · · · < kl ≤ kl+1 = r. See [10] and the
references therein, for a discussion of these estimates, along with the convergence. We will discuss the convergence
below, in the situation where NT = 1 (i.e. one resamples at every time point). If we denote by MG as the perfect
4
matchings for the original graph then an alternative estimate of the permanent is
n!(n2 + 1)
l
N
N
Y
1 X i X
ωi
ωkp
IMG (Mri ) PN r j .
N i=1
p=1
j=1 ωr
i=1
(4)
We remark that all of the subsequent analysis can be adopted for this estimate and the conclusions do not change
(so our analysis is for the estimate (3)). One might expect for n moderate that this estimate could be marginally
better. However, if the original graph has very few perfect matchings, then the number that are sampled are low
and this estimate may perform more poorly than our first estimate. We perform an empirical comparison in Section
4.
The algorithm as presented, may have a number of advantages over simulated annealing. Firstly, as noted above,
is the population-based nature of the evolution of the samples; they interact with each other, which can improve
performance against single-chain approaches such as simulated annealing (see e.g. [17]). Secondly, as noted above,
the approach of estimating the ideal wp is naturally incorporated into the sampling mechanism. One disadvantage,
against simulated annealing, however, is the need to store N samples in M.
2.4
Convergence Analysis
Below we will use →P to denote convergence in probability as N grows. We will analyze the algorithm when one
resamples multinomially at every time step (NT = 1); this is an assumption typically made in the literature - see
[9]. We do not need to specify the scheme associated to the change of φp and this can be either that in [18] or [3].
For reasons that will be clear later on in the article, we use γrN (1) to denote the estimate of per(A)/n! = γr (1) (see
Section 2.5). It should be remarked that this result does not depend upon the ergodicity/mixing properties of the
MCMC kernels - this is typically not needed for such consistency results. We have the following result, whose proof
is in the appendix.
Theorem 2.1. For any n > 1 fixed, we have
γrN (1) →P γr (1).
The result establishes the consistency of our approach, which is a non-trivial convergence result, in that it is not
a simple extension of the convergence results that are currently in the literature. However, it does not establish any
rate of convergence; it should be straightforward to obtain these, through non-asymptotic Ls −bounds (although
there are not any in the literature, which apply to our algorithm), but it is non-trivial task to ensure that these
bounds are sharp in n. However, these type of results are important. For example, one is interested in being able
to guarantee, with high probability that the empirical weights wpN are close to the ideal weights. In this direction,
one would want to establish a Hoeffding-type inequality; that is, at least for any > 0, (u, v) ∈ U × V
P wpN (u, v) − wp (u, v) ≥ ≤ C1 (n, N, ) exp{C2 (n, N, )}
for some constants C1 , C2 that depend upon n, , N and C2 goes to −∞ as N grows, and C1 grows more slowly than
exp{C2 (n, N, )} decreases. For non-adaptive algorithms some similar results have been established in [9, Chapter
7], but not directly about quantities such as wpN (u, v), with constants that are explicit in n. Even if one converts
such results for wpN (u, v), providing sharp bounds in n is expected to be fairly challenging. One would want to
replace the Dobrushin coefficient analysis in [9] with one related to spectral properties of the associated Markov
chain semi-groups, which are those exploited in the next Section; then an extension to the adaptive case is required.
This programme is particularly important, but left as a topic for future work.
2.5
2.5.1
Complexity Analysis
Notation and Assumptions
We now prove our complexity result. We will consider a ‘perfect algorithm’ that does not use the adaptation in
Section 2.3 (Step 2). It should be remarked that asymptotically in N , the performance of the adaptive and perfect
algorithm can coincide, see [2], although this has not been proved for the algorithm in this article. However, if n
(and hence N ) is large, one expects a similar performance of the adaptive and perfect algorithms, due to the aforementioned reason. The difficulty in the analysis when the algorithm is adaptive is as follows. When the algorithm
5
is non-adaptive, the estimate γrN (1) is unbiased; and it is this property which leads to a sharp analysis of its relative
variance (e.g. [20]). In the adaptive case, this property does not always hold which significantly complicates the
analysis; thus we focus on a non-adaptive version of the algorithm. As the adaptive algorithm requires estimation of
the targets (so a likely increase in variance in estimation), our results will lead to a lower-bound on the complexity
associated to controlling the relative variance of the estimate of the permanent.
Our proofs use Feynman-Kac notations, which we give here. We set {Gp }0≤p≤r−1 as the incremental weights:
Φp+1 (M )
.
Φp (M )
Gp (M ) =
We define the Markov kernels {Kp }1≤p≤r as the reversible MCMC kernels in [18]. One can show that
ηt (M ) =
γt (M )
γt (1)
1≤t≤r
where, for ϕ ∈ Bb (M) (the collection of real-valued and bounded-measurable functions on M)
h t−1
i
Y
γt (ϕ) = E
Gp (Mp )ϕ(Mt )
(5)
p=0
where the expectation is w.r.t. a non-homogeneous Markov chain with initial measure η0 and transitions {Kp }1≤p≤r .
Note that one can also show that γp (1) = Zp /Z0 . We introduce the following non-negative operator:
Qp (M, M 0 ) = Gp−1 (M )Kp (M, M 0 ).
We also use the semi-group notation, for 0 ≤ p < t
X
Qp,t (Mp , Mt ) =
Qp+1 (Mp , Mp+1 ) × · · · × Qn (Mt−1 , Mt ).
(Mp+1 ,...,Mt−1 )∈Mt−p−1
Finally the notation
λp = ηp (Gp ) =
X
ηp (M )Gp (M )
0≤p≤n
M ∈M
will prove to be useful.
Our analysis will be associated to an SMC algorithm that resamples (multinomially) at each time point. We
will consider the variance of the estimate
γrN (1) =
N
1 X
Gp (Mpi )
N
p=0
i=1
r−1
Y
which will approximate per(A)/n!; the factor 1/n! does not affect the complexity result in Theorem 2.2. We will
make the following assumption:
(A1) We have that for each (u, v) ∈ U × V , {wp (u, v)}0≤p≤r are deterministic and for 0 ≤ p ≤ r
1 ∗
w (u, v) ≤ wp (u, v) ≤ 2wp∗ (u, v)
2 p
for each (u, v) ∈ U × V .
The assumption means that one does not perform step 2. in Section 2.3, but is consistent with the assumptions
made in [3, 18]. The cooling scheme in [3] is adopted.
Introduce the Dirchlet form of a reversible Markov kernel P , with invariant measure ξ on finite state-space E,
for a real-valued function ϕ : E → R
E (f, f ) =
1 X
(ϕ(x) − ϕ(y))2 ξ(x)P (x, y).
2
x,y∈E
6
Then the spectral gap of P is:
Gap(P ) = inf
E (ϕ, ϕ)
:
ϕ
is
non
constant
.
ξ([ϕ − ξ(ϕ)]2 )
Then it follows that, under our assumptions, by the analysis in [3], the congestion of the MCMC kernels is O(n2 )
and via the Poincaré inequality (see e.g. [11]) that for 1 ≤ p ≤ r, 0 < C < ∞
1
(6)
Cn2
This fact will become useful later on in the proofs. This shows that, uniformly in p and for n fixed, that the MCMC
kernels have lower bounded spectral gap and are hence geometrically ergodic.
1 − Gap(Kp ) ≤ 1 −
2.5.2
Technical Results
The following technical results will allow us to give our main result associated to the complexity of the SMC
algorithm.
Lemma 2.1. Assume (A1). Then for any n > 1, 0 ≤ p ≤ r − 1,
8(n2 + 1)
|Gp (M )|
≤
.
λp
n2
M ∈M
sup
Proof. Let 0 ≤ p ≤ r − 1 be arbitrary. We note that by [3, Corollary 4.4.2]
√
sup |Gp (M )| ≤ 2.
(7)
M ∈M
Thus, we will focus upon λp .
We start our calculations by noting:
X
Zp =
φp (M ) +
M ∈M
≤
X
φp (M )wp (u, v)
(u,v)∈U ×V M ∈N (u,v)
X
2
φp (M ) +
M ∈M
=
X
X
φp (M )wp∗ (u, v)
X
(u,v)∈U ×V M ∈N (u,v)
2
2Ξp (M)(n + 1)
(8)
where we have applied (A1) to go to line 2. Now, moving onto λp :
X
X
λp ≥
ηp (M )Gp (M )
(u,v)∈U ×V M ∈N (u,v)
=
X
X
(u,v)∈U ×V M ∈N (u,v)
≥
≥
1
2Zp
X
n φ (M )w (u, v) φ (M )w (u, v) o
p+1
p+1
p
p
Zp
φp (M )wp (u, v)
X
∗
φp+1 (M )wp+1
(u, v)
(u,v)∈U ×V M ∈N (u,v)
1
4Ξp (M)(n2 + 1)
X
X
(u,v)∈U ×V M ∈N (u,v)
=
Ξp+1 (M)
n2
4Ξp (M)(n2 + 1)
≥
n2
√
4 2(n2 + 1)
φp+1 (M )Ξp+1 (M)
Ξp+1 (N (u, v))
(9)
where we have used (8) to go to the fourth line, the fact that
fifth line, and the inequality [3, (4.14)] to go to the final line.
Thus, noting (7) and (9) we have shown that
P
M ∈N (u,v)
|Gp (M )|
8(n2 + 1)
≤
λp
n2
M ∈M
sup
which completes the proof.
7
φp+1 (M ) = Ξp+1 (N (u, v)) to go to the
We now write the Ls (ηp ) norm, s ≥ 1, for f ∈ Bb (M)
X
1/s
kf kLs (ηp ) :=
|f (M )|s ηp (M )
.
M ∈M
Let
τ (n)
=
ρ(n)
=
8(n2 + 1)
n2
(1 − 1/(Cn2 ))2 .
Then, we have the following result.
Lemma 2.2. Assume (A1). Then if τ (n)3 (1 − ρ(n)) < 1 we have that for any f ∈ Bb (M), 0 ≤ p < t ≤ r:
Qp,t (f ) τ (n)3/4
Q
kf kL4 (ηt ) .
≤
t−1 λ 1 − (1 − ρ(n))τ (n)3
q=p
q
L4 (ηp )
Proof. The proof follows by using the technical results in [20]. In particular, Lemma 2.1 will establish Assumption
B in [20] and (6) Assumption D and hence Assumption C of [20]. Application of Corollary 5.3 (r = 2) of [20],
followed by Lemma 4.8 of [20] completes the proof.
Remark 2.1. The condition τ (n)3 (1 − ρ(n)) < 1 is not restrictive and will hold for n moderate; both τ (n) and ρ(n)
are O(1) which means that τ (n)3 (1 − ρ(n)) < 1 for n large enough.
2.5.3
Main Result and Interpretation
Let
2
τ (n)3/4
.
1 − (1 − ρ(n))τ (n)3
Below, the expectation is w.r.t. the process associated to the SMC algorithm which is actually simulated.
C̄(n) =
Theorem 2.2. Assume (A1). Then if τ (n)3 (1 − ρ(n)) < 1 and N > 2C̄(n)(r + 1)(3 + C̄(n)2 ) we have that
h γ N (1)
2 i (r + 1)C̄(n)2 2(r + 1)C̄(n)(3 + C̄(n)2 ) r
E
−1
≤
1+
.
γr (1)
N
N
Proof. Lemma 2.2, combined with [20, Lemma 4.1] show that Assumption A of [20] holds, with cp,t (p) (of that
paper) equal to C̄(n); that is for 0 ≤ p < t ≤ r, f ∈ Bb (M):
)
(
Qp,t (f 2 ) Qp,t (f 2 ) 2
Qp,t (f 2 ) 2 , Qt−1 , Qt−1 ≤ C̄(n)kf k2L4 (ηt ) .
(10)
max Qt−1
λ
λ
λ
q
q
q
L
(η
)
L
(η
)
L
(η
)
4
p
4
p
4
p
q=p
q=p
q=p
Then, one can apply [20, Theorem 3.2], if N > 2ĉr ;
h γ N (1)
2 i
1
r
E
−1
≤
γr (1)
N
(
r
X
Varηp
p=0
)
Qp,t (1)
2
+ ĉr vr
Qt−1
N
q=p λq
(11)
where ĉr , vr are defined in [20] and Varηp [·] is the variance w.r.t. the probability ηp . By (10) and Jensen’s inequality
X
r
r r X
X
Qp,t (1) 2
Qp,t (1) 2
Qp,t (1)
Varηp Qt−1
≤
≤
≤ (r + 1)C̄(n)2
Qt−1 λ Qt−1 λ λ
q
q
q
L
(η
)
L
(η
)
2
p
4
p
q=p
q=p
q=p
p=0
p=0
p=0
From the definitions in [20], one can easily conclude that:
ĉr
≤
C̄(n)(r + 1)(3 + C̄(n)2 )
vr
≤
(r + 1)C̄(n)2 .
Combining the above arguments with (11), gives that for N > 2C̄(n)(r + 1)(3 + C̄(n)2 )
2 i (r + 1)C̄(n)2 h γ N (1)
2(r + 1)C̄(n)(3 + C̄(n)2 ) r
E
−1
≤
1+
γr (1)
N
N
which concludes the proof.
8
As C̄(n) is O(1) and r is O(n log2 (n)), if N is O(n log2 (n)), then one can make the relative variance arbitrarily
small; thus the cost of this perfect algorithm is O(n2 log4 (n)) which is a lower bound on the complexity of the
algorithm actually applied. For example, the approximation of the weights is O(n2 ) per time step, which is an
additional cost; so one would expect that at best, the adaptive algorithm would have a cost of O(n4 log4 (n)) in
order to control the relative variance. As noted previously, our complexity analysis does not take into account
the ability to approximate the ideal weights up-to a factor of 2; which is another reason why O(n4 log4 (n)) is
a lower-bound on the complexity of the adaptive algorithm. As noted by a referee, one would also like to have
an upper-bound on the computational complexity. As we have argued in the introduction and as we shall see in
Section 4, the adaptive SMC algorithm seems to out-perform simulated annealing, so we expect that, at worst, the
complexity should be O(n7 log4 (n)). We expect that the upper-bound is much tighter than this; to prove such an
upper-bound, one would seek to analyze the algorithm as implemented, which is beyond the scope of the current
work.
3
3.1
Computational Algorithms for the α−Permanent
SMC
To approximate the α−permanent [16] induce a set Kn which contains all ordered partitions of the set {1, 2, . . . , n}.
Then they state that for each k ∈ Kn , there corresponds a permutation s(k). Namely,
Qn for each permutation σ, one
can find a corresponding partition k such that s(k) = σ. Let A{s(k)} = A(σ) = i=1 ai,σ(i) . Theorem 1 of [16]
says that if k is a random ordered partition whose distribution g(·) is strictly positive on Kn , then
q(k)
g(k)
P
αcyc{s(k)} A{s(k)}
(12)
is an unbiased estimate of perα (A) as long as q(·) > 0 satisfies k:s(k)=σ q(k) = 1 for all σ.
Furthermore, [16] identify a particular choice of q(·) and the associated importance sampling (IS) algorithm
(1)
(2)
(n−1)
provides a method to obtain a particular k with probability g(k) = q(k)p(k), where p(k) = pσ(k1 ) pσ(k2 ) · · · pσ(kn−1 )
is as given in equation (6) of [16]. It is then shown that the ratio generated from the IS algorithm w(σ) =
αcyc(σ) A{σ}/p(k) is an unbiased estimate of perα (A). As hinted at above, the overall estimate is of product form
and should be subject to the weight degeneracy problem; this could mean that one needs an exponential effort in
n in order to obtain reasonable estimates.
To deal with the possible issue of weight degenerary we introduce the SMC algorithm which is given in Algorithm
2; this is just the method of [16] with resampling. Note that in Algorithm 2 the notation ∆cyc(σk1 = j) denotes the
number of new cycles formed by assigning σk1 to j. Algorithm 2 approximates the α-permanent by the following
estimate:
(
)
rn−1 +1
N
Y
1 X m
N
ω
Perα (A) =
N m=1 kj
j=1
where kj is the jth time index at which we resample for j > 1 and ωkmj are the weights before resampling (if this
occurs). The number of resampling steps between 1 and n − 1 is denoted by rn−1 and we set krn−1 +1 = n. Similarly
to the unbiasedness of particle approximation for the Feynman-Kac prediction model in [9], the estimate of the
α-permanent obtained by the above SMC algorithm is unbiased.
An important remark is that although the SMC method may help to deal with the weight degeneracy issue
in the IS method, the SMC method also has drawbacks, one well known issue is the path degeneracy effect for
estimates on the path. That is, due to resampling, the number of unique particles is quite small amongst those
states initially sampled. According to [12], whether an SMC method works in practice mostly depends on the
presence or the absence of the path degeneracy. It states that if the actually used finite sample is degenerate, we
cannot expect a good performance of the SMC method even we have strong convergence results. Thus if an SMC
method is subject to the path degeneracy effect, the obtained sample cannot well represent the target density any
more and the simulation results will be unreliable. This may not be such a problem for this application as we have
found, on finite state-spaces, that SMC on the path can perform quite well; see [22].
3.2
DPF
To deal with path degeneracy we consider the DPF algorithm. The discrete particle filter algorithm is proposed in
[13] to approximate the posterior and the marginal densities of the hidden Markov models (HMM), but it is a non9
Algorithm 2 SMC algorithm
1. Initialization: define M = [N ], for each m ∈ M,
(a) For j ∈ {1, 2, . . . , n}, let Cjm be the jth column sum of A.
(b) Set Rm = {1, 2, . . . , n}.
(c) Choose k1m uniformly from Rm .
(d) Set i = 0 and ωim = 1.
2. For i = 1, 2, . . . , n − 1:
(a) For each m ∈ M, if any Cjm = 0 for j ∈ Rm , set ωim = 0 and M = M \ {m}.
(b) For each m ∈ M,
m
m
4cyc(σ (ki )=j)
i. set pm
× |akim ,j |/(Cjm − |akim ,j |) for j ∈ Rm .
j =α
m
m
m
ii. if (more than one pj = ∞) or (pm
j = 0 for all j ∈ R ), set ωi = 0 and M = M \ {m}.
(c) For each m ∈ M,
m m
m
m
i. if there exists j such that pm
j = ∞, set σ (ki ) = j and pσ m (kim ) = 1; else normalize pj such that
P
m
m m
m
with probability pm
j .
j∈Rm pj = 1 and sample σ (ki ) from R
m
m
m
m
m
ii. set ωim = ωi−1
× akim ,σm (kim ) × α4cyc(σ (ki )=σ (ki )) /pm
σ m (kim ) .
P
PN
N
(d) Normalize the weights ω̄im = ωim / m=1 ωim and compute ESS = ( m=1 (ω̄im )2 )−1 . If ESS < NT , rem
m
sample the particles (k1:i
, σ m (k1:i
))1≤m≤N and (Rm , (Cjm )j∈Rm )1≤m≤N , denote them the same notation.
m
Set ωi = 1 and M = {1, 2, . . . , N }. (Here NT is known as the threshold value in the resampling procedure.)
(e) For each m ∈ M,
i. set Rm = Rm \ {σ m (kim )}.
ii. set Cjm = Cjm − |akim ,j | for j ∈ Rm .
m
iii. If the assignment of σ m (kim ) complete the current cycle, choose ki+1
uniformly from Rm ; else set
m
m m
ki+1 = σ (ki ).
3. At step i = n:
(a) Repeat step 2a.
(b) For each m ∈ M,
i. set σ m (kim ) be the only one remaining element of Rm .
m
ii. set ωim = ωi−1
× akim ,σm (kim ) × α.
PN
(c) normalize the weights ω̄im = ωim / m=1 ωim .
10
iterative procedure. Later, [23] approximate the posterior and marginal densities of the switching state-space models
sequentially in time by using the DPF algorithm. More generally, the DPF method has shown good performance
in a variety of applications e.g. [22]. It provides us a clever random pruning mechanism to select support sample
points from the whole state-space, so that it can attempt to explore all possible states in the finite state space; this
does not occur in many alternative approaches.
We now consider using the DPF method to compute the α-permanent. For simplicity of presentation, k ∈
{1, 2, . . . , n}, set σk = σ(k) and define Vk be the set which includes all positions of non-zero elements in the k th
row of matrix A, then the DPF algorithm for computing the α-permanents is given in Algorithms 3 and 4. The
estimate of the α-permanent is:
n
Y
X
perN
(A)
=
wk (σ1:k )
α
k=1 σ1:k ∈Wk
where the definition of Wk is in Algorithm 3.
Algorithm 3 DPF algorithm
1. At time k = 1, set W1 = V1 .
(a) for each σ1 ∈ W1 compute w1 (σ1 ) = α4cyc(σ(1)=σ1 ) a1,σ1 .
(b) Normalize the weights w̄1 (σ1 ) =
P
w1 (σ1 )
w1 (σ1 ) .
σ1 ∈W1
2. At times k = 2, 3, . . . , n.
0
(a) If Card(Wk−1 ) ≤ N , set Wk−1
= Wk−1 , Ck−1 = ∞ and go to (b). Otherwise, perform the resampling
0
step described in Algorithm 3, which returns Wk−1
and Ck−1 .
0
(b) Set Wk = {σ1:k : σ1:k−1 ∈ Wk−1
, σk ∈ Vk \ {σ1:k−1 }}.
(c) For each σ1:k ∈ Wk compute
wk (σ1:k ) = α4cyc(σ(k)=σk ) ak,σk
(d) Normalize the weights w̄k (σ1:k ) =
4
P
w̄k−1 (σ1:k−1 )
.
1 ∧ Ck−1 w̄k−1 (σ1:k−1 )
wk (σ1:k )
wk (σ1:k ) .
σ1:k ∈Wk
Numerical Results
We now give some numerical illustration of our algorithms. All numerical results are coded in MATLAB.
4.1
Permanents: Toy Example
Consider the following matrix

1
A= 0
1
1
1
1

0
1 
0
The permanent of this graph is 2. We will illustrate some issues associated to the proposed SMC algorithm, the
estimates (3), (4) and some comparison to the simulated annealing algorithms (SA) in [4, 18]. Throughout, the
evolution of the (φp )0≤p≤r is as [3] and the implementation of SA is as described in [18].
We will estimate the relative variance of the estimate of the permanent, using the adaptive SMC algorithm as
well as the SMC algorithm which uses the ideal weights. We will also consider this quantity for the SA algorithm.
We will use 50 repeats of each algorithm to estimate the relative variance of the estimates. The number of particles
for the SMC algorithm is N ∈ {100, 1000, 10000}, and some results are given in Table 1.
In Table 1 we can see the performance of the proposed SMC algorithms versus the ‘perfect’ algorithm which uses
the ideal weights. At least in this example, there does not appear to be a significant degredation in performance
(for either the estimates (3), (4)), at a similar computational cost, of the adaptive SMC algorithm.
11
Algorithm 4 Resampling procedure in Algorithm 3
• Set Ck−1 to be the unique solution of
X
1 ∧ Ck−1 w̄k−1 (σ1:k−1 ) = N.
σ1:k−1 ∈Wk−1
• Keep the Lk−1 , σ1:k−1 whose weights are greater than 1/Ck−1 . For the remaining Card(Wk−1 )−Lk−1 particles
perform the following stratified resampling scheme.
• Normalize the weights w̄k−1 (σ1:k−1 ) of the remaining Card(Wk−1 ) − Lk−1 and label them to obtain the
i
normalized weights ŵk−1 (σ1:k−1
), i ∈ {1, . . . , Card(Wk−1 ) − Lk−1 }.
• Construct the CDF: for i ∈ {1, . . . , Card(Wk−1 ) − Lk−1 }
Qk−1 (i) :=
i
X
j
ŵk−1 (σ1:k−1
),
Qk−1 (i) := 0.
j=1
• Sample U1 uniformly on [0, 1/(N − Lk−1 )] and set Uj = U1 + (j − 1)/(N − Lk−1 ), j ∈ {2, . . . , N − Lk−1 }
• For i ∈ {1, . . . , Card(Wk−1 ) − Lk−1 }, if there exist a j ∈ {1, . . . , N − Lk−1 } such that Qk−1 (i − 1) < Uj ≤
i
i
Qk−1 (i), then σ1:k−1
and ω̄k−1
survive.
0
• Set Wk−1
to be the set of surviving particles from the resampling and the Lk−1 samples that were maintained.
N
Adaptive SMC
SMC
Adaptive SMC with estimate(4)
100
1000
10000
0.1279 (19.42)
0.0683 (181.23)
0.0587 (1879.92)
0.0733 (17.89)
0.0639 (164.26)
0.0637 (1607.33)
0.3094 (19.98)
0.0733 (178.49)
0.0513 (1883.17)
Table 1: Relative variance of the Adaptive SMC estimates compared with the ideal weights SMC estimates. The
value in the bracket is the computation time in seconds.
We now consider a comparison with SA and compute the relative variances; the results are shown in Table 2.
The results in Tables 1 and 2 show that, if one considers a computation time of about 1800 seconds, the relative
variance of the adaptive SMC is 0.0587 (estimate (3)), whilst the relative variance of SA is 0.1695. To further
analyze, if we consider the relative variance of about 0.0680, the computation time of the adaptive SMC is 181.23
seconds (estimate (3)), whilst the computation time of the SA is 3674.13 seconds. This suggests, at least for this
example, that the adaptive SMC is out-performing SA with regards to relative variance.
To end this first toy example, we consider what happens to the relative variance of the estimate (3) as the size of
the matrix increases. Clearly, we can only consider n small (or a matrix that is very sparse) if we want to compute
the permanent, so we consider only n ∈ {6, 7, 8} for N ∈ {1000, 2000, 5000}. The results are in Table 3. In Table
3 we can see an expected trend; as for a given n as N grows the variance falls and for a given N as n grows the
variance increases.
4.2
Permanents: A Larger Matrix
Now we consider two matrices with n = 18. The first matrix is relatively dense with 175 non-zero enteries and the
second more sparse with only 50 non-zero enteries. All algorithms are run with N = 2000 with 50 repeats. Table 4
and 5 show the estimates of the permanent (using (3), (4)) along the variability and wall-clock computation time.
The tables show the expected results; for a sparse graph the estimate (3) out-performs (4). The improvement is due
to the fact that one does not need to count the number of perfect matchings in the original graph in (3), which is a
likely source of variance for the estimate (4). When the graph becomes less sparse, this apparent advantage is not
present and (4) performs relatively better. Note that we ran the SA method, but it failed to produce competitve
results in the same computational time and are hence omitted. We also remark that the approach in [14] has been
12
Computation time (s)
Relative Variance of SA estimates
854.24
1698.59
2189.26
2785.27
3674.13
7824.65
0.3560
0.1695
0.1524
0.1134
0.0695
0.0513
Table 2: Relative variance of the Simulated Annealing estimates against the computation time.
size
6
7
8
N=1000
0.4120
0.7391
0.9401
N=2000
0.1849
0.1243
0.1121
N=5000
0.0431
0.0679
0.0428
Table 3: Relative variance of the Adaptive SMC estimates against the size of the graph. We consider estimate (3).
compared to [16] so we do not run it. We also run the IS method of [16]; the algorithm returns similar estimates
with significant reductions in run time. However, the standard deviation is much larger and it appears that the
adaptive SMC method brings some benefit for the particular matrices investigated, particularly in Table 5, where
the IS seems to be a little unreliable.
Method
Mean
Std
Computation time (s)
Adaptive SMC
Adaptive SMC with estimate (4)
IS
5.6723e+08
5.5683e+08
5.7380e+08
1.3901e+08
1.3748e+08
9.7000e+08
42156
42486
4896
Table 4: Comparison of 20 estimates for n = 18 and 175 non-zero entries. The computation time is the overall time
taken and Std is standard deviation.
4.3
α−Permanent: Simulation Results of The SMC method
In order to investigate the performance of the SMC algorithm in Section 3.1, especially in comparison to the IS
approach in [16], we apply the IS and our SMC algorithm to several different matrices and α’s. Our simulations
contains two parts, the first part is the comparison of these two methods for six different sized matrices. The second
part emphasizes the performance of a series of 100×100 matrices with different number of non-zero entries. The
purpose of these examples will be explained in the subsequent discussion. Note that all estimation is done over
multiple runs, for example an estimate of the relative variance of an estimate.
It is remarked that for the results in the next Section, our adaptive SMC algorithm was run. The results are
similar to the previous study, in that there are reductions in standard deviations (of similar order), with a similar
increase in computational time; the results are not provided for brevity.
4.3.1
Six Matrices
r
In the first part of our simulation, we will test matrices A1 -A4 , K(x)T100
and K(x)100 which are used in the simulation
studies of [16]. The details of these six matrices are listed in [16]. From [16], we see that both A1 and A2 are 20×20
matrices and α = 1. The main difference is that the entries of A1 are all positive but A2 has zero entries. A3
and A4 are 15×15 sized and α = 1/2. Also, A3 only has positive entries but A4 has some zero entries. [16] state
that the reason for choosing A1 -A4 is that the matrix size 20 is the threshold for computing the exact value of the
α-permanent under α = 1 and size 15 is the threshold for computing the exact value of the α-permanent under
r
α = 1/2. As to the other two matrices K100 and K(x)T100
, they are both 100×100 matrices with K100 only has
13
Method
Mean
Std
Computation time (s)
Adaptive SMC
Adaptive SMC with estimate (4)
IS
3.0122
3.1001
3.9902
11.3899
11.7158
66.9716
45672
45756
5002
Table 5: Comparison of 20 estimates for n = 18 and 50 non-zero entries. The computation time is the overall time
taken and Std is standard deviation.
Tr
Tr
positive entries and K100
has a few non-zero entries, thus K100
is a sparse matrix. We also consider α = 1/2 for
these two matrices as in [16].
For the first five matrices, the exact value of α-permanent is known and given in the Table 1 of [16]. Therefore
we can estimate the relative variance. We set N = 10000 and run each algorithm 50 times to obtain 50 estimates
for both the IS method and the SMC method. The threshold value in the SMC method is NT = 5000 for matrices
Tr
A1 -A4 and NT = 2000 for matrices K100
and K100 . In addition, we can also compute the ESS and the number of
Unique Particles (UN) at every step of the SMC method and at the last step of the IS method. The results for the
above six matrices are given in Figures 1-2 respectively as well as Table 6.
Figure 1 displays the simulation results for the first matrix A1 (the results for A2 − A4 are similar), we can
observe from this figure that the results of the SMC method and the IS method are almost the same (see also Table
6). The mean and the variance of these two estimates are very close and all of these relative variances are quite
small. We remark that the ESS values of the SMC method show that the SMC method seldom resamples, which
means that the SMC algorithm for these four matrices are essentially the same as the IS algorithm. Also, the UN
and the ESS values of these two methods are quite large so that these two estimates should be very effective.
Tr
In Figure 2, for the matrix K100 (the results for K100
are similar), the ESS values in the Figure indicates that
the IS method has the weight degeneracy problem here and the SMC helps on this issue. However the UN values of
the SMC method turn to be low eventually so that those samples may not be representative of the target density
and the estimate would be not ideal. This is also reflected in Table 6, for this matrix, not only that the variance
(and the computation time) of the SMC estimate is about 10 times (and 100 times) larger than the variance (and
the computation time) of the IS estimate, but also that the mean of the SMC estimates is a bit far away from
the mean of the IS estimates. This indeed shows that, for these problems, the path degeneracy effect in an SMC
method influences the performance of the SMC method in an adverse manner. Whilst the performance of the IS
Tr
method for this matrix is almost similar to the previous matrix K100
which is quite good. These results mean that
also for matrix K100 , the IS method is outperform the SMC method.
Tr
On the basis of the above results for matrices A1 -A4 , K100
and K100 , we may need to conclude that the IS
method provides an excellent proposal density but the SMC method suffers from the path degeneracy problem
which is more troublesome, at least for these examples, than the weight degeneracy problem.
4.3.2
Different Number of Non-zero Entries
As we mentioned above, the performance of the SMC method is not as good as we expected. In order to further
explore the properties of the SMC method, we consider one more collection of simulations. This will emphasize
the performance on a series of 100×100 matrices with different number of non-zero entries. These matrices are
randomly generated following the rule:

p
 0 with probability
1 with probability (1 − p)/2
Apn = (apij ),
apij =
(13)

2 with probability (1 − p)/2
where p indicates the degree of the sparseness of matrix A.
Following the above rule, we generate five matrices with n = 100 and p ∈ {0.1, 0.3, 0.5, 0.7, 0.9}. These matrices
are all 100×100 sized but with decreasing number of non-zero entries. In addition, we consider approximating the
α-permanent of these matrices with the same α = 1/2. Under the above set-ups, we will be able to investigate
whether the sparseness of the matrix is the cause of the inferior results of the SMC method. For both the SMC
method and the IS method, we set N = 10000, NT = 2000 and run the algorithm 50 times to obtain 50 estimates.
The results are given in Table 7 .
Table 7 provides us with very intuitive results. Given the premise that all of those matrices are 100×100 sized,
as p gets larger, from 0.1 to 0.9, the difference of the mean, the standard deviation and the computation time
14
per1 (A1 )
Mean ± Std (×1032 )
Relative Variance
Computation time (s)
per1 (A2 )
Mean ± Std (×1032 )
Relative Variance
Computation time (s)
per1/2 (A3 )
Mean ± Std (×1022 )
Relative Variance
Computation time (s)
per1/2 (A4 )
Mean ± Std (×1021 )
Relative Variance
Computation time (s)
r
per1/2 (K(x)T100
)
−16
Mean ± Std (×10 )
Relative Variance
Computation time (s)
per1/2 (K(x)100 )
Mean ± Std (×10111 )
Relative Variance
Computation time (s)
SMC estimate
9.7850±0.0175
3.1411×10−6
798.5538
SMC estimate
3.5119±0.0104
1.7810×10−5
947.9858
SMC estimate
1.4399±0.0047
1.0676×10−5
537.4382
SMC estimate
7.0280 ± 0.0327
2.1990×10−5
525.5042
SMC estimate
1.9314± 0.2437
0.0161
1.7287×105
SMC estimate
1.2609 ± 0.9960
0.0161
6.0522×105
IS estimate
9.7840±0.0168
2.9062×10−6
712.4889
IS estimate
3.5124±0.0106
8.9920×10−6
709.1709
IS estimate
1.4390±0.0046
1.0072×10−5
482.2118
IS estimate
7.0328±0.0344
2.3420×10−5
480.5708
IS estimate
1.8902±0.1209
0.0040
3.9505×103
IS estimate
2.6982±0.0869
0.0040
5.7963×103
Table 6: Results for Comparison of SMC to IS. Both algorithms were run 50 times each with 10000 samples.
ESS of SMC
ESS of IS
UN of SMC
10000
ESS and UN
8000
6000
4000
2000
0
0
2
4
6
8
10
12
14
16
18
20
Iteration No.
Figure 1: Simulation results for matrix A1 . The displayed estimate is the mean ±std of M estimates. For the
figure, the blue dash-dot line with star is the ESS of the SMC method; the green circle is the ESS of the IS method;
the red dash-dot line with plus is the UN of the SMC method.
15
ESS of SMC
ESS of IS
UN of SMC
10000
ESS and UN
8000
6000
4000
2000
0
0
20
40
60
80
100
Iteration No.
Figure 2: Simulation results for matrix K100 . The displayed estimate is the mean ± std of M estimates. For the
figure, the blue dash-dot line with star is the ESS of the SMC method; the green circle is the ESS of the IS method;
the red dash-dot line with plus is the UN of the SMC method.
n = 100, α = 1/2
p = 0.1
p = 0.3
p = 0.5
p = 0.7
p = 0.9
SMC estimate
5.2880 ± 0.0330
9.9450×103
4.0591 ± 0.0027
9.4052×103
3.1825±0.0581
1.2538×104
2.9539 ± 0.2914
3.2236×104
0.9218 ± 1.0568
2.3002×104
M±S (×10159 )
CT (s)
M±S (×10158 )
CT (s)
M±S (×10143 )
CT (s)
M±S (×10121 )
CT (s)
M±S (×1071 )
CT (s)
IS estimate
5.2793 ± 0.0372
8.5028×103
4.0565 ± 0.0028
8.3776×103
3.1703±0.0418
8.5339×103
2.9636 ± 0.1020
7.9802×103
1.3510± 2.0268
1.4866×104
Table 7: Estimated α-permanent for several matrices with n = 100, p ∈ {0.1, 0.3, 0.5, 0.7, 0.9} (see (13)) and
α = 1/2. The displayed M±S represents the mean±std of 50 estimates and CT represents the total computation
time.
between the SMC estimate and the IS estimate becomes more significant. To be more specific, the means for these
five matrices are quite close except when p = 0.9, they differ significantly. Also, the standard deviation is generally
larger for the SMC method compared with the IS method. Moreover, the computation time of the IS method
is relatively smaller in comparison to the SMC method as p becomes bigger. On the other hand, in results not
displayed, that as p becomes larger, the ESS value of the IS method becomes smaller, the SMC method resamples
more often, most importantly, the UN values of the SMC method at the last several steps becomes lower and almost
1 when p = 0.9. As we said in the previous subsection, small UN values may lead to ineffective estimates, this is
seen clearly in the table. The above simulations in this subsection tell us that for this example, as p goes larger,
namely, the degree of the sparseness goes higher, the SMC method will no longer be effective in approximating the
α-permanent; the sparseness appears to exacerbate to the path degeneracy issue.
4.4
4.4.1
α−Permanent: Simulation Results of the DPF method
Six Matrices
Tr
Now we apply the DPF method to the matrices A1 -A4 , K100
and K100 which are also used in Section 4.3.1, to see
how the DPF algorithm works on these matrices. We will compare its performance with the previous results of
the IS method. We set N ∈ {1500, 5000, 10000} and repeat the DPF algorithm 50 times to obtain 50 estimate for
each value of N and each matrix. Similar to the simulation of the SMC method, we will obtain the mean and the
16
standard deviation of these 50 estimates, as well as the relative variance if the exact value of the α-permanent is
known. Also, the total wall clock computation time for obtaining these 50 estimates will be recorded. The results
are listed in Table 8.
9.7528 ±0.2599
9.7970±0.1632
9.7763±0.0815
9.7840±0.0168
7.0163×10−4
704
2.7459×10−4
6237
6.8506×10−5
23832
2.9062×10−6
712
per1 (A2 ) (×1032 )
Relative variance
Computation time (s)
3.4824 ±0.1391
3.5022±0.0702
3.5125±0.0635
3.5124±0.0106
0.0016
660
4.0165×10−4
5976
3.1984×10−4
22643
8.9920×10−6
709
per1/2 (A3 ) (×1022 )
Relative variance
Computation time (s)
1.4384 ±0.0343
1.4397±0.0240
1.4375±0.0153
1.4390±0.0046
5.5720 ×10−4
360
2.4039×10−4
2783
1.1197×10−4
10611
1.0072×10−5
482
per1/2 (A4 ) (×1021 )
Relative variance
Computation time (s)
7.0338 ±0.0179
7.0275±0.0991
7.0389±0.0569
7.0328±0.0344
6.3278 ×10−4
335
1.9543×10−4
2460
6.4540×10−5
9741
2.3420×10−5
481
T r ) (×10−16 )
per1/2 (K100
1.9214±0.2108
1.9023±0.1173
1.8973±0.0845
1.8902±0.1209
Relative variance
Computation time (s)
0.0120
2170
0.0037
23311
0.0020
94320
0.0040
3950
per1 (A1 ) (×1032 )
Relative variance
Computation time (s)
Tr
Table 8: Estimated α-Permanent for Matrices A1 − A4 and K100
using the DPF methods. The first three columns
of results correspond to different N ∈ {1500, 5000, 10000} and the final column is for the method of [16] run with
N = 10000. All results results are repeated 50 times.
To start the analysis of the results in Table 8, we would like to give a short statement about matrix K100 . We
attempted to apply the DPF algorithm to this matrix, but the required value of N to obtain a meaningful estimate
is so large that the computation cost is too large for a comparison. Thus the simulation results for this matrix is
considered as unknown, somehow it means that the DPF method is not good at approximating the α-permanent
of matrix K100 . On the other hand, the most distinct feature of K100 compared with the other five matrices is that
the cardinality of the state space of the associated target in SMC is very large, which equals to 100!, so the above
phenomenon indicates that the DPF algorithm may not be suitable if the state space has too many elements.
Tr
We now return to Table 8 which displays useful information on the estimates for matrices A1 -A4 and K100
.
Basically, we have two main deductions from this table. The first is about the DPF itself: the performance of the
DPF algorithm for each matrix is quite good when N = 1500 and the respective computation time is low; the
results of the DPF method improve for each matrix from N = 1500 to N = 10000 at the cost that the computation
time increases significantly. For example, for matrix A1 , the relative variance decreases from 7.0163×10−4 to
6.8506×10−5 but the computation time increases from 704 seconds to 23832 seconds. The second result is with
regards to the comparison between the DPF method and the IS method. From the table, we can deduce that under
almost the same computation time, the relative variance of the IS method would be slightly smaller than the DPF
method; given that the relative variance of the DPF and SMC methods are almost equal, the computation time of
the DPF method would be much larger than the computation time of the IS method.
Although the performance of the DPF method is worse than the IS method for all of these five matrices, the
Tr
inferiority of the DPF method for matrix K100
is much smaller than matrices A1 − A4 . For example, for matrix A1 ,
the DPF method needs to spend at least 33 times (23832 seconds) more computational cost than the IS method
Tr
(712 seconds) to achieve the same relative variance (2.9062×10−6 ); whilst for matrix K100
, the DPF method only
needs to spend at most 6 times (23311 seconds) more computational cost than the IS method (3950 seconds) to
Tr
achieve the same relative variance (0.0040). In some sense the DPF method works much better for the matrix K100
than matrices A1 − A4 . For the purpose of finding beneficial cases for the DPF method, we see that one obvious
Tr
is the degree of sparseness is very high, it has about 70 percent of zero entries.Thus from
feature of matrix K100
the above study, one point we can conclude is that the DPF method can deal with small or moderate state space
cases and the performance is good, but it may not do well in large state space cases; this is confirmed in further
simulations of [21].
4.5
Summary
In this numerical study we might conclude the following, which seems to be consistent with the published literature:
17
1. For the permanent, MCMC type methods (including SMC samplers) see to be more accurate than IS, but in
general, the computational cost of the latter seems to be less for a given N
2. For the α−permanent, the method of [16] seems to be quite robust, even in the presence of weight degeneracy.
As noted by [16] the method is designed for symmetric and positive definite matrices and may not always
work well. In all of our numerical studies (including results in [21]), it seemed that the effect of weight
degeneracy was not substantial enough to provide a degredation of the estimate of the permanent, relative to
SMC methods, other than our adaptive SMC method. We we suspect that, for this problem, that the relative
variance of the IS method does not have an exponential dependence on the time parameter - which is why
standard SMC is not doing much better in general. It also seems that, for standard SMC, the introduction of
resampling and the path degeneracy effect is more significant than weight degeneracy, which another reason
why IS seems to do better than standard SMC.
Our main recommendation is that if the method of [16] seems to work well (e.g. robust results from multiple runs)
then this seems to be adequate for both the permanent and α−permanent. If this method fails, one needs to pay
the price and we recommend our approach in Section 2.3 which will be computationally expensive, but reduces the
variance in estimation (see e.g. Table 5); it can also be used for the α−permanent.
5
Bayesian Estimation
In this section, we consider using the above estimation procedures considered to perform statistical inference associated to a boson point process. Let S ⊂ Rd be a compact set and suppose we have a boson point process X on S.
If x = (x1 , . . . , xn ) is a finite point configuration of X which is contained in S, we are interested in the posterior
density for θ ∈ Θ of some parameter associated to the process (this will be specified later):
π(θ|x) ∝ fθ (x)π(θ)
where fθ (x) is the marginal density of x, π(θ) is a prior density of θ.
5.1
Marginal Density of the Boson Process
In [8], the boson point process is introduced as a special case of Poisson
R process. Given that Λθ ≡ (Λθ (x))x∈S is
the non-negative random intensity function of X, which also satisfies S Λθ (x)dx < ∞ almost surely, the marginal
density of x is (EΛθ is the expectation w.r.t. Λθ ):


Y
Z
n
Λθ (xj )
(14)
fθ (x) = EΛθ exp |S| −
Λθ (x)dx
S
j=1
the above marginal density is not of closed form.
Later, [19] adopt a class of flexible Cox Process models to format the above intensity function Λθ . They state
that for any integer k > 0 and real covariance matrix Cθ (x; x0 ), the intensity function is equal to the sum of k
independent, zero-mean Gaussian process with covariance function Cθ /2 (Eqaution (2) of [19]). Therefore, [19] also
call the Poisson process X as the permanent process with parameters α = k/2 and Cθ . Further, they prove that
the marginal density of x in equation (14) under the permanent process is equivalent to:
fθ (x) = e|S|−αD(θ) perα C̃θ (x)
(15)
P∞
where D(θ) =
r=0 log(1 + λr ) with λ0 ≥ λ1 ≥ . . . ≥ 0 are the eigenvalues of the corresponding covariance
function Cθ (x); C̃θ (x) is another covariance function related to Cθ (x), see more details in [19]. Usually, D(θ) is not
computable but can be approximated by
Dn (θ) =
n
X
θ
log(1 + λC
r )
r=0
θ
with λC
of estimating the marginal density of a poisson
r are the ordered eigenvalues of Cθ . Therefore, the problem
process is replaced by approximating the α-permanent perα C̃θ (x) in equation (15), which has been the subject
of this article.
18
Finally, according to [19], a boson point process is a permanent process with parameter α = 1 and Cθ . Overall,
given a known parameter θ, the marginal density fθ (x) of a boson process can be approximated by:
n
e|S|−αD (θ) perN
C̃
(x)
θ
1
where perN
C̃θ (x) is the IS estimate (of [16]) of the α-permanent by using N particles in the IS algorithm.
1
5.2
Pseudo Marginal MCMC
Now we consider a Bayesian parameter estimation problem. Basically, we want to sample from the posterior
density π(θ|x). We will
use the pseudo marginal algorithm which will sample from a density proportional to
e|S|−αD
n
(θ)
perN
C̃θ (x) π(θ); if D(θ) were known exactly, this provides exact inference from π(θ|x) - this follows
1
from the unbiased estimate of fθ (x). The pseudo marginal algorithm is given in Algorithm 5. Note that the
ergodicity of the Markov chain generated by this algorithm is easily attained by using Theorem 9 of [1].
Algorithm 5 Pseudo Marginal algorithm
1. initialization, i = 0,
(a) set θ(0) arbitrarily.
(b) run the IS algorithm of [16] to obtain the estimate of the α-permanent, denoted by perN
1 (C̃θ (x)).
2. for iteration i ≥ 1,
(a) sample θ? from a proposal kernel.
(b) run the IS algorithm of [16] to obtain the estimate of the α-permanent, denoted by perN
C̃θ? (x) .
1
(c) with probability
perN
C̃θ? (x) π(θ? )
1
1∧
e|S|−αDn (θ(i−1)) perN
C̃θ(i−1) (x) π(θ(i − 1))
1
e|S|−αD
n
(θ ? )
set θ(i) = θ? ; otherwise set θ(i) = θ(i − 1).
5.3
5.3.1
Simulation Results
Experimental Set-ups
In order to investigate the above pseudo marginal algorithm, with regards to making inference on θ, we consider
the following experiment.
0 2
• Covariance function: consider a continuous covariance funtion C̃θ (x, x0 ) = exp−θ(x−x ) (θ = 1).
• Data: data-sets of size, 100, 200, 500 and 2000 are drawn from the process (see [19]).
• Prior density: exponential density with mean 10.
• Proposal kernel: given θ is the current iteration, propose the next iteration θ? from a Gaussian centered at
the previous state and variance σ 2 .
• For each data-set, choose every 5 samples in each of 10 Markov chains to complete 20000 samples in total.
5.3.2
Analysis of Samples
The algorithm appears to converge quite quickly in each case, so we do not include the results. The posterior
samples from each of the data-sets can be found in Figure 3. The results are quite reassuring; the true value of θ
is 1 and it seems that (at least under our set-up) that one can recover the true value. In other experiments not
shown, if the prior is reasonably vague (that is, its variance is not small in some sense), we did not find significant
differences in the posterior. Thus, it seems that this model can be useful in real data scenarios.
19
0.7
density
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.6
0.5
0.4
density
0.3
0.2
0.1
0.0
0
20
40
60
80
0
20
θ
60
80
θ
0.6
0.5
0.4
density
0.0
0.1
0.2
0.3
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
(b) Histogram, 200 data-points
0.7
(a) Histogram, 100 data-points
density
40
0
20
40
60
80
0
20
θ
40
60
80
θ
(c) Histogram, 500 data-points
(d) Histogram, 2000 data-points
Figure 3: Posterior histograms for different data sets.
6
Summary
We have considered a number of issues in this article. Firstly, we have introduced a new adaptive SMC algorithm
for approximating permanents of n × n binary matrices and established the convergence of the estimate. We have
also provided a lower-bound on the cost in n to achieve an arbitrarily small relative variance of the estimate of
the permanent; this was O(n4 log4 (n)). Secondly, a consideration of existing SMC methods for approximating the
α−permanent. On the basis of our numerical results, it seems that the approach of [16] seems to work well (although
not always) and one might use the new method suggested in scenarios if the algorithm in [16] does not work well.
There are several directions for future work. The most pressing is a direct non-asymptotic analysis of the new
SMC algorithm which is actually implemented. As noted numerous times, the mathematical analysis of adaptive
SMC algorithms is in its infancy and so we expect this afore-mentioned problem to be particularly demanding. In
particular, one must analyze the MCMC kernels when one is using SMC approximations of the target densities,
which is a non-trivial task. One could also consider a complexity analysis of [16] as was done in [5] for the approach
of [7]. A consideration of the approaches in this article for real-valued (i.e. possibly negative entries) matrices may
also be of interest.
Acknowledgements
The second author was supported by an MOE Singapore grant R-155-000-119-133 and is affiliated with the Risk
Management Institute and Centre for Quantitative Finance at NUS. We thank two referees for very useful comments
that have vastly improved the paper. We also thank the editor, Prof. Mark Girolami for his assistance on the article.
A
Technical Results for Section 2
We will use the Feynman-Kac notations established in Section 2.5 and the reader should be familar with that
Section to proceed. Recall, from Section 2.3, for 0 ≤ p ≤ r − 1
Gp,N (M ) =
20
ΦN
p+1 (M )
ΦN
p (M )
and recall that ΦN
0 (M ) is deterministic and known. In addition, for (u, v) ∈ U × V
wpN (u, v) =
N
δ + ηp−1
(IM
N (I
δ + [ηp−1
N (u,v)
φp+1
φp )
φp+1
1
N (u,v)
φp )] wp
where for ϕ ∈ Bb (M), 0 ≤ p ≤ r
ηpN (ϕ) =
N
1 X
ϕ(Mpi )
N i=1
is the SMC approximation of ηp (recall that one will resample at every time-point, in this analysis). By a simple
inductive argument, it follows that one can find a 0 < c(n) < ∞ such that for any 0 ≤ p ≤ r, N ≥ 1, (u, v) ∈ U × V
δ+1
.
δ
c(n) ≤ wpN (u, v) ≤
Using the above formulation, for any N ≥ 1
sup |Gp,N (M )| ≤ 1 ∨
nδ + 1o
M ∈M
(16)
δc(n)
which will be used later. Note that
γrN (1) =
r−1
Y
ηpN (Gp,N ) =
r−1
Y
p=0
p=0
N
1 X
Gp,N (Mpi ) .
N i=1
With Gp−1,N , given Qp,N (M, M 0 ) = Gp−1,N (M )Kp,N (M, M 0 ) (Kp,N is the MCMC kernel in [18] with invariant
N
measure proportional to ΦN
p ) and and Gp−1 , Qp denote the limiting versions (that is, on replacing ηp with ηp and
so-fourth). Recall the definition of γt (1) in (5), which uses the limiting versions of Gp−1 and Kp .
Proof of Theorem 2.1. We start with the following decomposition
γrN (1) − γr (1) =
r−1
Y
ηpN (Gp,N ) −
p=0
r−1
Y
ηpN (Gp ) +
p=0
r−1
Y
r−1
Y
ηpN (Gp ) −
p=0
ηp (Gp )
p=0
Qr−1
where one can show that γr (1) = p=0 ηp (Gp ); see [9]. By Theorem A.1, the second term on the R.H.S. goes to
Qr−1
Qr−1
zero. Hence we will focus on p=0 ηpN (Gp,N ) − p=0 ηpN (Gp ).
We have the following collapsing sum representation
r−1
Y
p=0
ηpN (Gp,N )
−
r−1
Y
ηpN (Gp )
=
p=0
r−1 h q−1
Y
X
q=0
ηsN (Gs )
ih
ηqN (Gq,N )
−
ih r−1
Y
ηqN (Gq )
ηsN (Gs,N )
i
s=q+1
s=0
Q
where we are using the convention ∅ = 1. We can consider each summand separately. By Theorem A.1,
Qq−1 N
N
N
s=0 ηs (Gs ) will converge in probability to constant. By the proof of Theorem A.1 (see (20)) ηq (Gq,N ) − ηq (Gq )
Qr−1
converges to zero in probability and s=q+1 ηsN (Gs,N ) converges in probability to a constant; this completes the
proof of the theorem.
E will be used to denote expectation w.r.t. the probability associated to the SMC algorithm.
Theorem A.1. For any 0 ≤ p ≤ r − 1, (ϕ0 , . . . , ϕp ) ∈ Bb (M)p+1 and ((u1 , v1 ), . . . , (up+1 , vp+1 )) ∈ (U × V )p+1 , we
have
N
(η0N (ϕ0 ), w1N (u1 , v1 ), . . . , ηpN (ϕp ), wp+1
(up+1 , vp+1 )) →P
∗
(η0 (ϕ0 ), w1∗ (u1 , v1 ), . . . , ηp (ϕp ), wp+1
(up+1 , vp+1 )).
Proof. Our proof proceeds via strong induction. For p = 0, by the WLLN for i.i.d. random variables η0N (ϕ0 ) →P
η0 (ϕ0 ). Then by the continuous mapping theorem, it clearly follows that for any fixed (u1 , v1 ) that w1N (u1 , v1 ) →P
w1∗ (u1 , v1 ) and indeed that M0 ∈ M, G0,N (M0 ) →P G0 (M0 ) which will be used later on. Thus, the proof of the
initialization follows easily.
21
Now assume the result for p − 1 and consider the proof at rank p. We have that
ηpN (ϕp ) − ηp (ϕp ) = ηpN (ϕp ) − E[ηpN (ϕp )|Fp−1 ] + E[ηpN (ϕp )|Fp−1 ] − ηp (ϕp )
(17)
where Fp−1 is the filtration generated by the particle system up-to time p − 1. We focus on the second term on the
R.H.S., which can be written as:
N
ηp−1
(Qp (ϕp )) ηp−1 (Qp (ϕp ))
1
N
N
−
+ ηp−1 (Qp (ϕp )) N
−
E[ηp (ϕp )|Fp−1 ] − ηp (ϕp ) =
ηp−1 (Gp−1 )
ηp−1 (Gp−1 )
ηp−1 (Gp−1,N )
N
ηp−1
[{Qp,N − Qp }(ϕp )]
1
+
.
(18)
ηp−1 (Gp−1 )
ηp−1 (Gp−1,N )
By the induction hypothesis, as Qp (ϕp ) ∈ Bb (M), the first term on the R.H.S. of (18) converges in probability to
zero. To proceed, we will consider the two terms on the R.H.S. of (18) in turn, starting with the second.
Second Term on R.H.S. of (18). Consider
N
E[|ηp−1
(Gp−1,N ) − ηp−1 (Gp−1 )|]
=
N
N
E[|ηp−1
(Gp−1,N − Gp−1 ) + ηp−1
(Gp−1 ) − ηp−1 (Gp−1 )|]
≤
N
N
E[|ηp−1
(Gp−1,N − Gp−1 )|] + E[|ηp−1
(Gp−1 ) − ηp−1 (Gp−1 )|].
N
For the second term of the R.H.S. of the inequality, by the induction hypothesis |ηp−1
(Gp−1 ) − ηp−1 (Gp−1 )| →P 0
N
and as Gp−1 is a bounded function, so E[|ηp−1 (Gp−1 ) − ηp−1 (Gp−1 )|] will converge to zero. For the first term, we
have
N
1
1
E[|ηp−1
(Gp−1,N − Gp−1 )|] ≤ E[|Gp−1,N (Mp−1
) − Gp−1 (Mp−1
)|]
i
where we have used the exchangeability of the particle system (the marginal law of any sample Mp−1
is the same
for each i ∈ [N ]). Then, noting that the inductive hypothesis implies that for any fixed Mp−1 ∈ M
Gp−1,N (Mp−1 ) →P Gp−1 (Mp−1 )
(19)
N
by essentially the above the arguments (note (16)), we have that E[|ηp−1
(Gp−1,N − Gp−1 )|] → 0. This establishes
N
ηp−1
(Gp−1,N ) →P ηp−1 (Gp−1 ).
(20)
N
Thus, using the induction hypothesis, as Qp (ϕp ) ∈ Bb (M), ηp−1
(Qp (ϕp )) converges in probability to a constant.
This fact combined with above argument and the continuous mapping Theorem, shows that the the second term
on the R.H.S. of (18) will converge to zero in probability.
Third Term on R.H.S. of (18). We would like to show that
N
1
1
E||ηp−1
[{Qp,N − Qp }(ϕp )]|] ≤ E[|Qp,N (ϕp )(Mp−1
) − Qp (ϕp )(Mp−1
)|].
goes to zero. As the term in the expectation on the R.H.S. of the inequality is bounded (note (16)), it suffices to
prove that this term will converge to zero in probability. We have, for any fixed M ∈ M
Qp,N (ϕp )(M ) − Qp (ϕp )(M ) =
[Gp−1,N (M ) − Gp−1 (M )]Kp,N (ϕp )(M ) + Gp−1 (M )[Kp,N (ϕp )(M ) − Kp (ϕp )(M )].
As Kp,N (ϕp )(M ) is bounded, it clearly follows via the induction hypothesis (note (19)) that [Gp−1,N (M ) −
Gp−1 (M )]Kp,N )(ϕp )(M ) will converge to zero in probability. To deal with the second part, we consider only
‘acceptance’ part of the M-H kernel; dealing with the ‘rejection’ part is very similar and omitted for brevity:
Φ (M 0 ) ΦN (M 0 ) X
p
p
−
1
∧
(21)
qp (M, M 0 )ϕp (M 0 ) 1 ∧
N (M )
Φ
Φ
p (M )
p
0
M ∈M
where qp (M, M 0 ) is the symmetric proposal probability. For any fixed M, M 0 1 ∧
of
N
ηp−1
(·),
wpN
0
ΦN
p (M )
ΦN
(M
)
p
is a continuous function
(when they appear), so by the induction hypothesis, it follows that for any M, M 0 ∈ M,
ΦN (M 0 ) Φ (M 0 ) p
p
1∧
−1∧
→P 0
ΦN
Φp (M )
p (M )
22
and hence so does (21) (recall M is finite). By (20) ηp−1 (Gp−1,N ) converges in probability to ηp−1 (Gp−1 ) and hence
third term on the R.H.S. of (18) will converge to zero in probability.
Now, following the proof of [2, Theorem 4.1] and the above arguments, the first term on the R.H.S.of (17) will
converge to zero in probability. Thus, we have shown that ηpN (ϕp ) − ηp (ϕp ) will converge to zero in probability.
Then, by this latter result and the induction hypothesis, along with the continuous mapping theorem, it follows that
N
∗
for (up+1 , vp+1 ) ∈ (U × V ) arbitrary, wp+1
(up+1 , vp+1 ) →P wp+1
(up+1 , vp+1 ) and indeed that Gp,N (Mp ) converges
in probability to Gp (Mp ) for any fixed Mp ∈ M. From here one can conclude the proof with standard results in
probability.
References
[1] Andrieu, C. & Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations.
Ann. Statist., 37, 697–725.
[2] Beskos, A.,Jasra, A., Kantas, N. & Thiery, A. (2014). On the convergence of adaptive sequential Monte
Carlo methods. arXiv preprint, arXiv:1306.6462 .
[3] Bezakova, I. (2006). Faster Markov chain Monte Carlo Algorithms for the Permanent and Binary Contingency
Tables. PhD thesis, University of Chicago.
[4] Bezakova, I., Stefankovic, D., Vazirani, V. & Vigoda, E. (2008). Accelerating simulated annealing for
the permanent and combinatorial counting problems. SIAM J. Comp., 37, 1429–1454.
[5] Bezakova, I., Sinclair, A., Stefankovic, D., & Vigoda, E. (2006). Negative examples for sequential
importance sampling of binary contingency tables. In Y. Azar & T. Erlebach (Eds.), Algorithms ESA 2006
(Vol. 4168, p. 136-147). Berlin/Heidelberg: Springer.
[6] Bezakova, I., Bhatnagar, N., & Vigoda, E. (2007). Sampling binary contingency tables with a greedy
start. Random Struct. Algor., 30, 168–205.
[7] Chen, Diaconis, P., Holmes, S., & Liu J. S. (2005), Sequential Monte Carlo Methods for Statistical Analysis
of Tables. J. Amer. Statist. Ass., 105, 109–120.
[8] Daley D. & Vere-Jones D. (2003). An Introduction to the Theory of Point Processes. 2nd edition, Springer:
New York.
[9] Del Moral, P. (2004). Feynman-Kac Formulae. Springer: New York.
[10] Del Moral, P., Doucet, A. & Jasra, A. (2006). Sequential Monte Carlo samplers. J. R. Statist. Soc. B,
68, 411–436.
[11] Diaconis, P. & Stroock, D. (1991). Geometric bounds for eigenvalues of Markov chains. Ann. Appl. Probab. 1,
36–61.
[12] Doucet, A. & Johansen, A. (2011). A tutorial on particle filtering and smoothing: Fifteen years later. In
Handbook of Nonlinear Filtering (eds. D. Crisan et B. Rozovsky), Oxford University Press: Oxford.
[13] Fearnhead, P. (1998). Sequential Monte Carlo Methods in Filter Theory. PhD thesis, University of Oxford.
[14] Harrison, M. & Miller, J. (2013). Importance sampling for weighted binary random matrices. arXiv preprint,
arXiv:1301.3928.
[15] Kasteleyn, P. W. (1961). The statistics of dimers on a lattice I: The number of dimer arrangements on a
quadratic lattice. Physica, 27, 1664–1672.
[16] Kou, S. C. & McCullagh, P. (2009). Approximating the α−permanent. Biometrika, 96, 635–644.
[17] Jasra, A., Stephens, D. A. & Holmes, C. C. (2007). On population-based simulation. Statist. Comp, 17,
263–279.
[18] Jerrum, M., Sinclair, A. & Vigoda, E. (2004). A polynomial-time approximation for the permanent of a
matrix with non-negative enteries. J. Ass. Comp. Mach, 51, 671–697.
23
[19] McCullagh, P. & Moller, J. (2006). The permanental process. Adv. Appl. Probab., 38, 873–888.
[20] Schweizer, N. (2012). Non-asymptotic error bounds for sequential MCMC and stability of Feynman-Kac
propagators. arXiv preprint, arXiv:1204.2382.
[21] Wang, J. (2014). Sequential Monte Carlo Methods for Problems on Finite State-Spaces. PhD Thesis, National
University of Singapore (in progress).
[22] Wang, J., Jasra, A. & De Iorio, M. (2014). Computational methods for a class of network models. J. Comp.
Biol., 21, 141–161.
[23] Whiteley, N. P., Andrieu, C., & Doucet, A. (2010). Efficient Bayesian inference for switching state-space
models using discrete particle Markov chain Monte Carlo methods. arXiv preprint, arXiv:1011.2437.
24