Generating random permutations by coin-tossing - Hsien

Generating random permutations by coin-tossing:
classical algorithms, new analysis and
modern implementation
|
|
~
~
Axel Bacher , Olivier Bodini , Hsien-Kuei Hwang , and Tsung-Hsi Tsai
|
~
LIPN, Université Paris 13, Villetaneuse, France
Institute of Statistical Science, Acamedia Sinica, Taipei, Taiwan
March 10, 2016
Abstract
Several simple, classical, little-known algorithms in the statistical literature for generating
random permutations by coin-tossing are examined, analyzed and implemented. These algorithms are either asymptotically optimal or close to being so in terms of the expected number of
times the random bits are generated. In addition to asymptotic approximations to the expected
complexity, we also clarify the corresponding variances, as well as the asymptotic distributions. A brief comparative discussion with numerical computations in a multicore system is
also given.
1
Introduction
Random permutations are indispensable in widespread applications ranging from cryptology to
statistical testing, from data structures to experimental design, from data randomization to Monte
Carlo simulations, etc. Natural examples include the block cipher (Rivest, 1994), permutation tests
(Berry et al., 2014) and the generalized association plots (Chen, 2002). Random permutations are
also central in the framework of Boltzmann sampling for labeled combinatorial classes (Flajolet
et al., 2007) where they intervene in the labeling process of samplers. Finding simple, efficient
and cryptographically secure algorithms for generating large random permutations is then of vital importance in the modern perspective. We are concerned in this paper with several simple
classical algorithms for generating random permutations (each with the same probability of being
This research was partially supported by the ANR-MOST Joint Project MetAConC under the Grants ANR 2015BLAN-0204 and MOST 105-2923-E-001-001-MY4.
1
generated), some having remained little known in the statistical and computer science literature,
and focus mostly on their stochastic behaviors for large samples; implementation issues are also
discussed.
Algorithm Laisant-Lehmer: when UnifŒ1; n! is available (Here and throughout this paper
UnifŒa; b represents a discrete uniform distribution over all integers in the interval Œa; b.) The
earliest algorithm for generating a random permutation dated back to Laisant’s work near the end
of the 19th century (Laisant, 1888), which was later re-discovered in (Lehmer, 1960). It is based
on the factorial representation of an integer
k D c1 .n
1/! C c2 .n
2/! C C cn 1 1!
.0 6 k < n!/;
where 0 6 cj 6 n j for 1 6 j 6 n 1. A simple algorithm implementing this representation
then proceeds as follows; see (Devroye, 1986, p. 648) or (Robson, 1969).
Let k D UnifŒ0; n! 1. The first element of the random permutation is c1 C 1, which is then removed from
Algorithm 1: LL.n; c/
f1; : : : ; ng. The next element of the random permutation
will then be the .c2 C 1/st element of the n 1 remaining
Input: c: an array with n elements
Output: A random permutation on c
elements. Continue this procedure until the last element
begin
is removed. A direct implementation of this algorithm
1
u WD UnifŒ1; n!;
results in a two-loop procedure; a simple one-loop proce2
for i WD n downto 2 do
dure was devised in (Robson, 1969) and is shown on the
3
t WD ui ; j WD u i t C 1;
right; see also (Plackett, 1968) and Devroye’s book (De4
swap.ci ; cj /; u WD t;
vroye, 1986, ~XIII.1) for variants, implementation details
and references.
This algorithm is mostly useful when n is small, say
less than 20 because n! grows very fast and the large number arithmetics involved reduce its efficiency for large n. Also the generation of the uniform distribution is better realized by the cointossing algorithms described in Section 2.
Algorithm Fisher-Yates (FY): when UnifŒ1; n is available One of the simplest and mostly
widely used algorithms (based on a given sequence of distinct numbers fc1 ; : : : ; cn g) for generating a random permutation is the Fisher-Yates or Knuth shuffle (in its modern form by Durstenfeld
(Durstenfeld, 1964); see Wikipedia’s page on Fisher-Yates shuffle and (Devroye, 1986; Durstenfeld, 1964; Fisher and Yates., 1948; Knuth, 1998a) for more information).
The algorithm starts by swapping cn with a randomly chosen element in fc1 ; : : : ; cn g (each with the
Algorithm 2: FY.n; c/
same probability of being selected), and then repeats
Input: c: an array with n > 2 elements
the same procedure for cn 1 , . . . , c2 . See also the reOutput: A random permutation on c
cent book (Berry et al., 2014) or the survey paper (Ritbegin
ter, 1991) for a more detailed account.
1
for i WD n downto 2 by 1 do
Such an algorithm seems to have it all: single loop,
2
j WD UnifŒ1; i ; swap.ci ; cj /
one-line description, constant extra storage, efficient
2
and easy to code. Yet it is not optimal in situations
such as (i) when only a partial subset of the permuted
elements is needed (see (Black and Rogaway, 2002; Brassard and Kannan, 1988)), (ii) when implemented on a non-array type data structure such as a list (see (Ressler, 1992)), (iii) when numerical
truncation errors are inherent (see (Kimble, 1989)), and (iv) when a parallel or distributed computing environment is available (see (Anderson, 1990; Langr et al., 2014)). On the other hand, at the
memory access level, a direct generation of the uniform random variable results in a higher rate of
cache miss (see (Andrés and Pérez, 2011)), making it less efficient than it seems, notably when n is
very large; see also Section 6 for some implementation and simulation aspects. Finally, this algorithm is sequential in nature and the memory conflict problem is subtle in parallel implementation;
see (Waechter et al., 2011). Note that the implementation of this algorithm strongly relies on the
availability of a uniform random variate generator, and its bit-complexity (number of random bits
needed) is of linearithmic order (not linear); see (Lumbroso, 2013; Sandelius, 1962)) and below
for a detailed analysis.
From unbounded uniforms to bounded uniforms? Instead of relying on uniform distribution
with very large range, our starting question was: can we generate random permutations by bounded
uniform distributions (for example, by flipping unbiased coins)? There are at least two different
ways to achieve this:
Fisher-Yates type: simulate the uniform distribution used in Fisher-Yates shuffle by cointossing, which can be realized either by von Neumann’s rejection method (von Neumann,
1951) or by the Knuth-Yao algorithm (for generating a discrete distribution by unbiased
coins; see (Devroye, 1986; Knuth and Yao, 1976)) and Section 2, and
divide-and-conquer type: each element flips an unbiased coin and then depending on the
outcome being head or tail divide the elements into two groups. Continue recursively the
same procedure for each of the two groups. Then a random resampling is achieved by an
inorder traversal on the corresponding digital tree; see the next section for details. This
realization induces naturally a binary trie (Knuth, 1998b), which is closely related to a few
other binomial splitting processes that will be briefly described below; see (Fuchs et al.,
2014).
It turns out that exactly the same binomial splitting idea was already developed in the early
1960’s in the statistical literature in (Rao, 1961) and independently in (Sandelius, 1962), and analyzed later in (Plackett, 1968). The papers by Rao and by Sandelius also propose other variants,
which have their modern algorithmic interests per se. However, all these algorithms have remained
little known not only in computer science but also in statistics (see (Berry et al., 2014; Devroye,
1986)), partly because they rely on tables of random digits instead of more modern computer
generated random bits although the underlying principle remains the same. Since a complete and
rigorous analysis of the bit-complexity of these algorithms remains open, for historical reasons and
for completeness, we will provide a detailed analysis of the algorithms proposed in (Rao, 1961)
and (Sandelius, 1962) (and partially analyzed in (Plackett, 1968)) and two versions of Fisher-Yates
3
with different implementations of the underlying uniform UnifŒ0; n 1 by coin-tossing: one relying on von Neumann’s rejection method (Devroye and Gravel, 2015; von Neumann, 1951) and
the other on Knuth-Yao’s tree method (Devroye, 1986; Knuth and Yao, 1976).
As the ideas of these algorithms are very simple, it is no wonder that similar ideas also appeared
in computer science literature but in different guises; see (Barker and Kelsey, 2007; Flajolet et al.,
2011; Koo et al., 2014; Ressler, 1992) and the references therein. We will comment more on this
in the next section.
We describe in the next section the algorithms we will analyze in this paper. Then we give a
complete probabilistic analysis of the number of random bits used by each of them. Implementation aspects and benchmarks are briefly discussed in the final section. Note that Fisher-Yates
shuffle and its variants for generating cyclic permutations have been analyzed in (Louchard et al.,
2008; Mahmoud, 2003; Prodinger, 2002; Wilson, 2009) but their focus is on data movements rather
than on bit-complexity.
2
Generating random permutations by coin-tossing
We describe in this section three algorithms for generating random permutations, assuming that
a bounded uniform UnifŒ0; r 1 is available for some fixed integer r > 2. The first algorithm
relies on the divide-and-conquer strategy and was first proposed in (Rao, 1961) and independently
in (Sandelius, 1962), so we will refer to it as Algorithm RS (Rao-Sandelius). The other two ones
we study are of Fisher-Yates type but differ in the way they simulate UnifŒ0; n 1 by a bounded
uniform UnifŒ0; r 1: the first of these two simulates UnifŒ0; n 1 by a rejection procedure in
the spirit of von Neumann (von Neumann, 1951) and was proposed and implemented in (Sandelius,
1962), named ORP (One-stage-Randomization Procedure) there, but for convenience we will refer
to it as Algorithm FYvN (Fisher-Yates-von-Neumann); see also (Moses and Oakford, 1963); and
the other one relies on an optimized version of Lumbroso’s implementation (Lumbroso, 2013) of
Knuth-Yao’s DDG-tree (discrete distribution generating tree) algorithm (Knuth and Yao, 1976),
which will be referred to as Algorithm FYKY (Fisher-Yates-Knuth-Yao). See also (Devroye, 1986,
Ch. XV) on the “bit model” and the more recent updates (Devroye, 2010; Devroye and Gravel,
2015).
For simplicity of presentation and practical usefulness, we focus in what follows on the binary
case r D 2. For convenience, let rand-bit denote the random variable Bernoulli. 12 /, which
returns zero or one with equal probability.
2.1
Algorithm RS: divide-and-conquer
We describe Algorithm RS only in the binary case assuming an unbiased coin is available. Since
we will carry out a detailed analysis of this algorithm, we give its procedure in recursive form as
follows. (For practical implementation, it is more efficient to remove the recursions by standard
techniques; see Section 6.)
4
A sequence of distinct numbers fc1 ; : : : ; cn g is
given.
Algorithm 3: RS.n; c/
1. Each ci generates a rand-bit, one independently of the others;
2. Group them according to the outcomes being
0 or 1, and arrange the groups in increasing
order of the group labels.
1
2
Input: c: a sequence with n elements
Output: A random permutation on c
begin
if n 6 1 then
return c
if n D 2 then
if rand-bit D 1 then
return .c2 ; c1 /
else
return .c1 ; c2 /
3
4
3. For each group of cardinality :
5
6
(a) if D 1, then stop;
7
(b) if D 2, then generate a rand-bit b
and reverse their relative order if b D 1;
(c) if > 2, then repeat Steps 1–3 for each
group.
9
Let A0 and A1 be two empty arrays
for i WD 1 to n do
add ci into Arand-bit
10
return RS.jA0 j; A0 /,RS.jA1 j; A1 /
8
As an illustrative example, we begin with the
sequence fc1 ; : : : ; c6 g. Assume that the flipped
bi
nary sequence is
c1 c2 c3 c4 c5 c6
1 0 1 1 0 0
. Then
we split the ci ’s into the 0-group .c2 ; c5 ; c6 / and
the 1-group .c1 ; c3 ; c4 /, which can be written in
the form .c2 c5 c6 / .c1 c3 c4 /. As both groups
have cardinality larger than two, we run the same
coin-flipping process for both groups. Assume
that
further coin-flippings yield
c2 c5 c6
0 0 1
c1 c2 c3 c4 c5 c6
and
c2 c5 c6
c2 c5
c2
c1 c3 c4
0 1 0
c1 c3 c4
c6
c5
c1 c4
c4
c3
c1
, respectively. Then we obtain
.c2 c5 / c6 .c1 c4 / c2 . If the two extra coin-flippings needed to permute the two subgroups of size
two are 0 and 1, respectively, then we get the random permutation c2 c5 c6 c4 c1 c3 :
The splitting process of this algorithm is, up to the boundary conditions, essentially the same
as constructing a random trie under the Bernoulli model or sorting using radixsort (see (Fuchs
et al., 2014; Knuth, 1998b)), and was also briefly mentioned in (Flajolet et al., 2011). On the other
hand, Ressler in (Ressler, 1992) proposed an algorithm for randomly permuting a list structure
using a similar divide-and-conquer idea but performed in a rather different way. To the best of
our knowledge, except for these references, this simple algorithm seems to remain unknown in
the literature and we believe that more attention needs to be paid on its practical usefulness and
theoretical relevance.
Essentially identical binomial splitting processes In addition to the above connection to trie
and radixsort, the splitting process of Algorithm RS is also reminiscent of the so-called initialization problem in distributed computing (or processor identity problem), where a unique identifier
is to be assigned to each processor in some distributed computing environment; see (Nakano and
5
Olariu, 2000; Ravelomanana, 2007). Yet another context where exactly the same coin-tossing process is used to resolve conflict is the tree algorithm (or CTM algorithm, named after Capetanikis,
Tsybakov and Mikhailov) in multi-access channel; see (Massey, 1981; Wagner, 2009). For more
references on binomial splitting processes, see (Fuchs et al., 2014).
Nowadays, it is well-known that the stochastic behaviors of these structures can be understood
through the study of the binomial recurrence
X
n n
fn D gn C
2
.fk C fn k /;
(1)
k
06k6n
with suitably given initial conditions. In almost all cases of interest, such a recurrence often gives
rise to asymptotic approximations (for large n) that involve periodic oscillations with minute amplitudes (say, in the order of 10 5 ), which may lead to inexact conjectures (see for example (Massey,
1981)) but can be well described by standard complex-analytic tools such as Mellin transform
(Flajolet et al., 1995) and saddle-point method (Flajolet and Sedgewick, 2009) (or analytic dePoissonization (Jacquet and Szpankowski, 1998)); see (Fuchs et al., 2014) and the references compiled there. From a historical perspective, such a clarification through analytic means was first
worked out by Flajolet and his co-authors in the early 1980’s; see again (Fuchs et al., 2014) for
a brief account. However, the periodic oscillations had already been observed in the 1960’s by
Plackett in (Plackett, 1968) based on heuristic arguments and figures, which seems less expected
because of the limited computer power at that time and of the proper normalization needed to
visualize the fluctuations; see Figures 1 and 2 for the subtleties involved.
Unlike Algorithm FY, Algorithm RS is more easily adapted to a distributed or parallel computing environment because the random bits needed can be generated simultaneously. Furthermore,
we will prove that the total number of random bits used is asymptotically optimal, namely, the
expected complexity is asymptotic to n log2 n C nFRS .log2 n/ C O.1/, where FRS .t/ is a periodic
function of period 1 with very small amplitude (jFRS .t /j 6 1:1 10 5 ); see Figure 1. Another
distinctive feature is that FRS is very smooth (infinitely differentiable), differing from most other
periodic functions arising in the analysis below. Note that the information-theoretic lower bound
satisfies log2 n! D n log2 n logn 2 C O.log n/. While the asymptotic optimality of such a simple algorithm was already discussed in detail in (Sandelius, 1962) and such an asymptotic pattern
anticipated in (Plackett, 1968), the rigorous proof and the explicit characterization of the periodic
function FRS are new. Also we show that the variance is relatively small (being of linear order with
periodic fluctuations) and that the distribution is asymptotically normal.
2.2
Algorithm FYvN and FYKY
We describe in this subsection the two versions of Algorithm FY: FYvN and FYKY. Both algorithms follow the same loop of Fisher-Yates shuffle and simulate successively the discrete uniform
distributions UnifŒ1; n, : : : , UnifŒ1; 2 by flipping unbiased coins. To simulate UnifŒ1; k, both
algorithms generate first dlog2 ke random bits. If these bits, when read as a binary representation,
have a value less than k, then return this value plus 1 as the required random element; otherwise,
6
Algorithm FYvN rejects these bits and restart the same procedure until finding a value < k. Algorithm FYKY, on the other hand, does not reject the flipped bits but uses the difference between
this value and k as the “seed” of the next round and repeat the same procedure.
We modified and optimized these two procedures in a way to reduce the number of arithmetic
operations, and we mark their differences in red color; see Algorithm 4 and 5.
Algorithm 4: Algorithm FYvN
1
2
Algorithm 5: Algorithm FYKY
Input: c: an array with n elements
Output: A random permutation on c
begin
for i WD n downto 2 by 1 do
j WD von-Neumann.i / C 1;
swap.ci ; cj /;
1
2
Procedure von-Neumann(n)
1
2
3
4
5
6
7
8
9
Input: c: an array with n elements
Output: A random permutation on c
begin
for i WD n downto 2 by 1 do
j WD Knuth-Yao.i/ C 1;
swap.ci ; cj /;
Procedure Knuth-Yao(n)
Input: a positive integer n
Output: UnifŒ0; n 1
begin
u WD 1; x WD 0;
while 1 D 1 do
while u < n do
u WD 2u;
x WD 2xCrand-bit;
1
2
3
4
d WD u n;
if x > d then
return x d;
else
u WD 1; x WD 0;
Input: a positive integer n
Output: UnifŒ0; n 1
begin
u WD 1; x WD 0;
while 1 D 1 do
while u < n do
u WD 2u;
x WD 2xCrand-bit;
d WD u n;
if x > d then
return x d ;
else
u WD d;
5
6
7
8
9
Note that both algorithms are identical when n D 2k and n D 3 for which the algorithm
parameters evolve as in the following diagram.
.2; 1/
nD3W
.u; x/
.1; 0/
.2; 0/
.4; 3/
x
d D2
.4; 2/
x
d D1
.4; 1/
x
d D0
.4; 0/
.1; 0/
recursive
While the difference of both algorithms in such a pseudo-code level is minor, we show that the
asymptotic behavior of their bit-complexity for generating a random permutation of n elements
differs significantly:
7
Mean Variance n log2 n C nFRS ./
nGRS ./
Œ1
Œ2
n.log n/FvN ./ C nFvN ./ n.log n/2 GvN ./
n log2 n C nFKY ./
nGKY ./
Algorithm
RS
FYvN
FYKY
Method
Analytic
Elementary
Analytic
Here F ./ and G ./ are all bounded, continuous periodic functions of parameter log2 n. We see
that the minor difference in Algorithm FYvN results not only in higher mean but also larger variance, making FYvN less competitive in modern practical applications although it was used, for
example, by Moses and Oakford to produce tables of random permutations (Moses and Oakford,
1963). Also the procedure von-Neumann in Algorithm 4, as one of the simplest and most natural
ideas of simulating a uniform by coin-tossing, was independently proposed under different names
in the literature; see, for example, (Granboulan and Pornin, 2007; Koo et al., 2014); in particular,
it is called “Simple Discard Method” in NIST’s (Barker and Kelsey, 2007) “Recommendation for
random number generation using deterministic random bit generators.” Thus, we also include the
analysis of FYvN in this paper although it is less efficient in bit-complexity. The mean and the
variance of Algorithm FYvN were already derived in (Plackett, 1968) but only when n D 2k . In
addition to this approximation, we will also show that the variance is of a less common higher
order n.log n/2 , and the distribution remains asymptotically normal.
2.3
Outline of this paper
We focus in this paper on a detailed probabilistic analysis of the bit-complexity of the three algorithms RS, FYvN and FYKY. Indeed, in all three cases we will establish a very strong local limit
theorem for the bit-complexity of the form (although the variances are not of the same order)
x2
k
j
p
e 2
1 C jxj3
;
1CO
P Wn D E.Wn / C x V.Wn / D p
p
n
2V.Wn /
1
uniformly for x D o.n 6 /, where Wn represents the bit-complexity of any of the three algorithms
fRS, FYvN, FYKYg. Our method of proof is mostly analytic, relying on proper use of generating
functions (including characteristic functions) and standard complex-analytic techniques (see (Flajolet and Sedgewick, 2009)). The diverse uniform estimates needed for the characteristic functions
constitute the hard part of our proofs. The same method can be extended to clarify finer probabilities of moderate deviations, but for simplicity, we content with the above result in the central
range.
We also implemented these algorithms and tested their efficiency in terms of running time. The
simulation results are given in the last section. Briefly, Algorithm FYKY is recommended when n
is not very large, say n 6 107 , and Algorithm RS performs better for larger n or when a multicore
system is available.
Finally, our analysis and simulations also suggest that the “Simple Discard Algorithm” recommended in NITS’s (Barker and Kelsey, 2007) “Recommendation for random number generation”
is better replaced by the procedure Knuth-Yao in Algorithm 5 whose expected optimality (in
bit-complexity) was established in (Horibe, 1981).
8
3
The bit-complexity of Algorithm RS
We consider the total number Xn of times the random variable rand-bit is used in Algorithm
RS for generating a random permutation of n elements. We will derive precise asymptotic approximations to the mean, the variance and the distribution by applying the approaches developed in
our previous papers (Fuchs et al., 2014; Hwang, 2003; Hwang et al., 2010).
Recurrences and generating functions By construction, Xn satisfies the distributional recurrence
d
Xn D XIn C Xn In Cn;
.n > 3/;
„ƒ‚… „ƒ‚…
1-group
0-group
with the initial conditions X0 D X1 D 0 and X2 D 1, where In denotes the binomial distribution
with parameters n and 21 . Here the .Xn /’s are independent copies of the .Xn /’s and are independent
of In . This random variable is, up to initial conditions, identical to the external path length of
random tries constructed from n random binary strings. It may also be interpreted in may different
ways; see (Fuchs et al., 2014; Knuth, 1998b) and the references therein.
In terms of the moment generating function Pn .t / WD E.e Xn t /, we have the recurrence
X
nt
n n
Pk .t/Pn k .t /
.n > 3/;
(2)
Pn .t/ D e
2
k
06k6n
with P0 .t/ D P1 .t / D 1 and P2 .t/P
D e t . From these relations, we see that the bivariate Poisson
generating function PQ .z; t/ WD e z n>0 Pnn!.t / z n satisfies the functional equation
t
PQ .z; t/ D e .e
1/z
PQ
2
Q t/;
C Q.z;
1
2
e t z; t
z
1 C 14 e t z 2 C e t
(3)
where
Q t/ WD 1
Q.z;
e t ze
P
z
:
E.Xn / n
z denote the Poisson generating function
Let now fQm .z/ WD m!Œt m PQ .z; t/ D e
n>0
n!
of the mth moment of Xn . From (3), we see that
(
fQ1 .z/ D 2fQ1 2z C gQ 1 .z/;
fQ2 .z/ D 2fQ2 z C gQ 2 .z/;
m
2
with gQ 1 .0/ D gQ 2 .0/ D 0, where
(
gQ 1 .z/ D z ze z 1 C 43 z ;
2
gQ 2 .z/ D 2fQ1 z C 4z fQ1 z C 2z fQ0
2
z
1 2
2
9
C z C z2
ze
z
1C
11
4
z :
(4)
Mean value From the recurrence (2), we see that the mean n WD E.Xn / can be computed
recursively by
X
1 n n
n D n C
2
k
.n > 3/;
k
06k6n
P
with 0 D 1 D 0 and 2 D 1. Let Hn WD 16j 6n j 1 denote the harmonic numbers and denote Euler’s constant.
Theorem 1. The expected number n of random bits used by Algorithm RS for generating a random permutation of n elements satisfies the identity
X €.k /€.n/ 3
n
Hn 1
1
3
1
1 C k ;
D
C
(5)
n
log 2
2 4 log 2 log 2
€.n C k /
4
k2Znf0g
for n > 3, where € is the Gamma function and k WD
2k i
.
log 2
Asymptotically, n satisfies
n D n log2 n C nFRS .log2 n/ C O.1/;
(6)
where FRS .t/ is a periodic function of period 1 whose Fourier series expansion is given by
X
1
3
1
3
FRS .t / D
C
€.k / 1 C k e 2k i t ;
log 2 2 4 log 2 log 2
4
k2Znf0g
the Fourier series being absolutely convergent.
Proof. To derive a more effective asymptotic approximation to n , we begin with the expansion
gQ 1 .z/ D
X . 1/j 3j 7
zj :
.j 1/!
4
j >2
We then see that the sequence Q n WD n!Œz n fQ1 .z/, where Œz n f .z/ denotes the coefficient of z n in
the Taylor expansion of f , satisfies
Q n D
Œz n gQ 1 .z/
1 21 n
.n > 2/:
It follows, by Cauchy convolution, that the coefficient n WD n!Œz n e z fQ1 .z/ has the closed-form
expression
X n
k
3k 7
n D
. 1/k
.n > 1/;
k
1 21 k
4
26k6n
which, by standard integral representation for finite differences (see (Flajolet and Sedgewick,
1995)), can be expressed as
Z 12 Ci1
3
n
1
€.n/€.s/
D
1 C s ds;
1
n
2 i
€.n C s/.1 2s /
4
2 i1
10
where the integral path is the vertical line <.s/ D
right and by collecting all residues at the poles k D
n
Hn 1
1
D
C
n
log 2
2
Rn WD
1
2 i
k2Znf0g
1
2 Ci1
Z
1
2
integration to the
X €.k /€.n/ 3
1
1 C k C Rn ;
log 2
€.n C k /
4
3
4 log 2
where
1
. By moving the line of
2
2k i
.k 2 Z/, we obtain
log 2
i1
€.n/€.s/
3
1 C s ds:
€.n C s/.1 2s /
4
Since there is no other singularity lying to the right of the imaginary axis, we deform first the
integration path into a large half-circle to the right, and then prove that the integral tends to zero as
the radius of the circle tends to infinity. In this way, we deduce that Rn 0 for n > 3, proving the
identity (5). The asymptotic approximation (6) then follows from the asymptotic expansion for the
ratio of Gamma functions (see (Erdélyi et al., 1953, ~1.18))
jk j4
k .k 1/
€.n/
k
Dn
CO
1
.k ¤ 0/:
€.n C k /
2n
n2
Indeed, the O.1/-term in (6) can be further refined by this expansion, and be replaced by
1 X
3
€.1 C k /.k 1/ 1 C k n k C O n 1 ;
2 log 2
4
k2Z
the series on the right-hand side defining another bounded periodic function. Finally, the Fourier
series is absolutely convergent because of the estimate (see (Erdélyi et al., 1953, ~1.18))
j€.c C i t/j D O jt jc
1
2
e
2
jt j
;
(7)
for large jt j and bounded c. Indeed, FRS .t/ is infinitely differentiable for <.t / > 0.
Periodic fluctuations of n Due to the small amplitude of variation of FRS .t/, the periodic oscillations are invisible if one plots naively nn log2 n for increasing values of n as approximations
of FRS .t / (see Figure 1). Also note that the mean value of FRS equals numerically
1
C
log 2 2
3
0:25072 48966 10144 : : : ;
4 log 2
which is larger than the corresponding linear term in the information-theoretic lower bound
1:44.
11
(8)
1
log 2
Figure 1: Periodic fluctuations of n for n D 16 to 1024 in log-scale: nn log2 n (first from left),
n
1
log2 n C 2 log
(second), nn Hlogn 21 (third), and FRS .t / for t 2 Œ0; 1 (fourth).
n
2
Variance We prove that the variance is asymptotically linear with periodic oscillations. The
expressions involved are very complicated, showing the complexity of the underlying asymptotic
problem.
Theorem 2. The variance of Xn satisfies
V.Xn / D nGRS .log2 n/ C O.1/;
where GRS .t/ is a periodic function of period 1 whose Fourier series is given by
1 X 2k i 2k i t
GRS .t/ D
e
:
gQ
1C
log 2
log 2
k2Z
The Fourier series is absolutely convergent (and infinitely differentiable). An explicit expression
for the function gQ .s/ is given as follows.
X
gQ .s/
1
D .3s C 5/
€.s C 1/
1C2
k
s
k>1
C hQ .s/
C
sC5
4
.s C 2/.9s 3 C 66s 2 C 163s C 362/
2sC9
(9)
.<.s/ >
2/;
where
3s 3 34s 2 41s C 6
k
4
3
2
B
9s
C
87s
C
317s
C
333s
C
30
2k 1 2
B
3
2
@ C 3s C 22s C 141s C 170 2
C 3s 2 C 37s C 50 2 3k 2 C .3s C 5/2 4k
0
hQ .s/ WD
X
k>1
2
k 3
1C2
k
sC5
1
3
C
C:
A
Proof. For the variance, we consider, as in (Fuchs et al., 2014), the corrected Poissonized variance
VQ .z/ WD fQ2 .z/
fQ1 .z/2
z fQ10 .z/2 :
Then, by (4),
VQ .z/ D 2VQ
12
z
2
C g.z/;
Q
where
n
g.z/
Q
D e z.3z C 4/fQ1
z
C z C 41 z 2
z
2
1
16
1
2
z 3z 2
z.z C 1/.9z 3
2z
4 fQ10
z
2
12z 2 C 16z C 16/ e
z
o
;
which is exponentially small for large jzj in the half-plane <.z/ > 0. Indeed,
g.z/
Q
D O e <z jzj3 log jzj
.jzj ! 1I <.z/ > 0/:
(10)
(11)
We follow the same method of proof developed in (Fuchs et al., 2014) and need to compute the
Mellin transform of g.z/,
Q
which exists in the half-plane <.s/ > 2 because g.z/
Q
D O.jzj2 / as
jzj ! 0. Now
X
fQ1 .z/ D
2k gQ 1 2zk :
k>0
Thus
gQ .s/ WD
Z
1
s
g.z/z
Q
1
dz D 1 .s/ C 2 .s/ C 3 .s/;
0
where
1
Z
z s e z .3z C 4/fQ1 2z dz
0
Z 1
1
z s e z 3z 2 2z 4 fQ10 2z dz
2 .s/ WD 2
Z 10
1
3 .s/ WD
z s e z 1 C 14 z 16
.z C 1/.9z 3 12z 2 C 16z C 16/ e z dz:
1 .s/ WD
0
First, for <.s/ >
2,
3 .s/ D €.s C 1/ 41 .s C 5/
2
s 9
.s C 2/.9s 3 C 66s 2 C 163s C 362/ :
Note that 3 .s/ has no singularity at s D 1 (indeed, 3 . 1/ D 125
C log 2). On the other hand,
128
by an integration by parts,
Z 1
3z 3 C .3s C 11/z 2 2.s 3/z 4s dz
1 .s/ C 2 .s/ D
z s 1 e z fQ1 2z
0
Z 1
X
k 1
3z 3 C .3s C 11/z 2 2.s 3/z 4s dz
D
2
z s 1 e z gQ 1 2zk
k>1
0
X
D €.s C 1/ .3s C 5/
1
1C2
k
s
Q
C h .s/ ;
k>1
which can be continued into the half-plane <.s/ > 2 and leads then to (9). Note that 3 . 1/ D
7
37
C log 2. Also, by (7), jgQ .c C i t/j D O jtjcC 2 e 2 jt j for large jtj and c > 2. Thus,
32
13
the Fourier series expansion for GRS .t / is absolutely convergent. By the same Poisson-Charlier
approach used in (Fuchs et al., 2014), we see that
2
n Q 00
n Q00 2
f .n/ CO n
V.Xn / D VQ .n/
V .n/
„ƒ‚… „ 2
2 1 …
ƒ‚
DO.n/
1
;
DO.1/
where the O-terms can be made more precise by Mellin transform techniques (see (Flajolet et al.,
1995)) as follows. First, by moving the line of integration to the right and collecting all residues
i
)
encountered, we deduce that (k WD 2k
log 2
1
VQ .n/ D
2 i
3
2 Ci1
Z
n
3
2
i1
s
gQ .s/
ds
1 2sC1
Z
n X 1
k
D
gQ . 1 k /n C
log r
2 i
k2Z
X
D nGRS .log2 n/
2 k g.2
Q k n/;
1
2 Ci1
n
1
2
s
i1
gQ .s/
ds
1 2sC1
k>1
which is not only an asymptotic expansion but also an identity for n > 1. Here GRS .t / is a 1periodic function with small amplitude, and the series over k represents exponentially small terms;
see (11). Similarly,
nVQ 00 .n/ D
1 X
k .k C 1/gQ . 1
log 2
k /nk
k2Z
X
2k gQ 00 .2k n/;
k>1
the first series being bounded while the second exponentially small for large n.
In particular, the mean value of the periodic function GRS is given by
gQ . 1/
D1
log 2
X
125
C2
log2 1 C 2
128 log 2
k
k>1
1 X 3 8k C 10 4k 34 2k
4 log 2
.2k C 1/4
14
k>1
1:82994 9955089 43482 69596 20844 : : : ;
(12)
in accordance with the numerical calculations; see Figure 2.
Asymptotic normality By applying either the contraction method (see (Neininger and Rüschendorf, 2004)) or the refined method of moments (see (Hwang, 2003)), we can establish the convergence in distribution of the centered and normalized random variables .Xn n /=n to the standard
normal distribution, where n WD E.Xn / and n2 WD V.Xn /. The latter is also useful in providing
stronger results such as the following.
14
Figure 2: A plot (right) of V.Xnn/ c0 for n from 12 to 256 in logarithmic scale, where c0 D 2.log1 2/2
is the mean value of the second-order term (another periodic function). Without this correction
term c0 , the fluctuations are invisible (left).
Theorem 3. The sequence of random variables fXn g satisfies a local limit theorem of the form
x2
e 2
1 C jxj3
P .Xn D bn C xn c/ D p
1CO
(13)
p
n
2 n
1
uniformly for x D o n 6 .
Proof. (Sketch) The refined method of moments proposed in (Hwang, 2003) begins with introducing the normalized function
1 2 2
1 2 2
'n ./ WD e 2 n E e .Xn n / D e n 2 n Pn ./:
Then '0 . / D '1 ./ D '2 ./ D 1 and
X
n n
'n ./ D
2
'k ./'n
k
k ./e
n;k ın;k 2
.n > 3/;
06k6n
where n;k WD nCk Cn k n and ın;k WD 12 k2 Cn2 k n2 . From this, we see that all Taylor
coefficients 'n.m/ .0/ satisfy the same recurrence of the form (1) with different non-homogeneous
part. Then a good estimate for j'n ./j for small is obtained by establishing the uniform bounds
ˇ .m/ ˇ
ˇ' .0/ˇ 6 m!C m n m3
.m > 3/;
n
for a sufficiently large number C > 0. Such bounds are proved by induction using Gaussian tails
of the binomial distribution and the estimates
n; n Cx pn ; ın; n Cx pn D O 1 C x 2 ;
2
2
2
2
p
uniformly for x D o. n/ (the remaining range completed by using the smallness of the binomial
distribution). Then it follows that
ˇ
ˇ
ˇ ˇ
1
ˇ X ˇ'n.m/ .0/ˇ m
ˇ
i
ˇ6
ˇ'n
2 jj3
1
jj
D
O
n
;
ˇ
ˇ
n
m!nm
m>3
15
1
uniformly for jj D o.n 6 /, or, equivalently,
X n n
i
n
De
E e
1 2
2
CO n
1
2
jj3 e
1 2
2
;
(14)
for in the same range. Then another induction leads to the uniform estimate (see (Hwang, 2003)
for a similar setting)
ˇ
ˇ
ˇE e Xn i ˇ 6 e ".nC1/ 2
.jj 6 I n > 3/;
(15)
where " > 0 is a sufficiently small constant. (We use " > 0 as a generic symbol representing a
sufficiently small number whose occurrence may change from one occurrence to another.) These
two uniform bounds are sufficient to prove the local limit theorem by standard Fourier analysis
(see (Petrov, 1975)) starting from the inversion formula
Z 1
P.Xn D k/ D
e ik E e Xn i d;
2 and then splitting the integration range into two parts:
Z
Z
1
P.Xn D k/ D
C
e
1
1
2
jj6"n 3
"n 3 <jj6
By (15), the second integral is asymptotically negligible
ˇZ
ˇ
Z 1
ˇ
1 ˇˇ
ik
Xn i
e
E e
d ˇˇ D O
e
1
2 ˇ "n 13 <jj6
"n 3
The integral over the central range jj 6 "n
1
3
"n 2
ik
E e Xn i d:
d
DO n
1
6
1
e
"n 3
:
is then evaluated by (14) using
k D bn C xn c DW n C xn C n ;
n D O.1/;
giving
1
2
Z
jj6"n
1
3
e
ik
E e
Xn i
Z
X n n 1
1
i
ix
2
n
d D
e
E e
1CO n
d
2n jj6"n 61
Z 1
1
1
2
D
e ix 2 1 C O n 2 j 1 C j3
d
2n 1
x2
e 2
1 C jxj3
Dp
1CO
;
p
n
2 n
which completes the proof of (13).
Note that our estimates for the characteristic function of Xn also lead to an optimal BerryEsseen bound
ˇ ˇ
Z x
ˇ
ˇ
Xn n
1
t2
1
ˇ
sup ˇP
e 2 dt ˇˇ D O n 2 :
6x
p
n
2 1
x2R
16
Figure 3: Normalized (multiplied by standard deviation) histograms of the random variables Xn
for n D 15; : : : ; 50; the tendency to normality becomes apparent for larger n.
A simple improved version The first few terms of n and those of the expected bit complexity
of Algorithm FYKY are given in the following table.
Algorithm
E.RS/.D n /
E.FYKY/
2
1
1
3
4
5
5
8:29 12:1
3:67 5:67 9:27
6
16:3
12:9
7
20:7
16:4
8
25:3
19:4
9
30:1
24
10
35
28:6
and we see that for small n Algorithm RS is better replaced by Algorithm FYKY. The analysis of
the bit-complexity of these mixed algorithms (using FYKY for small n and RS for larger n) can be
done by the same methods used above but the expressions become more involved.
4
The bit-complexity of Algorithm FYvN
In this section, we analyze the bit-complexity of Algorithm FYvN (D Sandelius’s ORP in (Sandelius,
1962)), which is described in Introduction. Briefly, for each 2 6 k 6 n, select D dlog2 ke random bits (independently and uniformly at random), which gives rise to a number 0 6 u < 2 . If
u < k, use u as the required random number, otherwise repeat the same procedure until success.
Let Yn represent the total number of random digits used for generating a random permutation
of n element. Plackett showed (see (Plackett, 1968)), in the special case when n D 2 , that
E.Yn / 2n .log n
log 2/ ;
and
V.Yn / 2 .1
log 2/ n log22 n
2 log2 n C 1 :
(16)
In the section, we complete the analysis of Planckett of the mean and the variance for all n, and
establish a stronger local limit theorem for the bit-complexity.
Lemma 1. Let k WD dlog2 ke and Geok be a geometric random variable with probability of
success k=2k (with support on the positive integers). Then
X
d
Yn D
k Geok
.n > 1/:
(17)
16k6n
17
Proof. Observe that the number of random bits used for selecting each ck is a geometric random
variable Geok .
Expected value The mean of Yn satisfies
X k
2k :
k
E.Yn / D
16k6n
By splitting the range Œ1; n into blocks of the form .2j ; 2j C1 , we obtain the following asymptotic
approximation to E.Yn /.
Theorem 4. The expected number of random digits used by Algorithm FYvN to generate a random
permutation of n elements satisfies
Œ2
Œ1
.log2 n/ C O..log n/2 /;
.log2 n/n log n C nFvN
E.Yn / D FvN
(18)
Œ1
Œ2
where FvN
.t/ and FvN
.t/ are continuous, 1-periodic functions defined by
Œ1
FvN
.t/ WD 21
Œ2
FvN
.t/ WD
ft g
21
.1 C ft g/
ft g
.log 2/ 1 C ftg2 :
Proof. We start with the decomposition
X
E.Yn / D
.` C 1/2`C1
X
2` <j 62`C1
06`6n 2
1
C n 2 n
j
X
2n
1 <j 6n
1
:
j
By using the estimates
X
2` <j 62`C1
X
2n
1 <j 6n
1
D log 2
j
1
2`C2
1
n
D log j
2 n
CO 4
`
2 n
n2n
1
n
1
CO n
2
;
We deduce that
E.Yn / D log 2 C log
When n ¤ 2n , write n D 2n
E.Yn / D 21
n
1Cn
n 2n
1
2n n
2n C1 log 2 C O .log n/2 :
, where n WD flog2 ng. Then
.1 C n / n log n
21
n
.log 2/ 1 C n2 n C O .log n/2 ;
which is also valid when n D 2n . This completes the proof of (18) and Theorem 4.
Œ1
Œ1
Note that the dominant term can be written as FvN
./n log n D .log 2/FvN
./n log2 n, and the
Œ1
minimum value of .log 2/FvN ./ equals 2 log 2 1:38 > 1, which means that Algorithm FYvN
requires more random bits than Algorithm RS for large n; see (8).
18
Œ1
Œ2
Œ1
Œ2
Œ3
Figure 4: The periodic functions (from left to right) FvN
; FvN
; GvN
; GvN
; GvN
in the unit interval.
Variance Analogously, by (17), the variance of Yn is given by
X 2 k k
2k 2k :
V.Yn / D
2
k
16k6n
From this expression and a similar analysis as above, we can derive the following asymptotic
approximation to the variance whose proof is omitted here.
Theorem 5. The variance of Yn satisfies
Œ1
Œ2
Œ3
V.Yn / D GvN
.log2 n/n.log n/2 C GvN
.log2 n/n log n C nGvN
.log2 n/ C O .log n/3 ;
(19)
Œ1
Œ2
Œ3
where GvN
.t /; GvN
.t/ and GvN
.t / are continuous, 1-periodic functions defined by (see Figure 4)
Œ1
GvN
.t/ WD
21 ft g
3
.log 2/2
.log 2/.1 C ft g/
1
21
ft g
log 2 3
2
log 2
Œ2
GvN
.t/ WD 2.log 2/.1
Œ1
ftg/GvN
.t/
Œ3
GvN
.t/ WD .log 2/2 .1
Œ1
ftg/2 GvN
.t/ C .1
ft g
log 2/.1 C 2ftg/22
ft g
:
In particular, if n D 2n , then
V.Yn / D 2 .1
log 2/ n .log2 n/2
2 log2 n C 3 C O .log n/3 ;
where the last term inside the parentheses differs slightly from Plackett’s expression in (Plackett,
1968).
Asymptotic normality Since Yn is the sum of independent geometric random variables, we can
derive very precise limit theorems by following the classical approach; see (Petrov, 1975).
Theorem 6. The bit-complexity of Algorithm FYvN satisfies the local limit theorem
x2
j
k
p
e 2
1 C jxj3
P Yn D E.Yn / C x V.Yn / D p
1CO
p
n
2V.Yn /
1
uniformly for x D o n 6 .
19
Proof. By (17), the moment generating function of Yn satisfies (pk D k=2k )
Y
E e Yn D
16k6n
1
p k e k :
.1 pk /e k By induction, we see that the cumulant of order m satisfies
X
m
m
m
k pk polynomialm .pk / D O .n.log n/ /
(20)
.m D 1; 2; : : : /:
16k6n
From this we deduce that
3 Yn i n i 2
jj
E exp p
D exp p
CO p
;
(21)
2
n
V.Yn /
V.Yn /
p
uniformly for jj 6 " n. This estimate, coupling with the usual Berry-Esseen inequality, is
sufficient to prove an optimal convergence rate to normality. For the stronger local limit theorem,
it suffices to prove the bound
ˇ
ˇ
ˇE e Yn i ˇ 6 e "1 n 2 ;
(22)
uniformly for jj 6 , where "1 > 0 is a sufficiently small constant. Then the local limit theorem
follows from the same argument used in the proof of (13). To prove (22), a direct calculation from
(20) yields
Y
ˇ
ˇ
1
ˇE e Yn i ˇ D
q
1 C 2.1 pk /pk 2 .1 cos k /
16k6n
Y
1
6
p
1 C 2.1 pk /.1 cos k /
16k62n 1
Y
Y
1
:
D
q
k
`
1
1
C
2
.1
cos
`/
16`<n 16k<2
2`
x
1
For 0 6 x 6 4, we have the elementary inequality p1Cx
6 e 5 , so that
X
X k
ˇ
ˇ
ˇE e Yn i ˇ 6 exp 2
.1 cos `/
5
2`
`
1
16`<n 16k<2
1 X `
6 exp
.2
2/.1 cos `/ :
20
16`<n
By the inequality 2`
2 > 2`
for ` > 2, we then obtain
X
ˇ
ˇ
1
Y
i
`
ˇE e n ˇ 6 exp
2 .1 cos `/ 6 e
40
1
26`<n
20
1
40
2n n . /
;
where
n ./ WD
5
4 cos C cos n 2 cos.n
2.5 4 cos /
1/
:
By monotonicity and induction, we deduce that n ./ > 61 .1 cos / for n > 2; consequently,
ˇ
ˇ
1
1
2n .1 cos /
ˇE e Yn i ˇ 6 e 240
6 e 480 n.1 cos / ;
uniformly for jj 6 . But 1
5
cos >
2
2
2 for jj 6 , so that (22) follows.
The bit-complexity of Algorithm FYKY
We analyze the total number of bits used by Algorithm FYKY for generating a random permutation
of n elements. Let Bn denote the total number of random bits flipped in the procedure Knuth-Yao
of Algorithm FYKY for generating UnifŒ0; n 1. Then B2k D k.
Lemma 2. The probability generating function E t Bn of Bn satisfies
X n 2k Bn
E t
D 1 .1 t/
tk
.n D 2; 3; : : : /:
(23)
k
2
n
k>0
Proof. The probability that the algorithm does not stop after k random flips is given by
n 2k
P.Bn > k/ D k
.k D 0; 1; : : : /;
2
n
because after the first k random coin-tossings (2k different configurations) there are exactly 2k mod
˚ k
n D n 2n cases that the algorithm does not return a random integer in the specified interval
Œ0; n 1.
˚ k
Throughout this section, write Lx WD blog2 xc for x > 0 and L0 WD 0. Since 2nk 2n D 1 for
0 6 k 6 Ln when n ¤ 2Ln , we obtain
X n 2k Bn
Ln C1
E t
Dt
C .t 1/
tk
.n ¤ 2Ln /:
2k n
k>Ln
We see that Bn is close to Ln C 1, plus some geometric-type perturbations.
For computational purposes, the infinite series in (23) is less useful and it is preferable to use
the following finite representation. Let .n/ denote Euler’s totient function (the number of positive
integers less than n and relatively prime to n).
Corollary 1. For n > 2
8
Bn 2 ;
t
E
t
ˆ
<
1 t
E t Bn D
ˆ
:1
t .n/
1
2
if n is evenI
X
06k<.n/
21
k
2 mod n k
t ; if n is odd:
2k
(24)
Proof. This follows from (23) by grouping terms containing the same fractional parts.
Let Zn D B1 C C Bn represent the total number of bits required by Algorithm FYKY for
generating a random permutation of n elements.
5.1
Expected value of Zn
By (23), we have
n WD E.Zn / D
X
am ;
16m6n
where
X 2k n
:
an WD E.Bn / D
n 2k
k>0
This sequence has been studied in the literature; see (Knuth and Yao, 1976; Pokhodzeı̆, 1985;
Lumbroso, 2013). Obviously,
X
X 2k n
;
an D
1C
n 2k
06k6Ln
k>Ln
when n ¤ 2Ln , so we obtain the easy bounds
Ln 6 an 6 Ln C 1 C
n
2Ln C1
;
and an D log2 n C O.1/.
Lemma 3. For n > 1
8
ˆ
<a n2 C 1;
2.n/
an D
ˆ
: 2.n/ 1
if n is even;
j
X
06j <.n/
2 mod n
;
2j
if n is odd:
Proof. These relations follow from (24).
Corollary 2. For n > 1
an D log2 n C F0 .log2 n/
.n > 1/;
(25)
for some 1-periodic function F0 ; see Figure 5.
A formal expansion for F0 .t/ was derived in (Lumbroso, 2013). We prove the following estimate for n .
22
Figure 5: Periodic fluctuations of an log2 n in log-scale (left) and normalized in the unit interval
(right). The largest value achieved by the periodic function in the interval n 2 Œ2k ; 2kC1  is at
n D 2k C 1 for which a2k C1 k D 2 2kkC1 , which approaches 2 for large k.
Figure 6: Periodic fluctuations of
interval (right).
n C 12 log2 n
n
1
3
log2 n in log-scale (left) and FKY in the unit
Theorem 7. The expected number n of random bits required by Algorithm FYKY satisfies
n D n log2 n C nFKY .log2 n/ C O .log n/2 ;
where FKY .t/ is a continuous 1-periodic function whose Fourier expansion is given by (k WD
FKY .t/ D
1
1 X .k C 1/
C
e
2 log 2 log 2
2k 1
k¤0
„ ƒ‚ …
2k i t
.t 2 R/;
(26)
2k i
)
log 2
(27)
0:33274
the series being absolutely convergent. Here .s/ denotes Riemann’s zeta function.
2
Note that j.1 C i t/j D O .log jtj/ 3 for large jtj; see (Titchmarsh, 1986). Our method of
proof is based on approximating the partial sum n by an integral
Z x
X 2k x
M.x/ WD
a.t/ dt; where a.x/ WD
.x > 0/;
x 2k
0
k>0
and estimating their difference. Obviously, an D a.n/ for integer n > 0. The asymptotics of M.x/
is comparatively simpler and can be derived by standard Mellin transform techniques; see (Flajolet
et al., 1995). Indeed, we derive an asymptotic expansion that is itself an identity for x > 1.
23
Proposition 1. The integral M.x/ satisfies the identity
M.x/ D x log2 x C xFKY .log2 x/ C
2
;
12
(28)
for x > 1, where FKY is given in (27).
Proof. We start with the relation
(
a.x/ D a
x
2
C
1;
if x > 1I
˚1
x x ; if 0 < x 6 1:
Then for x > 1
M.x/
2M
x
2
x
Z
Z
a.t/ dt
D
x
2
a.t/ dt
2
0
0
x
Z
a.t/ a 2t dt
Z 1
˚ 1C
t 1t dt:
D
0
Dx
0
The last integral is equal to
Z
Z 1
˚1
t t dt D
0
1
1
X
ftg
dt
D
t3
j >1
Thus, M.x/ satisfies the functional equation
M.x/ D 2M x2 C x
Z
1
0
t
dt D 1
.j C t/3
2
;
12
2
:
12
.x > 1/;
(29)
2
M.x/ 12
log2 x is a periodic function, namely, MN .2x/ D MN .x/
which implies that MN .x/ WD
x
for x > 1, or, equivalently (28), and it remains to derive finer properties of the periodic function
FKY . For that purpose, we apply Mellin transform.
First, the integral M is decomposed as
X Z 1 ft g
X Z x t 2k dt D
2k k 3 dt:
(30)
M.x/ D
k
2
t
t
0 2
x
k>0
k>0
Then the Mellin transform of M.x/ can be derived as follows (assuming 2 < <.s/ <
Z 1
Z 1
Z 1
X Z 1
X
ftg
ftg
k
s 1
k.sC1/
s 1
2
x
dt
dx
D
2
x
dt dx
3
3
2k
t
t
0
0
x
x
k>0
k>0
Z 1
Z
X
ftg t s 1
k.sC1/
D
2
x
dx dt
t3 0
0
k>0
Z
1 X k.sC1/ 1 ftg
D
2
dt
s
t sC3
0
k>0
D
.s C 2/
;
s.s C 2/.1 2sC1 /
24
1):
where we used the integral representation for .s C 1/ (see (Titchmarsh, 1986, p. 14))
Z 1
ftg
.s C 1/ D .s C 1/
dt
. 1 < <.s/ < 0/:
t sC2
0
All steps here are justified by absolute convergence if 2 < <.s/ < 1. We then have the inverse
Mellin integral representation
Z 32 Ci1
.s C 2/
1
M.x/ D
x s ds
sC1
3
2 i
s.s C 2/.1 2 /
2 i1
Z 12 Ci1
.s C 1/
1
D
x 1 s ds
.x > 0/:
2
s/
1
2 i
.s
1/.1
2
i1
2
Move now the line of integration to the right using known asymptotic estimates for j.s/j (see
(Titchmarsh, 1986, Ch. V))
(
1
O jtj 2 .1 c/C" ; if 0 6 c 6 1I
(31)
j.c C i t/j D
2
O .log jtj/ 3 ;
if c D 1;
as jt j ! 1. A direct calculation of the residues at the poles (a double pole at s D 0 and simple
poles at s D k , s D 1) then gives
Z 21 Ci1
.s C 1/
1
2
x 1 s ds D x log2 x C xFKY .log2 x/ C 12 C .x/;
2
s
1
2 i
.s
1/.1 2 /
2 i1
for x > 0, where FKY is given in (27) and .x/ is give by
Z 23 Ci1
1
.s C 1/
.x/ WD
x1
2
s
3
2 i 2 i1 .s
1/.1 2 /
To evaluate this integral, we use the relations
Z cCi1
1
x s
0;
ds
D
x
2 i c i1 1 s 2
2
1
;
2x
s
if x > 1I
if 0 < x 6 1;
ds:
(32)
.c > 1/;
by standard residue calculus (integrating along a large half-circle to the right of the line <.s/ D c
if x > 1, and to the left otherwise). With this relation, we then have
8
ˆ
if x > 1I
ˆ 0; X <
1
1
2k 1 x 2
; if 0 < x 6 1:
.x/ D
k
ˆ
2 k
2 `2
ˆ
: 2 `x61
k;`>1
s
by expanding the zeta function and 1 12s D 1 2 2 s in Dirichlet series and then integrating term by
term. Note that the double sum expression for .x/ can be simplified but we do not need it. Also
.x/ D 0 for 12 < x 6 1.
25
Observe that the Fourier series expansion (27) converges only polynomially. We derive a different expansion for FKY , with an exponential convergence rate.
Lemma 4. The periodic function FKY has the series expansion
X
b2k ft g c
ft g
2
2k
FKY .t/ D 1 ftg
2
C
1
6
2kC1 ft g
1 ft g
0
k ft g
2
˘
C1
;
(33)
k>1
denotes the digamma function (derivative of log €) and
for t 2 R, where
0
.k C 1/ D
1
j >k j 2 .
P
For large k, we have
b2k ft g c
2kC1 ft g
1
since
0
.k C 1/ D
2k
1 ft g
1
2k 2
1
k
C
0
62k 6k C 1
CO 2
2kC2 ft g
3 22.kC1 ft g/
˚
C O k14 , where k WD 2k ft g .
1
6k 3
2k
ft g
˘
Proof. By (30), we have, for x > 0,
0 k˘
Z 2x C1 Z 1
X
k@
M.x/ D
2
C k˘
k
2
x
2
x
k>0
0
X BZ 1
t
D
2k @ n k o ˘
2
2k
k>0
x
X
D
x
k>0
1
C1 D
x
3k
1
A ftg dt
t3
C1
1
Ct
x 2 2k
2kC1 x
X
3 dt C
2
k 1
0
0
k˘
j>
2
x
2k
x
1
Z
t
C
dt A
.j C t/3
C1
C1
:
Now if x ¤ 2m , then (Lx WD blog2 xc)
X
M.x/ D
x
2
12
06k6Lx
2
k
x 2 2k
2kC1 x
X C
x
k>Lx C1
2
D xLx C x
2Lx C 12
6
kCLx X
x
2
Cx
1
2kCLx C1
x
2
k 1
0
2k
x
2kCLx
x
1
0
2kCLx
x
C1 ;
which also holds for x D 2m , and in that case we have
M.2m / D m2m C
0
$
2
C1
2
6
2m C
2
;
12
1 C 6 , where (see Section 5.3)
X
$ WD
1 2k 0 .2k C 1/ 0:44637 64113 48039 93349 : : : :
2
.2/ D
k>1
This proves (33) by writing Lx D log2 x
C1
2
k>1
by using
flog2 xg.
26
;
Note that if we use the expression (33) for FKY .t/, then the identity (28) holds for x > 0.
We turn now to estimating the difference between n and M.n/.
Proposition 2. The difference n
M.n/ satisfies
n
M.n/ D O .log n/2 :
(34)
Proof. We have (defining a.0/ D 0)
n
M.n/ D
X Z
06m6n
1
.a.m/
a.m C t// dt C O.log n/:
0
Now
X 2k m 2k m C t a.m C t/ D
m 2k
mCt
2k
k>Lm
k k X
X m
X
2
2
1
m
CO
C
D
2k
m
mCt
2k
2k
k>Lm
k>2Lm
Lm 6k<2Lm
k k X
m
2
1
2
D
CO
:
k
2
m
mCt
m
a.m/
Lm 6k<2Lm
Thus
n
M .n/ D
X
X
26m6n Lm 6k<2Lm
m
2k
Z 1 0
2k
m
By writing fxg D x bxc, we then obtain
Z 1
Z k m 1 2k
2
t
dt D
dt
k
2 0
m
mCt
0 mCt
m
2k
2k
mCt
Z 1 0
dt C O.log n/:
2k
m
2k
mCt
The first integral on the right-hand side is bounded above by
X
Z 1
X
X
log m
t
dt D O
D O .log n/2 :
m
0 mCt
26m6n
26m6n Lm 6k<2Lm
It remains to estimate the double-sum
X
X
M1 WD
Z k 2
m 1 2k
dt
k
2 0
m
mCt
26m6n Lm <k<2Lm
k k X
X
m
2
2
6
k
2
m
mC1
26m6n Lm <k<2Lm
k k X
X
m
2
2
D
:
k
2
m
m
C
1
36k<2Ln b k cC1
2 2
6m6minf2k ;ng
27
dt:
For a fixed k, the difference
in the interval
2k ˘
m
2k
mC1
˘
assumes the value 1 if there exists an integer q lying
2k
2k
<q6 ;
mC1
m
k˘
and 2m
inequality
2k
mC1
m
6 q1 .
2k
(35)
˘
assumes the value 0 otherwise. For those m satisfying (35), we have the
It follows that
m
2k
X
2b 2 c
k
C1
2k
m
2k
mC1
6
6m6minf2k ;ng
X 1
D O.k/;
q
k
16q62
and, consequently,
M1 D O
X
k
D O .log n/2 :
36k62Ln
This proves the proposition.
5.2
Variance of Zn
Let bn D E.Bn2 / D Bn00 .1/ C Bn0 .1/.
Lemma 5. For n > 1
8
ˆ
<b n2 C 2a n2 C 1; X 2k mod n
2k C 1
21 .n/ .n/
bn D
C
;
ˆ
:
2k
1 2 .n/
.1 2 .n/ /2
06k<.n/
Proof. By (23) and (24).
Note that the variance vn WD bn
a2n of Bn satisfies the recurrence
v2n D vn
.n > 1/:
On the other hand, we also have
bn D
X
k>0
n 2k
;
.2k C 1/ k
2
n
and we will derive an asymptotic approximation to the variance of Zn
X
X
&n2 WD V.Zn / D
vm D
bn a2n :
26m6n
28
26m6n
if n is evenI
if n is odd:
Figure 7: Periodic fluctuations of the variance of Bn (D bn
(left) and for n D 29 ; : : : ; 210 (right).
a2n ) in log-scale for n D 2; : : : ; 210
Theorem 8. The variance of the total number of random bits flipped to generate a random permutation by Algorithm FYKY satisfies
&n2 D nGKY .log2 n/ C O..log n/3 /;
(36)
where GKY .u/ is a continuous, bounded, periodic function of period 1 defined by
Z 2fug j
X
1 fug
j fug
GKY .u/ D v0 2
C
2
g.t/ dt:
j >1
Here v0 WD
R1
0
(37)
0
g.t/ dt and
g.x/ WD 1
1
1
x
x
2a
Cx
:
x
2
x
(38)
Numerically, v0 0:47021 47736 99741 30560 : : : ; see Section 5.3 for different approaches
of numerical evaluation. This theorem will follow from Propositions 3 and 5 given below.
Similar to the case of n , a good approximation to &n2 is given by the integral
Z x
Z x
V .x/ WD
v.t / dt D
b.t/ a.t/2 dt;
0
where v.x/ WD b.x/
0
a.x/ represents a continuous version of vn and
X
x 2k
:
b.x/ WD
.2k C 1/ k
2
x
2
k>0
Now consider
v.x/ D
X
k>0
x 2k
.2k C 1/ k
2
x
!2
X x 2k 2k x
k>0
X
x k
2
1
Dx
C
.2k C 3/ 2k x
x
2
2
k>0
X x k !2
2
1
2
x
C
;
x
k
x
2
2
From this relation, we derive the following functional equation.
29
k>0
Lemma 6. For x > 0
V .x/
2V
x 2
(39)
0
v.x/ D v
v.x/ D v
g.t/ dt:
D
Proof. If x > 1, then
if 0 < x 6 1, then
minf1;xg
Z
x x 2
2
I
C g.x/;
where g is defined in (38).
We now show that this functional equation leads to an asymptotic approximation that is itself
an identity, as in the case of M.x/.
Proposition 3. The integral V .x/ satisfies
V .x/ D xGKY .log2 x/
v0 ;
(40)
for x > 1, where GKY is defined in (37).
Proof. By a direct iteration of (39), we obtain
V .x/ D v0 2
Lx C1
1 C
X
j >1
2
Lx Cj
Z
x
2Lx Cj
g.t/ dt;
0
˚ for x > 1, where the sum is absolutely convergent because (a.x/ D O.x/ and x x1 D O.x/)
Z x
Z x
t
1
1
g.t/ dt D
1 t
2a
Ct
dt D O x 2 ;
(41)
t
2
t
0
0
as x ! 0. Now writing x D 2Lx Cx , where x WD flog2 xg, we obtain (40). Note that GKY .0/ D
limu!1 GKY .u/, and GKY is continuous on Œ0; 1. On the other hand, since g.x/ D O.x/, its
integral is of order x 2 as x ! 0, which implies that the series in (37) is absolutely convergent.
Accordingly, GKY is a bounded periodic function.
P
Proposition 4. The Fourier coefficients of GKY .u/ D k2Z gk e 2k iu can be computed by
Z 1
1
gk D
g.t/t k 1 dt
.k 2 Z/;
(42)
.log 2/.k C 1/ 0
the series being absolutely convergent. In particular, the mean value g0 is given by
2
1
1
2 2 X k.k C 1/. k C 1/
2
g0 D
C
2
1
2
24 2.log 2/2 6
.log 2/3
sinh 2k
log 2
k>1
1:55834 75820 73324 42639 35697 76811 51355 37715 91606 58602 30
(43)
where 1 is a Stieltjes constant:
.log m/2
2
X log j
j
1 WD lim
m!1
26j 6m
!
0:72815 84548 36767 24860 : : : :
Note that the terms in the series in (43) are convergent extremely fast with the rate
k
2k 2
4
4
3
k.log k/ exp
k.log k/ 3 2:33 1012 ;
log 2
(44)
by (31). Furthermore, the mean value (43) is smaller than that (12) of Algorithm RS.
Proof. By definition,
1
Z
gk D v0
2k iu 1 u
e
2
du C
0
gk0 D
2
X Z
j >1
j
1
Z
C
0
0
2k iu j u
e
2u
Z
j
g.t/ dt du:
2
0
The second term gk0 can be simplified as follows.
21
Z
1
0
j >1
1
.
.log 2/.k C1/
The first term equals
XZ
2
j
Z
j
X
1
D
2j
.log 2/.k C 1/
j >1
!
1
g.t /e
2k iu j u
2
du dt
j Clog2 t
1
2
Z
j
21
Z
g.t / dt C
j
!
k 1
g.t/ t
0
2
2j
j
1
dt :
By summation by parts, we see that
X
j 1
2
Z
2
j >1
j
g.t/ dt D
0
X
.2
j
1/
g.t/ dt
2
j >0
D
X
2
j 1
Z
21
j
1
j
Z
g.t/ dt
2
j >1
j
2
Z
j
1
g.t / dt:
0
Thus, we obtain (42). The proof of (43), together with different numerical procedures, will be
given in the next section.
We now show that &n2
V .n/ is small.
Proposition 5. The differencebetween the variance &n2 and its continuous approximation V .n/ is
bounded above by O .log n/3 .
31
Figure 8: Periodic fluctuations of
(right).
&n2 C2 log2 nC3
n
in log-scale for n D 27 ; : : : ; 211 (left) and GKY .u/
Proof. The proof is similar to that of Proposition 2. By definition,
&n2
V .n/ D
06m6n
D
1
X Z
.v.m/
1
X Z
06m6n
v.m C t// dt C O.1/
0
b.m/
b.m C t/
a.m/2
a.m C t/2
dt C O.1/:
0
Now divide the sum of terms into three parts:
&n2
V .n/ D 2W1 .n/ C W2 .n/ C W3 .n/ C O.1/;
where
W1 .n/ D
06m6n
W2 .n/ D
0
k>1
a.m/
k mCt
2
k k
dt
2
mCt
a.m C t/ dt
0
1
X Z
06m6n
X m 2k k k
2 m
1
X Z
06m6n
W3 .n/ D
1
X Z
a.m C t /2 dt:
a.m/2
0
We already proved in Proposition 2 that W2 .n/ D O .log n/2 . On the other hand,
W3 .n/ D
X Z
06m6n
1
a.m/
a.m C t / a.m/ C a.m C t/ dt
0
X Z 1ˇ
ˇa.m/
D O .log n/
06m6n
0
D O .log n/3 ;
32
ˇ
a.m C t/ˇ dt
by Proposition 2. For W1 .n/, we again follow exactly the same argument used in proving Proposition 2 and deduce that
Z k X
X
2
m 1 2k
dt C O.log n/
W1 .n/ D
k k
2 0
m
mCt
06m6n Lm 6k<2Lm
!
X
X
X
X k
k
DO
C
C O.log n/
mC1
q
06m6n Lm 6k<2Lm
16k62Ln 16q62k
3
D O .log n/ :
Theorem 8 now follows from Propositions 3, 4 and 5. It remains to prove the more precise
expression (43) for the mean value g0 and other Fourier coefficients gk .
5.3
Evaluation of gk
We show in this part how the coefficients g0 and gk with k ¤ 0 can be numerically evaluated to
high precision. For that purpose, we will derive a few different expressions for them, which are
of interest per se. We focus mainly on g0 , and most of the approaches used also apply to other
constants or coefficients appeared in this paper.
The mean value of GKY
The mean value of GKY is split, by (38), into two parts
Z 1
g 0 C g000
1
g.t/
g0 D
dt DW 0
;
log 2 0 t
log 2
where
g00
Z
1
1
WD
0
Z 1
1
1
ftg
dt D
t
t
t
t2
1
and
g000
1
Z
1
1
t
WD 2
0
ftg2
t3
dt D
2
12
1
;
2
1
t
t
a
dt:
t
2
Lemma 7.
g000
D
X
k>1
k
Z
2
0
1
1
e 2k t
1
t
et
1
1
dt:
Proof. By definition and direct expansions
X Z 1 1 2k 1
00
g0 D 2
1
t
dt
k
t
t
0 2
k>1
X X Z 1
2k j t
D2
3 dt:
0
2k j C ` C t
k;j >1 06`<2k
33
(45)
Now, by the integral representation
x
s
1
D
€.s/
1
Z
e
xu s 1
u
.x; <.s/ > 0/;
du
0
we see that
2
X X Z
j >1 06`<2k
1
0
jt
1
Z
2k j C ` C t
3 dt D
2
u
0
X
2k j u
je
j >1
Z
1
u
e 2k t
0
e
`u
1 eu
1
1
Z
te
tu
dt du
0
06`<2k
1
D
X
1 du:
This proves (45).
Lemma 8.
g000 D 2
X
`>3
where h` D 2hd ` e C
2
˙`
2
h`
;
`2 .` 1/
(46)
1 for ` > 2 with h0 D h1 D 0.
The first few terms of h` are
fh2` g`>1 D fh2` 1 g`>1 D f0; 1; 4; 5; 12; 13; 16; 17; 32; 33; 36; 37; 44; 45; 48; 49; 80; g ;
which correspond to sequence A080277 in Sloane’s OEIS (Online Encyclopedia of Integer Sequences), and is connected to partial sums of dyadic valuation.
Proof. Inverting (45) using Binet’s formula (see (Erdélyi et al., 1953, ~1.9))
Z 1
t
0
1 z .z C 1/ D z
1 e zt dt;
t
e
1
0
we get
g000
D
X X 1
k>1 j >1
Since
1
m
0
k
2
j
.m C 1/ D
0
.2 j C 1/ :
X
`>mC1
k
1
`2 .`
1/
;
by grouping terms with the same number, we get
X
X1
00
0
g0 D 2
.m C 1/
2k ;
m
k
m>2
2 jm
k>1
which then implies (46).
34
(47)
First approach: k 1 convergence rate The most naive approach to compute g0 consists of
evaluating exactly the first k > 1 terms of the series (46) and adding the error by an asymptotic
estimate of the remainders. More precisely, choose k sufficiently large and then split the series
into two parts depending on ` < k and ` > k. Since h` D 12 ` log2 ` C O.`/ for large `, we see
that the remainder is asymptotic to
2
X
`>k
with an additional error of order k
X log `
h`
log2 k
2
;
2
2
` .` 1/
`
k
`>k
1
. But such an approach is poor in terms of convergence rate.
A better approach to compute g000 from (46) consists
Second approach: 3 k convergence rate
in expanding the series
X
`>3
X
h`
D
D1 .k/;
`2 .` 1/
D1 .s/ WD
where
X h`
`>3
k>3
`s
;
and then evaluate D1 by the recurrence relation of h` , namely,
D1 .s/ D
X 2h` C `
.2`/s
`>1
D
1
2 .s
1
Since D1 .k/ D O.3
k
2/
1
C
X 2h` C `
`>1
.1
.2`
2 /.s
s
1
1/s
.s/ C 2
1/
1 D1 .s C j /
:
2sCj
X s C j
j
j >1
/ for large k, the terms in such a series converge at the rate O.j <.s/ 1 6 j /.
Third approach: k5 k convergence rate We can do better by applying the 21 -balancing technique introduced in (Grabner and Hwang, 2005), which begins with the relation
˘
X
X . 1/k k C 1
X h`
h`
2
D
D
.k
C
3/;
where
D
.s/
WD
s :
2
2
`2 .` 1/
2k
` 12
`>3
`>3
k>0
Here the convergence rate is of order k5
k
. So it suffices to compute D2 .j / for j > 3. Now
1 X 2h` C ` 1
C
1 s
3 s
2`
2`
2
2
`>1
`>1
X h` 1
1 s
D2
1
1
1 s
4
`
`
2
2
`>1
D2 .s/ D
X 2h` C `
s
C 1C
s
1
4 `
1
2
where
Z.s/ WD s
1; 14 C s
1; 34
35
1
4
s; 41
3
4
s; 34 :
C 2 s Z.s/;
Thus, we obtain the recurrence
1 D2 .s C 2j /
;
16j
X s C 2j
Z.s/
1
D2 .s/ D
C
4.2s 2 1/ 2s 2 1
2j
j >1
where the convergence rate is now improved to O.j <.s/ 1 100 j /. In this way, we obtain the
g 0 Cg 00
numerical value in (43) since g0 D 0log 20 .
Such an approach is generally satisfactory. But for our g0 it turns out that a very special
symmetric property makes the identity (43) possible, which is not the case for other constants
appearing in this paper (e.g., v0 and $).
4
2k 2
Fourth approach: k.log k/ 3 e log 2 convergence rate Instead of the elementary approach used
above, we now apply Mellin transform to compute the Fourier series of GKY . We start with defining
VN .x/ WD V .x/ C v0 . Then, by (39),
(
x 0;
if x > 1I
R1
D
VN .x/ 2VN
2
g.t/ dt; if 0 < x 6 1:
x
From this it follows that the Mellin transform V .s/ of VN .x/ satisfies
V .s/ 1 2sC1 D g .s/;
where
g .s/ WD
Z
1
x
s 1
Z
1
g.t/ dt dx D
0
x
1
s
Z
1
g.t /t s dt:
0
By (41), we see that g .s/ is well-defined in the half-plane <.s/ > 2. Thus, we anticipate the
same expansion (40) with the Fourier coefficients (42). What is missing here is the growth order
of jg .c C i t/j for c > 2 as jtj ! 1, which can be obtained by the integral representation (48)
below.
By (38), we first decompose g into two parts:
Z 1
t
1
1 1 1
1 t
2a
Ct
t s dx DW
g1 .s/ C g2 .s/ ;
g .s/ D
s 0
t
2
t
s
where
1
1
t s
a
t dt
D2
1 t
t
2
0
Z 1
1
1 sC1
g2 .s/ D
1 t
t
dt:
t
t
0
g1 .s/
Z
36
The second integral is easier and we have
Z 1
ftg2
.s C 3/
ft g
dt D
g2 .s/ D
sC3
sC4
t
t
sC3
1
.s C 1/.s C 2/
;
.s C 2/.s C 3/
for <.s/ > 2 (when s D 1, the last term is taken as the limit 12 ).
Consider now g1 .s/. The following integral representation is crucial in proving (43).
Lemma 9. For <.s/ >
g1 .s/
2,
2
1
D
€.s C 4/ 2 i
Z
cCi1
c i1
€.w C 1/.w C 1/€.s w C 2/.s
1 2 w
w C 2/
dw;
(48)
where 1 < c < <.s/ C 1.
Proof. By straightforward expansions as above
g1 .s/
X
2
2k.sC2/
€.s C 4/
D
Z
k>1
Since
Z
1
uw
1
u
0
1
usC1
e 2k u
1 eu
1 du D €.w C 1/.w C 1/
u
1
1 du:
. 1 < <.w/ < 0/;
eu 1
we obtain the Mellin inversion representation
0
u
eu
1
1
1D
2 i
Z
cCi1
€.w C 1/.w C 1/u
w
dw
.c 2 . 1; 0//:
c i1
Substituting this into (49), we obtain (48).
Proof of (43)
Taking s D
g1 .
1
1/ D
2 i
1 in (48), we get
1
2 Ci1
Z
1
2
i1
€.w C 1/.w C 1/€. w C 1/. w C 1/
dw
1 2 w
D R1 C J 2 ;
where R1 sums over all residues of the poles on the imaginary axis and
1
J2 WD
2 i
D
1
2 Ci1
Z
1
2 i
1
2
i1
€.w C 1/.w C 1/€. w C 1/. w C 1/
dw
1 2 w
1
2 Ci1
Z
1
2
i1
€.w C 1/.w C 1/€. w C 1/. w C 1/
dw:
1 2w
37
(49)
The last integral is almost identical to g1 . 1/ except the denominator for which we write
1
2w
1
Thus J2 D
D
1C
1
1 2
w
:
g1 . 1/ C J3 , where
1
J3 WD
2 i
Z 1
D
0
1
2
eu
6
€.w C 1/.w C 1/€. w C 1/. w C 1/ dw
i1
1
2
D1
1
2 Ci1
Z
u
1 eu
1
1 du
:
Collecting these relations, we see that
g1 . 1/ D
R1
J3
C ;
2
2
and
g . 1/ D g1 . 1/ C g2 . 1/ D
because g2 . 1/ D
axis:
2
12
1
2
J3
.
2
It remains to compute the residues of the poles on the imaginary
€.w C 1/.w C 1/€. w C 1/. w C 1/
Res
1 2 w
wDk
k2Z
2
X
2
log 2
1
2k .k C 1/. k C 1/
;
D
C
2 21
2
24
2 log 2 6
.log 2/2 sinh 2k
log 2
k>1
R1
g . 1/ D
D
2
D
R1
;
2
X
where 1 is defined in Proposition 4. The terms in the series are convergent at the rate (44), and is
much faster than the previous three approaches:
g0 D
g . 1/
1:55834 75820 73324 42639 35697 76811 51355 377159 16065 86021
log 2
33003 19983 06704 40332 28575 51733 41447 78391 56441 48117 : : :
(using only 18 terms of the series, one gets an error less than 1:8 10 108 ). Also the dominant
term alone, namely,
2
1
1
2
C
21 1:55834 75821 66122 : : : ;
24 2.log 2/2 6
gives an approximation to g0 to within an error less than 9:3 10
38
11
.
Consider now g1 . 1 C k / when k ¤ 0. Similarly, by (48) with
Calculation of gk for k ¤ 0
s D 1 C k , we have
g1 . 1 C k / D R2 C J4 ;
where R2 denotes the sum of all residues of the poles on the imaginary axis and
1
2
J4 WD
€.3 C k / 2 i
1
2 Ci1
Z
1
2
€.w C 1/.w C 1/€.1 C k
1 2 w
i1
By the change of variables w 7! k
J4 D
D
2
1
€.3 C k / 2 i
1
2
w/
dw:
w, we get
1
2 Ci1
Z
w/.1 C k
i1
€.w C 1/.w C 1/€.1 C k
1 2w
w/.1 C k
w/
w/.1 C k
w/ dw
dw
g1 . 1 C k / C J5 ;
where
Z 21 Ci1
2
1
J5 WD
€.w C 1/.w C 1/€.1 C k
1
€.3 C k / 2 i
2 i1
Z 1 k 2
u
u
D
1 du
€.3 C k / 0 e u 1 e u 1
k .k C 1/
.k C 2/
;
D2
.k C 2/.k C 1/
k C 2
which equals 2g2 . 1 C k /. Then
g
k
g . 1 C k /
R2
D
.log 2/. k C 1/
2.log 2/. k C 1/
0
0
2 .k C 1/ C .k C 1/.k C 1/
D
.log 2/2 .2k 1/.k C 2/
X €.kCj C 1/.kCj C 1/€. j C 1/. j C 1/
2
C
.log 2/2
.k 1/€.k C 3/
D
(50)
j >1
C
1
.log 2/2
X
16j 6k 1
€.j C 1/.j C 1/€.k j C 1/.k
.k 1/€.k C 3/
j
C 1/
:
By the order estimate (7) for Gamma function and (31) for -function (which implies that j 0 .1 C
5
i t/j D O .log jtj 3 /, we deduce that
gk D O k
2
5
.log k/ 3 ;
for large jkj, so that the Fourier series of GKY is absolutely convergent.
39
(51)
5.4
Asymptotic normality of Zn
We prove in this section the asymptotic normality of the bit-complexity Zn of Algorithm FYKY.
Such a result is well anticipated because Zn D B1 C C Bn and each Bk is close to Lk C 1 with
a geometric perturbation having bounded mean and variance. Indeed, we can establish a stronger
local limit theorem for Zn .
Theorem 9. The bit-complexity Zn of Algorithm FYKY satisfies a local limit theorem of the form
x2
1 C jxj3
e 2
;
1CO
P .Zn D bn C x&n c/ D p
p
n
2 &n
1
uniformly for x D o n 6 , where n WD E.Zn / and &n2 WD V.Zn /; see (26) and (36).
(52)
Proof. Since Zn is the sum of n independent random variables, the r -th cumulant of Zn , denoted
by Kr .n/, satisfies
X
Kr .n/ D
r .m/
.r > 1/;
26m6n
where r .m/ stands for the r -th
cumulant of Bm . To show that r .m/ are bounded for all m and
Bn
r > 2, we observe that E t
can be extended to any x > 0 by defining
B.x; t / WD 1
.1
X x 2k t/
tk
k
2
x
.x > 0/;
k>0
so that E t Bn D B.n; t /. Also B.x; t/ D tB. x2 ; t/ for x > 1 and the cumulants r .x/ WD
r !Œs r  log B.x; e s / are well-defined. It follows that for x > 1
x x ; e s D r
r .x/ D r !Œs r  s C log B
2
2
for r > 2, which then implies that r .x/ D r 2Lxx C1 for x > 1. It remains to prove that
r .x/ D O.1/ for x 2 .0; 1/. Note that r .x/ is a (finite) linear combination of sums of the
following form
k
X
X
2
j
k
j x
DO x
k 2
D O.x/ D O.1/;
k k
2
x
k>0
k>0
for each j D 1; 2; : : : . This proves that each r .x/ is bounded for x > 0, and, accordingly,
X
Kr .n/ D
r .m/ D O.n/
.r D 2; 3; : : : /:
26m6n
These estimates, together with those in (26) and (36), yield
2
3 Zn n
jj
E exp
i
D exp
CO p
;
&n
2
n
40
(53)
p
uniformly for jj 6 " n.
We now derive a uniform bound of the form
jE.e Zn i /j 6 e
"n 2
.jj 6 I n > 5; n ¤ 2Ln /;
(54)
for some " > 0. This bound, together with (53), will then be sufficient to prove the local limit
theorem (52).
P
For n ¤ 2Ln , let E.e Bn i / D e .Ln C1/i k>0 pn;k e ik t , where
pn;k WD
n
2Ln Ck
2Ln Ck
n
n
2Ln CkC1
2Ln CkC1
:
n
When both pn;0 and pn;1 are nonzero, we have
jE.e Bn i /j 6 1
pn;0
D1
pn;0
61
pn;0
pn;1 C jpn;0 C pn;1 e i j
q
pn;1 C .pn;0 C pn;1 /2
pn;1 C .pn;0 C pn;1 / 1
2pn;0 pn;1 .1
cos /
pn;0 pn;1
.1
.pn;0 C pn;1 /2
cos / ;
p
by using the inequality 1 x 6 1 21 x for x 2 Œ0; 1. Then by the inequalities 1
1 cos > 22 2 for jj 6 , we obtain, for jj 6 ,
jE.e
Bn i
/j 6 exp
pn;0 pn;1
.1
pn;0 C pn;1
cos / 6 exp
which holds for all n > 1 provided we interpret
jE.e Zn i /j 6 e
0
0
2
2
for jj 6 , where
ƒn WD
X
16k6n
x6e
2
pn;0 pn;1
2
;
2 pn;0 C pn;1
as zero. In this way, we see that
ƒn 2
6e
1
5
ƒn 2
;
pk;0 pk;1
:
pk;0 C pk;1
We now prove that ƒn > "n for some " > 0. Observe that pn;0 D 2LnnC1 when n ¤ Ln , and
8
m
l
< LkC2 ; if 2Lk < k < 2Lk C2 ;
3
k
l L C2 m
pk;1 D 2
2 k
:0;
if
6 k 6 2Lk C1 :
3
41
x
and
It follows that
ƒn >
k
k
2`C1 2`C2
k
k
l
m
C 2`C2
26`<Ln 2` <k< 2`C2 2`C1
X
X
3
1 X
D
6
k
`
m2
`C2
X
l
26`<Ln 2` <k< 2
1 X
7 `
>
2
6
18
26`<Ln
3
7
6
> "0 2Ln > "n;
for a sufficiently small " > 0. This completes the proof of (54) and the local limit theorem (52).
6
Implementation and testing
We discuss in this section the implementation and testing of the two algorithms FYKY and RS.
We implemented the algorithms in the C language, taking as input an array of 32-bit integers
(which is enough to represent permutations of size up to over four billion). To generate the needed
random bits, we used the rdrand instruction, present on Intel processors since 2012 (Intel, 2012)
and AMD processors since 2015. This instruction provides access to physical randomness, which
does not have the biases of a pseudorandom generator. This choice also makes it easy to compare
the performance of the algorithms without relying on third-party software. Alternatively, one could
use a pseudorandom generator like Mersenne Twister, which is the default choice in most software,
such as R, Python, Matlab and Maple, and runs faster than rdrand when properly implemented.
But such a generator has been known to be cryptographically insecure because one can predict all
the future iterations if a sufficient number (624 in the case of MT9937) of iterations is available.
The hardware driven instruction rdrand, in contrast, is proved to be cryptographically secure.
Our implementation takes care of not wasting any random bits and provides the option to track the
number of random bits consumed.
The implementation of Algorithm FYKY is rather straightforward, but that of Algorithm RS is
more involved. First of all, the recursive calls in RS are handled in the following fashion, depending
on the size of the input:
for large inputs, we run the recursive calls in parallel using the Posix thread library pthread;
for intermediate inputs, we run the recursive calls sequentially to limit the number of threads;
for small inputs, we use the Fisher-Yates algorithm instead to reduce the number of recursive
calls.
The cutoffs between small, intermediate and large inputs were determined experimentally; in our
tests, thresholds of 216 and 220 seemed efficient, but this may depend on machine and other implementation details.
42
The second optimization for Algorithm RS concerns the splitting routine. Written naively, this
routine contains a loop with an if statement depending on random data. This is a problem because
branches are considerably more efficient if they can be correctly predicted by the processor during
execution. We are able to avoid using branches altogether by vectorizing the code, i.e., using SIMD
(Single Instruction, Multiple Data) processor instructions. Such instructions take as input 128-bit
vector registers capable of storing four 32-bit integers and operate on all four elements at the same
time. The C language provides extensions capable of accessing such instructions. Specifically, we
used in our implementation two instructions, present in the AVX (Advanced Vector Extensions)
instruction set supported by newer processors. They are vpermilps, which arbitrarily permutes
the four 32-bit elements of a vector; and vmaskmovps, which writes an arbitrary subset of the
four elements of a vector to memory. Both instructions take as additional input a control vector
specifying the permutation or subset, of which only two bits out of every 32-bit element are read.
We use these instructions to separate four elements of the permutation at a time into two groups.
This can be done in 16 possible ways, which means that we have to supply each instruction with
one of 16 possible control registers. We do this by building a master register containing all 16 of
them in a packed fashion. We then draw randomly an integer r between 0 and 15 and shift every
component of the master register by 2r bits to select the appropriate control register. This lets us
handle four elements at a time without using branches.
Benchmarks Below are our benchmarks for Algorithm FYKY, Algorithm RS and one of its
parallel versions. The tests were performed on a machine with 32 processors.
n
105
106
107
108
109
Algorithm
RS
FYKY
Algorithm
RS
FYKY
FYKY
RS
Parallel RS
4.84ms 4.59ms
4.18ms
51.1ms 51.6ms
18.5ms
712ms 623ms
121ms
12.5s
7.26s
1.04s
145s
81.7s
10.3s
Mean
n log2 n C .0:25 ˙ "/n
n log2 n .0:33 ˙ "/n
Variance
.1:83 ˙ "/n
.1:56 ˙ "/n
Table 1: Left: the execution times to sample permutations of sizes from 105 to 109 (each averaged
over 100 runs for sizes up to 10 million and 10 runs otherwise). Right: the analytic results we
obtained in this paper. Here c ˙ " indicates fluctuations around the mean value c (coming from
the periodic functions); see (8), (12), (27) and (43).
As expected, parallelism speeds up the execution by as much as a factor of 8. What is more
surprising is that, even in a sequential form, Algorithm RS is nearly twice as efficient as FisherYates for the larger sizes, despite making on linearithmic order of memory accesses instead of
linear. The reason for this has to do with the memory cache, which makes it more efficient to access
memory in a sequential fashion instead of at haphazard places. The Fisher-Yates shuffle accesses
memory at a random place at each iteration of its loop, causing a large number of cache misses.
Algorithm RS, in comparison, does not have this drawback, which accounts for the observed gap
in performance.
43
References
Anderson, R. J. (1990). Parallel algorithms for generating random permutations on a shared memory machine. In Proceedings of the Second Annual ACM Symposium on Parallel Algorithms
and Architectures, pages 95–102.
Andrés, D. M. and Pérez, L. P. (2011). Efficient parallel random rearrange. In International
Symposium on Distributed Computing and Artificial Intelligence, pages 183–190. Springer.
Barker, E. B. and Kelsey, J. M. (2007). Recommendation for random number generation using deterministic random bit generators (revised). US Department of Commerce, Technology
Administration, National Institute of Standards and Technology, Computer Security Division,
Information Technology Laboratory.
Berry, K. J., Johnston, J. E., and Mielke, Jr., P. W. (2014). A Chronicle of Permutation Statistical
Methods (1920–2000, and Beyond). Springer, Cham.
Black, J. and Rogaway, P. (2002). Ciphers with arbitrary finite domains. In Topics in Cryptology
- CT-RSA 2002, The Cryptographer’s Track at the RSA Conference, 2002, San Jose, CA, USA,
February 18-22, 2002, Proceedings, pages 114–130.
Brassard, G. and Kannan, S. (1988). The generation of random permutations on the fly. Inf.
Process. Lett., 28(4):207–212.
Chen, C.-H. (2002). Generalized association plots: Information visualization via iteratively generated correlation matrices. Statistica Sinica, 12(1):7–30.
Devroye, L. (1986). Nonuniform Random Variate Generation. Springer-Verlag, New York.
Devroye, L. (2010). Complexity questions in non-uniform random variate generation. In Proceedings of COMPSTAT’2010, pages 3–18. Springer.
Devroye, L. and Gravel, C. (2015). The expected bit complexity of the von neumann rejection
algorithm. arXiv preprint arXiv:1511.02273.
Durstenfeld, R. (1964). Algorithm 235: Random permutation. Commun. ACM, 7(7):420.
Erdélyi, A., Magnus, W., Oberhettinger, F., and Tricomi, F. (1953). Higher Transcendental Functions. Vol. I. McGraw-Hill, New York.
Fisher, R. A. and Yates., F. (1948). Statistical tables for biological, agricultural and medical research. 3rd ed. Edinburgh and London, 13(3):26–27.
Flajolet, P., Fusy, É., and Pivoteau, C. (2007). Boltzmann sampling of unlabelled structures. In
Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments and the Fourth
Workshop on Analytic Algorithmics and Combinatorics, pages 201–211. SIAM, Philadelphia,
PA.
44
Flajolet, P., Gourdon, X., and Dumas, P. (1995). Mellin transforms and asymptotics: harmonic
sums. Theoret. Comput. Sci., 144(1-2):3–58.
Flajolet, P., Pelletier, M., and Soria, M. (2011). On Buffon machines and numbers. In Proceedings
of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San
Francisco, California, USA, January 23-25, 2011, pages 172–183.
Flajolet, P. and Sedgewick, R. (1995). Mellin transforms and asymptotics: finite differences and
Rice’s integrals. Theoret. Comput. Sci., 144(1-2):101–124.
Flajolet, P. and Sedgewick, R. (2009). Analytic Combinatorics. Cambridge University Press, New
York, NY, USA.
Fuchs, M., Hwang, H.-K., and Zacharovas, V. (2014). An analytic approach to the asymptotic
variance of trie statistics and related structures. Theoret. Comput. Sci., 527:1–36.
Grabner, P. J. and Hwang, H.-K. (2005). Digital sums and divide-and-conquer recurrences: Fourier
expansions and absolute convergence. Constr. Approx., 21(2):149–179.
Granboulan, L. and Pornin, T. (2007). Perfect block ciphers with small blocks. In Fast Software
Encryption, 14th International Workshop, FSE 2007, Luxembourg, Luxembourg, March 26-28,
2007, Revised Selected Papers, pages 452–465.
Horibe, Y. (1981). Entropy and an optimal random number transformation. IEEE Transactions on
Information Theory, 27(4):527–529.
Hwang, H.-K. (2003). Second phase changes in random m-ary search trees and generalized quicksort: convergence rates. Ann. Probab., 31(2):609–629.
Hwang, H.-K., Fuchs, M., and Zacharovas, V. (2010). Asymptotic variance of random symmetric
digital search trees. Discrete Math. Theor. Comput. Sci., 12(2):103–165.
Intel (2012). Intel Digital Random Number Generator (DRNG): Software Implementation Guide.
Intel Corporation.
Jacquet, P. and Szpankowski, W. (1998). Analytical de-Poissonization and its applications. Theoret. Comput. Sci., 201(1-2):1–62.
Kimble, G. W. (1989). Observations on the generation of permutations from random sequences.
International Journal of Computer Mathematics, 29(1):11–19.
Knuth, D. E. (1998a). The Art of Computer Programming. Vol. 2, Seminumerical Algorithms.
Addison-Wesley, Reading, MA.
Knuth, D. E. (1998b). The Art of Computer Programming. Vol. 3. Sorting and Searching. AddisonWesley, Reading, MA. Second edition.
45
Knuth, D. E. and Yao, A. C. (1976). The complexity of nonuniform random number generation.
Algorithms and Complexity: New Directions and Recent Results, pages 357–428.
Koo, B., Roh, D., and Kwon, D. (2014). Converting random bits into random numbers. The
Journal of Supercomputing, 70(1):236–246.
Laisant, C.-A. (1888). Sur la numération factorielle, application aux permutations. Bulletin de la
Société Mathématique de France, 16:176–183.
Langr, D., Tvrdı́k, P., Dytrych, T., and Draayer, J. P. (2014). Algorithm 947: Paraperm—parallel
generation of random permutations with MPI. ACM Trans. Math. Softw., 41(1):5:1–5:26.
Lehmer, D. H. (1960). Teaching combinatorial tricks to a computer. In Proc. Sympos. Appl. Math.
Combinatorial Analysis, volume 10, pages 179–193.
Louchard, G., Prodinger, H., and Wagner, S. (2008). Joint distributions for movements of elements
in Sattolo’s and the Fisher-Yates algorithm. Quaest. Math., 31(4):307–344.
Lumbroso, J. (2013). Optimal discrete uniform generation from coin flips, and applications. CoRR,
abs/1304.1916.
Mahmoud, H. M. (2003). Mixed distributions in Sattolo’s algorithm for cyclic permutations via
randomization and derandomization. J. Appl. Probab., 40(3):790–796.
Massey, J. L. (1981). Collision-Resolution Algorithms and Random-Access Communications.
Springer.
Moses, L. E. and Oakford, R. V. (1963). Tables of Random Permutations. Stanford University
Press.
Nakano, K. and Olariu, S. (2000). Randomized initialization protocols for ad hoc networks. IEEE
Transactions on Parallel and Distributed Systems, 11(7):749–759.
Neininger, R. and Rüschendorf, L. (2004). A general limit theorem for recursive algorithms and
combinatorial structures. Ann. Appl. Probab., 14(1):378–418.
Petrov, V. V. (1975). Sums of Independent Random Variables. Springer-Verlag, New YorkHeidelberg. Translated from the Russian by A. A. Brown, Ergebnisse der Mathematik und
ihrer Grenzgebiete, Band 82.
Plackett, R. L. (1968). Random permutations. Journal of the Royal Statistical Society. Series B
(Methodological), 30(3):pp. 517–534.
Pokhodzeı̆, B. B. (1985). Complexity of tabular methods of simulating finite discrete distributions.
Izv. Vyssh. Uchebn. Zaved. Mat., (7):45–50 & 85.
Prodinger, H. (2002). On the analysis of an algorithm to generate a random cyclic permutation.
Ars Combin., 65:75–78.
46
Rao, C. R. (1961). Generation of random permutation of given number of elements using random
sampling numbers. Sankhya A, 23:305–307.
Ravelomanana, V. (2007). Optimal initialization and gossiping algorithms for random radio networks. IEEE Transactions on Parallel and Distributed Systems, 18(1):17–28.
Ressler, E. K. (1992). Random list permutations in place. Inf. Process. Lett., 43(5):271–275.
Ritter, T. (1991). The efficient generation of cryptographic confusion sequences. Cryptologia,
15(2):81–139.
Rivest, R. L. (1994). The RC5 encryption algorithm. In Fast Software Encryption: Second International Workshop. Leuven, Belgium, 14-16 December 1994, Proceedings, pages 86–96.
Robson, J. M. (1969). Algorithm 362: Generation of random permutations [G6]. Commun. ACM,
12(11):634–635.
Sandelius, M. (1962). A simple randomization procedure. Journal of the Royal Statistical Society.
Series B (Methodological), 24(2):pp. 472–481.
Titchmarsh, E. C. (1986). The Theory of the Riemann Zeta-Function. The Clarendon Press Oxford
University Press, New York, second edition. Edited and with a preface by D. R. Heath-Brown.
von Neumann, J. (1951). Various techniques used in connection with random digits. Journal of
Research of the National Bureau of Standards. Applied Mathematics Series, 12:36–38.
Waechter, M., Hamacher, K., Hoffgaard, F., Widmer, S., and Goesele, M. (2011). Is your permutation algorithm unbiased for n ¤ 2m ? In Parallel Processing and Applied Mathematics, pages
297–306. Springer.
Wagner, S. (2009). On tries, contention trees and their analysis. Annals of Combinatorics,
12(4):493–507.
Wilson, M. C. (2009). Random and exhaustive generation of permutations and cycles. Ann. Comb.,
12(4):509–520.
47