Generating random permutations by coin-tossing: classical algorithms, new analysis and modern implementation | | ~ ~ Axel Bacher , Olivier Bodini , Hsien-Kuei Hwang , and Tsung-Hsi Tsai | ~ LIPN, Université Paris 13, Villetaneuse, France Institute of Statistical Science, Acamedia Sinica, Taipei, Taiwan March 10, 2016 Abstract Several simple, classical, little-known algorithms in the statistical literature for generating random permutations by coin-tossing are examined, analyzed and implemented. These algorithms are either asymptotically optimal or close to being so in terms of the expected number of times the random bits are generated. In addition to asymptotic approximations to the expected complexity, we also clarify the corresponding variances, as well as the asymptotic distributions. A brief comparative discussion with numerical computations in a multicore system is also given. 1 Introduction Random permutations are indispensable in widespread applications ranging from cryptology to statistical testing, from data structures to experimental design, from data randomization to Monte Carlo simulations, etc. Natural examples include the block cipher (Rivest, 1994), permutation tests (Berry et al., 2014) and the generalized association plots (Chen, 2002). Random permutations are also central in the framework of Boltzmann sampling for labeled combinatorial classes (Flajolet et al., 2007) where they intervene in the labeling process of samplers. Finding simple, efficient and cryptographically secure algorithms for generating large random permutations is then of vital importance in the modern perspective. We are concerned in this paper with several simple classical algorithms for generating random permutations (each with the same probability of being This research was partially supported by the ANR-MOST Joint Project MetAConC under the Grants ANR 2015BLAN-0204 and MOST 105-2923-E-001-001-MY4. 1 generated), some having remained little known in the statistical and computer science literature, and focus mostly on their stochastic behaviors for large samples; implementation issues are also discussed. Algorithm Laisant-Lehmer: when UnifŒ1; n! is available (Here and throughout this paper UnifŒa; b represents a discrete uniform distribution over all integers in the interval Œa; b.) The earliest algorithm for generating a random permutation dated back to Laisant’s work near the end of the 19th century (Laisant, 1888), which was later re-discovered in (Lehmer, 1960). It is based on the factorial representation of an integer k D c1 .n 1/! C c2 .n 2/! C C cn 1 1! .0 6 k < n!/; where 0 6 cj 6 n j for 1 6 j 6 n 1. A simple algorithm implementing this representation then proceeds as follows; see (Devroye, 1986, p. 648) or (Robson, 1969). Let k D UnifŒ0; n! 1. The first element of the random permutation is c1 C 1, which is then removed from Algorithm 1: LL.n; c/ f1; : : : ; ng. The next element of the random permutation will then be the .c2 C 1/st element of the n 1 remaining Input: c: an array with n elements Output: A random permutation on c elements. Continue this procedure until the last element begin is removed. A direct implementation of this algorithm 1 u WD UnifŒ1; n!; results in a two-loop procedure; a simple one-loop proce2 for i WD n downto 2 do dure was devised in (Robson, 1969) and is shown on the 3 t WD ui ; j WD u i t C 1; right; see also (Plackett, 1968) and Devroye’s book (De4 swap.ci ; cj /; u WD t; vroye, 1986, ~XIII.1) for variants, implementation details and references. This algorithm is mostly useful when n is small, say less than 20 because n! grows very fast and the large number arithmetics involved reduce its efficiency for large n. Also the generation of the uniform distribution is better realized by the cointossing algorithms described in Section 2. Algorithm Fisher-Yates (FY): when UnifŒ1; n is available One of the simplest and mostly widely used algorithms (based on a given sequence of distinct numbers fc1 ; : : : ; cn g) for generating a random permutation is the Fisher-Yates or Knuth shuffle (in its modern form by Durstenfeld (Durstenfeld, 1964); see Wikipedia’s page on Fisher-Yates shuffle and (Devroye, 1986; Durstenfeld, 1964; Fisher and Yates., 1948; Knuth, 1998a) for more information). The algorithm starts by swapping cn with a randomly chosen element in fc1 ; : : : ; cn g (each with the Algorithm 2: FY.n; c/ same probability of being selected), and then repeats Input: c: an array with n > 2 elements the same procedure for cn 1 , . . . , c2 . See also the reOutput: A random permutation on c cent book (Berry et al., 2014) or the survey paper (Ritbegin ter, 1991) for a more detailed account. 1 for i WD n downto 2 by 1 do Such an algorithm seems to have it all: single loop, 2 j WD UnifŒ1; i ; swap.ci ; cj / one-line description, constant extra storage, efficient 2 and easy to code. Yet it is not optimal in situations such as (i) when only a partial subset of the permuted elements is needed (see (Black and Rogaway, 2002; Brassard and Kannan, 1988)), (ii) when implemented on a non-array type data structure such as a list (see (Ressler, 1992)), (iii) when numerical truncation errors are inherent (see (Kimble, 1989)), and (iv) when a parallel or distributed computing environment is available (see (Anderson, 1990; Langr et al., 2014)). On the other hand, at the memory access level, a direct generation of the uniform random variable results in a higher rate of cache miss (see (Andrés and Pérez, 2011)), making it less efficient than it seems, notably when n is very large; see also Section 6 for some implementation and simulation aspects. Finally, this algorithm is sequential in nature and the memory conflict problem is subtle in parallel implementation; see (Waechter et al., 2011). Note that the implementation of this algorithm strongly relies on the availability of a uniform random variate generator, and its bit-complexity (number of random bits needed) is of linearithmic order (not linear); see (Lumbroso, 2013; Sandelius, 1962)) and below for a detailed analysis. From unbounded uniforms to bounded uniforms? Instead of relying on uniform distribution with very large range, our starting question was: can we generate random permutations by bounded uniform distributions (for example, by flipping unbiased coins)? There are at least two different ways to achieve this: Fisher-Yates type: simulate the uniform distribution used in Fisher-Yates shuffle by cointossing, which can be realized either by von Neumann’s rejection method (von Neumann, 1951) or by the Knuth-Yao algorithm (for generating a discrete distribution by unbiased coins; see (Devroye, 1986; Knuth and Yao, 1976)) and Section 2, and divide-and-conquer type: each element flips an unbiased coin and then depending on the outcome being head or tail divide the elements into two groups. Continue recursively the same procedure for each of the two groups. Then a random resampling is achieved by an inorder traversal on the corresponding digital tree; see the next section for details. This realization induces naturally a binary trie (Knuth, 1998b), which is closely related to a few other binomial splitting processes that will be briefly described below; see (Fuchs et al., 2014). It turns out that exactly the same binomial splitting idea was already developed in the early 1960’s in the statistical literature in (Rao, 1961) and independently in (Sandelius, 1962), and analyzed later in (Plackett, 1968). The papers by Rao and by Sandelius also propose other variants, which have their modern algorithmic interests per se. However, all these algorithms have remained little known not only in computer science but also in statistics (see (Berry et al., 2014; Devroye, 1986)), partly because they rely on tables of random digits instead of more modern computer generated random bits although the underlying principle remains the same. Since a complete and rigorous analysis of the bit-complexity of these algorithms remains open, for historical reasons and for completeness, we will provide a detailed analysis of the algorithms proposed in (Rao, 1961) and (Sandelius, 1962) (and partially analyzed in (Plackett, 1968)) and two versions of Fisher-Yates 3 with different implementations of the underlying uniform UnifŒ0; n 1 by coin-tossing: one relying on von Neumann’s rejection method (Devroye and Gravel, 2015; von Neumann, 1951) and the other on Knuth-Yao’s tree method (Devroye, 1986; Knuth and Yao, 1976). As the ideas of these algorithms are very simple, it is no wonder that similar ideas also appeared in computer science literature but in different guises; see (Barker and Kelsey, 2007; Flajolet et al., 2011; Koo et al., 2014; Ressler, 1992) and the references therein. We will comment more on this in the next section. We describe in the next section the algorithms we will analyze in this paper. Then we give a complete probabilistic analysis of the number of random bits used by each of them. Implementation aspects and benchmarks are briefly discussed in the final section. Note that Fisher-Yates shuffle and its variants for generating cyclic permutations have been analyzed in (Louchard et al., 2008; Mahmoud, 2003; Prodinger, 2002; Wilson, 2009) but their focus is on data movements rather than on bit-complexity. 2 Generating random permutations by coin-tossing We describe in this section three algorithms for generating random permutations, assuming that a bounded uniform UnifŒ0; r 1 is available for some fixed integer r > 2. The first algorithm relies on the divide-and-conquer strategy and was first proposed in (Rao, 1961) and independently in (Sandelius, 1962), so we will refer to it as Algorithm RS (Rao-Sandelius). The other two ones we study are of Fisher-Yates type but differ in the way they simulate UnifŒ0; n 1 by a bounded uniform UnifŒ0; r 1: the first of these two simulates UnifŒ0; n 1 by a rejection procedure in the spirit of von Neumann (von Neumann, 1951) and was proposed and implemented in (Sandelius, 1962), named ORP (One-stage-Randomization Procedure) there, but for convenience we will refer to it as Algorithm FYvN (Fisher-Yates-von-Neumann); see also (Moses and Oakford, 1963); and the other one relies on an optimized version of Lumbroso’s implementation (Lumbroso, 2013) of Knuth-Yao’s DDG-tree (discrete distribution generating tree) algorithm (Knuth and Yao, 1976), which will be referred to as Algorithm FYKY (Fisher-Yates-Knuth-Yao). See also (Devroye, 1986, Ch. XV) on the “bit model” and the more recent updates (Devroye, 2010; Devroye and Gravel, 2015). For simplicity of presentation and practical usefulness, we focus in what follows on the binary case r D 2. For convenience, let rand-bit denote the random variable Bernoulli. 12 /, which returns zero or one with equal probability. 2.1 Algorithm RS: divide-and-conquer We describe Algorithm RS only in the binary case assuming an unbiased coin is available. Since we will carry out a detailed analysis of this algorithm, we give its procedure in recursive form as follows. (For practical implementation, it is more efficient to remove the recursions by standard techniques; see Section 6.) 4 A sequence of distinct numbers fc1 ; : : : ; cn g is given. Algorithm 3: RS.n; c/ 1. Each ci generates a rand-bit, one independently of the others; 2. Group them according to the outcomes being 0 or 1, and arrange the groups in increasing order of the group labels. 1 2 Input: c: a sequence with n elements Output: A random permutation on c begin if n 6 1 then return c if n D 2 then if rand-bit D 1 then return .c2 ; c1 / else return .c1 ; c2 / 3 4 3. For each group of cardinality : 5 6 (a) if D 1, then stop; 7 (b) if D 2, then generate a rand-bit b and reverse their relative order if b D 1; (c) if > 2, then repeat Steps 1–3 for each group. 9 Let A0 and A1 be two empty arrays for i WD 1 to n do add ci into Arand-bit 10 return RS.jA0 j; A0 /,RS.jA1 j; A1 / 8 As an illustrative example, we begin with the sequence fc1 ; : : : ; c6 g. Assume that the flipped bi nary sequence is c1 c2 c3 c4 c5 c6 1 0 1 1 0 0 . Then we split the ci ’s into the 0-group .c2 ; c5 ; c6 / and the 1-group .c1 ; c3 ; c4 /, which can be written in the form .c2 c5 c6 / .c1 c3 c4 /. As both groups have cardinality larger than two, we run the same coin-flipping process for both groups. Assume that further coin-flippings yield c2 c5 c6 0 0 1 c1 c2 c3 c4 c5 c6 and c2 c5 c6 c2 c5 c2 c1 c3 c4 0 1 0 c1 c3 c4 c6 c5 c1 c4 c4 c3 c1 , respectively. Then we obtain .c2 c5 / c6 .c1 c4 / c2 . If the two extra coin-flippings needed to permute the two subgroups of size two are 0 and 1, respectively, then we get the random permutation c2 c5 c6 c4 c1 c3 : The splitting process of this algorithm is, up to the boundary conditions, essentially the same as constructing a random trie under the Bernoulli model or sorting using radixsort (see (Fuchs et al., 2014; Knuth, 1998b)), and was also briefly mentioned in (Flajolet et al., 2011). On the other hand, Ressler in (Ressler, 1992) proposed an algorithm for randomly permuting a list structure using a similar divide-and-conquer idea but performed in a rather different way. To the best of our knowledge, except for these references, this simple algorithm seems to remain unknown in the literature and we believe that more attention needs to be paid on its practical usefulness and theoretical relevance. Essentially identical binomial splitting processes In addition to the above connection to trie and radixsort, the splitting process of Algorithm RS is also reminiscent of the so-called initialization problem in distributed computing (or processor identity problem), where a unique identifier is to be assigned to each processor in some distributed computing environment; see (Nakano and 5 Olariu, 2000; Ravelomanana, 2007). Yet another context where exactly the same coin-tossing process is used to resolve conflict is the tree algorithm (or CTM algorithm, named after Capetanikis, Tsybakov and Mikhailov) in multi-access channel; see (Massey, 1981; Wagner, 2009). For more references on binomial splitting processes, see (Fuchs et al., 2014). Nowadays, it is well-known that the stochastic behaviors of these structures can be understood through the study of the binomial recurrence X n n fn D gn C 2 .fk C fn k /; (1) k 06k6n with suitably given initial conditions. In almost all cases of interest, such a recurrence often gives rise to asymptotic approximations (for large n) that involve periodic oscillations with minute amplitudes (say, in the order of 10 5 ), which may lead to inexact conjectures (see for example (Massey, 1981)) but can be well described by standard complex-analytic tools such as Mellin transform (Flajolet et al., 1995) and saddle-point method (Flajolet and Sedgewick, 2009) (or analytic dePoissonization (Jacquet and Szpankowski, 1998)); see (Fuchs et al., 2014) and the references compiled there. From a historical perspective, such a clarification through analytic means was first worked out by Flajolet and his co-authors in the early 1980’s; see again (Fuchs et al., 2014) for a brief account. However, the periodic oscillations had already been observed in the 1960’s by Plackett in (Plackett, 1968) based on heuristic arguments and figures, which seems less expected because of the limited computer power at that time and of the proper normalization needed to visualize the fluctuations; see Figures 1 and 2 for the subtleties involved. Unlike Algorithm FY, Algorithm RS is more easily adapted to a distributed or parallel computing environment because the random bits needed can be generated simultaneously. Furthermore, we will prove that the total number of random bits used is asymptotically optimal, namely, the expected complexity is asymptotic to n log2 n C nFRS .log2 n/ C O.1/, where FRS .t/ is a periodic function of period 1 with very small amplitude (jFRS .t /j 6 1:1 10 5 ); see Figure 1. Another distinctive feature is that FRS is very smooth (infinitely differentiable), differing from most other periodic functions arising in the analysis below. Note that the information-theoretic lower bound satisfies log2 n! D n log2 n logn 2 C O.log n/. While the asymptotic optimality of such a simple algorithm was already discussed in detail in (Sandelius, 1962) and such an asymptotic pattern anticipated in (Plackett, 1968), the rigorous proof and the explicit characterization of the periodic function FRS are new. Also we show that the variance is relatively small (being of linear order with periodic fluctuations) and that the distribution is asymptotically normal. 2.2 Algorithm FYvN and FYKY We describe in this subsection the two versions of Algorithm FY: FYvN and FYKY. Both algorithms follow the same loop of Fisher-Yates shuffle and simulate successively the discrete uniform distributions UnifŒ1; n, : : : , UnifŒ1; 2 by flipping unbiased coins. To simulate UnifŒ1; k, both algorithms generate first dlog2 ke random bits. If these bits, when read as a binary representation, have a value less than k, then return this value plus 1 as the required random element; otherwise, 6 Algorithm FYvN rejects these bits and restart the same procedure until finding a value < k. Algorithm FYKY, on the other hand, does not reject the flipped bits but uses the difference between this value and k as the “seed” of the next round and repeat the same procedure. We modified and optimized these two procedures in a way to reduce the number of arithmetic operations, and we mark their differences in red color; see Algorithm 4 and 5. Algorithm 4: Algorithm FYvN 1 2 Algorithm 5: Algorithm FYKY Input: c: an array with n elements Output: A random permutation on c begin for i WD n downto 2 by 1 do j WD von-Neumann.i / C 1; swap.ci ; cj /; 1 2 Procedure von-Neumann(n) 1 2 3 4 5 6 7 8 9 Input: c: an array with n elements Output: A random permutation on c begin for i WD n downto 2 by 1 do j WD Knuth-Yao.i/ C 1; swap.ci ; cj /; Procedure Knuth-Yao(n) Input: a positive integer n Output: UnifŒ0; n 1 begin u WD 1; x WD 0; while 1 D 1 do while u < n do u WD 2u; x WD 2xCrand-bit; 1 2 3 4 d WD u n; if x > d then return x d; else u WD 1; x WD 0; Input: a positive integer n Output: UnifŒ0; n 1 begin u WD 1; x WD 0; while 1 D 1 do while u < n do u WD 2u; x WD 2xCrand-bit; d WD u n; if x > d then return x d ; else u WD d; 5 6 7 8 9 Note that both algorithms are identical when n D 2k and n D 3 for which the algorithm parameters evolve as in the following diagram. .2; 1/ nD3W .u; x/ .1; 0/ .2; 0/ .4; 3/ x d D2 .4; 2/ x d D1 .4; 1/ x d D0 .4; 0/ .1; 0/ recursive While the difference of both algorithms in such a pseudo-code level is minor, we show that the asymptotic behavior of their bit-complexity for generating a random permutation of n elements differs significantly: 7 Mean Variance n log2 n C nFRS ./ nGRS ./ Œ1 Œ2 n.log n/FvN ./ C nFvN ./ n.log n/2 GvN ./ n log2 n C nFKY ./ nGKY ./ Algorithm RS FYvN FYKY Method Analytic Elementary Analytic Here F ./ and G ./ are all bounded, continuous periodic functions of parameter log2 n. We see that the minor difference in Algorithm FYvN results not only in higher mean but also larger variance, making FYvN less competitive in modern practical applications although it was used, for example, by Moses and Oakford to produce tables of random permutations (Moses and Oakford, 1963). Also the procedure von-Neumann in Algorithm 4, as one of the simplest and most natural ideas of simulating a uniform by coin-tossing, was independently proposed under different names in the literature; see, for example, (Granboulan and Pornin, 2007; Koo et al., 2014); in particular, it is called “Simple Discard Method” in NIST’s (Barker and Kelsey, 2007) “Recommendation for random number generation using deterministic random bit generators.” Thus, we also include the analysis of FYvN in this paper although it is less efficient in bit-complexity. The mean and the variance of Algorithm FYvN were already derived in (Plackett, 1968) but only when n D 2k . In addition to this approximation, we will also show that the variance is of a less common higher order n.log n/2 , and the distribution remains asymptotically normal. 2.3 Outline of this paper We focus in this paper on a detailed probabilistic analysis of the bit-complexity of the three algorithms RS, FYvN and FYKY. Indeed, in all three cases we will establish a very strong local limit theorem for the bit-complexity of the form (although the variances are not of the same order) x2 k j p e 2 1 C jxj3 ; 1CO P Wn D E.Wn / C x V.Wn / D p p n 2V.Wn / 1 uniformly for x D o.n 6 /, where Wn represents the bit-complexity of any of the three algorithms fRS, FYvN, FYKYg. Our method of proof is mostly analytic, relying on proper use of generating functions (including characteristic functions) and standard complex-analytic techniques (see (Flajolet and Sedgewick, 2009)). The diverse uniform estimates needed for the characteristic functions constitute the hard part of our proofs. The same method can be extended to clarify finer probabilities of moderate deviations, but for simplicity, we content with the above result in the central range. We also implemented these algorithms and tested their efficiency in terms of running time. The simulation results are given in the last section. Briefly, Algorithm FYKY is recommended when n is not very large, say n 6 107 , and Algorithm RS performs better for larger n or when a multicore system is available. Finally, our analysis and simulations also suggest that the “Simple Discard Algorithm” recommended in NITS’s (Barker and Kelsey, 2007) “Recommendation for random number generation” is better replaced by the procedure Knuth-Yao in Algorithm 5 whose expected optimality (in bit-complexity) was established in (Horibe, 1981). 8 3 The bit-complexity of Algorithm RS We consider the total number Xn of times the random variable rand-bit is used in Algorithm RS for generating a random permutation of n elements. We will derive precise asymptotic approximations to the mean, the variance and the distribution by applying the approaches developed in our previous papers (Fuchs et al., 2014; Hwang, 2003; Hwang et al., 2010). Recurrences and generating functions By construction, Xn satisfies the distributional recurrence d Xn D XIn C Xn In Cn; .n > 3/; „ƒ‚… „ƒ‚… 1-group 0-group with the initial conditions X0 D X1 D 0 and X2 D 1, where In denotes the binomial distribution with parameters n and 21 . Here the .Xn /’s are independent copies of the .Xn /’s and are independent of In . This random variable is, up to initial conditions, identical to the external path length of random tries constructed from n random binary strings. It may also be interpreted in may different ways; see (Fuchs et al., 2014; Knuth, 1998b) and the references therein. In terms of the moment generating function Pn .t / WD E.e Xn t /, we have the recurrence X nt n n Pk .t/Pn k .t / .n > 3/; (2) Pn .t/ D e 2 k 06k6n with P0 .t/ D P1 .t / D 1 and P2 .t/P D e t . From these relations, we see that the bivariate Poisson generating function PQ .z; t/ WD e z n>0 Pnn!.t / z n satisfies the functional equation t PQ .z; t/ D e .e 1/z PQ 2 Q t/; C Q.z; 1 2 e t z; t z 1 C 14 e t z 2 C e t (3) where Q t/ WD 1 Q.z; e t ze P z : E.Xn / n z denote the Poisson generating function Let now fQm .z/ WD m!Œt m PQ .z; t/ D e n>0 n! of the mth moment of Xn . From (3), we see that ( fQ1 .z/ D 2fQ1 2z C gQ 1 .z/; fQ2 .z/ D 2fQ2 z C gQ 2 .z/; m 2 with gQ 1 .0/ D gQ 2 .0/ D 0, where ( gQ 1 .z/ D z ze z 1 C 43 z ; 2 gQ 2 .z/ D 2fQ1 z C 4z fQ1 z C 2z fQ0 2 z 1 2 2 9 C z C z2 ze z 1C 11 4 z : (4) Mean value From the recurrence (2), we see that the mean n WD E.Xn / can be computed recursively by X 1 n n n D n C 2 k .n > 3/; k 06k6n P with 0 D 1 D 0 and 2 D 1. Let Hn WD 16j 6n j 1 denote the harmonic numbers and denote Euler’s constant. Theorem 1. The expected number n of random bits used by Algorithm RS for generating a random permutation of n elements satisfies the identity X .k /.n/ 3 n Hn 1 1 3 1 1 C k ; D C (5) n log 2 2 4 log 2 log 2 .n C k / 4 k2Znf0g for n > 3, where is the Gamma function and k WD 2k i . log 2 Asymptotically, n satisfies n D n log2 n C nFRS .log2 n/ C O.1/; (6) where FRS .t/ is a periodic function of period 1 whose Fourier series expansion is given by X 1 3 1 3 FRS .t / D C .k / 1 C k e 2k i t ; log 2 2 4 log 2 log 2 4 k2Znf0g the Fourier series being absolutely convergent. Proof. To derive a more effective asymptotic approximation to n , we begin with the expansion gQ 1 .z/ D X . 1/j 3j 7 zj : .j 1/! 4 j >2 We then see that the sequence Q n WD n!Œz n fQ1 .z/, where Œz n f .z/ denotes the coefficient of z n in the Taylor expansion of f , satisfies Q n D Œz n gQ 1 .z/ 1 21 n .n > 2/: It follows, by Cauchy convolution, that the coefficient n WD n!Œz n e z fQ1 .z/ has the closed-form expression X n k 3k 7 n D . 1/k .n > 1/; k 1 21 k 4 26k6n which, by standard integral representation for finite differences (see (Flajolet and Sedgewick, 1995)), can be expressed as Z 12 Ci1 3 n 1 .n/.s/ D 1 C s ds; 1 n 2 i .n C s/.1 2s / 4 2 i1 10 where the integral path is the vertical line <.s/ D right and by collecting all residues at the poles k D n Hn 1 1 D C n log 2 2 Rn WD 1 2 i k2Znf0g 1 2 Ci1 Z 1 2 integration to the X .k /.n/ 3 1 1 C k C Rn ; log 2 .n C k / 4 3 4 log 2 where 1 . By moving the line of 2 2k i .k 2 Z/, we obtain log 2 i1 .n/.s/ 3 1 C s ds: .n C s/.1 2s / 4 Since there is no other singularity lying to the right of the imaginary axis, we deform first the integration path into a large half-circle to the right, and then prove that the integral tends to zero as the radius of the circle tends to infinity. In this way, we deduce that Rn 0 for n > 3, proving the identity (5). The asymptotic approximation (6) then follows from the asymptotic expansion for the ratio of Gamma functions (see (Erdélyi et al., 1953, ~1.18)) jk j4 k .k 1/ .n/ k Dn CO 1 .k ¤ 0/: .n C k / 2n n2 Indeed, the O.1/-term in (6) can be further refined by this expansion, and be replaced by 1 X 3 .1 C k /.k 1/ 1 C k n k C O n 1 ; 2 log 2 4 k2Z the series on the right-hand side defining another bounded periodic function. Finally, the Fourier series is absolutely convergent because of the estimate (see (Erdélyi et al., 1953, ~1.18)) j.c C i t/j D O jt jc 1 2 e 2 jt j ; (7) for large jt j and bounded c. Indeed, FRS .t/ is infinitely differentiable for <.t / > 0. Periodic fluctuations of n Due to the small amplitude of variation of FRS .t/, the periodic oscillations are invisible if one plots naively nn log2 n for increasing values of n as approximations of FRS .t / (see Figure 1). Also note that the mean value of FRS equals numerically 1 C log 2 2 3 0:25072 48966 10144 : : : ; 4 log 2 which is larger than the corresponding linear term in the information-theoretic lower bound 1:44. 11 (8) 1 log 2 Figure 1: Periodic fluctuations of n for n D 16 to 1024 in log-scale: nn log2 n (first from left), n 1 log2 n C 2 log (second), nn Hlogn 21 (third), and FRS .t / for t 2 Œ0; 1 (fourth). n 2 Variance We prove that the variance is asymptotically linear with periodic oscillations. The expressions involved are very complicated, showing the complexity of the underlying asymptotic problem. Theorem 2. The variance of Xn satisfies V.Xn / D nGRS .log2 n/ C O.1/; where GRS .t/ is a periodic function of period 1 whose Fourier series is given by 1 X 2k i 2k i t GRS .t/ D e : gQ 1C log 2 log 2 k2Z The Fourier series is absolutely convergent (and infinitely differentiable). An explicit expression for the function gQ .s/ is given as follows. X gQ .s/ 1 D .3s C 5/ .s C 1/ 1C2 k s k>1 C hQ .s/ C sC5 4 .s C 2/.9s 3 C 66s 2 C 163s C 362/ 2sC9 (9) .<.s/ > 2/; where 3s 3 34s 2 41s C 6 k 4 3 2 B 9s C 87s C 317s C 333s C 30 2k 1 2 B 3 2 @ C 3s C 22s C 141s C 170 2 C 3s 2 C 37s C 50 2 3k 2 C .3s C 5/2 4k 0 hQ .s/ WD X k>1 2 k 3 1C2 k sC5 1 3 C C: A Proof. For the variance, we consider, as in (Fuchs et al., 2014), the corrected Poissonized variance VQ .z/ WD fQ2 .z/ fQ1 .z/2 z fQ10 .z/2 : Then, by (4), VQ .z/ D 2VQ 12 z 2 C g.z/; Q where n g.z/ Q D e z.3z C 4/fQ1 z C z C 41 z 2 z 2 1 16 1 2 z 3z 2 z.z C 1/.9z 3 2z 4 fQ10 z 2 12z 2 C 16z C 16/ e z o ; which is exponentially small for large jzj in the half-plane <.z/ > 0. Indeed, g.z/ Q D O e <z jzj3 log jzj .jzj ! 1I <.z/ > 0/: (10) (11) We follow the same method of proof developed in (Fuchs et al., 2014) and need to compute the Mellin transform of g.z/, Q which exists in the half-plane <.s/ > 2 because g.z/ Q D O.jzj2 / as jzj ! 0. Now X fQ1 .z/ D 2k gQ 1 2zk : k>0 Thus gQ .s/ WD Z 1 s g.z/z Q 1 dz D 1 .s/ C 2 .s/ C 3 .s/; 0 where 1 Z z s e z .3z C 4/fQ1 2z dz 0 Z 1 1 z s e z 3z 2 2z 4 fQ10 2z dz 2 .s/ WD 2 Z 10 1 3 .s/ WD z s e z 1 C 14 z 16 .z C 1/.9z 3 12z 2 C 16z C 16/ e z dz: 1 .s/ WD 0 First, for <.s/ > 2, 3 .s/ D .s C 1/ 41 .s C 5/ 2 s 9 .s C 2/.9s 3 C 66s 2 C 163s C 362/ : Note that 3 .s/ has no singularity at s D 1 (indeed, 3 . 1/ D 125 C log 2). On the other hand, 128 by an integration by parts, Z 1 3z 3 C .3s C 11/z 2 2.s 3/z 4s dz 1 .s/ C 2 .s/ D z s 1 e z fQ1 2z 0 Z 1 X k 1 3z 3 C .3s C 11/z 2 2.s 3/z 4s dz D 2 z s 1 e z gQ 1 2zk k>1 0 X D .s C 1/ .3s C 5/ 1 1C2 k s Q C h .s/ ; k>1 which can be continued into the half-plane <.s/ > 2 and leads then to (9). Note that 3 . 1/ D 7 37 C log 2. Also, by (7), jgQ .c C i t/j D O jtjcC 2 e 2 jt j for large jtj and c > 2. Thus, 32 13 the Fourier series expansion for GRS .t / is absolutely convergent. By the same Poisson-Charlier approach used in (Fuchs et al., 2014), we see that 2 n Q 00 n Q00 2 f .n/ CO n V.Xn / D VQ .n/ V .n/ „ƒ‚… „ 2 2 1 … ƒ‚ DO.n/ 1 ; DO.1/ where the O-terms can be made more precise by Mellin transform techniques (see (Flajolet et al., 1995)) as follows. First, by moving the line of integration to the right and collecting all residues i ) encountered, we deduce that (k WD 2k log 2 1 VQ .n/ D 2 i 3 2 Ci1 Z n 3 2 i1 s gQ .s/ ds 1 2sC1 Z n X 1 k D gQ . 1 k /n C log r 2 i k2Z X D nGRS .log2 n/ 2 k g.2 Q k n/; 1 2 Ci1 n 1 2 s i1 gQ .s/ ds 1 2sC1 k>1 which is not only an asymptotic expansion but also an identity for n > 1. Here GRS .t / is a 1periodic function with small amplitude, and the series over k represents exponentially small terms; see (11). Similarly, nVQ 00 .n/ D 1 X k .k C 1/gQ . 1 log 2 k /nk k2Z X 2k gQ 00 .2k n/; k>1 the first series being bounded while the second exponentially small for large n. In particular, the mean value of the periodic function GRS is given by gQ . 1/ D1 log 2 X 125 C2 log2 1 C 2 128 log 2 k k>1 1 X 3 8k C 10 4k 34 2k 4 log 2 .2k C 1/4 14 k>1 1:82994 9955089 43482 69596 20844 : : : ; (12) in accordance with the numerical calculations; see Figure 2. Asymptotic normality By applying either the contraction method (see (Neininger and Rüschendorf, 2004)) or the refined method of moments (see (Hwang, 2003)), we can establish the convergence in distribution of the centered and normalized random variables .Xn n /=n to the standard normal distribution, where n WD E.Xn / and n2 WD V.Xn /. The latter is also useful in providing stronger results such as the following. 14 Figure 2: A plot (right) of V.Xnn/ c0 for n from 12 to 256 in logarithmic scale, where c0 D 2.log1 2/2 is the mean value of the second-order term (another periodic function). Without this correction term c0 , the fluctuations are invisible (left). Theorem 3. The sequence of random variables fXn g satisfies a local limit theorem of the form x2 e 2 1 C jxj3 P .Xn D bn C xn c/ D p 1CO (13) p n 2 n 1 uniformly for x D o n 6 . Proof. (Sketch) The refined method of moments proposed in (Hwang, 2003) begins with introducing the normalized function 1 2 2 1 2 2 'n ./ WD e 2 n E e .Xn n / D e n 2 n Pn ./: Then '0 . / D '1 ./ D '2 ./ D 1 and X n n 'n ./ D 2 'k ./'n k k ./e n;k ın;k 2 .n > 3/; 06k6n where n;k WD nCk Cn k n and ın;k WD 12 k2 Cn2 k n2 . From this, we see that all Taylor coefficients 'n.m/ .0/ satisfy the same recurrence of the form (1) with different non-homogeneous part. Then a good estimate for j'n ./j for small is obtained by establishing the uniform bounds ˇ .m/ ˇ ˇ' .0/ˇ 6 m!C m n m3 .m > 3/; n for a sufficiently large number C > 0. Such bounds are proved by induction using Gaussian tails of the binomial distribution and the estimates n; n Cx pn ; ın; n Cx pn D O 1 C x 2 ; 2 2 2 2 p uniformly for x D o. n/ (the remaining range completed by using the smallness of the binomial distribution). Then it follows that ˇ ˇ ˇ ˇ 1 ˇ X ˇ'n.m/ .0/ˇ m ˇ i ˇ6 ˇ'n 2 jj3 1 jj D O n ; ˇ ˇ n m!nm m>3 15 1 uniformly for jj D o.n 6 /, or, equivalently, X n n i n De E e 1 2 2 CO n 1 2 jj3 e 1 2 2 ; (14) for in the same range. Then another induction leads to the uniform estimate (see (Hwang, 2003) for a similar setting) ˇ ˇ ˇE e Xn i ˇ 6 e ".nC1/ 2 .jj 6 I n > 3/; (15) where " > 0 is a sufficiently small constant. (We use " > 0 as a generic symbol representing a sufficiently small number whose occurrence may change from one occurrence to another.) These two uniform bounds are sufficient to prove the local limit theorem by standard Fourier analysis (see (Petrov, 1975)) starting from the inversion formula Z 1 P.Xn D k/ D e ik E e Xn i d; 2 and then splitting the integration range into two parts: Z Z 1 P.Xn D k/ D C e 1 1 2 jj6"n 3 "n 3 <jj6 By (15), the second integral is asymptotically negligible ˇZ ˇ Z 1 ˇ 1 ˇˇ ik Xn i e E e d ˇˇ D O e 1 2 ˇ "n 13 <jj6 "n 3 The integral over the central range jj 6 "n 1 3 "n 2 ik E e Xn i d: d DO n 1 6 1 e "n 3 : is then evaluated by (14) using k D bn C xn c DW n C xn C n ; n D O.1/; giving 1 2 Z jj6"n 1 3 e ik E e Xn i Z X n n 1 1 i ix 2 n d D e E e 1CO n d 2n jj6"n 61 Z 1 1 1 2 D e ix 2 1 C O n 2 j 1 C j3 d 2n 1 x2 e 2 1 C jxj3 Dp 1CO ; p n 2 n which completes the proof of (13). Note that our estimates for the characteristic function of Xn also lead to an optimal BerryEsseen bound ˇ ˇ Z x ˇ ˇ Xn n 1 t2 1 ˇ sup ˇP e 2 dt ˇˇ D O n 2 : 6x p n 2 1 x2R 16 Figure 3: Normalized (multiplied by standard deviation) histograms of the random variables Xn for n D 15; : : : ; 50; the tendency to normality becomes apparent for larger n. A simple improved version The first few terms of n and those of the expected bit complexity of Algorithm FYKY are given in the following table. Algorithm E.RS/.D n / E.FYKY/ 2 1 1 3 4 5 5 8:29 12:1 3:67 5:67 9:27 6 16:3 12:9 7 20:7 16:4 8 25:3 19:4 9 30:1 24 10 35 28:6 and we see that for small n Algorithm RS is better replaced by Algorithm FYKY. The analysis of the bit-complexity of these mixed algorithms (using FYKY for small n and RS for larger n) can be done by the same methods used above but the expressions become more involved. 4 The bit-complexity of Algorithm FYvN In this section, we analyze the bit-complexity of Algorithm FYvN (D Sandelius’s ORP in (Sandelius, 1962)), which is described in Introduction. Briefly, for each 2 6 k 6 n, select D dlog2 ke random bits (independently and uniformly at random), which gives rise to a number 0 6 u < 2 . If u < k, use u as the required random number, otherwise repeat the same procedure until success. Let Yn represent the total number of random digits used for generating a random permutation of n element. Plackett showed (see (Plackett, 1968)), in the special case when n D 2 , that E.Yn / 2n .log n log 2/ ; and V.Yn / 2 .1 log 2/ n log22 n 2 log2 n C 1 : (16) In the section, we complete the analysis of Planckett of the mean and the variance for all n, and establish a stronger local limit theorem for the bit-complexity. Lemma 1. Let k WD dlog2 ke and Geok be a geometric random variable with probability of success k=2k (with support on the positive integers). Then X d Yn D k Geok .n > 1/: (17) 16k6n 17 Proof. Observe that the number of random bits used for selecting each ck is a geometric random variable Geok . Expected value The mean of Yn satisfies X k 2k : k E.Yn / D 16k6n By splitting the range Œ1; n into blocks of the form .2j ; 2j C1 , we obtain the following asymptotic approximation to E.Yn /. Theorem 4. The expected number of random digits used by Algorithm FYvN to generate a random permutation of n elements satisfies Œ2 Œ1 .log2 n/ C O..log n/2 /; .log2 n/n log n C nFvN E.Yn / D FvN (18) Œ1 Œ2 where FvN .t/ and FvN .t/ are continuous, 1-periodic functions defined by Œ1 FvN .t/ WD 21 Œ2 FvN .t/ WD ft g 21 .1 C ft g/ ft g .log 2/ 1 C ftg2 : Proof. We start with the decomposition X E.Yn / D .` C 1/2`C1 X 2` <j 62`C1 06`6n 2 1 C n 2 n j X 2n 1 <j 6n 1 : j By using the estimates X 2` <j 62`C1 X 2n 1 <j 6n 1 D log 2 j 1 2`C2 1 n D log j 2 n CO 4 ` 2 n n2n 1 n 1 CO n 2 ; We deduce that E.Yn / D log 2 C log When n ¤ 2n , write n D 2n E.Yn / D 21 n 1Cn n 2n 1 2n n 2n C1 log 2 C O .log n/2 : , where n WD flog2 ng. Then .1 C n / n log n 21 n .log 2/ 1 C n2 n C O .log n/2 ; which is also valid when n D 2n . This completes the proof of (18) and Theorem 4. Œ1 Œ1 Note that the dominant term can be written as FvN ./n log n D .log 2/FvN ./n log2 n, and the Œ1 minimum value of .log 2/FvN ./ equals 2 log 2 1:38 > 1, which means that Algorithm FYvN requires more random bits than Algorithm RS for large n; see (8). 18 Œ1 Œ2 Œ1 Œ2 Œ3 Figure 4: The periodic functions (from left to right) FvN ; FvN ; GvN ; GvN ; GvN in the unit interval. Variance Analogously, by (17), the variance of Yn is given by X 2 k k 2k 2k : V.Yn / D 2 k 16k6n From this expression and a similar analysis as above, we can derive the following asymptotic approximation to the variance whose proof is omitted here. Theorem 5. The variance of Yn satisfies Œ1 Œ2 Œ3 V.Yn / D GvN .log2 n/n.log n/2 C GvN .log2 n/n log n C nGvN .log2 n/ C O .log n/3 ; (19) Œ1 Œ2 Œ3 where GvN .t /; GvN .t/ and GvN .t / are continuous, 1-periodic functions defined by (see Figure 4) Œ1 GvN .t/ WD 21 ft g 3 .log 2/2 .log 2/.1 C ft g/ 1 21 ft g log 2 3 2 log 2 Œ2 GvN .t/ WD 2.log 2/.1 Œ1 ftg/GvN .t/ Œ3 GvN .t/ WD .log 2/2 .1 Œ1 ftg/2 GvN .t/ C .1 ft g log 2/.1 C 2ftg/22 ft g : In particular, if n D 2n , then V.Yn / D 2 .1 log 2/ n .log2 n/2 2 log2 n C 3 C O .log n/3 ; where the last term inside the parentheses differs slightly from Plackett’s expression in (Plackett, 1968). Asymptotic normality Since Yn is the sum of independent geometric random variables, we can derive very precise limit theorems by following the classical approach; see (Petrov, 1975). Theorem 6. The bit-complexity of Algorithm FYvN satisfies the local limit theorem x2 j k p e 2 1 C jxj3 P Yn D E.Yn / C x V.Yn / D p 1CO p n 2V.Yn / 1 uniformly for x D o n 6 . 19 Proof. By (17), the moment generating function of Yn satisfies (pk D k=2k ) Y E e Yn D 16k6n 1 p k e k : .1 pk /e k By induction, we see that the cumulant of order m satisfies X m m m k pk polynomialm .pk / D O .n.log n/ / (20) .m D 1; 2; : : : /: 16k6n From this we deduce that 3 Yn i n i 2 jj E exp p D exp p CO p ; (21) 2 n V.Yn / V.Yn / p uniformly for jj 6 " n. This estimate, coupling with the usual Berry-Esseen inequality, is sufficient to prove an optimal convergence rate to normality. For the stronger local limit theorem, it suffices to prove the bound ˇ ˇ ˇE e Yn i ˇ 6 e "1 n 2 ; (22) uniformly for jj 6 , where "1 > 0 is a sufficiently small constant. Then the local limit theorem follows from the same argument used in the proof of (13). To prove (22), a direct calculation from (20) yields Y ˇ ˇ 1 ˇE e Yn i ˇ D q 1 C 2.1 pk /pk 2 .1 cos k / 16k6n Y 1 6 p 1 C 2.1 pk /.1 cos k / 16k62n 1 Y Y 1 : D q k ` 1 1 C 2 .1 cos `/ 16`<n 16k<2 2` x 1 For 0 6 x 6 4, we have the elementary inequality p1Cx 6 e 5 , so that X X k ˇ ˇ ˇE e Yn i ˇ 6 exp 2 .1 cos `/ 5 2` ` 1 16`<n 16k<2 1 X ` 6 exp .2 2/.1 cos `/ : 20 16`<n By the inequality 2` 2 > 2` for ` > 2, we then obtain X ˇ ˇ 1 Y i ` ˇE e n ˇ 6 exp 2 .1 cos `/ 6 e 40 1 26`<n 20 1 40 2n n . / ; where n ./ WD 5 4 cos C cos n 2 cos.n 2.5 4 cos / 1/ : By monotonicity and induction, we deduce that n ./ > 61 .1 cos / for n > 2; consequently, ˇ ˇ 1 1 2n .1 cos / ˇE e Yn i ˇ 6 e 240 6 e 480 n.1 cos / ; uniformly for jj 6 . But 1 5 cos > 2 2 2 for jj 6 , so that (22) follows. The bit-complexity of Algorithm FYKY We analyze the total number of bits used by Algorithm FYKY for generating a random permutation of n elements. Let Bn denote the total number of random bits flipped in the procedure Knuth-Yao of Algorithm FYKY for generating UnifŒ0; n 1. Then B2k D k. Lemma 2. The probability generating function E t Bn of Bn satisfies X n 2k Bn E t D 1 .1 t/ tk .n D 2; 3; : : : /: (23) k 2 n k>0 Proof. The probability that the algorithm does not stop after k random flips is given by n 2k P.Bn > k/ D k .k D 0; 1; : : : /; 2 n because after the first k random coin-tossings (2k different configurations) there are exactly 2k mod ˚ k n D n 2n cases that the algorithm does not return a random integer in the specified interval Œ0; n 1. ˚ k Throughout this section, write Lx WD blog2 xc for x > 0 and L0 WD 0. Since 2nk 2n D 1 for 0 6 k 6 Ln when n ¤ 2Ln , we obtain X n 2k Bn Ln C1 E t Dt C .t 1/ tk .n ¤ 2Ln /: 2k n k>Ln We see that Bn is close to Ln C 1, plus some geometric-type perturbations. For computational purposes, the infinite series in (23) is less useful and it is preferable to use the following finite representation. Let .n/ denote Euler’s totient function (the number of positive integers less than n and relatively prime to n). Corollary 1. For n > 2 8 Bn 2 ; t E t ˆ < 1 t E t Bn D ˆ :1 t .n/ 1 2 if n is evenI X 06k<.n/ 21 k 2 mod n k t ; if n is odd: 2k (24) Proof. This follows from (23) by grouping terms containing the same fractional parts. Let Zn D B1 C C Bn represent the total number of bits required by Algorithm FYKY for generating a random permutation of n elements. 5.1 Expected value of Zn By (23), we have n WD E.Zn / D X am ; 16m6n where X 2k n : an WD E.Bn / D n 2k k>0 This sequence has been studied in the literature; see (Knuth and Yao, 1976; Pokhodzeı̆, 1985; Lumbroso, 2013). Obviously, X X 2k n ; an D 1C n 2k 06k6Ln k>Ln when n ¤ 2Ln , so we obtain the easy bounds Ln 6 an 6 Ln C 1 C n 2Ln C1 ; and an D log2 n C O.1/. Lemma 3. For n > 1 8 ˆ <a n2 C 1; 2.n/ an D ˆ : 2.n/ 1 if n is even; j X 06j <.n/ 2 mod n ; 2j if n is odd: Proof. These relations follow from (24). Corollary 2. For n > 1 an D log2 n C F0 .log2 n/ .n > 1/; (25) for some 1-periodic function F0 ; see Figure 5. A formal expansion for F0 .t/ was derived in (Lumbroso, 2013). We prove the following estimate for n . 22 Figure 5: Periodic fluctuations of an log2 n in log-scale (left) and normalized in the unit interval (right). The largest value achieved by the periodic function in the interval n 2 Œ2k ; 2kC1 is at n D 2k C 1 for which a2k C1 k D 2 2kkC1 , which approaches 2 for large k. Figure 6: Periodic fluctuations of interval (right). n C 12 log2 n n 1 3 log2 n in log-scale (left) and FKY in the unit Theorem 7. The expected number n of random bits required by Algorithm FYKY satisfies n D n log2 n C nFKY .log2 n/ C O .log n/2 ; where FKY .t/ is a continuous 1-periodic function whose Fourier expansion is given by (k WD FKY .t/ D 1 1 X .k C 1/ C e 2 log 2 log 2 2k 1 k¤0 „ ƒ‚ … 2k i t .t 2 R/; (26) 2k i ) log 2 (27) 0:33274 the series being absolutely convergent. Here .s/ denotes Riemann’s zeta function. 2 Note that j.1 C i t/j D O .log jtj/ 3 for large jtj; see (Titchmarsh, 1986). Our method of proof is based on approximating the partial sum n by an integral Z x X 2k x M.x/ WD a.t/ dt; where a.x/ WD .x > 0/; x 2k 0 k>0 and estimating their difference. Obviously, an D a.n/ for integer n > 0. The asymptotics of M.x/ is comparatively simpler and can be derived by standard Mellin transform techniques; see (Flajolet et al., 1995). Indeed, we derive an asymptotic expansion that is itself an identity for x > 1. 23 Proposition 1. The integral M.x/ satisfies the identity M.x/ D x log2 x C xFKY .log2 x/ C 2 ; 12 (28) for x > 1, where FKY is given in (27). Proof. We start with the relation ( a.x/ D a x 2 C 1; if x > 1I ˚1 x x ; if 0 < x 6 1: Then for x > 1 M.x/ 2M x 2 x Z Z a.t/ dt D x 2 a.t/ dt 2 0 0 x Z a.t/ a 2t dt Z 1 ˚ 1C t 1t dt: D 0 Dx 0 The last integral is equal to Z Z 1 ˚1 t t dt D 0 1 1 X ftg dt D t3 j >1 Thus, M.x/ satisfies the functional equation M.x/ D 2M x2 C x Z 1 0 t dt D 1 .j C t/3 2 ; 12 2 : 12 .x > 1/; (29) 2 M.x/ 12 log2 x is a periodic function, namely, MN .2x/ D MN .x/ which implies that MN .x/ WD x for x > 1, or, equivalently (28), and it remains to derive finer properties of the periodic function FKY . For that purpose, we apply Mellin transform. First, the integral M is decomposed as X Z 1 ft g X Z x t 2k dt D 2k k 3 dt: (30) M.x/ D k 2 t t 0 2 x k>0 k>0 Then the Mellin transform of M.x/ can be derived as follows (assuming 2 < <.s/ < Z 1 Z 1 Z 1 X Z 1 X ftg ftg k s 1 k.sC1/ s 1 2 x dt dx D 2 x dt dx 3 3 2k t t 0 0 x x k>0 k>0 Z 1 Z X ftg t s 1 k.sC1/ D 2 x dx dt t3 0 0 k>0 Z 1 X k.sC1/ 1 ftg D 2 dt s t sC3 0 k>0 D .s C 2/ ; s.s C 2/.1 2sC1 / 24 1): where we used the integral representation for .s C 1/ (see (Titchmarsh, 1986, p. 14)) Z 1 ftg .s C 1/ D .s C 1/ dt . 1 < <.s/ < 0/: t sC2 0 All steps here are justified by absolute convergence if 2 < <.s/ < 1. We then have the inverse Mellin integral representation Z 32 Ci1 .s C 2/ 1 M.x/ D x s ds sC1 3 2 i s.s C 2/.1 2 / 2 i1 Z 12 Ci1 .s C 1/ 1 D x 1 s ds .x > 0/: 2 s/ 1 2 i .s 1/.1 2 i1 2 Move now the line of integration to the right using known asymptotic estimates for j.s/j (see (Titchmarsh, 1986, Ch. V)) ( 1 O jtj 2 .1 c/C" ; if 0 6 c 6 1I (31) j.c C i t/j D 2 O .log jtj/ 3 ; if c D 1; as jt j ! 1. A direct calculation of the residues at the poles (a double pole at s D 0 and simple poles at s D k , s D 1) then gives Z 21 Ci1 .s C 1/ 1 2 x 1 s ds D x log2 x C xFKY .log2 x/ C 12 C .x/; 2 s 1 2 i .s 1/.1 2 / 2 i1 for x > 0, where FKY is given in (27) and .x/ is give by Z 23 Ci1 1 .s C 1/ .x/ WD x1 2 s 3 2 i 2 i1 .s 1/.1 2 / To evaluate this integral, we use the relations Z cCi1 1 x s 0; ds D x 2 i c i1 1 s 2 2 1 ; 2x s if x > 1I if 0 < x 6 1; ds: (32) .c > 1/; by standard residue calculus (integrating along a large half-circle to the right of the line <.s/ D c if x > 1, and to the left otherwise). With this relation, we then have 8 ˆ if x > 1I ˆ 0; X < 1 1 2k 1 x 2 ; if 0 < x 6 1: .x/ D k ˆ 2 k 2 `2 ˆ : 2 `x61 k;`>1 s by expanding the zeta function and 1 12s D 1 2 2 s in Dirichlet series and then integrating term by term. Note that the double sum expression for .x/ can be simplified but we do not need it. Also .x/ D 0 for 12 < x 6 1. 25 Observe that the Fourier series expansion (27) converges only polynomially. We derive a different expansion for FKY , with an exponential convergence rate. Lemma 4. The periodic function FKY has the series expansion X b2k ft g c ft g 2 2k FKY .t/ D 1 ftg 2 C 1 6 2kC1 ft g 1 ft g 0 k ft g 2 ˘ C1 ; (33) k>1 denotes the digamma function (derivative of log ) and for t 2 R, where 0 .k C 1/ D 1 j >k j 2 . P For large k, we have b2k ft g c 2kC1 ft g 1 since 0 .k C 1/ D 2k 1 ft g 1 2k 2 1 k C 0 62k 6k C 1 CO 2 2kC2 ft g 3 22.kC1 ft g/ ˚ C O k14 , where k WD 2k ft g . 1 6k 3 2k ft g ˘ Proof. By (30), we have, for x > 0, 0 k˘ Z 2x C1 Z 1 X k@ M.x/ D 2 C k˘ k 2 x 2 x k>0 0 X BZ 1 t D 2k @ n k o ˘ 2 2k k>0 x X D x k>0 1 C1 D x 3k 1 A ftg dt t3 C1 1 Ct x 2 2k 2kC1 x X 3 dt C 2 k 1 0 0 k˘ j> 2 x 2k x 1 Z t C dt A .j C t/3 C1 C1 : Now if x ¤ 2m , then (Lx WD blog2 xc) X M.x/ D x 2 12 06k6Lx 2 k x 2 2k 2kC1 x X C x k>Lx C1 2 D xLx C x 2Lx C 12 6 kCLx X x 2 Cx 1 2kCLx C1 x 2 k 1 0 2k x 2kCLx x 1 0 2kCLx x C1 ; which also holds for x D 2m , and in that case we have M.2m / D m2m C 0 $ 2 C1 2 6 2m C 2 ; 12 1 C 6 , where (see Section 5.3) X $ WD 1 2k 0 .2k C 1/ 0:44637 64113 48039 93349 : : : : 2 .2/ D k>1 This proves (33) by writing Lx D log2 x C1 2 k>1 by using flog2 xg. 26 ; Note that if we use the expression (33) for FKY .t/, then the identity (28) holds for x > 0. We turn now to estimating the difference between n and M.n/. Proposition 2. The difference n M.n/ satisfies n M.n/ D O .log n/2 : (34) Proof. We have (defining a.0/ D 0) n M.n/ D X Z 06m6n 1 .a.m/ a.m C t// dt C O.log n/: 0 Now X 2k m 2k m C t a.m C t/ D m 2k mCt 2k k>Lm k k X X m X 2 2 1 m CO C D 2k m mCt 2k 2k k>Lm k>2Lm Lm 6k<2Lm k k X m 2 1 2 D CO : k 2 m mCt m a.m/ Lm 6k<2Lm Thus n M .n/ D X X 26m6n Lm 6k<2Lm m 2k Z 1 0 2k m By writing fxg D x bxc, we then obtain Z 1 Z k m 1 2k 2 t dt D dt k 2 0 m mCt 0 mCt m 2k 2k mCt Z 1 0 dt C O.log n/: 2k m 2k mCt The first integral on the right-hand side is bounded above by X Z 1 X X log m t dt D O D O .log n/2 : m 0 mCt 26m6n 26m6n Lm 6k<2Lm It remains to estimate the double-sum X X M1 WD Z k 2 m 1 2k dt k 2 0 m mCt 26m6n Lm <k<2Lm k k X X m 2 2 6 k 2 m mC1 26m6n Lm <k<2Lm k k X X m 2 2 D : k 2 m m C 1 36k<2Ln b k cC1 2 2 6m6minf2k ;ng 27 dt: For a fixed k, the difference in the interval 2k ˘ m 2k mC1 ˘ assumes the value 1 if there exists an integer q lying 2k 2k <q6 ; mC1 m k˘ and 2m inequality 2k mC1 m 6 q1 . 2k (35) ˘ assumes the value 0 otherwise. For those m satisfying (35), we have the It follows that m 2k X 2b 2 c k C1 2k m 2k mC1 6 6m6minf2k ;ng X 1 D O.k/; q k 16q62 and, consequently, M1 D O X k D O .log n/2 : 36k62Ln This proves the proposition. 5.2 Variance of Zn Let bn D E.Bn2 / D Bn00 .1/ C Bn0 .1/. Lemma 5. For n > 1 8 ˆ <b n2 C 2a n2 C 1; X 2k mod n 2k C 1 21 .n/ .n/ bn D C ; ˆ : 2k 1 2 .n/ .1 2 .n/ /2 06k<.n/ Proof. By (23) and (24). Note that the variance vn WD bn a2n of Bn satisfies the recurrence v2n D vn .n > 1/: On the other hand, we also have bn D X k>0 n 2k ; .2k C 1/ k 2 n and we will derive an asymptotic approximation to the variance of Zn X X &n2 WD V.Zn / D vm D bn a2n : 26m6n 28 26m6n if n is evenI if n is odd: Figure 7: Periodic fluctuations of the variance of Bn (D bn (left) and for n D 29 ; : : : ; 210 (right). a2n ) in log-scale for n D 2; : : : ; 210 Theorem 8. The variance of the total number of random bits flipped to generate a random permutation by Algorithm FYKY satisfies &n2 D nGKY .log2 n/ C O..log n/3 /; (36) where GKY .u/ is a continuous, bounded, periodic function of period 1 defined by Z 2fug j X 1 fug j fug GKY .u/ D v0 2 C 2 g.t/ dt: j >1 Here v0 WD R1 0 (37) 0 g.t/ dt and g.x/ WD 1 1 1 x x 2a Cx : x 2 x (38) Numerically, v0 0:47021 47736 99741 30560 : : : ; see Section 5.3 for different approaches of numerical evaluation. This theorem will follow from Propositions 3 and 5 given below. Similar to the case of n , a good approximation to &n2 is given by the integral Z x Z x V .x/ WD v.t / dt D b.t/ a.t/2 dt; 0 where v.x/ WD b.x/ 0 a.x/ represents a continuous version of vn and X x 2k : b.x/ WD .2k C 1/ k 2 x 2 k>0 Now consider v.x/ D X k>0 x 2k .2k C 1/ k 2 x !2 X x 2k 2k x k>0 X x k 2 1 Dx C .2k C 3/ 2k x x 2 2 k>0 X x k !2 2 1 2 x C ; x k x 2 2 From this relation, we derive the following functional equation. 29 k>0 Lemma 6. For x > 0 V .x/ 2V x 2 (39) 0 v.x/ D v v.x/ D v g.t/ dt: D Proof. If x > 1, then if 0 < x 6 1, then minf1;xg Z x x 2 2 I C g.x/; where g is defined in (38). We now show that this functional equation leads to an asymptotic approximation that is itself an identity, as in the case of M.x/. Proposition 3. The integral V .x/ satisfies V .x/ D xGKY .log2 x/ v0 ; (40) for x > 1, where GKY is defined in (37). Proof. By a direct iteration of (39), we obtain V .x/ D v0 2 Lx C1 1 C X j >1 2 Lx Cj Z x 2Lx Cj g.t/ dt; 0 ˚ for x > 1, where the sum is absolutely convergent because (a.x/ D O.x/ and x x1 D O.x/) Z x Z x t 1 1 g.t/ dt D 1 t 2a Ct dt D O x 2 ; (41) t 2 t 0 0 as x ! 0. Now writing x D 2Lx Cx , where x WD flog2 xg, we obtain (40). Note that GKY .0/ D limu!1 GKY .u/, and GKY is continuous on Œ0; 1. On the other hand, since g.x/ D O.x/, its integral is of order x 2 as x ! 0, which implies that the series in (37) is absolutely convergent. Accordingly, GKY is a bounded periodic function. P Proposition 4. The Fourier coefficients of GKY .u/ D k2Z gk e 2k iu can be computed by Z 1 1 gk D g.t/t k 1 dt .k 2 Z/; (42) .log 2/.k C 1/ 0 the series being absolutely convergent. In particular, the mean value g0 is given by 2 1 1 2 2 X k.k C 1/. k C 1/ 2 g0 D C 2 1 2 24 2.log 2/2 6 .log 2/3 sinh 2k log 2 k>1 1:55834 75820 73324 42639 35697 76811 51355 37715 91606 58602 30 (43) where 1 is a Stieltjes constant: .log m/2 2 X log j j 1 WD lim m!1 26j 6m ! 0:72815 84548 36767 24860 : : : : Note that the terms in the series in (43) are convergent extremely fast with the rate k 2k 2 4 4 3 k.log k/ exp k.log k/ 3 2:33 1012 ; log 2 (44) by (31). Furthermore, the mean value (43) is smaller than that (12) of Algorithm RS. Proof. By definition, 1 Z gk D v0 2k iu 1 u e 2 du C 0 gk0 D 2 X Z j >1 j 1 Z C 0 0 2k iu j u e 2u Z j g.t/ dt du: 2 0 The second term gk0 can be simplified as follows. 21 Z 1 0 j >1 1 . .log 2/.k C1/ The first term equals XZ 2 j Z j X 1 D 2j .log 2/.k C 1/ j >1 ! 1 g.t /e 2k iu j u 2 du dt j Clog2 t 1 2 Z j 21 Z g.t / dt C j ! k 1 g.t/ t 0 2 2j j 1 dt : By summation by parts, we see that X j 1 2 Z 2 j >1 j g.t/ dt D 0 X .2 j 1/ g.t/ dt 2 j >0 D X 2 j 1 Z 21 j 1 j Z g.t/ dt 2 j >1 j 2 Z j 1 g.t / dt: 0 Thus, we obtain (42). The proof of (43), together with different numerical procedures, will be given in the next section. We now show that &n2 V .n/ is small. Proposition 5. The differencebetween the variance &n2 and its continuous approximation V .n/ is bounded above by O .log n/3 . 31 Figure 8: Periodic fluctuations of (right). &n2 C2 log2 nC3 n in log-scale for n D 27 ; : : : ; 211 (left) and GKY .u/ Proof. The proof is similar to that of Proposition 2. By definition, &n2 V .n/ D 06m6n D 1 X Z .v.m/ 1 X Z 06m6n v.m C t// dt C O.1/ 0 b.m/ b.m C t/ a.m/2 a.m C t/2 dt C O.1/: 0 Now divide the sum of terms into three parts: &n2 V .n/ D 2W1 .n/ C W2 .n/ C W3 .n/ C O.1/; where W1 .n/ D 06m6n W2 .n/ D 0 k>1 a.m/ k mCt 2 k k dt 2 mCt a.m C t/ dt 0 1 X Z 06m6n X m 2k k k 2 m 1 X Z 06m6n W3 .n/ D 1 X Z a.m C t /2 dt: a.m/2 0 We already proved in Proposition 2 that W2 .n/ D O .log n/2 . On the other hand, W3 .n/ D X Z 06m6n 1 a.m/ a.m C t / a.m/ C a.m C t/ dt 0 X Z 1ˇ ˇa.m/ D O .log n/ 06m6n 0 D O .log n/3 ; 32 ˇ a.m C t/ˇ dt by Proposition 2. For W1 .n/, we again follow exactly the same argument used in proving Proposition 2 and deduce that Z k X X 2 m 1 2k dt C O.log n/ W1 .n/ D k k 2 0 m mCt 06m6n Lm 6k<2Lm ! X X X X k k DO C C O.log n/ mC1 q 06m6n Lm 6k<2Lm 16k62Ln 16q62k 3 D O .log n/ : Theorem 8 now follows from Propositions 3, 4 and 5. It remains to prove the more precise expression (43) for the mean value g0 and other Fourier coefficients gk . 5.3 Evaluation of gk We show in this part how the coefficients g0 and gk with k ¤ 0 can be numerically evaluated to high precision. For that purpose, we will derive a few different expressions for them, which are of interest per se. We focus mainly on g0 , and most of the approaches used also apply to other constants or coefficients appeared in this paper. The mean value of GKY The mean value of GKY is split, by (38), into two parts Z 1 g 0 C g000 1 g.t/ g0 D dt DW 0 ; log 2 0 t log 2 where g00 Z 1 1 WD 0 Z 1 1 1 ftg dt D t t t t2 1 and g000 1 Z 1 1 t WD 2 0 ftg2 t3 dt D 2 12 1 ; 2 1 t t a dt: t 2 Lemma 7. g000 D X k>1 k Z 2 0 1 1 e 2k t 1 t et 1 1 dt: Proof. By definition and direct expansions X Z 1 1 2k 1 00 g0 D 2 1 t dt k t t 0 2 k>1 X X Z 1 2k j t D2 3 dt: 0 2k j C ` C t k;j >1 06`<2k 33 (45) Now, by the integral representation x s 1 D .s/ 1 Z e xu s 1 u .x; <.s/ > 0/; du 0 we see that 2 X X Z j >1 06`<2k 1 0 jt 1 Z 2k j C ` C t 3 dt D 2 u 0 X 2k j u je j >1 Z 1 u e 2k t 0 e `u 1 eu 1 1 Z te tu dt du 0 06`<2k 1 D X 1 du: This proves (45). Lemma 8. g000 D 2 X `>3 where h` D 2hd ` e C 2 ˙` 2 h` ; `2 .` 1/ (46) 1 for ` > 2 with h0 D h1 D 0. The first few terms of h` are fh2` g`>1 D fh2` 1 g`>1 D f0; 1; 4; 5; 12; 13; 16; 17; 32; 33; 36; 37; 44; 45; 48; 49; 80; g ; which correspond to sequence A080277 in Sloane’s OEIS (Online Encyclopedia of Integer Sequences), and is connected to partial sums of dyadic valuation. Proof. Inverting (45) using Binet’s formula (see (Erdélyi et al., 1953, ~1.9)) Z 1 t 0 1 z .z C 1/ D z 1 e zt dt; t e 1 0 we get g000 D X X 1 k>1 j >1 Since 1 m 0 k 2 j .m C 1/ D 0 .2 j C 1/ : X `>mC1 k 1 `2 .` 1/ ; by grouping terms with the same number, we get X X1 00 0 g0 D 2 .m C 1/ 2k ; m k m>2 2 jm k>1 which then implies (46). 34 (47) First approach: k 1 convergence rate The most naive approach to compute g0 consists of evaluating exactly the first k > 1 terms of the series (46) and adding the error by an asymptotic estimate of the remainders. More precisely, choose k sufficiently large and then split the series into two parts depending on ` < k and ` > k. Since h` D 12 ` log2 ` C O.`/ for large `, we see that the remainder is asymptotic to 2 X `>k with an additional error of order k X log ` h` log2 k 2 ; 2 2 ` .` 1/ ` k `>k 1 . But such an approach is poor in terms of convergence rate. A better approach to compute g000 from (46) consists Second approach: 3 k convergence rate in expanding the series X `>3 X h` D D1 .k/; `2 .` 1/ D1 .s/ WD where X h` `>3 k>3 `s ; and then evaluate D1 by the recurrence relation of h` , namely, D1 .s/ D X 2h` C ` .2`/s `>1 D 1 2 .s 1 Since D1 .k/ D O.3 k 2/ 1 C X 2h` C ` `>1 .1 .2` 2 /.s s 1 1/s .s/ C 2 1/ 1 D1 .s C j / : 2sCj X s C j j j >1 / for large k, the terms in such a series converge at the rate O.j <.s/ 1 6 j /. Third approach: k5 k convergence rate We can do better by applying the 21 -balancing technique introduced in (Grabner and Hwang, 2005), which begins with the relation ˘ X X . 1/k k C 1 X h` h` 2 D D .k C 3/; where D .s/ WD s : 2 2 `2 .` 1/ 2k ` 12 `>3 `>3 k>0 Here the convergence rate is of order k5 k . So it suffices to compute D2 .j / for j > 3. Now 1 X 2h` C ` 1 C 1 s 3 s 2` 2` 2 2 `>1 `>1 X h` 1 1 s D2 1 1 1 s 4 ` ` 2 2 `>1 D2 .s/ D X 2h` C ` s C 1C s 1 4 ` 1 2 where Z.s/ WD s 1; 14 C s 1; 34 35 1 4 s; 41 3 4 s; 34 : C 2 s Z.s/; Thus, we obtain the recurrence 1 D2 .s C 2j / ; 16j X s C 2j Z.s/ 1 D2 .s/ D C 4.2s 2 1/ 2s 2 1 2j j >1 where the convergence rate is now improved to O.j <.s/ 1 100 j /. In this way, we obtain the g 0 Cg 00 numerical value in (43) since g0 D 0log 20 . Such an approach is generally satisfactory. But for our g0 it turns out that a very special symmetric property makes the identity (43) possible, which is not the case for other constants appearing in this paper (e.g., v0 and $). 4 2k 2 Fourth approach: k.log k/ 3 e log 2 convergence rate Instead of the elementary approach used above, we now apply Mellin transform to compute the Fourier series of GKY . We start with defining VN .x/ WD V .x/ C v0 . Then, by (39), ( x 0; if x > 1I R1 D VN .x/ 2VN 2 g.t/ dt; if 0 < x 6 1: x From this it follows that the Mellin transform V .s/ of VN .x/ satisfies V .s/ 1 2sC1 D g .s/; where g .s/ WD Z 1 x s 1 Z 1 g.t/ dt dx D 0 x 1 s Z 1 g.t /t s dt: 0 By (41), we see that g .s/ is well-defined in the half-plane <.s/ > 2. Thus, we anticipate the same expansion (40) with the Fourier coefficients (42). What is missing here is the growth order of jg .c C i t/j for c > 2 as jtj ! 1, which can be obtained by the integral representation (48) below. By (38), we first decompose g into two parts: Z 1 t 1 1 1 1 1 t 2a Ct t s dx DW g1 .s/ C g2 .s/ ; g .s/ D s 0 t 2 t s where 1 1 t s a t dt D2 1 t t 2 0 Z 1 1 1 sC1 g2 .s/ D 1 t t dt: t t 0 g1 .s/ Z 36 The second integral is easier and we have Z 1 ftg2 .s C 3/ ft g dt D g2 .s/ D sC3 sC4 t t sC3 1 .s C 1/.s C 2/ ; .s C 2/.s C 3/ for <.s/ > 2 (when s D 1, the last term is taken as the limit 12 ). Consider now g1 .s/. The following integral representation is crucial in proving (43). Lemma 9. For <.s/ > g1 .s/ 2, 2 1 D .s C 4/ 2 i Z cCi1 c i1 .w C 1/.w C 1/.s w C 2/.s 1 2 w w C 2/ dw; (48) where 1 < c < <.s/ C 1. Proof. By straightforward expansions as above g1 .s/ X 2 2k.sC2/ .s C 4/ D Z k>1 Since Z 1 uw 1 u 0 1 usC1 e 2k u 1 eu 1 du D .w C 1/.w C 1/ u 1 1 du: . 1 < <.w/ < 0/; eu 1 we obtain the Mellin inversion representation 0 u eu 1 1 1D 2 i Z cCi1 .w C 1/.w C 1/u w dw .c 2 . 1; 0//: c i1 Substituting this into (49), we obtain (48). Proof of (43) Taking s D g1 . 1 1/ D 2 i 1 in (48), we get 1 2 Ci1 Z 1 2 i1 .w C 1/.w C 1/. w C 1/. w C 1/ dw 1 2 w D R1 C J 2 ; where R1 sums over all residues of the poles on the imaginary axis and 1 J2 WD 2 i D 1 2 Ci1 Z 1 2 i 1 2 i1 .w C 1/.w C 1/. w C 1/. w C 1/ dw 1 2 w 1 2 Ci1 Z 1 2 i1 .w C 1/.w C 1/. w C 1/. w C 1/ dw: 1 2w 37 (49) The last integral is almost identical to g1 . 1/ except the denominator for which we write 1 2w 1 Thus J2 D D 1C 1 1 2 w : g1 . 1/ C J3 , where 1 J3 WD 2 i Z 1 D 0 1 2 eu 6 .w C 1/.w C 1/. w C 1/. w C 1/ dw i1 1 2 D1 1 2 Ci1 Z u 1 eu 1 1 du : Collecting these relations, we see that g1 . 1/ D R1 J3 C ; 2 2 and g . 1/ D g1 . 1/ C g2 . 1/ D because g2 . 1/ D axis: 2 12 1 2 J3 . 2 It remains to compute the residues of the poles on the imaginary .w C 1/.w C 1/. w C 1/. w C 1/ Res 1 2 w wDk k2Z 2 X 2 log 2 1 2k .k C 1/. k C 1/ ; D C 2 21 2 24 2 log 2 6 .log 2/2 sinh 2k log 2 k>1 R1 g . 1/ D D 2 D R1 ; 2 X where 1 is defined in Proposition 4. The terms in the series are convergent at the rate (44), and is much faster than the previous three approaches: g0 D g . 1/ 1:55834 75820 73324 42639 35697 76811 51355 377159 16065 86021 log 2 33003 19983 06704 40332 28575 51733 41447 78391 56441 48117 : : : (using only 18 terms of the series, one gets an error less than 1:8 10 108 ). Also the dominant term alone, namely, 2 1 1 2 C 21 1:55834 75821 66122 : : : ; 24 2.log 2/2 6 gives an approximation to g0 to within an error less than 9:3 10 38 11 . Consider now g1 . 1 C k / when k ¤ 0. Similarly, by (48) with Calculation of gk for k ¤ 0 s D 1 C k , we have g1 . 1 C k / D R2 C J4 ; where R2 denotes the sum of all residues of the poles on the imaginary axis and 1 2 J4 WD .3 C k / 2 i 1 2 Ci1 Z 1 2 .w C 1/.w C 1/.1 C k 1 2 w i1 By the change of variables w 7! k J4 D D 2 1 .3 C k / 2 i 1 2 w/ dw: w, we get 1 2 Ci1 Z w/.1 C k i1 .w C 1/.w C 1/.1 C k 1 2w w/.1 C k w/ w/.1 C k w/ dw dw g1 . 1 C k / C J5 ; where Z 21 Ci1 2 1 J5 WD .w C 1/.w C 1/.1 C k 1 .3 C k / 2 i 2 i1 Z 1 k 2 u u D 1 du .3 C k / 0 e u 1 e u 1 k .k C 1/ .k C 2/ ; D2 .k C 2/.k C 1/ k C 2 which equals 2g2 . 1 C k /. Then g k g . 1 C k / R2 D .log 2/. k C 1/ 2.log 2/. k C 1/ 0 0 2 .k C 1/ C .k C 1/.k C 1/ D .log 2/2 .2k 1/.k C 2/ X .kCj C 1/.kCj C 1/. j C 1/. j C 1/ 2 C .log 2/2 .k 1/.k C 3/ D (50) j >1 C 1 .log 2/2 X 16j 6k 1 .j C 1/.j C 1/.k j C 1/.k .k 1/.k C 3/ j C 1/ : By the order estimate (7) for Gamma function and (31) for -function (which implies that j 0 .1 C 5 i t/j D O .log jtj 3 /, we deduce that gk D O k 2 5 .log k/ 3 ; for large jkj, so that the Fourier series of GKY is absolutely convergent. 39 (51) 5.4 Asymptotic normality of Zn We prove in this section the asymptotic normality of the bit-complexity Zn of Algorithm FYKY. Such a result is well anticipated because Zn D B1 C C Bn and each Bk is close to Lk C 1 with a geometric perturbation having bounded mean and variance. Indeed, we can establish a stronger local limit theorem for Zn . Theorem 9. The bit-complexity Zn of Algorithm FYKY satisfies a local limit theorem of the form x2 1 C jxj3 e 2 ; 1CO P .Zn D bn C x&n c/ D p p n 2 &n 1 uniformly for x D o n 6 , where n WD E.Zn / and &n2 WD V.Zn /; see (26) and (36). (52) Proof. Since Zn is the sum of n independent random variables, the r -th cumulant of Zn , denoted by Kr .n/, satisfies X Kr .n/ D r .m/ .r > 1/; 26m6n where r .m/ stands for the r -th cumulant of Bm . To show that r .m/ are bounded for all m and Bn r > 2, we observe that E t can be extended to any x > 0 by defining B.x; t / WD 1 .1 X x 2k t/ tk k 2 x .x > 0/; k>0 so that E t Bn D B.n; t /. Also B.x; t/ D tB. x2 ; t/ for x > 1 and the cumulants r .x/ WD r !Œs r log B.x; e s / are well-defined. It follows that for x > 1 x x ; e s D r r .x/ D r !Œs r s C log B 2 2 for r > 2, which then implies that r .x/ D r 2Lxx C1 for x > 1. It remains to prove that r .x/ D O.1/ for x 2 .0; 1/. Note that r .x/ is a (finite) linear combination of sums of the following form k X X 2 j k j x DO x k 2 D O.x/ D O.1/; k k 2 x k>0 k>0 for each j D 1; 2; : : : . This proves that each r .x/ is bounded for x > 0, and, accordingly, X Kr .n/ D r .m/ D O.n/ .r D 2; 3; : : : /: 26m6n These estimates, together with those in (26) and (36), yield 2 3 Zn n jj E exp i D exp CO p ; &n 2 n 40 (53) p uniformly for jj 6 " n. We now derive a uniform bound of the form jE.e Zn i /j 6 e "n 2 .jj 6 I n > 5; n ¤ 2Ln /; (54) for some " > 0. This bound, together with (53), will then be sufficient to prove the local limit theorem (52). P For n ¤ 2Ln , let E.e Bn i / D e .Ln C1/i k>0 pn;k e ik t , where pn;k WD n 2Ln Ck 2Ln Ck n n 2Ln CkC1 2Ln CkC1 : n When both pn;0 and pn;1 are nonzero, we have jE.e Bn i /j 6 1 pn;0 D1 pn;0 61 pn;0 pn;1 C jpn;0 C pn;1 e i j q pn;1 C .pn;0 C pn;1 /2 pn;1 C .pn;0 C pn;1 / 1 2pn;0 pn;1 .1 cos / pn;0 pn;1 .1 .pn;0 C pn;1 /2 cos / ; p by using the inequality 1 x 6 1 21 x for x 2 Œ0; 1. Then by the inequalities 1 1 cos > 22 2 for jj 6 , we obtain, for jj 6 , jE.e Bn i /j 6 exp pn;0 pn;1 .1 pn;0 C pn;1 cos / 6 exp which holds for all n > 1 provided we interpret jE.e Zn i /j 6 e 0 0 2 2 for jj 6 , where ƒn WD X 16k6n x6e 2 pn;0 pn;1 2 ; 2 pn;0 C pn;1 as zero. In this way, we see that ƒn 2 6e 1 5 ƒn 2 ; pk;0 pk;1 : pk;0 C pk;1 We now prove that ƒn > "n for some " > 0. Observe that pn;0 D 2LnnC1 when n ¤ Ln , and 8 m l < LkC2 ; if 2Lk < k < 2Lk C2 ; 3 k l L C2 m pk;1 D 2 2 k :0; if 6 k 6 2Lk C1 : 3 41 x and It follows that ƒn > k k 2`C1 2`C2 k k l m C 2`C2 26`<Ln 2` <k< 2`C2 2`C1 X X 3 1 X D 6 k ` m2 `C2 X l 26`<Ln 2` <k< 2 1 X 7 ` > 2 6 18 26`<Ln 3 7 6 > "0 2Ln > "n; for a sufficiently small " > 0. This completes the proof of (54) and the local limit theorem (52). 6 Implementation and testing We discuss in this section the implementation and testing of the two algorithms FYKY and RS. We implemented the algorithms in the C language, taking as input an array of 32-bit integers (which is enough to represent permutations of size up to over four billion). To generate the needed random bits, we used the rdrand instruction, present on Intel processors since 2012 (Intel, 2012) and AMD processors since 2015. This instruction provides access to physical randomness, which does not have the biases of a pseudorandom generator. This choice also makes it easy to compare the performance of the algorithms without relying on third-party software. Alternatively, one could use a pseudorandom generator like Mersenne Twister, which is the default choice in most software, such as R, Python, Matlab and Maple, and runs faster than rdrand when properly implemented. But such a generator has been known to be cryptographically insecure because one can predict all the future iterations if a sufficient number (624 in the case of MT9937) of iterations is available. The hardware driven instruction rdrand, in contrast, is proved to be cryptographically secure. Our implementation takes care of not wasting any random bits and provides the option to track the number of random bits consumed. The implementation of Algorithm FYKY is rather straightforward, but that of Algorithm RS is more involved. First of all, the recursive calls in RS are handled in the following fashion, depending on the size of the input: for large inputs, we run the recursive calls in parallel using the Posix thread library pthread; for intermediate inputs, we run the recursive calls sequentially to limit the number of threads; for small inputs, we use the Fisher-Yates algorithm instead to reduce the number of recursive calls. The cutoffs between small, intermediate and large inputs were determined experimentally; in our tests, thresholds of 216 and 220 seemed efficient, but this may depend on machine and other implementation details. 42 The second optimization for Algorithm RS concerns the splitting routine. Written naively, this routine contains a loop with an if statement depending on random data. This is a problem because branches are considerably more efficient if they can be correctly predicted by the processor during execution. We are able to avoid using branches altogether by vectorizing the code, i.e., using SIMD (Single Instruction, Multiple Data) processor instructions. Such instructions take as input 128-bit vector registers capable of storing four 32-bit integers and operate on all four elements at the same time. The C language provides extensions capable of accessing such instructions. Specifically, we used in our implementation two instructions, present in the AVX (Advanced Vector Extensions) instruction set supported by newer processors. They are vpermilps, which arbitrarily permutes the four 32-bit elements of a vector; and vmaskmovps, which writes an arbitrary subset of the four elements of a vector to memory. Both instructions take as additional input a control vector specifying the permutation or subset, of which only two bits out of every 32-bit element are read. We use these instructions to separate four elements of the permutation at a time into two groups. This can be done in 16 possible ways, which means that we have to supply each instruction with one of 16 possible control registers. We do this by building a master register containing all 16 of them in a packed fashion. We then draw randomly an integer r between 0 and 15 and shift every component of the master register by 2r bits to select the appropriate control register. This lets us handle four elements at a time without using branches. Benchmarks Below are our benchmarks for Algorithm FYKY, Algorithm RS and one of its parallel versions. The tests were performed on a machine with 32 processors. n 105 106 107 108 109 Algorithm RS FYKY Algorithm RS FYKY FYKY RS Parallel RS 4.84ms 4.59ms 4.18ms 51.1ms 51.6ms 18.5ms 712ms 623ms 121ms 12.5s 7.26s 1.04s 145s 81.7s 10.3s Mean n log2 n C .0:25 ˙ "/n n log2 n .0:33 ˙ "/n Variance .1:83 ˙ "/n .1:56 ˙ "/n Table 1: Left: the execution times to sample permutations of sizes from 105 to 109 (each averaged over 100 runs for sizes up to 10 million and 10 runs otherwise). Right: the analytic results we obtained in this paper. Here c ˙ " indicates fluctuations around the mean value c (coming from the periodic functions); see (8), (12), (27) and (43). As expected, parallelism speeds up the execution by as much as a factor of 8. What is more surprising is that, even in a sequential form, Algorithm RS is nearly twice as efficient as FisherYates for the larger sizes, despite making on linearithmic order of memory accesses instead of linear. The reason for this has to do with the memory cache, which makes it more efficient to access memory in a sequential fashion instead of at haphazard places. The Fisher-Yates shuffle accesses memory at a random place at each iteration of its loop, causing a large number of cache misses. Algorithm RS, in comparison, does not have this drawback, which accounts for the observed gap in performance. 43 References Anderson, R. J. (1990). Parallel algorithms for generating random permutations on a shared memory machine. In Proceedings of the Second Annual ACM Symposium on Parallel Algorithms and Architectures, pages 95–102. Andrés, D. M. and Pérez, L. P. (2011). Efficient parallel random rearrange. In International Symposium on Distributed Computing and Artificial Intelligence, pages 183–190. Springer. Barker, E. B. and Kelsey, J. M. (2007). Recommendation for random number generation using deterministic random bit generators (revised). US Department of Commerce, Technology Administration, National Institute of Standards and Technology, Computer Security Division, Information Technology Laboratory. Berry, K. J., Johnston, J. E., and Mielke, Jr., P. W. (2014). A Chronicle of Permutation Statistical Methods (1920–2000, and Beyond). Springer, Cham. Black, J. and Rogaway, P. (2002). Ciphers with arbitrary finite domains. In Topics in Cryptology - CT-RSA 2002, The Cryptographer’s Track at the RSA Conference, 2002, San Jose, CA, USA, February 18-22, 2002, Proceedings, pages 114–130. Brassard, G. and Kannan, S. (1988). The generation of random permutations on the fly. Inf. Process. Lett., 28(4):207–212. Chen, C.-H. (2002). Generalized association plots: Information visualization via iteratively generated correlation matrices. Statistica Sinica, 12(1):7–30. Devroye, L. (1986). Nonuniform Random Variate Generation. Springer-Verlag, New York. Devroye, L. (2010). Complexity questions in non-uniform random variate generation. In Proceedings of COMPSTAT’2010, pages 3–18. Springer. Devroye, L. and Gravel, C. (2015). The expected bit complexity of the von neumann rejection algorithm. arXiv preprint arXiv:1511.02273. Durstenfeld, R. (1964). Algorithm 235: Random permutation. Commun. ACM, 7(7):420. Erdélyi, A., Magnus, W., Oberhettinger, F., and Tricomi, F. (1953). Higher Transcendental Functions. Vol. I. McGraw-Hill, New York. Fisher, R. A. and Yates., F. (1948). Statistical tables for biological, agricultural and medical research. 3rd ed. Edinburgh and London, 13(3):26–27. Flajolet, P., Fusy, É., and Pivoteau, C. (2007). Boltzmann sampling of unlabelled structures. In Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments and the Fourth Workshop on Analytic Algorithmics and Combinatorics, pages 201–211. SIAM, Philadelphia, PA. 44 Flajolet, P., Gourdon, X., and Dumas, P. (1995). Mellin transforms and asymptotics: harmonic sums. Theoret. Comput. Sci., 144(1-2):3–58. Flajolet, P., Pelletier, M., and Soria, M. (2011). On Buffon machines and numbers. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23-25, 2011, pages 172–183. Flajolet, P. and Sedgewick, R. (1995). Mellin transforms and asymptotics: finite differences and Rice’s integrals. Theoret. Comput. Sci., 144(1-2):101–124. Flajolet, P. and Sedgewick, R. (2009). Analytic Combinatorics. Cambridge University Press, New York, NY, USA. Fuchs, M., Hwang, H.-K., and Zacharovas, V. (2014). An analytic approach to the asymptotic variance of trie statistics and related structures. Theoret. Comput. Sci., 527:1–36. Grabner, P. J. and Hwang, H.-K. (2005). Digital sums and divide-and-conquer recurrences: Fourier expansions and absolute convergence. Constr. Approx., 21(2):149–179. Granboulan, L. and Pornin, T. (2007). Perfect block ciphers with small blocks. In Fast Software Encryption, 14th International Workshop, FSE 2007, Luxembourg, Luxembourg, March 26-28, 2007, Revised Selected Papers, pages 452–465. Horibe, Y. (1981). Entropy and an optimal random number transformation. IEEE Transactions on Information Theory, 27(4):527–529. Hwang, H.-K. (2003). Second phase changes in random m-ary search trees and generalized quicksort: convergence rates. Ann. Probab., 31(2):609–629. Hwang, H.-K., Fuchs, M., and Zacharovas, V. (2010). Asymptotic variance of random symmetric digital search trees. Discrete Math. Theor. Comput. Sci., 12(2):103–165. Intel (2012). Intel Digital Random Number Generator (DRNG): Software Implementation Guide. Intel Corporation. Jacquet, P. and Szpankowski, W. (1998). Analytical de-Poissonization and its applications. Theoret. Comput. Sci., 201(1-2):1–62. Kimble, G. W. (1989). Observations on the generation of permutations from random sequences. International Journal of Computer Mathematics, 29(1):11–19. Knuth, D. E. (1998a). The Art of Computer Programming. Vol. 2, Seminumerical Algorithms. Addison-Wesley, Reading, MA. Knuth, D. E. (1998b). The Art of Computer Programming. Vol. 3. Sorting and Searching. AddisonWesley, Reading, MA. Second edition. 45 Knuth, D. E. and Yao, A. C. (1976). The complexity of nonuniform random number generation. Algorithms and Complexity: New Directions and Recent Results, pages 357–428. Koo, B., Roh, D., and Kwon, D. (2014). Converting random bits into random numbers. The Journal of Supercomputing, 70(1):236–246. Laisant, C.-A. (1888). Sur la numération factorielle, application aux permutations. Bulletin de la Société Mathématique de France, 16:176–183. Langr, D., Tvrdı́k, P., Dytrych, T., and Draayer, J. P. (2014). Algorithm 947: Paraperm—parallel generation of random permutations with MPI. ACM Trans. Math. Softw., 41(1):5:1–5:26. Lehmer, D. H. (1960). Teaching combinatorial tricks to a computer. In Proc. Sympos. Appl. Math. Combinatorial Analysis, volume 10, pages 179–193. Louchard, G., Prodinger, H., and Wagner, S. (2008). Joint distributions for movements of elements in Sattolo’s and the Fisher-Yates algorithm. Quaest. Math., 31(4):307–344. Lumbroso, J. (2013). Optimal discrete uniform generation from coin flips, and applications. CoRR, abs/1304.1916. Mahmoud, H. M. (2003). Mixed distributions in Sattolo’s algorithm for cyclic permutations via randomization and derandomization. J. Appl. Probab., 40(3):790–796. Massey, J. L. (1981). Collision-Resolution Algorithms and Random-Access Communications. Springer. Moses, L. E. and Oakford, R. V. (1963). Tables of Random Permutations. Stanford University Press. Nakano, K. and Olariu, S. (2000). Randomized initialization protocols for ad hoc networks. IEEE Transactions on Parallel and Distributed Systems, 11(7):749–759. Neininger, R. and Rüschendorf, L. (2004). A general limit theorem for recursive algorithms and combinatorial structures. Ann. Appl. Probab., 14(1):378–418. Petrov, V. V. (1975). Sums of Independent Random Variables. Springer-Verlag, New YorkHeidelberg. Translated from the Russian by A. A. Brown, Ergebnisse der Mathematik und ihrer Grenzgebiete, Band 82. Plackett, R. L. (1968). Random permutations. Journal of the Royal Statistical Society. Series B (Methodological), 30(3):pp. 517–534. Pokhodzeı̆, B. B. (1985). Complexity of tabular methods of simulating finite discrete distributions. Izv. Vyssh. Uchebn. Zaved. Mat., (7):45–50 & 85. Prodinger, H. (2002). On the analysis of an algorithm to generate a random cyclic permutation. Ars Combin., 65:75–78. 46 Rao, C. R. (1961). Generation of random permutation of given number of elements using random sampling numbers. Sankhya A, 23:305–307. Ravelomanana, V. (2007). Optimal initialization and gossiping algorithms for random radio networks. IEEE Transactions on Parallel and Distributed Systems, 18(1):17–28. Ressler, E. K. (1992). Random list permutations in place. Inf. Process. Lett., 43(5):271–275. Ritter, T. (1991). The efficient generation of cryptographic confusion sequences. Cryptologia, 15(2):81–139. Rivest, R. L. (1994). The RC5 encryption algorithm. In Fast Software Encryption: Second International Workshop. Leuven, Belgium, 14-16 December 1994, Proceedings, pages 86–96. Robson, J. M. (1969). Algorithm 362: Generation of random permutations [G6]. Commun. ACM, 12(11):634–635. Sandelius, M. (1962). A simple randomization procedure. Journal of the Royal Statistical Society. Series B (Methodological), 24(2):pp. 472–481. Titchmarsh, E. C. (1986). The Theory of the Riemann Zeta-Function. The Clarendon Press Oxford University Press, New York, second edition. Edited and with a preface by D. R. Heath-Brown. von Neumann, J. (1951). Various techniques used in connection with random digits. Journal of Research of the National Bureau of Standards. Applied Mathematics Series, 12:36–38. Waechter, M., Hamacher, K., Hoffgaard, F., Widmer, S., and Goesele, M. (2011). Is your permutation algorithm unbiased for n ¤ 2m ? In Parallel Processing and Applied Mathematics, pages 297–306. Springer. Wagner, S. (2009). On tries, contention trees and their analysis. Annals of Combinatorics, 12(4):493–507. Wilson, M. C. (2009). Random and exhaustive generation of permutations and cycles. Ann. Comb., 12(4):509–520. 47
© Copyright 2026 Paperzz