Journal of Multivariate Analysis 127 (2014) 126–146 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva Independence tests for continuous random variables based on the longest increasing subsequence Jesús E. García, V. A. González-López ∗ Department of Statistics, University of Campinas, Rua Sérgio Buarque de Holanda, 651, Campinas, São Paulo. CEP 13083-859, Brazil article info Article history: Received 13 March 2013 Available online 3 March 2014 AMS 2000 subject classifications: 62G10 62G30 Keywords: Longest increasing subsequence Test for independence Copula abstract We propose a new class of nonparametric tests for the supposition of independence between two continuous random variables X and Y . Given a size n sample, let π be the permutation which maps the ranks of the X observations on the ranks of the Y observations. We identify the independence assumption of the null hypothesis with the uniform distribution on the permutation space. A test based on the size of the longest increasing subsequence of π (Ln ) is defined. The exact distribution of Ln is computed from Schensted’s theorem (Schensted, 1961). The asymptotic distribution of Ln was obtained by Baik et al. (1999). As the statistic Ln is discrete, there is a small set of possible significance levels. To solve this problem we define the JLn statistic which is a jackknife version of Ln , as well as the corresponding hypothesis test. A third test is defined based on the JLMn statistic which is a jackknife version of the longest monotonic subsequence of π . On a simulation study we apply our tests to diverse dependence situations with null or very small correlations where the independence hypothesis is difficult to reject. We show that Ln , JLn and JLMn tests have very good performance on that kind of situations. We illustrate the use of those tests on two real data examples with small sample size. © 2014 Elsevier Inc. All rights reserved. 1. Introduction Call Ω the space of the univariate, continuous cumulative distributions. Let (X , Y ) be a random vector with unknown joint cumulative distribution H and univariate marginal distributions F and G respectively, F ∈ Ω , G ∈ Ω . Suppose that (x1 , y1 ), . . . , (xn , yn ) is a paired sample of size n of (X , Y ). Set H0 : X and Y are independent. (1) A test is constructed with no extra assumption (other than continuity) about the form of the marginal distributions (marginal free test). The procedure is based on the size of the longest increasing subsequence of the random permutation defined by the paired sample and denoted by Ln . Theorem 3.1 shows how to compute the exact distribution of Ln and it is a straightforward application of Schensted’s theorem and Frame et al.’s theorem, see Schensted [12] and Frame et al. [6]. In addition, we proposed two test statistics denoted briefly by JLn and JLMn , respectively. JLn is a Jackknife version of Ln while JLMn is based on the size of the longest monotonic subsequence. The power of these tests is compared with those of various existing tests by simulation. This new class of tests is rankbased, therefore, it will be compared with other rank-based procedures for testing independence as the nonparametric tests Kendall, Spearman and Hoeffding and the independence test from Genest et al. [7], denoted here by Genest’s test. ∗ Corresponding author. E-mail addresses: [email protected] (J.E. García), [email protected] (V. A. González-López). http://dx.doi.org/10.1016/j.jmva.2014.02.010 0047-259X/© 2014 Elsevier Inc. All rights reserved. J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 127 Fig. 1. The left figure is the scatter plot of a sample (size = 200) from a mixture 50–50 of two bivariate Normal distributions, with correlation 0.9 and −0.9 respectively (distribution D1 from Section 4). The right figure shows the plot of the sample size vs. the empirical power (level 0.01) for the same distribution. We include also the MIC test, based on the maximal information coefficient, from Reshef et al. [11]. In addition we include Pearson’s test for its well known performance in the normal case. In the case of Kendall’s test, Spearman’s test, Hoeffding’s test and Pearson’s test, each methodology estimates the association between X and Y and computes a test of the association being zero. They use different measures of association, all of them in the interval [−1, 1] with 0 indicating no association/correlation. The asymptotic Genest’s test consist on computing the approximate p-values of the test statistic with respect to the empirical distribution obtained by simulation. For the MIC test, the p-value of a given MIC score is computed by selecting a probability δ of false rejection, creating a set of 1δ − 1 surrogate datasets, and comparing the MIC of the real data with the MIC scores of the surrogate datasets. To compute the p-values for Kendall, Spearman and Pearson methods, we use the ‘‘cor.test’’ function, available in the ‘‘stat’’ package from R-project. Details about each test may be found in Hollander et al. [9]. In the case of Hoeffding’s test, to compute the p-values, we use the ‘‘hoeffd’’ function, available in the ‘‘Hmisc’’ package from R-project. For Genest’s test we use the ‘‘indepTest’’ function, available in the ‘‘copula’’ package from R-project. For the MIC test was used the support program given in http://www.exploredata.net/. We performed a simulation study with different conditions. For example, we use a mixture 50–50 of two bivariate Normal distributions, with correlation ρ and −ρ respectively (zero expected correlation). In this case Ln , JLn and JLMn were competitive and markedly more powerful than the other six tests considered. Fig. 1, on the left, shows a scatter plot for a sample (size = 200) of this mixture when ρ = 0.9 and Fig. 1, on the right, shows the sample size versus the empirical power (level 0.01). The other tests do not detect the dependence for any sample size. This situation illustrates the usefulness of our proposal, we will explore more situations like that, in Section 4.2. We applied the tests based on the longest increasing subsequence to two real datasets, both with small sample sizes considering that for bigger sample sizes there exists very efficient procedures designed for asymptotic situations. The first dataset was provided by Professor Dalia Chakrabarty, researcher in the School of Physics and Astronomy, University of Nottingham. It consist on two measures, the projected radius and the radial velocity for 30 Globular Clusters around the galaxy NGC 3379 (see Chakrabarty [5]). The second dataset appears on ‘‘VGAM’’ (package from R-project), named ‘‘coalminers’’. The data is about coal-miners who are smokers without radiological pneumoconiosis, classified by age, breathlessness and wheeze. We adapted and implemented (in C language) the algorithm provided by Zoghbi et al. [13]. We use that algorithm to compute the exact probability of Ln , in the case of n ≤ 100. For n > 100 the asymptotic distribution of Ln , obtained by Baik et al. [3] can be used and we show how to use it in our test, in Section 3. Nevertheless, the exact probability could be calculated for n > 100 also. The probabilities for JLn and JLMn were estimated by simulation. The tests and simulations were implemented in the R-project environment (LIStest package). Section 2 provides the main concepts and the definition of the test statistic. In Section 3 we calculate the distribution of the test statistic, proposed here. In Theorem 3.1 is shown the exact distribution of the test statistic under the independence assumption, by a direct application of results from Schensted [12] and Frame et al. [6]. Section 4 is devoted to show the capacity to detect dependence of each test statistic introduced here. Through simulations, we discuss each one of the test statistics, face to face with several dependence situations. We apply the test, to real datasets in Section 4.3. In the Appendix A we include the proof of Theorem 3.1. 128 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 Table 1 Paired sample size 5. a b xi 4.1 1.1 2.51 3.61 1.8 yi 3.2 3.5 4.17 3.18 2.86 c Fig. 2. Dispersion’s graphic and permutation (Example 2.2). (a) Is the dispersion plot for the sample, (b) represents the permutation defined by the sample, the solid line shows the longest increasing subsequence, (c) shows the empirical copula of the sample. 2. Preliminaries We will introduce some basic concepts related to the size of the longest increasing subsequence, associated with a paired sample of size n of (X , Y ) with continuous marginal distributions. Definition 2.1. Let Sn denote the group of permutations of {1, . . . , n}. If π ∈ Sn , we say that π (i1 ), . . . , π (ik ) is an increasing subsequence of π if 1 ≤ i1 < · · · < ik ≤ n and 1 ≤ π (i1 ) < π (i2 ) < · · · < π (ik ) ≤ n. Definition 2.2. Given a permutation π ∈ Sn , we call ln (π ) (or ldn (π )) the length of the longest increasing (or decreasing) subsequence of π . Example 2.1. Consider the set {1, 2, 3, 4, 5, 6, 7, 8}. Let π be the permutation which transforms the previous set in {3, 6, 1, 7, 4, 2, 5, 8} where π (1) = 3, π (2) = 6, π (3) = 1, π (4) = 5, π (5) = 7, π (6) = 2, π (7) = 4, π (8) = 8. Examples of increasing subsequences are {1, 7, 8}, {3, 6, 7, 8}, {1, 2, 5, 8}. The maximal size for the increasing subsequences is 4 which is reached by the sequences {1, 2, 5, 8}, {1, 4, 5, 8} and {3, 6, 7, 8}, then l8 (π ) = 4. We bring the concept of the longest increasing subsequence to the sample space, using the next example, in which we will connect the sample with a specific permutation of n points, π . Example 2.2. Let us consider the paired sample {(xi , yi )}ni=1 (from Table 1). First, sort the sample in increasing order in relation to the marginal sample {xi }ni=1 and replace the xi value with its rank in the sequence, this produces {(1, 3.5), (2, 2.86), (3, 4.17), (4, 3.18), (5, 3.2)}. Next, replace each yi with its rank in the {yi }ni=1 sequence, this produces {(1, 4), (2, 1), (3, 5), (4, 2), (5, 3)} . The permutation π related to this sample is defined by π (1) = 4, π (2) = 1, π (3) = 5, π (4) = 2, π (5) = 3. The longest increasing subsequence is {1, 2, 3} and l5 (π ) = 3, see Fig. 2(b). We define now the length of the longest increasing subsequence as a random variable. Definition 2.3. Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be replications of (X , Y ) with continuous marginal distributions, we denote by Ln the random variable, Ln = ln (πD ) where D = {(Xi , Yi )}ni=1 and πD is the permutation which assigns π (rank(Xi )) = rank(Yi ), i = 1, . . . , n. On the next section we show the distribution of Ln , under the assumption of independence between X and Y . J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 129 3. The distribution of Ln The exact distribution of Ln in the case of independence can be obtained using the next theorem, in which the probability of Ln be equal to k, for k = 1, . . . , n will be denoted by pnk . Theorem 3.1. Let (X , Y ) be a random vector with continuous marginal distributions, under Hypothesis (1). Suppose that (x1 , y1 ), . . . , (xn , yn ) is a paired sample of size n of (X , Y ). Let Sn denote the group of permutations of {1, . . . , n} and let Sn , U be Sn with uniform distribution U. Then, for k = 1, 2 . . . , n, if pnk = Prob(Ln = k), n 1 n 2 pk = N (W ) (2) n! m=1 W ∈V (k,m) n where Ln is given by Definition 2.3, Vn (k, m) is the set of shapes of standard Young tableaux of order n having k columns and m rows, N (W ) is the number of standard Young tableaux with shape W as given by Formula (6). Proof. See Appendix A. Remark 1. Sn , U is the space of permutations πD where D = {(Xi , Yi )}ni=1 , and (Xi , Yi ) are i.i.d. with the same law of (X , Y ) under Hypothesis (1). For k = 1, 2 . . . , n, pnk = #{π ∈Sn :ln (π )=k} n! . There are diverse algorithms in the literature to find Vn (k, m), we implemented the ZS2 algorithm by Zoghbi et al. [13]. Using Theorem 3.1 we compute pnk for 1 ≤ k ≤ n, n ≤ 100. The table can be accessed from the LIStest package, implemented in R project. The asymptotic distribution of Ln in the case of independence, after appropriate centering and scaling, was first obtained by Baik et al. [3]. Let q(z ) denote the solution of the Painlevé II equation given by, q′′ (z ) = 2q3 + zq, satisfying the boundary condition q(z ) ∼ Ai(z ) when z → ∞, where Ai is the Airy function and q′′ denotes the second derivative of q. Hastings et al. [8] show the asymptotic solutions, q(z ) = −Ai(z ) + O 3/2 e−(4/3)z 1 / 4 z as z → ∞, q(z ) = − −z 2 1+O 1 z2 as z → −∞. The Tracy–Widom distribution is defined by the following cumulative distribution ∞ FTW (t ) = exp − (z − t )q2 (z )dz , t ∈ R. (3) t Theorem 3.2. Under the assumptions of Theorem 3.1, if χ is a random variable whose distribution function is FTW , given by Eq. (3), then √ χn = Ln − 2 n n1/6 → χ in distribution, Proof. See Baik et al. [3]. as n → ∞. (4) For n > 100 we use the asymptotic distribution of Ln through Eq. (4). We calculate the asymptotic p-values using the R-package ‘‘RMTstat’’, specifying the parameter β = 2 in the cumulative function ‘‘ptw’’. 3.1. The Ln independence test Let (x1 , y1 ), . . . , (xn , yn ) be a paired sample of size n of (X , Y ) with continuous marginal distributions. The p-value for a statistical test with null hypothesis of independence against an alternative hypothesis of not independence between X and Y is defined in the following way. Definition 3.1. The two-sided p-value is min 2FLn (l0 )I FLn (l0 )≤ 21 + 2(1 − FLn (l0 ))IF 1 Ln (l0 )> 2 , 1 , where l0 is the observed value of Ln in the sample, FLn is the cumulative distribution function, FLn (l0 ) = and IE denotes the indicator function of set E . l0 k=1 pnk (see Eq. (2)) 130 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 3.2. The JLn , statistic The JLn statistic is obtained from two modifications to the Ln statistic. The first modification is based on Johansson [10]. That paper shows that, in the independent case, if we consider Ui = rank(Xi ) and Vi = rank(Yi ), for i = 1, . . . , n then the typical deviations of a maximal path from the diagonal U = V is of order n5/6 . The first modification is that will only consider points whose ranks are at a distance less than or equal to cn5/6 , from the diagonal U = V , i.e. |Ui − Vi | ≤ cn5/6 , where c is a constant. To choose the value of c, we checked with different values of c on simulated data and used the best one. We started with c = 0.1 then c = 0.2 which gives us better power, and so on until the power started to go down which happened for c = 0.5. Table 15 shows the results of the simulation study used to choose c. Note that the power of the test does not change too much for values of c between 0.3 and 0.5. We choose c = 0.4 as it seems to give the best power for the distributions and sample sizes used in the simulation. Formally, we introduce the set D diag = (Ui , Vi ), i = 1, . . . , n : |Ui − Vi | ≤ 0.4n5/6 , diag and we define Ln = lndiag (πD diag ), with ndiag = # D diag . The second modification is a jackknife procedure. The statistic Ln is discrete, Fig. 3 (a) shows FL80 which is Ln cumulative distribution function for n = 80. For example, FL80 (·) jumps from 0.0078 for x = 11 to 0.057 for x = 12, then, if we want to test at level 0.05 for unilateral alternative, we will reject the independence when a unilateral p-value is ≤ 0.0078 and the diag exact level 0.05 cannot be achieved. To mitigate this characteristic we define a jackknife version of the Ln statistic. Definition 3.2. Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be replications of (X , Y ) with continuous marginal distributions. We define JLn = where diag Ln 1 ndiag Ldiag n (u, v), (u,v)∈D diag (u, v) = lndiag −1 (πD (u,v) ) with D (u,v) = D diag \ {(u, v)}, for each (u, v) ∈ D diag . Fig. 3 (b) shows FJL80 which is the JLn cumulative distribution function for n = 80. We can see that the number of steps in the function has grown. In this case, FJL80 (·) jumps from 0.0495 for x = 11.83 to 0.0505 for x = 11.834, this means that if we want to test at level 0.05 in practice we will test at level 0.0495. Remark 2. If we reexamine the cumulative distribution for the statistic L80 in Fig. 3 (left), we can see that (under the Hypothesis (1)) the set of values with probabilities significantly different from zero for L80 is {11, 12, . . . , 17, 18, 19}, as can be seen in the following table. l 10 p80 l 11 12 13 14 15 16 17 18 19 20 0.00 0.01 0.05 0.15 0.25 0.25 0.17 0.09 0.03 0.01 0.00 In the same way for JLn , under the Hypothesis (1), the set of values with probabilities significantly different from zero is inside the interval (9, 22), as can be seen from the right picture in Fig. 3. Because of this if some underlying dependence structure between the random variables increases or decreases the size of the longest increasing subsequence, even in a small quantity (4 or 5 for n = 80), it can easily take the Ln or JLn statistic to regions of very low probability. Another useful characteristic of the Ln (and JLn ) statistic is that, as seen in Aldous et al. [1], under the Hypothesis (1), √ E (L ) limn→∞ √nn = 2, in other words, under the assumption of independence, when n grows Ln grows like 2 n. 3.3. The JLMn , statistic The idea behind the JLMn statistic, is to use the size of the longest monotonic subsequence, which is the maximum between the longest increasing subsequence and the longest decreasing subsequence. The size of the longest decreasing subsequence of a sample {(xi , yi )}ni=1 is the size of the longest increasing subsequence for the sample {(xi , (−yi ))}ni=1 . As before, we only consider points that are at a distance smaller than 0.4n5/6 from the corresponding diagonal, derived from the observations transformed through the ranks, as described in Section 3.2. Definition 3.3. Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be replications of (X , Y ) with continuous marginal distributions. We define JLMn = max{JLn , JL− n }, with JLn and JL− n given by Definition 3.2 applied over (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) and (X1 , −Y1 ), (X2 , −Y2 ), . . . , (Xn , −Yn ), respectively. J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 131 Fig. 3. (a) Ln cumulative distribution function for n = 80 (left). (b) JLn cumulative distribution function for n = 80 (right). 4. Simulations To compare the power of our tests against Pearson’s test, Kendall’s test, Spearman’s test, Hoeffding’s test, Genest’s test and MIC’s test, we carried out a simulation study in which for each test we estimate the power function for different sample sizes and diverse joint distributions. For each joint distribution and sample sizes 20, 40, 60, 80 , 100 we simulated j j 5000 samples, and computed the p-values. The simulation was implemented in R-project. Denote by (Xi , Yi ) n the j-th i=1 simulated sample, with j = 1, . . . , 5000 and n = 20, 40, 60, 80, 100. Given a level α , we calculate the empirical significance level as being, # j : p-value n (Xij , Yij ) ≤α i =1 (5) 5000 where p-value (Xij , Yij ) n i=1 j j n denotes the p-value associated with the sample j, (Xi , Yi ) i =1 . The p-values for the Ln , JLn and JLMn statistics were calculated using our R package LIStest. For n ≤ 100 the LIStest package uses the exact values of probabilities for the Ln , computed using Theorem 3.1. For n > 100 it uses the Tracy–Widom approximation given by Theorem 3.2. In the LIStest package, the distributions for JLn and JLMn were estimated by simulation for n ≤ 200. We divided our simulations into two parts, the independence case to compare (5) for different sample sizes and the dependent case to measure and compare the power of the tests for diverse situations. Complete tables with the computed empirical power obtained for each situations can be consulted in the Appendix of this paper (see Appendix B). 4.1. Independence Considering that the tests (except Pearson’s test) are marginal free, we analyze only two distributions for the case of independence. The first distribution is pairs of independent random variables with Normal (standard) marginal distributions. The second distribution, with heavier tails, consists of pairs of independent random variables with Pareto of parameter 4 distribution. Fig. 4, and Tables 7, 8 show the behavior of the empirical significance levels for these two distributions. We note that the power of Pearson’s test can achieve values significantly higher than α = 0.01, under the effect of the marginal distributions. When, for example, the marginal distributions are heavy-tailed, as the Pareto distribution. See for illustration, the lower panel in Fig. 4, it shows the effect of the marginal distribution on Pearson’s test. Situations such as these show the importance of using a marginal free methodology to detect dependence. We can also see that the empirical significance levels for JLn and JLMn tests are much closer to the theoretical α compared to the Ln test. 4.2. Dependence 4.2.1. The conjecture No dependence test is known to be optimal under all dependence structures. With this in mind our conjecture is that the proposed family of tests is efficient to detect types of dependences with null correlation or when the correlation takes very 132 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 Fig. 4. The picture on the left is the scatter plot of a sample size 200, the picture on the right is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top: independent N (0, 1) random variables and on bottom: independent Pareto(4) random variables. small values and we concentrate our study on joint distributions with zero or very small correlations that challenge many tests of independence. In order to introduce some intuition about the behavior of the new test in traditional situations, we explore onwards four models with medium and high correlations, those are (i) Gumbel’s copula with parameter θ , where the cumulative distribution is given by CG (x, y|θ ) = exp{−[(− ln(x))θ + (− ln(y))θ ]1/θ }, θ ∈ [1, ∞); (ii) Frank’s copula with (exp(−θ x)−1)(exp(−θ y)−1) parameter θ , and cumulative distribution CF (x, y|θ ) = − θ1 ln{1 + }, θ ∈ (−∞, ∞)\{0}; (iii) Clayton’s exp(−θ )−1 copula with parameter θ , and cumulative distribution CC (x, y|θ ) = max{(x−θ + y−θ − 1)−1/θ , 0}, θ ∈ [−1, ∞) \ {0} and (iv) normal bivariate distribution with correlation ρ , Tables 2 and 3 show the results. In all the simulated cases we fixed the parameters θ and ρ , in order to obtain a correlation between x and y approximately equal to 0.5 and 0.7. For large sample sizes, at the nominal level equal to 0.01, the new family of tests detects the dependence but with a lower power than the others tests. Except some situations as in the case of correlation approximately equal to 0.7, where the new family of tests showed positive results. In general, Pearson’s tests show the highest power levels in all the distributions with moderate to large correlation considered in this study, see Tables 2 and 3. To illustrate in detail the behavior that we observe in the cases included in Tables 2 and 3, see Fig. 5. We also observed that the statistics introduced in this paper are not consistent against all alternatives. For example, they are not able to recognize the difference between the Uniform distribution on [0, 1]2 and the distribution given by f (x, y) = 1 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 133 Table 2 Empirical significance level (α = 0.01). For each case stand out the best power, in bold letter. Dist. n Gumbel 20 40 60 80 θ = 1.55 (cor ≈ 0.5) Spe Ken Pea Hoe Mine Gen Ln JLn JLMn 0.354 0.765 0.931 0.984 0.357 0.768 0.939 0.985 0.393 0.791 0.939 0.987 0.403 0.750 0.921 0.980 0.102 0.305 0.453 0.594 0.300 0.722 0.887 0.974 0.056 0.162 0.277 0.505 0.147 0.385 0.524 0.583 0.125 0.253 0.389 0.554 100 0.996 0.997 0.998 0.995 0.696 0.992 0.639 0.691 0.665 20 40 60 80 0.798 0.993 1.000 1.000 0.802 0.994 1.000 1.000 0.854 0.995 1.000 1.000 0.817 0.990 0.999 1.000 0.366 0.788 0.938 0.985 0.744 0.988 0.999 1.000 0.199 0.530 0.768 0.932 0.437 0.828 0.941 0.957 0.395 0.685 0.854 0.956 100 1.000 1.000 1.000 1.000 0.998 1.000 0.979 0.988 0.987 20 40 60 80 0.333 0.752 0.931 0.985 0.321 0.745 0.930 0.983 0.390 0.778 0.942 0.987 0.408 0.760 0.930 0.981 0.126 0.320 0.479 0.645 0.307 0.713 0.932 0.976 0.028 0.081 0.135 0.288 0.106 0.260 0.362 0.387 0.088 0.158 0.238 0.360 100 0.997 0.997 0.997 0.996 0.752 0.997 0.401 0.480 0.450 20 40 60 80 0.798 0.995 1.000 1.000 0.786 0.994 1.000 1.000 0.856 0.997 1.000 1.000 0.827 0.993 1.000 1.000 0.414 0.832 0.959 0.990 0.764 0.991 1.000 1.000 0.123 0.346 0.536 0.777 0.317 0.713 0.846 0.865 0.282 0.532 0.701 0.864 100 1.000 1.000 1.000 1.000 0.999 1.000 0.891 0.930 0.930 20 40 60 80 0.353 0.759 0.930 0.982 0.357 0.755 0.930 0.981 0.399 0.779 0.939 0.987 0.417 0.763 0.929 0.981 0.113 0.329 0.507 0.645 0.314 0.711 0.929 0.972 0.046 0.139 0.268 0.477 0.137 0.347 0.510 0.548 0.113 0.239 0.369 0.515 100 0.995 0.996 0.996 0.995 0.752 0.996 0.624 0.671 0.642 20 40 60 80 0.797 0.990 0.999 1.000 0.801 0.989 0.999 1.000 0.856 0.996 1.000 1.000 0.831 0.991 1.000 1.000 0.397 0.833 0.963 0.993 0.766 0.985 0.999 1.000 0.208 0.544 0.788 0.943 0.437 0.836 0.945 0.964 0.399 0.690 0.870 0.961 100 1.000 1.000 1.000 1.000 0.999 1.000 0.981 0.986 0.986 Gumbel θ = 2.07 (cor ≈ 0.7) Frank θ = 3.45 (cor ≈ 0.5) Frank θ = 5.83 (cor ≈ 0.7) Clayton θ = 1.08 (cor ≈ 0.5) Clayton θ = 2.13 (cor ≈ 0.7) Table 3 Empirical significance level (α = 0.01), bivariate Normal distribution with variance 1 and correlation ρ . For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe Mine Gen Ln JLn JLMn 0.50 20 40 60 80 100 0.304 0.704 0.910 0.975 0.994 0.290 0.699 0.911 0.974 0.994 0.376 0.776 0.946 0.987 0.998 0.344 0.673 0.887 0.963 0.989 0.093 0.237 0.382 0.512 0.627 0.254 0.615 0.895 0.948 0.987 0.031 0.081 0.151 0.296 0.429 0.097 0.241 0.355 0.372 0.475 0.081 0.148 0.235 0.331 0.436 0.70 20 40 60 80 100 0.770 0.991 1.000 1.000 1.000 0.759 0.990 1.000 1.000 1.000 0.867 0.995 1.000 1.000 1.000 0.770 0.984 0.999 1.000 1.000 0.323 0.731 0.915 0.976 0.994 0.694 0.979 0.999 1.000 1.000 0.132 0.371 0.589 0.812 0.902 0.321 0.693 0.836 0.884 0.928 0.291 0.523 0.704 0.861 0.925 if (x, y) ∈ [0, 1]2 \ (A ∪ B), f (x, y) = 2 if (x, y) ∈ B and f (x, y) = 0 otherwise, where A = [0.5 − a, 0.5 + a] × [1 − a, 1] and B = [0.5 − a, 0.5 + a] × [0, a], with a very small value a. That means, the distribution f is equal to the Uniform except on the sets A and B. On A, f takes null values and on B, f assumes value equal to 2 (because the mass on A was reallocated to B). The family of tests given in this paper, will recognize the magnitude of the density distributed on the diagonals (x = y and/or x = −y) and in its neighborhood. 4.2.2. Settings of dependence We consider two main situations in which the samples show low correlation. (a) Visible dependence; (b) hidden dependence. In the first group we explore distributions with the following x–y plot shapes (i) a cross, (ii) a ring, and (iii) a square. All of them are types of dependence with null expected correlation coefficients. We note that Pearson’s test, Kendall’s 134 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 Fig. 5. The picture on the left is the scatter plot of a sample size 200 of the Gumbel copula. The picture on the right is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top θ = 1.55 (correlation ≈ 0.5), and on bottom θ = 2.07 (correlation ≈ 0.7). test and Spearman’s test are not consistent for the Hypothesis (1) explaining its poor performance in almost all the cases exposed in this section. For case (a) we implemented the following joint distributions D1— Mixture of two bivariate Normal distributions with variances 1 and correlations ρ and −ρ; (X , Y ) ∼ 1 N 2 2 1 0, Σ2 , where 0 = (0, 0), Σ1 = ρ ρ 1 1 and Σ2 = −ρ −ρ 1 1 N 2 2 0, Σ1 + . D2— Uniform ring centered at 0 with internal radius of ρ and external radius of 1. D3— Uniform distribution on {[−1, 1] × [−1, 1]} \ {[−ρ, ρ] × [−ρ, ρ]} (border of a square). In all the cases (see Figs. 1, 6–9 and Tables 9–11 respectively), the new family of tests meet the highest empirical powers. D1, D2 and D3 situations show how it is possible to enhance the efficiency of statistical tests based on Ln , exemplified here by statistics JLn and JLMn . For case (b) we implemented the following joint distributions J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 135 Fig. 6. The picture on the left is the scatter plot of a sample size 200 of D1. The picture on the right is the sample size vs. the empirical significance level (α = 0.01) for the same distribution with ρ = 0.7. Table 4 P-values, Globular Clusters-NGC 3379 galaxy. Spe Ken Hoe Pea Gen MIC Ln JLn JLMn 0.1390 0.0900 0.1331 0.2760 0.1953 1 0.0203 0.0052 0.0094 D4— Mixture of two bivariate Normal distributions, one independent with standard deviation 4 and the other dependent with standard deviation 1 and correlation ρ , (X , Y ) ∼ ρ 1 1 N 2 2 0, 16I + 12 N2 0, Σ , where 0 = (0, 0), 16I = 16 0 0 16 and Σ = ρ 1 . D5— Mixture of two bivariate Normal distributions, γ % independent with standard deviation 4and (1 − γ )% dependent with standard deviation 0.5 and correlation ρ = 0.95, (X , Y ) ∼ γ N2 0, 16I + (1 − γ )N2 0, Σ , where 0 = (0, 0), 16 0 0 16 0.25 ρ and Σ = ρ . 0.25 D6— Mixture of two bivariate Clayton’s copulas, one with parameter −0.1 and the other with parameter equal to 10; (X , Y ) ∼ 0.75CC (·, ·| − 0.1) + 0.25CC (·, ·|10). 16I = For the case of distribution D6, Spearman’s correlation was about 0.15. Spearman’s, Kendall’s and Pearson’s tests cannot detect the dependence, because the sample proportion (25%) which has strong correlation (around 0.94) is too small compared with the sample proportion (75%) having negative and small correlation (about −0.17). According to the results shown by Figs. 10–12 and Tables 12–14 respectively, in scenarios D4, D5 and D6, the statistics Ln , JLn and JLMn reach the best results followed by Hoeffding’s test. 4.3. Applications 4.3.1. Application 1, ‘‘Globular Clusters’’ data The dataset is composed by a sample of globular clusters (GC) around the galaxy NGC 3379 (see Bergond et al. [4] and Chakrabarty [5]). The NGC 3379 is the brightest elliptical galaxy in the constellation Leo and it is known to have a supermassive black hole. The measures (Fig. 13) are the Projected Distance expressed in kpc (x axis) and the Line of Sight (LOS) radial Velocity expressed in km/s (y axis) to the galaxy for 30 GCs. While conceptually the dependence between the Projected Distance and the LOS Velocity exists, it is not detected by Pearson’s test, Kendall’s test, Spearman’s test, Hoeffding’s test, Genest’s test and MIC’s test, see the results in Table 4. Astronomers use this kind of relation to infer the total mass distribution in galaxies. They can compare, for example, the globular cluster system of NGC 3379 (with a black hole) and some planetary nebulae (without a black hole) in order to infer the influence produced by the presence of a black hole. Ln test (with a p-value = 0.0203), JLn test (with a p-value = 0.0052) and JLMn test (with a p-value = 0.0094) are capable to show that dependence. Spearman’s correlation between the Projected Distance and the LOS Velocity is equal to 0.2766. 136 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 Fig. 7. The left panel is the scatter plot of a sample size 200 of D2. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top ρ = 0, and on bottom ρ = 0.1. 4.3.2. Application 2, ‘‘coalminers’’ data The dataset named ‘‘coalminers’’, appears on VGAM (package from R-project). The data is about coal-miners who are smokers without radiological pneumoconiosis, classified by age, breathlessness and wheeze. Denote by BW the counts with breathlessness and wheeze, BnW the counts with breathlessness but no wheeze, nBW the counts with no breathlessness and wheeze. Fig. 14, on the left, shows the plot between BnW and BW, while Fig. 15, on the left, shows the plot between BW and nBW. Each point, was took according to 9 age-groups. In both situations the dependency appears as a consequence of event B (breathlessness) or W (wheeze) respectively. Since Figs. 14 and 15 (left) expose an increasing tendency, we can test some unilateral hypotheses for the relation BnW versus BW and BW versus nBW, respectively. For the tests based on a specific measure, we test ‘‘measure > 0’’; for the Ln test we test Ln > M0 , where M0 is the mode of the distribution under the independence assumption. For that kind of hypotheses, we compute the exact p-values for Pearson’s test, Spearman’s test, Kendall’s test (see the description of the function ‘‘cor.test’’ from R-project) and the Ln test. For the first case (BnW versus BW) all the tests reject the independence in favor of an increasing tendency, with small p-values (lesser that 1e-05) and with dependence coefficients taking values approximately equal to 1 (Pearson’s correlation, J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 137 Fig. 8. The left panel is the scatter plot of a sample size 200 of D2. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top ρ = 0.3, and on bottom ρ = 0.5. Table 5 Unilateral hypotheses (BW, nBW). Test Spe Ken Pea Ln JLn Coefficient p-value 0.4666 0.1063 0.3888 0.0901 0.4230 0.1283 – 0.0497 – 0.0266 Spearman’s rank correlation and Kendall’s rank correlation). In contrast, for the second situation (BW versus nBW) the tests based on correlations fail to reject the null hypothesis, while the Ln test and JLn test reject it, at level 5%. We show in Table 5 the results in which case, the Ln test and JLn test show the best performance. 5. Conclusions In this work we develop a new class of nonparametric independence tests for the independence of two continuous random variables. For the Ln test we show the exact distribution for the test statistic, for the JLn and JLMn tests we estimated 138 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 Fig. 9. The left panel is the scatter plot of a sample size 200 of D3. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top ρ = 0.5, and on bottom ρ = 0.7. the distribution by simulation. We compare our family of tests with Pearson’s test, Kendall’s test, Spearman’s test, Genest’s test, Hoeffding’s test and MIC’s test using simulations. We apply the Ln test and its variants in two real data examples. The inability of Ln , to reach the nominal α level (by construction) is successfully eliminated in its variants JLn and JLMn . For the sample sizes considered in our study, the tests based on the longest increasing subsequence were the only ones capable to detect dependence, for the distributions D1–D6. Followed by Hoeffding’s test in the cases D4 and D6, it is necessary to emphasize the ability of these new tests to address situations with moderate sample sizes (even small). In all the cases in which Ln test and related, work well, they have the highest power for sample sizes bigger than 40, this property added to the capacity to control the significance level (through the JLn and/or JLMn versions) put this procedure in an advantaged position in relation to the other tests, as is strongly exposed in the simulation study and in the applications to real data. According to the simulation study (Section 4) Ln , JLn and JLMn tests have a remarkable behavior in the mixture cases, in which the samples are composed by two subsamples, coming from a strongly correlated distribution and from a weakly correlated distribution, respectively. In summary, in this paper we wish to draw attention to the potential of statistics that are constructed with the longest increasing subsequence. Such statistics can be useful for detecting dependency types, difficult to identify with the tests available in the literature. J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 139 Fig. 10. The left panel is the scatter plot of a sample size 200 of D4. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top ρ = 0.9, and on bottom ρ = 0.95. Acknowledgments The authors gratefully acknowledge the support for this research provided by (a) USP project ‘‘Mathematics, computation, language and the brain’’, (b) ‘‘Portuguese in time and space: linguistic contact, grammars in competition and parametric change, FAPESP’s project, grant 2012/06078-9’’ and (c) ‘‘FAPESP Center for Neuromathematics (grant 2013/07699-0, S. Paulo Research Foundation)’’. Special thanks to Professor Dalia Chakrabarty for making the astronomical data used in this paper available to us. We wish to thank the referees and an associate editor for their many helpful comments and suggestions on an earlier draft of this paper. Appendix A. Proof of Theorem 3.1 The combinatorial concepts that we will introduce, useful in representation theory, have been extensively developed in Frame et al. [6], Schensted [12] and Baer et al. [2]. 140 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 Fig. 11. The left panel is the scatter plot of a sample size 200 of D5. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top γ = 0.75, and on bottom γ = 0.85. Definition A.1. A standard Young Tableau of order n is an arrangement of n distinct natural numbers in rows and columns so that the numbers in each row and in each column form increasing sequences, and so that there is an element of each row in the first column and an element of each column in the first row, and there are no gaps between numbers. There is a 1–1 correspondence between permutations and standard Young Tableaux. To each permutation we can assign a corresponding standard Young tableaux, citing Baer et al. [2], in the following way. Let the permutation be {x1 , x2 , . . . , xn }. For the moment, define the first entry in the first row of the tableau to be x1 . Now, if at the i-th step, the first i entries of the sequence have been used in the developing tableau then at the next step the element xi+1 , is inserted into the first row of the tableau by displacing the smallest entry in the first row which is larger than xi+1 , or by appending xi+1 , at the end of the first row if it is larger than all entries in the first row. If an entry y is displaced from the first row by xi+1 , then y is inserted into the second row by letting it displace the smallest entry in the second row which is larger than y or by simply appending y to the second row if there is no such element. The process is continued from row to row until either the original xi+1 , or a displaced element is appended to the end of a row. Then the whole process is renewed for xi+2 , . . . until all of the entries of the original permutation sequence have been entered into the tableau. J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 141 Fig. 12. The left panel is the scatter plot of a sample size 200 of D6. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. Fig. 13. The plots show the Projected Distance (kpc) vs. the Line of Sight Velocity (on the left) and the ranks of the Projected Distance vs. the ranks of the Line of Sight Velocity (on the right) for 30 GCs associated with NGC 3379. Example A.1. Applying the algorithm given by Baer et al. [2] and reproduced in the previous paragraph, to the permutation {3, 6, 1, 5, 7, 2, 4, 8} we find the following sequence of arrangements. The last arrangement (step 8) is the standard Young Tableaux. step 1 2 3 4 3 3 1 3 1 3 6 6 5 6 step 5 6 1 3 1 3 6 5 6 2 5 7 7 step 7 8 1 3 6 1 3 6 2 5 4 7 2 5 4 7 8 Remark 3. The first row on the standard Young Tableau corresponds to one of the longest increasing subsequence. 142 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 Fig. 14. The plots show counts with breathlessness and no wheeze (BnW) vs. counts with breathlessness and wheeze (BW) (on the left) and the ranks of BnW vs. the ranks of BW (on the right). Fig. 15. The plots show counts with breathlessness and wheeze (BW) vs. counts with no breathlessness and wheeze (nBW) (on the left) and the ranks of BW vs. the ranks of nBW (on the right). Definition A.2. If T is a standard Young Tableau of order n, for each element j, j ∈ {1, . . . , n} of the arrangement we define the hook number of j, hj as the number of elements in the same column and in the same row in which j is included, counting from the bottom until the element j and from the right to the row until the element j. Example A.2. We illustrate the concepts introduced by means of Example 2.2. hook numbers standard Young Tableau 1 4 2 5 3 4 2 3 1 1 Remark 4. By definition, the hj numbers depend on the shape of the tableau not on the numbers filling it. Each permutation is directly associated with the shape of a standard Young Tableau, but different permutations of {1, . . . , n} can give the same tableau shape. The next example shows all the possible shapes of standard Young Tableaux that can be obtained by the permutations of 5 numbers. Example A.3. Consider the set {1, 2, 3, 4, 5}. Each shape of the list (Table 6) is associated with an integer partition, which is a way of writing n as a sum of positive integers, denoted by IP(n) (n = 5 in this case). J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 143 Table 6 List of shapes (of standard Young Tableaux) and hooks numbers for the permutations of {1, 2, 3, 4, 5}. Shape 1 Shape 2 Shape 3 Shape 4 Shape 5 Shape 6 Shape 7 5 4 3 2 1 51 3 2 1 52 31 1 521 2 1 431 21 5321 1 54321 IP1(5) 5 IP2(5) 4+1 IP3(5) 3+2 IP4(5) 3+1+1 IP5(5) 2+2+1 IP6(5) 2+1+1+1 IP7(5) 1+1+1+1+1 Table 7 Empirical power at level α = 0.01 for independent random variables with Normal marginal distributions. n Spe Ken Pea Hoe MIC Gen Ln JLn JLMn 20 40 60 80 100 0.008 0.011 0.009 0.010 0.011 0.007 0.010 0.009 0.011 0.011 0.008 0.012 0.010 0.009 0.011 0.017 0.015 0.013 0.012 0.013 0.011 0.007 0.009 0.012 0.010 0.008 0.008 0.011 0.008 0.018 0.001 0.002 0.001 0.003 0.006 0.009 0.010 0.010 0.012 0.009 0.009 0.010 0.011 0.011 0.009 Table 8 Empirical power at level α = 0.01 for independent random variables with Pareto(4) marginal distributions. n Spe Ken Pea Hoe MIC Gen Ln JLn JLMn 20 40 60 80 100 0.009 0.010 0.012 0.012 0.010 0.008 0.009 0.010 0.012 0.010 0.018 0.021 0.019 0.023 0.017 0.021 0.016 0.013 0.015 0.012 0.009 0.010 0.012 0.010 0.008 0.008 0.012 0.009 0.018 0.008 0.000 0.002 0.002 0.002 0.005 0.008 0.008 0.010 0.010 0.011 0.013 0.010 0.013 0.010 0.011 Table 9 Empirical power at level α = 0.01 for distribution D1. For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe MIC Gen Ln JLn JLMn 0.70 20 40 60 80 100 0.004 0.007 0.006 0.004 0.006 0.005 0.010 0.009 0.008 0.009 0.026 0.035 0.035 0.034 0.035 0.014 0.010 0.011 0.010 0.011 0.007 0.010 0.009 0.008 0.007 0.003 0.005 0.008 0.002 0.005 0.006 0.013 0.022 0.067 0.105 0.021 0.048 0.078 0.105 0.152 0.019 0.048 0.094 0.138 0.201 0.90 20 40 60 80 100 0.005 0.005 0.006 0.003 0.005 0.009 0.007 0.010 0.007 0.008 0.048 0.053 0.057 0.051 0.052 0.014 0.015 0.023 0.031 0.057 0.014 0.024 0.043 0.089 0.207 0.004 0.004 0.015 0.008 0.026 0.019 0.106 0.251 0.483 0.662 0.073 0.294 0.509 0.619 0.765 0.092 0.372 0.648 0.851 0.945 Shape 1 corresponds to the permutation π (1) = 5, π (2) = 4, π (3) = 3, π (4) = 2, π (5) = 1 (ld5 = 5) and it is associated to the integer partition of n = 5 given by IP1(5) = 5 (the sum of the number of elements in the first column of the shape 1). The shape 5 is associated to the integer partition of n = 5, IP5(5) = 2 + 2 + 1, where each term of IP5(5) (from left to right) is the sum of the number of elements by column in the shape 5. Given a permutation π , the size of the longest increasing subsequence for π is the size of the first row in the shape of the Tableau corresponding to the permutation. The next results allow to compute the number of permutations of n numbers such that ln (π ) = k which is the number of standard Young tableaux with a shape such that the first row has size k. Theorem A.1 (Frame et al. [6]). Given a shape W , the number of standard Young tableaux with shape W , containing the integers {1, . . . , n} is N (W ) = n! n , hj j =1 where the hj , j = 1, . . . , n are the hook numbers for each cell of the Tableau. (6) 144 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 Table 10 Empirical power at level α = 0.01 for distribution D2. For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe MIC Gen Ln JLn JLMn 0.00 20 40 60 80 100 0.003 0.003 0.003 0.002 0.002 0.002 0.001 0.002 0.001 0.001 0.003 0.003 0.002 0.001 0.002 0.012 0.008 0.007 0.005 0.005 0.010 0.009 0.014 0.008 0.013 0.007 0.010 0.003 0.002 0.011 0.000 0.002 0.002 0.003 0.013 0.010 0.016 0.045 0.057 0.070 0.030 0.047 0.104 0.140 0.202 0.10 20 40 60 80 100 0.002 0.002 0.004 0.003 0.003 0.001 0.001 0.002 0.002 0.001 0.002 0.001 0.003 0.002 0.002 0.010 0.005 0.006 0.005 0.003 0.009 0.009 0.013 0.014 0.011 0.005 0.009 0.004 0.003 0.009 0.001 0.001 0.001 0.003 0.016 0.013 0.020 0.052 0.078 0.090 0.031 0.062 0.123 0.178 0.262 0.30 20 40 60 80 100 0.002 0.002 0.002 0.002 0.002 0.001 0.000 0.001 0.001 0.000 0.001 0.002 0.001 0.002 0.001 0.009 0.004 0.004 0.005 0.005 0.019 0.018 0.019 0.019 0.017 0.005 0.009 0.003 0.005 0.014 0.000 0.001 0.001 0.003 0.019 0.025 0.076 0.194 0.319 0.399 0.077 0.248 0.431 0.606 0.752 0.50 20 40 60 80 100 0.001 0.001 0.000 0.001 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.001 0.011 0.006 0.007 0.014 0.033 0.037 0.059 0.051 0.146 0.355 0.006 0.011 0.004 0.006 0.036 0.000 0.001 0.002 0.004 0.012 0.049 0.270 0.626 0.805 0.889 0.236 0.669 0.883 0.962 0.988 Table 11 Empirical power at level α = 0.01 for distribution D3. For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe MIC Gen Ln JLn JLMn 0.50 20 40 60 80 100 0.005 0.004 0.003 0.004 0.004 0.004 0.003 0.002 0.002 0.003 0.007 0.005 0.004 0.003 0.004 0.012 0.008 0.006 0.007 0.007 0.019 0.020 0.020 0.016 0.018 0.008 0.005 0.006 0.007 0.009 0.000 0.002 0.001 0.002 0.003 0.020 0.059 0.135 0.198 0.255 0.056 0.152 0.262 0.381 0.523 0.70 20 40 60 80 100 0.001 0.002 0.002 0.001 0.001 0.001 0.001 0.000 0.001 0.001 0.002 0.001 0.005 0.004 0.004 0.009 0.006 0.006 0.007 0.010 0.026 0.037 0.040 0.137 0.348 0.003 0.005 0.007 0.013 0.015 0.001 0.002 0.001 0.003 0.003 0.037 0.164 0.357 0.527 0.618 0.138 0.394 0.610 0.762 0.873 Table 12 Empirical power at level α = 0.01 for distribution D4. For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe MIC Gen Ln JLn JLMn 0.90 20 40 60 80 100 0.069 0.117 0.178 0.229 0.306 0.122 0.233 0.368 0.485 0.605 0.062 0.061 0.066 0.067 0.079 0.266 0.537 0.771 0.905 0.965 0.094 0.311 0.497 0.673 0.820 0.087 0.203 0.459 0.544 0.774 0.150 0.494 0.768 0.937 0.984 0.337 0.817 0.957 0.977 0.995 0.298 0.686 0.886 0.976 0.994 0.95 20 40 60 80 100 0.072 0.138 0.189 0.257 0.336 0.136 0.300 0.447 0.586 0.700 0.064 0.071 0.072 0.073 0.079 0.346 0.671 0.879 0.969 0.993 0.107 0.435 0.679 0.875 0.959 0.095 0.256 0.535 0.649 0.868 0.257 0.734 0.940 0.993 1.000 0.486 0.936 0.995 0.998 1.000 0.443 0.868 0.979 0.998 1.000 Example A.4. The number of standard Young tableaux containing the numbers {1, 2, 3, 4, 5} with shape given by Example A.2 (left) is 5!/[4.3.2] = 5 (using the values of Example A.2 (right)). Theorem A.2 (Schensted [12]). Let Vn (k, m) be the set of shapes of Young tableaux of order n, having k columns and m rows. The number of permutations of n elements with a longest increasing subsequence of size k and a longest decreasing subsequence of size m is W ∈Vn (k,m) N (W )2 . Example A.5. Considering the set of numbers {1, 2, 3, 4, 5} we want to calculate the number of sequences having l5 = 3. Let us denote by # {A} the cardinal of the set A; # {l5 = 3} = # {l5 = 3, ld5 = 2} + # {l5 = 3, ld5 = 3}, corresponding with J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 145 Table 13 Empirical power at level α = 0.01 for distribution D5. For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe MIC Gen Ln JLn JLMn 0.75 20 40 60 80 100 0.019 0.023 0.028 0.031 0.037 0.023 0.038 0.048 0.061 0.070 0.025 0.023 0.023 0.024 0.026 0.067 0.102 0.149 0.193 0.278 0.024 0.054 0.094 0.127 0.177 0.021 0.027 0.057 0.048 0.073 0.025 0.121 0.286 0.561 0.750 0.082 0.326 0.578 0.715 0.857 0.058 0.227 0.465 0.678 0.837 0.85 20 40 60 80 100 0.016 0.022 0.017 0.020 0.015 0.019 0.024 0.021 0.026 0.022 0.019 0.020 0.016 0.015 0.013 0.038 0.040 0.044 0.054 0.054 0.013 0.020 0.031 0.041 0.050 0.016 0.017 0.026 0.020 0.021 0.005 0.022 0.055 0.147 0.249 0.025 0.076 0.158 0.255 0.375 0.019 0.043 0.112 0.191 0.302 Table 14 Empirical power at level α = 0.01 for distribution D6. For each case stand out the best power, in bold letter. n Spe Ken Pea Hoe MIC Gen Ln JLn JLMn 20 40 60 80 100 0.032 0.068 0.108 0.162 0.209 0.035 0.080 0.131 0.196 0.255 0.037 0.072 0.111 0.167 0.221 0.066 0.111 0.173 0.252 0.327 0.018 0.030 0.040 0.047 0.048 0.027 0.061 0.129 0.150 0.295 0.014 0.056 0.141 0.370 0.543 0.056 0.188 0.354 0.508 0.673 0.040 0.118 0.264 0.441 0.618 Table 15 Empirical power of the test JLn at level α = 0.01, for c 0.1, 0.2, 0.3, 0.4 and 0.5 for six dependence situations. Distribution c 20 40 60 80 100 D1 0.1 0.2 0.3 0.4 0.5 0.017 0.018 0.018 0.021 0.018 0.041 0.048 0.049 0.048 0.036 0.065 0.065 0.073 0.078 0.072 0.069 0.090 0.103 0.105 0.083 0.069 0.123 0.129 0.152 0.157 0.1 0.2 0.3 0.4 0.5 0.001 0.000 0.003 0.025 0.028 0.000 0.053 0.072 0.076 0.062 0.012 0.079 0.174 0.194 0.166 0.013 0.116 0.253 0.319 0.290 0.003 0.241 0.413 0.399 0.345 0.1 0.2 0.3 0.4 0.5 0.000 0.000 0.008 0.037 0.050 0.000 0.120 0.174 0.164 0.117 0.029 0.240 0.381 0.357 0.265 0.037 0.322 0.483 0.527 0.432 0.017 0.560 0.685 0.618 0.474 0.1 0.2 0.3 0.4 0.5 0.219 0.300 0.370 0.337 0.342 0.559 0.732 0.783 0.817 0.711 0.773 0.887 0.935 0.957 0.923 0.801 0.941 0.979 0.977 0.979 0.888 0.985 0.994 0.995 0.996 0.1 0.2 0.3 0.4 0.5 0.057 0.071 0.080 0.082 0.076 0.221 0.311 0.328 0.326 0.265 0.397 0.506 0.567 0.578 0.504 0.449 0.653 0.755 0.715 0.684 0.552 0.803 0.865 0.867 0.862 0.1 0.2 0.3 0.4 0.5 0.203 0.282 0.359 0.321 0.348 0.426 0.559 0.621 0.693 0.572 0.582 0.698 0.763 0.836 0.812 0.590 0.775 0.876 0.884 0.882 0.658 0.871 0.922 0.928 0.916 ρ = 0.70 D2 ρ = 0.30 D3 ρ = 0.70 D4 ρ = 0.90 D5 ρ = 0.75 Bivariate Normal ρ = 0.70 = only two possible shapes of Young tableaux, shape 4 and shape 5 (see Table 6). Using the Theorem A.2, # {l5 = 3, ld5 = 2} = 52 = 25, # {l5 = 3, ld5 = 3} = 62 = 36 and # {l5 = 3} = 25 + 36 = 61. Proof. Let (X1 , Y1 ), . . . , (Xn , Yn ) be independent, identically distributed, bivariate random vectors with the same distribution of (X , Y ). X and Y verify de Hypothesis (1) with continuous marginal distributions. We can define Ln as 146 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) 126–146 given by Definition 2.3. The number of permutations of n elements with a longest increasing subsequence of size #{π ∈Sn :ln (π )=k} k and a longest decreasing subsequence of size m is given by Theorem A.2, then P (Ln = k) = n! n = m=1 2 W ∈Vn (k,m) N (W ) n! . Appendix B. Tables See Tables 7–15. References [1] D. Aldous, P. Diaconis, Hammersley’s interacting particle process and longest increasing subsequence, Probab. Theory Related Fields 103 (1995) 199–213. [2] R.M. Baer, P. Brock, Natural shorting over permutation spaces, Math. Comp. 22 (1968) 385–410. [3] J. Baik, P. Deift, K. Johansson, On the distribution of the length of the longest increasing subsequence of random permutations, J. Amer. Math. Soc. 12 (1999) 1119–1178. [4] G. Bergond, S.E. Zepf, A.J. Romanowsky, R.M. Sharples, K.L. Rhode, Wide-field kinematics of globular clusters in the Leo I group, Astronom. Astrophys. 448 (2006) 155–164. [5] D. Chakrabarty, Different traces give different gravitational mass distributions. http://wrap.warwick.ac.uk/35233/ (last viewed in March, 2013), 2009. [6] J.S. Frame, B. Robinson, R.M. Thrall, The hook graphs of the symmetric group, Canad. J. Math. 6 (1954) 316–324. [7] C. Genest, B. Remillard, Tests of independence or randomness based on the empirical copula process, Test 13 (2004) 335–369. [8] S.P. Hastings, J.B. McLeod, A boundary value problem associated with the second Painlevé transcendent and the Korteweg de Vries equations, Arch. Ration. Mech. Anal. 73 (1980) 31–51. [9] M. Hollander, D. Wolfe, Nonparametric Statistical Methods, John Wiley & Sons, New York, 1973, pp. 185–194. [10] K. Johansson, Transversal fluctuations for increasing subsequences on the plane, Probab. Theory Related Fields 116 (4) (2000) 445–456. [11] D.N. Reshef, Y.A. Reshef, H.K. Finucane, S.R. Grossman, G. McVean, P.J. Turnbaugh, E.S. Lander, M. Mitzenmacher, P.C. Sabeti, Detecting novel associations in large data sets, Science 334 (2011) 1518–1524. [12] C. Schensted, Longest increasing and decreasing sub-sequeces, Canad. J. Math. 13 (1961) 179–191. [13] A. Zoghbi, I. Stojmenovic, Fast algorithms for generating interger partitions, Int. J. Comput. Math. 70 (1998) 319–332.
© Copyright 2026 Paperzz