Psychology Science, Volume 47, 2005 (3/4), p. 391-400 An algorithm and tool for computing exact conditional probabilities of configuration frequencies MANFRED BEIER1 Abstract Traditionally the exact conditional probability of a configuration frequency is calculated with methods based on Fisher's well known formula for two by two contingency tables or its extensions for tables of higher dimensions. I present here a different, combinatorial approach that shows a much better scaling behavior for an increasing number of variables, and is in principle independent of the number of categories. Key words: Exact conditional probability, multidimensional contingency table, configural frequency analysis (CFA) 1 Dipl.-Ing. Manfred Beier, Institut für Humangenetik und Anthropologie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, D-40225 Düsseldorf, Germany; E-mail: [email protected] M. Beier 392 1. Algorithm Please note that in the following the term “matrix” does not refer to a contingency table but to the raw data matrix a contingency table can be constructed from. Let A = (aij) be a matrix with m rows (observations, sample size) and n columns (variables, dimension of a contingency table). What is the probability for finding k or more identical row vectors W = (wj) (configurations, patterns) in the matrix if the components (attributes) of the column vectors may be ordered at random? Let F = (fj) be a vector holding the frequencies (corresponding marginal sums of a contingency table) of wj for each column j = 1,..,n. For the first component of the first row, a11, the probability for matching w1 is: (1) P ( a11 = w1; m, f1 ) = f1 f , for a12: P ( a12 = w2 ; m, f 2 ) = 2 etc. m m Assuming independence of the columns the probability for all cells of the first row being equal to W = (wj) is: (2) ( n fj ) ∏m . P ( a11 ,..., a1n ) = W ; m, F = j=1 The probability for at least having the first k rows filled with W is: (3) ( n fj f j −1 f j − k +1 n ) ∏ m ⋅ m − 1 ⋅ … ⋅ m − k +1 = ∏ P ( a11 ,..., a1n ) = … = ( ak1 ,..., akn ) = W ; m, F = j=1 j=1 fj k =P . 0 m k 2 The maximum possible number of W is limited by fmin, i.e. the smallest component of F. Consequently, for k = fmin the formula above gives us the probability for exactly the first k rows containing W with no further occurrence in the remaining matrix. Therefore, the probability for a matrix Ak with k vectors W occurring in arbitrary rows can be computed by summing up the probabilities of all possible ways to choose k rows out of m: (4) (5) 2 m P ( Ak ; k = f min , m, F ) = ⋅ P0 . k In the case of n = 2 with f1 = fmin = k: m ⋅ k 2 ∏ j=1 fj f min k m f = min ⋅ ⋅ m f min m k f min The minimum possible number is given by max {0, Σfj - m(n-1)}. f2 f min = m f min f2 f min , m f min An Algorithm and Tool for Computing Exact Conditional Probabilities of Configuration Frequencies 393 this is equivalent to the p-value given by the hypergeometric distribution of a 2 by 2 contingency table, here shown with Fisher's formula for R1 = f1 = fmin = k and C1 = f2: O11 = k = fmin O21 = f2-fmin C1 = f2 (6) O12 = 0 O22 = C2 C2 R1 = f1 = fmin R2 = m-fmin m f2 ! f min !( m − f min )! f 2 !C2 ! R1 !R2 !C1 !C2 ! f 2 − f min )! ( m! m ! = = = m! O11 !O12 !O21 !O22 ! f min !0!( f 2 − f min )!C2 ! m f ! − ( min ) f2 f min . m f min For k < fmin the probability for exactly k row vectors matching W corresponds to the probability of getting at least k rows and no further row: (7) )) ( ( m P ( Ak ; k < f min , m, F ) = ⋅ P0 ⋅ 1 − P { A1 ,..., A fmin − k }; m − k , ( f1 − k ,..., f n − k ) . k This leads to the following recursive formula for getting k or more row vectors W: n (8) ( ) P { Ak ,..., A fmin }; m, F = ( fj ∏ k j=1 ) m k n −1 ( ( ⋅ 1 − P { A1 ,..., A fmin −k }; m − k , ( f1 − k ,..., f n − k ) )) + P { Ak +1 ,..., A fmin }; m, F . At first sight this formula seems to be computationally infeasible. The number of pvalues that have to be computed grows exponentially with the distance between k and fmin. But by storing and reusing partial results for 1-P({A1,...,Afmin-i}; m-i, F-i) for all i = k,...,fmin-1, the number of interim values is reduced to Σ1≤i≤fmin-k+1 i, i.e. a quadratic growth rate. An algorithm is given below in the form of an implementation in the programming language R (www.R-project.org). In addition to the necessary “data cache” just mentioned (p1array), a second array (p0) allows the first part of the formula to be calculated using the recurrence relation n (9) n fj ∏ k j=1 m k n −1 = n fj ∏ k − 1 ∏ ( f j=1 m k − 1 n −1 ⋅ j ) − k +1 j=1 k ( m − k +1) n −1 . M. Beier 394 exact.p <- function(k,m,f) { n <- length(f); min <- which.min(f) p1 <- array(dim=c(f[min]-k)); p0 <- array(dim=c(f[min])) inner.loop <- function(k,m,f) { if (is.na(p0[f[min]])) p0[f[min]] <<- prod(choose(f,k)) / choose(m,k)^(n-1) else p0[f[min]] <<- p0[f[min]] * prod(f-k+1) / (k * (m-k+1)^(n-1)) if (k < f[min]) { if (is.na(p1[f[min]-k])) p1[f[min]-k] <<- 1 - inner.loop(1,m-k,f-k) return(p0[f[min]] * p1[f[min]-k] + inner.loop(k+1,m,f)) } else return(p0[f[min]]) } return(inner.loop(k,m,f)) } For example, to calculate the upper probability for a configuration frequency of 31 with marginal sums 90, 93 and 94 occurring in a sample of 158, call and output would look like: > exact.p (31,158,c(90,93,94)) [1] 0.6150942 2. The special case n = 2 It is important to note that the probability for getting k or more patterns is completely independent of the number of categories of each variable. From the standpoint of one specific configuration each variable is binary: either the observation holds the correct value or not. Therefore, for a given pattern, any raw data table can be expressed and treated as a binary contingency table. In the case of two variables for one cell holding k the values of the remaining three cells are fixed. Since the above algorithm simply sums up the p-values for getting exactly k to fmin configurations, it is in fact easier and faster to use the hypergeometric distribution for that job, which lowers the number of p-values that have to be computed to fmin - k + 1: O11 = k O21 C1 = f2 O12 = f1-k O22 C2 = m-f2 R1 = f1 R2 m C1 C2 O min O11 (10) P { A ,..., A 12 = }; m, f , f = ∑ k f 1 2 m min O =k 11 R 1 f f min f 2 m − f 2 ∑ i = k i f1 − i m f 1 An Algorithm and Tool for Computing Exact Conditional Probabilities of Configuration Frequencies 395 Remark: Summing up over all possible k would of course result in a p-value of 1, meaning numerator and denominator being identical. This identity, commonly known as Vandermonde's convolution, leads to another combinatorial interpretation: While the denominator depicts the number of all possible ways to choose f1 elements from the union of two disjoint sets of size f2 and m - f2, the numerator is made up of only those containing k or more elements chosen from the first set, with the remaining f1 - k or less elements coming from the second set. 3. Implementation in C The software EXACTP, written in ANSI C, implements a command line interface for the recursive algorithm, while the special case of two variables is computed using the hypergeometric distribution. Below you see some examples taken from Krauth (1993, p. 33) and computed with EXACTP. According to Krauth, the original p-values for the exact hypergeometric test, computed with the software accompanying his publication (KFA.EXE), required the evaluation of 1242184 contingency tables. By contrast, the total number of pvalues computed by EXACTP is 10244. For EXACTP the parameters are entered in the same order as for the R script: configuration frequency, sample size and an arbitrary number of marginal sums: C:\>exactp 8 158 65 94 68 P = 0.99938264 (0 sec.) C:\>exactp 26 158 65 94 90 P = 0.13814997 (0 sec.) C:\>exactp 6 158 65 64 68 P = 0.98948248 (0 sec.) C:\>exactp 25 158 65 64 90 P = 0.00072640 (0 sec.) C:\>exactp 29 158 93 94 68 P = 0.07499715 (0 sec.) C:\>exactp 31 158 93 94 90 P = 0.61504758 (0 sec.) C:\>exactp 25 158 93 64 68 P = 0.00309131 (0 sec.) C:\>exactp 8 158 93 64 90 P = 0.99999850 (0 sec.) The sixth example is the same as the one shown for the R script. The less accurate pvalue returned by the R function (0.6150942 versus 0.61504758) is caused by limitations of double precision floating point arithmetic. Therefore, the C version makes use of the rational arithmetic routines offered by the GNU Multiple Precision Arithmetic Library (GMP, www.swox.com/gmp/). By staying with integers and computing the true fraction, the result is always ensured to be 100% correct. In more demanding cases like this artificial, 10-dimensional example, where processing time exceeds two seconds, progress is issued every 10 percent: 396 M. Beier C:\>exactp 5 1000 500 510 520 530 540 550 560 570 580 590 24.07% done 40.27% done 54.30% done 66.45% done 77.50% done 87.93% done 98.00% done P = 0.07927212 (1 min. 12 sec.) 4. Conclusions EXACTP provides a tool for computing exact conditional probabilities of configuration frequencies for high dimensional cases. Runtime mainly depends on the distance between the configuration frequency and the lowest of its marginal sums. Compared to this, the influence of dimensionality is negligible, and the number of categories is irrelevant. The source code of EXACTP and a DOS-executable, compiled with MinGW (www.mingw.org), is available at: www-public.rz. uni-duesseldorf.de/~beierm/exactp.html. Acknowledgement I would like to thank Prof. Joachim Krauth, Institut für Experimentelle Psychologie, Heinrich-Heine-Universität Düsseldorf, for his kind support and “beta-testing” the software. References 1. Krauth, J.: Einführung in die Konfigurationsfrequenzanalyse (KFA). Ein multivariates nichtparametrisches Verfahren zum Nachweis und zur Interpretation von Typen und Syndromen. Weinheim/Basel: Beltz, Psychologie-Verlags-Union, 1993. Appendix /* exactp (2004-08-09): exact conditional cell probabilities of multidimensional, multicategorial contingency tables Language: ANSI C requires GMP (GNU Multiple Precision Arithmetic Library) should compile with: gcc exactp.c -lgmp -oexactp Copyright (C) 2004 Manfred Beier, [email protected] An Algorithm and Tool for Computing Exact Conditional Probabilities of Configuration Frequencies This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */ #include <stdio.h> #include <malloc.h> #include <stdlib.h> #include <time.h> #include <gmp.h> /* global variables */ static int n; /* dimension of contingency table, size of frequency vector */ static int min = 0; /* index of minimum component of frequency vector */ static int total; /* total number of p-values to calculate */ static time_t start_time, timer; static mpq_t *p1, *p0; /* cache for partial results */ static mpq_t one; /* rational constant 1/1 */ /* issue computing progress every 10 seconds */ static void countdown() { static int done = 0; done++; if (10 <= (int) difftime(time(NULL), timer)) { printf(" %5.2f%% done\n", 100 * (double) done / total); timer = time(NULL); } } /* recursive algorithm for n > 2 */ static void exactp(mpq_t p, int k, int m, int *f) { int j; 397 398 M. Beier /* compute Po(k) */ if (! mpq_sgn(p0[f[min]])) { /* cache still empty? */ mpq_set(p0[f[min]], one); for (j = 0; j < n; j++) { mpz_bin_uiui(mpq_numref(p), f[j], k); mpz_bin_uiui(mpq_denref(p), m, k); mpq_canonicalize(p); mpq_mul(p0[f[min]], p0[f[min]], p); } } else for (j = 0; j < n; j++) { mpz_set_ui(mpq_numref(p), f[j] - k + 1); mpz_set_ui(mpq_denref(p), m - k + 1); mpq_canonicalize(p); mpq_mul(p0[f[min]], p0[f[min]], p); } /* P(k) = (m choose k) x Po(k) */ mpz_bin_uiui(mpq_numref(p), m, k); mpz_set_ui(mpq_denref(p), 1); mpq_mul(p, p, p0[f[min]]); countdown(); if (k < f[min]) { if (! mpq_sgn(p1[f[min] - k])) { /* compute P1 = (1-...) */ int *f_minus_k = malloc(n * sizeof (int)); for (j = 0; j < n; j++) f_minus_k[j] = f[j] - k; exactp(p1[f[min] - k], 1, m - k, f_minus_k); mpq_sub(p1[f[min] - k], one, p1[f[min] - k]); free(f_minus_k); } mpq_mul(p, p, p1[f[min] - k]); /* add P(k+1) */ mpq_t op; mpq_init(op); exactp(op, k + 1, m, f); mpq_add(p, p, op); mpq_clear(op); } } /* hypergeometric distribution for n = 2 */ static void hyper(mpq_t p, int k, int m, int *f) { mpz_t bin1, bin2; mpz_init(bin1); mpz_init(bin2); mpz_bin_uiui(bin1, f[0], k); mpz_bin_uiui(bin2, m - f[0], f[1] - k); mpz_mul(mpq_numref(p), bin1, bin2); An Algorithm and Tool for Computing Exact Conditional Probabilities of Configuration Frequencies /* compute binomial coeff. series using recurrence relation: n choose k+1 = (n-k)/(k+1) x (n choose k) */ int nb1 = f[0] - k, db1 = k + 1; int nb2 = f[1] - k, db2 = m - f[0] - f[1] + k + 1; for (++k; k <= f[min]; k++) { mpz_mul_ui(bin1, bin1, nb1--); mpz_divexact_ui(bin1, bin1, db1++); mpz_mul_ui(bin2, bin2, nb2--); mpz_divexact_ui(bin2, bin2, db2++); mpz_addmul(mpq_numref(p), bin1, bin2); countdown(); } mpz_bin_uiui(mpq_denref(p), m, f[1]); mpq_canonicalize(p); mpz_clear(bin2); mpz_clear(bin1); } int main(int argc, char *argv[]) { /* print usage */ if (argc < 5) { printf("\nUsage: %s k m f1 f2 [f3 ...]\n", argv[0]); printf("Computes the exact conditional probability for a pat"); printf("tern to occur k or more\ntimes in a total of m obser"); printf("vations given its attribute frequencies f1..fn.\n\n"); return(0); } /* get arguments */ int v = 1; int k = atoi(argv[v++]); int m = atoi(argv[v++]); n = argc - v; /* dimension of contingency table */ int *f = (int *) malloc(n * sizeof (int)); /* frequency vector */ int i; for (i = 0; i < n; i++) { f[i] = atoi(argv[v + i]); if (f[min] > f[i]) min = i; /* determin minimum index */ } /* initialize some globals */ mpq_t p; mpq_init(p); /* P-value */ total = f[min] - k + 1; if (n > 2) total = total * (total + 1) / 2; 399 M. Beier 400 /* compute P-value */ start_time = timer = time(NULL); if (n > 2) { /* multidimensional case */ /* initialize p1- and p0 cache */ p1 = (mpq_t *) malloc((f[min] + 1) * sizeof (mpq_t)); for (i = 0; i <= f[min]; i++) mpq_init(p1[i]); p0 = (mpq_t *) malloc((f[min] + 1) * sizeof (mpq_t)); for (i = 0; i <= f[min]; i++) mpq_init(p0[i]); /* rational constant 1/1 needed for P1 = (1-...)) */ mpq_init(one);mpq_set_ui(one, 1, 1); exactp(p, k, m, f); mpq_clear(one); for (i = f[min]; i >= 0; i--) mpq_clear(p0[i]); free(p0); for (i = f[min]; i >= 0; i--) mpq_clear(p1[i]); free(p1); } else /* twodimensional case */ hyper(p, k, m, f); int elapsed = (int) difftime(time(NULL), start_time); /* print result */ printf("P = %1.8f\t(", mpq_get_d(p)); if (elapsed >= 60) printf("%d min. ", elapsed / 60); printf("%d sec.)\n", elapsed % 60); /* clean up memory */ mpq_clear(p); free(f); return(0); }
© Copyright 2026 Paperzz