An algorithm and tool for computing exact conditional probabilities of

Psychology Science, Volume 47, 2005 (3/4), p. 391-400
An algorithm and tool for computing exact conditional probabilities
of configuration frequencies
MANFRED BEIER1
Abstract
Traditionally the exact conditional probability of a configuration frequency is calculated with
methods based on Fisher's well known formula for two by two contingency tables or its extensions for
tables of higher dimensions. I present here a different, combinatorial approach that shows a much better
scaling behavior for an increasing number of variables, and is in principle independent of the number of
categories.
Key words: Exact conditional probability, multidimensional contingency table, configural frequency analysis (CFA)
1
Dipl.-Ing. Manfred Beier, Institut für Humangenetik und Anthropologie, Heinrich-Heine-Universität
Düsseldorf, Universitätsstr. 1, D-40225 Düsseldorf, Germany; E-mail: [email protected]
M. Beier
392
1. Algorithm
Please note that in the following the term “matrix” does not refer to a contingency table
but to the raw data matrix a contingency table can be constructed from.
Let A = (aij) be a matrix with m rows (observations, sample size) and n columns (variables, dimension of a contingency table). What is the probability for finding k or more identical row vectors W = (wj) (configurations, patterns) in the matrix if the components (attributes) of the column vectors may be ordered at random?
Let F = (fj) be a vector holding the frequencies (corresponding marginal sums of a contingency table) of wj for each column j = 1,..,n. For the first component of the first row, a11,
the probability for matching w1 is:
(1)
P ( a11 = w1; m, f1 ) =
f1
f
, for a12: P ( a12 = w2 ; m, f 2 ) = 2 etc.
m
m
Assuming independence of the columns the probability for all cells of the first row being
equal to W = (wj) is:
(2)
(
n
fj
) ∏m .
P ( a11 ,..., a1n ) = W ; m, F =
j=1
The probability for at least having the first k rows filled with W is:
(3)
(
n
fj
f j −1
f j − k +1
n
) ∏ m ⋅ m − 1 ⋅ … ⋅ m − k +1 = ∏
P ( a11 ,..., a1n ) = … = ( ak1 ,..., akn ) = W ; m, F =
j=1
j=1
 fj 
 
 k =P .
0
m
 
k
2
The maximum possible number of W is limited by fmin, i.e. the smallest component of F.
Consequently, for k = fmin the formula above gives us the probability for exactly the first k
rows containing W with no further occurrence in the remaining matrix. Therefore, the probability for a matrix Ak with k vectors W occurring in arbitrary rows can be computed by
summing up the probabilities of all possible ways to choose k rows out of m:
(4)
(5)
2
m
P ( Ak ; k = f min , m, F ) =   ⋅ P0 .
k
In the case of n = 2 with f1 = fmin = k:  m  ⋅
 
k
2
∏
j=1
 fj 
 f min  
 k   m   f  
 =
 min  ⋅ 

⋅
 m   f min   m  
 

 
k
 f min  
The minimum possible number is given by max {0, Σfj - m(n-1)}.
f2  
 
f min  
=
m  
 
f min  
f2 

f min  ,
m 

f min 
An Algorithm and Tool for Computing Exact Conditional Probabilities
of Configuration Frequencies
393
this is equivalent to the p-value given by the hypergeometric distribution of a 2 by 2 contingency table, here shown with Fisher's formula for R1 = f1 = fmin = k and C1 = f2:
O11 = k = fmin
O21 = f2-fmin
C1 = f2
(6)
O12 = 0
O22 = C2
C2
R1 = f1 = fmin
R2 = m-fmin
m

f2 !
f min !( m − f min )! f 2 !C2 !
R1 !R2 !C1 !C2 !

f 2 − f min )! 
(
m!
m
!
=
=
=
m!

O11 !O12 !O21 !O22 !
f min !0!( f 2 − f min )!C2 !

m
f
!
−
(
min )

f2 

f min 
.
m 

f min 
For k < fmin the probability for exactly k row vectors matching W corresponds to the probability of getting at least k rows and no further row:
(7)
))
( (
m
P ( Ak ; k < f min , m, F ) =   ⋅ P0 ⋅ 1 − P { A1 ,..., A fmin − k }; m − k , ( f1 − k ,..., f n − k ) .
k
This leads to the following recursive formula for getting k or more row vectors W:
n
(8)
(
)
P { Ak ,..., A fmin }; m, F =
(
 fj 
∏  k 
j=1
)
 m
 
k
n −1
( (
⋅ 1 − P { A1 ,..., A fmin −k }; m − k , ( f1 − k ,..., f n − k )
))
+ P { Ak +1 ,..., A fmin }; m, F .
At first sight this formula seems to be computationally infeasible. The number of pvalues that have to be computed grows exponentially with the distance between k and fmin.
But by storing and reusing partial results for 1-P({A1,...,Afmin-i}; m-i, F-i) for all i = k,...,fmin-1,
the number of interim values is reduced to Σ1≤i≤fmin-k+1 i, i.e. a quadratic growth rate.
An algorithm is given below in the form of an implementation in the programming language R (www.R-project.org). In addition to the necessary “data cache” just mentioned (p1array), a second array (p0) allows the first part of the formula to be calculated using the
recurrence relation
n
(9)
n
 fj 
∏  k 
j=1
m
 
k
n −1
=
n
 fj 
∏  k − 1 ∏ ( f
j=1
 m 


 k − 1
n −1
⋅
j
)
− k +1
j=1
k ( m − k +1)
n −1
.
M. Beier
394
exact.p <- function(k,m,f) {
n <- length(f); min <- which.min(f)
p1 <- array(dim=c(f[min]-k)); p0 <- array(dim=c(f[min]))
inner.loop <- function(k,m,f) {
if (is.na(p0[f[min]]))
p0[f[min]] <<- prod(choose(f,k)) / choose(m,k)^(n-1)
else
p0[f[min]] <<- p0[f[min]] * prod(f-k+1) / (k * (m-k+1)^(n-1))
if (k < f[min]) {
if (is.na(p1[f[min]-k])) p1[f[min]-k] <<- 1 - inner.loop(1,m-k,f-k)
return(p0[f[min]] * p1[f[min]-k] + inner.loop(k+1,m,f))
} else return(p0[f[min]])
}
return(inner.loop(k,m,f))
}
For example, to calculate the upper probability for a configuration frequency of 31 with
marginal sums 90, 93 and 94 occurring in a sample of 158, call and output would look like:
> exact.p (31,158,c(90,93,94))
[1] 0.6150942
2. The special case n = 2
It is important to note that the probability for getting k or more patterns is completely independent of the number of categories of each variable. From the standpoint of one specific
configuration each variable is binary: either the observation holds the correct value or not.
Therefore, for a given pattern, any raw data table can be expressed and treated as a binary
contingency table.
In the case of two variables for one cell holding k the values of the remaining three cells
are fixed. Since the above algorithm simply sums up the p-values for getting exactly k to fmin
configurations, it is in fact easier and faster to use the hypergeometric distribution for that
job, which lowers the number of p-values that have to be computed to fmin - k + 1:
O11 = k
O21
C1 = f2
O12 = f1-k
O22
C2 = m-f2
R1 = f1
R2
m
 C1  C2 



O 

min  O11 
(10) P  { A ,..., A
12

=
}; m, f , f  = ∑
 k
f
1 2
m
min

 O =k
 
11
R 
 1
f
f
min  f 2   m − f 2 

∑   
i = k  i   f1 − i 
m
 
f 
 1
An Algorithm and Tool for Computing Exact Conditional Probabilities
of Configuration Frequencies
395
Remark: Summing up over all possible k would of course result in a p-value of 1, meaning numerator and denominator being identical. This identity, commonly known as Vandermonde's convolution, leads to another combinatorial interpretation: While the denominator
depicts the number of all possible ways to choose f1 elements from the union of two disjoint
sets of size f2 and m - f2, the numerator is made up of only those containing k or more elements chosen from the first set, with the remaining f1 - k or less elements coming from the
second set.
3. Implementation in C
The software EXACTP, written in ANSI C, implements a command line interface for the
recursive algorithm, while the special case of two variables is computed using the hypergeometric distribution. Below you see some examples taken from Krauth (1993, p. 33) and
computed with EXACTP. According to Krauth, the original p-values for the exact hypergeometric test, computed with the software accompanying his publication (KFA.EXE),
required the evaluation of 1242184 contingency tables. By contrast, the total number of pvalues computed by EXACTP is 10244. For EXACTP the parameters are entered in the
same order as for the R script: configuration frequency, sample size and an arbitrary number
of marginal sums:
C:\>exactp 8 158 65 94 68
P = 0.99938264 (0 sec.)
C:\>exactp 26 158 65 94 90
P = 0.13814997 (0 sec.)
C:\>exactp 6 158 65 64 68
P = 0.98948248 (0 sec.)
C:\>exactp 25 158 65 64 90
P = 0.00072640 (0 sec.)
C:\>exactp 29 158 93 94 68
P = 0.07499715 (0 sec.)
C:\>exactp 31 158 93 94 90
P = 0.61504758 (0 sec.)
C:\>exactp 25 158 93 64 68
P = 0.00309131 (0 sec.)
C:\>exactp 8 158 93 64 90
P = 0.99999850 (0 sec.)
The sixth example is the same as the one shown for the R script. The less accurate pvalue returned by the R function (0.6150942 versus 0.61504758) is caused by limitations of
double precision floating point arithmetic. Therefore, the C version makes use of the rational
arithmetic routines offered by the GNU Multiple Precision Arithmetic Library (GMP,
www.swox.com/gmp/). By staying with integers and computing the true fraction, the result
is always ensured to be 100% correct.
In more demanding cases like this artificial, 10-dimensional example, where processing
time exceeds two seconds, progress is issued every 10 percent:
396
M. Beier
C:\>exactp 5 1000 500 510 520 530 540 550 560 570 580 590
24.07% done
40.27% done
54.30% done
66.45% done
77.50% done
87.93% done
98.00% done
P = 0.07927212 (1 min. 12 sec.)
4. Conclusions
EXACTP provides a tool for computing exact conditional probabilities of configuration
frequencies for high dimensional cases. Runtime mainly depends on the distance between the
configuration frequency and the lowest of its marginal sums. Compared to this, the influence
of dimensionality is negligible, and the number of categories is irrelevant.
The source code of EXACTP and a DOS-executable, compiled with MinGW
(www.mingw.org), is available at: www-public.rz. uni-duesseldorf.de/~beierm/exactp.html.
Acknowledgement
I would like to thank Prof. Joachim Krauth, Institut für Experimentelle Psychologie,
Heinrich-Heine-Universität Düsseldorf, for his kind support and “beta-testing” the software.
References
1. Krauth, J.: Einführung in die Konfigurationsfrequenzanalyse (KFA). Ein multivariates
nichtparametrisches Verfahren zum Nachweis und zur Interpretation von Typen und
Syndromen. Weinheim/Basel: Beltz, Psychologie-Verlags-Union, 1993.
Appendix
/* exactp (2004-08-09):
exact conditional cell probabilities of multidimensional,
multicategorial contingency tables
Language: ANSI C
requires GMP (GNU Multiple Precision Arithmetic Library)
should compile with: gcc exactp.c -lgmp -oexactp
Copyright (C) 2004 Manfred Beier, [email protected]
An Algorithm and Tool for Computing Exact Conditional Probabilities
of Configuration Frequencies
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*/
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
#include <time.h>
#include <gmp.h>
/* global variables */
static int n; /* dimension of contingency table, size of frequency vector */
static int min = 0; /* index of minimum component of frequency vector */
static int total; /* total number of p-values to calculate */
static time_t start_time, timer;
static mpq_t *p1, *p0; /* cache for partial results */
static mpq_t one; /* rational constant 1/1 */
/* issue computing progress every 10 seconds */
static void
countdown() {
static int done = 0;
done++;
if (10 <= (int) difftime(time(NULL), timer)) {
printf(" %5.2f%% done\n", 100 * (double) done / total);
timer = time(NULL);
}
}
/* recursive algorithm for n > 2 */
static void
exactp(mpq_t p, int k, int m, int *f) {
int j;
397
398
M. Beier
/* compute Po(k) */
if (! mpq_sgn(p0[f[min]])) { /* cache still empty? */
mpq_set(p0[f[min]], one);
for (j = 0; j < n; j++) {
mpz_bin_uiui(mpq_numref(p), f[j], k);
mpz_bin_uiui(mpq_denref(p), m, k);
mpq_canonicalize(p);
mpq_mul(p0[f[min]], p0[f[min]], p);
}
} else
for (j = 0; j < n; j++) {
mpz_set_ui(mpq_numref(p), f[j] - k + 1);
mpz_set_ui(mpq_denref(p), m - k + 1);
mpq_canonicalize(p);
mpq_mul(p0[f[min]], p0[f[min]], p);
}
/* P(k) = (m choose k) x Po(k) */
mpz_bin_uiui(mpq_numref(p), m, k);
mpz_set_ui(mpq_denref(p), 1);
mpq_mul(p, p, p0[f[min]]);
countdown();
if (k < f[min]) {
if (! mpq_sgn(p1[f[min] - k])) { /* compute P1 = (1-...) */
int *f_minus_k = malloc(n * sizeof (int));
for (j = 0; j < n; j++) f_minus_k[j] = f[j] - k;
exactp(p1[f[min] - k], 1, m - k, f_minus_k);
mpq_sub(p1[f[min] - k], one, p1[f[min] - k]);
free(f_minus_k);
}
mpq_mul(p, p, p1[f[min] - k]);
/* add P(k+1) */
mpq_t op; mpq_init(op);
exactp(op, k + 1, m, f);
mpq_add(p, p, op);
mpq_clear(op);
}
}
/* hypergeometric distribution for n = 2 */
static void
hyper(mpq_t p, int k, int m, int *f) {
mpz_t bin1, bin2; mpz_init(bin1); mpz_init(bin2);
mpz_bin_uiui(bin1, f[0], k);
mpz_bin_uiui(bin2, m - f[0], f[1] - k);
mpz_mul(mpq_numref(p), bin1, bin2);
An Algorithm and Tool for Computing Exact Conditional Probabilities
of Configuration Frequencies
/* compute binomial coeff. series using recurrence relation:
n choose k+1 = (n-k)/(k+1) x (n choose k) */
int nb1 = f[0] - k, db1 = k + 1;
int nb2 = f[1] - k, db2 = m - f[0] - f[1] + k + 1;
for (++k; k <= f[min]; k++) {
mpz_mul_ui(bin1, bin1, nb1--);
mpz_divexact_ui(bin1, bin1, db1++);
mpz_mul_ui(bin2, bin2, nb2--);
mpz_divexact_ui(bin2, bin2, db2++);
mpz_addmul(mpq_numref(p), bin1, bin2);
countdown();
}
mpz_bin_uiui(mpq_denref(p), m, f[1]);
mpq_canonicalize(p);
mpz_clear(bin2); mpz_clear(bin1);
}
int
main(int argc, char *argv[]) {
/* print usage */
if (argc < 5) {
printf("\nUsage: %s k m f1 f2 [f3 ...]\n", argv[0]);
printf("Computes the exact conditional probability for a pat");
printf("tern to occur k or more\ntimes in a total of m obser");
printf("vations given its attribute frequencies f1..fn.\n\n");
return(0);
}
/* get arguments */
int v = 1;
int k = atoi(argv[v++]);
int m = atoi(argv[v++]);
n = argc - v; /* dimension of contingency table */
int *f = (int *) malloc(n * sizeof (int)); /* frequency vector */
int i;
for (i = 0; i < n; i++) {
f[i] = atoi(argv[v + i]);
if (f[min] > f[i]) min = i; /* determin minimum index */
}
/* initialize some globals */
mpq_t p; mpq_init(p); /* P-value */
total = f[min] - k + 1;
if (n > 2) total = total * (total + 1) / 2;
399
M. Beier
400
/* compute P-value */
start_time = timer = time(NULL);
if (n > 2) { /* multidimensional case */
/* initialize p1- and p0 cache */
p1 = (mpq_t *) malloc((f[min] + 1) * sizeof (mpq_t));
for (i = 0; i <= f[min]; i++) mpq_init(p1[i]);
p0 = (mpq_t *) malloc((f[min] + 1) * sizeof (mpq_t));
for (i = 0; i <= f[min]; i++) mpq_init(p0[i]);
/* rational constant 1/1 needed for P1 = (1-...)) */
mpq_init(one);mpq_set_ui(one, 1, 1);
exactp(p, k, m, f);
mpq_clear(one);
for (i = f[min]; i >= 0; i--) mpq_clear(p0[i]);
free(p0);
for (i = f[min]; i >= 0; i--) mpq_clear(p1[i]);
free(p1);
} else /* twodimensional case */
hyper(p, k, m, f);
int elapsed = (int) difftime(time(NULL), start_time);
/* print result */
printf("P = %1.8f\t(", mpq_get_d(p));
if (elapsed >= 60) printf("%d min. ", elapsed / 60);
printf("%d sec.)\n", elapsed % 60);
/* clean up memory */
mpq_clear(p);
free(f);
return(0);
}