Appendix B - BioMed Central

Appendix 2. Upper estimate of P-value for similarity between two alignment
columns.
For two alignment columns m* and n*, we will estimate P-value for the
independent generation of these columns by a single emission vector f, with prior
distribution of emission vectors   f 
P  m*,n* |   f   

  m,n | f    f  dm dn df
(B1)
  m,n| f    m*,n*| f 
Here, the integral is calculated over all emission vectors f and over those random
residue counts m, n, which have probability density   m,n | f  lower than that for the
generation of vectors m*, n* . For each given vector f, the integral over m, n can be
calculated precisely. As follows from Appendix A, P-value for separate generation of
each column by an emission vector f with dimensionality d obeys  2 distribution with
(d-1) degrees of freedom (formula (A8)). Threfore, the combined distribution for two
independently generated random columns is also a  2 , with the number of degrees of
freedom (d-1) + (d-1) = 2(d-1). Using notation of Appendix A,
P  m*,n* | f  

  m,n| f 

 m*,n*| f 

  m | f    n | f  dm dn  Q  d  1,  Ri 2 

i

(B2)
where
1  (m *  Mfi )2 (ni *  Nf i )2 
2
Ri   i


2
Mfi
Nf i

(B3)
We will use prior distribution (f) that maximizes likelihood for observed alignment
columns to be generated by a single emission vector f. Considering (f) in a simple
Gaussian form

  f    2ˆ i

1
2 2
i


f i  fˆi

exp  
2
 i
2ˆ i

 
2
(B4)


it is easy to show [49] that

 n  Nfˆ
i
ˆf  mi  ni , ˆ 2  1  i
i
i
M N
2
N

   m  Mfˆ  
2
2
i
i
M
(B5)


Formula (B2) and expression (B4) for the prior transform (B1) into

P   2ˆ i
i

1
2 2


f i  fˆi

2

Q
d

1,
R
exp
i i   i 2ˆ 2
 
i
 fi 1 

i
  df
2


(B6)
Analytic calculation of this integral is problematic, and we will estimate its approximate
value using two observations. First, the argument of the regularized gamma function is
the sum of partial functions of individual emission frequencies fi, which reaches its
minimum at f(0) = {fi(0)}:
2
fi
(0)


2
ni
m
 i
N M
1  1 1 ,  
M  N  2(ni  mi )

Second, function Q  a , x  
(B7)
  a, x 
monotonically decreases with x from Q  a,0  1 to
 a
Q  a,   0 . Therefore, the regularized gamma function Q in (B6) reaches its single
maximum Qmax at f(0) = {fi(0)} and rapidly decreases to zero outside the vicinity of f(0).

2
Based on this result, we determine volume  around f(0) where Q  d  1,  Ri  is still
i


comparable with Qmax, and calculate an approximate upper estimate of integral (B6) as

P  Qmax  2ˆ i

1
2 2
i


f i  fˆi

 exp  i 2ˆ i 2


  df
2
(B8)


We define volume  as a parallelepiped and estimate the location of its edges. First, we

2
consider Q  d  1,  Ri  in (B6) and estimate the characteristic distance  from the
i


maximum
x(0)=  Ri
point
2
f 
(0)
i
for
which
Q  d  1, x (0)    << Q  d  1, x (0)  .
i
Specifically, we approximate Q  d  1, x  with tangent at x (0) and find the point of
intersection between the tangent and abscissa:
Q  d  1, x (0)  
Given that

Q  d  1, x  x0 
0
x
(B9)
Q  a, x 
x a 1e x
, we can estimate distance  as

x
(a )
  d  1, x (0)  e  x
 x(0) 
( 0)
(B10)
d 2
Based on this estimate, we determine borders of volume  in the space of emission
vectors. For simplicity, we defined  as parallelepiped {fi(1) < fi < fi(2)}, i=1,d-1, where
limits fi(1,2) are determined from equation
R
i
2
( f i )   Ri
i
2
 f 
(0)
i
(B11)
i
Using expressions for Ri2 ( f i ) from (B3), we get the borders of  as
fi
(1,2)

b  b2  ac
a
(B12)
a  M  N   Ri
where
2
 f   2 ,
(0)
i
b  m1 * n1 * 
i
 m *
c i
M
2
 n *
 i
N
1
2
Ri  f i (0)    ,

2 i
2
.
Having  and using the definition of error function erf(x), we calculate an approximate
upper estimate for P-value (B8) as
 
1
1

2
(0) 
P  Q  d  1,  Ri  f i      erf
2 i

 i  2 

where Ri
2
 f (2)  fˆ 
i
 i
  erf
 2ˆ 2 
i


 f  is defined by (B3) and (B7),
(0)
i
are defined by (B12).
 f (1)  fˆ  
i 
 i

 2ˆ 2  
i

 
(B13)
2
(1,2)
fˆi and ˆi are defined by (B5), and f i