Appendix 2. Upper estimate of P-value for similarity between two alignment
columns.
For two alignment columns m* and n*, we will estimate P-value for the
independent generation of these columns by a single emission vector f, with prior
distribution of emission vectors f
P m*,n* | f
m,n | f f dm dn df
(B1)
m,n| f m*,n*| f
Here, the integral is calculated over all emission vectors f and over those random
residue counts m, n, which have probability density m,n | f lower than that for the
generation of vectors m*, n* . For each given vector f, the integral over m, n can be
calculated precisely. As follows from Appendix A, P-value for separate generation of
each column by an emission vector f with dimensionality d obeys 2 distribution with
(d-1) degrees of freedom (formula (A8)). Threfore, the combined distribution for two
independently generated random columns is also a 2 , with the number of degrees of
freedom (d-1) + (d-1) = 2(d-1). Using notation of Appendix A,
P m*,n* | f
m,n| f
m*,n*| f
m | f n | f dm dn Q d 1, Ri 2
i
(B2)
where
1 (m * Mfi )2 (ni * Nf i )2
2
Ri i
2
Mfi
Nf i
(B3)
We will use prior distribution (f) that maximizes likelihood for observed alignment
columns to be generated by a single emission vector f. Considering (f) in a simple
Gaussian form
f 2ˆ i
1
2 2
i
f i fˆi
exp
2
i
2ˆ i
2
(B4)
it is easy to show [49] that
n Nfˆ
i
ˆf mi ni , ˆ 2 1 i
i
i
M N
2
N
m Mfˆ
2
2
i
i
M
(B5)
Formula (B2) and expression (B4) for the prior transform (B1) into
P 2ˆ i
i
1
2 2
f i fˆi
2
Q
d
1,
R
exp
i i i 2ˆ 2
i
fi 1
i
df
2
(B6)
Analytic calculation of this integral is problematic, and we will estimate its approximate
value using two observations. First, the argument of the regularized gamma function is
the sum of partial functions of individual emission frequencies fi, which reaches its
minimum at f(0) = {fi(0)}:
2
fi
(0)
2
ni
m
i
N M
1 1 1 ,
M N 2(ni mi )
Second, function Q a , x
(B7)
a, x
monotonically decreases with x from Q a,0 1 to
a
Q a, 0 . Therefore, the regularized gamma function Q in (B6) reaches its single
maximum Qmax at f(0) = {fi(0)} and rapidly decreases to zero outside the vicinity of f(0).
2
Based on this result, we determine volume around f(0) where Q d 1, Ri is still
i
comparable with Qmax, and calculate an approximate upper estimate of integral (B6) as
P Qmax 2ˆ i
1
2 2
i
f i fˆi
exp i 2ˆ i 2
df
2
(B8)
We define volume as a parallelepiped and estimate the location of its edges. First, we
2
consider Q d 1, Ri in (B6) and estimate the characteristic distance from the
i
maximum
x(0)= Ri
point
2
f
(0)
i
for
which
Q d 1, x (0) << Q d 1, x (0) .
i
Specifically, we approximate Q d 1, x with tangent at x (0) and find the point of
intersection between the tangent and abscissa:
Q d 1, x (0)
Given that
Q d 1, x x0
0
x
(B9)
Q a, x
x a 1e x
, we can estimate distance as
x
(a )
d 1, x (0) e x
x(0)
( 0)
(B10)
d 2
Based on this estimate, we determine borders of volume in the space of emission
vectors. For simplicity, we defined as parallelepiped {fi(1) < fi < fi(2)}, i=1,d-1, where
limits fi(1,2) are determined from equation
R
i
2
( f i ) Ri
i
2
f
(0)
i
(B11)
i
Using expressions for Ri2 ( f i ) from (B3), we get the borders of as
fi
(1,2)
b b2 ac
a
(B12)
a M N Ri
where
2
f 2 ,
(0)
i
b m1 * n1 *
i
m *
c i
M
2
n *
i
N
1
2
Ri f i (0) ,
2 i
2
.
Having and using the definition of error function erf(x), we calculate an approximate
upper estimate for P-value (B8) as
1
1
2
(0)
P Q d 1, Ri f i erf
2 i
i 2
where Ri
2
f (2) fˆ
i
i
erf
2ˆ 2
i
f is defined by (B3) and (B7),
(0)
i
are defined by (B12).
f (1) fˆ
i
i
2ˆ 2
i
(B13)
2
(1,2)
fˆi and ˆi are defined by (B5), and f i
© Copyright 2026 Paperzz