Calculating relative abundance values

Calculating relative abundance values
Let P be a ancestor-to-descendant substitution pattern with length L :
P = b1 b2 . . . bL → b01 b02 . . . b0L
Where b1 , bL ∈ [A, T, G, C] and all other bi ∈ [A, T, G, C, N ]. We can write each ancestordescendant nucleotide pair as Bi = bi → b0i . Then
P = B1 B2 . . . BL
Given a set of ancestor-descendant alignments, the proportion of P is the fraction of ancestral
words that convert to the appropriate descendant sequence:
pr(P ) =
Number of observed b1 b2 ...bL →b01 b02 ...b0L
Number of observed b1 b2 ...bL
The normal recursive method for calculating relative abundance is:
(
pr(P ) if L = 1
ρ(P ) = pr(P )
if L > 1
ψ(P )
(1)
where ψ(P ) is the product of all elements in SP , the set of all subpatterns s of P :
ψ(P ) =
Y
ρ(s)
s∈SP
SP contains all gapped and ungapped subpatterns, with N representing any base.
We have proposed a different method of calculating relative abundance, which we refer to as the
“seg algorithm.” If we let GP be the set of all full-length gapped subpatterns s of P , define a
new function γ:
γ(P ) =
Y
ρ(s)
s∈GP
The seg algorithm is:
ρ(P ) =



pr(P )
pr(P )
ψ(P )


pr(P )pr(B2 ...BL−1 )

pr(B1 ...BL−1 )pr(B2 ...BL )γ(P )
if L = 1
if L = 2
(2)
if L > 2
The algorithms are the same for patterns of length 1 or 2 by definition. We can demonstrate by
mathematical induction that they are also equal for all patterns with L > 2.
1
Justification of the “seg algorithm”
Proof. Suppose that P is a substitution pattern with L = 3. Then from Equation 1, we have
ρ(P ) =
=
=
=
pr(B1 B2 B3 )
ψ(B1 B2 B3 )
pr(B1 B2 B3 )
pr(B1 )pr(B2 )pr(B3 )ρ(B1 B2 )ρ(B1 N B3 )ρ(B2 B3 )
pr(B1 B2 B3 )
pr(B1 B2 )
pr(B1 N B3 )
pr(B2 B3 )
][
][
]
1 )pr(B2 ) pr(B1 )pr(B3 ) pr(B2 )pr(B3 )
pr(B1 )pr(B2 )pr(B3 )[ pr(B
pr(B1 B2 B3 )pr(B1 )pr(B2 )pr(B3 )
pr(B1 B2 )pr(B2 B3 )pr(B1 N B3 )
Similarly, using the same pattern P and Equation 2,
ρ(P ) =
=
=
=
pr(B1 B2 B3 )pr(B2 )
pr(B1 B2 )pr(B2 B3 )γ(B1 B2 B3 )
pr(B1 B2 B3 )pr(B2 )
pr(B1 B2 )pr(B2 B3 )ρ(B1 N B3 )
pr(B1 B2 B3 )pr(B2 )
pr(B1 N B3)
]
1 )pr(B3 )
pr(B1 B2 )pr(B2 B3 )[ pr(B
pr(B1 B2 B3 )pr(B1 )pr(B2 )pr(B3 )
pr(B1 B2 )pr(B2 B3 )pr(B1 N B3 )
Thus, Equations 1 and 2 are equal for patterns with L = 3.
Inductive step. Suppose that Eq. 1 is equal to Eq. 2 for patterns of length n > 2. Then for a
pattern P = B1 . . . Bn , combining the equations gives us the following inductive hypothesis:
ρ(P ) =
pr(P )pr(B2 . . . Bn−1 )
pr(P )
=
ψ(P )
pr(B1 . . . Bn−1 )pr(B2 . . . Bn )γ(P )
(3)
Assuming it works for P , we want to prove that this holds for a pattern P + , with length n + 1.
Starting with the right side of Eq. 3, for P + we have:
ρ(P + ) =
pr(P + )pr(B2 ...Bn )
pr(B1 ...Bn )pr(B2 ...Bn+1 )γ(P + )
(4)
=
pr(P + )pr(B2 ...Bn )
pr(P )pr(B2 ...Bn+1 )γ(P + )
2
Solving Eq. 3 for γ(P ) gives
γ(P ) =
ψ(P )pr(B2 . . . Bn−1 )
pr(B1 . . . Bn−1 )pr(B2 . . . Bn )
(5)
Then from Eq. 4, using the expression for γ(P ) in Eq. 5 leads to
pr(P + )pr(B2 ...Bn )
pr(P )pr(B2 ...Bn+1 )γ(P + )
pr(P + )pr(B2 ...Bn )
=
ψ(P + )pr(B2 ...Bn )
]
1 ...Bn )pr(B2 ...Bn+1 )
pr(P )pr(B2 ...Bn+1 )[ pr(B
=
pr(P + )pr(B1 ...Bn )
pr(P )ψ(P + )
=
pr(P + )pr(P )
pr(P )ψ(P + )
=
pr(P + )
ψ(P + )
We have shown that the two algorithms are equal for patterns of length 1, 2, and 3. We have also
shown by induction that if they are equivalent for patterns of length n > 2, then they must also
be equal for patterns of length n + 1. As such, we conclude that the algorithms are equivalent
for substitution patterns of all lengths.
3