Comparing Classifiers for Nucleotide Pairs in Multiple Sequence

Comparing Classifiers for Nucleotide Pairs in Multiple Sequence Alignments
Nam-Phuong D. Nguyen
1
Introduction and Background
Computing Multiple Sequence Alignments (MSAs) are a vital first step in many areas of computational biology. The MSA
can aid biologists in constructing phylogenetic trees. The resulting MSA provides a wealth of information such as identifying
homology between sequences, inferring phylogenetic trees, and finding proteins of similar function.
Errors in MSA can hamper efforts of inferring information from the sequences. An example of an error is when the
pairwise homologous character between two sequences, known as a nucleotide pair (NT), is in the estimated MSA but not
in the true alignment. This is known as a false positive (FP). Likewise, when the estimated MSA is missing an NT pair that
exists in the true alignment, this is known as a false negative (FN). One metric for measuring the accuracy of an estimated
MSA is to sum the total number of nucleotide pairs (NT) pairs in the estimated MSA that do not appear in the true alignment
divided by the total number of NT pairs in the estimated MSA. This metric is known as the sum of pairs-false positives (sopFP). The sop-FP ranges from 0, all NT pairs in the estimated alignment exists in the true alignment, to 1, all NT pairs in the
estimated alignment do not show up in the true alignment. However, with real biological data, the true evolutionary history
of the set of sequences is unknown. Therefore, it would be advantageous to have a method to estimate the sop-FP in a MSA.
This paper examines two different methods for classifying whether NT pairs in an estimated MSA are in the true alignment.
By having an accurate classifier, we can estimate the sop-FN of an estimated MSA, as well as identify local regions that are
reliable.
In 2008, Landan and Graur presented a method to determine the local reliability of an MSA by generating many different
alternative MSA and scoring each NT pair in the estimated MSA with the ratio of times that it appears in the alternate
MSA[4]. This ratio is called the support value of an NT pair. Thus, NT pairs with a support value of 1.0 show up in all
the alternate set of MSAs, while NT pairs with a support value of 0.0 do not show up in any of the alternate set of MSAs.
From the support values, a classifier can be built by classifying all NT pairs with support value above some threshold as a
positive, and a negative otherwise. Landan and Graur generate the alternate set of MSA using a method called Heads or Tails
(HoT). However, the HoT method is computationally expensive, requiring 8(N-3) MSA alignments, where N is the number
of sequences in the MSA.
2
Method and Algorithm
This paper examines two different methods for generating the set of alternate alignments used score the NT pairs with a
support value, both of which require far fewer alignments than the HoT method. The first method, called the Multiple MSA
method (MMSA), runs 6 different MSA methods on the set of unaligned sequences to generate an alternate set of MSAs.
The methods are Prank [7], Probtree [8], ClustalW [9], Mafft [3], Muscle [1], and Opal [10]. The second method, called the
SATé method, uses SATé-II[6], a simultaneous estimator of trees and alignments, to generate an alternate set of 25 MSAs.
SATé-II[6] is an iterative method, and at each iteration, generates an alignment tree pair. The tree is scored with a maximum
likelihood (ML) score, which is used as a stopping condition. The MSAs are generated by running SATé-II on an alignment,
and iterating until the ML score of the resulting tree decreases. At this point, the MSA corresponding to this tree is put into
the alternate MSA set. The method continues to SATé-II for 24 more additional iterations, generating a total of 25 MSAs.
This process can be thought of as a random walk of MSAs near the local optimal ML tree.
The first step of this method is to calculate the estimated reference MSA. Similar to the SATé method, we run SATé-II[6]
on an unaligned set of sequences, generating an ML tree/MSA pair. We then take the ML tree to guide the next alignment,
and the process repeats until the ML score of the current tree decreases. That means that the previous ML tree was a local
optimum, so we take the MSA that corresponds to the previous tree as our estimated MSA. Each NT pair in the estimated
MSA is scored using the MMSA method and the SATé method. Finally, for a given threshold, the NT pairs are classified as
a positive or negative if the support value is above or below the threshold.
Given a threshold T , a NT pair in the estimated alignment is considered a classified positive (CP) if it has support greater
than or equal to T , and considered a classified negative (CN) if it has support less than T . An NT pair is an actual positive
(AP) if it exists in the true alignment and is an actual negative (AN) otherwise. The set of TP is defined as CP ∩ AP , the
set of FP is defined as CP ∩ AN , the set of TN is defined as CN ∩ AN , and the set of FN is defined as CN ∩ AP . Two
measurements for the accuracy of this classification are precision and recall. Precision is the total number of TP pairs over
|T P |
the total number of CP pairs in the estimated alignment and is defined as |CP
| . Recall, also known as the true positive rate
|T P |
(TPR), is the total number of TP pairs over the total number of AP in the estimated alignment and is defined as |AP
| . Thus,
the precision and recall can be calculated for a given threshold T . When the threshold is low, more NT pairs will be classified
as CP and typically results in a higher recall but lower precision. Likewise, when the threshold is high, less NT pairs will
be classified as CP and typically results in a higher precision but lower recall. If classifier A dominates classifier B on the
precision versus recall graph, then that classifier A will have better precision than classifier B for the equivalent recall.
An equivalent graph is a receiver operating characteristic (ROC) curve. A ROC curve plots TPR versus false positive
|F P |
rate (FPR). FPR is the proportion of FP pairs in the estimated alignment over all CN and is defined as |AN
| . Similar to
the precision versus recall graph, if classifier A dominates classifier B on the ROC curve, then classifier A will have higher
TPR and lower FPR than classifier B for any threshold T. However, the ROC curve has an additional advantage in that if we
calculate the area under the curve (AUC) of a ROC curve, this number gives the probability that a randomly picked AP will
have a higher support score than a randomly picked AN [2]. The AUC is exactly this probability when there is an infinite
number of samples, and the support values can span the entire continuum. Since support values are finite, and typically the
trapazoidal rule to find the area under the curve of the ROC, the AUC scores will underestimate the true probability. However,
the larger the AUC value, the more easily the classifier will be able to discriminate between a pair of positive and negative
examples.
In order to test the classifiers, simulated data is used so that the true alignment is known. For this project, I used the
simulated data generated by Kevin Liu in his SATé I paper [5]. There are 7 datasets, each one has 100 sequences, 1000 sites
on average, with varying mutation rates, indel rates, and gap length distributions. Each dataset contains 20 replicates. The
results for each dataset is averaged over each replicate. I present the results on 2 datasets. More results can be found in the
Supplemental Material.
3
Results
We show the results for two model conditions: 100M1 and 100M3. 100M1 is a dataset with 100 sequences, 1000 sites
on average per sequence, 1E-4 gap probability, and tree height of 4. 100M3 is a dataset with 100 sequences, 1000 sites on
average per sequence, 5E-5 gap probability, and tree height of 15. The remaining results are similar and can be found in the
Supplemental Material. We pick 100M1 because 100M1 is easy to align. The estimated MSA from SATé has less than 0.04
sop-FP error. We pick 100M3 because it is more difficult to align, with the estimated MSA from SATé with 0.27 sop-FP
error.
We examined the classification of NT pairs based on a support threshold T . If an NT pair has support equal to or greater
than the threshold T , it is classified as a CP. Likewise, if the NT pair has support below the threshold T , it is classified as a
CN. The precisions versus recall and TPR versus FPR can be measured for these classification for each threshold. Fig. 1-2
show the precision versus recall for various thresholds of both methods on 100M1 and 100M3. Fig. 3-4 show the ROC curves
for various thresholds of both methods on 100M1 and 100M3.
The AUC of the ROC curves are calculated using the trapazoidal rule. To test the accuracy of the AUC score, 40000
random AP and AN point pairs are selected, and the probability that the support value of the AP is higher than the support
value of the AN is estimated. The results are listed in Table 3.
4
Discussion
The precision versus recall for 100M1 and 100M3 in Fig. 1 and Fig. 2 show that the MMSA method dominates the SATé
method for both model conditions. The results suggest that on both easy and more difficult to align conditions, the MMSA
method has better performance. For a given recall level, the MMSA method will have better precision than the SATé method.
2
Figure 1. Graph showing the precision versus recall for the MMSA method and SATé method. The
precision and recall are calculated for various thresholds T and plotted. The results are over 20
replicates for the model 100M1.
Figure 2. Graph showing the precision versus recall for the MMSA method and SATé method. The
precision and recall are calculated for various thresholds T and plotted. The results are over 20
replicates for the model 100M3.
3
Figure 3. Graph showing the ROC curves for the MMSA method and SATé method. The TPR and FPR
are calculated for various thresholds T and plotted. The results are over 20 replicates for the model
100M1.
Figure 4. Graph showing the ROC curves for the MMSA method and SATé method. The TPR and FPR
are calculated for various thresholds T and plotted. The results are over 20 replicates for the model
100M3.
4
Table 1. The AUC score for the ROC curves generated by the MMSA method and the SATé for the
models 100M1 and 100M3. The probability that a randomly selected AP has greater support than a
randomly selected AN is also reported.
Model
100M1
100M3
AUC of MMSA
0.893
0.940
P(APsupport > ANsupport ) for MMSA
0.934
0.983
AUC of SATé
0.646
0.558
P(APsupport > ANsupport ) for SATé
0.347
0.122
The both figures show that the MMSA method has 7 threshold levels, with each change resulting in a sizable difference in
either precision or recall or both. This result greatly differs from the SATé method. There are only 2 regions where changing
the threshold significantly changes the results. The first is changing the threshold from classifying everything as negative to
not classifying everything as negative. In Fig. 2, the SATé method shows that there is a jump from recall of 0.0 and a precision
of 1 to recall of 0.83 and precision of 0.92. The second region is classifying everything as positive. In Fig. 2 the SATé method
shows that there is a jump from recall of 0.73 and precision of 1.0 to recall of 0.78 and precision of 0.97. Between the recall
regions of 0.73 and 0.83, small changes in the threshold results is small changes in precision. These results suggest that the
SATé method for generating alignments have little variability. The alignments being returned by the random walk are not
changing very much. This result suggests that using different aligners to generate the set of alternate MSAs will result in
better support values for classification. Thus, if two or more different methods have the same NT pair, it is likely that the NT
pair is in the true alignment.
Similiarly, the ROC curves for 100M1 and 100M3 in Fig. 3 and Fig. 4 show that the MMSA method dominates the SATé
method for both model conditions. It is easier to see on ROC curves how the threshold affects the TPR and FPR. In both
figures, the MMSA method is much closer to the top left region, whereas the SATé method is much closer to the red line.
In addition, Table 3 shows that the MMSA method results in curves with much higher AUC. This suggests that the MMSA
method can better discriminate between a pair of positive and negative examples. I tested this result by randomly picking
pairs of positive and negative examples and recording whether the positive example had a higher support value score. The
results of this test show that the MMSA method will usually correctly classify the positive example as positive, 0.93 times
for 100M1, and 0.98 times for 100M3. This is close to the AUC score of 0.89 and 0.94, respectively. However, the results
for the SATé method are much worse. The AUC scores for 100M1 and 100M3 using the SATé method is 0.65 and 0.56,
respectively. However, the SATé method was only able to pick the positive example 0.35 and 0.12 times, as the negative
example had support equal to or greater than the positive example the large majority of the time. How is this possible given
that the SATé method could do better by always reversing its decision? Examining the distribution of the support scores show
that the large majority of the NT pairs either have support of 0.0 or 1.0. The probability of a positive example has support
strictly greater than the negative example is very low if all NT pairs have support of 1.0, since most NT pairs will have the
same support value of 1.0. If we adjust the probability to be that whenever two support values have the same support value, a
coin is flipped and used to predict which is positive and which is negative, the probability of correctly classifying the positive
example is much closer to the AUC score, 0.65 and 0.57 respectively for 100M1 and 100M3.
The results suggest that one should use the MMSA method for classifying NT pairs. One immediate application of this
method is to estimate the errors at a site. First, for each site, find the number of estimated TP at the site over all NT pairs
at the site. If this value is below some threshold, this would suggest that the site unreliable. We can throw out this site. We
repeat this over all sites to come up with a masked MSA. We can run a tree reconstruction method on the masked alignment
and examine whether removing sites with high estimated errors will improve the accuracy of the reconstructed tree.
Other possible idea is to give weights to each one of the MSA methods that can be used for calculating the support values.
Thus, if a method has higher weight, NT pairs that show up in that method will have more effect on the support value.
References
[1] R. Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 32(5):1792–1797, 2004.
[2] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology,
143:29–36, 1982.
[3] K. Katoh, K.-i. Kuma, H. Toh, and T. Miyata. Mafft version 5: improvement in accuracy of multiple sequence alignment. Nucl.
Acids Res., 33(2):511–518, 2005.
5
[4] G. Landan and D. Graur. Local reliability measures from sets of co-optimal multiple sequence aligments. Pacific Symposium on
Biocomputing, 13:15–24, 2008.
[5] K. Liu, S. Raghavan, S. Nelesen, C. R. Linder, and T. Warnow. Rapid and accurate large-scale coestimation of sequence alignments
and phylogenetic trees. Science, 324(19):1561–1564, 2009.
[6] K. Liu, T. Warnow, M. Holder, S. Nelesen, J. Yu, A. Stamatakis, and C. R. Linder. Saté-ii, very fast and accurate simultaneous
estimation of multiple sequence alignments and phylogenetic trees. Submitted to Proceedings of the National Academy of Sciences
of the United States of America.
[7] A. Lytynoja and N. Goldman. An algorithm for progressive multiple alignment of sequences with insertions. Proceedings of the
National Academy of Sciences of the United States of America, 102(30):10557–10562, 2005.
[8] S. Nelesen, K. Liu, D. Zhao, R. Linder, and T. Warnow. The effect of the guide tree on multiple sequence alignments and subsequent
phylogenetic analyses. Pac Symp Biocomput, pages 25–36, 2008.
[9] J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acids Res., 22(22):4673–4680, 1994.
[10] T. J. Wheeler and J. D. Kececioglu. Multiple alignment by aligning alignments. Bioinformatics, 23(13):i559–568, 2007.
6