Vol. 12 no 2 1996 Pages 95-107 Hidden Markov models for sequence analysis: extension and analysis of the basic method Richard Hughey and Anders Krogh1'2 Abstract Hidden Markov models (HMMs) are a highly effective means of modeling a family of unaligned sequences or a common motif within a set of unaligned sequences. The trained HMM can then be used for discrimination or multiple alignment. The basic mathematical description of an HMM and its expectation-maximization training procedure is relatively straightforward. In this paper, we review the mathematical extensions and heuristics that move the method from the theoretical to the practical. We then experimentally analyze the effectiveness of model regularization, dynamic model modification and optimization strategies. Finally it is demonstrated on the SH2 domain how a domain can be found from unaligned sequences using a special model type. The experimental work was completed with the aid of the Sequence Alignment and Modeling software suite. Introduction Since their introduction to the computational biology community (Haussler et al., 1993; Krogh et al., 1994a), hidden Markov models (HMMs) have gained increasing acceptance as a means of sequence modeling, multiple alignment and profiling (Baldi et al., 1994; Eddy, 1995; Eddy et al., 1995). A HMM is a statistical model similar to a profile (Gribskov et al., 1987; Bucher & Bairoch, 1994), but can also be estimated from unaligned sequences. A linear HMM for a family of nucleotide or amino acid sequences is a set of positions that roughly correspond to columns in a multiple alignment. Typically, each position will have three states: match, insert and delete, corresponding to matching a single character to a column of the multiple alignment, omitting characters not modeled by the HMM, or skipping over that column and proceeding to the next. Probabilities or costs (negative log-probabilities) are associated with each character omission and each transition between states. The alignment of a sequence is simply the highest-probability, or lowest-cost, path through the HMM. Computer Engineering, University of California, Santa Cruz, CA 95064, USA and 'NORD1TA, Blegdamsvej 17, 2100 Copenhagen, Denmark 2 Present address: The Sanger Centre, Hixton, Cambridge CB10 IRQ, UK ) Oxford University Press The primary advantage of HMMs over, for example, profiles is that they can be automatically estimated, or trained, from unaligned sequences. The most basic mathematical method of training an HMM does not work well without the addition of several mathematical extensions and heuristics. The body of this paper is a study of three of the most important extensions to the basic model presented previously (Krogh et al., 1994a): regularizers, dynamic model modification and 'free insertion modules'. After reviewing these in turn, we present an experimental evaluation of the utility of these and other extensions. System and methods The Sequence Alignment and Modeling (SAM) software suite is written in ANSI C for Unix workstations. Additionally, we have implemented the inner loop in MPL for the MasPar family of parallel computers. The system has been tested on DEC Alpha, DECstation, IBM RS/6000, SGI, and Sun SPARC workstations. The research described herein was performed on a Sun SPARC 30/10 and a MasPar MP-2204 connected to a DEC Alpha 5000/300. Newly revised and documented workstation and MasPar versions of SAM are available from our WWW site: http://www.cse/ucsc.edu/research/compbio/sam.html. We recently created a server for using SAM to align and model sequences, which is accessible from the SAM WWW page. Questions concerning the software and its distribution can be mailed to [email protected]. Algorithm This section discusses the basic theory and use of HMMs. Although most aspects of our linear HMMs have been described previously (Krogh et al., 1994a), this section reexamines several operational features as the foundation for the experimental evaluation that follows. The hidden Markov model This section briefly describes the structure and basic theory of the type of HMM used in this work. For a general introduction to HMMs, see Rabiner (1989). For additional details on our HMMs, see Krogh et al. (1994a). 95 R.Hughey and A.Krogh Fig. 1. A linear hidden Markov model. The basic model topology is shown in Figure 1. Each position, or module, in the model has three states. A state shown as a rectangular box is a match state that models the distribution of letters in the corresponding column of an alignment. A state shown by a diamond-shaped box models insertions of random letters between two alignment positions, and a state shown by a circle models a deletion, corresponding to a gap in an alignment. States of neighboring positions are connected, as shown by lines. For each of these lines there is an associated 'transition probability', which is the probability of going from one state to the other. This model topology was chosen so as to feature insertions and deletions similar to biological sequence alignment techniques based on affine gap penalties. Other model topologies can certainly be considered—see, for example, Baldi et al. (1994) and Krogh et al. (1994b)—but for reasons for efficiency the software described here is limited to the topology shown in Figure 1. Gibbs sampling is an alternative approach that does not allow arbitrary gaps within aligned blocks (Lawrence et al., 1993). Alignment of a sequence to a model means that each letter in the sequence is associated with a match or insert state in the model. Two five-character sequences, A and B, are shown in a four-state model in Figure 2, along with the corresponding alignment between the sequences. One can specify such an alignment by giving the corresponding sequence of states with the restriction that the transition lines in the figure must be followed. For example, to match a letter to the first match state (wj) and the next letter to the third match state (w3) can only be done by using the intermediate delete state (d2), so that part of the alignment can be written as m|</2m3- In HMM terminology such an alignment of a sequence to a model is called a path through the model. A sequence can be aligned to a model in many different ways, just as in sequence-to-sequence alignments, or sequence-to-profile alignments. Each alignment of a given sequence to the model can be scored by using the probability parameters of the model. In the following, a sequence of L letters is denoted x\...xL. The path specifying the particular alignment is denoted qx ... qL, where qk is the fcth state in the path. To simplify notation (as compared to Krogh et al., 1994a), only the match and insert states are included in the path; for two states not directly connected, delete states are filled in as needed. Because the first state in a path is always the initial state /n0, and the last one is always the end state mM + x, these two trivial states are also not included in the state sequence. The number of match positions in the model is called M, and is referred to as the length of the model. The probability of an alignment of a sequence to a model is the product of all the probabilities met in the path, written as ...xL,q\...qL\mode\) = (1) where T{qi\q,_\) is the probability associated with the transition between states <7,_i and q,. The probability of matching letter x to state q is called V(x\q). If two states in the path are not directly connected by a transition, the transition probability has to be calculated by filling in the appropriate delete states. For example, the probability of a transition from the match state at position 1 (m^ to the insert state at position 4 (j4), passing through delete states d2, di and d4, is T(U\mx) = T{d1\mx)T{di\d1)T{d,\d,)T{U\d,) (2) By taking the negative logarithm of equation (1) it can be turned into a more familiar type of alignment 'cost': - log Prob(x, . . . xL, <7,. -to..,)-log*>(*,!*,) -logT(qL+l\qL) (3) Now the numbers can be interpreted as 'penalties'. A term like — logT(rf,-+i|m,-) is a positive penalty for initiating a deletion after position i, and — logT(di+l\dj) is the penalty for continuing a deletion at position i. Some of the other terms have similar interpretations. Stan Al Bl Si \ B2 A4 B3 X A5 B5 Si End Fig. 2. An example of two sequences whose characters are matches to states in an HMM, and the corresponding alignment. 96 Estimation of the model One great advantage of HMMs is that they can be estimated from sequences, without having to align the Hidden Markov models for sequence analysis sequencesfirst.The sequences used to estimate or train the model are called the training sequences, and any reserved sequences used to evaluate the model are called the test sequences. The model estimation is done with the forward-backward algorithm, also known as the BaumWelch algorithm, which is described in Rabiner (1989). It is an iterative algorithm that maximizes the likelihood of the training sequences. We have modified the algorithm to implement maximum a posteriori (MAP) estimation. In the basic algorithm the expected number of times a certain transition or letter emission is used by the training sequences is calculated (Rabiner, 1989). For a letter emission probability V{x\q) this is called n{x\q). Then the re-estimated parameter is P(x\q) = n(x\q) (4) where the sum is over all the characters x in the alphabet, such as the 20 amino acids. The re-estimation formula for the transition probabilities are similar. To begin with all the n{x\q) values are found from an arbitrary initial model. Next, a new set of parameters is found using equation (4), and the similar formula for the transition probabilities. Then, using the new model, the re-estimation procedure is repeated until the change in the parameters is insignificant. each probability distribution in the HMM, a Dirichlet distribution is used as a prior, and we call the corresponding a values a(x\q) for the distribution over characters, and a(qj\qj) for the transition probabilities. The reestimation formula corresponding to (4) is V(x\mk) = n{x\mk) + a(x\mk) (5) The set of all the as is called the regularizer. We call this the MAP estimate, although the correct MAP formula has a{x\mk) — 1 instead of a(x\mk). Equation (5) is really a least-squares estimate (see Krogh et al., 1994a), but one can also view it as a MAP estimate with redefined as. Even without the theoretical justification, this formula is appealing. For each parameter in the model a number (a) is added to the corresponding n before the new parameter is found. If n is small compared to a, as when there is little training data, the regularizer essentially determines the parameter, and V(x\mk) ~ (6) (This is the average of the Dirichlet distribution.) The size of the sum J^x1 a{x'\mk) determines the strength of the regularization, or the strength of the prior beliefs. If this sum is small, say 1, just a few sequences will be enough to 'take over' the model. On the other hand, if the sum is Prior distribution and regularization. When estimating a large, say 1000, then of the order of 1000 training sequences will be needed to make the model differ model from data, there is always the possibility that the significantly from the prior beliefs. model will overfit the data—it models the training This type of regularization is convenient when modeling sequences very well, but will not fit other sequences from biological sequences because we have prior knowledge the same family. This is particularly likely if there are few from conventional alignment methods. For example, in training sequences. With only one training sequence, a both pairwise and multiple alignments, the penalty for perfect model would have a match state for each residue in starting a deletion is usually larger than for a continuing which that residue would have unity probability and all deletion. This 'prior belief that match-to-delete tranother residues zero probability. Such a model would give sitions are less probable than delete-to-delete transitions zero probability to all other sequences than the training can easily be built into an HMM by setting sequence! For larger sets of training data, similar problems are still present but not as extreme. To avoid this problem a regularizer can be used. The SAM system can also use the more complicated Regularization is a method of avoiding overfitting the Dirichlet mixture priors for regularization. These priors data, and in Bayesian statistics it is tightly connected with include several different distributions for some number of the so-called prior distribution. The prior distribution is a different types of columns, such as hydrophobic and distribution over the model parameters; for the HMM it is hydrophilic positions. For more information on these a probability distribution over probability distributions. distributions, refer to our previous work (Brown et al., The prior contains our prior beliefs about the parameters 1993). of the model. In our work we use Dirichlet distributions By normalizing the regularizer as in equation (6), a valid for the prior (Berger, 1985; Santner and Duffy, 1989). For model is obtained. Since this model represents prior a discrete probability distribution pt,... ,pM a Dirichlet beliefs, it is natural to use it as the initial model for the distribution is described by M parameters a.\,..., aM. The estimation process. Usually noise is added to this initial mean of the Dirichlet distribution is p-t — a,/ao> where model for reasons discussed next. a 0 = J2i ai> a n d the variance is inversely proportional to Problems with local maxima. A serious problem with a0- If "o is large, it is highly probable that/?, ~ Q,/a0. For 97 R.Hughey and A.Krogh any hill-climbing optimization technique is that it often ends up in a local maximum. The same is true for the forward-backward procedure used to estimate HMMs by maximizing the likelihood (or the a posteriori model probability). In fact, it will almost always end up in a local maximum. To deal with this problem, we start the algorithm several times from different initial models. The resulting models then represent different local maxima, and we pick the one with the highest likelihood. The different initial models are obtained by taking the normalized regularizer and adding noise to all the parameters. The noise is added to a model in the following way. A number R of sequences are generated from the regularizer model, stepping from state to state and generating characters according to the regularizer's transition and match distributions. Given a noise level N,, each of these R paths through the model is added to the counts with a weight of N//R, before MAP re-estimation. By default, R is set to 50 random sequences. One can also add noise to the models during their estimation and decrease the noise level gradually in a technique similar to the general optimization method called simulated annealing (Kirkpatrick and Vecchi, 1983). The initial level of the noise in this annealing process is called JV0 as above. During the estimation process the annealing noise is decreased by a speed determined by r. We have tested two methods for decreasing the noise: Linearly: If r > 1, the noise is decreased linearly to zero in r iterations, so in iteration i N, = N0(l - i/r) for i < r and Nt = 0 for i > r Exponentially: If r < 1, the noise is decreased exponentially by multiplying the noise with r in each iteration Nt = Nor' An alternative and very elegant way of simulated annealing is described in Eddy (1995). Multiple sequence alignment The probability (1) or the score (3) can (in principle) be calculated for all possible alignments to a model, and thus the most probable (i.e. the best) can be found. There exists a dynamic programming technique, called the Viterbi algorithm (Rabiner, 1989), that can find the best alignment and its probability without going through all the possible alignments. It is this best alignment to the model that is used to produce multiple alignments of a set of sequences. For each of the sequences the alignment to the model is found. The columns of the multiple alignment are then determined by the match states. All the amino acids 98 matched to a particular match state are placed under each other to form the columns of the alignments. For sequences that do not have a match to a certain match state, a gap character is added (Figure 2). Searching By summing the probabilities of all the different alignments of a sequence to a model, one can calculate the total probability of the sequence given that model . . . xjmodel) ... qL\mode\) (7) where the sum is over all possible alignments (paths) q\ • •. qi, and the probability in the sum is given by equation (1). This probability can be calculated efficiently without having to consider explicitly all the possible alignments by the forward algorithm (Rabiner, 1989). The negative logarithm of this probability is called the negative log-likelihood score NLL(X] ... .vjmodel) — - log Prob(xi . . . xjmodel) (8) Any sequence can be compared to a model by calculating this NLL score. For sequences of equal length the NLL scores measure how 'far' they are from the model, and can be used to select sequences that are from the same family. However, the NLL score has a strong dependence on sequence length and model length (see Figure 3). One means of overcoming this length bias is using Z-scores, or the number of standard deviations each NLL is away from the average NLL of sequences of the same length, but which are not part of the family being modeled, or do not contain the motif being modeled. When searching a database like SwissProt (Bairoch & Boeckmann, 1994) with an HMM, the smooth average and the Z-scores are calculated as follows. For a fixed sequence length we assume that the NLL scores are distributed as a normal distribution with some outliers representing the sequences in the modeled family. The smooth average should be the average of the normal distribution, and it is found by iteratively removing the outliers: 1. 2. For k larger than or equal to the minimum sequence length, find the minimum length ek > k such that the length interval [k,ek] contains at least K sequences. We usually use K = 1000. For all such intervals, calculate the average sequence length tk and the average NLL score. For any integer length / the smoothed average NLL score is found by linear interpolation of the average Hidden Markov models for sequence analysis 6000 1£00 ieoo 400W90 2000 500 1000 Sequence Length 1500 2000 Fig. 3. NLL scores versus sequence length for a search of Swiss-Prot using a model of the SH2 domain described in the text. The + show the SH2 containing sequences (the training set), and dots show the ~30000 remaining sequences. All sequences containing the letter 'X' are excluded. NLLs corresponding to the closest tk < I and tk+i. For / either smaller than the smallest tk or larger than the largest tk, linear extrapolation from the two closest points is used. 3. For each interval the standard deviation of the NLL scores from the averages is found. For any integer length the smoothed standard deviation is found by linear interpolation or extrapolation as above. 4. All sequences with an NLL score more than a certain number of standard deviations (usually 4) from the smooth average are considered outliers and excluded in the next iteration. 5. If the excluded sequences are identical to the ones excluded in the previous iteration, the process is stopped. Otherwise it is repeated unless the maximum number of iterations have been performed. This procedure often produces excellent results on a large database like SwissProt, but there is no guarantee that it works. It is easy to detect when it is not working, because the sequences in the family, such as the training sequences, have low Z-scores. In this case, the training sequences and other obvious outliers can be removed by hand, and the above process repeated. This method always yields good results. In some sequences there are unknown residues that are indicated by special characters. A completely unknown residue is represented by the letter X in proteins and by N in DNA and RNA. For proteins the letters B, meaning amino acid N or D, and Z (Q or E) are also taken into account. For DNA and RNA the letters R for purine and Y pyrimidine are recognized. All other letters that are not part of the sequence alphabet or equal to one of these wildcard characters are taken to be unknown, i.e. changed to X or N depending on the sequence type. The probability of a wildcard character in a state of the HMM is set equal to the maximum probability of all the letters it represents. It has the unfortunate side effect that sequences with many unknowns automatically receive a large probability, and these sequences have to be inspected separately. Another solution would be to set the probability equal to the average probability of the letters the wildcard represents, but then the opposite problem might occur. This can also be chosen as an option in SAM. Model surgery It is often the case that during the course of learning, some match states in the model are used by few sequences, while some insertion states are used by many sequences. Model surgery is a means of dynamically adjusting the model during training. The basic operation of surgery is to delete modules with unused match states and to insert modules in place of overused insert states. In the default case, any match state used by fewer than half the sequences is removed, forcing 99 R.Hughey and A.Krogh those sequences to use an insert state or to change significantly their alignment to the model. Similarly, any insert state used by more than half the sequences is replaced with a number of match states approximating the average number of characters inserted by that insert state. After surgery the model is retrained to adjust to the new situation. This process can be continued until the model is stable. For finer tuning of the surgery process, the user can adjust the fraction of sequences that must use a match state, the fraction of sequences that are allowed to use an insert state, and the scaling used when replacing an insert state with one or more match states. While the default 50% usage heuristic stabilizes after a few iterations of surgery, careless setting of these fine-tuning parameters can result in a model that oscillates, switching back and forth between match and insert states. Modeling domains and motifs Imagine that the same domain appears in several sequences. In that case, to build an HMM of that domain and use it for searching or aligning requires a few extensions. We augment the HMM by insertion states on both ends that have no preference for which letters are inserted (i.e. they have uniform character distributions). The cost of aligning a sequence to such a model does not strongly depend on the position of the pattern in the sequence, and thus it will pick up the piece of the sequence that best fits the model. A new parameter then has to be set, namely the probability of making an insertion in these flanking insertion modules. If this probability is low, deletions in the main part of the model are encouraged and insertions discouraged, whereas a high insertion probability encourages the opposite behavior, because it becomes 'expensive' to use the flanking insertion states. The setting of this parameter is most important when estimating a model like this, because there is a danger of finding a model in which, for instance, only the delete states are used if it is too 'cheap' to make insertion in the flanking modules. These flanking modules are called free insertion modules (FIMs), because the self-transition probability in the insert state is set to unity. Other values can, of course, be used. FIMs are implemented by only using the delete state and the insert state in a standard module. All the transitions from the previous module go into the delete state, which is ensured by setting the other probabilities to zero. From the delete state there is a transition to the insert state with the probability set to one. In the insert state all characters have the same probability and the probability of insert-to-insert is set to one. From the delete and insert states there are transitions to the next module which have 100 Fig. 4. A model with a FIM in both ends and one at module 4. The width of a line is proportional to the probability of the transition. This example uses the RNA alphabet, and the probabilities in the modules other than the FIMs are set according to our standard regularizer, which strongly favors the transitions from match state to match state and has a low probability of starting insertions or deletions. the same probabilities (delete-to-match and insert-tomatch have the same probability, and so on). Note that the probabilities do not sum up to one properly in such a module. A model with three FIMs is shown in Figure 4. These special modules can be used to learn, align or discriminate domains that occur once per sequence. Typically, FIMs are used at the beginning and the end of a model to allow an arbitrary number of insertions at either end. To train a model to find several (different) domains, one can also add FIMs at different positions in the model. Note, however, that only domains always occurring in the same order and the same number of times can be modelled this way. To model domains that come in different order or in a variable number, a model with backward transitions is required, which is something that we might add to SAM at a later stage. There are two different ways to model subsequences. The first method is to cut out the subsequences from their host molecules, build a model from these, and then add a FIM in both ends afterwards. The second method is to use the full sequence and train with the FIMs. The second method is generally easier, but also slower in terms of CPU time, especially if the sequences are much longer than the subsequences. In Krogh et al. (1994a) the first method was used to model the protein kinase catalytic domain and the EF hand domain, and in this paper the second method will be demonstrated on the SH2 domain. In Lawrence and Reilly (1990) and Lawrence et al. (1993), two related methods of automatically finding common patterns in a set of sequences are described. Those papers deal only with gap-less alignments, i.e. patterns without insertions or deletions. The method described here can be viewed as an extension to alignments with gaps. Additional features Several additional features also aid the learning performance of the SAM system. First, SAM performs the first training pass (before any surgery occurs) on multiple models. The best of those models is selected for the remaining surgery and training iterations. Second, SAM Hidden Markov models for sequence analysis Table I. HMM training on various machines System Rel. Porting Difficulty Performance (kCUPS) Performance (Sun 4/50s) Sun 3/110 Sun 4/50 DECstation 5000/240 SGI 440VGX, 1 CPU DEC Alpha 3000/500 C-Linda, 7 DECstation 5000s (240 and 125) Cray Y-MP, 1 CPU, vectorized 8K MP-1, optimized 4K MP-2, optimized 0 0 0 0 0 4 g 60 60 3.2 37.1 39.2 59.0 107 147 167 1530 1580 0.1 1.0 1.1 1.6 2.9 4.0 4.5 41 43 also supports Viterbi training, in which the model is trained according to the best path of each sequence through the model, rather than by the probability distributions over all possible paths. Although this is slightly faster, the results are generally not as good. Third, when models are derived from an alignment or an existing profile, training of a module's match table or transitions can be turned off and it can be insulated from the surgery procedure. This last feature was not used for the results of this paper, apart from the related protection of FIMs from surgery. Implementation and performance The inner loop of the algorithm is an O(n2) dynamic programming algorithm that calculates the forward and backward values for each of the three states at a given pair of model and sequence indices. The serial implementation of this algorithm is straightforward. For Cray vectorization, the two inner loops were modified to calculate counter-diagonals in order (i.e. (0,1), (1,0), (0,2), (1,1), (2,0),...) to make effective use of the vector pipelines. The MasPar parallel processor features a two-dimensional mesh of processing elements (PEs) and a global router (Nickolls, 1990). For large numbers of sequences, the best algorithm mapping can be to perform a complete model-sequence dynamic programming in each PE; unfortunately, the 64 kbytes of local memory per PE is too small for such coarse-grain parallelism. Instead, the array is treated as collection of linear arrays of PEs, where each linear array contains one model. Thus, if the models are of length 100, 40 sequence-against-model calculations are performed at a time using 4000 of the 4096 processing elements. The linear arrays are used as follows. During the forward part of the dynamic programming, the (i, j) value is computed during step i +j in the rth PE of one of the linear arrays. This calculation depends on values from (i— 1,7), {i,j- 1) and ( i - \,j - 1), all of which have already been calculated in either PE, or PE,_!. The characters of the sequence are also shifted through the linear array, one PE at a time, to ensure that character j is in PE, at time step / +j. During the backward calculation, this process is reversed, calculating each diagonal at time n2 — i -j. After all sequences have been compared to the model, the frequencies are combined using log-time reduction, and uploaded to the host computer for highlevel processing such as surgery and noise injection. Performance can be rated in terms of dynamic programming cell updates per second (CUPS). Performing both the forward and backward calculations at a single dynamic programming cell (one character from one sequence against one model node) is a single-cell update. The total number of cell updates, which depends greatly on parameter settings, is the sum of model lengths over all re-estimation cycles multiplied by the total number of characters in all the sequences. For the experiments described in the next section, a typical training run on 50 globins of average length 146 and a model length of 145 with 40 re-estimation cycles, required 50 s on our 4096 PE MasPar MP-2 or ~7min on a DEC Alpha 3000/400 (longer runs heighten the difference between the two machines to factors of 10-15). Experiments based on 400 training sequences and an earlier version of the code are summarized in Table I. Results and discussion While several reports have shown the general effectiveness of HMMs (Brown et al., 1993; Baldi et al., 1994; Krogh et al., 1994a; Eddy, 1995; Eddy et al., 1995), this section takes a close look at the effectiveness of each extension to the basic method. We choose the globin family for the first of these illustrative experiments because of our previous familiarity with the family. From a set of 624 globins, close homologs were removed using a maximum entropy weighting scheme (Krogh & Mitchison, 1995) by removing all sequences with a very small weight (< 10~5), which left 101 R.Hughey and A.Krogh (a) (b) (c) Fig. 5. Test NLL scores (average over 117 test sequences) from running SAM 1000 times on 50 other globin sequences with (a) default noise, (b) random starting model lengths and (c) all heuristics including surgery. The solid vertical line at 334 is the average test sequence score without any random heuristics (in this case, surgery has no effect on the non-random training routine). us with 167 globins. For the experiments, our group of 167 sequences was randomly divided into a training set of 50 sequences and a test set of 117 sequences, except in the experiments on training set size. The statistical goodness of an HMM is tied to the final probability result of the test set. SAM reports this as a negative-log-likelihood (-In/*), or NLL, score. This section considers the effects of each of the more important extensions on NLL scores. Ideally, we would like small NLL scores that, with multiple runs using different random seeds, are sharply peaked. Such a peaked distribution implies that far fewer than the thousands of runs performed in these experiments are required to generate a good model. Noise To find default noise settings, we ran 50 random seeds for all combinations of nine annealing schedules and seven noise values. Average performance over the runs was typically 5% better than without noise, while the best NLL score over the 50 runs was 12% better than without noise. Given that the typical mode of running SAM is to generate many models and pick the best, this 12% value is quite an improvement. Our chosen default is five sequences worth of noise using an exponential annealing schedule with factor 0.8. This is a somewhat arbitrary choice based on the range of scores obtained—no clear winner among the settings emerged. The tested setpoints added between 20% and 350% more re-estimation cycles over the noiseless case. If less time is available, we suggest a linear schedule with one noise sequence. In general, as many models should be created as possible, and then the best one further refined. This procedure is automated in SAM. The histograms in Figure 5 show average test set NLL scores for 1000 training runs on 50 training globins with just default noise, random model lengths without noise 102 and all heuristics (noise, random model lengths and surgery). The vertical bar at 334 indicates the NLL score for training without noise. Note in particular how the combination of noise and surgery both improves the test set scores and sharpens their distribution, indicating that far fewer than 1000 runs are needed to generate good models. Regular izalion The effect of regularization is even more dramatic than the effects of noise and surgery. SAM supports both Dirichlet and Dirichlet mixture regularization. To gauge the effects of regularization, we compared no regularization to four other choices: the default simple regularizer, the original nine-component mixture (Brown el al., 1993), and more recent nine- and 21-component regularizers (Karplus, 1995). Histograms of the test set NLL scores for these experiments, which used the noise settings determined in the previous experiments, are shown in Figure 6. Clearly, regularization is needed. With 50 training sequences, however, the distinction between the different regularizers is not high, as expected, since the model is built from a reasonably sized training set. Although the original nine-component mixture appears to be the best, this mixture was in part based on a globin alignment, and thus is at an advantage in this experiment. Where regularization truly shines is with small training sets. Figure 7 shows the performance of the regularization methods for small training sets. In this case, each test point represents the average test set NLL score over 20 training runs, each based on a different random training set of the indicated size. The Dirichlet mixture priors require additional running time, but have somewhat improved performance over the single-component prior. All four of these regularizers are available with the SAM distribution. Hidden Markov models for sequence analysis Mr r I 80 I (d) (e) Fig. 6. NLL scores on the test set from running SAM 200 times on 50 globin sequences with no regularization (a), and single-component (b), original nine-component (c), revised nine-component (d), and 21-component (e) regularizers. The solid vertical line at 334 is the score with no random heuristics and the default regularizer. Case-study: modeling the SH2 domain with FIMs In this section we demonstrate the use of FIMs for modeling domains. We stress that this is mostly for illustrative purposes, so we will not go deeply into any biological implications of the model or alignment. We use the SH2 domain, which is found in a variety of proteins involved in signal transduction, where it mediates protein-protein interactions. For a review see Kuriyan and Cowburn (1993). The domain has a length of ~100. Initially a file was created with 78 SH2-containing 10 20 30 40 SO W proteins by searching SwissProt (release 30) for the keyword 'SH2 domain' (see Table II). Fifty models of length 100 were trained in batches of 10 with FIMs at both ends and without surgery. For each batch the model with the best overall score was examined. Of thesefivemodels, the best-scoring one happened to cover the SH2 domain almost entirely. It started ~20 amino acids prior to the domain and ended ~20 amino acids early as compared to the alignment in Kuriyan and Cowburn (1993). Some of the other models also covered part of the domain, whereas others had picked up a different signal. This signal was 70 Nufrtar of tsMng MquwcM Fig. 7. The effect of training set size and regularization on model building. Especially for small training sets, regularization greatly improved modeling with respect to the larger test set. The differences in test set scores between the available regularizers are shown in the blowup to the right. 103 R.Hughey and A.Krogh Table II. All the sequences with a Z-score >7 using the first model ID Length Z1 Z2 ID Length Z1 Z2 KSRC.AVISS KSRC_AVIS2 KSRC_AVIST KSRC_CHICK KYES_XENLA KSRC AVISR KSRC_RSVSR KSRC_RSVPA KSRC_HUMAN KSRN_MOUSE KYES_CHICK KHCK_MOUSE KFGR_MOUSE KFYNHUMAN KFYN_CHICK KYES_AVISY KHCK_HUMAN KFYN XENLA KYES_MOUSE KYES_HUMAN KSR1_XENLA KSR2_XENLA KYRK_CHICK KSRC_RSVH1 KFYN XIPHE KSRC_RSVP KLCK_HUMAN KLCK_MOUSE KLYN_HUMAN KFGR HUMAN KLYN MOUSE KLYN_RAT KYES_XIPHE KFGR FSVGR**' KBLK MOUSE KSTK_HYDAT KABL_MOUSE KABL_HUMAN KABL_DROME KABL FSVHY"* KSR1_DROME PIP4_HUMAN PIP4_RAT PIP4_BOVIN 568 587 557 533 537 526 526 523 536 541 541 503 517 536 533 528 505 536 541 543 531 531 535 526 536 526 508 508 511 529 511 511 544 545 499 509 1123 1130 1520 439 552 1290 1290 1291 67.569 66.610 66.187 63.475 63.177 63.007 63.006 62.815 62.782 62.609 62.162 62.012 61.846 61.811 61.799 61.798 61.731 60.899 60.894 60.808 60.516 60.320 59.673 59.639 59.610 59.370 58.265 58.265 58.265 58.024 57.843 57.843 57.477 55.885 55.812 47.469 47.440 47.408 46.822 43.836 43.509 42.121 41.892 41.533 57.941 59.765 56.073 56.653 55.735 57.519 57 519 58.016 56.550 56.243 55.339 51.266 53.802 52.437 51.222 55.872 50.077 51.891 55.201 54.594 54.921 54.793 50.890 54.652 50.490 53.892 49.467 49.287 49.868 50.468 49.468 48.955 50.010 51.428 45.981 43.094 46.331 46.458 41.191 40.175 38.564 39.355 39.526 39.341 PIP5_RAT PTNB_MOUSE PIP5_HUMAN KATK HUMAN PTNB_HUMAN GTPA HUMAN GTPA_BOVIN KATK_MOUSE P85A_MOUSE P84A_BOVIN P85A_HUMAN KABL_MLVAB*** KCSK-RAT CSW_DROME PTN6_MOUSE PTN6_HUMAN KLYK_HUMAN P85B_BOVIN KSR2 DROME SEM5_CAEEL SHC_HUMAN KABL_CAEEL"" KFES_FSVST*" KFES_HUMAN NCK_HUMAN GRB2_HUMAN KFES_FELCA VAV_MOUSE KFER HUMAN KFES_FSVGA*" GAGC_AVISC VAV_HUMAN KFES_MOUSE*** KSYK_PIG KAKT MLVAT KFPS_AVISP"» KFPS_FUJSV**» KRAC_MOUSE KTEC_MOUSE KRAC_HUMAN KRCB_HUMAN KFPS_DROME*** SPT6_YEAST YKF1_CAEEL*** 1265 585 1252 659 593 1047 1044 659 724 724 724 746 450 841 595 595 620 724 590 228 473 557 477 822 377 217 820 845 822 609 440 846 820 628 501 533 873 480 527 480 520 803 1451 424 41.204 38.255 38.087 37.859 37.700 37.255 36.794 36.593 35.106 35.104 35.042 34.189 34.115 33.930 33.681 32.909 32.546 31.012 30.245 29.742 28.570 28.330 28.243 28.214 28.194 27.546 27.264 25.937 25.905 25.838 25.058 24.997 23.101 22.652 21.399 21.396 21063 20.673 20.547 20.509 16.613 12.358 9.461 7.138 41.236 36.095 38.653 31.787 36.920 35.081 34.776 31.475 35.299 35.299 35.389 44.282 30.778 36.004 31.583 30.885 29.987 34.030 30.025 25.657 24.458 32.172 35.183 40.554 26.543 24.833 40.511 29.161 35.117 36.585 24.473 28.856 38.987 28.247 18.600 33.346 36.728 17.461 26.322 17.461 16.221 25.804 13.805 19.053 The score is shown in the columns labelled 'Z 1'. All except the ones marked by ' • • • ' were part of the training set initially extracted from Swiss-Prot. All the ones marked by '***' were included in the training set. The column labeled 'Z X shows the Z-scores for the new model. The first training set also contained the fragment KLCK_RAT of length 17. It had a Z-score of 0.18 with the first model, and was then removed from the training set. probably the kinease catalytic domain of some of the proteins in the file. It is quite remarkable that the model can find the domain completely unsupervised, and that might not always be the case. Using this first model, a search was made of the entire SwissProt database and all sequences scoring better than a Z-score of 7 were examined (not taking sequences with many 'X' characters into account)—see Table II. All the sequences in the training set were among those highscoring ones, except a fragment of length 17 which was 104 then deleted from the data set. Of these high-scoring sequences, 10 had Z-scores of 12 or more, and by checking the alignment, we consider it to be certain that they contain SH2. The last sequence with a score of 7.1 we also believe contains SH2. All these high-scoring sequences were now included in the data set. The old and new sequences in the training set are listed in Table II, and all sequences with Z-scores >4 are shown in Table III. The model was then modified by deleting the first 20 modules and inserting 30 'blank' modules in the end, so it Hidden Markov models for sequence analysis Table III. All the sequences that have a Z-score of >4, but not shown in Table II ID Length Z1 ID Length Z2 MYCN CHICK HLYA_SERMA PSBO_CHLRE MYSN_DROME MYCN SERCA TRP_YEAST DRRA STRPE COAT_RCNMV GLNA_AZOBR FIXL_BRAJA MIXIC_SHIFL ENL_HUMAN K2C1_XENLA RF3_SACUV TF3A_XENLA POL_SIVGB TBP1_YEAST ORA_PLAFN PIP1_DROME ATPF_BACSU TPM4_RAT ODO2_YEAST VSI1_REOVL ADX_HUMAN H2B EMENI HLY4_ECOLI UBC6_YEAST BNC3_RAT PSBO SPIOL RF3 YEAST MERA BACSR 441 1608 291 2017 426 707 330 339 468 505 182 559 425 476 344 1009 434 701 1312 170 248 475 470 184 139 478 250 318 332 559 631 6.067 5.507 5.426 5.187 5.111 4.994 4.958 4.937 4.831 4.654 4.593 4.592 4.503 4.500 4.467 4.422 4.421 4.407 4.347 4.322 4.294 4.291 4.285 4.276 4.241 4.199 4 164 4.134 4.094 4.094 4.078 KFLK_RAT PI PI DROME MYSD_MOUSE GSPF_ERWCA DRRA_STRPE NFL-COTJA MPP1_NEUCR CHLB CHLHU VG24_HSVI1 UFO_MOUSE TPM SCHPO POL_MMTVB RM02 YEAST UFO_HUMAN RS15_PODAN RL9_ARATH SY61_DISOM ADX_HUMAN SYT1_RAT SYT1_HUMAN TTL_PIG DNIV_ECOLI TTL_BOVIN RL15_SCHPO 323 1312 1853 408 330 555 577 431 333 888 161 899 371 887 152 197 426 184 421 422 379 184 377 29 8.445 5.432 5.245 5.227 4.775 4.750 4.718 4.603 4.587 4.521 4.477 4.413 4.378 4.364 4.349 4.336 4.246 4.215 4.171 4.149 4.138 4.084 4.068 4.056 The column labeled 'Z 1' contains the scores from the search with the first model and the one labeled 'Z 2' the scores with the second model. could better fit the domain. Starting from this modified model, 20 new models were trained on the new set of 88 protein sequences, and the best one selected. This was the final model for the SH2 domain (see Figure 8). Using this model, a new search was performed. The new model picked up all of the 88 sequences in the training set, and the smallest score was significantly higher than that of the first model (13.8 compared to 7.1)—see Table II. It also found a new one (KFLK_RAT) with a Z-score of 8.4 that is a fragment and in SwissProt is described as containing part of the SH2 domain. All other sequences in SwissProt had Z-scores <5.5, and in the highest scoring ones (Table III) we did not see any signs of the SH2 domain. The SH2 domain often occurs several times in the same protein, which is not modeled properly by such a model (see above: Modeling domains and motifs). When training, all domains in a given protein contribute to the model (because all paths are taken into account), but when aligning, only the occurrence matching the model best will be found. There are ways to find all occurrences by finding suboptimal paths or masking domains already found, the latter of which is currently being added to SAM. Conclusion In principle the HMM method is a simple and elegant way of modeling sequence families. For practical purposes, however, various problems arise, of which the most important ones are discussed in this paper. A variety of heuristics and extensions to overcome the problems that are all part of the SAM software package were presented. Most of these techniques were mentioned in our original paper (Krogh et al., 1994a), but here we have discussed them in more detail, and treated them from a more practical perspective, whereas we have not focused on the biological results at all. The algorithm for estimating HMMs is a simple hillclimbing optimization technique which will very often find suboptimal solutions resulting in inferior models. Three methods were introduced to deal with the problem. First, several different models can be trained and the best one selected, based on the overall NLL score. Second, noise of slowly decreasing size can be added during the estimation process in a fashion similar to simulated annealing. Third, the surgery method for adding and deleting states from the 105 R.Hughey and A.Krogh © © © © © ©...•©•-© © © © © © © © © © © © © © © © © 0 ©••••© ® © ©—©•-©-©••-©•• •©•••©•-©••©"-© © © © © •••© © © © €. o o <•>,<•>•. © © © © © © © © © © © © © © © © © © Q • © © © © © © © © © © © © © © © © ©•••© © <•> <S> . $ ® ©• <•> <•> Fig. 8. The second (and final) model of the SH2 domain. The initial section, at a larger magnification is shown below the complete model. The unshaded modules correspond to secondary structure elements as given in Kuriyan and Cowburn (1993). These elements are 0A, aA,0B, f)C, (3D, (3E, f3F, aB and /3G. The figure (except the shading) was produced with the program drawmodel in SAM. It is almost impossible to see the actual amino acid distributions at this scale, but the most important things to notice are that the distributions are quite peaked and that the delete states are used rarely in the conserved secondary structure elements, both of which are indications of a good model. model was shown also to help to obtain models with better NLL scores. The general theory for HMMs does not tell one how to choose the topology of the model. In our work we have chosen a model structure that we believe fits the biological sequences particularly well, and most of the probability parameters can be interpreted as penalties familiar from other alignment methods, such as gap penalties. For choosing the length of the model, the above-mentioned surgery heuristics was introduced, which deletes parts of the model not used very much and inserts more states where the existing states are 'overloaded'. In any parameter estimation process overfitting is a danger, in particular when there are many parameters and little data. This can be overcome using Dirichlet and Dirichlet mixture prior distributions to regularize the models. They are particularly useful because they 106 incorporate prior biological information and insight into the model but can be overcome by sufficient numbers of sequences. For the problem of modeling domains and other subsequences we introduced FIMs. These modules treat the part of the sequences outside the domain as completely random sequences, and by an example it was shown that an HMM can locate domains automatically. By using several FIMs one can model proteins with many subdomains (always appearing in the same order). By using long backward transition, one can in principle model domains occurring several times in different orders, but it is not currently implemented in the software package. SAM is an evolving system. Future additions will include repeated motif location, alternative scoring methods using null models, subsequence-to-submodel Hidden Markov models for sequence analysis training, sequence weighting, an improved user interface, and use of our new algorithm for parallel sequence alignment in limited space to greatly extend the capabilities of the MasPar code (Grice et ai, 1995). Acknowledgements We would like to thank Saira Mian for many valuable suggestions, and David Haussler and the rest of the computational biology group at the Baskin Center for Computer Engineering and Information Sciences at UCSC for many suggestions and contributions. We also thank the anonymous referees for their excellent suggestions. This research was supported in part by NSF grants CDA-9115268 and BIR-9408579, a grant from CONNECT, Denmark, and a grant from the Novo-Nordisk Foundation. SAM source code, the experimental server and SAM documentation can be accessed on the WorldWide Web at http://www.cse.ucsc.edu/research/compbio/ sam.html or by sending e-mail to [email protected]. Krogh.A., Brown,M., Mian.I.S., Sjolander,K. and Haussler.D. (1994a) Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol., 235, 1501-1531. Krogh.A., Mian,I.S. and Haussler.D. (1994b) A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res., 22, 4768-4778. Kuriyan.J. and Cowburn.D. (1993) Structures of SH2 and SH3 domains. Curr. Opin. Struct. Biol., 3, 828-837. Lawrence.C. and Reilly.A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7, 41-51. Lawrence.C., Altschul.S., Boguski.M., Liu.J., Neuwald.A. and Wootton.J. (1993) Detectable subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208-214. Nickolls,J.R. (1990) The design of the Maspar MP-1: a cost effective massively parallel computer. In Proc. COMPCON Spring 1990. IEEE Computer Society, Los Alamitos, CA, pp. 25-28. Rabiner,L.R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257-286. Santner,T.J. and Duffy,D.E. (1989) The Statistical Analysis of Discrete Data. Springer-Verlag, New York. Received on July 24, 1995; accepted on October 23, 1995 References Bairoch.A. and Boeckmann,B. (1994) The SWISS-PROT protein sequence data bank: current status. Nucleic Acids Res., 22, 3578—3580. Baldi.P., Chauvin.Y., Hunkapillar.T. and McClure,M. (1994) Hidden Markov models of biological primary sequence information. Proc. Natl Acad. Set. USA, 91, 1059-1063 Berger,J. (1985) Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York. Brown.M.P., Hughey.R., Krogh.A., Mian,I.S., Sjolander.K. and Haussler.D. (1993) Using Dirichlet mixture priors to derive hidden Markov models for protein families. In Hunter,L., Searls,D. and Shavlik.J. (eds), Proc. First Int. Conference on Intelligent Systems for Molecular Biology. AAAI/MIT Press, Menlo Park, CA, pp. 47-55. Bucher.P. and Bairoch,A. (1994) A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In Proc. Int. Conference on Intelligent Systems for Molecular Biology. AAAI/MIT Press, Stanford, CA, pp. 53-61. Eddy,S. (1995) Multiple alignment using hidden Markov models. In Proc. Int. Conference on Intelligent Systems for Molecular Biology. AAAI/MIT Press, Cambridge, pp. 114-120. Eddy.S., Mitchison,G. and Durbin,R. (1995) Maximum discrimination hidden Markov models for sequence consensus. J. Comput. Bioi, 2,9-23 Gribskov,M., McLachlan.A.D. and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355-4358. Grice.J.A., Hughey,R. and Speck.D. (1995) Parallel sequence alignment in limited space. In Proc. Third Int. Conference on Intelligent Systems for Molecular Biology. AAAI/MIT Press, Menlo Park, CA. Haussler.D., Krogh.A., Mian.I.S. and Sjolander.K. (1993) Protein modeling using hidden Markov models: Analysis of globins. In Proc. Hawaii International Conference on Systems Science. IEEE Computer Society Press, Los Alamitos, CA, Vol. 1, pp. 792-802. Karplus,K. (1995) Regularizers for estimating distributions of amino acids from small samples. In Proc. Third Int. Conference on Intelligent Systems for Molecular Biology. AAAI/MIT Press, Menlo Park, CA. Full version available as UCSC Technical Report UCSC-CRL-95-11 Kirkpatrick.S., Jr and Vecchi,M. (1983) Optimization by simulated annealing. Science, 220. Krogh,A. and Mitchison,G. (1995) Maximum entropy weighting of aligned sequences of proteins or DNA. In Proc. of Third Int. Conference on Intelligent Systems for Molecular Biology. AAAI/MIT Press, Menlo Park, CA. 107
© Copyright 2026 Paperzz