Vol 12 no. 6 1996 Pages 447-454 CAB I OS Kohonen map as a visualization tool for the analysis of protein sequences: multiple alignments, domains and segments of secondary structures Jens Hanke and Jens G.Reich Abstract System and methods The method of Kohonen maps, a special form of neural networks, was applied as a visualization tool for the analysis of protein sequence similarity. The procedure converts sequence (domains, aligned sequences, segments of secondary structure) into a characteristic signal matrix. This conversion depends on the property or replacement score vector selected by the user. Similar sequences have small distance in the signal space. The trained Kohonen network is functionally equivalent to an unsupervised non-linear cluster analyzer. Protein families, or aligned sequences, or segments of similar secondary structure, aggregate as clusters, and their proximity may be inspected on a color screen or on paper. Pull-down menus permit access to background information in the established text-oriented way. Signal coding of sequences Introduction Computer analysis of protein sequence 'homology' in the ever-growing data collections, e.g. with BLAST (Altschul et al., 1990) or with PROFILE (Gribskov et al., 1990), usually results in long lists of'hits', consisting of sequence identifiers, indicators of hit localization and score or probability values. There is a need for graphic display tools which make the information more directly conspicuous. Previous work (Ferran and Ferrara, 1991; Hanke et al., 1996) has demonstrated that neural networks in the special form of so-called Kohonen maps (Kohonen, 1989) are a good tool to organize and store data for which similarity or distance relations are defined. The map works as a non-linear cluster analyzer, stores information in an associative (rather than list-type) manner, and permits the recognition and classification of so far unknown information samples. In this paper, we propose Kohonen maps as a versatile tool for the two-dimensional (2-D) display of sequence similarity information. Background text information may be retrieved in the usual way if required. Max-Delbrick-Cenler for Molecular Medicine, Department of Bwinformatics, Robert-Rdssle-Strafie 10. D-13125 Berlin-Buch, Germany E-mail: hanke(bioinf.mdc-berlin.de ) Oxford University Press An accepted method of quantitative sequence analysis is the position-wise evaluation of similarity scores. Such a score may be derived from a vector (e.g. of physicochemical properties, in categorical or real-valued coding; see Taylor, 1986) or from a scalar quantity (e.g.comparison between aligned sequences using weight matrices like BLOSUM; Henikoff and Henikoff, 1992) or PAM (Dayhoff et al., 1978). In this paper, we do not prescribe the method of evaluation, but rather leave the choice to the user. However, a concept is used that defines, and organizes, similarity information in a unified way. Sequences are represented as signals, assembled from position-specific property vectors. Such a signal is a realvalued or binary matrix of dimension Q x L (where Q is the dimension of the property vector of an amino acid in the ith position, i=\...L, and L is the length of sequence). An example of this is the scoring method by Taylor (1986), who represents each amino acid by a binary vector of 11 physicochemical properties of this amino acid. For a sequence this leads to a matrix, which as an 'image' is characteristic of that item. If similarity is expressed in terms of entries of the accepted weight matrices, which are usually evaluated by pairwise comparison of sequence elements, then a convenient property vector of an amino acid is just the column vector pertinent to that amino acid. It expresses the scores of an amino acid when being compared to one out of the 20 amino acids (itself included). A single sequence is so converted to an image matrix symbolizing the scoring capacity of that sequence. For matrices M, N so obtained, if they have the same dimension, a measure of signal dissimilarity is immediately given by the Euclidean distance: D(M,N) = with i,j running over the matrix entries. One problem with this type of signal representation of a sequence is that a distance concept is not applicable to 447 J.Hanke and J.G.Rekh "" " 1 Po. N , 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 tK n 0 0 0 0 0 0 0 0 0 0 0 /I% o h 0 \0 0 0 0 0 0 0 0 0 0 \ 0 1 ( o • 0 / A *D C *E F N G P *H Q I *R *K S L M T V W Y — /H ^ / _ J 1 H. H D K E R Na Fig. 1. Coding of secondary structure segments. Jeffrey (1990) has developed a coding principle that represents each sequence in a unique way by a walk through a quadratic pixel space. In the figure, the amino acid alphabet is replaced by a four-property alphabet (hydrophobic, not hydrophobic, positive charge, negative charge). Each amino acid has one or two of these properties (see the inserted table). The graph for the peptide AND starts in the center and makes a step half-way to the coordinate of the property set of the first amino acid (A) in the sequence. Then it turns half-way into the direction of N's corner, continues with a half-step towards the D corner, and so on. In this way, each sequence is respresented as a graphic pattern of straight lines oscillating through the space. The graph is then converted into a binary pattern matrix by inserting' 1' (unity) into each quarter cell lying at the route of the graph. A further transformation (not shown here) according to the Quad-Tree-principle (Lynch, 1985) results in a condensed length-normalized and contour-preserving vector representation of the pattern matrix. We show the pixel values only in that quarter of the grid that has been traversed by the first letters of the example segment AND. sequences of unequal length. To deal with this difficulty, we applied the method of fractal encoding (see below) to the image representation of short sequences. For the treatment of longer sequences of unequal length, an alignment is required, from which one may either take gap-free sub-blocks or replace gaps, if not too numerous, by an average vector (whose elements are the mean values over the corresponding entries of all 20 amino acids). A further method is to represent sequences of different length by a matrix of subsequence frequencies (e.g. dipeptide or oligoword frequencies; see Ferran and Ferrara, 1990), but this results in loss of information residing in the succession of elements in the sequence. Fractal encoding of sequences After some experimentation, we encoded short sequence segments of unequal length according to Jeffrey (1990). In essence, this converts a sequence into a walk through a quadratic area with pixel elements. The edges of that area are labeled by the elements of the alphabet (amino acid names or some transformation like physicochemical properties). Each new sequence letter causes a jump halfway to the edge corresponding to the current letter. Jeffrey recorded only the end points of the steps and filled the 448 pertinent pixels. In our applications (shorter segments), we draw the whole route between both pixel points. The example of Figure 1 shows, simplified for illustration, a simple transformation of the amino acid alphabet into an alphabet made of two basic properties. A zig-zag curve unique for any sequence segment is obtained, which may be scanned and encoded according to the Quad-Treeprinciple (Lynch, 1985), resulting in a binary vector characteristic for the input segment. Similar segments give characteristically similar code vectors suitable for training of the Kohonen network. In particular, a Hamming distance metric between the walk patterns is established by this coding principle. Jeffrey's method was found to work well with sequence segments < 30 residues. The resulting walk trace was typical enough for a sequence and sufficiently similar even in the case of unequal length, but otherwise high similarity. However, with increasing length, the walking trace tends to become entangled and therefore increasingly uninformative. In the case of moderate or high length, we took recourse to pruning sequences to equal length, which works well if the length difference and hence the loss of information is small. The more general problem of coding distinctly length-variable sequences has so far not been satisfactorily solved. Visualization tool for the analysis of protein sequences Kohonen mapping A Kohonen map as modeled here is a computer program that projects signals onto a 2-D map of 'neurons'. A neuron is identified by its two coordinates in a quadratic lattice. Between different neurons there is a map distance defined: simply the distance between the coordinates. For instance, the distance between neurons (6,3) and (3,7) is SQRT((6 - 3) x (6 - 3) + (3 - 7) x (3 - 7)) = SQRT ( 9 + 16) = 5. A second distance metric is required for the signal space. We selected the Euclidean distance between code vectors (as described above) as such metrics. One of the salient features of Kohonen mapping is the non-linear relationship between the two distances, on the neuron lattice and in the signal space. In effect, an ndimensional signal vector is projected onto a 2-D neuron lattice. This projection is intended to retain proximity: two vectors close to each other in the signal space should also be close on the projection map. The projection is itself is non-linear and approximately topology preserving. Each neuron on the map has a 'reference vector' in that signal space, and it is possible to assign to any signal v the 'pertinent' neuron n on the map, namely that whose reference vector wn is closest to the signal (j running for all neurons): distance(i>, wn) = Mjin {distance (v, Wj) In this way, for a certain set of reference vectors vv,, the signal space is uniquely mapped into domains of the pertinent neurons (the 'receptor field' of the neurons). 'Learning' is an iterative process, during which all input signals sent out from the collection of sequences are 'shown' in turn and many times to the mapping, and each time the reference vectors of the pertinent neuron (the 'winner') and its closest neighbors become updated so as better to memorize this signal (shifting the reference vector somewhat closer towards it). Over a prolonged cycle, this produces reference vectors being in the center of the feature specimens in the cluster around them. As a result, we get a map of neurons whose reference vectors populate the signal space such that densely populated input regions also have a dense representation (many neurons) on the map, whereas in sparsely populated input regions the reference vectors occur rarely (i.e. reference vectors of neighboring neurons have a large distance). A sequence cluster emerges in this transformation as a set of input signals projecting onto the same neuron or onto its immediate neighbors on the map. Their distance from the pertinent reference vector in the signal space (called 'quantization error') is small. Any 'non-member' of the cluster will become projected either on distant neurons (this occurs when relatives of it have been offered in the training set), or (if the pattern is unprecedented) it will be projected on a neuron that happens by chance to be at minimum distance, but with large quantization error. The training time depends on the data set. The set of secondary structure segments, which consists of 3844 training patterns, was 2.3 CPU h on a SPARC 20 machine. Training consists of two phases: ordering phase and finetuning phases. The respective training parameters are: number of steps 40 000/65 000, learning rate 0.2/0.05 and neighborhood radius 20/5. Both additional examples, block and multiple alignment method, use the following training parameters: number of steps 1000/10 000, learning rate 0.05/0.02 and neighborhood radius 10/3. All these parameters approach unity during the training course. Sammon mapping A Kohonen map provides a conspicuous 2-D arrangement of neurons, but their distance measure is qualitative only, in the sense that one does not see the 'true' distances in the multi-dimensional space of sequence signals. Such ndimensional signal distances cannot be exactly displayed visually. However, the method of Sammon mapping (Sammon, 1969) makes it possible to generate a 2-D map where the signal distances are retained at least approximately. If Dy are the true distances in the signal space (/ and j running to n, the number of sequences projected onto the Sammon area) and D*j are the corresponding distances on the Sammon area, then the position of the projected signal is improved by a stepwise reduction of the fit criterion E: Retrieval of background information Sammon and Kohonen maps visualize only one aspect of a sequence, namely its similarity to others according to a scoring principle. Tools to retrieve background information pertinent to any sequence or group of sequences inspected on a map are required. For this we used a program called XGobi (Swayne et at., 1991) which brings contemporary dynamic graphics for statistics to the workstation environment. The results of the Kohonen map and XGobi offer to a biologist or chemist the power of motion, rapid surveys for discovering and understanding local and global relationships between sequences. With a pull-down menu, it is possible to show the contents of each neuron (e.g. identifier code, position and amino acid sequence), and to switch to information present in further databases. 449 J.Hanke and J.G.Reich Table I. Alignment of a set of 25 HSP-20 C-terminal domain sequences. The left-most column contains the SWISS-Prot codes of the proteins The first row is a consensus sequence composed of the highest-average-scoring amino acids in each position consensus 14KD_MYCTU 18KD_MYCLE CRA2_M0USE HS11_CAEEL HS11_HELAN HS18_CL0AB HS20_NIPBR HS22_DROME HS22_PHANI HS26_YEAST HS27_CRILO HS2C_ARATH HS2C_CHERU HS2C_CHLRE HS2C_PETHY HS2C_WHEAT HS3C_XENLA HS6A_DR0ME HS6C_DR0ME IBPA_ECOLI IBPB_ECOLI 0V21_0NCV0 P40_SCHMA SP21_STIAU YKZ1_CAEEL MKEGR..YEVRAILPOVDPDDVDIMVRDGQr.TIKXERTEQKDFDGRSEGSFVRTVSU>VGADEDDIKATYDK0ILTVSVAKPTEKHIQI AWREGEEFVVEFT)LP<aiKADSLDIDIERNVVTVRI^RPGVDPDREMLERPFNRQLVLaENLr7rERII^SYQE(WIiKLSIPRAKPRKISV VRSDRDKFVIFLDVKHFSPEDLTVKVLEDFVIIHOKHNERQDDHGYISREFHRRYRLPSNVDQSALSCSLSIXBILTFSGPKHSERAIPV IVNNDQKFAINLHVSQFKPEDLKINLDGHTL8IQOEQEL.KTEHGYSKKSFSRVILLPEDVDVGAVASNLSDOKLSIEAPKKQGRSIPI WKETPEAHVLKADLPOMKKEEVKVEVEIXSVI^ISOREQEEKDiyraRVERSFIRRFRLPENAKMDEVKAMMENOVLTVVVPKEEEEKKPM IKEDDDKYTVAADLPOVlUU3NIEI^YENNYLTINAKRDETKDDNNRRERSYGRRSFYVDNIDDSKIDASFLDaVLRITLPKKVKRRIDI VINDDKKFAVSU3VKHFKPNELKVQLDDRDLTVEOMQEV.KTEHGYIKKQFVHRWSLPEDCDLDAVHTELNHOHLSIEAPEDPSQKFQS ATVNKDGYKLTLDVKDYS . . ELKVKVDESWLVEAKSEQQEAEQGGSSRHFLGRYVLPDGYEADKVSSSLSDOVLTISVPNPPEREVTI VKEYPNSYVFIADMP<r/KAAEIKVQVEDDVLVVSOERTEEKDEKDRMERRFMRKFVLPENANVEAINAVYQDOVLQVTVSKPPEPKKPK ILDHDNNYELKVVVPOVKSKK.DIDIENQILVIPSTI^EESKDKVKVKESFKRVITLPDGVDADNIKADYANOVLTLTVPKKPQKDGKN IRQTADRWRVSLDVNHFAPEELTVKTKEGWIITaKHEERQDEHGYISRCFTRKYTLPPGVDPTLVSSSLSEOTLTVEAPQSAEITIPV IKEEEHEIKMRFDMPOLSKEDVKISVEDNVLVlKaEQKKEDSDDSWSGRSYGTRLQLPDNCEKDKIKAELKNaVLFITIPKKVERKV.. VREDEEALELKVDMPOLAKEDVKVSVEDNTLIIKSEAEKETEEEEQ.RRRYSSRIELTPNLKIDGIKAEMKNOVLKVTVPKKEEEKKDV IIESPTAFELHADAP<3MGPDDVKVE]^EGV^ll^VTaERHTTKEAGGKA/WRSFSRAFSIiPENANPIX3ITAA^roKaVLVVTVPKREPPAXPE GKDGKDHFELTljr^^FSPHELTVKTQGRRVrVTaKHERKSDTEDHEYREWKREAELPESVNPEQVVCSLKNOHLHIQAPRAPETPIPI SWNRNGFQVSMWVKQFAANELTVKTIDNCIWEOQHDEKEDGHGVISRHFIRKYILPKGYDPNEVHSTLSDaiLTVKAPQRQERIVDI ASNKQGNFEVHLDVGLFQPGELTVKLVNECIWEOKHEEREDDHGHVSRHF VPAVSAAQGVR ELVDENHYRIAIAVAOFAESELEITAQDNLLWKaAHADEQKEQGIAERNFERKFQLXENIHVRG..ANLVNOLLYIDL....ERVIPE EKSDDNHYRITLALAOFRQEDLEIQLEGTRLSVKOTPEQPKEEQGLMNQPFSLSFTLXENMEVSG..ATFVNOLLHIDLIRNEPEPIAA VINEKDKFAVRADVSHFHPKELSVSVRDRELVTEQHHKEDSAGHGSIERHFIRKYVIJ'EEVQPDTIESHLSDafVLTIAVOTTASRNIPI GEDGKVHFKVRFDAQOFAPQDINVTSSENRVTVHAKKETTTDGR.KCSREFCRhWQIJKSIDDSQLKCRMTIXIVLMLEAPVKVDQNQSL VHNTKEKFEVGLDVQrFTPKEIEVKVSGQELLIHCRHETRSDNHGTVAREINRAYKLPDDVDVSTVKSHLTROVLTITASKKA Results Associative storage of a protein domain: HSP-20 family HSP-20 (see Henikoff and Henikoff, BLOCKS Version 8.0) is a family of proteins induced by heat shock. For demonstration, we selected here the C-terminal domain of these proteins as well as the alignment of a representative subset of this family (G.Beckmann, personal communication; for details, see also Caspers et at., 1995). This alignment of 25 sequences is reproduced in Table I. Extracting the pertinent column of the BLOSUM 62 weight matrix (Henikoff and Henikoff, 1992) as vector for an amino acid, one obtains a 20 x 75 matrix image characteristic for each specimen. We offered these 25 images to a learning program and obtained a Kohonen map as displayed in Figure 2. It is clearly seen that the Kohonen map memorizes this data set by allocating certain clusters of similar sequences into specific neurons. The 'content' of a neuron may be obtained by clicking a pull-down menu, from which further information out of the database entries of that sequence may be called. It is noteworthy that the 'consensus' sequence (according to Gribskov et ai, 1990), made of amino acids which in each position would score best against the amino acids as present in the alignment, projects into the center of the map. The subclustering of the domain sample according to BLOSUM 62 similarity deserves further study, but will not be pursued here. Figure 2 gives a clear impression of similarity and distance of the sequences present in the alignment. However, the distance between neurons can be interpreted 450 only in a qualitative manner. One does not see how close in the signal space are the sequences pertinent to neighboring neurons, nor can one gauge the precise distance to the content of remote neurons. To answer questions of this type, one may construct a Sammon map as depicted in Figure 3. This picture is scaled in units of the signal distance BPB I rCHEHU . .14KD_MYCTU ; . HS18. • SP21_STlAy_ • ~ .18KD_MYCLE • _ « / . . - : P40 SCKMA ^CONSENSUS \ Fig. 2. Kohonen map memorizing the HSP-20 C-terminal domain of heat shock proteins. The pull-down menu of each occupied neuron is displayed. The rows of the alignment in Table I were presented in turn to the learning algorithm together with the BLOSUM 62 weight matrix as developed by Henikoff and Henikoff (1992). Ten out of the 16 neurons of this map are 'occupied' after training, i.e. sequences of the set are projected into. It is seen that some of the sequences project into a common neuron (indicating subclustering of more closely related specimens). The 2-D grid visualizes this behavior. Note that the consensus sequence is positioned in the center of the map. Visualization tool for the analysis of protein sequences HS22_PHANI 14KD MYCTU 3 •:.>V- _ n * _ ^" _ _ Fig. 3. Sammon map of the HSP-20 heat shock protein family. The 16 neurons of the lattice in Figure 2 are now displayed on a map scaled by the Euclidean distance of the full multidimensional signal space as defined in Sammon (1969). Nodes denote neurons. The distance on paper between any two nodes is approximately proportional to the distance in signal space between the pertinent reference vectors. Neighbors in the lattice (Figure 1) are identified by a connection line. Occupied neurons are labeled by one of the pertinent sequences. between the reference vectors of each neuron. It gives a better impression of the mutual distance between members of subclusters (note that the 2-D projection of «-dimensional points can represent distance only approximately). Memorizing many domains of the BLOCKS database Here we show a somewhat larger neuronal network that has been trained to memorize different domain sequences from BLOCKS database. We selected, for the sake of illustration, all domains of length 40 or slightly longer (up to 46 residues). We trimmed all of them to 40 residues exactly by just cutting off a few surplus positions (if any). The 130-neuron network proved able to place all sequences into distinct areas. Figure 4 displays the occupancy showing that the families form clusters comprising several neurons. Figure 5 is a screen snapshot (produced by the XGobi tool) showing how the memory information and related items may be addressed. The example shows the associative performance of the Kohonen map memory. Again, a Sammon mapping may be consulted to get an impression of the signal distances (rather than neuron distances) between different sets of entries (Figure 6). This _ i :/> : : & R p O : / » n N : 0 : .. i^ iA.- . /3 '• ' • / : A :A : / Fig. 4. Kohonen map memorizing domains of the BLOCKS database (Henikoff and Henikoff, 1991). The following blocks of 18 domains from 11 families were selected and tagged by one letter. A: hemopcxin domain proteins, BL00024C; B: trefoil domain proteins, BLOOO25; C: paired box domain proteins, BL00034A; D: POU domain proteins, BLOOO35B; E: kinesin motor domain proteins, BL00411E; F: apple domain proteins, BL0O495A; G: apple domain proteins, BLOO495C; H: apple domain proteins, BL0O495D; I: apple domain proteins, BLOO495H; J: apple domain proteins, BL00495I; K: apple domain proteins, BL00495K; L: apple domain proteins, BL0O495L; M: fibrinogen /9 and 7 chains, C-terminal domain proteins, BL00514E; N: somatomedin B domain proteins, BL00524C; O: somatomedin B domain proteins, BL00524D; P: cellulose-binding domain proteins fungal type, BL0O562; Q: osteonectin domain proteins, BL00612A; R: MAM domain proteins, BL00740D. The blocks, consisting of 159 sequences between 40 and 46 letters long, were trimmed to a common length of 40 by dropping positions from the C-amino end (if >40). Fifty-four out of the 13 x 10 neurons on the grid 'contain' some block elements after training. Most blocks, while being projected into a compact region, were distributed on several neurons (subclustering). In all cases, neurons were assigned exclusively to segments from one domain family. Sammon map was obtained by projecting the whole set of 159 individual sequences on the 2-D Sammon algorithm. One might prefer (not shown) a picture where only the reference vectors (i.e. the average of each family or subfamily) is processed by the Sammon algorithm. This reduces the dimensionality of the iteration procedure. Storing secondary structure segments from Brookhaven database The PDB database (Bernstein et al., 1977) of protein structures may be complemented by annotation of all segments to which a definite secondary structure may be assigned (cf. Kabsch and Sander, 1983). We decided to illustrate the usage of the Kohonen map tool for a collection of specimens of the most salient features of secondary structure, namely a-helix (ranging in length from 5 to 40, average around 10 residues) and /3-pleated sheet (usually 5-10 residues). 451 J.Hanke and J.G.Reich XCobi:bsp L• Toggle sfely hbefc 10- info X-Position 9 Y-Position 57 FSNCLYQDFRIPLSPVHKCSFYKKNTKVSYG FSNCKYQDLRIPLSPVHKCSYYKSNSKLSYG FSGDKYYRVNLKTRRVETTOPPYPRSIAQYV FSGDKYYRVNLRTRRVDSVNPPYPF.SIAC1YV rSODKYYRVNLRTQRVrjTVNPPYPRSIAQY* 010 Cfck hereto(ferns Fig. 5. XGobi-produced screen output of the Kohonen map of Figure 4 (trained to memorize BLOCKS domains). Neurons that have memorized sequences are labeled by a • sign (for neuron content, see the prevous figure). The seven neurons occupied by the A-family have been labeled previously by squares (lower-right comer). One of the two neurons labeled by filled circles has just been selected and says that it contains five specimens of the Ofamily. Further clicking opens the right-hand window showing the coordinates of the neuron (9,7) and the five blocks allocated to this memory element. We collected the set of 1798 a-helical segments and 1461 /3-sheet segments as present in the FSSP database of representative protein structures (Holm and Sander, 1994). We presented all these segments to a Kohonen map for self-organizing cluster definition. Using the Jeffrey coding strategy (see System and methods), we avoided problems with the different length of the relatively short segments; hence one large map could memorize all offered segments. We specified 62 x 62 = 3844 neurons to learn the above-mentioned 3259 structure segments, in keeping with the accepted rule that in order to avoid overindividualization as well as overgeneralization, there should be a slight excess in the number of neurons over the number of elements presented for learning. The self-organized algorithm produced the map shown in Figure 7. The training procedure converges to a final state in which each helix or /? segment is projected to one definite neuron. A mouse click produces a pull-down list of all FSSP segments belonging to the domain of that neuron. In Figure 7, the neurons are labeled in accordance with what their 'content' of segments is. Four classes are 452: discerned: (i) pure a neurons (981 specimens); (ii) pure j3 neurons (817 specimens); (iii) mixed neurons (310 specimens = 8% of all, containing both a and /? segments); (iv) void neurons (no segment projected into; 1736 pieces = 45% of all). Thus, 3259 segments are assigned to 1798 neurons of classes (i)-(iii); i.e. somewhat less than two segments are assigned on average to one neuron. Sixty percent of all neurons in classes (i)-(iii) contain just one segment, 23% contain two segments and 17% contained more than two segments in their domain. The highest content recorded was (in 24 out of 1798 neurons) an aggregation of seven segments in one neuron. A closer study of Figure 7 reveals the information structure in the final Kohonen map. It is seen that the assignment of segments to neurons does not produce global Q or/3 segment regions. Instead, we have at best local aggregations of some neurons of the same class. Not rare are the examples where the nearest neighbor of a neuron of /3 class is one of a class, and vice versa. The whole picture suggests the conclusion that there is no clear continuous (let alone linear) separation (in the sense of a discriminator curve) between helix and sheet segments in Visualization tool for the analysis of protein sequences Fig. 6. Sammon map of the BLOCKS domains selected for Figure 4. This picture emerges after 40 000 iteration steps of Sammon's steepest descent procedure fitting the signal distances to the map distances. Points: 130 neurons, 54 of them occupied (letter label). The distance between points is (approximately) proportional to the distance of the neuron's reference signal in the 20 x 40-dimensional signal space. The tight cluster structure of the domain families (in particular of A, D, E, M) is clearly visible. Note that the domains tend to settle in corners of the map, leaving the center unpopulated. the sequence space. Nevertheless, 78% of all segments become correctly projected into neurons of classes (i) or (ii), and are therefore correctly and unambiguously characterized by the Kohonen map. Discussion In a previous paper (Hanke et ai, 1996), we studied the Kohonen version of neural networks and showed that this algorithmic system is well suited to store, in an associative manner, the characteristic information as present in protein sequence patterns. Here we apply this method to the visualization of such information. The sequence segment is transformed, using one of the accepted scoring systems, into a signal and then mapped from the multidimensional signal space onto a two-dimensional grid filled with neuronal nodes. The term 'neuron' is used as a metaphor for cells which integrate weighted input signals and emit an ouput signal. The peculiarity of Kohonen networks is that their neurons no longer learn independently of each other: there is always a neighborhood region rather than an isolated cell which is memorizing sequence signals (principle of lateral inhibition in brain physiology, stating that regions more distant from the addressed memory are prevented from learning). The result is transformation of similarity of sequence segments into proximity on a 2-D neuronal grid. Inspection of local Fig. 7. Kohonen map of all a and f} segments found in the FSSP database (Holms and Sander, 1994) of established protein structure elements. This database contains all segments with clearly defined a or /3 structure, obtained according to the procedure of Kabsch and Sander (1983). This resulted in a learning set for instruction of the Kohonen map (62 x 62 neurons). The figure displays all neuron positions and assigns to them a symbol according to one of the following four classes. (1) Cells with open square. Only a segments projected to this neuron. (2) Cells with filled circle. Only /3 segments projected to this neuron. (3) Cells with filled squares (1 and 2 supenmposed), containing some a and some /? segments (mixed neurons). (4) No symbol. Neither a nor 0 segments residing here. neuron regions in such a lattice reveals the existence of proximity clusters in an immediately conspicuous way. The domain of one neuron is the ensemble of similar sequences projected into it. A reference vector is the representative ('center') of this projection field. There are two metrics carrying information on similarity and dissimilarity: the metrics of the Kohonen lattice (distance between neurons) as opposed to metrics in the signal space, which is derived from the similarity criterion provided by the scoring scheme applied during the training phase. The neuronal metric (Figures 2, 4 or 7) preserves neighborhood relative between elements. It leads to a clear display of cluster and neighborhood structures. The score-derived metric, by contrast, which can be approximately displayed visually (with some difficulty due to dimension entangling) as Sammon map (Figures 3 and 6), gives an impression of the 'true' distances between encoded signals. Both types of display combined allow a direct visual evaluation of what is usually presented by long lists of items and crosscomparisons characterized by numerical scores or probability statements. 453 J.Haoke and J.G.Reich The mapping method described here demands that a sequence is coded as a signal, i.e. as an image of matrix entries. A frequent problem is that sequence segments of different length have to be compared. This was tackled for short sequences by the length-independent fractal coding method, and for longer sequences by using (and pruning) alignments. Ferr n and Ferrara (1991) used frequency tables as characteristics of a longer sequence, but this entails, of course, loss of information residing in the succession of letters. It is noteworthy that the inner structure of domains (Figures 2 and 3) as well as an ensemble of domains (sequence blocks. Figures 4-6) are transformable in much more compactly clustered form than are secondary structure segments (Figure 7). a-Helix segments and /3 sheets appear on the map in a very patchy way without the formation of large regions. The topology of the helix space and of the /3-sheet space is extremely entangled. Nevertheless, neurons memorize individual subgroups in a way that permits correct distinction between secondary elements in many (78% of all) cases. Mapping of sequence segments onto points of a 2-D grid is a convenient visual display of one important ensemble property, their 'similarity1 or 'homology'. The user of such a tool, when examining such a surface, may click to find more information relevant to the study, for instance which sequences of the data set belong to a certain neuron or neuron cluster. There is also a need for additional information from sequence databases pertinent to sequences seen on the drop-down table attached to each neuron. The whole methodology of catchword-oriented information browsing may be integrated as background to each neuron of the Kohonen map. We consider the development of knowledge-based visualization tools as a useful complement to the textrecord-based database files which are in widespread use at present time. It is the exploding amount of information that necessitates conspicuous and comprehensive integration of existing knowledge. After an introductory period, the whole system and the Kohonen self-learning network (available: ftp 130.233.168.48) are easy to handle and impress through their stability and reliablity. Furthermore, we offer a WWW service (http://www.mdcberlin.de) for the visualization of protein sequences and additionally a number of convenient tools. References Altschul.S., Gish.W., Miller.W., Mycrs.E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol, 215, 403-410. Bernstein,F.C, K.oetzle,T.F., Wilhams.G.J.B., Meyer,E.F., Brice.M.D., Rodgers.J.R.. Kennard,O., Shimanouchi.T. and Tasumi,M. (1977) The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112, 535-542. 454 Caspers.G.J , Leunissen,J.A. and De-Jong,W.W. (1995) The expanding smal heat-shock protein family, and structure predictions of the conserved 'alpha-crystallin domain". J. Mol. Evol., 40, 238-248. DayhofT,M. O., Schwartz,R.M. and Oreutt.B.C. (1978) A model of evolutionary change in proteins. In Dayhoff.M.O. (ed.), Alias of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC, Vol. V, Suppl. 3, pp. 345-352. Ferran.E.A and Ferrara,P. (1991) Topological maps of protein sequences. Bwl. Cvbern., 65, 451-458. Gribskov,M., L thy.R and Eisenberg.D. (1990) Profile analysis. Methods Enzymol., 183, 146-159. Hanke.J., Beckmann,G., Bork,P. and Reich.J.G. (1996) Self organizing hierarchic networks for pattern recognition in protein sequence. Protein Sci., 5, 72-84. Henikoff,S. and HenikoffJ.G. (1991) Automated assembly of protein blocks for data base searching. Nucleic Acids Res., 19, 6565-6572. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Nail Acad. Sci. USA, 89, 10915-10919. Holm.L. and Sander.C. (1994) The FSSP database of structurally aligned protein fold families. Nucleic Acids Res., 22, 3600-3609. JefTrey.H.J. (1990) Chaos game representation of gene structure. Nucleic Acids Res., 18, 2163-2170. Kabsch,W. and Sander,C. (1983) How good are predictions of protein secondary structure? FEBS Lett., 155, 179-182. Kohonen,T. (1989) Self-Organization and Associative Memory. Springer Verlag, Berlin. Lynch,Th.J. (1985) Data Compression Techniques and Applications. Lifetime Learning Publications, Belmont. Swayne,D.F., Cook.B.D. and Buja.A. (1991) User's Manual for XGobi a Dynamic Program of Data Analysis Implemented in the X Window System. Bellcore Technical Memorandum. Taylor,W.R. (1986) Identification of protein sequence homology by consensus template alignment. J. Mol. Biol., 188, 233-258. Received on October 5, 1995; revised and accepted on August 9. 1996
© Copyright 2025 Paperzz