CAB I OS - Oxford Academic

Vol 12 no. 6 1996
Pages 447-454
CAB I OS
Kohonen map as a visualization tool for the
analysis of protein sequences: multiple
alignments, domains and segments of
secondary structures
Jens Hanke and Jens G.Reich
Abstract
System and methods
The method of Kohonen maps, a special form of neural
networks, was applied as a visualization tool for the analysis
of protein sequence similarity. The procedure converts
sequence (domains, aligned sequences, segments of secondary structure) into a characteristic signal matrix. This
conversion depends on the property or replacement score
vector selected by the user. Similar sequences have small
distance in the signal space. The trained Kohonen network is
functionally equivalent to an unsupervised non-linear cluster
analyzer. Protein families, or aligned sequences, or segments of similar secondary structure, aggregate as clusters,
and their proximity may be inspected on a color screen or on
paper. Pull-down menus permit access to background
information in the established text-oriented way.
Signal coding of sequences
Introduction
Computer analysis of protein sequence 'homology' in the
ever-growing data collections, e.g. with BLAST (Altschul
et al., 1990) or with PROFILE (Gribskov et al., 1990),
usually results in long lists of'hits', consisting of sequence
identifiers, indicators of hit localization and score or
probability values. There is a need for graphic display
tools which make the information more directly conspicuous. Previous work (Ferran and Ferrara, 1991; Hanke
et al., 1996) has demonstrated that neural networks in the
special form of so-called Kohonen maps (Kohonen, 1989)
are a good tool to organize and store data for which
similarity or distance relations are defined. The map works
as a non-linear cluster analyzer, stores information in an
associative (rather than list-type) manner, and permits
the recognition and classification of so far unknown
information samples. In this paper, we propose Kohonen
maps as a versatile tool for the two-dimensional (2-D)
display of sequence similarity information. Background
text information may be retrieved in the usual way if
required.
Max-Delbrick-Cenler for Molecular Medicine, Department of Bwinformatics, Robert-Rdssle-Strafie 10. D-13125 Berlin-Buch, Germany
E-mail: hanke(bioinf.mdc-berlin.de
) Oxford University Press
An accepted method of quantitative sequence analysis is
the position-wise evaluation of similarity scores. Such a
score may be derived from a vector (e.g. of physicochemical properties, in categorical or real-valued coding; see
Taylor, 1986) or from a scalar quantity (e.g.comparison
between aligned sequences using weight matrices like
BLOSUM; Henikoff and Henikoff, 1992) or PAM (Dayhoff et al., 1978). In this paper, we do not prescribe the
method of evaluation, but rather leave the choice to the
user. However, a concept is used that defines, and
organizes, similarity information in a unified way.
Sequences are represented as signals, assembled from
position-specific property vectors. Such a signal is a realvalued or binary matrix of dimension Q x L (where Q is
the dimension of the property vector of an amino acid in
the ith position, i=\...L,
and L is the length of
sequence). An example of this is the scoring method by
Taylor (1986), who represents each amino acid by a binary
vector of 11 physicochemical properties of this amino
acid. For a sequence this leads to a matrix, which as an
'image' is characteristic of that item. If similarity is
expressed in terms of entries of the accepted weight
matrices, which are usually evaluated by pairwise comparison of sequence elements, then a convenient property
vector of an amino acid is just the column vector pertinent
to that amino acid.
It expresses the scores of an amino acid when being
compared to one out of the 20 amino acids (itself
included). A single sequence is so converted to an image
matrix symbolizing the scoring capacity of that sequence.
For matrices M, N so obtained, if they have the same
dimension, a measure of signal dissimilarity is immediately
given by the Euclidean distance:
D(M,N) =
with i,j running over the matrix entries.
One problem with this type of signal representation of a
sequence is that a distance concept is not applicable to
447
J.Hanke and J.G.Rekh
"" " 1
Po. N ,
0
0 0
0 0
0 0
0
0
0 0
0 0
0 0
0
0 0 0
tK n
0
0
0
0
0
0
0
0
0
0
0 /I%
o h
0
\0
0
0
0
0
0
0
0
0 0
\
0
1
( o
•
0
/
A *D
C *E
F N
G P
*H Q
I *R
*K S
L
M
T
V
W
Y
—
/H
^
/
_ J
1
H.
H D
K E
R
Na
Fig. 1. Coding of secondary structure segments. Jeffrey (1990) has developed a coding principle that represents each sequence in a unique way by a walk
through a quadratic pixel space. In the figure, the amino acid alphabet is replaced by a four-property alphabet (hydrophobic, not hydrophobic, positive
charge, negative charge). Each amino acid has one or two of these properties (see the inserted table). The graph for the peptide AND starts in the center
and makes a step half-way to the coordinate of the property set of the first amino acid (A) in the sequence. Then it turns half-way into the direction of N's
corner, continues with a half-step towards the D corner, and so on. In this way, each sequence is respresented as a graphic pattern of straight lines
oscillating through the space. The graph is then converted into a binary pattern matrix by inserting' 1' (unity) into each quarter cell lying at the route of
the graph. A further transformation (not shown here) according to the Quad-Tree-principle (Lynch, 1985) results in a condensed length-normalized and
contour-preserving vector representation of the pattern matrix. We show the pixel values only in that quarter of the grid that has been traversed by the
first letters of the example segment AND.
sequences of unequal length. To deal with this
difficulty, we applied the method of fractal encoding
(see below) to the image representation of short
sequences. For the treatment of longer sequences of
unequal length, an alignment is required, from which
one may either take gap-free sub-blocks or replace
gaps, if not too numerous, by an average vector (whose
elements are the mean values over the corresponding
entries of all 20 amino acids). A further method is to
represent sequences of different length by a matrix of
subsequence frequencies (e.g. dipeptide or oligoword
frequencies; see Ferran and Ferrara, 1990), but this
results in loss of information residing in the succession
of elements in the sequence.
Fractal encoding of sequences
After some experimentation, we encoded short sequence
segments of unequal length according to Jeffrey (1990). In
essence, this converts a sequence into a walk through a
quadratic area with pixel elements. The edges of that area
are labeled by the elements of the alphabet (amino acid
names or some transformation like physicochemical
properties). Each new sequence letter causes a jump halfway to the edge corresponding to the current letter. Jeffrey
recorded only the end points of the steps and filled the
448
pertinent pixels. In our applications (shorter segments), we
draw the whole route between both pixel points. The
example of Figure 1 shows, simplified for illustration, a
simple transformation of the amino acid alphabet into an
alphabet made of two basic properties. A zig-zag curve
unique for any sequence segment is obtained, which may
be scanned and encoded according to the Quad-Treeprinciple (Lynch, 1985), resulting in a binary vector
characteristic for the input segment. Similar segments
give characteristically similar code vectors suitable for
training of the Kohonen network. In particular, a
Hamming distance metric between the walk patterns is
established by this coding principle.
Jeffrey's method was found to work well with sequence
segments < 30 residues. The resulting walk trace was
typical enough for a sequence and sufficiently similar even
in the case of unequal length, but otherwise high
similarity. However, with increasing length, the walking
trace tends to become entangled and therefore increasingly
uninformative.
In the case of moderate or high length, we took recourse
to pruning sequences to equal length, which works well if
the length difference and hence the loss of information is
small. The more general problem of coding distinctly
length-variable sequences has so far not been satisfactorily
solved.
Visualization tool for the analysis of protein sequences
Kohonen mapping
A Kohonen map as modeled here is a computer program
that projects signals onto a 2-D map of 'neurons'. A
neuron is identified by its two coordinates in a quadratic
lattice. Between different neurons there is a map distance
defined: simply the distance between the coordinates. For
instance, the distance between neurons (6,3) and (3,7)
is SQRT((6 - 3) x (6 - 3) + (3 - 7) x (3 - 7)) = SQRT
( 9 + 16) = 5. A second distance metric is required for
the signal space. We selected the Euclidean distance
between code vectors (as described above) as such metrics.
One of the salient features of Kohonen mapping is the
non-linear relationship between the two distances, on the
neuron lattice and in the signal space. In effect, an ndimensional signal vector is projected onto a 2-D neuron
lattice. This projection is intended to retain proximity: two
vectors close to each other in the signal space should also
be close on the projection map. The projection is itself is
non-linear and approximately topology preserving.
Each neuron on the map has a 'reference vector' in that
signal space, and it is possible to assign to any signal v the
'pertinent' neuron n on the map, namely that whose
reference vector wn is closest to the signal (j running for all
neurons):
distance(i>, wn) = Mjin {distance (v, Wj)
In this way, for a certain set of reference vectors vv,, the
signal space is uniquely mapped into domains of the
pertinent neurons (the 'receptor field' of the neurons).
'Learning' is an iterative process, during which all input
signals sent out from the collection of sequences are
'shown' in turn and many times to the mapping, and each
time the reference vectors of the pertinent neuron (the
'winner') and its closest neighbors become updated so as
better to memorize this signal (shifting the reference vector
somewhat closer towards it). Over a prolonged cycle, this
produces reference vectors being in the center of the
feature specimens in the cluster around them.
As a result, we get a map of neurons whose reference
vectors populate the signal space such that densely
populated input regions also have a dense representation
(many neurons) on the map, whereas in sparsely
populated input regions the reference vectors occur
rarely (i.e. reference vectors of neighboring neurons have
a large distance).
A sequence cluster emerges in this transformation as a
set of input signals projecting onto the same neuron or
onto its immediate neighbors on the map. Their distance
from the pertinent reference vector in the signal space
(called 'quantization error') is small. Any 'non-member' of
the cluster will become projected either on distant neurons
(this occurs when relatives of it have been offered in the
training set), or (if the pattern is unprecedented) it will be
projected on a neuron that happens by chance to be at
minimum distance, but with large quantization error.
The training time depends on the data set. The set of
secondary structure segments, which consists of 3844
training patterns, was 2.3 CPU h on a SPARC 20 machine.
Training consists of two phases: ordering phase and finetuning phases. The respective training parameters are:
number of steps 40 000/65 000, learning rate 0.2/0.05 and
neighborhood radius 20/5. Both additional examples,
block and multiple alignment method, use the following
training parameters: number of steps 1000/10 000,
learning rate 0.05/0.02 and neighborhood radius 10/3.
All these parameters approach unity during the training
course.
Sammon mapping
A Kohonen map provides a conspicuous 2-D arrangement
of neurons, but their distance measure is qualitative only,
in the sense that one does not see the 'true' distances in the
multi-dimensional space of sequence signals. Such ndimensional signal distances cannot be exactly displayed
visually. However, the method of Sammon mapping
(Sammon, 1969) makes it possible to generate a 2-D
map where the signal distances are retained at least
approximately. If Dy are the true distances in the signal
space (/ and j running to n, the number of sequences
projected onto the Sammon area) and D*j are the
corresponding distances on the Sammon area, then the
position of the projected signal is improved by a stepwise
reduction of the fit criterion E:
Retrieval of background information
Sammon and Kohonen maps visualize only one aspect of
a sequence, namely its similarity to others according to a
scoring principle. Tools to retrieve background information pertinent to any sequence or group of sequences
inspected on a map are required. For this we used a
program called XGobi (Swayne et at., 1991) which brings
contemporary dynamic graphics for statistics to the
workstation environment. The results of the Kohonen
map and XGobi offer to a biologist or chemist the power
of motion, rapid surveys for discovering and understanding local and global relationships between sequences.
With a pull-down menu, it is possible to show the
contents of each neuron (e.g. identifier code, position and
amino acid sequence), and to switch to information
present in further databases.
449
J.Hanke and J.G.Reich
Table I. Alignment of a set of 25 HSP-20 C-terminal domain sequences. The left-most column contains the SWISS-Prot codes of the proteins The first
row is a consensus sequence composed of the highest-average-scoring amino acids in each position
consensus
14KD_MYCTU
18KD_MYCLE
CRA2_M0USE
HS11_CAEEL
HS11_HELAN
HS18_CL0AB
HS20_NIPBR
HS22_DROME
HS22_PHANI
HS26_YEAST
HS27_CRILO
HS2C_ARATH
HS2C_CHERU
HS2C_CHLRE
HS2C_PETHY
HS2C_WHEAT
HS3C_XENLA
HS6A_DR0ME
HS6C_DR0ME
IBPA_ECOLI
IBPB_ECOLI
0V21_0NCV0
P40_SCHMA
SP21_STIAU
YKZ1_CAEEL
MKEGR..YEVRAILPOVDPDDVDIMVRDGQr.TIKXERTEQKDFDGRSEGSFVRTVSU>VGADEDDIKATYDK0ILTVSVAKPTEKHIQI
AWREGEEFVVEFT)LP<aiKADSLDIDIERNVVTVRI^RPGVDPDREMLERPFNRQLVLaENLr7rERII^SYQE(WIiKLSIPRAKPRKISV
VRSDRDKFVIFLDVKHFSPEDLTVKVLEDFVIIHOKHNERQDDHGYISREFHRRYRLPSNVDQSALSCSLSIXBILTFSGPKHSERAIPV
IVNNDQKFAINLHVSQFKPEDLKINLDGHTL8IQOEQEL.KTEHGYSKKSFSRVILLPEDVDVGAVASNLSDOKLSIEAPKKQGRSIPI
WKETPEAHVLKADLPOMKKEEVKVEVEIXSVI^ISOREQEEKDiyraRVERSFIRRFRLPENAKMDEVKAMMENOVLTVVVPKEEEEKKPM
IKEDDDKYTVAADLPOVlUU3NIEI^YENNYLTINAKRDETKDDNNRRERSYGRRSFYVDNIDDSKIDASFLDaVLRITLPKKVKRRIDI
VINDDKKFAVSU3VKHFKPNELKVQLDDRDLTVEOMQEV.KTEHGYIKKQFVHRWSLPEDCDLDAVHTELNHOHLSIEAPEDPSQKFQS
ATVNKDGYKLTLDVKDYS . . ELKVKVDESWLVEAKSEQQEAEQGGSSRHFLGRYVLPDGYEADKVSSSLSDOVLTISVPNPPEREVTI
VKEYPNSYVFIADMP<r/KAAEIKVQVEDDVLVVSOERTEEKDEKDRMERRFMRKFVLPENANVEAINAVYQDOVLQVTVSKPPEPKKPK
ILDHDNNYELKVVVPOVKSKK.DIDIENQILVIPSTI^EESKDKVKVKESFKRVITLPDGVDADNIKADYANOVLTLTVPKKPQKDGKN
IRQTADRWRVSLDVNHFAPEELTVKTKEGWIITaKHEERQDEHGYISRCFTRKYTLPPGVDPTLVSSSLSEOTLTVEAPQSAEITIPV
IKEEEHEIKMRFDMPOLSKEDVKISVEDNVLVlKaEQKKEDSDDSWSGRSYGTRLQLPDNCEKDKIKAELKNaVLFITIPKKVERKV..
VREDEEALELKVDMPOLAKEDVKVSVEDNTLIIKSEAEKETEEEEQ.RRRYSSRIELTPNLKIDGIKAEMKNOVLKVTVPKKEEEKKDV
IIESPTAFELHADAP<3MGPDDVKVE]^EGV^ll^VTaERHTTKEAGGKA/WRSFSRAFSIiPENANPIX3ITAA^roKaVLVVTVPKREPPAXPE
GKDGKDHFELTljr^^FSPHELTVKTQGRRVrVTaKHERKSDTEDHEYREWKREAELPESVNPEQVVCSLKNOHLHIQAPRAPETPIPI
SWNRNGFQVSMWVKQFAANELTVKTIDNCIWEOQHDEKEDGHGVISRHFIRKYILPKGYDPNEVHSTLSDaiLTVKAPQRQERIVDI
ASNKQGNFEVHLDVGLFQPGELTVKLVNECIWEOKHEEREDDHGHVSRHF
VPAVSAAQGVR
ELVDENHYRIAIAVAOFAESELEITAQDNLLWKaAHADEQKEQGIAERNFERKFQLXENIHVRG..ANLVNOLLYIDL....ERVIPE
EKSDDNHYRITLALAOFRQEDLEIQLEGTRLSVKOTPEQPKEEQGLMNQPFSLSFTLXENMEVSG..ATFVNOLLHIDLIRNEPEPIAA
VINEKDKFAVRADVSHFHPKELSVSVRDRELVTEQHHKEDSAGHGSIERHFIRKYVIJ'EEVQPDTIESHLSDafVLTIAVOTTASRNIPI
GEDGKVHFKVRFDAQOFAPQDINVTSSENRVTVHAKKETTTDGR.KCSREFCRhWQIJKSIDDSQLKCRMTIXIVLMLEAPVKVDQNQSL
VHNTKEKFEVGLDVQrFTPKEIEVKVSGQELLIHCRHETRSDNHGTVAREINRAYKLPDDVDVSTVKSHLTROVLTITASKKA
Results
Associative storage of a protein domain: HSP-20 family
HSP-20 (see Henikoff and Henikoff, BLOCKS Version
8.0) is a family of proteins induced by heat shock. For
demonstration, we selected here the C-terminal domain of
these proteins as well as the alignment of a representative
subset of this family (G.Beckmann, personal communication; for details, see also Caspers et at., 1995). This
alignment of 25 sequences is reproduced in Table I.
Extracting the pertinent column of the BLOSUM 62
weight matrix (Henikoff and Henikoff, 1992) as vector for
an amino acid, one obtains a 20 x 75 matrix image
characteristic for each specimen. We offered these 25
images to a learning program and obtained a Kohonen
map as displayed in Figure 2. It is clearly seen that the
Kohonen map memorizes this data set by allocating
certain clusters of similar sequences into specific neurons.
The 'content' of a neuron may be obtained by clicking a
pull-down menu, from which further information out of
the database entries of that sequence may be called. It is
noteworthy that the 'consensus' sequence (according to
Gribskov et ai, 1990), made of amino acids which in each
position would score best against the amino acids as
present in the alignment, projects into the center of the
map. The subclustering of the domain sample according to
BLOSUM 62 similarity deserves further study, but will
not be pursued here.
Figure 2 gives a clear impression of similarity and
distance of the sequences present in the alignment.
However, the distance between neurons can be interpreted
450
only in a qualitative manner. One does not see how close in
the signal space are the sequences pertinent to neighboring
neurons, nor can one gauge the precise distance to the
content of remote neurons. To answer questions of this
type, one may construct a Sammon map as depicted in
Figure 3. This picture is scaled in units of the signal distance
BPB I
rCHEHU
. .14KD_MYCTU ;
. HS18.
• SP21_STlAy_ •
~
.18KD_MYCLE •
_ « / . . - :
P40 SCKMA
^CONSENSUS
\
Fig. 2. Kohonen map memorizing the HSP-20 C-terminal domain of heat
shock proteins. The pull-down menu of each occupied neuron is
displayed. The rows of the alignment in Table I were presented in turn
to the learning algorithm together with the BLOSUM 62 weight matrix as
developed by Henikoff and Henikoff (1992). Ten out of the 16 neurons of
this map are 'occupied' after training, i.e. sequences of the set are
projected into. It is seen that some of the sequences project into a
common neuron (indicating subclustering of more closely related
specimens). The 2-D grid visualizes this behavior. Note that the
consensus sequence is positioned in the center of the map.
Visualization tool for the analysis of protein sequences
HS22_PHANI
14KD MYCTU
3
•:.>V-
_
n
*
_
^"
_
_
Fig. 3. Sammon map of the HSP-20 heat shock protein family. The 16
neurons of the lattice in Figure 2 are now displayed on a map scaled by
the Euclidean distance of the full multidimensional signal space as
defined in Sammon (1969). Nodes denote neurons. The distance on paper
between any two nodes is approximately proportional to the distance in
signal space between the pertinent reference vectors. Neighbors in the
lattice (Figure 1) are identified by a connection line. Occupied neurons are
labeled by one of the pertinent sequences.
between the reference vectors of each neuron. It gives a
better impression of the mutual distance between members
of subclusters (note that the 2-D projection of «-dimensional points can represent distance only approximately).
Memorizing many domains of the BLOCKS database
Here we show a somewhat larger neuronal network that
has been trained to memorize different domain sequences
from BLOCKS database. We selected, for the sake of
illustration, all domains of length 40 or slightly longer (up
to 46 residues). We trimmed all of them to 40 residues
exactly by just cutting off a few surplus positions (if any).
The 130-neuron network proved able to place all
sequences into distinct areas. Figure 4 displays the
occupancy showing that the families form clusters
comprising several neurons. Figure 5 is a screen snapshot
(produced by the XGobi tool) showing how the memory
information and related items may be addressed. The
example shows the associative performance of the
Kohonen map memory.
Again, a Sammon mapping may be consulted to get an
impression of the signal distances (rather than neuron
distances) between different sets of entries (Figure 6). This
_
i :/>
:
: &
R
p
O
:
/ »
n
N
: 0
:
.. i^ iA.- .
/3
'•
' • /
:
A
:A
: /
Fig. 4. Kohonen map memorizing domains of the BLOCKS database
(Henikoff and Henikoff, 1991). The following blocks of 18 domains
from 11 families were selected and tagged by one letter. A: hemopcxin
domain proteins, BL00024C; B: trefoil domain proteins, BLOOO25; C:
paired box domain proteins, BL00034A; D: POU domain proteins,
BLOOO35B; E: kinesin motor domain proteins, BL00411E; F: apple
domain proteins, BL0O495A; G: apple domain proteins, BLOO495C; H:
apple domain proteins, BL0O495D; I: apple domain proteins,
BLOO495H; J: apple domain proteins, BL00495I; K: apple domain
proteins, BL00495K; L: apple domain proteins, BL0O495L; M:
fibrinogen /9 and 7 chains, C-terminal domain proteins, BL00514E; N:
somatomedin B domain proteins, BL00524C; O: somatomedin B
domain proteins, BL00524D; P: cellulose-binding domain proteins
fungal type, BL0O562; Q: osteonectin domain proteins, BL00612A; R:
MAM domain proteins, BL00740D. The blocks, consisting of 159
sequences between 40 and 46 letters long, were trimmed to a common
length of 40 by dropping positions from the C-amino end (if >40).
Fifty-four out of the 13 x 10 neurons on the grid 'contain' some block
elements after training. Most blocks, while being projected into a
compact region, were distributed on several neurons (subclustering). In
all cases, neurons were assigned exclusively to segments from one
domain family.
Sammon map was obtained by projecting the whole set of
159 individual sequences on the 2-D Sammon algorithm.
One might prefer (not shown) a picture where only the
reference vectors (i.e. the average of each family or
subfamily) is processed by the Sammon algorithm. This
reduces the dimensionality of the iteration procedure.
Storing secondary structure segments from Brookhaven
database
The PDB database (Bernstein et al., 1977) of protein
structures may be complemented by annotation of all
segments to which a definite secondary structure may be
assigned (cf. Kabsch and Sander, 1983). We decided to
illustrate the usage of the Kohonen map tool for a
collection of specimens of the most salient features of
secondary structure, namely a-helix (ranging in length
from 5 to 40, average around 10 residues) and /3-pleated
sheet (usually 5-10 residues).
451
J.Hanke and J.G.Reich
XCobi:bsp
L• Toggle sfely hbefc
10-
info
X-Position
9
Y-Position
57
FSNCLYQDFRIPLSPVHKCSFYKKNTKVSYG
FSNCKYQDLRIPLSPVHKCSYYKSNSKLSYG
FSGDKYYRVNLKTRRVETTOPPYPRSIAQYV
FSGDKYYRVNLRTRRVDSVNPPYPF.SIAC1YV
rSODKYYRVNLRTQRVrjTVNPPYPRSIAQY*
010
Cfck hereto(ferns
Fig. 5. XGobi-produced screen output of the Kohonen map of Figure 4 (trained to memorize BLOCKS domains). Neurons that have memorized
sequences are labeled by a • sign (for neuron content, see the prevous figure). The seven neurons occupied by the A-family have been labeled previously
by squares (lower-right comer). One of the two neurons labeled by filled circles has just been selected and says that it contains five specimens of the Ofamily. Further clicking opens the right-hand window showing the coordinates of the neuron (9,7) and the five blocks allocated to this memory element.
We collected the set of 1798 a-helical segments and 1461
/3-sheet segments as present in the FSSP database of
representative protein structures (Holm and Sander,
1994). We presented all these segments to a Kohonen
map for self-organizing cluster definition. Using the
Jeffrey coding strategy (see System and methods), we
avoided problems with the different length of the relatively
short segments; hence one large map could memorize all
offered segments.
We specified 62 x 62 = 3844 neurons to learn the
above-mentioned 3259 structure segments, in keeping
with the accepted rule that in order to avoid overindividualization as well as overgeneralization, there
should be a slight excess in the number of neurons over
the number of elements presented for learning.
The self-organized algorithm produced the map shown
in Figure 7. The training procedure converges to a final
state in which each helix or /? segment is projected to one
definite neuron. A mouse click produces a pull-down list
of all FSSP segments belonging to the domain of that
neuron. In Figure 7, the neurons are labeled in accordance
with what their 'content' of segments is. Four classes are
452:
discerned: (i) pure a neurons (981 specimens); (ii) pure j3
neurons (817 specimens); (iii) mixed neurons (310
specimens = 8% of all, containing both a and /?
segments); (iv) void neurons (no segment projected into;
1736 pieces = 45% of all). Thus, 3259 segments are
assigned to 1798 neurons of classes (i)-(iii); i.e. somewhat
less than two segments are assigned on average to one
neuron. Sixty percent of all neurons in classes (i)-(iii)
contain just one segment, 23% contain two segments and
17% contained more than two segments in their domain.
The highest content recorded was (in 24 out of 1798
neurons) an aggregation of seven segments in one neuron.
A closer study of Figure 7 reveals the information
structure in the final Kohonen map. It is seen that the
assignment of segments to neurons does not produce
global Q or/3 segment regions. Instead, we have at best
local aggregations of some neurons of the same class. Not
rare are the examples where the nearest neighbor of a
neuron of /3 class is one of a class, and vice versa. The
whole picture suggests the conclusion that there is no clear
continuous (let alone linear) separation (in the sense of a
discriminator curve) between helix and sheet segments in
Visualization tool for the analysis of protein sequences
Fig. 6. Sammon map of the BLOCKS domains selected for Figure 4. This
picture emerges after 40 000 iteration steps of Sammon's steepest descent
procedure fitting the signal distances to the map distances. Points: 130
neurons, 54 of them occupied (letter label). The distance between points is
(approximately) proportional to the distance of the neuron's reference
signal in the 20 x 40-dimensional signal space. The tight cluster structure
of the domain families (in particular of A, D, E, M) is clearly visible. Note
that the domains tend to settle in corners of the map, leaving the center
unpopulated.
the sequence space. Nevertheless, 78% of all segments
become correctly projected into neurons of classes (i) or
(ii), and are therefore correctly and unambiguously
characterized by the Kohonen map.
Discussion
In a previous paper (Hanke et ai, 1996), we studied the
Kohonen version of neural networks and showed that this
algorithmic system is well suited to store, in an associative
manner, the characteristic information as present in
protein sequence patterns. Here we apply this method to
the visualization of such information. The sequence
segment is transformed, using one of the accepted scoring
systems, into a signal and then mapped from the multidimensional signal space onto a two-dimensional grid
filled with neuronal nodes. The term 'neuron' is used as a
metaphor for cells which integrate weighted input signals
and emit an ouput signal. The peculiarity of Kohonen
networks is that their neurons no longer learn independently of each other: there is always a neighborhood
region rather than an isolated cell which is memorizing
sequence signals (principle of lateral inhibition in brain
physiology, stating that regions more distant from the
addressed memory are prevented from learning). The
result is transformation of similarity of sequence segments
into proximity on a 2-D neuronal grid. Inspection of local
Fig. 7. Kohonen map of all a and f} segments found in the FSSP database
(Holms and Sander, 1994) of established protein structure elements. This
database contains all segments with clearly defined a or /3 structure,
obtained according to the procedure of Kabsch and Sander (1983). This
resulted in a learning set for instruction of the Kohonen map (62 x 62
neurons). The figure displays all neuron positions and assigns to them a
symbol according to one of the following four classes. (1) Cells with open
square. Only a segments projected to this neuron. (2) Cells with filled
circle. Only /3 segments projected to this neuron. (3) Cells with filled
squares (1 and 2 supenmposed), containing some a and some /? segments
(mixed neurons). (4) No symbol. Neither a nor 0 segments residing here.
neuron regions in such a lattice reveals the existence of
proximity clusters in an immediately conspicuous way.
The domain of one neuron is the ensemble of similar
sequences projected into it. A reference vector is the
representative ('center') of this projection field.
There are two metrics carrying information on
similarity and dissimilarity: the metrics of the Kohonen
lattice (distance between neurons) as opposed to metrics
in the signal space, which is derived from the similarity
criterion provided by the scoring scheme applied during
the training phase. The neuronal metric (Figures 2, 4 or
7) preserves neighborhood relative between elements. It
leads to a clear display of cluster and neighborhood
structures. The score-derived metric, by contrast, which
can be approximately displayed visually (with some
difficulty due to dimension entangling) as Sammon map
(Figures 3 and 6), gives an impression of the 'true'
distances between encoded signals. Both types of display
combined allow a direct visual evaluation of what is
usually presented by long lists of items and crosscomparisons characterized by numerical scores or
probability statements.
453
J.Haoke and J.G.Reich
The mapping method described here demands that a
sequence is coded as a signal, i.e. as an image of matrix
entries. A frequent problem is that sequence segments of
different length have to be compared. This was tackled for
short sequences by the length-independent fractal coding
method, and for longer sequences by using (and pruning)
alignments. Ferr n and Ferrara (1991) used frequency
tables as characteristics of a longer sequence, but this
entails, of course, loss of information residing in the
succession of letters.
It is noteworthy that the inner structure of domains
(Figures 2 and 3) as well as an ensemble of domains
(sequence blocks. Figures 4-6) are transformable in much
more compactly clustered form than are secondary
structure segments (Figure 7). a-Helix segments and /3
sheets appear on the map in a very patchy way without the
formation of large regions. The topology of the helix space
and of the /3-sheet space is extremely entangled. Nevertheless, neurons memorize individual subgroups in a way
that permits correct distinction between secondary elements in many (78% of all) cases.
Mapping of sequence segments onto points of a 2-D
grid is a convenient visual display of one important
ensemble property, their 'similarity1 or 'homology'. The
user of such a tool, when examining such a surface, may
click to find more information relevant to the study, for
instance which sequences of the data set belong to a
certain neuron or neuron cluster. There is also a need for
additional information from sequence databases pertinent
to sequences seen on the drop-down table attached to each
neuron. The whole methodology of catchword-oriented
information browsing may be integrated as background to
each neuron of the Kohonen map.
We consider the development of knowledge-based
visualization tools as a useful complement to the textrecord-based database files which are in widespread use at
present time. It is the exploding amount of information
that necessitates conspicuous and comprehensive integration of existing knowledge. After an introductory period,
the whole system and the Kohonen self-learning network
(available: ftp 130.233.168.48) are easy to handle and
impress through their stability and reliablity. Furthermore, we offer a WWW service (http://www.mdcberlin.de) for the visualization of protein sequences and
additionally a number of convenient tools.
References
Altschul.S., Gish.W., Miller.W., Mycrs.E.W. and Lipman,D.J. (1990)
Basic local alignment search tool. J. Mol. Biol, 215, 403-410.
Bernstein,F.C, K.oetzle,T.F., Wilhams.G.J.B., Meyer,E.F., Brice.M.D.,
Rodgers.J.R.. Kennard,O., Shimanouchi.T. and Tasumi,M. (1977)
The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112, 535-542.
454
Caspers.G.J , Leunissen,J.A. and De-Jong,W.W. (1995) The expanding
smal heat-shock protein family, and structure predictions of the
conserved 'alpha-crystallin domain". J. Mol. Evol., 40, 238-248.
DayhofT,M. O., Schwartz,R.M. and Oreutt.B.C. (1978) A model of
evolutionary change in proteins. In Dayhoff.M.O. (ed.), Alias of
Protein Sequence and Structure. National Biomedical Research
Foundation, Washington, DC, Vol. V, Suppl. 3, pp. 345-352.
Ferran.E.A and Ferrara,P. (1991) Topological maps of protein
sequences. Bwl. Cvbern., 65, 451-458.
Gribskov,M., L thy.R and Eisenberg.D. (1990) Profile analysis. Methods
Enzymol., 183, 146-159.
Hanke.J., Beckmann,G., Bork,P. and Reich.J.G. (1996) Self organizing
hierarchic networks for pattern recognition in protein sequence.
Protein Sci., 5, 72-84.
Henikoff,S. and HenikoffJ.G. (1991) Automated assembly of protein
blocks for data base searching. Nucleic Acids Res., 19, 6565-6572.
Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices
from protein blocks. Proc. Nail Acad. Sci. USA, 89, 10915-10919.
Holm.L. and Sander.C. (1994) The FSSP database of structurally aligned
protein fold families. Nucleic Acids Res., 22, 3600-3609.
JefTrey.H.J. (1990) Chaos game representation of gene structure. Nucleic
Acids Res., 18, 2163-2170.
Kabsch,W. and Sander,C. (1983) How good are predictions of protein
secondary structure? FEBS Lett., 155, 179-182.
Kohonen,T. (1989) Self-Organization and Associative Memory. Springer
Verlag, Berlin.
Lynch,Th.J. (1985) Data Compression Techniques and Applications.
Lifetime Learning Publications, Belmont.
Swayne,D.F., Cook.B.D. and Buja.A. (1991) User's Manual for XGobi a
Dynamic Program of Data Analysis Implemented in the X Window
System. Bellcore Technical Memorandum.
Taylor,W.R. (1986) Identification of protein sequence homology by
consensus template alignment. J. Mol. Biol., 188, 233-258.
Received on October 5, 1995; revised and accepted on August 9. 1996