The CATH Database provides insights into protein structure/function

 1999 Oxford University Press
Nucleic Acids Research, 1999, Vol. 27, No. 1
275–279
The CATH Database provides insights into protein
structure/function relationships
C. A. Orengo, F. M. G. Pearl, J. E. Bray, A. E. Todd, A. C. Martin, L. Lo Conte and
J. M. Thornton
Department of Biochemistry and Molecular Biology, Darwin Building, Univeristy College London, Gower Street,
London WC1E 6BT, UK
Received October 2, 1998; Revised October 13, 1998; Accepted October 28, 1998
ABSTRACT
We report the latest release (version 1.4) of the CATH
protein domains database (http://www.biochem.ucl.
ac.uk/bsm/cath ). This is a hierarchical classification of
13 359 protein domain structures into evolutionary
families and structural groupings. We currently identify 827 homologous families in which the proteins have
both structual similarity and sequence and/or functional similarity. These can be further clustered into
593 fold groups and 32 distinct architectures. Using
our structural classification and associated data on
protein functions, stored in the database (EC identifiers, SWISS-PROT keywords and information from
the Enzyme database and literature) we have been able
to analyse the correlation between the 3D structure
and function. More than 96% of folds in the PDB are
associated with a single homologous family. However,
within the superfolds, three or more different functions
are observed. Considering enzyme functions, more
than 95% of clearly homologous families exhibit either
single or closely related functions, as demonstrated by
the EC identifiers of their relatives. Our analysis
supports the view that determining structures, for
example as part of a ‘structural genomics’ initiative,
will make a major contribution to interpreting genome
data.
INTRODUCTION
The CATH classification of protein domain structures was
established in 1993 (1) as a hierarchical clustering of protein
domain structures into evolutionary families and structural
groupings, depending on sequence and structure similarity. There
are four major levels, corresponding to protein class, architecture,
topology or fold and homologous family (Fig. 1). Since 1995,
information about these structural groups and protein families has
been accessible over the Web (http://www.biochem.ucl.ac.uk/
bsm/cath ), together with summary information about each
individual protein structure (PDBsum) (2).
CATH consists of both phylogenetic and phenetic descriptors
for protein domain relationships. At the lowest levels in the
Figure 1. Schematic representation of the (C)lass, (A)rchitecture and
(T)opology/fold levels in the CATH database.
hierarchy, proteins are grouped into evolutionary families (Homologous familes), for having either significant sequence
similarity (≥35% identity) or high structural similarity and some
sequence similarity (≥20% identity). Structural similarity is
assessed using an automatic method (SSAP) (3,4), which scores
*To whom correspondence should be addressed. Tel: +44 171 419 3284; Fax; +44 171 380 7193; Email: [email protected]
276
Nucleic Acids Research, 1999, Vol. 27, No. 1
Figure 2. Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies, for the subtilisin family (CATH id: 3.40.50.200). Tables
display the PDB codes for non-identical relatives in the family, together with EC identifier codes and information about the enzyme reactions. The multiple structural
alignment, shown, has been coloured according to secondary structure assignments (red for helix, blue for strands).
100 for identical proteins and generally returns scores above 80
for homologous proteins. More distantly related folds generally
give scores above 70 (Topology or fold level), though in the
absence of any sequence or functional similarity this may simply
represent examples of convergent evolution, reinforcing the
hypothesis that there exists a limited number of folds in nature
(5,6).
The Architecture level in CATH, groups proteins whose folds
have similar 3D arrangements of secondary structures (e.g.,
barrel, sandwich or propellor), regardless of their connectivity,
whilst the top level, Class, simply reflects the proportion of
α-helix or β-strand secondary structures. Three major classes are
recognised, mainly-α, mainly-β and α–β, since analysis revealed
considerable overlap between the α+β and alternating α/β
classes, originally described by Levitt and Chothia (7).
Before classification, multidomain proteins are first separated
into their constituent folds using a consensus method which seeks
agreement between three independent algorithms (8). Whilst the
protocol for updating CATH is largely automatic (9), several
stages require manual validation, in particular establishing
domain boundaries in proteins for which no consensus could be
reached and in checking the relationships of very distant
homologues and proteins having borderline fold similarity.
Although there are plans to assign the more regular architectures
automatically, all architecture groupings are currently assigned
manually.
A homologous family Dictionary is now available within
CATH, which contains functional data, where available, for each
protein within a homologous family. This includes EC identifiers,
SWISS-PROT keywords and information from the Enzyme
database or the literature (Fig. 2). Multiple structure based
alignments are also available, coloured according to secondary
structure assignments or residue properties and there are schematic plots showing domain representations annotated by
protein–ligand interactions (DOMPLOTS) (A.E.Todd,
C.A.Orengo and J.M.Thornton, submitted to Protein Engng.).
277
Nucleic Acids
Acids Research,
Research,1994,
1999,Vol.
Vol.22,
27,No.
No.11
Nucleic
277
Figure 3. CATH wheel plot showing the population of homologous families in different fold groups, architectures and classes. The wheel is coloured according to
protein class (red, mainly-α; green, mainly-β; yellow, αβ; blue, few secondary structures). The size of the outer wheel represents the number of homologous families
in CATH whilst each band in the outer wheel corresponds to a single fold family. The size of each ‘fold band’ therefore reflects the number of homologous families
having that fold. It can be seen that most fold families contain a single homologous family. The superfold families are shown as paler bands, containing many
homologous families. The inner wheel shows the population of homologous families in the different architectures.
The topology of each domain is illustrated by schematic TOPS
diagrams (http://www3.ebi.ac.uk/tops ; 10).
We have also recently set up a Web Server (11), which enables
the user to scan the CATH database with a newly determined
protein structure and identify possible fold similarities or
evolutionary relationships. There are also plans to incorporate
sequence searches (using BLAST or PSI-BLAST) (12) to
identify a probable fold for a new sequence.
The latest release of CATH (version 1.4, April 1998) contains
9342 protein chains from the PDB (13), which divide into 13 359
domain folds. Currently 32 different architectures are recognised.
Since the last release, three new architectures have been
described, including the five-bladed α–β propellor. Grouping
proteins on the basis of sequence, structure and functional
similarity gives 827 evolutionary homologous families (H-level).
Whilst recognising more distant structural similarity with no
accompanying sequence or function similarity gives rise to 593
different fold groups (T-level).
The population of the different levels in the CATH hierarchy is
illustrated by the CATH wheel shown in Figure 3. It can be seen
that several highly populated fold families, which we describe as
superfolds (6), as they support a diverse range of sequences and
more than three different functions, still account for nearly 30%
of non-homologous structures.
IMPLICATIONS FOR STRUCTURAL GENOMICS
As the sequence databases grow rapidly, the need to interpret
these sequences and assign functions to specific genes becomes
increasingly important. Many techniques exist for matching
protein sequences and thereby inheriting functional information.
However, for very distant homologues there is often no detectable
sequence similarity, despite conservation of 3D structure and
function. For these cases, evolutionary relationships and thereby
functions can only be assigned by comparing the structures.
Therefore, a number of structural genomics initiatives are being
proposed (14) which aim to identify all the folds in nature with
the ultimate goal of being able to predict the function of a new
protein from its known or probable structure. The important
questions to ask are how many more folds do we need to
determine before we have the complete set? and how confident
can we be in assigning function between proteins having similar
structures?
In the current genomes, on average only 30–46% of sequences
can be assigned to a structural family, by recognising sequence
similarity to a protein of known structure (15,16). With only ∼600
unique structures currently in the PDB, compared with ∼20 000
sequence families, it is clear that we still need to determine many
more structures if we are to understand biology at the molecular
level.
However analysis of recently deposited structural data is very
revealing. Figure 4a illustrates the distribution of 2159 new
structural domains classified in the 10 months from June 1997 to
March 1998. A large proportion of these (79%) were clearly
homologous (≥30% identity) to proteins of known structure.
Of the remaining 443 structures (Fig. 4b) corresponding to
‘new’ sequences, we found only 8% were novel folds, the
remainder resembling a previously determined structure. Many
of these, 199 (45%), could be identified as clear homologues by
having significant structure and sequence similarity (SSAP ≥80
278
Nucleic Acids Research, 1999, Vol. 27, No. 1
a
b
Figure 4. Pi-charts showing the proportion of 2159 recently deposited
structures, which match structures in CATH. (a) Proportion of new structures
matching by sequence alignment (21) or structure alignment (SSAP) (3).
(b) Proportion of new non-homologous structure (<30% sequence identity to
any previous CATH entry), which match previous CATH entries by structure.
Those which have more than 20% sequence identity, measured after structural
alignment, or functional similarity, are assigned as homologues. The remaining
structures are analogues, having no clear evolutionary relationship.
and ≥20% sequence identity). A further 169 (38%) were probable
homologues as, although the sequence identity was below 20%,
they had functional similarity and/or gave significant scores using
sequence search methods designed to detect very distant homologues (PSIBLAST) (12). There remained a further 40 (9%)
proteins which were analogous—i.e., they had the same fold as
a previous entry, but neither the sequence nor the function gave
definite evidence of a common ancestor.
RELATIONSHIP BETWEEN PROTEIN STRUCTURE
AND FUNCTION
We now need to consider at what levels of structural similarity or
evolutionary ’distance’ it is reasonable to inherit functional
information, within a protein family. Data on the CATH
evolutionary families and structural groupings is stored in a
Postgres relational database (11) with links to a ligand database
containing information about protein–ligand interactions (2).
This allows us to analyse the relationship between the 3D
structure and function, using stored data on EC identifiers,
SWISS-PROT key words and protein–ligand interactions (11).
Considering the degree of functional similarity observed in
structures with similar folds, the vast majority (>96%) of fold
groups in the PDB derive from a single homologous family, with
similar or closely related functions within the family. However,
for the very common folds (superfolds, see above) which derive
from three or more apparently unrelated homologous families,
the proteins can perform quite unrelated functions even though
they have the same fold. We have described these as analogous
folds, which may or may not have a common ancestor.
At the homologous superfamily level in CATH, a more detailed
analysis of enzyme functions showed that the majority of
homologous enzyme families in CATH (>90%) contained
proteins for which the first three EC identifiers were the same.
Considering those families where homologues have significant
sequence identity (≥20%) after structural alignment, 95% were
found to have a single EC identifier, whilst for families where
proteins have more than 30% sequence similarity, we observed
that 98% had a single EC code.
Although assigning function on the basis of homology is
common practice, it is clear that some caution should be
exercised, particularly where there is little or no sequence
similarity. There are also some clear examples where homologues
with significant sequence similarity perform different functions.
The role of ‘gene recruitment’ is especially clear in the eye lens
proteins, which function as enzymes in other cellular environments, but which are used as structural proteins in this context
(17). The extent of such ‘gene recruitment’ and context-sensitive
function is really not known at this time. For enzymes, it is clear
that catalytic function can change and evolve, usually to act on a
different but related substrate. Similarly, within the lipocalin
family (CATH id #: 2.40.130.10), several proteins are found with
very similar structures, which bind different fatty acids in the
same region at the base of the β-barrel (e.g., retinol, bilin, biotin).
Nearly half of the homologous families where two or more
different EC numbers were observed, belong to the superfolds.
This suggests that if a new protein is assigned to a superfold
family, more caution should be used when inheriting functional
information, as there appears to be greater tolerance to changes
in sequence and ultimately function, for these families. However,
it is interesting to note that many of these were TIM barrel or
Rossmann folds. These are superfolds in which the substrate or
ligand commonly binds in the same place. This is in the base of
the β-barrel for the TIMs and at the crossover of the polypeptide
chain for the doubly wound Rossmann structures.
ASSIGNMENT OF FUNCTION THROUGH STRUCTURE
One of the reasons for determining structures is to derive more
information to facilitate the assignment of function. From our
analysis of proteins in CATH, we suggest that structural data can
help to assign function in several ways:
(i) The structural data allow recognition of more distant
homologues compared with sequence data—in our analysis, 83%
of structures with novel sequences could be assigned as
homologues in this way (note that such assignment of function is
again subject to the caveats imposed by ‘gene recruitment’
discussed above).
(ii) The structural data allows detailed inspection of the
functional site—to suggest if and how the function may have
evolved. For example, if an enzyme has evolved to act on a
279
Nucleic Acids
Acids Research,
Research,1994,
1999,Vol.
Vol.22,
27,No.
No.11
Nucleic
different substrate, the binding site may reveal, or at least suggest,
possible changes in the substrate.
(iii) For the superfolds, similarity of structure does not
necessarily mean similarity of function. However the active
site/binding sites are often conserved, e.g., in the TIM barrel or
Rossmann fold structures, the ligand always binds at the same end
of the barrel or sheet.
(iv) Some methods have already been developed, and will
increasingly be the focus of attention over the next few years,
which aim to predict function ab initio from structure. For
example, enzymes can often be identified by the presence of a
major cleft, which also locates the active site (18). Similarly
critical surface patches, which are used for molecular recognition
in binding other proteins or ligands, may be identified using
knowledge-based approaches (19,20).
In summary, extrapolating the data from Figure 4 to a new
genome, we can expect that, of the 54–70% of sequences which
currently have no obvious sequence matches in the PDB, we will
find nearly 80–90% to be homologous to a known family using
the structural data alone. For the singlet folds, this will almost
certainly reveal some clues to the function. For the superfolds,
some folds will reveal information on the functional class (e.g.,
enzyme for TIM barrels) or the location of the active site, if not
the specific function. Only 10–20% will be expected to be ‘novel’
folds. For these the ab initio methods referred to above may
provide some clues to guide experiments. Therefore, it is clear
that determining structures, as part of a ‘structural genomics’
initiative, for example, will make a major contribution to
interpreting genome data.
279
REFERENCES
1 Orengo,C.A., Flores,T.P., Taylor,W.R. and Thornton,J.M. (1993) Protein
Engng., 6, 485–500.
2 Laskowski,R.A., Hutchinson,E.G., Michie,A.D., Wallace,A.C., Jones,M.L.
and Thornton,J.M. (1997) Trends Biochem. Sci., 22, 488–490.
3 Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 1–22.
4 Orengo,C.A., Brown,N.P. and Taylor,W.R. (1992) Proteins, 14, 139–167.
5 Chothia,C. (1993) Nature, 357, 543–544.
6 Orengo,C.A., Jones,D.T. and Thornton,J.M. (1994) Nature, 372, 631–634.
7 Levitt,M. and Chothia,C. (1976) Nature, 261, 552–558.
8 Jones,S., Swindells,M.B., Stewart,M., Michie,A.D., Orengo,C.A. and
Thornton,J.M. (1998) Protein Sci., 7, 233–242.
9 Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and
Thornton,J.M. (1997) Structure, 5, 1093–1108.
10 Westhead,D.R., Hatton,D.C. and Thornton,J.M. (1998) Trends Biochem.
Sci., 23, 35–36.
11 Martin,A.C.R., Orengo,C.A., Hutchinson,E.G., Jones,S., Karmirantzou,M.,
Laskowski,R.A., Mitchell,J.B.O., Taroni,C. and Thornton,J.M. (1998)
Structure, 6, 1–10.
12 Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W.
and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.
13 Abola,E.E., Bernstein,F.C., Bryant,S.H., Koetzle,T.F. and Weng,J. (1987)
In Allen,F.H., Bergerhoff,G. and Sievers,R. (eds), Crystallographic
Databases-Information Content, Software Systems, Scientific Applications.
Data Commission of the International Union of Crystallography,
Bonn/Cambridge/Chester, pp. 107–132.
14 Pennisi,L. (1998) Science, 279, 978–979.
15 Huynen,M., Doerks,T., Eisenhaber,F., Orengo,C.A., Sunyaev,S., Yuan,Y.
and Bork,P. (1998) J. Mol. Biol., 280, 323–326.
16 Jones,D.T. (1998) J. Mol. Biol., in press.
17 Piatigorsky,J. andWistow,G. (1991) Science, 252, 1078–1079.
18 Laskowski,R.A., Luscombe,N.M., Swindells,M.B. and Thornton,J.M.
(1996) Protein Sci., 5, 2438–2452.
19 Jones,S. and Thornton,J.M. (1997) J. Mol. Biol., 272, 133–143.
20 Jones,S. and Thornton,J.M. (1997) J. Mol. Biol., 272, 121–132.
21 Needleman,S.B. and Wunsch,C.D. (1970) J .Mol. Biol., 48, 443–453.