structural assignments for the biologist and bioinformaticist alike

#
2003 Oxford University Press
Nucleic Acids Research, 2003, Vol. 31, No. 1
DOI: 10.1093/nar/gkg051
469–473
Gene3D: structural assignments for the biologist
and bioinformaticist alike
Daniel W. A. Buchan1, Stuart C. G. Rison1, James E. Bray1, David Lee1,2, Frances Pearl1,
Janet M. Thornton1,3 and Christine A. Orengo1,*
1
Biomolecular Structure and Modelling Group, Department of Biochemistry and Molecular Biology,
University College London, Gower Street, London WC1E 6BT, UK, 2Department of Crystallography, Birkbeck College,
Malet Street, Bloomsbury, London WC1E 7HX, UK and 3EMBL—European Bioinformatics Institute,
Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Received August 15, 2002; Revised and Accepted October 2, 2002
ABSTRACT
The Gene3D database (http://www.biochem.ucl.
ac.uk/bsm/cath_new/Gene3D/) provides structural
assignments for genes within complete genomes.
These are available via the internet from either the
World Wide Web or FTP. Assignments are made
using PSI-BLAST and subsequently processed
using the DRange protocol. The DRange protocol is
an empirically benchmarked method for assessing
the validity of structural assignments made using
sequence searching methods where appropriate
assignment statistics are collected and made available. Gene3D links assignments to their appropriate
entries in relevent structural and classification
resources (PDBsum, CATH database and the
Dictionary of Homologous Superfamilies). Release
2.0 of Gene3D includes 62 genomes, 2 eukaryotes,
10 archaea and 40 bacteria. Currently, structural
assignments can be made for between 30 and 40
percent of any given genome. In any genome, around
half of those genes assigned a structural domain are
assigned a single domain and the other half of the
genes are assigned multiple structural domains.
Gene3D is linked to the CATH database and is
updated with each new update of CATH.
INTRODUCTION
Considerable progress has been made in the field of genome
annotation in the past five years and it is now evident that some
structural or functional annotation can be provided for most of
the genes in any given organism (6–12). Currently, the state
of the art allows up to 80% (7) of the genes in any given
organism to be assigned functional or structural annotation.
Most annotations methods rely almost solely on inheriting
functional annotation via sequence comparison but one must
exercise a degree of caution when interpreting such results.
This is particularly pertinent when considering the annotation
of distant homologues [30% sequence identity, (13)]. The
benefit of structural annotation is often useful when assessing
the functional annotations of these homologues. Use of
structural data enables 3D models to be built to inform
functional predictions (14,15). Gene3D aims to provide the
biologist with reliable precalculated relationships to protein
structures and, as a result, the relevant links to the functional
and structural data curated within the CATH domain structure
classification database. These data can then be used as the
starting point for homology modelling or evolutionary studies.
A related resource, SUPERFAMILY (16), is linked to the
SCOP structural database (17).
METHODS
The Gene3D database is derived from data produced by the
DomainFinder algorithm (18) and the DRange protocol (2).
This resource is created by scanning the sequences from the
CATH structural domains against a large database derived
from the non-redundant sequence database from GenBank that
contains the sequences from the completed genomes. The PSIBLAST (1) iterative database search algorithm is used (19) to
scan CATH database sequences against the GenBank
sequences. Preprocessing is carried out by DomainFinder
and the DRange protocol selects and validates the putative
structural annotations suggested by DomainFinder. Gene3D
and the associated DRange protocol are described below.
DomainFinder and DRange
The Gene3D population process is illustrated in Figure 1. The
procedure starts with a dataset of non-identical sequences from
the CATH database (CATH S95Rep) sequences, which is
searched against a library of sequences (in this case the
sequences from the GenBank non-redundant database which
*To whom correspondence should be addressed. Tel: þ44 2076797193; Email: [email protected]
Present address:
Stuart C. G. Rison, Royal Vetinary College, Department of Pathology and Infectious Diseases, Royal College Street, London NW1 0TU
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors
470
Nucleic Acids Research, 2003, Vol. 31, No. 1
Figure 1. Populating the Gene3D Database. (A) CATH Representative sequences (S95Reps) are scanned against the GenBank non-redundant database
containing the sequences from the completed genomes using PSI-BLAST. Search results (B) are processed by DomainFinder to generate ‘Ranges’
(C). These are ‘cleaned-up’ by the DRange package (D) and final assignments are assimilated in the Gene3D database (E).
includes the sequences from the completed genomes) using
PSI-BLAST (Fig. 1A) with the aim of producing a series of
matches of the structural domains to the genomic sequences
(Fig. 1B).
In the subsequent step, the DomainFinder algorithm is used to
convert the ‘raw’ hits into ‘Ranges’ (18). These ‘Ranges’ act as
descriptors which indicate which regions of a gene are putatively
thought to belong to which CATH Homologous Superfamilies
(Fig. 1C). In the last data manipulation step (Fig. 1D) assignments are cleaned-up using the DRange package (Fig. 1D) and
the resulting assignments stored in the Gene3D database
(Fig. 1E). The DRange package is composed of three modules:
Collapse, MultiParse and CleanAssign (2). These three modules
are used to verify structural domain assignments. The ‘clean-up’
procedure is a triage procedure distinguishing between probably
correct and probably incorrect assignments.
Nucleic Acids Research, 2003, Vol. 31, No. 1
471
Figure 2. Bar chart showing the distribution domains assigned to genes in three typical organisms: Caenorhabdatis elegans, Methanococcus jannaschii and
Escherichia coli. The Y axis has been truncated.
Table 1. Genome assignments statistics
Organism
Celeg
Sacc
Mjan
Ecoli
Total number of genes
Total number of bases
Total number of residues
CATH covered residues
% CATH covered residues
Number of genes with CATH domains
% Genes with CATH domains
17046
100 109 819
7 689 303
630 655
8.20
3641
21.36
6297
12 155 026
2 973 987
346 272
11.64
2030
32.23
1706
1 664 970
480 140
103 259
21.51
607
35.58
4289
4 639 221
1 358 990
354 537
26.09
1788
41.69
Statisitics are shown for four representative organisms, the first two are eukaryotes, the next an archaean and the final a bacterium. Celeg: Caenorhabditis
elegans, Sacc: Saccharomyces cerevisiae, Mjan: Methanococcus jannaschii and Ecoli: Escherichia coli.
Results
Gene3D is the repository for structural assignments verified
using the DRange protocol and is available on-line at http://
www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D/. This protocol is applied to all complete genomes released. In May 2002,
Gene3D included whole genome structural assignments for 66
genomes. The data are also available via the CATH FTP site at
ftp://ftp.biochem.ucl.ac.uk/pub/Gene3D/.
Typical assignments statistics, for four typical genomes in
the database, one from each of the major branches of life (one
multicellular eukaryote, one unicellular eukaryote, an archaea
and a bacterium) are presented in Table 1. The level of
assignment ranges from 22% to 55% of the genes in a
given organism in the database receive annotation with at least
one structural domain. Of these genes, usually around half are
annotated with a single domain and the other half of the genes
are assigned multiple domains (see Fig. 2). The figure also
shows that the eukaryotic genomes have many more genes
with a large number of domains and closer inspection of these
indicates that the largest of these genes are made of long
strings of immunoglobulin like domains and are likely to be
cell–cell signalling domains. The percentage of residues
covered (see Table 1) is often around half (and frequently
lower) of the percentage of genes with an annotation. This
indicates that many genes that pick up a domain are not being
fully annotated and could be annotated with further domains.
Cursory inspection of the assignment data shows that
bacterial and archaeal genomes pick up approximately the
same ratios of the various types of CATH domains and that no
single genome appears to be strongly biased in the type of
CATH domains it utilises (see Fig. 3). The eukaryotes appear to
make more use of the all-beta domains in the CATH database,
which is probably due to their greater use of cell–cell signalling
proteins that typically use immunoglobulin like domains.
In the database, the eukaryotic genomes pick up the least
annotation which may be due to a prokaryotic bias in the
structures that are deposited within the PDB (20).
472
Nucleic Acids Research, 2003, Vol. 31, No. 1
Figure 3. Bar chart showing the relative percentage of domain classes from the four major CATH classes for genes that have been assigned a CATH domain. The
four classes are: Class 1: all alpha folds; Class 2: all beta folds; Class: 3 mixed alpha and beta folds and Class 4: folds with little secondary structure.
Figure 4. Diagram shows a typical entry page (A) for a given genome (e.g. Mycoplasma genitalium) and the statistics presented and an (B) example of the
diagram and data that can be retrieved for a gene.
Nucleic Acids Research, 2003, Vol. 31, No. 1
The Gene3D Web Server
The Gene3D web server is made up of a number of inter linked
web pages which allow the retrieval of data on specific genes
within the represented genomes. Each genome features an
entry page (Fig. 4A) with a summary of the assignment
statistics and a CATH wheel (21). The CATH wheel is a pie
plot indicating which folds in CATH are present in the
organism. Those folds not detected in the genome are blacked
out. The statistics are similar to those presented in Table 1.
From this entry page, it is possible to search the genome in one
of the two ways. The first is by a simple keyword or gene
identifier search that returns a list of matching genes. The
second is to browse the complete list of genes within an
organism that have had a structural assignment made to them.
By either route once a gene is selected a results page is
returned (Fig. 4B). These results page presents a schematic
diagram of both the gene (hatched in green) and the structural
domains assigned (colour coded by domain type). Presented
alongside this is the ‘ranges’ data for this assignment and the
E-values from PSI-BLAST upon which this assignment was
accepted. We recommend that for batch downloads users refer
to the FTP site (ftp://ftp.biochem.ucl.ac.uk/pub/Gene3D/).
DISCUSSION
The data within Gene3D are there to provide biologists and
bioinformaticists with an initial stepping stone from which
structural, functional and evolutionary studies can begin. In
future, we hope to integrate Pfam domain assignments (12) to
maximise the annotated coverage of genomes and we also hope
to provide alignments of the CATH or Pfam domains to the
genes that they matched. It is our hope that by integrating Pfam
domain assignments, we can provide the assignments for most,
if not all, of the genes in the complete genomes.
That we can annotate so much of the complete genome
sequences from the structure databases alone suggests that we
may not need to solve structures for every sequence but rather
for every sequence family containing relatives of high
sequence identity (for example 40%) sequence identity. In
such families, homology modelling could then be used to
predict the structures of all the relatives from one representative structure. This bodes well for the success of the structural
genomics projects.
REFERENCES
1. Altschul,S., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and
Lipman,D. (1997) Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res., 25, 3389–3402.
473
2. Buchan,D., Shepherd,A., Lee,D., Pearl,F., Rison,S., Thornton,J. and
Orengo,C. (2002) Gene3D: structural assignment for whole genes and
genomes using the CATH domain structure database. Genome Res., 12,
503–514.
3. Laskowski,R. (2001) PDBsum: summaries and analyses of PDB structures.
Nucleic Acids Res., 29, 221–222.
4. Pearl,F., Martin,N., Bray,J., Buchan,D., Harrison,A., Lee,D., Reeves,G.,
Shepherd,A., Sillitoe,I., Todd,A., Thornton,J. and Orengo,C. (2001) A
rapid classification protocol for the CATH domain database to support
structural genomics. Nucleic Acids Res., 29, 223–227.
5. Bray,J., Todd,A., Pearl,F., Thornton,J. and Orengo,C. (2000) The CATH
Dictionary of Homologous Superfamilies (DHS): a consensus approach for
identifying distant structural homologues. Protein Eng., 13, 153–165.
6. Gerstein,M. (1997) A structural census of genomes: comparing bacterial,
eukaryotic and archaeal genomes in terms of protein structure. J. Mol.
Biol., 274, 562–576.
7. Teichmann,S., Chothia,C. and Gerstein,M. (1999) Advances in structural
genomics. Curr. Opin. Struct. Biol., 9, 390–399.
8. Muller,A., MacCallum,R. and Sternberg,M. (1999) Benchmarking
PSI-BLAST in genome annotation. J. Mol. Biol., 293, 1257–1271.
9. Iliopoulos,I., Tsoka,S., Andrade,M., Janssen,P., Audit,B., Tramontano,A.,
Valencia,A., Leroy,C., Sander,C. and Ouzounis,C. (2001) Genome
sequences and great expectations. Genome Biol., 2, Interactions0001.
10. Apweiler,R., Biswas,M., Fleischmann,W., Kanapin,A., Karavidopoulou,Y.,
Kersey,P., Kriventseva,E., Mittard,V., Mulder,N., Phan,I. and Zdobnov,E.
(2001) Proteome Analysis Database: online application of InterPro and
CluSTr for the functional classification of proteins in whole genomes.
Nucleic Acids Res., 29, 44–48.
11. Kanehisa,M., Goto,S., Kawashima,S. and Nakaya,A. (2002) The KEGG
databases at GenomeNet. Nucleic Acids Res., 30, 42–46.
12. Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S.,
Griffiths-Jones,S., Howe,K., Marshall,M. and Sonnhammer,E. (2002) The
Pfam protein families database. Nucleic Acids Res., 30, 276–280.
13. Todd,A., Orengo,C. and Thornton,J. (2001) Evolution of function in
protein superfamilies, from a structural perspective. J. Mol. Biol., 307,
1113–1143.
14. Laskowski,R., Luscombe,N., Swindells,M. and Thornton,J. (1996) Protein
clefts in molecular recognition and function. Protein Sci., 5, 2438–2452.
15. Luscombe,N., Laskowski,R. and Thornton,J. (1997) NUCPLOT: a
program to generate schematic diagrams of protein–nucleic acid
interactions. Nucleic Acids Res., 25, 4940–4945.
16. Gough,J. and Chothia,C. (2002) SUPERFAMILY: HMMs representing all
proteins of known structure. SCOP sequence searches, alignments and
genome assignments. Nucleic Acids Res., 30, 268–272.
17. Lo Conte,L., Brenner,S., Hubbard,T., Chothia,C. and Murzin,A. (2002).
SCOP database in 2002: refinements accommodate structural genomics.
Nucleic Acids Res., 30, 264–272.
18. Pearl,F., Lee,D., Bray,J., Buchan,D., Shepherd,A. and Orengo,C. (2002)
The CATH extended protein-family database: providing structural
annotations for genome sequences. Protein Sci., 11, 233–244.
19. Altschul,S., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and
Lipman,D. (1997) Gapped BLAST and PSI–BLAST: a new generation of
protein database search programs. Nucleic Acids Res., 25, 3389–3402.
20. Westbrook,J., Feng,Z., Jain,S., Bhat,T., Thanki,N., Ravichandran,V.,
Gilliland,G., Bluhm,W., Weissig,H., Greer,D., Bourne,P. and Berman,H.
(2002) The Protein Data Bank: unifying the archive. Nucleic Acids Res.,
30, 245–248.
21. Michie,A., Orengo,C. and Thornton,J. (1996). Analysis of domain
structural class using an automated class assignment protocol. J. Mol. Biol.,
262, 168–185.