PDBSITE: A DATABASE ON PROTEIN ACTIVE SITES AND THEIR

COMPUTATIONAL PROTEOMICS
PDBSITE: A DATABASE ON PROTEIN ACTIVE SITES AND
THEIR ENVIRONMENT
*1,2 Ivanisenko V.A., 1 Grigorovich D.A., 1 Kolchanov N.A.
1
Institute of Cytology and Genetics, SB RAS, Novosibirsk, Russia
State Research Center of Virology and Biotechnology “Vector”, Koltsovo, Novosibirsk region, Russia
e-mail: [email protected]
*Corresponding author
2
Key words: biologically active protein sites, protein tertiary structure, databases
Resume
Motivation: The database Protein Data Bank (PDB) contains data on biologically active sites of many proteins: ligandbinding domains, enzyme catalytic centers, sites experiencing biochemical modification, etc. However, these data are of
very limited access in the modern retrieval systems. Development of a database containing information on features of active
sites and their spatial environment would provide a basis for comprehensive study of the properties of such sites.
Results: We have constructed the database PDBSite on biologically active sites retrieved from the database PDB. PDBSite
contains description of site functions; lists of residues and their positions; structural features, calculated from 3D structures
of the proteins; and physicochemical features of the sites and their spatial environment. The relationships between the
properties of the residues of the sites and the residues of their environment have been analyzed.
Availability:
http://srs6.bionet.nsc.ru/srs6bin/cgi-bin/wgetz?-page+LibInfo+-newId+-lib+PDBSite
or
http://wwwmgs.bionet.nsc.ru/mgs/systems/fastprot/
Introduction
Information on biologically active sites is of paramount importance in solving many problems of molecular biology,
biotechnology, and medicine. Highly specific biological activities of proteins are provided by the unique structure of active
sites and their environment. For example, catalytic centers of enzymes occur in cavities (Chothia, 1976), and antigenic
determinants are shaped as protrusions on a protein surface (Davies, Cohen, 1996). The structure of sites is largely
dependent on their environment. The activity of many natural and mutant proteins depends on the physicochemical
properties of the residues surrounding the functional sites (Ivanisenko, Eroshkin, 1997). Data on the 3D structures of
proteins are necessary for determination of the spatial environment of their biologically active sites. The database PDB
(Bernstein et al., 1997) contains the data on 3D structures of proteins. For many proteins, the amino acid residues
composing biologically active sites are marked, and the sites are briefly described. The purpose of the present study is the
development of a daughter database PDBSite on the specific features of spatial organization of the biologically active sites
stored in PDB and their environment.
For this purpose, we developed original methods, algorithms and software for calculating the 3D environment of sites and
their structural and physicochemical properties.
Most investigations of the structure−function organization of functional sites are aimed at the invariant properties of these
sites or their environment (Bagley, Altman, 1995; Sekharuda, Sundaralingam, 1988). We performed search for correlations
among the variable properties of sites and their environment. This showed that a correlated change of the property pairs
“site–environment” is a typical feature of the sites. These results may contribute to understanding the mechanisms of site
operation and evolution.
Materials and Methods
The PDBSite database includes data obtained by processing of the following PDB fields: HEADER, TITLE, KEYWDS,
REMARK 800, SITE, and ATOM. Grammar analysis programs were used for processing of PDB records. If a single PDB
record contained data on several sites, individual records were created for each site in PDBSite. The Internet access to
PDBSite was performed with the use of the Sequence Retrieval System (SRS).
The spatial environment of sites was calculated as follows. The least parallelogram including all the atoms of the amino acid
residues of a site was constructed from the coordinates of the protein atoms. The spatial environment of the site was
assumed to include each amino acid residue such that at least one of its atoms was within the parallelogram.
The following properties of sites and environment were calculated: solvent accessibility of amino acid residues; the mean
value, sum, and spatial moment of amino acid physicochemical indices; coordinates of the geometrical center of mass for
each residue; and the pairwise distances between the centers of mass of the residues. Another site index calculated was its
discontinuity in the primary structure. The spatial moment was calculated as follows:
146
BGRS’ 2002
1/ 2
2
2
2
ìé N
ù éN
ù éN
ù üï
ï
SM = íê
pi xi ú + ê
pi yi ú + ê
pi zi ú ý
ïîëê i =1
ûú ëê i =1
ûú ëê i =1
ûú ïþ
å
å
å
,
where pi, (i = 1, 2, ...,N) is the value of a certain property of the ith residue of the 3D site, comprising N amino acids; xi, yi, zi
are coordinates of the Ca atom of the ith residue taken with reference to the geometrical center of the 3D site.
The discontinuity index of a site was taken to be
1
N
N
å (P
i +1
− Pi − 1) , where N is the number of residues of the site and Pi
i =1
is the ordinal number of the ith residue of the site in the protein sequence. This index reflects the mean number of positions
between the neighboring residues of the site in the primary structure. For example, if a site consists of a continuous
sequence fragment, then the discontinuity index is zero. To calculate the solvent accessibility of amino acids, we used the
program DSSP (Kabsch, Sander, 1983).
Results and Discussion
The fields of the PDBSite database used for search in the system SRS are listed and briefly described in Table 1.
Table 1. The description of PDBSite fields that can be queried.
Field name
ID
PDBID
Header
Title
Keyword
Molecule
NumSiteChains
SiteDescr
ResidueNotAA
LenSite
LenSurround
ExposureSite
ExposureSurround
Discontinuity
Description
Entry identifier.
PDB ID code.
PDB classification for the entry. The field content corresponds to that in PDB.
Title for experiment or analysis described in the entry. The field content corresponds to that in PDB.
Keywords describing the macromolecule. The content corresponds to that of KEYWDS in PDB.
Contains names of macromolecules from the COMPND of PDB and is designed to look for entries by the
names of macromolecules.
Number of different chains to which the residues of the site belong.
Description of the site. The content corresponds to that of SITE_DESCRIPTION subfield of REMARK 800
field of PDB.
Names of residues that are not amino acids but are present in the site.
Number of residues in the site.
Number of residues in the site environment.
Average exposure of residues in the site.
Average exposure of residues in the site environment.
Discontinuity of the site according to its primary structure.
In addition, the database contains data on the structural and physicochemical indices of sites and their spatial environment
that are not used for search but can be applied to analysis of the structure-functional organization of sites.
The PDBSite base contains the descriptions of 4723 sites. For statistical analysis, we obtained a nonredundant set by
exclusion of complete analogs related to protein duplication in PDB. This set included 4038 sites. All the sites can be
divided into three groups: continuous, discontinuous, and such discontinuous sites whose residues correspond to different
subunits of a molecule. The proportion of the last group was considerable - 10.6%. The distribution of sites for the
discontinuity index is shown in Fig. 1.
The figure shows that most sites present in the database are discontinuous. Thus, the patterns based only on the primary
structure can hardly be developed for many sites. Recognition of the residues of such sites requires at least consideration of
the matrix of distances between protein residues in the tertiary structure. We found that sites with increased discontinuity
have limited solvent accessibility (Fig. 2); i.e., burred residues form highly discontinuous sites.
For further analysis, sites were grouped according to their functions. Only those sites whose function is unambiguously
described in the field SiteDescr were taken into consideration. Of them, we chose site types including no less than 10
members. Thus, 3611 sites of 30 types were examined. Figure 3 shows the hierarchical classification of sites of different
types for their mean amino acid composition. The sites clustered into two major groups in the resulting tree. One of them is
dominated by organic ligand-binding sites, and the other, by metal ion-binding sites. Sites of acetylation fell into the first
group, and the sites of glycosylation and phosphorylation, to the second one. Catalytic centers of enzymes belong to the first
group. Both groups contain also local clusters of sites of different types. In general, the tree constructed according to the
amino acid composition of sites is in good agreement with their function.
147
BGRS’ 2002
1200
accessibility
300
1000
800
600
250
200
150
400
100
200
50
0
0
0
1-10
11-20
21-30
31-50
0
>51
Fig. 1. Distribution of sites for discontinuity.
50
100
150
discontinuity
200
Fig. 2. Interrelation between the discontinuity and solvent accessibility of
sites.
A CETY LA TED
PH O SPH ATE binding
SU LFA TE binding
N U CLEO TID E binding
NA D PH binding
N A D binding
FA D binding
N A P binding
N D P binding
FM N binding
A D P binding
G TPA SE binding
H EC binding
G LY CO SY LA TIO N
CA RBO H Y D RA TE binding
G LU CO SE binding
N D P binding
PH O SPH O RYLA TIO N
CA T A LY TIC
A CT IV E
ZIN C binding
FE binding
CO PPER binding
N I binding
SU G A R binding
CALC IU M binding
M N binding
CO binding
M G binding
A TP binding
Fig. 3. The hierarchical tree
classifying the sites of various
types according to their amino
acid composition.
.0
0.2
0.4
0.6
0.8
1.0
1.2
Features presented in the PDBSite database were used for analysis of correlations between the physicochemical properties
of sites and their environment.
Figure 4 illustrates the correlation between the hydrophilicity of the Mn-binding, Co-binding, and Mg-binding sites, pooled
into one group, and the hydrophobicity of their solvent-exposed environment, i.e., residues on the molecule surface. It is
apparent that the hydrophobicity of the exposed environment increases with site hydrophilicity.
The results of our analysis suggest that the function of site environment manifests itself as early as the stage of initial
recognition of the site target and correct orientation of the site with respect to the target.
The mean or total physicochemical properties of sites were found to correlate with their spatial moment. This can be related
to the importance of the distribution of these properties over the spatial structure of the site.
In particular, the increase in the number of positively charged residues in ATP-binding sites correlates with the increase in
charge moment (Fig. 5). To put it differently, residues are aggregated into clusters of positively and negatively charged
ones.
148
55
5
50
4.5
spatial moment of charge
hydrophob. exposed envir.
BGRS’ 2002
45
40
35
30
25
20
4
3.5
3
2.5
2
1.5
1
0.5
0
15
-5
0
5
10
20
15
40
50
60
polarity
hydrophilicity of site
Fig. 4. Correlation (R = 0.77) between the hydrophilicity of the Mnbinding, Co-binding, and Mg-binding sites, bulked into one group, and
the hydrophobicity of their exposed environment; P>0.95.
30
Fig. 5. Correlation (R = 0.78) of the polarity of ATP-binding sites and the
spatial charge moment. The correlation is significant at P>0.95.
Conclusions
We have developed the PDBSite database, which can be applied to various tasks in the investigation of the functions of
proteins and their active centers.
The PDBSite database is equipped with an automated system for search for structural similarity between active sites and a
user-defined 3D structure of a certain protein (Ivanisenko et al., 2002). The search for structural similarity over PDBSite
allows recognition of active sites in the spatial structure of proteins.
Our results indicate that the correlations between physicochemical properties of sites and their spatial environment are
characteristic of globular proteins. A more complete solution of the problem of finding relations between site properties and
environment demands further analysis, involving the conformation and physicochemical properties of sites and their
environment. Application of methods of molecular dynamics and conformation analysis is promising for further study of
this problem.
Acknowledgements
The study was supported in part by the Russian Foundation for Basic Research (grants № 01-07-90376 and 01-07-90084);
Russian Ministry of Industry, Science, and Technologies (grant № 43.073.1.1.1501); Siberian Branch of the Russian
Academy of Sciences (Integration Project № 65); US National Institutes of Health (grant № 2 R01-HG-01539-04A2); and
US Department of Energy (grant № 535228 CFDA 81.049).
References
1. Bagley S.C., Altman R.B. (1995). Characterizing the microenvironment surrounding protein sites. Protein Sci. 4, 622-635.
2. Bernstein F.C., Koetzle T.F., Williams G.J.B., Meyer E.F., Brice M.D., Rodgers J.R., Kennard O., Shimanouchi T., Tasumi M. (1977).
The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535-542.
3. Chothia C. (1976). The nature of the accessible and buried surfaces in proteins. J. Mol. Biol. 105, 1-14.
4. Davies D.R., Cohen G.H. (1996). Interactions of protein antigens with antibodies. Proc. Natl Acad. Sci. USA. 93, 7-12.
5. Ivanisenko V.A., Eroshkin A.M. (1997). Search for sites containing functionally important substitutions in series of related or mutant
proteins. Mol. Biol. (Mosk.). 31, 880-887.
6. Ivanisenko V.A., Debelov V.A., Matsokin A.M., Pintus S.S., Grigorovich D.A., Kolchanov N.A. (2002). PDBSiteScan: a tool for
search for best-matching superposition over the PDBSite database. This volume.
7. Kabsch W., Sander C. (1983). Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical
features. Biopolymers. 22, 2577-2637.
8. Sekharuda Y.C., Sundaralingam M. (1988) A structure-function relationship for the calcium affinities of regulatory proteins containing
“EF-hand” pairs. Proteins Eng. 2, 139-146.
149