Prediction of interacting single-stranded RNA bases by protein

Prediction of interacting single-stranded RNA bases
by protein binding patterns
Alexandra Shulman-Peleg1∗, Maxim Shatsky2 , Ruth Nussinov3,4†, Haim J. Wolfson1
1
2
School of Computer Science
Beverly and Raymond Sackler Faculty of Exact Sciences
Tel Aviv University, Tel Aviv 69978, Israel
Physical Biosciences Division, Berkeley National Lab, California, USA
3
Basic Research Program, SAIC-Frederick, Inc.
Center for Cancer Research Nanobiology Program, NCI,
Frederick, MD 21702, USA
4
Department of Human Genetics and Molecular Medicine
Sackler Faculty of Medicine
Tel Aviv University, Tel Aviv 69978, Israel
Abstract
Prediction of protein-RNA interactions at the atomic level of detail is crucial for
our ability to understand and interfere with processes such as gene expression and
regulation. Here, we investigate protein binding pockets that accommodate extruded
nucleotides not involved in RNA base pairing. We observed that 86% of protein
interacting nucleotides are part of a consecutive fragment of at least two nucleotides,
whose rings have a significant interaction with the protein. Many of these share the
same protein binding cavity and almost 30% of such pairs are π-stacked. Since these
local geometries can not be inferred from the nucleotide identities, we present a novel
framework for their prediction from the properties of protein binding sites.
First, we present a classification of known RNA nucleotide and dinucleotide protein
binding sites and identify the common types of shared 3D physico-chemical binding
patterns. These are recognized by a new classification methodology which is based
on spatial multiple alignment. The shared patterns reveal novel similarities between
dinucleotide binding sites of proteins with different overall sequences, folds and functions. Then, we use these patterns for the prediction of RNA fragments that can
bind to a protein of interest. Specifically, we search a given target protein for regions
similar to known binding patterns and achieve a success rate of 80%. Based on the
interaction patterns, we further predict an RNA fragment that interacts with those
protein binding sites. Since the RNA fragment can have a previously unknown sequence and structure this is especially useful for such applications as RNA aptamer
design. By searching a database of all known small molecule binding sites for regions
similar to nucleotide and dinucleotide binding patterns, we can suggest fragments
and scaffolds that target nucleotide binding sites.
Keywords: protein-RNA interactions, nucleotide and dinucleotide binding sites, physico-chemical binding
patterns, multiple binding site alignment, RNA aptamer drug design.
∗
Corresponding authors’ emails:{shulmana,wolfson}@post.tau.ac.il
The publisher or recipient acknowledges right of the U.S. Government to retain a nonexclusive, royalty-free license
in and to any copyright covering the article.
†
1
1
Introduction
Protein-RNA interactions are crucial for many cellular processes, such as gene expression and
regulation, protein synthesis and many more. Understanding and predicting such interactions at
the atomic level is crucial for our ability to interfere with malfunctioning processes in disease.
Consequently, analysis of the protein-RNA interactions and RNA binding sites have been a field
of intensive research. The pioneering works of Jones et al.1 and Nadassy et al2 have provided
important insights into the physico-chemical and geometrical nature as well as the amino acid
composition of these regions. The specific atomic interactions formed between nucleotides and
amino acids have been analyzed and compared in several important contributions 3–8 . By analyzing
the amino acid composition in RNA binding sites, several successful methods for prediction of RNA
binding sites were developed
9–13 .
However, most of these methods do not distinguish between
two main interaction types which were thoroughly studied by Draper14 : (1) interactions with the
backbone of double-stranded RNA molecules; (2) interactions of single-stranded RNA bases that
are accommodated in the protein binding pockets. As noted in a recent comprehensive review of
Auweter et al15 , while the first type of interactions occur through positively charged protein surface
patches, the second type of contacts with single-stranded nucleotides, often involve hydrophobic
patches. The contacts are often with the unpaired nucleic acid bases, while the direct contacts
with the phosphate moieties of the backbone, which point towards the bulk solution can be rare15 .
Since flexibility plays an important role in interactions with the RNA backbone15 , we focus on the
prediction of protein interactions formed with the single-stranded nucleotide bases. Sequences of
such nucleotides, which are not involved in local base pairs and are extruded from the surrounding
double-stranded helix, also termed extruded helical single strands, were recently described as special
motifs in SCOR (Structural Classification of RNA) and were proposed to be mediators in RNARNA and RNA-protein interactions16, 17 .
Several works have classified the protein-RNA interactions based on the sequences and/or overall
structures of the corresponding protein18–20 or RNA molecules17, 21, 22 . However, these do not always
capture the similarity in the local regions which are responsible for nucleotide binding. Analysis
1
of these regions is important due to several reasons. First, proteins of the same family can form
different interactions with RNA nucleotides
23–25 .
Second, RNA molecules can be flexible and can
explore different conformations that are “fixed” by the protein whose binding site is more rigid
and quenches this motion26, 27 . This observation is also supported by a recent study of Ellis and
Jones28 who evaluated the conformational changes in known RNA binding proteins and observed
that the flexibility in the protein binding sites is not significant and should allow the structural
prediction of these interaction regions. Previous works that analyzed the nucleotides’ physicochemical binding patterns focused on single nucleotides1, 5, 29, 30 and did not apply them to the
prediction of protein-RNA interactions. Since the binding patterns of nucleotides of the same type
(e.g Adenine, Guanine) are very similar
5
their application to the prediction of RNA binding sites
and reconstruction of protein-RNA complexes is rather limited.
Here, we investigate the protein binding pockets that accommodate extruded nucleotides not
involved in RNA base pairing. We observed that many of these protein cavities are common to
pairs of consecutive nucleotides that are often pi-stacked with each other. Consequently, we suggest
that consideration of binding patterns of pairs of consecutive RNA nucleotides may be essential
for the correct prediction of protein-RNA interactions. We observed that the local nucleotide
geometries can not be inferred from the nucleotide identities and developed a novel framework for
their prediction through the recognition of known protein binding patterns.
Specifically, we present a classification of nucleotide and dinucleotide binding sites, which are
described by a set of physico-chemical properties that may be created by amino acids with different
identities and spatial location of backbone atoms. Towards this goal, we have developed a new
classification algorithm which performs multiple structural alignments and validates the spatial
superimposition of the cluster members. The created clusters describe the common types of 3D
consensus binding patterns that are used for several applications. First, their recognition allows
atomic level predictions of dinucleotide binding sites with a success rate of 80%. Second, combining
the predictions of neighboring dinucleotide binding sites allows us to predict the structure and the
sequence of consecutive RNA fragments of on average 5 nucleotides. Finally, searching the database
2
of drug binding sites for patterns similar nucleotide and dinucleotide binding sites can assist in
the prediction of ligands and ligand fragments that can be used to interfere with protein-RNA
interactions.
2
Results
Our goal to recognize and predict the main types of interactions between protein binding sites and
single-stranded RNA bases. Specifically, we focus on the protein binding pockets that accommodate
extruded16 nucleotides not involved in RNA base pairing (see Figure 1). We define a nucleotide
binding site by the protein Connolly solvent accessible surface area31 within 2Å from the surface
of the RNA nucleotide ring. Nucleotides with a protein binding sites area larger than 3Å2 are
considered as protein interacting. Given a pair of extruded consecutive nucleotides that interact
with the protein, a dinucleotide binding site is defined by a pair of the corresponding nucleotide
binding sites. The physico-chemical properties of the binding sites are represented by points in
3D space termed pseudocenters, extracted from the protein amino acids according to Schmitt et
al32 . Each pseudocenter represents a group of atoms according to the interactions in which it may
participate: hydrogen-bond donor, hydrogen-bond acceptor, mixed donor/acceptor, hydrophobic
aliphatic and aromatic (π) contacts. Only surface exposed pseudocenters are considered. Figure
1 presents several examples of extruded nucleotide pairs and their protein dinucleotide binding
sites. The analysis below is performed on a non-redundant dataset of all existing protein-RNA
structures, which consisted of 278 complexes with resolution 3Å and better and less than 50%
sequence identity in at least one chain (see Methods).
This section starts with some general observations regarding protein-RNA interactions and
protein binding pockets that are involved. We present our classification methodology and show its
application for the recognition of 3D consensus patterns of dinucleotide binding sites. As illustrated
in Figure 2 these patterns are further used for the prediction of dinucleotide binding sites as well
as RNA fragments and protein-RNA complexes. Finally, we describe how our infrastructure can
be used in drug design applications for the discovery of RNA aptamers33 and small molecules that
3
target nucleotide and dinucleotide binding sites.
2.1
Interactions of single-stranded RNA bases
Our first goal is to understand which types of interactions and binding pockets are involved in
protein-RNA interactions. We observed that 24% of all unpaired RNA nucleotides have a significant
contact with the protein (calculated on the above described dataset of protein-RNA complexes, see
also Methods and Supplementary Material). Only 3% are the binding pockets of single nucleotides
whose neighbors on the strand do not form significant contact with the protein. Consequently, 21%
of the unpaired protein interacting nucleotides have at least one RNA strand neighbor that interacts
with the protein via its ring as well. Thus, 86% of protein interacting extruded nucleotides are part
of a fragment of at least two nucleotides that have a significant interaction with the protein.
Interestingly, we observed that in many cases, two consecutive nucleotides share the same
binding cavity. They can be either π-stacked with each other or can form interactions with the same
water molecule (e.g. PDB: 1wmq) or the protein (see Figure 1). Here, we estimate the frequency
and the geometries of the first phenomenon, which was biochemically studied for some specific
protein-RNA complexes34 . We define a pair of consecutive extruded nucleotides as π-stacking, if
the corresponding nucleotides’ planes are either parallel or perpendicular and have an angle of
180 ± 20o or 90 ± 20o respectively. In addition, we require that the distance between nucleotide
ring centroids is less than 7Å35 . Table 1 presents the frequency of π-stacking observed in two nonredundant datasets constructed from all the available RNA structures and protein-RNA complexes
respectively. We observed that in general 31% of consecutive pairs of extruded nucleotides are
involved in local intra-RNA π-stacking interactions. When considering protein-RNA complexes,
28% of consecutive, protein interacting, nucleotides are π-stacked with each other. As expected,
T-shaped (perpendicular) π-stacking is very rare (8% of consecutive pairs) and was never observed
in protein binding pockets.
When trying to estimate the sequence specificity of π-stacking, the only dinucleotide pair which
was observed to have a clear tendency for π-stacking, was the AC pair (significant by the hyper
geometric p-value above 0.009). It constituted 60% and 37% of π-stacked pairs in the datasets of
4
protein-RNA complexes and all RNA structures respectively (see Supplementary Material). These
results show that the nucleotide geometries can not be predicted from the RNA sequence alone and
nucleotide identities are not sufficient for the atomic level prediction of protein-RNA complexes.
To summarize, we have observed that in most cases protein-RNA interactions are mediated by
consecutive, protein interacting nucleotides many of which are accommodated in the same protein
binding cavity and are π-stacked. Consequently, the binding cavities of single nucleotides are not
always sufficient for a proper physico-chemical description of protein-RNA interactions and the
binding patterns of dinucleotide binding sites should be considered. Moreover, since the RNA
sequence can not be used for the prediction of nucleotide local geometries, below we propose to
utilize the protein binding patterns for the prediction of RNA nucleotide orientations and proteinRNA complexes.
2.2
Classification of nucleotide and dinucleotide binding sites
Here, we classify all the nucleotide and dinucleotide binding sites extracted from the dataset of
protein-RNA complexes (see Methods). We consider all the dinucleotide binding sites that accommodate pairs of consecutive extruded nucleotides. In addition, binding sites of single nucleotides
that are not a part of such pairs are considered as single nucleotide binding sites.
Here, we present a unique classification methodology which validates the cluster quality by
multiple spatial binding site alignment. Specifically, we have developed a center-star classification
algorithm (see Methods), which creates clusters by iteratively adding binding sites in the order of
their decreasing similarity (based on pairwise spatial alignments) and validates each new addition
by the multiple spatial alignment among all current cluster members. If the multiple similarity,
measured by the score of the common physico-chemical binding pattern, is lower than a predefined
threshold (e.g. ≤ 30% of one of the binding sites) the new member is ignored and not added to
the cluster. The main advantage of this approach is that we validate the spatial superimposition
of the cluster members and assess the quality of the shared physico-chemical binding pattern.
We applied this classification methodology and obtained 186 clusters of protein dinucleotide
binding patterns that were created based on the binding sites of all pairs of consecutive extruded
5
nucleotides. In addition, we obtained 41 clusters that describe protein binding patterns of single
nucleotides that are not part of a consecutive protein interacting pair. The full details of all the
clusters are provided in the Supplementary Material. However, many of the considered binding sites
are unique and the number of clusters with more than one member is 53 and 15 for dinucleotide and
nucleotide binding sites respectively. Since binding patterns of single nucleotides were considered
by previous studies1, 5, 29, 30, 36 , here we focus on the analysis of dinucleotide binding patterns and
distinguish between two types: (1) patterns formed by dinucleotide binding sites of proteins with
similar structural folds; (2) patterns formed by dinucleotide binding sites of proteins with different
structures. We start this section with a description of the clusters that involve members of the
RNA-binding domain family which exhibit all of these types of similarities. Then, we present
examples of additional similarities revealed by our classification.
2.2.1
The RNA-binding domains
The RNA recognition motif (RRM), also known as RNA-binding domain (RBD) or ribonucleoprotein domain (RNP) is considered to be one of the most abundant protein domains23 . In spite of
its overall structural simplicity this domain can recognize a wide variety of RNAs and can perform
various biological functions. Our dataset contained 9 complexes with RBDs of the 6 following types:
(1) splicesomal U1A proteins (PDBs: 1m5o, 1urn, 1sj3); (2) splicing factor U2B’ (PDB: 1a9n); (3)
sex lethal protein, Sxl (PDB:1b7f); (4) HuD proteins (PDBs: 1g2e, 1fxl) and (5) poly(A)-binding
protein (PDB:1cvj); (6) Pre-mRNA splicing factor U2AF65 (PDB:2g4b). Although all of these
proteins share less than 2% overall sequence identity, all of them possess the RPM1 and RPM2
sequence motifs23 . Our clusters reveal the similarities and the differences of the various nucleotide
binding sites which allow these proteins to achieve a range of required biological functions. Notably,
the results obtained with our protein binding site alignment method (see Table 2), are consistent
with the manual observations and comparisons of nucleotide binding sites which were performed
for some of the complexes23, 24 .
Table 2 presents the details of all the dinucleotide clusters which involve RBDs. As can be
seen, the created clusters divide our proteins into three main groups: (1) HuD and Sxl, which
6
have 7 common dinucleotide binding sites, in spite of sharing only 49% sequence identity; (2) U1A
and U2B proteins that share 69% sequence identity and 4 common dinucleotide binding sites; (3)
U2AF65 and poly(A) binding proteins, which are clear outliers. They do not share clusters with
other RBDs, and are indeed known to be different from the rest and each other37, 38 . Below we
provide a detailed description of the similarities recognized by our clusters.
Tyr-Phe aromatic platform of RBDs
Figure 3(a) presents a superimposition of six com-
plexes, which constitute one of our largest RBD clusters formed by four different protein types
(U1A,U2B,HuD and Sxl, see Table 2, cluster 100 ). All the dinucleotide binding sites in this cluster
share the same aromatic binding platform formed by the conserved Phe and Tyr residues (see Figure 3(b)). These aromatic residues form π-stacking interactions with nucleotides that have exactly
the same spatial orientation but different nucleotide identities. Previous biological studies have
shown the importance of these π-stacking interactions for the affinity and stability39 . The correct
recognition of this platform by our method provides a validation of its correctness. The spatial
conservation of the physico-chemical properties shared by all the complexes provides additional
insights into the formation and importance of this platform (See Supplementary Material).
Interestingly, the structure of HuD in complex with c-fos AU-rich element (PDB: 1fxl), was
classified in a separate cluster together with a Rho terminating factor (PDB: 1pvo), which was
recognized to have the same Tyr-Phe aromatic binding platform. This is a novel similarity, which
we could not find in previous studies. The reason that the HuD binding site of 1fxl was not classified
with the rest of the RBDs is that its similarity to U1A (PDB: 1sj3) is less than 30% and these two
binding pockets share only 5 common features, while the common pattern of the above described
cluster has 6. Remarkably, the binding site of the Rho terminating factor exhibits a pattern of 7
physico-chemical properties which is very similar to that of HuD with c-fos (See Supplementary
Material). As illustrated in Figure 3(b) although the Rho terminating factor is structurally different
from RBDs and does not contain the required RPM sequence motifs, our binding site alignment
method reveals that it uses a very similar aromatic binding platform which binds dinucleotides in
a manner similar to RBDs.
7
RBDs and proteins of other folds In addition to the already mentioned similarity of the PheTyr RBD platform to that of Rho termination factor, our clusters reveal novel similarities of RBDs
to other structurally, functionally and evolutionary unrelated proteins (see Table 2). Figure 3 (c,
top and middle) illustrates the similarity between dinucleotide binding sites of a π-stacked A{C|G}
pairs of U1A and U2B to the binding site of a π-stacked anticodon pair GC of cysteinyl-tRNA
synthetase (PDB:1u0b, G33 − C34 are part of the GCA anticodon loop). Figure 3 (c, bottom)
presents an additional example of the similarity between U2AF and a guanine transglycosylase
(PDB: 1q2r). In both structures there is an unusual turn between the considered nucleotides. As
can be seen, this kink, which is measured to be 120◦ in U2AF38 , breaks the binding site to two
separate surface patches, which are similar in the two structures. In both examples, in spite of
the structural differences between the proteins, the binding sites have similar surfaces and shapes
and bind the π-stacked nucleotide pairs in exactly the same orientation. A very similar situation
is observed in the rest of the clusters of this type, which are detailed in Table 2.
2.2.2
Similar structures, similar dinucleotide binding sites
Similar binding patterns formed by structurally similar proteins are expected and not surprising.
This type of similarity is usually well illustrated in the biological literature, which allows us to
verify the correctness of the recognized physico-chemical patterns. In addition, the correct superimposition of the overall structures by the transformations calculated for the dinucleotide binding
sites allows us to validate the correctness of the method. Table 3 details all the recognized clusters
of this type. Below we describe several clusters that provide some insights into the RNA sequence
specificity and reveal the repetitive nature of certain binding patterns.
Nucleotide sequence specificity
MS2 RNA hairpin coat-protein complexes, have been widely
used as a model system for studying RNA sequence specificity40, 41 . Our dataset contained 20
complexes of RNA bacteriophage capsid proteins, most of which were determined as part of these
studies. These studies have revealed that the main driving force for the complex formation is the
π-stacking interaction between the -5 base and the conserved tyrosine side chain. Our clusters
8
support this finding. First, they reveal that the only extruded pair of nucleotides which have a
sufficient contact with the protein are nucleotide bases -4 and -541 (see Table3 cluster 215). Second,
they recognize the physico-chemical binding pattern common to all binding sites of this dinucleotide
pair (see Figure 4 (a)). They clearly show the conservation of the critical tyrosine side chain (Tyr85,
PDB: 2izn), which can tolerate any nucleotide sequence mutations (U |G|A|C). In addition, the
specific pattern recognized for the binding site of adenosine, can explain the specificity of this
position to this nucleotide type42 .
Repetitive Protein Patterns
Our dataset contained 4 structures of Pumilio proteins, that
regulate mRNA expression43 . These were observed to have three types of dinucleotide binding sites
which are detailed in clusters 83, 92 and 206. Interestingly, our clusters reveal the repetitive nature
of the Pumilio repeats, which are comprised of similar structural and sequence motifs. Figure 4
shows one example of the similarity between the binding sites located on the sequence motifs known
as repeats 6 and 843 . The similar physico-chemical properties and surfaces of these binding sites
are formed by the amino acid sequence QYYQ which has the same spatial arrangement in both
repeats (see Supplementary Material). Although the overall structures of the compared proteins
are similar, the repetitive nature of these spatial patterns can not be recognized using standard
structural alignment methods44 .
2.2.3
Different structures, similar dinucleotide binding sites
As was illustrated in the example of Pumilio repeat proteins, similarity of the sequence patterns
can lead to the similarity of the dinucleotide binding sites. We start this section with an additional
example in which we recognize similar binding patterns created by an even more fuzzy sequence
motif. We further describe several of our clusters that reveal similar dinucleotide binding sites of
proteins that do not share any sequence or structure motifs (see Table 4).
K homology (KH) motif The K homology (KH) is a widespread RNA-binding sequence motif,
an invariant Gly-X-X-Gly segment45 , which is known to be shared by proteins with globally distinct
structures46 . Two of our clusters (clusters 34 and 106, see Table 4) reveal the similarity between
9
consecutive dinucleotide binding sites of transcription elongation proteins NusA (PDB: 2atw, 2asb)
and neuronal splicing factors Nova-1 (PDB: 2anr, 2ann). Both of these protein types possess
the KH motifs which are aligned by our method. Figure 5 presents the common binding pattern
of three consecutive nucleotides GAA (G1-A3, 2atw) and CAC (C214-C216, 2anr) of NusA and
Nova respectively. The recognized common protein physico-chemical pattern, which is consistent
with previous studies
19, 45, 47 ,
consists of the conserved hydrogen bond donors contributed by the
GXXG motif (2anr:G22,G25, 2atw:G237, G240), and hydrophobic aliphatic interactions formed by
the conserved Ile residues (2anr:I21,I39, 2atw:I236,I255).
Aminoacyl-tRNA synthetases Nine of our clusters of the second type, reveal similarities between different types of Aminoacyl-tRNA synthetase proteins. Interestingly, the structures of tRNA
synthetases coding for different amino acids are different and they are currently estimated to evolve
from different single domain proteins which are not known to have any common ancestor48 . However, it is estimated that the structural relationship between these proteins may reflect intermediate
steps in the establishment of codon-amino acid interactions48 . Our clusters, that are detailed in
Table 4 support this theory and reveal the binding patterns that may be reused in different protein
regions that are important for RNA recognition.
Figure 5(b) illustrates one such example of the similarity between two anticodon loop binding
sites of Glutamyl-tRNA synthetase (PDB:1n78) and Aspartyl-tRNA synthetase (PDB:1il2). These
binding sites accommodate the dinucleotide pairs CU and GU that are part of the codons CU C and
GU C coding for Glu and Asp respectively. Notably, neither the overall structures of these complexes
nor the rest of the anticodon loops are aligned by our method. This suggests that the recognized
pattern may be the only building block reused during evolution in a manner that will allow the
specificity of each binding site. Figure 5(c) presents an additional example of similarity between
Aspartyl-tRNA synthetase (PDB: 1asy) rRNA (Uracil-5-)-methyltransferase RumA (PDB: 2bh2).
Both binding sites bind the GU nucleotide pair. In 1asy this pair of nucleotides (634-635) is located
in the anticodon loop, while in 2bh2 the GU pair (1954-1955) is responsible for the recognition of
the hairpin by the N-terminal OB fold. As can be seen, in both of the above examples in spite of
10
the structural and functional differences of the proteins, their binding sites share similar physicochemical properties and shapes and bind exactly the same dinucleotide sequence in the same spatial
orientation.
2.3
Prediction of RNA binding sites and structure of protein RNA complex
Given a target protein structure, we aim to recognize the binding sites of its extruded nucleotides
and to construct the RNA fragments that can bind to them, predicting the structure of an RNA
strand and of the protein-RNA complex. Figure 2 illustrates the flow of our classification and
prediction processes. Specifically, given the clusters of binding sites described in the previous
section, we define the 3D consensus binding patterns, by the physico-chemical properties shared by
all cluster members (see Methods). Given a target protein structure not used in the classification
we search its surface for the presence of any of these 3D consensus patterns, predicting its nucleotide
and dinucleotide binding sites. Using the nucleotide orientations observed in the consensus patterns
allows us to predict the structure of RNA strands and protein-RNA complexes.
2.3.1
Prediction of RNA binding sites
Here, we use the created clusters to predict RNA binding sites that accommodate unpaired extruded
nucleotides. Specifically, given a target protein structure not used for the classification, we search
its surface for regions similar to the created 3D consensus binding patterns, which are predicted to
serve as binding sites. Due to a low number of single nucleotide clusters, we evaluate our predictions
of dinucleotide binding sites. We perform leave-one-out tests on all the structures that participated
in the above described clusters with more than one member. For each left out complex, we repeat
the classification procedure without its structure, and use the obtained clusters to define a new set
of 3D consensus binding patterns. Then, we search the surface of the left out protein for a presence
of the constructed 3D patterns. All of these are searched by a single algorithm (RnaPred, see
Methods) which recognizes the top ranking patterns that are similar to some protein regions. For
each dinucleotide binding site of each left out protein we check whether it was correctly predicted
by this procedure. A dinucleotide binding site is considered to be correctly predicted if one of the
11
20 top ranking solutions fulfils at least one of the following requirements: (1) the RMSD distance
between the dinucleotides of the real complex and those predicted by the protein-profile alignment is
≤ 3Å; (2) There are at least four (or 30%) of physico-chemical properties (pseudocenters) involved
in protein-dinucleotide interactions of the specific site that were correctly predicted by the proteinprofile alignment. The percentage of successfully predicted dinucleotide binding sites is the success
rate of our methodology. Notably, the success rate calculated in these leave-one-out tests is 80%.
Moreover, in 50% of the cases the correct prediction was the top ranking solution and in 82% of
the cases it was within the 5 top ranking solutions. We consider this to be a success, especially
since we do not only point to some specific amino acids that are involved in the interactions, but
predict the spatial orientation of the RNA nucleotides in the protein binding site. Consequently,
we expect that the combination of such predictions will allow the reconstruction of protein-RNA
complexes with an unknown structure. Below, we detail our first steps in achieving this goal.
2.3.2
Prediction of RNA strands and protein-RNA complexes
Given a protein structure, we would like to be able to predict which RNA fragments if can bind
and what is the structure of the resulting protein-RNA complex. This is a very ambitious goal and
currently we only show some initial steps in achieving it.
Existing tools are unable to provide a good solution to this problem for several reasons. First,
we aim to solve this problem without any assumption or knowledge of the structure of the RNA
molecule. Due to the limited number of existing RNA structures, this is an important requirement.
Unfortunately, it prevents using standard methods like docking for its solution. Currently, the most
applicable approach is the superimposition of the given protein upon the protein-RNA complex of
its closest homologue. This provides the prediction of the complex between the input protein and
the RNA molecule bound to the homologue. However, this approach has several limitations. First,
as can be seen in the example of the RBD clusters, similar overall sequences and folds do not
always lead to similar nucleotide binding modes. Second, the superimposition of proteins done by
their backbone atoms often misaligns the RNA molecules and the specific nucleotide binding sites
of interest. Our methodology improves these points in the following ways. First, by looking for
12
the similarity of the physico-chemical patterns in the binding sites, we use only the most reliable
protein information which indeed leads to the similarity in binding. Second, we superimpose the
proteins according to the transformations calculated for their binding sites which usually optimize
the similarity of the RNA nucleotide orientations.
Here, we extend the RnaPred algorithm to the following scheme. Given a protein structure,
we first recognize all regions that are similar to any of the above described 3D consensus patterns.
Then, we superimpose the matched 3D patterns upon our target structure and consider the RNA
dinucleotides that are bound to these patterns. We consider all the solutions and try to connect
the nucleotides superimposed on neighboring regions to form a valid continuous RNA fragment.
The longest RNA fragment with the highest similarity score of the alignment between the protein
binding sites and the selected 3D patterns is the top ranking solution. Since the constructed
RNA fragment is comprised of dinucleotides taken from 3D consensus patterns of totally different
proteins, it can be different from any RNA fragment of known sequence and structure. This allows
us to make unique predictions that can not be achieved by other methods. As described below, these
can contribute to such applications as the design of RNA aptamer drugs which target protein-RNA
interactions.
To evaluate the performance of this method, we have performed leave-one-out tests similar to
those described in the previous section. As before, each time we have left out one protein structure,
re-clustered and created a new set of 3D consensus patterns that do not assume any knowledge of
the left out structure. Then, this structure was searched for the highest scoring sequence of regions
similar to the constructed 3D consensus patterns that allows us to create the longest continuous
RNA fragment. The average length of the predicted fragment was 5 and the average running time
on a standard PC was 3.5 minutes (AMD Opteron 242, 1593MHz). Notably, when we calculated
the RMSD between the predicted RNA fragments and the real fragments bound to the left out
structures in 22% of the cases it was less than 3Å, and in 17% of the cases it was even less than
1Å. In the rest of the cases, although we have precisely reconstructed some sub fragments, we have
added false positive predictions which pointed to regions that are not in interaction in the given
13
complex. Figure 6 presents two examples which illustrate our success and limitations. In the first
example, we reconstruct the RNA fragment and the structure of HutP antitermination complex
(1wpu). The total length of the RNA strand of this complex is 7 nucleotides and it interacts
with the protein through 4 dinucleotide binding sites. When we applied our prediction method and
searched the surface of the HutP protein, we have correctly predicted all 4 of its dinucleotide binding
sites. As expected, the prediction was made based on four 3D consensus patterns of homologous
proteins (1wrq, 1wmq), the chaining of which constituted the top ranking solution. Interestingly, 2
of these patterns were based on clusters that contained additional binding sites of Aminoacyl-tRNA
synthetase (1n78) the anticodon CU C loop of which was recognized to be similar to the central
U AG motif of HutP (see Table 4, cluster 63). In spite of this addition, the patterns were correctly
mapped to the query protein and the structure of the RNA fragment was correctly predicted with
the overall RMSD of 0.5Å from the native. However, the structure of two 5, U U nucleotides was
not predicted by our method. The first 5, U was ignored due to some missing atoms in one of
the structures (PDB:1wrq) which led to a singleton cluster of 1wmq, which was not used for the
construction of 3D patterns. Since the second U nucleotide has no interaction with the protein,
its binding site and ring orientation could not be predicted by our method (see Figure 6 (a) top).
Interestingly, this nucleotide is also more flexible than the rest and its prediction is more difficult.
Figure 6 (b) presents another example in which we reconstruct part of the hepatitis delta
precursor in interaction with the U1A protein. The loop of this RNA molecule interacts with the
protein through 6 dinucleotide binding sites. Four of these sites, formed by the nucleotides A1-C5
of U1A
39
were correctly predicted and mapped by our method. This allowed us to reconstruct a
fragment of five RNA nucleotides with overall RMSD of less than 0.3Å from their native structure.
However, we failed to recognize the dinucleotide binding site of C5-A6. Consequently, the fragment
A6-C7, the binding site of which was correctly mapped to the protein surface, was not added to
the reconstructed RNA fragment. Interestingly, this failure occurred in the prediction of the TyrPhe binding platform which is described above (see Figure 3). This is explained by the fact that
during the reclustering without query protein, another structure of Sxl (PDB:1fxl) was added to
14
the cluster. The Tyr-Phe binding platform of the query protein detailed in Figure 3 (a) is slightly
different from the binding platform of 1fxl represented in Figure 3 (b). The main difference is not in
the aromatic stacking platform, but in the spatial disposition of the rest of the properties that form
the binding sites. Consequently, when we combined the two patterns into one cluster represented
by only 5 pseudocenters of 1fxl, we failed to correctly map this pattern on the structure of 1sj3,
which is the most different from 1fxl (see Section 2.1). Since in the current version we attempt to
reconstruct only sequential fragments, we stopped upon the failure and did not extend the predicted
RNA fragment.
2.4
Drug design applications
There are two main types of drugs that can be developed to prevent the formation of protein-RNA
interactions. The first type are the most common small molecule drugs, which are bioavailable when
administered orally. The second type are drugs based on short strands of RNA oligonucleotides.
These ligands, known as aptamers are selected for their ability to bind proteins with both high
affinity and high specificity33 . Below we present drugs discovery applications and show how our
methodology can contribute to the development of both type of drugs.
2.4.1
Discovery of small molecule drugs
We have observed that the average surface area of the protein binding sites that accommodate
dinucleotide pairs is 145Å2 if the pairs are π-stacked and 210Å2 otherwise. Since optimal drug
molecules were proposed to have a surface area smaller than 140Å2 due to bioavailability reasons49 ,
the dinucleotide binding sites have a high potential to accommodate drugs and serve as drug targets. To provide suggestions of potential ligands and ligand fragments that can target dinucleotide
binding sites, we have searched the constructed 3D consensus patterns against a database of all
known small molecule binding sites (see Methods). The top ranking solutions of this search application are the binding sites with properties similar to dinucleotide binding sites. The drug leads
and the substrates that are bound to them provide ideas for small molecules that can bind to the
dinucleotide binding sites. The details of the top ranking binding sites and their corresponding
15
ligands is provided in the Supplementary Material. Figure 7 (a) presents one example of U 3 − A4
dinucleotide binding site of HutP aligned to the binding site of Cyclophilin A bound to sanglifehrin
macrolide (PDB:1nmk, rank 24). As can be seen, the sanglifehrin molecule has a volume and shape
similar to the U A pair. Its chemical groups and scaffold can be used in the design of new inhibitors
for HutP.
2.4.2
Discovery of novel aptamers
Our new, RnaPred, method described in Section 2.3.1 predicts the sequence and the structure of
the RNA strands that can bind to the protein of interest. It provides the suggestions of the initial
RNA sequences that can serve as the starting point in the design of new aptamers. Since our
method uses the information from all available protein-RNA complexes, it can design previously
unknown RNA sequences and structures that were never experimentally observed. Since all the
predictions are knowledged-based, the constructed models are based on rational considerations of
the nucleotide geometries typical to the observed protein physico-chemical patterns.
2.4.3
Aptamer optimization
Obviously, the predictions provided by a computational method are only a set of suggestions, which
require a further validation and optimization. One way to improve the binding affinity and/or
selectivity is to optimize the single extruded nucleotides, which are unpaired and are not part of
a protein interacting dinucleotide pair. To obtain ideas for the chemical groups and scaffolds that
can be used for the modification, we can search the above described database of drug-like binding
sites for those that are similar to the nucleotide binding site of interest. The detailed results of such
searches performed with the constructed 3D consensus binding patterns of single nucleotide binding
sites is provided in the Supplementary Material. Figure 7 (b) presents one example of the similarity
between a single nucleotide binding site (A37) of Cysteinyl-tRNA synthetase (PDB:1u0b) aligned to
the Human macrophage elastase (MMP-12, PDB:1ros, ranked 28) bound to its inhibitor. Since the
binding regions of two proteins have similar physico-chemical properties and secondary structure
elements (helix and 2 strands), the inhibitors developed for the MMP-12 can provide useful ideas
16
for the chemical groups that can be used to substitute and optimize the adenine nucleotide of
Cysteinyl-tRNA synthetase.
3
Summary and conclusions
Motivated by the important role of extruded non-paired RNA nucleotides in protein-RNA recognition, we have investigated their local geometries and interactions. We have observed that in most
cases of protein-nucleotide interactions, there are several consecutive RNA nucleotides that are not
involved in RNA base pairing. Since the nucleotide identities are not indicative of their spatial
geometry, we consider the protein pockets that accommodate them. We observed that many of
the consecutive nucleotide pairs share the same binding cavity and interact with each other. Consequently, we suggest that the binding patterns common to such nucleotide pairs provide a more
correct representation of these interactions.
We proposed a novel algorithmic framework which starts with the classification of all known
nucleotide and dinucleotide binding patterns according to their spatial physico-chemical patterns.
These patterns, termed 3D consensus patterns, are further used for the prediction of binding sites
and RNA fragments. Obviously, the proposed framework is just a starting point and each of its
stages can be further enhanced and improved. The classification methodology, which has the advantage of the spatial validation of the created patterns, has all the disadvantages of the regular
center star clustering and is sensitive to the selected star centers and the order of traversal. The
created 3D consensus patterns, which were shown to be extremely useful, do not contain the information about the variation of the spatial patterns of the cluster members, whose description is not
straightforward. Selection of the shared pattern coordinates from a single structure could further
influence the results. Currently, we do not predict the interactions formed with the RNA backbone, which are often represented by smaller and more flexible physico-chemical binding patterns.
Nonetheless, the results of this paper indicate that it is possible to predict the interactions and the
structure of fragments of single-stranded RNA bases. We intend to use the methodology presented
here as the first part in a two stage scheme for the prediction of complete protein-interacting RNA
17
strands. The results of this paper allow the prediction of binding sites and interactions formed
with the nucleotide bases. Modeling the short RNA sub-fragments between the predicted regions
is expected to allow the challenging reconstruction of complete RNA fragments.
In addition to being an important milestone towards achieving the ultimate RNA-protein structure prediction goal, our results provided several important insights. First, the presented classification reveals novel and surprising similarities between dinucleotide binding sites formed by proteins
with different overall sequences, folds and functions. These results suggest that certain physicochemical patterns may be reused during the evolution in different protein regions that are important
for RNA recognition. Second, we have presented a framework which allows a successful prediction
of dinucleotide binding sites as well as recognition of ligands and ligand fragments that can target
them. We hope that this will be useful in the design of aptamer and small molecule drugs that
interfere with protein-RNA interactions.
4
Methods
Dataset construction and resources
The datasets of RNA structures and protein-RNA com-
plexes were retrieved from the NDB database50 , March 2007, and contained 1103 and 278 structures respectively. The dataset of protein-RNA complexes contained only structures with resolution
3Å and better, while the dataset of all RNA structures contained NMR structures as well. Sequence redundancy was removed using BlastClust51 with a 50% sequence identity threshold. When
considering protein-RNA complexes, structures with more than 50% sequence identity in both RNA
and protein chains were removed. The local base pairing of RNA nucleotides was recognized using
the 3DNA software
52 .
Alignment of nucleotide and dinucleotide binding sites
Since, it was shown that single
nucleotides can bind in alternative modes even to the same protein binding site53 , the alignment
of their binding sites was performed, using our previously developed, MultiBind method54 . This
method allows the recognition of the maximal physico-chemical pattern common to the input set
of binding sites, without using any information regarding the corresponding binding partners.
18
The dinucleotide binding sites extracted from protein-RNA complexes contain additional information about the spatial orientation of its two nucleotides, which we aim to predict. Consequently,
it is essential that the alignment method will require the similarity of the nucleotide geometries
in the aligned binding sites. To fulfil this requirement, we have developed a new algorithm, RnaBind, which aligns between dinucleotide binding sites and utilizes the nucleotide orientation for
the construction of 3D transformations that superimpose the input binding sites. Specifically, each
dinucleotide pair is represented by the two centroids of the corresponding nucleotide rings and the
phosphate atom between them. Then we apply the Least-Squares Fitting method55 to calculate
the transformation that provides the best alignment of such representative triplets extracted from
the input binding sites. Once the binding sites are superimposed in 3D space we apply maximum
weight match in a weighted bipartite graph56 to determine the 1:1 correspondences between the
matched pseudocenters of the input binding sites and score their similarity by the scoring functions
developed for our MultiBind method54 .
Multiple center star clustering
The standard clustering methods such as UPGMA or k-
means57 provide a general methodology to group any elements based on their pair-wise relations.
Obviously, some complex elements that may be similar between each other, i.e. in pairs, may
not be similar as a whole group. For the clustering of dinucleotide binding sites we are interested
in computing clusters of binding sites that are similar as a whole group, i.e. finding a group of
binding sites that share a significant consensus pattern. Therefore, we developed the following new
clustering procedure.
Similar to most existing methods, we start with performing all-against all pairwise alignments
between objects of interest. Then, we use the calculated pairwise scores to create a graph in
which the objects of interest are the nodes. Edges are created between similar nodes with a
pairwise similarity score above certain threshold. We require that the similarity score between two
nodes (S(n1 , n2 )) will be at least 30% of the score of one of them aligned to itself ((S(n1 , n2 ) ≥
0.3 ∗ S(n1 , n1 )) ∨ (S(n1 , n2 ) ≥ 0.3 ∗ S(n2 , n2 ))). Then, for each node, we consider the “star” that
is created by its edges and determine its weight by the sum of their corresponding scores. We go
19
over all the “stars” in decreasing order of their weight and try to create a cluster around each “star
center”. For each star, we go over its edges in the decreasing order of their score and try to create
a cluster comprised of the star center and the nodes of the selected edges. For example, we start
with the highest scoring edge and perform the alignment between its two nodes. Then, we add the
second edge and perform a multiple alignment between three nodes: the “star” center and the two
nodes of its two highest scoring edges. If the score of the core of this multiple alignment is above
30% of the score of each node, this triplet is defined as a cluster. Otherwise, the last node that
was just added in this iteration is removed from the cluster. It will be considered later according
to its other edges. As to the current “star”, we proceed to go over the rest of its edges and add to
the cluster only those nodes which multiple alignment with the current cluster members receives a
high enough score (more than 30% similarity in our example). Figure 8 (a) illustrates the process
of creation of three clusters. The procedure terminates when all the nodes have been assigned to a
cluster. Each cluster defines a consensus 3D pattern, which are the 3D points shared by all of its
members. The coordinates that describe this pattern are taken from the node that served as “star
center” and have originated the cluster.
Prediction of dinucleotide binding sites We have developed a method, RnaPred, which
searches a complete protein for regions similar to the created 3D consensus patterns. The input
protein and the patterns are represented by the set of their corresponding pseudocenter points (see
Section 2.1). The RnaPred algorithm consists of the following 4 stages: (1) Generation of a set
candidate transformations and the superimposition ; (2) Defining the 1:1 correspondence between
the matched points; (3) Scoring and the selection of top ranking solutions. (4) Clustering the
solutions of each 3D pattern as well as of different patterns mapped to the same protein region.
Superimposition: The generation of all candidate transformations is done with the Geometric
Hashing method58 which consists of two stages, preprocessing and recognition. At the preprocessing,
each triplet of pseudocenters from the complete molecule is considered as a local reference frame.
Then, the coordinates of the other pseudocenters, calculated with respect to this coordinate system,
are stored in a Geometric Hash Table. The key to the hash table consists of the point coordinates
20
and physico-chemical property. In the recognition stage the same process is repeated for each 3D
consensus pattern. For each pair of reference frames, one from a 3D pattern and one from the
query protein, we count the number of matched points. For each pattern, we consider the reference
frames with a significant number of matched points and construct a set of candidate transformations
that can superimpose the given pattern upon the entire query protein. This procedure has several
advantages for our goal. First, all the information of the complete molecule is stored only once
for all the 3D patterns. Second, we avoid the processing of points that cannot be matched under
any transformation. This is especially important for our application since the complete molecule is
significantly larger than the 3D consensus patterns.
Correspondence and Scoring: The 1:1 correspondence and the scoring are implemented in the
same manner as in the RnaBind method. We ignore solutions that are too small and insignificant
and align less than 30% of a 3D consensus pattern (i.e. the score of the alignment is less than 30%
of the score of the pattern aligned to itself).
Clustering: First, for each 3D pattern we cluster similar solutions that superimpose it to similar
spatial locations. This is achieved by applying efficient RMSD clustering, described by Rarey et
al59 , with a default threshold of 3Å . When similar alignments are detected, we retain only a
single solution with the highest score. Since our goal is to retain only a small set of top ranking
solutions, we need to cluster the solutions that align different 3D consensus patterns to similar
protein regions. We define two mappings of different 3D consensus patterns to be similar if their
corresponding match lists (defined by the 1:1 correspondence) are based on at least 70% of identical
pseudocenters of the complete query molecule.
Since the algorithm performs simultaneous alignment against all the constructed patterns, it is
extremely fast. Its average running time for searching a complete protein surface for the presence
of all the di-nucleotide consensus binding patterns measured during all the leave-one-out tests (see
Results) is 3 minutes on a standard PC (AMD Opteron 242, 1593MHz).
Reconstruction of protein-RNA complexes Here, we extend the RnaPred algorithm described in the previous section to the recognition of the longest consecutive set of alignment of 3D
21
patterns to neighboring protein regions. Specifically, given all possible alignments of all the created
3D consensus patterns to the query protein, we construct a graph, in which these solutions are
the nodes. Let (a1 , a2 ) and (b1 , b2 ) represent the RNA nucleotide pairs bound to two aligned 3D
patterns. An edge in this graph is created if the alignments of 3D patterns and their corresponding
dinucleotides satisfy the following requirements: (1) They have a pair of overlapping nucleotide
rings: dist(a2 , b1 ) ≤ 1 ; (2) They have a pair of distant rings: dist(a1 , b2 ) > 1 (i.e. the 3D patterns
are aligned to different protein regions); (3) The planes of the overlapping rings are parallel to
each other: angle(a2 , b1 ) ≤ 2 . Using 1 = 2.0 and 2 = 20o , we ensure that the dinucleotide pairs
from two 3D patterns can be combined to a consecutive RNA fragment. Given the constructed
graph, we implemented a variation of the Bellman-Ford algorithm60 which recognizes K longest
paths with the maximal weight. Specifically, one path (p1 ) is considered to be better that the other
(p2 ) if it has at least the same length (length(p1 ) ≥ length(p2 )) and its cumulative score is higher
(Cs(p1 ) ≥ Cs(p2 )). The cumulative score, which is the sum of scores of all the nodes along the
path, reflects the quality of the alignments between the 3D consensus patterns to the corresponding
neighboring protein regions. Figure 8 (b) provides a schematic representation of the main idea of
this method.
Database of small molecule binding sites
The dataset of PDB structures complexed with
small molecules was retrieved from the PDBsum database61 , May 2007. We retained only the binding sites of compounds with more than 7 non-hydrogen atoms that are not covalently linked to the
protein. In addition, to remove the binding sites complexed with natural substrates, such as ATP
or GTP, we counted the frequency of occurrence of each ligand in the PDB. Small molecules that
appeared in more than 10 structures were assumed to be natural, frequently occurring substrates
which can not make a significant contribution in our searches for drugs and drug fragments. As
a result we constructed a dataset of 3999 binding sites which were screened for their similarity to
nucleotide and dinucleotide binding patterns.
22
Supplementary Material
Supplementary Figure S1 presents histograms of angles and distances between the ring planes
of consecutive, protein interacting nucleotides. Supplementary Table S1 presents the general
frequencies of consecutive, extruded, protein interacting nucleotides that are not involved in RNA
base pairing. Table S2 presents the analysis of the sequence specificity of pairs of consecutive,
extruded π-stacked nucleotides. Table S3 presents the full list of clusters of single, extruded,
nucleotide binding sites. Table S4 presents the full list of clusters of dinucleotide binding sites.
Table S5 presents the details of the similarity between the Phe-Tyr binding platforms of HuD RBD
(1fxl) and Rho terminating factor (1pvo) revealed by cluster 18. Table S6 details the common
physico-chemical binding pattern of the Phe-Tyr binding platform of RBDs revealed by cluster 100.
Table S7 details the physico-chemical properties shared by 8 dinucleotide binding sites of Pumilio
proteins of cluster 83 (QYYQ motif). Tables S8-S9 provide lists of 200 top ranking protein small
molecule binding sites that were recognized to be similar to the created nucleotide and dinucleotide
consensus binding patterns respectively.
Acknowledgments
We thank O. Dror for contribution of software for this project. The research of AS-P was supported by the
Clore PhD Fellowship. The research of HJW has been supported in part by the Israel Science Foundation
(grant no. 281/05), by the NIAID, NIH (grant No. 1UC1AI067231), by the Binational US-Israel Science
Foundation (BSF)and by the Hermann Minkowski-Minerva Center for Geometry at TAU. This publication
has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes
of Health, under contract NO1-CO-12400. This research was supported [in part] by the Intramural Research
Program of the NIH, National Cancer Institute, Center for Cancer Research. The content of this publication
does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does
mention of trade names, commercial products, or organizations imply endorsement by the US Government.
23
References
[1] Jones, S., Daley, D. T. A., Luscombe, N. M., Berman, H. M. & Thornton, J. M. (2001). ProteinRNA interactions: a structural analysis. Nucleic Acids Res. 29, 943–954.
[2] Nadassy, K., Wodak, S. J. & Janin, J. (1999). Structural features of protein-nucleic acid
recognition sites. Biochemistry, 38, 1999–2017.
[3] Chen, Y., Kortemme, T., Robertson, T., Baker, D. & Varani, G. (2004). A new hydrogenbonding potential for the design of protein-RNA interactions predicts specific contacts and discriminates decoy. Nucleic Acids Res. 32, 5147–5162.
[4] Allers, J. & Shamoo, Y. (2001). Structure-based analysis of protein-RNA interactions using the
program ENTANGLE. J. Mol. Biol. 311, 75–86.
[5] Morozova, N., Allers, J., Myers, J. & Shamoo, Y. (2006). Protein-RNA interactions: exploring
binding patterns with a three-dimensional superposition analysis of high resolution structures.
Bioinformatics, 22, 2746–2752.
[6] Hoffman, M. M., Khrapov, M. A., Cox, J. C., Yao, J., Tong, L. & Ellington, A. D. (2004).
AANT: the Amino Acid-Nucleotide Interaction Database. Nucleic Acids Res. 32, D174 – 181.
[7] Treger, M. & Westhof, E. (2001). Statistical analysis of atomic contacts at RNA-protein interfaces. J Mol Recognit. 14, 199–214.
[8] Cheng, A. C., Chen, W. W., Fuhrmann, C. N. & Frankel, A. D. (2003). Recognition of nucleic
acid bases and base-pairs by hydrogen bonding to amino acid side-chains. J. Mol. Biol. 327,
781–96.
[9] Jeong, E., Chung, I. F. & Miyano, S. (2004). A neural network method for identification of
RNA-interacting residues in protein. Genome Inform. 15, 105–16.
[10] Terribilini, M., Lee, J.-H., Yan, C., Jernigan, R. L., Honavar, V. & Dobbs, D. (2006). Prediction
of RNA binding sites in proteins from amino acid sequence. RNA, 12, 1450–1462.
24
[11] Terribilini, M., Sander, J. D., Lee, J. H., Zaback, P., Jernigan, R. L., Honavar, V. & Dobbs, D.
(2007). RNABindR: a server for analyzing and predicting RNA-binding sites in proteins. Nucleic
Acids Res. 35, W578–84.
[12] Kim, O. T. P., Yura, K. & Go, N. (2006). Amino acid residue doublet propensity in the
protein-RNA interface and its application to RNA interface prediction. Nucleic Acids Res. 34,
6450–60.
[13] Wang, L. & Brown, S. J. (2006). BindN: a web-based tool for efficient prediction of DNA and
RNA binding sites in amino acid sequences. Nucleic Acids Res. 34, W243 – W248.
[14] Draper, D. E. (1999). Themes in RNA-protein recognition. J. Mol. Biol. 293, 255–70.
[15] Auweter, S. D., Oberstrass, F. C. & Allain, F. H.-T. (2006). Sequence-specific binding of
single-stranded RNA: is there a code for recognition? Nucleic Acids Res. 34, 4943–4959.
[16] Klosterman, P. S., Hendrix, D. K., Tamura, M., Holbrook, S. R. & Brenner, S. E. (2004).
Three-dimensional motifs from the SCOR, structural classification of RNA database: extruded
strands, base triples, tetraloops and U-turns. Nucleic Acids Res. 32, 2342–52.
[17] Tamura, M., Hendrix, D. K., Klosterman, P. S., Schimmelman, N. R., Brenner, S. E. &
Holbrook, S. R. (2004). SCOR: Structural Classification of RNA, version 2.0. Nucleic Acids Res.
32, D182–4.
[18] Chen, Y. & Varani, G. (2005). Protein families and RNA recognition. FEBS J. 272, 2088–97.
[19] Lunde, B. M., Moore, C. & Varani, G. (2007). RNA-binding proteins: modular design for
efficient function. Nat Rev Mol Cell Biol. 8, 479–490.
[20] Murzin, A., Brenner, S., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification
of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.
[21] Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S. R. & Bateman, A. (2005).
Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33, D121–4.
25
[22] Macke, T. J., Ecker, D. J., Gutell, R. R., Gautheret, D., Case, D. A. & Sampath, R. (2001).
RNAMotif, an RNA secondary structure definition and search algorithm. Nucleic Acids Res. 29,
4724–35.
[23] Maris, C., Dominguez, C. & Allain, F. H.-T. (2005). The RNA recognition motif, a plastic
RNA-binding platform to regulate post-transcriptional gene expression. FEBS Journal, 272
(9), 2118–2131.
[24] Handa, N., Nureki, O., Kurimoto, K., Kim, I., Sakamoto, H., Shimura, Y., Muto, Y. &
Yokoyama, S. (1999). Structural basis for recognition of the tra mRNA precursor by the Sexlethal protein. Nature, 398, 579–585.
[25] Antson, A. A. (2000). Single-stranded-RNA binding proteins. Curr Opin Struct Biol. 10,
87–94.
[26] Turner, B., Melcher, S. E., Wilson, T. J., Norman, D. G. & Lilley, D. M. (2005). Induced fit
of RNA on binding the L7Ae protein to the kink-turn motif. RNA, 11, 1192–200.
[27] Shajani, Z., Drobny, G. & Varani, G. (2007). Binding of U1A protein changes RNA dynamics
as observed by 13C NMR relaxation studies. Biochemistry, 46, 5875–83.
[28] Ellis, J. J. & Jones, S. (2007). Evaluating conformational changes in protein structures binding
RNA. Proteins, . in press.
[29] Moodie, S. L., Mitchell, J. B. O. & Thornton, J. M. (1996). Protein recognition of adenylate:
an example of a fuzzy recognition template. J. Mol. Biol. 263, 486–500.
[30] Kuttner, Y. Y., Sobolev, V., Raskind, A. & Edelman, M. (2003). A consensus-binding structure
for adenine at the atomic level permits searching for the ligand site in a wide spectrum of adeninecontaining complexes. Proteins, 52, 400–411.
[31] Connolly, M. (1983). Analytical molecular surface calculation. J. Appl. Cryst. 16, 548–558.
26
[32] Schmitt, S., Kuhn, D. & Klebe, G. (2002). A new method to detect related function among
proteins independent of sequence or fold homology. J. Mol. Biol. 323, 387–406.
[33] Bunka, D. H. J. & Stockley, G. (2006). Aptamers come of age - at last. Nat Rev Microbiol,
4, 588–596.
[34] Guallar, V. & Borrelli, K. W. (2005). A binding mechanism in protein-nucleotide interactions:
Implication for U1A RNA binding. Proc. Natl. Acad. Sci. USA, 102, 3954–3959.
[35] Burley, S. K. & Petsko, G. A. (1985). Aromatic-aromatic interaction: a mechanism of protein
structure stabilization. Science, 229, 23–28.
[36] Shulman-Peleg, A., Nussinov, R. & Wolfson, H. J. (2004). Recognition of functional sites in
protein structures. J. Mol. Biol. 339(3), 607–633.
[37] Deo, R. C., Bonanno, J. B., Sonenberg, N. & Burley, S. K. (1999). Recognition of polyadenylate
RNA by the poly(A)-binding protein. Cell, 98, 835–845.
[38] Sickmier, E. A., Frato, K. E., Shen, H., Paranawithana, S. R., Green, M. R. & Kielkopf,
C. L. (2006). Structural Basis for Polypyrimidine Tract Recognition by the Essential Pre-mRNA
Splicing Factor U2AF65. Molecular Cell, 23, 49–59.
[39] Shiels, J. C., Tuite, J. B., Nolan, S. J. & Baranger, A. M. (2002). Investigation of a conserved
stacking interaction in target site recognition by the U1A protein. Nucleic Acids Res. 30, 550–
558.
[40] Grahn, E., Moss, T., Helgstrand, C., Fridborg, K., Sundaram, M., Tars, K., Lago, H., Stonehouse, N. J., Davis, D. R., Stockley, P. G. & Liljas, L. (2001). Structural basis of pyrimidine
specificity in the MS2 RNA hairpin-coat-protein complex. RNA, 7, 1616–1627.
[41] Horn, W. T., Tars, K., Grahn, E., Helgstrand, C., Baron, A. J., Lago, H., Adams, C. J.,
Peabody, D. S., Phillips, S. E., Stonehouse, N. J., Liljas, L. & Stockley, P. G. (2006). Structural
basis of RNA binding discrimination between bacteriophages Qbeta and MS2. Structure, 14,
487–495.
27
[42] Helgstrand, C., Grahn, E., Moss, T., Stonehouse, N. J., Tars, K., Stockley, P. G. & Liljas,
L. (2002). Investigating the structural basis of purine specificity in the structures of MS2 coat
protein RNA translational operator hairpins. Nucl. Acids Res. 30 (12), 2678–2685.
[43] Wang, X., McLachlan, J., Zamore, P. D. & Hall, T. M. (2002). Modular recognition of RNA
by a human pumilio-homology domain. Cell, 110, 501–512.
[44] Shatsky, M., Nussinov, R. & Wolfson, H. J. (2004). A method for simultaneous alignment of
multiple protein structures. Proteins, 56, 143–156.
[45] Lewis, H. A., Musunuru, K., Jensen, K. B., Edo, C., Chen, H., Darnell, R. B. & Burley, S. K.
(2000). Sequence-Specific RNA Binding by a Nova KH Domain: Implications for Paraneoplastic
Disease and the Fragile X Syndrome. Cell, 100, 323–332.
[46] Grishin, N. V. (2001). KH domain: one motif, two folds. Nucleic Acids Res. 29, 638–643.
[47] Beuth, B., Pennell, S., Arnvig, K. B., Martin, S. R. & Taylor, I. A. (2005). Structure of a
Mycobacterium tuberculosis NusA-RNA complex. EMBO J. 24, 3576–87.
[48] Bori-Sanz, T., Guitart, T. & Ribas de Pouplana, L. (2006). Aminoacyl-tRNA synthetases: a
complex system beyond protein synthesis. Cont. to Science, 3, 149–165.
[49] Veber, D. F., Johnson, S. R., Cheng, H. Y., Smith, B. R., Ward, K. W. & Kopple, K. D.
(2002). Molecular properties that influence the oral bioavailability of drug candidates. J. Med.
Chem. 45, 2615–2623.
[50] Berman, H. M., Olson, W. K., Beveridge, D. L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh,
S. H., Srinivasan, A. R. & Schneider, B. (2003). The nucleic acid database. A comprehensive
relational database of three-dimensional structures of nucleic acids. Biophys J. 63 (3), 751–759.
[51] Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D. (1990). Basic local alignment
search tool. J. Mol. Biol. 215, 403–410.
28
[52] Lu, X.-J. & Olson, W. K. (2003). 3DNA: a software package for the analysis, rebuilding and
visualization of three-dimensional nucleic acid structures. Nucl. Acids Res. 31 (17), 5108–5121.
[53] Denessiouk, K. A., Rantanen, V. & Johnson, M. (2001). Adenine Recognition: A motif present
in ATP-,CoA-,NAD-,NADP-, and FAD-dependent proteins. Proteins, 44, 282–291.
[54] Shatsky, M., Shulman-Peleg, A., Nussinov, R. & Wolfson, H. (2006). The multiple common
point set problem and its application to molecule binding pattern detection. J Comput Biol. 13,
407–42.
[55] Kabsch, W. (1978). A discussion of the solution for the best rotation to relate two sets of
vectors. Acta Crystallogr. A 34, 827–828.
[56] Mehlhorn, K. (1999). The LEDA platform of combinatorial and geometric computing. Cambridge University Press.
[57] Jain, A. K., Murty, M. N. & Flynn, P. J. (1999). Data clustering: a review. ACM Computing
Surveys, 3, 264 – 323.
[58] Wolfson, H. J. (1990). Model-Based Object Recognition by Geometric Hashing. In Proc. of
the 1st European Conf. on Comp. Vision (ECCV) LNCS pp. 526–536, Springer-Verlang.
[59] Rarey, M., Wefing, S. & Lengauer, T. (1996). Placement of medium-sized molecular fragments
into active sites of proteins. J. Computer-Aided Molecular Design, 10, 41–54.
[60] Cormen, T., Leiserson, C. & Rivest, R. (1990). Introduction to Algorithms. MIT Press.
[61] Laskowski, R. A., Chistyakov, V. V. & Thornton, J. M. (2005). PDBsum more: new summaries
and analyses of the known 3D structures of proteins and nucleic acids. Nucleic Acids Res., 33,
D266–268.
[62] Andreeva, A., Howorth, D. & Murzin, A. G. (June 4, 2007). The pre-SCOP database. http:
//www.mrc-lmb.cam.ac.uk/agm/pre-scop/.
29
[63] Eiler, S., Dock-Bregeon, A., Moulinier, L., Thierry, J. C. & Moras, D. (1999). Synthesis of
aspartyl-tRNA(Asp) in Escherichia coli–a snapshot of the second step. EMBO J., 18, 6532–6541.
[64] Yang, X., Grczei, T., Glover, L. T. & Correll, C. C. (2001). Crystal structures of restrictocininhibitor complexes with implications for RNA recognition and base flipping. Nat Struct Biol.,
8, 968–973.
[65] Tishchenko, A., Nikulin, A., Fomenkova, N., Nevskaya, N., Nikonov, O., Dumas, P., Moine,
H., Ehresmann, B., Ehresmann, C., Piendl, W., Lamzin, V., Garber, M. & Nikonov, S. (2001).
Detailed analysis of RNA-protein interactions within the ribosomal protein S8-rRNA complex
from the archaeon Methanococcus jannaschii. J. Mol. Biol., 311, 311–324.
[66] Lee, T. T., Agarwalla, S. & Stroud, R. M. (2005). A unique RNA Fold in the RumA-RNAcofactor ternary complex contributes to substrate selectivity and enzymatic function. Cell, ,
120, 599–611.
30
Tables
Dataset
Num. of PDB
structures
Num. of consecutive
extruded pairs
π-stacked
parallel
π-stacked
perpendicular
Protein-RNA
complexes
All RNA structures
278
307
87 (28%)
0 (0%)
1139
4114
1291 (31%)
27 (0.7%)
Table 1: Frequency of π-stacking interactions between consecutive nucleotides. The table presents
the statistics of π-stacking interactions observed in the datasets of protein-RNA complexes and all RNA
structures. Columns 1-2 present the dataset details. Column 3 details the number of pairs of consecutive
extruded nucleotides, where only protein interacting nucleotides were considered in the dataset of proteinRNA complexes. Columns 4-5 present the total numbers and the frequencies of the considered types of πstacking interactions. To prevent the influence of missing strands which are part of a double helix, sequences
of more than 5 consecutive π-stacked nucleotides were not considered. Only the standard, unmodified,
nucleotides A, G, C, U, I were considered.
31
Num. Scop
dist.
96
1
28
1
122
3
19
3
194
3
127
3
128
3
129
3
79
3
99
3
100
3
π
type
1
2
2
2
2
2
2
2
1
2
2
Cluster
size
3
3
2
3
3
3
3
3
4
4
6
Nucleotide
pairs
AU
UU
UA, UG
UU
UU
AU, UU
UU
UU
UG
GC
CA, AU,
GU
AC, AG,
GC
Cluster
center
(pdb:chains:nts)
1sj3:PR:A49U50
1urn:AP:U7U8
1g2e:AB:U1A2
1fxl:AB:U2U3
1b7f:AP:U6U7
1g2e:AB:A6U7
1g2e:AB:U7U8
1g2e:AB:U8U9
1m5o:CB:U38G39
1sj3:PR:G52C53
1sj3:PR:C53A54
32
7
1
5
168
7
1
2
AA, GA
1cvj:AM:A4A5
280
7
2
2
CU, UU
1q2r:AE:C8U9
18
7
2
2
UU, UC
1fxl:AB:U1U2
1urn:AP:A11C12
Location
Cluster members annotation
A1-U2 of U1A39
U2-U3 of U1A39
5, U3-G4 of Sxl24
U5-U6 of Sxl24
U6-U7 of Sxl24
U6-U7 of Sxl24
U7-U8 of Sxl24
U8-U9 of Sxl24
U3-G4 of U1A39
G4-C5 of U1A39
G4-U5 of Sxl24 ,
C5-A6 of U1A39
A6-C7 of U1A39 ,
G33-C34 of anticodon loop
A4-A5 of poly(A)
and G2-A3 of
NSP3
Rna turn at C32C33 or tRNA and
U4-U5 of U2AF
U1-U2 of HuD,
U1-C2 of Rho
U1A (1urn, 1m5o, 1sj3)
U1A (1urn, 1m5o, 1sj3)
HuD (1g2e), Sxl(1b7f)
HuD (1fxl, 1g2e), Sxl(1b7f)
HuD (1fxl, 1g2e), Sxl(1b7f)
HuD (1fxl, 1g2e), Sxl(1b7f)
HuD (1fxl, 1g2e), Sxl(1b7f)
HuD (1fxl, 1g2e), Sxl(1b7f)
U1A(1sj3,1m5o,1urn), U2B(1a9n)
U1A(1sj3,1m5o,1urn), U2B(1a9n)
U1A(1sj3,1m5o,1urn),
U2B(1a9n), HuD(1g2e), Sxl(1b7f)
U1A(1sj3,1m5o,1urn),
U1B’(1a9n), tRNA synthetases
(1u0b)
RNA-binding domain, NSP3 homodimer(1knz)
tRNA-guanine
transglycosylase(1q2r), U2AF(2g4b)
RNA-binding domain, Rho termination factor (1pvo)
Table 2: Dinucleotide clusters that involve RNA-binding domain proteins. The first column is
the cluster number (according to the full list provided in the Supplementary Material). The second column
specifies the maximal SCOP20, 62 distance between the cluster members. Proteins at distance 3 belong to
the same SCOP super family, while proteins at distance 7 have totally different overall folds and no common
structural annotation. Column 4 indicates whether all the cluster dinucleotide pairs are π-stacked (1) or
non π-stacked (2). Column 5 indicates the number of binding sites in the cluster and column 6 provides the
details of the binding site that initiated the cluster (PDB code, protein and RNA chains, nucleotides’ (nts)
numbers according to the appearance on the RNA chain). Column 7 specifies the location of the considered
dinucleotides and Column 8 details the cluster members.
32
Num. Scop
dist.
215
1
π
type
2
Cluster Nucleotide pairs
size
20
UA,GA,AA,CA
Cluster center
(pdb:chains:nts)
2izn:AR:U9A10
Location
206
1
2
9
AU,AG,GU,UA
83
1
2
8
UG,UA,UC
92
1
2
4
CC,AU
254
148
149
126
1
1
1
1
1
1
1
1
3
2
2
2
UA,UU
GC
CC
UA
Rna 4nt-loop: 5, -442
1m8y:AC:A8U9 Pumilio repeats
3,4,643
1m8w:AC:U2G3 Pumilio repeats
6, 843
1m8w:AE:C5C6 Pumilio repeat
443
1n1h:AB:U1A2
Rna 5,
1c0a:AB:G65C66 Rna 3, GCCA63
1c0a:AB:C66C67 Rna 3, GCCA63
1g2e:AB:U5A6
Rna U-6, A-7
14
225
1
1
2
2
2
2
GA
AG,AU
1h2c:AR:G2A3
1jbt:AC:A15G16
200
146
1
2
1
3
2
4
CU
UC
208
2
2
2
AU, AA
113
49
2
2
2
1
2
2
CC
CA
Rna 3, G2-A3
Unfolded SRL
tetraloop64
1si2ABC6U7
Rna C6-U7
1c0a:AB:U31C32 Rna anticodon
loop U-635, C636 63
1i6u:AC:U25A26 Rna A-640, U64165
1qtq:AB:C72C73 Rna 3, CCA
1n78:AC:C73A74 Rna 3, CCA
Protein annotation
RNA bacteriophage capsid
protein
Pumilio-Homology Domain
Pumilio-Homology Domain
Pumilio-Homology Domains
Minor core protein λ3
Aspartyl-tRNA Synthetases
Aspartyl-tRNA Synthetases
HuD protein (RNA-binding
domain)
Matrix Protein Vp40
Ribotoxin restrictocin
Translation initiation factor
Aspartyl-tRNA Synthetases
Ribosomal proteins S8
Glutamyl-tRNA synthetases
Glutamyl-tRNA synthetases
Table 3: Similar structures, similar dinucleotide binding sites. Clusters comprised of dinucleotide binding sites of structurally similar proteins (same SCOP super family, according to the pre-SCOP
database62 ). The columns are the same as in Table 2, except the last column which details the common
annotation of proteins in the cluster.
33
Num. π
type
63
1
Cluster Nucleotide Cluster center
size
pairs
(pdb:chains:nts)
4
AG, UC
1wrq:AC:A4G5
87
1
4
34
2
106
Location
3
UA, CA,
UU
CA, GA
central U AG motif of
hut mRNA, CU C aticodon loop in synthetase
1m8w:AC:U6A7 Pumilio repeat 3-443 ,
Rna U10-U11 of Ro
2anr:AB:C13A14 protein KH motif
1
3
AA, AC
2atw:AB:A3A4
protein KH motif
74
2
2
IC, CA
1f7u:AB:I27C28
ICG anticodon loop,
3, CCA
187
2
2
CA, AC
1il2:AC:C66A67
45
3
2
CU, GU
47
3
2
CA
261
2
2
UC, CA
285
2
2
GU
3, CCA and A3-C4 of
NSP3
1n78:AC:C33U34 CU C and GU C aticodon loops
1n78:AC:C35A36 anticodon loop (34nts) and anticodon
loop
1qf6AB:U30C31 anticodon loop start
(with the wobble base)
and ACA motif
1asy:AR:G34U35 anticodon loop, hairpin recognition
Cluster members annotation
Hut operon regulatory protein, Aminoacyl-tRNA synthetases (1n78)
Pumilio-Homology Domain,
Ro ribonucleoprotein (2i91)
splicing factors Nova-1, transcription elongation proteins
NusA
transcription elongation proteins NusA, splicing factors
Nova-1
Aminoacyl-tRNA
synthetases, tRNA-binding arm
(1gax)
Aspartyl-tRNA synthetase,
NSP3 homodimer (1knz)
Glutamyl-tRNA synthetase,
Aspartyl-tRNA
synthetase
(1il2)
Glutamyl-tRNA synthetases,
methionyl-tRNA synthetase
(2csxACC34A35)
Threonyl-tRNA synthetase,
tRNA pseudouridine synthase (2hvy)
Aspartyl-tRNA synthetase,
rRNA
methyltransferase
(2bh2)
Table 4: Different structures, similar dinucleotide binding sites. Clusters comprised of similar
dinucleotide binding sites of proteins with different overall folds. The columns are the same as in Table 2.
34
Figures
(a) Parallel stacking
(b) Perpendicular stacking
(c) Non π-stacking
Figure 1: Geometries of RNA di-nucleotide pairs. (a) Parallel π-stacking interactions, as observed in
the U1A RNA-binding domain (PDB: 1m5o, A6-C7 of U1A39 ). The protein pseudocenters are represented
as balls. Hydrogen bond donors are blue, acceptors - red, donors/acceptors - green, and aromatic - grey.
(b) Perpendicular stacking, observed in an unbound structure of 23S Ribosomal RNA (1vor, chain B). (c)
Non pi-stacking interactions of methyltransferase RumA (PDB: 2bh2, G1954-U1955). Although the rings
participate in aromatic interactions, they are pi-stacked with the protein and not with each other66 .
Figure 2: The algorithmic framework. The framework consists of two main stages. First, we classify
known dinucleotide binding patterns and recognize the common types of the physico-chemical binding patterns (left flowchart). Then, we use these patterns for the prediction of di-nucleotide binding sites and for
the reconstruction of RNA fragments and protein-RNA complexes (right flowchart).
35
(a)
(b)
(c)
Figure 3: Di-nucleotide binding sites of RNA-binding domains. (a) Alignment between the six
RBD proteins of cluster 100 (see Table 2) according to the transformation calculated for the di-nucleotide
binding sites. Proteins are colored from ranging pink (1sj3) to blue (1b7f) according to cluster order in
the table. The bottom figure provides a focused view of the common Tyr-Phe aromatic binding platform.
The conserved amino acids of Tyr (Y13, Y128, Y214 in U1A/U2B, Hud and Sxl respectively) and Phe
(F56, F170, F256) are represented as sticks and colored purple. The RNA nucleotides are colored magenta
and represented as sticks. The protein pseudocenters are represented as balls and are colored as in Figure
1. (b) Alignment between HuD RBD (1fxl, pink) and Rho terminating factor (1pvo, monochrome). The
bottom figure presents the Tyr-Phe binding platform shared by the proteins. As can be seen, in spite of the
overall structural differences between the proteins they use similar Tyr-Phe binding platforms for nucleotide
recognition. (c) Top: Alignment of the first four members of cluster 32 (see Table 2), colored as in a. Middle,
the alignment of all the five members of cluster 32. Colored as in a, 1u0b is green. Bottom: Cluster 180,
guanine transglycosylase (1q2r) is green and U2AF(2g4b) is blue and pink.
36
(a)
(b)
Figure 4: Di-nucleotide binding sites of proteins with similar structures or sequence patterns.
(a) Alignment between 20 di-nucleotide binding sites of MS2 RNA hairpin coat-proteins (cluster 215, Table3).
The left binding site can tolerate any nucleotide sequence, due to the conservation of its Tyr residue, represented by two pseudocenters: donor/accpetor (green) and aromatic (black). (b) Alignment between 8
binding sites of Pumilio proteins (see cluster 83, Table3). The surfaces of the binding sites are represented
as small dots and are colored by the physico-chemical property as in Figure 1. Proteins in which the binding
site is located on repeat 6 43 are colored cyan and those of repeat 8 are magenta. The RNA nucleotides are
colored as following: C - orange, A - red, G - yellow, C - green.
(a)
(b)
(c)
Figure 5: Similar di-nucleotide binding sites of unrelated proteins. (a) Alignment of binding sites of
transcription elongation proteins NusA (PDB: 2atw) and neuronal splicing factors Nova-1 (PDB: 2anr) that
accommodate nucleotide triplets GAA (G1-A3, purple) and CAC (C214-C216, cyan) respectively. Surface
patches with exactly the same properties and shapes (up to 1Å distance deviation) are represented as dots
colored as in Figure 1. (b) Alignment between rRNA (Uracil-5-)-methyltransferase RumA (PDB: 2bh2, blue)
and Aspartyl-tRNA synthetase (PDB: 1asy, green). Both binding sites bind the GU nucleotide pair. In 1asy
this pair of nucleotides (634-635) is located on the anticodon loop, while in 2bh2 the GU pair (1954-1955) is
responsible for the recognition of the hairpin by the N-terminal OB fold. The surfaces patches are as in (a).
(c) Glutamyl-tRNA synthetase (PDB:1n78, blue) and Aspartyl-tRNA synthetase (PDB:1il2, green), which
bind the rna fragments CU and GU (colored CPK) respectively. The binding sites were recognized to share
7 pseudocenters, which are represented as balls and are colored as in Figure 1 (the pseudocenters of 1n78
are represented as smaller balls).
37
(a)
(b)
Figure 6: Structure prediction of RNA fragments. (a) Reconstruction of the RNA fragment of HutP
antitermination complex (1wpu). The predicted and the native nucleotide fragments are red and orange
sticks respectively. The opposite view in the upper figure shows the non interacting U nucleotide, not
predicted by our method. (b) Reconstruction of the RNA loop of hepatitis delta precursor, which interacts
with U1A protein (1sj3). The predicted nucleotides are red and the native are orange.
(a)
(b)
Figure 7: (a) Dinucleotide binding site of HutP (1wrq, light blue) aligned to the binding site of Cyclophilin
A (pink) bound to sanglifehrin macrolide (PDB:1nmk). The bottom figure shows the alignment of the RNA
(blue) and sanglifehrin (red) from a different angle. (b) Nucleotide binding site (A37, yellow) of CysteinyltRNA synthetase (1u0b, light blue) aligned to the Human macrophage elastase (1ros, pink) bound to its
inhibitor (green). The aligned proteins have similar binding sites and similar secondary structure elements
(helix and 2 strands), which suggest the potential similarity in ligand binding.
38
(a)
(b)
Figure 8: (a) Center star classification algorithm. The star centers s1 − s4 are labeled according to the
weight and the order of traversal. The edges of s1 to s2 − s4 nodes failed to fulfil the multiple alignment
requirements and were not added to the cluster of s1. Consequently, the algorithm proceeded to s2, where
s4 fulfilled the multiple alignment requirement and was added to the cluster. (b) The RnaPred algorithms
aligns the 3D consensus patterns to some regions (represented by curves) of the complete query protein .
We consider the RNA dinucleotides bound to the aligned 3D patterns (pairs of balls, colored according to
the pattern) and select the longest and highest scoring fragment of consecutive mappings of such pairs. In
this example, the best RNA fragment prediction is of length 4 nucleotides (nts).
39