ParaDock – a flexible DNA rigid protein geometric complementarity

TEL-AVIV UNIVERSITY
RAYMOND AND BEVERLY SACKLER
FACULTY OF EXACT SCIENCES
BLAVATNIK SCHOOL OF COMPUTER SCIENCE
ParaDock – a flexible DNA rigid
protein geometric complementarity
based docking algorithm
This thesis is submitted in partial fulfillment of the requirements for the
M.Sc. degree in the Blavatnik School of Computer Science, Tel-Aviv
University by
Itamar Banitt
The research work in this thesis has been carried out at Tel-Aviv University
under the supervision of
Prof. Haim Wolfson
September 2010
1
Abstract
Accurate prediction of protein-DNA complexes could provide an important stepping stone towards a thorough
comprehension of vital intracellular processes. Few attempts were made to tackle this issue, through binding
patch prediction, protein function classification, and distance constraints based docking. In this work we
introduce ParaDock: a novel ab initio protein – DNA docking algorithm. ParaDock combines short DNA
sections, rigidly docked to the protein based on geometric complementarity, to create planarly bent DNA
molecules, ranked using a geometrical complementarity score. Our algorithm was tested on the bound and
unbound targets of a protein-DNA benchmark including 47 complexes. With neither addressing protein
flexibility, nor applying any refinement procedure, CAPRI acceptable solutions were ranked in the top 10 of
83% of the bound complexes, and 70% of the unbound. Without requiring prior knowledge of DNA length
and sequence, and within less than 2h per target on a standard 2.0GHz CPU, ParaDock offers a fast ab initio
docking solution.
2
Table of contents
Motivation
4
Background
Biological background
5
Molecular Docking
6
Related work
Protein – DNA binding characterization
9
Existing prediction solutions
9
ParaDock algorithm
Algorithm Motivation
11
Concept of the algorithm
12
Algorithm details
PatchDock
13
Partial solution filtering
13
Curve creation
14
Molecular solution
14
Scoring
15
Experimental results
Data set
16
Solution evaluation
16
Results
11
Discussion
20
Future work
22
Appendix
Implementation
23
Complexity
24
Main parameters
25
Benchmark contents
26
References
28
Hebrew abstract
33
3
Motivation
The central dogma of molecular biology (1) states that genetic information stored in the DNA is transcripted
and then translated to create proteins in all living cells. This widely agreed conjecture places the DNA
molecule as an extremely important element in many intracellular processes. All DNA functions within the
living cell, including among others transcription, replication, damage repair, packing and strand splitting, are
dependent on interaction with proteins. Comprehensive understanding of these interactions, of which
uncovering of the structure of protein – DNA complexes is a vital component, will most likely provide
important insights to the inner cell systems, such as gene expression mechanisms, and DNA related diseases.
Although many studies tackled various aspects of protein – DNA binding (some of which will be detailed
hereafter), no well established algorithm exists for ab initio protein - DNA docking. Such an algorithm can be
an important stepping stone in the establishment of a complete structural complex.
4
Background
Biological background
The DeoxyriboNucleic Acid (DNA) was first described in 1869 by Friedrich Miescher (2). It is a polymeric
macromolecule built as a chain of nucleotides, each composed of a nucleic acid, a sugar group, and a
phosphate group(3). Four types of nucleotides appear in natural DNA - adenine, guanine, thymine and
cytosine. While other polymers present varied structural behavior, DNA is almost exclusively found as a
double-stranded molecule, containing two polymeric chains bound together. The nucleotides of both chains
create Watson-Crick pairs: adenine bases of one chain are paired with thymine bases of the other, and the
same goes for guanine and cytosine. Hydrogen bonds between adjacent pairs create the well known helical
structure of the DNA molecule, named B-form. DNA molecules vary in length, depending on the organism,
and can be very long: e.g. human chromosome #1, is ~220 million base pairs long.
DNA protein interaction, responsible for many crucial intracellular processes, takes the form of intermolecular
binding, in which a DNA molecule and one or more protein molecules create a polymolecular complex. While
DNA is found in nature almost solely in B-form, this native DNA structure changes noticeably when bound to
a protein. A protein – DNA benchmark published by Bonvin et al. (4), contains an assembly of 47 complexes
covering most major groups of DNA binding proteins, as these groups were defined by Luscombe et al.(5). In
this benchmark, interface RMSD was calculated for each complex, resulting in 12 complexes displaying over
5.0 Å IRMSD, and another 22 of more than 2.0 Å, thus demonstrating the magnitude of structural deformation
in protein DNA binding, projecting on its importance and affect on docking attempts.
The DNA molecular deformations found in the majority of the complexes can be classified (according to 6)
into just a few types: bending, twisting, groove widening or compression, and base flipping. Visualization of
such deformations can be seen in figure 1, and representatives of all types can be found in the previously
mentioned benchmark (4). Despite the numerous types, bending is the most common, and, more importantly,
noticeably has the greatest effect on the DNA molecule global shape, whereas other types of structure
distortion are of a more local character.
5
a
b
c
d
Figure 1 - various conformational changes undergone by DNA molecules when bound to proteins: a – base flipping
(1EMH); b – helix unwinding (1EYU); c – bending (1O3T); d – A-form helix (1QRV)
Molecular docking
The newly developed field of bioinformatics employs mathematical, statistical and computational tools and
methods, to address various aspects of molecular biology. Probably the most common practice of
bioinformatics involves the analysis of biological sequences, referring both to DNA and protein sequences.
Studies of gene expression and regulation, modeling of evolutionary processes, and computational systems
biology are some of the other recent additions to this newborn field of science. Among these, structural
bioinformatics deals with the development and analysis of prediction tools based on structural data, notably,
focusing on the prediction of folding and docking of macromolecules. While folding attempts to predict the
structure of a molecule given its sequence (be it protein or RNA), docking deals with computationally
uncovering the structure of a complex composed of two or more molecules bound together.
Numerous docking algorithms have been published in the last two decades following the immense increase in
available biological data, much owing to the human genome project. Docking techniques differ from one
another both in concept and purpose, but the vast majority are composed of two main elements: a search
algorithm and a scoring function. Difficulty of docking is in a large extent the result of the vastness of the
conformation space. Three translational and three rotational degrees of freedom, with additional degrees as
molecular flexibility is introduced, combined with the large size of the molecules involved, generate too large
a space to be searched exhaustively. Thus, the search algorithm is aimed at locating the largest proportion of
eligible candidates in minimum time, and several such approaches can be identified when surveying the
various solution methods (7).
6
Conformational sampling methods use known properties of bound complexes (such as shape
complementarity) to actively and specifically generate the conformations to be considered. These techniques
embed an assumption of some sort regarding the final solution (inspiring the method by which the complexes
are generated), resulting in great speed, accompanied with possible weaknesses produced by the assumptions
in question. PatchDock (8) for example, generates conformations based on surface shape complementarity.
Incremental construction methods, used mainly to dock small, very flexible ligands, reduce the search space
by anchoring a part of the ligand to a predicted binding site, and building the rest of it piece by piece, fixing
the last added fragment in place. FlexX (9) may be considered as a representative of this sort of algorithms.
Stochastic methods scan the conformational space on the fly. Normally, a population of random initial
conformations is generated, after which multiple small steps are executed, slowly transforming each initial
solution. Transformation in each step can be carried out in different ways, many of them involving an energy
function. Monte Carlo based algorithms, as well as simulated annealing, steepest descent, and genetic
algorithms have been published among others. ROSSETA (10) and HADDOCK (11) are members of this
family.
Scoring functions are the second major element of docking algorithms, and are commonly dual purposed, used
both for directing the search algorithms and assessing candidate poses for final ranking (7). As in search
algorithms, scoring functions can be roughly sorted into three main categories. Force field based functions
employ a direct calculation of forces, based on physicochemical properties, including kinetic, thermodynamic,
electrostatic and others, which are active at the molecular and intermolecular scale. As the molecular system
in question presents high complexity, force fields are difficult to calculate, and elements not accounted for
might significantly influence results (e.g. water molecules). Empirical scoring functions use an energy
function approximated by the most (allegedly) relevant factors, often including hydrogen bonds, hydrophobic
effects, electrostatics, etc. Such simplifications offer improved speed, but might reduce accuracy. Knowledge
based functions abandon the physical and chemical known forces, and adopt a more pragmatic potential,
based on the statistical analysis of databases containing known complex structures. Contacts can be
categorized into types (e.g. certain pairs of amino acids) to create typical distributions used to assess the
7
candidate conformations. HADDOCK (11), PatchDock (8), and DrugScore (12) can be considered
representatives of force field, empirical and knowledge based functions respectively.
Despite this categorized summary, the variety and diversity of docking algorithms is immense. Many do not
fit the categories mentioned here or combine some of them, and the wide selection cannot be thoroughly
described in this scope.
8
Related work
Protein-DNA binding characterization
A vital key to the comprehension and prediction of DNA – protein interactions, lies in the broad and general
characterization of the complex's intermolecular interface and the conformational changes undergone by the
participating molecules during binding.
Among the first characteristics to be spotted in the early DNA - protein complexes found, was the abundance
of positively charged amino acids concentrated in the binding site upon the protein surface (13, 14). This
positive electrostatic charge is assumed to complement the negative charge found mainly on the B-form DNA
backbone. It was further suggested in some studies, that this electrostatic compatibility steers protein – DNA
recognition, and is the source of the primary force pulling the molecules together (15, 16).
A comprehensive study of the protein surface patch taking part in DNA binding was performed by Thornton
et al. (17). All intermolecular contacts between protein and DNA atoms were surveyed through 129 protein DNA complexes. For each type of amino acid, the number of found contacts to the DNA molecule was
divided by the expected number of contacts (calculated using that amino acid relative ASA, and the native
binding site size), thus receiving a parameter representing each acid type's propensity to appear in the binding
site. While the expected propensity value would be 1.0, many types of acids (including not only the
electrostatically charged acids) presented significantly different values.
Existing prediction solutions
Several different approaches and methods were employed trying to predict DNA – protein binding, addressing
various aspects of the problem.
Function recognition algorithms attempt to predict whether a certain protein is DNA binding or not. Such
algorithm was introduced by Mandel-Gutfreund et al. (18), based on the aforementioned electrostatic
properties of DNA binding sites. Given a protein structure, the surface electrostatic potential was assessed,
and the largest electrostatic patch located. The boolean prediction (binding or non-binding) was based on the
9
size (ASA) of the positive patch, as well as its average potential. Three DNA binding structural motifs (helixturn-helix, helix-hairpin-helix, and helix-loop-helix) were coupled by Thornton et al (19) with the electrostatic
properties at the suggested binding site for this same purpose, and shown to possess significant discriminating
power.
Further work done by Thornton et al. (20) tackles a different perspective of protein-DNA binding, trying to
predict the binding site of the protein. Their work embodies the comparison between the abilities of 5 different
score functions to distinguish the native binding site on the protein surface from equally sized decoys on that
same protein. The scores studied were each based on one of the following properties of the candidate binding
site: ASA (accessible surface area), amino acid propensities, electrostatic charge, hydrophobicity, and residue
conservation. The electrostatic and amino acid propensity based scores, have shown good performance, while
all three remaining scores exhibited poor prediction ability. Other attempts to locate the native binding site
employed machine learning methods, such as SVM and perceptron, based on sequential motifs and conserved
residues data (21-23). All mentioned attempts showed ability to predict the binding site with non trivial
probability, none managing to achieve more than 85% accuracy.
Recent publications (24, 25) by Bonvin et al. embody the first large scale head on confrontation with protein –
DNA docking. The HADDOCK algorithm, applied to the whole (previously discussed) benchmark, modeling
both protein and DNA flexibility, displayed highly accurate results. The algorithm utilizes distance constraints
extracted from experimental data (gathered from various possible sources, such as NMR, conservation data,
etc.), to reconstruct and refine the protein-DNA complex. In this recent paper (25), the experimental data was
replaced with distance constraints extracted from the target complexes themselves. The results, measured by
the CAPRI (Critical Assessment of PRedicted Interactions) standards (26) of fnat (Frequency of native
contacts) and IRMS (Interface RMS) , included a high proportion of extremely accurate solutions, depicting
HADDOCK's ability to accurately predict DNA-protein complexes, provided that the bulk of the interface
distance constraints are experimentally supplied a-priori by experiment or other prior knowledge.
11
ParaDock algorithm
Algorithm Motivation
Three observations have motivated the algorithmic design of ParaDock. The first is the well established
structural consistency and repetition of B-form DNA. Second, is the ability to characterize probable binding
sites, using geometric shape complementarity, electrostatics and amino acid propensities, as was earlier
discussed. The third observation is the approximal planarity displayed by bound DNA helices. Two studies
indirectly point to this conclusion. In the previously mentioned survey done by Thornton et al. (6) bending is
the most common of all deformation modes, and its influence on global molecular shape seems the most
significant. A second perspective as displayed by Dickerson (27, 28) views these deformations through the
angular position of the base-pairs, relative to the helix central axis. Each such deformation is classified as
major or minor, and further divided into three modes – kinks, planar curves, and writhes. While smooth planar
curves are the least common in this study, it can be observed that kinks and minor writhes create
approximately planar DNA curves, as the deformation is local. Therefore, apart from major writhed
molecules, amounting to 10 out of 86 surveyed complexes, the surveyed DNA molecules are approximately
planar. Within the 10 writhed complexes, some complete a full writhe, eventually returning to close to planar
form. To further strengthen the claim of DNA planarity (at the relevant scale), we have approximated the
DNA target molecules of the protein-DNA benchmark (4) with planes. Only an average of 0.08% of the atoms
were located farther than 12 Å (DNA helix radius (29)) from the approximated plane, and only in a single
complex, 1EMH, the ratio of these distant atoms was higher than 1% - easily explicable by a flipped base
(shown in figure 1a).
11
Concept of the ParaDock algorithm
The structural repetition and DNA planarity suggest the combination of partial rigid docking solutions
(filtered using electrostatics, geometry and amino acids propensities) into a complete structure, and the
approximation of the central DNA axis using planar curves. The algorithm structure is as follows:
I - Local docking and filtering – local rigid docking solutions of the protein and short DNA section are found
using PatchDock (8), and then filtered.
II - Co – linear pairs of solutions, and co – planar triplets are located, and planar conic curves as well as linear
curves are generated to fit them.
III - Molecular solutions are constructed along the curves.
IV - Solutions are scored and ranked by geometrical complementarity.
Figure 2 Illustration of the ParaDock algorithm. Illustrations are genuine figures from the 1O3T test run.
12
Algorithm details
PatchDock
Paradock's input is a PDB file, containing the sequence and structure of the protein in question. First, we seek
to find rigid partial solutions, indicating local complementarity of the protein's shape to B-form DNA. We use
PatchDock (8), a geometric hashing (30) based rigid docking algorithm, which detects shape complementarity
of molecular surfaces. Input to PatchDock is the protein molecule, and a short (6 base pairs) section of B-form
DNA, of arbitrary sequence. PatchDock program parameters were adjusted to yield 15 - 30 thousand
solutions, each of which is a complex containing the protein and the short DNA section docked together.
PatchDock scores the solutions, using a distance transform grid, thus scoring every atom of the ligand (in our
case, the short DNA section molecule), by its distance from the protein surface. Distant atoms (more than 1 Å
from the surface) are not scored, close atoms (-1, +1 Å) receive a positive score, and penetrating atoms receive
a negative score, respective to the penetration depth.
Partial solution filtering
Two other scoring functions are added to the geometric score calculated by PatchDock in order to filter out
the relevant partial solutions. A primitive electrostatic score is the number of interacting atoms of the protein,
belonging to a positively charged amino acid, minus the number of these belonging to negatively charged
amino acids. This score gives a rough approximation of the electrostatic charge of the interface patch on the
protein surface, very simply and swiftly. Finally, a third score was devised to account for the amino acid
propensities observed by Thornton (17) et al. Protein atoms participating in the interface were counted by their
residue type, resulting in a 20th dimensional vector. The inner product of this vector and the vector of average
propensities found in (17), weighing both patch composition and size, was used as the third score.
Out of all the solutions generated by PatchDock, a mere 2000, having the highest scores, were left to proceed
to the next step, focusing attention on specific areas on the protein surface, possessing properties of
electrostatics, geometry and amino acid composition, compatible with DNA binding patches.
13
Curve creation
Next, we exploit the consistency of B-DNA shape, and represent the partial solutions by their central axes
only, discarding the molecular data of the DNA. Hence, we are left with a protein molecule, surrounded by a
cloud of segments in 3 dimensions. It should be noted that the phase of the partial solutions, i.e. their
rotational angle around the central axis, is discarded along with the whole molecular data. It will be implicitly
accounted for during scoring .
Utilizing the aforementioned relative planarity of DNA molecules we now combine pairs and triplets of
segments to generate long planar solutions, employing the known local compatibility to the protein. The
segments extracted from the PatchDock filtered results are scanned to find co-linear pairs, and co-planar
triplets (with necessary certain tolerances), used in turn to create linear and conic curves respectively. These
approximate well the central axes of observed DNA molecules. The produced curves represent the location of
the desired DNA molecule central axis, around which we will forthwith build the solution molecule.
The creation of conic curves (the handling of which was implemented using CGAL (30)) produces undesired
effects such as complete circles and ellipses, alongside with double branched hyperbolas, all of which are not
particularly compatible with known DNA molecules. These are therefore trimmed, to consist of a single
branch, with an overall turn angle not exceeding 90 degrees. The curve is trimmed to leave a section that is
coherent with the maximal number of segments out of the 2000 filtered solutions. A coherent segment is
defined as one that is close to the curve, and almost parallel to its tangent. Curves containing points with
curvature higher than a certain threshold are discarded (as very sharp kinks are not common in B-form DNA,
and might result in DNA molecules "wrapped" around the protein, thus receiving a very high score).
Molecular solution
Around the trimmed curve, a DNA molecule needs to be constructed. First, a template molecule of arbitrary
sequence is multiplied to create a straight DNA molecule of the desired length. In order to bend this molecule,
its central axis is segmented, and each segment is paired with a section on the target curve. Rigid
transformations are created transforming each such segment onto its corresponding section on the curve.
Finally, each atom from the straight molecule is transformed onto its location using the appropriate
14
transformation, creating the final bent molecule. Several molecules are created for each curve, using different
translations along the central axis, and rotational angles around it. A geometrical score is calculated to choose
the final candidate molecule. These final molecules are bent only DNA molecules, not accounting for any
other type of flexibility.
Scoring
Several scoring functions were examined when attempting to rank the 1,000 – 2,000 solutions each complex
produced:
I - Electrostatics and amino acid propensities were earlier described. They have shown mediocre results as
basis to score functions, possibly because they were already used in filtering the partial solutions.
II - The amount of coherent partial solutions, which was used while trimming the curves, might have been a
good measure for geometrical compatibility, and as small tolerances are allowed might have compensated for
the lack of local flexibility in the final solution. Partial solutions individual scores (geometric, electrostatic,
propensities) were also considered. Probably due to the uneven spread of partial solutions, this score function
seemed to let their spatial density to weigh heavily on final ranking.
III - Curvature of the central axis curve might represent the energetic cost of the deformation involved, but has
also shown poor performance.
IV - PatchDock geometrical score, plainly yielding the best results, was eventually used.
The geometrical score function uses a distance transform grid to score every atom of the DNA molecule by its
distance from the protein surface. Distant atoms (more than 1 Å from the surface) are not scored, close atoms
(-1, +1 Å) receive a positive score, and penetrating atoms receive a negative score, respective to the
penetration depth.
Many variations of these basic score functions including weighted combinations were tested, none proving
better than the geometrical score.
Program output is therefore a geometrically ranked list of solution complexes, each containing the unchanged
protein docked with an arbitrary sequenced DNA molecule, of varying length.
15
Experimental results
Data set
The Bonvin et al. benchmark (4) consists of experimentally elucidated complexes of 47 different DNA
binding proteins, bound to DNA, alongside with the native unbound proteins. The benchmark contains
representatives of seven out of the eight groups defined in the structural classification of protein DNA
complexes by Luscombe et al. (5), categorized by their IRMS (Interface RMS) as easy, intermediate or hard.
IRMS is the root mean square deviation between interface atoms of both molecules in the bound and unbound
version, representing the magnitude of conformational changes undergone by both molecules when bound to
each other. ParaDock was tested on both bound and unbound cases. The DNA molecule was removed from
the target complex, leaving the protein structure as the input for ParaDock in the bound case, its results to be
compared back with the target complex. As the protein's flexibility is not modeled, ParaDock's output, when
run on the unbound protein, must be aligned with the target complex in order to evaluate correctness. This was
established using our C-alpha based structural alignment server (30, 32). Proteins which bind to DNA in a
non-monomeric form constitute a significant portion of the benchmark. Because Paradock is currently unable
to deal with multiple molecules, only one copy of the protein was left in each complex, both as input and for
solution evaluation.
Solution evaluation
ParaDock does not require knowledge of the DNA sequence and length, thus requiring some alterations in the
result evaluation measures commonly used in docking attempts. Difference of length between the target DNA
and the DNA output molecule, combined with the arbitrarity of the sequence used to create the solution, does
not allow for a straightforward comparison between the two molecules, owing to three facts: (i) The
correspondence between the target base pairs and the solution base pairs is undefined. (ii) Not all atoms that
exist in the target DNA appear in the solution, as the DNA sequence differs. (iii) Some base pairs of the target
might be missing, and vice versa (as lengths differ). Drawn from the CAPRI evaluation methods (26), fnat is
16
the first measure used with a slight adjustment. Fnat (Fraction of the NATive contacts) is calculated as the
proportion of native contacts between a nucleotide and an amino acid that appear in the evaluated solution. As
the correspondence of the base pairs is not given, all possible correspondences are examined (maintaining
order and sequence of base pairs), and the highest scoring is set as the work point correspondence list. To
complete fnat, fnonnat is also calculated, being the proportion of the contacts found in the evaluated solution
that do not exist in the target complex. IRMS, also drawn from CAPRI, is calculated between the solution and
the target locations of all backbone atoms in the nucleotides and amino acids, which take part in the interface.
Another important measure is the number of intermolecular clashes in the solution. A large amount of clashes
will inflate the interface in both molecules, creating more contacts between nucleotides and amino acids, thus
resulting in high fnat numbers. Clashes were defined to fit the CAPRI definition, as non-hydrogen atoms from
different molecules, less than 3 Å apart. It may be noted that the CAPRI evaluation system utilizes another
measure – ligand RMSD, calculated for the whole ligand, and used as an alternative to IRMS. DNA "tails" far
from the protein are very difficult to predict, and create an increase in ligand RMSD, more suited for use with
smaller (in diameter) ligands. Since it is fully interchangeable with IRMS in terms of CAPRI standards, we
have not used it here. Interface distance threshold was taken as 5 Å. To better perceive these measures, a few
solution complexes are illustrated in figure 3.
Results
ParaDock was run on the 47 benchmark complexes, in both bound and unbound versions. For each complex
approximately 15,000 – 30,000 partial solutions were found by PatchDock, usually dispersed over the entire
protein surface (as illustrated in figure 2). The 2,000 filtered partial solutions (in most complexes) are
distinctly concentrated in certain areas (figure 2), implying either geometric, electrostatic or propensity
compatibility in that location on the surface. Within these, 300 – 500 pairs of collinear solutions were
normally located (each of which yields a linear curve), alongside 2 – 3 million coplanar triplets, resulting in
1,000 – 2,000 solutions of both types (as most segment triplets are unsuitable to create a conic curve).
17
ParaDock was run on a single 2.0 GHz processor, with an average overall run time of ~2 hours (~45 minutes
of which consumed by PatchDock).
CAPRI definition of an acceptable solution which was used here, is either fnat >= 0.3, or fnat >=0.1 and
IRMS < 4.0 Å. When tested on the bound version, ParaDock places an acceptable or better solution within the
10 top ranking solutions, in 83% of the cases. From the remaining, another 13% presented a solution with fnat
> 0.2 (failing only to reach IRMS < 4.0 Å).
ParaDock exhibited slightly worse results when run on the unbound version of the benchmark. Scoring
parameters were altered to allow more collisions (compensating for the lack of protein flexibility), and a
clustering algorithm was applied to filter out similar solutions. An acceptable solution was found in the top 10
of 70% of the complexes, with an additional 17% with fnat > 0.2. Illustrations of result complexes are
depicted in figure 3. Both bound and unbound results are detailed in table 1.
b
a
Figure 3 – Illustration of best result complexes within
the top 10 ranking of four targets. Proteins are shown
in blue (differences between bound and unbound
versions can be observed), native solution in red,
ParaDock solution in green. Trimmed curve (DNA
central axis) appears in black.
c
d
18
a
b
c
d
complex
1B3T unbound
1B3T bound
1DFM unbound
1DFM bound
Fnat
0.37
0.61
0.23
0.67
Fnonnat
0.61
0.42
0.78
0.64
IRMS (Å)
5.44
3.37
6.7
5.5
Best in top 10 - bound protein
complex
2C5R
1PT3
1MNN
1FOK
1KSY
3CRO
1EMH
1H9T
1TRO
1BY4
1HJC
1DIZ
1RPE
1VRR
1F4K
1K79
1KC6
1EA4
1Z63
1R4O
1AZP
1W0T
1CMA
1JJ4
1VAS
4KTQ
1Z9C
1DDN
2IRF
1JT0
1G9Z
1A73
2FIO
1QNE
1ZS4
1QRV
1O3T
1B3T
3BAM
1RVA
1ZME
1DFM
1BDT
2FL3
7MHT
1EYU
2OAA
Average:
native
IRMS
0.49
1.35
1.48
1.53
1.58
1.58
1.62
1.68
1.7
1.77
1.8
1.82
1.87
2.08
2.26
2.37
2.38
2.43
2.51
2.61
2.7
2.78
2.81
2.83
3.04
3.23
3.24
3.26
3.35
3.49
3.67
4.26
4.41
4.57
4.71
5.19
5.2
5.32
5.55
5.68
5.76
6.31
6.45
6.71
6.71
6.82
8.95
evaluation
medium
medium
acceptable
medium
acceptable
acceptable
medium
medium
medium
medium
acceptable
medium
medium
medium
medium
acceptable
medium
medium
acceptable
acceptable
acceptable
medium
medium
medium
medium
acceptable
medium
acceptable
medium
medium
medium
medium
medium
medium
acceptable
acceptable
medium
acceptable
medium
Best in top 10 - unbound protein
Fnat
0.62
0.73
0.49
0.64
0.36
0.07
0.43
0.52
0.73
0.29
0.90
0.50
0.38
0.85
0.21
0.77
0.75
0.94
0.30
0.55
0.52
0.40
0.29
0.41
0.33
0.21
0.22
0.90
0.81
0.00
0.61
0.29
0.51
0.32
0.53
0.43
0.74
0.58
0.71
0.59
0.55
0.67
0.36
0.44
0.57
0.40
0.64
Fnonnat
0.90
0.72
0.65
0.58
0.89
0.97
0.84
0.72
0.72
0.89
0.50
0.76
0.79
0.60
0.89
0.48
0.73
0.63
0.95
0.80
0.82
0.68
0.88
0.71
0.80
0.88
0.93
0.61
0.64
0.00
0.48
0.74
0.75
0.78
0.70
0.84
0.57
0.47
0.50
0.55
0.72
0.64
0.74
0.63
0.71
0.73
0.46
IRMS
8.8
4.6
3.8
3.1
9.8
5.1
3.3
6.0
3.8
9.6
2.7
5.1
8.9
4.9
11.4
3.1
3.7
3.0
9.0
5.8
9.2
5.8
8.2
4.8
6.4
9.8
7.6
3.1
2.7
25.1
5.0
10.0
5.1
10.3
3.7
8.7
3.1
4.1
2.4
4.3
6.4
5.5
12.4
3.2
5.7
6.1
2.9
0.51
0.70
6.3
evaluation
medium
medium
acceptable
acceptable
acceptable
medium
acceptable
medium
acceptable
acceptable
medium
acceptable
medium
medium
medium
acceptable
medium
acceptable
acceptable
acceptable
acceptable
acceptable
acceptable
acceptable
acceptable
medium
acceptable
acceptable
acceptable
medium
acceptable
acceptable
acceptable
Fnat
1.00
0.50
0.41
0.32
0.32
0.58
0.27
0.29
0.16
0.42
0.50
0.33
0.47
0.70
0.21
0.31
0.56
0.65
0.53
0.30
0.39
0.79
0.38
0.38
0.39
0.21
0.41
0.40
0.45
0.09
0.44
0.29
0.15
0.32
0.06
0.88
0.08
0.37
0.32
0.13
0.21
0.23
0.42
0.56
0.48
0.37
0.38
Fnonnat
0.92
0.78
0.88
0.72
0.91
0.63
0.89
0.84
0.95
0.86
0.81
0.83
0.73
0.57
0.91
0.84
0.66
0.62
0.88
0.89
0.85
0.70
0.85
0.62
0.75
0.89
0.73
0.84
0.81
0.96
0.55
0.72
0.94
0.79
0.96
0.87
0.96
0.61
0.82
0.91
0.88
0.78
0.66
0.61
0.59
0.67
0.59
IRMS
4.3
5.2
5.6
5.3
9.5
3.6
5.2
13.1
10.5
6.9
8.2
7.9
7.2
2.6
12.2
7.8
4.2
3.3
5.8
10.5
11.7
3.1
6.9
2.7
7.8
7.1
5.0
4.3
4.8
12.4
5.1
10.4
13.1
9.0
14.5
4.6
17.0
5.4
11.0
16.5
15.4
6.7
5.7
5.4
6.9
8.4
8.4
0.39
0.79
7.8
Table 1 - ParaDock results as tested on the benchmark. Shown are the best solutions within the top 10 ranked.
Native IRMS (Å) – IRMS between bound and unbound molecules, Taken from (4).
Evaluation – solution classification according to CAPRI standards.
Fnat - The proportion of native contacts between a nucleotide and an amino acid that appear in the evaluated
solution.
Fnonnat - The proportion of the contacts found in the evaluated solution that do not exist in the target complex.
IRMS (Å) – calculated between the solution and the target locations of all backbone atoms in the nucleotides
and amino acids, which take part in the interface.
Interface distance is defined to be 5 Å, as in CAPRI.
19
The acceptable solutions (both bound and unbound) present an average of 135.5 clashes per solution. More
than 60% of these are very shallow clashes (less than 3 Å deep into the protein), which will most likely be
handled easily by a future refinement procedure , leaving an average of 54 non-shallow clashes per solution,
well in the neighborhood of normally accepted CAPRI solutions.
Discussion
ParaDock provides a protein – DNA docking solution based entirely on the protein's structure, requiring
neither distance constraints, nor any information about the DNA molecule (sequence, length, etc.). We have
employed generic observations regarding the electrostatic and geometric properties, as well as amino acidpropensities of protein – DNA complexes, and made simplifying assumptions on the flexibility of DNA
molecules, thus deriving a simple and fast algorithm, which combines partial binding solutions to predict
complexes containing long bent DNA molecules. CAPRI acceptable solutions are within the 10 top ranked
solutions in 76% of the bound and 70% of the unbound cases, prior to any form of refinement. The algorithm
presents very good ability to predict the location and general position of the DNA molecule, though it exhibits
lower accuracy when it comes to RMSD measures. This is explicable due to the lack of refinement and the
simplification of flexibility in the DNA molecule, combined with the unaddressed flexibility of the protein. A
succeeding stage of such local refinement is obviously called for, but remains outside of this work's scope.
As expected, ParaDock presents better performance on bound versions of the proteins. Target complexes in
table 2 are sorted by their bound-unbound IRMS, representing the magnitude of the conformational changes
undergone during binding and as can be foreseen, it is apparent that in the unbound version results deteriorate
as native IRMS rises.
In a vast majority of the solutions, fnat + fnonnat > 1, implying that the molecular interface found on the
solution complex is larger than the target interface. This is due to two possible reasons – first, as mentioned
before, the definition of interface used here inflates the interface size as the volume of intermolecular overlap
(steric clashes) grows. Second, because the length of the target DNA molecule is not used by the algorithm,
output DNA molecule may be longer than the target molecule, likely enlarging intermolecular interface
21
(figure 3c exhibits a solution quite longer than the target molecule). The nature of the geometric score, seeking
large complementary interface strengthens the latter explanation, bearing in mind that the solutions in question
are amongst the highest scoring solutions. As the average sum of fnat and Fnonnat is merely 1.2, we do not
regard this as a significant drawback.
Arbitrary DNA sequence was used both as input for PatchDock, and as a template for the full molecular
solution. Though such a choice might have interrupted severely with molecular complementarity, it seems not
to have deteriorated results significantly (as they are rather good). Despite the fact that energy was very
indirectly handled here, this might imply that the main contributors to the affinity of the protein and DNA are
not heavily sequence related, thus supporting the "DNA sliding" theory (33). The latter suggests that some
DNA binding proteins slide along the DNA searching for their cognate sequence, continuously in contact with
the DNA, thereby, exhibiting significant affinity to any DNA sequence.
While ParaDock handles the Bonvin et al. benchmark quite well, it is important to mind two main inherent
inabilities, in addition to the lack of refinement. The filtering stage denies the creation of irregular solutions,
i.e. solutions in which the protein's interface does not sustain such properties of electrostatics and amino acid
propensities earlier observed. Such target complexes, will not be predictable by ParaDock. On top of that, as
the length of target DNA molecules will increase, the underlying approximation of the central axis with a
planar conic curve, will eventually fail. Such targets will demand a different treatment, perhaps of combining
several ParaDock solution curves to create a more complex DNA molecule.
21
Future work
When discussing future work, solution refinement, modeling local DNA changes as well as protein flexibility,
is probably the most obvious choice, enabling increased accuracy in complex prediction combined with a
more precise score function (due to local conformation optimizations). Multiple molecule docking can also be
handled. Assuming symmetry (as done in Symmdock , (34)), will enable running ParaDock on a single copy
of the protein, and concatenating multiple copies of each solution to create final candidate complexes.
A more advanced improvement would be an attempt to approximate DNA helices with more complex curves,
discarding the planarity assumption, thus creating a larger more diverse solution space. Such curves can either
be concatenated conic curves (as used here), other parametric curves, or non-parametric curves custom built
using the 3D segments representation.
22
Appendix
Implementation
The search for co-planar segment triplets is a time consuming task, accounting for a significant portion of
ParaDock's overall runtime. The naive algorithm would require n^3 attempts, amounting to 8,000,000,000
possible triplets in our 2,000 segments case. A simple speed-up was achieved by locating all co-planar
segment pairs, and storing a list of co-planar mates for each segment. Then, through intersecting every pair of
lists, all co-planar triplets are located. As the lists are sorted, their intersection takes
the length of i's and j's lists, and the overall complexity is
,
and
being
. Worst case complexity is still
O(n^3), but as list lengths average 40-50 segments at most, this approach proves 20 times faster.
All handling of conic curves was done using CGAL(31). As CGAL is required to be geometrically consistent,
all parameters are stored as accurate rational numbers, and all calculations use their full length. This yields
some actions which are somewhat time consuming. The transformation of ~2,500,000 co-planar triplets into
conic curves presented a problem. As most triplets are unsuitable to create a conic curve (e.g. zigzagging
segments), a quick filtering stage was inserted. Conic curves are defined by 5 points. We have selected P1 and
P5 to be the farthest pair of segment endpoints, P2 to be the second endpoint of P1's segment, P4 to be the
second endpoint of P5's segment, and P3 as the center of the third segment. A segment triplet passes the
filtering if for every i,j,k, s.t. i<j<k, IS_LEFT_TURN(Pi,Pj,Pk) = constant. Thus making sure that the chain of
points is curved in consistent direction, and does not intersect itself. In most complexes, less than 5% of the
triplets passed this test, and many others which passed failed later because the middle segment was not
parallel to the created curve. The final result is less than 2,000 solution curves per target.
In the unbound version of the benchmark, a clustering method was exercised in order to thin out similar
solutions crowding the 10 top ranking solutions. A directional graph was created, solutions being its vertices,
and an edge was added from solution i to solution j if more than a certain percentage of j's curve was in close
proximity to i's curve, meaning i "covered" j. A greedy algorithm selected the solution with the highest
23
outgoing degree, and deleted all its neighbors, hopefully remaining with fewer, longer and more complete
solutions.
Complexity
PatchDock's complexity is detailed in (35).
Scoring of partial solutions (electrostatic score, amino acid propensity score) is done using a distance
transform grid, and time complexity is the same as the original PatchDock geometric complementarity score –
, DTG is the constant time required to create the distance transform
grid, protein_size is the number of protein atoms (relevant when filling the grid), n is the number of partial
solutions,
and
sect_size
is
the
constant
number
Filtering the solutions merely requires sorting, i.e
located in
of
atoms
in
the
short
DNA
section.
. Co-planar triplets and co-linear pairs are
as detailed above, m being the number of partial solutions after filtering (2,000 were used
here).
Trimming requires
per solution, in order to locate all coherent segments
using a grid. Assuming a constant maxima of close segments per query point, and minding that solution length
is limited by the protein diameter, leaves
overall, k being the number of final
solutions.
Molecular solution creation and scoring takes
, where
is the length of
solution i, rot_num is the number of sampled rotations around the central axis, and trans_num is the number of
sampled translations along it. As rot_num and trans_num are constants, overall complexity is
.
Clustering demands proximity test for each solution pair. Each test is linear to the solution length, thus the
overall is
. Greedy selection merely requires solution sorting, as does the final
ranking.
Overall complexity is thus:
.
n - # of partial solutions, m – number of filtered solutions, k – number of final solutions.
24
PatchDock runtime is on the average ~45 minutes per complex on a single 2.0 GHz processor, program
parameters adjusted to create 15 – 30 thousand solutions. Overall actual running time averages ~120 minutes.
Main parameters
Many parameters were used in the process of implementing ParaDock. Listed here are parameters found to be
of significant influence on either results or runtime.
parameter
parameter default value
parameter main influence
# of partial solutions that pass
filtering
2000
Geometric score in respect do
distance from surface.
1 Å < dist : 0 points.
-1 Å < dist < 1 Å : 1 points.
-2.2 Å < dist < -1 Å : 0 points.
-3.6 Å < dist < -2.2 Å : -4 points.
dist < -3.6 Å : -8 points.
Major effect on runtime, due to
m^3 complexity of locating
triplets.
Influences ability to rank solutions
correctly. Lower penalties were
used in unbound case for shallow
penetrations.
Parallelity threshold – minimal
absolute value allowed as inner
product of two parallel unit vectors.
Co-planarity threshold. Two planes
were defined, each based on a
segment, and the other segment's
center point. The threshold is the
minimal absolute value allowed as
inner product of unit normal vectors
of both planes.
DNA short section length. The
section used as input ligand in
PatchDock.
0.96
0.999
6 base pairs.
25
Determines number of linear
solutions. Light effect on amount
of results.
Determines number of co-planar
triplets, has significant influence
on their numbers, and thus on
runtime.
Parameter's influence was not
thoroughly studied. The shorter the
section, solutions will probably
tend to posses higher curvature.
Benchmark contents (4)
PDB id
Easy
2c5r
1pt3
1mnn
1fok
1ksy
3cro
1emh
1h9t
1tro
1by4
1hjc
1diz
1rpe
Intermediate
1vrr
1f4k
1k79
1kc6
1ea4
1z63
1r4o
1azp
1w0t
1cma
1jj4
1vas
4ktq
1z9c
1ddn
2irf
1jt0
1g9z
1a73
2fio
1qne
1zs4
Protein description
Number
of IRMS
molecules
in (Å)
target complex
Phage PHI29 replication organizer protein P16.7
Col-E7 nuclease domain
Sporulation specific transcription factor NDT80
Restriction endonuclease FOKI
Papillomavirus replication initiation domain E-1
Phage 434 CRO
Human uracil-DNA glucosylase
FADR, fatty acid responsive transcription factor
TRP repressor
Retinoid X receptor DNA binding domain
RUNX1 runt domain
E. coli 3-methyladenine DNA glycosylase II
Phage 434 repressor
4
2
2
2
3
3
2
3
3
3
2
2
3
0.49
1.35
1.48
1.53
1.58
1.58
1.62
1.68
1.70
1.77
1.80
1.82
1.87
Restriction endonuclease BSTYI
Replication terminator protein
ETS-1 DNA binding and autoinhibitory domain
Restriction endonuclease HINCII
Transcription repressor COPG
Sulfolobus solfataricus WI2/SNF2 ATPase core
domain
Glucocorticoid receptor
Hyperthermophile chromosomal protein SAC7D
HTRF1 DNA-binding domain
Methionine repressor
Papillomavirus type 18 E2
pyrimidine dimer specific excision repair
DNA polymerase
Organic hydroperoxide resistence transcription
regulator
Diphtheria TOX repressor
Interferon Regulatory Factor 2
Multidrug binding transcription factor QACR
I-CreI endonuclease
Intron-encoded homing endonuclease I-PPOI
Phage PHI29 transcription regulator P4
Adenovirus major late promotor TBP
Phage lambda CII
3
3
2
3
3
2
2.08
2.26
2.37
2.38
2.43
2.51
3
2
3
2
2
2
2
3
2.61
2.70
2.78
2.81
2.83
3.04
3.23
3.24
2
2
2
2
2
2
2
2
3.26
3.35
3.49
3.67
4.26
4.41
4.57
4.71
26
Difficult
1qrv
1o3t
1b3t
3bam
1rva
1zme
1dfm
1bdt
7mht
2fl3
1eyu
2oaa
High mobility group protein D
CAP-CAMP
Epstein-Barr virus nuclear antigen-1
Restriction endonuclease BAMHI
Eco RV endonuclease
Proline utilization transcription activator PUT3
Restriction endonuclease BGLII
Phage P22 Arc gene regulating protein
HHAI methyltransferase
Restriction endonuclease HINP1I
PVUII endonuclease
Restriction endonuclease MVAI
27
3
2
2
3
2
2
3
3
2
2
2
2
5.19
5.20
5.32
5.55
5.68
5.76
6.31
6.45
6.71
6.71
6.82
8.95
References
1. Crick, F. (1970): Central Dogma of Molecular Biology. Nature 227, 561-563. PMID 4913914
2. Dahm, R.(2005) Friedrich Miescher and the discovery of DNA Developmental Biology 278 274–288
3. Voet, D. and Voet, J. G. (1995). Biochemistry – Second Edition. John Willey and
Sons, Inc.pp 850-861.
4. Van Dijk, M. and Bonvin, A. M. J. J. (2008). A protein–DNA docking benchmark. Nucleic Acids
Research, Vol. 36, No. 14 e88
5. Luscombe, N.M., Austin, S.E., Berman, H.M. and Thornton, J.M. (2000) An overview of the
structures of protein-DNA complexes. Genome Biol., 1, e1.
6. Jones, S., Van Heyningen, P., Berman, H. M., Thornton, J. M. (1999). Protein-DNA interactions: a
structural analysis, Journal of Molecular Biology, Volume 287, Issue 5, 877-896.
7. Moitessier, N., Englebienne, P., Lee, D., Lawandi, J. and Corbeil, CR.(2008)Towards the
development of universal, fast and highly accurate docking/scoring methods: a long way to go. British
Journal of Pharmacology 153, S7–S26
8. Duhovny, D., Nussinov, R. and Wolfson, H.J. (2002) Efficient unbound docking of rigid molecules.
In Guigo,R. and Gusfield,D. (eds), Proceedings of the Fourth International Workshop on Algorithms
in Bioinformatics. Springer-Verlag GmbH Rome, Italy, September 17–21, 2002, Vol. 2452, pp. 185–
200.
9. Rarey, M., Kramer, B., Lengauer, T. and Klebe, G. (1996) A Fast Flexible Docking Method using an
Incremental Construction Algorithm. J. Mol. Biol. 261, 470–489
10. Meiler, J. and Baker, D. (2006) ROSETTALIGAND: Protein–Small Molecule Docking with Full
Side-Chain Flexibility. PROTEINS: Structure, Function, and Bioinformatics 65:538–548.
11. Dominguez, C., Boelens, R. and Bonvin A.M.J.J. (2003). HADDOCK: a protein–protein docking
approach based on biochemical or biophysical information. J Am Chem Soc 125: 1731–1737.
28
12. Velec, H.F.G., Gohlke, H., Klebe, G. (2005). DrugScoreCSD - knowledgebased scoring function
derived from small molecule crystal data with superior recognition rate of near-native ligand poses
and better affinity prediction. J Med Chem 48: 6296–6303.
13. Nadassy, K., Wodak, S. J. and Janin J. (1999). Structural Features of Protein−Nucleic Acid
Recognition Sites. Biochemistry, 38 (7), 1999–2017.
14. Ohlendorf, D. H. and Matthew, J. B. (1985). Electrostatics and flexibility in protein–DNA
interactions. Advan. Biophys. 20, 137–151.
15. Wade, R. C., Gabdoulline, R.R., Lu'demann, S. K. and Lounnas V. (1998). Electrostatic steering and
ionic tethering in enzyme–ligandbinding: Insights from simulations. Proc. Natl. Acad. Sci. USA Vol.
95, pp. 5942–5949, May 1998.
16. Mandel-Gutfreund, Y., Margalit, H., Jernigan, R. L. and Zhurkin, V. B. (1998). A role for CH…O
interactions in protein–DNA recognition. J. Mol. Biol. 277, .1140–1121
17. Luscombe, M. N., Laskowski, R. A., and Thornton, J. M. (2001). Amino acid-base interactions: a
three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Research,
21, 2860 – 2874.
18. Stawiski, E. W., Gregoret, L. M. and Mandel-Gutfreund, Y. (2003). Annotating Nucleic Acid-Binding
Function Based on Protein Structure. J. Mol. Biol. 326, 1065–1079.
19. Shanahan, H. P., Garcia, M. A., Jones, S. and Thornton, J. M. (2004) Identifying DNA-binding
proteins using structural motifs and the electrostatic potential. Nucleic Acids Research, 2004, Vol. 32,
No. 16 4732–4741doi:10.1093/nar/gkh803
20. Jones, S., Shanahan, H. P., Berman, H. M. and Thornton, J. M. (2003). Using electrostatic potentials
to predict DNA-binding sites on DNA-binding proteins. Nucleic Acids Research, Vol. 31, No. 24
7189-7198
21. Ahmad, S., Gromiha, M. M. and Sarai, A. (2004). Analysis and prediction of DNA-binding proteins
and their binding residues based on composition, sequence and structural information.
BIOINFORMATICS Vol. 20 no. 4, 477–486
22. Ahmad, S., and Sarai, A. (2005). PSSM-based prediction of DNA binding sites in proteins. BMC
Bioinformatics 6:33
29
23. Wu, J., Wu, H., Liu, H., Zhou, H. and Sun, X.. (2007) Support Vector Machine for Prediction of
DNA-Binding Domains in Protein-DNA Complexes. Life System Modeling and Simulation. Lecture
Notes in Computer Science, Volume 4689/2007, 180-187.
24. Van Dijk, M., Van Dijk, A. D. J., Hsu, V., Boelens R., and Bonvin, A. M. J. J. (2006). Informationdriven protein–DNA docking using HADDOCK: it is a matter of flexibility. Nucleic Acids Research,
Vol. 34, No. 11 3317–3325. doi:10.1093/nar/gkl412
25. Van Dijk M. and Bonvin, A. M. J. J. (2010). Pushing the limits of what is achievable in protein–DNA
docking: benchmarking HADDOCK’s performance. Nucleic Acids Research, 2010, 1–14.
doi:10.1093/nar/gkq222
26. Me´ndez, R., Leplae, R., De Maria, L., and Wodak, S. J. (2003). Assessment of Blind Predictions of
Protein–Protein Interactions: Current Status of Docking Methods. PROTEINS: Structure, Function,
and Genetics 52:51–67
27. Dickerson, R. E. (1998). DNA bending: the prevalence of kinkiness and the virtues of normality
Nucleic Acids Research, Vol. 26, No. 8, 1906–1926
28. Dickerson, R. E. and Chiu, T. K. (1997), Helix bending as a factor in protein/DNA recognition.
Biopolymers, 44: 361–403. doi: 10.1002/(SICI)1097-0282(1997)44:4<361::AID-BIP4>3.0.CO;2-X.
29. Mandelkern, M., Elias, J., Eden, D. and Crothers, D. (1981). The dimensions of DNA in solution. J
Mol Biol 152 (1): 153–61. doi:10.1016/0022-2836(81)90099-1
30. Wolfson, H.J. and Rigoutsos, I. Geometric hashing: An overview. IEEE Computational Science and
Eng., 11:263–278, 1997.
31. CGAL, Computational Geometry Algorithms Library, http://www.cgal.org
32. Nussinov, R. and Wolfson, H.J. Efficient Detection of Three - Dimensional Motifs in Biological
Macromolecules by Computer Vision Techniques, Proceedings of the National Academy of Sciences,
U.S.A., 88, 10495-10499, (1991).
33. Berg, O. G., Winter, R. B. and Von Hippel, P. H. (1981). Diffusion-driven mechanisms of protein
translocation on nucleic acids. 1. Models and theory Biochemistry 20 (24), 6929-6948
31
34. Schneidman-Duhovny, D., Inbar, Y., Nussinov, R. and Wolfson, H.J. Geometry based flexible and
symmetric protein docking. Proteins, 60: 224-231, 2005.
35. Schneidman-Duhovny, D., Nussinov, R and Wolfson, H.J. (2003) PatchDock: a method for binding
site detection and unbound docking of rigid molecules. M.Sc. thesis, Tel – Aviv University, School of
Computer Science.
31
32
‫תקציר‬
‫חלבונים ומולקולות דנ"א נקשרים אלו לאלו במסגרת תהליכים פנים תאיים חיוניים רבים‪ .‬ביצוע תחזיות מדויקות ואמינות למבנה‬
‫המרחבי של החלבון והדנ"א כאשר הם קשורים זה לזה‪ ,‬עשוי להוות נקודת ציון משמעותית בדרך להבנה מעמיקה של תפקודי התא‪.‬‬
‫גישות שנוסו בתחום זה כוללות ניסיון לחזות את אתר הקישור על גבי מעטפת החלבון‪ ,‬סיווג תפקודו של החלבון כקושר‪/‬לא קושר‬
‫דנ"א‪ ,‬ואלגוריתם עגינה המבוסס על אילוצי מרחק מתוך מידע ניסויי‪ .‬בעבודה זו אנו מציגים את ‪ :ParaDock‬אלגוריתם עגינה המשלב‬
‫פתרונות עגינה קשיחה של מקטעי דנ"א קצרים‪ ,‬המבוססים על תאימות גיאומטרית‪ ,‬לכדי מולקולות דנ"א מכופפות מישורית אשר‬
‫מדורגות גם הן על בסיס תאימות גיאומטרית‪ .‬האלגוריתם נבדק על בסיס נתונים הכולל ‪ 41‬מבני חלבון – דנ"א‪ ,‬בגרסאותיהם הקשורות‬
‫והלא קשורות כאחד‪ .‬ללא התמודדות עם גמישות החלבון‪ ,‬וללא שלב עידון המאפשר שינויים מבניים מקומיים‪ ,‬ב ‪ 33%‬מהמבנים‬
‫בצורתם הקשורה‪ ,‬וב ‪ 10%‬מהמבנים בצורתם הלא קשורה‪ ,‬לפחות פתרון מקובל אחד (על פי סטנדרט ‪ )CAPRI‬מדורג בעשרת‬
‫הפתרונות העליונים‪ .‬ללא דרישת רצף או אורך מולקולת הדנ"א‪ ,‬ובזמן ריצה של כשעתיים על מעבד ‪ 2.0 GHz‬סטנדרטי‪,‬‬
‫‪ ParaDock‬מספק פתרון עגינה מהיר ופשוט למבני חלבון – דנ"א‪.‬‬
‫‪33‬‬
‫אוניברסיטת תל‪-‬אביב‬
‫הפקולטה למדעים מדויקים ע"ש ריימונד ובברלי סאקלר‬
‫בית הספר למדעי המחשב‬
‫‪ – ParaDock‬אלגוריתם עגינת חלבון קשיח‬
‫ודנ"א גמיש‬
‫חיבור זה מוגש כחלק מהדרישות לקבלת תואר מוסמך בבית הספר למדעי המחשב ע"ש‬
‫בלבטניק באוניברסיטת תל – אביב על ידי‬
‫איתמר בנית‬
‫המחקר וכתיבת העבודה נעשו בהנחייתו של‬
‫פרופ' חיים וולפסון‬
‫תשרי תשע"א‬
‫‪34‬‬