“New scoring-functions for docking smaller molecules to proteins

UPTEC X 08 52
Examensarbete 20 p
December 2008
New scoring-functions for docking
smaller molecules to proteins
Evaluation on ligands containing nitrogenous
base -moieties
Johan Alexander Källberg Zvrskovec
Bioinformatics Engineering Program
Uppsala University School of Engineering
Date of issue 2008-12
UPTEC X 08 052
Author
Johan Alexander Källberg Zvrskovec
Title (English)
New scoring-functions for docking smaller molecules to proteins:
Evaluation on ligands containing nitrogenous base -moieties
Title (Swedish)
Abstract
Molecular modeling approaches aimed at prediction of the spatial structure of a ligand-receptor complex, given the 3D-model of the
latter, are referred to as molecular docking. They are widely used in both the fundamental studies of molecular mechanisms of protein
functioning and in drug design. In this work, new methods to improve the efficiency of scoring the putative protein-ligand complexes in
molecular docking are presented. A promising method – consensus docking, which has been previously successfully used in scoring the
interactions of adenine-containing ligands with proteins is now adapted for another important class of nitrogenous base -ligands – cytosinecontaining compounds.
Based on statistical analysis of 3D-structures of a representative set of 50 complexes of cytosine-containing ligands with different
proteins an array of new scores is proposed which capture and model concepts of physical phenomena that drive the recognition of cytosine
by protein-receptors – such as hydrogen bonds, aromatic stacking interactions and hydrophobic/hydrophilic interactions.
The proposed scores implement the concept of molecular hydrophobicity potential to model the hydrophobic/hydrophilic properties
of molecules. Also, statistical modeling through a knowledge-based potential is used to implicitly describe the above-mentioned
intermolecular interactions. The new scores were validated on a set of docking conformers for the studied 50 complexes that were generated
using two popular docking programs, GOLD and Glide. Most of our scores combining multiple scoring methods were demonstrated to have
an excellent potential for scoring the targeted type of molecular complexes.
Keywords
Molecular docking, protein, ligand, scoring function, nitrogenous base, cytosine,
hydrogen bond, stacking, hydrophobic, hydrophilic, MHP, empirical, knowledge-based,
backbone, motif, GOLD, Glide
Supervisors
Prof. Roman G. Efremov and Ph.D. Timothy V. Pyrkov
Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences
Scientific reviewer
Ph.D. Torgeir R. Hvidsten
Umeå Plant Science Centre
Dep. of Plant Physiology, Umeå University
Project name
Sponsors
Language
Security
English
Classification
ISSN 1401-2138
Supplementary bibliographical information
Pages
Biology Education Centre
48 (61 inc. attachments)
Biomedical Center
Husargatan 3 Uppsala
Box 592 S-75124 Uppsala
Tel +46 (0)18 4710000
1
Fax +46 (0)18 555217
New scoring-functions for docking smaller molecules
to proteins: Evaluation on ligands containing
nitrogenous base -moieties
Johan Alexander Källberg Zvrskovec
Supervisors:
Prof. Roman G. Efremov and Ph.D. Timothy V. Pyrkov
(Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of
Sciences)
ABSTRACT
Molecular modeling approaches aimed at prediction of the spatial structure of a ligandreceptor complex, given the 3D-model of the latter, are referred to as molecular docking.
They are widely used in both the fundamental studies of molecular mechanisms of protein
functioning and in drug design. In this work, new methods to improve the efficiency of
scoring the putative protein-ligand complexes in molecular docking are presented. A
promising method – consensus docking, which has been previously successfully used in
scoring the interactions of adenine-containing ligands with proteins is now adapted for
another important class of nitrogenous base -ligands – cytosine-containing compounds.
Based on statistical analysis of 3D-structures of a representative set of 50 complexes of
cytosine-containing ligands with different proteins an array of new scores is proposed which
capture and model concepts of physical phenomena that drive the recognition of cytosine by
protein-receptors – such as hydrogen bonds, aromatic stacking interactions and
hydrophobic/hydrophilic interactions.
The proposed scores implement the concept of molecular hydrophobicity potential to
model the hydrophobic/hydrophilic properties of molecules. Also, statistical modeling
through a knowledge-based potential is used to implicitly describe the above-mentioned
intermolecular interactions. The new scores were validated on a set of docking conformers for
the studied 50 complexes that were generated using two popular docking programs, GOLD
and Glide. Most of our scores combining multiple scoring methods were demonstrated to
have an excellent potential for scoring the targeted type of molecular complexes.
2
POPULÄRVETENSKAPLIG SAMMANFATTNING
Inom både grundläggande studier av molekylära mekanismer hos protein som i
läkemedelsdesign används en datorbaserad simuleringsmetod som kallas för molekylär
dockning där ett av målen är att förutsäga 3D-strukturen hos molekylerna när en mindre
molekyl (ligand) interagerar med ett protein. I det här projektet presenteras nya metoder för
att värdera sådana 3D-modeller, vilket är ett mycket viktigt steg inom molekylär dockning. En
lovande metod – konsensusdockning, vilken tidigare framgångsrikt har blivit tillämpad för att
poängsätta interaktioner mellan ligander innehållande adenin-strukturer och protein blir här
anpassad för en annan betydande klass av kvävebas-ligander – föreningar innehållande
cytosin.
Baserat på en analys av 3D-strukturerna i ett representativt set med 50 komplex av
ligander innehållande cytosin tillsammans med olika proteiner, föreslås en rad nya
poängsättningsfunktioner vilka fångar och modellerar koncept av fysikaliska fenomen som
styr hur protein känner igen cytosin – så som vätebindningar, aromatisk stackning och
hydrofoba/hydrofila interaktioner. Dessa föreslagna poängsättningsfunktioner inkluderar
konceptet Molekylär Hydrofobicitets-potential (MHP) för att modellera de
hydrofoba/hydrofila egenskaperna hos molekylerna. Dessutom används statistisk modellering
genom en kunskapsbaserad potential för att implicit beskriva interaktionsegenskaper.
De nya poängsättningsfunktionerna utvärderades på ett set molekylmodeller som är
variationer av de studerade 50 komplexen. Dessa molekylmodeller skapades genom att
använda de populära dockningsprogrammen GOLD och Glide. De flesta
poängsättningsfunktionerna som kombinerar olika metoder visade en utmärkt potential för att
värdera den typ av molekylära komplex som våra dataset representerar.
3
Table of Contents
1. INTRODUCTION ........................................................................................................ 5
1.1. Nitrogenous bases .............................................................................................. 5
1.2. Biochemical functions of ligands with cytosine-moieties ................................. 6
1.3. Molecular docking ............................................................................................. 8
2. METHODS ................................................................................................................... 9
2.1. Molecular docking and dataset generation ........................................................ 9
2.2. PLATINUM – a tool for analysis of non-covalent interactions ........................ 9
2.3. Creating a computer program for development and validation of scores ........ 11
2.4. Scoring component for modeling hydrogen bonds – c_hbond ........................ 11
2.5. Scoring component for modeling of stacking interactions – c_stack .............. 12
2.6. Molecular Hydrophobicity Potential (MHP) ................................................... 13
2.7. Scoring component based on MHP – c_mhp1 ................................................. 14
2.8. Knowledge-based estimation of atom positioning – EMP1 ............................ 14
2.9. Scoring component based on EMP1 – c_emp1 ............................................... 16
2.10. Recognition of cytosine – protein backbone motifs ...................................... 17
2.11. Scoring component based on hydrogen bond motifs – c_motif .................... 17
2.12. Scoring component of the docking algorithms used – c_dock ...................... 17
2.13. Normalization and calculation of the weighting coefficients ........................ 18
2.14. Validation of scoring functions...................................................................... 19
3. RESULTS & DISCUSSION ...................................................................................... 22
3.1. Analysis of non-covalent interactions in complexes with cytosine-containing
ligands .............................................................................................................................. 22
3.2. Investigating methods to score protein-ligand conformers .............................. 33
3.3. Investigating strategies to combine terms of different scoring components and
to validate scores .............................................................................................................. 36
3.4. Validation of scoring approaches .................................................................... 38
3.5. Time efficiency ................................................................................................ 45
4. CONCLUSIONS ........................................................................................................ 46
5. ACKNOWLEDGMENTS .......................................................................................... 47
6. REFERENCES ........................................................................................................... 48
Attached parts
APPENDIX I – Validation results for new scores
APPENDIX II – The work program “CSTAT”
4
1. INTRODUCTION
1.1. Nitrogenous bases
Nitrogenous bases, further distinguished into purines and pyrimidines, are fundamental
chemical constituents of all known living organisms, viruses and transposons, and have
probably been such since the dawn of life. Chemical compounds consisting of nitrogenous
bases have been present during the emergence and continued evolution of proteins and thus
been part of the environment in which proteins have evolved. An adenine-binding motif can
be found in “ancient proteins” common to all living organisms, which suggests that chemical
pathways involving both adenine-containing ligands and the motif to bind them developed
very early in evolution.1
Essential functions of all cells rely on molecules containing nitrogenous bases. Their
role in cellular energy transfer, signal transduction and protein synthesis is prominent. A
combination of nitrogenous bases is the information carrying constituent of both RNA and
DNA, adenosine-5’-triphosphate (ATP), guanosine-5’-triphospahte (GTP), and to less extent
cytidine-5’-triphosphate (CTP) and thymidine-5’-triphosphate (TTP) serve as cellular energy
carriers and are involved in phosphorylation-processes – important for cell signalling, with
ATP being the most prominent energy carrier in cells. Coenzyme A (CoA) is an acyl group
carrier with a key role in the citric acid cycle, nicotinamide adenine dinucleotide (NAD) is as
well as nicotinamide adenine dinucleotide phosphate (NADP) and flavin adenine dinucleotide
(FAD) involved in various redox reactions, to mention a few important cellular contexts
where compounds with nitrogenous base –moieties are involved.
As of April 2008 when the Brookhaven Protein Data Bank (PDB)2 recently reached a
number of over 50 000 stored structures (50160), a ligand search in the PDB database on
ligands containing the adenine-substructure resulted in 4705 structure hits and 421 ligand hits,
the guanine-substructure – 932 structure hits and 140 ligand hits, the cytosine-substructure –
264 structure hits and 147 ligand hits and the thymine-substructure – 246 structure hits and
124 ligand hits. The knowledge of how nitrogenous base –containing ligands are recognized
by proteins is important for the understanding of the enzymatic mechanisms which they are a
part of, as well as it is important for the understanding of the general principles of interaction
between proteins and the discussed ligands. Because of the attractive therapeutic potential of
proteins which recognize molecules containing nitrogenous bases, the problem of recognition
of ligands by these proteins can be considered a subject of great interest.3,4
Recent advances in structural biology have resulted in a growing number of available
X-ray crystallographic structures of proteins with bound ligands, which has facilitated studies
of intermolecular interactions in the respective binding sites.4,9 Many proteins with now
known spatial structure were selected for structure determination because of their therapeutic
potential and new targets for structure determination tend to be selected on the same basis.9
Comparisons of protein folds derived from different structural classification schemes with
their respective enzymatic functions led to the conclusion that protein function does not show
a clear relationship to protein fold, as only a few specific residues are involved in the enzyme
activity.5 Other studies have emphasized the existence of distinct differences in binding site
configuration between highly homologous proteins (more than 80% sequence similarity).6
Classical bioinformatics-approaches where proteins are sorted by sequence similarity may
therefore not necessarily be able to deduce properties related to the active site.3
A number of studies have been carried out on the case of protein recognition of the
adenine-moiety with an array of promising conclusions as a result. A frequent adeninerecognizing protein motif depends solely on protein backbone –adenine interactions,1,7 which
further accentuates the understanding of ligand binding site interactions at a sub- amino-acid
5
sequence level. In the work by Mao and Wang et al. 2004, non-bonded intermolecular
interactions between the adenine base and its surrounding environment in its binding pocket
were investigated.7 Hydrogen bonds, cation-Pi interactions and aromatic Pi-Pi stacking were
here concluded to be the main molecular determinants for adenine-moiety recognition.
Additionally, polar and hydrophilic/hydrophobic interactions between the ligand and its
binding site seem to play an important role in the process.3,4,8 While all of the aforementioned
interaction types to some extent affect the molecular recognition specificity, especially
hydrogen bonds are believed to determine specificity due to their directional nature.3
Figure 1. Molecular structure of the
cytosine base. Capacity to form
hydrogen bonds is shown with
hydrogen acceptor sites in blue and
hydrogen donor sites in red.
1.2. Biochemical functions of ligands with cytosine-moieties
After the recent work on adenine4, we turned our focus to ligands containing cytosine as
a substructure. Cytosine is a pyrimidine base like thymine and uracil, in contrast to adenine.
Figure 1 shows the molecular structure of the cytosine base and the possibilities for the
molecule to form hydrogen bonds. Cytosine has the capacity to form five hydrogen bonds
with O2 as the hydrogen acceptor for two bonds, N3 as the hydrogen acceptor for one bond
and N4 as the hydrogen donor for two bonds. Ligands containing cytosine are involved in a
multitude of essential biochemical pathways which makes cytosine together with other
nitrogenous bases an important molecular fragment in that sense, and thus of interest to study.
To provide a somewhat representative image of the roles in which cytosine-containing ligands
are involved, a ligand-search was carried out in the PDB on the cytosine-substructure
(September 2008) and the found structures were grouped according to E.C.-numbers. The
result of our search is summarized in Fig. 2.
6
The cytosine-substructure, Complex distribution over E.C. -numbers in the PDB
90
85
82
80
Number of complexes
70
66
60
52
50
40
39
33
30
20
31
29
26 26
20 20 19
18
13 12
10
11 10
9 8
7 7 6 6
5 5 5 4 4 4 4 4 4
3 3 3 3 3 3 3 3 3 2 2 2 2
2.
7.
2. 7.7
1
1. .3.2
2.
9
2. 9.2
7
2. .7.6
7.
4. 4.14
6.
2. 1.12
7.
1
2. .74
1. 5.1
3. .9
9
2. 9.20
7.
2. 7.38
7.
2. 1.48
7.
3. 7.39
1.
2. 27.
2.
7. 1 7.7 5
7. .3 .2
60 .9 5
, 4 9.
.6 17
2. .1.1
7. 2
1.
1
1. 48
2.
3
2. .1
7
1. .7.
17 .
1. 4.1
7
2. .3.3
4.
9
2. 9.4
1
3. .2.8
5.
1. 4.13
17
.4
1. .2
2. 2.-.
7. 7
2. .60
4.
2. 99.
7. 2. 7.33
7.
2. 7.43
7.
7.
5. 48
1.
6. 3.3.
4.
1 2
2. .-.-.
1. 2. 1.4
7
2.
7. 2 .1.1 5
7. .7 4
2. 25, .1.1 5
7. 2 6
7. .7 1
7, .7
4. .21
3.
2.
5.
9
4.
13 3.1 9., 3 .3
.6 .2
.1
.
6. 23
3.
2. 2.5
7
2. .2.3
7.
7
3. .19
1.
4. 13.
1. 1.
23
0
E.C. -number
Figure 2. Distribution of protein-ligand -complexes with cytosine-containing
ligands in the PDB over E.C. -numbers, as of September 2008. Some found
complexes were stored without E.C. -number, and thus not included here. For
more information about the EC. –numbers, see the ENZYME-database17.
This investigation shows that the E.C.-number with the largest number of complexes is
2.7.7.7, which corresponds to the DNA-extension reaction of various DNA-polymerases.
Another large group, 2.1.3.2, belongs to aspartate carbamoyltransferases which catalyze a
reaction in the pyrimidine metabolism and alanine and aspartate metabolism pathways in
bacteria such as Escherichia coli. CTP and UTP work as inhibitors while ATP is the effector.
Below we describe some of the most populated groups from our search.
The number 1.2.99.2 contains carbon-monoxide dehydrogenases and is in our search
represented by enzymes from the bacterium Oligotropha carboxidovorans. These carbonmonoxide dehydrogenases are involved in the methane metabolism pathway. The number
2.7.7.6 comprises DNA-directed RNA polymerases, where any nucleoside-triphosphate is
used as substrate in the process of elongating, or shortening RNA. The number 2.7.4.14
corresponds to a reaction step of the pyrimidine metabolism of a group of cytidylate kinases.
The reaction is that of ATP and dCMP transformed into ADP and dCDP, and vice versa.
dCMP and UMP can also act as phosphate-group acceptors in the reaction replacing of ATP.
The number 4.6.1.12 refers to a reaction catalyzed by 2-C-methyl-D-erythritol 2,4cyclodiphosphate synthase, where CMP is one of the products. All structures found are from
bacteria. The number 2.7.1.74 refers to deoxycytidine kinase and its reaction, where a
nucleotide-triphosphate and deoxycytidine are converted to a nucleotide-diphosphate and
dCMP, and vice versa, which is part of both the purine- and pyrimidine- pathways.
This serves as a brief (and incomplete) overview of various roles where cytosinecontaining ligands are involved, and should not be interpreted as a ranking of importance of
reactions or enzymes from any point of view (see explanatory text of Fig.2.) The ENZYMEdatabase17 was used for de-referencing the E.C.-numbers.
7
1.3. Molecular docking
Nowadays there exist numerous in-silico techniques for structural prediction and
characterization of protein-ligand complexes. These techniques are used to complement
experimental methods which sometimes can prove demanding or for some other reasons
cannot be applied.4 With the advance of such in-silico approaches, computational
methodologies have become an integral part of many drug discovery programmes, and are
used in projects such as hit identification and lead optimization. Two distinct approached to
model protein-ligand interactions and perform virtual screening have evolved: ligand-based
often used in QSAR (quantitative structure-activity relationships) and structure-based –
“docking” compounds into the known spatial structure of the binding site.9,11
A computational approach which emerged in the beginning of the 1980s - the docking
of small-molecular-weight drug-like organic compounds to protein binding and active sites,
continues to be a key methodology for structural prediction.9-12 Molecular docking is
generally a structural optimization process of its constituting molecular parts (usually a
protein and its ligands in the case of protein-ligand docking) where new structural conformers
of the biomolecular complex are generated and then evaluated through a scoring function.
Such functions represent the quality of a proposed conformer in terms of free energy of
binding or biological activity.4,9,12 New complexes – conformers, are generated by a search
algorithm in order to step-wise advance towards more qualitative (native-like) structural
solutions, assuming that this is reflected in the expression of the scoring function. While steps
through structure-conformational space can be made by a variety of different search
strategies9,12, the scoring function is usually the defining part of a docking algorithm.
However, since the scoring function only validates results produced by the search algorithm,
the combined docking algorithm is highly dependent on the architecture of the search
algorithm, in particular – how it defines search space.12
Currently there exist numerous robust algorithms for conformer generation9,12 while
inefficient and fallacious scoring functions continue to be a limiting factor in molecular
docking.9 Ideally, a scoring function should meet the demands of being able to sufficiently
precisely classify conformers as to drive the docking process towards a correct solution while
not consuming too much computational time so that it could be used in a high throughput
virtual screening. Fast but simplified scoring functions may result in underestimation of the
affinity of true binders or of native-like ligand poses and vice versa.
Molecular docking is indeed a complex task from many different perspectives and there
are numerous complications that might interfere in the docking process. Limited resolution of
available crystallographic data is a potential source of error, and the true consequences of this
effect may be difficult to overviev.9 Possibly the most concerning matter is the need of
introducing simplifications to the docking procedure, both regarding conformer generation
and scoring, when trying to satisfy the temporal demands of the docking algorithm, or
otherwise. For example, to reduce the conformational space, most of the search algorithms are
designed with such an architecture that reduces the number of degrees of freedom in various
ways.9,12 Such simplifications has led to that contemporary algorithms for protein-ligand
docking rarely account for conformational changes in the protein-receptor molecule, possibly
occurring at ligand binding, such as inherent flexibility or induced fit, as well as the
participation of solvent molecules.9,12 In many cases structural changes are mainly restricted
to the ligand for a simpler and faster modeling. This simplification of the conformational
search algorithm has been shown to poorly correspond to realistic protein-ligand behaviour,
where protein rigidity is likely to be more of a rarity than flexibility.12 All proteins are flexible
to some extent, and change between different conformational states in correspondence with
the energies of the states, which affects what is considered to be the ligand binding-site
location and its structure.12,13
8
Regarding scoring, in addition to the problems associated with simplifications, scoring
functions tend to introduce various assumptions, which of course may be false and thus be a
source of error.9 Existing scoring functions also tend to be more focused towards enthalpic
measures than entropic. This bias does not correspond to the physical phenomenon that both
entropic and energetic effects determine the ligand-binding event, and that neither of them can
be favoured in specific reaction cases.9
Due to the limited reliability of contemporary docking algorithms or their inability to
produce unambiguous scoring results in some cases, it is common procedure to analyse not
only the best ranked, but some of the top-scoring protein-ligand conformers produced by the
docking algorithm. This does also better reflect what is thought to be happening in reality,
since the whole complex is changing between different conformation states. An ensemble of
docking solutions is considered to be able to describe this to some extent.12 It is also a
common approach to re-score ligand poses (also referred to as docking solutions) using
scoring functions that are more appropriate for the current class of ligands or protein-targets,
while still relying on existing docking algorithms for conformer generation.4 An approach,
known as “consensus scoring” utilizes combinations of different scoring schemes and has
shown that combinations of different scoring functions can indeed compensate for errors in
individual functions.12 When applying consensus scoring, one should however keep in mind
eventual correlating effects between different scoring functions, since they can lead to
imbalances in the score, including error amplification.12
2. METHODS
2.1. Molecular docking and dataset generation
Two datasets were used to develop and validate new scoring criteria.
Dataset 1 comprises 50 molecular complexes obtained with X-ray crystallography
which have been deposited to the PDB. The full list of complexes of Dataset 1 is presented in
Table 1. This set was used for a statistical analysis of intermolecular interactions in complexes
of cytosine-containing ligands with proteins.
Dataset 2 comprises various ligand poses in the binding sites: native like (correct) and
misleading (incorrect). These were generated using the popular and well renowned docking
programs GOLD18-24 and Glide25,26 for all of the complexes of Dataset 1. This collection of
docking results is composed of 10 complexes from using GOLD with the scoring function
Goldscore, 10 complexes from using GOLD with the scoring function Chemscore, and 20
complexes from using Glide, for each complex. Dataset 2 thus contains 2000 complexes
generated with different docking algorithms.
2.2. PLATINUM – a tool for analysis of non-covalent interactions
The multi-functional program PLATINUM16 was the main tool used both for analysis
of available experimental structural data (Dataset 1) and for scoring the docking solutions
(Dataset 2). In our initial investigations and analysis of the crystal structures of Dataset 1 we
used PLATINUM to identify hydrogen bonds and stacking interactions. The functions used
are the scoring functions for hydrogen bonds and stacking interactions based on geometrical
criteria, which are described later. The unique feature of the program PLATINUM is the
option
to
analyse
molecular
complexes
with
respect
to
molecular
hydrophobicity/hydrophilicity using the concept of empirical molecular hydrophobicity
potential (MHP). We used this to perform such analysis as well as in scoring components
based on MHP (i.e. c_mhp1, see section “Scoring component using MHP – c_mhp1”).
9
TABLE 1. Analysed protein-ligand complexes (Dataset 1).
Ligand PDB-code PDB-code
1MC
1BKY
AR3
1P5Z
C2G
1N1D
C2P
1ROB
C3P
1RPF
C5P
1H7F
C5P
1IV4
C5P
1LP6
Resolution [Å]
2.00
1.60
2.00
1.60
2.20
2.12
1.55
1.90
C5P
C5P
CAR
CDC
CDF
CDM
CDM
CDP
1QF9
1UJ2
1KDR
1JYL
1GX1
1INI
1OJ4
1EYR
1.70
1.80
2.25
2.40
1.80
1.82
2.01
2.20
CDP
CDP
CDP
CDP
CDP
CDP
CG2
CMK
CPA
CSF
CTN
CTP
CTP
CTP
CTP
CTP
CTP
CTP
CTP
CTP
DCM
DCM
DCP
DCP
DCP
DCZ
DOC
GEO
GPC
MCN
MCN
NCC
1FFU
1H7H
1IV2
1XJN
2AZ3
2CMK
1OJ1
1GQC
1RPG
1RO7
1UEJ
1COZ
1H7G
1I52
1KFD
1MIY
1TUG
1UDW
1UEU
2AD5
1B5E
1NJE
1PEO
1PKK
5KTQ
1P60
1KDT
1P62
1RDS
1DGJ
1N62
1QWJ
2.35
2.30
1.55
2.25
2.20
2.00
2.10
2.60
1.40
1.80
2.61
2.00
2.13
1.50
3.90
3.52
2.10
2.60
2.00
2.80
1.60
2.30
3.00
1.77
2.50
1.96
1.95
1.90
1.80
2.80
1.09
2.80
PCD
PCD
1FFV
1VLB
2.25
1.28
Protein Name
VP39
Deoxycytidine kinase
glycerol-3-phosphate cytidylyltransferase
RIBONUCLEASE A
RIBONUCLEASE A
3-DEOXY-MANNO-OCTULOSONATE CYTIDYLYLTRANSFERASE
2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase
orotidine monophosphate decarboxylase
PROTEIN
(URIDYLMONOPHOSPHATE/CYTIDYLMONOPHOSPHATE
KINASE)
Uridine-cytidine kinase 2
CYTIDYLATE KINASE
CTP:phosphocholine Cytidylytransferase
2-C-METHYL-D-ERYTHRITOL 2,4-CYCLODIPHOSPHATE SYNTHASE
4-DIPHOSPHOCYTIDYL-2-C-METHYLERYTHRITOL SYNTHETASE
4-DIPHOSPHOCYTIDYL-2-C-METHYL-D-ERYTHRITOL KINASE
CMP-N-ACETYLNEURAMINIC ACID SYNTHETASE
CUTS,
IRON-SULFUR
PROTEIN
OF
CARBON
MONOXIDE
DEHYDROGENASE
3-DEOXY-MANNO-OCTULOSONATE CYTIDYLYLTRANSFERASE
2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase
ribonucleotide reductase, B12-dependent
Nucleoside diphosphate kinase
PROTEIN (CYTIDINE MONOPHOSPHATE KINASE)
RC-RNASE6 RIBONUCLEASE
3-DEOXY-MANNO-OCTULOSONATE CYTIDYLYLTRANSFERASE
RIBONUCLEASE A
alpha-2,3/8-sialyltransferase
Uridine-cytidine kinase 2
PROTEIN (GLYCEROL-3-PHOSPHATE CYTIDYLYLTRANSFERASE)
3-DEOXY-MANNO-OCTULOSONATE CYTIDYLYLTRANSFERASE
4-DIPHOSPHOCYTIDYL-2-C-METHYLERYTHRITOL SYNTHASE
DNA POLYMERASE I KLENOW FRAGMENT
tRNA CCA-adding enzyme
Aspartate carbamoyltransferase catalytic chain
Uridine-cytidine kinase 2
tRNA nucleotidyltransferase
CTP synthase
PROTEIN (DEOXYCYTIDYLATE HYDROXYMETHYLASE)
THYMIDYLATE SYNTHASE
Ribonucleoside-diphosphate reductase 2 alpha chain
Bifunctional deaminase/diphosphatase
PROTEIN (DNA POLYMERASE I)
Deoxycytidine kinase
CYTIDYLATE KINASE
Deoxycytidine kinase
RIBONUCLEASE MS
ALDEHYDE OXIDOREDUCTASE
Carbon monoxide dehydrogenase small chain
cytidine monophospho-N-acetylneuraminic acid synthetase
CUTS,
IRON-SULFUR
PROTEIN
OF
CARBON
MONOXIDE
DEHYDROGENASE
ALDEHYDE OXIDOREDUCTASE
10
2.3. Creating a computer program for development and validation of scores
In this work the author developed a computer program for the implementation of
functions for calling and communicating with the docking programs (GOLD and Glide) and
the program PLATINUM, as well as functions for performing tasks associated with our new
scoring algorithms (incorporated in the scoring components c_emp1 and c_motif) and
algorithms for score training and validation (see detailed description in APPENDIX II). The
program was developed under the working-name “CSTAT” and was written in the Java
programming language using Java SE Development Kit (JDK) 6.
2.4. Scoring component for modeling hydrogen bonds – c_hbond
Our scoring approach for estimating hydrogen bonds uses geometrical criteria to
identify the presence or absence of a bond, and is the same function that was used in the
previous work on adenine4.
If rAD is the distance between the proton acceptor and the proton donor (Fig. 3), AHDA is
the angle between the hydrogen atom and the acceptor and ANDA is the angle between the
donor neighbour atom and the acceptor atom via the donor, the function used to determine if a
hydrogen bond is present is set as:
Bhbond , acceptor :donor
1;3.2 Å
rAD
3.4 Å, cos( AHDA )
0.6,
0; otherwise
Figure 3. Schematics of the hydrogen acceptor (A) and the hydrogen donor (D)
connected through the hydrogen atom (H) in a hydrogen bond. The neighbouring heavy atom
to the donor is marked as the neighbour (N). The distance rAD, and the angles AHDA and ANDA
are shown schematically. In hydroxy groups where hydrogen is free to rotate around the N-D
bond, we used the value of angle ANDA to identify possible values of the angle AHDA.
As is seen in the formula, this function is either 1 or 0 – either a hydrogen bond is
detected or not. Four terms are combined into the final scoring component, where three terms
each represent a molecular site on the cytosine moiety (O2, N3, N4) and one term is a product
of the N3 and N4-terms (to detect simultaneous bonding at these sites):
Chbond
Thbond ,O 2 , Thbond , N 3 , Thbond , N 4 , Thbond , N 3, N 4
Thbond ,O 2
Bhbond ,O 2:donor
donors
Thbond , N 3
Bhbond , N 3:donor
donors
Thbond , N 4
Bhbond ,acceptor :N 4
acceptors
Thbond , N 3, N 4
Thbond , N 3 Thbond , N 4
11
2.5. Scoring component for modeling of stacking interactions – c_stack
Aromatic Pi-Pi stacking interactions between the heteroaromatic ring of the cytosine
moiety of the ligand and the rings of Phe, Tyr, Trp, His, and the guanidine group of Arg, were
measured through a geometrical function, in a way similar to the one used for hydrogen
bonds. This stacking function is a product of three different parts:
Tstacking S1 ( )S 2 (d normal )S 3 (d parallel )
The angular part is determined by:
S1 ( ) 1 sin 4 ,
0 ,90 ,
where
is the dihedral angle between the plane of the aromatic ring of the cytosinemoiety and the plane of the neighbouring planar structure. S 2 and S 3 are distance-dependent
parts defined as:
1; d normal
S2 (d normal )
4.0 Å
(5.0 d normal );4.0 Å
0; d normal
d normal
5.0 Å
5.0 Å
,
where d normal is the distance between the ring centres along the normal of the cytosinemoiety (Fig. 4), and
1; d parallel
S3 (d parallell )
2.0 Å
1.25 (3.0 d parallel );2.0 Å
0; d parallel
d parallel
3.0 Å
,
3.0 Å
where d parallel is the distance between the ring centres along the plane of the cytosinemoiety. The complete scoring component contains two terms; one for scoring stacking with
Phe, Tyr, Trp and His, and one term for scoring stacking with the guanidine group of Arg, so
that:
Cstacking Tstacking , phe,tyr ,trp, his , Tstacking , arg
Figure 4. Illustration showing how , d normal and d parallel are measured from the ring
structure of Ring 1 stacked with ring structure Ring 2 as reference.
12
2.6. Molecular Hydrophobicity Potential (MHP)
MHP is an empiric method to model hydrophobic/hydrophilic effects which has been
successfully used for scoring in molecular docking and in a number of other applications27,4.
In our first attempt to model hydrophobic and hydrophilic effects for scoring purposes, we
used the MHP model proposed by Pyrkov et al., 20074 (named MHP1 in the present work). In
MHP1, molecular hydrophobic properties of atoms are projected onto the Connolly-surface14
of the ligand, according to:
MHPj
fie
cri , j
i
In this formula, i is the index of a neighbouring atom and j is then index of a point on
the ligand surface (Fig. 5). Further, fi is the empirical atomic hydrophobicity-constant
corresponding to the atom with index i, c is some arbitrary weight constant (1 Å-1 in our case),
and ri,j is the distance between the point indexed i and the atom with index j.
is an offset
variable in order to shift the sum in a desired direction (positive – to more hydrophobic range
or negative – to more hydrophilic range). Two sums of this kind are calculated for each point
on the surface; one is a sum over all the ligand atoms (MHPlig), and the other is a sum over all
the protein atoms (MHPprot). An MHP-sum above 0 is considered hydrophobic while a sum
below 0 is considered hydrophilic. Since we use hydrophobicity constants based on octanolwater solvability fractions for organic compounds from Ghose et.al.15, this is a rather
convenient convention. In the PLATINUM implementation of MHP1, interaction between
solvent (usually water) and other participating molecules are treated by generating a grid of
solvent “hydrophilic charges” (Fig. 6). For a review of these, see our previous work on
adenine4 and the reference on the program PLATINUM16.
Figure 5. An example showing the MHPconcept applied on an ethanol –molecule in
a surrounding medium of water. Each atom
in the ethanol molecule in focus has its
hydrophobicity constant. This example
shows how atoms are contributing to the
MHP-sum for the point j, located at the
Connolly-surface of the molecule. [Slightly
trimmed, reference 16].
Figure 6. The grid of “hydrophilic
charges” simulating solvent (azure) around
a ligand (black) in a protein binding pocket
(grey). The grid nodes fill empty space.
[Slightly trimmed, reference 16].
13
2.7. Scoring component based on MHP – c_mhp1
This scoring part is the same function used in our previous work4 to model
complementarity of hydrophobic properties between ligand and protein in a score specifically
created to rank docking results with ligands containing adenine moieties. The complete
formula for the c_mhp1- score component contains one term and is expressed as:
CMHP1 TMHP1 ,
TMHP1 r c ,
which is a function built from two parts:
A1 A2
2 A' '
,
r
,c
A1 A2 A3
A'1 A'2 2 A' '
where r describes the shielding of hydrophobic areas of the ligand and the protein from
hydrophilic water, and c represents the complementarity of shared buried hydrophobic areas
between the ligand and protein. In the factor r, A1 is the buried hydrophobic area of the
protein, A2 is the buried hydrophobic area of the ligand and A3 is the exposed hydrophobic
area of the ligand (Fig. 7). In the factor c, A’’ is the matching buried hydrophobic area of the
ligand and the protein, A’1 is the mismatching buried hydrophobic area of the protein and A’2
is the mismatching buried hydrophobic area of the ligand. The factor r is thus the fraction of
buried area of the total hydrophobic area, and c is the fraction matching area of the total
buried hydrophobic area. Areas are computed according to the MHP1-model described
earlier.
Figure 7. The scheme illustrating the shielding (left) and complementarity (right)
parameters used to characterize the strength of hydrophobic contacts for receptor-ligand
binding. S1 and S2 are the buried surface areas of the receptor and the ligand, respectively; A3
is the exposed area of the ligand. A1', A2' and A'' are respectively the mismatching and
matching hydrophobic areas of the receptor and the ligand. Hydrophobic and hydrophilic
surface regions are shown in brown and blue, respectively. [Adapted from reference 4].
2.8. Knowledge-based estimation of atom positioning – EMP1
We also introduced a simple spatial-empirical knowledge-based method to assess
quality of molecular complexes in terms of free energy and/or biological activity. This is a
knowledge-based approach which solely relies on the spatial distribution of atoms of different
types with distinguishable properties. Data from known spatial structures of biomolecular
complexes is analysed and atoms with preferred properties are stored in a database in a format
that keeps information about their relative position to a reference, for example an atom in a
ligand. This is called “the training of the EMP1-function”. Each EMP1-function corresponds
in this way to a reference centre (in our case - a ligand atom or an other ligand-based point).
14
In our current model, such data about atoms can also be weighted to reflect the quality of the
data or any other preferred parameters.
Positions of stored atoms are transformed into local positions relative the reference
centrum by projecting their coordinates onto a local coordinate system associated with the
centrum – such projected atom positions are here called knowledge based centres. This allows
us to score a molecular complex by locating the same reference in it, projecting atoms of the
scored complex corresponding to (preferably but not necessarily) the properties of the atoms
in the database to a local coordinate system in the same way as was done during the training
procedure, and then calculate the score using a formula based on Gaussian functions.
The EMP1-function formula is described by:
I 1
ri2, j
J 1
wje
2V
i 0, ai A j 0
EMP1
J 1
wj
V 2
j 0
J 1
w j rj2,
V
j 0
J 1
wj
j 0
where A is the set of protein atoms that meet the atom-type condition set for the
particular EMP1-function, I is the number of atoms in A, i is the atom index and ai is the atom
with index i. Further, j is the index of a knowledge-based centre where each centre represents
one known (from preliminary training on experimental structures) spatial atom location in the
local coordinate system for the EMP1-function, J is the number of centres, ri,j is the distance
between the protein atom ai and the knowledge based centre with index j, rj,μ is the distance
between the knowledge based centre with index j and the mean vector of the knowledge based
centres and wj is the weight (1 in our case) of the knowledge based centre with index j.
Figure 8. The scheme illustrating the principle of EMP1. The reference centre (white) is
associated with a reference coordinate system where the atoms can be placed in a suitable
way to be compared to the knowledge-based centres (green). The atom ai (blue) is here
compared to the knowledge-based centre j. The distance between them is rij. Each knowledgebased centre is based on a position of a known atom in a reference structure to the reference
centre. To be able to produce vectors for projection, a suitable reference centre must be
somewhat rigid with at least 2 discernible directions.
15
2.9. Scoring component based on EMP1 – c_emp1
While there are many different possible approaches to construct a scoring component
using the EMP1-method, we used the following. To try and capture as many important
phenomena for molecular interaction as possible with as few EMP1-centres as possible, we
created 7 EMP1-functions at different centres, 6 centred at the 3 molecular sites of cytosine
(O2, N3, N4) and one centred in the middle of the ring-structure. The 7 EMP1-functions are:
EMP1-function
O2_HNx
O2_HOx
N3_HNx
N3_HOx
N4_Nx
N4_Ox
centre_Ar
site
O2
O2
N3
N3
N4
N4
centre
atom
hydrogen, connected to a nitrogen
hydrogen, connected to an oxygen
hydrogen, connected to a nitrogen
hydrogen, connected to an oxygen
nitrogen
oxygen
atoms involved in aromatic contacts
The atom type describes what kind of atoms the EMP1-function centred at the particular
site is recognizing. Each EMP1-function is the basis of one term in the c_emp1- scoring
component, but is also normalized by the number of EMP1-terms used; in our case 7. This
makes it possible to compare EMP1-scores with a different number of terms. The score
component for c_emp1 is described by:
C EMP1
TO 2 _ HNx , TO 2 _ HOx , TN 3 _ HNx , TN 3 _ HOx ,
TN 4 _ Nx , TN 4 _ Ox , Tcentre _ Ar
EMP1O 2 _ HNx
TO 2 _ HNx
N EMP1
EMP1O 2 _ HOx
TO 2 _ HOx
N EMP1
EMP1N 3 _ HNx
TN 3 _ HNx
N EMP1
EMP1N 3 _ HOx
TN 3 _ HOx
TN 4 _ Nx
TN 4 _ Ox
N EMP1
EMP1N 4 _ Nx
N EMP1
EMP1N 4 _ Ox
N EMP1
EMP1centre _ Ar
Tcentre _ Ar
N EMP1
N EMP1
7
16
2.10. Recognition of cytosine – protein backbone motifs
In our tests and scores, the cytosine – protein backbone motifs are detected by
investigating the hydrogen bonds involving the O2-, N3-, and N4-sites with the
Thbond ,acceptor :donor -functions described earlier in the “Scoring component for modeling hydrogen
bonds – c_hbond” section. Since more than one hydrogen bond can be detected by this
function for each of the three sites, all combinations of bonds are considered and composed
into motifs together with the information about amino acid indexing. When only investigating
how the cytosine binds to the protein backbone, bonds with other parts of the protein are
considered equal; that is the information about bond type and amino acid index from that
bond is discarded. We have chosen to index the amino acids from the first valid bond (a bond
with the protein backbone) in the order of the enumeration of the cytosine atomic sites (O2,
N3, N4). This means that the indexing starts at different sites for motifs with different number
of initial participating atomic sites, and must be considered when comparing motifs. Motifs
are further expressed in the form (a,b; c,d; e,f), where a, c and e are the atom-types of the
atoms binding with the O2-, N3 and N4- sites respectively, and b, d and f are the respective
amino-acid index for a, c and e.
2.11. Scoring component based on hydrogen bond motifs – c_motif
To use the information from known motifs where the cytosine moiety binds through
hydrogen bonding with the protein backbone, we created a scoring component that uses this
kind of information (see RESULTS & DISCUSSION for a more thorough discussion about
this approach). The motifs used are:
(N,0; N,-1; O,-3),
(N,0; N,-1; -,0),
(N,0; -,0; O,-3),
(-,0; N,0,O,-2),
which are variants of the same motif, but with different participating molecular sites. For
convenience, these are named in order as M1, M2, M3 and M4 in this scoring component. We
constructed the motif scoring component in the following form:
CMOTIF TM 1 , TM 2 , TM 3 , TM 4
TM 1
SM 1
TM 2
SM 2
TM 3
SM 3
TM 4
SM 4
where S M 1 , S M 2 , S M 3 and S M 4 are the numbers of the corresponding motifs found in the
complex for each denoted motif type.
2.12. Scoring component of the docking algorithms used – c_dock
This component is simply taken from the score incorporated in the docking algorithm
used for generating that particular docking solutions; GOLD goldscore, GOLD chemscore or
GLIDE. It can thus only be used in cases where this score is known and we use it only to
simulate a score which includes the scoring capabilities of the docking algorithms.
17
2.13. Normalization and calculation of the weighting coefficients
Linear combinations of scoring components result in a score with a number of terms.
We created our scores by first normalizing the terms with their mean value over all complexes
in the score training set, and then weighting these normalized terms with weighting
coefficients. Also, a score constant is added. According to this, our scores are weighted sums
with the following form:
I 1
f score
wi ni ti
w0
i 1
ni
D
D 1
ti ,
0
In this formula, i is the index of the term, I is the total number of terms in the score
(including the constant term), t i is the scoring component term corresponding to the total
score term with index i, ni is the normalization coefficient for the scoring component term
and wi is the weighting coefficient for the scoring component term. w0 is the added score
constant with the term index 0, and is also considered a weighting coefficient. For the
normalization coefficient,
is the docking result index and D is the number of docking
results contributing to the training or evaluation of the score.
The weighting coefficients were obtained by using the docking results (Dataset 2) and
adapting our (normalized) scoring functions to a number of fitting functions by a least squares
procedure using the program KVAS (part of the PLATINUM program package). We have
tried out 4 different fitting functions: Semi- step function, Negative RMSD, Negative RMSD
with cut off and Term correlation. Each function is a function of the RMSD between the
cytosine moiety and the crystal structure of the same substructure in the complex, except the
last function – Term correlation. The fitting functions are:
Semi- step function:
1; d RMSD
F
3Å
4 d RMSD ;3 Å d RMSD
4Å
0;4 Å d RMSD
Negative RMSD:
F
d RMSD
Negative RMSD with cut off:
F d RMSD 6 Å : d RMSD
Term correlation:
Si
I 1 ti ,
D
Si
i 1
F
I
D 1
Si
ti ,
0
18
Here d RMSD is the RMSD distance between the observed docked cytosine moiety and its
crystal structure. In Negative RMSD with cut off, docking results which does not fulfil the
condition are discarded and are thus not used for calculating weighting coefficients. Term
correlation is not calculated from RMSD between the docked ligand and its crystal structure,
but from the scoring terms themselves.
While RMSD-based criteria are well-known, the Term correlation is a new notion and
deserves a more detailed discussion. Actually it replaces atomic coordinates compared by
RMSD with the values of interaction terms. The basic idea is that a docking solution will be
considered correct (native-like) if its interaction pattern statistically implies high quality
compared to other interaction patterns. In Term correlation, i is the index of a term in a score
and I is the total number of terms in the score (including the constant term), is the index of
the docking result and D is the number of docking results. ti , is the scoring component term
corresponding to the score term i in the docking result with index (see the description of the
score structure earlier). Each score function will thus have its own Term correlation function.
2.14. Validation of scoring functions
We validated combinations of scoring components using two different validation
strategies, which we called “Complete training and test” and “Cross- training and test”.
In Complete training and test Dataset 1 and Dataset 2 are both used for training the
scoring functions and Dataset 2 is used as a test set. Dataset 1 is here used for training the
knowledge-based scoring component c_emp1 and Dataset 2 is used for calculating weighting
constants during the training. While this procedure does not provide any new information it
allows testing the consistency of the obtained result.
In Cross training and test Dataset 1 and Dataset 2 are repeatedly divided into trainingand test- sets. This is done on protein-ligand combination basis. The scores are trained on the
parts of Dataset 1 and Dataset 2 belonging to the training-set and a validation is performed for
each iteration. Complexes belonging to the training set are used for training the scoring
functions; those from Dataset 1 to train the knowledge-based component c_emp1 and those
from Dataset 2 to calculate weighting coefficients. Complexes from Dataset 2 belonging to
the test-set are used for the testing. The division into a training- and test- set is made
randomly, with all protein-ligand complex combinations treated equally in terms of
probabilities to belong to a particular set. From the cross-training/cross-testing procedure we
also obtain the variances of the weighting coefficients, variances of validation measurements
and mean values of validation measurements (about validation measurements; see the
explanation below). The used probabilities for the training-set and test-set were set as 0.9 that
a complex type belongs to the training-set and 0.1 that it belongs to the test-set. Also, one
comlex type was randomly chosen to always belong to the test-set, to avoid ending up with an
empty test-set for some iterations. The number of iterations for the cross-training/crossvalidation procedure was set at 20 iterations.
To validate how a score is performing we used a number of score validation
measurements mainly based on the RMSD between the docked ligand and its crystal
structure, or in our case – the RMSD between the cytosine substructure and its crystal
structure counterpart.
The first measurement is the “relative enrichment”, which is a measurement of how well
the score orders the docking results. The relative enrichment is calculated by measuring an
area under the ROC (reciever-operating characteristic) curve representing the rate of
collecting correct docking solutions in an ordered list AI (according to a scoring function
under study), and then comparing this to the area under an ideal ROC curve Amax (that is
19
where all correct docking poses are scored at the top of the list). The docking poses are
commonly rank-ordered according to their scores from the best score to the worst score. The
relative enrichment is thus calculated in the following way:
AI
E
Amax
Figure 9. Example of the ROC curves comparison. In this case a series of 60 docking
poses have been validated by a score indexing them from 1 to 60 with 1 being the index given
to the pose with the best score and 60 given to the worst scoring result. The docking poses are
also ranked by RMSD from the crystal structure of the cytosine moiety, and since an ideal
score in this case would yield equal indexing and RMSD ranking one can compare the
resulting area with the area of an ideal score using this procedure to get an overview of the
ranking performance of the score to be validated. This comparison measurement is called the
relative enrichment.
Another measurement of performance of a scoring function is the value of “hit rate” –
the best (lowest) rank of a credible native-like docking solution (RMSDcyt<2, the PLATINUM
standard) in the rank-ordered list according to the score under study. A hit rank of 1 is greatly
preferred. When validating score for many different protein-ligand complexes at once, we use
the relative enrichment and hit rank to create three secondary measurements which summarize
the performance of the score on the whole set of complexes: Mean Enrichment (ME) - the
mean relative enrichment over all complex types, Mean Hit Rank (MHRk) – the mean Hit
20
Rank over all complex types and Hit Rate (HRt), which is the number of times a complex
type shows a hit rank of 1. For the purpose of comparison we complement these with a couple
of other measurements. Mean Random Hit Rank (MRHRk) is the hypothetical rank of a
correct docking solution in assumption that our scoring function is completely nonselective
and the native-like solutions are distributed randomly in the rank-ordered list. Random Hit
Rate (RHRt) is the hypothetical Hit Rate if this assumption is made for all comlex types.
21
3. RESULTS & DISCUSSION
Figure 10. Schema of the workflow of the project. White blocks symbolize stored data.
Green circles are events and stages in the process with red arrows showing the process flow
and blue arrows showing the data flow. Dotted blue lines are special cases of data flow when
data about crystal structures are used to train the knowledge-based component c_emp1. N is
the number of folds in the cross-training/cross-test procedure.
3.1. Analysis of non-covalent interactions in complexes with cytosine-containing
ligands
Aiming at the development of new scoring functions, we have carried out an analysis of
different types of non-covalent intermolecular interactions between the ligand and the protein
for a set of protein-ligand complexes containing ligands with cytosine-moieties (Dataset 1).
The results of the analysis are shown in Tables 2a – 2e. Three interaction types were selected
to be investigated in the analysis: Hydrogen bonds, stacking-interactions and hydrophobic
interactions described by the MHP-formalism. These interactions have been shown previously
to adequately describe interactions of adenine-containing ligands with proteins.4 Through this
analysis we could conclude that all above-mentioned interaction types are involved in
22
different binding motifs for the cytosine-moiety, but as expected there does not exist a single
cytosine-binding motif regarding these interaction types, but rather many different kinds of
motives. Our goal was to reveal these binding patterns, to use them further in a scoring
approach similar to the one proposed for ligands containing adenine-substructures by Pyrkov
et. al., in 20074 Each part of the analysis will be discussed separately below.
Hydrogen bonds in protein-ligand complexes for ligands containing adenine have been
investigated in earlier studies3,4 with the conclusion that only a few of all possible hydrogen
bond patterns are actually present in the investigated datasets. The method to count the
number of hydrogen bonds is the same as used in the scoring component c_hbond (see
Methods), and the result of our analysis of hydrogen bonds is presented in Table 2a. We
decided to investigate hydrogen bonds for a selected number of atom types in the ligands,
which were all oxygen-atoms with an sp2-hybridizaton, all oxygen-atoms with an sp3hybridizaton, all oxygen-atoms in a saccharide-substructure, all oxygen-atoms in a phosphate
substructure, all nitrogen-atoms with an sp2-hybridization, all nitrogen-atoms involved in an
aromatic contact, the specific O2, N3, and N4-atom in cytosine (in standard nomenclature).
Secondary measurements were derived from these first measurements to further identify
hydrogen bond patterns for the cytosine-moiety. These describe combinations of hydrogen
bonds for the cytosine-moiety and include binary measurements for the presence of hydrogen
bonds at the O2, N3, N4 –sites and all sites simultaneously, and binary measurements for
combinations of hydrogen bonds present at different combinations of the O2, N3, and N4 –
sites. Simultaneous hydrogen bonding at all three sites (O2, N3, N4) was found in 46% of the
investigated complexes. The N4-site turned out to be involved in hydrogen bonding slightly
more frequently than the other two sites (74%) followed by the O2-site (68%) and the N3-site
(66%) in the order of frequency. This is an effect which can largely be explained by the fact
that both the N4-site and the O2-site have possibilities to bind through 2 hydrogen bonds
each. The two-site combinations showed somewhat similar frequencies with a presence of
58% for the O2/N4 and N3/N4 –combinations and a presence of 56% for the O2/N3combination.
23
Figure 11. Illustration of hydrogen-bonds (purple lines) between the cytosine-moiety and the
protein backbone in the crystal structure C5P_1IV4. Only the protein backbone is shown from
the protein to illustrate the O2, N3, N4 (N,0; N,-1; O,-3) 3-site cytosine-backbone motif CM1,
which in this case is involving residues with numbers 255, 254 and 252. Also, residue number
249 can be seen to be involved in hydrogen bonding to the cytosine-moiety[ref].
24
TABLE 2a. Analysis of intermolecular interactions in complexes with cytosine-containing
ligands; Hydrogen Bonds.
COMPLEX
1MC_1BKY
AR3_1P5Z
C2G_1N1D
C2P_1ROB
C3P_1RPF
C5P_1H7F
C5P_1IV4
C5P_1LP6
C5P_1QF9
C5P_1UJ2
CAR_1KDR
CDC_1JYL
CDF_1GX1
CDM_1INI
CDM_1OJ4
CDP_1EYR
CDP_1FFU
CDP_1H7H
CDP_1IV2
CDP_1XJN
CDP_2AZ3
CDP_2CMK
CG2_1OJ1
CMK_1GQC
CPA_1RPG
CSF_1RO7
CTN_1UEJ
CTP_1COZ
CTP_1H7G
CTP_1I52
CTP_1KFD
CTP_1MIY
CTP_1TUG
CTP_1UDW
CTP_1UEU
CTP_2AD5
DCM_1B5E
DCM_1NJE
DCP_1PEO
DCP_1PKK
DCP_5KTQ
DCZ_1P60
DOC_1KDT
GEO_1P62
GPC_1RDS
MCN_1DGJ
MCN_1N62
NCC_1QWJ
PCD_1FFV
PCD_1VLB
Sum
Mean
O.2
0
0
0
1
1
1
1
0
1
2
4
2
1
1
1
2
1
1
2
1
0
3
1
1
1
2
1
0
1
1
0
0
1
1
0
1
2
1
0
0
0
0
2
0
1
1
1
3
1
1
O.3
0
0
1
0
0
0
0
0
0
0
0
0
0
1
3
0
0
0
0
0
0
1
0
0
0
2
0
1
0
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
0
2
2
0
3
2
O.sug
0
3
6
1
1
3
3
2
3
5
1
2
2
4
0
1
2
2
1
4
2
0
1
6
1
4
5
0
2
2
0
1
1
4
1
3
1
2
0
0
0
3
0
2
0
0
3
1
3
1
O.po3
0
0
4
4
4
1
2
2
4
4
4
1
4
0
1
2
7
0
3
7
4
3
1
0
3
2
0
8
2
5
1
2
2
8
6
6
7
1
3
0
4
0
4
0
4
7
6
0
10
8
N.2
0
2
2
0
0
1
2
0
2
1
1
1
2
2
1
1
2
1
2
1
0
2
1
1
1
1
1
2
1
2
0
1
2
1
0
0
1
0
0
1
0
2
2
2
3
3
3
3
4
3
N.ar
0
1
1
1
1
1
1
0
0
1
0
1
1
1
1
1
1
0
1
0
0
1
2
1
2
1
1
1
1
1
0
1
0
1
0
1
0
0
0
0
0
1
1
1
2
2
2
1
1
1
Cyt_O2
0
0
0
1
1
1
1
0
1
2
4
2
1
1
1
2
1
1
2
1
0
3
0
1
1
2
1
0
1
1
0
0
1
1
0
1
2
1
0
0
0
0
2
0
0
1
1
2
1
1
Cyt_N3
0
1
1
1
1
1
1
0
0
1
0
1
1
1
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
1
0
1
0
1
0
1
0
0
0
0
0
1
1
1
0
1
1
1
1
1
50
1
21
0.42
95
1.9
161
3.22
67
1.34
40
0.8
47
0.94
33
0.66
Cyt_N4 Cyt_Sum
0
0
2
3
2
3
0
2
0
2
1
3
2
4
0
0
2
3
1
4
1
5
1
4
2
4
2
4
1
3
1
4
2
4
1
2
2
5
1
2
0
0
2
6
0
0
1
3
0
2
1
4
1
3
2
3
1
3
2
4
0
0
1
2
2
3
1
3
0
0
0
2
1
3
0
1
0
0
1
1
0
0
2
3
2
5
2
3
1
1
2
4
2
4
2
5
2
4
2
4
57
1.14
137
2.74
bCyt_O2 bCyt_N3 bCyt_N4 bCyt_All
0
0
0
0
0
1
1
0
0
1
1
0
1
1
0
0
1
1
0
0
1
1
1
1
1
1
1
1
0
0
0
0
1
0
1
0
1
1
1
1
1
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
0
1
1
1
1
1
0
1
0
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
1
0
1
1
0
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
0
1
0
1
0
1
1
1
1
0
0
0
0
1
1
0
0
1
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
1
0
1
1
1
1
0
1
1
0
0
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
34
0.68
33
0.66
37
0.74
23
0.46
O2/N3
0
0
0
1
1
1
1
0
0
1
0
1
1
1
1
1
1
0
1
0
0
1
0
1
1
1
1
0
1
1
0
0
0
1
0
1
0
0
0
0
0
0
1
0
0
1
1
1
1
1
O2/N4
0
0
0
0
0
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
1
0
1
0
1
1
0
1
1
0
0
1
1
0
0
1
0
0
0
0
0
1
0
0
1
1
1
1
1
N3/N4
0
1
1
0
0
1
1
0
0
1
0
1
1
1
1
1
1
0
1
0
0
1
0
1
0
1
1
1
1
1
0
1
0
1
0
0
0
0
0
0
0
1
1
1
0
1
1
1
1
1
27
0.54
29
0.58
29
0.58
1) Hydrogen bonds involving the respective atom types:
oxygen-atoms with a sp2-hybridizaton (O.2);
oxygen-atoms with a sp3-hybridizaton (O.3);
oxygen-atoms in a sugar-substructure (O.sug);
oxygen-atoms in a phosphate-substructure (O.po3);
nitrogen-atoms with a sp2 hybridization (N.2);
nitrogen-atoms involved in an aromatic bond (N.ar);
2) Specific cytosine sites: O2 (Cyt_O2), N3 (Cyt_N3) and the two hydrogen atoms of the N4 nitrogen (Cyt_N4),
3) Secondary measurements:
the sum of possible hydrogen bonds for the cytosine substructure (Cyt_Sum);
binary measurements indicating the presence of a hydrogen bond (1) or not (0) for the cytosine sites
(bCyt_O2, bCyt_N3, bCyt_N4);
all the sites at once (bCyt_All);
all pair combinations (O2/N3, O2/N4, N3/N4).
25
TABLE 2b. Analysis of intermolecular interactions in complexes with cytosine-containing
ligands; Cytosine-Protein motifs.
COMPLEX
1MC_1BKY
AR3_1P5Z
C2G_1N1D
C2P_1ROB
C3P_1RPF
C5P_1H7F
C5P_1IV4
C5P_1LP6
C5P_1QF9
C5P_1UJ2
CAR_1KDR
CDC_1JYL
CDF_1GX1
CDM_1INI
CDM_1OJ4
CDP_1EYR
CDP_1FFU
CDP_1H7H
CDP_1IV2
CDP_1XJN
CDP_2AZ3
CDP_2CMK
CG2_1OJ1
CMK_1GQC
CPA_1RPG
CSF_1RO7
CTN_1UEJ
CTP_1COZ
CTP_1H7G
CTP_1I52
CTP_1KFD
CTP_1MIY
CTP_1TUG
CTP_1UDW
CTP_1UEU
CTP_2AD5
DCM_1B5E
DCM_1NJE
DCP_1PEO
DCP_1PKK
DCP_5KTQ
DCZ_1P60
DOC_1KDT
GEO_1P62
GPC_1RDS
MCN_1DGJ
MCN_1N62
NCC_1QWJ
PCD_1FFV
PCD_1VLB
Sum
Mean
Nr of motifs =88
MOTIF
O2
N3
N
N
c
N
N
N
N
c
c
c
c
c
c
N
c
N
N
N
N
c
N
c
N
N
c
N
N
N
N
c
c
c
c
c
c
c
c
N
N
N
c
c
N
N
c
c
c
N
c
N
N
c
c
c
c
N
N
N
N
N
N
c
c
N
N
N
N
c
c
N
N
c
c
c
N
N
c
c
c
c
N
N
c
c
N
c
c
N
N
N
N
N
N
c
c
c
c
c
c
c
c
N
N
c
N
N
c
c
c
c
c
N
c
c
c
c
c
c
c
c
N
N
N
N
c
c
c
c
N
N
N
N
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
N4
0
0
0
0
0
0
0
0
-1
-1
0
0
0
0
0
0
0
0
0
0
0
-1
-1
0
0
0
0
0
-1
-1
0
0
0
-1
-1
0
0
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
0
0
0
0
-1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-1
-1
-1
-1
0
0
0
0
-1
-1
-1
-1
c
c
O
O
O
O
O
O
c
c
c
c
c
c
c
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
c
c
c
c
c
c
O
O
O
c
O
O
O
O
O
c
O
O
c
c
c
c
c
c
c
c
c
c
c
c
O
O
O
O
O
O
O
O
O
O
O
O
O
0
0
0
0
3
0
0
-6
-6
-3
0
25
0
0
0
0
0
0
0
72
-5
-6
-3
67
68
0
68
9
-3
68
-6
-5
-2
-6
-3
2
0
0
0
0
0
0
0
0
-6
0
7
6
0
0
3
-6
67
68
0
0
-48
29
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-3
65
-3
68
63
69
4
10
-3
68
-3
65
bond
O2
chain
bb
bomd
N3
chain
bb
bond
N4
chain
bb
0
0
0
0
0
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
0
1
1
1
1
1
0
0
1
1
1
0
0
1
1
1
0
1
1
1
1
0
0
0
0
0
1
1
1
1
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
67
0.761
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
1
1
1
1
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
0
1
0
1
1
1
1
1
1
0
1
0
0
0
1
0
0
1
0
0
0
0
1
1
1
0
0
1
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
31
0.352
0
0
0
0
0
1
1
0
1
1
0
1
1
0
0
0
0
0
0
1
0
1
1
1
1
0
1
0
1
1
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
1
1
1
1
36
0.409
0
1
1
1
1
1
1
1
1
1
0
0
0
1
1
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
0
0
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
0
1
0
0
1
0
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
65
0.739
0
1
1
0
0
1
1
1
0
0
0
0
0
1
1
0
0
0
0
1
1
0
0
1
1
0
1
1
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
0
0
1
0
0
1
1
1
0
1
0
0
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
1
1
0
0
0
0
39
0.443
0
0
0
1
1
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
0
0
1
1
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
26
0.295
0
1
1
1
1
0
0
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
0
1
0
1
1
1
1
1
1
1
1
0
1
1
1
1
0
0
1
1
0
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
75
0.852
0
1
1
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
1
1
0
0
1
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
29
0.330
0
0
0
1
1
0
0
1
1
1
0
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
0
1
1
0
1
1
1
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
46
0.523
O2/N3
bb
O2/N4
bb
N3/N4
bb
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
1
1
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
21
0.237
0
0
0
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
1
0
1
1
1
1
0
1
0
1
1
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
1
1
1
1
29
0.330
0
0
0
1
1
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
0
0
1
1
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
25
0.284
1) Atoms binding with the cytosine-sites are shown under respective site together with its index in case of the
site hydrogen bonding with the protein backbone.
2) A bond with a protein side-chain, this is marked with “c”, absence of bond is marked with “-“. Both these
special cases also receive the index 0.
3)Binary measurements indicate a bond (1) or no bond (0) for any type of bond (bond), a bond with a side chain
(chain) and a bond with the protein backbone (bb).
4) Binary measurements indicating combinatory backbone binding for pairs of sites are marked “O2/N3”,
“O2/N4” and “N3/N4”.
5) Mean values are mean values over all 88 found motifs.
26
TABLE 2c. Analysis of intermolecular interactions in complexes with cytosine-containing
ligands; Occurance of Cytosine-Protein motifs.
MOTIF
count
14
8
8
7
5
3
3
3
3
3
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
COUNT,
O2 atom
c
N
c
N
c
N
N
N
N
N
N
N
N
c
c
c
c
N
N
c
N
N
c
c
N
N
N
N
c
c
ALL
O2 aai
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
N3 atom
c
N
c
c
c
N
c
N
N
N
c
N
c
c
N
c
N
N
N
N
N
c
c
c
c
N3 aai
0
0
-1
0
0
0
0
-1
0
-1
0
0
0
0
-1
0
0
0
0
0
0
0
0
0
2
1
0
0
-1
0
0
0
0
0
0
0
N4 atom
c
O
c
c
O
O
O
O
O
O
c
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
c
O
O
O
O
O
N4 aai
0
0
-3
0
0
0
-6
-6
68
68
0
3
0
67
65
25
72
-5
0
9
-6
-5
-2
2
7
6
-48
29
0
0
0
0
63
69
4
10
MOTIF
count
21
17
11
10
9
7
4
4
2
1
1
1
COUNT,
O2 atom
c
N
N
c
N
N
c
N
N
O2,
O2 aai
0
0
0
0
0
0
0
0
0
0
0
0
N3
N3 atom
c
N
c
c
N
N
N
N
N
N3 aai
0
-1
0
0
0
0
0
0
0
0
2
1
MOTIF
count
19
8
8
8
6
5
5
4
3
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
COUNT,
O2 atom
c
N
N
N
c
N
N
N
N
N
N
c
c
c
N
N
c
N
N
c
c
N
N
c
c
O2,
O2 aai
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
N4
N4 atom
c
c
O
O
O
O
O
c
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
N4 aai
0
0
0
-3
68
0
0
-6
-6
0
67
65
25
72
-5
0
9
-5
-2
2
7
6
-48
29
63
69
4
10
MOTIF
count
21
9
9
8
6
5
4
4
3
3
3
2
2
2
2
2
2
1
COUNT,
N3 atom
c
N
c
N
c
N
c
N
N
c
N
N
c
c
N
N3,
N3 aai
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
N4
N4 atom
N4 aai
c
O
c
O
O
O
O
O
O
O
O
O
O
O
O
-
1) Under each list heading the number of found motifs (count) is shown together with the atom (O2 atom, N3
atom and N4 atom) of each site and the corresponding index (O2 aai, N3 aai and N4 aai).
2)The O2, N3, N4 (N,0; N,-1; O,-3) 3-site cytosine-backbone motif (CM1) is shown as the third ranking motif
among the 3-site motifs.
Another goal that we set up was to investigate the existence of possible hydrogen bond motifs
between the cytosine and the protein backbone with an approach similar to the one used in a
number of analyses performed on adenine.1,7 We investigated the various hydrogen bond
interactions between the cytosine-moiety and the protein with focus on amino-acid sequence
index and atom types of the protein for backbone contacts. The results of the analysis are
presented in Tables 2b and 2c. All combinations of hydrogen bonds between the O2, N3 and
N4 sites of the cytosine-moiety and the protein were considered, and each combination was
set to represent a motif. In the case of more than one hydrogen bond at one of the sites this
resulted in more than one motif for the complex in question. Since no distinction was shown
between different atoms of the protein side chains, virtually similar motifs can be seen for
some complexes in the results, but they are in fact unique motifs distinguished by different
protein atoms participating in the bonds. In Table 2c as well as presenting a list of 3-site
motifs ordered by occurrence, we also present lists of 2-site motifs sorted by occurrence for
all combinations of two hydrogen bonding sites for the cytosine.
The O2 and N3 -sites are involved in a fairly similar amount of motifs; 67 (76.1%) and
65 (73.9%) times respectively. The N4-site is involved in 75 (85.2%) motifs, again
demonstrating greater tendency to form hydrogen bonds. When looking at motifs involving
amino acid side-chains, the N3-site is seen interacting with the side chains in a larger number
of motifs than the other two sites. The three sites can be ranked according to involvement in
motifs when interacting with the protein backbone as: N4 – 46 (52.3%) times, O2 – 36
(40.9%) times, and N3 – 26 (29.5%) times. When examining the 2-site combinations and how
they are involved in interactions with the protein backbone, these combinations can be ranked
as: O2, N4 – 29 (33.0%) times, N3, N4 – 25 (28.4%) times and O2, N3 – 21 (23.7%) times.
27
0
0
-2
0
0
-6
-5
-5
0
0
69
3
9
5
66
4
10
0
If considering the motif counts, the O2, N3, N4 (N,0; N,-1; O,-3) 3-site cytosinebackbone motif (CM1) clearly outnumbers the other with 8 occurrences among the total 88
counted motifs. This motif is illustrated in Figure 11. 2-site motifs that constitute part of this
3-site motif are the O2, N3 (N,0; N,-1) with 17 occurrences, the O2, N4 (N,0; O,-3) with 8
occurrences and the N3, N4 (N,0; O,-2) with 9 occurrences. It is interesting to compare this
motif tendency with the adenine-motifs described by Denessiouk et.al. in 20011 and by Mao
and Wang et.al. in 20047. The N3 and N4 -sites in the CM1 show a binding pattern similar to
the N1 and N6 –sites in the “direct” adenine-motif of Denessiouk and the tri-residue –motif of
Mao and Wang. In the case of cytosine binding to the protein backbone, the O2, N4 site
combination tend to participate more in bonds to the protein backbone than the N3, N4
combination. Also, the O2, N4 2-site motif is the most reoccurring among all backbonebinding motifs found.
While we have discovered the cytosine-backbone 2-site motifs corresponding to the
“reversed” or mono-residue motifs of adenine, they were not as frequent; only 3 motifs
containing this pattern were found. More interesting might be to look for “reversed” cytosinebackbone motifs involving all three binding sites. As can be seen from our results, there
indeed are present poor traces of such motifs. A motif involving the O2 and N3 –sites
hydrogen bonding to the same amino acid residue can be found in 2 motifs and in 2 other
motifs the amino acid index is increasing in the direction from O2 to N4. The conclusion to
this must be that at least in our Dataset 1, such reversed cytosine- protein backbone motifs are
rarities. While it would be interesting to further investigate ligand-protein motifs including
other parts than the cytosine-moiety, for example the sugar moieties or phosphate groups,
and/or include amino acid side-chains, we did not prioritize this in the present analysis.
The second type of intermolecular interaction investigated was the aromatic pi-pi –
stacking interaction. For this purpose we used the geometrical scoring methodology of the
scoring component c_stack (see Methods), but with separate scoring terms for each and every
stacking interaction between the cytosine moiety and the amino acids arginine, histidine,
phenylalanine, tryptophan and tyrosine. Scoring terms above zero were treated as detected
stacking interactions. In this manner the stacking interactions in Dataset 1 were counted and
the results are displayed in Table 2d. With the total number of stacking interactions being 28,
compared to the total number of complexes, 50, we conclude that stacking interactions indeed
constitute an important molecular determinant for our dataset. Phenylalanine was counted to
have the largest number of stacking interactions with cytosine (12), and tryptophan had the
least number of stacking interactions (0). From these results we concluded that no changes
were needed to be made in the construction of the scoring component c_stack, compared to
how it was constructed for the purpose of scoring adenine.
28
TABLE 2d. Analysis of intermolecular interactions in complexes with cytosine-containing
ligands; Stacking-interactions
COMPLEX
1MC_1BKY
AR3_1P5Z
C2G_1N1D
C2P_1ROB
C3P_1RPF
C5P_1H7F
C5P_1IV4
C5P_1LP6
C5P_1QF9
C5P_1UJ2
CAR_1KDR
CDC_1JYL
CDF_1GX1
CDM_1INI
CDM_1OJ4
CDP_1EYR
CDP_1FFU
CDP_1H7H
CDP_1IV2
CDP_1XJN
CDP_2AZ3
CDP_2CMK
CG2_1OJ1
CMK_1GQC
CPA_1RPG
CSF_1RO7
CTN_1UEJ
CTP_1COZ
CTP_1H7G
CTP_1I52
CTP_1KFD
CTP_1MIY
CTP_1TUG
CTP_1UDW
CTP_1UEU
CTP_2AD5
DCM_1B5E
DCM_1NJE
DCP_1PEO
DCP_1PKK
DCP_5KTQ
DCZ_1P60
DOC_1KDT
GEO_1P62
GPC_1RDS
MCN_1DGJ
MCN_1N62
NCC_1QWJ
PCD_1FFV
PCD_1VLB
Sum
Mean
#ARG
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
#HIS
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
#PHE
1
2
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
2
0
1
0
0
0
0
0
0
#TRP
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
#TYR
1
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
Tot
2
2
1
0
0
0
0
0
0
1
2
0
0
0
2
1
0
0
0
1
1
2
0
0
0
1
1
1
1
0
0
1
0
1
0
0
0
0
0
0
1
2
2
1
1
0
0
0
0
0
9
0,18
1
0,02
12
0,24
0
0
6
0,12
28
0,56
1) The numbers represent the presence of stacking contacts with amino acids denoted with the 3-letter code
(#ARG, #HIS, #PHE, #TRP and #TYR).
2) Tot means the total number of stacking interactions in the investigated complex.
29
Figure 12. Illustration of the function f A (see text). Values of the function are shown in
respetive cell. The horizontal axis is the MHP-sum offset for the ligand and the vertical axis is
the MHP-sum offset for the protein. A local maximum is located around the MHP-sum offset
=0.25 for the ligand and =0.8 for the protein.
TABLE 2e. Analysis of intermolecular interactions in complexes with cytosine; MHP
COMPLEX Aphob
1MC_1BKY 0.377
AR3_1P5Z 0.476
C2G_1N1D 0.009
C2P_1ROB 0.097
C3P_1RPF
0.143
C5P_1H7F
0
C5P_1IV4
0.342
C5P_1LP6
0.321
C5P_1QF9
0.672
C5P_1UJ2
0.612
CAR_1KDR 0.005
CDC_1JYL 0.036
CDF_1GX1 0.396
CDM_1INI
0
CDM_1OJ4 0.811
CDP_1EYR 0.002
CDP_1FFU 0.689
CDP_1H7H
0
CDP_1IV2 0.382
CDP_1XJN
0.02
CDP_2AZ3 0.397
CDP_2CMK 0.02
CG2_1OJ1
0
CMK_1GQC
0
CPA_1RPG 0.111
CSF_1RO7 0.291
CTN_1UEJ 0.586
CTP_1COZ 0.022
CTP_1H7G
0
CTP_1I52
0.002
CTP_1KFD 0.239
CTP_1MIY
0
CTP_1TUG 0.476
CTP_1UDW 0.554
CTP_1UEU 0.209
CTP_2AD5 0.004
DCM_1B5E 0.289
DCM_1NJE 0.089
DCP_1PEO 0.239
DCP_1PKK 0.345
DCP_5KTQ 0.189
DCZ_1P60 0.508
DOC_1KDT 0.036
GEO_1P62
0.47
GPC_1RDS
0
MCN_1DGJ 0.715
MCN_1N62 0.616
NCC_1QWJ 0.087
PCD_1FFV 0.745
PCD_1VLB 0.547
Sum
Mean
13.176
0.264
Aphil
0.039
0.045
0.068
0.019
0.026
0.125
0.057
0.026
0.076
0.005
0.156
0.087
0.019
0.093
0
0.119
0.051
0.14
0.041
0.08
0.009
0.095
0
0.039
0.012
0.03
0.02
0.334
0.15
0.165
0
0.109
0.018
0.007
0.159
0.051
0.014
0.026
0
0.019
0.016
0.021
0.084
0.014
0.027
0.017
0.04
0.051
0.04
0.027
Abur
0.738
0.991
0.993
0.831
0.721
0.832
0.95
0.643
1
1
1
0.896
0.967
0.929
0.997
0.827
1
0.829
0.969
0.639
0.735
0.998
0.247
0.762
0.765
1
1
0.994
0.869
0.849
0.746
0.727
0.918
0.991
0.75
0.543
0.82
0.591
0.463
0.707
0.334
0.998
1
0.998
0.627
1
1
0.926
1
1
2.866
0.057
42.11
0.842
Aphob
[0,0.2[ [0.2,0.4[ [0.4,0.6[ [0.6,0.8[ [0.8,1.0]
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
24
0.48
12
0.24
7
0.14
6
0.12
1
0.02
Aphil
[0,0.2[ [0.2,0.4[ [0.4,0.6[ [0.6,0.8[ [0.8,1.0]
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
49
0.98
1) fraction of the total area being hydrophobic complementary (Aphob);
2) hydrophilic complementary (Aphil);
3) buried area (Abur).
30
1
0.02
0
0
0
0
0
0
Abur
[0,0.2[ [0.2,0.4[ [0.4,0.6[ [0.6,0.8[ [0.8,1.0]
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
1
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
2
0.04
3
0.06
12
0.24
33
0.66
Two analyses were performed to investigate the MHP1 approach to scoring in the case
of cytosine-moieties.
The first analysis was an analysis of the hydrophobic complementary area, the
hydrophilic complementary area and the total buried area of the ligand in the protein binding
pocket for all complexes in our dataset, and is presented in Table 2e. Areas are shown as
fractions of the complete ligand area. In addition we also present in what fraction range the
specified area is in steps of 0.2. The hydrophobic complementary area is the area of the
ligand-protein contact which has hydrophobic properties measured through the MHP
formalism. The hydrophilic complementary area is the corresponding area with hydrophilic
properties. The buried area is the sum of the two area types plus the mismatching ligandprotein contact areas where hydrophilic and hydrophobic areas are in contact with each other.
The results of the first MHP analysis revealed that the hydrophobic complementary area
seems to be generally larger than the hydrophilic complementary area for our dataset. We
measured the mean hydrophobic complementary area to be 0.264 of the total ligand area and
the mean hydrophilic complementary area to be 0.057 of the total ligand area. In most
complexes the hydrophobic complementary area is below 0.2 times the ligand area with fewer
complexes having larger values. The highest complementary area measured was 0.811 times
the total ligand area for the complex CDM_1OJ4. All hydrophilic complementary areas but
for one complex were found in the range below 0.2 times the ligand area. When examining
the buried area, most complexes seem to be somewhat buried with the mean buried area being
0.842 times the ligand area. Most complexes have a buried area above 0.8 and the complex
with the least buried area having a buried area of 0.247 times the ligand area. In this complex,
CG2_1OJ1, we didn’t register any complementary hydrophobic or hydrophilic areas. To
improve our knowledge about the different area types in the complexes of our dataset we were
curious to analyze the other area types together with the areas analysed in the first analysis.
Therefore we constructed a second analysis of all the available area types of the MHP
concept, which would be more informative
In the second analysis we intended to investigate how the different areas of the model
vary when changing the sum offset for the ligand and the protein respectively. This was done
in a similar fashion as in the investigation of the MHP1-model for the adenine-moiety in our
earlier study.4 By shifting the sum-offset (marked as
in the MHP1-description, see
Methods) we could change what sum contribution is considered hydrophobic and what is
considered hydrophilic since this is a matter of what contributions to the sum are above and
below zero. Since different parts of the protein and the ligand have varying hydrophobicity
and hydrophilicity in terms of hydrophobicity constants, this changes what parts of the
complex are considered hydrophobic and what parts are considered hydrophilic. The need of
having to distinguish between hydrophobic and hydrophilic regions is a result of the concept
of hydrophobicity/hydrophilicity itself. What we were interested in was how this definition
could affect complementarity of areas where the ligand and the protein, as well as the ligand
and the solvent, were in contact with each other. More specifically, one of our goals was to
find optimal values of this offset as to maximize matching areas (hydrophobic-hydrophobic,
hydrophilic-hydrophilic) and minimize mismatching areas (hydrophobic-hydrophilic), in a
way that could be useful in creating or modifying a score. A selected number of complexes
were used for this analysis based on how large part of the ligand is represented by the cytosine
moiety, the larger the better, and the analysis was performed upon them. MHP-sum offset was
varied separately for the ligand and the protein to cover a wide spectrum of possible
complementarity situations. Results were obtained in 2-dimensional matrices where the
dimensions are shifts in MHP-sum offset for the ligand and the protein, respectively. Our
selected complexes were:
31
1MC_1BKY
AR3_1P5Z
C5P_1IV4
C5P_1LP6
C5P_1QF9
C5P_1UJ2
CDF_1GX1
CDM_1OJ4
CDP_1FFU
CDP_1IV2
CDP_2AZ3
CTN_1UEJ
CTP_1TUG
CTP_1UDW
DCM_1B5E
DCP_1PKK
DCZ_1P60
GEO_1P62
MCN_1DGJ
MCN_1N62
PCD_1FFV
PCD_1VLB
When examining a function of the hydrophobic complementary areas and the
hydrophilic complementary areas between ligand and protein as well as between ligand and
(hydrophilic) solvent, we could observe a slight local maximum for this function at
approximately an offset of =0.25 for the ligand and =0.8 for the protein. The function of
areas used in this case is described by:
f A A1 ( lig , prot ) A4 ( lig , prot ) A4 ( lig , prot )
In this function, A1 , A4 and A4 are the normalized area functions of the two MHP-sum
offsets lig and prot , where A1 is the hydrophobic complementary area between ligand and
protein, A4 is the hydrophilic complementary area between ligand and protein and A4 is the
hydrophilic complementary area between ligand and hydrophilic solvent (Fig. 12). All three
area functions have been normalized by the total area sum over all measurements for
respective area. While this illustrates an inherent problem with the hydrophobic/hydrophilic –
property system and also a way to deal with it, we judged that the effect of a change from the
default value of the MHP-sum offset for scoring complexes with cytosine containing ligands
would be near to negligible. In comparison to a similar analysis made on adenine where a
change in the offset would produce a more recognizable effect4, our analysis of the cytosine
moiety could be interpreted as there generally being no large overlaps in MHP-sum of the
ligand and the protein on the surface between the two or that there exist some sort of
symmetries which would cancel out any effects of the offset. Further investigation of the
subject is required to determine the real cause, which might include investigation of MHPbehaviour in individual complexes. We deemed this result to be promising for the usefulness
of the MHP-concept in scoring cytosine-structures in molecular docking however. This is
because of the possibility that there might generally be no large parts of the surface between
the cytosine moiety and the binding pocket where the MHP-sums overlap and thus the need of
the MHP-sum offset calibration is largely eliminated.
Figure
12.
Scheme
illustrating
classification of ligand-protein surface
contacts in terms of MHP values: (1)
MHPLL, (2) MHPLH, (3) MHPHL, (4)
MHPHH, (4') MHPHW. Here, subscripts
denote: L – lipophilic (brown), H –
hydrophilic (blue), W – water.
32
3.2. Investigating methods to score protein-ligand conformers
As was mentioned above, and it is worth to mention again; molecular docking is a
complex task. And scoring in molecular docking is part of that complex task. Even when a
scoring approach seems to get the job done, it is not always simple to determine why it is
successful and whether the approach is robust for other cases. Often the argument for a
scoring approach is a successful validation of the proposed scoring approach on a
representative test-set. However, it is essential to also put the results in relation to the methods
and data, since they are directly connected. A successful validation of a scoring approach
should desirably be interpreted in turn through the data and methods used. This might be
especially important in cases such as ours, when we had a limited dataset to work with. With a
starting-point in our previous work on adenine4 and the score used there, we have attempted
to further develop those scoring approaches together with new techniques and tested their
performance on protein-ligand complexes with cytosine containing ligands. The score applied
on adenine in the earlier work is a linear combination of scoring terms and is constructed as
such:
f score,adenine
w0
w1Thbond , N 6
w2Thbond , N 1, N 3, N 7
w4Tstacking , phe,tyr ,trp,his
w3Thbond , N 6 Thbond , N 1, N 3, N 7
w5Tstacking ,arg
w6TMHP1
Thbond , N 6
Bhbond ,acceptor :N 6
acceptors
Thbond , N 1, N 3, N 7
Bhbond , N 1:donor
donors
Bhbond , N 3:donor
donors
Bhbond , N 7:donor
donors
These are terms describing hydrogen-bond interactions using the standard enumeration
of adenine-atoms N1, N3, N6 and N7 and the hydrogen-bond scoring methodology described
in Methods. The other terms describe stacking and hydrophobic (using MHP) interactions,
and are the same as the terms with the same names that we have used in our new scores (see
Methods). The rationale for that will be explained further on.
Our main interests were whether the scoring approaches successfully used on adenine
also could be applied on cytosine and how a score using these methods could be further
improved. We were also interested in determining the performance of each scoring approach
and combinations of those on our dataset, and to investigate issues concerning consensus
scoring such as strategies to construct consensus scores.
Ordering different scoring approaches under what we call scoring components, we
created a comfortable basis upon which we could later build our consensus scores by
combining different scoring components together into an integral scoring function. A scoring
component was meant to comprise a particular scoring method with one or more scoring
terms. To use the scoring methods from the score for adenine, it was divided into an
appropriate number of components representing different scoring approaches now associated
with the scoring components for hydrogen bonds (c_hbond), aromatic pi-pi stacking (c_stack)
33
and hydrophobic contact (c_mhp1). These scoring components were slightly modified to be
adapted for the use on the cytosine substructure but remain largely unchanged. In the scoring
component c_hbond, we added and converted the terms so that they would correspond to the
three molecular sites on cytosine where hydrogen bonding is possible. We kept a double term
representing a simultaneous bonding of ring atoms and the N6 –site of adenine, partly as a
rudiment to be able to determine the performance of the previous adenine score in our
cytosine dataset. It had also to be converted to correspond to the new molecular sites, and was
set to be a combination of the terms representing the N3 and the N4 sites in cytosine, since we
considered them as analogue sites when compared to adenine.
Our new contributions to the ensemble of scoring components are the c_emp1 and
c_motif components. As the available structural data on protein-ligand complexes grow over
time, we saw it promising to introduce a knowledge-based part in our scores that could draw
its use from this accumulated data and potentially become more accurate as the data continue
to grow. Knowledge-based scores that rely on the stored statistical distribution of receptor
atoms around particular ligand groups also show a tendency of being able to implicitly model
effects that are otherwise hard to capture using other scoring approaches.12 We were eager to
discover what effect this would have on our scores. As a starting model for this scoring
approach we created the EMP1-model which is intuitively simple (see Methods). To be able
to translate this model into a working scoring component we had to solve the questions of
what reoccurring centres should be used as reference centres in the crystal structures and the
evaluated complex (a docking solution). What we would use as knowledge based centres – if
atoms – what atoms and in combination with which reference centre, and finally what weight
to assign to structural data used in the training of the algorithm. Regarding the reference
centres and knowledge-based centres, we settled for a most simple model that we thought
would statistically capture some of the most important physical effects governing the
molecular recognition of the cytosine moiety by its protein binding pocket. Modeling
hydrogen bond effects at the three molecular sites capable of this in cytosine as well as pi-pi
stacking contacts of the cytosine heteroaromatic ring structure, is of course just an example of
what could be done using the same principle. One can guess that if a more elaborate system of
reference centres and knowledge-based centres would be applied, the accuracy of the score
using the EMP1-model could be increased. Reference centres and knowledge-based centres
aside, which still are more of a design issue, a larger problem might be that of how to value
the data in the training procedure.
The weighting of knowledge-based data is the connection of scores to the underlying
physical effects it is modeling since the weighting defines exactly what the score is
representing. Weighting also accounts for the trustworthiness of the data and possibly other
parts of the model. The understanding of a knowledge-based score as implicit and statistical
rather than based on clearer physical models has to be put in relation to the methods capacity
to statistically model physical phenomena. Weighting, while sometimes ambiguous, is in fact
a useful part of the score to create a score mechanics that regulates the influence of individual
data on the score outcome. By connecting the weights of the structural data to physical units,
such as free energies of binding or measurements of biological activity, it is possible to model
these through the use of the knowledge-based score and in theory be able to use the model to
predict these values. Weighting could possibly also be used to account for differences and
uncertainties due to varying X-ray crystallographic resolution in observed complexes. In our
case, since we lacked a complete set of experimental biological or physical data connected to
free energy changes at binding or activity for our complexes in our dataset, we could not use
such for weighting purposes. One use for the weighting of knowledge-data that we initially
tried was to complement our knowledge-based data with structural data generated by some
docking algorithms. This provided us with negative training data (wrong docking poses) that
34
complementing positive training data (crystallographic structures and correct docking
solutions) are requisite for developing an efficient scoring criterion. We believed that we
could improve the accuracy of the knowledge-based score by also training it with these
docking-generated conformers and regulate the trustworthiness by connecting it to some
measure of RMSD between the generated conformer ligand and its crystal structure and
expressing this through weighting. Initial results showed that it was difficult to determine the
trustworthiness of docking-generated ligand poses from RMSD measurements and put that in
reference to the known crystal structures and weight them accordingly. Thus this method was
discarded with the motivation to test the knowledge-based score with trustworthy data before
attempting any such experiments. We still believe that an improvement in accuracy is possible
to achieve by this mean, but that such a method has still to be improved and further
investigated.
While discussing the weighting issue, a related subject is that of the impact of having to
train the knowledge-based scoring algorithm with multiple data. It can be a good thing to try
to define what, if one complex is treated as one measurement, a measurement is meant to
represent in the knowledge-based statistical model that we try to create. This task might sound
more trivial than it really is. Depending upon what is to be modelled by the knowledge-based
model, weighting of different data measurements at the training of the algorithm must be
balanced so that the resulting knowledge-based data structure would grow to represent a valid
physical model. There is a risk of the input data having a significant overlap in information
which could, if not properly taken care of, lead to imbalances in the knowledge-based score.
Thus a suitable relationship between weight of the input data and how similar it is to other
data might be required, depending on what the knowledge-based model is representing. In our
case, since we lacked information about our complexes which we could use to train our model
with, we had to revise our representation of the model and work with something simpler. In
this work we settled with a design of our knowledge-based scoring component where all
knowledge-based data was treated equally and assigned a weight of 1. This is to statistically
model atom placement by treating our dataset of structural data as a statistics of the general
placement of the cytosine moiety over all possible complexes. The assumption that our
dataset would model such a placement has thus to be made, and such an assumption might of
course be more or less valid. This is especially so since our dataset is not particularly large
and it is difficult to have an overview of how well the accumulated data in the PDB on
complexes binding ligands containing cytosine, not to mention our sample from that data
making up our dataset, support such an assumption. Our statistical model upon which the
knowledge-based scoring component has been created and from which the weighting of data
was to be derived, might be one of the largest shortcoming in the scoring approach of the
c_emp1 component.
From the results obtained in the investigation of the intermolecular interactions in our
dataset we concluded that the traces of a recurring motif where the cytosine binds to the
protein backbone through hydrogen bonding provide enough support for investigating the use
of these data in a new method of scoring. For this purpose, we settled on the current design of
the c_motif component. Both the approach and the design of the motif-recognizing scoring
component are simple which makes this approach to scoring interesting from two
perspectives. First, it is always useful to have a scoring component which is quick and easy to
calculate, and any improvements this component would be able to complement us with in our
scoring attempts would be put in relation to the calculation cost. If proved effective, this
would make it an interesting advance. The second perspective is that we knew from the start
that this approach to scoring is inherently very biased, because it is only able to score a most
limited number of conformer cases while a complete score hopefully would need to be able to
35
evaluate cases outside this components capacity. This would prove to be a valuable challenge
to us when investigating methods to create composite scores.
3.3. Investigating strategies to combine terms of different scoring components and to
validate scores
Most methods to score conformational states of protein-ligand conformers do make
more or less serious misjudgements.9,12 Scoring methods are also biased in how they model
and capture different phenomena. The term consensus score, in its meaning of a score that
combines different scoring methods, is somewhat misleading since it can be applied to most
successful contemporary scoring functions. What is a consensus score and what is not, is
more a matter of interpretation. Scoring functions generally show a characteristic of a
consensus score, that is they combine different scoring methods into one score. This is why
we think the question about consensus scoring is of importance and why it is interesting to
investigate strategies how to construct composite scores. While there are infinitely many ways
to construct such scores, we continued our work on the score structure that had been used
earlier4 (see “Investigating methods to score protein-ligand conformers” above) since this is a
rather flexible and fast way for constructing a score. This score is simply a linear combination
of a number of scoring terms which are calculated by different methods. These terms are
weighted and usually a constant term is added in order to maximize the predictive
performance of the score. Here the interesting question is how to perform this combination
and weighting procedure, and how to validate performance on such a score. In the weighting
procedure the score has to be fitted to a function where information about the quality needed
to model (free energy change at binding, biological activity etc.) is already known. There are
cases where such information about molecular complexes is available, but these complexes
are often too few to be able to perform a satisfying fitting procedure. A commonly used
solution to this is to generate a multitude of protein-ligand conformers and approximate
quality from measurements of RMSD from known crystal structures. We did not have any
data on binding energies or activity for our dataset so we resorted first to this alternative of
approximating quality with the RMSD measurement. In generally rare cases when some sort
of symmetry is present in a ligand molecule, the approximation capabilities of this
measurement work better the smaller the values of the measured RMSD. Further away from
the crystal structure in terms of RMSD the scoring environment might change radically to the
disadvantage of the approximation. We believed however, that such a rare case could prove to
be a challenge for our cytosine-specific scoring functions after we discovered possible
reoccurring local score maxima at RMSDcyt of approximately 7Å and 13Å in preliminary
scoring results obtained for our dataset (Fig. 13). The maximum at 7Å was believed to be
caused by the ring structure of cytosine in different conformational states corresponding to a
180° rotation of the ring plane, along the R-C1 axis or some other axis, for a reoccurring
number of complexes. The other maximum at 13Å can possibly be explained by
conformations of ligands containing 2 ring structures and where the ligands are positioned
with the rings in opposite positions compared to the crystal-structure. Therefore the problem
of symmetry in ligand molecule appeared to be quite important in our case. The exact
connection between possible quality maxima at these positions and structural symmetries of
these ligands will have to be further investigated.
Using the RMSD-measurement to approximate docking quality for conformers around
this distance would most possibly be a source of error due to assuming a lower quality than
what probably would be the case. If we were to use the RMSD-measurement for both score
fitting and later on for validation of our scores there seemed to be a reason to use this
measurement only for a RMSD-range where we were more certain that the given
36
approximation would correspond to some value near the assumed quality maximum of the
crystal structure. In theory, not doing so could suppress the ability of a score to discern
features of the quality it is supposed to measure and equivalently discredit scores with such
abilities during score validation. To visualize and investigate this theory we decided to
include two different concepts to score training and score validation. In one concept, we use
the RMSD-measurement from the whole range for training and validation, while in the other
concept we only use docking results with RMSD-measurements below a set limit for the
training and validation, as comparison.
The first concept is to use the whole RMSD-range for score training and validation.
During training this is realized through using the fitting functions Semi- step function and
Negative RMSD, and during validation when we use conformers from the whole RMSDrange.
In the second concept we have placed an arbitrarily selected RMSDcyt cut-off at 6Å
from the crystal structure, below which we consider RMSD-measurements acceptably reliable
for approximating score quality. When training our scores we visualize this concept through
the use of the fitting function Negative RMSD with cut off, and during validation when placing
a RMSD cut-off radius below which we use conformers for validation. The limit was placed
according to the assumption of the possible score maxima at RMSDcyt of 7 and 13 Å and
compromising this with getting enough docking results below this limit to be able to produce
interesting validation results (see Validation of scoring approaches).
Based on our earlier experiences with using RMSD-measurements as approximations of
the quality of a docking solution, we decided to investigate alternative ways for
approximating conformer quality for score training and possibly also score validation. For this
purpose we created the Term correlation fitting function.
c_hbond c_stack c_mhp1 c_emp1 c_motif
50
40
Score
30
20
10
0
0
5
10
15
20
25
30
-10
RMSD_cyt
Figure 13. A plot of the calculated score versus RMSDcyt for all complexes in Dataset 2,
using the score comprising c_hbond, c_stack, c_mhp1, c_emp1, and c_motif trained on the
whole Dataset 1 and Dataset 2 with the fitting function Term correlation. This plot illustrates
possible reoccurring structural quality maxima at different ranges from the crystal-structures
(at 7 Å and 13Å). RMSDcyt is expressed in Ångströms.
37
3.4. Validation of scoring approaches
To both validate our various scoring approaches in forms of combinations of scoring
components into complete scores as well as validate our different methods of score
construction, mainly the different fitting functions, we decided to combine both scores with
score construction methods, our different validation strategies and finally also the different
data subsets from different docking algorithms when performing our evaluation. For each
validation strategy (Complete training and test and Cross- training and test), all proposed
score combinations have been validated once for each fitting function (Semi- step function,
Negative RMSD, Negative RMSD with cut off and Term correlation) for each data subset in
Dataset 2 (GOLD goldscore, GOLD chemscore and GLIDE) and once with the whole dataset.
Results from our evaluation of the scores can be found in tables AI.1. to AI.8. in APPENDIX
I. The main reason for testing the performance of scores on the subsets of Dataset 2 is that in
doing so we can make a sort of comparison between how our scores perform with the scoring
performance of the scores used by the docking algorithms on the same dataset. To visualize
the performance of scores on a shorter distance from the known crystal structures in terms of
RMSD, we also performed validation results for complexes from Dataset 2 with a measured
RMSDcyt < 6 Å from the respective known structure in Dataset 1. These results are marked
with “CUT”. Results taking into account the whole RMSD-range are marked with “full”. The
cut-off distance was arbitrarily chosen with the aim to be between the expected quality
maximum at 7Å (see Investigating methods to score protein-ligand conformers, Investigating
strategies to combine terms of different scoring components and to validate scores) and 0,
excluding complexes believed to rightfully score well but not belonging to the maximum
represented by the crystal structure, and still trying to keep as many conformers as possible to
make the result somewhat interesting from a statistical point of view.
Each score combination is built up from scoring components adding various numbers of
terms to the score. The scoring components used are those that are listed in the Methods
section. Additionally we added the dockscore – the score read from the respective docking
program by which the complex conformer was created (GOLD chemscore, GOLD goldscore,
Glide). Used for comparison.
When evaluating scoring quality of the scores we mainly used the following
measurements: mean enrichment (ME), relative mean hit rank (RelMHRk), and relative hit
rate (RelHRt).
ME is the mean enrichment over all complexes when measuring enrichment separately
for each protein-ligand combination. ME lies in the range between 0 and 1 and measures the
score’s capacity to order docking results, with 1 denoting a perfect ME and below 0.5 – an
unsuccessful ME.
RelMHRk is the relationship between the mean hit rank and the mean random hit rank
and can serve as a measurement of how precise the score is, that is the mean scoring rank of
the best docking result compared to the random rank of the best docking result. A RelMHRk
<1 is considered better than the random ordering of the docking results and higher RelMHRk
is considered less successful.
RelHRt is the relationship between the hit rate and the random hit rate. RelHRt is also a
measurement of how precise the score is in terms of the capacity to rank the best result as
number 1. A RelHRt > 1 suggests that the score’s performance is better that the random
result, and RelHRt < 1 is worse.
38
Table 3a. Ranking of scores by Mean Enrichment (ME), Complete training and test
semi- step function
score
chemscore_C
chemscore
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond_C
c_hbond c_stack c_mhp1_C
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack_C
c_hbond c_stack c_motif_C
glide_C
glide
c_hbond
c_hbond c_stack c_emp1
c_emp1_C
c_hbond c_stack c_motif
goldscore
c_hbond c_stack c_mhp1
c_hbond c_stack
c_emp1
goldscore_C
c_mhp1_C
c_mhp1
c_stack_C
c_motif_C
c_stack
c_motif
ME
0.99
0.98
0.93
0.93
0.91
0.9
0.88
0.88
0.87
0.87
0.87
0.87
0.86
0.85
0.85
0.84
0.83
0.83
0.81
0.8
0.77
0.76
0.74
0.71
0.64
0.56
0.54
0.4
negative rmsd
score
chemscore_C
chemscore
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack_C
c_hbond c_stack c_motif_C
c_hbond c_stack c_mhp1_C
c_hbond_C
c_emp1_C
c_hbond c_stack c_mhp1
glide_C
c_hbond c_stack c_motif
glide
c_hbond
c_hbond c_stack
c_emp1
goldscore
goldscore_C
c_mhp1_C
c_mhp1
c_stack_C
c_motif_C
c_stack
c_motif
ME
0.99
0.98
0.94
0.93
0.92
0.92
0.92
0.91
0.89
0.89
0.89
0.88
0.88
0.87
0.87
0.86
0.86
0.85
0.85
0.84
0.83
0.76
0.74
0.71
0.62
0.56
0.5
0.4
negative rmsd with cutoff
score
chemscore_C
chemscore
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack_C
c_hbond c_stack c_motif_C
c_hbond c_stack c_mhp1_C
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond_C
glide_C
c_hbond c_stack c_mhp1
c_emp1_C
glide
c_hbond c_stack c_motif
c_hbond
c_hbond c_stack
c_emp1
goldscore
goldscore_C
c_mhp1_C
c_mhp1
c_stack_C
c_motif_C
c_stack
c_motif
ME
0.99
0.98
0.92
0.9
0.9
0.89
0.89
0.89
0.89
0.88
0.88
0.88
0.87
0.86
0.86
0.86
0.85
0.84
0.84
0.83
0.83
0.76
0.74
0.71
0.62
0.56
0.5
0.4
term correlation
score
chemscore_C
chemscore
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack_C
c_hbond c_stack c_mhp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_motif_C
c_hbond c_stack c_mhp1
c_hbond_C
glide_C
c_hbond c_stack
c_hbond c_stack c_motif
glide
c_hbond
c_emp1_C
goldscore
c_emp1
goldscore_C
c_mhp1_C
c_mhp1
c_stack_C
c_motif_C
c_stack
c_motif
ME
0.99
0.98
0.92
0.92
0.9
0.9
0.9
0.9
0.89
0.89
0.89
0.88
0.88
0.87
0.86
0.86
0.86
0.85
0.85
0.83
0.79
0.76
0.74
0.71
0.62
0.56
0.52
0.4
Table 3b. Ranking of scores by Mean Enrichment (ME), Cross- training and test
semi- step function
score
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1_C
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_motif_C
c_hbond c_stack c_mhp1
c_hbond c_stack_C
c_hbond_C
c_hbond c_stack c_emp1
c_hbond c_stack c_motif
c_hbond
c_hbond c_stack
c_emp1_C
c_emp1
c_mhp1_C
c_mhp1
c_stack_C
c_motif_C
c_stack
c_motif
ME
0.950
0.933
0.917
0.914
0.912
0.902
0.891
0.876
0.876
0.875
0.847
0.843
0.835
0.833
0.825
0.754
0.737
0.727
0.668
0.613
0.549
0.454
negative rmsd
score
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond_C
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_motif_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_mhp1 c_emp1
c_hbond
c_hbond c_stack c_motif
c_hbond c_stack c_emp1
c_emp1_C
c_hbond c_stack_C
c_hbond c_stack c_mhp1
c_hbond c_stack c_mhp1_C
c_hbond c_stack
c_emp1
c_mhp1_C
c_mhp1
c_stack_C
c_motif_C
c_stack
c_motif
ME
0.940
0.939
0.935
0.930
0.926
0.921
0.908
0.907
0.893
0.891
0.885
0.882
0.868
0.863
0.846
0.838
0.720
0.707
0.576
0.561
0.455
0.415
negative rmsd with cutoff
score
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_emp1_C
c_hbond c_stack_C
c_hbond c_stack c_mhp1 c_emp1
c_hbond_C
c_hbond c_stack c_mhp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_mhp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_motif_C
c_hbond c_stack c_emp1
c_hbond
c_hbond c_stack
c_hbond c_stack c_motif
c_emp1_C
c_emp1
c_mhp1_C
c_mhp1
c_stack_C
c_motif_C
c_stack
c_motif
ME
0.950
0.921
0.920
0.916
0.913
0.910
0.908
0.888
0.884
0.882
0.879
0.869
0.863
0.837
0.828
0.804
0.778
0.737
0.562
0.558
0.438
0.418
term correlation
score
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1_C
c_hbond c_stack c_motif_C
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_mhp1
c_hbond c_stack c_motif
c_hbond c_stack_C
c_hbond c_stack
c_hbond_C
c_emp1_C
c_hbond
c_emp1
c_mhp1_C
c_mhp1
c_stack_C
c_motif_C
c_stack
c_motif
Table 3a and Table 3b show the validation results from the validation of our scores in the form of the scores ranked according to
achieved Mean Enrichment (ME). In Table 3a validation results from using the validation method Complete training and test are
shown, and In Table 3b the mean validation results from using the validation method Cross- training and test.
39
ME
0.947
0.930
0.928
0.923
0.919
0.911
0.910
0.909
0.906
0.897
0.897
0.858
0.851
0.850
0.821
0.801
0.764
0.724
0.610
0.525
0.471
0.365
Table 3c. Ranking of scores by Relative Mean Hit Rank (RelMHRk), Complete training and test
semi- step function
score
chemscore
glide
c_hbond c_stack c_mhp1 c_emp1 c_motif
chemscore_C
goldscore
c_hbond c_stack c_mhp1 c_emp1
glide_C
c_hbond c_stack c_emp1
c_hbond
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_emp1_C
c_hbond c_stack c_motif
c_emp1_C
c_hbond c_stack c_mhp1
c_hbond_C
c_hbond c_stack
c_hbond c_stack c_mhp1_C
goldscore_C
c_emp1
c_hbond c_stack_C
c_hbond c_stack c_motif_C
c_mhp1
c_mhp1_C
c_stack
c_stack_C
c_motif_C
c_motif
RelMHRk
0.47
0.74
0.75
0.75
0.82
0.84
0.89
0.90
0.94
0.96
0.96
0.97
1.12
1.17
1.24
1.26
1.28
1.32
1.38
1.39
1.41
1.41
2.13
2.42
3.27
3.36
4.62
4.86
semi- step function
score
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_emp1
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1_C
c_hbond_C
c_emp1_C
c_hbond c_stack
c_hbond c_stack c_motif
c_emp1
c_hbond c_stack_C
c_hbond c_stack c_motif_C
c_mhp1
c_mhp1_C
c_stack
c_stack_C
c_motif_C
c_motif
RelMHRk
0.686
0.781
0.819
0.828
0.912
0.927
0.931
0.957
0.976
1.028
1.051
1.071
1.330
1.383
1.459
1.604
2.077
2.233
3.094
3.575
4.851
5.234
negative rmsd
score
chemscore
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
glide
chemscore_C
goldscore
glide_C
c_hbond c_stack c_mhp1
c_hbond
c_hbond c_stack c_emp1_C
c_hbond c_stack c_motif
c_hbond c_stack
c_emp1
c_emp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond_C
c_hbond c_stack_C
c_hbond c_stack c_motif_C
c_hbond c_stack c_mhp1_C
goldscore_C
c_mhp1
c_mhp1_C
c_stack
c_stack_C
c_motif_C
c_motif
RelMHRk
0.47
0.60
0.66
0.66
0.74
0.75
0.82
0.89
0.90
0.91
0.91
0.96
0.97
0.99
1.00
1.03
1.03
1.21
1.21
1.21
1.28
1.38
2.13
2.42
3.52
3.55
4.62
4.86
negative rmsd with cutoff
score
chemscore
glide
chemscore_C
c_hbond c_stack c_emp1
goldscore
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1
glide_C
c_hbond
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1
c_hbond c_stack
c_hbond c_stack c_motif
c_emp1
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_emp1_C
c_hbond_C
c_hbond c_stack_C
c_hbond c_stack c_motif_C
c_hbond c_stack c_mhp1_C
goldscore_C
c_mhp1
c_mhp1_C
c_stack
c_stack_C
c_motif_C
c_motif
RelMHRk
0.47
0.74
0.75
0.78
0.82
0.87
0.88
0.89
0.93
0.93
0.98
0.99
0.99
1.08
1.18
1.18
1.20
1.26
1.26
1.26
1.29
1.38
2.13
2.42
3.52
3.55
4.62
4.86
term correlation
score
chemscore
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
glide
c_hbond c_stack c_mhp1
chemscore_C
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack
goldscore
c_hbond c_stack c_motif
c_hbond
c_hbond c_stack c_emp1_C
glide_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_motif_C
c_emp1_C
c_hbond c_stack c_mhp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack_C
c_hbond_C
c_emp1
goldscore_C
c_mhp1
c_mhp1_C
c_stack
c_stack_C
c_motif_C
c_motif
RelMHRk
0.47
0.66
0.74
0.74
0.75
0.75
0.77
0.81
0.82
0.83
0.86
0.86
0.89
1.08
1.16
1.17
1.18
1.19
1.20
1.21
1.25
1.38
2.13
2.42
3.32
3.44
4.62
4.86
Table 3d. Ranking of scores by Relative Mean Hit Rank (RelMHRk), Cross- training and test
negative rmsd
score
c_hbond
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond_C
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1_C
c_hbond c_stack c_motif
c_emp1
c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack
c_hbond c_stack c_motif_C
c_hbond c_stack_C
c_hbond c_stack c_mhp1_C
c_mhp1
c_mhp1_C
c_stack_C
c_stack
c_motif_C
c_motif
RelMHRk
0.563
0.572
0.675
0.774
0.863
0.866
0.875
0.953
1.007
1.031
1.040
1.063
1.070
1.198
1.258
1.515
2.218
2.744
3.592
3.730
4.637
5.087
negative rmsd with cutoff
score
c_hbond
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond_C
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack
c_hbond c_stack c_mhp1_C
c_hbond c_stack_C
c_hbond c_stack c_motif
c_emp1
c_emp1_C
c_hbond c_stack c_motif_C
c_mhp1
c_mhp1_C
c_stack
c_stack_C
c_motif
c_motif_C
RelMHRk
0.657
0.674
0.717
0.829
0.864
0.877
0.881
0.881
1.026
1.062
1.066
1.069
1.114
1.217
1.387
1.441
1.980
2.138
3.828
3.977
4.607
4.728
term correlation
score
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack
c_hbond c_stack c_motif_C
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond
c_hbond c_stack c_mhp1_C
c_emp1
c_emp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond_C
c_mhp1
c_mhp1_C
c_stack_C
c_stack
c_motif
c_motif_C
Table 3c and Table 3d show the validation results from the validation of our scores in the form of the scores ranked according to
achieved Relative Mean Hit Rank (RelMHRk). In Table 3c validation results from using the validation method Complete training
and test are shown, and In Table 3d the mean validation results from using the validation method Cross- training and test.
40
RelMHRk
0.569
0.658
0.747
0.790
0.804
0.846
0.861
0.863
1.019
1.064
1.183
1.188
1.215
1.228
1.262
1.479
2.229
2.490
3.618
3.706
5.236
5.245
Table 3e. Ranking of scores by Relative Hit Rate (RelHRt), Complete training and test
semi- step function
score
goldscore
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1
c_hbond
c_hbond c_stack c_emp1
c_hbond c_stack c_motif
chemscore
c_hbond c_stack c_mhp1
glide
c_hbond c_stack
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
chemscore_C
c_hbond_C
c_hbond c_stack_C
c_hbond c_stack c_motif_C
c_hbond c_stack c_mhp1_C
glide_C
goldscore_C
c_emp1_C
c_mhp1
c_emp1
c_motif
c_mhp1_C
c_stack
c_stack_C
c_motif_C
RelHRt
2.00
1.93
1.80
1.64
1.64
1.50
1.50
1.40
1.39
1.36
1.22
1.22
1.22
1.20
1.15
1.11
1.11
1.04
1.00
0.86
0.81
0.80
0.79
0.64
0.59
0.50
0.33
0.30
negative rmsd
score
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
goldscore
c_hbond
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack
chemscore
glide
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_emp1
chemscore_C
c_hbond_C
c_hbond c_stack_C
c_hbond c_stack c_motif_C
c_hbond c_stack c_mhp1_C
c_emp1_C
glide_C
goldscore_C
c_mhp1
c_motif
c_mhp1_C
c_stack_C
c_motif_C
c_stack
RelHRt
2.14
2.00
2.00
2.00
1.86
1.86
1.80
1.79
1.50
1.39
1.26
1.22
1.22
1.21
1.20
1.19
1.19
1.19
1.11
1.07
1.00
0.86
0.80
0.64
0.59
0.30
0.30
0.29
negative rmsd with cutoff
score
goldscore
c_hbond
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1
chemscore
glide
c_hbond c_stack c_emp1_C
chemscore_C
c_hbond_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack_C
c_hbond c_stack c_mhp1_C
glide_C
goldscore_C
c_emp1_C
c_hbond c_stack c_motif_C
c_mhp1
c_emp1
c_motif
c_mhp1_C
c_stack_C
c_motif_C
c_stack
RelHRt
2.00
1.79
1.79
1.79
1.79
1.73
1.73
1.67
1.50
1.39
1.26
1.20
1.15
1.15
1.15
1.11
1.07
1.00
0.86
0.81
0.81
0.80
0.79
0.64
0.59
0.30
0.30
0.29
term correlation
score
c_hbond c_stack c_motif
goldscore
c_hbond c_stack c_mhp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack
c_hbond
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_emp1
chemscore
glide
c_hbond c_stack c_motif_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
chemscore_C
c_hbond_C
c_hbond c_stack_C
c_hbond c_stack c_mhp1_C
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1 c_emp1_C
glide_C
c_emp1_C
goldscore_C
c_mhp1
c_emp1
c_motif
c_mhp1_C
c_stack_C
c_motif_C
c_stack
RelHRt
2.00
2.00
1.93
1.93
1.86
1.79
1.67
1.64
1.50
1.39
1.22
1.22
1.20
1.19
1.19
1.19
1.19
1.11
1.00
0.89
0.86
0.80
0.71
0.64
0.59
0.30
0.30
0.29
Table 3f. Ranking of scores by Relative Hit Rate (RelHRt), Cross- training and test
semi- step function
score
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond
c_hbond c_stack c_mhp1
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond_C
c_hbond c_stack c_motif_C
c_hbond c_stack_C
c_hbond c_stack c_emp1
c_emp1
c_emp1_C
c_mhp1
c_mhp1_C
c_motif
c_stack
c_stack_C
c_motif_C
RelHRt
2.179
1.850
1.769
1.595
1.450
1.353
1.292
1.262
1.258
1.210
1.167
1.082
1.075
1.050
0.848
0.815
0.794
0.571
0.528
0.393
0.333
0.310
negative rmsd
score
c_hbond
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_emp1
c_hbond c_stack c_motif
c_hbond c_stack
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1
c_emp1
c_hbond_C
c_hbond c_stack c_emp1_C
c_hbond c_stack_C
c_emp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_motif_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_mhp1_C
c_mhp1
c_motif
c_mhp1_C
c_stack
c_stack_C
c_motif_C
RelHRt
2.259
2.059
2.000
1.914
1.909
1.886
1.667
1.333
1.245
1.221
1.210
1.194
1.188
1.182
1.147
1.015
0.784
0.703
0.522
0.409
0.288
0.278
negative rmsd with cutoff
score
c_hbond c_stack c_motif
c_hbond
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1_C
c_hbond_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_motif_C
c_hbond c_stack_C
c_hbond c_stack c_mhp1_C
c_emp1_C
c_emp1
c_mhp1
c_motif
c_mhp1_C
c_stack
c_stack_C
c_motif_C
RelHRt
2.000
1.833
1.806
1.806
1.727
1.676
1.615
1.339
1.210
1.188
1.184
1.153
1.147
1.086
0.812
0.771
0.733
0.697
0.660
0.500
0.313
0.290
term correlation
score
c_hbond c_stack c_mhp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_motif
c_hbond c_stack
c_hbond c_stack_C
c_hbond c_stack c_emp1_C
c_hbond c_stack c_mhp1_C
c_hbond c_stack c_mhp1 c_emp1_C
c_hbond c_stack c_motif_C
c_hbond_C
c_hbond c_stack c_mhp1 c_emp1 c_motif_C
c_emp1_C
c_emp1
c_mhp1
c_mhp1_C
c_motif
c_stack_C
c_motif_C
c_stack
Table 3e and Table 3f show the validation results from the validation of our scores in the form of the scores ranked according to
achieved Relative Hit Rate (RelHRt). In Table 3e validation results from using the validation method Complete training and test
are shown, and in Table 3f the mean validation results from using the validation method Cross- training and test.
41
RelHRt
2.024
1.969
1.963
1.895
1.867
1.733
1.719
1.304
1.205
1.200
1.193
1.189
1.182
1.113
1.000
0.966
0.839
0.655
0.548
0.271
0.263
0.132
A ranking of the scores according to each of the three measurements described above is
presented in Tables 3a – 3f above, with scores only scoring the RMSD-range < 6Å marked
with “_C”, at the end. Here we have ranked the performance of scores on Dataset 2 for both
Complete training and test and Cross- training and test. For Cross- training and test the mean
results over 20 iterations are presented. In the rankings for Complete training and test we also
included the performance of the scores of the docking algorithms on their respective subset of
Dataset 2 as comparison. Results obtained by using different fitting functions are presented
alongside for comparisons between the functions.
What is generally interesting is which combinations of scoring components are more
successful. This question has to be viewed from two perspectives however. A traditional view
on how to construct scores is to treat the crystal structure as the final and only goal of the
docking procedure and the score is constructed to somehow reflect the distance from this
conformation and to lead the docking algorithm to this position as swiftly as possible. An
alternative view is to use the RMSD-distance from the crystal structure as an approximation
of the quality of the conformation only on a limited distance from the crystal structure (see
Investigating strategies to combine terms of different scoring components and to validate
scores). In our validation we have tried to fulfil the goals of both these views. The
performance of scores validated on the whole RMSD-range versus performance of scores
validated on the shorter distance should be viewed in this light. The performance on the whole
range can be considered as the performance of the score used for the first purpose of ranking
conformers against the crystal structure conformation. When observing the performance on
the limited range, it should be seen as a more comparative performance since it considers a
RMSD-range which is more probable to approximate conformer quality.
Examining the performance of the scores according to ME one can see how our scores
which combine different approaches to scoring generally perform better than scores using
only one method for scoring. It should also be noted that scores scoring only the limited
RMSDcyt-range also generally score above their full-range counterparts. Both these effects
were expected; combinations of scoring approaches should be able to capture more effects
than scores using only one approach, and the scoring environment just outside the crystal
structures in terms of RMSDcyt was expected to be more forgiving to scoring functions,
compared to the full range. All high ranking scores contain the combination c_hbond and
c_stack. Especially c_hbond shows a good ME performance, even on its own as a score
despite its single-tracked scoring method. The best performing scores according to ME are
scores consisting of c_hbond, c_stack and different combinations of c_mhp1, c_emp1 and
sometimes c_motif. The c_emp1 component shows good results during both Complete
training and tes and Cross- training and test, even though the more limited availability of
knowledge-based data when dividing Dataset 2 into different sets during Cross- training and
test. Adding the c_stack component to the c_hbond component results in an improved ME,
except when using the fitting function Semi- step function during Complete training and test.
During Cross- training and test, the trend is more ambigious. Combining c_hbond and
c_stack with both c_mhp1 and c_emp1 does not always improve the performance comparing
to scores when c_hbond and c_stack are used with either c_mhp1 or c_emp1. During
Complete training and test, an addition of the c_mhp1 component to the c_hbond c_stack
c_emp1 –score would either not change or would lower the performance except when using
the fitting function Semi- step function. During Cross- training and test, combining c_mhp1
and c_emp1 usually results in a better performance except when using Term correlation. The
addition of c_motif does not improve ME performance uring Complete training and test,
except when using the fitting function Term correlation. During Cross- training and test, the
42
pattern Is the opposite with c_motif improving performance when using all fitting functions
except Negative RMSD with cut off. Most of our composite scores show better performance
than goldscore and glide. Chemscore show an extreme ME of 0.99 on the limited distance,
higher than any score, and 0.98 on the full distance Our best performing scores considering
ME was c_hbond c_stack c_emp1 (ME=0.94) on the limited distance and using the fitting
function Negative RMSD during Complete training and test. During Cross- training and test,
our best performing scores were the two scores c_hbond c_stack c_mhp1 c_emp1 c_motif
(Semi- step function) together with c_hbond c_stack c_mhp1 c_emp1 (Negative RMSD with
cut off) (ME=0.95) on the limited distance.
The use of different fitting functions seems to have a greater impact on the performance
according to RelMHRk than for ME. During Complete training and test combinations of
c_hbond, c_stack and c_mhp1 with c_emp1 and c_motif still rank highest among our scores
but a difference compared to ME is that it is the full-range versions of the scores that
generally rank highest. Here all scoring functions from the docking algorithms are generally
outperformed by our functions except chemscore, which once again ranks highest. Glide also
shows better performance through RelMHRk than ME, ranking second for two fitting
functions. During Cross- training and test the full range versions of scores composed of
c_hbond c_stack plus various combinations of c_mhp1, c_emp1 and c_motif rank high. Also
the score composed of only c_hbond rank highest when using Negative RMSD and Negative
RMSD with cut off. For Complete training and test our best performing score is c_hbond
c_stack c_emp1 (RelMHRk=0.6) on the whole distance using the fitting function Negative
RMSD, and for Cross- training and test c_hbond (RelMHRk=0.563) on the whole distance
using the fitting function Negative RMSD.
Ranking according to RelHRt, scores validating the full RMSDcyt -range generally rank
higher as for RelMHRk. For Complete training and test our high ranking scores are
combinations of c_hbond and c_stack with c_mhp1, c_emp1 and c_motif. Also, the solo score
of c_hbond rank high with a second place when using the fitting function Negative RMSD
with cut off. Generally docking scores are outperformed by our scores when measuring
RelHRt performance except for the fitting function Negative RMSD with cut off where
goldscore rank highest. During Cross- training and test results are more ambiguous and
combinations of c_hbond and c_stack plus combinations of c_emp1, c_mhp1 and c_motif,
rank high. Similar to the ranking by RelMHRk, c_hbond rank highest when using the fitting
function Negative RMSD. The highest performing score for Complete training and test is
c_hbond c_stack c_emp1 (RelHRt=2.14) on the full range when using Negative RMSD, and
the score c_hbond (RelHRt=1.40) on the full range when using Negative RMSD with cut off
for Cross- training and test.
Information about the weighting coefficients obtained from the score training can be
used to evaluate our different scoring components. Mean weighting coefficients from the
Cross- training and test validation strategy are presented and visualized in Appendix I. An
assumption was made that a term with a large absolute weight is contributing relatively more
to the scoring capacity of a score than a term with a smaller weight, for the used fitting
function, given that the magnitute of these terms is comparable. We compared the weighting
coefficients from the score containing all scoring components and thus all scoring terms with
the aim to estimate the relative performance of the terms when the score is trained on the
same fitting function. The combinatory term in the c_hbond component seems to be
downweighted for all fitting functions – especially when using Semi- step function and Term
correlation. Using Negative RMSD, the terms with index 8 and 13 receive larger weights.
These terms correspond to the c_mhp1-term and the c_emp1 term scoring N4 relative
nitrogen atoms respectively. The terms with index 9 and 13; the c_emp1 terms scoring O2
relative hydrogen bound to a nitrogen and N4 relative nitrogen, have larger weights when
43
using Negative RMSD with cut off, and the c_emp1 term scoring N3 relative hydrogen bound
to nitrogen (indexed 11) have although not a very large but a significant negative weight.
When examining these results one must consider that the obtained mean weighting
coefficients are just that and not the mean of the absolute value of the weighting coefficients,
which might have been a better way to estimate importance of scoring terms for a score.
Since the fitting functions are based on various different strategies and can be used for
different purposes, it is difficult to determine which are more successful in general. The two
fitting functions which use the whole RMSD-spectrum for approximation do seem to promote
a good scoring performance of the scores validated on the limited range, sometimes with
better scoring performance than the same scores together with fitting functions using the
limited range or no range at all (i.e. Negative RMSD with cut off and Term correlation). Also,
Negative RMSD with cut off and Term correlation seem to promote a good scoring
performance in the scores validated on the full RMSD-range. It must be considered that the
fitting function Negative RMSD with cut off must work with less data than the other three
fitting functions, which might affect the performance of this function compared to the other
functions. Term correlation is not a function of RMSD, which when using RMSDmeasurements for score validation might put this function at a disadvantage. However, as can
be seen from the comparisons between the fitting functions, Term correlation seems to
promote comparably good scoring performance in the scores trained with this fitting function,
compared to the fitting functions using RMSD-measurements. It is doing so with an unbiased
consideration of the whole RMSD-range, which might have its own advantages. Term
correlation is shown to possibly incorporate also the scoring capabilities of scoring terms that
model smaller parts of the quality model in the score, or terms with ambiguous scoring
capabilities. This is the case with the scoring component c_motif (see further description of
the validation of this component below). Term correlation can with its different method of
approximating conformer quality also be used for conformer validation. This method should
be seen more as a complement to RMSD-based validation methods because of its inherent
capability to amplify errors. For example, we use the score results obtained from using Term
correlation as one of the arguments for the possibility of 2 major general docking quality
maxima around RMSDcyt of 7 and 13Å respectively from the crystal structures of our proteinligand complexes.
Our results show that the same scoring strategy as was earlier successfully used for
scoring complexes with adenine4, represented by the score c_hbond c_stack c_mhp1 and its
components, is also successful at scoring docking poses for protein-ligand complexes
containing ligands with cytosine as s substructure. The validations of this score argue that its
performance is higher than the performance of goldscore and glide when considering ME, and
a higher performance than goldscore on the limited RMSDcyt-range when measuring
RelMHRk. Measuring RelHRt, the score shows a performance which is often better than
chemscore and glide on the full RMSDcyt-range and better than goldscore and glide on the
limited range. While we comment scores we also have to mention the combination c_hbond
c_stack. This component combination is found in all of the more successful of our scores and
seems to account for a very large part of the model we try to create. On its own, the score
composed of these two components have scoring capabilities often better than other scores
with greater number of components. Scores with better performance than this generally also
contain combinations of c_mhp1, c_emp1 and c_motif.
Our novel scoring approaches, represented by c_emp1 and c_motif, show scoring
potential, however not as apparent in the case of c_motif. The c_emp1 scoring component
shows a good scoring performance, both as a score of its own, as well as part of a composite
score. This is the case during both Complete training and test and Cross- training and test
which gives a hint of its overall potential of scoring this particular dataset, as well as the
44
robustness of this scoring approach. Combined with more explicit scoring components,
c_emp1 is indeed a useful contribution to a score which would be an argument for our theory
of the implicit modelling capacity of knowledge-based scores successfully complementing
explicit scoring components.
The scoring capacity of the component c_motif is more difficult to evaluate. The
addition of c_motif to either c_hbond c_stack c_mhp1 c_emp1 or to c_hbond c_stack results
in a score that often show better performance than the original, but results almost as often in a
score with worse performance. It seems like the fitting function Term correlation is able to
incorporate the predictive capabilities of c_motif in such a way that all scores with this
component have higher ME performance than their counterparts without c_motif. This is the
case for both Complete training and test and Cross- training and test.
Since the datasets upon which our scores have been trained upon directly define the
scoring capacity of the scores, our validation results have to be interpreted with this in
consideration. Dataset 1 is in not any way representative of a complete set of data on how
proteins recognize bind cytosine–containing ligands, and must be treated as a sample. This is
also the case when considering Dataset 2, since it is derived from Dataset 1. A training of
scores on such limited datasets, of course also implies a certain degree of bias or over-training
in those scores on the particular dataset. When comparing scores with the scores of the
docking algorithms, it should also be mentioned that the results from these docking
algorithms are highly biased from the scoring function used in the docking process. For a
better validation of scores, they have all to be tested in combination with the same docking
algorithm.
3.5. Time efficiency
The scoring capacity of a score can be put into relation to its time complexity to
estimate the overall time efficiency of the score. Below we present a list where our scoring
components are annotated with their respective time complexity functions per scored
complex:
Component
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
Time complexity (per scored complex)
O(a)
O(a)
O(a)
O(a)
O(a 3 )
Here a is the number of protein atoms in the protein-ligand complex scored by the
component. The component c_motif has a cubic term because of the need to investigate all
possible combinations of hydrogen bonds contributing to a motif. Since the time complexities
of the different components are fairly similar, and the time complexity of c_motif is a result of
taking into account also the existence of extreme and highly unmatching conformers, a
comparison of time efficiencies between our scores can be reduced to comparing their scoring
capacities.
45
4. CONCLUSIONS
The present work was aimed at improving the performance of standard docking
techniques, i.e. computational methods designed to predict the conformation and orientation
of an organic molecule in the known structure of a receptor binding site. With this aim in
view we used the “consensus docking” approach based on re-ranking the list of putative
ligand poses generated with a docking program by a more efficient scoring function. By
analogy to the adenine-specific score developed before4 we introduced the novel method to
re-score the docking solutions for another important class of ligands – cytosine-containing
compounds. To develop the cytosine-specific score we collected from PDB a representatve set
of 50 structures of complexes of such ligands with different proteins and generated sets of
docking solutions for each of them in order to obtain positive – native-like and negative –
misplaced ligand poses for training our scrores. The main conclusions of the present work are:
1) Detailed analysis of intermolecular interactions between the cytosine fragment of a
ligand and its protein environment was carried out. From this analysis we derived the
cytosine-backbone hydrogen bond motifs and compared them with those for adenine. Also
stacking and hydrophobic interactions were analysed showing that the former are important
for cytosine-recognition and that the latter also are important, but more difficult to use in a
scoring function due to cytosine not showing distinctive properties suitable for scoring
hydrophobic interactions using the current approach.
2) Novel scoring functions were created based on various scoring methodologies. Our
novel scores show promising capacity on the computer-generated dataset of 2000 proteinligand conformers (Dataset 2) for 50 structures known from X-ray crystallography. The best
performance achieved on the limited RMSDcyt-range by our scores is a ME of 0.94 on the
complete Dataset 2 during Complete training and test by the score c_hbond c_stack c_emp1
(describing hydrogen bonds, stacking interactions, and knowledge-based potential,
respectively). The score c_hbond c_stack c_mhp1 (the latter describing hydrophobic
interactions), which was earlier used for scoring complexes with ligands containing adenine
instead of cytosine, was shown to perform well also in the case of cytosine. This score
achieved a ME of 0.9 on the limited RMSDcyt-range during Complete training and test on the
complete Dataset 2. Both mentioned scores show high performance also during Crosstraining and test, which is an indication of their robustness while used on our dataset.
3) The newly introduced knowledge-based potential component c_emp1, is shown to
perform comparably well to the hydrophobic component c_mhp1, and shows a similar
capacity to successfully complement more explicit scoring methods. The component c_emp1
does however rely on knowledge-based data, and the quality and availability of such data is
thus defining its performance. Our other newly introduced scoring component c_motif does
seem to enhance the scoring capacity of a score in some cases. We have also shown that it
was possible in this case to create scores which had comparable performance to scores
obtained through fitting functions based on traditional RMSD-measurements, by using the
newly introduced fitting function Term correlation.
46
5. ACKNOWLEDGMENTS
I would like to thank my supervisor Prof. Roman G. Efremov and Ph.D. Timothy V.
Pyrkov for their assistance and guidance in this project and for giving me the fabulous
opportunity to perform my degree project at the Laboratory of Molecular Modelling at IBCh,
Moscow. Many thanks to Prof. Efremov for his warm reception and his help in all the
practical matters concerning my stay in Moscow, without wich this would not have been
possible. An elogy to our program coordination office with Dr. Margareta Krabbe for the
swift and much needed support and encouragement which helped to make this one of the first
degree projects made in cooperation with a Russian institution. I would also like to express
my gratitude to Ph.D. Torgeir R. Hvidsten for reviewing my report and to Hoda Ibrahim and
Andreas Dahlsten for acting as opponents during the presentation. Last but not least – thanks
to all co-workers and friends connected to IBCh. Without you I would have been just as lost
as that Danish professor.
47
15.
6. REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Denessiouk K.A., Rantanen V., Johnson M.S., Adenine
Recognition: A Motif Present in ATP-, CoA-, NAD-, NADP, and FAD-Dependent Proteins., Proteins 2001;44:282-291
Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat
T.N., Weissig H., Shindyalov I.N., Bourne P.E., The Protein
Data Bank., Nucleic Acids Research, 2000;28:235-242
Cappello V., Tramontano A., Koch U., Classification of
Proteins Based on the Properties of the Ligand-Binding Site:
The Case of Adenine-Binding Proteins., Proteins
2002;47:106-115
Pyrkov T.V., Kosinsky Y.A., Arseniev A.S., Priestle J.P.,
Jacoby E., Efremov R.G., Complementarity of Hydrophobic
Properties in ATP-Protein Binding: A New Criterion to Rank
Docking Solutions., Proteins, 2007; 66:388-398.
Martin A.C., Orengo C.A., Hutchinson E.G., Jones S.,
Karmirantzou M., Laskowski R.A., Mitchell J.B., Taroni C.,
Thornton J.M., Protein folds and functions., Structure. 1998
Jul 15;6(7):875-84.
Erlandsen H., Abola E.E., Stevens R.C., Combining
structural genomics and enzymology: completing the picture
in metabolic pathways and enzyme active sites., Curr Opin
Struct Biol. 2000 Dec;10(6):719-30.
Mao L., Wang Y., Liu Y., Hu X.. Molecular determinants for
ATP-binding in proteins: a data mining and quantum
chemical analysis., J Mol Biol. 2004 Feb 20;336(3):787-807.
Kuttner Y.Y., Sobolev V., Raskind A., Edelman M., A
consensus-binding structure for adenine at the atomic level
permits searching for the ligand site in a wide spectrum of
adenine-containing complexes., Proteins. 2003 Aug
15;52(3):400-11.
Kitchen D.B., Decornez H., Furr J.R., Bajorath J., Docking
and scoring in virtual screening for drug discovery: methods
and applications., Nat Rev Drug Discov. 2004
Nov;3(11):935-49.
Kuntz I.D., Blaney J.M., Oatley S.J., Langridge R., Ferrin
T.E., A geometric approach to macromolecule-ligand
interactions., J Mol Biol. 1982 Oct 25;161(2):269-88.
Jorgensen W.L., The many roles of computation in drug
discovery, Science. 2004 Mar 19;303(5665):1813-8.
Sousa S.F., Fernandes P.A., Ramos M.J., Protein-ligand
docking: current status and future challenges., Proteins. 2006
Oct 1;65(1):15-26.
Teague S.J., Implications of protein flexibility for drug
discovery., Nat Rev Drug Discov. 2003 Jul;2(7):527-41.
Connolly M.L., Solvent-accessible surfaces of proteins and
nucleic acids., Science. 1983 Aug 19;221(4612):709-13.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
48
Ghose A.K., Viswanadhan V.N., Wendoloski J.J., prediction
of hydrophobic (lipophilic) properties of small organic
molecules using fragmental methods: an analysis of ALOGP
and CLOGP methods., J Phys Chem 1998;102:3762-3772
Pyrkov T.V., Chugunov A.O., Krylov N.A., Nolde D.E.,
Efremov R.G., PLATINUM, Laboratory of Biomolecular
Modeling at Shemyakin-Ovchinnikov Institute of Bioorganic
Chemistry, Russian Academy of Sciencies, Moscow, Russia,
2008. http://model.nmr.ru/platinum.
Bairoch A., The ENZYME database in 2000., Nucleic Acids
Res. 28:304-305(2000).
Jones G., Willett P. and Glen R.C., Molecular recognition of
receptor sites using a genetic algorithm with a description of
desolvation, J. Mol. Biol., 245, 43-53, 1995
Jones G., Willett P. and Glen R.C., Leach A. R. and Taylor
R., Development and Validation of a Genetic Algorithm for
Flexible Docking, J. Mol. Biol., 267, 727-748, 1997
Nissink J.W.M., Murray C., Hartshorn M., Verdonk M.L.,
Cole J.C. and Taylor R., A new test set for validating
predictions of protein-ligand interaction, Proteins, 49, 457471, 2002
Verdonk M.L., Cole J.C., Hartshorn M.J., Murray C.W. and
Taylor R.D., Improved Protein-Ligand Docking Using
GOLD, Proteins, 52, 609-623, 2003
Cole J.C., Nissink J.W.M., Taylor R. in Virtual Screening in
Drug Discovery (Eds. Shoichet B., Alvarez J.), ProteinLigand Docking and Virtual Screening with GOLD, Taylor
& Francis CRC Press, Boca Raton, Florida, USA (2005)
Verdonk M.L., Chessari G., Cole J.C., Hartshorn M.J.,
Murray C.W., Nissink J.W.M., Taylor R.D., and Taylor R.,
Modeling Water Molecules in Protein-Ligand Docking Using
GOLD, J. Med. Chem., 48, 6504-6515, 2005
Hartshorn M.J., Verdonk M.L., Chessari G., Brewerton S.C.,
Mooij W.T.M., Mortenson P.N., Murray C.W., Diverse,
High-Quality Test Set for the Validation of Protein-Ligand
Docking Performance, J. Med. Chem., 50, 726-741, 2007.
Friesner R.A., Banks J.L., Murphy R.B., Halgren T.A., Klicic
J.J., Mainz D.T., Repasky M.P., Knoll E.H., Shaw D.E.,
Shelley M., Perry J.K., Francis P., Shenkin P.S., Glide: A
New Approach for Rapid, Accurate Docking and Scoring. 1.
Method and Assessment of Docking Accuracy, J. Med.
Chem., 2004, 47, 1739–1749.
Halgren T.A., Murphy R.B., Friesner R.A., Beard H.S., Frye
L.L., Pollard W.T., Banks J.L., Glide: A New Approach for
Rapid, Accurate Docking and Scoring. 2. Enrichment Factors
in Database Screening, J. Med. Chem., 2004, 47, 1750–1759.
R.G. Efremov, A.O. Chugunov, T.V. Pyrkov, J.P. Priestle,
A.S. Arseniev and E. Jacoby, Molecular Lipophilicity in
Protein Modeling and Drug Design, Current Medicinal
Chemistry, 2007, 14, 393-415
APPENDIX I – Validation results for new scores
Tables AI.1 – AI.8. show the results from the validations performed for our scores. Each table shows the results of a certain
validation strategy (Complete training and test, Cross- training and test) combined with a certain fitting function (Semi- stepfunction, Negative RMSD, Negative RMSD with cut-off, Term correlation). Within each table, results are shown for separate
subsets of Dataset 2 (chemscore, goldscore, glide) and for the complete Dataset 2 (all). Each score has results presented for
the full RMSDcyt –range (full) and the reduced RMSDcyt –range (CUT). A number of measurements are presented for each
score: The number of terms figuring in the score - excluding the constant term (terms), Mean enrichment (ME), Mean Hit Rank
(MHRk), Mean Random Hit Rank (MRHRk), Hit Rate (HRt), Random Hit Rate (RHRt), and the total number of complex types
participating in the measurement (total). See Methods for the explanation of the strategies, fitting functions and validation
measurements. Dockscore and c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock are special scores because they contain
components derived from results obtained by the docking algorithms, and are thus only validated on the subset of Dataset 2
corresponding to the docking algorithm from which the subset originated.
TABLE AI.1. Validation of scoring approaches; Complete training and test, Semi- step-function
chemscore
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
terms
1
5
2
1
7
4
7
11
8
14
15
19
20
ME
0.98
0.93
0.68
0.78
0.86
0.63
0.99
0.99
0.96
0.97
0.97
0.97
0.96
MHRk
1
1.67
3.78
2.89
2
4.22
1.11
1.11
1.33
1.33
1.33
1.33
1.44
full
MRHRk
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
HRt
9
7
3
5
5
3
8
8
7
7
7
7
7
RHRt
6
6
6
6
6
6
6
6
6
6
6
6
6
total
9
9
9
9
9
9
9
9
9
9
9
9
9
ME
0.99
0.92
0.78
0.78
0.89
0.71
1
1
1
1
1
1
1
MHRk
1
1.5
1.83
1.83
1.17
2.33
1
1
1
1
1
1
1
CUT
MRHRk
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
HRt
6
5
1
4
5
2
6
6
6
6
6
6
6
RHRt
5
5
5
5
5
5
5
5
5
5
5
5
5
total
6
6
6
6
6
6
6
6
6
6
6
6
6
goldscore
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
1
5
2
1
7
4
7
11
8
14
15
19
20
0.83
0.86
0.39
0.73
0.75
0.49
0.8
0.83
0.79
0.8
0.79
0.85
0.84
2.35
1.95
6.6
3.2
3.1
5.3
2.55
2.35
2.75
2.6
2.7
2.15
2.2
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
14
13
3
9
7
7
11
12
10
12
12
14
14
7
7
7
7
7
7
7
7
7
7
7
7
7
20
20
20
20
20
20
20
20
20
20
20
20
20
0.76
0.82
0.43
0.64
0.81
0.56
0.79
0.79
0.81
0.86
0.85
0.86
0.87
1.8
1.5
3.3
2.3
1.6
2.8
1.6
1.6
1.5
1.4
1.4
1.4
1.4
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
6
7
1
4
5
4
7
7
7
8
8
8
8
7
7
7
7
7
7
7
7
7
7
7
7
7
10
10
10
10
10
10
10
10
10
10
10
10
10
glide
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
1
5
2
1
7
4
7
11
8
14
15
19
20
0.86
0.83
0.58
0.72
0.77
0.43
0.84
0.84
0.88
0.87
0.86
0.86
0.86
2.25
2.86
7.19
5.19
3.58
10.83
2.86
2.86
2.47
2.31
2.5
2.5
2.44
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
25
22
8
14
14
5
23
23
24
26
27
27
27
18
18
18
18
18
18
18
18
18
18
18
18
18
36
36
36
36
36
36
36
36
36
36
36
36
36
0.87
0.87
0.61
0.72
0.82
0.49
0.87
0.87
0.9
0.9
0.88
0.88
0.88
1.85
2.06
5.55
4.21
2.36
7.45
2.06
2.06
1.88
1.73
1.97
1.97
1.97
2.09
2.09
2.09
2.06
2.09
2.09
2.09
2.09
2.06
2.09
2.06
2.06
2.06
24
25
5
15
19
3
25
25
28
28
27
27
27
24
24
24
24
24
24
24
24
24
24
24
24
24
33
33
33
33
33
33
33
33
33
33
33
33
33
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
5
2
1
7
4
7
11
8
14
15
19
0.85
0.54
0.71
0.77
0.4
0.8
0.83
0.81
0.85
0.87
0.9
4.02
13.95
8.95
5.95
20.76
5.46
4.78
5.2
3.85
3.54
3.15
4.27
4.27
4.2
4.27
4.27
4.27
4.27
4.2
4.27
4.2
4.2
23
7
12
11
9
19
21
21
23
27
29
14
14
15
14
14
14
14
15
14
15
15
41
41
41
41
41
41
41
41
41
41
41
0.88
0.64
0.74
0.84
0.56
0.87
0.87
0.88
0.91
0.93
0.93
2.3
6.15
4.35
2.15
8.45
2.58
2.58
2.38
1.77
1.73
1.73
1.83
1.83
1.8
1.83
1.83
1.83
1.83
1.8
1.83
1.8
1.8
31
9
16
22
8
30
30
28
33
33
33
27
27
27
27
27
27
27
27
27
27
27
40
40
40
40
40
40
40
40
40
40
40
TABLE AI.2. Validation of scoring approaches; Complete training and test, Negative RMSD
chemscore
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
terms
1
5
2
1
7
4
7
11
8
14
15
19
20
ME
0.98
0.93
0.68
0.78
0.93
0.63
0.91
0.91
0.94
0.93
0.93
0.93
0.93
MHRk
1
1.67
3.78
2.89
1.56
4.22
1.89
1.89
1.56
1.67
1.67
1.67
1.67
all
MRHRk
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
HRt
9
7
3
5
6
3
8
8
7
8
8
8
8
RHRt
6
6
6
6
6
6
6
6
6
6
6
6
6
total
9
9
9
9
9
9
9
9
9
9
9
9
9
ME
0.99
0.92
0.78
0.78
0.98
0.71
0.89
0.89
1
0.92
0.92
0.92
0.92
MHRk
1
1.5
1.83
1.93
1
2.33
1.67
1.67
1
1.5
1.5
1.5
1.5
CUT
MRHRk
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
HRt
6
5
1
4
6
2
5
5
6
5
5
5
5
RHRt
5
5
5
5
5
5
5
5
5
5
5
5
5
total
6
6
6
6
6
6
6
6
6
6
6
6
6
goldscore
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
1
5
2
1
7
4
7
11
8
14
15
19
20
0.83
0.86
0.39
0.73
0.77
0.49
0.84
0.84
0.88
0.91
0.87
0.91
0.91
2.35
1.9
6.6
3.2
3
5.3
2.25
2.2
1.8
1.6
2.05
1.7
1.6
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
14
13
3
9
9
7
13
14
12
15
12
14
16
7
7
7
7
7
7
7
7
7
7
7
7
7
20
20
20
20
20
20
20
20
20
20
20
20
20
0.76
0.82
0.43
0.64
0.83
0.56
0.79
0.79
0.81
0.92
0.88
0.89
0.84
1.8
1.5
3.3
2.3
1.5
2.8
1.6
1.6
1.6
1.2
1.4
1.4
1.5
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
6
7
1
4
6
4
7
7
7
8
8
8
7
7
7
7
7
7
7
7
7
7
7
7
7
7
10
10
10
10
10
10
10
10
10
10
10
10
10
glide
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
1
5
2
1
7
4
7
11
8
14
15
19
20
0.86
0.84
0.58
0.72
0.84
0.43
0.86
0.86
0.87
0.89
0.91
0.91
0.91
2.25
2.72
7.19
5.19
2.36
10.83
2.61
2.58
2.44
2
2.11
2.11
2.11
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
25
23
8
14
20
5
25
25
26
28
29
29
29
18
18
18
18
18
18
18
18
18
18
18
18
18
36
36
36
36
36
36
36
26
36
36
36
36
36
0.87
0.86
0.61
0.72
0.84
0.49
0.88
0.88
0.88
0.9
0.91
0.91
0.92
1.85
2.12
5.55
4.21
2.21
7.45
2.09
2.06
2.09
1.7
1.91
1.91
1.94
2.09
2.09
2.09
2.06
2.09
2.09
2.09
2.09
2.06
2.09
2.06
2.06
2.06
24
24
5
15
22
3
25
26
25
27
28
28
27
24
24
24
24
24
24
24
24
24
24
24
24
24
33
33
33
33
33
33
33
33
33
33
33
33
33
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
5
2
1
7
4
7
11
8
14
15
19
0.85
0.5
0.71
0.84
0.4
0.85
0.86
0.87
0.92
0.91
0.92
3.88
15.02
8.95
4.24
20.76
4.15
4.12
3.8
2.56
2.78
2.78
4.27
4.27
4.2
4.27
4.27
4.27
4.27
4.2
4.27
4.2
4.2
26
4
12
17
9
25
26
27
30
30
30
14
14
15
14
14
14
14
15
14
15
15
41
41
41
41
41
41
14
41
41
41
41
0.88
0.62
0.74
0.88
0.56
0.89
0.89
0.89
0.94
0.93
0.92
2.22
6.5
4.35
1.83
8.45
2.22
2.22
2.3
1.67
1.85
1.85
1.83
1.83
1.8
1.83
1.83
1.83
1.83
1.8
1.83
1.8
1.8
32
8
16
29
8
32
32
30
34
33
33
27
27
27
27
27
27
27
27
27
27
27
40
40
40
40
40
40
40
40
40
40
40
TABLE AI.3. Validation of scoring approaches; Complete training and test, Negative RMSD with cut-off
chemscore
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
terms
1
5
2
1
7
4
7
11
8
14
15
19
20
ME
0.98
0.94
0.68
0.78
0.92
0.63
1
1
0.98
1
1
1
0.98
MHRk
1
1.56
3.78
2.89
1.56
4.22
1
1
1.22
1
1
1
1.22
all
MRHRk
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
HRt
9
8
3
5
6
3
9
9
8
9
9
9
7
RHRt
6
6
6
6
6
6
6
6
6
6
6
6
6
total
9
9
9
9
9
9
9
9
9
9
9
9
9
ME
0.99
0.92
0.78
0.78
0.97
0.71
1
1
1
1
1
1
1
MHRk
1
1.5
1.83
1.83
1
2.33
1
1
1
1
1
1
1
CUT
MRHRk
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
HRt
6
5
1
4
6
2
6
6
6
6
6
6
6
RHRt
5
5
5
5
5
5
5
5
5
5
5
5
5
total
6
6
6
6
6
6
6
6
6
6
6
6
6
goldscore
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
1
5
2
1
7
4
7
11
8
14
15
19
20
0.83
0.85
0.39
0.73
0.82
0.49
0.74
0.81
0.78
0.84
0.85
0.89
0.91
2.35
2
6.6
3.2
2.55
5.3
3.2
2.55
2.9
2.2
2.1
1.7
1.6
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
14
13
3
9
8
7
9
11
9
14
13
15
16
7
7
7
7
7
7
7
7
7
7
7
7
7
20
20
20
20
20
20
20
20
20
20
20
20
20
0.76
0.82
0.43
0.64
0.86
0.56
0.77
0.78
0.79
0.86
0.83
0.86
0.84
1.8
1.5
3.3
2.3
1.5
2.8
1.7
1.7
1.7
1.4
1.5
1.4
1.5
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
6
7
1
4
5
4
6
6
6
8
8
8
7
7
7
7
7
7
7
7
7
7
7
7
7
7
10
10
10
10
10
10
10
10
10
10
10
10
10
glide
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
1
5
2
1
7
4
7
11
8
14
15
19
20
0.86
0.85
0.55
0.72
0.82
0.43
0.86
0.86
0.85
0.89
0.88
0.87
0.88
2.25
2.47
7.75
5.19
2.67
10.83
2.47
2.47
2.56
2.19
2.42
2.42
2.39
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
25
22
7
14
17
5
24
24
25
26
24
24
24
18
18
18
18
18
18
18
18
18
18
18
18
18
36
36
36
36
36
36
36
36
36
36
36
36
36
0.87
0.86
0.58
0.72
0.84
0.49
0.88
0.88
0.87
0.92
0.89
0.89
0.89
1.85
1.97
5.76
4.21
2.06
7.45
1.91
1.91
2.12
1.79
2.09
2.09
2.06
2.09
2.09
2.09
2.06
2.09
2.09
2.09
2.09
2.06
2.09
2.06
2.06
2.06
24
23
5
15
22
3
24
24
25
28
26
26
26
24
24
24
24
24
24
24
24
24
24
24
24
24
33
33
33
33
33
33
33
33
33
33
33
33
33
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
5
2
1
7
4
7
11
8
14
15
19
0.84
0.5
0.71
0.83
0.4
0.84
0.85
0.86
0.88
0.88
0.89
3.95
15.02
8.95
4.63
20.76
4.24
4.24
4.12
3.34
3.68
3.66
4.27
4.27
4.2
4.27
4.27
4.27
4.27
4.2
4.27
4.2
4.2
25
4
12
11
9
25
25
25
25
26
26
14
14
15
14
14
14
14
15
14
15
15
41
41
41
41
41
41
41
41
41
41
41
0.88
0.62
0.74
0.86
0.56
0.89
0.89
0.89
0.92
0.9
0.9
2.3
6.5
4.35
2.2
8.45
2.3
2.3
2.33
1.7
2.12
2.12
1.83
1.83
1.8
1.83
1.83
1.83
1.83
1.8
1.83
1.8
1.8
31
8
16
22
8
30
30
29
34
31
31
27
27
27
27
27
27
37
27
27
27
27
40
40
40
40
40
40
40
40
40
40
40
TABLE AI.4 Validation of scoring approaches; Complete training and test, Term correlation
chemscore
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
terms
1
5
2
1
7
4
7
11
8
14
15
19
20
ME
0.98
0.93
0.68
0.78
0.92
0.63
0.97
1
0.91
0.94
0.92
0.99
0.99
MHRk
1
1.67
3.78
2.89
1.56
4.22
1.33
1
1.78
1.56
1.78
1.11
1.11
all
MRHRk
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
2.11
HRt
9
7
3
5
6
3
8
9
7
8
7
8
8
RHRt
6
6
6
6
6
6
6
6
6
6
6
6
6
total
9
9
9
9
9
9
9
9
9
9
9
9
9
ME
0.99
0.92
0.78
0.78
0.98
0.71
0.94
1
0.83
0.94
0.92
1
1
MHRk
1
1.5
1.83
1.83
1
2.33
1.33
1
1.67
1.33
1.5
1
1
CUT
MRHRk
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
1.33
HRt
6
5
1
4
6
2
5
6
4
5
5
6
6
RHRt
5
5
5
5
5
5
5
5
5
5
5
5
5
total
6
6
6
6
6
6
6
6
6
6
6
6
6
goldscore
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
1
5
2
1
7
4
7
11
8
14
15
19
20
0.83
0.85
0.39
0.73
0.8
0.49
0.84
0.83
0.87
0.92
0.83
0.87
0.88
2.35
2.05
6.6
3.2
2.8
5.3
2.25
2.25
1.95
1.5
2.3
2.05
1.95
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
14
13
3
9
8
7
13
14
14
13
12
14
14
7
7
7
7
7
7
7
7
7
7
7
7
7
20
20
20
20
20
20
20
20
20
20
20
20
20
0.76
0.82
0.43
0.64
0.79
0.56
0.82
0.78
0.88
0.89
0.88
0.81
0.81
1.8
1.5
3.3
2.3
1.7
2.8
1.5
1.6
1.4
1.3
1.4
1.6
1.6
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
6
7
1
4
5
4
7
7
8
7
8
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
10
10
10
10
10
10
10
10
10
10
10
10
10
glide
dockscore
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock
1
5
2
1
7
4
7
11
8
14
15
19
20
0.86
0.85
0.55
0.72
0.85
0.43
0.87
0.86
0.88
0.9
0.91
0.89
0.89
2.25
2.53
7.61
5.19
2.53
10.83
2.39
2.56
2.14
1.94
2.06
2.19
2.19
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
3.03
25
21
7
14
18
5
23
26
25
24
27
28
27
18
18
18
18
18
18
18
18
18
18
18
18
18
36
36
36
36
36
36
36
36
36
36
36
36
36
0.87
0.86
0.58
0.72
0.87
0.49
0.88
0.88
0.89
0.92
0.91
0.91
0.9
1.85
2
5.73
4.21
1.94
7.45
1.97
2.06
1.94
1.48
1.85
1.94
1.94
2.09
2.09
2.09
2.06
2.09
2.09
2.09
2.09
2.06
2.09
2.06
2.06
2.06
24
22
5
15
23
3
23
26
24
26
26
28
27
24
24
24
24
24
24
24
24
24
24
24
24
24
33
33
33
33
33
33
33
33
33
33
33
33
33
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
5
2
1
7
4
7
11
8
14
15
19
0.85
0.52
0.71
0.79
0.4
0.86
0.86
0.88
0.89
0.89
0.9
3.66
14.17
8.95
5.34
20.76
3.46
3.56
3.15
2.83
3.22
3.1
4.27
4.27
4.2
4.27
4.27
4.27
4.27
4.2
4.27
4.2
4.2
25
4
12
10
9
26
28
29
23
25
29
14
14
15
14
14
14
14
15
14
15
15
41
41
41
41
41
41
41
41
41
41
41
0.88
0.62
0.74
0.85
0.56
0.9
0.89
0.9
0.92
0.9
0.92
2.22
6.3
4.35
2.15
8.45
2.2
2.12
2.12
1.58
2.15
1.95
1.83
1.83
1.8
1.83
1.83
1.83
1.83
1.8
1.83
1.8
1.8
32
8
16
24
8
32
33
32
32
30
33
27
27
27
27
27
27
27
27
27
27
27
40
40
40
40
40
40
40
40
40
40
40
TABLE AI.5a Validation of scoring approaches; Cross- training and test (Mean results), Semi- step-function
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
terms
5
2
1
7
4
7
11
8
14
15
19
ME
0,835
0,549
0,727
0,754
0,454
0,833
0,843
0,876
0,847
0,902
0,917
MHRk
4,485
12,408
8,236
6,093
18,147
4,518
4,582
3,218
4,036
2,551
2,734
all
MRHRk
4,920
4,010
3,966
4,406
3,467
4,220
3,446
3,929
4,333
3,718
3,503
HRt
2,300
0,550
1,350
1,400
1,000
2,900
2,300
2,950
2,100
3,050
3,700
RHRt
1,300
1,400
1,700
1,650
1,895
2,000
1,700
1,850
2,000
1,400
2,000
total
4,500
4,300
4,050
5,250
4,105
5,900
4,200
5,100
4,650
4,600
4,850
ME
0,875
0,668
0,737
0,825
0,613
0,876
0,891
0,912
0,914
0,933
0,950
MHRk
2,070
5,950
4,091
2,150
7,209
2,680
2,662
1,692
1,851
1,377
1,372
CUT
MRHRk
2,014
1,664
1,832
2,045
1,486
1,838
1,660
1,734
1,935
1,485
1,657
HRt
3,150
1,000
1,600
2,650
0,947
4,300
3,300
3,900
3,850
3,750
4,200
RHRt
2,700
3,000
2,800
3,250
3,053
4,000
3,050
3,100
3,050
3,100
3,250
total
4,350
4,316
3,900
5,050
4,053
5,850
4,050
4,950
4,400
4,450
4,650
TABLE AI.5b Validation of scoring approaches; Cross- training and test (Variances), Semi- step-function
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
terms
5
2
1
7
4
7
11
8
14
15
19
ME
0,010
0,038
0,017
0,007
0,035
0,009
0,008
0,006
0,005
0,003
0,004
MHRk
12,931
38,370
23,624
9,953
51,003
9,274
12,781
4,698
8,808
1,654
4,039
all
MRHRk
7,351
3,831
8,918
3,727
5,017
4,275
3,150
3,444
5,014
2,954
1,443
HRt
1,010
0,448
1,328
1,240
0,895
1,590
1,310
2,148
2,090
1,148
1,610
RHRt
1,710
1,240
1,110
0,727
1,757
2,000
0,810
1,228
1,200
0,740
0,900
total
1,850
4,010
1,747
7,488
4,665
4,390
3,360
3,590
3,728
1,840
1,628
ME
0,005
0,044
0,013
0,003
0,045
0,009
0,011
0,003
0,006
0,002
0,001
MHRk
2,473
19,306
3,758
0,906
10,729
4,376
8,570
0,715
4,121
0,336
0,396
CUT
MRHRk
1,190
0,497
1,031
1,308
0,608
0,898
0,952
0,445
0,932
0,196
0,385
HRt
1,828
1,000
1,240
2,628
0,834
3,810
1,910
2,290
2,928
1,488
1,460
RHRt
1,410
3,000
1,360
4,388
3,698
3,300
1,248
1,490
2,847
1,290
1,888
total
1,927
4,565
1,390
7,047
4,283
4,328
3,548
3,248
3,440
1,748
1,527
TABLE AI.6a Validation of scoring approaches; Cross- training and test (Mean results), Negative RMSD
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
terms
5
2
1
7
4
7
11
8
14
15
19
ME
0,907
0,455
0,707
0,838
0,415
0,846
0,893
0,868
0,891
0,908
0,930
MHRk
2,278
15,955
9,362
4,119
19,963
4,185
3,297
3,520
2,904
3,019
2,347
all
MRHRk
4,050
4,278
4,220
4,090
3,925
3,910
3,459
4,064
4,303
3,900
4,103
HRt
3,050
0,450
1,450
2,200
1,300
3,150
3,350
3,333
3,500
3,300
3,500
RHRt
1,350
1,100
1,850
1,650
1,850
1,650
1,750
2,000
1,750
1,750
1,700
total
4,150
4,200
5,050
5,200
5,550
4,700
4,350
5,667
4,950
4,600
4,600
ME
0,935
0,576
0,720
0,885
0,561
0,882
0,926
0,863
0,940
0,921
0,939
MHRk
1,358
7,286
4,287
1,756
8,440
2,263
2,094
2,616
1,568
1,865
1,655
CUT
MRHRk
1,574
2,029
1,562
1,704
1,820
1,799
1,748
1,727
1,793
1,755
1,592
HRt
3,300
0,750
1,750
3,700
1,000
3,750
3,900
3,722
4,150
3,800
3,900
RHRt
2,650
2,600
3,350
3,100
3,600
3,100
3,300
3,667
3,400
3,200
3,400
total
3,850
4,150
4,900
5,050
5,250
4,650
4,350
5,500
4,800
4,600
4,550
TABLE AI.6b Validation of scoring approaches; Cross- training and test (Variances), Negative RMSD
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
terms
5
2
1
7
4
7
11
8
14
15
19
ME
0,004
0,026
0,038
0,009
0,015
0,008
0,008
0,091
0,004
0,008
0,002
MHRk
1,645
33,285
43,321
7,159
25,560
10,115
11,048
7,907
3,945
8,943
2,598
all
MRHRk
3,019
4,954
3,926
2,617
1,543
3,100
3,072
5,942
4,756
3,389
2,035
HRt
2,048
0,448
1,047
1,160
1,210
1,528
2,328
5,568
2,650
3,710
2,350
RHRt
1,628
0,890
1,628
0,928
1,127
0,927
0,888
3,333
1,688
1,088
1,110
total
2,927
3,960
4,348
1,860
4,547
2,510
2,827
8,012
2,948
4,940
2,940
ME
0,003
0,022
0,015
0,003
0,010
0,006
0,008
0,092
0,004
0,007
0,003
MHRk
0,307
8,873
2,651
0,417
8,280
5,320
6,466
3,852
1,353
4,634
1,580
CUT
MRHRk
0,422
0,861
0,248
0,352
0,571
0,574
0,793
0,867
0,795
1,029
0,553
HRt
2,110
0,588
1,388
1,210
1,100
1,688
2,690
5,629
2,528
4,060
2,190
RHRt
1,628
2,640
2,428
0,790
2,640
1,190
1,510
6,049
2,540
3,760
2,240
total
2,927
3,727
4,590
1,948
3,988
2,228
2,827
7,722
2,660
4,940
2,948
TABLE AI.7a Validation of scoring approaches; Cross- training and test (Mean results), Negative RMSD with cut-off
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
terms
5
2
1
7
4
7
11
8
14
15
19
ME
0,869
0,438
0,737
0,804
0,418
0,863
0,837
0,888
0,879
0,916
0,884
MHRk
3,197
19,083
8,107
5,311
19,456
3,938
4,830
3,413
3,320
2,688
3,705
all
MRHRk
4,864
4,985
4,094
4,363
4,223
3,709
4,336
4,757
4,004
3,988
4,204
HRt
2,750
0,550
1,100
1,350
1,211
2,850
3,200
3,150
2,800
3,800
3,250
RHRt
1,500
1,100
1,500
1,750
1,737
1,700
1,600
1,950
1,550
2,200
1,800
total
4,800
4,200
4,200
5,400
4,684
4,850
5,200
5,150
4,400
5,450
5,000
ME
0,913
0,562
0,778
0,828
0,558
0,920
0,882
0,910
0,921
0,950
0,908
MHRk
1,508
7,609
3,868
2,416
7,928
1,831
2,770
1,781
1,634
1,422
1,849
CUT
MRHRk
1,712
1,913
1,809
1,742
1,677
1,713
1,923
1,671
1,891
1,622
1,801
HRt
3,750
0,750
1,750
2,800
0,947
3,900
4,150
3,800
3,750
4,500
3,800
RHRt
3,100
2,400
2,650
3,450
3,263
3,400
3,600
3,500
2,800
3,800
3,200
total
4,550
4,150
4,150
5,150
4,421
4,800
5,150
5,050
4,400
5,400
4,950
TABLE AI.7b Validation of scoring approaches; Cross- training and test (Variances), Negative RMSD with cut-off
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
terms
5
2
1
7
4
7
11
8
14
15
19
ME
0,004
0,037
0,012
0,012
0,026
0,010
0,006
0,006
0,004
0,005
0,007
MHRk
2,831
66,716
16,495
9,364
61,229
8,599
10,349
4,457
5,578
3,436
6,697
all
MRHRk
4,484
5,606
5,709
2,953
6,836
3,235
2,611
10,090
3,707
4,095
3,518
HRt
2,188
0,448
0,890
1,428
1,085
1,927
2,760
2,128
1,660
3,160
2,488
RHRt
1,250
0,690
0,950
0,988
1,511
1,410
1,440
1,347
1,047
1,260
2,160
total
5,360
3,760
2,060
2,840
3,371
4,328
5,260
3,228
3,440
4,248
4,700
ME
0,005
0,015
0,008
0,009
0,037
0,003
0,007
0,006
0,002
0,001
0,004
MHRk
0,482
14,403
3,829
1,371
14,771
2,025
6,442
1,813
0,774
0,231
1,041
CUT
MRHRk
0,456
0,437
0,544
0,433
1,180
0,687
0,643
0,658
0,672
0,259
0,619
HRt
3,588
0,988
0,888
3,460
0,939
2,990
4,627
2,360
2,388
3,150
2,860
RHRt
2,390
2,840
1,528
2,647
2,860
2,840
2,640
2,550
1,660
2,460
2,760
total
5,048
3,728
1,828
2,527
2,957
4,360
5,027
2,848
3,440
4,340
5,048
TABLE AI.8a Validation of scoring approaches; Cross- training and test (Mean results), Term correlation
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
terms
5
2
1
7
4
7
11
8
14
15
19
ME
0,821
0,471
0,724
0,801
0,365
0,858
0,897
0,906
0,923
0,910
0,911
MHRk
4,593
16,368
8,366
4,870
22,087
3,531
2,752
2,232
2,062
2,848
2,757
all
MRHRk
4,507
4,417
3,753
4,118
4,219
4,395
3,683
3,394
3,622
3,605
3,193
HRt
2,650
0,250
1,300
1,400
0,850
2,750
3,900
4,150
3,600
2,800
3,150
RHRt
1,350
1,900
1,550
1,450
1,550
1,600
2,250
2,050
1,900
1,500
1,600
total
4,550
4,750
4,100
4,250
4,650
4,550
5,050
5,350
5,400
4,500
4,250
ME
0,851
0,610
0,764
0,850
0,525
0,897
0,928
0,930
0,947
0,909
0,919
MHRk
2,920
6,820
3,985
2,011
9,229
2,210
1,430
1,567
1,204
1,998
1,920
CUT
MRHRk
1,974
1,885
1,601
1,693
1,760
1,800
1,690
1,473
1,397
1,644
1,521
HRt
3,250
0,800
1,800
2,800
0,750
3,650
4,400
4,500
4,700
3,400
3,450
RHRt
2,750
2,950
2,750
2,800
2,850
2,800
3,700
3,750
3,900
2,850
3,100
total
4,350
4,500
4,050
4,200
4,550
4,450
5,000
5,250
5,250
4,450
4,100
TABLE AI.8b Validation of scoring approaches; Cross- training and test (Variances), Term correlation
all
c_hbond
c_stack
c_mhp1
c_emp1
c_motif
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
terms
5
2
1
7
4
7
11
8
14
15
19
ME
0,010
0,013
0,012
0,009
0,025
0,010
0,005
0,003
0,002
0,004
0,004
MHRk
8,930
17,659
14,992
7,188
42,155
11,837
2,829
3,193
0,619
4,605
5,107
all
MRHRk
3,832
1,854
3,874
3,857
3,739
3,363
5,478
2,513
2,831
4,761
2,468
HRt
2,027
0,188
1,410
1,340
0,628
2,288
3,290
2,528
3,540
2,660
1,828
RHRt
0,827
1,290
1,148
1,148
1,147
1,340
2,688
2,248
1,890
1,050
0,940
total
2,948
4,188
6,290
2,888
3,928
2,748
3,548
3,628
5,240
4,750
2,688
ME
0,015
0,013
0,013
0,009
0,014
0,008
0,003
0,003
0,002
0,005
0,004
MHRk
6,537
8,608
4,140
0,891
8,269
6,006
1,094
1,145
0,127
1,638
3,217
CUT
MRHRk
1,003
0,681
0,411
0,480
0,346
0,666
0,662
0,427
0,265
0,590
0,530
HRt
2,788
0,460
1,560
2,060
0,588
1,928
3,840
2,350
3,810
3,340
1,748
RHRt
2,588
2,148
2,488
1,860
1,727
1,660
5,510
2,388
2,890
2,528
1,590
total
3,028
4,250
6,148
2,760
3,948
2,448
3,600
2,888
5,088
4,348
2,290
Table AI.9 shows the mean values of the calculated weighting coefficients and their variances from the Cross- training and test –
procedure for each score calculated on the complete Dataset 2. The term corresponding to the weight marked “w4_c_hbond(N3’)” is
the result of a bug, which added an extra term to the c_hbond –component. This extra term captures hydrogen bond interactions in
the same way as the N3-scoring term of c_hbond (with weight w2_c_hbond(N3)), but captures interactions of all aromatic-marked
nitrogen atoms in the ligand, not only N3. The number of terms in the c_hbond -component is thus finally five, with two terms that
capture the same hydrogen-bond effects of N3. Also, the combinatorial term in c_hbond (with weight w5_c_hbond(N3’,N4)) is
dependant on this “aromatic” term rather than the term scoring N3. The use of ligands with other aromatic nitrogens than N3 in
cytosine together with the bug-term is thus cause of a supposed “model error” which might have improved the performance of the
c_hbond –component slightly.
AI.10 and AI.11 through AI.14 illustrate weighting coefficients des cribbed in AI.9. Diagram AI.10 illustrates the mean weighting
coefficients of the full score (c_hbond c_stack c_mhp1 c_emp1 c_motif) from the Cross- training and test –procedure. Diagram AI.11
through AI.14 illustrate the mean weighting coefficients of all scores from the Cross- training and test –procedure for each fitting
function.
TABLE AI.9a Weighting coefficients; Cross- training and test (Mean results)
w0
semi- step-function
c_hbond
0,116
c_stack
0,269
c_mhp1
0,208
c_emp1
-0,106
c_motif
0,355
c_hbond c_stack
0,107
c_hbond c_stack c_motif
0,120
c_hbond c_stack c_mhp1
-0,001
c_hbond c_stack c_emp1
-0,100
c_hbond c_stack c_mhp1 c_emp1
-0,150
c_hbond c_stack c_mhp1 c_emp1 c_motif -0,160
w1_c_hbond(O2) w2_c_hbond(N3) w3_c_hbond(N4)w4_c_hbond(N3’)w5_c_hbond(N3’,N4w6_c_stack(phe,tyr,trp,his) w7_cstack(arg) w8_c_mhp1w9_c_emp1(O2_HNx) w10_c_emp1(O2_HOx) w11_c_emp1(N3_HNx) w12_c_emp1(N3_HOx) w13_c_emp1(N4_Nx) w14_c_emp1(N4_Ox) w15_c_emp1(Ar) w16_c_motif(O2,N3,N4) w17_c_motif(O2,N3,-)w18_c_motif(O2,-,N4) w19_c_motif(-,N3,N4)
0,147
0,114
0,106
0,952
0,050
0,051
0,148
0,098
0,952
1,000
0,953
0,530
0,437
0,530
1,000
1,000
1,000
0,231
0,322
0,279
1,000
1,000
1,000
1,000
1,000
0,811
1,000
1,000
1,000
0,062
0,055
0,050
0,070
0,080
0,079
0,705
1,230
1,000
0,394
term correlation
c_hbond
-0,001
c_stack
-0,001
c_mhp1
-0,001
c_emp1
-0,001
c_motif
-0,001
c_hbond c_stack
-0,001
c_hbond c_stack c_motif
0,000
c_hbond c_stack c_mhp1
-0,001
c_hbond c_stack c_emp1
-0,001
c_hbond c_stack c_mhp1 c_emp1
-0,001
c_hbond c_stack c_mhp1 c_emp1 c_motif 0,000
0,366
0,056
0,058
0,566
0,094
0,715
0,808
1,000
1,000
0,757
1,000
1,000
1,000
0,866
0,703
0,458
0,468
1,000
1,000
1,000
1,000
1,000
1,000
0,931
0,034
1,000
0,865
0,272
0,769
0,466
0,513
0,460
0,198
0,332
0,312
1,000
1,000
1,000
0,771
0,942
0,884
0,000
0,000
0,000
1,000
0,000
0,000
0,200
1,000
0,000
1,000
0,000
0,000
0,000
1,000
0,187
1,000
negative RMSD
c_hbond
-8,813 0,595
c_stack
-7,117
c_mhp1
-8,045
c_emp1
-12,603
c_motif
-6,347
c_hbond c_stack
-8,733 0,601
c_hbond c_stack c_motif
-8,783 0,610
c_hbond c_stack c_mhp1
-10,3530,581
c_hbond c_stack c_emp1
-12,8531,000
c_hbond c_stack c_mhp1 c_emp1
-12,9151,000
c_hbond c_stack c_mhp1 c_emp1 c_motif -13,0341,000
negative RMSD with cut-off
c_hbond
-3,850
c_stack
-2,611
c_mhp1
-2,482
c_emp1
-5,377
c_motif
-2,306
c_hbond c_stack
-3,827
c_hbond c_stack c_motif
-3,842
c_hbond c_stack c_mhp1
-3,941
c_hbond c_stack c_emp1
-4,884
c_hbond c_stack c_mhp1 c_emp1
-4,903
c_hbond c_stack c_mhp1 c_emp1 c_motif -4,990
0,101
0,227
1,000
-0,603
1,000
1,089
1,000
0,095
0,030
0,252
1,000
1,000
0,803
1,000
1,000
1,000
1,000
1,000
1,000
1,000
0,265
0,853
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
0,079
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
0,000
0,000
0,000
1,000
0,150
0,150
0,150
1,000
0,000
0,000
0,000
0,131
0,240
0,106
1,085
1,015
1,000
1,000
1,000
1,000
-1,010
0,541
0,863
1,000
1,000
1,000
0,784
0,447
0,392
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
0,603
1,000
4,891
1,047
0,868
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
0,988
1,000
1,000
4,799
4,530
4,777
1,070
1,000
1,000
0,644
1,000
1,000
1,891
1,000
-3,765
1,000
4,481
1,000
0,310
1,822
1,748
1,599
1,000
1,000
1,000
-0,956
-0,772
-0,639
1,000
1,000
1,000
1,838
1,803
1,774
1,000
1,000
0,982
1,000
0,959
1,000
0,800
0,450
0,200
0,550
1,000
0,350
0,000
1,000
1,000
1,000
1,000
1,000
1,000
0,900
0,700
1,000
0,000
0,200
1,000
0,100
0,300
1,000
1,000
0,950
1,000
1,000
0,800
1,000
2,004
0,955
0,977
0,937
1,000
1,000
1,000
0,967
0,996
1,202
0,970
1,008
1,023
1,000
1,000
0,769
1,000
1,000
1,000
0,571
0,569
0,452
0,574
0,896
0,792
0,397
0,618
1,000
0,444
1,711
1,254
1,396
0,257
0,268
0,211
0,352
0,919
0,958
0,961
0,432
0,437
0,443
0,548
0,608
0,511
0,577
0,577
0,572
0,508
0,565
0,493
1,000
1,000
1,000
0,857
0,867
0,867
0,454
0,405
0,647
0,545
0,425
0,621
0,000
0,000
1,000
1,000
0,000
1,000
0,932
0,896
0,001
0,000
1,000
0,000
0,100
0,250
1,000
0,000
1,000
0,000
0,000
0,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
0,000
0,150
0,000
0,000
0,000
0,000
1,000
1,000
1,000
TABLE AI.9b Weighting coefficients; Cross- training and test (Variances)
w0 w1_c_hbond(O2)w2_c_hbond(N3)w3_c_hbond(N4)w4_c_hbond(N3’)w5_c_hbond(N3’,N4w6_c_stack(phe,tyr,trp,his)w7_cstack(arg)w8_c_mhp1w9_c_emp1(O2_HNx) w10_c_emp1(O2_HOx)w11_c_emp1(N3_HNx) w12_c_emp1(N3_HOx) w13_c_emp1(N4_Nx) w14_c_emp1(N4_Ox) w15_c_emp1(Ar)w16_c_motif(O2,N3,N4)w17_c_motif(O2,N3,-)w18_c_motif(O2,-,N4)w19_c_motif(-,N3,N4)
semi- step-function
c_hbond
0,0000,081
0,042
0,000
0,044
0,000
c_stack
0,000
0,000
0,217
c_mhp1
0,000
0,000
c_emp1
0,001
0,000
0,000
0,004
0,000
0,003
0,000
0,000
c_motif
0,000
0,000
0,187
0,000
0,000
c_hbond c_stack
0,0010,000
0,221
0,105
0,000
0,001
0,000
0,000
c_hbond c_stack c_motif
0,0010,081
0,212
0,153
0,000
0,001
0,000
0,000
c_hbond c_stack c_mhp1
0,0020,043
0,221
0,130
0,143
0,001
0,230
0,177
0,000
c_hbond c_stack c_emp1
0,0000,044
0,000
0,000
0,000
0,000
0,043
0,000
0,011
0,000
0,781
0,000
0,023
0,000
0,000
c_hbond c_stack c_mhp1 c_emp1
0,0010,000
0,000
0,000
0,000
0,000
0,189
0,000
0,103
0,001
0,000
0,898
0,000
0,038
0,000
0,000
c_hbond c_stack c_mhp1 c_emp1 c_motif0,0000,042
0,000
0,000
0,000
0,000
0,148
0,000
0,000
0,000
0,000
0,168
0,000
0,010
0,000
0,000
0,000
0,000
0,000
0,000
negative RMSD
c_hbond
0,0260,004
c_stack
0,025
c_mhp1
0,075
c_emp1
0,147
c_motif
0,019
c_hbond c_stack
0,0140,002
c_hbond c_stack c_motif
0,0170,002
c_hbond c_stack c_mhp1
0,1030,002
c_hbond c_stack c_emp1
0,1740,000
c_hbond c_stack c_mhp1 c_emp1
0,1650,000
c_hbond c_stack c_mhp1 c_emp1 c_motif0,1680,000
negative RMSD with cut-off
c_hbond
0,0050,032
c_stack
0,004
c_mhp1
0,011
c_emp1
0,040
c_motif
0,006
c_hbond c_stack
0,0060,060
c_hbond c_stack c_motif
0,0120,033
c_hbond c_stack c_mhp1
0,0080,105
c_hbond c_stack c_emp1
0,0440,059
c_hbond c_stack c_mhp1 c_emp1
0,0860,033
c_hbond c_stack c_mhp1 c_emp1 c_motif0,0730,029
term correlation
c_hbond
0,0000,000
c_stack
0,000
c_mhp1
0,000
c_emp1
0,000
c_motif
0,000
c_hbond c_stack
0,0000,000
c_hbond c_stack c_motif
0,0000,000
c_hbond c_stack c_mhp1
0,0000,000
c_hbond c_stack c_emp1
0,0000,090
c_hbond c_stack c_mhp1 c_emp1
0,0000,187
c_hbond c_stack c_mhp1 c_emp1 c_motif0,0000,000
0,061
0,007
0,000
0,009
0,018
0,010
0,022
0,000
0,000
0,000
0,008
0,010
0,030
0,005
0,001
0,004
0,000
0,000
0,045
0,000
0,000
0,000
0,011
0,008
0,132
0,025
0,033
0,053
0,007
0,002
0,000
0,253
0,001
0,000
0,001
0,088
0,103
0,081
0,001
0,001
0,002
0,029
0,050
0,015
0,000
0,000
0,000
0,082
0,070
0,071
0,244
0,236
0,232
0,213
0,189
0,219
0,000
0,000
0,000
0,000
0,001
0,134
0,002
0,002
0,000
0,000
0,000
0,000
0,000
0,000
0,043
0,402
0,000
0,164
0,001
0,160
0,190
0,194
0,194
0,034
0,112
0,119
0,000
0,000
0,000
0,209
0,063
0,122
0,000
0,000
0,000
0,000
0,000
0,000
0,160
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,021
0,000
0,000
1,424
0,000
0,920
0,041
0,007
0,000
0,123
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,127
0,127
0,127
0,000
0,000
0,000
0,000
0,016
0,013
0,011
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,003
0,000
0,000
0,503
0,096
0,139
0,093
0,000
0,000
0,024
0,000
0,000
1,966
0,000
4,618
0,000
0,109
0,000
0,001
0,005
0,000
0,042
0,061
2,780
2,298
1,441
0,000
0,000
0,000
9,706
9,297
7,388
0,000
0,000
0,000
0,676
0,748
0,678
0,000
0,000
0,004
0,000
0,031
0,000
0,160
0,247
0,160
0,247
0,000
0,227
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,090
0,210
0,000
0,000
0,160
0,000
0,090
0,210
0,000
0,000
0,047
0,000
0,000
0,160
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,000
0,127
0,000
0,000
0,000
0,000
0,000
0,000
0,000
Table AI.10 Weighting coefficients in the full score (c_hbond c_stack c_mhp1 c_emp1 c_motif)
6,000
4,000
2,000
0,000
-2,000
semi- step-function
-4,000
negative RMSD
negative RMSD with cut-off
term correlation
-6,000
-8,000
-10,000
-12,000
-14,000
Table AI.11 Weighting coefficients, Semi- step-function
1,500
1,000
c_hbond
0,500
c_stack
c_mhp1
c_emp1
c_motif
0,000
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
-0,500
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
-1,000
-1,500
Table AI.12 Weighting coefficients, Negative RMSD
6,000
4,000
2,000
0,000
c_hbond
c_stack
c_mhp1
-2,000
c_emp1
c_motif
-4,000
c_hbond c_stack
c_hbond c_stack c_motif
-6,000
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
c_hbond c_stack c_mhp1 c_emp1
-8,000
-10,000
-12,000
-14,000
c_hbond c_stack c_mhp1 c_emp1 c_motif
Table AI.13 Weighting coefficients, Negative RMSD with cut-off
6,000
4,000
c_hbond
2,000
c_stack
c_mhp1
c_emp1
c_motif
0,000
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
-2,000
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
-4,000
-6,000
Table AI.14 Weighting coefficients, Term correlation
1,200
1,000
0,800
c_hbond
c_stack
0,600
c_mhp1
c_emp1
c_motif
0,400
c_hbond c_stack
c_hbond c_stack c_motif
c_hbond c_stack c_mhp1
c_hbond c_stack c_emp1
0,200
c_hbond c_stack c_mhp1 c_emp1
c_hbond c_stack c_mhp1 c_emp1 c_motif
0,000
-0,200