UPTEC X 08 52 Examensarbete 20 p December 2008 New scoring-functions for docking smaller molecules to proteins Evaluation on ligands containing nitrogenous base -moieties Johan Alexander Källberg Zvrskovec Bioinformatics Engineering Program Uppsala University School of Engineering Date of issue 2008-12 UPTEC X 08 052 Author Johan Alexander Källberg Zvrskovec Title (English) New scoring-functions for docking smaller molecules to proteins: Evaluation on ligands containing nitrogenous base -moieties Title (Swedish) Abstract Molecular modeling approaches aimed at prediction of the spatial structure of a ligand-receptor complex, given the 3D-model of the latter, are referred to as molecular docking. They are widely used in both the fundamental studies of molecular mechanisms of protein functioning and in drug design. In this work, new methods to improve the efficiency of scoring the putative protein-ligand complexes in molecular docking are presented. A promising method – consensus docking, which has been previously successfully used in scoring the interactions of adenine-containing ligands with proteins is now adapted for another important class of nitrogenous base -ligands – cytosinecontaining compounds. Based on statistical analysis of 3D-structures of a representative set of 50 complexes of cytosine-containing ligands with different proteins an array of new scores is proposed which capture and model concepts of physical phenomena that drive the recognition of cytosine by protein-receptors – such as hydrogen bonds, aromatic stacking interactions and hydrophobic/hydrophilic interactions. The proposed scores implement the concept of molecular hydrophobicity potential to model the hydrophobic/hydrophilic properties of molecules. Also, statistical modeling through a knowledge-based potential is used to implicitly describe the above-mentioned intermolecular interactions. The new scores were validated on a set of docking conformers for the studied 50 complexes that were generated using two popular docking programs, GOLD and Glide. Most of our scores combining multiple scoring methods were demonstrated to have an excellent potential for scoring the targeted type of molecular complexes. Keywords Molecular docking, protein, ligand, scoring function, nitrogenous base, cytosine, hydrogen bond, stacking, hydrophobic, hydrophilic, MHP, empirical, knowledge-based, backbone, motif, GOLD, Glide Supervisors Prof. Roman G. Efremov and Ph.D. Timothy V. Pyrkov Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences Scientific reviewer Ph.D. Torgeir R. Hvidsten Umeå Plant Science Centre Dep. of Plant Physiology, Umeå University Project name Sponsors Language Security English Classification ISSN 1401-2138 Supplementary bibliographical information Pages Biology Education Centre 48 (61 inc. attachments) Biomedical Center Husargatan 3 Uppsala Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 1 Fax +46 (0)18 555217 New scoring-functions for docking smaller molecules to proteins: Evaluation on ligands containing nitrogenous base -moieties Johan Alexander Källberg Zvrskovec Supervisors: Prof. Roman G. Efremov and Ph.D. Timothy V. Pyrkov (Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences) ABSTRACT Molecular modeling approaches aimed at prediction of the spatial structure of a ligandreceptor complex, given the 3D-model of the latter, are referred to as molecular docking. They are widely used in both the fundamental studies of molecular mechanisms of protein functioning and in drug design. In this work, new methods to improve the efficiency of scoring the putative protein-ligand complexes in molecular docking are presented. A promising method – consensus docking, which has been previously successfully used in scoring the interactions of adenine-containing ligands with proteins is now adapted for another important class of nitrogenous base -ligands – cytosine-containing compounds. Based on statistical analysis of 3D-structures of a representative set of 50 complexes of cytosine-containing ligands with different proteins an array of new scores is proposed which capture and model concepts of physical phenomena that drive the recognition of cytosine by protein-receptors – such as hydrogen bonds, aromatic stacking interactions and hydrophobic/hydrophilic interactions. The proposed scores implement the concept of molecular hydrophobicity potential to model the hydrophobic/hydrophilic properties of molecules. Also, statistical modeling through a knowledge-based potential is used to implicitly describe the above-mentioned intermolecular interactions. The new scores were validated on a set of docking conformers for the studied 50 complexes that were generated using two popular docking programs, GOLD and Glide. Most of our scores combining multiple scoring methods were demonstrated to have an excellent potential for scoring the targeted type of molecular complexes. 2 POPULÄRVETENSKAPLIG SAMMANFATTNING Inom både grundläggande studier av molekylära mekanismer hos protein som i läkemedelsdesign används en datorbaserad simuleringsmetod som kallas för molekylär dockning där ett av målen är att förutsäga 3D-strukturen hos molekylerna när en mindre molekyl (ligand) interagerar med ett protein. I det här projektet presenteras nya metoder för att värdera sådana 3D-modeller, vilket är ett mycket viktigt steg inom molekylär dockning. En lovande metod – konsensusdockning, vilken tidigare framgångsrikt har blivit tillämpad för att poängsätta interaktioner mellan ligander innehållande adenin-strukturer och protein blir här anpassad för en annan betydande klass av kvävebas-ligander – föreningar innehållande cytosin. Baserat på en analys av 3D-strukturerna i ett representativt set med 50 komplex av ligander innehållande cytosin tillsammans med olika proteiner, föreslås en rad nya poängsättningsfunktioner vilka fångar och modellerar koncept av fysikaliska fenomen som styr hur protein känner igen cytosin – så som vätebindningar, aromatisk stackning och hydrofoba/hydrofila interaktioner. Dessa föreslagna poängsättningsfunktioner inkluderar konceptet Molekylär Hydrofobicitets-potential (MHP) för att modellera de hydrofoba/hydrofila egenskaperna hos molekylerna. Dessutom används statistisk modellering genom en kunskapsbaserad potential för att implicit beskriva interaktionsegenskaper. De nya poängsättningsfunktionerna utvärderades på ett set molekylmodeller som är variationer av de studerade 50 komplexen. Dessa molekylmodeller skapades genom att använda de populära dockningsprogrammen GOLD och Glide. De flesta poängsättningsfunktionerna som kombinerar olika metoder visade en utmärkt potential för att värdera den typ av molekylära komplex som våra dataset representerar. 3 Table of Contents 1. INTRODUCTION ........................................................................................................ 5 1.1. Nitrogenous bases .............................................................................................. 5 1.2. Biochemical functions of ligands with cytosine-moieties ................................. 6 1.3. Molecular docking ............................................................................................. 8 2. METHODS ................................................................................................................... 9 2.1. Molecular docking and dataset generation ........................................................ 9 2.2. PLATINUM – a tool for analysis of non-covalent interactions ........................ 9 2.3. Creating a computer program for development and validation of scores ........ 11 2.4. Scoring component for modeling hydrogen bonds – c_hbond ........................ 11 2.5. Scoring component for modeling of stacking interactions – c_stack .............. 12 2.6. Molecular Hydrophobicity Potential (MHP) ................................................... 13 2.7. Scoring component based on MHP – c_mhp1 ................................................. 14 2.8. Knowledge-based estimation of atom positioning – EMP1 ............................ 14 2.9. Scoring component based on EMP1 – c_emp1 ............................................... 16 2.10. Recognition of cytosine – protein backbone motifs ...................................... 17 2.11. Scoring component based on hydrogen bond motifs – c_motif .................... 17 2.12. Scoring component of the docking algorithms used – c_dock ...................... 17 2.13. Normalization and calculation of the weighting coefficients ........................ 18 2.14. Validation of scoring functions...................................................................... 19 3. RESULTS & DISCUSSION ...................................................................................... 22 3.1. Analysis of non-covalent interactions in complexes with cytosine-containing ligands .............................................................................................................................. 22 3.2. Investigating methods to score protein-ligand conformers .............................. 33 3.3. Investigating strategies to combine terms of different scoring components and to validate scores .............................................................................................................. 36 3.4. Validation of scoring approaches .................................................................... 38 3.5. Time efficiency ................................................................................................ 45 4. CONCLUSIONS ........................................................................................................ 46 5. ACKNOWLEDGMENTS .......................................................................................... 47 6. REFERENCES ........................................................................................................... 48 Attached parts APPENDIX I – Validation results for new scores APPENDIX II – The work program “CSTAT” 4 1. INTRODUCTION 1.1. Nitrogenous bases Nitrogenous bases, further distinguished into purines and pyrimidines, are fundamental chemical constituents of all known living organisms, viruses and transposons, and have probably been such since the dawn of life. Chemical compounds consisting of nitrogenous bases have been present during the emergence and continued evolution of proteins and thus been part of the environment in which proteins have evolved. An adenine-binding motif can be found in “ancient proteins” common to all living organisms, which suggests that chemical pathways involving both adenine-containing ligands and the motif to bind them developed very early in evolution.1 Essential functions of all cells rely on molecules containing nitrogenous bases. Their role in cellular energy transfer, signal transduction and protein synthesis is prominent. A combination of nitrogenous bases is the information carrying constituent of both RNA and DNA, adenosine-5’-triphosphate (ATP), guanosine-5’-triphospahte (GTP), and to less extent cytidine-5’-triphosphate (CTP) and thymidine-5’-triphosphate (TTP) serve as cellular energy carriers and are involved in phosphorylation-processes – important for cell signalling, with ATP being the most prominent energy carrier in cells. Coenzyme A (CoA) is an acyl group carrier with a key role in the citric acid cycle, nicotinamide adenine dinucleotide (NAD) is as well as nicotinamide adenine dinucleotide phosphate (NADP) and flavin adenine dinucleotide (FAD) involved in various redox reactions, to mention a few important cellular contexts where compounds with nitrogenous base –moieties are involved. As of April 2008 when the Brookhaven Protein Data Bank (PDB)2 recently reached a number of over 50 000 stored structures (50160), a ligand search in the PDB database on ligands containing the adenine-substructure resulted in 4705 structure hits and 421 ligand hits, the guanine-substructure – 932 structure hits and 140 ligand hits, the cytosine-substructure – 264 structure hits and 147 ligand hits and the thymine-substructure – 246 structure hits and 124 ligand hits. The knowledge of how nitrogenous base –containing ligands are recognized by proteins is important for the understanding of the enzymatic mechanisms which they are a part of, as well as it is important for the understanding of the general principles of interaction between proteins and the discussed ligands. Because of the attractive therapeutic potential of proteins which recognize molecules containing nitrogenous bases, the problem of recognition of ligands by these proteins can be considered a subject of great interest.3,4 Recent advances in structural biology have resulted in a growing number of available X-ray crystallographic structures of proteins with bound ligands, which has facilitated studies of intermolecular interactions in the respective binding sites.4,9 Many proteins with now known spatial structure were selected for structure determination because of their therapeutic potential and new targets for structure determination tend to be selected on the same basis.9 Comparisons of protein folds derived from different structural classification schemes with their respective enzymatic functions led to the conclusion that protein function does not show a clear relationship to protein fold, as only a few specific residues are involved in the enzyme activity.5 Other studies have emphasized the existence of distinct differences in binding site configuration between highly homologous proteins (more than 80% sequence similarity).6 Classical bioinformatics-approaches where proteins are sorted by sequence similarity may therefore not necessarily be able to deduce properties related to the active site.3 A number of studies have been carried out on the case of protein recognition of the adenine-moiety with an array of promising conclusions as a result. A frequent adeninerecognizing protein motif depends solely on protein backbone –adenine interactions,1,7 which further accentuates the understanding of ligand binding site interactions at a sub- amino-acid 5 sequence level. In the work by Mao and Wang et al. 2004, non-bonded intermolecular interactions between the adenine base and its surrounding environment in its binding pocket were investigated.7 Hydrogen bonds, cation-Pi interactions and aromatic Pi-Pi stacking were here concluded to be the main molecular determinants for adenine-moiety recognition. Additionally, polar and hydrophilic/hydrophobic interactions between the ligand and its binding site seem to play an important role in the process.3,4,8 While all of the aforementioned interaction types to some extent affect the molecular recognition specificity, especially hydrogen bonds are believed to determine specificity due to their directional nature.3 Figure 1. Molecular structure of the cytosine base. Capacity to form hydrogen bonds is shown with hydrogen acceptor sites in blue and hydrogen donor sites in red. 1.2. Biochemical functions of ligands with cytosine-moieties After the recent work on adenine4, we turned our focus to ligands containing cytosine as a substructure. Cytosine is a pyrimidine base like thymine and uracil, in contrast to adenine. Figure 1 shows the molecular structure of the cytosine base and the possibilities for the molecule to form hydrogen bonds. Cytosine has the capacity to form five hydrogen bonds with O2 as the hydrogen acceptor for two bonds, N3 as the hydrogen acceptor for one bond and N4 as the hydrogen donor for two bonds. Ligands containing cytosine are involved in a multitude of essential biochemical pathways which makes cytosine together with other nitrogenous bases an important molecular fragment in that sense, and thus of interest to study. To provide a somewhat representative image of the roles in which cytosine-containing ligands are involved, a ligand-search was carried out in the PDB on the cytosine-substructure (September 2008) and the found structures were grouped according to E.C.-numbers. The result of our search is summarized in Fig. 2. 6 The cytosine-substructure, Complex distribution over E.C. -numbers in the PDB 90 85 82 80 Number of complexes 70 66 60 52 50 40 39 33 30 20 31 29 26 26 20 20 19 18 13 12 10 11 10 9 8 7 7 6 6 5 5 5 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 2 2 2 2 2. 7. 2. 7.7 1 1. .3.2 2. 9 2. 9.2 7 2. .7.6 7. 4. 4.14 6. 2. 1.12 7. 1 2. .74 1. 5.1 3. .9 9 2. 9.20 7. 2. 7.38 7. 2. 1.48 7. 3. 7.39 1. 2. 27. 2. 7. 1 7.7 5 7. .3 .2 60 .9 5 , 4 9. .6 17 2. .1.1 7. 2 1. 1 1. 48 2. 3 2. .1 7 1. .7. 17 . 1. 4.1 7 2. .3.3 4. 9 2. 9.4 1 3. .2.8 5. 1. 4.13 17 .4 1. .2 2. 2.-. 7. 7 2. .60 4. 2. 99. 7. 2. 7.33 7. 2. 7.43 7. 7. 5. 48 1. 6. 3.3. 4. 1 2 2. .-.-. 1. 2. 1.4 7 2. 7. 2 .1.1 5 7. .7 4 2. 25, .1.1 5 7. 2 6 7. .7 1 7, .7 4. .21 3. 2. 5. 9 4. 13 3.1 9., 3 .3 .6 .2 .1 . 6. 23 3. 2. 2.5 7 2. .2.3 7. 7 3. .19 1. 4. 13. 1. 1. 23 0 E.C. -number Figure 2. Distribution of protein-ligand -complexes with cytosine-containing ligands in the PDB over E.C. -numbers, as of September 2008. Some found complexes were stored without E.C. -number, and thus not included here. For more information about the EC. –numbers, see the ENZYME-database17. This investigation shows that the E.C.-number with the largest number of complexes is 2.7.7.7, which corresponds to the DNA-extension reaction of various DNA-polymerases. Another large group, 2.1.3.2, belongs to aspartate carbamoyltransferases which catalyze a reaction in the pyrimidine metabolism and alanine and aspartate metabolism pathways in bacteria such as Escherichia coli. CTP and UTP work as inhibitors while ATP is the effector. Below we describe some of the most populated groups from our search. The number 1.2.99.2 contains carbon-monoxide dehydrogenases and is in our search represented by enzymes from the bacterium Oligotropha carboxidovorans. These carbonmonoxide dehydrogenases are involved in the methane metabolism pathway. The number 2.7.7.6 comprises DNA-directed RNA polymerases, where any nucleoside-triphosphate is used as substrate in the process of elongating, or shortening RNA. The number 2.7.4.14 corresponds to a reaction step of the pyrimidine metabolism of a group of cytidylate kinases. The reaction is that of ATP and dCMP transformed into ADP and dCDP, and vice versa. dCMP and UMP can also act as phosphate-group acceptors in the reaction replacing of ATP. The number 4.6.1.12 refers to a reaction catalyzed by 2-C-methyl-D-erythritol 2,4cyclodiphosphate synthase, where CMP is one of the products. All structures found are from bacteria. The number 2.7.1.74 refers to deoxycytidine kinase and its reaction, where a nucleotide-triphosphate and deoxycytidine are converted to a nucleotide-diphosphate and dCMP, and vice versa, which is part of both the purine- and pyrimidine- pathways. This serves as a brief (and incomplete) overview of various roles where cytosinecontaining ligands are involved, and should not be interpreted as a ranking of importance of reactions or enzymes from any point of view (see explanatory text of Fig.2.) The ENZYMEdatabase17 was used for de-referencing the E.C.-numbers. 7 1.3. Molecular docking Nowadays there exist numerous in-silico techniques for structural prediction and characterization of protein-ligand complexes. These techniques are used to complement experimental methods which sometimes can prove demanding or for some other reasons cannot be applied.4 With the advance of such in-silico approaches, computational methodologies have become an integral part of many drug discovery programmes, and are used in projects such as hit identification and lead optimization. Two distinct approached to model protein-ligand interactions and perform virtual screening have evolved: ligand-based often used in QSAR (quantitative structure-activity relationships) and structure-based – “docking” compounds into the known spatial structure of the binding site.9,11 A computational approach which emerged in the beginning of the 1980s - the docking of small-molecular-weight drug-like organic compounds to protein binding and active sites, continues to be a key methodology for structural prediction.9-12 Molecular docking is generally a structural optimization process of its constituting molecular parts (usually a protein and its ligands in the case of protein-ligand docking) where new structural conformers of the biomolecular complex are generated and then evaluated through a scoring function. Such functions represent the quality of a proposed conformer in terms of free energy of binding or biological activity.4,9,12 New complexes – conformers, are generated by a search algorithm in order to step-wise advance towards more qualitative (native-like) structural solutions, assuming that this is reflected in the expression of the scoring function. While steps through structure-conformational space can be made by a variety of different search strategies9,12, the scoring function is usually the defining part of a docking algorithm. However, since the scoring function only validates results produced by the search algorithm, the combined docking algorithm is highly dependent on the architecture of the search algorithm, in particular – how it defines search space.12 Currently there exist numerous robust algorithms for conformer generation9,12 while inefficient and fallacious scoring functions continue to be a limiting factor in molecular docking.9 Ideally, a scoring function should meet the demands of being able to sufficiently precisely classify conformers as to drive the docking process towards a correct solution while not consuming too much computational time so that it could be used in a high throughput virtual screening. Fast but simplified scoring functions may result in underestimation of the affinity of true binders or of native-like ligand poses and vice versa. Molecular docking is indeed a complex task from many different perspectives and there are numerous complications that might interfere in the docking process. Limited resolution of available crystallographic data is a potential source of error, and the true consequences of this effect may be difficult to overviev.9 Possibly the most concerning matter is the need of introducing simplifications to the docking procedure, both regarding conformer generation and scoring, when trying to satisfy the temporal demands of the docking algorithm, or otherwise. For example, to reduce the conformational space, most of the search algorithms are designed with such an architecture that reduces the number of degrees of freedom in various ways.9,12 Such simplifications has led to that contemporary algorithms for protein-ligand docking rarely account for conformational changes in the protein-receptor molecule, possibly occurring at ligand binding, such as inherent flexibility or induced fit, as well as the participation of solvent molecules.9,12 In many cases structural changes are mainly restricted to the ligand for a simpler and faster modeling. This simplification of the conformational search algorithm has been shown to poorly correspond to realistic protein-ligand behaviour, where protein rigidity is likely to be more of a rarity than flexibility.12 All proteins are flexible to some extent, and change between different conformational states in correspondence with the energies of the states, which affects what is considered to be the ligand binding-site location and its structure.12,13 8 Regarding scoring, in addition to the problems associated with simplifications, scoring functions tend to introduce various assumptions, which of course may be false and thus be a source of error.9 Existing scoring functions also tend to be more focused towards enthalpic measures than entropic. This bias does not correspond to the physical phenomenon that both entropic and energetic effects determine the ligand-binding event, and that neither of them can be favoured in specific reaction cases.9 Due to the limited reliability of contemporary docking algorithms or their inability to produce unambiguous scoring results in some cases, it is common procedure to analyse not only the best ranked, but some of the top-scoring protein-ligand conformers produced by the docking algorithm. This does also better reflect what is thought to be happening in reality, since the whole complex is changing between different conformation states. An ensemble of docking solutions is considered to be able to describe this to some extent.12 It is also a common approach to re-score ligand poses (also referred to as docking solutions) using scoring functions that are more appropriate for the current class of ligands or protein-targets, while still relying on existing docking algorithms for conformer generation.4 An approach, known as “consensus scoring” utilizes combinations of different scoring schemes and has shown that combinations of different scoring functions can indeed compensate for errors in individual functions.12 When applying consensus scoring, one should however keep in mind eventual correlating effects between different scoring functions, since they can lead to imbalances in the score, including error amplification.12 2. METHODS 2.1. Molecular docking and dataset generation Two datasets were used to develop and validate new scoring criteria. Dataset 1 comprises 50 molecular complexes obtained with X-ray crystallography which have been deposited to the PDB. The full list of complexes of Dataset 1 is presented in Table 1. This set was used for a statistical analysis of intermolecular interactions in complexes of cytosine-containing ligands with proteins. Dataset 2 comprises various ligand poses in the binding sites: native like (correct) and misleading (incorrect). These were generated using the popular and well renowned docking programs GOLD18-24 and Glide25,26 for all of the complexes of Dataset 1. This collection of docking results is composed of 10 complexes from using GOLD with the scoring function Goldscore, 10 complexes from using GOLD with the scoring function Chemscore, and 20 complexes from using Glide, for each complex. Dataset 2 thus contains 2000 complexes generated with different docking algorithms. 2.2. PLATINUM – a tool for analysis of non-covalent interactions The multi-functional program PLATINUM16 was the main tool used both for analysis of available experimental structural data (Dataset 1) and for scoring the docking solutions (Dataset 2). In our initial investigations and analysis of the crystal structures of Dataset 1 we used PLATINUM to identify hydrogen bonds and stacking interactions. The functions used are the scoring functions for hydrogen bonds and stacking interactions based on geometrical criteria, which are described later. The unique feature of the program PLATINUM is the option to analyse molecular complexes with respect to molecular hydrophobicity/hydrophilicity using the concept of empirical molecular hydrophobicity potential (MHP). We used this to perform such analysis as well as in scoring components based on MHP (i.e. c_mhp1, see section “Scoring component using MHP – c_mhp1”). 9 TABLE 1. Analysed protein-ligand complexes (Dataset 1). Ligand PDB-code PDB-code 1MC 1BKY AR3 1P5Z C2G 1N1D C2P 1ROB C3P 1RPF C5P 1H7F C5P 1IV4 C5P 1LP6 Resolution [Å] 2.00 1.60 2.00 1.60 2.20 2.12 1.55 1.90 C5P C5P CAR CDC CDF CDM CDM CDP 1QF9 1UJ2 1KDR 1JYL 1GX1 1INI 1OJ4 1EYR 1.70 1.80 2.25 2.40 1.80 1.82 2.01 2.20 CDP CDP CDP CDP CDP CDP CG2 CMK CPA CSF CTN CTP CTP CTP CTP CTP CTP CTP CTP CTP DCM DCM DCP DCP DCP DCZ DOC GEO GPC MCN MCN NCC 1FFU 1H7H 1IV2 1XJN 2AZ3 2CMK 1OJ1 1GQC 1RPG 1RO7 1UEJ 1COZ 1H7G 1I52 1KFD 1MIY 1TUG 1UDW 1UEU 2AD5 1B5E 1NJE 1PEO 1PKK 5KTQ 1P60 1KDT 1P62 1RDS 1DGJ 1N62 1QWJ 2.35 2.30 1.55 2.25 2.20 2.00 2.10 2.60 1.40 1.80 2.61 2.00 2.13 1.50 3.90 3.52 2.10 2.60 2.00 2.80 1.60 2.30 3.00 1.77 2.50 1.96 1.95 1.90 1.80 2.80 1.09 2.80 PCD PCD 1FFV 1VLB 2.25 1.28 Protein Name VP39 Deoxycytidine kinase glycerol-3-phosphate cytidylyltransferase RIBONUCLEASE A RIBONUCLEASE A 3-DEOXY-MANNO-OCTULOSONATE CYTIDYLYLTRANSFERASE 2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase orotidine monophosphate decarboxylase PROTEIN (URIDYLMONOPHOSPHATE/CYTIDYLMONOPHOSPHATE KINASE) Uridine-cytidine kinase 2 CYTIDYLATE KINASE CTP:phosphocholine Cytidylytransferase 2-C-METHYL-D-ERYTHRITOL 2,4-CYCLODIPHOSPHATE SYNTHASE 4-DIPHOSPHOCYTIDYL-2-C-METHYLERYTHRITOL SYNTHETASE 4-DIPHOSPHOCYTIDYL-2-C-METHYL-D-ERYTHRITOL KINASE CMP-N-ACETYLNEURAMINIC ACID SYNTHETASE CUTS, IRON-SULFUR PROTEIN OF CARBON MONOXIDE DEHYDROGENASE 3-DEOXY-MANNO-OCTULOSONATE CYTIDYLYLTRANSFERASE 2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase ribonucleotide reductase, B12-dependent Nucleoside diphosphate kinase PROTEIN (CYTIDINE MONOPHOSPHATE KINASE) RC-RNASE6 RIBONUCLEASE 3-DEOXY-MANNO-OCTULOSONATE CYTIDYLYLTRANSFERASE RIBONUCLEASE A alpha-2,3/8-sialyltransferase Uridine-cytidine kinase 2 PROTEIN (GLYCEROL-3-PHOSPHATE CYTIDYLYLTRANSFERASE) 3-DEOXY-MANNO-OCTULOSONATE CYTIDYLYLTRANSFERASE 4-DIPHOSPHOCYTIDYL-2-C-METHYLERYTHRITOL SYNTHASE DNA POLYMERASE I KLENOW FRAGMENT tRNA CCA-adding enzyme Aspartate carbamoyltransferase catalytic chain Uridine-cytidine kinase 2 tRNA nucleotidyltransferase CTP synthase PROTEIN (DEOXYCYTIDYLATE HYDROXYMETHYLASE) THYMIDYLATE SYNTHASE Ribonucleoside-diphosphate reductase 2 alpha chain Bifunctional deaminase/diphosphatase PROTEIN (DNA POLYMERASE I) Deoxycytidine kinase CYTIDYLATE KINASE Deoxycytidine kinase RIBONUCLEASE MS ALDEHYDE OXIDOREDUCTASE Carbon monoxide dehydrogenase small chain cytidine monophospho-N-acetylneuraminic acid synthetase CUTS, IRON-SULFUR PROTEIN OF CARBON MONOXIDE DEHYDROGENASE ALDEHYDE OXIDOREDUCTASE 10 2.3. Creating a computer program for development and validation of scores In this work the author developed a computer program for the implementation of functions for calling and communicating with the docking programs (GOLD and Glide) and the program PLATINUM, as well as functions for performing tasks associated with our new scoring algorithms (incorporated in the scoring components c_emp1 and c_motif) and algorithms for score training and validation (see detailed description in APPENDIX II). The program was developed under the working-name “CSTAT” and was written in the Java programming language using Java SE Development Kit (JDK) 6. 2.4. Scoring component for modeling hydrogen bonds – c_hbond Our scoring approach for estimating hydrogen bonds uses geometrical criteria to identify the presence or absence of a bond, and is the same function that was used in the previous work on adenine4. If rAD is the distance between the proton acceptor and the proton donor (Fig. 3), AHDA is the angle between the hydrogen atom and the acceptor and ANDA is the angle between the donor neighbour atom and the acceptor atom via the donor, the function used to determine if a hydrogen bond is present is set as: Bhbond , acceptor :donor 1;3.2 Å rAD 3.4 Å, cos( AHDA ) 0.6, 0; otherwise Figure 3. Schematics of the hydrogen acceptor (A) and the hydrogen donor (D) connected through the hydrogen atom (H) in a hydrogen bond. The neighbouring heavy atom to the donor is marked as the neighbour (N). The distance rAD, and the angles AHDA and ANDA are shown schematically. In hydroxy groups where hydrogen is free to rotate around the N-D bond, we used the value of angle ANDA to identify possible values of the angle AHDA. As is seen in the formula, this function is either 1 or 0 – either a hydrogen bond is detected or not. Four terms are combined into the final scoring component, where three terms each represent a molecular site on the cytosine moiety (O2, N3, N4) and one term is a product of the N3 and N4-terms (to detect simultaneous bonding at these sites): Chbond Thbond ,O 2 , Thbond , N 3 , Thbond , N 4 , Thbond , N 3, N 4 Thbond ,O 2 Bhbond ,O 2:donor donors Thbond , N 3 Bhbond , N 3:donor donors Thbond , N 4 Bhbond ,acceptor :N 4 acceptors Thbond , N 3, N 4 Thbond , N 3 Thbond , N 4 11 2.5. Scoring component for modeling of stacking interactions – c_stack Aromatic Pi-Pi stacking interactions between the heteroaromatic ring of the cytosine moiety of the ligand and the rings of Phe, Tyr, Trp, His, and the guanidine group of Arg, were measured through a geometrical function, in a way similar to the one used for hydrogen bonds. This stacking function is a product of three different parts: Tstacking S1 ( )S 2 (d normal )S 3 (d parallel ) The angular part is determined by: S1 ( ) 1 sin 4 , 0 ,90 , where is the dihedral angle between the plane of the aromatic ring of the cytosinemoiety and the plane of the neighbouring planar structure. S 2 and S 3 are distance-dependent parts defined as: 1; d normal S2 (d normal ) 4.0 Å (5.0 d normal );4.0 Å 0; d normal d normal 5.0 Å 5.0 Å , where d normal is the distance between the ring centres along the normal of the cytosinemoiety (Fig. 4), and 1; d parallel S3 (d parallell ) 2.0 Å 1.25 (3.0 d parallel );2.0 Å 0; d parallel d parallel 3.0 Å , 3.0 Å where d parallel is the distance between the ring centres along the plane of the cytosinemoiety. The complete scoring component contains two terms; one for scoring stacking with Phe, Tyr, Trp and His, and one term for scoring stacking with the guanidine group of Arg, so that: Cstacking Tstacking , phe,tyr ,trp, his , Tstacking , arg Figure 4. Illustration showing how , d normal and d parallel are measured from the ring structure of Ring 1 stacked with ring structure Ring 2 as reference. 12 2.6. Molecular Hydrophobicity Potential (MHP) MHP is an empiric method to model hydrophobic/hydrophilic effects which has been successfully used for scoring in molecular docking and in a number of other applications27,4. In our first attempt to model hydrophobic and hydrophilic effects for scoring purposes, we used the MHP model proposed by Pyrkov et al., 20074 (named MHP1 in the present work). In MHP1, molecular hydrophobic properties of atoms are projected onto the Connolly-surface14 of the ligand, according to: MHPj fie cri , j i In this formula, i is the index of a neighbouring atom and j is then index of a point on the ligand surface (Fig. 5). Further, fi is the empirical atomic hydrophobicity-constant corresponding to the atom with index i, c is some arbitrary weight constant (1 Å-1 in our case), and ri,j is the distance between the point indexed i and the atom with index j. is an offset variable in order to shift the sum in a desired direction (positive – to more hydrophobic range or negative – to more hydrophilic range). Two sums of this kind are calculated for each point on the surface; one is a sum over all the ligand atoms (MHPlig), and the other is a sum over all the protein atoms (MHPprot). An MHP-sum above 0 is considered hydrophobic while a sum below 0 is considered hydrophilic. Since we use hydrophobicity constants based on octanolwater solvability fractions for organic compounds from Ghose et.al.15, this is a rather convenient convention. In the PLATINUM implementation of MHP1, interaction between solvent (usually water) and other participating molecules are treated by generating a grid of solvent “hydrophilic charges” (Fig. 6). For a review of these, see our previous work on adenine4 and the reference on the program PLATINUM16. Figure 5. An example showing the MHPconcept applied on an ethanol –molecule in a surrounding medium of water. Each atom in the ethanol molecule in focus has its hydrophobicity constant. This example shows how atoms are contributing to the MHP-sum for the point j, located at the Connolly-surface of the molecule. [Slightly trimmed, reference 16]. Figure 6. The grid of “hydrophilic charges” simulating solvent (azure) around a ligand (black) in a protein binding pocket (grey). The grid nodes fill empty space. [Slightly trimmed, reference 16]. 13 2.7. Scoring component based on MHP – c_mhp1 This scoring part is the same function used in our previous work4 to model complementarity of hydrophobic properties between ligand and protein in a score specifically created to rank docking results with ligands containing adenine moieties. The complete formula for the c_mhp1- score component contains one term and is expressed as: CMHP1 TMHP1 , TMHP1 r c , which is a function built from two parts: A1 A2 2 A' ' , r ,c A1 A2 A3 A'1 A'2 2 A' ' where r describes the shielding of hydrophobic areas of the ligand and the protein from hydrophilic water, and c represents the complementarity of shared buried hydrophobic areas between the ligand and protein. In the factor r, A1 is the buried hydrophobic area of the protein, A2 is the buried hydrophobic area of the ligand and A3 is the exposed hydrophobic area of the ligand (Fig. 7). In the factor c, A’’ is the matching buried hydrophobic area of the ligand and the protein, A’1 is the mismatching buried hydrophobic area of the protein and A’2 is the mismatching buried hydrophobic area of the ligand. The factor r is thus the fraction of buried area of the total hydrophobic area, and c is the fraction matching area of the total buried hydrophobic area. Areas are computed according to the MHP1-model described earlier. Figure 7. The scheme illustrating the shielding (left) and complementarity (right) parameters used to characterize the strength of hydrophobic contacts for receptor-ligand binding. S1 and S2 are the buried surface areas of the receptor and the ligand, respectively; A3 is the exposed area of the ligand. A1', A2' and A'' are respectively the mismatching and matching hydrophobic areas of the receptor and the ligand. Hydrophobic and hydrophilic surface regions are shown in brown and blue, respectively. [Adapted from reference 4]. 2.8. Knowledge-based estimation of atom positioning – EMP1 We also introduced a simple spatial-empirical knowledge-based method to assess quality of molecular complexes in terms of free energy and/or biological activity. This is a knowledge-based approach which solely relies on the spatial distribution of atoms of different types with distinguishable properties. Data from known spatial structures of biomolecular complexes is analysed and atoms with preferred properties are stored in a database in a format that keeps information about their relative position to a reference, for example an atom in a ligand. This is called “the training of the EMP1-function”. Each EMP1-function corresponds in this way to a reference centre (in our case - a ligand atom or an other ligand-based point). 14 In our current model, such data about atoms can also be weighted to reflect the quality of the data or any other preferred parameters. Positions of stored atoms are transformed into local positions relative the reference centrum by projecting their coordinates onto a local coordinate system associated with the centrum – such projected atom positions are here called knowledge based centres. This allows us to score a molecular complex by locating the same reference in it, projecting atoms of the scored complex corresponding to (preferably but not necessarily) the properties of the atoms in the database to a local coordinate system in the same way as was done during the training procedure, and then calculate the score using a formula based on Gaussian functions. The EMP1-function formula is described by: I 1 ri2, j J 1 wje 2V i 0, ai A j 0 EMP1 J 1 wj V 2 j 0 J 1 w j rj2, V j 0 J 1 wj j 0 where A is the set of protein atoms that meet the atom-type condition set for the particular EMP1-function, I is the number of atoms in A, i is the atom index and ai is the atom with index i. Further, j is the index of a knowledge-based centre where each centre represents one known (from preliminary training on experimental structures) spatial atom location in the local coordinate system for the EMP1-function, J is the number of centres, ri,j is the distance between the protein atom ai and the knowledge based centre with index j, rj,μ is the distance between the knowledge based centre with index j and the mean vector of the knowledge based centres and wj is the weight (1 in our case) of the knowledge based centre with index j. Figure 8. The scheme illustrating the principle of EMP1. The reference centre (white) is associated with a reference coordinate system where the atoms can be placed in a suitable way to be compared to the knowledge-based centres (green). The atom ai (blue) is here compared to the knowledge-based centre j. The distance between them is rij. Each knowledgebased centre is based on a position of a known atom in a reference structure to the reference centre. To be able to produce vectors for projection, a suitable reference centre must be somewhat rigid with at least 2 discernible directions. 15 2.9. Scoring component based on EMP1 – c_emp1 While there are many different possible approaches to construct a scoring component using the EMP1-method, we used the following. To try and capture as many important phenomena for molecular interaction as possible with as few EMP1-centres as possible, we created 7 EMP1-functions at different centres, 6 centred at the 3 molecular sites of cytosine (O2, N3, N4) and one centred in the middle of the ring-structure. The 7 EMP1-functions are: EMP1-function O2_HNx O2_HOx N3_HNx N3_HOx N4_Nx N4_Ox centre_Ar site O2 O2 N3 N3 N4 N4 centre atom hydrogen, connected to a nitrogen hydrogen, connected to an oxygen hydrogen, connected to a nitrogen hydrogen, connected to an oxygen nitrogen oxygen atoms involved in aromatic contacts The atom type describes what kind of atoms the EMP1-function centred at the particular site is recognizing. Each EMP1-function is the basis of one term in the c_emp1- scoring component, but is also normalized by the number of EMP1-terms used; in our case 7. This makes it possible to compare EMP1-scores with a different number of terms. The score component for c_emp1 is described by: C EMP1 TO 2 _ HNx , TO 2 _ HOx , TN 3 _ HNx , TN 3 _ HOx , TN 4 _ Nx , TN 4 _ Ox , Tcentre _ Ar EMP1O 2 _ HNx TO 2 _ HNx N EMP1 EMP1O 2 _ HOx TO 2 _ HOx N EMP1 EMP1N 3 _ HNx TN 3 _ HNx N EMP1 EMP1N 3 _ HOx TN 3 _ HOx TN 4 _ Nx TN 4 _ Ox N EMP1 EMP1N 4 _ Nx N EMP1 EMP1N 4 _ Ox N EMP1 EMP1centre _ Ar Tcentre _ Ar N EMP1 N EMP1 7 16 2.10. Recognition of cytosine – protein backbone motifs In our tests and scores, the cytosine – protein backbone motifs are detected by investigating the hydrogen bonds involving the O2-, N3-, and N4-sites with the Thbond ,acceptor :donor -functions described earlier in the “Scoring component for modeling hydrogen bonds – c_hbond” section. Since more than one hydrogen bond can be detected by this function for each of the three sites, all combinations of bonds are considered and composed into motifs together with the information about amino acid indexing. When only investigating how the cytosine binds to the protein backbone, bonds with other parts of the protein are considered equal; that is the information about bond type and amino acid index from that bond is discarded. We have chosen to index the amino acids from the first valid bond (a bond with the protein backbone) in the order of the enumeration of the cytosine atomic sites (O2, N3, N4). This means that the indexing starts at different sites for motifs with different number of initial participating atomic sites, and must be considered when comparing motifs. Motifs are further expressed in the form (a,b; c,d; e,f), where a, c and e are the atom-types of the atoms binding with the O2-, N3 and N4- sites respectively, and b, d and f are the respective amino-acid index for a, c and e. 2.11. Scoring component based on hydrogen bond motifs – c_motif To use the information from known motifs where the cytosine moiety binds through hydrogen bonding with the protein backbone, we created a scoring component that uses this kind of information (see RESULTS & DISCUSSION for a more thorough discussion about this approach). The motifs used are: (N,0; N,-1; O,-3), (N,0; N,-1; -,0), (N,0; -,0; O,-3), (-,0; N,0,O,-2), which are variants of the same motif, but with different participating molecular sites. For convenience, these are named in order as M1, M2, M3 and M4 in this scoring component. We constructed the motif scoring component in the following form: CMOTIF TM 1 , TM 2 , TM 3 , TM 4 TM 1 SM 1 TM 2 SM 2 TM 3 SM 3 TM 4 SM 4 where S M 1 , S M 2 , S M 3 and S M 4 are the numbers of the corresponding motifs found in the complex for each denoted motif type. 2.12. Scoring component of the docking algorithms used – c_dock This component is simply taken from the score incorporated in the docking algorithm used for generating that particular docking solutions; GOLD goldscore, GOLD chemscore or GLIDE. It can thus only be used in cases where this score is known and we use it only to simulate a score which includes the scoring capabilities of the docking algorithms. 17 2.13. Normalization and calculation of the weighting coefficients Linear combinations of scoring components result in a score with a number of terms. We created our scores by first normalizing the terms with their mean value over all complexes in the score training set, and then weighting these normalized terms with weighting coefficients. Also, a score constant is added. According to this, our scores are weighted sums with the following form: I 1 f score wi ni ti w0 i 1 ni D D 1 ti , 0 In this formula, i is the index of the term, I is the total number of terms in the score (including the constant term), t i is the scoring component term corresponding to the total score term with index i, ni is the normalization coefficient for the scoring component term and wi is the weighting coefficient for the scoring component term. w0 is the added score constant with the term index 0, and is also considered a weighting coefficient. For the normalization coefficient, is the docking result index and D is the number of docking results contributing to the training or evaluation of the score. The weighting coefficients were obtained by using the docking results (Dataset 2) and adapting our (normalized) scoring functions to a number of fitting functions by a least squares procedure using the program KVAS (part of the PLATINUM program package). We have tried out 4 different fitting functions: Semi- step function, Negative RMSD, Negative RMSD with cut off and Term correlation. Each function is a function of the RMSD between the cytosine moiety and the crystal structure of the same substructure in the complex, except the last function – Term correlation. The fitting functions are: Semi- step function: 1; d RMSD F 3Å 4 d RMSD ;3 Å d RMSD 4Å 0;4 Å d RMSD Negative RMSD: F d RMSD Negative RMSD with cut off: F d RMSD 6 Å : d RMSD Term correlation: Si I 1 ti , D Si i 1 F I D 1 Si ti , 0 18 Here d RMSD is the RMSD distance between the observed docked cytosine moiety and its crystal structure. In Negative RMSD with cut off, docking results which does not fulfil the condition are discarded and are thus not used for calculating weighting coefficients. Term correlation is not calculated from RMSD between the docked ligand and its crystal structure, but from the scoring terms themselves. While RMSD-based criteria are well-known, the Term correlation is a new notion and deserves a more detailed discussion. Actually it replaces atomic coordinates compared by RMSD with the values of interaction terms. The basic idea is that a docking solution will be considered correct (native-like) if its interaction pattern statistically implies high quality compared to other interaction patterns. In Term correlation, i is the index of a term in a score and I is the total number of terms in the score (including the constant term), is the index of the docking result and D is the number of docking results. ti , is the scoring component term corresponding to the score term i in the docking result with index (see the description of the score structure earlier). Each score function will thus have its own Term correlation function. 2.14. Validation of scoring functions We validated combinations of scoring components using two different validation strategies, which we called “Complete training and test” and “Cross- training and test”. In Complete training and test Dataset 1 and Dataset 2 are both used for training the scoring functions and Dataset 2 is used as a test set. Dataset 1 is here used for training the knowledge-based scoring component c_emp1 and Dataset 2 is used for calculating weighting constants during the training. While this procedure does not provide any new information it allows testing the consistency of the obtained result. In Cross training and test Dataset 1 and Dataset 2 are repeatedly divided into trainingand test- sets. This is done on protein-ligand combination basis. The scores are trained on the parts of Dataset 1 and Dataset 2 belonging to the training-set and a validation is performed for each iteration. Complexes belonging to the training set are used for training the scoring functions; those from Dataset 1 to train the knowledge-based component c_emp1 and those from Dataset 2 to calculate weighting coefficients. Complexes from Dataset 2 belonging to the test-set are used for the testing. The division into a training- and test- set is made randomly, with all protein-ligand complex combinations treated equally in terms of probabilities to belong to a particular set. From the cross-training/cross-testing procedure we also obtain the variances of the weighting coefficients, variances of validation measurements and mean values of validation measurements (about validation measurements; see the explanation below). The used probabilities for the training-set and test-set were set as 0.9 that a complex type belongs to the training-set and 0.1 that it belongs to the test-set. Also, one comlex type was randomly chosen to always belong to the test-set, to avoid ending up with an empty test-set for some iterations. The number of iterations for the cross-training/crossvalidation procedure was set at 20 iterations. To validate how a score is performing we used a number of score validation measurements mainly based on the RMSD between the docked ligand and its crystal structure, or in our case – the RMSD between the cytosine substructure and its crystal structure counterpart. The first measurement is the “relative enrichment”, which is a measurement of how well the score orders the docking results. The relative enrichment is calculated by measuring an area under the ROC (reciever-operating characteristic) curve representing the rate of collecting correct docking solutions in an ordered list AI (according to a scoring function under study), and then comparing this to the area under an ideal ROC curve Amax (that is 19 where all correct docking poses are scored at the top of the list). The docking poses are commonly rank-ordered according to their scores from the best score to the worst score. The relative enrichment is thus calculated in the following way: AI E Amax Figure 9. Example of the ROC curves comparison. In this case a series of 60 docking poses have been validated by a score indexing them from 1 to 60 with 1 being the index given to the pose with the best score and 60 given to the worst scoring result. The docking poses are also ranked by RMSD from the crystal structure of the cytosine moiety, and since an ideal score in this case would yield equal indexing and RMSD ranking one can compare the resulting area with the area of an ideal score using this procedure to get an overview of the ranking performance of the score to be validated. This comparison measurement is called the relative enrichment. Another measurement of performance of a scoring function is the value of “hit rate” – the best (lowest) rank of a credible native-like docking solution (RMSDcyt<2, the PLATINUM standard) in the rank-ordered list according to the score under study. A hit rank of 1 is greatly preferred. When validating score for many different protein-ligand complexes at once, we use the relative enrichment and hit rank to create three secondary measurements which summarize the performance of the score on the whole set of complexes: Mean Enrichment (ME) - the mean relative enrichment over all complex types, Mean Hit Rank (MHRk) – the mean Hit 20 Rank over all complex types and Hit Rate (HRt), which is the number of times a complex type shows a hit rank of 1. For the purpose of comparison we complement these with a couple of other measurements. Mean Random Hit Rank (MRHRk) is the hypothetical rank of a correct docking solution in assumption that our scoring function is completely nonselective and the native-like solutions are distributed randomly in the rank-ordered list. Random Hit Rate (RHRt) is the hypothetical Hit Rate if this assumption is made for all comlex types. 21 3. RESULTS & DISCUSSION Figure 10. Schema of the workflow of the project. White blocks symbolize stored data. Green circles are events and stages in the process with red arrows showing the process flow and blue arrows showing the data flow. Dotted blue lines are special cases of data flow when data about crystal structures are used to train the knowledge-based component c_emp1. N is the number of folds in the cross-training/cross-test procedure. 3.1. Analysis of non-covalent interactions in complexes with cytosine-containing ligands Aiming at the development of new scoring functions, we have carried out an analysis of different types of non-covalent intermolecular interactions between the ligand and the protein for a set of protein-ligand complexes containing ligands with cytosine-moieties (Dataset 1). The results of the analysis are shown in Tables 2a – 2e. Three interaction types were selected to be investigated in the analysis: Hydrogen bonds, stacking-interactions and hydrophobic interactions described by the MHP-formalism. These interactions have been shown previously to adequately describe interactions of adenine-containing ligands with proteins.4 Through this analysis we could conclude that all above-mentioned interaction types are involved in 22 different binding motifs for the cytosine-moiety, but as expected there does not exist a single cytosine-binding motif regarding these interaction types, but rather many different kinds of motives. Our goal was to reveal these binding patterns, to use them further in a scoring approach similar to the one proposed for ligands containing adenine-substructures by Pyrkov et. al., in 20074 Each part of the analysis will be discussed separately below. Hydrogen bonds in protein-ligand complexes for ligands containing adenine have been investigated in earlier studies3,4 with the conclusion that only a few of all possible hydrogen bond patterns are actually present in the investigated datasets. The method to count the number of hydrogen bonds is the same as used in the scoring component c_hbond (see Methods), and the result of our analysis of hydrogen bonds is presented in Table 2a. We decided to investigate hydrogen bonds for a selected number of atom types in the ligands, which were all oxygen-atoms with an sp2-hybridizaton, all oxygen-atoms with an sp3hybridizaton, all oxygen-atoms in a saccharide-substructure, all oxygen-atoms in a phosphate substructure, all nitrogen-atoms with an sp2-hybridization, all nitrogen-atoms involved in an aromatic contact, the specific O2, N3, and N4-atom in cytosine (in standard nomenclature). Secondary measurements were derived from these first measurements to further identify hydrogen bond patterns for the cytosine-moiety. These describe combinations of hydrogen bonds for the cytosine-moiety and include binary measurements for the presence of hydrogen bonds at the O2, N3, N4 –sites and all sites simultaneously, and binary measurements for combinations of hydrogen bonds present at different combinations of the O2, N3, and N4 – sites. Simultaneous hydrogen bonding at all three sites (O2, N3, N4) was found in 46% of the investigated complexes. The N4-site turned out to be involved in hydrogen bonding slightly more frequently than the other two sites (74%) followed by the O2-site (68%) and the N3-site (66%) in the order of frequency. This is an effect which can largely be explained by the fact that both the N4-site and the O2-site have possibilities to bind through 2 hydrogen bonds each. The two-site combinations showed somewhat similar frequencies with a presence of 58% for the O2/N4 and N3/N4 –combinations and a presence of 56% for the O2/N3combination. 23 Figure 11. Illustration of hydrogen-bonds (purple lines) between the cytosine-moiety and the protein backbone in the crystal structure C5P_1IV4. Only the protein backbone is shown from the protein to illustrate the O2, N3, N4 (N,0; N,-1; O,-3) 3-site cytosine-backbone motif CM1, which in this case is involving residues with numbers 255, 254 and 252. Also, residue number 249 can be seen to be involved in hydrogen bonding to the cytosine-moiety[ref]. 24 TABLE 2a. Analysis of intermolecular interactions in complexes with cytosine-containing ligands; Hydrogen Bonds. COMPLEX 1MC_1BKY AR3_1P5Z C2G_1N1D C2P_1ROB C3P_1RPF C5P_1H7F C5P_1IV4 C5P_1LP6 C5P_1QF9 C5P_1UJ2 CAR_1KDR CDC_1JYL CDF_1GX1 CDM_1INI CDM_1OJ4 CDP_1EYR CDP_1FFU CDP_1H7H CDP_1IV2 CDP_1XJN CDP_2AZ3 CDP_2CMK CG2_1OJ1 CMK_1GQC CPA_1RPG CSF_1RO7 CTN_1UEJ CTP_1COZ CTP_1H7G CTP_1I52 CTP_1KFD CTP_1MIY CTP_1TUG CTP_1UDW CTP_1UEU CTP_2AD5 DCM_1B5E DCM_1NJE DCP_1PEO DCP_1PKK DCP_5KTQ DCZ_1P60 DOC_1KDT GEO_1P62 GPC_1RDS MCN_1DGJ MCN_1N62 NCC_1QWJ PCD_1FFV PCD_1VLB Sum Mean O.2 0 0 0 1 1 1 1 0 1 2 4 2 1 1 1 2 1 1 2 1 0 3 1 1 1 2 1 0 1 1 0 0 1 1 0 1 2 1 0 0 0 0 2 0 1 1 1 3 1 1 O.3 0 0 1 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 0 1 0 0 0 2 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 2 2 0 3 2 O.sug 0 3 6 1 1 3 3 2 3 5 1 2 2 4 0 1 2 2 1 4 2 0 1 6 1 4 5 0 2 2 0 1 1 4 1 3 1 2 0 0 0 3 0 2 0 0 3 1 3 1 O.po3 0 0 4 4 4 1 2 2 4 4 4 1 4 0 1 2 7 0 3 7 4 3 1 0 3 2 0 8 2 5 1 2 2 8 6 6 7 1 3 0 4 0 4 0 4 7 6 0 10 8 N.2 0 2 2 0 0 1 2 0 2 1 1 1 2 2 1 1 2 1 2 1 0 2 1 1 1 1 1 2 1 2 0 1 2 1 0 0 1 0 0 1 0 2 2 2 3 3 3 3 4 3 N.ar 0 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 2 1 2 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 0 1 1 1 2 2 2 1 1 1 Cyt_O2 0 0 0 1 1 1 1 0 1 2 4 2 1 1 1 2 1 1 2 1 0 3 0 1 1 2 1 0 1 1 0 0 1 1 0 1 2 1 0 0 0 0 2 0 0 1 1 2 1 1 Cyt_N3 0 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 50 1 21 0.42 95 1.9 161 3.22 67 1.34 40 0.8 47 0.94 33 0.66 Cyt_N4 Cyt_Sum 0 0 2 3 2 3 0 2 0 2 1 3 2 4 0 0 2 3 1 4 1 5 1 4 2 4 2 4 1 3 1 4 2 4 1 2 2 5 1 2 0 0 2 6 0 0 1 3 0 2 1 4 1 3 2 3 1 3 2 4 0 0 1 2 2 3 1 3 0 0 0 2 1 3 0 1 0 0 1 1 0 0 2 3 2 5 2 3 1 1 2 4 2 4 2 5 2 4 2 4 57 1.14 137 2.74 bCyt_O2 bCyt_N3 bCyt_N4 bCyt_All 0 0 0 0 0 1 1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 34 0.68 33 0.66 37 0.74 23 0.46 O2/N3 0 0 0 1 1 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 O2/N4 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 1 1 N3/N4 0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 27 0.54 29 0.58 29 0.58 1) Hydrogen bonds involving the respective atom types: oxygen-atoms with a sp2-hybridizaton (O.2); oxygen-atoms with a sp3-hybridizaton (O.3); oxygen-atoms in a sugar-substructure (O.sug); oxygen-atoms in a phosphate-substructure (O.po3); nitrogen-atoms with a sp2 hybridization (N.2); nitrogen-atoms involved in an aromatic bond (N.ar); 2) Specific cytosine sites: O2 (Cyt_O2), N3 (Cyt_N3) and the two hydrogen atoms of the N4 nitrogen (Cyt_N4), 3) Secondary measurements: the sum of possible hydrogen bonds for the cytosine substructure (Cyt_Sum); binary measurements indicating the presence of a hydrogen bond (1) or not (0) for the cytosine sites (bCyt_O2, bCyt_N3, bCyt_N4); all the sites at once (bCyt_All); all pair combinations (O2/N3, O2/N4, N3/N4). 25 TABLE 2b. Analysis of intermolecular interactions in complexes with cytosine-containing ligands; Cytosine-Protein motifs. COMPLEX 1MC_1BKY AR3_1P5Z C2G_1N1D C2P_1ROB C3P_1RPF C5P_1H7F C5P_1IV4 C5P_1LP6 C5P_1QF9 C5P_1UJ2 CAR_1KDR CDC_1JYL CDF_1GX1 CDM_1INI CDM_1OJ4 CDP_1EYR CDP_1FFU CDP_1H7H CDP_1IV2 CDP_1XJN CDP_2AZ3 CDP_2CMK CG2_1OJ1 CMK_1GQC CPA_1RPG CSF_1RO7 CTN_1UEJ CTP_1COZ CTP_1H7G CTP_1I52 CTP_1KFD CTP_1MIY CTP_1TUG CTP_1UDW CTP_1UEU CTP_2AD5 DCM_1B5E DCM_1NJE DCP_1PEO DCP_1PKK DCP_5KTQ DCZ_1P60 DOC_1KDT GEO_1P62 GPC_1RDS MCN_1DGJ MCN_1N62 NCC_1QWJ PCD_1FFV PCD_1VLB Sum Mean Nr of motifs =88 MOTIF O2 N3 N N c N N N N c c c c c c N c N N N N c N c N N c N N N N c c c c c c c c N N N c c N N c c c N c N N c c c c N N N N N N c c N N N N c c N N c c c N N c c c c N N c c N c c N N N N N N c c c c c c c c N N c N N c c c c c N c c c c c c c c N N N N c c c c N N N N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N4 0 0 0 0 0 0 0 0 -1 -1 0 0 0 0 0 0 0 0 0 0 0 -1 -1 0 0 0 0 0 -1 -1 0 0 0 -1 -1 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1 -1 0 0 0 0 -1 -1 -1 -1 c c O O O O O O c c c c c c c O O O O O O O O O O O O O O O O O c c c c c c O O O c O O O O O c O O c c c c c c c c c c c c O O O O O O O O O O O O O 0 0 0 0 3 0 0 -6 -6 -3 0 25 0 0 0 0 0 0 0 72 -5 -6 -3 67 68 0 68 9 -3 68 -6 -5 -2 -6 -3 2 0 0 0 0 0 0 0 0 -6 0 7 6 0 0 3 -6 67 68 0 0 -48 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -3 65 -3 68 63 69 4 10 -3 68 -3 65 bond O2 chain bb bomd N3 chain bb bond N4 chain bb 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 67 0.761 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 31 0.352 0 0 0 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 36 0.409 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 65 0.739 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 39 0.443 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 26 0.295 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 75 0.852 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 29 0.330 0 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 46 0.523 O2/N3 bb O2/N4 bb N3/N4 bb 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 21 0.237 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 29 0.330 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 25 0.284 1) Atoms binding with the cytosine-sites are shown under respective site together with its index in case of the site hydrogen bonding with the protein backbone. 2) A bond with a protein side-chain, this is marked with “c”, absence of bond is marked with “-“. Both these special cases also receive the index 0. 3)Binary measurements indicate a bond (1) or no bond (0) for any type of bond (bond), a bond with a side chain (chain) and a bond with the protein backbone (bb). 4) Binary measurements indicating combinatory backbone binding for pairs of sites are marked “O2/N3”, “O2/N4” and “N3/N4”. 5) Mean values are mean values over all 88 found motifs. 26 TABLE 2c. Analysis of intermolecular interactions in complexes with cytosine-containing ligands; Occurance of Cytosine-Protein motifs. MOTIF count 14 8 8 7 5 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 COUNT, O2 atom c N c N c N N N N N N N N c c c c N N c N N c c N N N N c c ALL O2 aai 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N3 atom c N c c c N c N N N c N c c N c N N N N N c c c c N3 aai 0 0 -1 0 0 0 0 -1 0 -1 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 2 1 0 0 -1 0 0 0 0 0 0 0 N4 atom c O c c O O O O O O c O O O O O O O O O O O O O O O c O O O O O N4 aai 0 0 -3 0 0 0 -6 -6 68 68 0 3 0 67 65 25 72 -5 0 9 -6 -5 -2 2 7 6 -48 29 0 0 0 0 63 69 4 10 MOTIF count 21 17 11 10 9 7 4 4 2 1 1 1 COUNT, O2 atom c N N c N N c N N O2, O2 aai 0 0 0 0 0 0 0 0 0 0 0 0 N3 N3 atom c N c c N N N N N N3 aai 0 -1 0 0 0 0 0 0 0 0 2 1 MOTIF count 19 8 8 8 6 5 5 4 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 COUNT, O2 atom c N N N c N N N N N N c c c N N c N N c c N N c c O2, O2 aai 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N4 N4 atom c c O O O O O c O O O O O O O O O O O O O O O O O O N4 aai 0 0 0 -3 68 0 0 -6 -6 0 67 65 25 72 -5 0 9 -5 -2 2 7 6 -48 29 63 69 4 10 MOTIF count 21 9 9 8 6 5 4 4 3 3 3 2 2 2 2 2 2 1 COUNT, N3 atom c N c N c N c N N c N N c c N N3, N3 aai 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N4 N4 atom N4 aai c O c O O O O O O O O O O O O - 1) Under each list heading the number of found motifs (count) is shown together with the atom (O2 atom, N3 atom and N4 atom) of each site and the corresponding index (O2 aai, N3 aai and N4 aai). 2)The O2, N3, N4 (N,0; N,-1; O,-3) 3-site cytosine-backbone motif (CM1) is shown as the third ranking motif among the 3-site motifs. Another goal that we set up was to investigate the existence of possible hydrogen bond motifs between the cytosine and the protein backbone with an approach similar to the one used in a number of analyses performed on adenine.1,7 We investigated the various hydrogen bond interactions between the cytosine-moiety and the protein with focus on amino-acid sequence index and atom types of the protein for backbone contacts. The results of the analysis are presented in Tables 2b and 2c. All combinations of hydrogen bonds between the O2, N3 and N4 sites of the cytosine-moiety and the protein were considered, and each combination was set to represent a motif. In the case of more than one hydrogen bond at one of the sites this resulted in more than one motif for the complex in question. Since no distinction was shown between different atoms of the protein side chains, virtually similar motifs can be seen for some complexes in the results, but they are in fact unique motifs distinguished by different protein atoms participating in the bonds. In Table 2c as well as presenting a list of 3-site motifs ordered by occurrence, we also present lists of 2-site motifs sorted by occurrence for all combinations of two hydrogen bonding sites for the cytosine. The O2 and N3 -sites are involved in a fairly similar amount of motifs; 67 (76.1%) and 65 (73.9%) times respectively. The N4-site is involved in 75 (85.2%) motifs, again demonstrating greater tendency to form hydrogen bonds. When looking at motifs involving amino acid side-chains, the N3-site is seen interacting with the side chains in a larger number of motifs than the other two sites. The three sites can be ranked according to involvement in motifs when interacting with the protein backbone as: N4 – 46 (52.3%) times, O2 – 36 (40.9%) times, and N3 – 26 (29.5%) times. When examining the 2-site combinations and how they are involved in interactions with the protein backbone, these combinations can be ranked as: O2, N4 – 29 (33.0%) times, N3, N4 – 25 (28.4%) times and O2, N3 – 21 (23.7%) times. 27 0 0 -2 0 0 -6 -5 -5 0 0 69 3 9 5 66 4 10 0 If considering the motif counts, the O2, N3, N4 (N,0; N,-1; O,-3) 3-site cytosinebackbone motif (CM1) clearly outnumbers the other with 8 occurrences among the total 88 counted motifs. This motif is illustrated in Figure 11. 2-site motifs that constitute part of this 3-site motif are the O2, N3 (N,0; N,-1) with 17 occurrences, the O2, N4 (N,0; O,-3) with 8 occurrences and the N3, N4 (N,0; O,-2) with 9 occurrences. It is interesting to compare this motif tendency with the adenine-motifs described by Denessiouk et.al. in 20011 and by Mao and Wang et.al. in 20047. The N3 and N4 -sites in the CM1 show a binding pattern similar to the N1 and N6 –sites in the “direct” adenine-motif of Denessiouk and the tri-residue –motif of Mao and Wang. In the case of cytosine binding to the protein backbone, the O2, N4 site combination tend to participate more in bonds to the protein backbone than the N3, N4 combination. Also, the O2, N4 2-site motif is the most reoccurring among all backbonebinding motifs found. While we have discovered the cytosine-backbone 2-site motifs corresponding to the “reversed” or mono-residue motifs of adenine, they were not as frequent; only 3 motifs containing this pattern were found. More interesting might be to look for “reversed” cytosinebackbone motifs involving all three binding sites. As can be seen from our results, there indeed are present poor traces of such motifs. A motif involving the O2 and N3 –sites hydrogen bonding to the same amino acid residue can be found in 2 motifs and in 2 other motifs the amino acid index is increasing in the direction from O2 to N4. The conclusion to this must be that at least in our Dataset 1, such reversed cytosine- protein backbone motifs are rarities. While it would be interesting to further investigate ligand-protein motifs including other parts than the cytosine-moiety, for example the sugar moieties or phosphate groups, and/or include amino acid side-chains, we did not prioritize this in the present analysis. The second type of intermolecular interaction investigated was the aromatic pi-pi – stacking interaction. For this purpose we used the geometrical scoring methodology of the scoring component c_stack (see Methods), but with separate scoring terms for each and every stacking interaction between the cytosine moiety and the amino acids arginine, histidine, phenylalanine, tryptophan and tyrosine. Scoring terms above zero were treated as detected stacking interactions. In this manner the stacking interactions in Dataset 1 were counted and the results are displayed in Table 2d. With the total number of stacking interactions being 28, compared to the total number of complexes, 50, we conclude that stacking interactions indeed constitute an important molecular determinant for our dataset. Phenylalanine was counted to have the largest number of stacking interactions with cytosine (12), and tryptophan had the least number of stacking interactions (0). From these results we concluded that no changes were needed to be made in the construction of the scoring component c_stack, compared to how it was constructed for the purpose of scoring adenine. 28 TABLE 2d. Analysis of intermolecular interactions in complexes with cytosine-containing ligands; Stacking-interactions COMPLEX 1MC_1BKY AR3_1P5Z C2G_1N1D C2P_1ROB C3P_1RPF C5P_1H7F C5P_1IV4 C5P_1LP6 C5P_1QF9 C5P_1UJ2 CAR_1KDR CDC_1JYL CDF_1GX1 CDM_1INI CDM_1OJ4 CDP_1EYR CDP_1FFU CDP_1H7H CDP_1IV2 CDP_1XJN CDP_2AZ3 CDP_2CMK CG2_1OJ1 CMK_1GQC CPA_1RPG CSF_1RO7 CTN_1UEJ CTP_1COZ CTP_1H7G CTP_1I52 CTP_1KFD CTP_1MIY CTP_1TUG CTP_1UDW CTP_1UEU CTP_2AD5 DCM_1B5E DCM_1NJE DCP_1PEO DCP_1PKK DCP_5KTQ DCZ_1P60 DOC_1KDT GEO_1P62 GPC_1RDS MCN_1DGJ MCN_1N62 NCC_1QWJ PCD_1FFV PCD_1VLB Sum Mean #ARG 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 #HIS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 #PHE 1 2 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 2 0 1 0 0 0 0 0 0 #TRP 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 #TYR 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Tot 2 2 1 0 0 0 0 0 0 1 2 0 0 0 2 1 0 0 0 1 1 2 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 2 2 1 1 0 0 0 0 0 9 0,18 1 0,02 12 0,24 0 0 6 0,12 28 0,56 1) The numbers represent the presence of stacking contacts with amino acids denoted with the 3-letter code (#ARG, #HIS, #PHE, #TRP and #TYR). 2) Tot means the total number of stacking interactions in the investigated complex. 29 Figure 12. Illustration of the function f A (see text). Values of the function are shown in respetive cell. The horizontal axis is the MHP-sum offset for the ligand and the vertical axis is the MHP-sum offset for the protein. A local maximum is located around the MHP-sum offset =0.25 for the ligand and =0.8 for the protein. TABLE 2e. Analysis of intermolecular interactions in complexes with cytosine; MHP COMPLEX Aphob 1MC_1BKY 0.377 AR3_1P5Z 0.476 C2G_1N1D 0.009 C2P_1ROB 0.097 C3P_1RPF 0.143 C5P_1H7F 0 C5P_1IV4 0.342 C5P_1LP6 0.321 C5P_1QF9 0.672 C5P_1UJ2 0.612 CAR_1KDR 0.005 CDC_1JYL 0.036 CDF_1GX1 0.396 CDM_1INI 0 CDM_1OJ4 0.811 CDP_1EYR 0.002 CDP_1FFU 0.689 CDP_1H7H 0 CDP_1IV2 0.382 CDP_1XJN 0.02 CDP_2AZ3 0.397 CDP_2CMK 0.02 CG2_1OJ1 0 CMK_1GQC 0 CPA_1RPG 0.111 CSF_1RO7 0.291 CTN_1UEJ 0.586 CTP_1COZ 0.022 CTP_1H7G 0 CTP_1I52 0.002 CTP_1KFD 0.239 CTP_1MIY 0 CTP_1TUG 0.476 CTP_1UDW 0.554 CTP_1UEU 0.209 CTP_2AD5 0.004 DCM_1B5E 0.289 DCM_1NJE 0.089 DCP_1PEO 0.239 DCP_1PKK 0.345 DCP_5KTQ 0.189 DCZ_1P60 0.508 DOC_1KDT 0.036 GEO_1P62 0.47 GPC_1RDS 0 MCN_1DGJ 0.715 MCN_1N62 0.616 NCC_1QWJ 0.087 PCD_1FFV 0.745 PCD_1VLB 0.547 Sum Mean 13.176 0.264 Aphil 0.039 0.045 0.068 0.019 0.026 0.125 0.057 0.026 0.076 0.005 0.156 0.087 0.019 0.093 0 0.119 0.051 0.14 0.041 0.08 0.009 0.095 0 0.039 0.012 0.03 0.02 0.334 0.15 0.165 0 0.109 0.018 0.007 0.159 0.051 0.014 0.026 0 0.019 0.016 0.021 0.084 0.014 0.027 0.017 0.04 0.051 0.04 0.027 Abur 0.738 0.991 0.993 0.831 0.721 0.832 0.95 0.643 1 1 1 0.896 0.967 0.929 0.997 0.827 1 0.829 0.969 0.639 0.735 0.998 0.247 0.762 0.765 1 1 0.994 0.869 0.849 0.746 0.727 0.918 0.991 0.75 0.543 0.82 0.591 0.463 0.707 0.334 0.998 1 0.998 0.627 1 1 0.926 1 1 2.866 0.057 42.11 0.842 Aphob [0,0.2[ [0.2,0.4[ [0.4,0.6[ [0.6,0.8[ [0.8,1.0] 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 24 0.48 12 0.24 7 0.14 6 0.12 1 0.02 Aphil [0,0.2[ [0.2,0.4[ [0.4,0.6[ [0.6,0.8[ [0.8,1.0] 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 49 0.98 1) fraction of the total area being hydrophobic complementary (Aphob); 2) hydrophilic complementary (Aphil); 3) buried area (Abur). 30 1 0.02 0 0 0 0 0 0 Abur [0,0.2[ [0.2,0.4[ [0.4,0.6[ [0.6,0.8[ [0.8,1.0] 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 2 0.04 3 0.06 12 0.24 33 0.66 Two analyses were performed to investigate the MHP1 approach to scoring in the case of cytosine-moieties. The first analysis was an analysis of the hydrophobic complementary area, the hydrophilic complementary area and the total buried area of the ligand in the protein binding pocket for all complexes in our dataset, and is presented in Table 2e. Areas are shown as fractions of the complete ligand area. In addition we also present in what fraction range the specified area is in steps of 0.2. The hydrophobic complementary area is the area of the ligand-protein contact which has hydrophobic properties measured through the MHP formalism. The hydrophilic complementary area is the corresponding area with hydrophilic properties. The buried area is the sum of the two area types plus the mismatching ligandprotein contact areas where hydrophilic and hydrophobic areas are in contact with each other. The results of the first MHP analysis revealed that the hydrophobic complementary area seems to be generally larger than the hydrophilic complementary area for our dataset. We measured the mean hydrophobic complementary area to be 0.264 of the total ligand area and the mean hydrophilic complementary area to be 0.057 of the total ligand area. In most complexes the hydrophobic complementary area is below 0.2 times the ligand area with fewer complexes having larger values. The highest complementary area measured was 0.811 times the total ligand area for the complex CDM_1OJ4. All hydrophilic complementary areas but for one complex were found in the range below 0.2 times the ligand area. When examining the buried area, most complexes seem to be somewhat buried with the mean buried area being 0.842 times the ligand area. Most complexes have a buried area above 0.8 and the complex with the least buried area having a buried area of 0.247 times the ligand area. In this complex, CG2_1OJ1, we didn’t register any complementary hydrophobic or hydrophilic areas. To improve our knowledge about the different area types in the complexes of our dataset we were curious to analyze the other area types together with the areas analysed in the first analysis. Therefore we constructed a second analysis of all the available area types of the MHP concept, which would be more informative In the second analysis we intended to investigate how the different areas of the model vary when changing the sum offset for the ligand and the protein respectively. This was done in a similar fashion as in the investigation of the MHP1-model for the adenine-moiety in our earlier study.4 By shifting the sum-offset (marked as in the MHP1-description, see Methods) we could change what sum contribution is considered hydrophobic and what is considered hydrophilic since this is a matter of what contributions to the sum are above and below zero. Since different parts of the protein and the ligand have varying hydrophobicity and hydrophilicity in terms of hydrophobicity constants, this changes what parts of the complex are considered hydrophobic and what parts are considered hydrophilic. The need of having to distinguish between hydrophobic and hydrophilic regions is a result of the concept of hydrophobicity/hydrophilicity itself. What we were interested in was how this definition could affect complementarity of areas where the ligand and the protein, as well as the ligand and the solvent, were in contact with each other. More specifically, one of our goals was to find optimal values of this offset as to maximize matching areas (hydrophobic-hydrophobic, hydrophilic-hydrophilic) and minimize mismatching areas (hydrophobic-hydrophilic), in a way that could be useful in creating or modifying a score. A selected number of complexes were used for this analysis based on how large part of the ligand is represented by the cytosine moiety, the larger the better, and the analysis was performed upon them. MHP-sum offset was varied separately for the ligand and the protein to cover a wide spectrum of possible complementarity situations. Results were obtained in 2-dimensional matrices where the dimensions are shifts in MHP-sum offset for the ligand and the protein, respectively. Our selected complexes were: 31 1MC_1BKY AR3_1P5Z C5P_1IV4 C5P_1LP6 C5P_1QF9 C5P_1UJ2 CDF_1GX1 CDM_1OJ4 CDP_1FFU CDP_1IV2 CDP_2AZ3 CTN_1UEJ CTP_1TUG CTP_1UDW DCM_1B5E DCP_1PKK DCZ_1P60 GEO_1P62 MCN_1DGJ MCN_1N62 PCD_1FFV PCD_1VLB When examining a function of the hydrophobic complementary areas and the hydrophilic complementary areas between ligand and protein as well as between ligand and (hydrophilic) solvent, we could observe a slight local maximum for this function at approximately an offset of =0.25 for the ligand and =0.8 for the protein. The function of areas used in this case is described by: f A A1 ( lig , prot ) A4 ( lig , prot ) A4 ( lig , prot ) In this function, A1 , A4 and A4 are the normalized area functions of the two MHP-sum offsets lig and prot , where A1 is the hydrophobic complementary area between ligand and protein, A4 is the hydrophilic complementary area between ligand and protein and A4 is the hydrophilic complementary area between ligand and hydrophilic solvent (Fig. 12). All three area functions have been normalized by the total area sum over all measurements for respective area. While this illustrates an inherent problem with the hydrophobic/hydrophilic – property system and also a way to deal with it, we judged that the effect of a change from the default value of the MHP-sum offset for scoring complexes with cytosine containing ligands would be near to negligible. In comparison to a similar analysis made on adenine where a change in the offset would produce a more recognizable effect4, our analysis of the cytosine moiety could be interpreted as there generally being no large overlaps in MHP-sum of the ligand and the protein on the surface between the two or that there exist some sort of symmetries which would cancel out any effects of the offset. Further investigation of the subject is required to determine the real cause, which might include investigation of MHPbehaviour in individual complexes. We deemed this result to be promising for the usefulness of the MHP-concept in scoring cytosine-structures in molecular docking however. This is because of the possibility that there might generally be no large parts of the surface between the cytosine moiety and the binding pocket where the MHP-sums overlap and thus the need of the MHP-sum offset calibration is largely eliminated. Figure 12. Scheme illustrating classification of ligand-protein surface contacts in terms of MHP values: (1) MHPLL, (2) MHPLH, (3) MHPHL, (4) MHPHH, (4') MHPHW. Here, subscripts denote: L – lipophilic (brown), H – hydrophilic (blue), W – water. 32 3.2. Investigating methods to score protein-ligand conformers As was mentioned above, and it is worth to mention again; molecular docking is a complex task. And scoring in molecular docking is part of that complex task. Even when a scoring approach seems to get the job done, it is not always simple to determine why it is successful and whether the approach is robust for other cases. Often the argument for a scoring approach is a successful validation of the proposed scoring approach on a representative test-set. However, it is essential to also put the results in relation to the methods and data, since they are directly connected. A successful validation of a scoring approach should desirably be interpreted in turn through the data and methods used. This might be especially important in cases such as ours, when we had a limited dataset to work with. With a starting-point in our previous work on adenine4 and the score used there, we have attempted to further develop those scoring approaches together with new techniques and tested their performance on protein-ligand complexes with cytosine containing ligands. The score applied on adenine in the earlier work is a linear combination of scoring terms and is constructed as such: f score,adenine w0 w1Thbond , N 6 w2Thbond , N 1, N 3, N 7 w4Tstacking , phe,tyr ,trp,his w3Thbond , N 6 Thbond , N 1, N 3, N 7 w5Tstacking ,arg w6TMHP1 Thbond , N 6 Bhbond ,acceptor :N 6 acceptors Thbond , N 1, N 3, N 7 Bhbond , N 1:donor donors Bhbond , N 3:donor donors Bhbond , N 7:donor donors These are terms describing hydrogen-bond interactions using the standard enumeration of adenine-atoms N1, N3, N6 and N7 and the hydrogen-bond scoring methodology described in Methods. The other terms describe stacking and hydrophobic (using MHP) interactions, and are the same as the terms with the same names that we have used in our new scores (see Methods). The rationale for that will be explained further on. Our main interests were whether the scoring approaches successfully used on adenine also could be applied on cytosine and how a score using these methods could be further improved. We were also interested in determining the performance of each scoring approach and combinations of those on our dataset, and to investigate issues concerning consensus scoring such as strategies to construct consensus scores. Ordering different scoring approaches under what we call scoring components, we created a comfortable basis upon which we could later build our consensus scores by combining different scoring components together into an integral scoring function. A scoring component was meant to comprise a particular scoring method with one or more scoring terms. To use the scoring methods from the score for adenine, it was divided into an appropriate number of components representing different scoring approaches now associated with the scoring components for hydrogen bonds (c_hbond), aromatic pi-pi stacking (c_stack) 33 and hydrophobic contact (c_mhp1). These scoring components were slightly modified to be adapted for the use on the cytosine substructure but remain largely unchanged. In the scoring component c_hbond, we added and converted the terms so that they would correspond to the three molecular sites on cytosine where hydrogen bonding is possible. We kept a double term representing a simultaneous bonding of ring atoms and the N6 –site of adenine, partly as a rudiment to be able to determine the performance of the previous adenine score in our cytosine dataset. It had also to be converted to correspond to the new molecular sites, and was set to be a combination of the terms representing the N3 and the N4 sites in cytosine, since we considered them as analogue sites when compared to adenine. Our new contributions to the ensemble of scoring components are the c_emp1 and c_motif components. As the available structural data on protein-ligand complexes grow over time, we saw it promising to introduce a knowledge-based part in our scores that could draw its use from this accumulated data and potentially become more accurate as the data continue to grow. Knowledge-based scores that rely on the stored statistical distribution of receptor atoms around particular ligand groups also show a tendency of being able to implicitly model effects that are otherwise hard to capture using other scoring approaches.12 We were eager to discover what effect this would have on our scores. As a starting model for this scoring approach we created the EMP1-model which is intuitively simple (see Methods). To be able to translate this model into a working scoring component we had to solve the questions of what reoccurring centres should be used as reference centres in the crystal structures and the evaluated complex (a docking solution). What we would use as knowledge based centres – if atoms – what atoms and in combination with which reference centre, and finally what weight to assign to structural data used in the training of the algorithm. Regarding the reference centres and knowledge-based centres, we settled for a most simple model that we thought would statistically capture some of the most important physical effects governing the molecular recognition of the cytosine moiety by its protein binding pocket. Modeling hydrogen bond effects at the three molecular sites capable of this in cytosine as well as pi-pi stacking contacts of the cytosine heteroaromatic ring structure, is of course just an example of what could be done using the same principle. One can guess that if a more elaborate system of reference centres and knowledge-based centres would be applied, the accuracy of the score using the EMP1-model could be increased. Reference centres and knowledge-based centres aside, which still are more of a design issue, a larger problem might be that of how to value the data in the training procedure. The weighting of knowledge-based data is the connection of scores to the underlying physical effects it is modeling since the weighting defines exactly what the score is representing. Weighting also accounts for the trustworthiness of the data and possibly other parts of the model. The understanding of a knowledge-based score as implicit and statistical rather than based on clearer physical models has to be put in relation to the methods capacity to statistically model physical phenomena. Weighting, while sometimes ambiguous, is in fact a useful part of the score to create a score mechanics that regulates the influence of individual data on the score outcome. By connecting the weights of the structural data to physical units, such as free energies of binding or measurements of biological activity, it is possible to model these through the use of the knowledge-based score and in theory be able to use the model to predict these values. Weighting could possibly also be used to account for differences and uncertainties due to varying X-ray crystallographic resolution in observed complexes. In our case, since we lacked a complete set of experimental biological or physical data connected to free energy changes at binding or activity for our complexes in our dataset, we could not use such for weighting purposes. One use for the weighting of knowledge-data that we initially tried was to complement our knowledge-based data with structural data generated by some docking algorithms. This provided us with negative training data (wrong docking poses) that 34 complementing positive training data (crystallographic structures and correct docking solutions) are requisite for developing an efficient scoring criterion. We believed that we could improve the accuracy of the knowledge-based score by also training it with these docking-generated conformers and regulate the trustworthiness by connecting it to some measure of RMSD between the generated conformer ligand and its crystal structure and expressing this through weighting. Initial results showed that it was difficult to determine the trustworthiness of docking-generated ligand poses from RMSD measurements and put that in reference to the known crystal structures and weight them accordingly. Thus this method was discarded with the motivation to test the knowledge-based score with trustworthy data before attempting any such experiments. We still believe that an improvement in accuracy is possible to achieve by this mean, but that such a method has still to be improved and further investigated. While discussing the weighting issue, a related subject is that of the impact of having to train the knowledge-based scoring algorithm with multiple data. It can be a good thing to try to define what, if one complex is treated as one measurement, a measurement is meant to represent in the knowledge-based statistical model that we try to create. This task might sound more trivial than it really is. Depending upon what is to be modelled by the knowledge-based model, weighting of different data measurements at the training of the algorithm must be balanced so that the resulting knowledge-based data structure would grow to represent a valid physical model. There is a risk of the input data having a significant overlap in information which could, if not properly taken care of, lead to imbalances in the knowledge-based score. Thus a suitable relationship between weight of the input data and how similar it is to other data might be required, depending on what the knowledge-based model is representing. In our case, since we lacked information about our complexes which we could use to train our model with, we had to revise our representation of the model and work with something simpler. In this work we settled with a design of our knowledge-based scoring component where all knowledge-based data was treated equally and assigned a weight of 1. This is to statistically model atom placement by treating our dataset of structural data as a statistics of the general placement of the cytosine moiety over all possible complexes. The assumption that our dataset would model such a placement has thus to be made, and such an assumption might of course be more or less valid. This is especially so since our dataset is not particularly large and it is difficult to have an overview of how well the accumulated data in the PDB on complexes binding ligands containing cytosine, not to mention our sample from that data making up our dataset, support such an assumption. Our statistical model upon which the knowledge-based scoring component has been created and from which the weighting of data was to be derived, might be one of the largest shortcoming in the scoring approach of the c_emp1 component. From the results obtained in the investigation of the intermolecular interactions in our dataset we concluded that the traces of a recurring motif where the cytosine binds to the protein backbone through hydrogen bonding provide enough support for investigating the use of these data in a new method of scoring. For this purpose, we settled on the current design of the c_motif component. Both the approach and the design of the motif-recognizing scoring component are simple which makes this approach to scoring interesting from two perspectives. First, it is always useful to have a scoring component which is quick and easy to calculate, and any improvements this component would be able to complement us with in our scoring attempts would be put in relation to the calculation cost. If proved effective, this would make it an interesting advance. The second perspective is that we knew from the start that this approach to scoring is inherently very biased, because it is only able to score a most limited number of conformer cases while a complete score hopefully would need to be able to 35 evaluate cases outside this components capacity. This would prove to be a valuable challenge to us when investigating methods to create composite scores. 3.3. Investigating strategies to combine terms of different scoring components and to validate scores Most methods to score conformational states of protein-ligand conformers do make more or less serious misjudgements.9,12 Scoring methods are also biased in how they model and capture different phenomena. The term consensus score, in its meaning of a score that combines different scoring methods, is somewhat misleading since it can be applied to most successful contemporary scoring functions. What is a consensus score and what is not, is more a matter of interpretation. Scoring functions generally show a characteristic of a consensus score, that is they combine different scoring methods into one score. This is why we think the question about consensus scoring is of importance and why it is interesting to investigate strategies how to construct composite scores. While there are infinitely many ways to construct such scores, we continued our work on the score structure that had been used earlier4 (see “Investigating methods to score protein-ligand conformers” above) since this is a rather flexible and fast way for constructing a score. This score is simply a linear combination of a number of scoring terms which are calculated by different methods. These terms are weighted and usually a constant term is added in order to maximize the predictive performance of the score. Here the interesting question is how to perform this combination and weighting procedure, and how to validate performance on such a score. In the weighting procedure the score has to be fitted to a function where information about the quality needed to model (free energy change at binding, biological activity etc.) is already known. There are cases where such information about molecular complexes is available, but these complexes are often too few to be able to perform a satisfying fitting procedure. A commonly used solution to this is to generate a multitude of protein-ligand conformers and approximate quality from measurements of RMSD from known crystal structures. We did not have any data on binding energies or activity for our dataset so we resorted first to this alternative of approximating quality with the RMSD measurement. In generally rare cases when some sort of symmetry is present in a ligand molecule, the approximation capabilities of this measurement work better the smaller the values of the measured RMSD. Further away from the crystal structure in terms of RMSD the scoring environment might change radically to the disadvantage of the approximation. We believed however, that such a rare case could prove to be a challenge for our cytosine-specific scoring functions after we discovered possible reoccurring local score maxima at RMSDcyt of approximately 7Å and 13Å in preliminary scoring results obtained for our dataset (Fig. 13). The maximum at 7Å was believed to be caused by the ring structure of cytosine in different conformational states corresponding to a 180° rotation of the ring plane, along the R-C1 axis or some other axis, for a reoccurring number of complexes. The other maximum at 13Å can possibly be explained by conformations of ligands containing 2 ring structures and where the ligands are positioned with the rings in opposite positions compared to the crystal-structure. Therefore the problem of symmetry in ligand molecule appeared to be quite important in our case. The exact connection between possible quality maxima at these positions and structural symmetries of these ligands will have to be further investigated. Using the RMSD-measurement to approximate docking quality for conformers around this distance would most possibly be a source of error due to assuming a lower quality than what probably would be the case. If we were to use the RMSD-measurement for both score fitting and later on for validation of our scores there seemed to be a reason to use this measurement only for a RMSD-range where we were more certain that the given 36 approximation would correspond to some value near the assumed quality maximum of the crystal structure. In theory, not doing so could suppress the ability of a score to discern features of the quality it is supposed to measure and equivalently discredit scores with such abilities during score validation. To visualize and investigate this theory we decided to include two different concepts to score training and score validation. In one concept, we use the RMSD-measurement from the whole range for training and validation, while in the other concept we only use docking results with RMSD-measurements below a set limit for the training and validation, as comparison. The first concept is to use the whole RMSD-range for score training and validation. During training this is realized through using the fitting functions Semi- step function and Negative RMSD, and during validation when we use conformers from the whole RMSDrange. In the second concept we have placed an arbitrarily selected RMSDcyt cut-off at 6Å from the crystal structure, below which we consider RMSD-measurements acceptably reliable for approximating score quality. When training our scores we visualize this concept through the use of the fitting function Negative RMSD with cut off, and during validation when placing a RMSD cut-off radius below which we use conformers for validation. The limit was placed according to the assumption of the possible score maxima at RMSDcyt of 7 and 13 Å and compromising this with getting enough docking results below this limit to be able to produce interesting validation results (see Validation of scoring approaches). Based on our earlier experiences with using RMSD-measurements as approximations of the quality of a docking solution, we decided to investigate alternative ways for approximating conformer quality for score training and possibly also score validation. For this purpose we created the Term correlation fitting function. c_hbond c_stack c_mhp1 c_emp1 c_motif 50 40 Score 30 20 10 0 0 5 10 15 20 25 30 -10 RMSD_cyt Figure 13. A plot of the calculated score versus RMSDcyt for all complexes in Dataset 2, using the score comprising c_hbond, c_stack, c_mhp1, c_emp1, and c_motif trained on the whole Dataset 1 and Dataset 2 with the fitting function Term correlation. This plot illustrates possible reoccurring structural quality maxima at different ranges from the crystal-structures (at 7 Å and 13Å). RMSDcyt is expressed in Ångströms. 37 3.4. Validation of scoring approaches To both validate our various scoring approaches in forms of combinations of scoring components into complete scores as well as validate our different methods of score construction, mainly the different fitting functions, we decided to combine both scores with score construction methods, our different validation strategies and finally also the different data subsets from different docking algorithms when performing our evaluation. For each validation strategy (Complete training and test and Cross- training and test), all proposed score combinations have been validated once for each fitting function (Semi- step function, Negative RMSD, Negative RMSD with cut off and Term correlation) for each data subset in Dataset 2 (GOLD goldscore, GOLD chemscore and GLIDE) and once with the whole dataset. Results from our evaluation of the scores can be found in tables AI.1. to AI.8. in APPENDIX I. The main reason for testing the performance of scores on the subsets of Dataset 2 is that in doing so we can make a sort of comparison between how our scores perform with the scoring performance of the scores used by the docking algorithms on the same dataset. To visualize the performance of scores on a shorter distance from the known crystal structures in terms of RMSD, we also performed validation results for complexes from Dataset 2 with a measured RMSDcyt < 6 Å from the respective known structure in Dataset 1. These results are marked with “CUT”. Results taking into account the whole RMSD-range are marked with “full”. The cut-off distance was arbitrarily chosen with the aim to be between the expected quality maximum at 7Å (see Investigating methods to score protein-ligand conformers, Investigating strategies to combine terms of different scoring components and to validate scores) and 0, excluding complexes believed to rightfully score well but not belonging to the maximum represented by the crystal structure, and still trying to keep as many conformers as possible to make the result somewhat interesting from a statistical point of view. Each score combination is built up from scoring components adding various numbers of terms to the score. The scoring components used are those that are listed in the Methods section. Additionally we added the dockscore – the score read from the respective docking program by which the complex conformer was created (GOLD chemscore, GOLD goldscore, Glide). Used for comparison. When evaluating scoring quality of the scores we mainly used the following measurements: mean enrichment (ME), relative mean hit rank (RelMHRk), and relative hit rate (RelHRt). ME is the mean enrichment over all complexes when measuring enrichment separately for each protein-ligand combination. ME lies in the range between 0 and 1 and measures the score’s capacity to order docking results, with 1 denoting a perfect ME and below 0.5 – an unsuccessful ME. RelMHRk is the relationship between the mean hit rank and the mean random hit rank and can serve as a measurement of how precise the score is, that is the mean scoring rank of the best docking result compared to the random rank of the best docking result. A RelMHRk <1 is considered better than the random ordering of the docking results and higher RelMHRk is considered less successful. RelHRt is the relationship between the hit rate and the random hit rate. RelHRt is also a measurement of how precise the score is in terms of the capacity to rank the best result as number 1. A RelHRt > 1 suggests that the score’s performance is better that the random result, and RelHRt < 1 is worse. 38 Table 3a. Ranking of scores by Mean Enrichment (ME), Complete training and test semi- step function score chemscore_C chemscore c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond_C c_hbond c_stack c_mhp1_C c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack_C c_hbond c_stack c_motif_C glide_C glide c_hbond c_hbond c_stack c_emp1 c_emp1_C c_hbond c_stack c_motif goldscore c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 goldscore_C c_mhp1_C c_mhp1 c_stack_C c_motif_C c_stack c_motif ME 0.99 0.98 0.93 0.93 0.91 0.9 0.88 0.88 0.87 0.87 0.87 0.87 0.86 0.85 0.85 0.84 0.83 0.83 0.81 0.8 0.77 0.76 0.74 0.71 0.64 0.56 0.54 0.4 negative rmsd score chemscore_C chemscore c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack_C c_hbond c_stack c_motif_C c_hbond c_stack c_mhp1_C c_hbond_C c_emp1_C c_hbond c_stack c_mhp1 glide_C c_hbond c_stack c_motif glide c_hbond c_hbond c_stack c_emp1 goldscore goldscore_C c_mhp1_C c_mhp1 c_stack_C c_motif_C c_stack c_motif ME 0.99 0.98 0.94 0.93 0.92 0.92 0.92 0.91 0.89 0.89 0.89 0.88 0.88 0.87 0.87 0.86 0.86 0.85 0.85 0.84 0.83 0.76 0.74 0.71 0.62 0.56 0.5 0.4 negative rmsd with cutoff score chemscore_C chemscore c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack_C c_hbond c_stack c_motif_C c_hbond c_stack c_mhp1_C c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond_C glide_C c_hbond c_stack c_mhp1 c_emp1_C glide c_hbond c_stack c_motif c_hbond c_hbond c_stack c_emp1 goldscore goldscore_C c_mhp1_C c_mhp1 c_stack_C c_motif_C c_stack c_motif ME 0.99 0.98 0.92 0.9 0.9 0.89 0.89 0.89 0.89 0.88 0.88 0.88 0.87 0.86 0.86 0.86 0.85 0.84 0.84 0.83 0.83 0.76 0.74 0.71 0.62 0.56 0.5 0.4 term correlation score chemscore_C chemscore c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack_C c_hbond c_stack c_mhp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_motif_C c_hbond c_stack c_mhp1 c_hbond_C glide_C c_hbond c_stack c_hbond c_stack c_motif glide c_hbond c_emp1_C goldscore c_emp1 goldscore_C c_mhp1_C c_mhp1 c_stack_C c_motif_C c_stack c_motif ME 0.99 0.98 0.92 0.92 0.9 0.9 0.9 0.9 0.89 0.89 0.89 0.88 0.88 0.87 0.86 0.86 0.86 0.85 0.85 0.83 0.79 0.76 0.74 0.71 0.62 0.56 0.52 0.4 Table 3b. Ranking of scores by Mean Enrichment (ME), Cross- training and test semi- step function score c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1_C c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_motif_C c_hbond c_stack c_mhp1 c_hbond c_stack_C c_hbond_C c_hbond c_stack c_emp1 c_hbond c_stack c_motif c_hbond c_hbond c_stack c_emp1_C c_emp1 c_mhp1_C c_mhp1 c_stack_C c_motif_C c_stack c_motif ME 0.950 0.933 0.917 0.914 0.912 0.902 0.891 0.876 0.876 0.875 0.847 0.843 0.835 0.833 0.825 0.754 0.737 0.727 0.668 0.613 0.549 0.454 negative rmsd score c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond_C c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_motif_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_hbond c_hbond c_stack c_motif c_hbond c_stack c_emp1 c_emp1_C c_hbond c_stack_C c_hbond c_stack c_mhp1 c_hbond c_stack c_mhp1_C c_hbond c_stack c_emp1 c_mhp1_C c_mhp1 c_stack_C c_motif_C c_stack c_motif ME 0.940 0.939 0.935 0.930 0.926 0.921 0.908 0.907 0.893 0.891 0.885 0.882 0.868 0.863 0.846 0.838 0.720 0.707 0.576 0.561 0.455 0.415 negative rmsd with cutoff score c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_emp1_C c_hbond c_stack_C c_hbond c_stack c_mhp1 c_emp1 c_hbond_C c_hbond c_stack c_mhp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_mhp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_motif_C c_hbond c_stack c_emp1 c_hbond c_hbond c_stack c_hbond c_stack c_motif c_emp1_C c_emp1 c_mhp1_C c_mhp1 c_stack_C c_motif_C c_stack c_motif ME 0.950 0.921 0.920 0.916 0.913 0.910 0.908 0.888 0.884 0.882 0.879 0.869 0.863 0.837 0.828 0.804 0.778 0.737 0.562 0.558 0.438 0.418 term correlation score c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1_C c_hbond c_stack c_motif_C c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_mhp1 c_hbond c_stack c_motif c_hbond c_stack_C c_hbond c_stack c_hbond_C c_emp1_C c_hbond c_emp1 c_mhp1_C c_mhp1 c_stack_C c_motif_C c_stack c_motif Table 3a and Table 3b show the validation results from the validation of our scores in the form of the scores ranked according to achieved Mean Enrichment (ME). In Table 3a validation results from using the validation method Complete training and test are shown, and In Table 3b the mean validation results from using the validation method Cross- training and test. 39 ME 0.947 0.930 0.928 0.923 0.919 0.911 0.910 0.909 0.906 0.897 0.897 0.858 0.851 0.850 0.821 0.801 0.764 0.724 0.610 0.525 0.471 0.365 Table 3c. Ranking of scores by Relative Mean Hit Rank (RelMHRk), Complete training and test semi- step function score chemscore glide c_hbond c_stack c_mhp1 c_emp1 c_motif chemscore_C goldscore c_hbond c_stack c_mhp1 c_emp1 glide_C c_hbond c_stack c_emp1 c_hbond c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_emp1_C c_hbond c_stack c_motif c_emp1_C c_hbond c_stack c_mhp1 c_hbond_C c_hbond c_stack c_hbond c_stack c_mhp1_C goldscore_C c_emp1 c_hbond c_stack_C c_hbond c_stack c_motif_C c_mhp1 c_mhp1_C c_stack c_stack_C c_motif_C c_motif RelMHRk 0.47 0.74 0.75 0.75 0.82 0.84 0.89 0.90 0.94 0.96 0.96 0.97 1.12 1.17 1.24 1.26 1.28 1.32 1.38 1.39 1.41 1.41 2.13 2.42 3.27 3.36 4.62 4.86 semi- step function score c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_emp1 c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1_C c_hbond_C c_emp1_C c_hbond c_stack c_hbond c_stack c_motif c_emp1 c_hbond c_stack_C c_hbond c_stack c_motif_C c_mhp1 c_mhp1_C c_stack c_stack_C c_motif_C c_motif RelMHRk 0.686 0.781 0.819 0.828 0.912 0.927 0.931 0.957 0.976 1.028 1.051 1.071 1.330 1.383 1.459 1.604 2.077 2.233 3.094 3.575 4.851 5.234 negative rmsd score chemscore c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif glide chemscore_C goldscore glide_C c_hbond c_stack c_mhp1 c_hbond c_hbond c_stack c_emp1_C c_hbond c_stack c_motif c_hbond c_stack c_emp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond_C c_hbond c_stack_C c_hbond c_stack c_motif_C c_hbond c_stack c_mhp1_C goldscore_C c_mhp1 c_mhp1_C c_stack c_stack_C c_motif_C c_motif RelMHRk 0.47 0.60 0.66 0.66 0.74 0.75 0.82 0.89 0.90 0.91 0.91 0.96 0.97 0.99 1.00 1.03 1.03 1.21 1.21 1.21 1.28 1.38 2.13 2.42 3.52 3.55 4.62 4.86 negative rmsd with cutoff score chemscore glide chemscore_C c_hbond c_stack c_emp1 goldscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 glide_C c_hbond c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1 c_hbond c_stack c_hbond c_stack c_motif c_emp1 c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_emp1_C c_hbond_C c_hbond c_stack_C c_hbond c_stack c_motif_C c_hbond c_stack c_mhp1_C goldscore_C c_mhp1 c_mhp1_C c_stack c_stack_C c_motif_C c_motif RelMHRk 0.47 0.74 0.75 0.78 0.82 0.87 0.88 0.89 0.93 0.93 0.98 0.99 0.99 1.08 1.18 1.18 1.20 1.26 1.26 1.26 1.29 1.38 2.13 2.42 3.52 3.55 4.62 4.86 term correlation score chemscore c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif glide c_hbond c_stack c_mhp1 chemscore_C c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack goldscore c_hbond c_stack c_motif c_hbond c_hbond c_stack c_emp1_C glide_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_motif_C c_emp1_C c_hbond c_stack c_mhp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack_C c_hbond_C c_emp1 goldscore_C c_mhp1 c_mhp1_C c_stack c_stack_C c_motif_C c_motif RelMHRk 0.47 0.66 0.74 0.74 0.75 0.75 0.77 0.81 0.82 0.83 0.86 0.86 0.89 1.08 1.16 1.17 1.18 1.19 1.20 1.21 1.25 1.38 2.13 2.42 3.32 3.44 4.62 4.86 Table 3d. Ranking of scores by Relative Mean Hit Rank (RelMHRk), Cross- training and test negative rmsd score c_hbond c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond_C c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1_C c_hbond c_stack c_motif c_emp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_hbond c_stack c_motif_C c_hbond c_stack_C c_hbond c_stack c_mhp1_C c_mhp1 c_mhp1_C c_stack_C c_stack c_motif_C c_motif RelMHRk 0.563 0.572 0.675 0.774 0.863 0.866 0.875 0.953 1.007 1.031 1.040 1.063 1.070 1.198 1.258 1.515 2.218 2.744 3.592 3.730 4.637 5.087 negative rmsd with cutoff score c_hbond c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond_C c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_hbond c_stack c_mhp1_C c_hbond c_stack_C c_hbond c_stack c_motif c_emp1 c_emp1_C c_hbond c_stack c_motif_C c_mhp1 c_mhp1_C c_stack c_stack_C c_motif c_motif_C RelMHRk 0.657 0.674 0.717 0.829 0.864 0.877 0.881 0.881 1.026 1.062 1.066 1.069 1.114 1.217 1.387 1.441 1.980 2.138 3.828 3.977 4.607 4.728 term correlation score c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_hbond c_stack c_motif_C c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_hbond c_stack c_mhp1_C c_emp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond_C c_mhp1 c_mhp1_C c_stack_C c_stack c_motif c_motif_C Table 3c and Table 3d show the validation results from the validation of our scores in the form of the scores ranked according to achieved Relative Mean Hit Rank (RelMHRk). In Table 3c validation results from using the validation method Complete training and test are shown, and In Table 3d the mean validation results from using the validation method Cross- training and test. 40 RelMHRk 0.569 0.658 0.747 0.790 0.804 0.846 0.861 0.863 1.019 1.064 1.183 1.188 1.215 1.228 1.262 1.479 2.229 2.490 3.618 3.706 5.236 5.245 Table 3e. Ranking of scores by Relative Hit Rate (RelHRt), Complete training and test semi- step function score goldscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_hbond c_hbond c_stack c_emp1 c_hbond c_stack c_motif chemscore c_hbond c_stack c_mhp1 glide c_hbond c_stack c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C chemscore_C c_hbond_C c_hbond c_stack_C c_hbond c_stack c_motif_C c_hbond c_stack c_mhp1_C glide_C goldscore_C c_emp1_C c_mhp1 c_emp1 c_motif c_mhp1_C c_stack c_stack_C c_motif_C RelHRt 2.00 1.93 1.80 1.64 1.64 1.50 1.50 1.40 1.39 1.36 1.22 1.22 1.22 1.20 1.15 1.11 1.11 1.04 1.00 0.86 0.81 0.80 0.79 0.64 0.59 0.50 0.33 0.30 negative rmsd score c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif goldscore c_hbond c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack chemscore glide c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_emp1 chemscore_C c_hbond_C c_hbond c_stack_C c_hbond c_stack c_motif_C c_hbond c_stack c_mhp1_C c_emp1_C glide_C goldscore_C c_mhp1 c_motif c_mhp1_C c_stack_C c_motif_C c_stack RelHRt 2.14 2.00 2.00 2.00 1.86 1.86 1.80 1.79 1.50 1.39 1.26 1.22 1.22 1.21 1.20 1.19 1.19 1.19 1.11 1.07 1.00 0.86 0.80 0.64 0.59 0.30 0.30 0.29 negative rmsd with cutoff score goldscore c_hbond c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 chemscore glide c_hbond c_stack c_emp1_C chemscore_C c_hbond_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack_C c_hbond c_stack c_mhp1_C glide_C goldscore_C c_emp1_C c_hbond c_stack c_motif_C c_mhp1 c_emp1 c_motif c_mhp1_C c_stack_C c_motif_C c_stack RelHRt 2.00 1.79 1.79 1.79 1.79 1.73 1.73 1.67 1.50 1.39 1.26 1.20 1.15 1.15 1.15 1.11 1.07 1.00 0.86 0.81 0.81 0.80 0.79 0.64 0.59 0.30 0.30 0.29 term correlation score c_hbond c_stack c_motif goldscore c_hbond c_stack c_mhp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_emp1 chemscore glide c_hbond c_stack c_motif_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C chemscore_C c_hbond_C c_hbond c_stack_C c_hbond c_stack c_mhp1_C c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1 c_emp1_C glide_C c_emp1_C goldscore_C c_mhp1 c_emp1 c_motif c_mhp1_C c_stack_C c_motif_C c_stack RelHRt 2.00 2.00 1.93 1.93 1.86 1.79 1.67 1.64 1.50 1.39 1.22 1.22 1.20 1.19 1.19 1.19 1.19 1.11 1.00 0.89 0.86 0.80 0.71 0.64 0.59 0.30 0.30 0.29 Table 3f. Ranking of scores by Relative Hit Rate (RelHRt), Cross- training and test semi- step function score c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_hbond c_stack c_mhp1 c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond_C c_hbond c_stack c_motif_C c_hbond c_stack_C c_hbond c_stack c_emp1 c_emp1 c_emp1_C c_mhp1 c_mhp1_C c_motif c_stack c_stack_C c_motif_C RelHRt 2.179 1.850 1.769 1.595 1.450 1.353 1.292 1.262 1.258 1.210 1.167 1.082 1.075 1.050 0.848 0.815 0.794 0.571 0.528 0.393 0.333 0.310 negative rmsd score c_hbond c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_emp1 c_hbond c_stack c_motif c_hbond c_stack c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond_C c_hbond c_stack c_emp1_C c_hbond c_stack_C c_emp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_motif_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_mhp1_C c_mhp1 c_motif c_mhp1_C c_stack c_stack_C c_motif_C RelHRt 2.259 2.059 2.000 1.914 1.909 1.886 1.667 1.333 1.245 1.221 1.210 1.194 1.188 1.182 1.147 1.015 0.784 0.703 0.522 0.409 0.288 0.278 negative rmsd with cutoff score c_hbond c_stack c_motif c_hbond c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1_C c_hbond_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_motif_C c_hbond c_stack_C c_hbond c_stack c_mhp1_C c_emp1_C c_emp1 c_mhp1 c_motif c_mhp1_C c_stack c_stack_C c_motif_C RelHRt 2.000 1.833 1.806 1.806 1.727 1.676 1.615 1.339 1.210 1.188 1.184 1.153 1.147 1.086 0.812 0.771 0.733 0.697 0.660 0.500 0.313 0.290 term correlation score c_hbond c_stack c_mhp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_motif c_hbond c_stack c_hbond c_stack_C c_hbond c_stack c_emp1_C c_hbond c_stack c_mhp1_C c_hbond c_stack c_mhp1 c_emp1_C c_hbond c_stack c_motif_C c_hbond_C c_hbond c_stack c_mhp1 c_emp1 c_motif_C c_emp1_C c_emp1 c_mhp1 c_mhp1_C c_motif c_stack_C c_motif_C c_stack Table 3e and Table 3f show the validation results from the validation of our scores in the form of the scores ranked according to achieved Relative Hit Rate (RelHRt). In Table 3e validation results from using the validation method Complete training and test are shown, and in Table 3f the mean validation results from using the validation method Cross- training and test. 41 RelHRt 2.024 1.969 1.963 1.895 1.867 1.733 1.719 1.304 1.205 1.200 1.193 1.189 1.182 1.113 1.000 0.966 0.839 0.655 0.548 0.271 0.263 0.132 A ranking of the scores according to each of the three measurements described above is presented in Tables 3a – 3f above, with scores only scoring the RMSD-range < 6Å marked with “_C”, at the end. Here we have ranked the performance of scores on Dataset 2 for both Complete training and test and Cross- training and test. For Cross- training and test the mean results over 20 iterations are presented. In the rankings for Complete training and test we also included the performance of the scores of the docking algorithms on their respective subset of Dataset 2 as comparison. Results obtained by using different fitting functions are presented alongside for comparisons between the functions. What is generally interesting is which combinations of scoring components are more successful. This question has to be viewed from two perspectives however. A traditional view on how to construct scores is to treat the crystal structure as the final and only goal of the docking procedure and the score is constructed to somehow reflect the distance from this conformation and to lead the docking algorithm to this position as swiftly as possible. An alternative view is to use the RMSD-distance from the crystal structure as an approximation of the quality of the conformation only on a limited distance from the crystal structure (see Investigating strategies to combine terms of different scoring components and to validate scores). In our validation we have tried to fulfil the goals of both these views. The performance of scores validated on the whole RMSD-range versus performance of scores validated on the shorter distance should be viewed in this light. The performance on the whole range can be considered as the performance of the score used for the first purpose of ranking conformers against the crystal structure conformation. When observing the performance on the limited range, it should be seen as a more comparative performance since it considers a RMSD-range which is more probable to approximate conformer quality. Examining the performance of the scores according to ME one can see how our scores which combine different approaches to scoring generally perform better than scores using only one method for scoring. It should also be noted that scores scoring only the limited RMSDcyt-range also generally score above their full-range counterparts. Both these effects were expected; combinations of scoring approaches should be able to capture more effects than scores using only one approach, and the scoring environment just outside the crystal structures in terms of RMSDcyt was expected to be more forgiving to scoring functions, compared to the full range. All high ranking scores contain the combination c_hbond and c_stack. Especially c_hbond shows a good ME performance, even on its own as a score despite its single-tracked scoring method. The best performing scores according to ME are scores consisting of c_hbond, c_stack and different combinations of c_mhp1, c_emp1 and sometimes c_motif. The c_emp1 component shows good results during both Complete training and tes and Cross- training and test, even though the more limited availability of knowledge-based data when dividing Dataset 2 into different sets during Cross- training and test. Adding the c_stack component to the c_hbond component results in an improved ME, except when using the fitting function Semi- step function during Complete training and test. During Cross- training and test, the trend is more ambigious. Combining c_hbond and c_stack with both c_mhp1 and c_emp1 does not always improve the performance comparing to scores when c_hbond and c_stack are used with either c_mhp1 or c_emp1. During Complete training and test, an addition of the c_mhp1 component to the c_hbond c_stack c_emp1 –score would either not change or would lower the performance except when using the fitting function Semi- step function. During Cross- training and test, combining c_mhp1 and c_emp1 usually results in a better performance except when using Term correlation. The addition of c_motif does not improve ME performance uring Complete training and test, except when using the fitting function Term correlation. During Cross- training and test, the 42 pattern Is the opposite with c_motif improving performance when using all fitting functions except Negative RMSD with cut off. Most of our composite scores show better performance than goldscore and glide. Chemscore show an extreme ME of 0.99 on the limited distance, higher than any score, and 0.98 on the full distance Our best performing scores considering ME was c_hbond c_stack c_emp1 (ME=0.94) on the limited distance and using the fitting function Negative RMSD during Complete training and test. During Cross- training and test, our best performing scores were the two scores c_hbond c_stack c_mhp1 c_emp1 c_motif (Semi- step function) together with c_hbond c_stack c_mhp1 c_emp1 (Negative RMSD with cut off) (ME=0.95) on the limited distance. The use of different fitting functions seems to have a greater impact on the performance according to RelMHRk than for ME. During Complete training and test combinations of c_hbond, c_stack and c_mhp1 with c_emp1 and c_motif still rank highest among our scores but a difference compared to ME is that it is the full-range versions of the scores that generally rank highest. Here all scoring functions from the docking algorithms are generally outperformed by our functions except chemscore, which once again ranks highest. Glide also shows better performance through RelMHRk than ME, ranking second for two fitting functions. During Cross- training and test the full range versions of scores composed of c_hbond c_stack plus various combinations of c_mhp1, c_emp1 and c_motif rank high. Also the score composed of only c_hbond rank highest when using Negative RMSD and Negative RMSD with cut off. For Complete training and test our best performing score is c_hbond c_stack c_emp1 (RelMHRk=0.6) on the whole distance using the fitting function Negative RMSD, and for Cross- training and test c_hbond (RelMHRk=0.563) on the whole distance using the fitting function Negative RMSD. Ranking according to RelHRt, scores validating the full RMSDcyt -range generally rank higher as for RelMHRk. For Complete training and test our high ranking scores are combinations of c_hbond and c_stack with c_mhp1, c_emp1 and c_motif. Also, the solo score of c_hbond rank high with a second place when using the fitting function Negative RMSD with cut off. Generally docking scores are outperformed by our scores when measuring RelHRt performance except for the fitting function Negative RMSD with cut off where goldscore rank highest. During Cross- training and test results are more ambiguous and combinations of c_hbond and c_stack plus combinations of c_emp1, c_mhp1 and c_motif, rank high. Similar to the ranking by RelMHRk, c_hbond rank highest when using the fitting function Negative RMSD. The highest performing score for Complete training and test is c_hbond c_stack c_emp1 (RelHRt=2.14) on the full range when using Negative RMSD, and the score c_hbond (RelHRt=1.40) on the full range when using Negative RMSD with cut off for Cross- training and test. Information about the weighting coefficients obtained from the score training can be used to evaluate our different scoring components. Mean weighting coefficients from the Cross- training and test validation strategy are presented and visualized in Appendix I. An assumption was made that a term with a large absolute weight is contributing relatively more to the scoring capacity of a score than a term with a smaller weight, for the used fitting function, given that the magnitute of these terms is comparable. We compared the weighting coefficients from the score containing all scoring components and thus all scoring terms with the aim to estimate the relative performance of the terms when the score is trained on the same fitting function. The combinatory term in the c_hbond component seems to be downweighted for all fitting functions – especially when using Semi- step function and Term correlation. Using Negative RMSD, the terms with index 8 and 13 receive larger weights. These terms correspond to the c_mhp1-term and the c_emp1 term scoring N4 relative nitrogen atoms respectively. The terms with index 9 and 13; the c_emp1 terms scoring O2 relative hydrogen bound to a nitrogen and N4 relative nitrogen, have larger weights when 43 using Negative RMSD with cut off, and the c_emp1 term scoring N3 relative hydrogen bound to nitrogen (indexed 11) have although not a very large but a significant negative weight. When examining these results one must consider that the obtained mean weighting coefficients are just that and not the mean of the absolute value of the weighting coefficients, which might have been a better way to estimate importance of scoring terms for a score. Since the fitting functions are based on various different strategies and can be used for different purposes, it is difficult to determine which are more successful in general. The two fitting functions which use the whole RMSD-spectrum for approximation do seem to promote a good scoring performance of the scores validated on the limited range, sometimes with better scoring performance than the same scores together with fitting functions using the limited range or no range at all (i.e. Negative RMSD with cut off and Term correlation). Also, Negative RMSD with cut off and Term correlation seem to promote a good scoring performance in the scores validated on the full RMSD-range. It must be considered that the fitting function Negative RMSD with cut off must work with less data than the other three fitting functions, which might affect the performance of this function compared to the other functions. Term correlation is not a function of RMSD, which when using RMSDmeasurements for score validation might put this function at a disadvantage. However, as can be seen from the comparisons between the fitting functions, Term correlation seems to promote comparably good scoring performance in the scores trained with this fitting function, compared to the fitting functions using RMSD-measurements. It is doing so with an unbiased consideration of the whole RMSD-range, which might have its own advantages. Term correlation is shown to possibly incorporate also the scoring capabilities of scoring terms that model smaller parts of the quality model in the score, or terms with ambiguous scoring capabilities. This is the case with the scoring component c_motif (see further description of the validation of this component below). Term correlation can with its different method of approximating conformer quality also be used for conformer validation. This method should be seen more as a complement to RMSD-based validation methods because of its inherent capability to amplify errors. For example, we use the score results obtained from using Term correlation as one of the arguments for the possibility of 2 major general docking quality maxima around RMSDcyt of 7 and 13Å respectively from the crystal structures of our proteinligand complexes. Our results show that the same scoring strategy as was earlier successfully used for scoring complexes with adenine4, represented by the score c_hbond c_stack c_mhp1 and its components, is also successful at scoring docking poses for protein-ligand complexes containing ligands with cytosine as s substructure. The validations of this score argue that its performance is higher than the performance of goldscore and glide when considering ME, and a higher performance than goldscore on the limited RMSDcyt-range when measuring RelMHRk. Measuring RelHRt, the score shows a performance which is often better than chemscore and glide on the full RMSDcyt-range and better than goldscore and glide on the limited range. While we comment scores we also have to mention the combination c_hbond c_stack. This component combination is found in all of the more successful of our scores and seems to account for a very large part of the model we try to create. On its own, the score composed of these two components have scoring capabilities often better than other scores with greater number of components. Scores with better performance than this generally also contain combinations of c_mhp1, c_emp1 and c_motif. Our novel scoring approaches, represented by c_emp1 and c_motif, show scoring potential, however not as apparent in the case of c_motif. The c_emp1 scoring component shows a good scoring performance, both as a score of its own, as well as part of a composite score. This is the case during both Complete training and test and Cross- training and test which gives a hint of its overall potential of scoring this particular dataset, as well as the 44 robustness of this scoring approach. Combined with more explicit scoring components, c_emp1 is indeed a useful contribution to a score which would be an argument for our theory of the implicit modelling capacity of knowledge-based scores successfully complementing explicit scoring components. The scoring capacity of the component c_motif is more difficult to evaluate. The addition of c_motif to either c_hbond c_stack c_mhp1 c_emp1 or to c_hbond c_stack results in a score that often show better performance than the original, but results almost as often in a score with worse performance. It seems like the fitting function Term correlation is able to incorporate the predictive capabilities of c_motif in such a way that all scores with this component have higher ME performance than their counterparts without c_motif. This is the case for both Complete training and test and Cross- training and test. Since the datasets upon which our scores have been trained upon directly define the scoring capacity of the scores, our validation results have to be interpreted with this in consideration. Dataset 1 is in not any way representative of a complete set of data on how proteins recognize bind cytosine–containing ligands, and must be treated as a sample. This is also the case when considering Dataset 2, since it is derived from Dataset 1. A training of scores on such limited datasets, of course also implies a certain degree of bias or over-training in those scores on the particular dataset. When comparing scores with the scores of the docking algorithms, it should also be mentioned that the results from these docking algorithms are highly biased from the scoring function used in the docking process. For a better validation of scores, they have all to be tested in combination with the same docking algorithm. 3.5. Time efficiency The scoring capacity of a score can be put into relation to its time complexity to estimate the overall time efficiency of the score. Below we present a list where our scoring components are annotated with their respective time complexity functions per scored complex: Component c_hbond c_stack c_mhp1 c_emp1 c_motif Time complexity (per scored complex) O(a) O(a) O(a) O(a) O(a 3 ) Here a is the number of protein atoms in the protein-ligand complex scored by the component. The component c_motif has a cubic term because of the need to investigate all possible combinations of hydrogen bonds contributing to a motif. Since the time complexities of the different components are fairly similar, and the time complexity of c_motif is a result of taking into account also the existence of extreme and highly unmatching conformers, a comparison of time efficiencies between our scores can be reduced to comparing their scoring capacities. 45 4. CONCLUSIONS The present work was aimed at improving the performance of standard docking techniques, i.e. computational methods designed to predict the conformation and orientation of an organic molecule in the known structure of a receptor binding site. With this aim in view we used the “consensus docking” approach based on re-ranking the list of putative ligand poses generated with a docking program by a more efficient scoring function. By analogy to the adenine-specific score developed before4 we introduced the novel method to re-score the docking solutions for another important class of ligands – cytosine-containing compounds. To develop the cytosine-specific score we collected from PDB a representatve set of 50 structures of complexes of such ligands with different proteins and generated sets of docking solutions for each of them in order to obtain positive – native-like and negative – misplaced ligand poses for training our scrores. The main conclusions of the present work are: 1) Detailed analysis of intermolecular interactions between the cytosine fragment of a ligand and its protein environment was carried out. From this analysis we derived the cytosine-backbone hydrogen bond motifs and compared them with those for adenine. Also stacking and hydrophobic interactions were analysed showing that the former are important for cytosine-recognition and that the latter also are important, but more difficult to use in a scoring function due to cytosine not showing distinctive properties suitable for scoring hydrophobic interactions using the current approach. 2) Novel scoring functions were created based on various scoring methodologies. Our novel scores show promising capacity on the computer-generated dataset of 2000 proteinligand conformers (Dataset 2) for 50 structures known from X-ray crystallography. The best performance achieved on the limited RMSDcyt-range by our scores is a ME of 0.94 on the complete Dataset 2 during Complete training and test by the score c_hbond c_stack c_emp1 (describing hydrogen bonds, stacking interactions, and knowledge-based potential, respectively). The score c_hbond c_stack c_mhp1 (the latter describing hydrophobic interactions), which was earlier used for scoring complexes with ligands containing adenine instead of cytosine, was shown to perform well also in the case of cytosine. This score achieved a ME of 0.9 on the limited RMSDcyt-range during Complete training and test on the complete Dataset 2. Both mentioned scores show high performance also during Crosstraining and test, which is an indication of their robustness while used on our dataset. 3) The newly introduced knowledge-based potential component c_emp1, is shown to perform comparably well to the hydrophobic component c_mhp1, and shows a similar capacity to successfully complement more explicit scoring methods. The component c_emp1 does however rely on knowledge-based data, and the quality and availability of such data is thus defining its performance. Our other newly introduced scoring component c_motif does seem to enhance the scoring capacity of a score in some cases. We have also shown that it was possible in this case to create scores which had comparable performance to scores obtained through fitting functions based on traditional RMSD-measurements, by using the newly introduced fitting function Term correlation. 46 5. ACKNOWLEDGMENTS I would like to thank my supervisor Prof. Roman G. Efremov and Ph.D. Timothy V. Pyrkov for their assistance and guidance in this project and for giving me the fabulous opportunity to perform my degree project at the Laboratory of Molecular Modelling at IBCh, Moscow. Many thanks to Prof. Efremov for his warm reception and his help in all the practical matters concerning my stay in Moscow, without wich this would not have been possible. An elogy to our program coordination office with Dr. Margareta Krabbe for the swift and much needed support and encouragement which helped to make this one of the first degree projects made in cooperation with a Russian institution. I would also like to express my gratitude to Ph.D. Torgeir R. Hvidsten for reviewing my report and to Hoda Ibrahim and Andreas Dahlsten for acting as opponents during the presentation. Last but not least – thanks to all co-workers and friends connected to IBCh. Without you I would have been just as lost as that Danish professor. 47 15. 6. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Denessiouk K.A., Rantanen V., Johnson M.S., Adenine Recognition: A Motif Present in ATP-, CoA-, NAD-, NADP, and FAD-Dependent Proteins., Proteins 2001;44:282-291 Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E., The Protein Data Bank., Nucleic Acids Research, 2000;28:235-242 Cappello V., Tramontano A., Koch U., Classification of Proteins Based on the Properties of the Ligand-Binding Site: The Case of Adenine-Binding Proteins., Proteins 2002;47:106-115 Pyrkov T.V., Kosinsky Y.A., Arseniev A.S., Priestle J.P., Jacoby E., Efremov R.G., Complementarity of Hydrophobic Properties in ATP-Protein Binding: A New Criterion to Rank Docking Solutions., Proteins, 2007; 66:388-398. Martin A.C., Orengo C.A., Hutchinson E.G., Jones S., Karmirantzou M., Laskowski R.A., Mitchell J.B., Taroni C., Thornton J.M., Protein folds and functions., Structure. 1998 Jul 15;6(7):875-84. Erlandsen H., Abola E.E., Stevens R.C., Combining structural genomics and enzymology: completing the picture in metabolic pathways and enzyme active sites., Curr Opin Struct Biol. 2000 Dec;10(6):719-30. Mao L., Wang Y., Liu Y., Hu X.. Molecular determinants for ATP-binding in proteins: a data mining and quantum chemical analysis., J Mol Biol. 2004 Feb 20;336(3):787-807. Kuttner Y.Y., Sobolev V., Raskind A., Edelman M., A consensus-binding structure for adenine at the atomic level permits searching for the ligand site in a wide spectrum of adenine-containing complexes., Proteins. 2003 Aug 15;52(3):400-11. Kitchen D.B., Decornez H., Furr J.R., Bajorath J., Docking and scoring in virtual screening for drug discovery: methods and applications., Nat Rev Drug Discov. 2004 Nov;3(11):935-49. Kuntz I.D., Blaney J.M., Oatley S.J., Langridge R., Ferrin T.E., A geometric approach to macromolecule-ligand interactions., J Mol Biol. 1982 Oct 25;161(2):269-88. Jorgensen W.L., The many roles of computation in drug discovery, Science. 2004 Mar 19;303(5665):1813-8. Sousa S.F., Fernandes P.A., Ramos M.J., Protein-ligand docking: current status and future challenges., Proteins. 2006 Oct 1;65(1):15-26. Teague S.J., Implications of protein flexibility for drug discovery., Nat Rev Drug Discov. 2003 Jul;2(7):527-41. Connolly M.L., Solvent-accessible surfaces of proteins and nucleic acids., Science. 1983 Aug 19;221(4612):709-13. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 48 Ghose A.K., Viswanadhan V.N., Wendoloski J.J., prediction of hydrophobic (lipophilic) properties of small organic molecules using fragmental methods: an analysis of ALOGP and CLOGP methods., J Phys Chem 1998;102:3762-3772 Pyrkov T.V., Chugunov A.O., Krylov N.A., Nolde D.E., Efremov R.G., PLATINUM, Laboratory of Biomolecular Modeling at Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciencies, Moscow, Russia, 2008. http://model.nmr.ru/platinum. Bairoch A., The ENZYME database in 2000., Nucleic Acids Res. 28:304-305(2000). Jones G., Willett P. and Glen R.C., Molecular recognition of receptor sites using a genetic algorithm with a description of desolvation, J. Mol. Biol., 245, 43-53, 1995 Jones G., Willett P. and Glen R.C., Leach A. R. and Taylor R., Development and Validation of a Genetic Algorithm for Flexible Docking, J. Mol. Biol., 267, 727-748, 1997 Nissink J.W.M., Murray C., Hartshorn M., Verdonk M.L., Cole J.C. and Taylor R., A new test set for validating predictions of protein-ligand interaction, Proteins, 49, 457471, 2002 Verdonk M.L., Cole J.C., Hartshorn M.J., Murray C.W. and Taylor R.D., Improved Protein-Ligand Docking Using GOLD, Proteins, 52, 609-623, 2003 Cole J.C., Nissink J.W.M., Taylor R. in Virtual Screening in Drug Discovery (Eds. Shoichet B., Alvarez J.), ProteinLigand Docking and Virtual Screening with GOLD, Taylor & Francis CRC Press, Boca Raton, Florida, USA (2005) Verdonk M.L., Chessari G., Cole J.C., Hartshorn M.J., Murray C.W., Nissink J.W.M., Taylor R.D., and Taylor R., Modeling Water Molecules in Protein-Ligand Docking Using GOLD, J. Med. Chem., 48, 6504-6515, 2005 Hartshorn M.J., Verdonk M.L., Chessari G., Brewerton S.C., Mooij W.T.M., Mortenson P.N., Murray C.W., Diverse, High-Quality Test Set for the Validation of Protein-Ligand Docking Performance, J. Med. Chem., 50, 726-741, 2007. Friesner R.A., Banks J.L., Murphy R.B., Halgren T.A., Klicic J.J., Mainz D.T., Repasky M.P., Knoll E.H., Shaw D.E., Shelley M., Perry J.K., Francis P., Shenkin P.S., Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy, J. Med. Chem., 2004, 47, 1739–1749. Halgren T.A., Murphy R.B., Friesner R.A., Beard H.S., Frye L.L., Pollard W.T., Banks J.L., Glide: A New Approach for Rapid, Accurate Docking and Scoring. 2. Enrichment Factors in Database Screening, J. Med. Chem., 2004, 47, 1750–1759. R.G. Efremov, A.O. Chugunov, T.V. Pyrkov, J.P. Priestle, A.S. Arseniev and E. Jacoby, Molecular Lipophilicity in Protein Modeling and Drug Design, Current Medicinal Chemistry, 2007, 14, 393-415 APPENDIX I – Validation results for new scores Tables AI.1 – AI.8. show the results from the validations performed for our scores. Each table shows the results of a certain validation strategy (Complete training and test, Cross- training and test) combined with a certain fitting function (Semi- stepfunction, Negative RMSD, Negative RMSD with cut-off, Term correlation). Within each table, results are shown for separate subsets of Dataset 2 (chemscore, goldscore, glide) and for the complete Dataset 2 (all). Each score has results presented for the full RMSDcyt –range (full) and the reduced RMSDcyt –range (CUT). A number of measurements are presented for each score: The number of terms figuring in the score - excluding the constant term (terms), Mean enrichment (ME), Mean Hit Rank (MHRk), Mean Random Hit Rank (MRHRk), Hit Rate (HRt), Random Hit Rate (RHRt), and the total number of complex types participating in the measurement (total). See Methods for the explanation of the strategies, fitting functions and validation measurements. Dockscore and c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock are special scores because they contain components derived from results obtained by the docking algorithms, and are thus only validated on the subset of Dataset 2 corresponding to the docking algorithm from which the subset originated. TABLE AI.1. Validation of scoring approaches; Complete training and test, Semi- step-function chemscore dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock terms 1 5 2 1 7 4 7 11 8 14 15 19 20 ME 0.98 0.93 0.68 0.78 0.86 0.63 0.99 0.99 0.96 0.97 0.97 0.97 0.96 MHRk 1 1.67 3.78 2.89 2 4.22 1.11 1.11 1.33 1.33 1.33 1.33 1.44 full MRHRk 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 HRt 9 7 3 5 5 3 8 8 7 7 7 7 7 RHRt 6 6 6 6 6 6 6 6 6 6 6 6 6 total 9 9 9 9 9 9 9 9 9 9 9 9 9 ME 0.99 0.92 0.78 0.78 0.89 0.71 1 1 1 1 1 1 1 MHRk 1 1.5 1.83 1.83 1.17 2.33 1 1 1 1 1 1 1 CUT MRHRk 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 HRt 6 5 1 4 5 2 6 6 6 6 6 6 6 RHRt 5 5 5 5 5 5 5 5 5 5 5 5 5 total 6 6 6 6 6 6 6 6 6 6 6 6 6 goldscore dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock 1 5 2 1 7 4 7 11 8 14 15 19 20 0.83 0.86 0.39 0.73 0.75 0.49 0.8 0.83 0.79 0.8 0.79 0.85 0.84 2.35 1.95 6.6 3.2 3.1 5.3 2.55 2.35 2.75 2.6 2.7 2.15 2.2 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 14 13 3 9 7 7 11 12 10 12 12 14 14 7 7 7 7 7 7 7 7 7 7 7 7 7 20 20 20 20 20 20 20 20 20 20 20 20 20 0.76 0.82 0.43 0.64 0.81 0.56 0.79 0.79 0.81 0.86 0.85 0.86 0.87 1.8 1.5 3.3 2.3 1.6 2.8 1.6 1.6 1.5 1.4 1.4 1.4 1.4 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 6 7 1 4 5 4 7 7 7 8 8 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7 10 10 10 10 10 10 10 10 10 10 10 10 10 glide dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock 1 5 2 1 7 4 7 11 8 14 15 19 20 0.86 0.83 0.58 0.72 0.77 0.43 0.84 0.84 0.88 0.87 0.86 0.86 0.86 2.25 2.86 7.19 5.19 3.58 10.83 2.86 2.86 2.47 2.31 2.5 2.5 2.44 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 25 22 8 14 14 5 23 23 24 26 27 27 27 18 18 18 18 18 18 18 18 18 18 18 18 18 36 36 36 36 36 36 36 36 36 36 36 36 36 0.87 0.87 0.61 0.72 0.82 0.49 0.87 0.87 0.9 0.9 0.88 0.88 0.88 1.85 2.06 5.55 4.21 2.36 7.45 2.06 2.06 1.88 1.73 1.97 1.97 1.97 2.09 2.09 2.09 2.06 2.09 2.09 2.09 2.09 2.06 2.09 2.06 2.06 2.06 24 25 5 15 19 3 25 25 28 28 27 27 27 24 24 24 24 24 24 24 24 24 24 24 24 24 33 33 33 33 33 33 33 33 33 33 33 33 33 all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif 5 2 1 7 4 7 11 8 14 15 19 0.85 0.54 0.71 0.77 0.4 0.8 0.83 0.81 0.85 0.87 0.9 4.02 13.95 8.95 5.95 20.76 5.46 4.78 5.2 3.85 3.54 3.15 4.27 4.27 4.2 4.27 4.27 4.27 4.27 4.2 4.27 4.2 4.2 23 7 12 11 9 19 21 21 23 27 29 14 14 15 14 14 14 14 15 14 15 15 41 41 41 41 41 41 41 41 41 41 41 0.88 0.64 0.74 0.84 0.56 0.87 0.87 0.88 0.91 0.93 0.93 2.3 6.15 4.35 2.15 8.45 2.58 2.58 2.38 1.77 1.73 1.73 1.83 1.83 1.8 1.83 1.83 1.83 1.83 1.8 1.83 1.8 1.8 31 9 16 22 8 30 30 28 33 33 33 27 27 27 27 27 27 27 27 27 27 27 40 40 40 40 40 40 40 40 40 40 40 TABLE AI.2. Validation of scoring approaches; Complete training and test, Negative RMSD chemscore dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock terms 1 5 2 1 7 4 7 11 8 14 15 19 20 ME 0.98 0.93 0.68 0.78 0.93 0.63 0.91 0.91 0.94 0.93 0.93 0.93 0.93 MHRk 1 1.67 3.78 2.89 1.56 4.22 1.89 1.89 1.56 1.67 1.67 1.67 1.67 all MRHRk 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 HRt 9 7 3 5 6 3 8 8 7 8 8 8 8 RHRt 6 6 6 6 6 6 6 6 6 6 6 6 6 total 9 9 9 9 9 9 9 9 9 9 9 9 9 ME 0.99 0.92 0.78 0.78 0.98 0.71 0.89 0.89 1 0.92 0.92 0.92 0.92 MHRk 1 1.5 1.83 1.93 1 2.33 1.67 1.67 1 1.5 1.5 1.5 1.5 CUT MRHRk 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 HRt 6 5 1 4 6 2 5 5 6 5 5 5 5 RHRt 5 5 5 5 5 5 5 5 5 5 5 5 5 total 6 6 6 6 6 6 6 6 6 6 6 6 6 goldscore dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock 1 5 2 1 7 4 7 11 8 14 15 19 20 0.83 0.86 0.39 0.73 0.77 0.49 0.84 0.84 0.88 0.91 0.87 0.91 0.91 2.35 1.9 6.6 3.2 3 5.3 2.25 2.2 1.8 1.6 2.05 1.7 1.6 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 14 13 3 9 9 7 13 14 12 15 12 14 16 7 7 7 7 7 7 7 7 7 7 7 7 7 20 20 20 20 20 20 20 20 20 20 20 20 20 0.76 0.82 0.43 0.64 0.83 0.56 0.79 0.79 0.81 0.92 0.88 0.89 0.84 1.8 1.5 3.3 2.3 1.5 2.8 1.6 1.6 1.6 1.2 1.4 1.4 1.5 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 6 7 1 4 6 4 7 7 7 8 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 10 10 10 10 10 10 10 10 10 10 10 10 10 glide dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock 1 5 2 1 7 4 7 11 8 14 15 19 20 0.86 0.84 0.58 0.72 0.84 0.43 0.86 0.86 0.87 0.89 0.91 0.91 0.91 2.25 2.72 7.19 5.19 2.36 10.83 2.61 2.58 2.44 2 2.11 2.11 2.11 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 25 23 8 14 20 5 25 25 26 28 29 29 29 18 18 18 18 18 18 18 18 18 18 18 18 18 36 36 36 36 36 36 36 26 36 36 36 36 36 0.87 0.86 0.61 0.72 0.84 0.49 0.88 0.88 0.88 0.9 0.91 0.91 0.92 1.85 2.12 5.55 4.21 2.21 7.45 2.09 2.06 2.09 1.7 1.91 1.91 1.94 2.09 2.09 2.09 2.06 2.09 2.09 2.09 2.09 2.06 2.09 2.06 2.06 2.06 24 24 5 15 22 3 25 26 25 27 28 28 27 24 24 24 24 24 24 24 24 24 24 24 24 24 33 33 33 33 33 33 33 33 33 33 33 33 33 all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif 5 2 1 7 4 7 11 8 14 15 19 0.85 0.5 0.71 0.84 0.4 0.85 0.86 0.87 0.92 0.91 0.92 3.88 15.02 8.95 4.24 20.76 4.15 4.12 3.8 2.56 2.78 2.78 4.27 4.27 4.2 4.27 4.27 4.27 4.27 4.2 4.27 4.2 4.2 26 4 12 17 9 25 26 27 30 30 30 14 14 15 14 14 14 14 15 14 15 15 41 41 41 41 41 41 14 41 41 41 41 0.88 0.62 0.74 0.88 0.56 0.89 0.89 0.89 0.94 0.93 0.92 2.22 6.5 4.35 1.83 8.45 2.22 2.22 2.3 1.67 1.85 1.85 1.83 1.83 1.8 1.83 1.83 1.83 1.83 1.8 1.83 1.8 1.8 32 8 16 29 8 32 32 30 34 33 33 27 27 27 27 27 27 27 27 27 27 27 40 40 40 40 40 40 40 40 40 40 40 TABLE AI.3. Validation of scoring approaches; Complete training and test, Negative RMSD with cut-off chemscore dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock terms 1 5 2 1 7 4 7 11 8 14 15 19 20 ME 0.98 0.94 0.68 0.78 0.92 0.63 1 1 0.98 1 1 1 0.98 MHRk 1 1.56 3.78 2.89 1.56 4.22 1 1 1.22 1 1 1 1.22 all MRHRk 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 HRt 9 8 3 5 6 3 9 9 8 9 9 9 7 RHRt 6 6 6 6 6 6 6 6 6 6 6 6 6 total 9 9 9 9 9 9 9 9 9 9 9 9 9 ME 0.99 0.92 0.78 0.78 0.97 0.71 1 1 1 1 1 1 1 MHRk 1 1.5 1.83 1.83 1 2.33 1 1 1 1 1 1 1 CUT MRHRk 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 HRt 6 5 1 4 6 2 6 6 6 6 6 6 6 RHRt 5 5 5 5 5 5 5 5 5 5 5 5 5 total 6 6 6 6 6 6 6 6 6 6 6 6 6 goldscore dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock 1 5 2 1 7 4 7 11 8 14 15 19 20 0.83 0.85 0.39 0.73 0.82 0.49 0.74 0.81 0.78 0.84 0.85 0.89 0.91 2.35 2 6.6 3.2 2.55 5.3 3.2 2.55 2.9 2.2 2.1 1.7 1.6 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 14 13 3 9 8 7 9 11 9 14 13 15 16 7 7 7 7 7 7 7 7 7 7 7 7 7 20 20 20 20 20 20 20 20 20 20 20 20 20 0.76 0.82 0.43 0.64 0.86 0.56 0.77 0.78 0.79 0.86 0.83 0.86 0.84 1.8 1.5 3.3 2.3 1.5 2.8 1.7 1.7 1.7 1.4 1.5 1.4 1.5 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 6 7 1 4 5 4 6 6 6 8 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 10 10 10 10 10 10 10 10 10 10 10 10 10 glide dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock 1 5 2 1 7 4 7 11 8 14 15 19 20 0.86 0.85 0.55 0.72 0.82 0.43 0.86 0.86 0.85 0.89 0.88 0.87 0.88 2.25 2.47 7.75 5.19 2.67 10.83 2.47 2.47 2.56 2.19 2.42 2.42 2.39 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 25 22 7 14 17 5 24 24 25 26 24 24 24 18 18 18 18 18 18 18 18 18 18 18 18 18 36 36 36 36 36 36 36 36 36 36 36 36 36 0.87 0.86 0.58 0.72 0.84 0.49 0.88 0.88 0.87 0.92 0.89 0.89 0.89 1.85 1.97 5.76 4.21 2.06 7.45 1.91 1.91 2.12 1.79 2.09 2.09 2.06 2.09 2.09 2.09 2.06 2.09 2.09 2.09 2.09 2.06 2.09 2.06 2.06 2.06 24 23 5 15 22 3 24 24 25 28 26 26 26 24 24 24 24 24 24 24 24 24 24 24 24 24 33 33 33 33 33 33 33 33 33 33 33 33 33 all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif 5 2 1 7 4 7 11 8 14 15 19 0.84 0.5 0.71 0.83 0.4 0.84 0.85 0.86 0.88 0.88 0.89 3.95 15.02 8.95 4.63 20.76 4.24 4.24 4.12 3.34 3.68 3.66 4.27 4.27 4.2 4.27 4.27 4.27 4.27 4.2 4.27 4.2 4.2 25 4 12 11 9 25 25 25 25 26 26 14 14 15 14 14 14 14 15 14 15 15 41 41 41 41 41 41 41 41 41 41 41 0.88 0.62 0.74 0.86 0.56 0.89 0.89 0.89 0.92 0.9 0.9 2.3 6.5 4.35 2.2 8.45 2.3 2.3 2.33 1.7 2.12 2.12 1.83 1.83 1.8 1.83 1.83 1.83 1.83 1.8 1.83 1.8 1.8 31 8 16 22 8 30 30 29 34 31 31 27 27 27 27 27 27 37 27 27 27 27 40 40 40 40 40 40 40 40 40 40 40 TABLE AI.4 Validation of scoring approaches; Complete training and test, Term correlation chemscore dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock terms 1 5 2 1 7 4 7 11 8 14 15 19 20 ME 0.98 0.93 0.68 0.78 0.92 0.63 0.97 1 0.91 0.94 0.92 0.99 0.99 MHRk 1 1.67 3.78 2.89 1.56 4.22 1.33 1 1.78 1.56 1.78 1.11 1.11 all MRHRk 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 2.11 HRt 9 7 3 5 6 3 8 9 7 8 7 8 8 RHRt 6 6 6 6 6 6 6 6 6 6 6 6 6 total 9 9 9 9 9 9 9 9 9 9 9 9 9 ME 0.99 0.92 0.78 0.78 0.98 0.71 0.94 1 0.83 0.94 0.92 1 1 MHRk 1 1.5 1.83 1.83 1 2.33 1.33 1 1.67 1.33 1.5 1 1 CUT MRHRk 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 1.33 HRt 6 5 1 4 6 2 5 6 4 5 5 6 6 RHRt 5 5 5 5 5 5 5 5 5 5 5 5 5 total 6 6 6 6 6 6 6 6 6 6 6 6 6 goldscore dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock 1 5 2 1 7 4 7 11 8 14 15 19 20 0.83 0.85 0.39 0.73 0.8 0.49 0.84 0.83 0.87 0.92 0.83 0.87 0.88 2.35 2.05 6.6 3.2 2.8 5.3 2.25 2.25 1.95 1.5 2.3 2.05 1.95 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 14 13 3 9 8 7 13 14 14 13 12 14 14 7 7 7 7 7 7 7 7 7 7 7 7 7 20 20 20 20 20 20 20 20 20 20 20 20 20 0.76 0.82 0.43 0.64 0.79 0.56 0.82 0.78 0.88 0.89 0.88 0.81 0.81 1.8 1.5 3.3 2.3 1.7 2.8 1.5 1.6 1.4 1.3 1.4 1.6 1.6 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 6 7 1 4 5 4 7 7 8 7 8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 10 10 10 10 10 10 10 10 10 10 10 10 10 glide dockscore c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_mhp1 c_emp1 c_motif c_dock 1 5 2 1 7 4 7 11 8 14 15 19 20 0.86 0.85 0.55 0.72 0.85 0.43 0.87 0.86 0.88 0.9 0.91 0.89 0.89 2.25 2.53 7.61 5.19 2.53 10.83 2.39 2.56 2.14 1.94 2.06 2.19 2.19 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 25 21 7 14 18 5 23 26 25 24 27 28 27 18 18 18 18 18 18 18 18 18 18 18 18 18 36 36 36 36 36 36 36 36 36 36 36 36 36 0.87 0.86 0.58 0.72 0.87 0.49 0.88 0.88 0.89 0.92 0.91 0.91 0.9 1.85 2 5.73 4.21 1.94 7.45 1.97 2.06 1.94 1.48 1.85 1.94 1.94 2.09 2.09 2.09 2.06 2.09 2.09 2.09 2.09 2.06 2.09 2.06 2.06 2.06 24 22 5 15 23 3 23 26 24 26 26 28 27 24 24 24 24 24 24 24 24 24 24 24 24 24 33 33 33 33 33 33 33 33 33 33 33 33 33 all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif 5 2 1 7 4 7 11 8 14 15 19 0.85 0.52 0.71 0.79 0.4 0.86 0.86 0.88 0.89 0.89 0.9 3.66 14.17 8.95 5.34 20.76 3.46 3.56 3.15 2.83 3.22 3.1 4.27 4.27 4.2 4.27 4.27 4.27 4.27 4.2 4.27 4.2 4.2 25 4 12 10 9 26 28 29 23 25 29 14 14 15 14 14 14 14 15 14 15 15 41 41 41 41 41 41 41 41 41 41 41 0.88 0.62 0.74 0.85 0.56 0.9 0.89 0.9 0.92 0.9 0.92 2.22 6.3 4.35 2.15 8.45 2.2 2.12 2.12 1.58 2.15 1.95 1.83 1.83 1.8 1.83 1.83 1.83 1.83 1.8 1.83 1.8 1.8 32 8 16 24 8 32 33 32 32 30 33 27 27 27 27 27 27 27 27 27 27 27 40 40 40 40 40 40 40 40 40 40 40 TABLE AI.5a Validation of scoring approaches; Cross- training and test (Mean results), Semi- step-function all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif terms 5 2 1 7 4 7 11 8 14 15 19 ME 0,835 0,549 0,727 0,754 0,454 0,833 0,843 0,876 0,847 0,902 0,917 MHRk 4,485 12,408 8,236 6,093 18,147 4,518 4,582 3,218 4,036 2,551 2,734 all MRHRk 4,920 4,010 3,966 4,406 3,467 4,220 3,446 3,929 4,333 3,718 3,503 HRt 2,300 0,550 1,350 1,400 1,000 2,900 2,300 2,950 2,100 3,050 3,700 RHRt 1,300 1,400 1,700 1,650 1,895 2,000 1,700 1,850 2,000 1,400 2,000 total 4,500 4,300 4,050 5,250 4,105 5,900 4,200 5,100 4,650 4,600 4,850 ME 0,875 0,668 0,737 0,825 0,613 0,876 0,891 0,912 0,914 0,933 0,950 MHRk 2,070 5,950 4,091 2,150 7,209 2,680 2,662 1,692 1,851 1,377 1,372 CUT MRHRk 2,014 1,664 1,832 2,045 1,486 1,838 1,660 1,734 1,935 1,485 1,657 HRt 3,150 1,000 1,600 2,650 0,947 4,300 3,300 3,900 3,850 3,750 4,200 RHRt 2,700 3,000 2,800 3,250 3,053 4,000 3,050 3,100 3,050 3,100 3,250 total 4,350 4,316 3,900 5,050 4,053 5,850 4,050 4,950 4,400 4,450 4,650 TABLE AI.5b Validation of scoring approaches; Cross- training and test (Variances), Semi- step-function all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif terms 5 2 1 7 4 7 11 8 14 15 19 ME 0,010 0,038 0,017 0,007 0,035 0,009 0,008 0,006 0,005 0,003 0,004 MHRk 12,931 38,370 23,624 9,953 51,003 9,274 12,781 4,698 8,808 1,654 4,039 all MRHRk 7,351 3,831 8,918 3,727 5,017 4,275 3,150 3,444 5,014 2,954 1,443 HRt 1,010 0,448 1,328 1,240 0,895 1,590 1,310 2,148 2,090 1,148 1,610 RHRt 1,710 1,240 1,110 0,727 1,757 2,000 0,810 1,228 1,200 0,740 0,900 total 1,850 4,010 1,747 7,488 4,665 4,390 3,360 3,590 3,728 1,840 1,628 ME 0,005 0,044 0,013 0,003 0,045 0,009 0,011 0,003 0,006 0,002 0,001 MHRk 2,473 19,306 3,758 0,906 10,729 4,376 8,570 0,715 4,121 0,336 0,396 CUT MRHRk 1,190 0,497 1,031 1,308 0,608 0,898 0,952 0,445 0,932 0,196 0,385 HRt 1,828 1,000 1,240 2,628 0,834 3,810 1,910 2,290 2,928 1,488 1,460 RHRt 1,410 3,000 1,360 4,388 3,698 3,300 1,248 1,490 2,847 1,290 1,888 total 1,927 4,565 1,390 7,047 4,283 4,328 3,548 3,248 3,440 1,748 1,527 TABLE AI.6a Validation of scoring approaches; Cross- training and test (Mean results), Negative RMSD all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif terms 5 2 1 7 4 7 11 8 14 15 19 ME 0,907 0,455 0,707 0,838 0,415 0,846 0,893 0,868 0,891 0,908 0,930 MHRk 2,278 15,955 9,362 4,119 19,963 4,185 3,297 3,520 2,904 3,019 2,347 all MRHRk 4,050 4,278 4,220 4,090 3,925 3,910 3,459 4,064 4,303 3,900 4,103 HRt 3,050 0,450 1,450 2,200 1,300 3,150 3,350 3,333 3,500 3,300 3,500 RHRt 1,350 1,100 1,850 1,650 1,850 1,650 1,750 2,000 1,750 1,750 1,700 total 4,150 4,200 5,050 5,200 5,550 4,700 4,350 5,667 4,950 4,600 4,600 ME 0,935 0,576 0,720 0,885 0,561 0,882 0,926 0,863 0,940 0,921 0,939 MHRk 1,358 7,286 4,287 1,756 8,440 2,263 2,094 2,616 1,568 1,865 1,655 CUT MRHRk 1,574 2,029 1,562 1,704 1,820 1,799 1,748 1,727 1,793 1,755 1,592 HRt 3,300 0,750 1,750 3,700 1,000 3,750 3,900 3,722 4,150 3,800 3,900 RHRt 2,650 2,600 3,350 3,100 3,600 3,100 3,300 3,667 3,400 3,200 3,400 total 3,850 4,150 4,900 5,050 5,250 4,650 4,350 5,500 4,800 4,600 4,550 TABLE AI.6b Validation of scoring approaches; Cross- training and test (Variances), Negative RMSD all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif terms 5 2 1 7 4 7 11 8 14 15 19 ME 0,004 0,026 0,038 0,009 0,015 0,008 0,008 0,091 0,004 0,008 0,002 MHRk 1,645 33,285 43,321 7,159 25,560 10,115 11,048 7,907 3,945 8,943 2,598 all MRHRk 3,019 4,954 3,926 2,617 1,543 3,100 3,072 5,942 4,756 3,389 2,035 HRt 2,048 0,448 1,047 1,160 1,210 1,528 2,328 5,568 2,650 3,710 2,350 RHRt 1,628 0,890 1,628 0,928 1,127 0,927 0,888 3,333 1,688 1,088 1,110 total 2,927 3,960 4,348 1,860 4,547 2,510 2,827 8,012 2,948 4,940 2,940 ME 0,003 0,022 0,015 0,003 0,010 0,006 0,008 0,092 0,004 0,007 0,003 MHRk 0,307 8,873 2,651 0,417 8,280 5,320 6,466 3,852 1,353 4,634 1,580 CUT MRHRk 0,422 0,861 0,248 0,352 0,571 0,574 0,793 0,867 0,795 1,029 0,553 HRt 2,110 0,588 1,388 1,210 1,100 1,688 2,690 5,629 2,528 4,060 2,190 RHRt 1,628 2,640 2,428 0,790 2,640 1,190 1,510 6,049 2,540 3,760 2,240 total 2,927 3,727 4,590 1,948 3,988 2,228 2,827 7,722 2,660 4,940 2,948 TABLE AI.7a Validation of scoring approaches; Cross- training and test (Mean results), Negative RMSD with cut-off all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif terms 5 2 1 7 4 7 11 8 14 15 19 ME 0,869 0,438 0,737 0,804 0,418 0,863 0,837 0,888 0,879 0,916 0,884 MHRk 3,197 19,083 8,107 5,311 19,456 3,938 4,830 3,413 3,320 2,688 3,705 all MRHRk 4,864 4,985 4,094 4,363 4,223 3,709 4,336 4,757 4,004 3,988 4,204 HRt 2,750 0,550 1,100 1,350 1,211 2,850 3,200 3,150 2,800 3,800 3,250 RHRt 1,500 1,100 1,500 1,750 1,737 1,700 1,600 1,950 1,550 2,200 1,800 total 4,800 4,200 4,200 5,400 4,684 4,850 5,200 5,150 4,400 5,450 5,000 ME 0,913 0,562 0,778 0,828 0,558 0,920 0,882 0,910 0,921 0,950 0,908 MHRk 1,508 7,609 3,868 2,416 7,928 1,831 2,770 1,781 1,634 1,422 1,849 CUT MRHRk 1,712 1,913 1,809 1,742 1,677 1,713 1,923 1,671 1,891 1,622 1,801 HRt 3,750 0,750 1,750 2,800 0,947 3,900 4,150 3,800 3,750 4,500 3,800 RHRt 3,100 2,400 2,650 3,450 3,263 3,400 3,600 3,500 2,800 3,800 3,200 total 4,550 4,150 4,150 5,150 4,421 4,800 5,150 5,050 4,400 5,400 4,950 TABLE AI.7b Validation of scoring approaches; Cross- training and test (Variances), Negative RMSD with cut-off all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif terms 5 2 1 7 4 7 11 8 14 15 19 ME 0,004 0,037 0,012 0,012 0,026 0,010 0,006 0,006 0,004 0,005 0,007 MHRk 2,831 66,716 16,495 9,364 61,229 8,599 10,349 4,457 5,578 3,436 6,697 all MRHRk 4,484 5,606 5,709 2,953 6,836 3,235 2,611 10,090 3,707 4,095 3,518 HRt 2,188 0,448 0,890 1,428 1,085 1,927 2,760 2,128 1,660 3,160 2,488 RHRt 1,250 0,690 0,950 0,988 1,511 1,410 1,440 1,347 1,047 1,260 2,160 total 5,360 3,760 2,060 2,840 3,371 4,328 5,260 3,228 3,440 4,248 4,700 ME 0,005 0,015 0,008 0,009 0,037 0,003 0,007 0,006 0,002 0,001 0,004 MHRk 0,482 14,403 3,829 1,371 14,771 2,025 6,442 1,813 0,774 0,231 1,041 CUT MRHRk 0,456 0,437 0,544 0,433 1,180 0,687 0,643 0,658 0,672 0,259 0,619 HRt 3,588 0,988 0,888 3,460 0,939 2,990 4,627 2,360 2,388 3,150 2,860 RHRt 2,390 2,840 1,528 2,647 2,860 2,840 2,640 2,550 1,660 2,460 2,760 total 5,048 3,728 1,828 2,527 2,957 4,360 5,027 2,848 3,440 4,340 5,048 TABLE AI.8a Validation of scoring approaches; Cross- training and test (Mean results), Term correlation all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif terms 5 2 1 7 4 7 11 8 14 15 19 ME 0,821 0,471 0,724 0,801 0,365 0,858 0,897 0,906 0,923 0,910 0,911 MHRk 4,593 16,368 8,366 4,870 22,087 3,531 2,752 2,232 2,062 2,848 2,757 all MRHRk 4,507 4,417 3,753 4,118 4,219 4,395 3,683 3,394 3,622 3,605 3,193 HRt 2,650 0,250 1,300 1,400 0,850 2,750 3,900 4,150 3,600 2,800 3,150 RHRt 1,350 1,900 1,550 1,450 1,550 1,600 2,250 2,050 1,900 1,500 1,600 total 4,550 4,750 4,100 4,250 4,650 4,550 5,050 5,350 5,400 4,500 4,250 ME 0,851 0,610 0,764 0,850 0,525 0,897 0,928 0,930 0,947 0,909 0,919 MHRk 2,920 6,820 3,985 2,011 9,229 2,210 1,430 1,567 1,204 1,998 1,920 CUT MRHRk 1,974 1,885 1,601 1,693 1,760 1,800 1,690 1,473 1,397 1,644 1,521 HRt 3,250 0,800 1,800 2,800 0,750 3,650 4,400 4,500 4,700 3,400 3,450 RHRt 2,750 2,950 2,750 2,800 2,850 2,800 3,700 3,750 3,900 2,850 3,100 total 4,350 4,500 4,050 4,200 4,550 4,450 5,000 5,250 5,250 4,450 4,100 TABLE AI.8b Validation of scoring approaches; Cross- training and test (Variances), Term correlation all c_hbond c_stack c_mhp1 c_emp1 c_motif c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif terms 5 2 1 7 4 7 11 8 14 15 19 ME 0,010 0,013 0,012 0,009 0,025 0,010 0,005 0,003 0,002 0,004 0,004 MHRk 8,930 17,659 14,992 7,188 42,155 11,837 2,829 3,193 0,619 4,605 5,107 all MRHRk 3,832 1,854 3,874 3,857 3,739 3,363 5,478 2,513 2,831 4,761 2,468 HRt 2,027 0,188 1,410 1,340 0,628 2,288 3,290 2,528 3,540 2,660 1,828 RHRt 0,827 1,290 1,148 1,148 1,147 1,340 2,688 2,248 1,890 1,050 0,940 total 2,948 4,188 6,290 2,888 3,928 2,748 3,548 3,628 5,240 4,750 2,688 ME 0,015 0,013 0,013 0,009 0,014 0,008 0,003 0,003 0,002 0,005 0,004 MHRk 6,537 8,608 4,140 0,891 8,269 6,006 1,094 1,145 0,127 1,638 3,217 CUT MRHRk 1,003 0,681 0,411 0,480 0,346 0,666 0,662 0,427 0,265 0,590 0,530 HRt 2,788 0,460 1,560 2,060 0,588 1,928 3,840 2,350 3,810 3,340 1,748 RHRt 2,588 2,148 2,488 1,860 1,727 1,660 5,510 2,388 2,890 2,528 1,590 total 3,028 4,250 6,148 2,760 3,948 2,448 3,600 2,888 5,088 4,348 2,290 Table AI.9 shows the mean values of the calculated weighting coefficients and their variances from the Cross- training and test – procedure for each score calculated on the complete Dataset 2. The term corresponding to the weight marked “w4_c_hbond(N3’)” is the result of a bug, which added an extra term to the c_hbond –component. This extra term captures hydrogen bond interactions in the same way as the N3-scoring term of c_hbond (with weight w2_c_hbond(N3)), but captures interactions of all aromatic-marked nitrogen atoms in the ligand, not only N3. The number of terms in the c_hbond -component is thus finally five, with two terms that capture the same hydrogen-bond effects of N3. Also, the combinatorial term in c_hbond (with weight w5_c_hbond(N3’,N4)) is dependant on this “aromatic” term rather than the term scoring N3. The use of ligands with other aromatic nitrogens than N3 in cytosine together with the bug-term is thus cause of a supposed “model error” which might have improved the performance of the c_hbond –component slightly. AI.10 and AI.11 through AI.14 illustrate weighting coefficients des cribbed in AI.9. Diagram AI.10 illustrates the mean weighting coefficients of the full score (c_hbond c_stack c_mhp1 c_emp1 c_motif) from the Cross- training and test –procedure. Diagram AI.11 through AI.14 illustrate the mean weighting coefficients of all scores from the Cross- training and test –procedure for each fitting function. TABLE AI.9a Weighting coefficients; Cross- training and test (Mean results) w0 semi- step-function c_hbond 0,116 c_stack 0,269 c_mhp1 0,208 c_emp1 -0,106 c_motif 0,355 c_hbond c_stack 0,107 c_hbond c_stack c_motif 0,120 c_hbond c_stack c_mhp1 -0,001 c_hbond c_stack c_emp1 -0,100 c_hbond c_stack c_mhp1 c_emp1 -0,150 c_hbond c_stack c_mhp1 c_emp1 c_motif -0,160 w1_c_hbond(O2) w2_c_hbond(N3) w3_c_hbond(N4)w4_c_hbond(N3’)w5_c_hbond(N3’,N4w6_c_stack(phe,tyr,trp,his) w7_cstack(arg) w8_c_mhp1w9_c_emp1(O2_HNx) w10_c_emp1(O2_HOx) w11_c_emp1(N3_HNx) w12_c_emp1(N3_HOx) w13_c_emp1(N4_Nx) w14_c_emp1(N4_Ox) w15_c_emp1(Ar) w16_c_motif(O2,N3,N4) w17_c_motif(O2,N3,-)w18_c_motif(O2,-,N4) w19_c_motif(-,N3,N4) 0,147 0,114 0,106 0,952 0,050 0,051 0,148 0,098 0,952 1,000 0,953 0,530 0,437 0,530 1,000 1,000 1,000 0,231 0,322 0,279 1,000 1,000 1,000 1,000 1,000 0,811 1,000 1,000 1,000 0,062 0,055 0,050 0,070 0,080 0,079 0,705 1,230 1,000 0,394 term correlation c_hbond -0,001 c_stack -0,001 c_mhp1 -0,001 c_emp1 -0,001 c_motif -0,001 c_hbond c_stack -0,001 c_hbond c_stack c_motif 0,000 c_hbond c_stack c_mhp1 -0,001 c_hbond c_stack c_emp1 -0,001 c_hbond c_stack c_mhp1 c_emp1 -0,001 c_hbond c_stack c_mhp1 c_emp1 c_motif 0,000 0,366 0,056 0,058 0,566 0,094 0,715 0,808 1,000 1,000 0,757 1,000 1,000 1,000 0,866 0,703 0,458 0,468 1,000 1,000 1,000 1,000 1,000 1,000 0,931 0,034 1,000 0,865 0,272 0,769 0,466 0,513 0,460 0,198 0,332 0,312 1,000 1,000 1,000 0,771 0,942 0,884 0,000 0,000 0,000 1,000 0,000 0,000 0,200 1,000 0,000 1,000 0,000 0,000 0,000 1,000 0,187 1,000 negative RMSD c_hbond -8,813 0,595 c_stack -7,117 c_mhp1 -8,045 c_emp1 -12,603 c_motif -6,347 c_hbond c_stack -8,733 0,601 c_hbond c_stack c_motif -8,783 0,610 c_hbond c_stack c_mhp1 -10,3530,581 c_hbond c_stack c_emp1 -12,8531,000 c_hbond c_stack c_mhp1 c_emp1 -12,9151,000 c_hbond c_stack c_mhp1 c_emp1 c_motif -13,0341,000 negative RMSD with cut-off c_hbond -3,850 c_stack -2,611 c_mhp1 -2,482 c_emp1 -5,377 c_motif -2,306 c_hbond c_stack -3,827 c_hbond c_stack c_motif -3,842 c_hbond c_stack c_mhp1 -3,941 c_hbond c_stack c_emp1 -4,884 c_hbond c_stack c_mhp1 c_emp1 -4,903 c_hbond c_stack c_mhp1 c_emp1 c_motif -4,990 0,101 0,227 1,000 -0,603 1,000 1,089 1,000 0,095 0,030 0,252 1,000 1,000 0,803 1,000 1,000 1,000 1,000 1,000 1,000 1,000 0,265 0,853 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 0,079 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 0,000 0,000 0,000 1,000 0,150 0,150 0,150 1,000 0,000 0,000 0,000 0,131 0,240 0,106 1,085 1,015 1,000 1,000 1,000 1,000 -1,010 0,541 0,863 1,000 1,000 1,000 0,784 0,447 0,392 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 0,603 1,000 4,891 1,047 0,868 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 0,988 1,000 1,000 4,799 4,530 4,777 1,070 1,000 1,000 0,644 1,000 1,000 1,891 1,000 -3,765 1,000 4,481 1,000 0,310 1,822 1,748 1,599 1,000 1,000 1,000 -0,956 -0,772 -0,639 1,000 1,000 1,000 1,838 1,803 1,774 1,000 1,000 0,982 1,000 0,959 1,000 0,800 0,450 0,200 0,550 1,000 0,350 0,000 1,000 1,000 1,000 1,000 1,000 1,000 0,900 0,700 1,000 0,000 0,200 1,000 0,100 0,300 1,000 1,000 0,950 1,000 1,000 0,800 1,000 2,004 0,955 0,977 0,937 1,000 1,000 1,000 0,967 0,996 1,202 0,970 1,008 1,023 1,000 1,000 0,769 1,000 1,000 1,000 0,571 0,569 0,452 0,574 0,896 0,792 0,397 0,618 1,000 0,444 1,711 1,254 1,396 0,257 0,268 0,211 0,352 0,919 0,958 0,961 0,432 0,437 0,443 0,548 0,608 0,511 0,577 0,577 0,572 0,508 0,565 0,493 1,000 1,000 1,000 0,857 0,867 0,867 0,454 0,405 0,647 0,545 0,425 0,621 0,000 0,000 1,000 1,000 0,000 1,000 0,932 0,896 0,001 0,000 1,000 0,000 0,100 0,250 1,000 0,000 1,000 0,000 0,000 0,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 0,000 0,150 0,000 0,000 0,000 0,000 1,000 1,000 1,000 TABLE AI.9b Weighting coefficients; Cross- training and test (Variances) w0 w1_c_hbond(O2)w2_c_hbond(N3)w3_c_hbond(N4)w4_c_hbond(N3’)w5_c_hbond(N3’,N4w6_c_stack(phe,tyr,trp,his)w7_cstack(arg)w8_c_mhp1w9_c_emp1(O2_HNx) w10_c_emp1(O2_HOx)w11_c_emp1(N3_HNx) w12_c_emp1(N3_HOx) w13_c_emp1(N4_Nx) w14_c_emp1(N4_Ox) w15_c_emp1(Ar)w16_c_motif(O2,N3,N4)w17_c_motif(O2,N3,-)w18_c_motif(O2,-,N4)w19_c_motif(-,N3,N4) semi- step-function c_hbond 0,0000,081 0,042 0,000 0,044 0,000 c_stack 0,000 0,000 0,217 c_mhp1 0,000 0,000 c_emp1 0,001 0,000 0,000 0,004 0,000 0,003 0,000 0,000 c_motif 0,000 0,000 0,187 0,000 0,000 c_hbond c_stack 0,0010,000 0,221 0,105 0,000 0,001 0,000 0,000 c_hbond c_stack c_motif 0,0010,081 0,212 0,153 0,000 0,001 0,000 0,000 c_hbond c_stack c_mhp1 0,0020,043 0,221 0,130 0,143 0,001 0,230 0,177 0,000 c_hbond c_stack c_emp1 0,0000,044 0,000 0,000 0,000 0,000 0,043 0,000 0,011 0,000 0,781 0,000 0,023 0,000 0,000 c_hbond c_stack c_mhp1 c_emp1 0,0010,000 0,000 0,000 0,000 0,000 0,189 0,000 0,103 0,001 0,000 0,898 0,000 0,038 0,000 0,000 c_hbond c_stack c_mhp1 c_emp1 c_motif0,0000,042 0,000 0,000 0,000 0,000 0,148 0,000 0,000 0,000 0,000 0,168 0,000 0,010 0,000 0,000 0,000 0,000 0,000 0,000 negative RMSD c_hbond 0,0260,004 c_stack 0,025 c_mhp1 0,075 c_emp1 0,147 c_motif 0,019 c_hbond c_stack 0,0140,002 c_hbond c_stack c_motif 0,0170,002 c_hbond c_stack c_mhp1 0,1030,002 c_hbond c_stack c_emp1 0,1740,000 c_hbond c_stack c_mhp1 c_emp1 0,1650,000 c_hbond c_stack c_mhp1 c_emp1 c_motif0,1680,000 negative RMSD with cut-off c_hbond 0,0050,032 c_stack 0,004 c_mhp1 0,011 c_emp1 0,040 c_motif 0,006 c_hbond c_stack 0,0060,060 c_hbond c_stack c_motif 0,0120,033 c_hbond c_stack c_mhp1 0,0080,105 c_hbond c_stack c_emp1 0,0440,059 c_hbond c_stack c_mhp1 c_emp1 0,0860,033 c_hbond c_stack c_mhp1 c_emp1 c_motif0,0730,029 term correlation c_hbond 0,0000,000 c_stack 0,000 c_mhp1 0,000 c_emp1 0,000 c_motif 0,000 c_hbond c_stack 0,0000,000 c_hbond c_stack c_motif 0,0000,000 c_hbond c_stack c_mhp1 0,0000,000 c_hbond c_stack c_emp1 0,0000,090 c_hbond c_stack c_mhp1 c_emp1 0,0000,187 c_hbond c_stack c_mhp1 c_emp1 c_motif0,0000,000 0,061 0,007 0,000 0,009 0,018 0,010 0,022 0,000 0,000 0,000 0,008 0,010 0,030 0,005 0,001 0,004 0,000 0,000 0,045 0,000 0,000 0,000 0,011 0,008 0,132 0,025 0,033 0,053 0,007 0,002 0,000 0,253 0,001 0,000 0,001 0,088 0,103 0,081 0,001 0,001 0,002 0,029 0,050 0,015 0,000 0,000 0,000 0,082 0,070 0,071 0,244 0,236 0,232 0,213 0,189 0,219 0,000 0,000 0,000 0,000 0,001 0,134 0,002 0,002 0,000 0,000 0,000 0,000 0,000 0,000 0,043 0,402 0,000 0,164 0,001 0,160 0,190 0,194 0,194 0,034 0,112 0,119 0,000 0,000 0,000 0,209 0,063 0,122 0,000 0,000 0,000 0,000 0,000 0,000 0,160 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,021 0,000 0,000 1,424 0,000 0,920 0,041 0,007 0,000 0,123 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,127 0,127 0,127 0,000 0,000 0,000 0,000 0,016 0,013 0,011 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,003 0,000 0,000 0,503 0,096 0,139 0,093 0,000 0,000 0,024 0,000 0,000 1,966 0,000 4,618 0,000 0,109 0,000 0,001 0,005 0,000 0,042 0,061 2,780 2,298 1,441 0,000 0,000 0,000 9,706 9,297 7,388 0,000 0,000 0,000 0,676 0,748 0,678 0,000 0,000 0,004 0,000 0,031 0,000 0,160 0,247 0,160 0,247 0,000 0,227 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,090 0,210 0,000 0,000 0,160 0,000 0,090 0,210 0,000 0,000 0,047 0,000 0,000 0,160 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,127 0,000 0,000 0,000 0,000 0,000 0,000 0,000 Table AI.10 Weighting coefficients in the full score (c_hbond c_stack c_mhp1 c_emp1 c_motif) 6,000 4,000 2,000 0,000 -2,000 semi- step-function -4,000 negative RMSD negative RMSD with cut-off term correlation -6,000 -8,000 -10,000 -12,000 -14,000 Table AI.11 Weighting coefficients, Semi- step-function 1,500 1,000 c_hbond 0,500 c_stack c_mhp1 c_emp1 c_motif 0,000 c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 -0,500 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif -1,000 -1,500 Table AI.12 Weighting coefficients, Negative RMSD 6,000 4,000 2,000 0,000 c_hbond c_stack c_mhp1 -2,000 c_emp1 c_motif -4,000 c_hbond c_stack c_hbond c_stack c_motif -6,000 c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 c_hbond c_stack c_mhp1 c_emp1 -8,000 -10,000 -12,000 -14,000 c_hbond c_stack c_mhp1 c_emp1 c_motif Table AI.13 Weighting coefficients, Negative RMSD with cut-off 6,000 4,000 c_hbond 2,000 c_stack c_mhp1 c_emp1 c_motif 0,000 c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 -2,000 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif -4,000 -6,000 Table AI.14 Weighting coefficients, Term correlation 1,200 1,000 0,800 c_hbond c_stack 0,600 c_mhp1 c_emp1 c_motif 0,400 c_hbond c_stack c_hbond c_stack c_motif c_hbond c_stack c_mhp1 c_hbond c_stack c_emp1 0,200 c_hbond c_stack c_mhp1 c_emp1 c_hbond c_stack c_mhp1 c_emp1 c_motif 0,000 -0,200
© Copyright 2026 Paperzz