Computational Aspects of High-Throughput

Computational Aspects of High-Throughput
Screening planning and analysis
Introduction
Welcome to the „Living Document‟ for the EMBL Advanced Course - Computational
Aspects of High-Throughput Screening planning and analysis. This document is
intended to be usable by all participants and presenters on the course. Please add any
information you wish to it so it can be shared with everyone else in attendance.
Introduction
Networking
Questions
Questions from ChEMBL intro
Questions about ZINC and from John Irwin‟s session
Questions about Chemical Space from Peter Ertl‟s session
Questions on Chemical Representation from George Papadatos
Questions on Designing Focussed Libraries from Steffen Renner‟s session
Questions on virtual screening from Anna Linusson‟s session
Questions on assay validation from David Murray‟s session
Questions on Shaping a Screening file from Jeremy Everett‟s session
Questions of Screening Workflows from Joe Lewis‟s session
Questions on EU-Openscreen from Per-Anders session
Resources
Networking
If you are happy for others on the course to contact you, then please leave details here
(email, webpage, LinkedIn etc…)
Tom Hancocks - Course Organiser
 http://www.ebi.ac.uk/about/people/tom-hancocks
 http://uk.linkedin.com/pub/tom-hancocks/2a/956/bb2
 [email protected]
 Twitter: @tehancocks
Chris Kuffer - Participant
 PhD student / Max-Planck-Institute of Biochemistry
 [email protected]
Aurianne Lescure - Participant
 Curie Institute HCS platform - Robotic engineer, data analysis
 [email protected]
Silvia Lorente-Cebrián - Participant
 Post-doc / Karolinska Institute
 [email protected]
Tugrul DORUK - Participant
 Post-doc / Umea University
 [email protected]
David Andersson - Participant
 Researcher / Umea University
 [email protected]
Jeremy Everett - Speaker
 Professor at University of Greenwich
 http://www2.gre.ac.uk/about/schools/science/about/departments/pces/staff/profes
sor-jeremy-everett
 [email protected]
Caroline BARETTE- participant
 PhD, CMBA‟s HTS facility operational manager, CEA-INSERM-Grenoble
University, France
 [email protected]
Anna-Lena Gustavsson - participant
 Computational Chemist at the Lab. for Chemical Biology Karolinska Institutet,
part of the Chemical Biology Consortium Sweden www.cbcs.se
 [email protected]
Arafath Najumudeen - participant
 Turku Centre for Biotechnology, Finland
 [email protected]
Matthias Kolberg – participant
 Researcher at Oslo University Hospital, Norway
 [email protected]
Ana Kitanovic - Participant
 Screening and Automation Scientist at Laboratory Automation Technology (LAT)
Screening Facility, German Center for Neurodegenerative Deseases (DZNE),
Bonn
 [email protected]
Pamela Gatto
 HTS Facility, Center for Integrative Biology, University of Trento
 http://www.unitn.it/en/cibio
 [email protected]
Guochao Wu- Participant
 research engineer, working on yeast screening, with collection of deletion
mutants of yeast
 Department of Chemistry, Umea University, Sweden
 [email protected]
Bernd Boidol - Participant
 PhD Student (PLACEBO Lab), CeMM Research Center for Molecular Medicine
Vienna
 [email protected]
Martijn fiers - Participant
 WUR\Plant Research International, Wageningen, The Netherlands
 [email protected]
Erik Vrij
 Tissue Regeneration - MIRA, University of Twente, the Netherlands
 [email protected]
Evgeny Kulesskiy - Participant
 High Throughput Biomedicine Unit, FIMM, Helsinki, Finland
 http://www.fimm.fi/en/technologycentre/htb/
 [email protected]
Steffen Renner - Speaker
- Novartis
- [email protected]
Anne Hersey - Speaker
- EMBL-EBI
- [email protected]
George Papadatos - Speaker
- EMBL-EBI
- [email protected]
- http://uk.linkedin.com/in/georgepapadatos
Questions
If you have a question (but are too afraid to ask!) then please type it up here. If you are
able to answer someone else‟s question, or have additional things to ask, then please
feel free to add them to the document
Questions from ChEMBL intro








How does ChEMBL create the chemical structure for each compound?
o Drawn by curators for each entry!
o There are computational methods - but they are not always accurate
o Need to map structures with biological value - difficult
o Would be nice if journals supplied information of chemical structures when
publishing
o Occasional find errors in journal formulas - report back to authors
 So there is some checking of structures, but not foolproof
Can you enter a chemical structure and get out all the targets out?
o Yes! Search against name, or drawn your own structure and search that.
What IDs do you accept for protein searching?
o UniProt ID, Short Names
o ChEMBL assigns a ChEMBL ID to each UniProt entry
Does ChEMBL only return targets that are proteins?
o No, there is lots of data on cell-based assays as well
o Can search for compounds tested against cell lines
o Nucleic acids also valid targets
Do you have a plan to cover patents as well in the database?
o For example it could be really interesting to cover Chinese patents
as well.
o There are links to patent databases via UniChem in the compound report
card, e.g. https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL111
and scroll to the bottom
o ZINC also includes some patents, limited to openly available sources
o SureChem - https://surechem.com/ - is one of the best places to search
for patents
How can you submit multiple searches to ChEMBL
o Best to use web services via command line or to create a local installation
of ChEMBL
o There are example of this in KNIME
How to search ChEMBL for “which target has the most compounds in common
with IRAK4 at 10uM or better” ?
What other targets do IRAK4 compounds hit? (what polypharmacology should I
be looking out for?)






Are any of the compounds for IRAK4 also approved drugs? Are any of them
natural products? Are any of them metabolites? If not, what is the closest
metabolite to a known IRAK4 ligand?
How to take a query you have done in the ChEMBL interface and turn it into a
script?
which compounds in ChEMBL are the most promiscuous?
For which targets is there at least one compound that binds at 1uM or better?
in the drawing tool, can you enter SMILES and then edit the molecule graphically
before submitting the search?
If you are doing a systematic survey for a list of compounds and their targets,
what thresholds of ADMET properties to use that are applicable to all the
compounds or should one set individual thresholds for each compound?
Questions about ZINC and from John Irwin’s session





Things to consider when using ZINC to order compounds and libraries
o Remember to not trust your data! There are errors in every database!
o Also be careful and check the quality of compounds
 Good vendors can provide spectra to show the quality (it might
cost)
o Not all compounds are available as stock from some vendors and they
may have to prepare them when ordered.
 This can take lots of time!
o You get what you pay for - cheap libraries might not be great quality!
Can ZINC provide a list of suppliers for each compound?
o Yes
o They are listed alphabetically, so no weighting of which suppliers are
better
Does SEA store any search information
o Yes, just like Google does
o For privacy concerns you can buy the software and run it locally
SEA - Limited by IP address to 40x a day, free to use in single cases
o For HTS, can purchase
If you have 30-50 compounds to test at once, can it tell you which pathways you
are hitting
o No but it will tell you the target you are hitting
o Compounds need to be in ChEMBL for SEA to actually return something
o Natural compounds will return null hits as they are not well defined in
chemical space
Questions about Chemical Space from Peter Ertl’s session

2000 viable drug targets, many possible drugs
o „Bioavailability area‟ - small area of molecular property space
o Drug-likeness and NP-likeness
o ChemGPS - from AZ - molecules with “extremeproperties ” used as
reference
o http://chemgps.bmc.uu.se/batchelor/queue.php?show=submit
o







Reference from current drugs on the market - these have already been
cleared for clinical use. So find something similar?
o Natural product-likeness - becoming increasingly important
 Molecules produced by living organisms - primary metabolites (fatty
acids, sugars…) and secondary metabolites (unique to an
organism, antibiotics, marine toxins…)
 Long natural selection process to optimize bio-interaction
o „Grey area‟ between synthetic, natural and bioactive molecules - key are
for drug discovery
What is the root ring of Gleevec in Ertl‟s hierarchy?
What about bioisosteres in Ertl’s classification system?
o analysis of bioactivity databases (ChEMBL) can provide many ideas for
new bioisosteres
Isn‟t the decision to use secondary metabolites and not primary metabolites
arbitrary?
In a real drug discovery project how much is found in the target class, how
much is found elsewhere?
o This process is used early on in the project to find basic information
o Later on incorporate 3D structures and modelling
o But classification on bioactivity is important?
Do natural products have to be considered with regards to membrane
transport?
o Some known structural features for transport (like sugars) and needs to be
thought about
o Isn‟t easy, needs to be considered new for each case
Will we ever have an overlap between synthetic chemistry and natural
products?
o Non-natural, natural products - making simpler compounds
Do you characterise cyto-toxic compounds when create potential libraries
and exclude them?
o Yes, large set of substructures excluded from screens due to these
properties
Questions on Chemical Representation from George Papadatos


Word of warning!
Chemical names - there are many names, different people use different names
o
o
o
Trivial, IUPAC, synonyms, trade names…
Language and spelling can confound the naming
Large molecules have complex IUPAC names, easy to make a mistake
o



CAS numbers, not derived from structures, proprietary and human
assigned
 Only 7900 freely available
Molecules as a graph - atoms as nodes
o SMILES
 Simplified Molecular Input Line Entry System
 Linear format, concise and convenient
o MDL mol files
 30+ lines of text can become very clumsy to use
o InChI
 International Chemical Identifier
 InChIKey - hashed representation
o SMARTS
 Extension of SMILES for substructure searching
 Find or filter out certain structures
Does CIR deal with large scale queries?
Molecular descriptors - lots of them, this is goods and bad!
o Connectivity
o Pharmacophores
o Shapes
Questions on Designing Focussed Libraries from Steffen Renner’s session







Consciously decide what you want to screen before beginning
o Don‟t screen everything!
How many compounds to screen? 200,000 is a focussed screen
o Balance between getting hits and testing redundant space
o Conduct pilot screens for larger screens
o Iterative screens and learn from results
o <450 MW; leadlikeness, water soluble, membrane permeable
o Hitlists
o Stereocentres
o Non-exclusive compounds - don‟t want someone to beat you to it!
Drug Likeness -the QED value is a parameter provided in the ChEMBL database
Combine results from all previous screens
o Gain data on #assays and times a compound has been active
o Can be done for each different technology
o PAINS filter
Screening at different concentrations and in different conditions
o Screen with low conc of detergents, removes aggregate interactions
o Still many potential factors
Is there a way to calculate redox screening potential?
o Not aware of it
o Can be confusing information, but could be useful
Large and lipophilic molecules might not be great targets
o Reasons why are not always clear
o Lots of debate on whether this is true or not!
Questions on virtual screening from Anna Linusson’s session








Binding affinities are based on free energy of ligand/target system; combined
energy vs separate energy of both components
o Enthalpy - Imagine it as two lovers
o Binding entropy - what keeps the components together?
Interactions occur in an aqueous environment
Need development and validation of screening before conducting screen
o Base on existing knowledge
Virtual screening - has existed in the literature since the mid 1990‟s
Methodology
o Ligand-based
 Need known molecules
o Structure-based
 Need 3D model of target
Why is NMR not a better way to gain knowledge of a protein’s structure?
o NMR has low resolution
How useful is VS for finding targets you don’t want to bind to?
o techniques in presentation not great
o Other methods might be useful
Could you quantify knowledge to prepare for a screen?
o Look in more detail and refine the question you are trying to ask
o Iteratively build up more information after you get first results
o Be careful though!
Questions on assay validation from David Murray’s session






A 0.1% hit rate in a 1m strong library gives 1000 active compounds
o No assay is 100% accurate
o 99% accurate
 10 false negatives
 9990 false positives!
o Technology artifacts makes things difficult
 Don‟t always believe how good an assay claims to be!
HTS isn‟t rubbish though!
o Treat everything with caution
o Every HTS experiment will be a big investment, make sure you get what
you can
Controls
o Neutral controls
 Steady state control
o Scale-reference controls
 supra-maximal conc of a compound
Assay must be good enough to call activity
o Validate, don‟t be pressured into running the HTS
Standard QC
o Signal to Background - remember error rates!
Z‟/Z factor article on Wikipedia is [currently] incorrect!
o Z‟ of 1 is best score
Z‟ of 0.5 is a marginal assay score
Some assays will be consistently above 0.5
 Cell-based assays are less accurate; 0.3 is often seen
 Replicate assays can reduce error a lot
 Once = 29%
 Four times = 50%
o Don‟t go chasing Z-values and try and improve results to reach a
threshold
Robust statistics
o DON‟T USE MEANS in your statistical analysis - use median instead.
o
o

o
Questions on Shaping a Screening file from Jeremy Everett’s session





Plate-based diversity subset
o Ordering of a subset on a plate in order of physical properties
o Rule of 40
 All criteria are multiples of 40
 200+ compounds on a plate
 Binned compounds on each plate
o BCUTS - cell based description of chemical space
 Plate-based properties vs individual compound basis
o Order plates in each subset to have maximal coverage of chemical space
 Aim for double coverage
 Computational problem for calculating this! Many permutations!
 Take top plates; best plates move to the top of the pile
 17 iterations gave optimal order
Did you return to missed sample plates and find out why they were missed?
o No not done. Very large screening file. Interesting question, but not a
question needed to be solved.
How many compounds do you need to screen?
o Validated structure and knowing it is important
o Knowing how much compound in a well is essential to calculate IC50
values
Introduce molecular redundancy
o relationship between molecular fingerprint to known active and chance of
finding biological activity
o Belief Theory - work out molecules ended to find one active
 Defined „sure‟ as 95% certainty of activity
 Probability of getting an active
Have you done the PBDS in practice?
o Yes, become a major part of Pfizer‟s work and economic considerations
Questions of Screening Workflows from Joe Lewis’s session
Questions on EU-Openscreen from Per-Anders session



EU-Openscreen is a Chemical Biology Research Infrastructure for Europe
o EU-Openscreen - http://www.eu-openscreen.de/
o Facilitate and support basic research
o Currently in last month of preparatory phase
o Current funding by EU
 Operational phase - support from host nation research councils
 User pays for screens
o Allow access to equipment, experience and training for scientists across
Europe
 Support at each stage of pre/post-screening process
o Development of standards, automation and reproducible practices
o Training course in Stockholm - http://www.ucmr.umu.se/researchschool/course-catalogue/1557-assay-development-in-high-throughputscreening-2-ects.html
o Open access of tools and data
o Covers all of the life sciences - food, ecology, animals, pest control,
medicine
More information on ESFRI http://ec.europa.eu/research/infrastructures/index_en.cfm?pg=esfri
BioMedBridges co-ordinates interactions between ESFRI infrastructures, more
information can be found here: http://www.biomedbridges.eu/
Resources
If you have any useful links, resources or information to share with the rest of the course
then please add it here.
1. http://zinc.docking.org - Free tools for ligand discovery
a. JI - interaction of Medicinal Chemical space with ZINC. Can also include
analogs of these compounds.
2. https://www.ebi.ac.uk/chembl/ - Open access and free med. chemistry SAR data
. http://www.ebi.ac.uk/training/online/course/chembl-quick-tour - ChEMBL
„Quick Tour‟ e-learning tutorial from EBI Train Online
a. pCHEMBL score -this is -log(IC50,Ki,EC50) and attempts to put a
number of roughly comparable concentration dependent endpoints on a
standard scale
i.
This is a subset of possible measured bioactivities
3. https://www.ebi.ac.uk/unichem/ - UniChem resource
4. http://www.ebi.ac.uk/pdbe/ - PDBe (Protein DataBase Europe) is a database of
3D protein structures curated at the EBI. Links between PDBe and ChEMBL.
5. http://advisor.docking.org - Small molecule aggregation historical record
database.
6. http://sea.docking.org - Predict the biological targets of small molecules (e.g. from
phenotypic hits)
7. Characterization of Chemical Space http://eu.wiley.com/WileyCDA/WileyTitle/productCd-3527318526.html
8. SMARTSviewerhttp://smartsview.zbh.uni-hamburg.de/
9. http://cactus.nci.nih.gov/chemical/structure Chemical representation converters
10. Someone asked if we run similar courses to this at EBI. We run this one every
year but it is closed for applications this year.
https://registration.hinxton.wellcome.ac.uk/display_info.asp?id=386
11. http://www.ebi.ac.uk/training/online/
12. www.knime.org
13. Rajashi Guha‟s slideshare account. Check out some of his resources if you are
interested in seeing what he would have covered:
http://www.slideshare.net/rguha/presentations