Free Energy Perturbation at Merck: Benchmarking

Free Energy Perturbation at Merck:
Benchmarking against Faster
Methods
Vertex Free Energy Workshop 2016
05/16/16
Andreas Verras
Merck FEP Evaluation and Benchmarking
Overview:
• Discussion of data set for validation
• Performance
• Domain of applicability
• Benchmarking against faster methods
• Future directions
2
FEP Data Set Generation
Initial data set was identified from internal crystallographic database
with a by near neighbors.
The following criteria were also used:
• No charge state changes
• No qualified data and well behaved binding curves
• All chirality must be known absolutely
• No metals in the protein active site
• Symmetry events flagged but included
3
FEP Data Set Generation
• Initially single atom perturbations were identified.
• Because of the mapping workflow we built maps containing
multiple atom perturbations.
• Hysteresis is assessed for cycles rather than back and forth
transformations.
4
Data Set
• Have completed 15 maps, ~200 total
perturbations. Will review data for first
124 pairs.
• Perturbation rate ~2 / 24 hrs on 4
GPUs.
• All receptor structures were prepared
by modelers.
• Receptor and water inclusion were
determined by someone who had
supported each project.
5
Performance on Validation Set
• Overall R2 of about 0.3
• Mean unsigned error is about 1.5
kcal/mol
• Performance varies depending on
target and map
6
FEP at Merck still an Enrichment Method
• Data binned into 1 kcal/mol
ΔΔG changes.
• In cases where FEP predicts
a 1 kcal/mol change in either
direction, the experimental
result agrees or is essentially
unchanged 95% of the time.
ddG exp
≤ -1 kcal/mol
-1 to 1 kcal/mol
≥ 1 kcal/mol
7
Domain Applicability – Perturbation Size
• Red box indicates perturbations witnessed N ≥ 10 times.
• Performance likely declines with increasing perturbation size.
8
Domain Applicability – Ring vs. Substitution
• Perturbations were classified as Ring Changes if an atom is modified within a ring
vs. Ring Substitution if a ring substituent is modified.
• No difference in performance.
9
Domain Applicability – Cycle Closure Error
• Red box indicates perturbations witnessed N ≥ 10 times.
• Error may trend upward with increasing hysteresis.
10
Benchmarking – PhysProps and Fast Methods
Performance was compared against change in physical properties including:
• PSA and SASA
• LogD and Heavy Atom Count
• HBD and HBA
Performance was compared against MMGBSA scoring implemented in
Schrodinger with 6Å flexible active site around ligand and default parameters.
11
Benchmark - SASA
ddG exp
≤ -1 kcal/mol
-1 to 1 kcal/mol
≥ 1 kcal/mol
12
Benchmark - MMGBSA
ddG exp
≤ -1 kcal/mol
-1 to 1 kcal/mol
≥ 1 kcal/mol
13
Benchmarking – Visual inspection
There was concern that FEP may have been getting “easy” perturbations
correct.
Developed a visual inspection tool and had 18 participants vote on the
perturbations.
A vote of -1 indicated left compound was more potent by at least 1 kcal/mol, 1
indicated right compound was more potent, and 0 indicated the two
compounds were within 1 kcal/mol potency.
No one was allowed to vote on their own projects.
A voting mean was calculated for all perturbations.
14
Visual Inspection Tool
15
Benchmark - Visual Inspection
ddG exp
≤ -1 kcal/mol
-1 to 1 kcal/mol
≥ 1 kcal/mol
16
Consensus Methods
a)
c)
b)
Performance of Random Forest with AP, DP, and MOE2D descriptors(a)
compared with FEP(b) and an unweighted geometric mean(c).
While this is an initial evaluation on a single target it may suggest that a
17
consensus approach may add predictivity.
Conclusions
• FEP is still an enrichment method.
• Performance is both dependent on targets and on maps.
• Error is likely dependent on perturbation size and hysteresis, but our
understanding of domain of applicability is incomplete.
• FEP outperforms cheaper methods and physical properties.
• FEP outperforms visual inspection.
• There may be improved performance in including consensus scores.
18
Future Directions
• Prospective application in projects.
• Work with others to better understand domain applicability and
variation based on user.
• Evaluate consensus approaches.
19
Acknowledgements
Merck
Alejandro Crespo
Kerim Babaoglu
John Sanders
Deping Wang
Xavier Fradera
Hakan Gunaydin
Hongwu Wang
Michael Altman
Sung-Sau So
Jennifer Johnston
Daniel Mcmasters
Matt Walker
Robert Sheridan
Zhuyan Guo
Yuan Hu
Chip Lesburg
Frank Brown
Brad Sherborne
Schrodinger
Alessandro Monge
Jeff Sanders
Fiona McRobb
Teng Lin
Thijs Beuming
20
backup
21
Benchmark – MMGBSA
All protein ligand complexes were rescored by MMGBSA. MMGBSA is not
predictive of experimental affinities for our data set.
22
Benchmark - PhysProp
Benchmarks against delta logD between pairs (left) and delta heavy
atom count (right).
Neither physical property is predictive of affinity for our data set.
23