Virtual Screening In A Desktop Grid: Replication And The Optimal

Virtual Screening In A Desktop Grid:
Replication And The Optimal Quorum
Ilya Chernov
Natalia Nikitina
Institute of Applied Mathematical Research,
Karelian Research Centre of Russian Academy of Sciences
{chernov,nikitina}@krc.karelia.ru
BOINC::FAST(2015)
Petrozavodsk, Russia
1 / 20
Desktop grid: reliability problems
I
Can we trust volunteers?
I
Can we trust hardware?
I
If an answer is valueable, then one can want it.
I
Non-deterministic algorithms are used.
I
Precision, convergence, tolerance.
Replication is able to help!
Solve a task up to ν identical results: a quorum.
2 / 20
A model project
I
Solve many similar tasks.
I
Each is a yes/no problem ("Is this ligand good or not?").
I
Answers may have different value,
I
different probability,
I
different risk of an error.
I
Cost of an error can also differ
I
and is normally high!
3 / 20
The problem
I
q+ , q− are probabilities of correct answers
I
The quorums: ν YES answers, µ = ν + γ NO answers.
I
Penalties F+ , F− are added to the computational time in case
of the false answer.
I
A priori probabilties are α+ , α− .
I
The unit cost is cost of a mean task => F± are high!
I
What are the optimal ν, µ for given penalties?
I
What penalties force the wanted quorums?
I
How do they depend on probabilities p± ?
4 / 20
The simpler problems
I
Similar answers:
I
q+ = q− = q, ν = µ, γ = 0, F+ = F− = F ,
I
α± do not matter.
I
One reliable answer:
I
The YES answers are reliable: p − = 0.
I
Then ν = 1, F + does not matter.
5 / 20
The random cost function: the simple case
The cost function is a random variable with finite number of
possible values:
Values:
Probabilities:
ν + i, i = 0 : ν − 1
ν+i−1 ν i
ν−1 q p
ν + i + F, i = 0 : ν − 1
ν+i−1 ν i
ν−1 p q
Its expectation is
EF (ν) = ν + p ν Fgν (q) + q ν pgν0 (p) + p ν qgν0 (q),
where the function
gν (x) =
ν−1 X
ν+i −1 i
x.
ν−1
i=0
6 / 20
The optimal quorum
Consider the difference
G (ν) = EF (ν) − EF (ν + 1) = A(p)F − B(p);
If G (ν) > 0 then ν + 1 replicas are better than ν.
Note that B(p) > 0;
so if A(p) < 0 then no replication is needed, whatever large penalty;
but if A(p) > 0 then G > 0 provided that F is sufficiently large.
7 / 20
A(p) and B(p)
B is the "no penalty case":
B(p) = E0 (ν) − E0 (ν + 1)
A(p) can be evaluated (by no means easily):
A(p) =
2ν − 1 ν ν
p q (1 − 2p).
ν−1
Thus for p ≥ 0.5 no replication is useful
For p < 0.5 :
F > Fν =
E0 (ν + 1) − E0 (ν)
.
2ν−1
ν ν
ν−1 (1 − 2p)p q
8 / 20
The penalty F
0.05
0.1
0.2
0.3
18
15.5
10
6
2
4
6
Replication ν
8
10
16.25
11
6.5
2
v=2
v=3
v=4
v=5
21
log penalty F
Replication ν for P < p/2ψ
log penalty: ln F(ν,p)
14
2
0.05
0.1
0.2
0.3
20
11.5
6.75
1
2.25
3.5
Power ψ
4.75
6
2
0
0.125
0.25
0.375
0.5
Error probability p
I
Fν grows almost exponentially, less quickly for lower q;
I
Replication to reduce the risk 2ψ times grows with respect to
ψ more quickly for higher p.
I
Penalty F (p) has exactly one minimum at [0, 0.5] and grows
very quickly near p = 0 and p = 0.5.
9 / 20
Unequal answers
Now let us consider a more general case:
I
YES and NO answers are not equal:
I
NO is more likely (α− > 0.5) (the usual case);
I
YES is more important (rare things searched for).
I
So penalty F− for false NO is much less than F+ for false YES.
I
Replication: solve up to ν NOs or µ = ν + γ YESes.
I
The cost function is
ν+i
ν i
α ν+i−1
ν−1 q p
ν + i + F−
ν i
ᾱ ν+i−1
ν−1 p q
µ+j
µ i
ᾱ µ+i−1
µ−1 q p
µ + j + F+
µ i
α µ+i−1
µ−1 p q
Here i = 0 : µ − 1, j = 0 : ν − 1, ᾱ = 1 − α.
Let the expected cost be E (ν, γ).
10 / 20
Unequal answers, replication
Consider the increment of the expected cost:
G (ν, γ) = E (ν, γ) − E (ν + 1, γ) = A− F− + A+ F+ − B.
We can evaluate (even less easily!) the coefficients:
2ν + γ
p ,
A− = ᾱp q
1−
ν+γ
2ν + γ
ν+γ ν 2ν + γ − 1
p .
A+ = αp
q
1−
ν+γ
ν
ν ν+γ
2ν + γ − 1
ν
A− > 0 for p < 0.5,
A+ > 0 if p <
ν
1
< .
2ν + γ
2
11 / 20
Unequal answers, replication
THUS:
I
ν
For reliable computers (p < 2ν+γ
) additional replication is
profitable if either or both penalties are high enough;
I
For less reliable computers cheap mistakes should not be
punished heavily: high penalty avoids replication!
I
Any replication ν can be made optimal by choosing high
enough penalty F− for expensive mistakes.
What about additional checks γ?
12 / 20
Additional check
Consider the difference
G (γ) = E (ν, γ) − E (ν, γ + 1) = a− F− + a+ F+ − b.
Evaluate the coefficients (also not easy!):
2ν + γ − 1
a− = −ᾱp q
,
ν−1
ν+γ ν 2ν + γ − 1
a+ = +αp
q
.
ν+γ
ν ν+γ
So G > 0 needs
α
F−
≤
a− F − + a+ F + > 0 ⇔
F+
ᾱ
γ
p
q
13 / 20
Additional check
THUS:
I
For reliable computers we can always make the desired
replication (ν, γ) optimal by choosing penalties F− and F+ .
I
For less reliable ones this is also possible:
F−
α p γ
α p γ
R(ν, γ, p) ≤
≤
ᾱ q
F+
ᾱ q
where
R(ν, γ, p) =
(2ν + γ)p − ν
ν
1
< 1 for
<p< .
(ν + γ) − (2ν + γ)p
2ν + γ
2
14 / 20
Reliable answer
Expressions for the coefficients become simpler:
A+ = a+ = 0,
µ
A− = α− p+
q− (1 − p+ (1 + µ)) ,
µ
a− = α− p+
q− .
Average is
µ
E = α− + α+ µ + α− p+
F.
15 / 20
An application: virtual screening
I
I
I
I
I
I
A computational technique to evaluate the binding energy
between a complex molecule and a smaller one called ligand.
"Good" ligands should be tested in a laboratory.
Hardware is reliable (Natalia tested), software also is.
People are reliable (!): Enterprise Desktop Grid.
A protein molecule often has pockets: local minima of the
connection energy.
A few tries can, possibly, reveal the mistake.
So:
I
I
I
I
Good ligands are not lost.
Bad ones can pretend to be good.
Costly lab checks seem unnecessary.
Little knowledge about F , only that it is high.
16 / 20
An application: virtual screening
I
Statistics (Lübeck Inst. Experimental Dermatology, Univ. of
Lübeck, Germany; virtual screening of ligands for one protein;
Open-source Autodock Vina).
I
Reliable answer.
I
A priori probabilities α+ = 0.036, α− = 0.964.
I
Probability of a false positive answer p+ = 0.004.
I
Mean duration of a task was 11.37 s.
Table : Minimal penalty making quorum µ optimal.
µ
F∗
2
1.9
3
230
4
57 · 103
5
14 · 106
6
35 · 108
7
89 · 1010
Need not to know F precisely: "Not less than..." is ok.
17 / 20
an application: virtual screening
I
these threshold values are important:
I
if real penalties are close to these, then variance can be used
to reduce cost a little.
I
for F = 1.4 · 107 the optimal mean µ = 4. Taking task cost
variance into account saves 0.13%, total duration grows on
0.15%, instead the error probability and the expected loss are
twice less.
Table : Expected cost with respect to quorum µ.
µ
E
1
55799
2
224.2
3
1.90
4
1.09
5
1.02
7
1.03
10
1.04
18 / 20
An application: precision
I
Calculation can be performed with different precision;
I
higher precision: more cost, less risky.
I
E.g.: p+ = 0.044 for C s per a task VS p+ = 0.004 for 8C .
I
However: µ = 2 or µ = 3 reduce risk 10 or 100 times!
I
At almost no price: the average cost of a task is 1.05.
I
Thus: use lower precision with optimal quorum.
19 / 20
Thank you for your attention!
20 / 20

Download Report

Virtual Screening In A Desktop Grid: Replication And The Optimal

Paperzz.com

Your Paperzz