Empirical investigations

KTH/CSC
Empirical investigations
of local search on random
KSAT for K = 3,4,5,6...
CDInfos0803 Program
Kavli Institute for Theoretical Physics China
Erik Aurell
KTH Royal Institute of Technology
Stockholm, Sweden
March 4, 2008
Erik Aurell, KTH Computational Biology
1
KTH/CSC
Circumspect descent prevails
in solving combinatorial
optimization problems
Mikko Alava, John Ardelius, E.A., Petteri Kaski, Supriya
Krishnamurthy, Pekka Orponen, Sakari Seitz,
arXiv:0711.4902 (Nov 30, 2007)
Earlier work by E.A., Scott Kirkpatrick and Uri Gordon
(2004), Alava, Orponen and Seitz (2005), Ardelius and E.A.
(2006), Ardelius, E.A. and Krishnamurthy (2007)……and
many others
March 4, 2008
Erik Aurell, KTH Computational Biology
2
KTH/CSC
Why did we get into this?
March 4, 2008
Erik Aurell, KTH Computational Biology
3
KTH/CSC
Let me give three reasons
March 4, 2008
Erik Aurell, KTH Computational Biology
4
it is a fundamental and
practically important problem
KTH/CSC
...which I learnt about working for the Swedish railways
E.A. J. Ekman, Capacity of single rail yards [in Swedish], Swedish Railway
Authority Technical reports (2002)
March 4, 2008
Erik Aurell, KTH Computational Biology
5
KTH/CSC
They have potential, under-used
applications in systems biology
As an example I will describe
a consulting work we did for
Global Genomics, a now
defunct Swedish Biotech
Company. They claimed to
have a new method to measure
global gene expression. Many of
their ideas were in fact from
S. Brenner and K. Livak, PNAS
86 (1989), 8902-06, and K.
Kato, Nucleic Acids Res. 23
(1995), 3685-3690.
March 4, 2008
Erik Aurell, KTH Computational Biology
6
KTH/CSC
The problem is that using only one
restriction Type IIS enzyme, there
is not enough information in the
data to determine which genes
were expressed (many genes
could have given rise to a given
peak).
Kato (1995) tried using several
enzymes of the same type
sequentially. Problem: loss of
accuracy, complicated.
Global Genomics AB’s invention
was to use several enzymes
in parallel.
March 4, 2008
Erik Aurell, KTH Computational Biology
7
KTH/CSC
The Global Genomics invention in
led to a optimal matching problem
Matching the observations to a gene database
gives a bipartite graph, where a link between a
gene g and an observation o represents the
fact that o could be an observation of g.
All possible matchings
gene database
gene 1
observations
100
30
gene 2
70
30
The best matching can be represented as a
subgraph of the graph above + expression
levels.
gene 3
70
An optimal matching
gene database
gene 1
30
observations
100
30
gene 2
A. Ameur, E.A., M. Carlsson, J. Orzechowski
Westholm, “Global gene expression analysis by
combinatorial optimization”, In Silico Biology 4
(0020) (2004)
March 4, 2008
70
30
gene 3
70
Erik Aurell, KTH Computational Biology
70
8
KTH/CSC
Testing using the FANTOM data
base of mouse cDNA (RIKEN)
For in silico testing we used the
FANTOM data base of full-length
mouse cDNA, available at
genome.gsc.riken.go.jp
We used an early 2003 version of
60 770 RIKEN full-length clones,
partitioned into 33 409 groups
representing different genes.
This second list can be taken a
proxy of all genes in mouse.
March 4, 2008
Principle of in silico tests:
1. Select a fraction of genes
2. Generate
random exp.
levels
3. Generate random
peak and length
perturbations
4. Run the algorithm
5. Compare
Erik Aurell, KTH Computational Biology
9
KTH/CSC
both methods solve the optimization
according to the given criteria when
the perturbation parameters are
small enough
the methods are comparable at
low or moderate fraction of genes
expressed
local search is superior at high
fraction of genes expressed
Ameur et al (2004)
March 4, 2008
Erik Aurell, KTH Computational Biology
10
KTH/CSC
In theory,
combinatorial optimization
and constraint satisfiability
give rise to many of the
computationally hardest
problems
March 4, 2008
Erik Aurell, KTH Computational Biology
11
KTH/CSC
In practice,
combinatorial optimization
and constraint satisfaction
problems are routinely
solved by complete methods
(branch-and-bound), local
search heuristics, by mixed
integer programming, etc.
March 4, 2008
Erik Aurell, KTH Computational Biology
12
KTH/CSC
How is this possible?
Following many others
we will look at a simple model
March 4, 2008
Erik Aurell, KTH Computational Biology
13
KTH/CSC
Random K-satisfiability problems
Let there be N Boolean variables, and 2N literals
Let there be M logical propositions (clauses) Pa L a1 La2 ... L ak
A clause expresses that one out of 2k possible configurations of k variables
is forbidden. Clauses are picked randomly (with replacement) from all
possible k-tuples of variables.
Can all M clauses be satisfied simultaneously? P P1 P2 ... PM
March 4, 2008
Erik Aurell, KTH Computational Biology
14
4000
M N
50 var
40 var
20 var
3000
DP Calls
KTH/CSC
The 4.3 Point
KSAT characterized
by number of clauses
per variable
2000
1000
phase transition between
almost surely SAT to
almost surely UNSAT
0
1.0
Several simple algorithms take
a.s. linear time for α small enough
Probability
Algorithms take longest
time (on the average) close
to phase boundary
March 4, 2008
50% sat
0.8
0.6
0.4
0.2
0.0
2
3
4
5
6
7
Ratio of Clauses-to-Variables
8
Mitchell, Selman, and Levesque 1991
Mitchell, Selman, Levesque (AAAI-92)
Kirkpatrick, Selman, Science 264:1297 (1994)
Erik Aurell, KTH Computational Biology
15
KTH/CSC
A now about decade old statistical physics prediction of 3SAT
and other constraint satisfaction problems: a clustering transition
UNSAT
SAT
one state
many states
3SAT threshold values
d
3.92
M N
March 4, 2008
many states
no solutions
cr
Erik Aurell, KTH Computational Biology
4.27
16
The Mezard, Palassini and Rivoire 2005 prediction for 3COL
Obtained by entropic cavity method, computing within a 1RSB
scenario the number of states with a given number of solutions
KTH/CSC
one green state
March 4, 2008
many green states, but most solutions
in one or a few big states
Erik Aurell, KTH Computational Biology
17
KTH/CSC
The latest clustering predictions for KSAT, K > 3 are in F Krzakała, A.
Montanari, F. Ricci-Tersenghi, G. Semerjian, L. Zdeborová.
”Gibbs states and the set of solutions of random constraint satisfaction
problems” PNAS 2007 Jun 19;104(25):10318-23.
single cluster
many small clusters
but most solutions in
a few of them
March 4, 2008
Erik Aurell, KTH Computational Biology
many clusters and
solutions are found
in a large set of all
about equal size
18
The cluster condensation transition in F Krzakała et al (2007)
KTH/CSC
many clusters and
solutions are found
in a large set of all
about equal size
March 4, 2008
most clusters disappear, and
again most solutions are found
in a small number of them
Erik Aurell, KTH Computational Biology
19
KTH/CSC
So does clustering in
fact pose a problem to
simple local search?
Are the known/features of
the static landscape
relevant to dynamics?
March 4, 2008
Erik Aurell, KTH Computational Biology
20
a landscape that could be difficult for local
search
courtesy Sui Huang
KTH/CSC
another
local
minimu
m
local
minima
March 4, 2008
global
minimum
Erik Aurell, KTH Computational Biology
21
KTH/CSC
Papadimitriou invented a stochastic local search algorithm for
SAT problems in 1991, today often referred to as RandomWalksat:
Pick an unsatisfied clause
Pick a variable in that clause, flip it, loop
Not quite like an equilibrium physics process in detailed balance,
because only variables in unsatisfied clauses are updated
Solves 3SAT in linear time on average up to α about 2.7
March 4, 2008
Erik Aurell, KTH Computational Biology
22
KTH/CSC
A benchmark algorithm is Cohen-Kautz-Selman walksat
www.cs.wahington.edu/homes/kautz/walksat
Pick an unsatisfied clause
Compute for each variable in the clause the breakclause
breakclause is the number of other, presently satisfied,
clauses, that would be broken if the variable is flipped
If any variable has breakclause zero, flip it, loop
With probability p, flip variable with least breakclause, loop
Else, with probability 1-p, flip random variable in clause, loop
Solves 3SAT in linear time on average up to α about 4.15
Using default parameters from the public repository
(Aurell, Gordon, Kirkpatrick (2004)
March 4, 2008
Erik Aurell, KTH Computational Biology
23
KTH/CSC
We have worked with the Focused Metropolis
Search (FMS) algorithm, and ASAT, an alternative version
ASAT: if you have a solution, output and stop
Pick an unsatisfied clause
Pick randomly a variable in the clause
If flipping that variable decreases the energy, do so
If not, flip the variable with probability p
Loop
Also not in detailed balance (also tries only unsat clauses)
Parameter p has to be optimized. The optimal
value depends on the problem class, e.g. about 0.2 for 3SAT
March 4, 2008
Erik Aurell, KTH Computational Biology
24
We have a new algorithm ChainSAT which by
design never goes up in energy
KTH/CSC
Algorithm 1. ChainSAT
S = random assignment of values to the variables
chaining = FALSE
while S is not a solution do
if not chaining then
C = a clause not satisfied by S selected uniformly at random
V = a variable in C selected uniformly at random
end if
ΔE = change in the number of unsatisfied clauses if V is flipped in S
if ΔE = 0 then
flip V in S
else if ΔE < 0 then
with probability p1
flip V in S
end with
end if
chaining = FALSE
if ΔE > 0 then
with probability 1 – p2
C = a clause that is satisfied only by V selected uniformly at random
X = a variable in C other than V selected uniformly at random
V=X
chaining = TRUE
end with
end if
end while
March 4, 2008
Erik Aurell, KTH Computational Biology
25
KTH/CSC
March 4, 2008
Solution course of a good
local search (ASAT at 4.2)
Erik Aurell, KTH Computational Biology
26
KTH/CSC
Runtimes for ASAT on 3SAT
at α=4.21
Ardelius and E.A. (2006)
March 4, 2008
Erik Aurell, KTH Computational Biology
27
KTH/CSC
Runtimes for ASAT on 3SAT
at α=4.25
Ardelius and E.A. (2006)
March 4, 2008
Erik Aurell, KTH Computational Biology
28
KTH/CSC
March 4, 2008
FMS on 4SATat α=9.6
Erik Aurell, KTH Computational Biology
29
KTH/CSC
March 4, 2008
ChainSAT on 4SAT, 5SAT, 6SAT
Erik Aurell, KTH Computational Biology
30
KTH/CSC
Do we know how local
search fails on hard CSPs?
The first guess would be that
local search fails if solutions
have little slackness which is
expressed by Parisi whitening
March 4, 2008
Erik Aurell, KTH Computational Biology
31
KTH/CSC
March 4, 2008
Erik Aurell, KTH Computational Biology
32
KTH/CSC
Several proposed clustering
transitions do not stop
circumspect descent
Not even an algorithm
which would be trapped in
a potential well of any depth
The reason why local search
eventually fails is unknown
March 4, 2008
Erik Aurell, KTH Computational Biology
33
KTH/CSC
Clustering has been
rigorously proven for
KSAT and K greater than 8
For K less than 8 there are
cavity method predictions
How does numerics compare
to these?
March 4, 2008
Erik Aurell, KTH Computational Biology
34
KTH/CSC
Solve a 3SAT instance L times with a stochastic local search (ASAT)
Compute the overlaps between these L solutions
See how that quantity changes with α
average overlap
variance of the overlap
Ardelius, E.A. and Krishnamurthy (2007)
March 4, 2008
Erik Aurell, KTH Computational Biology
35
The rank ordered plots of the overlaps in a chain of instances
with increasing number of clauses displays a transition around 4.25
KTH/CSC
α ranges from 3.5 to 4.3
N is 2000
for α = 4.3 repeat until
solvable instance found
for α < = 4.3 repeat until
ASAT finds many solutions
on the instance
Ardelius, E.A. and Krishnamurthy (2007)
March 4, 2008
Erik Aurell, KTH Computational Biology
36
Generate many chains of instances, check for the α at which all
solutions found have an overlap of at least 80%
KTH/CSC
N is 100, 200, 400, 1000, 2000
Number of chains at each N is 110
If a chain does not reach the 80%
threshold, repeat
Threshold is between 4.25 and 4.27,
could in fact coincide with
SAT/UNSAT for 3SAT
This is not in contradiction with the
theoretical predictions of Krzakala
et al (2007) who do not address
3SAT
Ardelius, E.A. and Krishnamurthy (2007)
March 4, 2008
Erik Aurell, KTH Computational Biology
37
KTH/CSC
March 4, 2008
FMS diffusion 4SAT different α
Erik Aurell, KTH Computational Biology
38
KTH/CSC
March 4, 2008
FMS diffusion 4SAT α=9.6
Erik Aurell, KTH Computational Biology
39
KTH/CSC
March 4, 2008
FMS diffusion 4SAT different N
Erik Aurell, KTH Computational Biology
40
KTH/CSC
As far as numerics can
tell, if there are clusters
beyond the clustering
transitions in 4SAT, they
are not separated by
overlap
March 4, 2008
Erik Aurell, KTH Computational Biology
41
KTH/CSC
How does local search
compare to more
sophisticated (and
specialized) methods
that we will hear about
at this school?
(here I have to go to PDF)
March 4, 2008
Erik Aurell, KTH Computational Biology
42
KTH/CSC
A question to the experts:
Which is (or are) the good
metrics to compare runtimes?
Wall-clock time? Some
intrinsic count?
March 4, 2008
Erik Aurell, KTH Computational Biology
43
KTH/CSC
Conclusions
Local heuristics (walksat, Focused Metropolis Search,
Focused Record-to-Record Travel, ASAT, ChainSAT) are
effective on hard random 3SAT, 4SAT… problems
This is true even if the heuristic by design can never get out
of a potential well, of any depth (ChainSAT). Traps in the
landscape do not stop these algorithms.
There seems to be a “clustering condensation” transition in 3SAT
very close to SAT/UNSAT transition.
If there is a clustering transition in 4SAT, these clusters do not
seem to be separated in overlap (in contrast to K equal to 8 and greater)
March 4, 2008
Erik Aurell, KTH Computational Biology
44
KTH/CSC
Thanks to
John Ardelius
Supriya Krishnamurthy
KTH/CSC
Mikko Alava
Petteri Kaski
Pekka Orponen
Sakari Seitz
March 4, 2008
Erik Aurell, KTH Computational Biology
45
Is the search trapped in “potential wells” of metastable states?
KTH/CSC
Energy as function of time
N is 1000,
March 4, 2008
is 4.2
Distance to target
ASAT linear regime, solution in 1000 sweeps
Erik Aurell, KTH Computational Biology
46
Is the search trapped in “potential wells” of metastable states?
KTH/CSC
Energy as function of time
N is 1000,
March 4, 2008
is 4.3
Distance to target
ASAT nonlinear regime, no barrier seen
Erik Aurell, KTH Computational Biology
47
Is the search trapped in “potential wells” of metastable states?
KTH/CSC
Energy as function of time
N is 1000,
March 4, 2008
is 4.1
Distance to target
ASAT linear regime, solution in 20 sweeps
Erik Aurell, KTH Computational Biology
48