Coevolving Solutions to the Shortest Common Superstring Problem

Coevolving Solutions to the
Shortest Common
Superstring Problem
Assaf Zaritsky & Moshe Sipper
Ben-Gurion University, Israel
www.cs.bgu.ac.il/~assafza
1
Outline

The “Shortest Common Superstring” problem.

DNA sequencing and the input domain.

Standard and cooperative coevolutionary genetic
algorithm (GA).

The Puzzle approach.

Conclusions and future work.

Messy Puzzle.
2
The Shortest Common Superstring
Problem (SCS)

Let S = {s1,…,sn} be a set of strings (blocks) over
some alphabet Σ. A superstring of S is a string x
such that each si in S is a substring of x.

Problem: Find shortest (common) superstring.

NP-Complete.

MAX-SNP hard.

Motivation: DNA sequencing, data compression.
3
SCS: Example




S = {ate, half, lethal, alpha, alfalfa}
A trivial superstring is “atehalflethalalphaalfalfa” of
length 25 (a simple concatenation of all blocks).
A shortest common superstring is “lethalphalfalfate”
of length 17.
Note that a “compressed” permutation of the blocks
is actually a superstring.
4
Approximation Algorithms




Several linear approximations for SCS have been
proposed, most of which rely on greedy approaches.
GREEDY
The most widely heuristic used in DNA sequencing.
Conjecture [Blum 1994, Sweedyk 1999]: Superstring
produced by GREEDY is of length at most two
times the optimal.
We are not aware of any previous evolutionary
approach to the SCS problem.
5
Outline

The “Shortest Common Superstring” problem.

DNA sequencing and the input domain.

Standard and cooperative coevolutionary genetic
algorithm (GA).

The Puzzle approach.

Conclusions and future work.

Messy Puzzle.
6
DNA Sequencing
The most common usage of the SCS problem.
7
DNA Sequencing (cont’d)



The problem: “read” a string of DNA.
Short DNA strands can be read in laboratory.
To sequence a long DNA strand:
2.
(The DNA sequence appears in many copies)
Cut the DNA to short fragments using restriction
enzymes.
Sequence each of the resulting fragments.
3.
Order those fragments using a SCS algorithm.
1.
8
The Input Domain
The input strings used in the experiments were
inspired by DNA sequencing:
9
Input Generation Setup: Parameters
Size of random string
250 bits (~50 blocks)
400 bits (~80 blocks)
Minimal block size
20 bits
Maximal block size
30 bits
Number of duplicates
created from a random 5
string
NB: increasing number of blocks results in exponential
growth of the problem’s complexity.
10
Outline

The “Shortest Common Superstring” problem.

DNA sequencing and the input domain.

Standard and cooperative coevolutionary
genetic algorithm (GA).

The Puzzle approach.

Conclusions and future work.

Messy Puzzle.
11
Simple Genetic Algorithm
produce an initial population of individuals
evaluate fitness of all individuals
while termination condition not met do
select fitter individuals for reproduction
recombine individuals
mutate individuals
evaluate fitness of modified individuals
generate a new population
end while
12
Simple GA for the SCS Problem




Given a set of strings as input, generate initial
population of random candidate solutions.
The fitness of each individual depends on its
length and accuracy.
The GA uses selection, recombination, and
mutation to create the next generation, each
individual of which is then evaluated.
Theses steps are repeated a predefined number
of times or until the solution is deemed
satisfactory.
13
Simple GA for the SCS Problem
(cont’d)


Blocks of the input set are atomic components.
Representation: An individual’s genome is
represented as a sequence of blocks.
 An individual may have missing blocks or
contain duplicate copies of the same block.
 Permutation Representation: Good or Bad?
14
Simple GA for the SCS Problem
(cont’d)


Evaluation: fitness of an individual is the length of
it’s compressed genome + the total length of the
blocks that are not covered by the individual.
Genetic operators:
 Fitness proportionate selection.
 Two-points recombination. Allows growth and
reduction in genome’s length.
 Block-change mutation.
15
Simple GA for the SCS Problem
(example)





S = {s1,s2,s3,s4}; s1 = 0011, s2 = 1100, s3 = 1001, s4 = 111.
Fitness (< s2,s1>) = |110011| + |111| = 6 + 3 = 9.
Fitness (< s4,s2,s1,s4>) = |11100111| = 8.
Recombination:
 p1 = <s1,|s2,s3|,s4>
 p2 = <s4,|s1,s3,s2|>
 p3 = recombine1(p1,p2) = <s1,s1,s3 ,s2,s4>
 p4 = recombine2(p1,p2) = <s4,s2,s3 >
mutate (<s1,s2,s2>) = <s1,s4,s2>
16
Coevolution

Simultaneous evolution of two or more species
with coupled fitness.

Coevolving species either compete or
cooperate.

Competitive coevolution: Fitness of individual
based on direct competition with individuals of
other species, which in turn evolve separately in
their own populations (“prey-predator”).
17
Cooperative Coevolution
18
Cooperative Coevolution (cont’d)

Cooperative Coevolution involves a number of
independently evolving species.

Interaction between species occurs via fitness
function only.

The fitness of an individual depends on its ability
to collaborate with individuals from other species.
19
Cooperative Coevolution (cont’d)
Source: Potter & DeJong (1997)
20
Cooperative Coevolutionary
Algorithm for the SCS Problem

Two species evolve simultaneously.

First species contains prefixes of candidate
solutions to the SCS problem at hand.

Second species contains candidate suffixes.

Fitness of an individual in each species
depends on how good it interacts with
representatives from other species to
construct a global solution.
21
Cooperative Coevolutionary Algorithm
for the SCS Problem (evaluation process)
Merge
Prefixes
population
Suffixes
population
22
Cooperative Coevolutionary Algorithm
for the SCS Problem (evaluation process)
Evaluate
Prefixes
population
Suffixes
population
23
Experiments
Compare: GREEDY, Standard GA, Cooperative Coevolution
24
Experimental Setup
Population size
Number of generations
Recombination rate
Mutation rate
Problem instances per experiment
500
5000
0.8
0.03
50
Each type of GA was executed twice on each problem instance;
the better run of the two was used for statistical purposes.
25
Results: Experiment I (~50 blocks)
26
Results: Experiment II (~80 blocks)
27
Results: Summary
Average of the best superstring lengths
50 blocks
80 blocks
GREEDY
Genetic
Cooperative
381
280
275
Distance from
optimum: 131
Distance from
optimum: 30
Distance from
optimum: 25
596
685
547
Distance from
optimum: 196
Distance from
optimum: 285
Distance from
optimum: 147
28
Conclusion:
The collaboration between the two
populations results in a good
decomposition of the problem into
two smaller sub-problems, each is
solved using a standard GA.
29
Outline

The “Shortest Common Superstring” problem.

DNA sequencing and the input domain.

Standard and cooperative coevolutionary genetic
algorithm (GA).

The Puzzle approach.

Conclusions and future work.

Messy Puzzle.
30
The Puzzle Algorithm
31
The Schema Theorem
“Short, low-order, above-average
schemata receive exponentially
increasing trials in subsequent
generations of a genetic algorithm.”
Holland (1975)
32
Building Blocks Hypothesis
“A genetic algorithm seeks near-optimal
performance through the juxtaposition
of short, low-order, high-performance
schemata, called the building blocks.”
33
Our Interpretation
“The success of GAs stems from
their ability to combine quality
sub-solutions (building blocks)
from separate individuals in order
to form better global solutions.”
34
The Main Assumption
Problems in nature have an
inherent structural design. Even
when the structure is not known
explicitly GAs detect it
implicitly and gradually
enhance good building blocks.
35
A Problem
Recombination may
destroy quality building
blocks found by the GA.
36
Example
Brain Appearance
0010101010101010101000011110100010000
37
Example (con’t)
Brain Appearance
0010101010101010101000011110100010000
1. Smart (assumable)
2. Blond
But not very beautiful…
38
The Preservation of
Favoured Building
Blocks in the Struggle
for Fitness: The
Puzzle Algorithm
39
Puzzle Algorithm: The Idea

Improve Recombination Operator.

Preserve good building blocks discovered by
GA using selection of recombination loci
that do not destroy good building blocks.

Result: Assembly of good building blocks to
construct better solutions (as in a puzzle).
40
Puzzle Algorithm (cont’d)

Two populations:
1. Candidate solutions: As in simple GA.
2. Building blocks: Each individual is a sequence of blocks
contained in at least one candidate solution.
Building
blocks
population
Candidate
solutions
population
41
Puzzle Algorithm (cont’d)


Interaction between candidate solutions and
building blocks is through fitness function.
Interaction between building blocks and
candidate solutions is through constraints on
recombination points.
Fitness evaluation
Building
blocks
population
Candidate
solutions
population
Crossover location
42
Puzzle Algorithm: Zoom In
Building
blocks
population
Fitness evaluation
Candidate
solutions
population
each individual is a
sequence of blocks
Crossover location
43
Puzzle Algorithm: Zoom In
Building
blocks
population
Fitness evaluation
Candidate
solutions
population
each building block is contained in at
least one individual in the solutions
population
Crossover location
overlapping
building blocks
44
The Candidate Solutions Population



Representation, fitness evaluation, selection, and
mutation are identical to the simple GA.
Recombination-aid vector aids in selecting the
recombination loci.
Recombination-aid vector is updated by building
blocks individuals.
Fitness evaluation
Building
blocks
population
Candidate
solutions
population
Crossover location
45
The Building Blocks Population



An individual is represented as a sequence of
blocks, contained in at least one candidate solution.
Fitness of an individual is the average of the fitness
of candidate solutions containing it.
Fitness-proportionate selection.
Fitness evaluation
Candidate
solutions
population
Building
blocks
population
Crossover location
46
The Building Blocks Population (con’t)


“Unisex” individuals.
Two modification operators:
 Expansion: Increase it’s genome by one block.
Occurs with high probability.
 Exploration: “Die”, and start over as a new 2block individual. Occurs with low probability.
Fitness evaluation
Candidate
solutions
population
Building
blocks
population
Crossover location
47
Building Blocks – Candidate Solutions
Building
blocks
population
Fitness evaluation
Candidate
solutions
population
f1
f2
f3
f4
48
Building Blocks – Candidate Solutions
Building
blocks
population
Fitness evaluation
f3
Candidate
solutions
population
f2
f1
f1
f2
f3
f4
Update
“recombination-aid”
vector
f1
f2
f3
f4
49
Update Recombination-aid vector
Recombination-aid vector
0
0
0
0
0
0
0
Solution’s genome
building block #1
fitness = 0.3
building block #2
fitness = 0.4
building block #3
fitness = 0.6
50
Update Recombination-aid vector
Recombination-aid vector
0 0.3 0.3 0 0.4 0.6 0
Solution’s genome
building block #1
fitness = 0.3
building block #2
fitness = 0.4
building block #3
fitness = 0.6
51
Update Recombination-aid vector
Recombination-aid vector
0.3 0.3 0.3 0 0.4 0.6 0.6
Solution’s genome
building block #1
fitness = 0.3
building block #2
fitness = 0.4
building block #3
fitness = 0.6
52
Recombination-loci selection
Recombination-aid vector
0.3 0.3 0.3 0 0.4 0.6 0.6
Solution’s genome
* Ties are broken arbitrarily
53
Experiments
Compare: GREEDY, Standard GA, Puzzle
54
Building Blocks - Experimental Setup
Population size
Expansion rate
Exploration rate
1000
0.8
0.1
55
Results: Experiment III (~50 blocks)
Cooperative
56
Results: Experiment IV (~80 blocks)
Did we lose to cooperative?
NO!
Cooperative
57
Results: Summary
Average of the best superstring lengths
50 blocks
80 blocks
GREEDY
Genetic
Puzzle
381
280
253
Distance from
optimum: 131
Distance from
optimum: 30
Distance from
optimum: 3
596
685
571
Distance from
optimum: 196
Distance from
optimum: 285
Distance from
optimum: 171
58
Relations Between The Algorithms
Co-Puzzle
Cooperative
Puzzle
GA
59
The Co-Puzzle Algorithm
Fitness evaluation
Fitness eval
Possible
building
blocks
population
Fitness eval
Candidate
prefixes
population
Crossover location
Possible
building
blocks
population
Candidate
suffixes
population
Crossover location
60
Experiments
Compare: GREEDY, Cooperative Coevolution, Co-Puzzle
61
Results: Experiment V (~80 blocks)
62
Results: Experiment VI (~50 blocks)
????
Puzzle
63
Results: Summary
size of shortest common superstring
GREEDY Cooperative
50 blocks
80 blocks
Co-puzzle
381
275
268
Distance from
optimum: 131
Distance from
optimum: 25
Distance from
optimum: 18
596
547
482
Distance from
optimum: 196
Distance from
optimum: 147
Distance from
optimum: 82
42% improvement
over cooperative
64
Outline

The “Shortest Common Superstring” problem.

DNA sequencing and the input domain.

Standard and cooperative coevolutionary genetic
algorithm (GA).

The Puzzle approach.

Conclusions and future work.

Messy Puzzle.
65
Results: Summary
size of shortest common superstring
50 blocks
80 blocks
GREEDY Cooperative
381
275
Puzzle
253
Co-puzzle
268
83%
Distance from
optimum: 131
Distance from
optimum: 25
Distance from
optimum: 3
Distance from
optimum: 18
better
596
547
571
482
Distance from
optimum: 196
Distance from
optimum: 147
Distance from
optimum: 171
Distance from
optimum: 82
42%
better
20 problem instances per experiment
90 blocks
100 blocks
677
673
683
617
Distance from
optimum: 227
Distance from
optimum: 223
Distance from
optimum: 233
Distance from
optimum: 167
768
768
813
732
Distance from
optimum: 268
Distance from
optimum: 268
Distance from
optimum: 313
Distance from
optimum: 232
25%
better
13%
better
66
Larger Problems - Using More Species
size of shortest common superstring
GREEDY
110 blocks
120 blocks
Co-puzzle 3-Co-puzzle
836
867
?
Distance from
optimum: 286
Distance from
optimum: 317
Distance from
optimum: ?
906
992
906
Distance from
optimum: 306
Distance from
optimum: 392
Distance from
optimum: 306
67
Conclusions

Cooperative coevolution might prove
deleterious when too many species are
used (when close to optimum?).

When a suitable number of species are
used, cooperative coevolution improves
performance by decomposing the
problem to several easier subproblems.
68
Conclusions (con’t)

Evolving a population of building blocks
to aid in the selection of recombination
loci improves drastically the
performance of a standard GA.

Cooperation between cooperative
coevolution and Puzzle ultimately
improves global performance.
69
Future Work

Test the (Co-) Puzzle approach on other
problem domains.

A hybrid GA.
 Tackle
larger problems.
 Comparison
to greedy-stochastically based
local-search algorithms.
70
Outline

The “Shortest Common Superstring” problem.

DNA sequencing and the input domain.

Standard and cooperative coevolutionary genetic
algorithm (GA).

The Puzzle approach.

Conclusions and future work.

Messy Puzzle.
71
The Messy Puzzle Algorithm
72
Static Detection of Building
Blocks for addressing the
Linkage Problem
Hillel Maoz
Ben-Gurion University, Israel
73
The Linkage Problem



A binary Genome of size n = 14.
Genes a and b together encode important information.
Random cross over is applied.
Survival probability = The chance to appear in the offspring

Left genome – 4/15

Right genome – 14/15
74
The Linkage Problem (con’t)
In many cases it is hard
to know the optimal
representation
75
The MaxCut Problem
 Input: undirected weighted graph G=(V, E, W).
 Output: a partition of V into two disjoint sets (S,V\S).
 Goal: maximal sum of edge weights between the sets.
 NP-complete.
76
MaxCut - Example
Cut = 34
Cut = 47
77
Simple GA for MaxCut

Population of candidate solutions
•
•

Give each node with a number
Assign ‘0’ or ‘1’ to indicate which set the node belongs to
Iteration step
•
•
•
Select any two parents
Recombine and create an offspring
Repeat until a new population is generated
 Fitness – The weight of the cut
78
The Representation Problem
“How to define the order of the
vertices within the genome ?”
79
Messy Genes

The main difficulty: identifying the related vertexes.

Messy gene is an ordered pair <allele-locus,allele-value>.

Possible solution:

Use some sort of messy genes to detect related
genes.

Use the Puzzle approach to keep them together.
80
The Messy Puzzle Algorithm
A building block’s genome
is represented as a
sequence of messy genes
81
Messy Puzzle Algorithm



Two population setup as in the puzzle algorithm.
Enhanced recombination operator.
Evolved building blocks structure (similar to
puzzle).
<0,0>
<2,0>
<1,1>
<5,0>
<6,1>
82
Enhanced Recombination
1 2
3
4 5
6
7
8
1 2
3
4 5
0.8
I)
6
7
0.7
8
0.6
Add the 1st BB - success
II)
Add the 2nd BB - failure
III
)
Add the 3rd BB - success
IV)
Simple crossover
83
Static Detection of Building Blocks




Building blocks do not truly evolve.
No Expansion and Exploration operators.
Building blocks’ fitness is based on a number of
generations.
Purpose: to check and understand the core of the
messy puzzle algorithm.
84
Results
 Random Generated Graphs.
 1000 generations.
 10 separate experiments per problem instance.
1600
Avg Cut
Size
- Size
different
number
of BB
Max
Cut
Size
Puzzle
VS.
GAber
Max
Cut
Size ---Bi-partite
graphs
Avg
Cut
different
num
of BB
(graph_300_0.1_1)
(graph_300_0.5_1)
distance
from
GA
distance
GA
cut size
sizefrom
difference
cut
difference
1 80.0 graph_200_0.01_1
1400
70.0 graph_200_0.05_1
2 50
80
1200
60.0
40
3 1000
60 graph_200_0.1_1
50.0
800 graph_200_0.3_1
40.0
4 30
40
30.0
600
5 2020 graph_200_0.5_1
20.0
400
6 10
0 graph_300_0.01_1
10.0
200
00.0 graph_300_0.05_1
10
20
7 -20
0
2
3
1110
220
8 -200 graph_300_0.1_1
9
graph_300_0.3_1
10
graph_300_0.5_1
•Distance to optimum
•Puzzle addition
30
40
4 30
5
3
640
4
num ber of BB
50
7
5058
60
9 60 6 10
number
of BB
graph
number
graph number
85
Conclusions and Future Work





Do messy work to solve the linkage problem.
Even a small population of building blocks
improves the GA performance.
Messy puzzle is better when inner structures
exists.
Applying evolution to the building blocks
population.
Comparing to different representation-search
techniques.
86

Download Report

Coevolving Solutions to the Shortest Common Superstring Problem

Paperzz.com

Your Paperzz