Multi-Niche Crowding in Genetic Algorithms and its Application to

Multi-Niche Crowding in Genetic Algorithms and its Application to the
Assembly of DNA Restriction-Fragments
Walter Cedeño, V. Rao Vemuri, and Tom Slezak
Department of Applied Science
University of California, Davis and
Lawrence Livermore National Laboratory
Livermore, CA 94550
([email protected], [email protected] and [email protected])
Abstract
The determination of the sequence of all nucleotide base-pairs in a DNA molecule,
from restriction-fragment data, is a complex task and can be posed as the problem of
finding the optima of a multi-modal function. A genetic algorithm that uses multi-niche
crowding permits us to do this. Performance of this algorithm is first tested using a
standard suite of test functions. The algorithm is next tested using two data sets
obtained from the Human Genome Project at the Lawrence Livermore National
Laboratory. The new method holds promise in automating the sequencing
computations.
Key words: Genetic algorithms, multi-modal functions, DNA restriction fragment
assembly, human genome project;
1 Introduction
This paper describes how the concept of multi-niche crowding in a genetic algortihm
permits one to simultaneously find several extrema of a multi-modal function.
Although a complete theoretical analysis supported by performance data on the
method is still being worked out, initial results from experiments with a standard suite
of test functions are persuasive about the utility of this method to solve problems of
practical interest. The method is tested by applying it to solve a problem in molecular
biology, believed to be NP-hard.
Section 2 of this paper describes multi-modal function optimization using the new
GA model. The validity of the new method is demonstrated in Section 3 by applying it
on several multi-modal test functions. Section 4.1 gives a tutorial background on the the
Human Genome Project and the relevance of the restriction-fragment assembly
problem. Section 4.2 summarizes the experimental procedure used by one group of
biologists to gather the data used in this paper. Section 4.3 summarizes some of the
related work on restriction fragment map assembly. Section 5 addresses the
representational issues including a description of the genetic operators and the fitness
function used to solve this problem. Section 6 includes results and discussion and
Section 7 summarizes on-going work.
Because genetic algorithms are being used in this paper to solve a problem in
genetics, a word of caution about the terminology used is in order. Words like
"chromosomes" occur while describing the genetic algorithm as well as during the
description of the DNA fragment assembly process. To the biologist, a chromosome is
DNA. To the computational scientist, a chromosome is a data structure. The context will
make the intent abundantly clear.
2
2. Genetic Algorithms for Multi-modal Search
Searching for extrema in a multi-modal search space is different from locating the
extremum of a unimodal function. When a search technique proven to be useful for
unimodal functions is applied to multi-modal functions, the method tends to converge
to an optimum in the local neighborhood of the first guess. One can use ideas like
simulated annealing to escape from a local optimum. However, there are many
applications where the location of "k best extrema" of a multi-modal function are of
interest. Searching for these locations goes by the name "multi-modal optimization."
There are many versions of genetic algorithms, one differing from another in some
detail. For the purposes of this paper, it is sufficient to focus on two basic steps common
to all genetic algorithms. During the selection step, a decision is made as to who is
allowed to produce offspring, and during the replacement step another decision is made
as to which of the members from generation i are forced to perish (or vacate a slot) in
order to make room for an offspring.
The Simple GA (or, SGA) (Holland 1975), which will be used as a point of departure
for presenting our method, starts by randomly generating a population of N
individuals, that is individual solutions. These individuals are evaluated for their
fitness. Individuals with higher fitness scores are selected, with replacement, to create a
mating pool of size N. This method of selection is called fitness proportionate reproduction
(FPR). The genetic operators of crossover and mutation are applied at this stage in a
probabilistic manner which results in some individuals from the mating pool to
reproduce. The assumption here is that each pair of parents produce only one pair of
offspring through the crossover operation. That is, the second generation is comprised
of those first generation individuals who never got a chance to reproduce and the
offspring of those who got a chance to reproduce. The procedure continues until a
suitable termination condition is satisfied, say a specified number of generations. (For
the purposes of this paper, a generation is every N /2 mating operations, where N is the
3
population size.) All other versions of GAs differ from this basic method in some detail
or another.
The steady state GA (or, SSGA) (Whitley 1988; Syswerda 1989), differs from SGA
mainly in the replacement step, and to a lesser extent on the way the genetic operators
are applied. The SSGA selects two individuals using FPR and allows them to mate to
produce two offspring. This selection step is identical to the corresponding step of SGA.
However, in SSGA, the offspring are inserted into the population, thus replacing two
individuals, soon after they are generated. The SGA, on the other hand, generates N
offspring prior to replacing the entire population. In other words SGA uses simultaneous
replacement strategy whereas the SSGA uses the successive replacement strategy. Thus
SGA and SSGA are analogous, respectively, to the Jacobi and Gauss-Seidal methods of
solving systems of algebraic equations.
Both SGA and SSGA suffer from the possibility of premature convergence to a local
minimum, primarily due to the selection pressure exerted by the FPR rule. Simply
assigning an exponential number of mating opportunities to those members of the
population that exhibit above average survival traits, as the FPR rule does, is not a good
strategy for a thorough exploration of complex search spaces with multiple peaks. Due
to this reason as well as the deceptiveness exhibited by multi-modal search spaces
(Goldberg et al., 1992), SGA and SSGA are not suitable for multi-modal search and
optimization.
2.1 Multi-modal Search: Crowding-based Methods
The GA model described in this paper, called multi-niche crowding (or, MNC), has the
ability to converge simultaneously to multiple solutions by encouraging competition
between individuals within the same locally optimal group (Cedeño and Vemuri, 1992).
In MNC, both the selection and replacement steps of the SGA are modified with the
introduction of some form of crowding in order to render it suitable for searching spaces
4
characterized by multiple peaks or niches. For completeness, the concept of crowding is
briefly reviewed here.
Crowding (De Jong, 1975) is a generalization of preselection (Cavicchio, 1970). In
crowding, selection and reproduction are the same as in the SGA; but replacement is
different. For concreteness, it is assumed that two parents are selected to produce two
offspring. In order to make room for these offspring, it is necessary to identify two
members of the population for replacement. The policy of replacing a member of the
present generation by an offspring is carried out as follows, in two steps. First, a group
of C individuals is selected at random from the population. C, called the crowding factor,
also indicates the size of the group. A value of C = 2 or 3 appears to work well for De
Jong. Second, the bit strings in the offspring chromosomes are compared with those of
the C individuals in the group using Hamming distance as a measure of similarity. The
group member that is most similar to the offspring is now replaced by the offspring.
This procedure is repeated for the other offspring as well. This second offspring can
conceivably replace its own sibling that just entered the population pool, although such
a scenario is rather unlikely. In any event, crowding is essentially a successive
replacement strategy. This strategy maintains the diversity in the population and
postpones premature convergence. However crowding cannot maintain stable
subpopulations too long due to the selection pressure imparted by FPR. Summarizing,
in crowding offspring replace similar individuals from the population. Crowding slows
down premature convergence of the traditional GA and in most cases can find the
global optimum in a multi-modal search space. On the other hand it does not facilitate
convergence to multiple solutions and after many generations one of the peaks takes
over.
In deterministic crowding (Mahfound, 1992) selection pressure is eliminated and
preselection is introduced to obtain a very fast GA suitable for multi-modal function
5
optimization. In this method selection pressure is eliminated by allowing individuals to
mate at random with any other individual in the population. Pressure is applied,
however, during replacement step using preselection. Toward this goal, each of the two
offspring is first paired with one of the parents; this pairing is not done randomly,
rather the pairings are done in such a manner the offspring is paired with the most
similar parent. Then each offspring is compared with its paired parent and the
individual with the higher fitness is allowed to stay in the population and the other is
eliminated. It is not clear if multiple solutions can be maintained for many generations
using this method, although it appears that multiple solutions are sustained for more
generations than when crowding is used alone.
2.2 Multi-modal Search: Multi-niche Crowding
In multi-niche crowding (MNC), both the selection and replacement steps are
modified with some type of crowding. The idea is to eliminate the selection pressure
caused by FPR while allowing the population to maintain some diversity. This objective
is achieved, in part, by encouraging mating and replacement within members of the
same niche while allowing for some competition for population slots among the niches.
The result is an algorithm that (a) maintains stable subpopulations within different
niches, (b) maintains diversity throughout the search, and (c) converges to different
local optima.
In MNC, the FPR selection is replaced by what we call crowding selection. In
crowding selection, each individual in the population has the same chance for mating in
every generation. Application of this selection rule takes place in two steps. First, an
individual A is selected for mating. This selection can be either sequential or random.
Second, its mate M is selected, not from the entire population, but from a group of
individuals of size Cs, picked at random (with replacement) from the population. The
mate M thus chosen must be the one who is the most "similar" to A. The similarity
6
metric used here is not a genotypic metric such as the Hamming distance, but a suitably
defined phenotypic distance metric. Crowding selection promotes mating between
individuals from the same niche while allowing matings between individuals from
different niches.
During the replacement step, MNC uses a replacement policy called worst among the
most similar. The goal of this step is to pick an individual from the population for
replacement by an offspring. Implementation of this policy follows these steps. First, Cf
groups are created by randomly picking s individuals (with replacement) per group
from the population. These groups are called crowding factor groups. Second, one
individual from each group that is most phenotypically similar to the offspring is
identified. This gives Cf individuals that are candidates for replacement by virtue of
their similarity to the offspring that will replace them. From this group of most similar
individuals, we pick the one with the lowest fitness to die and that slot is filled with the
offspring. The offspring could possibly have a lower fitness than the individual being
replaced.
Both selection and replacement steps in the MNC are primarily based on a similarity
metric. Fitness is also considered during replacement to promote competition between
members of the same niche. Competition between members of different niches occurs as
well. The following pseudo-code summarizes the salient features of the MNC:
1.
2.
3.
4.
5.
6.
Generate initial population of N individuals
For gen = 1 to MAX_GEN
For i = 1 to N
Use crowding selection to find mate for parent i
Mate and mutate
Insert offspring in the population using worst among most similar replacement.
If FPR is used in Step 4 above and if the individual with the lowest fitness in the
population (Step 6) is replaced with the newly generated offspring, this model
corresponds to a steady-state GA. In contrast with the most common generational GA,
7
offspring are available for mating as soon as they are generated, and good individuals
can survive for many generations. A similar technique called enhanced crowding
(Goldberg, 1989) has been used before in classifier systems, but there the most similar
individual out of a group of worst candidates is replaced. A pictorial interpretation of
the MNC replacement policy is shown in Figure 1.
Population
..
.
Group 1 with
s individuals
Randomly form
crowding factor
(Cf ) groups
..
.
Offspring
Individual 1
Most similar
to offspring
..
.
Replaces
Lowest fitted
individual
Individual Cf
Group Cf with
s individuals
Figure 1: Schematic showing worst among most similar replacement policy.
In summary, MNC differs from the other crowding based methods in the following
salient features:
(a) Both selection and replacement use a form of crowding.
(b) Mating among members of the same niche is encouraged.
(c) Competition among members of the same niche is promoted.
These features appear to be responsible for the apparantly superior performance of this
method over other crowding based methods insofar as maintaining solutions in
multiple peaks. The next section shows some performance results when this method is
applied to some standard test functions.
8
3. Application of MNC to Multimodal Test Functions
To evaluate the performance of the MNC model described above, we used five
different test functions. The first three of these functions have been used earlier by other
investigators. Functions F (x) and F (x), defined by
1
2
F1 ( x ) = sin 6 ( 51
. π x + 0.5) and
F2 ( x ) = exp −4
(ln 2 )( x − 0.0667 )2 0 .64
sin 6 ( 51
. π x + 0.5)
are shown in Figure 2. These functions correspond to the five-optima sine functions
used by Goldberg and Richardson (1987) in their work on sharing. Both functions were
maximized using binary chromosome encoding for numbers in the interval [0, 1]. In
both cases the GA with sharing was able to maintain stable subpopulations and
diversity in the population. Convergence was good, but not all the peaks in F had the
1
individuals distributed close to the top. It is not clear that in F sharing will be able to
2
maintain proportional number of individuals for larger number of generations.
Figure 2: Test functions F (x) (left) and F (x) (right).
1
2
Function F (x,y), called “Shekel’s foxholes” with twenty five optima, was used by De
3
Jong in his work on crowding. The inverted version of F is shown on Figure 3. The
3
peaks are located on the grid intersections formed by the values x = [32, 16, 0, -16, -32]
and y = [32, 16, 0, -16, -32]. All 25 peaks have the same base width. The height of peak i
is given by the expression 0.002 + 1/i, where the peaks are located at (32, 32), (16, 32), ...,
9
(-32, -32). Using a value of C = 2 a GA using crowding consistently found the global
optima. Crowding allows the GA to escape local optima by maintaining diversity in the
population during the early stages.
Figure 3: Test Function F (x), Inverted Shekel's Foxholes.
3
We also considered other functions not exhibiting the symmetry present in the
above functions. Function F (x,y), shown on the left in Figure 4, contains two global
4
optima with the same height and width but located far apart. Function F (x,y), the
5
sample shown on the right in Figure 4, contains five optima with height, width, and
location chosen at random in every run. Both of these functions are defined by the
summation:
p
Ai
2
2
i =1 1 +Wi (( x − X i ) + ( y −Yi ) ) ,
∑
where p indicates the number of peaks in the function, (Xi, Yi) the coordinates of peak i,
Ai is the height of peak i, andWi determines how narrow or wide the base of peak i is.
Table 1 summarizes the parameters for functions F and F shown in Figure 4.
4
10
5
Table 1: Parameters for functions F and F .
4
Function
F4
F5
Peak Location
(45k, 2k)
(15k, 62k)
( 17.1, 34.4 )
( 8.7, 4.1 )
( 20.3, 11.0 )
( 38.3, 47.9 )
( 57.9, 54.1 )
Width
0.0004
0.0004
1.9
0.17
3.6
3.1
3.7
5
Height
100
100
50.5
c1.coord[
44.8
87.1
51.5
55.5
Figure 4: Test functions F4 (scaled version on left) and F5 (right).
The MNC method was applied on these test functions and the simulations were
done on a 486/33MHz PC. The variables x and y were encoded using two 32 bit
chromosomes. The initial population was generated at random. Interval crossover was
used for mating. In interval crossover only one offspring is generated. For each pair of
parent chromosomes, x1 and x2, (assume without loss of generality that the cardinal
value of binary string x1 is less than the cardinal value of binary string x2), the
offspring’s chromosome is selected at random from the interval [x1-δ, x2+δ]. The value
for δ is usually small. This allows the offspring to move outside the boundaries
delineated by their parents. For this test δ was set at the decimal value 1,000. We
obtained better results using interval crossover over single-point crossover. This is due
in part to the disruptive effect on niche preservation caused by single-point crossover.
11
For example, consider the individuals 0100000000000 and 0011111111111. If these
individuals are part of the same peak, a single-point crossover may conceivably
generate the offspring pair 0000000000000 and 0111111111111 which may not be in the
same peak as the parents. On the other hand interval crossover will generate an
offspring close to the parents and more likely to belong to the same niche. After mating
bit mutation is applied to the offspring.
The crossover probability (pc) was set at 0.95. Normally one tends to choose a lower
value for the crossover probability in order to allow individuals to survive from one
generation to the other. In contrast, in the MNC, pc is usually chosen in the interval [0.9,
1.0] for three reasons. First, the individuals with highest fitness will survive multiple
generations. Second, low pc values tend to increase the number of duplicate
chromosomes in the population. Third, offspring are likely to replace individuals with
similar phenotypic features and thus alleviate the need to carry other individuals. The
mutation probability (pm) was set at 0.001. Mutation may allow an individual to escape
from a niche to other areas not explored by the algorithm. Furthermore, it is useful to
note, in passing, that duplicates encountered after mating (mating pair do not undergo
crossover and mutation) or during replacement (offspring same as individual in
population) are not inserted in the population.
The phenotypical similarity between two individuals was determined by the
Euclidean distance between them. Like Deb and Goldberg (1989) we also had better
performance when the similarity was based on the phenotype rather than the genotype.
The MNC method was executed for 100 generations in each run. Other parameters are
summarized in Table 2.
12
Table 2: Function Specific Parameters used in the MNC method.
F1 & F2
Population size (N):
Number of chromosomes:
Crowding selection size (Cs ):
Crowding factor (Cf ):
Crowding factor group size (s):
F3
F4
F5
100
500
100
200
1
2
2
2
15
75
5
15
2
2
3
3
15
75
5
15
These parameters were chosen after a number of trial runs. Although the optimal
value of these parameters is unknown at this moment, a rule of thumb that worked fine
on these functions was the following. The population size was selected to be N ≥ 20p,
where p is the number of peaks. This value of N allowed small peaks to survive for
many generations and the need to use a large population is avoided.
For the crowding selection size, Cs, and the crowding size, s, the values where
chosen to be at least 2p. The idea is to get a group size that will likely contain an
individual from the same niche during selection or replacement. At the same time it
must be small enough to allow competition between different niches. A large group size
will restrict selection and replacement among members of the same niche only. For
these test function a value of 3p for Cs and s produced good results. The crowding
factor, Cf, was set at 2 or 3. Higher crowding factor values increased the chances of highfitness individuals to survive for more generations. On the other hand higher values
can cause niches from small peaks to dissapear from the population in less number of
generations.
These parameter values seem to balance competition between individuals in the
same niche and individuals from different niches. Consider for example a value of Cs =
1. This will reduce the selection step to basically random selection as is used in
13
deterministic crowding. Increasing the value of Cs will localize a mate closer to the
parent. For very large values, i.e. , Cs closer to N, the mate chosen will most likely be the
parent itself.
The same applies during replacement for the value of s. What is interesting is the
effect of the combined values of s and Cf during replacement. A value of s > 1 and Cf = 1
reduces the replacement step to crowding and fitness plays no role in the MNC method.
These settings will reduce the method to some form of random search since fitness is
not used during selection. Consider now s = 1 and Cf > 1 during replacement. In this
case similarity plays no role and only fitness is used during replacement. Higher peaks
will tend to take over the population with these settings. At a minimum both s and Cf
must be greater than 1 to allow multiple niches to survive. Increasing s will promote
more competition between members of the same niche. Increasing Cf will allow fitness
to play a bigger role during replacement. An analysis of the MNC method and the effect
of these parameters is ongoing and will be reported in another paper.
An interesting approach will be to allow the parameters Cs, s, and Cf to dynamically
change with the population. This will allow the algorithm to adapt itself to the
idiosyncracies of the search space. A related idea, based on implicit fitness sharing,
(Smith et al., 1993) effectively maintained diverse populations while modeling the
immune system.
3.1 Results on Multimodal Test Functions
For all test cases the MNC method was able to maintain stable subpopulations in
most of the peaks without converging prematurely to any one of them. Only very small
peaks in F3 and F5 did not have any significant number of individuals in the last
generation even though some were present during many earlier generations. All peaks
in functions F1, F2, and F4 were found. Even the local optima at x = 1 was found in F1 and
was kept until the last generation. Stable subpopulations were consistently maintained
14
in all peaks until the last generation. The distance between the peaks in F4 did not cause
any problem for the MNC method. Figure 5, shows for function F4, the average fitness
of the individuals (on the left) in each peak and the number of individuals (on the right)
in each peak. The results were averaged over 20 runs. An individual solution is
considered part of a peak if it is located in the region above 10% of the height of the
peak. The solid line represents individuals located in neither peak. As can be seen the
number of individuals stabilizes around the same time as the average fitness does. After
that the number of individuals in the peaks does not fluctuates greatly for the last 70+
generations.
100
90
80
70
60
50
40
30
20
10
0
Peak 1
Peak 2
Others
0
10
20
30 40 50 60 70
Generation Number
80
Peak 1
Number of Individuals
Average Fitness
100
90
80
70
60
50
40
30
20
10
0
Peak 2
Others
0
90 100
10
20
30 40 50 60 70
Generation Number
80
90 100
Figure 5: Average fitness (left) and number of individuals (right) for function F4.
In general, the number of individuals at the lower peaks decreased as the fitness of
the individuals in other peaks increased. During the initial generations all peaks had
about the same number of individuals since their average fitness was comparable. As
the average fitness increased in some peaks so did the number of individuals in those
peaks. In other words, the number of slots in the population occuppied by a peak
depends on its average fitness relative to that of other peaks. Of course there are other
factors affecting the peak count and that needs to be studied further. As the average
fitness approaches a peak’s optimum, the number of individuals in that peak stabilizes.
This can be observed in Figure 6 for peak 2 located near (0,0) and all other peaks in
function F5. The plot on the left has the average fitness of each peak for every
15
generation. The plot on the right has the number of individuals in each peak for every
generation. During the initial ten generations the wider peak (peak 2) had more
individuals in the population. As the fitness of the individuals in the other peaks
improved the number of individuals increased. After about 20 generations the number
individuals in the other peaks was greater than those in the wider peak. There are of
course other factors that contribute to this pattern and that needs to be investigated
90
70
80
60
Peak 1
Peak 3
Peak 2
Peak 4
50
Peak 5
Others
Number of Individuals
Average Fitness
further. It shows the ability of the MNC method to avoid premature convergence.
70
60
50
40
30
Peak 1
Peak 3
Peak 5
20
10
Peak 2
Peak 4
Other
0
40
30
20
10
0
0
10
20
30 40 50 60 70
Generation Num ber
80
90 100
0
10
20
30 40 50 60 70
Generation Num ber
80
90 100
Figure 6: Average fitness (left) and number of individuals (right) for function F5.
We also observed that the number of individuals in a peak is related to more than
just its average fitness. In function F3 where the optima are located on a 5x5 grid, peaks
along the same x and y axis as the global optima had more individuals than other peaks
with higher average fitness. Some of the extra individuals can be attributed to mutation
since a bit change in one of the chromosomes will cause an individual to move along the
x or y axis. We ran the same test with mutation set at 0.0 and no major changes were
observed. More analysis is needed to determine other factors affecting the number of
individuals in a peak.
The properties exhibited by the MNC method are encouraging. The population
converges to multiple optima and stable subpopulations evolve in different niches.
Three main factors appear to contribute to the success of the algorithm. First, using
16
crowding selection and worst among most similar replacement allowed niches to form
naturally and compete for population slots among them. Second, the mating operator
used preserved parents’ phenotype in the offspring and therefore allowed individuals
from a niche to generate offspring in the same niche. Third, phenotypic similarity
allowed individuals to form niches in the solution space. Although a more rigorous
analysis is necessary before the merits of this method can be substantiated, this method
is applied nevertheless to solve the problem of DNA restriction-fragment assembly, a
problem widely believed to be NP-hard.
4. Background on the DNA Restriction-Fragment Assembly Problem
The purpose of this section is to provide a simplified summary of the biological
background necessary to understand some of the central issues in the DNA restrictionfragment assembly problem as it arises in the Human Genome Project.
The genetic material contained in all the chromosomes of a cell is collectively called
the genome. Chromosomes are essentially DNA molecules, the ladder shaped, doublehelical structures. For the purpose of this paper, suffice it to say that the most important
parts of this double-helix are the "steps" of the ladder. These steps, called base-pairs,
denoted here by the letters A, C, G, T, can theoretically form sixteen pairs. However,
only AT, TA, CG and GC pairings are allowed. That is, if the left half of the base-pair is
known, the right half is uniquely determined and vice versa. It is estimated that the
human genome, contained in all the 23 pairs of chromosomes, is comprised of about
three
billion
base-pairs.
A
hypothetical
sequence
of
these
may
look
like
AATCTTCGGGCCT.... occupying three billion positions. Specific sub sequences of this
are called genes.
The monumental task of the Human Genome Project is to (a) associate with each
gene all the properties controlled by that gene, (b) associate each gene with one of the 23
chromosomes in the body, (c) identify the exact positioning of a gene on a chromosome
17
- known as the mapping problem or the genetic-linkage problem, and (d) decipher the
exact sequence of base pairs that constitute a given gene - known as the sequencing
problem. Although each human is uniquely characterized by his/her genome,
apparently one individual differs from another in only a small percentage (about 0.2%)
of this material. For that matter, the human genome differs from the simian genome by
only a few percentage points. Thus the Human Genome Project is only analyzing a sort
of composite genome: 23 chromosome pairs donated by a few European and U. S.
scientists. Thus, the focus is on understanding the common structure that runs through
the human species, although the actual DNA used in the experiments may come from a
specific individual.
Issues related to molecular biology, instrumentation and computations do play a
critical role in this effort. It is impractical even to attempt to summarize the scope of this
project here. A description of the technique used, at the Human Genome Center of the
Lawrence Livermore National Laboratory (LLNL), for example, can be found in
Genomics, 4, pp 129-136 (Carrano et al. 1989). To view on-line information about, say the
LLNL genome program, it is only necessary to point the WWW client at: http://wwwbio.llnl.gov/bbrp/genome.html. This Center is mapping human chromosome 19 which
is estimated to be approximately 60 million base-pairs long, a relatively small-sized
chromosome.
Current technology is forcing us to limit the sequencing task to small fragments of
DNA that are composed of approximately 0.5K base-pairs (Istvanick et al. 1993). In
order to divide the DNA into fragments up to this resolution level, techniques using
restriction enzymes are used. The restriction enzymes act on the DNA at specific
locations which are randomly distributed along the length of the chromosome.
Depending on the number of different restriction enzymes used in obtaining the
fragments, the data are called single-digest (one enzyme), double-digest (two enzymes),
18
or n-digest (n enzymes) data. Most mappings are done using single- and double-digest
data. Scientists use different restriction enzymes to obtain DNA fragments of the
appropriate size.
In practice, scientists try to map one chromosome at a time. Many techniques
generally require: (a) purifying chromosomal DNA, (b) cutting the DNA into pieces
called contigs using restriction enzymes, (c) inserting contigs into DNA cloning vectors,
(d) inserting cloning vectors into bacterial host cells for multiplication, (e) cutting clones
into fragments, (f) analyzing the fragments, and (g) reconstructing the original order of
the base-pairs on the chromosome.
4.1 Restriction Fragment Data for Chromosome 19
As the data used in this study are obtained from the Human Genome Center of
LLNL, it is useful to summarize the techniques used at LLNL to gather these data.
Chromosomes are sorted via a laser flow sorter; restriction enzymes are used to
shatter the sorted material into fragments. These fragments are inserted into a host
vector which can then be grown in large quantities. Due to the properties of the cloning
vector used at LLNL, our so-called cosmid clones average 40Kbp (forty thousand base
pairs) of DNA each. It is important to note that in each such cloning step, all original
order is lost.
The LLNL biologists cloned and selected 15,000 such cosmid clones for human
chromosome 19. A "cosmid fingerprint" was obtained for each clone by subjecting them
to a double-digest labeled with a fluorescent dye and measuring them on a modified
DNA sequencing gel apparatus. These crude fingerprints were compared between all
pairs (taking 2 days on a network of 40 workstations in parallel) and a set of 800
unordered islands or "contigs" were formed by an automated algorithm which also
determined a near-minimal spanning path for each island. (Note that 15,000 40Kbp
19
clones provides a nominal 10X depth of coverage for chromosome 19.) However, this
set of islands remains unordered by this technique.
A wide range of other techniques, not covered here, is used to merge, order, and
provide the distance between the contigs. This paper will focus on one technique used
to verify the accuracy of the contigs built via pair-wise overlap data, called EcoRI
restriction fragment mapping after the name of the restriction enzyme used. A subset of
the clones in a contig are chosen (i.e., the spanning path members plus other clones to
attempt to get at least a 2X coverage across the contig) and digested, with fragment
lengths in the range of about 0.5-20Kbp being generated, depending on the distribution
of EcoRI sites within each cosmid. These maps are of great utility in locating and
sequencing genes; since individual EcoRI fragments can be physically cut from gels and
prepared for sequencing, the cost of sequencing a gene can often be reduced by an
order of magnitude (assuming a gene can be localized to a ~4Kbp fragment, instead of
having to sequence an entire 40Kbp clone).
The order of the fragments within each clone is of course unknown. At LLNL, the
"EcoRI maps" are constructed rapidly by hand, since the existence of a reliable nearminimal spanning paths provides a tremendous hint towards obtaining a proper clone
ordering. This in effect reduces a 2-D optimzation problem to a 1-D one. Once clone
order is known, fragment ordering within clones is a much simpler problem (note that
fragment sizes can be measured to about 5%, and that fragments of the same size can
occur in multiple unrelated positions within a single map.)
The process of initially assembling the contigs and thereby deriving the minimal
spanning paths is itself quite slow and labor-intensive in terms of laboratory bench
work. LLNL researchers were interested in seeing how well a fully-automated system
could do in assembling EcoRI maps if a priori contig information were not available.
This is the problem discussed here: given a number of clones, in unknown order, each
20
with a set of restriction fragments, also in unknown order, construct a maximumlikelihood map of that region, subject to the constraints that, in general, the cosmids
should be contiguous in the resulting maps, and apart from end fragments, all
fragments should overlap others in their vertical "stack" within about 5% of their
lengths. If these maps could be build accurately over large regions (i.e., at least 200500Kbp), a potential exists for reducing the overall cost of obtaining a high-resolution
physical map.
Chromosome is divided into several islands (contigs)
Contig
Contig map of
overlapping
cosmid clones
( ~150 kbp)
Clone restriction fragments (.5 - 15 kbp)
Figure 7: Physical map for island using fragments from a set of overlapping clones.
Toward the goal stated above, in this paper, we propose to establish the relative
position of each cosmid clone on a contig by establishing the possible locations of the
fragments in each cosmid clone in such a manner that the fragment overlap among the
clones is maximized while a suitably defined total error between the overlapped
fragments is minimized. There are other constraints such as the total length of the
assembly be equal to the contig's original size. The problem is one of assembling cosmid
clone sequences as shown in Figure 7. The problem is complicated further by the
21
uncertainty in the data, the possibility of data loss (fragments of the same size are hard
to distinguish during fingerprinting), and the known fact that data related to corner
fragments (i.e., fragments near the fragment boundaries) is almost always unreliable.
Problems of this type are known to be hard (Opatrny, 1979). For example, the case with
10 cosmid clones, there are 10! / 2 possible clone sequences.
4.2 Related Work and Scope of Present Work
The DNA restriction fragment assembly, the subject matter of this paper resembles,
and is somewhat related to, the restriction-site mapping, which deals with the equivalent
problem of determining the absolute location of a fragment within a cosmid clone. Here
also one uses digestion data from restriction enzymes but the focus is on finding the
absolute location of a fragment on a clone. Stefik (1978) used a branch and bound
technique with rules to exhaustively eliminate wrong answers from the digest fragment
data. This approach is sensitive to error in the data and is computationally intensive.
Pearson (1982) exhaustively generated permutations of the single-digest data to
compute the error between the generated double-digest and the actual (experimental)
double-digest data. This approach is faster but it is limited to small number of
restriction sites also. Krawczak (1988) developed a divide and conquer technique that
groups the fragments into compatible clusters and then determine the order of the
fragments within each cluster. This approach can process a greater number of
restriction sites. Platt and Dix (1993) used Genetic Algorithms (GAs) for restriction-site
mapping using double digest data. In their work they did not consider operators suited
for multi-modal search spaces and mating which preserve adjacency information.
Other techniques are available to sequence larger DNA regions. Branscomb et al.
(1990) developed a greedy algorithm to order the most probable clone sequence using
overlap probabilities between the clones. The algorithm works well when a large
amount of overlap between the clones exists and the fragment data has small errors.
22
This approach is prone to getting stuck in local minima and does not use all the
available data gathered at great expense. Techniques using larger clones are also being
tried to order, orient, and connect the islands in the original DNA (Olson et al. 1986;
Waterman and Griggs 1986; Stallings et al. 1990; Fickett and Cinkosky 1993).
Cuticchia et al. (1992) constructed maps using simulated annealing techniques. In
their work clones are ordered according to a measure of similarity between them given
by the presence or absence of specific sequences. A signature is assigned to each clone
and the algorithm uses it to minimize the error between the actual length of the contig
and the given length by the hypothetical clone ordering. Matching signatures are use to
order the clones. In their work they only considered the relationship between
consecutive clones.
Recently, Parsons et al. (1993, 1995) applied the edge recombination crossover (Whitley
et al. 1989) in conjunction with several specialized operators and solved a 10 Kbp
sequencing problem consisting of 177 fragments with no manual intervention.
Previously, we too had experimented, although not reported, with a variety of
permutation-based crossover operators, such as those one finds in Traveling Sales
Person type problems. In that context we also found and reported that the edge
recombination operator
outperformed the permutation-based operators by a wide
margin (Cedeño and Vemuri, 1993). In this paper, we now describe a genetic algorithm
that uses single-digest restriction data on a set of overlapping clones to find multiple
solutions for the clone sequences. We use genetic operators suited for multi-modal
function optimization (Cedeño and Vemuri, 1992) to determine solutions. A mating
operator based on genetic edge recombination was used to preserve adjacency
information and improve convergence toward multiple solutions at the same time.
Results are given for two data sets of overlapping clones from human chromosome 19.
23
Before delving into the details of the application of MNC algorithm to the DNA
problem, it is useful to pause and justify the need for multi-modal optimization to solve
this problem. We are aware of the fact that molecular biologists, unlike engineers,
cannot use “sub-optimal” solutions. As any geneticist knows, one missing or misplaced
base-pair could lead to the synthesis of an entirely different protein. Ideally, what a
molecular biologist
wants is the exact sequence, which should correspond in our
mathematical formulation to the global optimum. But the search for this global
optimum is intimately tied up with the selection of the optimization criterion, say the
fitness function in GAs or the optimization criterion such as an energy function in
classical optimization. As long as this criterion is fabricated by our methods, we will
never be able to zero in on the “correct” solution using mathematical methods. The best
we can do is to ease the burden on the biologist by showing her (him) possible avenues
for further experimental investigation. We believe that this is the place the sub
optimization plays a role and that is the reason we were searching for the k-best sub
optima.
5. Problem Representation
It is worth repeating the cautionary word about the terminology used here.
Hereafter, the word DNA is used synonymously with the word chromosome(s) found
in a cell, whereas the word chromosome will be used to describe the computational data
structure used in genetic algorithms. This interesting situation arose because genetic
algorithms are computational processes that imitate natural processes of selection and
survival. Coincidentally, we are applying this metaphorical computational paradigm to
solve a problem in genetics! So it is important to distinguish meanings of terms used in
the problem domain from those used in the paradigm domain.
In this section some problem-dependent genetic operators are defined. First, the
encoding of the chromosomes to describe clone sequences is examined. Second, the use
24
of fragment sizes to define the fitness function is described. Third, the modified edge
recombination mating operator is described. And last, the function that measures
similarity between two clone sequences is described.
ALLELE CLONE
NUMBER ID
C0
5154
C1
7442
C2 21230
C3
8131
C4 18993
C5
5435
C6
7255
C7 12282
C8 27714
C9 10406
FRAGMENT SIZES (in thousands of base-pairs)
16.55, 4.4, 1.68, 1.07, 4.81, 8.5
0.79, 0.79, 2.6, 4.35, 8.24, 2.7, 6.9, 5.16
0.96, 1.68, 1.08, 4.77, 8.47, 1.44, 2.37, 6.29, 0.62
0.92, 3.73, 19.8, 4.43, 1.69, 1.25, 4.68, 5.63
0.96, 6.31, 5.48, 8.61, 7.29, 0.81, 0.81, 2.6, 4.36, 1.92
2.89, 8.24, 2.7, 6.9, 5.14, 5.14, 2.89, 1.54
1.04, 8.21, 2.69, 6.89, 5.12, 5.12, 2.88, 1.94, 2.42, 1.37, 3.33
4.52, 5.13, 5.13, 2.87, 1.94, 2.42, 1.39, 3.35, 5.41
6.69, 5.07, 5.41, 2.88, 1.92, 2.32, 1.4, 3.35, 5.46, 17.65, 1.0, 10.49, 0.58, 1.74
2.03, 1.43, 2.34, 6.28, 5.46, 8.58, 7.27
Figure 8: Cosmid clones with fragment data.
Before going into the details about the operators, it is important to show how the
data for the problem is presented to the GA. Figure 8 shows fragment sizes obtained
from fingerprinting for a set of overlapping cosmid clones. For example, cosmid clone
with the ID number 5154 which is also labelled as Allele Number C0, is known to be
comprised of six fragments, containing 16550, 4400, 1680, 1070, 4810 and 8500 basepairs, in that order. Also, cosmid clone with the ID number 8131 which is also labelled
as Allele Number C3, is known to be comprised of eight fragments, containing 920,
3730, 19800, 4430, 1690, 1250, 4680 and 5630 base-pairs, in that order. By comparing
these fragment sequences one can surmise that the third and fourth fragments of Allele
C0 are probably the same as the fifth and sixth fragments of Allele C3 mainly because
the fragment lengths are so nearly equal to each other. If this is true then Alleles C3 can
be "aligned" below Allele C0 in such a manner that the fifth and sixth fragments of
Allele C3 fall right below the third and fourth fragments of Allele C0 as shown in Figure
9. The matching of the fragment lengths is not perfect. Indeed, the mismatch at other
positions is large. The goal of this problem is to maximize this type of matching while
minimizing the number and degree of mismatches while keeping the total length of the
25
assembly within reasonable limits. The data for this problem consist of the n cosmid
clones with their fragment lengths and the tolerance measure e which is used to
determine if two fragments are of the same size. That is, two fragments F1 and F2 are
considered to be of the same size if |F1 - F2| < e.
Clone
C3 8131
C0 5154
C2 21230
Fragments
0.92 3.73 19.8 4.43 1.69 1.25 4.68 5.63
16.55 4.4 1.68 1.07 4.81 8.5
0.96 1.68 1.08 4.77 8.47 1.44 2.37 6.29 0.62
Figure 9. An example of fragment assembly.
5.1 Chromosome Encoding
The encoding for this problem is simple. Each allele in the chromosome has a label
between 0 and n - 1 corresponding to one of the cosmid clones. No two alleles have the
same label, and mating and mutation will preserve this constraint. In Figure 8, for
example, an allele with the label C0 corresponds to the clone with ID 5154 and an allele
with the label C9 corresponds to clone ID 10406. The clone sequence (5154, 21230, 10406,
7255, 12282, 27714, 8131, 18993, 7442, 5435), for example, is represented by the
chromosome (0 2 9 6 7 8 3 4 1 5). The initial population is generated by picking, at
random, values between 0 and n - 1 without replacement.
Number of matches between clones
CLONE C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
C0
6 1 4 2 1 0 1 0 2 1
C1
1 8 0 1 4 4 4 1 1 0
C2
4 0 9 3 2 1 3 2 5 3
C3
2 1 3 8 2 0 0 1 2 0
C4
1 4 2 2 10 1 3 2 3 4
C5
0 4 1 0 1 8 6 3 2 0
C6
1 4 3 0 3 6 11 7 7 3
C7
0 1 2 1 2 3 7 9 7 4
C8
2 1 5 2 3 2 7 7 14 3
C9
1 0 3 0 4 0 3 4 3 7
26
Total error in the matches
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
0 5 8 4 4 0 3 0 13 8
5 0 0 8 5 2 9 3 9 0
8 0 0 14 2 10 16 10 23 5
4 8 14 0 11 0 0 9 13 0
4 5 2 11 0 10 19 9 6 10
0 2 10 0 10 0 10 4 8 0
3 9 16 0 19 10 0 7 26 23
0 3 10 9 9 4 7 0 20 26
13 9 23 13 6 8 26 20 0 5
8 0 5 0 10 0 23 26 5 0
Figure 10: The M matrix on the left whose entries are the number of fragment
matches and the E matrix on the right whose entries are the total errors between any
two clones with ε = 10.
5.2 Fitness Function
To calculate the fitness of an individual, the number of fragment matches between
all consecutive clones and the error between the fragments is considered. The fragment
sizes are represented using integer numbers (by multiplying the number given in
Figure 8 by 100), to accelerate computation of the fitness function. Prior to the execution
of the GA, two matrices are calculated. One matrix, the match matrix M shown on the
left side of Figure 10, contains the number of fragments that match between two clones
Ci and Cj, within an error tolerance ε . The other matrix, the error matrix E shown on
the right side of Figure 10, contains the total error between the clones being matched.
The error between two clones is given by the sum of the errors between all fragments
that matched. For example, between clone No. 8131 (C3) and clone No. 5154 (C0) there
are two pairs of fragments that match within the specified tolerance of ε = 10. The
lengths of these fragments are 169 and 168 for one pair and 443 and 440 for the second
pair. Thus a 2 appears in (row 2, column 4) of the Match Matrix M. The total error
between both pairs of fragments is (169-168) + (443-440) = 4, which is shown in (row 1,
col. 4) of the Error Matrix, E.
Our goal is to arrange the cosmid clones as shown in Figure 7 so that the lengths of
the overlapping fragments match with each other as closely as possible. The necessary
matching information is already gathered in the matrix M and the degree of
accumulated mismatch per clone is gathered in the matrix E. However, we believe that
this information alone is not sufficient to establish which two clones are "adjacent" to
each other in the arrangement shown in Figure 7. For example, consider how clone C0
(i.e., allele No. 5154) matches with other clones. Inspection of the match matrix M
indicates that the degree of match between clone C0 and clone C2 (or, equivalently,
27
allele No. 21230), is 4 matches. Also, fragments in clone C0 match with fragments in
clone C3 as well as C8, each with 2 matches. By interpreting this to mean that C2 should
be placed nearer to C0 than C3 or C8, we are ignoring information contained in the C0C3 matches and C0-C8 matches. One more example suffices to make the point. Clone C3
should be placed closer to C2 because they match with each other the maximum
number of times, namely 3, although C3 matches with three other clones, each with
only 2 matches. This phenomena makes us to think that using only the number of
matches between clones is not sufficient to establish the partial order between clones
when they possess the same match count. We believe that part of this problem is due to
false matches, between fragments of similar sizes, that may occur by chance. We tried to
overcome this problem by incorporating the total error in the matches, shown in matrix
E, in order to enable our GA to discriminate further between clones. Using the same
example, notice that clone C3 has less total error when matched with C0 than clone C8
and therefore indicates that C0 is adjacent to C3. The following equation for fitness
captures the essence of the method described so far.
M [ ai , a i + j ] 
E [ ai , ai + j ] 
1 −


M [ai , ai + j ] ⋅ ε 
i = 0 j = −1 Count[ ai + j ] 
n −1 1
fitness = ∑ ∑
j≠0
Here ai refers to the cosmid clone placed in the ith position of the chromosome, ai+j
refers to the clone to the right or left (if any) of the ith position. The first term,
M[ai,ai+j]/Count[ai+j], give us a normalized count for the degree of match. The term
E[ai,ai+j]/M[ai,ai+j]⋅ε refers to the normalized error per fragment. When this normalized
error reaches unity, it means that the total error is so large that any apparent matches
are worthless. With this interpretation, the second term of the equation essentially tells
us the degree of confidence we can place on the normalized matches we are counting in
28
the first term. In the above equation, n refers to the number of alleles in the
chromosome which is equal to the number of clones.
By defining the fitness function as above, we are assigning a higher fitness to those
clone pairs that match a higher percentage of their regions. For example, Allele 0 with 6
fragments has two matches each with Alleles 3 and 8, each having 8 and 14 fragments
respectively. Since 2/8 represents a higher percentage than 2/14, we designed a fitness
function that prefers a configuration that places Allele 3 closer to Allele 0 than Allele 8.
This is achieved by dividing the number of matches by the number of fragments in the
clone.
Before settling on the fitness function described above, others were considered. For
example, fitness functions that just counted the number of matches between clones with
no regard to normalization failed to produce the correct answer. A fitness function that
just counted the number of matches and then subtracted the total error in those matches
also failed to give satifactory results. It is possible that other fitness functions may give
results that are even better than what are reported here. In the future, we plan to
include the number of matches as well as errors among groups of three clones or more
as components of the fitness function and study its effect on performance.
5.3 Mating and Mutation Operators
The mating operator used in this method is based on a slight modification to the
genetic edge recombination operator that was applied successfully to solve the TSP
(Traveling Sales Person) problem. As in the TSP problem, the important information
here is the adjacency of the alleles, although the order the alleles appear in the
chromosome can be derived from the adjacency information. The idea is to recombine
the links (pairs of clones) between two parents such that common links are inherited by
the offspring. This operator is implemented in two steps as shown in Figure 11. First,
those links (or traits) that are common to both the parents are identified and passed on
29
to the offspring and the links occupy the same absolute positions in the offspring
chromosome. In the example shown in Figure 11, the relevant link-pairs are 7-8, 8-1,
and 5-0 in the first parent and 1-8, 8-7 and 5-0 in the second parent. Notice that these
links are passed on to the two offspring undisturbed. Second, those alleles that are not
passed to the offspring (indicated by dashes, in Figure 11) are randomly assigned to the
available positions while observing the constraint that no link label is repeated.
Parent 1
(6 7 8 1 2 3 9 4 5 0)
Common links (traits)
(- 7 8 1 - - - - 5 0)
Offspring 1
(3 7 8 1 4 6 2 9 5 0)
Parent 2
(1 8 7 9 5 0 2 6 3 4)
Common links (traits)
(1 8 7 - 5 0 - - - -)
Offspring 2
(1 8 7 3 5 0 6 2 4 9)
Figure 11: Modified genetic edge recombination for clone sequencing.
The differences between this operator and the original edge recombination operator
are in the number of offspring generated and in the assignment of alleles not transferred
from the parent. We generated two offspring instead of one because the location of the
links in the clone sequence is important to our problem. In TSP the chromosome is
circular, thus the location did not matter. We allow both parents to pass the location of
the links to their offspring. To assign the other alleles we select them at random from
those clones not passed by their parents. In the original operator the links are assigned
from those present in any of the two parents. Alleles with fewer links are assigned first
to prevent from running out of links for a given allele.
In the mating operator used here there is excessive exploration of the search space
primarily due to the random filling of the unassigned slots, in the second step, while
creating the offspring chromosomes. Part of this exploration difficulty is alleviated by
the fact that mates are selected using crowding selection and therefore they have
common features between them. Exploration is therefore localize to a smaller region
within the entire search space. On the other hand, by allowing unassigned clones to be
30
chosen at random, we are allowing links to re-appear that might not have done so using
mutation alone. Other crossover operators based on clone positions alone did not
perform as well in this problem, we only give the results for the modified edge
recombinator operator.
Mutation is applied on an individual basis. After an offspring is generated it is
mutated if the outcome from the flip of a biased coin is true. When this happens, a link
from the offspring is selected at random and all alleles from that link to the last position
of the chromosome are reversed. For example, the offspring ( 1 8 7 3 5 0 6 2 4 9) after
mutation can result in ( 1 8 7 9 4 2 6 0 5 3) if the link between allele 7 and 3 is selected to
mutate. This mutation operator is known as inversion (Holland 1975).
The mating and mutation operators are compatible with each other in the sense that
they both operate on links. The building blocks of this problem are based on the links
between clones in the sequence. The GA operates on these links so that the most useful
ones are passed from generation to generation.
5.4 Similarity Function
The similarity function is very simple also. It counts the number of dissimilar links
between two individuals. Using the parents from Figure 11 once again as an example,
notice that there are six dissimilar links, corresponding to the five alleles not assigned to
the offspring. For concreteness, these six dissimilar links in Parent 1 chromosome are 67, 1-2, 2-3, 3-9, 9-4 and 4-5 and for Parent 2 are 7-9, 9-5, 0-2, 2-6, 6-3 and 3-4. This metric
measures the proximity between two clone sequences by counting the different links
they have and not the position of the alleles. For example, the sequences (0 1 2 3 4 5 6 7 8
9) and (9 8 7 6 5 4 3 2 1 0) have a distance of zero since all the links are the same. This
metric captures the essential aspect of the problem since both solutions are equivalent in
our problem.
31
6. Results and Discussion
The results presented in this section were obtained on a SGI IRIS 4D computer
under IRIX O.S. running the GA application written in C. The parameters for the GA
are the following:
Population size:
100
Mutation probability:
0.06
Crossover probability:
1.00
Crowding selection group size (Cs)
20
Crowding factor group size (s):
10
Crowding factor (Cf):
5
Maximum number of generations to execute:
Tolerance ε
100
10
These parameters were selected after various trials. In each trial, different values for
each of the six parameters were tried, varying one at a time while holding the others
constant. A population size of 100 was found satisfactory. Other sizes (50, 150, 200, 250,
and 300) were tried. Higher sizes did not provide new information about the problem,
although they produced, on average, the solution in less number of generations. Lower
sizes in some cases did not converge to the best solutions seen before in the allowed
number of generations. The population size of 100 is a compromise between speed of
convergence and execution time.
Mutation was set at 0.06, therefore an average of 6 individuals were mutated every
generation. This low value of mutation works well in this problem. It seems to allow
individuals to escape local optima without causing any major disruption on the
elements of a niche. In the MNC method diversity is maintained implicitly when the
population converges to different optima, reducing the need for higher mutation
values. On the other hand the mating probability was set to a high value of 1.0. As
described before the MNC method is basically a steady state GA and highly fit
32
individuals have a high chance of surviving for many generations. Also, crowding
selection exploits the similarity between individuals in the population by allowing
localize solutions to pass common traits between them.
The group size for crowding selection was set at 20. This high value emphasizes the
importance of similarity between parent and mate for this problem. Then the mating
operator will likely generate an offspring within the same region as their parents. The
size for the crowding factor groups was set at 10. This allowed competition between
multiple optima to occur while maintaining multiple solutions. Good results were
obtained with other Cs and s values (5, 10, 15, 20) also. The crowding factor was set at 5.
This value allowed the MNC method to eliminate low fitness individuals from the
niches more rapidly. These parameter values allowed a diverse population to co-exist
during the number of generations allowed and did not restrict competition between
individuals from different niches. The tolerance value ε was set to 10 to minimize false
matches due to chance. Higher values of ε increased the false matches more than true
matches and therefore more possible clone sequences were found.
The MNC method took an average of 50 seconds for each run. Some of the best
sequences obtained for two different data sets of overlapping clones are shown in
Figure 12. These results were obtained from 10 different runs. The figure shows the
actual sequence for the data sets and the clone sequences (with their fitness) obtained by
the algorithm.
Data for set 1 is shown in Figure 8 and data for set 2 is shown in the Appendix.
There was nothing really special about these data sets except that they were in regions
containing various gene(s) of interest to some of our collaborators. They had already
mapped them manually, so we could judge the answers our GA technique provided.
Finally, and perhaps most importantly, the data sets were large enough not to be toy
problems. One of the data sets contained two vertical "columns" in the final map that
33
had fragments of the same size. The existence of independent columns of the same size
is, of course, one of the things that makes this problem so tough, and interesting. Many
of the runs using data set 2 generated multiple solutions with relatively close fitness
values.
Data Set 1 actual sequence and its fitness:
(8131 5154 21230 10406 18993 7442 5435 7255 12282 27714) 764
The 3 Best sequences found by the GA and their corresponding fitnesses:
(8131 5154 21230 10406 18993 7442 5435 7255 12282 27714) 764
(8131 5154 21230 27714 12282 7255 5435 7442 18993 10406) 749
(8131 5154 21230 10406 27714 12282 7255 5435 7442 18993) 744
Data Set 2 actual sequence and its fitness:
(12595 6722 26999 29626 29064 18301 19811 29035 17755 28828 20235) 750
The 12 Best sequences found by the GA and their corresponding fitnesses:
(12595 26999 6722 29626 29064 18301 28828 20235 17755 29035 19811 )
(20235 28828 17755 29035 19811 18301 29064 29626 6722 26999 12595 )
(12595 26999 6722 29626 29064 18301 20235 28828 17755 29035 19811 )
(28828 20235 17755 29035 19811 18301 29064 29626 6722 26999 12595 )
(12595 26999 6722 29626 29064 18301 28828 20235 17755 19811 29035 )
(12595 26999 6722 29626 29064 18301 20235 28828 17755 19811 29035 )
(12595 6722 26999 29626 29064 18301 19811 29035 17755 28828 20235 )
(20235 28828 17755 29035 19811 18301 29064 29626 12595 6722 26999 )
(19811 29035 17755 20235 28828 18301 29064 29626 12595 6722 26999 )
(19811 29035 17755 20235 28828 18301 29626 29064 26999 6722 12595 )
(19811 29035 17755 28828 20235 18301 29064 29626 26999 6722 12595 )
(19811 29035 17755 28828 20235 18301 29064 29626 12595 6722 26999 )
757
757
756
755
754
753
750
750
750
750
749
749
Figure 12: Clone sequences obtained by GA and actual sequence.
For data set 1 the MNC method was able to find the actual clone sequence. The best
sequence as described by the fitness function did indeed match the solution. The best
sequence was found in all the runs. From the other sequences found, the fitness values
of the next best is 15 less than the actual sequence. Similar gaps exist between all the
sequences shown in Figure 12 for data set 1. From the solutions we can see that the
other sequences are a single mutation from the actual sequence.
34
Data set 2, shown in the Appendix, presented a more challenging problem for the
MNC method. In this case the best sequence found only had clones 6722 (C7) and 26999
(C2) transposed from the actual sequence. The fitness for the actual sequence is 750,
which is the sixth best score when compared with all solutions found. The actual
sequence was obtained in 8 of the 10 the runs. As shown in Figure 12, different solutions
with equal fitness values were found and maintained in the runs. Two different
solutions with a distance of 2 (two different links) were found with fitness equal to 757.
Four others were found with fitness equal to 750 (including the actual sequence).
Another observation is that there is a difference of 8 or less in the fitness between all the
sequences found. Some of the sequences are mutations of others, but there is more
diversity when compared with the solutions for data set 1. The biologists later
confirmed via independent techniques (i.e., hybridization probing) that the clones in
data set 2 were indeed from two distinct regions in the DNA. That is, the sequence
(6722 26999 12595 29626 29064 18301) and (19811 29035 17755 28828 20235) belong to
separate maps in the DNA. The best solutions found by the MNC method kept both
600
800
500
700
Best Fitness
Avg Fitness
sequences apart.
400
300
200
SGA
SSGA
100
600
500
SGA
SSGA
400
MNC
MNC
0
300
0
10
20
30
40
50
60
70
80
90 100
0
Generation
10
20
30
40
50
60
70
80
90 100
Generation
Figure 13: Population average fitness (left) and best fitness (right) for different methods.
To see how the MNC method compares with more traditional GAs we applied the
same problem (i.e., using data set 2) to the SGA and SSGA with crowding. For both the
35
SGA and SSGA the crossover probability was set at 0.6. The crowding factor was set to
10 for the SSGA. All other parameters are the same as in the MNC method. We
observed poor results by the SGA using these parameters alone. To improve the
performance of the SGA the 10 best individuals in the population (known as generation
gap (De Jong 1975)) were transferred from the current generation to the next. Figure 13
shows the population’s average fitness (on the left) and the fitness of the best solution
(on the right) for the MNC, SGA, and SSGA methods averaged over 10 runs.
The MNC method outperformed both the SGA and the SSGA. The population
converged faster and at the same time kept multiple solutions throughout the run. Use
of a generation gap allowed the SGA to retain the good individuals found in previous
runs, but did not improve convergence towards multiple solutions. The SSGA did not
find the best sequence in any of the runs. The SGA found one of the two best sequences
in 2 of the runs. As mentioned before the MNC found the two best sequences and others
in 8 of the runs. The ability of the MNC method to exploit similarity and increased
competition among localized solutions proved to be successful in this problem.
7. Comments and Conclusions
Two points deserve further discussion. First, data containing clones with fewer than
five fragments were normally sequenced erroneously by the GA. This is due to the lack
of opportunity for sufficient fragment matches. Also data pertaining to corner
fragments (i. e., fragments lying near the cosmid clone boundaries) is generally more
prone to errors. Consequently corner segments will not match well with a high
probability with their counterparts in the preceding and succeeding clones. For clones
with less than five fragments, this means that on the average, at least half of the data is
not useful and in some cases leads to more false matches. Clones with less than five
fragments were usually placed first or last in the clone sequence by the GA.
36
Second, when a large number of overlaps existed between 3 or 4 clones, the GA
experienced difficulty deciding the correct sequence. An example of this behavior was
observed with data set 2. This phenomenon, we believe, is happening because the
fitness function is only looking for matches between the clones to the left and right
without accounting for the fragments which are common to three or more clones.
Permutations of these clones generally had similar fitness values. An improved fitness
measure is needed to account for fragment matches between three or more clones.
Overall the MNC worked well with the data presented to it. Using the correct set of
genetic operators was very important to find a MNC model that will find good
solutions to the problem. Using a multi-modal approach was very useful for this
problem also since it prevented premature convergence and at the same time explored
the search space in a more efficient manner. Defining the operators for mating,
mutation, fitness, and similarity measure to work with adjacency information between
the clones rather than clone positions gave the MNC method the correct set of tools to
converge towards the most probable solutions. More information must be incorporated
into the fitness evaluation to distinguish even further between the best clone sequences
and other similar ones.
The results on multi-modal optimization and DNA fragment assembly shows the
ability of the MNC to converge to multiple solutions, maintain stable sub-populations,
and succeed in complex search spaces. Exploiting similarity during selection and
replacement allows a diverse population to coexist while competition for slots in the
population evolve naturally. Convergence towards the best solution(s) is not affected
with the increased diversity. The MNC method uses the diversity present in the
population to guide the population towards multiple optima in the search space.
37
Acknowledgments
The authors thank the three anonymous reviewers for their many valuable
comments and Dr. Ken De Jong for encouraging us to make the revisions and for
bringing to our attention some relevant unpublished work. This work was supported,
in part, by the Applied Mathematics Program of the Office of Energy Research (US.
Department of Energy) under contract number W-7405-Eng-48 to LLNL Lawrence
Livermore National Laboratory and in part by a grant from the Institute of Scientific
Computing Research of the Lawrence Livermore National Laboratory.
References
Branscomb, E., Slezak, T., Pae, R., Galas, D., Carrano, A. V., & Waterman, M. (1990).
Optimizing restriction fragment fingerprinting methods for ordering large genomic
libraries. Genomics, 8, 351-366.
Carrano, A. V., Lamerdin, J., Ashworth, L. K., Watkins, B., Branscomb, E., Slezak, T.,
Raff, M., De Jong, P. J., Keith, D., McBride, L., Meister, S., & Kronick, M. (1989). A
high-resolution,
fluorescence-based,
semiautomated
method
for
DNA
fingerprinting. Genomics, 4, 129-136.
Cavicchio, D. J. (1970). Adaptive search using simulated evolution. Ph.D. thesis, University
of Michigan, Ann Arbor, MI.
Cede o, W., & Vemuri, V. (1992). Dynamic multi-modal function optimization using
genetic algorithms. In Proceedings of the XVIII Latin-American Informatics Conference.
Las Palmas de Gran Canaria, Spain: University of Las Palmas, 292-301.
Cede o, W., & Vemuri, V. (1993). An investigation of DNA mapping with genetic
algorithms: Preliminary results. In Proceedings of the Fifth Workshop on Neural
Networks, SPIE 2204, 133-140.
Cuticchia, A. J., Arnold, J., & Timberlake, W. E. (1992). The use of simulated annealing
in chromosome reconstruction experiments based on binary scoring. Genetics 132,
591-601.
38
De Jong, K. A. (1975). An analysis of the behavior of a class of genetic adaptive
systems. Ph.D. thesis, University of Michigan, Dissertation Abstracts International
36(10), 5140B.
Deb, K., & Goldberg, D. E. (1989). An investigation of niche and species formation in
genetic function optimization. In J. D. Schaffer (Ed.), Proceedings of the Third
International Conference on Genetic Algorithms. San Mateo, CA: Morgan Kaufmann, 4250.
Fickett, J. W., & Cinkosky, M. J. (1993). A genetic algorithm for assembling
chromosome physical maps. In C. R. Cantor & R. Robbins, (Eds.), Proceedings of the
Second International Conference on Bioinformatics, Supercomputing, and Complex Genome
Analysis. River Edge, NJ: World Scientific, 272-285.
Goldberg, D. E., & Richardson, J. (1987). Genetic algorithms with sharing for
multimodal function optimization. In J. J. Grefenstette (Ed.), Proceedings of the Second
International Conference on Genetic Algorithms. Hillsdale, NJ: Lawrence Erlbaum
Associates, 41-49.
Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning.
Reading MA: Addison-Wesley.
Goldberg, D. E., Deb, K., & Horn, J. (1992). Massive multimodality, deception, and genetic
algorithms (IlliGAL Technical Report No. 92005). Urbana, IL: University of Illinois at
Urbana-Champaign.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. Ann Harbor MI: The
University of Michigan Press.
Istvanick, W., Kryder, A., Lewandoeski, G., Meidânis, J., Rang, A., Wyman, S., &
Joseph, D. (1993). Dynamic methods for fragment assembly in large scale genome
sequencing projects. In, T. N. Mudge, V. Milutinovic, & L. Hunter (Eds.), Proceedings
of the Twenty Sixth Annual Hawaii International Conference on System Sciences:
Architecture and Biotechnology Computing. Los Alamitos, CA: IEEE Computer Society
Press, 534-543.
Krawczak, M. (1988). Algorithms for the restriction-site mapping of DNA molecules.
In Proceedings of the National Academy of Sciences. USA 85, 7298-7301.
39
Mahfoud, S. W. (1992). Crowding and preselection revisited. In R. Männer & B.
Manderick (Eds.), Proceedings of Parallel Problem Solving from Nature 2. New York,
NY: Elsevier Science B. V., 27-36.
Olson, M. V., Dutchik, J. W., Graham, M. Y., Brodeur, G. M., Helms, C., Frank, M.,
MacCollin, M., Scheinman, R., & Frank, T. (1986). Random-clone strategy for
genomic restriction mapping yeast. In Proceedings of the National Academy of Sciences.
USA 83, 7826-7830.
Opatrny, J. (1979). The total ordering problem. SIAM Journal of Computing, 8:1, 111-114.
Parsons, R. J., Forrest, S., & Burks, C. (1993). Genetic algorithms for DNA sequence
assembly. In Proceedings 1st International Conference on Intelligent Systems for Molecular
Biology. Menlo Park, CA: AAAI Press, 310-318.
Parsons, R. J., Forrest, S., & Burks, C. (1995). Genetic algorithms, operators and DNA
fragment assembly. To appear in Machine Learning. Boston MA: Kluwer Academic
Publishers.
Pearson, W. (1982). Automatic construction of restriction site maps. Nucleic Acids
Research 10, 217-228.
Platt, M. D., & Dix, T. I. (1993). Construction of restriction maps using a genetic
algorithm, In T. N. Mudge, V. Milutinovic, & L. Hunter (Eds.), Proceedings of the
Twenty Sixth Annual Hawaii International Conference on System Sciences: Architecture
and Biotechnology Computing. Los Alamitos, CA: IEEE Computer Society Press, 756762.
Smith, R. E., Forrest, S., & Perelson, A. S. (1993). Searching for diverse, cooperative
populations with genetic algorithms. Evolutionary Computation, 1:2, 128-149.
Stallings, R. L., Torney, D. C., Hildebrand, C. E., Longmire, J. L., Deaven, L. L., Jett, J.
H., Doggett, N. A., & Moyzis, R. K. (1990). Physical mapping of human
chromosomes by repetitive sequence fingerprinting. In Proceedings of the National
Academy of Sciences. USA 87, 6218-6222.
Stefik, M. (1978). Inferring DNA structures from segmentation data. Artificial
Intelligence , 11, 85-114.
40
Syswerda, G. (1989). Uniform crossover in genetic algorithms. In J. D. Schaffer (Ed.),
Proceedings of the Third International Conference on Genetic Algorithms. San Mateo, CA:
Morgan Kaufmann, 2-9.
Waterman, M. S., & Griggs, J. R. (1986). Internal graphs and maps of DNA. Bulletin of
Mathematical Biology, 48, 189-195.
Whitley, D. (1988). GENITOR: a different genetic algorithm. In Proceedings of the Rocky
Mountain Conference on Artificial Intelligence. Denver Colorado, 118-130.
Whitley, D., Starkweather, T., & Fugway, D. (1989). Scheduling problems and
traveling salesmen: the genetic edge recombination operator. In J. D. Schaffer (Ed.),
Proceedings of the Third International Conference on Genetic Algorithms. San Mateo, CA:
Morgan Kaufmann, 133-140.
Appendix
Fragment data for set 2 of overlapping cosmid clones:
CLONE ID
29064
19811
26999
29626
17755
12595
20235
6722
28828
18301
29035
FRAGMENTS (k base-pairs)
15.42, 3.46, 1.50, 9.12, 4.30
3.13, 7.89, 7.89, 3.02, 4.35, 14.31, 2.65, 0.64
4.64, 19.69, 1.10, 1.48, 2.82, 0.77, 9.61
3.46, 1.50, 9.14, 6.47, 13.48,
1.26, 2.62, 6.32, 2.73, 3.54, 7.88, 7.88, 3.02, 4.35, 1.74
1.48, 2.83, 0.76, 9.68, 12.75, 1.48, 6.06
8.01, 12.56, 2.62, 6.32, 2.74, 3.54, 5.71
2.84, 19.72, 1.16, 1.49, 2.84, 0.77, 9.69, 9.20
12.45, 2.61, 6.27, 2.72, 3.52, 7.89, 3.89
2.53, 13.99, 2.64, 16.88, 3.44
2.45, 2.74, 3.53, 7.82, 7.82, 3.02, 4.35, 7.44
CLONE C0
C0 :
5
C1 :
1
C2 :
1
C3 :
3
C4 :
2
C5 :
1
C6 :
1
C7 :
2
C8 :
1
C9 :
1
C10:
2
C1
1
8
0
0
5
0
1
0
2
1
5
C2
1
0
7
1
1
4
1
6
1
0
1
C3
3
0
1
5
1
1
1
2
1
1
1
C4
2
5
1
1
10
1
4
1
5
2
6
C5
1
0
4
1
1
7
1
4
0
0
1
C6
1
1
1
1
4
1
7
1
4
2
2
C7
2
0
6
2
1
4
1
8
0
0
1
C8
1
2
1
1
5
0
4
0
7
2
3
C9
1
1
0
1
2
0
2
0
2
5
3
C10
2
5
1
1
6
1
2
1
3
3
8
CLONE C0
C0 :
0
C1 :
5
C2 :
2
C1
5
0
0
C2
2
0
0
C3
2
0
2
C4
13
5
9
C5
2
0
9
C6
8
3
8
C7
9
0
20
C8
6
4
10
C9
2
1
0
C10
12
23
8
41
C3 :
C4 :
C5 :
C6 :
C7 :
C8 :
C9 :
C10:
2
13
2
8
9
6
2
12
0
5
0
3
0
4
1
23
2
9
9
8
20
10
0
8
0
8
2
8
7
6
2
7
8
0
10
1
10
10
12
14
2
10
0
9
4
0
0
9
8
1
9
0
10
10
12
1
7
10
4
10
0
0
0
10
6
10
0
10
0
0
11
10
42
2
12
0
12
0
11
0
27
7
14
9
1
10
10
27
0

Download Report

Multi-Niche Crowding in Genetic Algorithms and its Application to

Paperzz.com

Your Paperzz