a genetic algorithm for solving the euclidean distance matrices

A GENETIC ALGORITHM
FOR SOLVING THE EUCLIDEAN
DISTANCE MATRICES
COMPLETION
PROBLEM
NSF Engineering
Wilson Rivera-Gallego
Research Center for Computational
Mississippi State University
Mississippi State, MS 39762, USA
[email protected]
Field Simulation
Key words. genetic algorithms, Euclidean distance gorithms (GAS) have been applied to the molecular
matrices, molecular configurations
conformation problem[l2,13].
Genetic algorithms are search algorithms based on
the genetic processesobserved in natural evolution.
ABSTRACT
Theoretical concepts were first introduced by Holland[ll]
and later described by Golberg[G].
This paper presents a genetic algorithm to solve
GAS
work
with a population of individuals. Each
the Euclidean distance matrices completion problem
individual
(chromosome)
consists of a string of charand to determine three-dimensional configurations of
acters
(genes)
and
represents
a possible solution in a
points which generate the corresponding Euclidean
search
space.
A
fitness
value
is assigned to each indistance matrices. This new approach becomes imdividual
according
to
a
problem-specific
fitness funcportant for applications in molecular conformation
tion.
By
using
a
selection operator, fitter individuals
problems where it is necessary to determine molecular configurations from incomplete nuclear magnetic have higher probability of being chosen to be recombined in order to get offipring with better fitness vaIresonance data.
ues. Offspring are generated by a ctT)asoueroperator,
which combines geneafrom two selected individuals,
and a mutation operator, which changes genes in the
1 INTRODUCTION
chromosomes with certain probabiity. Mutation alOne of the most important problems in computa- lows for population diversity so that the GA does
tional chemistry is the design of molecules such as not converge prematurely and important information
enzymes, anti cancer agents and catalysts for mate- is not lost. This process is repeated in order to find
rial processing. The design of these molecules de- an optimal solution to the particular problem.
pends upon the precise determination of the three- The first references of Euclidean Distance Matridimensional configurations of atoms generating the ces (EDMs) appeared in multidimenaionol scaZing[4].
Members of a multidimensional sample are repremolecules[2].
In the absenceof crystallographic or spectroscopic ex- sented by points in some geometrical space, usually
perimental data, the searching of molecular structure Euclidean. It is possible to build up arrays of numcan be formulated as an optimization problem under bers, called ordering diagrams, where pairs of close
the assumption that the structure of the molecule points usually represent similar samples and faraway
corresponds to a conformation for which the energy points represent dissimilar samples. The goal is to
mot+ in order to describe some
is near the global minimum. Because there exist a use this diaaimihity
characteristics
of
the
sample. For example, in psylarge number of local minima, standard optimization
chometry
each
element
in the dissimilarity
mat&
methods do not work very well. Thus, Genetic Al
may
represent
the
dissimilarity
of
a
pair
of
stimuli.
Permissionto make digital or hard copies of all 01part of this Work
for personal
OTclassroomuseis granted without fee provided that
Schoenberg[lG] found a natural parametrization of
copies are not made or distributed for profit or commercial
the EDMs cone from the symmetric positive semidefadvantage and that copies bear this notice and the full citation On
inite matrices. Gower[8,9] described some properthe first page. To copy otherwise, to republish, to post on sewen
ties of the EDMs and pointed out some co~ecor to redistribute to lists, requires prior specific permission ar&x a
fee.
tions between these properties and the configurations
SAC 99. San Antonio, Texas
01998 ACM l-S81 13-086-4199/0001$J.C’o
of points that generate the matrices. Critchley[3]
worked on the spectral decomposition of EDMs. Hayden et al.[lO] obtained Gower’s results, but in a simple way that provided a more geometrical view of
the EDMs cone. The author[l4] studied some special
structures of EDMs.
The relationship between EDMs and the molecular
conformation problem appears when we seek the conformation of biological molecules from nuclear magnetic resonance (NMR) data[5]. Interatomic distances are measured using NMR. The distances are
located in a rectangular matrix. Thus, the geometry
of the atoms conforming the molecule can be determined from the structure of the matrix[l0,14]. In
this context a GA for circulant EDMs was presented
in [15].
This paper describes a GA for solving the Euclidean
distance matrices completion problem. In the measuring process, information may be missed giving rise
to incomplete EDMs. In order to solve a matrix completion problem, we attempt to determine the unspecified entries of a partially defined matrix so that the
complete matrix satisfies the desired properties. The
positive definite completion problem, which is closely
related to the EDMs completion problem, has been
studied by Barret et al.[l].
This paper is organized as follows. Section 2 introduces definitions and preliminary results. Section 3
describes the implementation of the GA for the EDMs
completion problem. Section 4 shows practical examples. Finally, section 5 gives the conclusions.
0.6
I
Figure 1: Molecular configuration of the methane.
and in particular
for i= 1,2 ,... ,n.
hi = IIpiI12
Example 2.1 Consider the configuration of the
methane (CH4) in Fig. 1. The coordinates matrix
is given by
-0.62
2
EUCLIDEAN
MATRICES
(1)
c=
-0.62
0.62
DISTANCE
00 1.15
0 -0.380 -0.380
I0 -0.38
-0.38
1.15 -0.38
-0.38
1.15
0
-0.38
-0.38 ,
-0.38
1.15 I
If there exist 12points, pl,~, ....pn in R’ such that
B=
0
[Ipi - pj112 = ej, then D = [dfj] E R”‘” is a EU(2)
clidean Distance Matriz. The minimum value of k is
0 -0.38 -0.38 -0.38
called embedding dimension.
A symmetric matrix A E R”‘” is positive semidef- and the EDM is
inite if zTAz 3 0 for all z E R”, z # 0. In patitular, if it is possible to write a matrix A E Rnxn
0
1.15 1.15 1.15 1.15
as a product A = CCT then A is symmetric positive
1.15 0 3.07 3.07 3.07
D=
semidefinite.
1.15 3.07 0 3.07 3.07
(3)
A matrix C E RkXk such that the i-th row corre1.15 3.07 3.07 0
3.07
sponds to the coordinates of the poirit pi is called
1.15 3.07 3.07 3.07 0
Coordinates Matriz
Note that given a configuration of n points 2.1
Parametrization
of Euclidean DisPl,P2, “‘9 pn in Rk and the corresponding coordinates
tame Matrices
matrix C, then the matrix B = CC= is symmetric
Let us define a,, as the set of symmetric positive
positive semidefinite. Also
semidefinite matrices, and A, as the set of Euclidean
distance matrices.
bij = pTpj
for i,j = 1,2 ,... ,n.
1
287
It is clear that
dfj = Ibill + llPjl12- ‘J$pj
= bii + bjj - 2bij
(4)
Because B = [bij] is a positive semidefinite matrix,
the above expression is a natural parametrization of
the EDMs
)E: R, + An
(5)
[K(B)];~ = bii + bjj - 2bij.
Vector 1
Vector 2
Vector 3
The map K.can be expressed in matrix form by
n(B) = beT + ebT - 2B
Figure 2: Structure of an individual.
(6)
where b = diag(B) and e is the vector whose components are all equal to 1.
IMPLEMENTATION
Each EDM corresponds to a different positive 3
semidefinite matrix depending on the coordinate sysNext, we discusse the implementation of the GA for
tem. Thus, the function tc is not an injection.
In order to get an inverse transformation of n, it is the EDMs completion problem.
necessary to define the map
T,(D) = -f(ls E R”
esT)D(I - seT)
Such that
(7) 3.1
sTe = 1
where s is a vector in R” that fixes the coordinate
system.
In case that s = E, the origin of the coordinate system corresponds to the centroid of the configuration
of points[l4].
In the example 2.1, it can be verified that K(B) = D,
and T$ (0) = B.
2.2
Conditions
for
tance Matrix
a Euclidean
Dis-
The following theorem from [lo] gives us necessary
and sufficient conditions under which an arbitrary
matrix is a EDM.
Theorem 2.1 A matrix D E Rnx” is a EDM if and
only if
- f D= 2 ujv;+zeT + ezT + cleeT
(8)
j=l
where the nonzerO vectors ~1, ~2, . . . , vr form an orthogonal set in M = (x E R” : xTe = 0}, and
na
=
-
2
llVjJ12,
2Zi
=
--O
-
where uij is the i-th component
e(VS)2
j=l
j=l
of
(9)
vj.
We use this result to define the fitness function for
our GA.
Input
and Data Structure
The input data required is the population size, the
maximum number of generations that can be reached,
the crossover and mutation rates, the precision or
length of the chromosomes. In addition to the GA’s
parameters, the incomplete matrix information including known entries, number of incomplete entries,
and the location of the unknown entries into the matrix are necessary.
Each individual is represented by a structure as in
Fig. 2. The chrom’s arrays are binary arrays that
correspond to the genetic information. Initially, these
arrays are randomly generated. The binary information is then translated, by using a function called Decode, into the vector’s arrays. Each vector array corresponds to a column of the coordinates matrix. F&all
that we are looking for three-dimensional configurations so that the dimension of the coordinates matrix
is number of points x 3, and its i-t/t row corresponds
to coordinates of the i-th point in the con&ration.
A function called 0bjfin assigns a fitness value to
each individual. By using the necessary and sufficient conditions for a EDM and the information
into the vector’s arrays, a new matrix is calculated
and the F’robeniua norm of the dikence between
the input matrix and the new matrix is .assigned as
the fitness value of the individual. Given a matrix
the Frobenius norm is defined
A = [aij] E R”‘“,
as llAl/~ = (&
Cy=, loij)2)“2. Fitter individuals
are chosen and crossover and mutation operators are
applied in order to produce the next generation.
3.2
Genetic
Operators
Crossover and mutation are applied independently to
each one of the chrom’s arrays. It produces more efficiency in the reproduction process and avoids a high
level of interaction among genesin the chromosomes
(epistasis).
As pointed out before, the fitness values are assigned
by calculating the Fkobenius norm of the difference
between the input matrix and the EDM corresponding to the individual. In this particular case, we do
not account for unknown entries so that these positions are equal to zero in both matrices.
There are two additional conditions on the vector’s arrays. First ‘$, uj = 0, and second
Cf,j=i(jfiJ VTuj = 0. These conditions are not
present in the fitness function, but they are part of
the selection process. This idea is an innovation, a
combined effect of selection and recombination, as
suggested by Golberg in [16].
A tolerance parameter regulates the generation of individuals at each step. Usually, the initial value of
tolerance is 2Moz9enwhere Masgen is the maximum
number of allowed generations. At each generation,
the tolerance is reduced guaranteeing those better individuals are selected. Initially, the tolerance constraint was applied to both offspring in the recombination, but it was observed that the performance of
the GA seemsto be better if the constraint is applied
only to one offspring.
The crossover and mutation probability parameters
are not fixed. The crossover probability decreases
from 0.9 to 0.6, and the mutation probability decreasesfrom 0.2 to 0.02 through the GA’s run.
3.3
Additional
Table 1: Completion of molecular configurations
has been missed. For example, the input matrix is
given by
The best configuration obtained by the GA is given
by the coordinates matrix
This coordinates matrix generates the following EDM
1
Comments
3.056
3.062
1:149
0
3.069
3.062
1.156
0
3.069
3.062
1.147
0
3.069
3.056
1.153
0
: 1.149
1.153
1.147
1.156
0
(12)
The inclusion of innovation, tolerance, and variable
crossover and mutation probability parameters make
this GA a very dynamic one. Manipulating each improves the performance of the GA. However, further
research is necessary to determine the influence of
these factors on the GA’s performance.
Also, the fitness calculation is slow. Theoretical re
sults dealing with the conditions for EDMs should
help to reduce the time of this part of the algorithm.
The code has been tested on Sun Spare and Silicon
Graphics workstations.
Example 4.3 The code has been run for several incomplete matrices corresponding to molecules with
different number of atoms, and unknown distances.
In Table 1, some of the results obtained are shown.
At each case, the obtained fitness value implies that
the relative error at each element of the resultant matrix is not greater than 0.01 so that a good completion
of input matrix is obtained.
,4 NUMERICAL
5
EXAMPLES
whose fitness values is 0.0439.
CONCLUSIONS
Example 4.2 Consider again the configuration in A genetic algorithm to solve the Euclidean distance
the example 2.1, and suppose that one of the entries matrices completion problem has been developed.
Also, the algorithm determines three-dimensional
February 1998. University of Illinois at Urbanaconfigurations that generates the complete matrices.
Champaign.
This genetic algorithm has allowed us to obtain good
PI Gower J.C. “Euclidean distance geometry.” Math.
results in several study cases.
Scientist, 7 (1982): 1-14.
The basic algorithm can be modified to work in applications regarding image enhancement, statistics, and
PI Gower J.C. “Properties of Euclidean and nonmathematical programming.
Euclidean distance matrices.” Linear Algebra and
Further research includes implementing of parallel
its Applications, 67 (1985): 81-97.
versions, improving performance using knowledge
from positive definite completion problem and estab- [lo] Hayden T.L., Wells J., Liu W., and Tarazaga
lishing optimal parameters for the genetic algorithm.
P. “The cone of distance matrices.” Linear Algebrand and its Applications, 144 (1991): 153-169.
Acknowledgments
[ll] Holland J.H. Adaptation of Natural and Artificial Systems. Ann Arbor: University of Michigan
Press, 1975.
This work was supported in part by the National
Science Foundation, through funding provided to
the Engineering Research Center for Computational [12] Jin A. Y., Leung F.Y., and Weaver D.F. “Development of a novel genetic algorithm search
Field Simulation.
method (GAP1.O) for exploring peptide conThe author would like to thank Dr. Pablo Tarazaga
formational space.” Journal of Computational
for sharing his insights on Euclidean distance matriChemistry, 18 (1997): 1971-1984.
ces, and the anonymous reviewers for their helpful
suggestions and comments on this work.
[13] Judson R.S., Colvin M.E., Meza J.C., Huffer A.,
Gutierrez D. “Do intelligent configuration search
techniques outperform random search for large
References
molecules?” International Journal of Quantum
Chemistry 45 (1992): 503-528.
PI Barret W.W., Johnson C.R., and Loewy R. “The
real positive definite completion problem: cycle
[141 Rivera-Gallego W. “Matrices de distancia con escompletability.” Memoirs of the American Mathtructuras especiales.” MSc. Thesis, University of
ematical Society, 122 (1996): No 584.
Puerto Rico, Mayaguez, 1994.
PI Crippen G.M., and Have1 T.F. Distance Geom- [15] Rivera-Gallego W. “A genetic algorithm for ciretry and Molecular Conformation. New York:
culant Euclidean distance matrices.” Journal of
John Wiley and Sons Inc., 1988.
Applied Mathematics and Computing, 97 (1998):
197-208.
F. “On certain linear mappings between
inner product and squared-distance matrices.” [16] Schoenberg I. “Remarks to Maurice Frechet’s
Linear Algebra and its Applications, 105 (1988):
Article, Sur la definition axiomatique d’ une classe
91-107.
d’espaces vectoriels distancib applicablea vectoriellement sur L’espace de Hilbert.” Ann. Math,
PI De Leuuw J., and Heiser W. “Theory of Mul36 (1935): 724732.
tidimensional Scaling.” Handbook of Statistics,
Edited by P. R. Krishnaiah. Amsterdam: NorthHolland, 1982.
Wilson Rivera-Gallego ia a Ph.D. canPI Glunt W., Hayden T.L., and Raydan M. “Molecu- Author.
didate
in
computational
engineering at Missiiippi
lar conformations from distance matrices.” JourState
University.
He
has
been
granted with a research
nal of Computational Chemistry, 14 (1993): 114
assistantship
from
NSF/Engineering
Reaearch Cen120.
ter for Computational Field Simulation. He received
PI Golberg D.E. Genetic Algorithms in Search, his M.Sc. degree in computational mathematics from
Optimization and Machine Learning. Reading: University of Puerto Rico. His current research interest includes computational fluid dynamics, environAddison-Wesley, 1989.
mental quality modeling, and genetic algorithms.
VI Golberg D.E. “A meditation on the applications
of genetic algorithms.” IlliGAL Report No. 98003,
PI Critchley