A GENETIC ALGORITHM FOR SOLVING THE EUCLIDEAN DISTANCE MATRICES COMPLETION PROBLEM NSF Engineering Wilson Rivera-Gallego Research Center for Computational Mississippi State University Mississippi State, MS 39762, USA [email protected] Field Simulation Key words. genetic algorithms, Euclidean distance gorithms (GAS) have been applied to the molecular matrices, molecular configurations conformation problem[l2,13]. Genetic algorithms are search algorithms based on the genetic processesobserved in natural evolution. ABSTRACT Theoretical concepts were first introduced by Holland[ll] and later described by Golberg[G]. This paper presents a genetic algorithm to solve GAS work with a population of individuals. Each the Euclidean distance matrices completion problem individual (chromosome) consists of a string of charand to determine three-dimensional configurations of acters (genes) and represents a possible solution in a points which generate the corresponding Euclidean search space. A fitness value is assigned to each indistance matrices. This new approach becomes imdividual according to a problem-specific fitness funcportant for applications in molecular conformation tion. By using a selection operator, fitter individuals problems where it is necessary to determine molecular configurations from incomplete nuclear magnetic have higher probability of being chosen to be recombined in order to get offipring with better fitness vaIresonance data. ues. Offspring are generated by a ctT)asoueroperator, which combines geneafrom two selected individuals, and a mutation operator, which changes genes in the 1 INTRODUCTION chromosomes with certain probabiity. Mutation alOne of the most important problems in computa- lows for population diversity so that the GA does tional chemistry is the design of molecules such as not converge prematurely and important information enzymes, anti cancer agents and catalysts for mate- is not lost. This process is repeated in order to find rial processing. The design of these molecules de- an optimal solution to the particular problem. pends upon the precise determination of the three- The first references of Euclidean Distance Matridimensional configurations of atoms generating the ces (EDMs) appeared in multidimenaionol scaZing[4]. Members of a multidimensional sample are repremolecules[2]. In the absenceof crystallographic or spectroscopic ex- sented by points in some geometrical space, usually perimental data, the searching of molecular structure Euclidean. It is possible to build up arrays of numcan be formulated as an optimization problem under bers, called ordering diagrams, where pairs of close the assumption that the structure of the molecule points usually represent similar samples and faraway corresponds to a conformation for which the energy points represent dissimilar samples. The goal is to mot+ in order to describe some is near the global minimum. Because there exist a use this diaaimihity characteristics of the sample. For example, in psylarge number of local minima, standard optimization chometry each element in the dissimilarity mat& methods do not work very well. Thus, Genetic Al may represent the dissimilarity of a pair of stimuli. Permissionto make digital or hard copies of all 01part of this Work for personal OTclassroomuseis granted without fee provided that Schoenberg[lG] found a natural parametrization of copies are not made or distributed for profit or commercial the EDMs cone from the symmetric positive semidefadvantage and that copies bear this notice and the full citation On inite matrices. Gower[8,9] described some properthe first page. To copy otherwise, to republish, to post on sewen ties of the EDMs and pointed out some co~ecor to redistribute to lists, requires prior specific permission ar&x a fee. tions between these properties and the configurations SAC 99. San Antonio, Texas 01998 ACM l-S81 13-086-4199/0001$J.C’o of points that generate the matrices. Critchley[3] worked on the spectral decomposition of EDMs. Hayden et al.[lO] obtained Gower’s results, but in a simple way that provided a more geometrical view of the EDMs cone. The author[l4] studied some special structures of EDMs. The relationship between EDMs and the molecular conformation problem appears when we seek the conformation of biological molecules from nuclear magnetic resonance (NMR) data[5]. Interatomic distances are measured using NMR. The distances are located in a rectangular matrix. Thus, the geometry of the atoms conforming the molecule can be determined from the structure of the matrix[l0,14]. In this context a GA for circulant EDMs was presented in [15]. This paper describes a GA for solving the Euclidean distance matrices completion problem. In the measuring process, information may be missed giving rise to incomplete EDMs. In order to solve a matrix completion problem, we attempt to determine the unspecified entries of a partially defined matrix so that the complete matrix satisfies the desired properties. The positive definite completion problem, which is closely related to the EDMs completion problem, has been studied by Barret et al.[l]. This paper is organized as follows. Section 2 introduces definitions and preliminary results. Section 3 describes the implementation of the GA for the EDMs completion problem. Section 4 shows practical examples. Finally, section 5 gives the conclusions. 0.6 I Figure 1: Molecular configuration of the methane. and in particular for i= 1,2 ,... ,n. hi = IIpiI12 Example 2.1 Consider the configuration of the methane (CH4) in Fig. 1. The coordinates matrix is given by -0.62 2 EUCLIDEAN MATRICES (1) c= -0.62 0.62 DISTANCE 00 1.15 0 -0.380 -0.380 I0 -0.38 -0.38 1.15 -0.38 -0.38 1.15 0 -0.38 -0.38 , -0.38 1.15 I If there exist 12points, pl,~, ....pn in R’ such that B= 0 [Ipi - pj112 = ej, then D = [dfj] E R”‘” is a EU(2) clidean Distance Matriz. The minimum value of k is 0 -0.38 -0.38 -0.38 called embedding dimension. A symmetric matrix A E R”‘” is positive semidef- and the EDM is inite if zTAz 3 0 for all z E R”, z # 0. In patitular, if it is possible to write a matrix A E Rnxn 0 1.15 1.15 1.15 1.15 as a product A = CCT then A is symmetric positive 1.15 0 3.07 3.07 3.07 D= semidefinite. 1.15 3.07 0 3.07 3.07 (3) A matrix C E RkXk such that the i-th row corre1.15 3.07 3.07 0 3.07 sponds to the coordinates of the poirit pi is called 1.15 3.07 3.07 3.07 0 Coordinates Matriz Note that given a configuration of n points 2.1 Parametrization of Euclidean DisPl,P2, “‘9 pn in Rk and the corresponding coordinates tame Matrices matrix C, then the matrix B = CC= is symmetric Let us define a,, as the set of symmetric positive positive semidefinite. Also semidefinite matrices, and A, as the set of Euclidean distance matrices. bij = pTpj for i,j = 1,2 ,... ,n. 1 287 It is clear that dfj = Ibill + llPjl12- ‘J$pj = bii + bjj - 2bij (4) Because B = [bij] is a positive semidefinite matrix, the above expression is a natural parametrization of the EDMs )E: R, + An (5) [K(B)];~ = bii + bjj - 2bij. Vector 1 Vector 2 Vector 3 The map K.can be expressed in matrix form by n(B) = beT + ebT - 2B Figure 2: Structure of an individual. (6) where b = diag(B) and e is the vector whose components are all equal to 1. IMPLEMENTATION Each EDM corresponds to a different positive 3 semidefinite matrix depending on the coordinate sysNext, we discusse the implementation of the GA for tem. Thus, the function tc is not an injection. In order to get an inverse transformation of n, it is the EDMs completion problem. necessary to define the map T,(D) = -f(ls E R” esT)D(I - seT) Such that (7) 3.1 sTe = 1 where s is a vector in R” that fixes the coordinate system. In case that s = E, the origin of the coordinate system corresponds to the centroid of the configuration of points[l4]. In the example 2.1, it can be verified that K(B) = D, and T$ (0) = B. 2.2 Conditions for tance Matrix a Euclidean Dis- The following theorem from [lo] gives us necessary and sufficient conditions under which an arbitrary matrix is a EDM. Theorem 2.1 A matrix D E Rnx” is a EDM if and only if - f D= 2 ujv;+zeT + ezT + cleeT (8) j=l where the nonzerO vectors ~1, ~2, . . . , vr form an orthogonal set in M = (x E R” : xTe = 0}, and na = - 2 llVjJ12, 2Zi = --O - where uij is the i-th component e(VS)2 j=l j=l of (9) vj. We use this result to define the fitness function for our GA. Input and Data Structure The input data required is the population size, the maximum number of generations that can be reached, the crossover and mutation rates, the precision or length of the chromosomes. In addition to the GA’s parameters, the incomplete matrix information including known entries, number of incomplete entries, and the location of the unknown entries into the matrix are necessary. Each individual is represented by a structure as in Fig. 2. The chrom’s arrays are binary arrays that correspond to the genetic information. Initially, these arrays are randomly generated. The binary information is then translated, by using a function called Decode, into the vector’s arrays. Each vector array corresponds to a column of the coordinates matrix. F&all that we are looking for three-dimensional configurations so that the dimension of the coordinates matrix is number of points x 3, and its i-t/t row corresponds to coordinates of the i-th point in the con&ration. A function called 0bjfin assigns a fitness value to each individual. By using the necessary and sufficient conditions for a EDM and the information into the vector’s arrays, a new matrix is calculated and the F’robeniua norm of the dikence between the input matrix and the new matrix is .assigned as the fitness value of the individual. Given a matrix the Frobenius norm is defined A = [aij] E R”‘“, as llAl/~ = (& Cy=, loij)2)“2. Fitter individuals are chosen and crossover and mutation operators are applied in order to produce the next generation. 3.2 Genetic Operators Crossover and mutation are applied independently to each one of the chrom’s arrays. It produces more efficiency in the reproduction process and avoids a high level of interaction among genesin the chromosomes (epistasis). As pointed out before, the fitness values are assigned by calculating the Fkobenius norm of the difference between the input matrix and the EDM corresponding to the individual. In this particular case, we do not account for unknown entries so that these positions are equal to zero in both matrices. There are two additional conditions on the vector’s arrays. First ‘$, uj = 0, and second Cf,j=i(jfiJ VTuj = 0. These conditions are not present in the fitness function, but they are part of the selection process. This idea is an innovation, a combined effect of selection and recombination, as suggested by Golberg in [16]. A tolerance parameter regulates the generation of individuals at each step. Usually, the initial value of tolerance is 2Moz9enwhere Masgen is the maximum number of allowed generations. At each generation, the tolerance is reduced guaranteeing those better individuals are selected. Initially, the tolerance constraint was applied to both offspring in the recombination, but it was observed that the performance of the GA seemsto be better if the constraint is applied only to one offspring. The crossover and mutation probability parameters are not fixed. The crossover probability decreases from 0.9 to 0.6, and the mutation probability decreasesfrom 0.2 to 0.02 through the GA’s run. 3.3 Additional Table 1: Completion of molecular configurations has been missed. For example, the input matrix is given by The best configuration obtained by the GA is given by the coordinates matrix This coordinates matrix generates the following EDM 1 Comments 3.056 3.062 1:149 0 3.069 3.062 1.156 0 3.069 3.062 1.147 0 3.069 3.056 1.153 0 : 1.149 1.153 1.147 1.156 0 (12) The inclusion of innovation, tolerance, and variable crossover and mutation probability parameters make this GA a very dynamic one. Manipulating each improves the performance of the GA. However, further research is necessary to determine the influence of these factors on the GA’s performance. Also, the fitness calculation is slow. Theoretical re sults dealing with the conditions for EDMs should help to reduce the time of this part of the algorithm. The code has been tested on Sun Spare and Silicon Graphics workstations. Example 4.3 The code has been run for several incomplete matrices corresponding to molecules with different number of atoms, and unknown distances. In Table 1, some of the results obtained are shown. At each case, the obtained fitness value implies that the relative error at each element of the resultant matrix is not greater than 0.01 so that a good completion of input matrix is obtained. ,4 NUMERICAL 5 EXAMPLES whose fitness values is 0.0439. CONCLUSIONS Example 4.2 Consider again the configuration in A genetic algorithm to solve the Euclidean distance the example 2.1, and suppose that one of the entries matrices completion problem has been developed. Also, the algorithm determines three-dimensional February 1998. University of Illinois at Urbanaconfigurations that generates the complete matrices. Champaign. This genetic algorithm has allowed us to obtain good PI Gower J.C. “Euclidean distance geometry.” Math. results in several study cases. Scientist, 7 (1982): 1-14. The basic algorithm can be modified to work in applications regarding image enhancement, statistics, and PI Gower J.C. “Properties of Euclidean and nonmathematical programming. Euclidean distance matrices.” Linear Algebra and Further research includes implementing of parallel its Applications, 67 (1985): 81-97. versions, improving performance using knowledge from positive definite completion problem and estab- [lo] Hayden T.L., Wells J., Liu W., and Tarazaga lishing optimal parameters for the genetic algorithm. P. “The cone of distance matrices.” Linear Algebrand and its Applications, 144 (1991): 153-169. Acknowledgments [ll] Holland J.H. Adaptation of Natural and Artificial Systems. Ann Arbor: University of Michigan Press, 1975. This work was supported in part by the National Science Foundation, through funding provided to the Engineering Research Center for Computational [12] Jin A. Y., Leung F.Y., and Weaver D.F. “Development of a novel genetic algorithm search Field Simulation. method (GAP1.O) for exploring peptide conThe author would like to thank Dr. Pablo Tarazaga formational space.” Journal of Computational for sharing his insights on Euclidean distance matriChemistry, 18 (1997): 1971-1984. ces, and the anonymous reviewers for their helpful suggestions and comments on this work. [13] Judson R.S., Colvin M.E., Meza J.C., Huffer A., Gutierrez D. “Do intelligent configuration search techniques outperform random search for large References molecules?” International Journal of Quantum Chemistry 45 (1992): 503-528. PI Barret W.W., Johnson C.R., and Loewy R. “The real positive definite completion problem: cycle [141 Rivera-Gallego W. “Matrices de distancia con escompletability.” Memoirs of the American Mathtructuras especiales.” MSc. Thesis, University of ematical Society, 122 (1996): No 584. Puerto Rico, Mayaguez, 1994. PI Crippen G.M., and Have1 T.F. Distance Geom- [15] Rivera-Gallego W. “A genetic algorithm for ciretry and Molecular Conformation. New York: culant Euclidean distance matrices.” Journal of John Wiley and Sons Inc., 1988. Applied Mathematics and Computing, 97 (1998): 197-208. F. “On certain linear mappings between inner product and squared-distance matrices.” [16] Schoenberg I. “Remarks to Maurice Frechet’s Linear Algebra and its Applications, 105 (1988): Article, Sur la definition axiomatique d’ une classe 91-107. d’espaces vectoriels distancib applicablea vectoriellement sur L’espace de Hilbert.” Ann. Math, PI De Leuuw J., and Heiser W. “Theory of Mul36 (1935): 724732. tidimensional Scaling.” Handbook of Statistics, Edited by P. R. Krishnaiah. Amsterdam: NorthHolland, 1982. Wilson Rivera-Gallego ia a Ph.D. canPI Glunt W., Hayden T.L., and Raydan M. “Molecu- Author. didate in computational engineering at Missiiippi lar conformations from distance matrices.” JourState University. He has been granted with a research nal of Computational Chemistry, 14 (1993): 114 assistantship from NSF/Engineering Reaearch Cen120. ter for Computational Field Simulation. He received PI Golberg D.E. Genetic Algorithms in Search, his M.Sc. degree in computational mathematics from Optimization and Machine Learning. Reading: University of Puerto Rico. His current research interest includes computational fluid dynamics, environAddison-Wesley, 1989. mental quality modeling, and genetic algorithms. VI Golberg D.E. “A meditation on the applications of genetic algorithms.” IlliGAL Report No. 98003, PI Critchley
© Copyright 2025 Paperzz