Localizing compact set of genes involved in cancer diseases using an evolutionary connectionist approach M. A. Esseghir, S. Ben Yahia, and S. Abdelhak Departement des Sciences de l’Informatique, Campus Universitaire, 1060 Tunis, Tunisie. [email protected] [email protected] Institut Pasteur de Tunis, [email protected] Abstract. In this paper, we present a model capable of predicting cancerous and normal state associated to each biological situation using a hybrid system based on genetic algorithms and artificial neural networks. Carried out experiments have shown a very promising results in terms of predictive accuracy and generalization. Besides, the compact set provided by the model includes well known genes included in different cancer diseases. Keywords: Gene expression analysis, Genetic algorithms, Artificial neural networks and supervised learning. 1 Introduction Recent technological advances in microarrays allow us to simultaneously measure the level of expression of thousand of genes in a given tissues at a given moment resulting in massive data sets of gene expression profiles. Data mining tools can be used to promote the understanding of gene function, regulation and interactions. Several techniques have been used for the analysis of gene-expression data from statistical models to artificial intelligence techniques, including hierarchical clustering, discriminant analysis and neural networks. The review of the literature showed the growing interest to artificial neural networks approaches in gene expression analysis and modelling. In this paper, we focus on two Artificial intelligence techniques: genetic algorithms as a feature selector and neural networks as a classifier. The reminder of the paper is organized as follows: the second section reviews models based on neural networks and used in gene expression profiling. The next part presents an overview of the genetic algorithms and their advantages in the combinatorial search. The fourth section details the proposed approach. The fifth section is concerned with the methodology used in handling the available data from biological point of view. Section six deals with experiments results yielded by the model. 2 Neural networks models Artificial neural networks are stochastic models that simulate the brain’s ability in learning, control and pattern recognition. Taken decisions are always based on past learned experiences. A variety of studies have been done on the topic of neural networks (NN) in gene expression and transcriptome analysis [1–3]. Supervised neural networks: MLP-BP The majority of gene expression studies done using supervised neural networks use both multilayer feed-forward NN as architecture and back-propagation or one of its derivative as a learning algorithm. In fact ,Wu [2] developed a system called gene classification artificial neural system (geneCANS) which is based on the back propagation algorithm with a feed forward architecture using one hidden layer. (geneCANS) was designed to classify new unknown sequences into predefined (known) classes. Unsupervised neural networks: Self-Organizing-Map (SOM) Self-Organizing-Map(SOM) also known under the Kohonen networks[4]. The SOM tries to separate outputs into categories, called clusters, after a training stage. This technique was applied to cluster together genes with closely related expression profiles. P. Toronen et al [3] used the SOM to analyze changes in genes expression during a diauxic shift (shift from anaerobic to aerobic metabolism) in yeast. J. Huang et al [5] developed an hybrid model based supervised and unsupervised neural networks capabilities. The SOM was used for clustering and the ANN was used for the extraction of the interconnection of genes clusters. In this paper, we propose a model, based on genetic algorithms and neural networks, capable of selecting a compact set of genes having the higher predictive accuracy in the detection and the classification of cancer diseases. The results of experiments have shown that the model achieves around 96% of accuracy in the prediction of the cancer associated to each biological situation using the set provided by the model. 3 Genetic Algorithms Genetic Algorithms (GA) are artificial Intelligence global search techniques. GA attempt to apply evolutionary techniques to the field of the problem solving notably function optimization [6]. They have proven to be valuable in searching large and complex problem spaces [7]. GA process is based on natural selection, crossover and mutation that are repeatedly applied to a population of potential solutions, simulating the Darwinian evolution. Each solution is coded in a binary string called chromosome. Over generation, a number of a new solution is created, with crossover and mutation operations, and the best solutions are selected. This process evolves iteratively and comes to an end when a good solution solution is obtained. A genetic evolution process is described by Figure 1: Fig. 1. Genetic evolution process Each of these solutions is assigned a fitness value depending on how well it suits the solution. Individuals are evaluated according to their associated fitness value. In the following section, we will use a model based on genetic algorithms and artificial neural networks, in which the GA plays the key role of features selector. 4 The proposed model Believing that combining classifiers and boosting methods can lead to improvements in performance we propose, in this article, a new hybrid model based on mainly two stochastic models:Artificial neural networks and genetic algorithms. The following paragraphs illustrates how ANN and GA can be combined and applied to gene expression modelling. The model to present in this study consists in two stage evolutionary optimization process [8]. The first step selects the best set of inputs having a predictive relationship to the target 2. The second step consists in the training of the neural network constructed using the selected genes. The genetic algorithm presented here is based on the standard algorithm of Goldberg [6, 7]. We Fig. 2. The proposed model will apply it to find the optimal set of genes involved in the prediction of cancer diseases. A set of candidate solutions are evolved through a fixed number of generations. Representation Each generation consists of a set of candidate solutions represented using a binary string of 822 genes for each solution, where the code 1 means that the gene is selected as an ANN input and 0 means that the variable is not chosen. The chromosomes consist of 822 genes, each representing a gene. Initialization An initial set of solutions is randomly generated. For each individual a number of genes are randomly selected by setting the corresponding bits equal to 1. Once the initial set is generated, the evaluation process starts. A fitness value is assigned to each chromosome. The first generation of solutions is derived by applying a tournament selection to the evaluated set. Evaluation There are two steps that must be done to evaluate each chromosome. First a neural network, with the selected variables in the chromosome as input, is built and trained. Next, the trained network is evaluated using the test set. The test set presents to the network a new data which is not trained with it. The chromosome evaluation assesses the predictive generalization ability of the neural network and consequently of the set of ratios involved. The idea of the use of the test set in the evaluation was retrieved from the work of M. Aleixandre et al.[9]. Our fitness involves two evaluation criteria : the number of incorrectly classified instances and the mean square error on the test set. f itness = (ICI + T M SE)/2 (1) Where ICI and T M SE designate respectively the proportion of incorrectly classified instances in the data set and the mean square error found on the test set. The predictive power of ANN was assessed on the basis of the number of Correctly Classified instances (CCI) and the MSE of the test set. Algorithm 1: Genes selector algorithm Input: S: Set of available genes Ni : initial population size N: population size t : tournament size p mut : mutation probability p cross : Crossover probability Maxgen : number of generations Output: S1: Best sub-set of input predictors’ 1 Begin 2 i=0 3 Population P0 , P,Ptmp 4 P0 =P=Ptmp = ∅; 5 P0 =GenerateInitialPopulation(Ni ) 6 Evaluate (P0 ) 7 P=Select (P0 , N,t) 8 While i<Maxgen do 9 Ptmp =Select (P0 , N,t ) 10 Crossover(Ptmp , p cross ) 11 Mutate(Ptmp , p mut ) 12 Evaluate(Ptmp ) 13 Replace(Ptmp , P) 14 i=i+1 S1=P.bestChromosome().decode() Return(S1) 17End 15 16 5 Methodology The longest association rules generated by the study of Becquet and al.[10] corresponded mainly to genes encoding ribosomal proteins. This is not surprising since these genes have an important function in protein synthesis. For this reason, we decided to discard the set of tags corresponding to ribosomal genes. This might lead to the identification of predictive genes hidden by the presence of the large set of ribosomal proteins. 6 Experiments and interpretation Data description and preprocessing For our experiments, we decide to use the data set provided by the ’Challenge Discovery 2005’ site. It is based on the expression of 822 genes in 74 biological situations:”Small Data set”. We add to the matrix of expression levels the corresponding biological situation state represented with column ’Class 2’. All neural networks classifiers evolved by the genetic algorithm proposed in our model will be assessed according to their accuracy level in predicting the state of each biological situation (normal or cancer) with the respective cancer type. ANN requires data to be close to the range of 0 to 1. The method selected for normalization is thelinear transform scaling. Results of experiments During the experiments we have tested different values for each parameter. The best results were obtained using the following neural networks and genetic algorithm parameters defined in Table 1. GA parameters Parameter Number of generations Crossover probability(p cross ) Mutation probability (p mut ) Initial population size population size ANN parameters Value Parameter Value 200 Number of iterations 50 85% learning rate (η) 0.25 20% Number of hidden nodes 10 30 Weights initialization range [−0.1..0.1] 20 Architecture Feed forward fully-connected Table 1. Model parameters The results of the runs of the algorithm have provided the sets of genes presented by Table 3. We can see the reduction of the number of genes from 822 to respectively 18,18,15 and 12. Model constructed using the selected set of genes as inputs achieves high generalization performance on test set: around 96% of accuracy. The performance of each model is presented in Table 2. Genes resulted, from each experiment, are used to build a model based on a neural network. Model M1: M2: M3: M4: 18 18 12 15 Ribosomal protein genes genes genes genes Mean Square (MSE) included 0.018980607 included 0.019634757 discarded 0.02600241 discarded 0.029648075 Table 2. Models generalization performance Error Interpretation As seen in the previous paragraph, reducing the number of genes from 822 to less than twenty haven’t resulted in a significant loss of information in building a model capable of predicting cancerous and normal state associated to each biological situation. Although, the sets provided by the model using the whole set of genes achieves higher accuracy than those resulted using a set without ribosomal protein genes,latest seems to be more interesting on the biological plan. The both results provided by our model which have discarded the ribosomal protein genes incudes a large number of genes involved in cancer diseases. Such a set was not provided with the whole set of genes. Moreover, we should remark that the annotation (description) of some genes have changed from latest update. M1 TAG ID 103 112 120 121 168 258 307 314 365 399 408 409 426 431 508 552 600 749 7 M2 M3 TAG TAG ID TAG TAG ID TAG TAG ID AGCCTTTGTT 10 AACGCGGCCA 4 AAACTCTGTG 23 AGGGCTTCCA 14 AACTAATACT 77 ACTGAGGTGC 36 AGGTTTCCTC 65 ACCTGCCGAC 222 CCATTGCACT 47 AGTAGGTGGC 93 AGCAGATCAG 329 CTGCTGTGAT 113 CAAGGGCTTG 112 AGGGCTTCCA 335 CTGGCTGCAA 215 CCTCGCTCAG 156 ATTCTCCAGT 472 GCTAAGGAGA 293 CTCCAATAAA 195 CAGGAACGGG 517 GGCTCCTGGC 546 CTCTCACCCT 275 CGAGGGGCCA 564 GGTGGATGTG 565 GAACCCTGGG 359 CTTTTTGTGC 598 GTGCCTGAGA 598 GAGGCCATCC 464 GCGCCGCCCC 652 TAGATAATGG 603 GAGTGAGTGA 472 GCTAAGGAGA 687 TCGAAGAACC 676 GAGTGGGGGC 517 GGCTCCTGGC 693 TCTCCAGGAA 712 GCCAAGATGC 554 GGGTTTGAAC 773 GCCAGGAAGC 599 GTGCGCTAGG 784 GGCCCTAGGC 604 GTGGAGGTGC 802 GGGTGTGGTG 665 TCACAGCTGT GTGCTGAATG 685 TCCTCCCTCC TGGATCCTAG 801 TTGGGAGCAG Table 3. List of selected genes for each model M4 TAG AAGACTGGCT AAGGACCTTT AATTTGCAAC AGGGGATTCC CCACTACACT CGGTTACTGT GGGGGTAACT GGTGGCACTC GTGCCTGAGA GTGGACCCCA TCCAATACTG TGACCCCACA TGTGCTCGGG TTCACAAAGG TTGGGGTTTC Conclusion We have designed a model capable of selecting a compact set of genes predicting cancerous and normal state associated to each biological situation with high neural network based model accuracy. Besides, discarding the ribosomal protein genes yields more significant set including a large number of genes involved in cancer diseases. References 1. Tan, A., Pan, H.: Predictive neural networks for gene expression data analysis. Neural networks 18 (2005) 297–306 2. Wu, C.H.: Gene classification artificial neural system. In: Methods in enzymology: Computer methods for macromolecular sequence analysis. (1995) 3. Toronen, P., Kolekmainen, M., wong, G., Castren, E.: Analysis of gene expression data using selforganising maps. Federation of European Biochimical Societies (1999) 142–146 4. Faure, A.: Cyberntique des rseaux neuronaux. Edition Herms (1998) 5. Huang, J., Shimizu, H., Shioya, S.: Clustering gene expression pattern and extracting relationship in gene network based on artificial neural networks. Journal bioscience and bioengeneering 96 (2003) 421–428 6. Goldberg, D.E.: Genetic algorithms in search, optimization and machine learning. Addison Wesley (1989) 7. Cornujols, A., Miclet, L., Kodratoff, Y., Mitchell, T.: Apprentissage artificiel : concepts et algorithmes. Eyrolles (2002) 8. Esseghir, M.A.: New evolutionary bankruptcy forecasting model: design and implementation. Master’s thesis, Highier Institute of management , Tunis, Tunisia (2005) 9. Aleixandre, M., Sayago, I., Horrilo, M.: Analysis of neural networks and analysis of feature selection with genetic algorithm to discriminate among pollutant gaz. Sensors and Actuators B (2004) 122–128 10. Becquet, C., Blachon, S., Jeudy, B., Boulicaut, J., Gandrillon, O.: Strong-association-rule mining for large-scale gene-expression data analysis: a case study on human sage data. Genome Biol. 3 (2002)
© Copyright 2026 Paperzz