Localizing compact set of genes involved in cancer diseases

Localizing compact set of genes involved in cancer diseases
using an evolutionary connectionist approach
M. A. Esseghir, S. Ben Yahia, and S. Abdelhak
Departement des Sciences de l’Informatique,
Campus Universitaire, 1060 Tunis, Tunisie.
[email protected]
[email protected]
Institut Pasteur de Tunis,
[email protected]
Abstract. In this paper, we present a model capable of predicting cancerous and normal
state associated to each biological situation using a hybrid system based on genetic algorithms and artificial neural networks. Carried out experiments have shown a very promising
results in terms of predictive accuracy and generalization. Besides, the compact set provided
by the model includes well known genes included in different cancer diseases.
Keywords: Gene expression analysis, Genetic algorithms, Artificial neural networks and supervised learning.
1
Introduction
Recent technological advances in microarrays allow us to simultaneously measure the level of
expression of thousand of genes in a given tissues at a given moment resulting in massive data
sets of gene expression profiles.
Data mining tools can be used to promote the understanding of gene function, regulation and
interactions. Several techniques have been used for the analysis of gene-expression data from statistical models to artificial intelligence techniques, including hierarchical clustering, discriminant
analysis and neural networks. The review of the literature showed the growing interest to artificial
neural networks approaches in gene expression analysis and modelling.
In this paper, we focus on two Artificial intelligence techniques: genetic algorithms as a feature
selector and neural networks as a classifier. The reminder of the paper is organized as follows: the
second section reviews models based on neural networks and used in gene expression profiling. The
next part presents an overview of the genetic algorithms and their advantages in the combinatorial
search. The fourth section details the proposed approach. The fifth section is concerned with the
methodology used in handling the available data from biological point of view. Section six deals
with experiments results yielded by the model.
2
Neural networks models
Artificial neural networks are stochastic models that simulate the brain’s ability in learning,
control and pattern recognition. Taken decisions are always based on past learned experiences. A
variety of studies have been done on the topic of neural networks (NN) in gene expression and
transcriptome analysis [1–3].
Supervised neural networks: MLP-BP
The majority of gene expression studies done using supervised neural networks use both multilayer feed-forward NN as architecture and back-propagation or one of its derivative as a learning
algorithm. In fact ,Wu [2] developed a system called gene classification artificial neural system
(geneCANS) which is based on the back propagation algorithm with a feed forward architecture using one hidden layer. (geneCANS) was designed to classify new unknown sequences into
predefined (known) classes.
Unsupervised neural networks: Self-Organizing-Map (SOM)
Self-Organizing-Map(SOM) also known under the Kohonen networks[4]. The SOM tries to separate outputs into categories, called clusters, after a training stage. This technique was applied to
cluster together genes with closely related expression profiles. P. Toronen et al [3] used the SOM
to analyze changes in genes expression during a diauxic shift (shift from anaerobic to aerobic
metabolism) in yeast. J. Huang et al [5] developed an hybrid model based supervised and unsupervised neural networks capabilities. The SOM was used for clustering and the ANN was used
for the extraction of the interconnection of genes clusters.
In this paper, we propose a model, based on genetic algorithms and neural networks, capable
of selecting a compact set of genes having the higher predictive accuracy in the detection and the
classification of cancer diseases.
The results of experiments have shown that the model achieves around 96% of accuracy in the
prediction of the cancer associated to each biological situation using the set provided by the model.
3
Genetic Algorithms
Genetic Algorithms (GA) are artificial Intelligence global search techniques. GA attempt to apply
evolutionary techniques to the field of the problem solving notably function optimization [6].
They have proven to be valuable in searching large and complex problem spaces [7]. GA process
is based on natural selection, crossover and mutation that are repeatedly applied to a population
of potential solutions, simulating the Darwinian evolution. Each solution is coded in a binary
string called chromosome. Over generation, a number of a new solution is created, with crossover
and mutation operations, and the best solutions are selected. This process evolves iteratively
and comes to an end when a good solution solution is obtained. A genetic evolution process is
described by Figure 1:
Fig. 1. Genetic evolution process
Each of these solutions is assigned a fitness value depending on how well it suits the solution.
Individuals are evaluated according to their associated fitness value. In the following section, we
will use a model based on genetic algorithms and artificial neural networks, in which the GA
plays the key role of features selector.
4
The proposed model
Believing that combining classifiers and boosting methods can lead to improvements in performance we propose, in this article, a new hybrid model based on mainly two stochastic models:Artificial neural networks and genetic algorithms. The following paragraphs illustrates how
ANN and GA can be combined and applied to gene expression modelling.
The model to present in this study consists in two stage evolutionary optimization process [8].
The first step selects the best set of inputs having a predictive relationship to the target 2. The
second step consists in the training of the neural network constructed using the selected genes.
The genetic algorithm presented here is based on the standard algorithm of Goldberg [6, 7]. We
Fig. 2. The proposed model
will apply it to find the optimal set of genes involved in the prediction of cancer diseases. A set
of candidate solutions are evolved through a fixed number of generations.
Representation Each generation consists of a set of candidate solutions represented using a
binary string of 822 genes for each solution, where the code 1 means that the gene is selected
as an ANN input and 0 means that the variable is not chosen. The chromosomes consist of 822
genes, each representing a gene.
Initialization An initial set of solutions is randomly generated. For each individual a number
of genes are randomly selected by setting the corresponding bits equal to 1. Once the initial set
is generated, the evaluation process starts. A fitness value is assigned to each chromosome. The
first generation of solutions is derived by applying a tournament selection to the evaluated set.
Evaluation There are two steps that must be done to evaluate each chromosome. First a neural
network, with the selected variables in the chromosome as input, is built and trained. Next, the
trained network is evaluated using the test set. The test set presents to the network a new data
which is not trained with it. The chromosome evaluation assesses the predictive generalization
ability of the neural network and consequently of the set of ratios involved. The idea of the use of
the test set in the evaluation was retrieved from the work of M. Aleixandre et al.[9]. Our fitness
involves two evaluation criteria : the number of incorrectly classified instances and the mean
square error on the test set.
f itness = (ICI + T M SE)/2
(1)
Where ICI and T M SE designate respectively the proportion of incorrectly classified instances
in the data set and the mean square error found on the test set. The predictive power of ANN
was assessed on the basis of the number of Correctly Classified instances (CCI) and the MSE of
the test set.
Algorithm 1: Genes selector algorithm
Input:
S: Set of available genes
Ni : initial population size
N: population size
t : tournament size
p mut : mutation probability
p cross : Crossover probability
Maxgen : number of generations
Output: S1: Best sub-set of input predictors’
1 Begin
2
i=0
3
Population P0 , P,Ptmp
4
P0 =P=Ptmp = ∅;
5
P0 =GenerateInitialPopulation(Ni )
6
Evaluate (P0 )
7
P=Select (P0 , N,t)
8
While i<Maxgen do
9
Ptmp =Select (P0 , N,t )
10
Crossover(Ptmp , p cross )
11
Mutate(Ptmp , p mut )
12
Evaluate(Ptmp )
13
Replace(Ptmp , P)
14
i=i+1
S1=P.bestChromosome().decode()
Return(S1)
17End
15
16
5
Methodology
The longest association rules generated by the study of Becquet and al.[10] corresponded mainly
to genes encoding ribosomal proteins. This is not surprising since these genes have an important
function in protein synthesis. For this reason, we decided to discard the set of tags corresponding
to ribosomal genes. This might lead to the identification of predictive genes hidden by the presence
of the large set of ribosomal proteins.
6
Experiments and interpretation
Data description and preprocessing
For our experiments, we decide to use the data set provided by the ’Challenge Discovery 2005’
site. It is based on the expression of 822 genes in 74 biological situations:”Small Data set”. We
add to the matrix of expression levels the corresponding biological situation state represented
with column ’Class 2’. All neural networks classifiers evolved by the genetic algorithm proposed
in our model will be assessed according to their accuracy level in predicting the state of each
biological situation (normal or cancer) with the respective cancer type. ANN requires data to be
close to the range of 0 to 1. The method selected for normalization is thelinear transform scaling.
Results of experiments
During the experiments we have tested different values for each parameter. The best results were
obtained using the following neural networks and genetic algorithm parameters defined in Table
1.
GA parameters
Parameter
Number of generations
Crossover probability(p cross )
Mutation probability (p mut )
Initial population size
population size
ANN parameters
Value
Parameter
Value
200
Number of iterations
50
85%
learning rate (η)
0.25
20% Number of hidden nodes
10
30 Weights initialization range
[−0.1..0.1]
20
Architecture
Feed forward fully-connected
Table 1. Model parameters
The results of the runs of the algorithm have provided the sets of genes presented by Table 3.
We can see the reduction of the number of genes from 822 to respectively 18,18,15 and 12. Model
constructed using the selected set of genes as inputs achieves high generalization performance on
test set: around 96% of accuracy. The performance of each model is presented in Table 2. Genes
resulted, from each experiment, are used to build a model based on a neural network.
Model
M1:
M2:
M3:
M4:
18
18
12
15
Ribosomal protein
genes
genes
genes
genes
Mean
Square
(MSE)
included
0.018980607
included
0.019634757
discarded
0.02600241
discarded
0.029648075
Table 2. Models generalization performance
Error
Interpretation
As seen in the previous paragraph, reducing the number of genes from 822 to less than twenty
haven’t resulted in a significant loss of information in building a model capable of predicting
cancerous and normal state associated to each biological situation. Although, the sets provided
by the model using the whole set of genes achieves higher accuracy than those resulted using a set
without ribosomal protein genes,latest seems to be more interesting on the biological plan. The
both results provided by our model which have discarded the ribosomal protein genes incudes a
large number of genes involved in cancer diseases. Such a set was not provided with the whole
set of genes.
Moreover, we should remark that the annotation (description) of some genes have changed
from latest update.
M1
TAG ID
103
112
120
121
168
258
307
314
365
399
408
409
426
431
508
552
600
749
7
M2
M3
TAG
TAG ID
TAG
TAG ID
TAG
TAG ID
AGCCTTTGTT
10
AACGCGGCCA
4
AAACTCTGTG
23
AGGGCTTCCA
14
AACTAATACT
77
ACTGAGGTGC
36
AGGTTTCCTC
65
ACCTGCCGAC
222
CCATTGCACT
47
AGTAGGTGGC
93
AGCAGATCAG
329
CTGCTGTGAT
113
CAAGGGCTTG
112
AGGGCTTCCA
335
CTGGCTGCAA
215
CCTCGCTCAG
156
ATTCTCCAGT
472
GCTAAGGAGA
293
CTCCAATAAA
195
CAGGAACGGG
517
GGCTCCTGGC
546
CTCTCACCCT
275
CGAGGGGCCA
564
GGTGGATGTG
565
GAACCCTGGG
359
CTTTTTGTGC
598
GTGCCTGAGA
598
GAGGCCATCC
464
GCGCCGCCCC
652
TAGATAATGG
603
GAGTGAGTGA
472
GCTAAGGAGA
687
TCGAAGAACC
676
GAGTGGGGGC
517
GGCTCCTGGC
693
TCTCCAGGAA
712
GCCAAGATGC
554
GGGTTTGAAC
773
GCCAGGAAGC
599
GTGCGCTAGG
784
GGCCCTAGGC
604
GTGGAGGTGC
802
GGGTGTGGTG
665
TCACAGCTGT
GTGCTGAATG
685
TCCTCCCTCC
TGGATCCTAG
801
TTGGGAGCAG
Table 3. List of selected genes for each model
M4
TAG
AAGACTGGCT
AAGGACCTTT
AATTTGCAAC
AGGGGATTCC
CCACTACACT
CGGTTACTGT
GGGGGTAACT
GGTGGCACTC
GTGCCTGAGA
GTGGACCCCA
TCCAATACTG
TGACCCCACA
TGTGCTCGGG
TTCACAAAGG
TTGGGGTTTC
Conclusion
We have designed a model capable of selecting a compact set of genes predicting cancerous
and normal state associated to each biological situation with high neural network based model
accuracy. Besides, discarding the ribosomal protein genes yields more significant set including a
large number of genes involved in cancer diseases.
References
1. Tan, A., Pan, H.: Predictive neural networks for gene expression data analysis. Neural networks 18
(2005) 297–306
2. Wu, C.H.: Gene classification artificial neural system. In: Methods in enzymology: Computer methods
for macromolecular sequence analysis. (1995)
3. Toronen, P., Kolekmainen, M., wong, G., Castren, E.: Analysis of gene expression data using selforganising maps. Federation of European Biochimical Societies (1999) 142–146
4. Faure, A.: Cyberntique des rseaux neuronaux. Edition Herms (1998)
5. Huang, J., Shimizu, H., Shioya, S.: Clustering gene expression pattern and extracting relationship in
gene network based on artificial neural networks. Journal bioscience and bioengeneering 96 (2003)
421–428
6. Goldberg, D.E.: Genetic algorithms in search, optimization and machine learning. Addison Wesley
(1989)
7. Cornujols, A., Miclet, L., Kodratoff, Y., Mitchell, T.: Apprentissage artificiel : concepts et algorithmes.
Eyrolles (2002)
8. Esseghir, M.A.: New evolutionary bankruptcy forecasting model: design and implementation. Master’s thesis, Highier Institute of management , Tunis, Tunisia (2005)
9. Aleixandre, M., Sayago, I., Horrilo, M.: Analysis of neural networks and analysis of feature selection
with genetic algorithm to discriminate among pollutant gaz. Sensors and Actuators B (2004) 122–128
10. Becquet, C., Blachon, S., Jeudy, B., Boulicaut, J., Gandrillon, O.: Strong-association-rule mining for
large-scale gene-expression data analysis: a case study on human sage data. Genome Biol. 3 (2002)