Bhattacharyya Distance for Identifying Differentially Expressed Genes in Colon Gene Experiments Xue W. Tian IT College, Gachon University, Republic of Korea [email protected] Joon S. Lim IT College, Gachon University, Republic of Korea [email protected] (Corresponding author) Abstract— Identify a small number of differentially expressed genes for accurate classification of gene samples is essential for the development of diagnostic tests. We present an approach for cancer molecular feature selection method based on their gene expression profiles. Tumor and normal colon tissues were classified in this research. The Bhattacharyya distance was used as the gene selection method to identify the small number of differentially expressed genes for the colon cancer analysis. Finally we selected 7 genes for the colon cancer analysis with 95.16% accuracy by using a fuzzy neural networks classifier. Compare with other colon cancer analysis results, our method selected the smallest number of differentially expressed genes and get the highest classification accuracy. probability distributions. It is a measurement of the amount of overlap between two statistical samples or populations. In this paper, the Bhattacharyya distance of each gene was used as the criterion for ranking genes in the training dataset. We applied this differential gene detection method on the colon cancer analysis. The colon cancer dataset was obtained from [3]. Keywords—colon; bhattacharyya distance; differentially expressed genes; gene expression profiles; fuzzy neural networks With the Bhattacharyya distance feature selection method and NEWFM classifier, we selected 7 DEGs for the colon cancer analysis and get 95.16% accuracy. Compare with the colon cancer analysis results in [5,6,7], our method selected the smallest differential gene set and got the highest classification accuracy. I. INTRODUCTION With the successful completion of the Human Genome Project (HGP), we are entering the post genomic era. Facing mass amounts of data, traditional biological experiments and data analysis techniques encounter great challenges. In this situation, cDNA microarrays and high-density oligonucleotide chips are novel biotechnologies as global (genome-wide or system-wide) experimental approaches that are effectively used in systematical analysis of large-scale genome data [1]. In recent years, with its ability to measure simultaneously the activities and interactions of thousands of genes, microarray promises new insights into the mechanisms of living systems and is attracting more and more interest for solving scientific problems and in industrial applications. Meanwhile, further biological and medical research also promoted the development and application of microarray. A common objective of microarray experiments is the detection of differential gene expression between samples obtained under different conditions. The task of identifying differentially expressed genes consists of two aspects: ranking and selection. Numerous statistics have been proposed to rank genes in order of evidence for differential expression. Different algorithms can potentially select different relevant genes, different numbers of relevant genes and lead to different classification accuracies. In this paper, we used Bhattacharyya distance [2] for differentially expressed genes (DEGs) identifying method. In statistics, the Bhattacharyya distance measures the similarity of two discrete or continuous In our research, we applied a fuzzy neural network (FNN) which is named NEWFM (neuro-fuzzy network with a weighted fuzzy membership function) [4] to the classification. NEWFM is a kind of FNN which is modeled on the structure and behavior of neurons in the human brain and can be trained to recognize and categorize complex patterns [4]. II. MATERIALS AND METHOD A. Materials In [3], the authors describe and study a data set that is available on-line. Gene expression information was extracted from DNA microarray data resulting, after pre-processing, in a table of 62 tissues × 2000 gene expression values. The 62 tissues include 22 normal and 40 colon cancer tissues. The matrix contains the expression of the 2000 genes with highest minimal intensity across the 62 tissues. Some genes are nonhuman genes. Since there was no defined training and test set, we used all 62 samples for both training and test. B. DEGs Identification Method In this paper, we used Bhattacharyya distance approach as our feature selection method for microarray data analysis. The definition of a gene's Bhattacharyya distance between two variants is shown bellow [2]: B ( g, i ) = ( μ i =1 ( g ) − μ i = 2 ( g )) 2 1 ⎛ σ i2=1 ( g ) + σ i2=2 ( g ) ⎞ (1) ⎟ + ln⎜ 4(σ i2=1 ( g ) + σ i2=2 ( g )) 2 ⎜⎝ 2σ i =1 ( g )σ i =2 ( g ) ⎟⎠ 978-1-4799-0604-8/13/$31.00 ©2013 IEEE In the formula (1), i = 1,2 represent variant for normal and colon cancer samples respectively. μ i=(g , )is the 1 ) μ i=(g 2 mean value of genes on the normal and colon cancer simples, , 22 ) is the variance value of genes on respectively. σ i=2(g 1 ) σ i=(g the normal and colon cancer simples, respectively. The gene with bigger distance is more DEG [2]. than these researches. The experimental results comparison is shown on Table1. TABLE I. Correlational Research EXPERIMENTAL RESULTS COMPARISON. Number of Genes Accuracy 8 90.32% Cho’s [6] 10 82.08% Wang’s [7] 3 91.9% Bhattacharyya Distance 7 95.16% Guyon’s [5] In this paper, we used Bhattacharyya distance for DEGs selection. We selected a smaller differential gene set and got a higher classification accuracy. In the future research, we will apply this Bhattacharyya distance feature selection method on other microarray data analysis, and improve the method to get less genes and higher accuracy. ACKNOWLEDGMENT This research was supported by MSIP (the Ministry of Science, ICT and Future Planning), Korea, under the ITCRSP(IT Convergence Research Support Program) (NIPA2013-H0401-13-1001) supervised by the NIPA(National IT Industry Promotion Agency) This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology. (2012R1A1A2044134). Fig. 1. Structure of DEGs Identification Method The structure of DEGs identification method is shown in Fig.1. We used all the colon data set to calculate the Bhattacharyya distance. Then selected the first 100 biggest distance genes as the primal feature set. Using these genes to classify the normal and colon cancer by NEWFM classifier. And then we deleted bad gene one by one from bottom to top form the 100 biggest distance list until two genes. Finally, we selected 7 differentially expressed genes for the normal and colon cancer classification with 95.16% which is the highest accuracy gene set. III. EXPERIMENTAL RESULTS AND CONCLUDING REMARKS The performance results of this study can achieved at 95.16% accuracy with only 7 differentially expressed genes. As in [5], they classified the normal and colon cancer with 8 differentially expressed genes, 90.32% accuracy. As in [7], they classified the normal and colon cancer with 3 differentially expressed genes, 91.9% accuracy. And as in [6],they classified the normal and colon cancer with 10 differentially expressed genes, 82.08% accuracy. The accuracy and differentially expressed genes in our study are greatly less REFERENCES [1] [2] [3] [4] [5] [6] [7] W. David, Galbraith, “Global analysis of cell type-specific gene expression," Comparative Functional Genomics, vol. 4, pp. 208-215, 2003. G. Xuan, “Bhattacharyya distance feature selection,” Pattern Recognition, Proceedings of the 13th International Conference, 1996. U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat. Acad. Sci. USA, vol. 96, pp. 67456750, 1999. J. S. Lim, “Finding Features for Real-Time Premature Ventricular Contraction Detecti-on Using a Fuzzy Neural Network System,” IEEE Transactions on Neural Networks, pp. 522-527, 2009. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Sselection for Cancer Classification Using Support Vector Machines,” Machine Learning, pp. 389-422, 2002. J. H. Cho, D. Lee, J. H. Park, and I. B. Lee, “Gene Selection and Classification from Microarray Data Using Kernel Machine,” FEBS Letters 571, pp. 93-98, 2004. Y. Wang, F. S. Makedon, J. C. Ford, and J. Pearlman, “HykGene: A Hybrid Aproach for Selecting Marker Genes for Phenotype Classification Using Microarray Gene Expression Data,” Bioinformatics vol. 21, pp. 1530-1537, 2005.
© Copyright 2025 Paperzz