Bhattacharyya Distance for Identifying Differentially Expressed

Bhattacharyya Distance for Identifying Differentially
Expressed Genes in Colon Gene Experiments
Xue W. Tian
IT College, Gachon University, Republic of Korea
[email protected]
Joon S. Lim
IT College, Gachon University, Republic of Korea
[email protected]
(Corresponding author)
Abstract— Identify a small number of differentially expressed
genes for accurate classification of gene samples is essential for
the development of diagnostic tests. We present an approach for
cancer molecular feature selection method based on their gene
expression profiles. Tumor and normal colon tissues were
classified in this research. The Bhattacharyya distance was used
as the gene selection method to identify the small number of
differentially expressed genes for the colon cancer analysis.
Finally we selected 7 genes for the colon cancer analysis with
95.16% accuracy by using a fuzzy neural networks classifier.
Compare with other colon cancer analysis results, our method
selected the smallest number of differentially expressed genes
and get the highest classification accuracy.
probability distributions. It is a measurement of the amount of
overlap between two statistical samples or populations. In this
paper, the Bhattacharyya distance of each gene was used as the
criterion for ranking genes in the training dataset. We applied
this differential gene detection method on the colon cancer
analysis. The colon cancer dataset was obtained from [3].
Keywords—colon; bhattacharyya distance; differentially
expressed genes; gene expression profiles; fuzzy neural networks
With the Bhattacharyya distance feature selection method
and NEWFM classifier, we selected 7 DEGs for the colon
cancer analysis and get 95.16% accuracy. Compare with the
colon cancer analysis results in [5,6,7], our method selected the
smallest differential gene set and got the highest classification
accuracy.
I.
INTRODUCTION
With the successful completion of the Human Genome
Project (HGP), we are entering the post genomic era. Facing
mass amounts of data, traditional biological experiments and
data analysis techniques encounter great challenges. In this
situation, cDNA microarrays and high-density oligonucleotide
chips are novel biotechnologies as global (genome-wide or
system-wide) experimental approaches that are effectively used
in systematical analysis of large-scale genome data [1]. In
recent years, with its ability to measure simultaneously the
activities and interactions of thousands of genes, microarray
promises new insights into the mechanisms of living systems
and is attracting more and more interest for solving scientific
problems and in industrial applications. Meanwhile, further
biological and medical research also promoted the
development and application of microarray.
A common objective of microarray experiments is the
detection of differential gene expression between samples
obtained under different conditions. The task of identifying
differentially expressed genes consists of two aspects: ranking
and selection. Numerous statistics have been proposed to rank
genes in order of evidence for differential expression. Different
algorithms can potentially select different relevant genes,
different numbers of relevant genes and lead to different
classification accuracies. In this paper, we used Bhattacharyya
distance [2] for differentially expressed genes (DEGs)
identifying method. In statistics, the Bhattacharyya distance
measures the similarity of two discrete or continuous
In our research, we applied a fuzzy neural network (FNN)
which is named NEWFM (neuro-fuzzy network with a
weighted fuzzy membership function) [4] to the classification.
NEWFM is a kind of FNN which is modeled on the structure
and behavior of neurons in the human brain and can be trained
to recognize and categorize complex patterns [4].
II.
MATERIALS AND METHOD
A. Materials
In [3], the authors describe and study a data set that is
available on-line. Gene expression information was extracted
from DNA microarray data resulting, after pre-processing, in a
table of 62 tissues × 2000 gene expression values. The 62
tissues include 22 normal and 40 colon cancer tissues. The
matrix contains the expression of the 2000 genes with highest
minimal intensity across the 62 tissues. Some genes are nonhuman genes. Since there was no defined training and test set,
we used all 62 samples for both training and test.
B. DEGs Identification Method
In this paper, we used Bhattacharyya distance approach as
our feature selection method for microarray data analysis. The
definition of a gene's Bhattacharyya distance between two
variants is shown bellow [2]:
B ( g, i ) =
( μ i =1 ( g ) − μ i = 2 ( g )) 2 1 ⎛ σ i2=1 ( g ) + σ i2=2 ( g ) ⎞ (1)
⎟
+ ln⎜
4(σ i2=1 ( g ) + σ i2=2 ( g )) 2 ⎜⎝ 2σ i =1 ( g )σ i =2 ( g ) ⎟⎠
978-1-4799-0604-8/13/$31.00 ©2013 IEEE
In the formula (1), i = 1,2 represent variant for normal and
colon cancer samples respectively. μ i=(g
,
)is the
1 ) μ i=(g
2
mean value of genes on the normal and colon cancer simples,
, 22 )
is the variance value of genes on
respectively. σ i=2(g
1 ) σ i=(g
the normal and colon cancer simples, respectively. The gene
with bigger distance is more DEG [2].
than these researches. The experimental results comparison is
shown on Table1.
TABLE I.
Correlational
Research
EXPERIMENTAL RESULTS COMPARISON.
Number of Genes
Accuracy
8
90.32%
Cho’s [6]
10
82.08%
Wang’s [7]
3
91.9%
Bhattacharyya
Distance
7
95.16%
Guyon’s [5]
In this paper, we used Bhattacharyya distance for DEGs
selection. We selected a smaller differential gene set and got a
higher classification accuracy. In the future research, we will
apply this Bhattacharyya distance feature selection method on
other microarray data analysis, and improve the method to get
less genes and higher accuracy.
ACKNOWLEDGMENT
This research was supported by MSIP (the Ministry of
Science, ICT and Future Planning), Korea, under the ITCRSP(IT Convergence Research Support Program) (NIPA2013-H0401-13-1001) supervised by the NIPA(National IT
Industry Promotion Agency)
This research was supported by Basic Science Research
Program through the National Research Foundation of Korea
(NRF) funded by the Ministry of Education, Science and
Technology. (2012R1A1A2044134).
Fig. 1. Structure of DEGs Identification Method
The structure of DEGs identification method is shown in
Fig.1. We used all the colon data set to calculate the
Bhattacharyya distance. Then selected the first 100 biggest
distance genes as the primal feature set. Using these genes to
classify the normal and colon cancer by NEWFM classifier.
And then we deleted bad gene one by one from bottom to top
form the 100 biggest distance list until two genes. Finally, we
selected 7 differentially expressed genes for the normal and
colon cancer classification with 95.16% which is the highest
accuracy gene set.
III.
EXPERIMENTAL RESULTS AND CONCLUDING REMARKS
The performance results of this study can achieved at
95.16% accuracy with only 7 differentially expressed genes.
As in [5], they classified the normal and colon cancer with 8
differentially expressed genes, 90.32% accuracy. As in [7],
they classified the normal and colon cancer with 3
differentially expressed genes, 91.9% accuracy. And as in
[6],they classified the normal and colon cancer with 10
differentially expressed genes, 82.08% accuracy. The accuracy
and differentially expressed genes in our study are greatly less
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
W. David, Galbraith, “Global analysis of cell type-specific gene
expression," Comparative Functional Genomics, vol. 4, pp. 208-215,
2003.
G. Xuan, “Bhattacharyya distance feature selection,” Pattern
Recognition, Proceedings of the 13th International Conference, 1996.
U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A.
J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering
Analysis of Tumor and Normal Colon Tissues Probed by
Oligonucleotide Arrays,” Proc. Nat. Acad. Sci. USA, vol. 96, pp. 67456750, 1999.
J. S. Lim, “Finding Features for Real-Time Premature Ventricular
Contraction Detecti-on Using a Fuzzy Neural Network System,” IEEE
Transactions on Neural Networks, pp. 522-527, 2009.
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Sselection for
Cancer Classification Using Support Vector Machines,” Machine
Learning, pp. 389-422, 2002.
J. H. Cho, D. Lee, J. H. Park, and I. B. Lee, “Gene Selection and
Classification from Microarray Data Using Kernel Machine,” FEBS
Letters 571, pp. 93-98, 2004.
Y. Wang, F. S. Makedon, J. C. Ford, and J. Pearlman, “HykGene: A
Hybrid Aproach for Selecting Marker Genes for Phenotype
Classification Using Microarray Gene Expression Data,” Bioinformatics
vol. 21, pp. 1530-1537, 2005.