Finding Significant Gene Sets with Weighted Distribution of Gene

Finding Significant Gene Sets with Weighted Distribution of
Gene Expression
ABSTRACT
Gene set analysis shows great advantages of finding the significant gene categories where genes
share similar biological functions. In this article we present a new method named SGS for finding
significant gene sets, of which genes are differentially expressed detected by microarray
experiments. The methodology takes the hypothesis that genes being more differentially expressed
play more important part in gene sets and uses a weighted distribution of gene expression to
calculate the up-regulation and down-regulation of gene sets. Two kinds of cutoffs are used to
determine the gene sets which are both biological sensible and statistical significant. Our method
can decrease the false positive predictions causing by the large size of gene sets and it also suits
for the analysis of microarray data with few repeat experiments. A type II diabetes microarray data
is used to test the performance of SGS and we compare our tool with one of the most widely used
gene set analysis tool, GSEA. The results shows that SGS finds out more gene sets related with
oxidative phosphoration and ribosome, and excludes gene sets which do not belong to these two
properties. Our tool performs with high sensitivity and low false positive rate.
INTRODUCTION
Microarray technology has been used to measure the expression of huge number of genes under
various biological conditions simultaneously. The challenge is to find out the biological meanings
hiding behind the massive data produced by microarray experiments. Gene set analysis technology
is developed to fetch the significant gene categories represented as signaling pathways,
transcription modules or other functional relationships in which the member genes are
differentially expressed. By evaluating genes as a group shows great advantages than the
traditional methods of individual gene analysis. Approaches of gene set analysis has also been
applied in other fields including functional network module analysis (Liu et al., 2007) and
MicroRNA target prediction (Creighton et al., 2008, Cheng et al., 2008). Compared to individual
gene-per-gene analysis, the strategy of gene set analysis is better in uncovering more reasonable
gene relations (Mootha et al., 2003), reducing the dimensionality of the underlying statistical
problem (Ackermann et al., 2009), improving the comparability of microarray data from different
platforms (Manoli et al., 2006), being more robust and is capable of producing more reliable
results.
Gene sets are defined as list of genes sharing similar attributes such as being in the same signaling
pathway, being in the near locations in cell, or the same functions their product proteins have.
Furthermore, it can be regarded as genes in a network module or being the target of a same
transcription factor or MicroRNA. The definition of the gene sets is based on prior knowledge
according to what kind of biological property is considered and it is usually extracted from public
databases such as Gene Ontology (文献), KEGG (文献), PID (文献), etc.
The aim of gene set analysis is to obtain significant gene sets by evaluating the differential
expression of genes in it. Amount of tools have been developed in recent years. Most of them can
be categorized into two classes. The first is known as over-representation analysis (ORA) that uses
a 2x2 contingent table to test the independence of genes belonging to a functional category
considered and genes being differentially expressed, usually by Fisher’s exact test (Khatri et al.,
2005). This type of methods only focuses on the part of gene list with an expression level over a
cutoff value while excludes all the other genes, so that a lot of expression details are lost through
this process. Also, the results are very sensitive to the selection of the differentially expressed
genes which is always considered arbitrarily (Tsai et al., 2009). Additionally, under this model,
gene sets with large size always tend to get small p-values and it will introduce more false
positives. Another limitation of ORA-based approach is that it does not work for those microarray
data with few differential expressions.
The second type of approaches is proposed that makes use of the whole expression structure.
Typical examples are GSEA (Subramanian et al., 2005) which is one of the most popular tools for
gene set analysis. It utilizes a weighted Kolmogorov-Smirnov test to measure the degree of
differential expression of a gene set by calculating a running sum from the top of a ranked gene
list according to the correlation with the biological conditions. It succeeded in finding the process
of oxidative phosphoration was down-regulated in muscles tissue from diabetes II patients where
nearly no differentially expressed gene can be found (文献). However, GSEA relies on the ordering
of the gene-level statistics and it would decrease the power when only a few repeat experiments
are available. Moreover, the cumulative value can cause false positives with those gene sets
having large size (文献).
It is accepted that genes with more differential expression play more important role under the
alteration at the transcription level. We proposed a statistical model where expression values are
taken as the weight when evaluating the genes in the gene set. Under this model, more
differentially expressed genes will play more important role than those having less differential
expression in the assessment. As a result, gene sets with differentiation will be discriminated from
those do not have.
Current gene set analysis tools assess gene sets according to the statistical significance always by
permutations or simulations under certain parametric models (文献). Nevertheless, through these
procedures, the significance has direct relation with the size of gene set. Gene sets with large size
will get high significance while at low differential expression. It will be a potential source of false
positives. In the other hand, information for the gene sets with large size are always too general
and it will be little helpful for the biological interpretation. To solve this problem, SGS uses two
cutoffs to assess gene sets to ensure them both differentially expressed and statistical significant.
In general, SGS calculates the up-regulation and down-regulation of gene sets through a weighted
distribution of gene expression and then combines them into a gene set level statistic represents
the regulation of the gene set. Two kinds of cutoffs are used to determine the gene sets which are
both biological sensible and statistical significant. The advantage of our method is that it can
reduce the false positive predictions causing by the large size of gene sets and it also suits for the
analysis of microarray data with few repeat experiments. A type II diabetes microarray data is used
to test the performance of SGS with comparing to GSEA. The results show that SGS finds out
more gene sets related with oxidative phosphoration and ribosome, and excludes gene sets which
are not belong to these two properties. Our tool performs with high sensitive and low false
positive rate.
METHOD
In the article, the expression value measured by microarray is considered as the logarithm of fold
change, which means a positive value represents up-regulation and a negative value represents
down-regulation. Three types of microarray experiments are considered as follows.
One biological condition and no repeat experiments
Under the microarray experiment design, only a single expression value is measured for every
gene. The microarray data can be formatted as {G: e} where G represents gene and e is the
corresponding expression value. For a certain gene set S, we denoted U and D as the up-regulation
and the down-regulation in formula (1) where the gene expression x is taken as the weight of the
probability density function f(x) of the gene expression in the gene set.
x

0
U

xP
(x
)
x

0
D

xP
(x
)
(1)
One biological condition and repeat experiments
The microarray data can be formatted as {G: e1, e2, …, en} where n is the number of the repeats.
For a certain gene set S, fi(x) is the probability density function of the gene expression in the gene
set in the ith individual repeat. The up-regulation and down-regulation of the gene set is the
accumulation of the individual regulation of every single repeat as calculated in formula (2)
0
nx
i
0
nx
i
i
i
U

x
P
x D

x
P
x




i
i()
i
i()
More than one biological conditions and repeat experiments
We category different biological conditions into classes as C. The microarray data is formatted as
{
G
|
e
ee
,
.
.
.
,
,,,
e
ee
.
.
.
,
,,
e
ee
,
.
.
.
,n
}
where eni ,c , is the expression value in
1
,
1
,
2
,
1
n
,
1
1
,
2
2
,
2
n
,
2
1
,
C
2
,
Cn
,
C
1
2
the ith repeat in class c. a label is added to each class to identify which kind of regulation patterns.
1 is assigned if same regulation patterns are needed and -1 is assigned if reverse regulation pattern
is needed. The up-regulation and down-regulation of the gene set is the accumulation of the
individual regulation of every single repeat as calculated in formula (3)
e
l
a
b
e
lx
i,j
0
n
,
C
j
C
i
e
l
a
b
e
lx
i,j
0
n
,
C
j
C
i
U

x
P
(
x
)U

x
P
(
x
)






i
,j
i
,j
i
,j
i
,j
j i
j i
Gene set level statistic
We combine the regulation of gene set as the logarithm of its up-regulation and down-regulation.
The statistic for a gene set is denoted as E which is the regulation of the gene set extracting that
generated from background, as shown in formula (4).
E

l
o
g
(/
U
D
)l

o
g
(
U
/
D
)
b
b
where U b and Db are the up-regulation and down-regulation of the whole background genes.
Significance assessment
In a typical microarray experiment, fold change of every single experiment can be approximated
as following a normal distribution. We calculate the means and standard deviations of every
experiment repeats and generate a new data which has the same size of the origin data based on
normal distribution. E value of every gene set is calculated on the simulated data. The simulation
was repeated for a large number of times (e.g. 1000) and the p-value for each gene set is the
percentage that the simulated values exceed the real ones.
E value represents the regulation of the gene set and p-value measures the significance, so both of
E values and p-values are chosen as the cutoff to assess the statistical significant as well as
biological sensible gene sets.
RESULT
We applied SGS on a public type II diabetes microarray dataset. The data was downloaded from
http://www.broad.mit.edu/gsea/ and it has been broadly used as a test dataset for a lot of other
tools. The dataset is related with two experiment conditions, Normal glucose tolerance (NGT) and
Type II diabetes mellitus (DM2). Each condition has 17 repeats. Few differentially expressed gene
can be assessed by t-test where 20 genes at cutoff of 0.05 and 10 genes at cutoff of 0.01 out of
15957 genes. We generated the fold change matrix by dividing the mean value of DM2 to every
expression value in NGT, resulting a matrix of 15957 rows and 17 columns. Gene sets derived
from Gene Ontology are used as the biological catalog. The size of gene sets is restricted between
10 and 1000.
SGS found oxidative phosphoration is down-regulated in the type II diabetes samples that fits the
result by GSEA. But SGS assess it with high significance (<0.001 vs 0.047). SGS can also found
more detailed process in oxidative phosphoration such as synthesis of ATP and electron transport
chain. In cellular component, we found sub components related to mitochondrion is significant
down-regulated like respiratory chain complex I, proton-transporting ATP synthase complex,
NADH dehydrogenase complex. Genes of which product proteins having oxidoreductase activity,
ATP synthase activity, NADH dehydrogenase activity, cytochrome-c oxidase activity are downregulated. Also, these down-regulated genes take part in the process as tricarboxylic acid cycle,
acetyl-CoA catabolism, ATP metabolism and ATP synthesis coupled electron transport. These gene
sets are all evaluated as high significant by SGS. One mechanism of type II diabetes is the
repression of glucose metabolism, mainly due to the repression of oxidative phosphoration.
Besides that, gene sets related to ribosome is also assessed as down-regulated. It implies that the
translation of proteins processed in oxidative phosphoration may be repressed. The reason of the
repression of oxidative phosphoration is because the enzyme and co-enzyme is decreased, which
is due to the decrease of ribosome protein or RNA.
Gene sets with large size tend to having small E values while E values for gene sets with smaller
size become larger and discrete. Gene sets with small size may have a large differentiation but the
probability that it is due to random effect also arise. Gene sets with large size although have small
E values, but may get high significance. On the other hand, biological information for these gene
set may be too general and have no usage for the interpretations. So both cutoff of p-values and E
values are used to assess the result. It may be regarded as two views, E values represent how the
gene set is differentially expressed and p-values represents whether the differential expression
statistical significant. In the region where size of gene set is small, the main factor is constrained
by p-value while in the region where size of gene set is large, the main factor is constrained by E
values.
Figure 1. Distribution of the gene sets assessed by SGS. Black curves are the edge where p-value = 0.01. straight
lines are the edge where E value = 1 or -1. Black dots represent significant gene sets.
The same data set and GO catalog are applied by GSEA. The results of GSEA is calculated under
p-value < 0.05 which is loose than the requirement of SGS. 23 gene sets are found by both of SGS
and GSEA, 29 gene sets is only found by SGS and 24 gene sets is only found by GSEA (figure 2).
Fig 2. Venn diagram of the gene sets found by SGS and GSEA.
23 gene sets found by both SGS and GSEA are related to oxidative phosphoration. SGS found
these gene sets with large absolute E values (20 out of 23 E value < -1.7), showing that genes in
these gene sets have being down-regulated strongly. And most of these gene sets have p-values <
0.001.
29 gene sets only found by SGS, 15 gene sets having direct relations to oxidative phosphoration
where E values of 14 gene sets is less than -1.45. 8 gene sets have direct relations to ribosome
where 4 of these have E values less than -1.45. The remaining 4 gene sets relate to purine
metabolism having less small E values. The last 6 gene sets are proteasome, threonine
endopeptidase activity, co-enzyme metabolism, location for the proteins in membrane and
nucleus. These gene sets may behave indirect relation to the protein synthesis and transcription.
For the 24 gene sets only found by GSEA, 4 of them having E value < -1 where two of them is
related with mitochondrion. But the size of these two gene sets is small and having less
significance by SGS. 4 gene sets have high significance where 2 is related with ribosome,1 to
mitochondrion. In these 4 gene sets, two gene sets is with large size (>200). They have high
significance while have less differential expression. Is this category of gene sets, two groups can
be discriminated from their size which are small with about 13 genes and extremely large with
more than 200 genes.
Fig 3. Gene sets found by SGS and GSEA.
If gene sets related to oxidative and phosphoration are regarded as positive result and sets having
no relation with these two groups are treated as false negative. The discovery rate of SGS is 0.885
(46/52) and 0.638 for GSEA (30/47). SGS have higher sensitivity and low false positives.
DISCUSSION
1.使用加权的表达值分布和权重系数
2.如何减少假阳性,原因
3.同样对重复次数少的数据有效
4.gene set analysis is robust
5.显著性如何与基因集合的大小相关
6.当fold change难以计算时,可以使用t值
Can find out the underlying biological process with moderate differentiation which can not be
detected by single-gene analysis
CONCLUSION
Gene set analysis is important to find the biological process affected. It uses a prior biological
knowledge to evaluate the expression patterns of groups of genes. We proposed a useful tool based
on gene set analysis names SGS by evaluating the weighted distribution of gene expression. SGS
shows high performance at finding significant gene sets as well as biological sensible. SGS can
work for microarray with few repeats and decrease the false positives causing by size. We believe
our tool will help researchers to find more biological outcomes.