Supplementary Material - GSA-SNP

Supplementary Material
Please, refer to the references in the main manuscript.
How to compute p-values in the restandardization GSA (pooled version)
N is the number of gene sets considered. P is the number of permutations taken. T j , j = 1,
2,…, N are the gene set scores. Let T  T j 1 j S .
Step 1. Permute the sample labels and compute each gene set score to generate the N  P


matrix of randomized summary set statistic Tperm  T jk , 1  j  N , 1  k  P ,
where T jk is the score of the jth gene set for the kth permutation, and P is the
number of permutations executed.
Step 2. (Pooled and randomized gene set scores)
 g (T jk )   *

g (T perm )  
, 1  j  N , 1  k  P ,
*


Compute
represents a generalized gene set statistic and
 *, υ *
where
g (T jk )
are their mean and
standard deviation of randomly drawn gene sets over all the permutations
performed
Step 3. For each gene set, compute the generalized statistic g (T j ) as well as its mean and
  , υ ,
variance
g (T j )  


then compute the ratio of scores in g (T perm ) that satisfy
g (Ti k )   *
.
*
If g is a linear transform of individual scores such as mean of absolute values, pth moment, or
maxmean statistic,
 *, υ *
and
  , υ
can simply be replaced by those mean and
variance of the individual gene scores without randomly drawing gene sets.
Comparison of GWAS on height between European and Korean population
Weedon et al. (25) and Gudbjartsson et al. (26) recently reported 20 (PGWA < 5X10-7) and 27
(PGWA < 1.6X10-7) regions, respectively that were highly associated with adult height in
multiple cohorts of primarily European ancestry. On the other hand, the GWA analysis of
Korean samples identified only eight regions significantly associated with height (PGWA <
4X10-6) and five out of them largely overlapped the results of the preceding European studies
(25,26). The European studies also analyzed the gene ontology (GO) terms or pathways of
the genes neighboring the loci associated with adult height and suggested several biological
processes that were likely implicated in the height variation. On the other hand, only seven
genes were identified within 200-kb window of variants significantly associated in Korean
population; five of them overlapped with the European results. Although the small number of
implicated genes makes their GO analysis meaningless, most of them are also found in the
gene list of the European studies. This implies that some of the strongest signals in the
European studies are well captured in Korean samples as well, while other signals may be
weaker in Korean samples and are lost below the threshold.
Imputation of KARE genotype data
The KARE genotype data were supplemented by imputing SNP genotypes based on the
genotypes of the JPT+CHB panel of the International HapMap Phase II (The International
HapMap Consortium 2005). Details of SNP imputation and filtering have been published
elsewhere (28). Briefly, the genotypes of a total of 2,168,896 SNPs were imputed using
PLINK and 799,492 of them passing PHWE > 10-6, IMPUTED > 0.9, and INFO > 0.9
(concordance > 99%) were kept for subsequent analyses. GWA scan for height was carried
out by the trend test after adjustment of age and sex using PLINK. The GWA p-values for all
the filtered SNPs were gathered and used as an input to GSA-SNP.
Guidance on how to choose a GSA method in GSA-SNP
By and large, there are three kinds of GSA methods: Gene- or sample- randomizing methods
and their hybridization. Briefly speaking, gene-randomization methods (Z-statistic method of
GSA-SNP, GeSBAP) assess the enrichment of the association signal in a gene set compared
to the background genes, and sample randomization methods (SRT) assess the existence of
the signal in a gene set. In other words, the foci are a little different between the two
approaches.
In the practical perspective, the gene randomization methods have an advantage. It is
applicable to a small number of data samples, while the sample-randomization methods
require more data samples. In the statistical perspective, the sample-randomizing methods
have an advantage. Gene-randomization methods assume that each gene set is a collection of
independent samples (genes), which is not valid because most gene sets share a common
biological function and may have more or less correlation structures in their gene expression.
On the other hand, the sample-randomization method is free from this assumption. However,
the problem of independent sampling may be ameliorated in the context of GWAS, because
the correlation of association depends on the haplotype structures of the genome rather than
the biological functions shared by gene sets.
In the context of GWAS, the sample randomizing methods require high computational costs
in both memory and time. For a p-value approach, for example, we should generate a
thousand simulated p-values to obtain a reasonable level of significance, which is very timeconsuming. For this reason, we devised two methods that hybridize gene- and samplerandomizing methods (Restandardization and GSEA) in our software. By pooling the gene
set scores of different gene sets, we can obtain significant scores with only a hundred
simulated samples. Therefore, the user with a preference to sample randomizing methods
may choose the Restandardization or GSEA methods. Comparing the two hybrid methods,
restandardization method yields many significant results, while GSEA provides the most
conservative results. See Nam and Kim (2008) for a detailed comparison of GSA methods.
Comparison of best, 2nd best and 3rd best p-values for height and triglyceride traits
The Supplementary Figure 1 & 2 demonstrate the distributions of p-values between the kth vs.
(k+1)th best p-values, k=1,2,3, and 4 for the two traits: height and triglyceride. In both cases,
the correlations between p-values increased as k increased. In particular, the difference
between the best and the second best p-values for the triglyceride example was pronounced
where the use of the second best p-values removed many doubtful best p-values.
Comparison of corrected p-values
We also observed that by using the second best p-values, the corrected p-values (Z-statistic)
or FDR values (GSEA-Maxmean) were overall improved by using the second best p’s, while
the predicted gene sets themselves were similar between the best and the second best p results.
But, when we used the third or higher orders, the FDRs in GSEA method became worse than
those for the best p option.
Supplementary Figure 1. (Height example) The Scatter plots between the kth vs. (k+1)th best p-values, k=1,2,3,
and 4 for the height data. Blue areas represent densely distributed data points.
Supplementary Figure 2. (Triglyceride example) The Scatter plots between the kth vs. (k+1)th best p-values,
k=1,2,3, and 4 for the triglyceride data. Blue areas represent densely distributed data points.