Batch Effects and Pathway Analysis

Published OnlineFirst April 29, 2013; DOI: 10.1158/1055-9965.EPI-13-0114
Cancer
Epidemiology,
Biomarkers
& Prevention
Research Article
Batch Effects and Pathway Analysis: Two Potential Perils in
Cancer Studies Involving DNA Methylation Array Analysis
Kristin N. Harper, Brandilyn A. Peters, and Mary V. Gamble
Abstract
Background: DNA methylation microarrays have become an increasingly popular means of studying the
role of epigenetics in cancer, although the methods used to analyze these arrays are still being developed and
existing methods are not always widely disseminated among microarray users.
Methods: We investigated two problems likely to confront DNA methylation microarray users: (i) batch
effects and (ii) the use of widely available pathway analysis software to analyze results. First, DNA taken from
individuals exposed to low and high levels of drinking water arsenic were plated twice on Illumina’s Infinium
450 K HumanMethylation Array, once in order of exposure and again following randomization. Second, we
conducted simulations in which random CpG sites were drawn from the 450 K array and subjected to pathway
analysis using Ingenuity’s IPA software.
Results: The majority of differentially methylated CpG sites identified in Run One were due to batch effects;
few sites were also identified in Run Two. In addition, the pathway analysis software reported many significant
associations between our data, randomly drawn from the 450 K array, and various diseases and biological
functions.
Conclusions: These analyses illustrate the pitfalls of not properly controlling for chip-specific batch effects
as well as using pathway analysis software created for gene expression arrays to analyze DNA methylation
array data.
Impact: We present evidence that (i) chip-specific effects can simulate plausible differential methylation
results and (ii) popular pathway analysis software developed for expression arrays can yield spurious
results when used in tandem with methylation microarrays. Cancer Epidemiol Biomarkers Prev; 22(6);
1052–60. 2013 AACR.
Introduction
DNA methylation arrays have become a popular tool in
cancer research (1–4). The use of arrays and the popularity
of epigenetic studies have naturally attracted many investigators to this relatively new technology. Methylation
array studies are particularly prone to several problems
that, while well known to researchers who specialize in
the analysis of array data, are often not appreciated by
investigators who are less familiar with array analysis.
It has been understood for some time that batch effects
represent a significant problem in array analysis that must
be both avoided, as much as possible, and accounted for
(5–9). For example, it was found that confounding by
Authors' Affiliation: Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, NY
Note: Supplementary data for this article are available at Cancer Epidemiology, Biomarkers & Prevention Online (http://cebp.aacrjournals.org/).
Corresponding Author: Kristin N. Harper, Department of Environmental
Health Sciences, Mailman School of Public Health, Columbia University,
650 W. 168th St., Room 1618, New York, NY 10032. Phone: 212-305-1205;
Fax: 212 305-3857; E-mail: [email protected]
doi: 10.1158/1055-9965.EPI-13-0114
2013 American Association for Cancer Research.
1052
batch effects along with several other problems accounted
for an association reported between gene expression profiles and response to ovarian cancer treatment (10); subsequently, the majority of authors of the original article
(11) published a retraction. Similarly, another team of
researchers found that a surprisingly high estimate of
differentially expressed genes between individuals with
European and Yoruban ancestry (12) may have been
influenced by confounding by the year in which samples
were processed (13). Finally, re-analysis of a study of gene
expression in Plasmodium falciparum (14), the malaria
parasite, found that confounding by batch complicated
the evaluation of transcriptional diversity (15). Because
many array studies are biomedical in nature, the findings
often have important practical implications. Therefore,
understanding and controlling for batch effects requires
particular attention.
Thus far, attention has focused on relatively large differences in batch, such as differences in the laboratory in
which samples are run, differences in the day of run, or
differences in sample preparation methods. The implications of batch effects due to more subtle differences, such
as the location of samples on different chips run concurrently, may not be sufficiently appreciated. Here, we use
actual array data to illustrate how chip-specific effects can
Cancer Epidemiol Biomarkers Prev; 22(6) June 2013
Downloaded from cebp.aacrjournals.org on July 12, 2017. © 2013 American Association for Cancer Research.
Published OnlineFirst April 29, 2013; DOI: 10.1158/1055-9965.EPI-13-0114
Batch Effects, Pathway Analysis, and DNA Methylation Arrays
simulate plausible and interesting differential methylation results.
Moreover, because software programs used in the
study of gene expression arrays are widely available
and user-friendly, many of these programs have been
"borrowed" for applications involving DNA methylation
array data. In some cases, however, this is problematic.
We present evidence that popular pathway analysis
software, developed for expression array results but
commonly used for methylation arrays as well (see, for
example, refs. 16, 17–22), can yield spurious results.
Our hope is that this pair of illustrations will help
investigators using DNA methylation arrays to avoid two
common problems that could yield biased results.
Materials and Methods
450 K array study
Sample collection and DNA extraction. Peripheral
blood mononuclear cells (PBMCs) were obtained from a
randomized, controlled trial intended to look at the effect
of folate supplementation on arsenic (As) metabolism.
This trial was approved by both the Bangladesh Medical
Research Council (BMRC) and the Institutional Review
Board at Columbia University Medical Center. Blood was
collected in EDTA-containing tubes at our field clinic in
Araihazar. Plasma was separated from roughly 8 mL of
blood, and the remainder was diluted with PBS to 15 mL.
Subsequently, 7.5 mL of the diluted blood was gently
layered atop 4 mL of Ficoll. The tube containing blood and
Ficoll was centrifuged at 400 g for 30 minutes, and the
mononuclear layer along with half of the underlying Ficoll
was transferred to a new tube. PBS was added in equal
volume, and the contents were centrifuged for 10 minutes
at 200 g. The cells were washed once more with PBS,
then the supernatant was removed and the cell pellet was
resuspended in 3.5 mL 5 PRIME ArchivePure DNA Cell
Lysis solution, after which proteinase K was added. The
cells were incubated at room temperature for 20 minutes,
then the samples were stored at 80 C. Samples were
transported on dry ice, first to Dhaka by car, then to
Columbia University for analysis.
After arriving at Columbia, 48 samples were chosen
from the PBMCs taken during the baseline visit based on
membership in either a "low" or "high" As exposure
group, members of the first having water As concentrations between 50 and 100 mg/L and members of the second
having water As concentrations over 100 mg/L. DNA was
extracted from the PBMCs using the 5 PRIME ArchivePure DNA Blood Kit. DNA was visualized on gels to
confirm quality and quantified using the Quant-iT PicoGreen dsDNA Assay Kit (Life Technologies).
The PBMC DNA samples were run twice on Illumina
Infinium 450 K HumanMethylation Arrays at Roswell
Park Cancer Institute. Each run was identical except for
the order in which samples were plated. In the first run,
Run One, "low" As group samples were plated, then "high"
As group samples were plated (Fig. 1). In the second run,
www.aacrjournals.org
Run Two, samples from the "low" and "high" As groups
were randomly assigned to positions on the plates (Fig. 1).
Statistical analysis. The b values, or the percentage
of CpGs at a given site that were methylated, were calculated for every sample at each CpG site. Observations
with detection P > 0.05 were set to missing, and any CpG
site with missing data was omitted from the analysis. The
R program ComBat (8), included in the SVA package (23),
was used to adjust for the chip on which samples were
run. This method uses a parametric empirical Bayes
framework to adjust data for batch effects; it is a particularly robust method for dealing with small samples sizes.
To determine whether ComBat effectively removed batch
effects, principal component analysis (PCA) was used to
determine the top 5 principal components (PCs) present
in the pre- and post-ComBat b values for Runs One
and Two. Afterward, the association between each PC
and (i) the chip on which samples were run and (ii) "low"
or "high" As group was determined using linear regression. In addition, Spearman’s correlation coefficients
between b values in Runs One and Two were calculated
for each CpG site, pre- and post-ComBat; we also calculated pre-Combat, between-run correlations stratified by
Run One chip.
To investigate whether As exposure status was associated with differential DNA methylation in Runs One
or Two, b values were transformed into M values
(log2(b/(1 b)) before statistical analysis (24). SVA was
used to calculate F-statistics and their associated P values
by comparing two nested linear models. The larger model
contained As group, coded as a factor variable, as well as
any other variables of interest, whereas the smaller model
contained only an intercept and the non-As variables of
interest. Q values were then calculated, also in SVA. In
this way, the number of differentially methylated sites
(q < 0.05) for Run One and Two were determined, both
before and after ComBat was used to adjust for the chips
on which samples were run. A clustering analysis was
conducted on the Run One data, using pre-ComBat
b values and the heatmap function in R. The top 100
differentially methylated sites, determined by q value,
were included in the analysis. Color coding for both
exposure group and chip was implemented. The same
process was conducted for Run Two data, after adjusting
models for chip, using ComBat, as well as two potential
confounders: betelnut use and land ownership.
Pathway analysis simulations
One hundred simulations were conducted, in which
random CpG sites from the 450 K array were chosen and
entered into Ingenuity’s IPA software. For each simulation, 100 CpG sites were randomly drawn from those
included on the 450 K array. To randomize the choice of
sites, all CpGs on the chip were assigned a random
number between 0 and 1 and the 100 sites assigned the
smallest values were chosen. The genes associated with
these random sites were determined, and the gene lists
were then uploaded into the IPA software. The Ingenuity
Cancer Epidemiol Biomarkers Prev; 22(6) June 2013
Downloaded from cebp.aacrjournals.org on July 12, 2017. © 2013 American Association for Cancer Research.
1053
Published OnlineFirst April 29, 2013; DOI: 10.1158/1055-9965.EPI-13-0114
Harper et al.
Figure 1. Plating scheme: Run One (samples plated by exposure group) versus Run Two (samples randomized on chips). Samples are labeled by the same
number in both runs.
Knowledge Base used in this analysis consisted of genes
only, and both direct and indirect relationships were
considered. "Diseases and Disorders," "Molecular and
Cell Function," "Physiological System Development and
Function," and "Top Canonical Pathway" results linked to
a given dataset at P < 0.05 were tallied.
Deconstructing Illumina’s 450 K HumanMethylation
Array
The number of CpG sites associated with each refgene
accession number (i.e., gene transcript) included on the
450 K array was determined using the data file generated
by the GenomeStudio software for Run One. Refgene
accession number duplicates in each row of the data file
were removed and then all remaining accession numbers
were tallied, yielding the number of probes associated with
each transcript covered by the chip. Refgene accession
numbers were divided into 12 "bins": these consisted of
transcripts that had (i) <10; (ii) 11–20; (iii) 21–30; (iv) 31–40;
1054
Cancer Epidemiol Biomarkers Prev; 22(6) June 2013
(v) 41–50; (vi) 51–60; (vii) 61–70; (viii) 71–80; (ix) 81–90;
(x) 91–100; (xi) 101–200; or (xii) 201–1,300 CpG sites included on the array. The distribution of transcripts in the
histogram was then checked using the data contained in
the GenomeStudio data file generated for Run Two.
For each bin, 500 refgene accession numbers at a time
were uploaded into the IPA software, representing one
set. The number of sets included in each bin varied from 1
to 29, according to the number of transcripts it included.
Each set was loaded into a "My Pathway" analysis, and the
"overlay functions and diseases" tool was used to determine the number of genes within it that were associated
with (i) cancer, (ii) connective tissue disorders, and (iii)
developmental disorders.
The median number of CpG sites included in each set
was determined. This number was transformed using
the natural logarithm, and a square root transformation
of the percentage of genes in each set associated with a
given disease/disorder was also performed. The Pearson
Cancer Epidemiology, Biomarkers & Prevention
Downloaded from cebp.aacrjournals.org on July 12, 2017. © 2013 American Association for Cancer Research.
Published OnlineFirst April 29, 2013; DOI: 10.1158/1055-9965.EPI-13-0114
Batch Effects, Pathway Analysis, and DNA Methylation Arrays
Figure 2. Principal Component
Analysis shows that ComBat
effectively removed batch effects
from randomized Run Two but not
from nonrandomized Run One.
Before ComBat, the first PC in Run
Two was significantly associated
with the chip on which samples were
run. After using ComBat, the first and
third PCs were significantly
associated with exposure group, and
batch effects were not associated
with any of the major PCs. The
percent variance explained by each
PC is indicated.
correlation coefficient between the log-transformed median number of associated CpG sites in each set and the
square root–transformed percentage of cancer, connective
tissue disorder, and developmental disorder–associated
genes was calculated.
Results
Batch effects in Illumina 450 K methylation array
analysis
PCA showed that ComBat effectively removed batch
effects in Run Two, in which samples were randomized
(Fig. 2). Before using ComBat, the first PC in Run Two was
significantly associated with the chip on which samples
were run (P < 0.001). After using ComBat, the first and
third PCs were significantly associated with As exposure
group (P < 0.05), and batch effects were not associated
with any of the first 5 PCs. By contrast, in nonrandomized
Run One, both chip and exposure group were significantly associated with PC 2 before using ComBat (P < 0.05),
and both remained significantly associated with PC 3 after
using ComBat (P < 0.001).
Before correction for batch effects, the results from Run
One included 1,203 CpG sites that were differentially
methylated according to As exposure group (Table 1).
A clustering analysis showed that by using the top 100
differentially methylated sites by q value, all but two of
the study participants could be assigned to the correct
exposure group (Fig. 3). However, the same analysis showed that samples clustered by the chip on which they were
run (Fig. 3). The two samples which could not be assigned to
the correct exposure group, 25 and 26, included a low-As
sample that was run on a chip that otherwise contained all
high-As samples and a high-As sample run on a chip that
contained predominately low-As samples (Fig. 1).
After adjusting for chip using the ComBat program, the
results from Run One included more than 24,000 CpG
sites that were differentially methylated according to As
exposure group (Table 1). However, when the same DNA
samples were re-analyzed after being plated in randomly
assigned order on new 450 K array chips (Run Two), only
449 differentially methylated sites were identified
(Table 1). Only 25 differentially methylated sites overlapped between the batch-adjusted results from both Run
One and Run Two (Table 1). They are associated with a
number of genes involved in cancer pathogenesis, as well
as DNA/RNA binding and the production of oxidative
stress (data not shown); in the future, we plan to verify
Table 1. Number of differentially methylated CpG sites (q < 0.05) in samples run on the Illumina 450 K
HumanMethylation Array two times: first, plated by exposure status, and then with samples randomized to chip
Before adjusting for chip
After adjusting for chipa
Number of sites identified in both runs one and two, after adjusting for chip
a
Run One
(nonrandomized)
Run Two
(randomized)
1,203
24,184
25
0
187
The program ComBat was used to adjust for the chip on which samples were run (8).
www.aacrjournals.org
Cancer Epidemiol Biomarkers Prev; 22(6) June 2013
Downloaded from cebp.aacrjournals.org on July 12, 2017. © 2013 American Association for Cancer Research.
1055
Published OnlineFirst April 29, 2013; DOI: 10.1158/1055-9965.EPI-13-0114
Harper et al.
Figure 3. A seemingly strong
association between exposure
group and DNA methylation
patterns in Run One (in which
samples were plated by exposure
group) appears to be due to batch
effects. A clustering analysis was
conducted in which 48 samples
were grouped on the basis of
methylation patterns observed in
the top 100 differentially
methylated CpG sites (as
determined by q value). In Run
One: (A) all but 2 individuals
(samples 25 and 26) could be
correctly assigned to the correct
exposure group on the basis of b
values at these sites, but (B) there
was a strong relationship between
the chip on which a sample was
run and the other samples with
which it clustered. Later analysis
showed that the two misclassified
samples in (A) were run on chips
containing predominantly
samples from the other exposure
group. By contrast, in Run Two, in
which the same samples were
plated after being randomized
and the models used to determine
significance were adjusted for
batch effects as well as several
possible confounders: (C) 41 of 48
samples were correctly assigned
to the correct exposure group,
and (D) no relationship between
the chip on which a sample was
run and the clustering pattern was
evident.
these differentially methylated sites using NextGen bisulfite sequencing.
The pre-ComBat Spearman’s correlations between b
values from Runs One and Two were poor overall
(mean r ¼ 0.30) with a bimodal distribution, the larger
peak centered around zero (Supplementary Fig. S1). The
correlation between runs did not improve after using
ComBat and remained biomodal (mean r ¼ 0.29; Supplementary Fig. S1). The peak around zero was not
present when correlations between samples run on a
single chip in Run One and the corresponding samples
in Run Two were calculated (Supplementary Fig. S2); it
emerged only when correlations were calculated across
chips in Run One (Supplementary Fig. S1) and reflected
the prominence of batch effects in this nonrandomized
run.
Pathway analysis simulations
In this study, we ran 100 simulations, each consisting of
all of the genes linked to 100 CpG sites drawn randomly
from those occurring on the 450 K chip. Each list of genes
was entered into the IPA software to determine whether
it was significantly associated with any diseases or bio-
1056
Cancer Epidemiol Biomarkers Prev; 22(6) June 2013
logical functions in the IPA database. In our trials, the
software assigned each of the simulations a results profile
that contained five significantly associated (P < 0.05)
"Diseases and Disorders," "Molecular and Cellular Functions," and "Physiological System Development and Functions," as well as a varying number of significantly linked
"Top Canonical Pathways." Cancer (71% of simulations)
was the most common disease/disorder to be reported as
statistically associated with the gene lists, followed by
developmental disorders (50% of simulations), and connective tissue disorders (37% of simulations; Fig. 4). The
most common molecular/cellular functions were cellular
development (55% of simulations), cell morphology (52%
of simulations), and cellular assembly and organization
(45% of simulations; Fig. 4). In terms of physiological
system development and function, embryonic development (65% of simulations), tissue development (41% of
simulations), and nervous system development and function (37% of simulations) were most frequently encountered (Fig. 4). Finally, more than 200 canonical pathways
were significantly linked to the simulation results,
although no single pathway was linked to more than
6% of the simulations. Complete tables of the significant
Cancer Epidemiology, Biomarkers & Prevention
Downloaded from cebp.aacrjournals.org on July 12, 2017. © 2013 American Association for Cancer Research.
Published OnlineFirst April 29, 2013; DOI: 10.1158/1055-9965.EPI-13-0114
Batch Effects, Pathway Analysis, and DNA Methylation Arrays
associated with a given transcript varied from 1 to 1,299. In
addition, we found that transcripts associated with genes
of particular interest to cancer researchers, such as TP53 or
KRAS, were disproportionally represented on the chip,
possessing a greater number of associated sites than the
Illumina-reported average of 17 (25). We also examined
the relationship between the number of CpG sites associated with a given gene and the likelihood that the gene
would be associated with some of the most common
diseases/disorders in our IPA results. To do this, we
examined the correlation between the median number
of gene-associated CpG sites and the percentage of
genes associated with a given disease/disorder in 69
"sets" of 500 transcripts or less, which incorporated
every refgene sequence represented on the 450 K chip.
We found that the median number of CpG sites in a set
was highly correlated with the percentage of genes
within that set that IPA linked to cancer (P < 0.0001),
connective tissue disorders (P ¼ 0.005), and developmental disorders (P < 0.0001; Fig. 6).
Discussion
Figure 4. The results of 100 simulations in which 100 random CpG sites
were drawn from Illumina's 450 K array and the associated genes were
entered into Ingenuity's IPA software for Pathway Analysis. The 10 most
frequent (A) diseases and disorders, (B) molecular and cellular functions,
and (C) physiological system development and function results that were
significantly associated with the random simulation data (P < 0.05) are
presented here. Significant links to cancer, developmental disorder,
cellular development, cell morphology, and embryonic development
were found in half or more of the simulations.
results from these simulations can be found in Supplementary Tables S1 to S4.
Deconstructing Illumina’s 450 K HumanMethylation
Array
To identify the reason why CpG sites chosen at random
from the chip would yield such results, we first examined
the distribution of transcript-associated CpG sites on the
450 K chip (Fig. 5). We found that the number of sites
www.aacrjournals.org
In this article, we provided illustrations of two problems that researchers studying gene-specific methylation
frequently encounter: batch effects in DNA methylation
arrays and difficulties associated with applying pathway
analysis software that was originally created to work with
gene expression arrays to DNA methylation arrays.
Although an entire book has been written about dealing
with batch effects in array data (26), investigators new to the
field often underestimate this challenge, which may have
serious ramifications upon their research. While it may be
clear to some that running samples in different laboratories
or preparing them in different ways invites batch effects,
the effect of simply plating samples on separate chips run at
the same time may not be adequately appreciated (Fig. 3).
Some researchers may accidentally confound exposure
group and batch, as in the example presented here; if they
are not aware of the power of batch effects, they may publish
spurious findings with real health implications (See Fig. 3
and Run One Results in Table 1). Especially when samples
are being run by a commercial or partner laboratory, it is
crucial for the researcher supplying the samples to assign
them run IDs that mask their identity and to specify the
order in which they should be plated. If samples are not
properly randomized across chips and batch and the phenotype of interest are confounded, even a program that is
very effective at removing batch effects, such as ComBat,
will not be able to salvage the data (Fig. 2, Supplementary
Figs. S1 and S2). Conversely, other researchers may properly randomize their samples but fail to appreciate the
potential of batch effects to swamp the signal in which
they are interested. If this is the case, they may find few
significant results in their data and abandon a project, not
realizing that by simply controlling for a factor such as the
chip on which samples were run, they could unmask real
differences between the 2 groups they are comparing (See
Cancer Epidemiol Biomarkers Prev; 22(6) June 2013
Downloaded from cebp.aacrjournals.org on July 12, 2017. © 2013 American Association for Cancer Research.
1057
Published OnlineFirst April 29, 2013; DOI: 10.1158/1055-9965.EPI-13-0114
Harper et al.
Figure 5. Distribution of transcriptassociated CpG sites on Illumina's
450 K HumanMethylation Array.
While some transcripts were
associated with only a single CpG
site included on the array, others
were associated with many more.
For example, some cancer-linked
genes such as TP53 and KRAS
have more than 30 associated CpG
sites on the chip, whereas another
gene, PTPRN2, is associated with
1,299 CpG sites.
Run Two Results in Table 1 and Fig. 2). It is our hope that the
examples presented in this article, based on real data, will
provide convincing evidence that carefully adjusting for
batch effects is a necessary part of methylation array analysis. Batch effects are likely to be particularly problematic
when researchers anticipate that any observed differences
in methylation will be subtle, as when low-level exposures
are studied in healthy people. In these situations, the
signals of interest are more easily masked by batch effects.
Unlike the perils of batch effects, which have been
appreciated for some time, to the best of our knowledge,
no one has discussed publicly the problems associated
with using commercial pathway analysis software to analyze DNA methylation array data. This is such a common
practice in the literature that some reviewers may request
that it be done, even though these software packages were
created for use with gene expression arrays. Here, we
show that when used with Illumina HumanMethylation
450 K array data, the IPA software consistently results in
spurious, statistically significant links to various diseases,
disorders, and biological processes such as cancer, developmental disorders, and cellular and embryonic development. These problems appear to stem from the failure of
commercially available pathway analysis software to
account for the differential representation of genes on an
array. That is, some genes are associated with many more
CpG probes on the array than others (Fig. 5), and genes
that are interesting to researchers due to their links to
Figure 6. Genes with more CpG
probes on Illumina's 450 K
HumanMethylation Array were
more likely to be associated with
disease in Ingenuity's IPA
software. Here, we present the
correlation between the median
number of CpG sites in a "set" of
transcripts (see Materials and
Methods) and the percent of genes
associated with (A) cancer, (B)
connective tissue disorders, and
(C) developmental disorders,
which represent the disease
categories most commonly linked
to our data in 100 random-draw
trials. Each dot in the graph
represents a set of 500 or fewer
transcripts.
1058
Cancer Epidemiol Biomarkers Prev; 22(6) June 2013
Cancer Epidemiology, Biomarkers & Prevention
Downloaded from cebp.aacrjournals.org on July 12, 2017. © 2013 American Association for Cancer Research.
Published OnlineFirst April 29, 2013; DOI: 10.1158/1055-9965.EPI-13-0114
Batch Effects, Pathway Analysis, and DNA Methylation Arrays
disease are disproportionately represented on the chip
(Fig. 6). These problems do not appear to be limited to
Illumina’s 450 K array or Ingenuity’s IPA software. For
example, Affymetrix states of their popular GeneChip
Human Promoter 1.0R Array, "for over 1,300 cancer-associated genes, coverage of promoter regions was expanded
to include additional genomic content" (27). Similarly, the
commonly used pathway analysis program DAVID (28)
allows users to specify the expression array used but does
not provide this option for methylation arrays, resulting in
an inability to control for the number of probes associated
with each gene. Unless a pathway analysis package has
been modified for use with gene methylation arrays, so
that differential representation of genes is considered
when calculating the observed versus expected ratios of
genes associated with diseases or gene networks, it is
unlikely that the program will yield reliable results.
In this article, we highlighted two problems associated
with DNA methylation array analysis. However, there are
additional methodologic challenges that should be noted.
First, although we demonstrated that pathway analysis
can yield spurious associations when genes are differentially represented on an array, it is important to highlight
another potential limitation inherent to this approach.
Pathway analysis works by detecting the statistical overrepresentation of genes belonging to a certain pathway in a
dataset. However, hypermethylation in the promoter
region of a single upstream regulator could shut down
an entire biologic pathway without being detected by
pathway analysis. By concentrating on pathway analysis
results, single-gene alterations in methylation associated
with important biological changes could be missed. Second, many methylation studies are carried out on PBMCs
or peripheral blood leukocyte (PBL) samples. Such samples are composed of a variety of white cell types, which
have distinct DNA methylation patterns (29). If a factor
being studied influences the distribution of these cell
types, alterations in DNA methylation due to a change in
cell-type composition may be mistakenly attributed to the
direct effects of the factor of interest. To avoid this problem, cell-type distribution may be determined at the time
of the blood draw via a differential cell count. Alternately,
a computational method of estimating the cell mixture
distribution in a sample has been developed by Houseman
and colleagues (29), and recently, it was used to determine
differentially methylated genes in the blood of patients
with rheumatoid arthritis (30). Third, single-nucleotide
polymorphisms (SNPs) in the sites targeted by an array’s
probes can prove problematic. Because the Illumina array
quantifies the level of methylation at each target site, a
polymorphism can result in a decline in methylation levels
that is not due to fewer unmethylated CpG sites (31). Thus,
the issues associated with population stratification that
complicate genome-wide association studies could also
complicate DNA methylation analyses. Finally, the best
method of controlling for multiple comparisons when
analyzing array data varies based on the type of study,
and under certain circumstances, it may be helpful to
reduce the dimensionality of array data before analyzing
association with a phenotype of interest (32). All of these
considerations make methylation array data analysis a
challenging endeavor.
Investigation of the role of DNA methylation in disease
holds much promise, and many scientists are currently
exploring epigenetic influences on a multitude of outcomes. Using a study that exploits the availability of
repeated measures and a set of in silico experiments, we
have provided illustrations of (i) the insidious nature of
batch effects and (ii) problems encountered when using
commercially available gene expression pathway analysis
software for the analysis of DNA methylation array data.
Collectively, these analyses highlight two common pitfalls associated with the analysis and interpretation of
DNA microarray data.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: K.N. Harper, M.V. Gamble
Development of methodology: K.N. Harper, B.A. Peters
Acquisition of data (provided animals, acquired and managed patients,
provided facilities, etc.): K.N. Harper, B.A. Peters, M.V. Gamble
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): K.N. Harper, M.V. Gamble
Writing, review, and/or revision of the manuscript: K.N. Harper, B.A.
Peters, M.V. Gamble
Study supervision: M.V. Gamble
Acknowledgments
The authors thank Jeff Conroy, Shuang Wang, Hokeun Sun, Julie
Herbstman, and Megan Hall for their help; Greg Gibson and John Storey
for their summer course on working with array data; the members of the
Epigenetics Working Group at CUMC; and two anonymous reviewers
whose comments helped improve this manuscript.
Grant Support
This work was supported by funding from a Robert Wood Johnson
Health & Society Scholars seed grant and Columbia University’s Cancer
Epidemiology Training Program (5-R25-CA 094061; to K.N. Harper) and
NIH grants R01CA133595, R01ESO17875, and P42ES10349 (to M.V.
Gamble).
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
Received January 30, 2013; revised March 26, 2013; accepted April 7,
2013; published OnlineFirst April 29, 2013.
References
1.
Marsit C, Koestler D, Christensen B, Karagas M, Houseman E, Kelsey
K. DNA methylation array analysis identifies profiles of blood-derived
DNA methylation associated with bladder cancer. J Clin Oncol
2011;29:1133–9.
www.aacrjournals.org
2.
Teschendorff A, Menon U, Gentry-Maharaj A, Ramus S, Weisenberger
D, Shen H, et al. Age-dependent DNA methylation of genes that are
suppressed in stem cells is a hallmark of cancer. Genome Res
2010;20:440–6.
Cancer Epidemiol Biomarkers Prev; 22(6) June 2013
Downloaded from cebp.aacrjournals.org on July 12, 2017. © 2013 American Association for Cancer Research.
1059
Published OnlineFirst April 29, 2013; DOI: 10.1158/1055-9965.EPI-13-0114
Harper et al.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
1060
Hansen K, Timp W, Bravo H, Sabunciyan S, Langmead B, McDonald
O, et al. Increased methylation variation in epigenetic domains across
cancer types. Nat Genet 2011;43:768–75.
Breitling L, Yang R, Korn B, Burwinkel B, Brenner H. Tobacco-smoking-related differential DNA methylation: 27K discovery and replication. Am J Hum Genet 2011;88:450–7.
Leek J, Scharpf R, Bravo H, Simcha D, Langmead B, Johnson W, et al.
Tackling the widespread and critical impact of batch effects in highthroughput data. Nat Rev Genet 2010;11:733–9.
Sun Z, Chai H, Wu Y, White W, Donkena K, Klein C, et al. Batch effect
correction for genome-wide methylation data with Illumina Infinium
platform. BMC Med Genomics 2011;4:84.
Wang D, Zhang Y, Huang Y, Li P, Wang M, Wu R, et al. Comparison of
different normalization assumptions for analyses of DNA methylation
data from the cancer genome. Gene 2012;506:36–42.
Johnson W, Rabinovic A, Li C. Adjusting batch effects in microarray
expression data using empirical Bayes methods. Biostatistics
2006;8:118–27.
Yan L, Ma C, Wang D, Hu Q, Qin M, Conroy J, et al. OSAT: a tool for
sample-to-batch allocations in genomics experiments. BMC Genomics 2012;13:689.
Baggerly K, Coombes K, Neeley E. Run batch effects potentially
compromise the usefulness of genomic signatures for ovarian cancer.
J Clin Oncol 2008;26:1186–7.
Dressman H, Berchuck A, Chan G, Zhai J, Bild A, Sayer R, et al. An
integrated genomic-based approach to individualized treatment of
patients with advanced-stage ovarian cancer. J Clin Oncol 2007;25:
517–25.
Spielman R, Bastone L, Burdick J, Morley M, Ewens W, Cheung V.
Common genetic variants account for differences in gene expression
among ethnic groups. Nat Genet 2007;39:226–31.
Akey J, Biswas S, Leek J, Storey J. On the design and analysis of gene
expression studies in human populations. Nat Genet 2007;39:807–8.
Daily J, Scanfeld D, Pochet N, Roch KL, Plouffe D, Kamal M, et al.
Distinct physiological states of Plasmodium falciparum in malariainfected patients. Nature 2007;450:1091–5.
Lemieux J, Feller A, Holmes C, Newbold C. Reply to Wirth et al.: in vivo
profiles show continuous variation between two cellular populations.
PNAS 2009;106:E71–E2.
Einstein F, Thompson R, Bhagat T, Fazzari M, Verma A, Barzilai N, et al.
Cytosine methylation dysregulation in neonates following intrauterine
growth restriction. PLoS ONE 2010;5:e8887.
Kim S, Kelly W, Fu A, Haines K, Hoffman A, Zheng T, et al. Genome-wide
methylation analysis identifies involvement of TNF-alpha mediated
cancer pathways in prostate cancer. Cancer Lett 2011;302:47–53.
Sadikovic B, Yoshimoto M, Al-Romaih K, Maire G, Zielenska M, Squire
J. In vitro analysis of integrated global high-resolution DNA Methylation
Cancer Epidemiol Biomarkers Prev; 22(6) June 2013
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
profiling with genomic imbalance and gene expression in osteosarcoma. PLoS One 2008;3:e2834.
Thompson R, Atzmon G, Gheorghe C, Liang H, Lowes C, Greally J,
et al. Tissue-specific dysregulation of DNA methylation in aging. Aging
Cell 2010;9:506–18.
Zhu Y, Stevens R, Hoffman A, Tjonneland A, Vogel U, Zheng T, et al.
Epigenetic impact of long-term shiftwork: pilot evidence from circadian
genes and whole-genome methylation analysis. Chronobiol Int
2011;28:852–61.
€lk K, Vo
~ sa U, Roosipuu R, et al.
Lokk K, Vooder T, Kolde R, Va
Methylation markers of early-stage non-small cell lung cancer. PLoS
ONE 2012;7:e39813.
Novakovic B, Yuen R, Gordon L, Penaherrera M, Sharkey A, Moffett
A, et al. Evidence for widespread changes in promoter methylation
profile in human placenta in response to increasing gestational age
and environmental/stochastic factors. BMC Genomics 2011;
12:529.
Leek J, Johnson W, Parker H, Jaffe A, Storey J. SVA: surrogate variable
analysis. R package version 3.2.1. 2012.
Du P, Zhang X, Huang C, Jafari N, Kibbe W, Hou L, et al. Comparison of
Beta-value and M-value methods for quantifying methylation levels by
microarray analysis. BMC Bioinformatics 2010;11:587.
Illumina. Infinium HumanMethylation450 BeadChip Kit. 2012; Available from: http://www.illumina.com/products/methylation_450_beadchip_
kits.ilmn
Scherer A. Batch effects and noise in microarray experiments: sources
and solutions. Hoboken, NJ: John Wiley & Sons; 2009.
Affymetrix. GeneChip Human Promoter 1.0R Array. [cited 2013 Mar 18];
Available from: http://www.affymetrix.com/estore/browse/products
.jsp?productId¼131461#1_1.
Huang D, Sherman B, Lempicki R. Systematic and integrative analysis
of large gene lists using DAVID Bioinformatics Resources. Nat Protoc
2009;4:44–57.
Houseman E, Accomando W, Koestler D, Christensen B, Marsit C,
Nelson H, et al. DNA methylation arrays as surrogate measure of cell
mixture distribution. BMC Bioinformatics 2012;13:86.
Liu Y, Aryee M, Padyukov L, Fallin M, Hesselberg E, Runarsson A, et al.
Epigenome-wide association data implicate DNA methylation as an
intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol
2013;31:142–7.
Chen Y, Choufani S, Ferreira J, Grafodatskaya D, Butcher D,
Weksberg R. Sequence overlap between autosomal and sex-linked
probes on the Illumina HumanMethylation27 microarray. Genomics
2011;97:214–22.
Houseman E. Biostatisical methods in epigenetic epidemiology. In:
Michels K, editor. Epigenetic epidemiology. New York, NY: Springer
Verlag; 2012. p. 57–76.
Cancer Epidemiology, Biomarkers & Prevention
Downloaded from cebp.aacrjournals.org on July 12, 2017. © 2013 American Association for Cancer Research.
Published OnlineFirst April 29, 2013; DOI: 10.1158/1055-9965.EPI-13-0114
Batch Effects and Pathway Analysis: Two Potential Perils in
Cancer Studies Involving DNA Methylation Array Analysis
Kristin N. Harper, Brandilyn A. Peters and Mary V. Gamble
Cancer Epidemiol Biomarkers Prev 2013;22:1052-1060. Published OnlineFirst April 29, 2013.
Updated version
Supplementary
Material
Cited articles
Citing articles
E-mail alerts
Reprints and
Subscriptions
Permissions
Access the most recent version of this article at:
doi:10.1158/1055-9965.EPI-13-0114
Access the most recent supplemental material at:
http://cebp.aacrjournals.org/content/suppl/2013/05/08/1055-9965.EPI-13-0114.DC1
This article cites 27 articles, 5 of which you can access for free at:
http://cebp.aacrjournals.org/content/22/6/1052.full#ref-list-1
This article has been cited by 7 HighWire-hosted articles. Access the articles at:
http://cebp.aacrjournals.org/content/22/6/1052.full#related-urls
Sign up to receive free email-alerts related to this article or journal.
To order reprints of this article or to subscribe to the journal, contact the AACR Publications Department
at [email protected].
To request permission to re-use all or part of this article, contact the AACR Publications Department at
[email protected].
Downloaded from cebp.aacrjournals.org on July 12, 2017. © 2013 American Association for Cancer Research.