An analytical pipeline for genomic representations used for cytosine

BIOINFORMATICS
ORIGINAL PAPER
Vol. 24 no. 9 2008, pages 1161–1167
doi:10.1093/bioinformatics/btn096
Gene expression
An analytical pipeline for genomic representations used
for cytosine methylation studies
Reid F. Thompson1, Mark Reimers2, Batbayar Khulan1, Mathieu Gissot3,4,
Todd A. Richmond5, Quan Chen6,7, Xin Zheng6,7, Kami Kim3,4 and John M. Greally1,8,*
1
Department of Molecular Genetics, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461,
Department of Biostatistics, Virginia Commonwealth University, 730 East Broad Street, Richmond, VA 23298,
3
Department of Medicine (Infectious Diseases), 4Department of Microbiology & Immunology, Albert Einstein College of
Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, 5Roche NimbleGen, 1 Science Court, Madison, WI 53711,
6
Department of Pathology, 7Bioinformatics Shared Resource and 8Department of Medicine (Hematology), Albert
Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
2
Received on November 7, 2007; revised on January 24, 2008; accepted on March 7, 2008
Advance Access publication March 18, 2008
Associate Editor: Joaquin Dopazo
ABSTRACT
Motivation: Representations of the genome can be generated by the
selection of a subpopulation of restriction fragments using ligationmediated PCR. Such representations form the basis for a number of
high-throughput assays, including the HELP assay to study cytosine
methylation. We find that HELP data analysis is complicated not only
by PCR amplification heterogeneity but also by a complex and
variable distribution of cytosine methylation. To address this, we
created an analytical pipeline and novel normalization approach that
improves concordance between microarray-derived data and single
locus validation results, demonstrating the value of the analytical
approach. A major influence on the PCR amplification is the size of
the restriction fragment, requiring a quantile normalization approach
that reduces the influence of fragment length on signal intensity.
Here we describe all of the components of the pipeline, which can
also be applied to data derived from other assays based on genomic
representations.
Contact: [email protected]
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
INTRODUCTION
There are several techniques described that rely on restriction
enzyme-generated representations to sample the genome. These
include representational oligonucleotide microarray analysis
with Roche NimbleGen microarrays (ROMA; Lucito et al.,
2003), whole-genome sampling analysis (WGSA) with
Affymetrix SNP microarrays (Kennedy et al., 2003), representational difference analysis (RDA; Lisitsyn and Wigler, 1993)
and other techniques to study cytosine methylation in the
genome (Hatada et al., 2006) including the HELP assay that we
have previously described (Khulan et al., 2006). Cytosine
methylation is an epigenetic mark maintained by DNA
*To whom correspondence should be addressed.
methyltransferases and important for transcriptional regulation
(Jones and Baylin, 2007). The HELP assay is based on HpaII
Tiny Fragment (HTF; Bird, 1986) Enrichment by Ligationmediated PCR (HELP). The approach relies on comparative
genomic hybridization of HpaII-digested and MspI-digested
DNA. While HpaII is only able to digest its unmethylated
recognition motif (50 -CCGG-30 ), its isoschizomer, MspI, digests
any HpaII sites whether methylated or not at the central CG
dinucleotide (Waalwijk and Flavell, 1978). MspI therefore
serves as an internal control for the assay, representing all
HTFs equivalently unless the locus is deleted, amplified or
mutated to prevent restriction digest. We have shown that PCR
enrichment of the two genomic representations followed by
co-hybridization on a custom microarray provides a powerful
tool for determining local methylation status on a genome-wide
scale (Khulan et al., 2006).
However, PCR amplification of mixed templates, a step
inherent to representational techniques, can cause certain
fragments to amplify with efficiencies different from those of
other fragments. A number of important biases have been
described that affect PCR with increasing numbers of
amplification cycles (Mathieu-Daude et al., 1996; Suzuki and
Giovannoni, 1996). Furthermore, random artifacts introduced
at early stages of the PCR can have dramatic effects (Polz and
Cavanaugh, 1998; Wagner et al., 1994). The HELP amplification technique is limited to 20 cycles to minimize some of these
potential sources of bias. However, fragment length bias
(Carvalho et al., 2007; Nannya et al., 2005) is inherent to any
multi-template PCR reaction, as the efficiency of amplification
of each component template is affected by the length of that
template. In this report, we provide further evidence that
this phenomenon may be responsible for substantial differences
in PCR amplification efficiency, sometimes of the order of
the biological differences we seek to measure, complicating our
ability to compare results within and between different arrays.
We solve the amplicon length heterogeneity problem with a
novel quantile normalization method that we have developed
ß The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1161
R.F.Thompson et al.
as part of a modular pipeline of analytical tools. We assess the
performance of this pipeline with extensive bisulphite pyrosequencing validation studies. Designed for the HELP assay
specifically, these tools can also be applied to other techniques
that use data from PCR amplification to create genomic
representations. The functions in this pipeline (Supplementary
Fig. 1) are publicly available and written for the R Statistical
Package (R Development Core Team, 2005) to allow adoption
and testing by other investigators.
2.4
Inter- and intra-microarray quality assessment
To perform a HELP experiment, high molecular weight genomic DNA
is isolated, digested to completion by HpaII and MspI separately, and
then ligated to an oligonucleotide adapter pair complementary to the
cohesive ends generated. The linkers then serve as a priming site for
a ligation-mediated PCR reaction that we have described to generate a
product ranging primarily in the 200–2000 bp size range (Khulan et al.,
2006). To avoid the PCR biases described above, we use a universal
primer, a relatively small number of amplification cycles (20), a
substantial quantity of initial DNA template (0.1 mg), and a pooled
mixture of (three or more) ‘hot start’ PCR (Chou et al., 1992) replicates.
Following PCR, the HpaII and MspI representations are labeled with
different fluorophores using random priming and are then cohybridized
on a customized genomic microarray representing the HpaII/MspI
fragments of 200–2000 bp in unique sequence.
Before we can consider biological variability, we have to address the
issues of array performance and quality control (QC). We screen for
spatial artifacts by comparing the average ratios of signal intensities as
a function of position on the array. We divide the array into sectors
(default is 25) and take summary measures of probes located within
each sector, then compare the distributions between sectors. Highquality hybridizations yield a relatively uniform distribution of ratios
across all sectors (Supplementary Fig. 2A), whereas samples such as the
one shown in Supplementary Figure 2B demonstrate poor hybridizations that require repeating.
Consideration of probe signal in the context of its performance across
multiple arrays improves our ability to discriminate finer deviations in
performance. We therefore perform additional quality assessment in a
manner analogous to our previous description (Reimers and Weinstein,
2005). We first define a prototypical signal intensity and ratio profile by
mean-centering each data set, followed by calculation of a summary
measure for each probe as the (20%-) trimmed mean of its values across
the arrays (calc.prototype()). We then subtract the log-intensity
or log-ratio prototype from the corresponding value for each
oligonucleotide on each microarray to obtain an array-specific data
set that compares its signal intensities with those for the population of
signals on all of the microarrays, plotting each set of values relative to
the technical variables under consideration. For instance, regional
heterogeneities of signal intensities on the microarrays can be colorcoded in terms of deviations from prototype (plot.chip.image()),
illustrating their positions on a depiction of the microarray (Fig. 1A–C
and F–H). We also study intensity-dependent biases (Yang et al., 2002)
(see Fig. 1D, E, I and J; plot.HELP.qc()), which may reflect sources
of technical variability such as labeling efficiency. The influence of
other variables can also be studied (e.g. probe melting temperature;
Supplementary Fig. 3).
These studies allow us to identify at an early stage of the analytical
phase of an experiment the extremes or outliers within the data set.
Occasionally, there are microarrays that demonstrate varying degrees of
spatial artifacts, of which we show a representative sample in
Figure 1A–C. This particular microarray also manifested dramatic
intensity-dependent biases that further mark it as an outlier in the data
set (Fig. 1D and E). Rehybridization of these samples on a new
microarray removed the regional artifacts observed in the original
hybridization (Fig. 1F–H), resulting in significant reduction of the
intensity-dependent bias (Fig. 1I and J). Rehybridization also improved
correlation of this sample with its other replicates in the data set
(R-values 0.94, up from 0.84).
2.3
2.5
2
SYSTEM AND METHODS
2.1
Samples
For the studies described here, samples were obtained from C57Bl/6J
mice (Charles River Laboratories). Highly-enriched spermatogenic cell
populations were isolated from 25 mice of ages 6–8 weeks using the
Sta-Put method based on sedimentation velocity and unit gravity (Bellve
et al., 1977; Romrell et al., 1976). Purity of germ cells (90%) was
established on the basis of cellular morphology using light microscopy.
Whole brain samples were obtained from male mice at age 8 weeks.
Additional samples were obtained from Sprague Dawley rat liver and
cultured Toxoplasma gondii RH strain tachyzoites. Genomic DNA was
purified from all cell or tissue types, and HELP samples were prepared
as we have described previously (Khulan et al., 2006).
2.2
The HELP assay
Array designs and data import
For each HELP experiment, Cy3 and Cy5 signal intensities measure the
relative abundances for each HTF of MspI and HpaII representations,
respectively. In addition to gridding and other technical controls
supplied by Roche NimbleGen, the microarrays also report thousands
of random probes (50-mers of random nucleotides) which serve as
a metric of non-specific annealing and background fluorescence. By
design, all probes are randomly distributed across each microarray.
Signal intensity data for every spot on the array is read from flat files
(read.pairs()) and linked to its corresponding probe identifier.
Roche NimbleGen-formatted design files are then used to link probe
identifiers to their corresponding HTF, and provide genomic position
and probe sequence information (read.design()). From these
probe sequences, we calculate (G þ C) content as %GC
(calc.gc()) and theoretical melting temperatures (calc.tm())
using the nearest-neighbor approach (Allawi and SantaLucia, 1997)
with the unified thermodynamic parameters (SantaLucia, 1998).
1162
Size-dependent intensities and definition
of background
When we visualize signal intensities as a function of fragment size
(Fig. 2; plot.HELP.svi()), we observe several characteristics of the
representations. It is obvious that signal intensity is dependent on
fragment length, with maximal intensities for both MspI and HpaII
representations around 500 bp and weaker signals at the extremes
(Fig. 1A and B). We also note that MspI-derived representations show
amplification of all HTFs (Fig. 2A) whereas the HpaII-derived
distribution shows a second population of oligonucleotides with lowsignal intensities across all fragment sizes represented (Fig. 2B). This
second population corresponds to DNA sequences that did not digest
or amplify because of methylation at the flanking HpaII sites. As we
have shown previously (Khulan et al., 2006), the HpaII/MspI ratios
have a bimodal distribution, the lower ratios representing methylated
HTFs while the higher ratios represent HTFs that are relatively
hypomethylated (Fig. 2C).
Analytical pipeline for cytosine methylation
Fig. 1. Quality assessment shows improvement of poor array data following rehybridization on a fresh array. Spatial artifacts for a poor-quality
hybridization are shown as the difference of MspI (A) and HpaII (B) signal intensities as well as HpaII/MspI ratios (C) from the probe-by-probe
averages across all arrays in the dataset. Green indicates signal or ratio data that is less than the multi-array average while red indicates signal or
ratios that exceed the average. Panels (D) and (E) show the MspI and HpaII signal intensities on the y-axis, respectively, versus their multi-array
averages shown along the x-axis. The yellow lines on these panels represent lowess-smoothing and highlight the non-linearity of the data, consistent
with intensity-dependent bias. The lower panels show improved quality for a rehybridization of the same sample on a fresh array.
For each HELP experiment, the level of background signal intensity
(‘noise’) is measured by thousands of random probes (50-mers of
random nucleotides). These probes measure non-specific annealing and
background fluorescence and enable definition of ‘failed’ probes, those
for which the levels of MspI and HpaII signal intensities are
indistinguishable from the background intensities defined by the
random probes (Fig. 2, yellow data points). In these cases, failed
probes represent the population of fragments that do not amplify by
PCR, whatever the biological or experimental cause (e.g. genomic
deletions). We remove these probes from further consideration,
typically affecting 10–20% of the probes and a smaller fraction of
HTFs. If the probe only ‘fails’ in the HpaII and not the MspI channel,
the cause is likely to be due to methylation of that locus, and we
maintain these data through subsequent analyses.
The HELP assay also makes use of mitochondrial probes as a high
copy number, hypomethylated control (Supplementary Fig. 4, red data
points). Mitochondrial DNA has been observed to be unmethylated
in germline (Hecht et al., 1984), somatic (Pollack et al., 1984) and
cancerous cells (Maekawa et al., 2004). As such, the mitochondrial loci
serve as a highly reliable control, the failure of which is a robust
indicator of major problems with the assay.
2.6
Fig. 2. Microarray signal characteristics of HELP data from a normal
mouse spermatogenic sample. (A) The log-intensity of data from the
MspI genomic representation is plotted versus HTF size (right) and the
corresponding density plot is shown (left). The horizontal red line
denotes a ‘‘failed’’ cutoff, calculated as a 2.5 MAD deviation from the
median of random probe signals. The blue curve shows the distribution
of random probe signals, and the yellow curve and data points indicate
‘‘failed’’ data. All other data is shown in black. HpaII signal data is
similarly shown in (B), although the horizontal red line in this case
denotes a ‘‘methylated’’ cutoff. HpaII/MspI ratios are shown in panel
(C) with a horizontal red line denoting the median HpaII/MspI ratio for
random probes.
Quantile normalization
Despite our measures to avoid PCR bias during the amplification
process, we continue to see a size-dependence of signal intensities as
described above. The distribution of MspI intensities in Figure 3A
clearly demonstrates this size bias. The HTFs around 500–600 bp tend
to generate the highest MspI signals in the intensity distribution,
whereas HTFs at the tails of the size distribution have lower signal
intensities (Fig. 3A and B). In the mouse spermatogenic sample data
depicted (see also Fig. 2), the individual biases in the MspI (Fig. 3A)
and HpaII (Fig. 3B) intensity distributions do not compensate for each
other and there remains an overt fragment size bias in the distribution
of HpaII/MspI ratios (Fig. 3C).
Previous reports have also shown fragment length bias in PCR and,
further, that reduction of this bias with linear modeling improves data
1163
R.F.Thompson et al.
Fig. 3. Fragment-size effect in pre- and post-normalized data from a normal mouse spermatogenic sample. The data is divided into 58 step-wise bins
each containing a comparable number of HTFs. The color key at the upper left illustrates the partitioning scheme for pre-normalized data where each
colored block corresponds to a bin of certain HTF sizes; color varies from blue to red with increasing fragment size (from 200–2000 bp). The density
of MspI signal intensities for each bin is shown in panel (A), where different color lines represent the density data for HTFs from each corresponding
bin. The black line represents the overall density of MspI intensity data. Panel (B) shows the same representation for HpaII signals and panel
(C) shows the same representation for HpaII/MspI ratios. The color key to the lower left shows the analogous partitioning scheme for normalized
data after failed probes have been identified and removed. Normalized MspI, HpaII, and HpaII/MspI ratio data are shown (without failed probes)
in panels (D), (E), and (F), respectively.
interpretation (Carvalho et al., 2007; Nannya et al., 2005). We address
the fragment length problem using a quantile normalization approach
(quantile.normalize()). The goal of this approach was to
normalize signal intensities across all fragment lengths, improving
within and between-array comparisons. This normalization corrects for
the fragment size-dependency of the MspI, HpaII and HpaII/MspI
ratio distributions (Fig. 3D, E and F). The approach is similar to interarray quantile normalization methods (e.g. RMA; Irizarry et al., 2003);
however, in this particular case we align the quantiles across densitydependent sliding windows of size-sorted data.
All HTFs that are considered amplifiable (i.e. those not classified as
failed by the above criterion) are sorted according to increasing
fragment size. The resultant data are then divided into multiple bins (b)
and steps (s), resulting in a total number (n, where n ¼ s(b 1)þ1) of
sliding windows (w ¼ {1,. . ., n}). Minimum and maximum fragment size
boundaries for each window are calculated as the (w 1)/n and w/n
quantiles, respectively. The data are then split according to these
boundaries into each window (overlapping windows are each assigned
a copy of the overlapping data). MspI signal intensity quantiles are
calculated for each window and are then averaged across all n windows
to produce an average quantile (Q). In order to track overlapping data,
each probe is assigned intensity-sorted position(s) (p) within the
window(s) in which it is included. These positions then determine
which values from Q to assign to each probe in a given window.
Because a probe may appear at different points within the quantile
distribution for two or more overlapping windows (e.g. p1 6¼ p2), final
quantile-normalized values are calculated for each probe as the average
of these values: mean(Qp1,Qp2,. . .).
1164
The same calculations are then applied to the HpaII signals, again
with failed probes (defined by MspI intensities) removed. However, for
HpaII data the (methylated) probes that fall within the random signal
distribution (99% quantile) are normalized separately from those that
exceed random probe signals (499% quantile). This piecewise normalization is performed to separate amplifying (hypomethylated) probes
from their unamplifying (methylated) counterparts in order to preserve
the potentially variable distributions of methylation across different
fragment sizes. This may be particularly relevant for the treatment of
CG-dense regions, which tend to occur at shorter fragment lengths
(Supplementary Fig. 5) and which may exhibit different distributions of
methylation. Quantile-normalized HpaII/MspI logratios are calculated
as the difference between normalized HpaII and MspI signals, and are
then centered to the average difference of random probe signal intensities (HpaII-MspI) to adjust for global differences in signal strength.
A useful experiment involved the HELP analysis of Toxoplasma
gondii, which we have found to lack methylation in its genome (Gissot
et al., 2008). This allowed us to consider size bias in the absence of
methylation and therefore due to technical sources alone (e.g. PCR). We
show that the HpaII and MspI distributions in this case exhibit different size-dependencies of the signal strength, causing a size-dependent
artifact in the HpaII/MspI ratio (Supplementary Fig. 6A, C and E).
Quantile normalization corrects this artifact (Supplementary Fig. 6B, D
and F) and improves inter-sample HpaII, MspI, and HpaII/MspI ratio
correlations in each of four replicate assays (by an average of 7.8%).
We tested whether normalization improves analysis of data from
methylating genomes, finding that it preserves HpaII/MspI ratio
correlation for technical replicates (R-values differ in pre- and
Analytical pipeline for cytosine methylation
post-normalized data by an average of 0.1–0.2%) while enhancing the
differences between tissues (R-values comparing brain and sperm are
reduced by an average of 2% following normalization).
2.7
Data summarization
The methylation status of each HpaII fragment is typically measured by
a set of probes (up to 10, depending on the array design). Failed probes
are removed from consideration as described previously; however,
the remaining probes must be considered together, necessitating
a summarization approach (combine.data()). Currently, we
employ a 20%-trimmed mean, weighted by MspI signal intensities as
follows: for a given HTF, weights for each probe are assigned between 1
(for the lowest MspI signal intensity) and the magnitude of the range of
signal intensities (maximal weight is given to probes with the best
performance in the MspI channel). A weight of zero is assigned to the
20% most deviant probes per HTF. This enables us to make slight
adjustments for the performance of a given probe and also enables us
to take a more robust summary measure of the data (by removal of
outliers).
2.8
Categorization
The HELP assay generates a bimodal distribution of HpaII intensities
and of HpaII/MspI ratio values as a consequence (Khulan et al., 2006)
(Fig. 2B and C). We explored whether this allowed us to categorize loci
as methylated or hypomethylated (categorize.HELP()), finding
agreement with methylation levels detected using validation studies. We
identified loci as methylated wherever HpaII signals fell below random
noise thresholds and the corresponding MspI data were above
background noise. A high-confidence hypomethylated population was
defined by HpaII signals above background with a corresponding
positive HpaII/MspI logratio. Some values, however, did not group
clearly into either the methylated or hypomethylated categories and
were therefore considered to have ‘indeterminate’ methylation status.
These categorizations are consistent with the bisulphite pyrosequencing
data we generated (Supplementary Table 1), which group into two
distinct populations: methylated and hypomethylated.
2.9
Data interpretation
We analyze sample-to-sample relationships, including both similarities
and differences, at both the global and local levels. We determine the
global pairwise (Pearson) correlations between all combinations of
samples and show a representative pair plot for multiple technical
replicates of two tissue types that we have previously shown to have
distinctive methylation profiles (Khulan et al., 2006), brain and sperm
(Fig. 4, Supplementary Fig. 7). While pairwise comparison is a preexisting program written in R (pairs()), we combine this analysis
with an unsupervised clustering using Ward’s minimum variance
method and a Euclidean distance matrix (Fig. 4). The union of both
components (plot.pairs()) enables a novel visualization and
interpretation of the relatedness of different samples to each other.
The representative figure shows that replicates of both brain and sperm
are similar to each other (R 0.9) while comparison of one tissue with
the other shows dramatic global differences (R 0.4). In addition, the
data show that rehybridization of a poor technical replicate improves
correlation among spermatogenic cells (Fig. 4, ‘Sperm1re’).
Additionally, we explore HELP data at the local level, by
chromosomal position. We generate BED-formatted tracks of the
data for visualization with the UCSC Genome Browser (Kent et al.,
2002). In Supplementary Figure 8, we show the H19/Igf2 imprinted
domain on mouse chromosome 7 and demonstrate tissue-specific
differences in methylation at the differentially-methylated CTCFbinding site upstream from H19 (Supplementary Fig. 8, starred) (Bell
and Felsenfeld, 2000). The observed changes (methylation in sperm,
Fig. 4. Union of two comparative approaches: unsupervised clustering
and global pairwise (Pearson) correlations of normalized ratios from
mouse brain and spermatogenic samples. Three brain and four sperm
samples were compared using Ward’s minimum variance clustering,
with inter-array distances calculated as the Euclidean distance between
ratios. The resulting tree is shown in the lower left portion of the figure,
where the branching order is shown in solid lines, colored by group
(blue indicates brain samples, and red indicates sperm samples). The
diagonal dotted lines are numbered and indicate the Euclidean distance
scale. The dotted red line indicates the Euclidean distance cutoff used to
separate the individual groups of samples; this cutoff is calculated
automatically using the cutree() function. Pairwise correlations are
shown in the upper right portion of the figure, where R values indicate
the Pearson correlation for each pair and blue dotplots show a visual
representation of the differences between samples.
hypomethylation in brain) are consistent with the previous finding that
the H19 locus is methylated exclusively on the paternal chromosome
but hypomethylated in somatic cells due to the maternal copy of the
locus (Ferguson-Smith et al., 1993).
2.10
Validation of analytical approach
The best means of testing a novel analytical approach is in terms of
performance with reference to a validation data set. For cytosine
methylation, this validation is provided by quantitative analyses of
methylation at the HpaII sites generating the HELP signals, using
bisulphite conversion of DNA (Kerjean et al., 2001) and either
pyrosequencing (Biotage) or MassArray (Sequenom) to measure C/T
ratios in the population of molecules (Ehrich et al., 2005; Fakhrai-Rad
et al., 2002). A data set was prepared using bisulphite pyrosequencing
and MassArray on both brain and sperm samples. We investigated 11
loci with varying degrees of methylation as identified by HELP. Three
loci (Tyr, Ube3a, Kcnq1) were hypomethylated in all samples, while
three were hypomethylated in brain but methylated in sperm (H19/Igf2,
Hccs, Ube3a) and five showed the opposite pattern (Figla, Th-Ins2,
Fthl17, Xmr, Ott). Cytosine methylation was quantified using bisulphite
pyrosequencing at both HpaII sites flanking each of 10 loci. Two loci
(Ube3a and Kcnq1) were not amenable to pyrosequencing and were
therefore analyzed by MassArray. All pyrosequencing and MassArray
1165
R.F.Thompson et al.
Fig. 5. Quantitative validation of HELP microarray data demonstrates
improvement of accuracy through normalization. Twenty-seven raw
(A) and normalized (B) HpaII/MspI ratios (from HELP data) are
plotted against the bisulphite validation data (methylation percent) for
each locus. Small gray circles indicate spermatogenic cell samples while
black triangles indicate brain samples. The dotted curves in each panel
represent the density of HpaII/MspI ratios from all experiments with
the x-axis drawn to scale and the y-axis indicating relative density
values; both raw and normalized data exhibit a clear bimodal
distribution. The dashed vertical line in panel (B) is a cutoff that
enables discrete classification of methylated and hypomethylated loci;
such discrete classification cannot be performed for the raw data in
panel (A), demonstrating the value of the normalization.
data were summarized as single values representing the maximum level
of methylation detected for each HTF (Supplementary Table 1).
The corresponding HELP data for each of these loci were averaged
across replicates and the results were compared to the validation data
(Supplementary Table 1). This was applied to both pre- and postnormalized values. We show that HELP reliably discriminates between
two groups of loci, methylated and hypomethylated, for the vast
majority of loci (Fig. 5). While raw HpaII/MspI ratios are unable to
achieve complete concordance with the validation results (Fig. 5A),
normalization of HELP data improves the ability of the assay to
discriminate between methylation and hypomethylation and achieves
complete concordance with the validation data set (Fig. 5B).
Normalization also improves the correlation of HELP results with
those from the validation data set (R-values increased 0.5% for sperm
and 3% for brain).
sequences, and CG dinucleotide-dense CG clusters; Supplementary Figs 10 and 11).
Our prior report of the HELP assay described the measurement of cytosine methylation solely in terms of HpaII/MspI
ratios, with categorization of methylation status defined by the
use of mixture models to separate the bimodal distributions of
methylated from hypomethylated loci. We were prompted by
data from cancer specimens (not shown) to explore alternative
means of analysis and categorization, as the proportion of
methylated loci in some of these specimens became so small
that mixture models were insensitive to the presence of the
methylated population. The new approaches described here
improve accuracy over raw ratio values when measured against
the gold standard of bisulphite pyrosequencing data, although
it should be noted that a given ratio may not discriminate
a locus that is partially versus fully unmethylated (for example,
the imprinted loci in the brain samples; Fig. 5, Supplementary
Table 1).
In summary, we show how HELP microarray data can be
more accurately interpreted to measure cytosine methylation
states in the genome. We also note the potential application of
this approach to a number of other representational techniques
whenever PCR is used for their generation. With the increasing
study of epigenetic influences in general and cytosine methylation in particular, the value of careful analytical techniques
to complement high-throughput molecular assays is clearly of
importance.
4
IMPLEMENTATION
This analytical pipeline is implemented in the R Statistical
Package (R Development Core Team, 2005). The pipeline has
been tested on the Mac platform (OS X 10.4.10) using R
version 2.5.1 with grDevices, stats, utils and graphics packages
installed and enabled. R source code is publicly available online
at http://greallylab.aecom.yu.edu/greally/HELP_pipeline/.
ACKNOWLEDGEMENTS
3
DISCUSSION
We describe a series of functions that are assembled as a
pipeline for the analysis of HELP data. We demonstrate that
these functions improve insight into data quality, normalize for
potentially misleading technical influences and improve performance when tested against a large validation set. While PCR
amplification of genomic representations is used effectively for
a number of applications including HELP, it is critical that
technical variability does not influence the interpretation of
biology (e.g. cytosine methylation status). Fragment length
normalization allows intra- and inter-array comparisons to be
made in a more robust manner, allowing biological variability
to be tested independently of fragment length, as shown in
Supplementary Table 1. The normalization preserves the
relative proportion of fragments in the methylated and
hypomethylated categories for different fragment sizes even in
DNA samples with markedly different overall methylation;
Supplementary Figure 9, and maintains these distributions
in different genomic compartments (e.g. inter- and intragenic
1166
The authors acknowledge the contribution of the Genomics
Core Facility at the Albert Einstein College of Medicine,
resources from the Albert Einstein Cancer Center, and Anton
Svetlanov and Paula Cohen (Cornell University) for the
spermatogenic cells samples. This work is supported by a
grant from the National Institutes of Health (NIH) to J. M. G.
(R01 HD044078). R. F. T. is supported by NIH MSTP Training
Grant GM007288. K. K. is supported by NIH NIAID RO1
AI060496, the Albert Einstein College of Medicine Biodefense
Proteomics Research Center (NIH NIAID contract HSN266200400054C), and a pilot grant from the AECOM CFAR (NIH
NIAID 5 P30 AI051519). M. G. is supported by a Philippe
Foundation fellowship.
Conflict of Interest: none declared.
REFERENCES
Allawi,H.T. and SantaLucia,J. Jr. (1997) Thermodynamics and NMR of internal
G.T mismatches in DNA. Biochemistry, 36, 10581–10594.
Analytical pipeline for cytosine methylation
Bell,A.C. and Felsenfeld,G. (2000) Methylation of a CTCF-dependent boundary
controls imprinted expression of the Igf2 gene. Nature, 405, 482–485.
Bellve,A.R. et al. (1977) Spermatogenic cells of the prepubertal mouse. Isolation
and morphological characterization. J. Cell Biol., 74, 68–85.
Bird,A.P. (1986) CpG-rich islands and the function of DNA methylation. Nature,
321, 209–213.
Carvalho,B. et al. (2007) Exploration, normalization, and genotype calls of highdensity oligonucleotide SNP array data. Biostatistics, 8, 485–499.
Chou,Q. et al. (1992) Prevention of pre-PCR mis-priming and primer dimerization improves low-copy-number amplifications. Nucleic Acids Res., 20,
1717–1723.
Ehrich,M. et al. (2005) Quantitative high-throughput analysis of DNA
methylation patterns by base-specific cleavage and mass spectrometry. Proc.
Natl Acad. Sci. USA, 102, 15785–15790.
Fakhrai-Rad,H. et al. (2002) Pyrosequencing: an accurate detection platform for
single nucleotide polymorphisms. Hum. Mutat., 19, 479–485.
Ferguson-Smith,A.C. et al. (1993) Parental-origin-specific epigenetic modification
of the mouse H19 gene. Nature, 362, 751–755.
Gissot,M. et al. (2008) Toxoplasma gondii and Cryptosporidium parvum lack
detectable DNA cytosine methylation. Eukaryot. Cell, 7, 537–540.
Hatada,I. et al. (2006) Genome-wide profiling of promoter methylation in
human. Oncogene, 25, 3059–3064.
Hecht,N.B. et al. (1984) Maternal inheritance of the mouse mitochondrial
genome is not mediated by a loss or gross alteration of the paternal
mitochondrial DNA or by methylation of the oocyte mitochondrial DNA.
Dev. Biol., 102, 452–461.
Irizarry,R.A. et al. (2003) Exploration, normalization, and summaries
of high density oligonucleotide array probe level data. Biostatistics, 4,
249–264.
Jones,P.A. and Baylin,S.B. (2007) The epigenomics of cancer. Cell, 128, 683–692.
Kennedy,G.C. et al. (2003) Large-scale genotyping of complex DNA. Nat.
Biotechnol., 21, 1233–1237.
Kent,W.J. et al. (2002) The human genome browser at UCSC. Genome Res., 12,
996–1006.
Kerjean,A. et al. (2001) Bisulfite genomic sequencing of microdissected cells.
Nucleic Acids Res., 29, e106.
Khulan,B. et al. (2006) Comparative isoschizomer profiling of cytosine
methylation: The HELP assay. Genome Res., 16, 1046–1055.
Lisitsyn,N. and Wigler,M. (1993) Cloning the differences between two complex
genomes. Science, 259, 946–951.
Lucito,R. et al. (2003) Representational oligonucleotide microarray analysis: a
high-resolution method to detect genome copy number variation. Genome
Res., 13, 2291–2305.
Maekawa,M. et al. (2004) Methylation of mitochondrial DNA is not a useful
marker for cancer detection. Clin. Chem., 50, 1480–1481.
Mathieu-Daude,F. et al. (1996) DNA rehybridization during PCR: the ‘Cot
effect’ and its consequences. Nucleic Acids Res., 24, 2080–2086.
Nannya,Y. et al. (2005) A robust algorithm for copy number detection using
high-density oligonucleotide single nucleotide polymorphism genotyping
arrays. Cancer Res., 65, 6071–6079.
Pollack,Y. et al. (1984) Methylation pattern of mouse mitochondrial DNA.
Nucleic Acids Res., 12, 4811–4824.
Polz,M.F. and Cavanaugh,C.M. (1998) Bias in template-to-product ratios in
multitemplate PCR. Appl. Environ. Microbiol., 64, 3724–3730.
R Development Core Team (2005) R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria.
Reimers,M. and Weinstein,J.N. (2005) Quality assessment of microarrays:
visualization of spatial artifacts and quantitation of regional biases. BMC
Bioinformatics, 6, 166.
Romrell,L.J. et al. (1976) Separation of mouse spermatogenic cells by
sedimentation velocity. A morphological characterization. Dev. Biol., 49,
119–131.
SantaLucia,J. Jr. (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl Acad. Sci. USA, 95,
1460–1465.
Suzuki,M.T. and Giovannoni,S.J. (1996) Bias caused by template annealing in the
amplification of mixtures of 16S rRNA genes by PCR. Appl. Environ.
Microbiol., 62, 625–630.
Waalwijk,C. and Flavell,R.A. (1978) MspI, an isoschizomer of hpaII which
cleaves both unmethylated and methylated hpaII sites. Nucleic Acids Res., 5,
3231–3236.
Wagner,A. et al. (1994) Surveys of gene families using polymerase chain reaction:
PCR selection and PCR drift. Syst. Biol., 43, 250–261.
Yang,Y.H. et al. (2002) Normalization for cDNA microarray data: a robust
composite method addressing single and multiple slide systematic variation.
Nucleic Acids Res., 30, e15.
1167