BIOINFORMATICS ORIGINAL PAPER Vol. 24 no. 9 2008, pages 1161–1167 doi:10.1093/bioinformatics/btn096 Gene expression An analytical pipeline for genomic representations used for cytosine methylation studies Reid F. Thompson1, Mark Reimers2, Batbayar Khulan1, Mathieu Gissot3,4, Todd A. Richmond5, Quan Chen6,7, Xin Zheng6,7, Kami Kim3,4 and John M. Greally1,8,* 1 Department of Molecular Genetics, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, Department of Biostatistics, Virginia Commonwealth University, 730 East Broad Street, Richmond, VA 23298, 3 Department of Medicine (Infectious Diseases), 4Department of Microbiology & Immunology, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, 5Roche NimbleGen, 1 Science Court, Madison, WI 53711, 6 Department of Pathology, 7Bioinformatics Shared Resource and 8Department of Medicine (Hematology), Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA 2 Received on November 7, 2007; revised on January 24, 2008; accepted on March 7, 2008 Advance Access publication March 18, 2008 Associate Editor: Joaquin Dopazo ABSTRACT Motivation: Representations of the genome can be generated by the selection of a subpopulation of restriction fragments using ligationmediated PCR. Such representations form the basis for a number of high-throughput assays, including the HELP assay to study cytosine methylation. We find that HELP data analysis is complicated not only by PCR amplification heterogeneity but also by a complex and variable distribution of cytosine methylation. To address this, we created an analytical pipeline and novel normalization approach that improves concordance between microarray-derived data and single locus validation results, demonstrating the value of the analytical approach. A major influence on the PCR amplification is the size of the restriction fragment, requiring a quantile normalization approach that reduces the influence of fragment length on signal intensity. Here we describe all of the components of the pipeline, which can also be applied to data derived from other assays based on genomic representations. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION There are several techniques described that rely on restriction enzyme-generated representations to sample the genome. These include representational oligonucleotide microarray analysis with Roche NimbleGen microarrays (ROMA; Lucito et al., 2003), whole-genome sampling analysis (WGSA) with Affymetrix SNP microarrays (Kennedy et al., 2003), representational difference analysis (RDA; Lisitsyn and Wigler, 1993) and other techniques to study cytosine methylation in the genome (Hatada et al., 2006) including the HELP assay that we have previously described (Khulan et al., 2006). Cytosine methylation is an epigenetic mark maintained by DNA *To whom correspondence should be addressed. methyltransferases and important for transcriptional regulation (Jones and Baylin, 2007). The HELP assay is based on HpaII Tiny Fragment (HTF; Bird, 1986) Enrichment by Ligationmediated PCR (HELP). The approach relies on comparative genomic hybridization of HpaII-digested and MspI-digested DNA. While HpaII is only able to digest its unmethylated recognition motif (50 -CCGG-30 ), its isoschizomer, MspI, digests any HpaII sites whether methylated or not at the central CG dinucleotide (Waalwijk and Flavell, 1978). MspI therefore serves as an internal control for the assay, representing all HTFs equivalently unless the locus is deleted, amplified or mutated to prevent restriction digest. We have shown that PCR enrichment of the two genomic representations followed by co-hybridization on a custom microarray provides a powerful tool for determining local methylation status on a genome-wide scale (Khulan et al., 2006). However, PCR amplification of mixed templates, a step inherent to representational techniques, can cause certain fragments to amplify with efficiencies different from those of other fragments. A number of important biases have been described that affect PCR with increasing numbers of amplification cycles (Mathieu-Daude et al., 1996; Suzuki and Giovannoni, 1996). Furthermore, random artifacts introduced at early stages of the PCR can have dramatic effects (Polz and Cavanaugh, 1998; Wagner et al., 1994). The HELP amplification technique is limited to 20 cycles to minimize some of these potential sources of bias. However, fragment length bias (Carvalho et al., 2007; Nannya et al., 2005) is inherent to any multi-template PCR reaction, as the efficiency of amplification of each component template is affected by the length of that template. In this report, we provide further evidence that this phenomenon may be responsible for substantial differences in PCR amplification efficiency, sometimes of the order of the biological differences we seek to measure, complicating our ability to compare results within and between different arrays. We solve the amplicon length heterogeneity problem with a novel quantile normalization method that we have developed ß The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 1161 R.F.Thompson et al. as part of a modular pipeline of analytical tools. We assess the performance of this pipeline with extensive bisulphite pyrosequencing validation studies. Designed for the HELP assay specifically, these tools can also be applied to other techniques that use data from PCR amplification to create genomic representations. The functions in this pipeline (Supplementary Fig. 1) are publicly available and written for the R Statistical Package (R Development Core Team, 2005) to allow adoption and testing by other investigators. 2.4 Inter- and intra-microarray quality assessment To perform a HELP experiment, high molecular weight genomic DNA is isolated, digested to completion by HpaII and MspI separately, and then ligated to an oligonucleotide adapter pair complementary to the cohesive ends generated. The linkers then serve as a priming site for a ligation-mediated PCR reaction that we have described to generate a product ranging primarily in the 200–2000 bp size range (Khulan et al., 2006). To avoid the PCR biases described above, we use a universal primer, a relatively small number of amplification cycles (20), a substantial quantity of initial DNA template (0.1 mg), and a pooled mixture of (three or more) ‘hot start’ PCR (Chou et al., 1992) replicates. Following PCR, the HpaII and MspI representations are labeled with different fluorophores using random priming and are then cohybridized on a customized genomic microarray representing the HpaII/MspI fragments of 200–2000 bp in unique sequence. Before we can consider biological variability, we have to address the issues of array performance and quality control (QC). We screen for spatial artifacts by comparing the average ratios of signal intensities as a function of position on the array. We divide the array into sectors (default is 25) and take summary measures of probes located within each sector, then compare the distributions between sectors. Highquality hybridizations yield a relatively uniform distribution of ratios across all sectors (Supplementary Fig. 2A), whereas samples such as the one shown in Supplementary Figure 2B demonstrate poor hybridizations that require repeating. Consideration of probe signal in the context of its performance across multiple arrays improves our ability to discriminate finer deviations in performance. We therefore perform additional quality assessment in a manner analogous to our previous description (Reimers and Weinstein, 2005). We first define a prototypical signal intensity and ratio profile by mean-centering each data set, followed by calculation of a summary measure for each probe as the (20%-) trimmed mean of its values across the arrays (calc.prototype()). We then subtract the log-intensity or log-ratio prototype from the corresponding value for each oligonucleotide on each microarray to obtain an array-specific data set that compares its signal intensities with those for the population of signals on all of the microarrays, plotting each set of values relative to the technical variables under consideration. For instance, regional heterogeneities of signal intensities on the microarrays can be colorcoded in terms of deviations from prototype (plot.chip.image()), illustrating their positions on a depiction of the microarray (Fig. 1A–C and F–H). We also study intensity-dependent biases (Yang et al., 2002) (see Fig. 1D, E, I and J; plot.HELP.qc()), which may reflect sources of technical variability such as labeling efficiency. The influence of other variables can also be studied (e.g. probe melting temperature; Supplementary Fig. 3). These studies allow us to identify at an early stage of the analytical phase of an experiment the extremes or outliers within the data set. Occasionally, there are microarrays that demonstrate varying degrees of spatial artifacts, of which we show a representative sample in Figure 1A–C. This particular microarray also manifested dramatic intensity-dependent biases that further mark it as an outlier in the data set (Fig. 1D and E). Rehybridization of these samples on a new microarray removed the regional artifacts observed in the original hybridization (Fig. 1F–H), resulting in significant reduction of the intensity-dependent bias (Fig. 1I and J). Rehybridization also improved correlation of this sample with its other replicates in the data set (R-values 0.94, up from 0.84). 2.3 2.5 2 SYSTEM AND METHODS 2.1 Samples For the studies described here, samples were obtained from C57Bl/6J mice (Charles River Laboratories). Highly-enriched spermatogenic cell populations were isolated from 25 mice of ages 6–8 weeks using the Sta-Put method based on sedimentation velocity and unit gravity (Bellve et al., 1977; Romrell et al., 1976). Purity of germ cells (90%) was established on the basis of cellular morphology using light microscopy. Whole brain samples were obtained from male mice at age 8 weeks. Additional samples were obtained from Sprague Dawley rat liver and cultured Toxoplasma gondii RH strain tachyzoites. Genomic DNA was purified from all cell or tissue types, and HELP samples were prepared as we have described previously (Khulan et al., 2006). 2.2 The HELP assay Array designs and data import For each HELP experiment, Cy3 and Cy5 signal intensities measure the relative abundances for each HTF of MspI and HpaII representations, respectively. In addition to gridding and other technical controls supplied by Roche NimbleGen, the microarrays also report thousands of random probes (50-mers of random nucleotides) which serve as a metric of non-specific annealing and background fluorescence. By design, all probes are randomly distributed across each microarray. Signal intensity data for every spot on the array is read from flat files (read.pairs()) and linked to its corresponding probe identifier. Roche NimbleGen-formatted design files are then used to link probe identifiers to their corresponding HTF, and provide genomic position and probe sequence information (read.design()). From these probe sequences, we calculate (G þ C) content as %GC (calc.gc()) and theoretical melting temperatures (calc.tm()) using the nearest-neighbor approach (Allawi and SantaLucia, 1997) with the unified thermodynamic parameters (SantaLucia, 1998). 1162 Size-dependent intensities and definition of background When we visualize signal intensities as a function of fragment size (Fig. 2; plot.HELP.svi()), we observe several characteristics of the representations. It is obvious that signal intensity is dependent on fragment length, with maximal intensities for both MspI and HpaII representations around 500 bp and weaker signals at the extremes (Fig. 1A and B). We also note that MspI-derived representations show amplification of all HTFs (Fig. 2A) whereas the HpaII-derived distribution shows a second population of oligonucleotides with lowsignal intensities across all fragment sizes represented (Fig. 2B). This second population corresponds to DNA sequences that did not digest or amplify because of methylation at the flanking HpaII sites. As we have shown previously (Khulan et al., 2006), the HpaII/MspI ratios have a bimodal distribution, the lower ratios representing methylated HTFs while the higher ratios represent HTFs that are relatively hypomethylated (Fig. 2C). Analytical pipeline for cytosine methylation Fig. 1. Quality assessment shows improvement of poor array data following rehybridization on a fresh array. Spatial artifacts for a poor-quality hybridization are shown as the difference of MspI (A) and HpaII (B) signal intensities as well as HpaII/MspI ratios (C) from the probe-by-probe averages across all arrays in the dataset. Green indicates signal or ratio data that is less than the multi-array average while red indicates signal or ratios that exceed the average. Panels (D) and (E) show the MspI and HpaII signal intensities on the y-axis, respectively, versus their multi-array averages shown along the x-axis. The yellow lines on these panels represent lowess-smoothing and highlight the non-linearity of the data, consistent with intensity-dependent bias. The lower panels show improved quality for a rehybridization of the same sample on a fresh array. For each HELP experiment, the level of background signal intensity (‘noise’) is measured by thousands of random probes (50-mers of random nucleotides). These probes measure non-specific annealing and background fluorescence and enable definition of ‘failed’ probes, those for which the levels of MspI and HpaII signal intensities are indistinguishable from the background intensities defined by the random probes (Fig. 2, yellow data points). In these cases, failed probes represent the population of fragments that do not amplify by PCR, whatever the biological or experimental cause (e.g. genomic deletions). We remove these probes from further consideration, typically affecting 10–20% of the probes and a smaller fraction of HTFs. If the probe only ‘fails’ in the HpaII and not the MspI channel, the cause is likely to be due to methylation of that locus, and we maintain these data through subsequent analyses. The HELP assay also makes use of mitochondrial probes as a high copy number, hypomethylated control (Supplementary Fig. 4, red data points). Mitochondrial DNA has been observed to be unmethylated in germline (Hecht et al., 1984), somatic (Pollack et al., 1984) and cancerous cells (Maekawa et al., 2004). As such, the mitochondrial loci serve as a highly reliable control, the failure of which is a robust indicator of major problems with the assay. 2.6 Fig. 2. Microarray signal characteristics of HELP data from a normal mouse spermatogenic sample. (A) The log-intensity of data from the MspI genomic representation is plotted versus HTF size (right) and the corresponding density plot is shown (left). The horizontal red line denotes a ‘‘failed’’ cutoff, calculated as a 2.5 MAD deviation from the median of random probe signals. The blue curve shows the distribution of random probe signals, and the yellow curve and data points indicate ‘‘failed’’ data. All other data is shown in black. HpaII signal data is similarly shown in (B), although the horizontal red line in this case denotes a ‘‘methylated’’ cutoff. HpaII/MspI ratios are shown in panel (C) with a horizontal red line denoting the median HpaII/MspI ratio for random probes. Quantile normalization Despite our measures to avoid PCR bias during the amplification process, we continue to see a size-dependence of signal intensities as described above. The distribution of MspI intensities in Figure 3A clearly demonstrates this size bias. The HTFs around 500–600 bp tend to generate the highest MspI signals in the intensity distribution, whereas HTFs at the tails of the size distribution have lower signal intensities (Fig. 3A and B). In the mouse spermatogenic sample data depicted (see also Fig. 2), the individual biases in the MspI (Fig. 3A) and HpaII (Fig. 3B) intensity distributions do not compensate for each other and there remains an overt fragment size bias in the distribution of HpaII/MspI ratios (Fig. 3C). Previous reports have also shown fragment length bias in PCR and, further, that reduction of this bias with linear modeling improves data 1163 R.F.Thompson et al. Fig. 3. Fragment-size effect in pre- and post-normalized data from a normal mouse spermatogenic sample. The data is divided into 58 step-wise bins each containing a comparable number of HTFs. The color key at the upper left illustrates the partitioning scheme for pre-normalized data where each colored block corresponds to a bin of certain HTF sizes; color varies from blue to red with increasing fragment size (from 200–2000 bp). The density of MspI signal intensities for each bin is shown in panel (A), where different color lines represent the density data for HTFs from each corresponding bin. The black line represents the overall density of MspI intensity data. Panel (B) shows the same representation for HpaII signals and panel (C) shows the same representation for HpaII/MspI ratios. The color key to the lower left shows the analogous partitioning scheme for normalized data after failed probes have been identified and removed. Normalized MspI, HpaII, and HpaII/MspI ratio data are shown (without failed probes) in panels (D), (E), and (F), respectively. interpretation (Carvalho et al., 2007; Nannya et al., 2005). We address the fragment length problem using a quantile normalization approach (quantile.normalize()). The goal of this approach was to normalize signal intensities across all fragment lengths, improving within and between-array comparisons. This normalization corrects for the fragment size-dependency of the MspI, HpaII and HpaII/MspI ratio distributions (Fig. 3D, E and F). The approach is similar to interarray quantile normalization methods (e.g. RMA; Irizarry et al., 2003); however, in this particular case we align the quantiles across densitydependent sliding windows of size-sorted data. All HTFs that are considered amplifiable (i.e. those not classified as failed by the above criterion) are sorted according to increasing fragment size. The resultant data are then divided into multiple bins (b) and steps (s), resulting in a total number (n, where n ¼ s(b 1)þ1) of sliding windows (w ¼ {1,. . ., n}). Minimum and maximum fragment size boundaries for each window are calculated as the (w 1)/n and w/n quantiles, respectively. The data are then split according to these boundaries into each window (overlapping windows are each assigned a copy of the overlapping data). MspI signal intensity quantiles are calculated for each window and are then averaged across all n windows to produce an average quantile (Q). In order to track overlapping data, each probe is assigned intensity-sorted position(s) (p) within the window(s) in which it is included. These positions then determine which values from Q to assign to each probe in a given window. Because a probe may appear at different points within the quantile distribution for two or more overlapping windows (e.g. p1 6¼ p2), final quantile-normalized values are calculated for each probe as the average of these values: mean(Qp1,Qp2,. . .). 1164 The same calculations are then applied to the HpaII signals, again with failed probes (defined by MspI intensities) removed. However, for HpaII data the (methylated) probes that fall within the random signal distribution (99% quantile) are normalized separately from those that exceed random probe signals (499% quantile). This piecewise normalization is performed to separate amplifying (hypomethylated) probes from their unamplifying (methylated) counterparts in order to preserve the potentially variable distributions of methylation across different fragment sizes. This may be particularly relevant for the treatment of CG-dense regions, which tend to occur at shorter fragment lengths (Supplementary Fig. 5) and which may exhibit different distributions of methylation. Quantile-normalized HpaII/MspI logratios are calculated as the difference between normalized HpaII and MspI signals, and are then centered to the average difference of random probe signal intensities (HpaII-MspI) to adjust for global differences in signal strength. A useful experiment involved the HELP analysis of Toxoplasma gondii, which we have found to lack methylation in its genome (Gissot et al., 2008). This allowed us to consider size bias in the absence of methylation and therefore due to technical sources alone (e.g. PCR). We show that the HpaII and MspI distributions in this case exhibit different size-dependencies of the signal strength, causing a size-dependent artifact in the HpaII/MspI ratio (Supplementary Fig. 6A, C and E). Quantile normalization corrects this artifact (Supplementary Fig. 6B, D and F) and improves inter-sample HpaII, MspI, and HpaII/MspI ratio correlations in each of four replicate assays (by an average of 7.8%). We tested whether normalization improves analysis of data from methylating genomes, finding that it preserves HpaII/MspI ratio correlation for technical replicates (R-values differ in pre- and Analytical pipeline for cytosine methylation post-normalized data by an average of 0.1–0.2%) while enhancing the differences between tissues (R-values comparing brain and sperm are reduced by an average of 2% following normalization). 2.7 Data summarization The methylation status of each HpaII fragment is typically measured by a set of probes (up to 10, depending on the array design). Failed probes are removed from consideration as described previously; however, the remaining probes must be considered together, necessitating a summarization approach (combine.data()). Currently, we employ a 20%-trimmed mean, weighted by MspI signal intensities as follows: for a given HTF, weights for each probe are assigned between 1 (for the lowest MspI signal intensity) and the magnitude of the range of signal intensities (maximal weight is given to probes with the best performance in the MspI channel). A weight of zero is assigned to the 20% most deviant probes per HTF. This enables us to make slight adjustments for the performance of a given probe and also enables us to take a more robust summary measure of the data (by removal of outliers). 2.8 Categorization The HELP assay generates a bimodal distribution of HpaII intensities and of HpaII/MspI ratio values as a consequence (Khulan et al., 2006) (Fig. 2B and C). We explored whether this allowed us to categorize loci as methylated or hypomethylated (categorize.HELP()), finding agreement with methylation levels detected using validation studies. We identified loci as methylated wherever HpaII signals fell below random noise thresholds and the corresponding MspI data were above background noise. A high-confidence hypomethylated population was defined by HpaII signals above background with a corresponding positive HpaII/MspI logratio. Some values, however, did not group clearly into either the methylated or hypomethylated categories and were therefore considered to have ‘indeterminate’ methylation status. These categorizations are consistent with the bisulphite pyrosequencing data we generated (Supplementary Table 1), which group into two distinct populations: methylated and hypomethylated. 2.9 Data interpretation We analyze sample-to-sample relationships, including both similarities and differences, at both the global and local levels. We determine the global pairwise (Pearson) correlations between all combinations of samples and show a representative pair plot for multiple technical replicates of two tissue types that we have previously shown to have distinctive methylation profiles (Khulan et al., 2006), brain and sperm (Fig. 4, Supplementary Fig. 7). While pairwise comparison is a preexisting program written in R (pairs()), we combine this analysis with an unsupervised clustering using Ward’s minimum variance method and a Euclidean distance matrix (Fig. 4). The union of both components (plot.pairs()) enables a novel visualization and interpretation of the relatedness of different samples to each other. The representative figure shows that replicates of both brain and sperm are similar to each other (R 0.9) while comparison of one tissue with the other shows dramatic global differences (R 0.4). In addition, the data show that rehybridization of a poor technical replicate improves correlation among spermatogenic cells (Fig. 4, ‘Sperm1re’). Additionally, we explore HELP data at the local level, by chromosomal position. We generate BED-formatted tracks of the data for visualization with the UCSC Genome Browser (Kent et al., 2002). In Supplementary Figure 8, we show the H19/Igf2 imprinted domain on mouse chromosome 7 and demonstrate tissue-specific differences in methylation at the differentially-methylated CTCFbinding site upstream from H19 (Supplementary Fig. 8, starred) (Bell and Felsenfeld, 2000). The observed changes (methylation in sperm, Fig. 4. Union of two comparative approaches: unsupervised clustering and global pairwise (Pearson) correlations of normalized ratios from mouse brain and spermatogenic samples. Three brain and four sperm samples were compared using Ward’s minimum variance clustering, with inter-array distances calculated as the Euclidean distance between ratios. The resulting tree is shown in the lower left portion of the figure, where the branching order is shown in solid lines, colored by group (blue indicates brain samples, and red indicates sperm samples). The diagonal dotted lines are numbered and indicate the Euclidean distance scale. The dotted red line indicates the Euclidean distance cutoff used to separate the individual groups of samples; this cutoff is calculated automatically using the cutree() function. Pairwise correlations are shown in the upper right portion of the figure, where R values indicate the Pearson correlation for each pair and blue dotplots show a visual representation of the differences between samples. hypomethylation in brain) are consistent with the previous finding that the H19 locus is methylated exclusively on the paternal chromosome but hypomethylated in somatic cells due to the maternal copy of the locus (Ferguson-Smith et al., 1993). 2.10 Validation of analytical approach The best means of testing a novel analytical approach is in terms of performance with reference to a validation data set. For cytosine methylation, this validation is provided by quantitative analyses of methylation at the HpaII sites generating the HELP signals, using bisulphite conversion of DNA (Kerjean et al., 2001) and either pyrosequencing (Biotage) or MassArray (Sequenom) to measure C/T ratios in the population of molecules (Ehrich et al., 2005; Fakhrai-Rad et al., 2002). A data set was prepared using bisulphite pyrosequencing and MassArray on both brain and sperm samples. We investigated 11 loci with varying degrees of methylation as identified by HELP. Three loci (Tyr, Ube3a, Kcnq1) were hypomethylated in all samples, while three were hypomethylated in brain but methylated in sperm (H19/Igf2, Hccs, Ube3a) and five showed the opposite pattern (Figla, Th-Ins2, Fthl17, Xmr, Ott). Cytosine methylation was quantified using bisulphite pyrosequencing at both HpaII sites flanking each of 10 loci. Two loci (Ube3a and Kcnq1) were not amenable to pyrosequencing and were therefore analyzed by MassArray. All pyrosequencing and MassArray 1165 R.F.Thompson et al. Fig. 5. Quantitative validation of HELP microarray data demonstrates improvement of accuracy through normalization. Twenty-seven raw (A) and normalized (B) HpaII/MspI ratios (from HELP data) are plotted against the bisulphite validation data (methylation percent) for each locus. Small gray circles indicate spermatogenic cell samples while black triangles indicate brain samples. The dotted curves in each panel represent the density of HpaII/MspI ratios from all experiments with the x-axis drawn to scale and the y-axis indicating relative density values; both raw and normalized data exhibit a clear bimodal distribution. The dashed vertical line in panel (B) is a cutoff that enables discrete classification of methylated and hypomethylated loci; such discrete classification cannot be performed for the raw data in panel (A), demonstrating the value of the normalization. data were summarized as single values representing the maximum level of methylation detected for each HTF (Supplementary Table 1). The corresponding HELP data for each of these loci were averaged across replicates and the results were compared to the validation data (Supplementary Table 1). This was applied to both pre- and postnormalized values. We show that HELP reliably discriminates between two groups of loci, methylated and hypomethylated, for the vast majority of loci (Fig. 5). While raw HpaII/MspI ratios are unable to achieve complete concordance with the validation results (Fig. 5A), normalization of HELP data improves the ability of the assay to discriminate between methylation and hypomethylation and achieves complete concordance with the validation data set (Fig. 5B). Normalization also improves the correlation of HELP results with those from the validation data set (R-values increased 0.5% for sperm and 3% for brain). sequences, and CG dinucleotide-dense CG clusters; Supplementary Figs 10 and 11). Our prior report of the HELP assay described the measurement of cytosine methylation solely in terms of HpaII/MspI ratios, with categorization of methylation status defined by the use of mixture models to separate the bimodal distributions of methylated from hypomethylated loci. We were prompted by data from cancer specimens (not shown) to explore alternative means of analysis and categorization, as the proportion of methylated loci in some of these specimens became so small that mixture models were insensitive to the presence of the methylated population. The new approaches described here improve accuracy over raw ratio values when measured against the gold standard of bisulphite pyrosequencing data, although it should be noted that a given ratio may not discriminate a locus that is partially versus fully unmethylated (for example, the imprinted loci in the brain samples; Fig. 5, Supplementary Table 1). In summary, we show how HELP microarray data can be more accurately interpreted to measure cytosine methylation states in the genome. We also note the potential application of this approach to a number of other representational techniques whenever PCR is used for their generation. With the increasing study of epigenetic influences in general and cytosine methylation in particular, the value of careful analytical techniques to complement high-throughput molecular assays is clearly of importance. 4 IMPLEMENTATION This analytical pipeline is implemented in the R Statistical Package (R Development Core Team, 2005). The pipeline has been tested on the Mac platform (OS X 10.4.10) using R version 2.5.1 with grDevices, stats, utils and graphics packages installed and enabled. R source code is publicly available online at http://greallylab.aecom.yu.edu/greally/HELP_pipeline/. ACKNOWLEDGEMENTS 3 DISCUSSION We describe a series of functions that are assembled as a pipeline for the analysis of HELP data. We demonstrate that these functions improve insight into data quality, normalize for potentially misleading technical influences and improve performance when tested against a large validation set. While PCR amplification of genomic representations is used effectively for a number of applications including HELP, it is critical that technical variability does not influence the interpretation of biology (e.g. cytosine methylation status). Fragment length normalization allows intra- and inter-array comparisons to be made in a more robust manner, allowing biological variability to be tested independently of fragment length, as shown in Supplementary Table 1. The normalization preserves the relative proportion of fragments in the methylated and hypomethylated categories for different fragment sizes even in DNA samples with markedly different overall methylation; Supplementary Figure 9, and maintains these distributions in different genomic compartments (e.g. inter- and intragenic 1166 The authors acknowledge the contribution of the Genomics Core Facility at the Albert Einstein College of Medicine, resources from the Albert Einstein Cancer Center, and Anton Svetlanov and Paula Cohen (Cornell University) for the spermatogenic cells samples. This work is supported by a grant from the National Institutes of Health (NIH) to J. M. G. (R01 HD044078). R. F. T. is supported by NIH MSTP Training Grant GM007288. K. K. is supported by NIH NIAID RO1 AI060496, the Albert Einstein College of Medicine Biodefense Proteomics Research Center (NIH NIAID contract HSN266200400054C), and a pilot grant from the AECOM CFAR (NIH NIAID 5 P30 AI051519). M. G. is supported by a Philippe Foundation fellowship. Conflict of Interest: none declared. REFERENCES Allawi,H.T. and SantaLucia,J. Jr. (1997) Thermodynamics and NMR of internal G.T mismatches in DNA. Biochemistry, 36, 10581–10594. Analytical pipeline for cytosine methylation Bell,A.C. and Felsenfeld,G. (2000) Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene. Nature, 405, 482–485. Bellve,A.R. et al. (1977) Spermatogenic cells of the prepubertal mouse. Isolation and morphological characterization. J. Cell Biol., 74, 68–85. Bird,A.P. (1986) CpG-rich islands and the function of DNA methylation. Nature, 321, 209–213. Carvalho,B. et al. (2007) Exploration, normalization, and genotype calls of highdensity oligonucleotide SNP array data. Biostatistics, 8, 485–499. Chou,Q. et al. (1992) Prevention of pre-PCR mis-priming and primer dimerization improves low-copy-number amplifications. Nucleic Acids Res., 20, 1717–1723. Ehrich,M. et al. (2005) Quantitative high-throughput analysis of DNA methylation patterns by base-specific cleavage and mass spectrometry. Proc. Natl Acad. Sci. USA, 102, 15785–15790. Fakhrai-Rad,H. et al. (2002) Pyrosequencing: an accurate detection platform for single nucleotide polymorphisms. Hum. Mutat., 19, 479–485. Ferguson-Smith,A.C. et al. (1993) Parental-origin-specific epigenetic modification of the mouse H19 gene. Nature, 362, 751–755. Gissot,M. et al. (2008) Toxoplasma gondii and Cryptosporidium parvum lack detectable DNA cytosine methylation. Eukaryot. Cell, 7, 537–540. Hatada,I. et al. (2006) Genome-wide profiling of promoter methylation in human. Oncogene, 25, 3059–3064. Hecht,N.B. et al. (1984) Maternal inheritance of the mouse mitochondrial genome is not mediated by a loss or gross alteration of the paternal mitochondrial DNA or by methylation of the oocyte mitochondrial DNA. Dev. Biol., 102, 452–461. Irizarry,R.A. et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264. Jones,P.A. and Baylin,S.B. (2007) The epigenomics of cancer. Cell, 128, 683–692. Kennedy,G.C. et al. (2003) Large-scale genotyping of complex DNA. Nat. Biotechnol., 21, 1233–1237. Kent,W.J. et al. (2002) The human genome browser at UCSC. Genome Res., 12, 996–1006. Kerjean,A. et al. (2001) Bisulfite genomic sequencing of microdissected cells. Nucleic Acids Res., 29, e106. Khulan,B. et al. (2006) Comparative isoschizomer profiling of cytosine methylation: The HELP assay. Genome Res., 16, 1046–1055. Lisitsyn,N. and Wigler,M. (1993) Cloning the differences between two complex genomes. Science, 259, 946–951. Lucito,R. et al. (2003) Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation. Genome Res., 13, 2291–2305. Maekawa,M. et al. (2004) Methylation of mitochondrial DNA is not a useful marker for cancer detection. Clin. Chem., 50, 1480–1481. Mathieu-Daude,F. et al. (1996) DNA rehybridization during PCR: the ‘Cot effect’ and its consequences. Nucleic Acids Res., 24, 2080–2086. Nannya,Y. et al. (2005) A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res., 65, 6071–6079. Pollack,Y. et al. (1984) Methylation pattern of mouse mitochondrial DNA. Nucleic Acids Res., 12, 4811–4824. Polz,M.F. and Cavanaugh,C.M. (1998) Bias in template-to-product ratios in multitemplate PCR. Appl. Environ. Microbiol., 64, 3724–3730. R Development Core Team (2005) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Reimers,M. and Weinstein,J.N. (2005) Quality assessment of microarrays: visualization of spatial artifacts and quantitation of regional biases. BMC Bioinformatics, 6, 166. Romrell,L.J. et al. (1976) Separation of mouse spermatogenic cells by sedimentation velocity. A morphological characterization. Dev. Biol., 49, 119–131. SantaLucia,J. Jr. (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl Acad. Sci. USA, 95, 1460–1465. Suzuki,M.T. and Giovannoni,S.J. (1996) Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Appl. Environ. Microbiol., 62, 625–630. Waalwijk,C. and Flavell,R.A. (1978) MspI, an isoschizomer of hpaII which cleaves both unmethylated and methylated hpaII sites. Nucleic Acids Res., 5, 3231–3236. Wagner,A. et al. (1994) Surveys of gene families using polymerase chain reaction: PCR selection and PCR drift. Syst. Biol., 43, 250–261. Yang,Y.H. et al. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res., 30, e15. 1167
© Copyright 2025 Paperzz