Optimal Replication and the Importance of Experimental Design for Gel-Based Quantitative Proteomics Sybille M. N. Hunt,* Mervyn R. Thomas,† Lucille T. Sebastian, Susanne K. Pedersen, Rebecca L. Harcourt, Andrew J. Sloane, and Marc R. Wilkins Proteome Systems Ltd, Locked Bag 2073, North Ryde, NSW 1670, Australia Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009 Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y Received December 20, 2004 Quantitative proteomic studies, based on two-dimensional gel electrophoresis, are commonly used to find proteins that are differentially expressed between samples or groups of samples. These proteins are of interest as potential diagnostic or prognostic biomarkers, or as proteins associated with a trait. The complexity of proteomic data poses many challenges, so while experiments may reveal proteins that are differentially expressed, these are often not significant when subjected to rigorous statistical analysis. However, this can be addressed through appropriate experimental design. A good experimental design considers the impact of different sources of variation, both analytical and biological, on the statistical importance of the results. The design should address the number of samples that must be analyzed and the number of replicate gels per sample, in the context of a particular minimum difference that one is seeking to achieve. In this study, we explore the ways to improve the quality of protein expression data from 2-DE gels, and describe an approach for defining the number of samples required and the number of gels per sample. It has been developed for the simplest of situations, two groups of samples with variation at two levels: between samples and between gels. This approach will also be useful as a guide for more complex designs involving more than two groups of samples. We describe some Internet-accessible tools that can assist in the design of proteomic studies. Keywords: quantitative proteomics • 2-DE • statistical power analysis • experimental design • analytical variation Introduction A key aim of proteomics is the expression analysis of large numbers of proteins. Technologies employed for this include two-dimensional polyacrylamide gel electrophoresis combined with visible or fluorescent stains and image analysis, and mass spectrometric approaches that use isotopic labeling techniques such as isotope-coded affinity tag peptide labeling1 or amino acid coded mass tagging.2 These have been recently reviewed elsewhere.3 Quantitative two-dimensional electrophoresis (2-DE) is commonly applied to differential display analysis, to find proteins that are differentially expressed between samples or groups of samples. If these proteins are shown to change consistently in a population, they may be associated with or responsible for a phenotype and are referred to as biomarkers. Biomarkers can form the basis of diagnostic and prognostic tests and as such they are of scientific and commercial interest. In many quantitative proteomic studies based on 2-DE gels, where researchers are seeking to identify differentially expressed proteins, there is often inadequate attention paid to experimental design. There are special challenges with proteomic data that need consideration, similar to the challenges * To whom correspondence should be addressed. Phone: 61 2 9889 1830. Fax: 61 2 9889 1805. E-mail: [email protected]. † Emphron Informatics, 6 Geewan Place, Chapel Hill, Queensland 4069, Australia. 10.1021/pr049758y CCC: $30.25 2005 American Chemical Society faced in the acquisition and analysis of mRNA expression data. These arise due to the following: very large number of measurements are usually generated for each sample; analytical variation is inherent to the protein separation, staining, image acquisition and processing steps; there is biological variation of environmental origin; and there is biological variation of genetic origin, for example in an out-bred population. Furthermore, the generation of proteomic data remains a relatively involved and multistep process, and as a consequence, experiments are usually modest in the number of samples analyzed. So while many experiments reveal proteins that are differentially expressed, these may be found to be not significant when subjected to rigorous statistical analysis. Many of the sources of variation simply cannot be easily controlled in proteomics (e.g., the study of human samples). However, a good experimental design can evaluate the impact that different sources of variation can have on the statistical importance of the results, and help assess the best course of action. The simplest experimental design for differential display exhibits a hierarchical structure (see Figure 1). At the top of the hierarchy are the groups of samples for comparison, defined by sample characteristics such as disease state (healthy or disease population) or treatment applied (drug dosage). The middle level of the hierarchy assays variation between the samples within a given group, capturing the major source of Journal of Proteome Research 2005, 4, 809-819 809 Published on Web 05/07/2005 research articles Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009 Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y Figure 1. Typical experimental paradigm displaying the hierarchical structure of a 2-group quantitative proteomic experiment. At the top of the hierarchy are the groups of samples for comparison of sample characteristics such as disease state. The middle level of the hierarchy allows for estimation of betweensamples (biological) variation within a given group and the lowest level for estimation of between-replicate 2-DE gels (analytical) variation. biological variation. This variation will be smallest when dealing with simple bacterial cultures and at its greatest in studies dealing with samples from human subjects. The lowest level of the hierarchy involves replicate 2-DE gels run from the same sample, and captures the inherent analytical variation. It is important to recognize that variation is inherent across the experiment, however a good experimental design can address the number of samples that must be analyzed and the number of replicate gels per sample, in the context of a particular minimum expression difference that one is seeking to discover. In this study, we describe an approach for understanding and controlling variation in gel-based proteomics, and propose a means of designing an experiment to define the number of samples required and the number of gels per sample. This approach has been developed for the simplest of situationss two groups of samples with variation at two levels: between samples and between gelsshowever this can be extended to address more complex experimental situations. We also describe some Internet-accessible tools that can assist researchers in experimental design for their quantitative proteomic studies. Materials and Methods Materials. Whole blood was collected from patients and immediately stored on ice, then spun gently to sediment red blood cells. The supernatant was further spun at 6000 × g to leave clarified plasma that was then stored at -80 °C until required. Standard laboratory chemicals were obtained from Sigma-Aldrich (St. Louis, MO) unless specified otherwise. Human Plasma Sample Preparation. Two mL aliquots of plasma (per subject) were quickly thawed at 37 °C and depleted of the three high-abundance proteins fibrinogen, immunoglobulin G and albumin. Fibrinogen was removed using a venom cross-linking method,4 immunoglobulin type G (IgG) with immuno-affinity chromatography using immobilized protein G sepharose beads (Amersham Biosciences, NSW, Australia), and human serum albumin by ethanol fractionation.5 The triple-depleted plasma sample was then pre-fractionated into narrow range pI fractions, pI 3.0-5.5, 5.5-6.5 and 6.5-11.0 using an IsoelectrIQ2 multi compartment electrolyzer (MCE; Proteome Systems, Sydney, Australia). Throughout the fractionation, protein concentrations were quantified using a Coomassie-blue based Bradford protein assay. Only the pI 5.5-6.5 fractions were used in this study. Three hundred µg of MCE-fractionated triple depleted human plasma protein was made up to a final volume of 210 µL in a 7 M urea, 2 M thiourea, 10 mM Tris, 2% CHAPS sample buffer. The sample was then ultrasonicated for 30 s, and then 810 Journal of Proteome Research • Vol. 4, No. 3, 2005 Hunt et al. reduced (by adding tributylphosphine to a final concentration of 5mM and incubating for 1 h at ambient temperature) and alkylated (by adding iodoacetamide to a final concentration of 15mM for 1 h at ambient temperature and protected from light). Before rehydration of immobilized pH gradient (IPG) strips, samples were ultrasonicated for 2 min and then centrifuged at 21 000 × g for 5 min. The supernatant was collected and 10 µL of Orange G added as an indicator dye. Bacterial Sample Preparation. Twenty mg of lyophilized Escherichia coli bacterial cells (Sigma product EC-1, strain K12) were resuspended in 10 mL of sample buffer (7 M urea, 2 M thiourea, 40 mM Tris, 1% C7 BzO). The suspension was sonicated with the ultrasonic probe (Branson digital sonicator, model 450) for a total of 1 min (4 × 15 s pulses at 70% amplitude, chilled on ice between each step), centrifuged at 14 000 × g for 15 min at 15 °C to pellet cell debris. The supernatant was transferred into a clean tube and reduced and alkylated as described in the above section for the plasma protein preparations. Two-Dimensional Gel Electrophoresis. Dry 11 cm IPG strips (Amersham Biosciences, NSW, Australia) were rehydrated for 8 h with 210 µL of protein sample. Rehydrated strips were focused to 100 kVh on a Protean IEF Cell (Bio-Rad, Hercules, CA) or an IsoelectrIQ2 (Proteome Systems, Sydney, Australia) electrophoresis equipment. Focused IPG strips were equilibrated for 20 min in 6 M urea, 2% SDS, 0.01% bromophenol blue in 50 mM Tris-acetate buffer pH 7.0. Equilibrated IPG strips were placed on top of 6-15% trisacetate sodium dodecyl sulfate polyacrylamide precast 10 cm × 15 cm × 1 mm gels (Proteome Systems, Sydney, Australia). Electrophoresis was performed at 50 mA per gel for 1.5 h or until the tracking dye front reached the bottom of the gel. Proteins were stained using SYPRO Ruby (Molecular Probes, Eugene, OR) according to the manufacturer’s instructions, then destained for 4 to 7 h in 10% methanol, 7% glacial acetic acid. Image Capture and Analysis. Images of gels were acquired using the AlphaImager 3300 software (Alpha Innotech Corporation, San Leandro, California). Aperture and exposure times were adjusted so that only the most abundant proteins on the gels reached saturation (at pixel intensity level as determined by the software). For the analysis of E. coli, the above procedure was used for the gels separating 320 µg of protein, and the same settings were used for all other gels with lesser protein loads. The gel images were saved as 16-bit tagged image file format (TIFF) and analyzed using ImagepIQ version 1.0.1, a 2-DE image analysis software (Proteome Systems, Sydney, Australia). The images were imported into the ImagepIQ database under separate experiments according to sample type. Image manipulation was done within ImagepIQ and consisted of inverting the pixels to obtain an adsorption image (dark spots on a light background), flipping the image into the correct orientation and cropping the image where required. The spot detection parameters were optimized on one gel image representative of each experiment. To do this, spot detection was applied to the image using the default settings in the first instance. The optimal threshold values for the spotintensity and spot-area parameters were then determined by applying real time filters and visually determining these optimal values; the aim being to minimize the detection of artifacts and maximize the detection of real spots. These optimized settings were saved with the experiment and applied to all gel images within that experiment. The region of interest for each gel image and the settings were ultimately saved with the image. research articles Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009 Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y Experimental Design for 2-DE Quantitative Proteomics After spot detection, manual spot editing was done for each image. Editing consisted of deleting spots at the periphery of the gel, and removal of obvious specks that escaped the filtering process. Missed spots were not edited at this stage. The images in each match set were then matched. In the case of the E. coli gels, all the images were selected and a single layer matching process applied. In the case of plasma gels for data sets 1 and 2, multilayer matches were done. The triplicate gels were grouped and matching was done for all the groups simultaneously using the batch process feature. A composite image (replicate composite) was generated for each group of triplicate gel images. Spots that were matched in 2 of the 3 replicate images were visually examined for mismatches and these were edited. Spots that were only found in one of the three images were deemed to be artifacts and were deleted from the match and hence from the composite image. The edited composite images from each sample group were subsequently matched to each other to generate a composite image representative of each group. Post match editing at this level was restricted to making sure that the correct spots were matched together. At the final level of matching, the two group composite images were matched to generate the final experiment composite image. The match IDs (numbers used as identifiers for all spots in a match) and normalized spot volumes columns were exported from the generated match report as a text file for further manipulation using Microsoft Excel. Statistical Analyses Data Screening. The analyses used in this study assume that the data come from a normally distributed underlying population. To check the distribution of our data, we plotted frequency histograms for the nontransformed normalized spot volumes, and for the data after log transformation. This was done using the BiostatistIQ tools within the BioinformatIQ software platform (Proteome Systems, Sydney, Australia). Estimation of Variance Components. The coefficient of variation (CV% calculated by the standard deviation of the normalized spot volumes divided by the mean, expressed as a percent) was calculated as a measure of inter- and intra-sample variation. The correlation coefficient, as R2, was also calculated. CV% and R2 values are commonly used for measuring gel-togel variation and would therefore allow for a direct comparison with other publications. However, there are more powerful means of estimating variance. These variance components calculations are automated when using the tools at www.emphron.com. Here we describe the details of this analysis. Data were analyzed using a mixed effects linear model.6 The model had a fixed effect term representing the group of samples (e.g., healthy vs diseased), and random deviations representing the effects of samples within groups and gels within samples. Between sample variation represents biological variation, and between gel variation represents analytical variation in the experimental process. Variance components for between samples and between replicate gels variation were estimated using the REML algorithm, as implemented in the R (Open Source statistical package) function lme. Power Calculations. To assist in experimental design, and to understand the number of samples that need to be analyzed in order to confidently discover differentially expressed proteins, a power analysis was undertaken using specially developed tools. We have made these available at www. emphron.com. Note that for power analysis there cannot be any missing values. Here, we describe the steps used in the automated analysis. Most biologists are familiar with the use of the significance test, and most understand the significance level of a test in terms of the probability of incorrectly rejecting the Null hypothesissthat is the probability of falsely deciding that there is an effect. This probability is often referred to as the type I error rate. By convention, the type I error rate is often fixed at 5%. In some studies, the experimenter is likely to make a different type of errorsthat of failing to reject the null hypothesis when a real effect exists. This is referred to as a type II error. Obviously, we wish to ensure that our experiments have a low probability of producing each type of error. Conventionally, rather than working with the type II error rate, we usually consider the power of a study. The power is the probability of correctly rejecting the null hypothesis, given that a difference exists. That is, the power is the probability of not making a type II error. Elementary statistics text books generally focus on the significance level (type I error) rather than on the power (type II error). This is because we can usually control the type I error rate by choosing an appropriate critical value for our test statistic. The power, however, requires us to know rather more about the system we are studying. Power is influenced by four factors. Increasing the effect size (the true difference between means) makes it more likely that we will reject the null hypothesis and therefore increases power. That is, it is easier to find big differences than small differences. Reducing the experimental variability increases the power. That is, differences are easier to detect when there is little variation. Increasing the sample size increases the powerswe are more likely to detect effects when we have many observations than when we have few. Finally, the power is influenced by the significance level we require. The smaller the type I error rate we are prepared to tolerate, the smaller our chance of detecting real effects becomes. That is decreasing the significance level decreases the power. Power was calculated using the tools on www.emphron.com. These tools use estimates of analytical and biological variability to determine the standard error of differences between two treatment means for a given experimental design. The experimental design is defined by the number of samples per group and the number of gels per sample. These tools then generate estimates of the minimal detectable differencesthe difference between means which will give a power of 80% for the given number of samples and gels. Results The determination of an appropriate design for a 2-DE based experiment requires high quality protein expression data from image analysis. However, there is relatively little attention paid to the issues that affect the quality of image analysis data. Accordingly, we wished to carefully evaluate our approaches used in generating these data prior to its use for experimental design. Figure 2 shows the issues that are faced in generating high quality image analysis data and outlines some of the approaches that can be used to minimize their effects on data quality. These issues are explored in detail below. Steps that need to be robust in the image analysis process are as follows: the ability to detect all, if not most, of the spots arrayed on a 2-DE gel; to correctly determine the boundary of Journal of Proteome Research • Vol. 4, No. 3, 2005 811 Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009 Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y research articles Hunt et al. Figure 2. Flowchart showing the steps involved in an experimental design for a typical quantitative proteomic experiment. This includes the hierarchical structure of the pilot experiment, and outlines the potential hazards associated with each step as well as the precautions that can be taken to minimize their effects on the experimental results. these spots and hence obtain an accurate measure of the spot volumes; to generate normalized spot volumes in a gel or group of gels using a method that corrects for variations in absolute spot volumes (due to differences in protein loads per gel, in protein staining regime, and in image capture settings); to correctly align and match all the images in a group and accurately determine the corresponding spots across all images. Spot Detection. To check the accuracy of spot detection, 2-DE gel images from http://www.umbc.edu/proteome, that have been used in other publications to test image analysis software,7 were analyzed. Spot detection was undertaken as described in the methods section, the spot detection parameters were optimized for the image type but no manual spot editing was done. The results of the spot detection were then visually compared to the same image annotated with the expected real spots (also available from the above website). A total of 900 and 1403 spots were detected respectively on image gel-a and gel-b. Visual comparison with the expected results showed the detection of 93.6% and 95.2% of the expected real spots (true positives), the missing of 6.4% and 4.8% of the expected real spots (false negatives), and detection of 13.1% and 8.8% artifacts. These results compare favorably to the accuracy of spot detection in the literature using other image analysis softwares, the best reported value for percentage of spots missed being 6% for Melanie 3.0 software.7 Spot Quantitation. To evaluate the accuracy of spot quantitation, a set of eleven artificial images from Raman et al.,7 that were designed to test the accuracy of spot quantitation of image analysis software, was downloaded (http://www. umbc.edu/proteome). The expected volume ratio of the center spot in images (b) through to (k), relative to the center spot in image (a), are 2, 4, 6, 10, 14, 18, 22, 26, 30, and 40, respectively. 812 Journal of Proteome Research • Vol. 4, No. 3, 2005 Spot detection was carried out on the 11 artificial images using default spot detection parameters (see Materials and Methods). The images were then matched to each other to generate a match report with the appropriate relative intensity ratios. The observed spot ratios correlated well with the expected spot ratios, giving a correlation coefficient R2 value of 0.99. Although the images analyzed are artificial and do not completely mimic a set of 2-DE gels, they are a useful indicator of the complexity of spot quantitation and a valid test of quantitative image analysis approaches. Spot Volume Normalization. Slight variations in protein load per gel, protein staining efficiency and image capture can have a considerable impact on the raw spot volumes generated by image analysis. Normalization of raw spot volumes, necessary to minimize gel to gel variation, is imperative for quantitative proteomic studies using 2-DE. The efficiency of normalization methods in correcting for analytical variations in raw spot volumes due to uneven protein loading was tested. Increasing amounts of E. coli protein extracts (20, 40, 80, 160, or 320 µg) were subjected to 2-DE using 11-cm, pH 3-10 IPGs (see Materials and Methods). A representative image, annotated with the 278 spots that matched across all gels following automated matching, is shown in Figure 3. The range of spot volumes for the 278 matched spots relative to that of all the spots on the representative gel also shows that there was no bias in the selection of the spots for statistical analyses (Figure 3). The volumes for the matched spots, before and after normalization, were graphed using a box-and-whisker plot (Figure 4, graphs A and B). With an increase in protein load, there is a steady increase in the raw spot volumes, but there is little change in the normalized spot volume. These results show that the method of normalization Experimental Design for 2-DE Quantitative Proteomics research articles Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009 Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y of triple depleted plasma sample preparations from two groups of subjects (a healthy and a diseased group), 4 samples per group, and triplicate gels per sample. Data set 2 consisted of 18 gel images generated similarly from two other healthy and diseased groups of subjects (3 samples per group, triplicate gels per sample). Image analysis was carried out as described in the Materials and Methods. Any matching conflicts (for example nonmatching spots) were deliberately not resolved, and only spots that matched across all the images were used. This strategy was used because it was imperative, for the purpose of these statistical analyses, that there were no mismatches in the data set. It also ensured that the spots used in the analysis were randomly spread across the images, and that no subjective editing was applied that might affect the automated spot quantitation. Plasma data sets 1 and 2 consisted of 63 and 119 matched spots respectively, these were randomly spread across the gels indicating that there was no bias in the selection of the spots for statistical analyses, as shown in Figure 6 for a representative gel for data set 2. Figure 3. E. coli 2-DE gel image annotated with the 278 matched spots (out of a total of 500 detected spots) used in the statistical analyses. The spot volumes for all the 500 spots detected, and for the 278 matched spots, were sorted in ascending order and plotted in a line graph. The graphs show that the 278 matched spots covered a similar range of volumes relative to the 500 spots detected, illustrating that we have a representative set of protein spots from the gel. employed can globally correct for differences in amount of protein loaded onto the gels. To investigate the efficiency of normalization in controlling analytical variation in replicate gels from the same sample, and in controlling variation from replicate samples run on different gels, we plotted the spot volumes, before and after normalization, for the 119 spots from data set 2 as described for the E. coli data in the above paragraph (Figure 4, graphs C and D). This plot shows that for raw spot volumes, there is random variation in the median and other quartiles across the 18 gels. After normalization, it can be seen that the random analytical variations in raw spot volumes has been largely corrected for. Normalization should correct for differences in raw spot volumes such that, for any given matched spot across replicate gels of the same sample, you would expect the ratio of spot volumes to be close to 1. Figure 5 shows X-Y plots of the ratio of the log transformed normalized spot volume for two sets of duplicate gels (E. coli_80 and E. coli_40 from the E. coli data set; and replicate gels 2.1.1 and 2.1.3 from data set 2) plotted against the log transformed normalized spot volume for one of the replicate gels (see section: Spot Volume Data Screening, for more information on logarithmic transformation). The ratios were close to 1 for all except the lowest spot volumes, where we see a departure from 1 for the faintest spots. Image Analysis Data for Statistical Analyses. Two plasma protein expression data sets were generated for the purpose of estimating between sample and between gel variance. Data set 1 consisted of 24 protein 2-DE gels (11-cm pH 4-7 IPGs) Coefficient of Variation and Correlation Coefficient. Most published reports that study biological and analytical variation have been based on evaluation of the percent coefficient of variation (CV%) and correlation coefficient (R2) (see e.g., ref 8). We believe that these tests are not sufficiently robust, and so we explored alternative approaches based on variance analyses and power calculations, below. However, to allow for a direct comparison of our study with previously published work, we calculated CV% and R2 from our plasma data sets 1 and 2. Table 1 shows the average CV% ( SD for normalized spot volumes for the replicate images from plasma data sets 1 and 2. These values may appear high but when compared to similar analyses of 2-DE image analysis data (see discussion), our results are very favorable. A better illustration of these data is, however, to graph the cumulative percent of spots that fall below given CV% values for the 8 sets of replicate gels from data set 1, and the 6 sets of triplicate gels from data set 2 (Figure 7). This reveals that there can be notable differences in the CV% from one sample to the next, and can aid the identification of outlying samples in a group. One simple, but widely used, means of evaluating within and between sample variance is to establish the correlation coefficient of nontransformed data. Accordingly, we used automatic image analysis to generate the correlation coefficient values (R2) of nontransformed normalized spot volumes, for every pairwise combination of gel images from plasma data sets 1 and 2. The means and standard deviations of the R2 values for each replicate set were calculated for the healthy and diseased groups of data sets 1 and 2 (Table 2). The average R2 (( SD) values for within-samples comparisons, which indicates analytical variation, ranged from 0.87 ( 0.09 to 0.99 ( 0.001, and from 0.94 ( 0.03 to 0.99 ( 0.002 for the healthy and diseased groups of data set 1, respectively. The equivalent values for data set 2 were 0.95 ( 0.01 to 0.98 ( 0.01, and 0.93 ( 0.03 to 0.98 ( 0.01. The values for between samples comparison, which indicate biological variation, ranged from 0.73 ( 0.03 to 0.88 ( 0.04, and from 0.93 ( 0.04 to 0.97 ( 0.01 for the healthy and diseased groups of data set 1, respectively. The equivalent values for data set 2 were 0.70 ( 0.06 to 0.90 ( 0.01, and 0.43 ( 0.02 to 0.66 ( 0.04. While this approach is not as powerful as other methods of analyzing variance (see below), it has revealed that our between sample variation (biological variation) is clearly greater than our within sample variation (analytical variation). Journal of Proteome Research • Vol. 4, No. 3, 2005 813 Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009 Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y research articles Hunt et al. Figure 4. Efficiency of normalization in correcting for differences in spot volumes due to varying protein loads (A and B) or analytical variations (C and D). The loge transformed spot volumes (graph A), and loge transformed normalized spot volumes (graph B), for the 278 analyzed spots across the five E. coli gels with increasing protein load per gel, are plotted in a box-and-whisker format against the five gels. The loge transformed spot volumes (graph C), and loge transformed normalized spot volumes (graph D), for the 119 analyzed spots across the 18 gels from data set 2 are plotted in the same format against the 18 gels. The graphs show the effect of normalization on the median and the range of spot volumes across the gels. Figure 5. Effectiveness of normalization of spot volumes in correcting for gel-to-gel variation between sets of replicate gels from E. coli (A) and human plasma (B, data set 2). The ratios of the loge transformed normalized spot volumes are plotted, using an X-Y plot format, against the loge transformed normalized spot volumes. Figure 6. Representative 2-DE gel image of human plasma from data set 2 annotated with the 119 spots used in the statistical analyses. The spot volumes for all the 367 spots detected, and for the 119 matched spots, were sorted in ascending order and plotted in a line graph. The graphs show that the 119 matched spots covered a similar range of volumes relative to the 367 spots detected, illustrating that we have a representative set of protein spots from the gel. Spot Volume Data Screening. On the basis of previous work in a separate study (data not shown), and on published work,9 we understood that spot volume data from image analysis of 2-DE gels does not fit a normal distribution, but requires 814 Journal of Proteome Research • Vol. 4, No. 3, 2005 Experimental Design for 2-DE Quantitative Proteomics research articles Table 1. Average CV% for normalized Spot Volumes Across Replicate Gelsa data set replicate set average CV% ( SD 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 2.1 2.2 2.3 2.4 2.5 2.6 16.1 ( 8.1 19.8 ( 13.0 13.0 ( 5.9 14.8 ( 16.4 26.3 ( 17.3 31.6 ( 22.9 11.6 ( 9.8 7.4 ( 4.5 17.20 ( 12.16 21.24 ( 20.13 19.84 ( 15.86 16.02 ( 12.43 22.41 ( 22.28 19.54 ( 15.23 2 Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009 Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y a For each of the matched spots, CV% of normalized spot volumes was calculated across replicate 2-DE gels from data set 1 (replicate sets 1.1 to 1.8) and data set 2 (replicate sets 2.1 to 2.6). For each replicate set of gels, the average and SD values were then calculated and tabulated. Figure 7. Percentage of spots that fall below a given CV% for normalized spot volume across replicate 2-DE gels. The CV% for normalized spot volumes across each of the 8 sets of triplicate gels from plasma data set 1 (replicate sets 1.1 to 1.8) was calculated and the cumulative percent of matched spots that fall below each of the given CV% values were plotted against the CV%. A shows the graphs for plasma data set 1, replicate sets 1.1 to 1.8; B shows the graphs that were similarly generated for the 6 sets of triplicate gels from plasma data set 2 (replicate sets 2.1 to 2.6). logarithmic transformation. To confirm that this was the case with the data generated here, the normalized spot volumes, before and after log transformation, were plotted using a frequency histogram. Figure 8 shows the plots for a representative image from data set 1. These results showed that our log- Figure 8. Comparison of frequency distribution of nontransformed and transformed normalized spot volumes. The median normalized spot volumes and loge transformed median normalized spot volumes was calculated for each replicate set of gels from data sets 1 and 2. The median values were then plotted in a frequency histogram, with the kernel density estimate of the distribution superimposed. The graphs are shown for a representative replicate set (replicate set 1.1) from data set 1. Note that the non transformed data does not fit a normal distribution, and how the loge transformation yields a data set that has approximately normal distribution. Similar effects were obtained with all other replicate sets of gels (data not shown). transformed data are approximately normally distributed, compared to the nontransformed data. Accordingly, only transformed data was used for further analyses. Estimation of Variance Components. The normalized spot volumes from plasma data sets 1 and 2 were analyzed using statistical applications that we have made available under the Tools section at www.emphron.com. A table of betweensamples and within-samples variance was generated, which was then viewed in a box-and-whisker plot (Figure 9). The median between-samples variance components for data sets 1 and 2 were calculated to be 0.22 and 0.29, respectively, and the median within-samples (between replicate gels) variance components were 0.05 and 0.04, respectively. When analyzed on a spot by spot basis, the within-samples (analytical) variance was smaller than the between-samples (biological) variance in 92% and 87% of the cases for data sets 1 and 2, respectively. The variance values were then used in the power calculations to determine exactly how many gels and samples one should aim for in any given experimental design. Power Calculations. The variance components for betweensamples and within-samples can be used in power calculations, to predict the best experimental design for use in a twopopulation comparison study. Statistical power is defined as the probability of correctly identifying a difference between the groups. The power is determined by the sample size (number of samples and number of gels), the variability (biological and analytical variation), the significance level of the test, and the effect size (being the size of the difference you are looking to identify). The significance level is the probability of incorrectly rejecting the null hypothesissthat is, the probability of making a type I statistical error. A type I error arises when we decide that there is a group difference, when, in truth, there is not. A type II error is made when we incorrectly accept the null hypothesis. That is, a type II error arises when we decide that there is no group difference, when, in truth, there is. The effect size is defined by percentage increase. That is, an effect size of 100% represents a doubling of spot volume between groups. An effect size of 50% represents a between group ratio of 1.5. Minimum detectable difference is the size of effect required to give a required power at a specified significance level, given the variance known from a particular number of samples and gels. It can be extrapolated from the plots of effect size % against number of samples for a given number of gels per sample. Journal of Proteome Research • Vol. 4, No. 3, 2005 815 research articles Hunt et al. Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009 Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y the number of replicates done per sample does not. This is not surprising, since the between-samples variance is so much larger than the within-samples variance, and serves to illustrate that duplicate or triplicate 2-DE gel runs per sample will be appropriate in the data presented here. An added advantage of doing replicate gels per sample is that it facilitates image analysis; artifacts detected as spots are usually unique to a replicate gel and can therefore be filtered out on that basis. In examining the graphs for plasma data sets 1 and 2, it can also be seen that the minimal detectable differences are dissimilar. Data set 2 has a greater variance for between sample effects, and a correspondingly larger minimum detectable effect size with an asymptote toward 65%. This shows that the minimum detectable effect size is sensitive to features of the experimental data that vary from time to time. Effective use of sample size calculations will depend on the availability of suitable pilot data for the system of interest. Figure 9. Between-samples and within-sample variances in normalized spot volumes. The normalized spot volumes from plasma data sets 1 and 2 were analyzed using statistical applications that we have made available under the Tools section at www.emphron.com. The tool: Minimum detectable differences spot data, was selected, the data file was uploaded, and the following values were entered; Minimum number samples per group: 4, Maximum number samples per group: 20, Minimum number gels per sample: 1 and Maximum number gels per sample: 4. The default values of 0.05 and 0.8 were used for Required significance level and Required power. The variance values generated by the software were copied into an Excel spreadsheet and graphed using box-and-whisker plots. Graphs A and B show the variance for between-samples and withinsample in data sets 1 and 2, respectively. Y-axes are of different scale. Note that the within-sample variation is smaller than the between-samples variation in both cases. To investigate what number of samples and analytical replicates would need to be analyzed in a differential display experiment where different levels of protein expression are sought, we analyzed plasma data sets 1 and 2 using the tool “Minimum Detectable Difference-Spot Data” from the web site www.emphron.com. The analysis generated a table of effect size % for 4 to 20 samples per group and 1 to 4 gels per sample, based on a given power of 80% and significance level of 5% (data not shown). We plotted the effect size % against number of samples for 1 to 4 gels per sample (Figure 10). Figure 10 shows curves whereby the extent of the difference that can be significantly established between two sample types is influenced predominantly by the number of samples that are analyzed. For example, there will be approximately 9 samples (in data set 1) and 11 samples (in data set 2) required from control and experimental groups to significantly detect a 2-fold difference between the populations. However, more substantial differences (e.g., a 3-fold difference) can be detected significantly with smaller numbers of samples (in data set 1, with approximately 5 samples from control and experimental groups). The top graph in Figure 10 also makes clear that there is an asymptote toward ∼50% effect size, illustrating that even with a very large number of samples analyzed, there would be a minimal detectable difference of approximately 1.5-fold. Similar trends were observed in data set 2. A further trend evident in both graphs (Figure 10) is that while an increase in the number of samples has a marked impact on the detectable differences observed, an increase in 816 Journal of Proteome Research • Vol. 4, No. 3, 2005 Experimental Design In the above sections, we have described approaches and tools required to formulate a good experimental design for quantitative 2-DE based proteomics. The main aim of the design is to determine how many samples and gels should be analyzed in a given study to ensure that any proteins identified as differentially displayed will hold true in a subsequent larger study such as a clinical trial. Figure 2 shows a schematic description of the steps to be followed to ensure good experimental design, and highlights some hazards and precautions to be taken at each step. The overall strategy is to do a pilot study consisting of running triplicate 2-DE gels from each of 3 samples from each of two groups (for example a healthy and a diseased group); perform image analysis and normalize the data, and then do statistical variance analysis of the data for spots that match across all the gels. Transformation of the normalized spot volumes is done automatically by the analysis tool. The between and within sample variance data can then be used to run power analysis which allows you to determine how many gels and samples you will need to analyze in a total experiment to confidently detect differences of a certain level between the two groups. Discussion We have presented an approach for assaying the quality of image analysis and a set of robust statistical approaches to measure variability in gel-based proteomics. We have also described a set of web-based tools that can be used for experimental design based on these measurements. We have shown how data from a pilot study can then be used to evaluate how many samples should be analyzed in a larger study and how many replicate gels need to be run to be confident that any differences detected in subsequent experiments will be significant. We believe these approaches, while straightforward, should be applied in proteomic experiments to ensure that the results reported reflect discoveries relevant to a biological system, and not just analytical variation. For example, a 2-fold up- or down-regulation of protein expression between two groups may be of little importance unless appropriate numbers of samples have been analyzed. Our gel-to-gel variability results (analytical variation) were very favorable when compared with previous reports. If we consider CV%, we have shown that an average of 50%, 83%, research articles Experimental Design for 2-DE Quantitative Proteomics Table 2. Correlation Coefficient for normalized Spot Volumes between Replicate Gels Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009 Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y Correlation coefficient R2 values (average ( SD) was calculated for normalized spot volumes (not transformed) for all combinations of replicate sets of images for the healthy and diseased groups from data set 1 (A) and data set 2 (B). and 95% of spots matched across triplicate gels with CV% values of <15%, <30%, and <50%, respectively. Our CV% values ranged from 7.4% ( 4.5 to 31.6% ( 22.9, with an average of 18%, and from 16.0% ( 12.4 to 22.4% ( 22.2, with an average of 19%, for data sets 1 and 2, respectively. Others have reported CV% values ranging from 18.7% to 26.7%, with an average of 22.6% ( 2.9, and 35 to 38% of spots with CV% < 15%.8 Another report, comparing the effect of automation of image analysis on gel-to-gel variability, showed that 76%, 54% and 48% of spots that matched across 4 gel images had CV% < 30% for manual, semi-automated and fully automated image analysis strategies.10 Asirvatham et al.11 reported more favorable results; they analyzed 50 selected spots using a user-guided approach to image analysis and reported an average CV% of 16.2, with 52% of spots having CV% values of <15%. However, when they used a fully automated image analysis strategy, the average CV% rose to 39.4%. Nishihara et al.12 reported favorable results, but they analyzed only 20 well defined, manually selected spots. Using three different image analysis softwares, Z3, Progenesis and PDQuest respectively, with varying degrees of user input, they reported average CV% of 15.2% ( 7.6, 18.1% ( 8.0 and 17.0% ( 7.6, with 35%, 40% and 45% of spots having CV% of <15%. Given that some of the above reports11,8 used homogeneous instead of gradient gels, and the former have been shown to have less variability,13 and others took care to minimize variability by focusing all the IPGs simultaneously,11 the values we report here suggest that we had very good gel-to-gel variability data. We believe that the care taken in image acquisition, and our approaches for spot detection, quantitation and normalization, have contributed substantially to this. We reported R2 values of between 0.87 and 0.99 for gel-togel variability, with average values of 0.97 ( 0.04 and 0.96 ( 0.02 respectively for the plasma data sets 1 and 2. These R2 values compare favorably with those reported elsewhere. Molloy et al.8 reported R2 values of between 0.83 and 0.91, with an average of 0.86 for replicate gels from 5 different sample types. We should point out that we used a midi gel system as opposed to most of the reports being based on large gel formats. To determine whether this contributed to the observed differences in gel-to-gel variability would require side-by-side experiments on the midi gel and the large format gel systems using the same sample preparations and image analysis strategies. Felley-Bosco et al.14 reported a comparison between small (6 × 7 cm) and large (16 × 18 cm) gels. Although they focused on qualitative rather than quantitative differences, they reported slightly better reproducibility with the large gel format, with R2 values for spot volumes from triplicate gels comparisons of 0.87 ( 0.002 and 0.78 ( 0.07 for large and small gel format, respectively. It is widely accepted, and supported by our data, that multiple samples must be analyzed to ensure statistical significance in quantitative proteomics. The sources of gel-to-gel and sample-to-sample variability in 2-DE based quantitative proteomics cannot be removed completely, but there are ways of minimizing these, as discussed below. The biological (sample-to-sample) variability, will be dependent on the sample type. Previous reports show R2 and CV% values of 0.75 and 31.2% for bacterial cells, 0.78 and 26.0% for mammalian tissue culture, and 0.32 and 46.6% for mammalian primary cell cultures.8 Biological variation is contributed by two main sources; population heterogeneity based on genetic and environmental differences and sample heterogeneity based on the way the samples are collected.15 Care should be taken to minimize this source of heterogeneity for any given study. Variables, other than those specific to the study, should be similarly represented in all sample groups; all groups should be sampled in a similar fashion with respect to time and place; and any patients should be matched with respect to, for example, age and gender. Some of the sources of analytical variation, on the other hand, can be more readily controlled than those of biological variability. The main sources are discussed below, together with approaches as to how these can be minimized. Variations in the protein load per gel, in the stain and staining regime used to visualize the proteins, and in settings used during image acquisition, can all affect the relative spots volumes in the gel images being compared. We have shown that normalization can correct for some of these differences, but the effectiveness of this correction will be dependent on the image analysis software used. However, it is important to avoid over or under loading the gels since this can affect the way the spot boundaries are determined, and hence introduce variability in spot volumes that may not be optimally corrected for by the normalization process. In our hands, with a bacterial whole cell lysate, normalization optimally corrected for differences between loads of 40, 80, and 160 µg protein per IPG. However, normalization was not as effective with higher or lower protein loads (20 µg and 320 µg). Journal of Proteome Research • Vol. 4, No. 3, 2005 817 research articles Hunt et al. Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009 Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y these cases it is hard to determine if the source of variability is due to the polyacrylamide gels or the electrophoresis system in use. Sample preparation has also been shown to introduce variability; greater variability is seen with multiple-step preparations of a single sample than with a single-step preparation of the same sample.10 To assist in the control of these issues, a randomization strategy can be applied to sample preparation, 2-DE gel runs, image acquisition and image analysis. This will in essence ensure that these sources of variations do not bias the data in any way. Analysis in many cases can also be blinded, to reduce experimenter bias. Figure 10. Power curves for data sets 1 and 2. The analysis done using the Tools section at www.emphron.com (described in Figure 9 legend) generated values for effect size % for 4 to 20 samples per group and 1 to 4 gels per sample based on a given power of 80% and significance level of 5%. The effect size % values were plotted against number of samples for 1 to 4 gels per sample for data set 1 (graph A) and data set 2 (graph B). These plots can be used to predict the number of samples that will need to be analyzed to confidently find a certain difference between two populations. Too low a loading will mean that faint spots are near their threshold of detection and hence more prone to errors in quantitation, while high protein loads can mean that spots are prone to overlapping on the gel or saturation during image capture, which can affect the detection of spot boundaries and hence spot quantitation. Direct proportionality between spot volumes and protein quantity is imperative for quantitative 2-DE gel analysis. To ensure high-quality data is generated, it is therefore best to use protein stains that have an equilibrium end point, a broad linear dynamic range, and whose staining intensity is less dependent upon individual protein amino acid composition and post-translational modifications. We used SYPRO Ruby to stain the proteins in our gels as it has all the above properties.16 An equally suitable stain, now referred to as Deep Purple, could also be used.17 Other factors can also affect the variability of 2-DE gel-based proteomic work. Electrophoresis systems themselves have been shown to introduce variability, whereby batch-to-batch variability was higher than within the same batch.11,18 However, in 818 Journal of Proteome Research • Vol. 4, No. 3, 2005 The importance of image acquisition in quantitative 2-DE proteomics is often underestimated, and yet there are several precautions that can be taken to minimize variability at this level. It is known, for example, that different scans of the same gel can give different results in terms of the number and size of spots that are detected.19 This is mainly because the image analysis software will determine the spots boundaries differently due to subtle differences in the background pixels values across the same regions of the scans. This source of error can be minimized by acquiring images with optimal resolution and taking precautions to avoid the need to rotate the image, except for 90° or 180° rotations, as this can affect the pixels in the image and hence affect spot boundaries. We found that, with our image analysis software, a resolution of between 250 and 300 dpi is optimal. A lower resolution can compromise spot segmentation yet a higher resolution tends to increase the detection of artifacts without any significant improvement in spot segmentation (data not shown). This will, of course, be dependent on the gel format used, and we recommend the acquisition of gel images at different resolutions, followed by spot detection, matching and variance analysis to determine an optimal strategy. To ensure high quality data for lower abundance proteins, a good rule of thumb is to aim for an image resolution that equates to a minimum of 9 pixels per spot. The image analysis software and strategy used can also contribute to gel-to-gel variation. It has been shown that accuracy can be compromised by speed in image analysis. Data produced using a manual, semi-automated, or automated image analysis strategy showed that for spots matching across 4 out of 4 gels, 76%, 54%, and 48% of the spots respectively had CV% of less than 30%.10 These levels of automation refer mainly to editing of gel-to-gel match (which can dramatically improve the results) as opposed to editing of the spot boundaries (which are generally difficult to do accurately and consistently by hand). Manual editing of spot boundaries can dramatically increase quantitative errors.19 Hence, a good recommendation is that spot detection should be automatic and gel-to-gel matching should be manually checked. Conclusion We recommend that a sound strategy be adopted when designing a gel based quantitative proteomic study. That is, to carry out a pilot study with 3 samples per group and 3 gels per sample, to carefully undertake image capture, spot detection and matching, normalization and data transformation, and then use the tools we are providing on the Internet to calculate the power analysis and determine the minimum number of samples and gels per samples that should be run in the subsequent study. This approach will assist in discovering differentially displayed proteins that reflect true protein expres- research articles Experimental Design for 2-DE Quantitative Proteomics sion changes between two or more populations, and so assist in the identification of biomarkers. References PR049758Y Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009 Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y (1) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Nat. Biotechnol. 1999, 17 (10), 994-999. (2) Harris, M. N.; Ozpolat, B.; Abdi, F.; Gu, S.; Legler, A.; Mawuenyega, K. G.; Tirado-Gomez, M.; Lopez-Berestein, G.; Chen, X. Blood 2004, 104 (5), 1314-1323. (3) Righetti, P. G.; Campostrini, N.; Pascali, J.; Hamdan, M.; Astner, H. Eur J Mass Spectrom. (Chichester, Eng) 2004, 10 (3), 335-348. (4) Hatton, M. W. Biochem. J. 1973, 131 (4), 799-807. (5) Wilson, N. L.; Schulz, B. L.; Karlsson, N. G.; Packer, N. H. J. Proteome Res. 2002, 1 (6), 521-529. (6) Pinheiro, J.; Bates, D. Mixed-Effects Models in S and S-Plus; Springer-Verlag: New York, 2000. (7) Raman, B.; Cheung, A.; Marten, M. R. Electrophoresis 2002, 23 (14), 2194-2202. (8) Molloy, M. P.; Brzezinski, E. E.; Hang, J.; McDowell, M. T.; VanBogelen, R. A. Proteomics 2003, 3 (10), 1912-1919. (9) Gustafsson, J. S.; Ceasar, R.; Glasbey, C. A.; Blomberg, A.; Rudemo, M. Proteomics 2004, 4 (12), 3791-3799. (10) Choe, L. H.; Lee, K. H. Electrophoresis 2003, 24 (19-20), 35003507. (11) Asirvatham, V. S.; Watson, B. S.; Sumner, L. W. Proteomics 2002, 2 (8), 960-968. (12) Nishihara, J. C.; Champion, K. M. Electrophoresis 2002, 23 (14), 2203-2215. (13) Voss, T.; Haberl, P. Electrophoresis 2000, 21 (16), 3345-3350. (14) Felley-Bosco, E.; Demalte, I.; Barcelo, S.; Sanchez, J. C.; Hochstrasser, D. F.; Schlegel, W.; Reymond, M. A. Electrophoresis 1999, 20 (18), 3508-3513. (15) Boguski, M. S.; McIntosh, M. W. Nature 2003, 422 (6928), 233237. (16) White, I. R.; Pickford, R.; Wood, J.; Skehel, J. M.; Gangadharan, B.; Cutler, P. Electrophoresis 2004, 25 (17), 3048-3054. (17) Mackintosh, J. A.; Choi, H. Y.; Bae, S. H.; Veal, D. A.; Bell, P. J.; Ferrari, B. C.; Van Dyk, D. D.; Verrills, N. M.; Paik, Y. K.; Karuso, P. Proteomics 2003, 3 (12), 2273-2288. (18) Zhan, X.; Desiderio, D. M. Electrophoresis 2003, 24 (11), 18341846. (19) Mahon, P.; Dupree, P. Electrophoresis 2001, 22 (10), 20752085. Journal of Proteome Research • Vol. 4, No. 3, 2005 819
© Copyright 2026 Paperzz