Optimal Replication and the Importance of Experimental Design for

Optimal Replication and the Importance of Experimental Design for
Gel-Based Quantitative Proteomics
Sybille M. N. Hunt,* Mervyn R. Thomas,† Lucille T. Sebastian, Susanne K. Pedersen,
Rebecca L. Harcourt, Andrew J. Sloane, and Marc R. Wilkins
Proteome Systems Ltd, Locked Bag 2073, North Ryde, NSW 1670, Australia
Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009
Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y
Received December 20, 2004
Quantitative proteomic studies, based on two-dimensional gel electrophoresis, are commonly used to
find proteins that are differentially expressed between samples or groups of samples. These proteins
are of interest as potential diagnostic or prognostic biomarkers, or as proteins associated with a trait.
The complexity of proteomic data poses many challenges, so while experiments may reveal proteins
that are differentially expressed, these are often not significant when subjected to rigorous statistical
analysis. However, this can be addressed through appropriate experimental design. A good experimental
design considers the impact of different sources of variation, both analytical and biological, on the
statistical importance of the results. The design should address the number of samples that must be
analyzed and the number of replicate gels per sample, in the context of a particular minimum difference
that one is seeking to achieve. In this study, we explore the ways to improve the quality of protein
expression data from 2-DE gels, and describe an approach for defining the number of samples required
and the number of gels per sample. It has been developed for the simplest of situations, two groups
of samples with variation at two levels: between samples and between gels. This approach will also
be useful as a guide for more complex designs involving more than two groups of samples. We describe
some Internet-accessible tools that can assist in the design of proteomic studies.
Keywords: quantitative proteomics • 2-DE • statistical power analysis • experimental design • analytical variation
Introduction
A key aim of proteomics is the expression analysis of large
numbers of proteins. Technologies employed for this include
two-dimensional polyacrylamide gel electrophoresis combined
with visible or fluorescent stains and image analysis, and mass
spectrometric approaches that use isotopic labeling techniques
such as isotope-coded affinity tag peptide labeling1 or amino
acid coded mass tagging.2 These have been recently reviewed
elsewhere.3
Quantitative two-dimensional electrophoresis (2-DE) is commonly applied to differential display analysis, to find proteins
that are differentially expressed between samples or groups of
samples. If these proteins are shown to change consistently in
a population, they may be associated with or responsible for a
phenotype and are referred to as biomarkers. Biomarkers can
form the basis of diagnostic and prognostic tests and as such
they are of scientific and commercial interest.
In many quantitative proteomic studies based on 2-DE gels,
where researchers are seeking to identify differentially expressed proteins, there is often inadequate attention paid to
experimental design. There are special challenges with proteomic data that need consideration, similar to the challenges
* To whom correspondence should be addressed. Phone: 61 2 9889 1830.
Fax: 61 2 9889 1805. E-mail: [email protected].
†
Emphron Informatics, 6 Geewan Place, Chapel Hill, Queensland 4069,
Australia.
10.1021/pr049758y CCC: $30.25
 2005 American Chemical Society
faced in the acquisition and analysis of mRNA expression data.
These arise due to the following: very large number of
measurements are usually generated for each sample; analytical
variation is inherent to the protein separation, staining, image
acquisition and processing steps; there is biological variation
of environmental origin; and there is biological variation of
genetic origin, for example in an out-bred population. Furthermore, the generation of proteomic data remains a relatively
involved and multistep process, and as a consequence, experiments are usually modest in the number of samples analyzed.
So while many experiments reveal proteins that are differentially expressed, these may be found to be not significant when
subjected to rigorous statistical analysis.
Many of the sources of variation simply cannot be easily
controlled in proteomics (e.g., the study of human samples).
However, a good experimental design can evaluate the impact
that different sources of variation can have on the statistical
importance of the results, and help assess the best course of
action.
The simplest experimental design for differential display
exhibits a hierarchical structure (see Figure 1). At the top of
the hierarchy are the groups of samples for comparison, defined
by sample characteristics such as disease state (healthy or
disease population) or treatment applied (drug dosage). The
middle level of the hierarchy assays variation between the
samples within a given group, capturing the major source of
Journal of Proteome Research 2005, 4, 809-819
809
Published on Web 05/07/2005
research articles
Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009
Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y
Figure 1. Typical experimental paradigm displaying the hierarchical structure of a 2-group quantitative proteomic experiment.
At the top of the hierarchy are the groups of samples for
comparison of sample characteristics such as disease state. The
middle level of the hierarchy allows for estimation of betweensamples (biological) variation within a given group and the
lowest level for estimation of between-replicate 2-DE gels
(analytical) variation.
biological variation. This variation will be smallest when dealing
with simple bacterial cultures and at its greatest in studies
dealing with samples from human subjects. The lowest level
of the hierarchy involves replicate 2-DE gels run from the same
sample, and captures the inherent analytical variation. It is
important to recognize that variation is inherent across the
experiment, however a good experimental design can address
the number of samples that must be analyzed and the number
of replicate gels per sample, in the context of a particular
minimum expression difference that one is seeking to discover.
In this study, we describe an approach for understanding
and controlling variation in gel-based proteomics, and propose
a means of designing an experiment to define the number of
samples required and the number of gels per sample. This
approach has been developed for the simplest of situationss
two groups of samples with variation at two levels: between
samples and between gelsshowever this can be extended to
address more complex experimental situations. We also describe some Internet-accessible tools that can assist researchers in experimental design for their quantitative proteomic studies.
Materials and Methods
Materials. Whole blood was collected from patients and
immediately stored on ice, then spun gently to sediment red
blood cells. The supernatant was further spun at 6000 × g to
leave clarified plasma that was then stored at -80 °C until
required. Standard laboratory chemicals were obtained from
Sigma-Aldrich (St. Louis, MO) unless specified otherwise.
Human Plasma Sample Preparation. Two mL aliquots of
plasma (per subject) were quickly thawed at 37 °C and depleted
of the three high-abundance proteins fibrinogen, immunoglobulin G and albumin. Fibrinogen was removed using a venom
cross-linking method,4 immunoglobulin type G (IgG) with
immuno-affinity chromatography using immobilized protein
G sepharose beads (Amersham Biosciences, NSW, Australia),
and human serum albumin by ethanol fractionation.5 The
triple-depleted plasma sample was then pre-fractionated
into narrow range pI fractions, pI 3.0-5.5, 5.5-6.5 and
6.5-11.0 using an IsoelectrIQ2 multi compartment electrolyzer (MCE; Proteome Systems, Sydney, Australia). Throughout
the fractionation, protein concentrations were quantified using
a Coomassie-blue based Bradford protein assay. Only the
pI 5.5-6.5 fractions were used in this study.
Three hundred µg of MCE-fractionated triple depleted human plasma protein was made up to a final volume of 210 µL
in a 7 M urea, 2 M thiourea, 10 mM Tris, 2% CHAPS sample
buffer. The sample was then ultrasonicated for 30 s, and then
810
Journal of Proteome Research • Vol. 4, No. 3, 2005
Hunt et al.
reduced (by adding tributylphosphine to a final concentration
of 5mM and incubating for 1 h at ambient temperature) and
alkylated (by adding iodoacetamide to a final concentration of
15mM for 1 h at ambient temperature and protected from
light). Before rehydration of immobilized pH gradient (IPG)
strips, samples were ultrasonicated for 2 min and then centrifuged at 21 000 × g for 5 min. The supernatant was collected
and 10 µL of Orange G added as an indicator dye.
Bacterial Sample Preparation. Twenty mg of lyophilized
Escherichia coli bacterial cells (Sigma product EC-1, strain K12)
were resuspended in 10 mL of sample buffer (7 M urea, 2 M
thiourea, 40 mM Tris, 1% C7 BzO). The suspension was
sonicated with the ultrasonic probe (Branson digital sonicator,
model 450) for a total of 1 min (4 × 15 s pulses at 70%
amplitude, chilled on ice between each step), centrifuged at
14 000 × g for 15 min at 15 °C to pellet cell debris. The
supernatant was transferred into a clean tube and reduced and
alkylated as described in the above section for the plasma
protein preparations.
Two-Dimensional Gel Electrophoresis. Dry 11 cm IPG strips
(Amersham Biosciences, NSW, Australia) were rehydrated for
8 h with 210 µL of protein sample. Rehydrated strips were
focused to 100 kVh on a Protean IEF Cell (Bio-Rad, Hercules,
CA) or an IsoelectrIQ2 (Proteome Systems, Sydney, Australia)
electrophoresis equipment. Focused IPG strips were equilibrated for 20 min in 6 M urea, 2% SDS, 0.01% bromophenol
blue in 50 mM Tris-acetate buffer pH 7.0.
Equilibrated IPG strips were placed on top of 6-15% trisacetate sodium dodecyl sulfate polyacrylamide precast 10 cm
× 15 cm × 1 mm gels (Proteome Systems, Sydney, Australia).
Electrophoresis was performed at 50 mA per gel for 1.5 h or
until the tracking dye front reached the bottom of the gel.
Proteins were stained using SYPRO Ruby (Molecular Probes,
Eugene, OR) according to the manufacturer’s instructions, then
destained for 4 to 7 h in 10% methanol, 7% glacial acetic acid.
Image Capture and Analysis. Images of gels were acquired
using the AlphaImager 3300 software (Alpha Innotech Corporation, San Leandro, California). Aperture and exposure times
were adjusted so that only the most abundant proteins on the
gels reached saturation (at pixel intensity level as determined
by the software). For the analysis of E. coli, the above procedure
was used for the gels separating 320 µg of protein, and the same
settings were used for all other gels with lesser protein loads.
The gel images were saved as 16-bit tagged image file format
(TIFF) and analyzed using ImagepIQ version 1.0.1, a 2-DE
image analysis software (Proteome Systems, Sydney, Australia).
The images were imported into the ImagepIQ database under
separate experiments according to sample type. Image manipulation was done within ImagepIQ and consisted of inverting the pixels to obtain an adsorption image (dark spots on a
light background), flipping the image into the correct orientation and cropping the image where required.
The spot detection parameters were optimized on one gel
image representative of each experiment. To do this, spot
detection was applied to the image using the default settings
in the first instance. The optimal threshold values for the spotintensity and spot-area parameters were then determined by
applying real time filters and visually determining these optimal
values; the aim being to minimize the detection of artifacts and
maximize the detection of real spots. These optimized settings
were saved with the experiment and applied to all gel images
within that experiment. The region of interest for each gel
image and the settings were ultimately saved with the image.
research articles
Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009
Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y
Experimental Design for 2-DE Quantitative Proteomics
After spot detection, manual spot editing was done for each
image. Editing consisted of deleting spots at the periphery of
the gel, and removal of obvious specks that escaped the filtering
process. Missed spots were not edited at this stage.
The images in each match set were then matched. In the
case of the E. coli gels, all the images were selected and a single
layer matching process applied. In the case of plasma gels for
data sets 1 and 2, multilayer matches were done. The triplicate
gels were grouped and matching was done for all the groups
simultaneously using the batch process feature. A composite
image (replicate composite) was generated for each group of
triplicate gel images. Spots that were matched in 2 of the 3
replicate images were visually examined for mismatches and
these were edited. Spots that were only found in one of the
three images were deemed to be artifacts and were deleted
from the match and hence from the composite image. The
edited composite images from each sample group were subsequently matched to each other to generate a composite image
representative of each group. Post match editing at this level
was restricted to making sure that the correct spots were
matched together. At the final level of matching, the two group
composite images were matched to generate the final experiment composite image. The match IDs (numbers used as
identifiers for all spots in a match) and normalized spot
volumes columns were exported from the generated match
report as a text file for further manipulation using Microsoft
Excel.
Statistical Analyses
Data Screening. The analyses used in this study assume that
the data come from a normally distributed underlying population. To check the distribution of our data, we plotted frequency
histograms for the nontransformed normalized spot volumes,
and for the data after log transformation. This was done using
the BiostatistIQ tools within the BioinformatIQ software platform (Proteome Systems, Sydney, Australia).
Estimation of Variance Components. The coefficient of
variation (CV% calculated by the standard deviation of the
normalized spot volumes divided by the mean, expressed as a
percent) was calculated as a measure of inter- and intra-sample
variation. The correlation coefficient, as R2, was also calculated.
CV% and R2 values are commonly used for measuring gel-togel variation and would therefore allow for a direct comparison
with other publications.
However, there are more powerful means of estimating
variance. These variance components calculations are automated when using the tools at www.emphron.com. Here we
describe the details of this analysis. Data were analyzed using
a mixed effects linear model.6 The model had a fixed effect term
representing the group of samples (e.g., healthy vs diseased),
and random deviations representing the effects of samples
within groups and gels within samples. Between sample
variation represents biological variation, and between gel
variation represents analytical variation in the experimental
process. Variance components for between samples and between replicate gels variation were estimated using the REML
algorithm, as implemented in the R (Open Source statistical
package) function lme.
Power Calculations. To assist in experimental design, and
to understand the number of samples that need to be analyzed in order to confidently discover differentially expressed
proteins, a power analysis was undertaken using specially
developed tools. We have made these available at www.
emphron.com. Note that for power analysis there cannot be
any missing values. Here, we describe the steps used in the
automated analysis.
Most biologists are familiar with the use of the significance
test, and most understand the significance level of a test in
terms of the probability of incorrectly rejecting the Null
hypothesissthat is the probability of falsely deciding that there
is an effect. This probability is often referred to as the type I
error rate. By convention, the type I error rate is often fixed at
5%.
In some studies, the experimenter is likely to make a different
type of errorsthat of failing to reject the null hypothesis when
a real effect exists. This is referred to as a type II error.
Obviously, we wish to ensure that our experiments have a low
probability of producing each type of error. Conventionally,
rather than working with the type II error rate, we usually
consider the power of a study. The power is the probability of
correctly rejecting the null hypothesis, given that a difference
exists. That is, the power is the probability of not making a
type II error.
Elementary statistics text books generally focus on the
significance level (type I error) rather than on the power (type
II error). This is because we can usually control the type I error
rate by choosing an appropriate critical value for our test
statistic. The power, however, requires us to know rather more
about the system we are studying.
Power is influenced by four factors. Increasing the effect size
(the true difference between means) makes it more likely that
we will reject the null hypothesis and therefore increases power.
That is, it is easier to find big differences than small differences.
Reducing the experimental variability increases the power. That
is, differences are easier to detect when there is little variation.
Increasing the sample size increases the powerswe are more
likely to detect effects when we have many observations than
when we have few. Finally, the power is influenced by the
significance level we require. The smaller the type I error rate
we are prepared to tolerate, the smaller our chance of detecting
real effects becomes. That is decreasing the significance level
decreases the power.
Power was calculated using the tools on www.emphron.com.
These tools use estimates of analytical and biological variability
to determine the standard error of differences between two
treatment means for a given experimental design. The experimental design is defined by the number of samples per group
and the number of gels per sample. These tools then generate
estimates of the minimal detectable differencesthe difference
between means which will give a power of 80% for the given
number of samples and gels.
Results
The determination of an appropriate design for a 2-DE based
experiment requires high quality protein expression data from
image analysis. However, there is relatively little attention paid
to the issues that affect the quality of image analysis data.
Accordingly, we wished to carefully evaluate our approaches
used in generating these data prior to its use for experimental
design. Figure 2 shows the issues that are faced in generating
high quality image analysis data and outlines some of the
approaches that can be used to minimize their effects on data
quality. These issues are explored in detail below.
Steps that need to be robust in the image analysis process
are as follows: the ability to detect all, if not most, of the spots
arrayed on a 2-DE gel; to correctly determine the boundary of
Journal of Proteome Research • Vol. 4, No. 3, 2005 811
Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009
Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y
research articles
Hunt et al.
Figure 2. Flowchart showing the steps involved in an experimental design for a typical quantitative proteomic experiment. This includes
the hierarchical structure of the pilot experiment, and outlines the potential hazards associated with each step as well as the precautions
that can be taken to minimize their effects on the experimental results.
these spots and hence obtain an accurate measure of the spot
volumes; to generate normalized spot volumes in a gel or group
of gels using a method that corrects for variations in absolute
spot volumes (due to differences in protein loads per gel, in
protein staining regime, and in image capture settings); to
correctly align and match all the images in a group and
accurately determine the corresponding spots across all images.
Spot Detection. To check the accuracy of spot detection,
2-DE gel images from http://www.umbc.edu/proteome, that
have been used in other publications to test image analysis
software,7 were analyzed. Spot detection was undertaken as
described in the methods section, the spot detection parameters were optimized for the image type but no manual spot
editing was done. The results of the spot detection were then
visually compared to the same image annotated with the
expected real spots (also available from the above website). A
total of 900 and 1403 spots were detected respectively on image
gel-a and gel-b. Visual comparison with the expected results
showed the detection of 93.6% and 95.2% of the expected real
spots (true positives), the missing of 6.4% and 4.8% of the
expected real spots (false negatives), and detection of 13.1%
and 8.8% artifacts. These results compare favorably to the
accuracy of spot detection in the literature using other image
analysis softwares, the best reported value for percentage of
spots missed being 6% for Melanie 3.0 software.7
Spot Quantitation. To evaluate the accuracy of spot quantitation, a set of eleven artificial images from Raman et al.,7
that were designed to test the accuracy of spot quantitation of
image analysis software, was downloaded (http://www.
umbc.edu/proteome). The expected volume ratio of the center
spot in images (b) through to (k), relative to the center spot in
image (a), are 2, 4, 6, 10, 14, 18, 22, 26, 30, and 40, respectively.
812
Journal of Proteome Research • Vol. 4, No. 3, 2005
Spot detection was carried out on the 11 artificial images
using default spot detection parameters (see Materials and
Methods). The images were then matched to each other to
generate a match report with the appropriate relative intensity
ratios. The observed spot ratios correlated well with the
expected spot ratios, giving a correlation coefficient R2 value
of 0.99. Although the images analyzed are artificial and do not
completely mimic a set of 2-DE gels, they are a useful indicator
of the complexity of spot quantitation and a valid test of
quantitative image analysis approaches.
Spot Volume Normalization. Slight variations in protein load
per gel, protein staining efficiency and image capture can have
a considerable impact on the raw spot volumes generated by
image analysis. Normalization of raw spot volumes, necessary
to minimize gel to gel variation, is imperative for quantitative
proteomic studies using 2-DE.
The efficiency of normalization methods in correcting for
analytical variations in raw spot volumes due to uneven protein
loading was tested. Increasing amounts of E. coli protein
extracts (20, 40, 80, 160, or 320 µg) were subjected to 2-DE using
11-cm, pH 3-10 IPGs (see Materials and Methods). A representative image, annotated with the 278 spots that matched
across all gels following automated matching, is shown in
Figure 3. The range of spot volumes for the 278 matched spots
relative to that of all the spots on the representative gel also
shows that there was no bias in the selection of the spots for
statistical analyses (Figure 3). The volumes for the matched
spots, before and after normalization, were graphed using a
box-and-whisker plot (Figure 4, graphs A and B). With an
increase in protein load, there is a steady increase in the raw
spot volumes, but there is little change in the normalized spot
volume. These results show that the method of normalization
Experimental Design for 2-DE Quantitative Proteomics
research articles
Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009
Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y
of triple depleted plasma sample preparations from two groups
of subjects (a healthy and a diseased group), 4 samples per
group, and triplicate gels per sample. Data set 2 consisted of
18 gel images generated similarly from two other healthy and
diseased groups of subjects (3 samples per group, triplicate gels
per sample). Image analysis was carried out as described in
the Materials and Methods. Any matching conflicts (for example nonmatching spots) were deliberately not resolved, and
only spots that matched across all the images were used. This
strategy was used because it was imperative, for the purpose
of these statistical analyses, that there were no mismatches in
the data set. It also ensured that the spots used in the analysis
were randomly spread across the images, and that no subjective
editing was applied that might affect the automated spot
quantitation. Plasma data sets 1 and 2 consisted of 63 and 119
matched spots respectively, these were randomly spread across
the gels indicating that there was no bias in the selection of
the spots for statistical analyses, as shown in Figure 6 for a
representative gel for data set 2.
Figure 3. E. coli 2-DE gel image annotated with the 278 matched
spots (out of a total of 500 detected spots) used in the statistical
analyses. The spot volumes for all the 500 spots detected, and
for the 278 matched spots, were sorted in ascending order and
plotted in a line graph. The graphs show that the 278 matched
spots covered a similar range of volumes relative to the 500 spots
detected, illustrating that we have a representative set of protein
spots from the gel.
employed can globally correct for differences in amount of
protein loaded onto the gels.
To investigate the efficiency of normalization in controlling
analytical variation in replicate gels from the same sample, and
in controlling variation from replicate samples run on different
gels, we plotted the spot volumes, before and after normalization, for the 119 spots from data set 2 as described for the E.
coli data in the above paragraph (Figure 4, graphs C and D).
This plot shows that for raw spot volumes, there is random variation in the median and other quartiles across the 18
gels. After normalization, it can be seen that the random
analytical variations in raw spot volumes has been largely
corrected for.
Normalization should correct for differences in raw spot
volumes such that, for any given matched spot across replicate
gels of the same sample, you would expect the ratio of spot
volumes to be close to 1. Figure 5 shows X-Y plots of the ratio
of the log transformed normalized spot volume for two sets of
duplicate gels (E. coli_80 and E. coli_40 from the E. coli data
set; and replicate gels 2.1.1 and 2.1.3 from data set 2) plotted
against the log transformed normalized spot volume for one
of the replicate gels (see section: Spot Volume Data Screening,
for more information on logarithmic transformation). The ratios
were close to 1 for all except the lowest spot volumes, where
we see a departure from 1 for the faintest spots.
Image Analysis Data for Statistical Analyses. Two plasma
protein expression data sets were generated for the purpose
of estimating between sample and between gel variance. Data
set 1 consisted of 24 protein 2-DE gels (11-cm pH 4-7 IPGs)
Coefficient of Variation and Correlation Coefficient. Most
published reports that study biological and analytical variation
have been based on evaluation of the percent coefficient of
variation (CV%) and correlation coefficient (R2) (see e.g., ref
8). We believe that these tests are not sufficiently robust, and
so we explored alternative approaches based on variance
analyses and power calculations, below. However, to allow for
a direct comparison of our study with previously published
work, we calculated CV% and R2 from our plasma data sets 1
and 2. Table 1 shows the average CV% ( SD for normalized
spot volumes for the replicate images from plasma data sets 1
and 2. These values may appear high but when compared to
similar analyses of 2-DE image analysis data (see discussion),
our results are very favorable. A better illustration of these data
is, however, to graph the cumulative percent of spots that fall
below given CV% values for the 8 sets of replicate gels from
data set 1, and the 6 sets of triplicate gels from data set 2 (Figure
7). This reveals that there can be notable differences in the CV%
from one sample to the next, and can aid the identification of
outlying samples in a group.
One simple, but widely used, means of evaluating within and
between sample variance is to establish the correlation coefficient of nontransformed data. Accordingly, we used automatic
image analysis to generate the correlation coefficient values
(R2) of nontransformed normalized spot volumes, for every
pairwise combination of gel images from plasma data sets 1
and 2. The means and standard deviations of the R2 values for
each replicate set were calculated for the healthy and diseased
groups of data sets 1 and 2 (Table 2). The average R2 (( SD)
values for within-samples comparisons, which indicates analytical variation, ranged from 0.87 ( 0.09 to 0.99 ( 0.001, and
from 0.94 ( 0.03 to 0.99 ( 0.002 for the healthy and diseased
groups of data set 1, respectively. The equivalent values for data
set 2 were 0.95 ( 0.01 to 0.98 ( 0.01, and 0.93 ( 0.03 to 0.98 (
0.01. The values for between samples comparison, which
indicate biological variation, ranged from 0.73 ( 0.03 to 0.88
( 0.04, and from 0.93 ( 0.04 to 0.97 ( 0.01 for the healthy and
diseased groups of data set 1, respectively. The equivalent
values for data set 2 were 0.70 ( 0.06 to 0.90 ( 0.01, and 0.43
( 0.02 to 0.66 ( 0.04. While this approach is not as powerful
as other methods of analyzing variance (see below), it has
revealed that our between sample variation (biological variation) is clearly greater than our within sample variation
(analytical variation).
Journal of Proteome Research • Vol. 4, No. 3, 2005 813
Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009
Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y
research articles
Hunt et al.
Figure 4. Efficiency of normalization in correcting for differences in spot volumes due to varying protein loads (A and B) or analytical
variations (C and D). The loge transformed spot volumes (graph A), and loge transformed normalized spot volumes (graph B), for the
278 analyzed spots across the five E. coli gels with increasing protein load per gel, are plotted in a box-and-whisker format against the
five gels. The loge transformed spot volumes (graph C), and loge transformed normalized spot volumes (graph D), for the 119 analyzed
spots across the 18 gels from data set 2 are plotted in the same format against the 18 gels. The graphs show the effect of normalization
on the median and the range of spot volumes across the gels.
Figure 5. Effectiveness of normalization of spot volumes in
correcting for gel-to-gel variation between sets of replicate gels
from E. coli (A) and human plasma (B, data set 2). The ratios of
the loge transformed normalized spot volumes are plotted, using
an X-Y plot format, against the loge transformed normalized spot
volumes.
Figure 6. Representative 2-DE gel image of human plasma from
data set 2 annotated with the 119 spots used in the statistical
analyses. The spot volumes for all the 367 spots detected, and
for the 119 matched spots, were sorted in ascending order and
plotted in a line graph. The graphs show that the 119 matched
spots covered a similar range of volumes relative to the 367 spots
detected, illustrating that we have a representative set of protein
spots from the gel.
Spot Volume Data Screening. On the basis of previous work
in a separate study (data not shown), and on published work,9
we understood that spot volume data from image analysis of
2-DE gels does not fit a normal distribution, but requires
814
Journal of Proteome Research • Vol. 4, No. 3, 2005
Experimental Design for 2-DE Quantitative Proteomics
research articles
Table 1. Average CV% for normalized Spot Volumes Across
Replicate Gelsa
data
set
replicate
set
average
CV% ( SD
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
2.1
2.2
2.3
2.4
2.5
2.6
16.1 ( 8.1
19.8 ( 13.0
13.0 ( 5.9
14.8 ( 16.4
26.3 ( 17.3
31.6 ( 22.9
11.6 ( 9.8
7.4 ( 4.5
17.20 ( 12.16
21.24 ( 20.13
19.84 ( 15.86
16.02 ( 12.43
22.41 ( 22.28
19.54 ( 15.23
2
Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009
Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y
a
For each of the matched spots, CV% of normalized spot volumes was
calculated across replicate 2-DE gels from data set 1 (replicate sets 1.1 to
1.8) and data set 2 (replicate sets 2.1 to 2.6). For each replicate set of gels,
the average and SD values were then calculated and tabulated.
Figure 7. Percentage of spots that fall below a given CV% for
normalized spot volume across replicate 2-DE gels. The CV% for
normalized spot volumes across each of the 8 sets of triplicate
gels from plasma data set 1 (replicate sets 1.1 to 1.8) was
calculated and the cumulative percent of matched spots that fall
below each of the given CV% values were plotted against the
CV%. A shows the graphs for plasma data set 1, replicate sets
1.1 to 1.8; B shows the graphs that were similarly generated for
the 6 sets of triplicate gels from plasma data set 2 (replicate sets
2.1 to 2.6).
logarithmic transformation. To confirm that this was the case
with the data generated here, the normalized spot volumes,
before and after log transformation, were plotted using a
frequency histogram. Figure 8 shows the plots for a representative image from data set 1. These results showed that our log-
Figure 8. Comparison of frequency distribution of nontransformed and transformed normalized spot volumes. The median
normalized spot volumes and loge transformed median normalized spot volumes was calculated for each replicate set of gels
from data sets 1 and 2. The median values were then plotted in
a frequency histogram, with the kernel density estimate of the
distribution superimposed. The graphs are shown for a representative replicate set (replicate set 1.1) from data set 1. Note
that the non transformed data does not fit a normal distribution,
and how the loge transformation yields a data set that has
approximately normal distribution. Similar effects were obtained
with all other replicate sets of gels (data not shown).
transformed data are approximately normally distributed,
compared to the nontransformed data. Accordingly, only
transformed data was used for further analyses.
Estimation of Variance Components. The normalized spot
volumes from plasma data sets 1 and 2 were analyzed using
statistical applications that we have made available under the
Tools section at www.emphron.com. A table of betweensamples and within-samples variance was generated, which
was then viewed in a box-and-whisker plot (Figure 9). The
median between-samples variance components for data sets
1 and 2 were calculated to be 0.22 and 0.29, respectively, and
the median within-samples (between replicate gels) variance
components were 0.05 and 0.04, respectively. When analyzed
on a spot by spot basis, the within-samples (analytical) variance
was smaller than the between-samples (biological) variance in
92% and 87% of the cases for data sets 1 and 2, respectively.
The variance values were then used in the power calculations
to determine exactly how many gels and samples one should
aim for in any given experimental design.
Power Calculations. The variance components for betweensamples and within-samples can be used in power calculations,
to predict the best experimental design for use in a twopopulation comparison study. Statistical power is defined as
the probability of correctly identifying a difference between the
groups. The power is determined by the sample size (number
of samples and number of gels), the variability (biological and
analytical variation), the significance level of the test, and the
effect size (being the size of the difference you are looking to
identify). The significance level is the probability of incorrectly
rejecting the null hypothesissthat is, the probability of making
a type I statistical error. A type I error arises when we decide
that there is a group difference, when, in truth, there is not. A
type II error is made when we incorrectly accept the null
hypothesis. That is, a type II error arises when we decide that
there is no group difference, when, in truth, there is. The effect
size is defined by percentage increase. That is, an effect size of
100% represents a doubling of spot volume between groups.
An effect size of 50% represents a between group ratio of 1.5.
Minimum detectable difference is the size of effect required
to give a required power at a specified significance level, given
the variance known from a particular number of samples and
gels. It can be extrapolated from the plots of effect size %
against number of samples for a given number of gels per
sample.
Journal of Proteome Research • Vol. 4, No. 3, 2005 815
research articles
Hunt et al.
Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009
Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y
the number of replicates done per sample does not. This is
not surprising, since the between-samples variance is so much
larger than the within-samples variance, and serves to illustrate
that duplicate or triplicate 2-DE gel runs per sample will be
appropriate in the data presented here. An added advantage
of doing replicate gels per sample is that it facilitates image
analysis; artifacts detected as spots are usually unique to a
replicate gel and can therefore be filtered out on that basis.
In examining the graphs for plasma data sets 1 and 2, it can
also be seen that the minimal detectable differences are
dissimilar. Data set 2 has a greater variance for between sample
effects, and a correspondingly larger minimum detectable effect
size with an asymptote toward 65%. This shows that the
minimum detectable effect size is sensitive to features of the
experimental data that vary from time to time. Effective use of
sample size calculations will depend on the availability of
suitable pilot data for the system of interest.
Figure 9. Between-samples and within-sample variances in
normalized spot volumes. The normalized spot volumes from
plasma data sets 1 and 2 were analyzed using statistical applications that we have made available under the Tools section at
www.emphron.com. The tool: Minimum detectable differences
spot data, was selected, the data file was uploaded, and the
following values were entered; Minimum number samples per
group: 4, Maximum number samples per group: 20, Minimum
number gels per sample: 1 and Maximum number gels per
sample: 4. The default values of 0.05 and 0.8 were used for
Required significance level and Required power. The variance
values generated by the software were copied into an Excel
spreadsheet and graphed using box-and-whisker plots. Graphs
A and B show the variance for between-samples and withinsample in data sets 1 and 2, respectively. Y-axes are of different
scale. Note that the within-sample variation is smaller than the
between-samples variation in both cases.
To investigate what number of samples and analytical
replicates would need to be analyzed in a differential display
experiment where different levels of protein expression are
sought, we analyzed plasma data sets 1 and 2 using the tool
“Minimum Detectable Difference-Spot Data” from the web site
www.emphron.com. The analysis generated a table of effect
size % for 4 to 20 samples per group and 1 to 4 gels per sample,
based on a given power of 80% and significance level of 5%
(data not shown). We plotted the effect size % against number
of samples for 1 to 4 gels per sample (Figure 10).
Figure 10 shows curves whereby the extent of the difference
that can be significantly established between two sample types
is influenced predominantly by the number of samples that
are analyzed. For example, there will be approximately 9
samples (in data set 1) and 11 samples (in data set 2) required
from control and experimental groups to significantly detect a
2-fold difference between the populations. However, more
substantial differences (e.g., a 3-fold difference) can be detected
significantly with smaller numbers of samples (in data set 1,
with approximately 5 samples from control and experimental
groups). The top graph in Figure 10 also makes clear that there
is an asymptote toward ∼50% effect size, illustrating that even
with a very large number of samples analyzed, there would be
a minimal detectable difference of approximately 1.5-fold.
Similar trends were observed in data set 2.
A further trend evident in both graphs (Figure 10) is that
while an increase in the number of samples has a marked
impact on the detectable differences observed, an increase in
816
Journal of Proteome Research • Vol. 4, No. 3, 2005
Experimental Design
In the above sections, we have described approaches and
tools required to formulate a good experimental design for
quantitative 2-DE based proteomics. The main aim of the
design is to determine how many samples and gels should be
analyzed in a given study to ensure that any proteins identified
as differentially displayed will hold true in a subsequent larger
study such as a clinical trial. Figure 2 shows a schematic
description of the steps to be followed to ensure good experimental design, and highlights some hazards and precautions
to be taken at each step.
The overall strategy is to do a pilot study consisting of
running triplicate 2-DE gels from each of 3 samples from each
of two groups (for example a healthy and a diseased group);
perform image analysis and normalize the data, and then do
statistical variance analysis of the data for spots that match
across all the gels. Transformation of the normalized spot
volumes is done automatically by the analysis tool. The
between and within sample variance data can then be used to
run power analysis which allows you to determine how many
gels and samples you will need to analyze in a total experiment
to confidently detect differences of a certain level between the
two groups.
Discussion
We have presented an approach for assaying the quality of
image analysis and a set of robust statistical approaches to
measure variability in gel-based proteomics. We have also
described a set of web-based tools that can be used for
experimental design based on these measurements. We have
shown how data from a pilot study can then be used to evaluate
how many samples should be analyzed in a larger study and
how many replicate gels need to be run to be confident that
any differences detected in subsequent experiments will be
significant. We believe these approaches, while straightforward,
should be applied in proteomic experiments to ensure that the
results reported reflect discoveries relevant to a biological
system, and not just analytical variation. For example, a 2-fold
up- or down-regulation of protein expression between two
groups may be of little importance unless appropriate numbers
of samples have been analyzed.
Our gel-to-gel variability results (analytical variation) were
very favorable when compared with previous reports. If we
consider CV%, we have shown that an average of 50%, 83%,
research articles
Experimental Design for 2-DE Quantitative Proteomics
Table 2. Correlation Coefficient for normalized Spot Volumes between Replicate Gels
Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009
Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y
Correlation coefficient R2 values (average ( SD) was calculated for normalized spot volumes (not transformed) for all combinations of replicate sets of
images for the healthy and diseased groups from data set 1 (A) and data set 2 (B).
and 95% of spots matched across triplicate gels with CV%
values of <15%, <30%, and <50%, respectively. Our CV% values
ranged from 7.4% ( 4.5 to 31.6% ( 22.9, with an average of
18%, and from 16.0% ( 12.4 to 22.4% ( 22.2, with an average
of 19%, for data sets 1 and 2, respectively. Others have reported
CV% values ranging from 18.7% to 26.7%, with an average of
22.6% ( 2.9, and 35 to 38% of spots with CV% < 15%.8 Another
report, comparing the effect of automation of image analysis
on gel-to-gel variability, showed that 76%, 54% and 48% of
spots that matched across 4 gel images had CV% < 30% for
manual, semi-automated and fully automated image analysis
strategies.10 Asirvatham et al.11 reported more favorable results;
they analyzed 50 selected spots using a user-guided approach
to image analysis and reported an average CV% of 16.2, with
52% of spots having CV% values of <15%. However, when they
used a fully automated image analysis strategy, the average
CV% rose to 39.4%. Nishihara et al.12 reported favorable results,
but they analyzed only 20 well defined, manually selected spots.
Using three different image analysis softwares, Z3, Progenesis
and PDQuest respectively, with varying degrees of user input,
they reported average CV% of 15.2% ( 7.6, 18.1% ( 8.0 and
17.0% ( 7.6, with 35%, 40% and 45% of spots having CV% of
<15%. Given that some of the above reports11,8 used homogeneous instead of gradient gels, and the former have been shown
to have less variability,13 and others took care to minimize
variability by focusing all the IPGs simultaneously,11 the values
we report here suggest that we had very good gel-to-gel
variability data. We believe that the care taken in image
acquisition, and our approaches for spot detection, quantitation
and normalization, have contributed substantially to this.
We reported R2 values of between 0.87 and 0.99 for gel-togel variability, with average values of 0.97 ( 0.04 and 0.96 (
0.02 respectively for the plasma data sets 1 and 2. These R2
values compare favorably with those reported elsewhere.
Molloy et al.8 reported R2 values of between 0.83 and 0.91, with
an average of 0.86 for replicate gels from 5 different sample
types. We should point out that we used a midi gel system as
opposed to most of the reports being based on large gel
formats. To determine whether this contributed to the observed
differences in gel-to-gel variability would require side-by-side
experiments on the midi gel and the large format gel systems
using the same sample preparations and image analysis strategies. Felley-Bosco et al.14 reported a comparison between small
(6 × 7 cm) and large (16 × 18 cm) gels. Although they focused
on qualitative rather than quantitative differences, they reported slightly better reproducibility with the large gel format,
with R2 values for spot volumes from triplicate gels comparisons
of 0.87 ( 0.002 and 0.78 ( 0.07 for large and small gel format,
respectively.
It is widely accepted, and supported by our data, that
multiple samples must be analyzed to ensure statistical significance in quantitative proteomics. The sources of gel-to-gel
and sample-to-sample variability in 2-DE based quantitative
proteomics cannot be removed completely, but there are ways
of minimizing these, as discussed below.
The biological (sample-to-sample) variability, will be dependent on the sample type. Previous reports show R2 and CV%
values of 0.75 and 31.2% for bacterial cells, 0.78 and 26.0% for
mammalian tissue culture, and 0.32 and 46.6% for mammalian
primary cell cultures.8 Biological variation is contributed by two
main sources; population heterogeneity based on genetic and
environmental differences and sample heterogeneity based on
the way the samples are collected.15 Care should be taken to
minimize this source of heterogeneity for any given study.
Variables, other than those specific to the study, should be
similarly represented in all sample groups; all groups should
be sampled in a similar fashion with respect to time and place;
and any patients should be matched with respect to, for
example, age and gender.
Some of the sources of analytical variation, on the other
hand, can be more readily controlled than those of biological
variability. The main sources are discussed below, together with
approaches as to how these can be minimized.
Variations in the protein load per gel, in the stain and
staining regime used to visualize the proteins, and in settings
used during image acquisition, can all affect the relative spots
volumes in the gel images being compared. We have shown
that normalization can correct for some of these differences,
but the effectiveness of this correction will be dependent on
the image analysis software used. However, it is important to
avoid over or under loading the gels since this can affect the
way the spot boundaries are determined, and hence introduce
variability in spot volumes that may not be optimally corrected
for by the normalization process.
In our hands, with a bacterial whole cell lysate, normalization
optimally corrected for differences between loads of 40, 80, and
160 µg protein per IPG. However, normalization was not as
effective with higher or lower protein loads (20 µg and 320 µg).
Journal of Proteome Research • Vol. 4, No. 3, 2005 817
research articles
Hunt et al.
Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009
Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y
these cases it is hard to determine if the source of variability is
due to the polyacrylamide gels or the electrophoresis system
in use. Sample preparation has also been shown to introduce
variability; greater variability is seen with multiple-step preparations of a single sample than with a single-step preparation
of the same sample.10 To assist in the control of these issues,
a randomization strategy can be applied to sample preparation,
2-DE gel runs, image acquisition and image analysis. This will
in essence ensure that these sources of variations do not bias
the data in any way. Analysis in many cases can also be blinded,
to reduce experimenter bias.
Figure 10. Power curves for data sets 1 and 2. The analysis done
using the Tools section at www.emphron.com (described in
Figure 9 legend) generated values for effect size % for 4 to 20
samples per group and 1 to 4 gels per sample based on a given
power of 80% and significance level of 5%. The effect size %
values were plotted against number of samples for 1 to 4 gels
per sample for data set 1 (graph A) and data set 2 (graph B).
These plots can be used to predict the number of samples that
will need to be analyzed to confidently find a certain difference
between two populations.
Too low a loading will mean that faint spots are near their
threshold of detection and hence more prone to errors in
quantitation, while high protein loads can mean that spots are
prone to overlapping on the gel or saturation during image
capture, which can affect the detection of spot boundaries and
hence spot quantitation.
Direct proportionality between spot volumes and protein
quantity is imperative for quantitative 2-DE gel analysis. To
ensure high-quality data is generated, it is therefore best to use
protein stains that have an equilibrium end point, a broad
linear dynamic range, and whose staining intensity is less
dependent upon individual protein amino acid composition
and post-translational modifications. We used SYPRO Ruby to
stain the proteins in our gels as it has all the above properties.16
An equally suitable stain, now referred to as Deep Purple, could
also be used.17
Other factors can also affect the variability of 2-DE gel-based
proteomic work. Electrophoresis systems themselves have been
shown to introduce variability, whereby batch-to-batch variability was higher than within the same batch.11,18 However, in
818
Journal of Proteome Research • Vol. 4, No. 3, 2005
The importance of image acquisition in quantitative 2-DE
proteomics is often underestimated, and yet there are several
precautions that can be taken to minimize variability at this
level. It is known, for example, that different scans of the same
gel can give different results in terms of the number and size
of spots that are detected.19 This is mainly because the image
analysis software will determine the spots boundaries differently due to subtle differences in the background pixels values
across the same regions of the scans. This source of error can
be minimized by acquiring images with optimal resolution and
taking precautions to avoid the need to rotate the image, except
for 90° or 180° rotations, as this can affect the pixels in the
image and hence affect spot boundaries. We found that, with
our image analysis software, a resolution of between 250 and
300 dpi is optimal. A lower resolution can compromise spot
segmentation yet a higher resolution tends to increase the
detection of artifacts without any significant improvement in
spot segmentation (data not shown). This will, of course, be
dependent on the gel format used, and we recommend the acquisition of gel images at different resolutions, followed by spot
detection, matching and variance analysis to determine an
optimal strategy. To ensure high quality data for lower abundance proteins, a good rule of thumb is to aim for an image
resolution that equates to a minimum of 9 pixels per spot.
The image analysis software and strategy used can also
contribute to gel-to-gel variation. It has been shown that
accuracy can be compromised by speed in image analysis. Data
produced using a manual, semi-automated, or automated
image analysis strategy showed that for spots matching across
4 out of 4 gels, 76%, 54%, and 48% of the spots respectively
had CV% of less than 30%.10 These levels of automation refer
mainly to editing of gel-to-gel match (which can dramatically
improve the results) as opposed to editing of the spot boundaries (which are generally difficult to do accurately and
consistently by hand). Manual editing of spot boundaries can
dramatically increase quantitative errors.19 Hence, a good
recommendation is that spot detection should be automatic
and gel-to-gel matching should be manually checked.
Conclusion
We recommend that a sound strategy be adopted when
designing a gel based quantitative proteomic study. That is, to
carry out a pilot study with 3 samples per group and 3 gels per
sample, to carefully undertake image capture, spot detection
and matching, normalization and data transformation, and
then use the tools we are providing on the Internet to calculate
the power analysis and determine the minimum number of
samples and gels per samples that should be run in the
subsequent study. This approach will assist in discovering
differentially displayed proteins that reflect true protein expres-
research articles
Experimental Design for 2-DE Quantitative Proteomics
sion changes between two or more populations, and so assist
in the identification of biomarkers.
References
PR049758Y
Downloaded by BBWS CONSORTIA GERMANY on July 17, 2009
Published on May 7, 2005 on http://pubs.acs.org | doi: 10.1021/pr049758y
(1) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.;
Aebersold, R. Nat. Biotechnol. 1999, 17 (10), 994-999.
(2) Harris, M. N.; Ozpolat, B.; Abdi, F.; Gu, S.; Legler, A.; Mawuenyega,
K. G.; Tirado-Gomez, M.; Lopez-Berestein, G.; Chen, X. Blood
2004, 104 (5), 1314-1323.
(3) Righetti, P. G.; Campostrini, N.; Pascali, J.; Hamdan, M.; Astner,
H. Eur J Mass Spectrom. (Chichester, Eng) 2004, 10 (3), 335-348.
(4) Hatton, M. W. Biochem. J. 1973, 131 (4), 799-807.
(5) Wilson, N. L.; Schulz, B. L.; Karlsson, N. G.; Packer, N. H. J.
Proteome Res. 2002, 1 (6), 521-529.
(6) Pinheiro, J.; Bates, D. Mixed-Effects Models in S and S-Plus;
Springer-Verlag: New York, 2000.
(7) Raman, B.; Cheung, A.; Marten, M. R. Electrophoresis 2002, 23
(14), 2194-2202.
(8) Molloy, M. P.; Brzezinski, E. E.; Hang, J.; McDowell, M. T.;
VanBogelen, R. A. Proteomics 2003, 3 (10), 1912-1919.
(9) Gustafsson, J. S.; Ceasar, R.; Glasbey, C. A.; Blomberg, A.; Rudemo,
M. Proteomics 2004, 4 (12), 3791-3799.
(10) Choe, L. H.; Lee, K. H. Electrophoresis 2003, 24 (19-20), 35003507.
(11) Asirvatham, V. S.; Watson, B. S.; Sumner, L. W. Proteomics 2002,
2 (8), 960-968.
(12) Nishihara, J. C.; Champion, K. M. Electrophoresis 2002, 23 (14),
2203-2215.
(13) Voss, T.; Haberl, P. Electrophoresis 2000, 21 (16), 3345-3350.
(14) Felley-Bosco, E.; Demalte, I.; Barcelo, S.; Sanchez, J. C.; Hochstrasser, D. F.; Schlegel, W.; Reymond, M. A. Electrophoresis 1999,
20 (18), 3508-3513.
(15) Boguski, M. S.; McIntosh, M. W. Nature 2003, 422 (6928), 233237.
(16) White, I. R.; Pickford, R.; Wood, J.; Skehel, J. M.; Gangadharan,
B.; Cutler, P. Electrophoresis 2004, 25 (17), 3048-3054.
(17) Mackintosh, J. A.; Choi, H. Y.; Bae, S. H.; Veal, D. A.; Bell, P. J.;
Ferrari, B. C.; Van Dyk, D. D.; Verrills, N. M.; Paik, Y. K.; Karuso,
P. Proteomics 2003, 3 (12), 2273-2288.
(18) Zhan, X.; Desiderio, D. M. Electrophoresis 2003, 24 (11), 18341846.
(19) Mahon, P.; Dupree, P. Electrophoresis 2001, 22 (10), 20752085.
Journal of Proteome Research • Vol. 4, No. 3, 2005 819