MS-specific noise model reveals the potential of

BIOINFORMATICS
ORIGINAL PAPER
Vol. 25 no. 8 2009, pages 1004–1011
doi:10.1093/bioinformatics/btn551
Gene expression
MS-specific noise model reveals the potential of iTRAQ in
quantitative proteomics
C. Hundertmark1,† , R. Fischer1,† , T. Reinl1 , S. May2 , F. Klawonn2,∗ and L. Jänsch1,∗
1 Helmholtz
Centre for Infection Research, Department for Cell Biology, Inhoffenstr. 7, D-38124 Braunschweig and
of Applied Sciences BS/WF, Department of Computer Science, Salzdahlumer Str. 46/48, D-38302
Wolfenbuettel, Germany
2 University
Received on May 15, 2008; revised on October 8, 2008; accepted on October 18, 2008
Advance Access publication October 24, 2008
Associate Editor: David Rocke
ABSTRACT
Motivation: Mass spectrometry (MS) data are impaired by noise
similar to many other analytical methods. Therefore, proteomics
requires statistical approaches to determine the reliability of
regulatory information if protein quantification is based on ion
intensities observed in MS.
Results: We suggest a procedure to model instrument and workflowspecific noise behaviour of iTRAQ™ reporter ions that can provide
regulatory information during automated peptide sequencing by
LC-MS/MS. The established mathematical model representatively
predicts possible variations of iTRAQ™ reporter ions in an MS
data-dependent manner. The model can be utilized to calculate the
robustness of regulatory information systematically at the peptide
level in so-called bottom-up proteome approaches. It allows to
determine the best fitting regulation factor and in addition to
calculate the probability of alternative regulations. The result can be
visualized as likelihood curves summarizing both the quantity and
quality of regulatory information. Likelihood curves basically can be
calculated from all peptides belonging to different regions of proteins
if they are detected in LC-MS/MS experiments. Therefore, this
approach renders excellent opportunities to detect and statistically
validate dynamic post-translational modifications usually affecting
only particular regions of the whole protein. The detection of known
phosphorylation events at protein kinases served as a first proof of
concept in this study and underscores the potential for noise models
in quantitative proteomics.
Contact: [email protected]; f.klawonn@fh-wolfenbue
ttel.de
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
INTRODUCTION
The success of comparative proteome studies depends decisively
on the capability to accurately quantify hundreds to thousands of
the investigated components. Traditional 2D-PAGE analyses allow
regulatory investigations directly at the level of proteins that can
be quantified based on their spot volumes. Alternatively, after
∗ To
whom correspondence should be addressed.
authors wish it to be known that, in their opinion, the first two authors
should be regarded as joint First Authors.
† The
1004
enzymatic cleavage of proteins the resulting peptides can
be automatically sequenced in so-called bottom-up proteome
approaches based on the combination of liquid chromatography
linked with a suitable MS2 device (LC-MS). Unfortunately, it is
still a challenging task to directly correlate peptide ion intensities in
LC-MS with the abundance of the corresponding protein (label-free
quantification) and the majority of quantitative LC-MS proteome
projects rely on protein or peptide labelling technologies (Bantscheff
et al., 2007).
Besides SILAC (Ong et al., 2002), chemical labelling of
peptides with iTRAQ™ is becoming an important and widespread
quantitative technology based on LC-MS. Contrary to SILAC,
iTRAQ™ labels peptides from biochemically purified proteins
and thus also allows to investigate clinical samples obtained from
patients. iTRAQ™ molecules enable the multiplexed differential
labelling and relative peptide quantification of either up to 4 or
8 different samples in parallel (Pierce et al., 2007; Ross et al.,
2004). During the labelling process only one type of iTRAQ™
molecules is linked covalently to all peptides from one biological
sample. Equal peptides (with identical amino acid sequence) from
different biological samples, which were labelled differentially and
are subsequently pooled exhibit still identical biochemical properties
and total masses. Consequently, corresponding equal peptides from
different samples co-elute at the same time from chromatographic
columns, enter the MS device with the same molecular weight,
and are subjected concertedly to the fragmentation process. Under
the MS2 conditions of peptide sequencing these iTRAQ™ labels
also produce fragment ions that differ in mass and do serve
as sample-specific reporters (114.1, 115.1, 116.1 and 117.1 Da).
The ratios of the released iTRAQ™ reporter ions correlate
with the relative abundance of the analysed peptides as part
of the investigated samples. Since a bottom-up approach can
produce nowadays thousands of fragment ion patterns per hour the
extraction of regulatory information from the individual peptide
fragmentation patterns requires bioinformatics strategies. Several
tools and services were recently suggested and established to
calculate and visualize regulatory data obtained by iTRAQ™. All of
these approaches clearly aim at state-of-the-art analyses of protein
expression information: They successfully presented mathematical
concepts considering isotopic impurities, normalizing regulatory
information and in particular performing statistical data analyses
and integration with a cumulative view on reporter intensities from
© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
[10:13 3/4/2009 Bioinformatics-btn551.tex]
Page: 1004
1004–1011
Noise model in quantitative proteomics
peptides belonging to the same protein (Boehm et al., 2007; Lin
et al., 2006). Furthermore, it became obvious in recent studies
that the intensity of the reporter ions positively correlates with
the quality of the derived regulatory information (Hu et al., 2006)
but the statistical basis of this relation was not established so far.
Focusing on robust protein expression analyses it was recommended
to consider only reporter ions above a certain threshold in order to
reduce the number of unexpectedly regulated peptides. However,
proteins are not exclusively regulated at the expression level and
divergently regulated peptides can point to important biological
processes. More than 200 different post-translational modifications
(PTMs) are described in literature (Walsh et al., 2005) and modified
individual amino acids decisively control protein enzyme activities
and protein–protein interactions.
In this study, we present a noise model to describe the behaviour
of iTRAQ™ reporter intensities at the level of peptides. Noise
models for microarray data are enjoying great popularity since a
few years (Tu et al., 2002; Weng et al., 2006), noise models for
MS data on the other hand are just appearing by now (Du et al.,
2008). We demonstrate that analyses of iTRAQ™ labelled peptides
can representatively specify the noise characteristics for individual
proteome workflows. The resulting noise model allows simulating
the data-dependent distribution of iTRAQ™ reporter intensities.
Importantly, the established mathematical model in return can
provide information on the robustness of experimentally observed
reporter intensities. We present here an intuitive concept for the
visualization-aided exploration of regulatory iTRAQ™ data based
on likelihood curves precisely depicting the regulatory information
in an MS data-dependent way. This strategy supports data integration
as required for protein expression studies but in particular presents
the state-of-the-art in characterizing regulations at the peptide level.
As an example, we demonstrate that regulatory information from
peptides with significantly variant behaviour can be correlated
unambiguously with known regulatory phosphorylation events at
protein kinases.
2
2.1
METHODS AND RESULTS
Experimental settings
Peptides were digested from proteins using trypsin, automatically sequenced
by LC-MS and identified as described previously (Trost et al., 2005). For
iTRAQ™ experiments the protocol was adapted as follows: digestion was
carried out always in 50 mM TEAB allowing to apply directly iTRAQ™
kit labelling reagents. Labelling by iTRAQ™ was done in all experiments
according the manufacturer’s protocol (Applied, Biosystems, Foster City,
CA, USA). In brief: aliquots of iTRAQ™ reagents recommended for the
labelling of 100 µg protein were applied to peptides corresponding to a
maximum of only 40 µg total protein and the reactions were carried out
for at least 1 h. The completeness of labelling reactions was tested along the
process of peptide and protein identification done by Mascot (V 2.1, Matrix
Science, Boston, MA, USA). Unlabelled peptides could not be observed for
any experiment described in this study. Peptides were purified by small C18
columns (ZipTip, Millipore, Billerica, MA, USA) and the equivalent of one
ZipTip was used for LC-MS experiments. Peptides were separated using a
nanoAcquity HPLC (Waters) and BEH material (Waters) as an analytical
chromatographic column (150 mm × 75µm, 1.7 µm) that was linked to a QTOFmicro instrument equipped with PicoTip™ emitter (distal coated, New
Objective, Woburn, MA, USA). Double- and triple-charged peptide ions were
automatically and individually selected by the quadrupole unit and mainly
b- and y-ions were generated during their fragmentation in the collision
cell (MS/MS mode). Collision cell energy was increased by about 10%
compared with protocols for non-labelled peptides as recommend by Ross
and his colleagues (2004). Post-processing of fragmentation data yielded
representative individual ion intensities for peptide sequencing and iTRAQ™
quantification were performed as previously described (Trost et al., 2005).
In brief: always four MS scans were combined to one spectrum, smoothed
based on Savitzki–Golay (no. of smoothes 2, 7 channels) and centroid spectra
were generated (minimum peak width at half height 4, top 80%).
To generate the training dataset that was used to estimate the
noise parameters the following unmodified and phosphorylated (*) peptides
were synthesized and labelled with iTRAQ™ 115: IADPEHDHTGFLTEYVATRWYR, VSDFGLTK, FVLDDQY*TSSTGTKFPVK, HERPAGPGT*PPPDSGPLAK, SST*VTEAPIAVVTSR. The peptides were diluted in 65%
methanol, 5% formic acid to a concentration of 4.4 µM and injected into a
Q-TOFmicro (Waters) by using a syringe pump. By using these peptides
and different collision energies (12 ev, 16 ev, 20 ev, 22 ev, 24 ev, 26 ev, 28 ev
and 30 eV) the variation of reporter signals was analysed at 24 different
intensity levels. About 100 fragmentation spectra were acquired for each
intensity level and data points for the training dataset were processed as
described before.
For the generation of a 1:1 peptide sample a protein mixture containing
10 µg of the proteins ovalbumin, conalbumin, aldolase, ferritin, llactate dehydrogenase A and thyroglobulin (GE Healthcare, Chalfont St.
Giles, Buckinghamshire, UK) was digested with Trypsin (1:50, Promega,
Sequencing Grade, Madison, Wisconsin, USA) in 50 mM TEAB. The
sample was split 1:1, differentially labelled with iTRAQ™ reagents and
the re-combined sample was analysed by LC-MS.
Additional details on the design of the biological approaches, i.e. (i) the
investigation of intensity variations of labelled peptides derived from total
cell lysates from immune cells (B cells) and (ii) the analysis of regulated
peptide modifications of protein kinases in HGF/Met-activated signalling
studies are included in the chapters 2.3 (B cells), 2.6 (HGF/Met) and in the
Supplementary Material (Figures S2/S3).
2.2
Noise model
In accordance with Lin et al. (2006) and Hu et al. (2006) we routinely observe
an ion intensity-dependent accuracy of our regulatory data. This effect can
be representatively investigated and visualized by using the following test
system: a protein sample was digested, split and the resulting peptides were
labelled differentially with iTRAQ™ tags 115 and 117 before both fractions
were re-combined in a 1:1 ratio and analysed by LC-MS/MS. Subsequently,
we isolated the iTRAQ™ reporter intensities from all spectra, corrected the
data based on the known isotopic impurities and compared the trimmed
mean values from both samples to calculate and apply a normalization
factor (<1). From such preprocessed data, we derived logarithmic peptide
ratios as presented in Figure 1. The peptide expression accuracy is
iTRAQ™-reporter ion intensity-dependent. Decreasing iTRAQ™-reporter
ion intensities coincide with increased deviations from the expected
1:1 ratios.
In Klawonn et al. (2006), we suggested a statistical approach to estimate
the noise of quantitative protein analyses based on iTRAQ™.
In brief, we assume that the available data
(1)
(k )
x1 ,..., x1 1 ,..., xn(1) ,..., xn(kn )
(1)
(1)
(k )
consist of n tuple (xi ,...,xi i ) each representing ki (ki ≥ 2) noisy
measurements of the same unknown (preprocessed) logarithmic intensity
µi . We assume that each tuple originates from independent samples with
normally distributed data, with unknown mean µi and unknown variance
σi2 . In accordance with recent studies (e.g. Hu et al., 2006; Lin et al., 2006),
our experiments have shown that the relative ratio of noise decreases for
larger iTRAQ™ reporter intensities.
Various physical causes, such as variable ion suppression effects, signal
amplifier circuits and iTRAQ™ fragmentation specificities might contribute
1005
[10:13 3/4/2009 Bioinformatics-btn551.tex]
Page: 1005
1004–1011
C.Hundertmark et al.
The parameters were fitted in an expectation maximization (EM)-like
fashion with numerical steps if required.
2.3
Fig. 1. Normalized logarithmic reporter ion intensities from equal sets of
peptides that were differentially labelled with iTRAQ™ reagents 115 and
117 reveal intensity-dependent deviations.
to this effect. Currently, this can hardly be dissected and calculated
individually to produce an exact model for iTRAQ™ data. So far, a rational
noise model could be established and experimentally validated only for the
detection of peptide ions in noisy MS spectra. Du et al. (2008) recently
demonstrated that the combination of a multinomial and Poisson model can
fit observed with expected isotopic distributions from peptide ions. This
concept basically belongs to de-isotoping approaches and provides equations
for calculating the proportion of C13 -containing peptide fractions according
to the natural frequency of this isotope. In contrast, iTRAQ™ signals reflect
the abundance of labelled peptides in biological samples and are thus not
predictable per se. Only a minor portion of the total intensities are formed
by expectable but even not calculable isotope distributions also known as
isotopic impurities.
Therefore, and until the rational basis of iTRAQ™ noise has been
established in detail, a representative approximation of noise characteristics
was achieved in this study using the equation
σ (µ) = h(µ;a,r,λ) = a+re−λµ
(2)
with a,r,λ ≥ 0. The parameter a represents the absolute noise in the
measurement, while r and λ determine the intensity-dependent noise and
its decrease.
The estimation of the parameters a,r and λ based on the sample
(1) requires that we estimate the true unknown intensity µi for each
subsample. For the estimation of the unknown parameter vector θ = (a,r,λ)
the maximum likelihood principle leads to the likelihood function
(1)
(k )
L x1 ,..., x1 1 ,..., xn(1) ,..., xn(kn ) |θ
=
(j)
(x −µi )2
.
√ exp − i 2
2h (µ;θ )
h(µ;θ ) 2π
i=1 j=1
ki
n 1
(3)
Since the maximization of the log-likelihood is equivalent to the
maximization of the likelihood itself, we consider the log-likelihood (L̃).
Assuming the parameters a, r and λ to be fixed at the moment, optimization
of the µi -values is achieved by maximization of the log-likelihoods. By
d L̃i
setting dµ
= 0 we finally get
i
ki
(λre−λµi )(a+re−λµi )2
j=1
+(xi −µi )((a+reλµi )+(xi −µi )(−λre−λµi ) = 0. (4)
(j)
(j)
For optimization of the parameters a,r and λ the fitness of a parameter
combination θ = (a,r,λ) is given by (3).
Definition of workflow specific noise parameter
The parameters a,r,λ are not universal and depend on the type of the
mass spectrometer in general as well as on data acquisition settings.
To estimate these parameters we generated a training dataset based on
five peptides to determine the intensity-dependent noise. Two unmodified
peptides were included in this approach which facilitate to cover a broad
dynamic reporter intensity range: IADPEHDHTGFLTEYVATRWYR can
only be labelled once with iTRAQ™ at the N-terminus. Its low reporter
signal capacity is further decreased by many fragmentation products causing
significant ionization suppression effects. Contrary, VSDFGLTK contains
two iTRAQ™ -reactive primary amino groups (N-terminus and lysine) and
exhibits significantly better ionization efficiency of the released iTRAQ™
115 reporter. Additionally, we included three further phosphorylated peptides
in this approach in order to substantiate the data pool of the training dataset
and to guarantee its representative character also to modified peptides.
Measurements were repeated for all five peptides by using different collision
energies, thus covering different absolute levels of iTRAQ™ reporter
intensities. In this way, we repeatedly measured reporter intensity variations
at 24 different intensity levels resulting in 552 individual reporter intensities
defining the training dataset (Supplementary Material, Fig. S1).
From this dataset, we estimated the noise parameters for the used mass
spectrometer and individual settings as follows: a = 0.0103, r = 0.9908 and
λ = 0.4751. These parameters facilitate the noise estimation of iTRAQ™
reporter ions dependent on their individual intensity. For the calculation of
a 95% interval for an unregulated sample (µ1 = µ2 ) we assume a normal
distribution with mean µ = µ1 −µ2 = 0 and SD σ = σ12 +σ22 and σi = a+
√
re−λµi . Since σ1 = σ2 follows σ = 2σ 2 .
The noise parameters were estimated from data that were generated under
artificial analytical conditions. In order to validate the noise model in an LCMS/MS workflow which is used for the analysis of complex protein samples
we labelled equal amounts of a digested protein mixture containing the
proteins ovalbumin, conalbumin, aldolase, ferritin, tyroglobulin and l-lactate
dehydrogenase A (GE Healthcare) with iTRAQ™ tags 115 and 117 and
combined the labelled peptides in a 1:1 ratio (test dataset). Using the model
parameters a, r and λ derived from the training dataset we calculated the
expression ratios for 563 MS/MS spectra that were recorded for the 1:1
peptide mixture (Fig. 2A). According to the training dataset we expected 5%
outliers. Of 563 peptide expression ratios 16 are located outside the borders
(2.84%) thus confirming the representative character of the applied training
strategy. Additionally, we analysed in a similar approach a real biological
sample, i.e. peptides derived from total cell lysates of immune cells (B cells).
Again, the 95% interval of the noise model was in good accordance with
the experimentally observed variations (1.9% outliers, Fig. S2). Thus, we
assume that five iTRAQ™ labelled peptides are sufficient to estimate the
noise parameters of an individual MS workflow. We have also simulated
the variation of iTRAQ™ reporter intensities based on our noise model
systematically at variable intensity levels. The resulting in silico data points
exhibit striking similarities with all experimental results (Fig. 2B).
We compared our noise model σ (µ) = h(µ;θ ), θ = a+re−λµ with further
potential noise models h(µ;θ ), which might also be able to simulate
a representative distribution of iTRAQ™ intensities as experimentally
observed (Table S3), but none of them produced a satisfying result. An
additionally performed estimation of parameters based on median absolute
deviation (MAD) instead of maximum likelihood estimation (MLE) yielded
similar results. However, since outliers in MAD have less effect for
estimating the noise parameters we select the more conservative method
based on MLE accepting some missed true positives.
In accordance with Boehm et al. (2007), we assume that the measurement
of each intensity results from the true intensity plus noise following a lognormal distribution (with mean zero, i.e. the measurement error is unbiased).
1006
[10:13 3/4/2009 Bioinformatics-btn551.tex]
Page: 1006
1004–1011
Noise model in quantitative proteomics
Fig. 2. (A) Experimental test dataset comprising expression ratios
(logarithms) of 563 peptides. Tryptically generated peptides from six
different proteins were differentially labelled with iTRAQ™ 115 or 117,
mixed in the ratio 1:1 and then analysed by MS. Of the peptide regulation
factors, 2.84% are located outside the 95% interval calculated from the
training dataset. (B) Simulation of a dataset based on the training dataset
shows striking similarities to the intensities detected in the experimental test
dataset. Every integer intensity between 1 and 2000 (eµ ) was used for the
generation of two noisy intensities (sim1 and sim2) according to the noise
model with the parameters a = 0.0103, r = 0.9908 and µ = 0.4751. ln( sim1
sim2 )
are plotted against µ.
For verification, we performed a power transformation with the training
dataset. Power transformations of the form
p
x −1
if p = 0,
p
Tp (x) =
lnx if p = 0
are a very common technique in data analysis used for various purposes.
Here, the special case of a (power) transformation for equal spread (see,
for instance Hoaglin et al., 2000) is of interest, reducing the difference in
the variances at the different intensity levels. For the training dataset the
P-value was near zero (P = 0.0017) which provides another justification for
the logarithmic transformations applied to the data. Java classes that were
used to estimate the noise parameters from test datasets are provided on
request.
2.4
Calculation of regulatory information
Similar to Shadforth et al. (2005) or Lin et al. (2006), we perform a standard
procedure to preprocess the acquired MS data. All fragmentation ions are
smoothed and centred using MassLynx 4.1 (Waters) before the peaks of the
actual iTRAQ™ reporters are detected using Mascot Parser 2.1.00 (Matrix
Science). Mass shifts were compensated automatically and in case of multiple
signals near the expected mass, we optimized peak detection by scanning
for a pattern that fits the specified iTRAQ™ label masses. Since each
iTRAQ™ reporter type can contribute to the neighbouring signals with a
few percent of its own ion intensity (isotopic impurity) a small fraction
of each iTRAQ™ molecule batch was routinely subjected to fragmentation
experiments individually in order to certify its signal specificities. Following,
the percent certified isotopic impurities of each applied iTRAQ™ molecule
type is considered in a linear system of equations allowing to calculate
precisely the actual ion abundances of differentially labelled peptides.
The last preprocessing step is the normalization of samples. This is
done by comparing the trimmed mean values of all samples and adjusting
by multiplicative correction the intensities of samples with high peptide
concentration to those of the lowest concentration. For the calculation of
the normalization factor the mean was calculated after discarding the upper
and lower 20% of the sample’s intensities (trimmed mean).
The expression ratio of a single peptide can simply be determined by
dividing the iTRAQ™ ion intensities if only one fragmentation experiment
(MS/MS) was generated within an experiment. However, and as discussed
in Hu et al. (2006) we need a different approach if multiple MS/MS spectra
match to one peptide. According to the noise model, expression ratios derived
from high intensities are more representative compared with expression
ratios derived from low intensities. Therefore, reporter ion ratios that all
have to be correlated to the same peptide should be integrated in a weighted
manner depending on their reporter intensities and their corresponding signal
qualities. In order to find the most likely expression ratio for a peptide we
scan all expression ratios cj between the lowest (cmin ) and the highest (cmax )
for all MS/MS spectra that match the actual considered peptide.
For each scanned expression ratio cj and each measured pair of intensities
(xi ,yi ), 1 ≤ i ≤ n, n = number of matched MS/MS spectra, a suitable pair of
intensities is calculated by setting µxi = cj µyi where µxi and µyi are the
true, unknown intensities of the measured intensities xi and yi , respectively.
To find the best expression ratio cj for all n matched MS/MS spectra, we
compute the likelihood function
L(x,y,cj ) =
n
1
−
e
√
2π σxi
i=1
= a+re−λµx
x −c µy 2
i j i
2σxi
1
−
e
√
2π σyi
y −µ 2 yi
i
2σyi
(5)
and σy = a+re−λµy .
with σx
Logarithmic transformation and omitting the constant parts leads to the
log-likelihood function
L̃(x,y,cj ) =
n
xi −cj µyi 2
yi −µyi 2
−ln(σxi )−
.
−ln(σyi )−
2σxi
2σyi
(6)
i=1
The best suitable overall expression ratio cj for all matched MS/MS
spectra of a multiple found peptide according to the noise model is that
one, which results in the maximum log-likelihood.
Calculation of a protein expression ratio takes place in the same manner.
Instead of regarding expression ratios of a single peptide matched repeatedly,
we compute the most representative regulation factor for all peptides of
the observed protein. This is done by calculation of the likelihoods for all
regulation factors cj to be considered (cmin ≤ cj ≤ cmax and a sufficiently small
step size for cj ) and choosing that one resulting in the maximum likelihood.
Hence, we have
n = the total number of all MS/MS spectra matched to a
peptide of the actual protein,
cmin = lowest expression ratio of a matched MS/MS spectrum and
cmax = highest expression ratio of a matched MS/MS spectrum
for the estimation of the protein expression ratio.
Thus, this approach can integrate information of all experimentally
observed results without the necessity to reject data by arbitrary threshold
levels. The quality of each iTRAQ™ reporter can be specified by the
established noise model based on its individual intensity. Low intensity
1007
[10:13 3/4/2009 Bioinformatics-btn551.tex]
Page: 1007
1004–1011
C.Hundertmark et al.
Table 1. Sequences and regulatory information of the peptides from the test
protein aldolase (ALDOA) as depicted in Fig. 3 and Fig. S4
Undiluted
ALDOA (aldolase)
IVAPGK
GGVVGIK
DGADFAK
VLAAVYK
ELSDIAHR
ALQASALK
QLLLTADDR
GILAADESTGSIAK
AAQEEYVK
ADDGRPFPQVIK
LQSIGTENTEENR
Ratio
1.02
1.05
1
1.04
1.12
1.03
1.03
−1.17
−1.07
1.01
1.01
1.01
IR
Tenfold diluted
0.07
0.19
0.18
0.25
0.20
0.33
0.16
0.59
0.40
0.15
0.35
0.18
ALDOA (aldolase)
IVAPGK
GGVVGIK
DGADFAK
ALQASALK
QLLLTADDR
AAQEEYVK
LQSIGTENTEENR
Ratio
−1.07
1.22
−1.11
−1.16
1.11
1.06
−1.19
−1.12
IR
0.28
1.28
0.73
0.56
0.72
1.22
0.52
1.59
The protein aldolase was digested, the sample was split into aliquots, the peptides
were differentially labelled by iTRAQ™ and recombined. The sample was measured
undiluted or 10-fold diluted yielding 11 and 7 peptide sequences and corresponding
regulatory information. Positive and negative ratios refer to the peak of the likelihood
curve and give the multiplicative increase and decrease, respectively, IR i.e. the minimal
interval which is covered by 80% of the area beyond each likelihood curve.
reporters in combination with those of better signal qualities, i.e. higher
intensities, have only a minor effect on the final result, and vice versa. In
Table S2 protein expression ratios (derived from the test dataset 1:1 and
1:10) are compared based on MLE, median and mean value. In summary,
the presented approach yields better results than median and mean value
since thus calculated protein expression ratios average the closest to 1. This
strategy basically allows scientists to observe all clues to regulatory events
in a complex dataset but also introduce the need to calculate and present the
actual robustness of regulatory information in parallel.
2.5 Visualization of regulatory information
Regulatory data are supposed to be especially reliable and robust if the
regulatory information of a peptide or protein are consistent and based
on relatively high reporter ion intensities. We have established a system
for the calculation of expression ratios for multiple measured peptides and
for proteins according to our noise model, but we have not yet included
information about the robustness of the underlying signal qualities. This is
of highest importance since we know that even unregulated objects can yield
significantly different regulatory results in an MS data-dependent manner
of the reporter signals (see Fig. 2). Therefore, a broader range of possible
regulations has to be considered if low reporter intensities were measured and
if a protein expression ratio is based on heterogeneously regulated peptides.
We selected a visualization-aided approach to display the likelihoods of
protein and peptide regulations according to the real data qualities: the x-axis
refers to the regulation factor, the y-axis refers to the likelihood given by (6).
An example for the visualization of the iTRAQ™ data quality is given for
the protein aldolase (ALDOA) that was analysed as part of the test dataset.
Similar but not identical regulatory ratios were observed for different aldolase
peptides (Table 1, Fig. 3). The established noise model allows us to calculate
the likelihood for regulatory values adjacent to the simply observable
regulatory ratios. Higher iTRAQ™ reporter intensities correlate positively
with the robustness of regulation and thus result in narrower likelihood
curves. To display positive and negative regulatory factors symmetrically
in the plots we have omitted the range [−1,+1], and regulations <1 are
calculated by the means of inversion of measured intensities. For example
the ratios +1.20 and −1.20 refer to an increase and decrease of 20%,
respectively. The area below the likelihood curve is always normalized to 1
and consequently gives an intuitive impression of the robustness. In addition,
concrete values referring to the minimal interval which is covered by 80%
of the area beyond each likelihood curve (interval of robustness, IR) are
displayed as a legend. In total, 11 peptides of the test protein were detected
exhibiting peak ratios ranging from −1.17 to 1.12 and IR values between
Fig. 3. Likelihood curves can representatively summarize the robustness of
regulatory information. The likelihood for variable regulations is calculated
according to the noise model and actual reporter intensities from peptides
belonging to the same protein (ALDOA, aldolase). Resulting likelihood
curves can be calculated and presented at the peptide level (A) Analyses
of a 10-fold diluted sample results in decreased reporter ion intensities and
less robust regulatory information as presented by broadened peptide curves.
(B) Legends at the bottom and the right top side of each plot refer to the
displayed peptide sequences and IR, respectively.
0.15 and 0.59. Next we diluted the same sample expecting reduced reporter
ion qualities. Automatic sequencing by LC-MS/MS now detected only seven
peptides exhibiting in general reduced reporter ion intensities (Table 1). The
calculated likelihood curves of the corresponding peptide and protein views
(Fig. 3 and Fig. S4) intuitively indicate the significantly lower quality of the
regulatory information. This finding is also in accordance with an increased
diversity of peak ratios varying now between −1.19 to 1.22 and IR values up
to 1.59. The peak ratios from the undiluted and 10-fold diluted sample exhibit
only a distance of 0.09 (1.02 versus −1.07) emphasizing the relevance of
repeated peptide measurements in case of low data quality.
Whereas in real experiments at least three parameters in concert contribute
to the overall regulation data quality, i.e. the (i) number, (ii) consistency and
(iii) the reporter ion intensities of independently measured peptides, in silico
calculations allow to model their influence individually. We hypothesize
a differentially labelled peptide that can release iTRAQ™ reporter ions
with variable intensities but with always exactly the same resulting ratio.
Applying our noise model and likelihood curve calculations to this scenario
allows investigating to which extent repeated measurements of the same
peptide or variable reporter ion intensities of only one peptide influence
the robustness of regulatory information. Noteworthy, both the simulated
increase of independent measurements and the simulated increase of reporter
ion intensities exhibit a similar potential to improve the robustness of
regulatory information (Fig. S11). In conclusion, this model strongly suggests
that data quality, i.e. the intensities of iTRAQ™ reporters, can basically
substitute the need for repeated measurements in order to derive reliable
regulatory information from single peptides.
1008
[10:13 3/4/2009 Bioinformatics-btn551.tex]
Page: 1008
1004–1011
Noise model in quantitative proteomics
2.6
Detection and visualization of regulated PTMs
We suppose that the statistical evaluation of regulatory information at the
peptide level provides excellent opportunities to detect dynamic PTMs
occurring only in particular regions of a protein. For a proof-of-concept
we selected phosphorylations as one of the most frequent and important type
of PTMs. In humans, about 500 protein kinases mediate phosphorylations of
serine, threonine and tyrosine residues controlling cell-signalling processes.
Thus, the interaction of these kinases and their cross-phosphorylations
constitute one of the basic molecular mechanisms to regulate kinase activities
within these signalling networks. To substantiate our hypothesis we selected
the identification of regulated phosphorylation sites at human protein
kinases from cells treated with one growth factor HGF, the physiological
ligand of the multifaceted network controlling receptor tyrosine kinase Met
(Birchmeier et al., 2003). Activation of the Met kinase by HGF should induce
phosphorylations at down-stream signalling components described in the
literature, namely the MK01 and GSK3α kinases. Protein kinase fractions
comprising MK01 and GSK3α were generated from HGF-stimulated human
epithelial cells and untreated control cells following established protocols
(Reinl et. al., manuscript in preparation; Wissing et al., 2007). All purified
kinases were then digested and the resulting peptides from both samples
were differentially labelled with iTRAQ™ reagents (114, 117). LC-MS/MS
analyses of both unmodified and phosphorylated peptides were performed
on a nanoAcquity Ultra Performance LC-system (Waters Corp., Milford,
Massachusetts, USA) connected to a Qtof micro™ -mass spectrometer
(Waters Corp., USA). MS data that were acquired and processed by
MassLynx version V4.1 (Waters) were searched for protein identification
using the UniProtKB/Swiss-Prot database (release 55.0 of February 26,
2008 with 356194 entries; taxonomy: Homo sapiens with 18610 entries) and
Mascot Daemon 2.1.6. Only peptides that were unambiguously sequenced
and identified (Mascot, P < 0.05) were included in the statistical evaluation
of regulatory data. MS spectra from phosphopeptides and phosphorylation
sites belonging to known Met signalling components, such as MK01 and
GSK3α were inspected manually. Reporter ion intensities of the raw data
were in accordance with the upregulation of these phosphorylation events
in HGF-treated cells as described in literature (Fig. S3). To prove the
robustness of these results we applied our statistical approach as described
before. In brief, reporter intensities detected in all unambiguously identified
peptides from both samples were (i) corrected according to known isotopic
impurities and (ii) normalized by the trimmed mean algorithm. Then
likelihood curves were calculated based on the established noise model for all
peptide regulations and both phosphopeptide curves and non-phosphopeptide
curves belonging to one protein kinase were depicted comparatively. As
expected, non-modified peptides were detected with variable robustness but
generally constituting clusters with overlapping likelihood curves indicating
no significant alterations at the level of protein expression or stability.
Contrary, curve presentations of some phosphopeptides detected at known
signalling components did not overlap with these main clusters, in each
case indicating their significant regulation caused by HGF-activated Met
signalling. Figure 4 representatively summarizes the results obtained from
two well-described human kinases. MK01 and GSK3α revealed regulated
and differentially phosphorylated peptides, each at one particular protein
region (Fig. 4, peptides no. 2, 3 and 5, spectra of peptides 3 and 5 are
depicted in Fig. S4) whereas peptides corresponding to other protein regions
consistently demonstrated that the expression of protein kinases themselves
was not significantly altered.
Interestingly, this strategy also revealed a significant downregulation of
one unmodified peptide each at MK01 and GSK3α. In the case of MK01
this peptide (Fig. 4, peptide no. 1) corresponds to the protein region with the
induced phosphorylations. Opposite regulatory events of such peptide pairs
generally will co-exist during the post-translational modification of proteins.
However, many parameters such as peptide ion suppression in complex
samples determine whether LC-MS/MS can detect them automatically.
Analyses of GSK3α could not detect fragmentation data of the unmodified
peptide corresponding to the observed regulated phosphorylation at serine 21
Fig. 4. Likelihood curves of peptides from two human kinases reveal
regulated phosphopeptides and protein regions (for details see Table S1).
Unmodified and phosphorylated peptides belonging to the same protein
kinase are depicted comparatively in the same plot. Phosphorylated amino
acids are marked in red (Regulated Peptides). The majority of peptides from
the kinases MK01 and GSK3α indicate no regulation of the total protein.
Peptides 2, 3 and 5 carrying phosphorylated amino acids in each kinase are
upregulated with variable robustness. The corresponding unmodified protein
region (1) was detected by LC-MS/MS only in the case of MK01 exhibiting
a downregulation as expected. Accordingly, a significant downregulation of
the unmodified peptide 4 suggests an induced modification at this region in
GSK3α.
(Fig. 4, peptide no. 5). However, a second protein region at GSK3α exhibits
a significant down-regulation (Fig. 4, peptide no. 4) strongly suggesting
that a further so far uncharacterized modification is induced. Thus, statistical
evaluation of regulated unmodified peptides can be highly beneficial to define
biologically relevant protein sequences for detailed PTM characterizations
in future studies as well.
3
DISCUSSION
iTRAQ™ gained great importance in basic research and in
particular is the method of choice in in vivo animal models and
clinical projects relying on patient samples. iTRAQ™ provides
regulatory information at the peptide level and initial bioinformatics
studies focused on data integration and the accurate calculation
of protein expressions. A previous study already described a
positive correlation between low iTRAQ™ reporter intensities
and low qualities of regulatory information and suggested an
intensity threshold to minimize the frequency of false positive
regulatory events (Lin et al., 2006). Actually, variations of regulatory
1009
[10:13 3/4/2009 Bioinformatics-btn551.tex]
Page: 1009
1004–1011
C.Hundertmark et al.
information might simply reflect ‘noise’ but might also indicate
meaningful post-translational modifications. Proteins are subjects of
post-translational modifications and these structural modifications
essentially mediate their enzymatic activities by intra- and
intermolecular interactions. In conclusion, high-impact proteome
studies preferably should provide regulatory data about protein
expression without loosing indications for structural alterations
caused by PTMs.
This study therefore aims at the data-dependent evaluation of
regulatory information at the protein level and peptide level.
We looked for a bioinformatics concept that intuitively allows
scientists to explore the impact of regulatory data. Noise models
were already suggested for peptide ion detection in general (Du
et al., 2008) but also for iTRAQ™ reporter intensities (Klawonn
et al., 2006). We suggested and proved satisfactorily a procedure
to specify the signal-to-noise features of MS workflows based on
repeated measurements of iTRAQ™ reporters released from just
five different synthetic peptides. We presume that the quality of the
noise model directly profits from the number of data points that
cover the dynamic range of the used MS detector representatively.
In fact, our established noise model calculates a characteristic
simulation of iTRAQ™ intensities as revealed by comparative
proteome experiments. Instead of the expected 5% outliers, 2.84%
were counted indicating the robust and conservative character of the
established noise model.
Once a workflow-specific noise model is established, scientists
can utilize this as a tool to evaluate robustness of regulatory
information based on the underlying iTRAQ™ reporter ion
intensities. Regulatory information can be presented as likelihood
curves for each peptide that in addition to the most likely regulatory
factor exhibit the chance of variable regulations. The shape of
these curves depends on the specific signal-to-noise features of both
reporters from corresponding peptides. The scientist thus remains
in charge to select the level of data integration: likelihood curves at
the protein level can summarize cumulatively available information
from all protein-related peptides. This type of presentation can
indicate an altered protein expression or stability if the protein
curve does not overlap with those of unregulated proteins (RF = 1).
Contrary, a comparative view of likelihood curves established
for each individually measured peptide belonging to one protein
is the method of choice to reveal any inconsistently regulated
peptide (peptide curve does not overlap with the main cluster).
These regulations can refer to meaningful biological processes,
such as regulated PTMs but also bears the chance of false positive
results. Scientist at least have to prove whether (i) the differentially
regulated peptide sequences were identified unambiguously, (ii) can
be assigned to only one protein and (iii) whether the fragmentation
spectra were compromised by reporter ions from co-eluting and
partially co-extracted peptides during the process of fragmentation.
Furthermore, the process of normalization might require adaptations
with respect to the investigated biological process. A trimmed mean
(instead of a mean or median approach) is recommended for complex
datasets and if regulations in one direction only cannot be ruled out.
By considering all these aspects a workflow-specified noise model
should be able to cover many sources of artificial results facilitating
the detection of biologically relevant events.
The investigated kinases of this study exhibit consistent regulatory
behaviour of many unmodified peptides underscoring the general
robustness of iTRAQ™. In addition, some phosphorylated and
unmodified peptides were observed differentially regulated as
suggested by their corresponding likelihood curves. Actually,
these phosphorylation events are well-described structural elements
controlling reversibly the activation and inhibition of these kinases
(Cohen and Frame, 2001; Rubinfeld and Seger, 2005). Furthermore,
the regulated unmodified peptide of GSK3α might indicate to further
induced phosphorylations. Scansite 2.0 (Obenauer et al., 2003)
predicts the first serine of peptide no. 4 as part of a potential
phosphorylation site utilized by DNA-dependent protein kinase
(DNA-PK).
In conclusion, noise model-dependent likelihood calculations of
MS data will help to ‘co-evolve’ state-of-the-art data interpretation
strategies for upcoming quantitative PTM studies (Kim and White,
2006; Rosenzweig et al., 2008). Phosphoproteome studies can yield
nowadays hundreds or even thousands of phosphorylation events
and databases were already established to summarise species and
sample-specific results (Diella et al., 2004; Gnad et al., 2007). The
presented concept of likelihood curve calculations can serve as an
ideal basis for establishing visualization-aided data-mining tools and
in parallel can facilitate automated high-throughput screenings in
complex datasets.
4
OUTLOOK
Noise characteristics should be comparatively analyzed at different
standard MS workflows that will help to understand the rational
basis of noise observed in iTRAQ™ data and in parallel might
yield in workflow-specific equations with minimized numbers
of adjustable parameters. Present data already underscore that
increasing regulation factors necessitate an increasing difference in
reporter ion intensities that mandatory require one reporter with
low signal intensity and thus low signal quality. Consequently,
significant differences between biological samples will always tend
to lower robustness concerning the true actual regulation factor.
Therefore, future studies should establish noise models at the level
of the MS-specific dynamic range of regulatory factors instead of
the dynamic range of iTRAQ™ reporter intensities. Furthermore,
cluster analysis strategies have to be implemented in order to
detect biologically relevant regulatory events in complex datasets
automatically. The characterization of PTMs following the detection
of regulated unmodified peptides is still a challenging task but at
least can be restricted to a particular protein region.
ACKNOWLEDGEMENTS
We thank Kirsten Minkhart and Reiner Munder for excellent
technical assistance, Thorsten Johl for supplementary work and
Dr Uwe Kärst for helpful comments and proof reading of the article.
Conflict of Interest: none declared.
REFERENCES
Bantscheff,M. et al. (2007) Quantitative mass spectrometry in proteomics: a critical
review. Anal. Bioanal. Chem., 389, 1017–1031.
Birchmeier,C. et al. (2003) Met, metastasis, motility and more. Nat. Rev. Mol. Cell
Biol., 4, 915–925.
Boehm,A.M. et al. (2007) Precise protein quantification based on peptide quantification
using itraq. BMC Bioinformatics, 8, 214.
Cohen,P. and Frame,S. (2001) The renaissance of gsk3. Nat. Rev. Mol. Cell Biol., 2,
769–776.
1010
[10:13 3/4/2009 Bioinformatics-btn551.tex]
Page: 1010
1004–1011
Noise model in quantitative proteomics
Diella,F. et al. (2004) Phospho.elm: a database of experimentally verified
phosphorylation sites in eukaryotic proteins. BMC Bioinformatics, 5, 79.
Du,P. et al. (2008) A noise model for mass spectrometry based proteomics.
Bioinformatics, 24, 1070–1077.
Gnad,F. et al. (2007) Phosida (phosphorylation site database): management, structural
and evolutionary investigation, and prediction of phosphosites. Genome Biol., 8,
R250.
Hoaglin,D.C. et al. (2000) Understanding Robust and Exploratory Data Analysis. Wiley,
New York.
Hu,J. et al. (2006) Optimized proteomic analysis of a mouse model of cerebellar
dysfunction using amine-specific isobaric tags. Proteomics, 6, 4321–4334.
Kim,J.-E. and White,F.M. (2006) Quantitative analysis of phosphotyrosine signaling
networks triggered by cd3 and cd28 costimulation in jurkat cells. J. Immunol., 176,
2833–2843.
Klawonn,F. et al. (2006) A maximum likelihood approach to noise estimation
for intensity measurements in biology. In Proceedings of the Sixth IEEE
International Conference on Data Mining Workshops ICDM Workshops 2006,
IEEE, Los Alamitos (CA), pp. 180–184.
Lin,W.-T. et al. (2006) Multi-q: a fully automated tool for multiplexed protein
quantitation. J. Proteome Res., 5, 2328–2338.
Obenauer,J.C. et al. (2003) Scansite 2.0: proteome-wide prediction of cell
signaling interactions using short sequence motifs. Nucleic Acids Res., 31,
3635–3641.
Ong,S.-E. et al. (2002) Stable isotope labeling by amino acids in cell culture, silac, as
a simple and accurate approach to expression proteomics. Mol. Cell Proteomics,
1, 376–386.
Pierce,A. et al. (2007) Eight-channel itraq enables comparison of the activity of 6
leukaemogenic tyrosine kinases. Mol. Cell Proteomics, 7, 853–863.
Rosenzweig,D. et al. (2008) Post-translational modification of cellular proteins during
leishmania donovani differentiation. Proteomics, 8, 1843–1850.
Ross,P.L. et al. (2004) Multiplexed protein quantitation in saccharomyces cerevisiae
using amine-reactive isobaric tagging reagents. Mol. Cell Proteomics, 3, 1154–1169.
Rubinfeld,H. and Seger,R. (2005) The erk cascade: a prototype of mapk signaling. Mol.
Biotechnol., 31, 151–174.
Shadforth,I.P. et al. (2005) i-tracker: for quantitative proteomics using itraq. BMC
Genomics, 6, 145.
Trost,M. et al. (2005) Comparative proteome analysis of secretory proteins from
pathogenic and nonpathogenic listeria species. Proteomics, 5, 1544–1557.
Tu,Y. et al. (2002) Quantitative noise analysis for gene expression microarray
experiments. Proc. Natl Acad. Sci. USA, 99, 14031–14036.
Walsh,C.T. et al. (2005) Protein posttranslational modifications: the chemistry of
proteome diversifications. Angew. Chem. Int. Ed. Engl., 44, 7342–7372.
Weng,L. et al. (2006) Rosetta error model for gene expression analysis. Bioinformatics,
22, 1111–1121.
Wissing,J. et al. (2007) Proteomics analysis of protein kinases by target class-selective
prefractionation and tandem mass spectrometry. Mol. Cell Proteomics, 6, 537–547.
1011
[10:13 3/4/2009 Bioinformatics-btn551.tex]
Page: 1011
1004–1011