BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 8 2009, pages 1004–1011 doi:10.1093/bioinformatics/btn551 Gene expression MS-specific noise model reveals the potential of iTRAQ in quantitative proteomics C. Hundertmark1,† , R. Fischer1,† , T. Reinl1 , S. May2 , F. Klawonn2,∗ and L. Jänsch1,∗ 1 Helmholtz Centre for Infection Research, Department for Cell Biology, Inhoffenstr. 7, D-38124 Braunschweig and of Applied Sciences BS/WF, Department of Computer Science, Salzdahlumer Str. 46/48, D-38302 Wolfenbuettel, Germany 2 University Received on May 15, 2008; revised on October 8, 2008; accepted on October 18, 2008 Advance Access publication October 24, 2008 Associate Editor: David Rocke ABSTRACT Motivation: Mass spectrometry (MS) data are impaired by noise similar to many other analytical methods. Therefore, proteomics requires statistical approaches to determine the reliability of regulatory information if protein quantification is based on ion intensities observed in MS. Results: We suggest a procedure to model instrument and workflowspecific noise behaviour of iTRAQ™ reporter ions that can provide regulatory information during automated peptide sequencing by LC-MS/MS. The established mathematical model representatively predicts possible variations of iTRAQ™ reporter ions in an MS data-dependent manner. The model can be utilized to calculate the robustness of regulatory information systematically at the peptide level in so-called bottom-up proteome approaches. It allows to determine the best fitting regulation factor and in addition to calculate the probability of alternative regulations. The result can be visualized as likelihood curves summarizing both the quantity and quality of regulatory information. Likelihood curves basically can be calculated from all peptides belonging to different regions of proteins if they are detected in LC-MS/MS experiments. Therefore, this approach renders excellent opportunities to detect and statistically validate dynamic post-translational modifications usually affecting only particular regions of the whole protein. The detection of known phosphorylation events at protein kinases served as a first proof of concept in this study and underscores the potential for noise models in quantitative proteomics. Contact: [email protected]; f.klawonn@fh-wolfenbue ttel.de Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION The success of comparative proteome studies depends decisively on the capability to accurately quantify hundreds to thousands of the investigated components. Traditional 2D-PAGE analyses allow regulatory investigations directly at the level of proteins that can be quantified based on their spot volumes. Alternatively, after ∗ To whom correspondence should be addressed. authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. † The 1004 enzymatic cleavage of proteins the resulting peptides can be automatically sequenced in so-called bottom-up proteome approaches based on the combination of liquid chromatography linked with a suitable MS2 device (LC-MS). Unfortunately, it is still a challenging task to directly correlate peptide ion intensities in LC-MS with the abundance of the corresponding protein (label-free quantification) and the majority of quantitative LC-MS proteome projects rely on protein or peptide labelling technologies (Bantscheff et al., 2007). Besides SILAC (Ong et al., 2002), chemical labelling of peptides with iTRAQ™ is becoming an important and widespread quantitative technology based on LC-MS. Contrary to SILAC, iTRAQ™ labels peptides from biochemically purified proteins and thus also allows to investigate clinical samples obtained from patients. iTRAQ™ molecules enable the multiplexed differential labelling and relative peptide quantification of either up to 4 or 8 different samples in parallel (Pierce et al., 2007; Ross et al., 2004). During the labelling process only one type of iTRAQ™ molecules is linked covalently to all peptides from one biological sample. Equal peptides (with identical amino acid sequence) from different biological samples, which were labelled differentially and are subsequently pooled exhibit still identical biochemical properties and total masses. Consequently, corresponding equal peptides from different samples co-elute at the same time from chromatographic columns, enter the MS device with the same molecular weight, and are subjected concertedly to the fragmentation process. Under the MS2 conditions of peptide sequencing these iTRAQ™ labels also produce fragment ions that differ in mass and do serve as sample-specific reporters (114.1, 115.1, 116.1 and 117.1 Da). The ratios of the released iTRAQ™ reporter ions correlate with the relative abundance of the analysed peptides as part of the investigated samples. Since a bottom-up approach can produce nowadays thousands of fragment ion patterns per hour the extraction of regulatory information from the individual peptide fragmentation patterns requires bioinformatics strategies. Several tools and services were recently suggested and established to calculate and visualize regulatory data obtained by iTRAQ™. All of these approaches clearly aim at state-of-the-art analyses of protein expression information: They successfully presented mathematical concepts considering isotopic impurities, normalizing regulatory information and in particular performing statistical data analyses and integration with a cumulative view on reporter intensities from © The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] [10:13 3/4/2009 Bioinformatics-btn551.tex] Page: 1004 1004–1011 Noise model in quantitative proteomics peptides belonging to the same protein (Boehm et al., 2007; Lin et al., 2006). Furthermore, it became obvious in recent studies that the intensity of the reporter ions positively correlates with the quality of the derived regulatory information (Hu et al., 2006) but the statistical basis of this relation was not established so far. Focusing on robust protein expression analyses it was recommended to consider only reporter ions above a certain threshold in order to reduce the number of unexpectedly regulated peptides. However, proteins are not exclusively regulated at the expression level and divergently regulated peptides can point to important biological processes. More than 200 different post-translational modifications (PTMs) are described in literature (Walsh et al., 2005) and modified individual amino acids decisively control protein enzyme activities and protein–protein interactions. In this study, we present a noise model to describe the behaviour of iTRAQ™ reporter intensities at the level of peptides. Noise models for microarray data are enjoying great popularity since a few years (Tu et al., 2002; Weng et al., 2006), noise models for MS data on the other hand are just appearing by now (Du et al., 2008). We demonstrate that analyses of iTRAQ™ labelled peptides can representatively specify the noise characteristics for individual proteome workflows. The resulting noise model allows simulating the data-dependent distribution of iTRAQ™ reporter intensities. Importantly, the established mathematical model in return can provide information on the robustness of experimentally observed reporter intensities. We present here an intuitive concept for the visualization-aided exploration of regulatory iTRAQ™ data based on likelihood curves precisely depicting the regulatory information in an MS data-dependent way. This strategy supports data integration as required for protein expression studies but in particular presents the state-of-the-art in characterizing regulations at the peptide level. As an example, we demonstrate that regulatory information from peptides with significantly variant behaviour can be correlated unambiguously with known regulatory phosphorylation events at protein kinases. 2 2.1 METHODS AND RESULTS Experimental settings Peptides were digested from proteins using trypsin, automatically sequenced by LC-MS and identified as described previously (Trost et al., 2005). For iTRAQ™ experiments the protocol was adapted as follows: digestion was carried out always in 50 mM TEAB allowing to apply directly iTRAQ™ kit labelling reagents. Labelling by iTRAQ™ was done in all experiments according the manufacturer’s protocol (Applied, Biosystems, Foster City, CA, USA). In brief: aliquots of iTRAQ™ reagents recommended for the labelling of 100 µg protein were applied to peptides corresponding to a maximum of only 40 µg total protein and the reactions were carried out for at least 1 h. The completeness of labelling reactions was tested along the process of peptide and protein identification done by Mascot (V 2.1, Matrix Science, Boston, MA, USA). Unlabelled peptides could not be observed for any experiment described in this study. Peptides were purified by small C18 columns (ZipTip, Millipore, Billerica, MA, USA) and the equivalent of one ZipTip was used for LC-MS experiments. Peptides were separated using a nanoAcquity HPLC (Waters) and BEH material (Waters) as an analytical chromatographic column (150 mm × 75µm, 1.7 µm) that was linked to a QTOFmicro instrument equipped with PicoTip™ emitter (distal coated, New Objective, Woburn, MA, USA). Double- and triple-charged peptide ions were automatically and individually selected by the quadrupole unit and mainly b- and y-ions were generated during their fragmentation in the collision cell (MS/MS mode). Collision cell energy was increased by about 10% compared with protocols for non-labelled peptides as recommend by Ross and his colleagues (2004). Post-processing of fragmentation data yielded representative individual ion intensities for peptide sequencing and iTRAQ™ quantification were performed as previously described (Trost et al., 2005). In brief: always four MS scans were combined to one spectrum, smoothed based on Savitzki–Golay (no. of smoothes 2, 7 channels) and centroid spectra were generated (minimum peak width at half height 4, top 80%). To generate the training dataset that was used to estimate the noise parameters the following unmodified and phosphorylated (*) peptides were synthesized and labelled with iTRAQ™ 115: IADPEHDHTGFLTEYVATRWYR, VSDFGLTK, FVLDDQY*TSSTGTKFPVK, HERPAGPGT*PPPDSGPLAK, SST*VTEAPIAVVTSR. The peptides were diluted in 65% methanol, 5% formic acid to a concentration of 4.4 µM and injected into a Q-TOFmicro (Waters) by using a syringe pump. By using these peptides and different collision energies (12 ev, 16 ev, 20 ev, 22 ev, 24 ev, 26 ev, 28 ev and 30 eV) the variation of reporter signals was analysed at 24 different intensity levels. About 100 fragmentation spectra were acquired for each intensity level and data points for the training dataset were processed as described before. For the generation of a 1:1 peptide sample a protein mixture containing 10 µg of the proteins ovalbumin, conalbumin, aldolase, ferritin, llactate dehydrogenase A and thyroglobulin (GE Healthcare, Chalfont St. Giles, Buckinghamshire, UK) was digested with Trypsin (1:50, Promega, Sequencing Grade, Madison, Wisconsin, USA) in 50 mM TEAB. The sample was split 1:1, differentially labelled with iTRAQ™ reagents and the re-combined sample was analysed by LC-MS. Additional details on the design of the biological approaches, i.e. (i) the investigation of intensity variations of labelled peptides derived from total cell lysates from immune cells (B cells) and (ii) the analysis of regulated peptide modifications of protein kinases in HGF/Met-activated signalling studies are included in the chapters 2.3 (B cells), 2.6 (HGF/Met) and in the Supplementary Material (Figures S2/S3). 2.2 Noise model In accordance with Lin et al. (2006) and Hu et al. (2006) we routinely observe an ion intensity-dependent accuracy of our regulatory data. This effect can be representatively investigated and visualized by using the following test system: a protein sample was digested, split and the resulting peptides were labelled differentially with iTRAQ™ tags 115 and 117 before both fractions were re-combined in a 1:1 ratio and analysed by LC-MS/MS. Subsequently, we isolated the iTRAQ™ reporter intensities from all spectra, corrected the data based on the known isotopic impurities and compared the trimmed mean values from both samples to calculate and apply a normalization factor (<1). From such preprocessed data, we derived logarithmic peptide ratios as presented in Figure 1. The peptide expression accuracy is iTRAQ™-reporter ion intensity-dependent. Decreasing iTRAQ™-reporter ion intensities coincide with increased deviations from the expected 1:1 ratios. In Klawonn et al. (2006), we suggested a statistical approach to estimate the noise of quantitative protein analyses based on iTRAQ™. In brief, we assume that the available data (1) (k ) x1 ,..., x1 1 ,..., xn(1) ,..., xn(kn ) (1) (1) (k ) consist of n tuple (xi ,...,xi i ) each representing ki (ki ≥ 2) noisy measurements of the same unknown (preprocessed) logarithmic intensity µi . We assume that each tuple originates from independent samples with normally distributed data, with unknown mean µi and unknown variance σi2 . In accordance with recent studies (e.g. Hu et al., 2006; Lin et al., 2006), our experiments have shown that the relative ratio of noise decreases for larger iTRAQ™ reporter intensities. Various physical causes, such as variable ion suppression effects, signal amplifier circuits and iTRAQ™ fragmentation specificities might contribute 1005 [10:13 3/4/2009 Bioinformatics-btn551.tex] Page: 1005 1004–1011 C.Hundertmark et al. The parameters were fitted in an expectation maximization (EM)-like fashion with numerical steps if required. 2.3 Fig. 1. Normalized logarithmic reporter ion intensities from equal sets of peptides that were differentially labelled with iTRAQ™ reagents 115 and 117 reveal intensity-dependent deviations. to this effect. Currently, this can hardly be dissected and calculated individually to produce an exact model for iTRAQ™ data. So far, a rational noise model could be established and experimentally validated only for the detection of peptide ions in noisy MS spectra. Du et al. (2008) recently demonstrated that the combination of a multinomial and Poisson model can fit observed with expected isotopic distributions from peptide ions. This concept basically belongs to de-isotoping approaches and provides equations for calculating the proportion of C13 -containing peptide fractions according to the natural frequency of this isotope. In contrast, iTRAQ™ signals reflect the abundance of labelled peptides in biological samples and are thus not predictable per se. Only a minor portion of the total intensities are formed by expectable but even not calculable isotope distributions also known as isotopic impurities. Therefore, and until the rational basis of iTRAQ™ noise has been established in detail, a representative approximation of noise characteristics was achieved in this study using the equation σ (µ) = h(µ;a,r,λ) = a+re−λµ (2) with a,r,λ ≥ 0. The parameter a represents the absolute noise in the measurement, while r and λ determine the intensity-dependent noise and its decrease. The estimation of the parameters a,r and λ based on the sample (1) requires that we estimate the true unknown intensity µi for each subsample. For the estimation of the unknown parameter vector θ = (a,r,λ) the maximum likelihood principle leads to the likelihood function (1) (k ) L x1 ,..., x1 1 ,..., xn(1) ,..., xn(kn ) |θ = (j) (x −µi )2 . √ exp − i 2 2h (µ;θ ) h(µ;θ ) 2π i=1 j=1 ki n 1 (3) Since the maximization of the log-likelihood is equivalent to the maximization of the likelihood itself, we consider the log-likelihood (L̃). Assuming the parameters a, r and λ to be fixed at the moment, optimization of the µi -values is achieved by maximization of the log-likelihoods. By d L̃i setting dµ = 0 we finally get i ki (λre−λµi )(a+re−λµi )2 j=1 +(xi −µi )((a+reλµi )+(xi −µi )(−λre−λµi ) = 0. (4) (j) (j) For optimization of the parameters a,r and λ the fitness of a parameter combination θ = (a,r,λ) is given by (3). Definition of workflow specific noise parameter The parameters a,r,λ are not universal and depend on the type of the mass spectrometer in general as well as on data acquisition settings. To estimate these parameters we generated a training dataset based on five peptides to determine the intensity-dependent noise. Two unmodified peptides were included in this approach which facilitate to cover a broad dynamic reporter intensity range: IADPEHDHTGFLTEYVATRWYR can only be labelled once with iTRAQ™ at the N-terminus. Its low reporter signal capacity is further decreased by many fragmentation products causing significant ionization suppression effects. Contrary, VSDFGLTK contains two iTRAQ™ -reactive primary amino groups (N-terminus and lysine) and exhibits significantly better ionization efficiency of the released iTRAQ™ 115 reporter. Additionally, we included three further phosphorylated peptides in this approach in order to substantiate the data pool of the training dataset and to guarantee its representative character also to modified peptides. Measurements were repeated for all five peptides by using different collision energies, thus covering different absolute levels of iTRAQ™ reporter intensities. In this way, we repeatedly measured reporter intensity variations at 24 different intensity levels resulting in 552 individual reporter intensities defining the training dataset (Supplementary Material, Fig. S1). From this dataset, we estimated the noise parameters for the used mass spectrometer and individual settings as follows: a = 0.0103, r = 0.9908 and λ = 0.4751. These parameters facilitate the noise estimation of iTRAQ™ reporter ions dependent on their individual intensity. For the calculation of a 95% interval for an unregulated sample (µ1 = µ2 ) we assume a normal distribution with mean µ = µ1 −µ2 = 0 and SD σ = σ12 +σ22 and σi = a+ √ re−λµi . Since σ1 = σ2 follows σ = 2σ 2 . The noise parameters were estimated from data that were generated under artificial analytical conditions. In order to validate the noise model in an LCMS/MS workflow which is used for the analysis of complex protein samples we labelled equal amounts of a digested protein mixture containing the proteins ovalbumin, conalbumin, aldolase, ferritin, tyroglobulin and l-lactate dehydrogenase A (GE Healthcare) with iTRAQ™ tags 115 and 117 and combined the labelled peptides in a 1:1 ratio (test dataset). Using the model parameters a, r and λ derived from the training dataset we calculated the expression ratios for 563 MS/MS spectra that were recorded for the 1:1 peptide mixture (Fig. 2A). According to the training dataset we expected 5% outliers. Of 563 peptide expression ratios 16 are located outside the borders (2.84%) thus confirming the representative character of the applied training strategy. Additionally, we analysed in a similar approach a real biological sample, i.e. peptides derived from total cell lysates of immune cells (B cells). Again, the 95% interval of the noise model was in good accordance with the experimentally observed variations (1.9% outliers, Fig. S2). Thus, we assume that five iTRAQ™ labelled peptides are sufficient to estimate the noise parameters of an individual MS workflow. We have also simulated the variation of iTRAQ™ reporter intensities based on our noise model systematically at variable intensity levels. The resulting in silico data points exhibit striking similarities with all experimental results (Fig. 2B). We compared our noise model σ (µ) = h(µ;θ ), θ = a+re−λµ with further potential noise models h(µ;θ ), which might also be able to simulate a representative distribution of iTRAQ™ intensities as experimentally observed (Table S3), but none of them produced a satisfying result. An additionally performed estimation of parameters based on median absolute deviation (MAD) instead of maximum likelihood estimation (MLE) yielded similar results. However, since outliers in MAD have less effect for estimating the noise parameters we select the more conservative method based on MLE accepting some missed true positives. In accordance with Boehm et al. (2007), we assume that the measurement of each intensity results from the true intensity plus noise following a lognormal distribution (with mean zero, i.e. the measurement error is unbiased). 1006 [10:13 3/4/2009 Bioinformatics-btn551.tex] Page: 1006 1004–1011 Noise model in quantitative proteomics Fig. 2. (A) Experimental test dataset comprising expression ratios (logarithms) of 563 peptides. Tryptically generated peptides from six different proteins were differentially labelled with iTRAQ™ 115 or 117, mixed in the ratio 1:1 and then analysed by MS. Of the peptide regulation factors, 2.84% are located outside the 95% interval calculated from the training dataset. (B) Simulation of a dataset based on the training dataset shows striking similarities to the intensities detected in the experimental test dataset. Every integer intensity between 1 and 2000 (eµ ) was used for the generation of two noisy intensities (sim1 and sim2) according to the noise model with the parameters a = 0.0103, r = 0.9908 and µ = 0.4751. ln( sim1 sim2 ) are plotted against µ. For verification, we performed a power transformation with the training dataset. Power transformations of the form p x −1 if p = 0, p Tp (x) = lnx if p = 0 are a very common technique in data analysis used for various purposes. Here, the special case of a (power) transformation for equal spread (see, for instance Hoaglin et al., 2000) is of interest, reducing the difference in the variances at the different intensity levels. For the training dataset the P-value was near zero (P = 0.0017) which provides another justification for the logarithmic transformations applied to the data. Java classes that were used to estimate the noise parameters from test datasets are provided on request. 2.4 Calculation of regulatory information Similar to Shadforth et al. (2005) or Lin et al. (2006), we perform a standard procedure to preprocess the acquired MS data. All fragmentation ions are smoothed and centred using MassLynx 4.1 (Waters) before the peaks of the actual iTRAQ™ reporters are detected using Mascot Parser 2.1.00 (Matrix Science). Mass shifts were compensated automatically and in case of multiple signals near the expected mass, we optimized peak detection by scanning for a pattern that fits the specified iTRAQ™ label masses. Since each iTRAQ™ reporter type can contribute to the neighbouring signals with a few percent of its own ion intensity (isotopic impurity) a small fraction of each iTRAQ™ molecule batch was routinely subjected to fragmentation experiments individually in order to certify its signal specificities. Following, the percent certified isotopic impurities of each applied iTRAQ™ molecule type is considered in a linear system of equations allowing to calculate precisely the actual ion abundances of differentially labelled peptides. The last preprocessing step is the normalization of samples. This is done by comparing the trimmed mean values of all samples and adjusting by multiplicative correction the intensities of samples with high peptide concentration to those of the lowest concentration. For the calculation of the normalization factor the mean was calculated after discarding the upper and lower 20% of the sample’s intensities (trimmed mean). The expression ratio of a single peptide can simply be determined by dividing the iTRAQ™ ion intensities if only one fragmentation experiment (MS/MS) was generated within an experiment. However, and as discussed in Hu et al. (2006) we need a different approach if multiple MS/MS spectra match to one peptide. According to the noise model, expression ratios derived from high intensities are more representative compared with expression ratios derived from low intensities. Therefore, reporter ion ratios that all have to be correlated to the same peptide should be integrated in a weighted manner depending on their reporter intensities and their corresponding signal qualities. In order to find the most likely expression ratio for a peptide we scan all expression ratios cj between the lowest (cmin ) and the highest (cmax ) for all MS/MS spectra that match the actual considered peptide. For each scanned expression ratio cj and each measured pair of intensities (xi ,yi ), 1 ≤ i ≤ n, n = number of matched MS/MS spectra, a suitable pair of intensities is calculated by setting µxi = cj µyi where µxi and µyi are the true, unknown intensities of the measured intensities xi and yi , respectively. To find the best expression ratio cj for all n matched MS/MS spectra, we compute the likelihood function L(x,y,cj ) = n 1 − e √ 2π σxi i=1 = a+re−λµx x −c µy 2 i j i 2σxi 1 − e √ 2π σyi y −µ 2 yi i 2σyi (5) and σy = a+re−λµy . with σx Logarithmic transformation and omitting the constant parts leads to the log-likelihood function L̃(x,y,cj ) = n xi −cj µyi 2 yi −µyi 2 −ln(σxi )− . −ln(σyi )− 2σxi 2σyi (6) i=1 The best suitable overall expression ratio cj for all matched MS/MS spectra of a multiple found peptide according to the noise model is that one, which results in the maximum log-likelihood. Calculation of a protein expression ratio takes place in the same manner. Instead of regarding expression ratios of a single peptide matched repeatedly, we compute the most representative regulation factor for all peptides of the observed protein. This is done by calculation of the likelihoods for all regulation factors cj to be considered (cmin ≤ cj ≤ cmax and a sufficiently small step size for cj ) and choosing that one resulting in the maximum likelihood. Hence, we have n = the total number of all MS/MS spectra matched to a peptide of the actual protein, cmin = lowest expression ratio of a matched MS/MS spectrum and cmax = highest expression ratio of a matched MS/MS spectrum for the estimation of the protein expression ratio. Thus, this approach can integrate information of all experimentally observed results without the necessity to reject data by arbitrary threshold levels. The quality of each iTRAQ™ reporter can be specified by the established noise model based on its individual intensity. Low intensity 1007 [10:13 3/4/2009 Bioinformatics-btn551.tex] Page: 1007 1004–1011 C.Hundertmark et al. Table 1. Sequences and regulatory information of the peptides from the test protein aldolase (ALDOA) as depicted in Fig. 3 and Fig. S4 Undiluted ALDOA (aldolase) IVAPGK GGVVGIK DGADFAK VLAAVYK ELSDIAHR ALQASALK QLLLTADDR GILAADESTGSIAK AAQEEYVK ADDGRPFPQVIK LQSIGTENTEENR Ratio 1.02 1.05 1 1.04 1.12 1.03 1.03 −1.17 −1.07 1.01 1.01 1.01 IR Tenfold diluted 0.07 0.19 0.18 0.25 0.20 0.33 0.16 0.59 0.40 0.15 0.35 0.18 ALDOA (aldolase) IVAPGK GGVVGIK DGADFAK ALQASALK QLLLTADDR AAQEEYVK LQSIGTENTEENR Ratio −1.07 1.22 −1.11 −1.16 1.11 1.06 −1.19 −1.12 IR 0.28 1.28 0.73 0.56 0.72 1.22 0.52 1.59 The protein aldolase was digested, the sample was split into aliquots, the peptides were differentially labelled by iTRAQ™ and recombined. The sample was measured undiluted or 10-fold diluted yielding 11 and 7 peptide sequences and corresponding regulatory information. Positive and negative ratios refer to the peak of the likelihood curve and give the multiplicative increase and decrease, respectively, IR i.e. the minimal interval which is covered by 80% of the area beyond each likelihood curve. reporters in combination with those of better signal qualities, i.e. higher intensities, have only a minor effect on the final result, and vice versa. In Table S2 protein expression ratios (derived from the test dataset 1:1 and 1:10) are compared based on MLE, median and mean value. In summary, the presented approach yields better results than median and mean value since thus calculated protein expression ratios average the closest to 1. This strategy basically allows scientists to observe all clues to regulatory events in a complex dataset but also introduce the need to calculate and present the actual robustness of regulatory information in parallel. 2.5 Visualization of regulatory information Regulatory data are supposed to be especially reliable and robust if the regulatory information of a peptide or protein are consistent and based on relatively high reporter ion intensities. We have established a system for the calculation of expression ratios for multiple measured peptides and for proteins according to our noise model, but we have not yet included information about the robustness of the underlying signal qualities. This is of highest importance since we know that even unregulated objects can yield significantly different regulatory results in an MS data-dependent manner of the reporter signals (see Fig. 2). Therefore, a broader range of possible regulations has to be considered if low reporter intensities were measured and if a protein expression ratio is based on heterogeneously regulated peptides. We selected a visualization-aided approach to display the likelihoods of protein and peptide regulations according to the real data qualities: the x-axis refers to the regulation factor, the y-axis refers to the likelihood given by (6). An example for the visualization of the iTRAQ™ data quality is given for the protein aldolase (ALDOA) that was analysed as part of the test dataset. Similar but not identical regulatory ratios were observed for different aldolase peptides (Table 1, Fig. 3). The established noise model allows us to calculate the likelihood for regulatory values adjacent to the simply observable regulatory ratios. Higher iTRAQ™ reporter intensities correlate positively with the robustness of regulation and thus result in narrower likelihood curves. To display positive and negative regulatory factors symmetrically in the plots we have omitted the range [−1,+1], and regulations <1 are calculated by the means of inversion of measured intensities. For example the ratios +1.20 and −1.20 refer to an increase and decrease of 20%, respectively. The area below the likelihood curve is always normalized to 1 and consequently gives an intuitive impression of the robustness. In addition, concrete values referring to the minimal interval which is covered by 80% of the area beyond each likelihood curve (interval of robustness, IR) are displayed as a legend. In total, 11 peptides of the test protein were detected exhibiting peak ratios ranging from −1.17 to 1.12 and IR values between Fig. 3. Likelihood curves can representatively summarize the robustness of regulatory information. The likelihood for variable regulations is calculated according to the noise model and actual reporter intensities from peptides belonging to the same protein (ALDOA, aldolase). Resulting likelihood curves can be calculated and presented at the peptide level (A) Analyses of a 10-fold diluted sample results in decreased reporter ion intensities and less robust regulatory information as presented by broadened peptide curves. (B) Legends at the bottom and the right top side of each plot refer to the displayed peptide sequences and IR, respectively. 0.15 and 0.59. Next we diluted the same sample expecting reduced reporter ion qualities. Automatic sequencing by LC-MS/MS now detected only seven peptides exhibiting in general reduced reporter ion intensities (Table 1). The calculated likelihood curves of the corresponding peptide and protein views (Fig. 3 and Fig. S4) intuitively indicate the significantly lower quality of the regulatory information. This finding is also in accordance with an increased diversity of peak ratios varying now between −1.19 to 1.22 and IR values up to 1.59. The peak ratios from the undiluted and 10-fold diluted sample exhibit only a distance of 0.09 (1.02 versus −1.07) emphasizing the relevance of repeated peptide measurements in case of low data quality. Whereas in real experiments at least three parameters in concert contribute to the overall regulation data quality, i.e. the (i) number, (ii) consistency and (iii) the reporter ion intensities of independently measured peptides, in silico calculations allow to model their influence individually. We hypothesize a differentially labelled peptide that can release iTRAQ™ reporter ions with variable intensities but with always exactly the same resulting ratio. Applying our noise model and likelihood curve calculations to this scenario allows investigating to which extent repeated measurements of the same peptide or variable reporter ion intensities of only one peptide influence the robustness of regulatory information. Noteworthy, both the simulated increase of independent measurements and the simulated increase of reporter ion intensities exhibit a similar potential to improve the robustness of regulatory information (Fig. S11). In conclusion, this model strongly suggests that data quality, i.e. the intensities of iTRAQ™ reporters, can basically substitute the need for repeated measurements in order to derive reliable regulatory information from single peptides. 1008 [10:13 3/4/2009 Bioinformatics-btn551.tex] Page: 1008 1004–1011 Noise model in quantitative proteomics 2.6 Detection and visualization of regulated PTMs We suppose that the statistical evaluation of regulatory information at the peptide level provides excellent opportunities to detect dynamic PTMs occurring only in particular regions of a protein. For a proof-of-concept we selected phosphorylations as one of the most frequent and important type of PTMs. In humans, about 500 protein kinases mediate phosphorylations of serine, threonine and tyrosine residues controlling cell-signalling processes. Thus, the interaction of these kinases and their cross-phosphorylations constitute one of the basic molecular mechanisms to regulate kinase activities within these signalling networks. To substantiate our hypothesis we selected the identification of regulated phosphorylation sites at human protein kinases from cells treated with one growth factor HGF, the physiological ligand of the multifaceted network controlling receptor tyrosine kinase Met (Birchmeier et al., 2003). Activation of the Met kinase by HGF should induce phosphorylations at down-stream signalling components described in the literature, namely the MK01 and GSK3α kinases. Protein kinase fractions comprising MK01 and GSK3α were generated from HGF-stimulated human epithelial cells and untreated control cells following established protocols (Reinl et. al., manuscript in preparation; Wissing et al., 2007). All purified kinases were then digested and the resulting peptides from both samples were differentially labelled with iTRAQ™ reagents (114, 117). LC-MS/MS analyses of both unmodified and phosphorylated peptides were performed on a nanoAcquity Ultra Performance LC-system (Waters Corp., Milford, Massachusetts, USA) connected to a Qtof micro™ -mass spectrometer (Waters Corp., USA). MS data that were acquired and processed by MassLynx version V4.1 (Waters) were searched for protein identification using the UniProtKB/Swiss-Prot database (release 55.0 of February 26, 2008 with 356194 entries; taxonomy: Homo sapiens with 18610 entries) and Mascot Daemon 2.1.6. Only peptides that were unambiguously sequenced and identified (Mascot, P < 0.05) were included in the statistical evaluation of regulatory data. MS spectra from phosphopeptides and phosphorylation sites belonging to known Met signalling components, such as MK01 and GSK3α were inspected manually. Reporter ion intensities of the raw data were in accordance with the upregulation of these phosphorylation events in HGF-treated cells as described in literature (Fig. S3). To prove the robustness of these results we applied our statistical approach as described before. In brief, reporter intensities detected in all unambiguously identified peptides from both samples were (i) corrected according to known isotopic impurities and (ii) normalized by the trimmed mean algorithm. Then likelihood curves were calculated based on the established noise model for all peptide regulations and both phosphopeptide curves and non-phosphopeptide curves belonging to one protein kinase were depicted comparatively. As expected, non-modified peptides were detected with variable robustness but generally constituting clusters with overlapping likelihood curves indicating no significant alterations at the level of protein expression or stability. Contrary, curve presentations of some phosphopeptides detected at known signalling components did not overlap with these main clusters, in each case indicating their significant regulation caused by HGF-activated Met signalling. Figure 4 representatively summarizes the results obtained from two well-described human kinases. MK01 and GSK3α revealed regulated and differentially phosphorylated peptides, each at one particular protein region (Fig. 4, peptides no. 2, 3 and 5, spectra of peptides 3 and 5 are depicted in Fig. S4) whereas peptides corresponding to other protein regions consistently demonstrated that the expression of protein kinases themselves was not significantly altered. Interestingly, this strategy also revealed a significant downregulation of one unmodified peptide each at MK01 and GSK3α. In the case of MK01 this peptide (Fig. 4, peptide no. 1) corresponds to the protein region with the induced phosphorylations. Opposite regulatory events of such peptide pairs generally will co-exist during the post-translational modification of proteins. However, many parameters such as peptide ion suppression in complex samples determine whether LC-MS/MS can detect them automatically. Analyses of GSK3α could not detect fragmentation data of the unmodified peptide corresponding to the observed regulated phosphorylation at serine 21 Fig. 4. Likelihood curves of peptides from two human kinases reveal regulated phosphopeptides and protein regions (for details see Table S1). Unmodified and phosphorylated peptides belonging to the same protein kinase are depicted comparatively in the same plot. Phosphorylated amino acids are marked in red (Regulated Peptides). The majority of peptides from the kinases MK01 and GSK3α indicate no regulation of the total protein. Peptides 2, 3 and 5 carrying phosphorylated amino acids in each kinase are upregulated with variable robustness. The corresponding unmodified protein region (1) was detected by LC-MS/MS only in the case of MK01 exhibiting a downregulation as expected. Accordingly, a significant downregulation of the unmodified peptide 4 suggests an induced modification at this region in GSK3α. (Fig. 4, peptide no. 5). However, a second protein region at GSK3α exhibits a significant down-regulation (Fig. 4, peptide no. 4) strongly suggesting that a further so far uncharacterized modification is induced. Thus, statistical evaluation of regulated unmodified peptides can be highly beneficial to define biologically relevant protein sequences for detailed PTM characterizations in future studies as well. 3 DISCUSSION iTRAQ™ gained great importance in basic research and in particular is the method of choice in in vivo animal models and clinical projects relying on patient samples. iTRAQ™ provides regulatory information at the peptide level and initial bioinformatics studies focused on data integration and the accurate calculation of protein expressions. A previous study already described a positive correlation between low iTRAQ™ reporter intensities and low qualities of regulatory information and suggested an intensity threshold to minimize the frequency of false positive regulatory events (Lin et al., 2006). Actually, variations of regulatory 1009 [10:13 3/4/2009 Bioinformatics-btn551.tex] Page: 1009 1004–1011 C.Hundertmark et al. information might simply reflect ‘noise’ but might also indicate meaningful post-translational modifications. Proteins are subjects of post-translational modifications and these structural modifications essentially mediate their enzymatic activities by intra- and intermolecular interactions. In conclusion, high-impact proteome studies preferably should provide regulatory data about protein expression without loosing indications for structural alterations caused by PTMs. This study therefore aims at the data-dependent evaluation of regulatory information at the protein level and peptide level. We looked for a bioinformatics concept that intuitively allows scientists to explore the impact of regulatory data. Noise models were already suggested for peptide ion detection in general (Du et al., 2008) but also for iTRAQ™ reporter intensities (Klawonn et al., 2006). We suggested and proved satisfactorily a procedure to specify the signal-to-noise features of MS workflows based on repeated measurements of iTRAQ™ reporters released from just five different synthetic peptides. We presume that the quality of the noise model directly profits from the number of data points that cover the dynamic range of the used MS detector representatively. In fact, our established noise model calculates a characteristic simulation of iTRAQ™ intensities as revealed by comparative proteome experiments. Instead of the expected 5% outliers, 2.84% were counted indicating the robust and conservative character of the established noise model. Once a workflow-specific noise model is established, scientists can utilize this as a tool to evaluate robustness of regulatory information based on the underlying iTRAQ™ reporter ion intensities. Regulatory information can be presented as likelihood curves for each peptide that in addition to the most likely regulatory factor exhibit the chance of variable regulations. The shape of these curves depends on the specific signal-to-noise features of both reporters from corresponding peptides. The scientist thus remains in charge to select the level of data integration: likelihood curves at the protein level can summarize cumulatively available information from all protein-related peptides. This type of presentation can indicate an altered protein expression or stability if the protein curve does not overlap with those of unregulated proteins (RF = 1). Contrary, a comparative view of likelihood curves established for each individually measured peptide belonging to one protein is the method of choice to reveal any inconsistently regulated peptide (peptide curve does not overlap with the main cluster). These regulations can refer to meaningful biological processes, such as regulated PTMs but also bears the chance of false positive results. Scientist at least have to prove whether (i) the differentially regulated peptide sequences were identified unambiguously, (ii) can be assigned to only one protein and (iii) whether the fragmentation spectra were compromised by reporter ions from co-eluting and partially co-extracted peptides during the process of fragmentation. Furthermore, the process of normalization might require adaptations with respect to the investigated biological process. A trimmed mean (instead of a mean or median approach) is recommended for complex datasets and if regulations in one direction only cannot be ruled out. By considering all these aspects a workflow-specified noise model should be able to cover many sources of artificial results facilitating the detection of biologically relevant events. The investigated kinases of this study exhibit consistent regulatory behaviour of many unmodified peptides underscoring the general robustness of iTRAQ™. In addition, some phosphorylated and unmodified peptides were observed differentially regulated as suggested by their corresponding likelihood curves. Actually, these phosphorylation events are well-described structural elements controlling reversibly the activation and inhibition of these kinases (Cohen and Frame, 2001; Rubinfeld and Seger, 2005). Furthermore, the regulated unmodified peptide of GSK3α might indicate to further induced phosphorylations. Scansite 2.0 (Obenauer et al., 2003) predicts the first serine of peptide no. 4 as part of a potential phosphorylation site utilized by DNA-dependent protein kinase (DNA-PK). In conclusion, noise model-dependent likelihood calculations of MS data will help to ‘co-evolve’ state-of-the-art data interpretation strategies for upcoming quantitative PTM studies (Kim and White, 2006; Rosenzweig et al., 2008). Phosphoproteome studies can yield nowadays hundreds or even thousands of phosphorylation events and databases were already established to summarise species and sample-specific results (Diella et al., 2004; Gnad et al., 2007). The presented concept of likelihood curve calculations can serve as an ideal basis for establishing visualization-aided data-mining tools and in parallel can facilitate automated high-throughput screenings in complex datasets. 4 OUTLOOK Noise characteristics should be comparatively analyzed at different standard MS workflows that will help to understand the rational basis of noise observed in iTRAQ™ data and in parallel might yield in workflow-specific equations with minimized numbers of adjustable parameters. Present data already underscore that increasing regulation factors necessitate an increasing difference in reporter ion intensities that mandatory require one reporter with low signal intensity and thus low signal quality. Consequently, significant differences between biological samples will always tend to lower robustness concerning the true actual regulation factor. Therefore, future studies should establish noise models at the level of the MS-specific dynamic range of regulatory factors instead of the dynamic range of iTRAQ™ reporter intensities. Furthermore, cluster analysis strategies have to be implemented in order to detect biologically relevant regulatory events in complex datasets automatically. The characterization of PTMs following the detection of regulated unmodified peptides is still a challenging task but at least can be restricted to a particular protein region. ACKNOWLEDGEMENTS We thank Kirsten Minkhart and Reiner Munder for excellent technical assistance, Thorsten Johl for supplementary work and Dr Uwe Kärst for helpful comments and proof reading of the article. Conflict of Interest: none declared. REFERENCES Bantscheff,M. et al. (2007) Quantitative mass spectrometry in proteomics: a critical review. Anal. Bioanal. Chem., 389, 1017–1031. Birchmeier,C. et al. (2003) Met, metastasis, motility and more. Nat. Rev. Mol. Cell Biol., 4, 915–925. Boehm,A.M. et al. (2007) Precise protein quantification based on peptide quantification using itraq. BMC Bioinformatics, 8, 214. Cohen,P. and Frame,S. (2001) The renaissance of gsk3. Nat. Rev. Mol. Cell Biol., 2, 769–776. 1010 [10:13 3/4/2009 Bioinformatics-btn551.tex] Page: 1010 1004–1011 Noise model in quantitative proteomics Diella,F. et al. (2004) Phospho.elm: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics, 5, 79. Du,P. et al. (2008) A noise model for mass spectrometry based proteomics. Bioinformatics, 24, 1070–1077. Gnad,F. et al. (2007) Phosida (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol., 8, R250. Hoaglin,D.C. et al. (2000) Understanding Robust and Exploratory Data Analysis. Wiley, New York. Hu,J. et al. (2006) Optimized proteomic analysis of a mouse model of cerebellar dysfunction using amine-specific isobaric tags. Proteomics, 6, 4321–4334. Kim,J.-E. and White,F.M. (2006) Quantitative analysis of phosphotyrosine signaling networks triggered by cd3 and cd28 costimulation in jurkat cells. J. Immunol., 176, 2833–2843. Klawonn,F. et al. (2006) A maximum likelihood approach to noise estimation for intensity measurements in biology. In Proceedings of the Sixth IEEE International Conference on Data Mining Workshops ICDM Workshops 2006, IEEE, Los Alamitos (CA), pp. 180–184. Lin,W.-T. et al. (2006) Multi-q: a fully automated tool for multiplexed protein quantitation. J. Proteome Res., 5, 2328–2338. Obenauer,J.C. et al. (2003) Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res., 31, 3635–3641. Ong,S.-E. et al. (2002) Stable isotope labeling by amino acids in cell culture, silac, as a simple and accurate approach to expression proteomics. Mol. Cell Proteomics, 1, 376–386. Pierce,A. et al. (2007) Eight-channel itraq enables comparison of the activity of 6 leukaemogenic tyrosine kinases. Mol. Cell Proteomics, 7, 853–863. Rosenzweig,D. et al. (2008) Post-translational modification of cellular proteins during leishmania donovani differentiation. Proteomics, 8, 1843–1850. Ross,P.L. et al. (2004) Multiplexed protein quantitation in saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol. Cell Proteomics, 3, 1154–1169. Rubinfeld,H. and Seger,R. (2005) The erk cascade: a prototype of mapk signaling. Mol. Biotechnol., 31, 151–174. Shadforth,I.P. et al. (2005) i-tracker: for quantitative proteomics using itraq. BMC Genomics, 6, 145. Trost,M. et al. (2005) Comparative proteome analysis of secretory proteins from pathogenic and nonpathogenic listeria species. Proteomics, 5, 1544–1557. Tu,Y. et al. (2002) Quantitative noise analysis for gene expression microarray experiments. Proc. Natl Acad. Sci. USA, 99, 14031–14036. Walsh,C.T. et al. (2005) Protein posttranslational modifications: the chemistry of proteome diversifications. Angew. Chem. Int. Ed. Engl., 44, 7342–7372. Weng,L. et al. (2006) Rosetta error model for gene expression analysis. Bioinformatics, 22, 1111–1121. Wissing,J. et al. (2007) Proteomics analysis of protein kinases by target class-selective prefractionation and tandem mass spectrometry. Mol. Cell Proteomics, 6, 537–547. 1011 [10:13 3/4/2009 Bioinformatics-btn551.tex] Page: 1011 1004–1011
© Copyright 2026 Paperzz