Wesleyan University Evaluation of Mass Spectrometry-based Proteomic Search Algorithms: Parent Protein Profiling of 22 MS/MS Experiments in Saccharomyces cerevisiae By Miin Sophia Lin Faculty Advisor: Dr. Michael P. Weir A Thesis submitted to the Faculty of Wesleyan University in partial fulfillment of the requirements for the degree of Master of Arts in Biology Middletown, Connecticut May 2013 Acknowledgements I would like to express my gratitude to my research adviser and mentor, Dr. Michael Weir, for providing me the opportunity to conduct research since my senior year as an undergraduate student, and for his continuous guidance and support during my academic career at Wesleyan University. He has shown me not only how a bioinformaticist should approach problems, but also the importance of family. I would like to thank Dr. Scott Holmes and Dr. Ruth Johnson for serving on my M.A. thesis committee. In addition, I would like to thank Dr. Holmes and Rebecca Ryznar for sparking my interest in academic research during the spring semester of my junior year through the independent project, and Dr. Johnson for being a mentor and listening to my ranting on med school plans. I want to thank all the members of the Weir lab for their support and guidance over the past two years. I would especially like to thank Justin Cherny and Claire Fournier for being mentors in the bioinformatic and wet-lab aspects of my research. They have taught me a considerable amount of knowledge and skills ranging from troubleshooting experiments to presenting journal club papers. Furthermore, I would like to acknowledge Claire Fournier for developing the wet-lab protocol for the gel slice experiments conducted for this study, and her contribution of MS/MS experiments 1, 2, and 3. I would also like to thank Dr. Andrea Roberts and Dr. Hyejoo Back for their guidance and support during my academic career at Wesleyan University. Last but not least, I would like to take this opportunity to thank God for giving me the strength and wisdom to complete this thesis. I would also like to thank my parents, Mahn-Lih and Yao-Feng Lin, and my brother, Chiarng, for shaping me into who I am today, and for their love and support through the ups and downs in my life. i TABLE OF CONTENTS Introduction ……………………………………………………………………...….1 I. Mass Spectrometry-based Proteomics ...…………………………………………..2 1. Bottom-up vs. top-down approaches ……..………..………..………..……….3 1.1 Tandem mass spectrometry ……………..……………..…………………3 1.2 Prospects in fragmentation methods and MS instrumentation ...…………6 2. Bioinformatic and computational approaches..…..………..…………..………6 2.1 MS/MS spectral databases ………………………………………………6 2.2 Identification of downPeptides in the yeast proteome…..………………7 II. Confidence in Algorithm Matches of MS/MS Spectra to Peptide Sequences ...…..8 1. Peptide-spectrum matching search algorithms …………………………..……8 2. Methods of evaluation... ………………..…………………………..…………9 2.1 Target-decoy approach: False identification rate (FIR) ...………………..9 2.2 Gel slice approach: Parent protein profiling of 22 MS/MS experiments.10 Materials and Methods …..………………………………………………………11 I. Gel Slice Approach ……………………………………………………..………..11 1. Protein sample preparation ...……………………………………………..…12 II. Peptide-spectrum Matching Search Algorithms …..……………………………..14 III.Conformance Score Computation…..……………………………………………17 Results and Discussion ……………………………………….………………….18 I. An Evaluation of Algorithm Performance using Conformance Scores …….……18 1. Ion fragmentation: a search for a, b, and y ions………...……………………19 II. Factors Affecting Conformance Score Computation ...…………………………..22 1. Post-translational modifications...……………………………………………22 2. Random matching of peptides to parent proteins of the correct size range .…23 ii III.Union of OMSSA Implementations …..…………………………………………27 1. Conformance scoring based on multiple OMSSA implementations ...………27 2. Contribution of individual implementations to the pool of distinct peptides detected by multiple OMSSA implementations…………………...…………28 IV. Overlap of Peptide Matches in a/b/y and b/y ion Screens ….....…………………31 1. SEQUEST …...………………………………………………………………31 2. OMSSA ……..………………………………………………………………32 V. Protein Expression …….…………………………………………………………37 1. Parent protein expression of peptide matches detected exclusively in the b/y or a/b/y ion screens, and in both the b/y and a/b/y ion screens.….....……37 2. Bootstrap analysis: Confidence in peptide matches dependent on parent protein expression ……………………………………………………………38 VI. Increasing the Yield of High Confidence Peptide Matches ..………….…………43 1. Distinct peptides detected by both SEQUEST and OMSSA algorithms.…… 43 2. Conformance scoring using algorithm-specific evaluation methods..……… 45 Conclusion………………………………………………………..………………...48 I. Evaluation Methods for MS/MS Algorithms and Implementations..…………… 48 1. Benchmark data: 22 MS/MS experiments for optimization of algorithm implementations…………………………………….………..………………50 2. Multiple algorithms, implementations, and ion screens: increase yield of high confidence peptide matches ………………………………….………………51 3. Practical limitations to benchmark data sets …...…………………………….52 II. Future Directions...…………………………………….…………………………54 1. Assessment of MASCOT using parent protein conformance scoring ….……54 2. Applications of mass spectrometry-based proteomics……………………….55 2.1 Elucidation of biological functions and pathways ………………………55 2.2 Mass spectrometry-based proteomics in the clinical setting .……………57 iii References …………………………………………………………………………...59 Appendix A: SEQUEST search parameters in Proteome Discoverer 1.2.………66 Appendix B: Python and MS-SQL scripts...………………………………………68 iv LIST OF FIGURES Figure 1: Bottom-up and top-down approaches to mass spectrometry-based proteomics..…. 5 Figure 2-1: Schematic of gel slice approach …..………………………………………...11 Figure 2-2: Partitioning of parent proteins via SDS-PAGE…….........................................13 Figure 3: Experimental design and workflow ………...…..……………………………..16 Figure 4: Glycosylation does not account for nonconforming peptide matches with parent proteins below the gel slice MW range ………………....……………………..25 Figure 5: Union of multiple OMSSA implementations increases both the yield of detected peptides and the Conformance Score ………………………………………….30 Figure 6: Conformance scoring based on distinct peptides detected in the a/b/y ion screen, the b/y ion screen, or both a/b/y/ and b/y ion screens ………………………….34 Figure 7: Peptides detected by SEQUEST in both the b/y and a/b/y ion screens give significantly higher conformance scores compared to peptides detected by b/y or a/b/y ion screens alone ………………….....……………………………36 Figure 8: Parent protein expression of distinct peptide matches (per MS Run) detected in the a/b/y ion screen, the b/y ion screen, or both a/b/y and b/y ion screens………..40 Figure 9: Confidence in peptides detected by SEQUEST is dependent on parent protein expression………….…………………………………..…..…………………41 Figure 10: Distinct peptides detected by both algorithms …….…………………………..44 LIST OF TABLES Table 1: Comparison of SEQUEST and OMSSA parameter implementations ………….….15 Table 2: Conformance scoring based on distinct peptide matches per MS Run at 10% mass tolerance ………………….…………….………………………….20 Table 3: Conformance scoring based on distinct peptide matches per MS Run at 25% mass tolerance …………………..……………………………………….21 Table 4: Conformance scoring using algorithm-specific evaluation methods ……......…….47 v LIST OF TERMINOLOGY Algorithm search parameter implementation – The set of parameter settings applied to a mass spectrometry based protein identification search algorithm, such as SEQUEST. Hereafter referred to as “implementation.” Collision Induced Dissociation (CID) ion screens – Algorithm parameter settings searching for a, b, and y CID ions is hereafter referred to as the a/b/y ion screen. Algorithm parameter settings searching for b and y CID ions is hereafter referred to as the b/y ion screen. Conforming peptides – Distinct forward peptides with parent proteins conforming to a given molecular weight size range (e.g., at 10% mass tolerance for a molecular weight size range of 25-37 kDa, detected peptides between 22.5 and 40.7 kDa are classified as conforming). Distinct forward peptides – Peptide matches identified by the search algorithm that are counted once per MS Run or per Gel Slice Range. False Identification Rate (FIR) – 100*(reverse peptides / forward peptides) Gel Slice Range – The molecular weight size range, 25-37, 37-50, or 50-75 kDa, of proteins that were excised from SDS-PAGE gels based on a protein standard lane. Distinct peptide matches per Gel Slice Range allows only a single count of the same peptide from a particular molecular weight size range, thereby decreasing the weight of the peptide in the pool of distinct forward peptides. Mass Tolerance – Increase the molecular weight size range by either 10% or 25% of the minimum (m) and maximum (M) values [e.g., Given a size range of m to M and a mass tolerance of 10%, increase the size range to m - (0.1*m) and M + (0.1*M).] MS Run – An individual tandem mass spectrometry (MS/MS) run performed on a single gel slice of a molecular weight size range of 25-37, 37-50, or 50-75 kDa. Distinct peptide matches per MS Run allows multiple counts of the same peptide from different MS/MS experiments of the same molecular weight size range, thereby increasing the weight of the peptide in the pool of distinct forward peptides. Non-conforming peptides – Distinct forward peptides with parent proteins lying outside a given molecular weight size range. Conformance Score (CS) – Equation (m 3.2) Reverse peptides – Decoy peptides with reverse peptide sequences of proteins in a database file. vi Introduction Since the establishment of the central dogma in molecular biology (Crick, 1970), continuous efforts have been made to annotate and characterize the molecular building blocks of life, with the ultimate goal of understanding how biological systems function from DNA transcription to mRNA translation into protein products. The completion of the Human Genome Project in 2003 (IHGSC, 2004) was a major determinant in promoting a shift from genomic studies to proteomic studies. In particular, the development of DNA sequencing technologies, from Sanger’s electrophoresis-radiolabeling method (Sanger et al., 1977) to pyrosequencing (Hyman, 1988), has led to the completion of whole genome sequencing for many model organisms, and a call for the functional annotation of genomes. In response, techniques such as RT-PCR and DNA microarrays were developed to measure gene expression and transcript abundance (Etienne et al., 2004). However, as mRNA expression profiles do not necessarily correlate with protein expression (Gygi et al., 1999), quantitative analyses of mRNA can only provide a partial, and sometimes inaccurate, picture of biological systems. The shift toward a direct characterization of protein products encoded by the genome, however, has been challenged by the enormous size and complexity of the proteome. Protein expression is regulated by multiple processes at various stages during transcription and translation. Not only do rates of translation and the targeting of mRNA for decay dictate protein expression levels, single pre-mRNA transcripts 1 may also code for multiple protein products as seen in alternative splicing (Black, 2000). Once proteins are expressed, post-translational modifications, degradation, localization, and the formation of complexes via protein-protein interactions further complicate the understanding of protein function within a system. I. Mass Spectrometry-based Proteomics Along with the advances in bioinformatics and computational biology, the development of two-dimensional electrophoresis (2D-PAGE; O’Farrell, 1975) and tandem mass spectrometry (MS/MS; Hunt, 1986) have paved the way for large-scale studies on protein expression and function. The proteomic counterpart of DNA sequencing and microarray expression profiling, mass spectrometry, when used alongside spectra-matching search algorithms, is able to determine peptide fragment sequences and identify proteins in a sample. The principle behind mass spectrometry is based on the ability to characterize the mass of analytes in a given sample via measurements of the mass to charge (m/z) ratios and abundance of ionized analytes. In a mass spectrometry experiment, following the introduction of analytes into the mass spectrometer by the ion source, the mass analyzer measures the m/z ratios of the ionized analytes, and the detector determines the relative abundance in a given sample. These three components of a mass spectrometer determine the differences in resolution and sensitivity between the various types of instrumentation on the market, from time-of-flight (TOF) and ion trap to Fourier transform (FT) spectrometers (Aebersold and Mann, 2003). The choice as to which type of mass spectrometer to use is in turn determined by the objective of the proteomic study. 2 1. Bottom-up vs. top-down approaches The characterization and identification of proteins through mass spectrometry can be accomplished through a bottom-up or a top-down approach. In the bottom-up approach, proteins isolated from cell lysate are subjected to enzymatic digestion (e.g., trypsin), producing peptide fragments that are first separated by high-performance liquid chromatography (HPLC; Davis, 1998), then introduced into a mass spectrometer via soft-ionization techniques such as electrospray ionization (ESI; Fenn et al., 1989) or matrix-assisted laser desorption ionization (MALDI; Karas and Hillenkamp, 1988) (Coon et al., 2005). In the top-down approach, isolated proteins are instead directly introduced into a mass spectrometer via ESI for fragmentation (McLafferty et al., 2007). 1.1 Tandem mass spectrometry Current methodologies for both approaches make use of tandem mass spectrometry (MS/MS; Hunt et al., 1986), where the most abundant molecular ions from the first MS spectrum are selected for further fragmentation to reveal additional information via subsequent MS spectra of the fragment ions. In fragmentation methods such as collision induced dissociation (CID; Johnson, 1987), cleavages of the peptide backbone typically give rise to N-terminal b ions and C-terminal y ions (Figure 1). However, other ions such as a ions, where the C=O group is cleaved from the b ion, may also occur (Wysocki et al., 2005), and the ability to enhance detection of a1 fragment ions via protein reductive glutaraldehydation has allowed for determination of N-terminal sequences (Russo et al., 2008). 3 The MS/MS spectra obtained from a bottom-up approach and that from a topdown approach, however, focus on different aspects of protein characterization (Figure 1). In the bottom-up approach, the first MS measures the mass of the proteolytic fragments, while subsequent MSn provide clues to the sequence of the fragment. Consequently, the submission of MS/MS spectra to a search algorithm is limited to the identification of proteins in a sample via a piecing together of proteolytic fragment sequences – a method more prone to error. In the top-down approach, the first MS measures the mass of the complete parent protein, while subsequent MSn provide clues to not only the sequence composition of downstream fragments, but clues in the determination of post-translational modifications, predicted sequence errors, or other primary structural information when assessed with a search algorithm (McLafferty et al., 2007). 4 Figure 1. Bottom-up and top-down approaches to mass spectrometry-based proteomics 5 1.2 Prospects in fragmentation methods and MS instrumentation The advantage of the top-down approach in providing a more complete picture of a protein, however, is limited by the requirement for higher resolution instrumentation and a more effective method in the fragmentation of larger proteins. Efforts to address these limitations have been made through the introduction of fragmentation techniques such as electron capture dissociation (ECD; Zubarev et al., 2000) in conjunction with hybrid mass spectrometers of increasing resolution, sensitivity, and m/z ranges, such as the ion trap-Fourier transform mass spectrometer (LTQ FT; Syka et al., 2004b). Furthermore, the development of electron transfer dissociation (ETD; Syka et al., 2004a) has contributed to reducing the higher cost often associated with ECD-Fourier transform mass spectrometry. 2. Bioinformatic and computational approaches 2.1 MS/MS spectral databases The development of newer technologies in protein-based mass spectrometry has allowed for high-throughput data acquisition and the establishment of open-source databases containing large amounts of publicly deposited spectra, including PeptideAtlas, Proteomics IDEntifications Database, and NCBI’s Peptidome (Allmer, 2012). Along with the use of relational databases hosted by local servers, bioinformatic and computational approaches can now be applied to analyze these large sets of data, allowing for the characterization and identification of proteins in ways that otherwise would have taken longer to accomplish through methods solely dependent on wet-lab experimentation. 6 2.2 Identification of downPeptides in the yeast proteome As an example, a previous study (Fournier et al., 2012) used the SEQUEST algorithm to analyze spectra obtained from MS/MS experiments on glutaraldehydetreated yeast cell lysates in addition to publicly deposited spectra obtained from the PeptideAtlas repository (http://www.peptideatlas.org/repository/). A total of 320 downPeptides with amino termini mapping to translation initiation at AUG codons downstream (dnAUGs) of the annotated start codon (annAUG) were identified in the budding yeast. In support of data obtained from MS/MS spectra analysis, bioinformatic approaches revealed poorer quality Kozak sequence contexts surrounding the annAUGs of downPeptide genes as compared to those of annPeptide genes, and suggested the occurrence of translation initiation at both the annAUG and dnAUG via tag densities from ribosome profiling. In this case study, the ability to integrate bioinformatic and computational approaches with large amounts of publicly deposited MS/MS spectra revealed added complexity to the current knowledge of protein expression in yeast. In particular, translation initiation at dnAUGs in frame to the annAUG would result in the expression of truncated proteins, while translation initiation at dnAUGs out of frame to the annAUG would potentially lead to the expression of proteins with novel functions and cellular localizations, increasing the repertoire of the current annotated yeast proteome (Kochetov et al., 2005). 7 II. Confidence in Algorithm Matches of MS/MS Spectra to Peptide Sequences With increasing improvements in the resolution and sensitivity of mass spectrometers, there is little doubt that the emphasis placed on mass spectrometrybased proteomics will persist for many years to come. However, given the complexity of proteomic studies, an assessment of spectra-matching algorithms in accurately identifying and characterizing proteins from a given sample is important for the validation of both non-standard protein matches (e.g., downPeptides) and in-depth characterizations of annotated proteins. The variability seen between different spectra-matching algorithms, due to their unique design and structure, is further complicated by the numerous parameter settings (e.g., precursor and fragment mass tolerance, number of missed cleavage sites, or type of CID ions) that may be implemented within each algorithm (Balgley et al., 2007). Confidence in peptidespectrum matches becomes problematic when the same spectral data submitted to different algorithms or implementations of algorithms yield different samplings of peptide matches (Wenger and Coon, 2013). Which of the algorithm detected peptides are valid and which should be discarded due to low confidence in the match? 1. Peptide-spectrum matching search algorithms Several types of spectra-matching search algorithms are available either on the market or as open-source software. Algorithms such as SEQUEST (Eng et al., 1994) are based on cross-correlation methods that match a MS/MS experimental spectrum with theoretical spectra, while algorithms such as OMSSA (Geer et al., 2004) and MASCOT (Perkins et al., 1999) rely on probability-based matching that assigns a 8 score of statistical significance to the observed peptide for sequence correlation to theoretical peptides. Although falling under the same category as probability-based algorithms, OMSSA and MASCOT differ in various aspects: i) OMSSA is an opensource algorithm, which allows for manipulation of data on information such as the detected CID ions in a spectrum for user-defined peak analysis; ii) OMSSA was the fastest algorithm out of the algorithms mentioned prior to the introduction of Morpheus (Wenger and Coon, 2013). Despite the differences, all spectra-matching algorithms have their own score ranking systems, where peptide matches are assigned a score representing the probability of the match being a true match, allowing for the application of probability thresholds at certain false identification rates. 2. Methods of evaluation 2.1 Target-decoy approach: False identification rate (FIR) The effectiveness of an algorithm or an algorithm implementation is typically assessed through decoy analysis (Fitzgibbon et al., 2008; Elias and Gygi, 2007). In the FASTA format sequence database provided as a reference for peptide-spectrum matching algorithms, a reverse sequence is present for every forward protein sequence. With no knowledge that the reverse sequences represent “ghost” proteins that do not exist in nature, an algorithm may still match spectra to these decoy sequences. The frequency that reverse peptides are chosen by the algorithm, calculated as the number of reverse peptides over the number of forward peptides, is the false identification rate (FIR) for the implementation of the algorithm. To filter out lower confidence peptide matches, a target FIR of either 1% or 5% is set during 9 the computation of a probability threshold value based on the scores assigned to individual peptide matches representing an algorithm’s confidence in the match. However, despite submission of the same spectra files and application of a probability threshold at a FIR of 1% or 5%, different algorithms or implementations of an algorithm still yield different samplings of detected peptides. 2.2 Gel slice approach: Parent profiling of 22 MS/MS experiments To address this issue, we developed additional evaluation methods to establish confidence in peptide-spectrum matches by algorithms. Using a bottom-up approach, where the algorithm has no knowledge of the masses of parent proteins prior to trypsin digestion, we partitioned proteins from yeast cell lysate according to mass using SDS-polyacrylamide gel electrophoresis, and subjected samples of known parent protein molecular weight size ranges for LC-MS/MS. Spectra from 22 MS/MS experiments were then used to evaluate peptides matched by an algorithm implementation for conformance to parent protein masses prior to trypsin digestion. Since this method can be applied to any algorithm or implementation of an algorithm, the computed overall Conformance Scores (eq. m 3.2) provide a means to compare the performance of different algorithms as well as to evaluate the performance of an individual implementation – allowing for optimization of parameter settings in an algorithm. In addition, making available to download such algorithm benchmark data would greatly benefit the research community (Allmer, 2012) in choosing the best algorithm or implementation to use with yeast proteomic studies. 10 Materials and Methods I. Gel Slice Approach In order to assess algorithm matches of spectra to tryptic peptides, algorithms were evaluated for the conformance of peptide matches to parent proteins prior to trypsin digestion. Yeast cell lysates were partitioned according to mass by SDSpolyacrylamide gel electrophoresis. Gel slices corresponding to molecular weight size ranges of 25-37, 37-50, and 50-75 kDa were excised, subjected to in-gel trypsin digestion, and assessed through LC-MS/MS. Spectra were then submitted to search algorithms (SEQUEST or OMSSA) for protein identification (Figure 2-1). Figure 2-1. Schematic of gel slice approach. 11 1. Protein sample preparation Protein samples were prepared by growing 100 ml of the wild-type S. cerevisiae strain (YSH474) to mid-log phase in YPD. Lysis was carried out using RIPA buffer (150 mM NaCl, 1% Igepal, 0.1% SDS, 50 mM Tris pH 8.0) and acid washed glass beads in addition to protease inhibitor cocktail tablets (Roche) and PMSF. After the lysate was spun at 5,000 RPM, samples at concentrations of 500 g or 1000 g were run alongside protein standard markers (Bio-Rad) on 4-20% SDS-PAGE gels (BioRad). Protein standard bands served as a guide for the excision of gel slices of various molecular weight size ranges (25-37, 37-50, and 50-75 kDa), which were subjected to reduction alkylation followed by overnight in-gel trypsin digestion (Shevchenko et al., 1996, 2006). Extracted peptides were resuspended in 0.1% TFA, loaded onto a c18 (Michrom) nanospray column (Polymicro), and run with a 180 minute gradient on a Thermo-Finnigan LCQ Deca XP (3D ion trap) coupled to an Agilent 1100 series high-performance liquid chromatography (HPLC) system and a Thermo nanoelectrospray (nano-ESI) ion source. As preliminary tests indicated that gels with visible degradation (smearing of protein bands) had limited conformance to parent protein masses, gel slices chosen for analysis originated from gels lacking visible degradation (Figure 2-2). 12 Figure 2-2. Partitioning of parent proteins via SDS-PAGE. 13 II. Peptide-spectrum Matching Search Algorithms Peptides were identified using the SEQUEST algorithm (Proteome Discoverer v.1.2) run on a Dell Alienware Aurora R4 server, and the Open Mass Spectrometry Search Algorithm (OMSSA) run on a 90-node cluster. Algorithm parameters were set to search for either b and y ions or a, b, and y ions following CID, and to include mass increases to peptides: Dynamic modifications include +42 Da for acetylation of any N-terminal amino acid residue, and +16 Da for oxidation of methionine residues; Static modifications include +57 Da for carbamidomethyl modification of cysteine residues (Appendix A). Each SEQUEST parameter file included a precursor mass tolerance of 3.0 Da and a fragment mass tolerance of 1.0 Da. The OMSSA algorithm was run using five parameter implementations of the algorithm (Table 1). For the SEQUEST and OMSSA analysis, we required trypsin-cleavage sites at both ends of the precursor peptides (or one end if a terminal peptide). A sequence database file containing protein translations of annotated and downstream Open Reading Frames (dnORFs) in FASTA format was constructed as described previously (Fournier et al., 2012). 14 Table 1. Comparison of SEQUEST and OMSSA parameter implementations 0 1 OMSSA 2 3 4 1 1 1 1 1 1 3.0 3.0 1.5 2.0 1.5 1.5 1.0 1.0 0.5 0.8 0.5 1.0 mono avg. avg. avg. mono avg. mono mono mono avg. mono avg. N/A none none none none linear SEQUEST Max. missed cleavage sites Precursor mass tolerance (Da) Fragment mass tolerance (Da) Precursor ion search type Fragment ion search type Mass tolerance charge scaling Output from SEQUEST and OMSSA were uploaded to a relational database and analyzed with stored procedures written in MS-SQL to compute false identification rates (FIR) and conformance scores (Figure 3). In the stored procedure analysis (Appendix B), for each MS/MS experiment, we excluded peptide matches with internal trypsin sites, matches with an initial ranking (Rank) > 1 if a SEQUEST matched peptide, and matches with multiple parent protein references. In the case of OMSSA, we also excluded matches where multiple peptides matched to the same spectrum. Finally, the peptide matches were filtered by applying the calculated probability threshold that gave a target FIR of 5%. Decoy analysis was performed as described previously (Fournier et al., 2012). 15 Figure 3. Experimental design and workflow. RAW files (1) from MS/MS experiments are either submitted to the SEQUEST algorithm (2A), or converted to ODTA format and submitted to the OMSSA algorithm (3A). The XLS output from SEQUEST (3A) and the XML output from OMSSA (3B) are then uploaded into the devGelSlice database for further analysis via stored procedures written in MS-SQL: application of 5% FIR (4); computation of overall conformance score (5). 16 III. Conformance Score Computation Parent protein conformance scores were computed from distinct forward peptide matches classified as conforming or nonconforming according to the known molecular weight size range of the gel slice (25-37, 37-50, and 50-75 kDa). We analyzed data from 22 gel slices in 22 independent MS/MS experiments. Peptides were either counted once per MS Run or once per Gel Slice Range. To account for aberrant running on a gel or possible post-translational modifications, mass tolerances at ± 10% or ± 25% of the molecular weight size range were applied. Conformance scores for individual MS/MS experiments were used as an initial screen and computed as follows: (m 3.1) An overall conformance score (for a single algorithm search parameter implementation) was computed by summing the number of conforming peptides and nonconforming peptides across all 22 MS/MS experiments: (m 3.2) 17 Results and Discussion Mass spectrometry-based protein identification algorithms that match MS/MS spectra of fragmented peptides to the theoretical spectra of peptide sequences from FASTA format databases are important tools in the advancement of proteomic studies, and the ability of algorithms to provide accurate protein identification is necessary for confidence in published data. The importance of confidence in algorithm detected peptides is further illuminated by the discovery of special sets of peptides, such as downPeptides (Fournier et al., 2012), where localization and function of the proteins have yet to be characterized. I. Evaluation of Algorithm Performance using Conformance Scores As an initial method for evaluating algorithm performance, we computed relative conformance scores indicating how well peptide matches from the algorithms conformed to parent proteins of known molecular weight size ranges (25-37, 37-50, and 50-75 kDa). A concept similar to single blind studies in psychology, the evaluation method was based on the algorithms having no prior knowledge of the masses of parent proteins prior to trypsin digestion. The computed conformance scores, using spectra from 22 MS/MS experiments, therefore provide an unbiased evaluation of algorithm performance in matching theoretical tryptic peptides to mass spectra of CID fragments. 18 1. Ion fragmentation: a search for a, b, and y ions Although b and y ions are most typically observed during CID, the type of ions detected in a mass spectrometry experiment is affected by factors such as the sensitivity of the mass spectrometer, the source of the ions, and the amino acid composition of the peptides (Wysocki, et al., 2005). To account for the possibility of other less-prevalent ion fragmentations, we set our algorithm search parameters to also include the search for a ions, which results from the loss of C=O at the carboxy terminal of the b ion. Analysis of spectra from 22 independent MS/MS experiments using the SEQUEST algorithm and five implementations of the OMSSA algorithm included either a typical b/y ion screen or the a/b/y ion screen. As application of a decoy analysis using a false identification rate below 5% is generally accepted as the method to ensure the quality of peptide-spectrum matches, we pre-screened our data prior to computing conformance scores. The overall false identification rates for SEQUEST were ~6.0% while rates for OMSSA ranged from 2.2% to 6.8% (Table 2). Due to experimental variations inherent in the gel slice experiments, such as aberrant running of proteins during gel electrophoresis, a tolerance of 10% was applied to the molecular weight size range of the parent proteins. Counting each peptide once per MS Run, conformance scores computed at a 10% mass tolerance indicated that around 84.5% of peptide matches from SEQUEST conformed to the expected parent protein size ranges in both types of ion screens, while an average of 88.3% and 87.5% of peptide matches from OMSSA conformed to the expected parent 19 protein size ranges in the b/y and a/b/y ion screens respectively (Table 2). Although the average conformance scores for OMSSA are similar in both ion screens, in addition to slight fluctuations, the number of distinct peptides detected by a particular ion screen (b/y: 2134-3644; a/b/y: 1702-3393) was also dependent on the particular search parameter implementation. In general, the conformance scores suggest high confidence in both SEQUEST and OMSSA algorithms in addition to high confidence in different ion screens or individual OMSSA implementations. Table 2. Conformance scoring based on distinct peptide matches per MS Run at 10% mass tolerance. Overall scores include data from 22 MS/MS experiments. A. Distinct peptides detected in b/y ion screen. Threshold values for SEQUEST: 0.081; OMSSA runs 0-4: 1.0. B. Distinct peptides detected in a/b/y ion screen. Threshold values for SEQUEST: 0.164025; OMSSA runs 0,1,3,4: 1.0; OMSSA run 2: 0.9. A. Algorithm SEQUEST OMSSA OMSSA OMSSA OMSSA OMSSA Implementation # Distinct Peptides # Conforming Peptides # Nonconforming Peptides Overall Conformance Score Overall Reverse Conformance Score Overall FIR 0 1 2 3 4 4480 3060 3644 2134 3295 3035 3781 2717 3196 1893 2887 2696 699 343 448 241 408 339 84.4 88.8 87.7 88.7 87.6 88.8 14.6 23.6 20.3 23.5 17.8 23.9 6.0 2.4 3.6 5.4 3.9 2.2 Implementation # Distinct Peptides # Conforming Peptides # Nonconforming Peptides Overall Conformance Score Overall Reverse Conformance Score Overall FIR 0 1 2 3 4 4757 2583 3393 1702 3065 2556 4021 2282 2960 1482 2670 2250 736 301 433 220 395 306 84.5 88.3 87.2 87.1 87.1 88.0 17.0 18.0 21.2 19.8 15.9 16.7 5.7 2.4 4.9 6.8 5.1 2.3 OMSSA B. Algorithm SEQUEST OMSSA OMSSA OMSSA OMSSA OMSSA OMSSA 20 Table 3. Conformance scoring based on distinct peptide matches per MS Run at 25% mass tolerance. Overall scores include data from 22 MS/MS experiments. A. Distinct peptides detected in b/y ion screen. Threshold values for SEQUEST: 0.081; OMSSA runs 0-4: 1.0. B. Distinct peptides detected in a/b/y ion screen. Threshold values for SEQUEST: 0.164025; OMSSA runs 0,1,3,4: 1.0; OMSSA run 2: 0.9. A. Algorithm SEQUEST OMSSA OMSSA OMSSA OMSSA OMSSA Implementation # Distinct Peptides # Conforming Peptides # Nonconforming Peptides Overall Conformance Score Overall Reverse Conformance Score Overall FIR 0 1 2 3 4 4480 3060 3644 2134 3295 3035 4099 2920 3455 2022 3126 2899 381 140 189 112 169 136 91.5 95.4 94.8 94.8 94.9 95.5 26.2 33.3 36.8 40.9 34.1 31.3 6.0 2.4 3.6 5.4 3.9 2.2 Implementation # Distinct Peptides # Conforming Peptides # Nonconforming Peptides Overall Conformance Score Overall Reverse Conformance Score Overall FIR 0 1 2 3 4 4757 2583 3393 1702 3065 2556 4364 2469 3210 1606 2897 2438 393 114 183 96 168 118 91.7 95.6 94.6 94.4 94.5 95.4 27.8 31.1 34.5 35.3 33.1 28.3 5.7 2.4 4.9 6.8 5.1 2.3 OMSSA B. Algorithm SEQUEST OMSSA OMSSA OMSSA OMSSA OMSSA OMSSA 21 II. Factors Affecting Conformance Score Computation Although conformance scores indicate a relative peptide matching efficacy of algorithm implementations, scores may be affected by additional factors that may or may not be accountable or corrected for. After application of a 25% mass tolerance to the molecular weight size range of parent proteins, an increase in conformance scores independent of the algorithm or search parameter implementation selected was observed (Table 3). In an attempt to elucidate factors affecting the categorization of algorithm matched peptides as conforming or nonconforming, we drafted a list of factors that may affect the computation of conformance scores. Unaccountable factors such as the algorithm-specific probability scoring of peptide matches may result in the exclusion of a matched peptide during decoy analysis, where peptides not meeting a certain probability threshold are discarded. Additionally, the algorithm may not have been able to match a given spectra to a peptide sequence from the specified database file, therefore decreasing the yield of detected peptides. However, correctable factors affecting conformance scoring include post-translational modifications of parent proteins that alter their molecular weights, and random matching of peptides by the algorithm to parent proteins of the correct size range. 1. Post-translational modifications Of the various post-translational modifications (PTMs), glycosylation is one of the major PTMs that may increase the mass of a parent protein by as much as several thousand Daltons through the addition of covalently linked saccharides (Parker et al., 22 2010). When partitioned via SDS-polyacrylamide gel electrophoresis, PTM parent proteins may be found at a position on the gel that is above its annotated molecular weight (Iakouchevaet al., 2001). In order to investigate whether peptide matches were categorized as nonconforming due to post-translational modifications, we tested the parent proteins of nonconforming peptides for a significant elevation of the Asn-XSer/Thr motif (where X is any amino acid other than proline) that is typically observed in the sequence of proteins targeted for N-linked glycosylation, and a significant elevation of serine and threonine residues that are typically observed in the sequence of proteins targeted for O-linked glycosylation (Wildt and Gerngross, 2005). We expected to see an elevation of the Asn-X-Ser/Thr motif or serine and threonine residues in the parent proteins of nonconforming peptides with unmodified masses below the gel slice range. However, an elevation in both the Asn-X-Ser/Thr motif and the number of serine and threonine residues was observed for parent proteins of nonconforming peptides with molecular weight sizes above the gel slice range (Figure 4), suggesting that other factors may take precedence. A possible explanation is that the elevated levels of the Asn-X-Ser/Thr motif or serine and threonine residues do in fact indicate glycosylation of the parent proteins, but due to post-translational proteolytic cleavage, the parent protein runs at a molecular weight lower than expected (Shao and Kent, 1997). 2. Random matching of peptides to parent proteins of the correct size range The ability of mass spectrometry-based search algorithms to randomly match spectra to peptide sequences is apparent in the matching of spectra to reverse decoy 23 peptides. If algorithms randomly match spectra to incorrect peptides, algorithms can also randomly match spectra to peptides corresponding to parent proteins of the correct size range. For example, 25% of annotated yeast proteins have masses between 22.5 and 40.7 kDa, so 25% of random matches would conform to this size range. The overall conformance scores for the reverse peptides may in part illustrate this phenomenon. Conformance scores computed at a 10% mass tolerance indicate that around 27% of reverse peptide matches from SEQUEST conformed to the expected parent protein size range in both b/y and a/b/y ion screens, while an average 35.3% and 32.5% of reverse peptide matches from OMSSA conformed to the expected parent protein size range in b/y and a/b/y ion screens respectively (Table 2). 24 10% Mass Tolerance 25% Mass Tolerance SEQUEST (10% Mass Tolerance) Effects of N-glycosylation on Peptide Conformance 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Frequency Frequency A. 0 1 2 3 4 5 6 7 8 9 10 SEQUEST (25% Mass Tolerance) Effects of N-glycosylation on Peptide Conformance 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 11 1 2 4 5 6 7 8 9 10 11 SEQUEST (25% Mass Tolerance) Effects of O-glycosylation on Peptide Conformance SEQUEST (10% Mass Tolerance) Effects of O-glycosylation on Peptide Conformance 0.5 0.5 0.4 0.4 0.3 Frequency Frequency 3 Asn-X-Ser/Thr Motif Count Asn-X-Ser/Thr Motif Count 0.2 0.1 0.3 0.2 0.1 0 0 10 12 14 16 24 28 0 Percent Serine and Threonine (%) Conforming Nonconforming Above 0 10 12 14 16 24 28 Percent Serine and Threonine (%) Nonconforming Below B. OMSSA Implementation 0 Effects of O-glycosylation on Peptide Conformance 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Frequency Frequency OMSSA Implementation 0 Effects of O-glycosylation on Peptide Conformance 0 10 12 14 16 24 28 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 10 Percent Serine and Threonine (%) 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 10 12 14 16 Percent Serine and Threonine (%) 14 16 24 28 OMSSA Implementation 1 Effects of O-glycosylation on Peptide Conformance Frequency Frequency OMSSA Implementation 1 Effects of O-glycosylation on Peptide Conformance 0 12 Percent Serine and Threonine (%) 24 28 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 10 12 14 16 24 28 Percent Serine and Threonine (%) Figure 4. Glycosylation does not account for nonconforming peptide matches with parent proteins below the gel slice MW range. Yellow lines represent nonconforming peptides with parent protein MW above the gel slice range, light blue lines represent nonconforming peptides with parent protein MW below the gel slice range, and dark blue lines represent conforming peptide matches. Assessment for both SEQUEST (A) and OMSSA (B) indicated elevation of the Asn-X-Ser/Thr motif or serine and threonine residues in peptides having parent proteins with MW above the gel slice range. 25 Figure 4B. (continued) OMSSA Implementation 2 Effects of O-glycosylation on Peptide Conformance 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Frequency Frequency OMSSA Implementation 2 Effects of O-glycosylation on Peptide Conformance 0 10 12 14 16 24 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 28 10 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 10 12 14 16 24 28 0 Frequency Frequency 12 14 16 28 10 12 14 16 24 28 OMSSA Implementation 4 Effects of O-glycosylation on Peptide Conformance 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Percent Serine and Threonine (%) 24 Percent Serine and Threonine (%) OMSSA Implementation 4 Effects of O-glycosylation on Peptide Conformance 10 16 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Percent Serine and Threonine (%) 0 14 OMSSA Implementation 3 Effects of O-glycosylation on Peptide Conformance Frequency Frequency OMSSA Implementation 3 Effects of O-glycosylation on Peptide Conformance 0 12 Percent Serine and Threonine (%) Percent Serine and Threonine (%) 24 28 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 7 10 13 15 18 21 24 28 Percent Serine and Threonine (%) 26 III. Union of OMSSA Implementations Although overall conformance scores (10% mass tolerance) computed from peptide matches detected by the OMSSA algorithm (~88%) were slightly higher than that from peptide matches detected by the SEQUEST algorithm (~85%), the high confidence in algorithm peptide matches regardless of algorithm, implementation, or ion screen suggested that using multiple implementations of OMSSA can increase the yield of detected peptides. Counting detected peptides once even if seen in multiple implementations (b/y ion screen: 3,922 distinct peptides; a/b/y ion screen: 3,662 distinct peptides), we grouped peptides according to the number of OMSSA implementations that peptides were detected in. 1. Conformance scoring based on multiple OMSSA implementations When a peptide is detected by a single OMSSA implementation subjected to the b/y ion screen, the conformance score (10% mass tolerance) indicated that only 61.1% of the detected peptides conformed to their respective parent protein molecular weight size ranges. As the number of OMSSA implementations that had detected the peptide increases from two to all five, the conformance scores also increased from 78.9%, 84.1%, 89.8%, to 90.0% respectively (Figure 5A). Similar trends were seen in peptide matches detected by OMSSA implementations subjected to the a/b/y ion screen (Figure 5A), indicating that including the search for a ions in the OMSSA algorithm has a minimal affect on the yield of peptide matches. The increase in the conformance score of peptides detected by more than one OMSSA implementation and the low percentage of peptides (b/y ion screen: 5%; 27 a/b/y ion screen: 7.8%) detected by a single OMSSA implementation suggest two methods for increasing the yield of detected peptides while maintaining high confidence in the matches: i) directly take the union of detected peptides from multiple OMSSA search implementations, or ii) discard peptide matches detected by a single OMSSA implementation after taking the union of detected peptides from multiple OMSSA implementations. In the first method, where distinct peptides are counted once per MS Run and once per OMSSA implementation, the overall conformance scores (10% mass tolerance) are 86.2% and 85.3% for the b/y ion screen and a/b/y ion screen respectively – high confidence conformance scores comparable to that of the 4,480 and 4,757 peptides detected by SEQUEST (84.4% and 84.5%). 2. Contribution of individual implementations to the pool of distinct peptides detected by multiple OMSSA implementations Alternatively, a third approach is to discard peptide matches detected by the OMSSA implementation that contributes most to the pool of distinct peptides detected by a single OMSSA implementation. Comparing the contribution of each OMSSA implementation to the detection of distinct peptides found in multiple implementations, OMSSA implementations 1 and 2 contribute the most to the pool of peptides detected in a single OMSSA implementation (Figure 5B). However, implementations 1 and 2 not only contribute at similar levels as other implementations for peptide matches detected by multiple OMSSA implementations, they also have high overall conformance scores, 87.7% and 88.7% respectively. This suggests that taking the union of detected peptides from multiple OMSSA 28 implementations and discarding peptide matches detected by a single OMSSA implementation is more suitable for increasing the yield of high confidence peptide matches if the individual OMSSA implementations also have high conformance scores. (Figure 5 located on page 30) Figure 5. Union of multiple OMSSA implementations increases both the yield of detected peptides and the Conformance Score. A. Conformance scores calculated from distinct peptides detected in multiple implementations of the b/y or a/b/y ion screens in OMSSA (peptides counted once even if seen in multiple implementations). B. Contribution of each OMSSA implementation to the distinct peptides detected in multiple implementations of the b/y or a/b/y ion screens. 29 A. Conformance of Distinct Peptides detected by multiple OMSSA implementations (a/b/y ion screen) Conformance of Distinct Peptides detected by multiple OMSSA implementations (b/y ion screen) 100 100 90 93.2 86.2 80 88.2 78.9 70 60 84.1 96.9 89.8 96.2 90.0 80 70 60 50 20 20 10 10 10 0 0 3 4 5 81.6 25% All # Distinct Peptides Percentage of all matches 2 96.2 88.7 90 80 70 60 50 40 30 30 20 20 10 0 All 1 2 3 4 5 Number of OMSSA Implementations 10% Percentage of all matches 1 88.6 96.4 89.6 40 Number of OMSSA Implementations 10% 94.5 68.5 61.2 50 30 2 90.2 60 30 1 92.7 70 40 All 85.3 80 40 0 100 90 90 66.7 61.1 50 92.0 Conformance Score (%) Conformance Score (%) 100 3 4 5 3922 198 667 276 1097 1684 100.0 5.0 17.0 7.0 28.0 42.9 # Distinct Peptides Percentage of all matches 25% Percentage of all matches All 1 2 3 4 5 3662 286 766 255 1059 1296 100.0 7.8 20.9 7.0 28.9 35.4 B. Contribution of individual OMSSA implementations to the pool of distinct peptides detected by multiple OMSSA implementatoins (a/b/y ion screen) Contribution of individual OMSSA implementations to the pool of distinct peptides detected by multiple OMSSA implementatoins (b/y ion screen) 60 20 0 All 1 2 3 4 5 40 20 19.4 25.5 12.8 23.0 19.2 Percentage (%) 40 20.2 24.0 14.1 21.7 20.0 Percentage (%) 60 0 All Number of OMSSA Implementations Implementation 0 Implementation 1 Implementation 2 Implementation 3 1 2 3 4 5 Number of OMSSA Implementations Implementation 4 Implementation 0 Implementation 1 Implementation 2 Implementation 3 Implementation 4 Figure 5. 30 IV. Overlap of Peptide Matches in a/b/y ion and b/y ion Screens In contrast to the multiple implementations used in OMSSA, we submitted spectra from the 22 MS/MS experiments to SEQUEST using a single search algorithm parameter implementation with either b/y ion screening or a/b/y ion screening. To investigate whether including the search for a ions increases the yield of peptide detections, we grouped distinct peptides according to detection by the b/y ion screen exclusively (BY ions), the a/b/y ion screen exclusively (ABY ions), or by both the b/y and a/b/y ion screens (ABY ions & BY ions). 1. SEQUEST The conformance score based on peptide matches detected by SEQUEST in both the b/y and a/b/y ion screens was significantly higher than that based on peptide matches detected exclusively by the b/y ion screen or the a/b/y ion screen (Figure 6A). Counting distinct peptides once per MS/MS experiment (per MS Run), peptides detected by SEQUEST in both the b/y and a/b/y ion screens had a conformance score of 87.4% (mass tolerance of 10%), while the conformance scores based on peptides detected by SEQUEST exclusively in the b/y or a/b/y ion screens were 34.7% and 61.6% respectively. A similar trend was seen when counting distinct peptides once per parent protein molecular weight size range (per Gel Slice Range). The lower conformance scores based on distinct peptide matches per Gel Slice Range as compared to the corresponding conformance scores based on matches per MS Run (Figure 6A) is likely due to the decreased representation of a peptide in the pool of distinct forward peptides even if seen in multiple MS/MS experiments. 31 In the SEQUEST algorithm, we detected 5,016 peptides, counting a peptide once per MS/MS experiment. Of the 5,016 detected peptides, 2,261 were distinct peptides. Of the 2,261 distinct peptides, 78% (1,752) were detected by both the b/y and the a/b/y ion screens (Figure 10), suggesting that confidence in SEQUEST peptide matches can be increased by screening spectra for both b/y and a/b/y ions and subsequently retaining only the peptide matches detected by both ion screens. Using bootstrap analysis, we found that the conformance scores for ABY ions (61.6%) and BY ions (34.7%) lie outside the distribution of conformance scores calculated after random sampling with replacement from a full set of 4,757 and 4,480 distinct peptides counted once per MS/MS experiment in the a/b/y and b/y ion screens respectively (Figure 7). This indicated that the higher conformance score observed for peptide matches detected by both the b/y and a/b/y ion screens using the SEQUEST algorithm is significantly higher than the conformance scores for peptide matches detected exclusively by either ion screens alone (Figure 7; p < 0.01, p<0.001). 2. OMSSA In contrast, the increase in the conformance score based on peptide matches detected by both b/y and a/b/y ion screens as compared to the conformance scores of peptide matches detected by either the b/y or a/b/y ion screens alone was not as pronounced in the OMSSA algorithm (Figure 6C). Counting a peptide once per MS/MS experiment, the average conformance scores for peptide matches detected by OMSSA implementations were 78.7% and 86.1% for exclusive detection by a/b/y and b/y ion screens respectively, and 89% for detection by both a/b/y and b/y ion screens. 32 Counting a peptide once per gel slice range, the average conformance scores for peptide matches detected by OMSSA implementations were 68% and 79.2% for exclusive detection by a/b/y and b/y ion screens respectively, and 86.3% for detection by both a/b/y and b/y ion screens. This suggests that including the search for a ions in the OMSSA algorithm has a minimal affect on not only the yield of peptide matches (Figure 5A) but also the confidence in algorithm peptide matches (Figure 6C). (Figure 6 located on pages 34-35) 33 A. Conformance based on Distinct Peptide Matches per MS Run 100 100 80 87.4 70 50 90 94.4 70.7 10% 25% 61.6 40 44.0 30 34.7 20 10 Conformance Score (%) Conformance Score (%) 90 60 Conformance based on Distinct Peptide Matches per Gel Slice Range 91.9 80 82.6 70 10% 25% 60 57.2 50 40 47.6 30 20 24.6 32.9 10 0 ABY ions ABY ions & BY ions Distinct Peptides (per MS Run) BY ions 0 ABY ions ABY ions & BY ions BY ions Distinct Peptides (per Gel Slice Range) B. Figure 6. Conformance scoring based on distinct peptides detected in the a/b/y ion screen, the b/y ion screen, or both a/b/y and b/y ion screens. Algorithm detected peptides were selected based on distinct peptide matches counted once per MS Run or once per Gel Slice Range. B. SEQUEST detected 5,016 peptides, counting each peptide once per MS/MS experiment. Of the 5,016 peptides, 4,221 peptides were detected by both ion screens, while 536 and 259 peptides were detected by a/b/y and b/y ion screens respectively. 34 Figure 6 (continued): C. OMSSA conformance scoring based on distinct peptides detected in the a/b/y ion screen, the b/y ion screen, or both a/b/y and b/y ion screens. C. Conformance based on Distinct Peptide Matches per MS Run Conformance based on Distinct Peptide Matches per Gel Slice Range 10% Mass Tolerance 10% Mass Tolerance OMSSA Implementation ABY Ions ABY Ions & BY Ions BY Ions OMSSA Implementation ABY Ions ABY Ions & BY Ions BY Ions 0 1 2 3 4 82.0 (344) 78.1 (402) 75.8 (298) 77.7 (364) 79.8 (336) 89.3 (2239) 88.5 (2991) 89.5 (1404) 88.4 (2701) 89.3 (2220) 87.3 (821) 84.2 (653) 87.3 (730) 84.2 (594) 87.6 (815) 0 1 2 3 4 73.0 (152) 66.3 (181) 65.6 (154) 66.5 (170) 68.6 (153) 86.7 (1204) 85.5 (1604) 87.3 (860) 85.5 (1490) 86.7 (1194) 80.3 (346) 77.6 (299) 79.7 (344) 77.3 (278) 81.3 (348) OMSSA Implementation ABY Ions ABY Ions & BY Ions BY Ions OMSSA Implementation ABY Ions ABY Ions & BY Ions BY Ions 0 91.0 96.3 93.1 0 83.6 94.9 89.6 1 86.8 95.7 91.0 1 75.7 94.3 88.0 2 86.2 96.1 92.2 2 77.9 95.3 87.8 3 85.7 95.7 91.1 3 74.7 94.4 87.4 4 89.3 96.3 93.4 4 80.4 95.0 90.5 25% Mass Tolerance 25% Mass Tolerance 35 A. Conformance of SEQUEST distinct peptides per MS Run (10% Mass Tolerance, ABY ions, n = 536) 61.6% 100x 1000x 0.3 Frequency 0.25 0.2 0.15 0.1 0.05 0 60 61 62 63 64 65 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 Conformance Score (%) B. Conformance of SEQUEST distinct peptides per MS Run (10% Mass Tolerance, BY ions, n = 259) 34.7% 100x 1000x 0.3 Frequency 0.25 0.2 0.15 0.1 0.05 0 33 34 35 36 37 38 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 Conformance Score (%) Figure 7. Peptides detected by SEQUEST in both the b/y and a/b/y ion screens give significantly higher conformance scores compared to peptides detected by b/y or a/b/y ion screens alone. Blue dotted lines and percentages represent the conformance score based on distinct peptides detected by b/y or a/b/y screens alone. Number of bootstrap cycles: 100 (purple) or 1000 (dark blue). A. Distribution of conformance scores calculated after random sampling with replacement of 536 samples from a full set of 4,757 distinct peptides per MS Run detected in the a/b/y ion screen. B. Distribution of conformance scores calculated after random sampling with replacement of 259 samples from a full set of 4,480 distinct peptides per MS Run detected in the b/y ion screen. 36 V. Protein Expression Although the sample preparations for the 22 MS/MS experiments did not include modifications such as glutaraldehydation, the SEQUEST algorithm successfully matched a higher percentage of detected peptides to the correct parent protein molecular weight size ranges when the search space included a ions (Figure 6A). In contrast, OMSSA did not show a similar trend, indicating that the SEQUEST algorithm uses a scoring and matching system that is inherently different from the OMSSA algorithm. Indeed, SEQUEST is a cross-correlation based algorithm, while OMSSA is a probability-based algorithm. Unfortunately, unlike OMSSA, which is an open-source algorithm, SEQUEST is a proprietary algorithm and its source code is unavailable to the general public. In order to elucidate the inherent differences between SEQUEST and OMSSA when increasing the ion search space, we asked whether protein expression plays a role in the higher conformance score for peptide matches detected by SEQUEST in both the b/y and a/b/y ion screens. 1. Parent protein expression of peptide matches detected exclusively in the b/y or a/b/y ion screens, and in both the b/y and a/b/y ion screens Comparing the distribution of parent protein expression values for peptide matches detected in the b/y ion screen or a/b/y ion screen alone to that for peptide matches detected in both the b/y and a/b/y ion screens using the SEQUEST algorithm, we found that peptides detected exclusively by either the b/y or a/b/y ion screens had parent proteins with significantly lower protein expression levels as compared to the parent proteins of peptides detected by both the b/y and a/b/y ion screens (Figure 8A; 37 p < 4*10-14 for BY ions, p < 0.0004 for ABY ions). In contrast, the distribution of parent protein expression values for peptide matches detected by the OMSSA algorithm did not differ significantly between the ion screening groups (Figure 8B), possibly accounting for the high confidence in the peptide matches detected exclusively by either the b/y or a/b/y ion screens (Figure 6C). 2. Bootstrap analysis: Confidence in peptide matches dependent on parent protein expression The significantly lower parent protein expression (Figure 8A) in combination with the lower conformance score of peptide matches detected exclusively by the b/y or a/b/y ion screens (Figure 6A) suggests that though the SEQUEST algorithm can detect peptides of lower abundance, the confidence in the quality of peptide-spectrum matches also decreases. To investigate whether the matching of spectra to peptide sequences in the SEQUEST algorithm is dependent on protein expression, we ranked the parent proteins of peptide matches detected exclusively by the b/y or a/b/y ion screens according to protein expression, and asked if the conformance scores computed from the top or bottom 1/3 pool of protein expressers lie outside a distribution of conformance scores computed from a bootstrap analysis (Figure 9). Bootstrap analysis (Figure 9A; p < 0.01, p<0.001) revealed that the conformance scores of parent proteins in the top or bottom 1/3 pool of protein expressers for peptides detected by SEQUEST lie outside the distribution of bootstrap conformance score data, but the corresponding conformance scores of parent proteins for peptides detected by OMSSA lie within the distribution of the bootstrap data. The combined 38 results from Figure 8A and Figure 9A support the idea that compared to OMSSA, the SEQUEST algorithm can detect peptides of lower abundance, but with the caveat that confidence in lower abundance peptides is also low. (Figure 8 located on page 40) Figure 8. Parent protein expression of distinct peptide matches (per MS Run) detected in the a/b/y ion screen, the b/y ion screen, or both a/b/y and b/y ion screens. Protein expression values were obtained from genomic-scale western analyses in yeast (Ghaemmaghami et al., 2003). Proteins of unknown expression or expression values of zero were disregarded. Pink distribution lines represent parent protein expression values of peptides detected by the b/y ion screen exclusively. Blue distribution lines represent parent protein expression values of peptides detected by the a/b/y ion screen exclusively. Green distribution lines represent parent protein expression values of peptides detected by both ion screens. A. SEQUEST. B. OMSSA implementations 0-4. 39 A. B. Protein Expression based on Distinct Peptide Matches per MS Run (OMSSA Implementation 0) Protein Expression based on SEQUEST Distinct Peptide Matches per MS Run 0.35 0.25 0.2 0.15 0.1 0.25 0.2 0.15 0.1 ABY ions ABY ions & BY ions BY ions 0.35 6.5 6.5 5.5 5 4.5 4 3.5 3 Protein Expression log(pe) Protein Expression log(pe) ABY ions & BY ions 2.5 2 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 0 1.5 0 1 0.05 1.5 0.1 0.05 1 0.1 0.2 0.15 0.5 0.2 0.15 0 Frequency 0.25 0.5 p = 4.16E-01 p = 4.94E-01 0.3 0.25 6 p = 1.46E-01 p = 3.84E-01 0.3 0 6.5 ABY ions & BY ions Protein Expression based on Distinct Peptide Matches per MS Run (OMSSA Implementation 4) 0.35 BY ions 6 5.5 5 4.5 Protein Expression log(pe) ABY ions ABY ions & BY ions Protein Expression based on Distinct Peptide Matches per MS Run (OMSSA Implementation 3) ABY ions 6 5.5 5 4.5 4 4 3.5 3 2.5 2 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 0 1.5 0 1 0.1 0.05 0.5 0.1 0.05 Protein Expression log(pe) Frequency 3.5 0.2 0.15 1 0.15 0.25 0.5 Frequency 0.2 0 Frequency p = 1.14E-03 p = 4.09E-01 0.3 0.25 BY ions ABY ions & BY ions 0.35 p = 2.51E-02 p = 6.94E-02 0.3 0 0.35 ABY ions BY ions Protein Expression based on Distinct Peptide Matches per MS Run (OMSSA Implementation 2) Protein Expression based on Distinct Peptide Matches per MS Run (OMSSA Implementation 1) 1.5 BY ions 3 Protein Expression log(pe) Protein Expression log(pe) ABY ions 2.5 2 1.5 1 0.5 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0 0.05 0.05 0 p = 1.14E-01 p = 4.87E-01 0.3 Frequency Frequency 0.35 p = 1.12-05 p = 1.09E-18 0.3 ABY ions BY ions ABY ions & BY ions Figure 8. 40 A. Frequency Conformance of SEQUEST distinct peptides per MS Run (10% Mass Tolerance, BY ions, n = 1244) Bottom 1/3 Top 1/3 88.0% 76.8% 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 100x 1000x 76 77 78 79 80 81 82 83 84 85 86 87 87.5 88 89 Conformance Score (%) Frequency Conformance of SEQUEST distinct peptides per MS Run (10% Mass Tolerance, ABY ions, n = 1244) 77.9% 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 88.0% 100x 1000x 76 77 78 79 80 81 82 83 84 85 86 87 87.5 88 89 Conformance Score (%) B. Conformance of OMSSA distinct peptides per MS Run (ABY ions: Implementation 0, n = 714) 89.8% 86.2% 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 89.5% 86.6% 0.4 100x 1000x 100x 1000x 0.35 Frequency Frequency Conformance of OMSSA distinct peptides per MS Run (BY ions: Implementation 0, n = 840) 0.3 0.25 0.2 0.15 0.1 0.05 0 82 83 84 85 86 87 88 89 90 Conformance Score (%) 91 92 93 94 82 83 84 85 86 87 88 89 90 91 92 93 94 Conformance Score (%) Figure 9. Confidence in peptides detected by SEQUEST is dependent on parent protein expression. Conformance scores computed at 10% mass tolerance. Purple and blue lines represent distribution of bootstrap analysis data from 100 cycles and 1,000 cycles respectively. Dotted lines and percentages represent the conformance score of the top and bottom 1/3 protein expressers in the full set of distinct peptides. A. SEQUEST algorithm: distribution of conformance scores calculated after random sampling with replacement of 1,244 samples from a full set of 3,731 distinct peptides per MS Run. B. OMSSA algorithm, Run 0: distribution of conformance scores calculated after random sampling with replacement of 840 samples from a full set of 2,520 distinct peptides per MS Run. 41 Figure 9B continued Conformance of OMSSA distinct peptides per MS Run (BY ions: Implementation 1, n = 1004) Frequency 0.35 100x 1000x 0.3 0.25 0.2 0.15 0.1 Frequency 88.8% 85.4% 0.4 Conformance of OMSSA distinct peptides per MS Run (ABY ions: Implementation 1, n = 941) 0.05 0 82 83 84 85 86 87 88 89 90 91 92 93 100x 1000x 82 94 89.4% 83.7% 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 83 84 85 86 Conformance Score (%) 100x 1000x Frequency 0.3 0.25 0.2 0.15 0.1 Frequency 89.7% 86.9% 0.05 0 82 83 84 85 86 87 88 89 90 91 92 93 82 94 Frequency 0.25 0.2 0.15 0.1 Frequency 89.0% 0.3 0.05 0 84 85 86 87 88 89 90 83 84 86 91 92 93 84 85 86 88 89 90 Conformance Score (%) Frequency 89 90 91 92 93 94 89.2% 83 84 85 86 87 88 89 90 91 92 93 94 Conformance of OMSSA distinct peptides per MS Run (ABY ions: Implementation 4, n = 706) 90.0% 87 88 Conformance Score (%) Frequency 83 94 100x 1000x 82 94 100x 1000x 82 87 84.6% 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Conformance of OMSSA distinct peptides per MS Run (BY ions: Implementation 4, n = 834) 86.1% 93 89.8% 85 Conformance Score (%) 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 92 Conformance of OMSSA distinct peptides per MS Run (ABY ions: Implementation 3, n = 845) 100x 1000x 83 91 Conformance Score (%) 0.35 82 90 100x 1000x Conformance of OMSSA distinct peptides per MS Run (BY ions: Implementation 3, n = 904) 84.2% 89 84.3% 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Conformance Score (%) 0.4 88 Conformance of OMSSA distinct peptides per MS Run (ABY ions: Implementation 2, n = 470) Conformance of OMSSA distinct peptides per MS Run (BY ions: Implementation 2, n = 581) 0.35 87 Conformance Score (%) 91 92 93 94 89.4% 85.9% 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 100x 1000x 82 83 84 85 86 87 88 89 90 91 92 93 94 Conformance Score (%) 42 VI. Increasing the Yield of High Confidence Peptide Matches 1. Distinct peptides detected by both SEQUEST and OMSSA algorithms The qualitative difference in the results from the SEQUEST and OMSSA algorithms suggest that the two algorithms provide different samplings of the same MS/MS data. To investigate whether the use of multiple algorithms increases the yield of high confidence peptide matches, we counted peptides once regardless of how many MS/MS experiments identified the peptide, which CID ion screen was applied, and which implementation revealed the peptide, and subsequently computed the number of distinct peptides detected by both algorithms. Of the 2,261 distinct peptides detected by the SEQUEST algorithm and the 2,097 distinct peptides detected by the OMSSA algorithm, 1,516 distinct peptides (53.3%) were detected by both algorithms (Figure 10). The increased number of detected peptides (2,842 distinct peptides) by combining results from both algorithms, in addition to the high conformance scores for the individual algorithms (Table 2), suggests the use of multiple algorithms in addition to algorithm specific methods in increasing confidence of peptide matches (retaining peptide matches detected by both b/y and a/b/y ion screens for SEQUEST; taking the union of peptide matches detected by multiple implementations of OMSSA) to increase the yield of high confidence peptide-spectrum matches. 43 A. B. Distinct Peptides detected in Multiple Implementations of OMSSA 1 (10%) 2 (18%) 5 (47%) 3 (6%) 4 (19%) Figure 10. Distinct peptides detected by both algorithms. A. SEQUEST detected 5,016 peptides counting a peptide once per MS Run even if seen in both b/y and a/b/y ion screens. Of the 5,016 peptides, 2,261 are distinct (peptides found in multiple MS Runs are only counted once). Of the 2,261 distinct peptides, 309 peptides are seen in the a/b/y ion screen alone, 200 peptides are seen in the b/y ion screen alone, and 1,752 peptides are seen in both the b/y and a/b/y ion screens. B. OMSSA detected 4,429 peptides counting a peptide once per MS Run even if seen in both b/y and a/b/y ion screens or seen in multiple implementations. Of the 4,429 peptides, 2,097 are distinct (peptides found in multiple MS Runs are only counted once). Of the 2,097 peptides, 203 peptides are seen in a single implementation, 372 are seen in two implementations, 118 are seen in three implementations, 392 are seen in four implementations, and 1,012 are seen by five implementations. C. 1,516 distinct peptides are found in both SEQUEST and OMSSA, while 745 distinct peptides are unique to SEQUEST and 581 distinct peptides are unique to OMSSA. Counting each peptide once, regardless of which implementation of the algorithm, how many MS/MS experiments, or which CID ion screens revealed the peptide, we find that 53.34% of the peptides are detected by both algorithms. 44 2. Conformance scoring using algorithm-specific evaluation methods In order to verify this claim, we computed conformance scores based on distinct peptides detected by the SEQUEST and OMSSA algorithms after implementation of algorithm-specific evaluation methods. Trends seen in peptides detected by both algorithms or by either algorithm alone are consistent with previous data where peptides detected by both a/b/y and b/y ion screens in SEQUEST (Figure 6A) and peptides detected by multiple implementations of OMSSA (Figure 5A) had higher overall conformance scores compared to peptides detected by the individual ion screens in SEQUEST and by a single implementation of OMSSA respectively. Peptides detected by both algorithms had a conformance score of 88.2%, with 93.1% of its peptides detected in both the a/b/y and b/y ion screens of SEQUEST, and with only 1.5% of its peptides detected in a single implementation of OMSSA. In contrast, peptides detected by SEQUEST alone had a conformance score of 65.4%, with 54.4% of its peptides detected in either a/b/y or b/y screens alone, while peptides detected by OMSSA alone had a conformance of 62.1%, with 31% of its peptides detected in a single implementation of OMSSA (Table 4A-C). Despite the higher conformance of peptides detected by both algorithms (88.2%), the conformance of peptides detected by either SEQUEST alone (65.4%) or OMSSA alone (62.1%) had similar conformance scores, suggesting similar confidence in both algorithms. Furthermore, when restrictions are applied, where peptides detected by both algorithms or by SEQUEST alone are required to be detected in both a/b/y and b/y ion screens of SEQUEST, and where peptides detected by both algorithms or 45 OMSSA alone are required to be detected in more than one implementation of OMSSA, there is a striking increase in the conformance score of peptides detected by each algorithm alone but not by both. In addition, applying such restrictions does not significantly decrease the number of detected peptides (20.8% decrease in number of distinct peptides; Table 4D), suggesting that using different algorithms and multiple implementations, in addition to requiring a/b/y and b/y ion screen detection for SEQUEST detected peptides and detection in >1 implementation for OMSSA, not only increases the yield in peptide detection but also increases confidence in the peptide matches. (Table 4 on page 47) 46 Table 4. Conformance scoring using algorithm-specific evaluation methods. A. Percentages of peptides detected by both algorithms or by SEQUEST alone that are also detected in both a/b/y and b/y ion screens in SEQUEST. B. Percentages of peptides detected by both algorithms or by OMSSA alone that are also detected in multiple implementations of OMSSA. C. Conformance Scores calculated from distinct peptides detected by both algorithms, SEQUEST only, or OMSSA only. D. Conformance scores for peptides detected by both algorithms or by OMSSA alone, where peptides were limited to those detected in >1 implementations of OMSSA. Conformance scores for peptides detected by both algorithms or by SEQUEST alone, where peptides were limited to those detected in both a/b/y and b/y ion screens in SEQUEST. A. Percentage of distinct peptides detected in SEQUEST ion screens that are detected by B. Ion screen Both Algorithms SEQUEST alone a/b/y 5.7 29.8 a/b/y & b/y 93.1 45.6 b/y 1.1 24.6 Percentage of distinct peptides detected in multiple OMSSA implementations that are detected by Both Algorithms OMSSA alone 1 1.5 31.0 2 10.4 37.0 # Implementations C. 3 4.3 9.1 4 21.8 10.7 5 62.1 12.2 Conformance Scores of distinct peptides detected by D. SEQUEST (745) Both Algorithms (1516) OMSSA (581) 65.4 88.2 62.1 Conformance Scores after applying evaluation methods to distinct peptides detected by SEQUEST (340) Both Algorithms (1412, 1493) OMSSA (401) 80.6 88.3 73.4 47 Conclusion The rise of proteomics has been made possible by a number of factors, including: i) the complete genome sequencing of many organisms; ii) the development of fragmentation techniques in conjunction with advances in the resolution, sensitivity, and m/z ratio ranges of mass spectrometry instrumentation; iii) the establishment of public repository databases for high-throughput data; iv) the increasing awareness of the importance of interdisciplinary methods in research (bioinformatics and computational biology); v) the development of spectra-matching algorithms that use protein sequence databases, derived from the translation of putative ORFs, as a resource. I. Evaluation Methods for MS/MS Algorithms and Implementations Prior to high-throughput proteomic studies, data from publications were easily available for validation through wet-lab experimentations. For example, if a lab publishes a paper on identifying a protein-protein interaction between protein X and Z through co-immunoprecipitation, a different lab can either conduct the same experiment or use other methods such as pull-down assays to confirm the interaction. In proteomic studies today, many resources are not available or are not made available for validation through re-conducting the experiment. For example, lab A has a high resolution, high sensitivity, but extremely expensive LTQ FT mass spectrometer. Additionally, lab A has been using outdated proprietary software for their spectra analysis. Unfortunately, lab B has run out of grant funds and owns a first 48 generation time of flight mass spectrometer. Furthermore, to lab B’s disappointment, the raw spectral files and coding scripts used for the publication of the data were not required to be made available online by the terms of the journal, and lab A is unreachable. In this extreme situation, lab B has doubts about the analysis of highresolution data with an outdated algorithm, but can only trust that the published data has been reviewed and is accurate. As seen in this example, the problems that have risen with the advancements in proteomics today include: i) the inability to have confidence in peptide matches identified by spectra-matching algorithms, and ii) the inability to reproduce published data from raw MS/MS spectra files – problems impeding the ultimate goal of proteomics: identifying and characterizing proteins so as to understand how the biological system functions. Fortunately, methods can be used to increase confidence in peptides and proteins identified by algorithms, and there is increasing awareness and efforts placed in making raw data and coding scripts publicly available – areas that this study hopes to have addressed through parent protein profiling of 22 MS/MS experiments performed on yeast cell lysates. In current proteomic studies, the confidence in peptide-spectrum matching by algorithms is assessed primarily through the application of probability thresholds at a given false identification rate (e.g., FIR of 1% or 5%). Unfortunately, the typical use of a reverse set of peptide sequences for every ‘validated’ peptide sequence in a FASTA format sequence database has its downfalls. Specifically, reverse sequences may contain homologous sequences to ‘validated’ peptide sequences, resulting in a 49 false negative match (Nesvizhskii et al., 2007). In this study, despite applying a probability threshold of a 5% FIR, the conformance scores varied across different algorithms and implementations (Table 2), indicative of different samplings due to inherent differences between algorithms or due to slight changes in implementation parameter settings. Therefore, the fact that peptide matches boast a FIR of 1% does not necessarily provide high confidence in the published data. 1. Benchmark data: 22 MS/MS experiments for optimization of algorithm implementations Although an alternative to decoy based FIR would be to use an empirical bayes approach, such as that implemented by PeptideProphet (Nesvizhskii et al., 2007), we suggest an algorithm evaluation method in conjunction with the use of the standard application of false identification rates. We have created a set of 22 MS/MS spectra files from 22 experiments testing an algorithm’s ability to detect peptides with parent protein sizes within a known molecular weight size range despite the algorithm having no knowledge of the parent protein mass prior to trypsin digestion. Our results have shown that despite the inherent differences between the design and structure of different algorithms (SEQUEST and OMSSA), and despite parameter setting variations between different algorithm implementations (OMSSA), the computed conformance scores indicate high confidence in peptide matches from all algorithms and implementations of algorithms. This suggests that investigators performing largescale yeast proteomic studies may use our set of 22 MS/MS spectra as an 50 optimization tool-kit for choosing the best algorithm to use or for tweaking the parameter settings in an individual algorithm. 2. Multiple algorithms, implementations, and ion screens: increase yield of high confidence peptide matches Furthermore, analyzing the peptide matches from our implementations of algorithms revealed algorithm-specific methods for increasing confidence in algorithm detected peptides. In addition to the typical search for b and y ions (b/y ion screen), an additional screen for a ion inclusion (a/b/y ion screen) may be performed on MS/MS spectra files. After application of a 5% FIR, peptide matches that are detected in both ion screens, and not exclusively in either screen, can be used for further analysis. Our analysis on SEQUEST output data indicated a significantly higher conformance score for peptide matches detected in both ion screens as compared to peptide matches detected in either screen alone (Figure 6A). This method, however, seems to be specific to SEQUEST and did not apply to peptide matches detected by the OMSSA algorithm (Figure 6B). Instead, our analysis on OMSSA output data suggested that taking the union of peptide matches from multiple implementations of an algorithm increases the yield of detected peptides while maintaining high confidence in the matches (Figure 5A; Table 2). In particular, our data suggested that peptide matches detected by a single OMSSA implementation have a low conformance score, indicative of low confidence in the match, and should be discarded. For both SEQUEST and OMSSA, implementation of the algorithm specific methods to increase confidence in peptide 51 matches indicated an increase in conformance scores for peptide matches detected by SEQUEST or OMSSA alone (Table 4C vs. 4D). 3. Practical limitations to benchmark data sets In conclusion, our data suggest the use of multiple algorithms, implementations, and ion screens, in addition to the algorithm specific filtering methods described, to increase the yield of high confidence peptide matches. This methodology of combining samplings from different algorithms so as to increase peptide yield, however, has also been suggested by a number of other groups (Searle et al., 2008; Keller et al., 2005; Price et al., 2007). Furthermore, we plan to assemble a tool-kit for algorithm and implementation evaluation using the 22 MS/MS spectra files, and to make the raw spectral files available online. Admittedly, although such benchmark data sets allow for the tweaking of implementations and the choice of algorithms for large-scale proteomics investigations, limitations in practicality include the fact that these benchmark data sets are heavily dependent on the type of mass spectrometer and the fragmentation method used to obtain the MS/MS spectra (Allmer, 2012). In a comparative study of SEQUEST and MASCOT spectrum-peptide matching performance on spectra obtained from LC-MS/MS analysis of the yeast proteome by different types of mass spectrometers (LTQ and QqTOF; QSTAR), Elias et al. found that while the two algorithms gave similar peptide matches using LTQ MS/MS spectra, the matches obtained from QSTAR MS/MS spectra were less similar (Elias et al., 2005). 52 Therefore, if our 22 MS/MS experiments were to act as a benchmark data set for proteomic studies in yeast, computed conformance scores for different algorithms or implementations may be limited to investigators who aim to use CID fragmentation and a mass spectrometer of similar resolution and sensitivity to the LCQ-Deca XP in future proteomic experiments. Nonetheless, the parent protein-profiling approach is still a valid evaluation method to assess the performance of spectra-matching algorithms, and labs using other classes of mass spectrometers can apply the same methodology using new gel slice data sets. 53 II. Future Directions 1. Assessment of MASCOT using parent protein conformance scoring With the recent acquirement of the MASCOT spectra-matching algorithm, we plan to use the spectra from the 22 MS/MS parent-profiling experiments as a benchmark data set for the optimization of search parameter implementations in MASCOT. As OMSSA and MASCOT are both probability-based spectra-matching algorithms, we expected similar numbers of detected peptides and high conformance scores for individual implementations set-up to mirror the parameter settings in the OMSSA implementations. However, initial screenings performed using the MASCOT algorithm yielded considerably lower numbers of detected peptides in addition to slightly lower overall conformance scores – results that did not agree well with data from our analyses of the SEQUEST and OMSSA algorithms. As MASCOT is also proprietary software, a full understanding of the individual parameter settings may be impossible. However, additional screening and changes to the current parameter settings will allow us to find an optimal set or sets of implementations for future mass spectrometry based yeast proteomic studies using LC-MS/MS on an ion trap mass spectrometer coupled with a nano-ESI probe and CID fragmentation. 54 2. Applications of mass spectrometry-based proteomics 2.1 Elucidation of biological functions and pathways Traditionally, individual protein-protein interactions in the budding yeast were revealed through transcriptional activation of a reporter gene in two-hybrid assays. Although comprehensive yeast two-hybrid studies have revealed a multitude of protein-protein interactions (Ito et al., 2001; Uetz et al., 2000), the use of such experimentation methods in the study of protein networks assumes an oversimplified model that disregards: i) other non-binding protein-protein interactions such as phosphorylation, ii) the downstream effects of protein-protein interactions, and iii) the interconnection between signaling pathways. Fortunately, mass spectrometry-based proteomic experiments can be manipulated to address all three issues by comparing differentially treated protein samples at the proteomic-scale. In particular, the emergence of hybrid mass spectrometers and electron capture dissociation (ECD) has contributed to the identification of low abundance and post-translationally modified proteins, allowing for the elucidation of protein interaction networks and cellular pathways (Mumby and Brekken, 2005). In a recent phosphoproteomic study on the epidermal growth factor (EGF) signaling pathway, proteins from three populations of HeLa cells were differentially labeled with isotopic forms of lysine and arginine via SILAC 1, and subsequently quantified using LC-MS/MS on a linear iron trap/Fourier transform mass spectrometer (LTQ-FT) after exposure of the HeLa cells to EGF for varying lengths 1 Stable isotope labeling by amino acids in cell culture 55 of time (Olsen et al., 2006). The temporal and kinetic profiling of the 6,600 phosphorylation sites on 2,244 proteins revealed clusters of phosphopeptides with defined functions at specific stages in the EGF receptor pathway, providing insight into the temporal dynamics and players involved in the initiation of the signaling pathway via autophosphorylation of the EGF receptor, the step-wise activation of the kinase cascades, and the regulation of transcription factors involved in cellular processes such as proliferation (Lemmon and Schlessinger, 2010). In addition to revealing temporal aspects of protein-protein interactions in signaling pathways, mass spectrometry is also capable of elucidating higher order structural features involved in protein-protein or protein-nucleotide interactions. One of the first structural studies used limited proteolysis and mass spectrometry to identify the DNA-binding regions in the Max protein, confirming previous X-ray crystallographic results that suggested a DNA-binding site at the basic N-terminals of the homodimer. Furthermore, the inherent design of the proteolytic protection assay, where the Max protein was subjected to proteolysis by six different endoproteases in the presence or absence of DNA, allowed for a comparison of MALDI-MS spectra that revealed a conformational change due to the binding of DNA (Cohen et al., 1995). More recently, improvements in cross-linking/mass spectrometry, where noncovalent protein-protein or protein-nucleotide interactions are converted to covalent bonds, have allowed for structural studies on protein complexes such as the elucidation of heterodimeric coiled coils and tetramerization domain organization in the NDC80 complex (Maiolica et al., 2007). 56 2.2 Mass spectrometry-based proteomics in the clinical setting With the maturation of mass spectrometry techniques and instrumentation in the lab setting, a shift toward clinical applications of mass spectrometry is likely to follow. Just as the quantitative characterization of gene expression via DNAmicroarrays has led to the development of array comparative genomic hybridization (aCGH) in the diagnosis of diseases characterized by microdeletions or duplications of chromosomes (Galizia et al., 2012), quantitative characterization of protein expression via mass spectrometry can also be targeted toward the development of diagnostic and prognostic tools in medicine. One such tool, MALDI imaging mass spectrometry (MALDI-MS), has identified biomarkers for the diagnosis of gastric cancer at various pathologic stages by comparing endoscopic biopsy tissue samples from healthy individuals and cancer patients (Kim et al., 2010). Although MALDIMS can directly analyze specific areas of tissue samples mounted on a MALDI matrix, thereby providing spatial information on protein expression, the application of the technique in clinical settings is limited to a number of factors: i) the identification of biomarkers via comparison of sample tissue from “healthy” individuals and patients may only be specific to a subset of disease patients, resulting in false-positives or false-negatives (LaBaer, 2005); ii) the non-standardized work flow from patient sample collection, delivery to a proteomic lab, sample preparation, mass spectrometry, to data interpretation is open to multiple variables affecting disease prognosis (Beretta, 2007); 57 Nonetheless, the advantages in the prospects of mass spectrometric techniques in diagnosing patients at the early on-set of a disease or in assessing the effectiveness of a certain drug for chronic diseases greatly outweigh the disadvantages listed above. With standardization of the workflow from sample collection to disease prognosis and highly regulated population-specific studies in the identification of biomarkers, mass spectrometry-based prognosis and diagnosis of disease-states may play a major role in the development of preventive medicine. 58 References Aebersold, R. and Mann, M. (2003). Mass spectrometry-based proteomics. Nature 422, 198-207. Allmer, J. (2012). A Call for Benchmark Data in Mass Spectrometry-Based Proteomics. JIOMICS 2, 1-5. Balgley, B.M., Laudeman, T., Yang, L., Song, T., and Lee, C.S. (2007). Comparative Evaluation of Tandem MS Search Algorithms Using a Target-Decoy Search Strategy. Mol. Cell. Prot. 6.9, 1599-1608. Beretta, L. (2007). Proteomics from the clinical perspective: many hopes and much debate. Nature Methods 4, 785-786. Black, D.L. (2000). Protein Diversity from Alternative Splicing: A Challenge for Bioinformatics and Post-Genome Biology. Cell 103, 367-370. Cohen, S.L., Ferré-D'Amaré, A.R., Burley, S.K., and Chait, B.T. (1995). Probing the solution structure of the DNA-binding protein Max by a combination of proteolysis and mass spectrometry. Protein Science 4, 1088-1099. Coon, J.J., Syka, J.E.P., Shabanowitz, J., and Hunt, D.F. (2005). Tandem Mass Spectrometry for Peptide and Protein Sequence Analysis. BioTechniques 38, 519-523. Crick, F. (1970). Central Dogma of Molecular Biology. Nature 227, 561-563. Davis, M.T. and Lee, T.D. (1998). Rapid Protein Identification Using a Microscale Electrospray LC/MS System on an Ion Trap Mass Spectrometer. J. Am. Soc. Mass Spectrom. 9, 194-201. 59 Elias, J.E. and Gygi, S.P. (2007). Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods 4, 207-214. Elias, J.E., Haas, W., Faherty, B.K., and Gygi, S.P. (2005). Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nature Methods 2, 667-675. Eng, J.K., McCormack, A.L., Yates, III, J.R. (1994). An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976-989. Etienne, W., Meyer, M.H., Peppers, J., and Meyer, Jr., R.A. (2004). Comparison of mRNA gene expression by RT-PCR and DNA microarray. BioTechniques 36, 618626. Fenn, J.B., Mann, M., Meng, C.K., Wong, S.F., and Whitehouse, C.M. (1989). Electrospray ionization for mass spectrometry of large biomolecules. Science 246, 64-71. Fitzgibbon, M., Li, Q., and McIntosh, M. (2008). Modes of inference for evaluating the confidence of peptide identifications. J. Proteome Res. 7, 35-39. Fournier, C.T., Cherny, J.J., Truncali, K., Robbins-Pianka, A., Lin, M.S., Krizanc, D., and Weir, M.P. (2012). Amino Termini of Many Yeast Proteins Map to Downstream Start Codons. J. Proteome Res. 11, 5712-5719. 60 Galizia, E.C., Srikantha, M., Palmer, R., Waters, J.J., Lench, N., Ogilvie, C.M., Kasperavičiūtėa, D., Nashef, L., and Sisodiya, S.M. (2012). Array comparative genomic hybridization: Results from an adult population with drug-resistant epilepsy and co-morbidities. European Journal of Medical Genetics 55, 342-348. Geer, L.Y., Markey, S.P., Kowalak, J.A., Wagner, L., Xu, M., Maynard, D.M., Yang, X., Shi, W., and Bryant, S.H. (2004). J. Proteome Res. 3, 958-964. Ghaemmaghami, S., Huh, W., Kiowa, B., Howson, R.W., Belle, A., Dephoure, N., O’Shea, E.K., and Weissman, J.S. (2003). Global analysis of protein expression in yeast. Nature 425, 737-741. Gygi, S.P., Rochon, Y., Franza, B.R., and Aebersold, R. (1999). Correlation between Protein and mRNA Abundance in Yeast. Mol. Cell. Biol. 19, 1720-1730. Hunt, D.F., Yates, III, J.R., Shabanowitz, J., Winston, S., and Hauer, C.R. (1986). Protein sequencing by tandem mass spectrometry. Proc. Natl. Acad. Sci. 83, 62336237. Hyman, E. D. (1988). A new method of sequencing DNA. Analytical Biochemistry 174, 423-436. Iakoucheva, L.A., Kimzey, A.L., Masselon, C.D., Smith, R.D., Dunker, A.K., and Ackerman, E.J. (2001). Aberrant mobility phenomena of the DNA repair protein XPA. Protein Science 10, 1353-1362. International Human Genome Sequencing Consortium. (2004). Finishing the euchromatic sequence of the human genome. Naure 431, 931-945. 61 Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., and Sakaki, Y. (2001). A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. 98, 4569-4574. Johnson, R.S., Martin, S.A., and Blemann, K. (1987). Novel Fragmentation Process of Peptides by Collision-Induced Decomposition in a Tandem Mass Spectrometer: Differentiation of Leucine and Isoleucine. Anal. Chem. 59, 2621-2625. Karas, M. and Hillenkamp, F. (1988). Laser Desorption Ionization of Proteins with Molecular Masses Exceeding 10 000 Daltons. Anal Chem 60, 2299-2301. Keller, A., Eng, J., Zhang, N., Li, X., and Aebersold, R. (2005). A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst. Biol. [online] 1, E1-E8. Kim, H.K., et al. (2010). Gastric Cancer-Specific Protein Profile Identified Using Endoscopic Biopsy Samples via MALDI Mass Spectrometry. J. Proteome Res. 9, 4123-4130. Kochetov, A.V., Sarai, A., Rogozin, I.B., Shumny, V.K., and Kolchanov, N.A. (2005). The role of alternative translation start sites in the generation of human protein diversity. Mol. Genet. Genomics 273, 491-496. LaBaer, J. (2005). So, You Want to Look for Biomarkers (Introduction to the Special Biomarkers Issue). J. Proteome Res. 4, 1053-1059. Lemmon, M.A. and Schlessinger, J. (2010). Cell Signaling by Receptor Tyrosine Kinases. Cell 141, 1117-1134. 62 Maiolica, A., Cittaro, D., Borsotti, D., Sennels, L., Ciferri, C., Tarricone, C., Musacchio, A., and Rappsilber, J. (2007). Structural Analysis of Multiprotein Complexes by Cross-linking, Mass Spectrometry, and Database Searching. Mol. Cell. Prot. 6, 2200-2211. McLafferty, F.W., Breuker, K., Jin, M., Han, X., Infusini, G., Jiang, H., Kong, X., and Begley, T.P. (2007). Top-down MS, a powerful complement to the high capabilities of proteolysis proteomics. FEBS J. 274, 6256-6258. Mumby, M. and Brekken, D. (2005). Phosphoproteomics: new insights into cellular signaling. Genome Biology 6, 230. Nesvizhskii, A.I., Vitek, O., and Aeversold, R. (2007). Analysis and validation of proteomic data generated by tandem mass spectrometry. Nature Methods 4, 787-797. O’Farrell. (1975). High Resolution Two-Dimensional Electrophoresis of Proteins. The Journal of Biological Chemistry. 250, 4007-4021. Olsen, J.V., Blagoev, B., Gnad, F., Macek, B., Kumar, C., Mortensen, P., and Mann, M. (2006). Global, In Vivo, and Site-Specific Phosphorylation Dynamics in Signaling Networks. Cell 127, 635-648. Parker C.E., Mocanu V., Mocanu M., et al. Mass Spectrometry for Post-Translational Modifications. In: Alzate O, editor. Neuroproteomics. Boca Raton (FL): CRC Press; 2010. Chapter 6. Available from: http://www.ncbi.nlm.nih.gov/books/NBK56012/ Perkins, D.N., Pappin, D.J.C., Creasy, D.M., and Cottrell, J.S. (1999). Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567. 63 Price, T.S., et al. (2007). EBP, a program for protein identification using multiple tandem mass spectrometry data sets. Mol. Cell. Proteomics 6, 527-536. Russo, A., Chandramouli, N., Zhang, L., and Deng, H. (2008). Reductive Glutaraldehydation of Amine Groups for Identificaiton of Protein N-termini. J. Proteome Res. 7, 4178-4182. Sanger, F., Nicklen, S., and Coulson, A.R. (1977). DNA sequencing with chainterminating inhibitors. Proc. Natl. Acad. Sci. 74, 5463 – 5467. Searle, B.C., Turner, M., and Nesvizhskii, A.I. (2008). Improving Sensitivity by Probabilistically Combining Results form Multiple MS/MS Search Methodologies. J. Proteome Res. 7, 245-253. Shao, Y. and Kent, S.B.H. (1997). Protein splicing: occurrence, mechanisms and related phenomena. Chemistry & Biology 4, 187-194. Shevchenko, A., Tomas, H., Havliš, J., Olsen, J.V., and Mann, M. (2006). In-gel digestion for mass spectrometric characterization of proteins and proteomes. Nature Protocols 1, 2856-2860. Shevchenko, A., Wilm, M., Vorm, O., and Mann, M. (1996). Mass Spectrometric Sequencing of Proteins from Silver-Stained Polyacrylamide Gels. Anal. Chem. 68, 850-858. Syka, J.E.P., Coon, J.J., Schroeder, M.J., Shabanowitz, J., and Hunt, D.F. (2004a). Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. Proc. Natl. Acad. Sci. 101, 9528-9533. 64 Syka, J.E.P., et al. (2004b). Novel Linear Quadrupole Ion Trap/FT Mass Spectrometer: Performance Characterization and Use in the Comparative Analysis of Histone H3 Post-translational Modifications. J. Proteome Res. 3, 621-626. Uetz, P., et al. (2000). A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623-627. Wenger, C.D. and Coon, J.J. (2013). A Proteomics Search Algorithm Specifically Designed for High-Resolution Tandem Mass Spectra. J. Proteome Res. 12, 13771386. Wildt, S. and Gerngross, T.U. (2005). The Humanization of N-Glycosylation Pathways in Yeast. Nature Reviews 3, 119-128. Wysocki, V.H., Resing, K.A., Zhang, Q., and Cheng, G. (2005). Mass spectrometry of peptides and proteins. Methods 35, 211-222. Zubarev, R.A., Horn, D.M., Fridriksson, E.K., Kelleher, N.L., Kruger, N.A., Lewis, M.A., Carpenter, B.K., and McLafferty, F.W. (2000). Electron Capture Dissociation for Structural Characterization of Multiply Charged Protein Cations. Anal. Chem. 72, 563-573. 65 Appendix A: SEQUEST search parameters in Proteome Discoverer 1.2 b/y ions screen 66 a/b/y ions screen 67 Appendix B: Python and MS-SQL scripts Python Upload Script SEQUEST_TestAndParseXLS.py https://wesfiles.wesleyan.edu/home/mlin/mlin_BAMA_thesis_2013/SEQUEST_Test AndParseXLS.py List of Stored Procedures dt_2013_methods_assessGelSlices_SEQUEST_commented.txt https://wesfiles.wesleyan.edu/home/mlin/mlin_BAMA_thesis_2013/dt_2013_method s_assessGelSlices_SEQUEST_commented.txt dt_2013_methods_assessGelSlices_OMSSA_Commented.txt https://wesfiles.wesleyan.edu/home/mlin/mlin_BAMA_thesis_2013/dt_2013_method s_assessGelSlices_OMSSA_Commented.txt 68
© Copyright 2026 Paperzz