Evaluation of Mass Spectrometry-based Proteomic Search Algorithms

Wesleyan University
Evaluation of Mass Spectrometry-based Proteomic Search Algorithms:
Parent Protein Profiling of 22 MS/MS Experiments
in Saccharomyces cerevisiae
By
Miin Sophia Lin
Faculty Advisor: Dr. Michael P. Weir
A Thesis submitted to the Faculty of Wesleyan University in partial fulfillment of the
requirements for the degree of Master of Arts in Biology
Middletown, Connecticut
May 2013
Acknowledgements
I would like to express my gratitude to my research adviser and mentor, Dr.
Michael Weir, for providing me the opportunity to conduct research since my senior
year as an undergraduate student, and for his continuous guidance and support during
my academic career at Wesleyan University. He has shown me not only how a
bioinformaticist should approach problems, but also the importance of family.
I would like to thank Dr. Scott Holmes and Dr. Ruth Johnson for serving on my
M.A. thesis committee. In addition, I would like to thank Dr. Holmes and Rebecca
Ryznar for sparking my interest in academic research during the spring semester of
my junior year through the independent project, and Dr. Johnson for being a mentor
and listening to my ranting on med school plans. I want to thank all the members of
the Weir lab for their support and guidance over the past two years. I would especially
like to thank Justin Cherny and Claire Fournier for being mentors in the bioinformatic
and wet-lab aspects of my research. They have taught me a considerable amount of
knowledge and skills ranging from troubleshooting experiments to presenting journal
club papers. Furthermore, I would like to acknowledge Claire Fournier for developing
the wet-lab protocol for the gel slice experiments conducted for this study, and her
contribution of MS/MS experiments 1, 2, and 3. I would also like to thank Dr. Andrea
Roberts and Dr. Hyejoo Back for their guidance and support during my academic
career at Wesleyan University.
Last but not least, I would like to take this opportunity to thank God for giving me
the strength and wisdom to complete this thesis. I would also like to thank my parents,
Mahn-Lih and Yao-Feng Lin, and my brother, Chiarng, for shaping me into who I am
today, and for their love and support through the ups and downs in my life.
i
TABLE OF CONTENTS
Introduction ……………………………………………………………………...….1
I. Mass Spectrometry-based Proteomics ...…………………………………………..2
1. Bottom-up vs. top-down approaches ……..………..………..………..……….3
1.1 Tandem mass spectrometry ……………..……………..…………………3
1.2 Prospects in fragmentation methods and MS instrumentation ...…………6
2. Bioinformatic and computational approaches..…..………..…………..………6
2.1
MS/MS spectral databases ………………………………………………6
2.2
Identification of downPeptides in the yeast proteome…..………………7
II. Confidence in Algorithm Matches of MS/MS Spectra to Peptide Sequences ...…..8
1. Peptide-spectrum matching search algorithms …………………………..……8
2. Methods of evaluation... ………………..…………………………..…………9
2.1 Target-decoy approach: False identification rate (FIR) ...………………..9
2.2 Gel slice approach: Parent protein profiling of 22 MS/MS experiments.10
Materials and Methods …..………………………………………………………11
I. Gel Slice Approach ……………………………………………………..………..11
1. Protein sample preparation ...……………………………………………..…12
II. Peptide-spectrum Matching Search Algorithms …..……………………………..14
III.Conformance Score Computation…..……………………………………………17
Results and Discussion ……………………………………….………………….18
I. An Evaluation of Algorithm Performance using Conformance Scores …….……18
1. Ion fragmentation: a search for a, b, and y ions………...……………………19
II. Factors Affecting Conformance Score Computation ...…………………………..22
1. Post-translational modifications...……………………………………………22
2. Random matching of peptides to parent proteins of the correct size range .…23
ii
III.Union of OMSSA Implementations …..…………………………………………27
1. Conformance scoring based on multiple OMSSA implementations ...………27
2. Contribution of individual implementations to the pool of distinct peptides
detected by multiple OMSSA implementations…………………...…………28
IV. Overlap of Peptide Matches in a/b/y and b/y ion Screens ….....…………………31
1. SEQUEST …...………………………………………………………………31
2. OMSSA ……..………………………………………………………………32
V. Protein Expression …….…………………………………………………………37
1. Parent protein expression of peptide matches detected exclusively in the
b/y or a/b/y ion screens, and in both the b/y and a/b/y ion screens.….....……37
2. Bootstrap analysis: Confidence in peptide matches dependent on parent
protein expression ……………………………………………………………38
VI. Increasing the Yield of High Confidence Peptide Matches ..………….…………43
1. Distinct peptides detected by both SEQUEST and OMSSA algorithms.…… 43
2. Conformance scoring using algorithm-specific evaluation methods..……… 45
Conclusion………………………………………………………..………………...48
I. Evaluation Methods for MS/MS Algorithms and Implementations..…………… 48
1. Benchmark data: 22 MS/MS experiments for optimization of algorithm
implementations…………………………………….………..………………50
2. Multiple algorithms, implementations, and ion screens: increase yield of high
confidence peptide matches ………………………………….………………51
3. Practical limitations to benchmark data sets …...…………………………….52
II. Future Directions...…………………………………….…………………………54
1. Assessment of MASCOT using parent protein conformance scoring ….……54
2. Applications of mass spectrometry-based proteomics……………………….55
2.1 Elucidation of biological functions and pathways ………………………55
2.2 Mass spectrometry-based proteomics in the clinical setting .……………57
iii
References …………………………………………………………………………...59
Appendix A: SEQUEST search parameters in Proteome Discoverer 1.2.………66
Appendix B: Python and MS-SQL scripts...………………………………………68
iv
LIST OF FIGURES
Figure 1: Bottom-up and top-down approaches to mass spectrometry-based proteomics..…. 5
Figure 2-1: Schematic of gel slice approach …..………………………………………...11
Figure 2-2: Partitioning of parent proteins via SDS-PAGE…….........................................13
Figure 3: Experimental design and workflow ………...…..……………………………..16
Figure 4: Glycosylation does not account for nonconforming peptide matches with parent
proteins below the gel slice MW range ………………....……………………..25
Figure 5: Union of multiple OMSSA implementations increases both the yield of detected
peptides and the Conformance Score ………………………………………….30
Figure 6: Conformance scoring based on distinct peptides detected in the a/b/y ion screen,
the b/y ion screen, or both a/b/y/ and b/y ion screens ………………………….34
Figure 7: Peptides detected by SEQUEST in both the b/y and a/b/y ion screens give
significantly higher conformance scores compared to peptides detected by
b/y or a/b/y ion screens alone ………………….....……………………………36
Figure 8: Parent protein expression of distinct peptide matches (per MS Run) detected in
the a/b/y ion screen, the b/y ion screen, or both a/b/y and b/y ion screens………..40
Figure 9: Confidence in peptides detected by SEQUEST is dependent on parent protein
expression………….…………………………………..…..…………………41
Figure 10: Distinct peptides detected by both algorithms …….…………………………..44
LIST OF TABLES
Table 1: Comparison of SEQUEST and OMSSA parameter implementations ………….….15
Table 2: Conformance scoring based on distinct peptide matches per MS Run at
10% mass tolerance ………………….…………….………………………….20
Table 3: Conformance scoring based on distinct peptide matches per MS Run at
25% mass tolerance …………………..……………………………………….21
Table 4: Conformance scoring using algorithm-specific evaluation methods ……......…….47
v
LIST OF TERMINOLOGY
Algorithm search parameter implementation – The set of parameter settings
applied to a mass spectrometry based protein identification search algorithm, such as
SEQUEST. Hereafter referred to as “implementation.”
Collision Induced Dissociation (CID) ion screens – Algorithm parameter settings
searching for a, b, and y CID ions is hereafter referred to as the a/b/y ion screen.
Algorithm parameter settings searching for b and y CID ions is hereafter referred to
as the b/y ion screen.
Conforming peptides – Distinct forward peptides with parent proteins conforming to
a given molecular weight size range (e.g., at 10% mass tolerance for a molecular
weight size range of 25-37 kDa, detected peptides between 22.5 and 40.7 kDa are
classified as conforming).
Distinct forward peptides – Peptide matches identified by the search algorithm that
are counted once per MS Run or per Gel Slice Range.
False Identification Rate (FIR) – 100*(reverse peptides / forward peptides)
Gel Slice Range – The molecular weight size range, 25-37, 37-50, or 50-75 kDa, of
proteins that were excised from SDS-PAGE gels based on a protein standard lane.
Distinct peptide matches per Gel Slice Range allows only a single count of the same
peptide from a particular molecular weight size range, thereby decreasing the weight
of the peptide in the pool of distinct forward peptides.
Mass Tolerance – Increase the molecular weight size range by either 10% or 25% of
the minimum (m) and maximum (M) values [e.g., Given a size range of m to M and a
mass tolerance of 10%, increase the size range to m - (0.1*m) and M + (0.1*M).]
MS Run – An individual tandem mass spectrometry (MS/MS) run performed on a
single gel slice of a molecular weight size range of 25-37, 37-50, or 50-75 kDa.
Distinct peptide matches per MS Run allows multiple counts of the same peptide
from different MS/MS experiments of the same molecular weight size range, thereby
increasing the weight of the peptide in the pool of distinct forward peptides.
Non-conforming peptides – Distinct forward peptides with parent proteins lying
outside a given molecular weight size range.
Conformance Score (CS) – Equation (m 3.2)
Reverse peptides – Decoy peptides with reverse peptide sequences of proteins in a
database file.
vi
Introduction
Since the establishment of the central dogma in molecular biology (Crick, 1970),
continuous efforts have been made to annotate and characterize the molecular
building blocks of life, with the ultimate goal of understanding how biological
systems function from DNA transcription to mRNA translation into protein products.
The completion of the Human Genome Project in 2003 (IHGSC, 2004) was a major
determinant in promoting a shift from genomic studies to proteomic studies. In
particular, the development of DNA sequencing technologies, from Sanger’s
electrophoresis-radiolabeling method (Sanger et al., 1977) to pyrosequencing (Hyman,
1988), has led to the completion of whole genome sequencing for many model
organisms, and a call for the functional annotation of genomes. In response,
techniques such as RT-PCR and DNA microarrays were developed to measure gene
expression and transcript abundance (Etienne et al., 2004). However, as mRNA
expression profiles do not necessarily correlate with protein expression (Gygi et al.,
1999), quantitative analyses of mRNA can only provide a partial, and sometimes
inaccurate, picture of biological systems.
The shift toward a direct characterization of protein products encoded by the
genome, however, has been challenged by the enormous size and complexity of the
proteome. Protein expression is regulated by multiple processes at various stages
during transcription and translation. Not only do rates of translation and the targeting
of mRNA for decay dictate protein expression levels, single pre-mRNA transcripts
1
may also code for multiple protein products as seen in alternative splicing (Black,
2000). Once proteins are expressed, post-translational modifications, degradation,
localization, and the formation of complexes via protein-protein interactions further
complicate the understanding of protein function within a system.
I. Mass Spectrometry-based Proteomics
Along with the advances in bioinformatics and computational biology, the
development of two-dimensional electrophoresis (2D-PAGE; O’Farrell, 1975) and
tandem mass spectrometry (MS/MS; Hunt, 1986) have paved the way for large-scale
studies on protein expression and function. The proteomic counterpart of DNA
sequencing and microarray expression profiling, mass spectrometry, when used
alongside spectra-matching search algorithms, is able to determine peptide fragment
sequences and identify proteins in a sample. The principle behind mass spectrometry
is based on the ability to characterize the mass of analytes in a given sample via
measurements of the mass to charge (m/z) ratios and abundance of ionized analytes.
In a mass spectrometry experiment, following the introduction of analytes into the
mass spectrometer by the ion source, the mass analyzer measures the m/z ratios of the
ionized analytes, and the detector determines the relative abundance in a given
sample. These three components of a mass spectrometer determine the differences in
resolution and sensitivity between the various types of instrumentation on the market,
from time-of-flight (TOF) and ion trap to Fourier transform (FT) spectrometers
(Aebersold and Mann, 2003). The choice as to which type of mass spectrometer to
use is in turn determined by the objective of the proteomic study.
2
1. Bottom-up vs. top-down approaches
The characterization and identification of proteins through mass spectrometry can
be accomplished through a bottom-up or a top-down approach. In the bottom-up
approach, proteins isolated from cell lysate are subjected to enzymatic digestion (e.g.,
trypsin), producing peptide fragments that are first separated by high-performance
liquid chromatography (HPLC; Davis, 1998), then introduced into a mass
spectrometer via soft-ionization techniques such as electrospray ionization (ESI; Fenn
et al., 1989) or matrix-assisted laser desorption ionization (MALDI; Karas and
Hillenkamp, 1988) (Coon et al., 2005). In the top-down approach, isolated proteins
are instead directly introduced into a mass spectrometer via ESI for fragmentation
(McLafferty et al., 2007).
1.1 Tandem mass spectrometry
Current methodologies for both approaches make use of tandem mass
spectrometry (MS/MS; Hunt et al., 1986), where the most abundant molecular ions
from the first MS spectrum are selected for further fragmentation to reveal additional
information via subsequent MS spectra of the fragment ions. In fragmentation
methods such as collision induced dissociation (CID; Johnson, 1987), cleavages of
the peptide backbone typically give rise to N-terminal b ions and C-terminal y ions
(Figure 1). However, other ions such as a ions, where the C=O group is cleaved from
the b ion, may also occur (Wysocki et al., 2005), and the ability to enhance detection
of a1 fragment ions via protein reductive glutaraldehydation has allowed for
determination of N-terminal sequences (Russo et al., 2008).
3
The MS/MS spectra obtained from a bottom-up approach and that from a topdown approach, however, focus on different aspects of protein characterization
(Figure 1). In the bottom-up approach, the first MS measures the mass of the
proteolytic fragments, while subsequent MSn provide clues to the sequence of the
fragment. Consequently, the submission of MS/MS spectra to a search algorithm is
limited to the identification of proteins in a sample via a piecing together of
proteolytic fragment sequences – a method more prone to error. In the top-down
approach, the first MS measures the mass of the complete parent protein, while
subsequent MSn provide clues to not only the sequence composition of downstream
fragments, but clues in the determination of post-translational modifications,
predicted sequence errors, or other primary structural information when assessed with
a search algorithm (McLafferty et al., 2007).
4
Figure 1. Bottom-up and top-down approaches to mass spectrometry-based proteomics
5
1.2 Prospects in fragmentation methods and MS instrumentation
The advantage of the top-down approach in providing a more complete picture of
a protein, however, is limited by the requirement for higher resolution
instrumentation and a more effective method in the fragmentation of larger proteins.
Efforts to address these limitations have been made through the introduction of
fragmentation techniques such as electron capture dissociation (ECD; Zubarev et al.,
2000) in conjunction with hybrid mass spectrometers of increasing resolution,
sensitivity, and m/z ranges, such as the ion trap-Fourier transform mass spectrometer
(LTQ FT; Syka et al., 2004b). Furthermore, the development of electron transfer
dissociation (ETD; Syka et al., 2004a) has contributed to reducing the higher cost
often associated with ECD-Fourier transform mass spectrometry.
2. Bioinformatic and computational approaches
2.1 MS/MS spectral databases
The development of newer technologies in protein-based mass spectrometry has
allowed for high-throughput data acquisition and the establishment of open-source
databases containing large amounts of publicly deposited spectra, including
PeptideAtlas, Proteomics IDEntifications Database, and NCBI’s Peptidome (Allmer,
2012). Along with the use of relational databases hosted by local servers,
bioinformatic and computational approaches can now be applied to analyze these
large sets of data, allowing for the characterization and identification of proteins in
ways that otherwise would have taken longer to accomplish through methods solely
dependent on wet-lab experimentation.
6
2.2 Identification of downPeptides in the yeast proteome
As an example, a previous study (Fournier et al., 2012) used the SEQUEST
algorithm to analyze spectra obtained from MS/MS experiments on glutaraldehydetreated yeast cell lysates in addition to publicly deposited spectra obtained from the
PeptideAtlas repository (http://www.peptideatlas.org/repository/). A total of 320
downPeptides with amino termini mapping to translation initiation at AUG codons
downstream (dnAUGs) of the annotated start codon (annAUG) were identified in the
budding yeast. In support of data obtained from MS/MS spectra analysis,
bioinformatic approaches revealed poorer quality Kozak sequence contexts
surrounding the annAUGs of downPeptide genes as compared to those of annPeptide
genes, and suggested the occurrence of translation initiation at both the annAUG and
dnAUG via tag densities from ribosome profiling.
In this case study, the ability to integrate bioinformatic and computational
approaches with large amounts of publicly deposited MS/MS spectra revealed added
complexity to the current knowledge of protein expression in yeast. In particular,
translation initiation at dnAUGs in frame to the annAUG would result in the
expression of truncated proteins, while translation initiation at dnAUGs out of frame
to the annAUG would potentially lead to the expression of proteins with novel
functions and cellular localizations, increasing the repertoire of the current annotated
yeast proteome (Kochetov et al., 2005).
7
II. Confidence in Algorithm Matches of MS/MS Spectra to Peptide Sequences
With increasing improvements in the resolution and sensitivity of mass
spectrometers, there is little doubt that the emphasis placed on mass spectrometrybased proteomics will persist for many years to come. However, given the complexity
of proteomic studies, an assessment of spectra-matching algorithms in accurately
identifying and characterizing proteins from a given sample is important for the
validation of both non-standard protein matches (e.g., downPeptides) and in-depth
characterizations of annotated proteins. The variability seen between different
spectra-matching algorithms, due to their unique design and structure, is further
complicated by the numerous parameter settings (e.g., precursor and fragment mass
tolerance, number of missed cleavage sites, or type of CID ions) that may be
implemented within each algorithm (Balgley et al., 2007). Confidence in peptidespectrum matches becomes problematic when the same spectral data submitted to
different algorithms or implementations of algorithms yield different samplings of
peptide matches (Wenger and Coon, 2013). Which of the algorithm detected peptides
are valid and which should be discarded due to low confidence in the match?
1. Peptide-spectrum matching search algorithms
Several types of spectra-matching search algorithms are available either on the
market or as open-source software. Algorithms such as SEQUEST (Eng et al., 1994)
are based on cross-correlation methods that match a MS/MS experimental spectrum
with theoretical spectra, while algorithms such as OMSSA (Geer et al., 2004) and
MASCOT (Perkins et al., 1999) rely on probability-based matching that assigns a
8
score of statistical significance to the observed peptide for sequence correlation to
theoretical peptides. Although falling under the same category as probability-based
algorithms, OMSSA and MASCOT differ in various aspects: i) OMSSA is an opensource algorithm, which allows for manipulation of data on information such as the
detected CID ions in a spectrum for user-defined peak analysis; ii) OMSSA was the
fastest algorithm out of the algorithms mentioned prior to the introduction of
Morpheus (Wenger and Coon, 2013). Despite the differences, all spectra-matching
algorithms have their own score ranking systems, where peptide matches are assigned
a score representing the probability of the match being a true match, allowing for the
application of probability thresholds at certain false identification rates.
2. Methods of evaluation
2.1 Target-decoy approach: False identification rate (FIR)
The effectiveness of an algorithm or an algorithm implementation is typically
assessed through decoy analysis (Fitzgibbon et al., 2008; Elias and Gygi, 2007). In
the FASTA format sequence database provided as a reference for peptide-spectrum
matching algorithms, a reverse sequence is present for every forward protein
sequence. With no knowledge that the reverse sequences represent “ghost” proteins
that do not exist in nature, an algorithm may still match spectra to these decoy
sequences. The frequency that reverse peptides are chosen by the algorithm,
calculated as the number of reverse peptides over the number of forward peptides, is
the false identification rate (FIR) for the implementation of the algorithm. To filter
out lower confidence peptide matches, a target FIR of either 1% or 5% is set during
9
the computation of a probability threshold value based on the scores assigned to
individual peptide matches representing an algorithm’s confidence in the match.
However, despite submission of the same spectra files and application of a probability
threshold at a FIR of 1% or 5%, different algorithms or implementations of an
algorithm still yield different samplings of detected peptides.
2.2 Gel slice approach: Parent profiling of 22 MS/MS experiments
To address this issue, we developed additional evaluation methods to establish
confidence in peptide-spectrum matches by algorithms. Using a bottom-up approach,
where the algorithm has no knowledge of the masses of parent proteins prior to
trypsin digestion, we partitioned proteins from yeast cell lysate according to mass
using SDS-polyacrylamide gel electrophoresis, and subjected samples of known
parent protein molecular weight size ranges for LC-MS/MS. Spectra from 22 MS/MS
experiments were then used to evaluate peptides matched by an algorithm
implementation for conformance to parent protein masses prior to trypsin digestion.
Since this method can be applied to any algorithm or implementation of an algorithm,
the computed overall Conformance Scores (eq. m 3.2) provide a means to compare
the performance of different algorithms as well as to evaluate the performance of an
individual implementation – allowing for optimization of parameter settings in an
algorithm. In addition, making available to download such algorithm benchmark data
would greatly benefit the research community (Allmer, 2012) in choosing the best
algorithm or implementation to use with yeast proteomic studies.
10
Materials and Methods
I. Gel Slice Approach
In order to assess algorithm matches of spectra to tryptic peptides, algorithms were
evaluated for the conformance of peptide matches to parent proteins prior to trypsin
digestion. Yeast cell lysates were partitioned according to mass by SDSpolyacrylamide gel electrophoresis. Gel slices corresponding to molecular weight size
ranges of 25-37, 37-50, and 50-75 kDa were excised, subjected to in-gel trypsin
digestion, and assessed through LC-MS/MS. Spectra were then submitted to search
algorithms (SEQUEST or OMSSA) for protein identification (Figure 2-1).
Figure 2-1. Schematic of gel slice approach.
11
1. Protein sample preparation
Protein samples were prepared by growing 100 ml of the wild-type S. cerevisiae
strain (YSH474) to mid-log phase in YPD. Lysis was carried out using RIPA buffer
(150 mM NaCl, 1% Igepal, 0.1% SDS, 50 mM Tris pH 8.0) and acid washed glass
beads in addition to protease inhibitor cocktail tablets (Roche) and PMSF. After the
lysate was spun at 5,000 RPM, samples at concentrations of 500 g or 1000 g were
run alongside protein standard markers (Bio-Rad) on 4-20% SDS-PAGE gels (BioRad). Protein standard bands served as a guide for the excision of gel slices of various
molecular weight size ranges (25-37, 37-50, and 50-75 kDa), which were subjected to
reduction alkylation followed by overnight in-gel trypsin digestion (Shevchenko et al.,
1996, 2006). Extracted peptides were resuspended in 0.1% TFA, loaded onto a c18
(Michrom) nanospray column (Polymicro), and run with a 180 minute gradient on a
Thermo-Finnigan LCQ Deca XP (3D ion trap) coupled to an Agilent 1100 series
high-performance liquid chromatography (HPLC) system and a Thermo nanoelectrospray (nano-ESI) ion source. As preliminary tests indicated that gels with
visible degradation (smearing of protein bands) had limited conformance to parent
protein masses, gel slices chosen for analysis originated from gels lacking visible
degradation (Figure 2-2).
12
Figure 2-2. Partitioning of parent proteins via SDS-PAGE.
13
II. Peptide-spectrum Matching Search Algorithms
Peptides were identified using the SEQUEST algorithm (Proteome Discoverer
v.1.2) run on a Dell Alienware Aurora R4 server, and the Open Mass Spectrometry
Search Algorithm (OMSSA) run on a 90-node cluster. Algorithm parameters were set
to search for either b and y ions or a, b, and y ions following CID, and to include
mass increases to peptides: Dynamic modifications include +42 Da for acetylation of
any N-terminal amino acid residue, and +16 Da for oxidation of methionine residues;
Static modifications include +57 Da for carbamidomethyl modification of cysteine
residues (Appendix A). Each SEQUEST parameter file included a precursor mass
tolerance of 3.0 Da and a fragment mass tolerance of 1.0 Da. The OMSSA algorithm
was run using five parameter implementations of the algorithm (Table 1). For the
SEQUEST and OMSSA analysis, we required trypsin-cleavage sites at both ends of
the precursor peptides (or one end if a terminal peptide). A sequence database file
containing protein translations of annotated and downstream Open Reading Frames
(dnORFs) in FASTA format was constructed as described previously (Fournier et al.,
2012).
14
Table 1. Comparison of SEQUEST and OMSSA parameter implementations
0
1
OMSSA
2
3
4
1
1
1
1
1
1
3.0
3.0
1.5
2.0
1.5
1.5
1.0
1.0
0.5
0.8
0.5
1.0
mono
avg.
avg.
avg.
mono
avg.
mono
mono
mono
avg.
mono
avg.
N/A
none
none
none
none
linear
SEQUEST
Max. missed
cleavage sites
Precursor mass
tolerance (Da)
Fragment mass
tolerance (Da)
Precursor ion
search type
Fragment ion
search type
Mass tolerance
charge scaling
Output from SEQUEST and OMSSA were uploaded to a relational database and
analyzed with stored procedures written in MS-SQL to compute false identification
rates (FIR) and conformance scores (Figure 3). In the stored procedure analysis
(Appendix B), for each MS/MS experiment, we excluded peptide matches with
internal trypsin sites, matches with an initial ranking (Rank) > 1 if a SEQUEST
matched peptide, and matches with multiple parent protein references. In the case of
OMSSA, we also excluded matches where multiple peptides matched to the same
spectrum. Finally, the peptide matches were filtered by applying the calculated
probability threshold that gave a target FIR of 5%. Decoy analysis was performed as
described previously (Fournier et al., 2012).
15
Figure 3. Experimental design and workflow. RAW files (1) from MS/MS experiments are
either submitted to the SEQUEST algorithm (2A), or converted to ODTA format and
submitted to the OMSSA algorithm (3A). The XLS output from SEQUEST (3A) and the
XML output from OMSSA (3B) are then uploaded into the devGelSlice database for further
analysis via stored procedures written in MS-SQL: application of 5% FIR (4); computation of
overall conformance score (5).
16
III. Conformance Score Computation
Parent protein conformance scores were computed from distinct forward peptide
matches classified as conforming or nonconforming according to the known
molecular weight size range of the gel slice (25-37, 37-50, and 50-75 kDa). We
analyzed data from 22 gel slices in 22 independent MS/MS experiments. Peptides
were either counted once per MS Run or once per Gel Slice Range. To account for
aberrant running on a gel or possible post-translational modifications, mass tolerances
at ± 10% or ± 25% of the molecular weight size range were applied. Conformance
scores for individual MS/MS experiments were used as an initial screen and
computed as follows:
(m 3.1)
An overall conformance score (for a single algorithm search parameter
implementation) was computed by summing the number of conforming peptides and
nonconforming peptides across all 22 MS/MS experiments:
(m 3.2)
17
Results and Discussion
Mass spectrometry-based protein identification algorithms that match MS/MS
spectra of fragmented peptides to the theoretical spectra of peptide sequences from
FASTA format databases are important tools in the advancement of proteomic studies,
and the ability of algorithms to provide accurate protein identification is necessary for
confidence in published data. The importance of confidence in algorithm detected
peptides is further illuminated by the discovery of special sets of peptides, such as
downPeptides (Fournier et al., 2012), where localization and function of the proteins
have yet to be characterized.
I. Evaluation of Algorithm Performance using Conformance Scores
As an initial method for evaluating algorithm performance, we computed relative
conformance scores indicating how well peptide matches from the algorithms
conformed to parent proteins of known molecular weight size ranges (25-37, 37-50,
and 50-75 kDa). A concept similar to single blind studies in psychology, the
evaluation method was based on the algorithms having no prior knowledge of the
masses of parent proteins prior to trypsin digestion. The computed conformance
scores, using spectra from 22 MS/MS experiments, therefore provide an unbiased
evaluation of algorithm performance in matching theoretical tryptic peptides to mass
spectra of CID fragments.
18
1. Ion fragmentation: a search for a, b, and y ions
Although b and y ions are most typically observed during CID, the type of ions
detected in a mass spectrometry experiment is affected by factors such as the
sensitivity of the mass spectrometer, the source of the ions, and the amino acid
composition of the peptides (Wysocki, et al., 2005). To account for the possibility of
other less-prevalent ion fragmentations, we set our algorithm search parameters to
also include the search for a ions, which results from the loss of C=O at the carboxy
terminal of the b ion. Analysis of spectra from 22 independent MS/MS experiments
using the SEQUEST algorithm and five implementations of the OMSSA algorithm
included either a typical b/y ion screen or the a/b/y ion screen.
As application of a decoy analysis using a false identification rate below 5% is
generally accepted as the method to ensure the quality of peptide-spectrum matches,
we pre-screened our data prior to computing conformance scores. The overall false
identification rates for SEQUEST were ~6.0% while rates for OMSSA ranged from
2.2% to 6.8% (Table 2).
Due to experimental variations inherent in the gel slice experiments, such as
aberrant running of proteins during gel electrophoresis, a tolerance of 10% was
applied to the molecular weight size range of the parent proteins. Counting each
peptide once per MS Run, conformance scores computed at a 10% mass tolerance
indicated that around 84.5% of peptide matches from SEQUEST conformed to the
expected parent protein size ranges in both types of ion screens, while an average of
88.3% and 87.5% of peptide matches from OMSSA conformed to the expected parent
19
protein size ranges in the b/y and a/b/y ion screens respectively (Table 2). Although
the average conformance scores for OMSSA are similar in both ion screens, in
addition to slight fluctuations, the number of distinct peptides detected by a particular
ion screen (b/y: 2134-3644; a/b/y: 1702-3393) was also dependent on the particular
search parameter implementation. In general, the conformance scores suggest high
confidence in both SEQUEST and OMSSA algorithms in addition to high confidence
in different ion screens or individual OMSSA implementations.
Table 2. Conformance scoring based on distinct peptide matches per MS Run at 10%
mass tolerance. Overall scores include data from 22 MS/MS experiments. A. Distinct
peptides detected in b/y ion screen. Threshold values for SEQUEST: 0.081; OMSSA runs 0-4:
1.0. B. Distinct peptides detected in a/b/y ion screen. Threshold values for SEQUEST:
0.164025; OMSSA runs 0,1,3,4: 1.0; OMSSA run 2: 0.9.
A.
Algorithm
SEQUEST
OMSSA
OMSSA
OMSSA
OMSSA
OMSSA
Implementation
# Distinct
Peptides
# Conforming
Peptides
# Nonconforming
Peptides
Overall
Conformance
Score
Overall
Reverse
Conformance
Score
Overall
FIR
0
1
2
3
4
4480
3060
3644
2134
3295
3035
3781
2717
3196
1893
2887
2696
699
343
448
241
408
339
84.4
88.8
87.7
88.7
87.6
88.8
14.6
23.6
20.3
23.5
17.8
23.9
6.0
2.4
3.6
5.4
3.9
2.2
Implementation
# Distinct
Peptides
# Conforming
Peptides
# Nonconforming
Peptides
Overall
Conformance
Score
Overall
Reverse
Conformance
Score
Overall
FIR
0
1
2
3
4
4757
2583
3393
1702
3065
2556
4021
2282
2960
1482
2670
2250
736
301
433
220
395
306
84.5
88.3
87.2
87.1
87.1
88.0
17.0
18.0
21.2
19.8
15.9
16.7
5.7
2.4
4.9
6.8
5.1
2.3
OMSSA
B.
Algorithm
SEQUEST
OMSSA
OMSSA
OMSSA
OMSSA
OMSSA
OMSSA
20
Table 3. Conformance scoring based on distinct peptide matches per MS Run at 25%
mass tolerance. Overall scores include data from 22 MS/MS experiments. A. Distinct
peptides detected in b/y ion screen. Threshold values for SEQUEST: 0.081; OMSSA runs 0-4:
1.0. B. Distinct peptides detected in a/b/y ion screen. Threshold values for SEQUEST:
0.164025; OMSSA runs 0,1,3,4: 1.0; OMSSA run 2: 0.9.
A.
Algorithm
SEQUEST
OMSSA
OMSSA
OMSSA
OMSSA
OMSSA
Implementation
# Distinct
Peptides
# Conforming
Peptides
# Nonconforming
Peptides
Overall
Conformance
Score
Overall
Reverse
Conformance
Score
Overall
FIR
0
1
2
3
4
4480
3060
3644
2134
3295
3035
4099
2920
3455
2022
3126
2899
381
140
189
112
169
136
91.5
95.4
94.8
94.8
94.9
95.5
26.2
33.3
36.8
40.9
34.1
31.3
6.0
2.4
3.6
5.4
3.9
2.2
Implementation
# Distinct
Peptides
# Conforming
Peptides
# Nonconforming
Peptides
Overall
Conformance
Score
Overall
Reverse
Conformance
Score
Overall
FIR
0
1
2
3
4
4757
2583
3393
1702
3065
2556
4364
2469
3210
1606
2897
2438
393
114
183
96
168
118
91.7
95.6
94.6
94.4
94.5
95.4
27.8
31.1
34.5
35.3
33.1
28.3
5.7
2.4
4.9
6.8
5.1
2.3
OMSSA
B.
Algorithm
SEQUEST
OMSSA
OMSSA
OMSSA
OMSSA
OMSSA
OMSSA
21
II. Factors Affecting Conformance Score Computation
Although conformance scores indicate a relative peptide matching efficacy of
algorithm implementations, scores may be affected by additional factors that may or
may not be accountable or corrected for. After application of a 25% mass tolerance to
the molecular weight size range of parent proteins, an increase in conformance scores
independent of the algorithm or search parameter implementation selected was
observed (Table 3). In an attempt to elucidate factors affecting the categorization of
algorithm matched peptides as conforming or nonconforming, we drafted a list of
factors that may affect the computation of conformance scores.
Unaccountable factors such as the algorithm-specific probability scoring of
peptide matches may result in the exclusion of a matched peptide during decoy
analysis, where peptides not meeting a certain probability threshold are discarded.
Additionally, the algorithm may not have been able to match a given spectra to a
peptide sequence from the specified database file, therefore decreasing the yield of
detected peptides. However, correctable factors affecting conformance scoring
include post-translational modifications of parent proteins that alter their molecular
weights, and random matching of peptides by the algorithm to parent proteins of the
correct size range.
1. Post-translational modifications
Of the various post-translational modifications (PTMs), glycosylation is one of the
major PTMs that may increase the mass of a parent protein by as much as several
thousand Daltons through the addition of covalently linked saccharides (Parker et al.,
22
2010). When partitioned via SDS-polyacrylamide gel electrophoresis, PTM parent
proteins may be found at a position on the gel that is above its annotated molecular
weight (Iakouchevaet al., 2001). In order to investigate whether peptide matches were
categorized as nonconforming due to post-translational modifications, we tested the
parent proteins of nonconforming peptides for a significant elevation of the Asn-XSer/Thr motif (where X is any amino acid other than proline) that is typically
observed in the sequence of proteins targeted for N-linked glycosylation, and a
significant elevation of serine and threonine residues that are typically observed in the
sequence of proteins targeted for O-linked glycosylation (Wildt and Gerngross, 2005).
We expected to see an elevation of the Asn-X-Ser/Thr motif or serine and
threonine residues in the parent proteins of nonconforming peptides with unmodified
masses below the gel slice range. However, an elevation in both the Asn-X-Ser/Thr
motif and the number of serine and threonine residues was observed for parent
proteins of nonconforming peptides with molecular weight sizes above the gel slice
range (Figure 4), suggesting that other factors may take precedence. A possible
explanation is that the elevated levels of the Asn-X-Ser/Thr motif or serine and
threonine residues do in fact indicate glycosylation of the parent proteins, but due to
post-translational proteolytic cleavage, the parent protein runs at a molecular weight
lower than expected (Shao and Kent, 1997).
2. Random matching of peptides to parent proteins of the correct size range
The ability of mass spectrometry-based search algorithms to randomly match
spectra to peptide sequences is apparent in the matching of spectra to reverse decoy
23
peptides. If algorithms randomly match spectra to incorrect peptides, algorithms can
also randomly match spectra to peptides corresponding to parent proteins of the
correct size range. For example, 25% of annotated yeast proteins have masses
between 22.5 and 40.7 kDa, so 25% of random matches would conform to this size
range. The overall conformance scores for the reverse peptides may in part illustrate
this phenomenon. Conformance scores computed at a 10% mass tolerance indicate
that around 27% of reverse peptide matches from SEQUEST conformed to the
expected parent protein size range in both b/y and a/b/y ion screens, while an average
35.3% and 32.5% of reverse peptide matches from OMSSA conformed to the
expected parent protein size range in b/y and a/b/y ion screens respectively (Table 2).
24
10% Mass Tolerance
25% Mass Tolerance
SEQUEST (10% Mass Tolerance)
Effects of N-glycosylation on Peptide Conformance
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Frequency
Frequency
A.
0
1
2
3
4
5
6
7
8
9
10
SEQUEST (25% Mass Tolerance)
Effects of N-glycosylation on Peptide Conformance
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
11
1
2
4
5
6
7
8
9
10
11
SEQUEST (25% Mass Tolerance)
Effects of O-glycosylation on Peptide Conformance
SEQUEST (10% Mass Tolerance)
Effects of O-glycosylation on Peptide Conformance
0.5
0.5
0.4
0.4
0.3
Frequency
Frequency
3
Asn-X-Ser/Thr Motif Count
Asn-X-Ser/Thr Motif Count
0.2
0.1
0.3
0.2
0.1
0
0
10
12
14
16
24
28
0
Percent Serine and Threonine (%)
Conforming
Nonconforming Above
0
10
12
14
16
24
28
Percent Serine and Threonine (%)
Nonconforming Below
B.
OMSSA Implementation 0
Effects of O-glycosylation on Peptide Conformance
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Frequency
Frequency
OMSSA Implementation 0
Effects of O-glycosylation on Peptide Conformance
0
10
12
14
16
24
28
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
10
Percent Serine and Threonine (%)
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
10
12
14
16
Percent Serine and Threonine (%)
14
16
24
28
OMSSA Implementation 1
Effects of O-glycosylation on Peptide Conformance
Frequency
Frequency
OMSSA Implementation 1
Effects of O-glycosylation on Peptide Conformance
0
12
Percent Serine and Threonine (%)
24
28
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
10
12
14
16
24
28
Percent Serine and Threonine (%)
Figure 4. Glycosylation does not account for nonconforming peptide matches with
parent proteins below the gel slice MW range. Yellow lines represent nonconforming
peptides with parent protein MW above the gel slice range, light blue lines represent
nonconforming peptides with parent protein MW below the gel slice range, and dark blue
lines represent conforming peptide matches. Assessment for both SEQUEST (A) and
OMSSA (B) indicated elevation of the Asn-X-Ser/Thr motif or serine and threonine residues
in peptides having parent proteins with MW above the gel slice range.
25
Figure 4B. (continued)
OMSSA Implementation 2
Effects of O-glycosylation on Peptide Conformance
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Frequency
Frequency
OMSSA Implementation 2
Effects of O-glycosylation on Peptide Conformance
0
10
12
14
16
24
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
28
10
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
10
12
14
16
24
28
0
Frequency
Frequency
12
14
16
28
10
12
14
16
24
28
OMSSA Implementation 4
Effects of O-glycosylation on Peptide Conformance
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Percent Serine and Threonine (%)
24
Percent Serine and Threonine (%)
OMSSA Implementation 4
Effects of O-glycosylation on Peptide Conformance
10
16
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Percent Serine and Threonine (%)
0
14
OMSSA Implementation 3
Effects of O-glycosylation on Peptide Conformance
Frequency
Frequency
OMSSA Implementation 3
Effects of O-glycosylation on Peptide Conformance
0
12
Percent Serine and Threonine (%)
Percent Serine and Threonine (%)
24
28
0.55
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
7
10
13
15
18
21
24
28
Percent Serine and Threonine (%)
26
III. Union of OMSSA Implementations
Although overall conformance scores (10% mass tolerance) computed from
peptide matches detected by the OMSSA algorithm (~88%) were slightly higher than
that from peptide matches detected by the SEQUEST algorithm (~85%), the high
confidence in algorithm peptide matches regardless of algorithm, implementation, or
ion screen suggested that using multiple implementations of OMSSA can increase the
yield of detected peptides. Counting detected peptides once even if seen in multiple
implementations (b/y ion screen: 3,922 distinct peptides; a/b/y ion screen: 3,662
distinct peptides), we grouped peptides according to the number of OMSSA
implementations that peptides were detected in.
1. Conformance scoring based on multiple OMSSA implementations
When a peptide is detected by a single OMSSA implementation subjected to the
b/y ion screen, the conformance score (10% mass tolerance) indicated that only
61.1% of the detected peptides conformed to their respective parent protein molecular
weight size ranges. As the number of OMSSA implementations that had detected the
peptide increases from two to all five, the conformance scores also increased from
78.9%, 84.1%, 89.8%, to 90.0% respectively (Figure 5A). Similar trends were seen in
peptide matches detected by OMSSA implementations subjected to the a/b/y ion
screen (Figure 5A), indicating that including the search for a ions in the OMSSA
algorithm has a minimal affect on the yield of peptide matches.
The increase in the conformance score of peptides detected by more than one
OMSSA implementation and the low percentage of peptides (b/y ion screen: 5%;
27
a/b/y ion screen: 7.8%) detected by a single OMSSA implementation suggest two
methods for increasing the yield of detected peptides while maintaining high
confidence in the matches: i) directly take the union of detected peptides from
multiple OMSSA search implementations, or ii) discard peptide matches detected by
a single OMSSA implementation after taking the union of detected peptides from
multiple OMSSA implementations. In the first method, where distinct peptides are
counted once per MS Run and once per OMSSA implementation, the overall
conformance scores (10% mass tolerance) are 86.2% and 85.3% for the b/y ion screen
and a/b/y ion screen respectively – high confidence conformance scores comparable
to that of the 4,480 and 4,757 peptides detected by SEQUEST (84.4% and 84.5%).
2. Contribution of individual implementations to the pool of distinct peptides detected
by multiple OMSSA implementations
Alternatively, a third approach is to discard peptide matches detected by the
OMSSA implementation that contributes most to the pool of distinct peptides
detected by a single OMSSA implementation. Comparing the contribution of each
OMSSA implementation to the detection of distinct peptides found in multiple
implementations, OMSSA implementations 1 and 2 contribute the most to the pool of
peptides detected in a single OMSSA implementation (Figure 5B). However,
implementations 1 and 2 not only contribute at similar levels as other
implementations for peptide matches detected by multiple OMSSA implementations,
they also have high overall conformance scores, 87.7% and 88.7% respectively. This
suggests that taking the union of detected peptides from multiple OMSSA
28
implementations and discarding peptide matches detected by a single OMSSA
implementation is more suitable for increasing the yield of high confidence peptide
matches if the individual OMSSA implementations also have high conformance
scores.
(Figure 5 located on page 30)
Figure 5. Union of multiple OMSSA implementations increases both the yield of
detected peptides and the Conformance Score. A. Conformance scores calculated
from distinct peptides detected in multiple implementations of the b/y or a/b/y ion
screens in OMSSA (peptides counted once even if seen in multiple implementations).
B. Contribution of each OMSSA implementation to the distinct peptides detected in
multiple implementations of the b/y or a/b/y ion screens.
29
A.
Conformance of Distinct Peptides
detected by multiple OMSSA implementations (a/b/y ion screen)
Conformance of Distinct Peptides
detected by multiple OMSSA implementations (b/y ion screen)
100
100
90
93.2
86.2
80
88.2
78.9
70
60
84.1
96.9
89.8
96.2
90.0
80
70
60
50
20
20
10
10
10
0
0
3
4
5
81.6
25%
All
# Distinct
Peptides
Percentage of
all matches
2
96.2
88.7
90
80
70
60
50
40
30
30
20
20
10
0
All
1
2
3
4
5
Number of OMSSA Implementations
10%
Percentage of all matches
1
88.6
96.4
89.6
40
Number of OMSSA Implementations
10%
94.5
68.5
61.2
50
30
2
90.2
60
30
1
92.7
70
40
All
85.3
80
40
0
100
90
90
66.7
61.1
50
92.0
Conformance Score (%)
Conformance Score (%)
100
3
4
5
3922
198
667
276
1097
1684
100.0
5.0
17.0
7.0
28.0
42.9
# Distinct
Peptides
Percentage of
all matches
25%
Percentage of all matches
All
1
2
3
4
5
3662
286
766
255
1059
1296
100.0
7.8
20.9
7.0
28.9
35.4
B.
Contribution of individual OMSSA implementations to the pool of distinct peptides
detected by multiple OMSSA implementatoins (a/b/y ion screen)
Contribution of individual OMSSA implementations to the pool of distinct peptides
detected by multiple OMSSA implementatoins (b/y ion screen)
60
20
0
All
1
2
3
4
5
40
20
19.4
25.5
12.8
23.0
19.2
Percentage (%)
40
20.2
24.0
14.1
21.7
20.0
Percentage (%)
60
0
All
Number of OMSSA Implementations
Implementation 0
Implementation 1
Implementation 2
Implementation 3
1
2
3
4
5
Number of OMSSA Implementations
Implementation 4
Implementation 0
Implementation 1
Implementation 2
Implementation 3
Implementation 4
Figure 5.
30
IV. Overlap of Peptide Matches in a/b/y ion and b/y ion Screens
In contrast to the multiple implementations used in OMSSA, we submitted spectra
from the 22 MS/MS experiments to SEQUEST using a single search algorithm
parameter implementation with either b/y ion screening or a/b/y ion screening. To
investigate whether including the search for a ions increases the yield of peptide
detections, we grouped distinct peptides according to detection by the b/y ion screen
exclusively (BY ions), the a/b/y ion screen exclusively (ABY ions), or by both the b/y
and a/b/y ion screens (ABY ions & BY ions).
1. SEQUEST
The conformance score based on peptide matches detected by SEQUEST in both
the b/y and a/b/y ion screens was significantly higher than that based on peptide
matches detected exclusively by the b/y ion screen or the a/b/y ion screen (Figure 6A).
Counting distinct peptides once per MS/MS experiment (per MS Run), peptides
detected by SEQUEST in both the b/y and a/b/y ion screens had a conformance score
of 87.4% (mass tolerance of 10%), while the conformance scores based on peptides
detected by SEQUEST exclusively in the b/y or a/b/y ion screens were 34.7% and
61.6% respectively. A similar trend was seen when counting distinct peptides once
per parent protein molecular weight size range (per Gel Slice Range). The lower
conformance scores based on distinct peptide matches per Gel Slice Range as
compared to the corresponding conformance scores based on matches per MS Run
(Figure 6A) is likely due to the decreased representation of a peptide in the pool of
distinct forward peptides even if seen in multiple MS/MS experiments.
31
In the SEQUEST algorithm, we detected 5,016 peptides, counting a peptide once
per MS/MS experiment. Of the 5,016 detected peptides, 2,261 were distinct peptides.
Of the 2,261 distinct peptides, 78% (1,752) were detected by both the b/y and the
a/b/y ion screens (Figure 10), suggesting that confidence in SEQUEST peptide
matches can be increased by screening spectra for both b/y and a/b/y ions and
subsequently retaining only the peptide matches detected by both ion screens. Using
bootstrap analysis, we found that the conformance scores for ABY ions (61.6%) and
BY ions (34.7%) lie outside the distribution of conformance scores calculated after
random sampling with replacement from a full set of 4,757 and 4,480 distinct
peptides counted once per MS/MS experiment in the a/b/y and b/y ion screens
respectively (Figure 7). This indicated that the higher conformance score observed for
peptide matches detected by both the b/y and a/b/y ion screens using the SEQUEST
algorithm is significantly higher than the conformance scores for peptide matches
detected exclusively by either ion screens alone (Figure 7; p < 0.01, p<0.001).
2. OMSSA
In contrast, the increase in the conformance score based on peptide matches
detected by both b/y and a/b/y ion screens as compared to the conformance scores of
peptide matches detected by either the b/y or a/b/y ion screens alone was not as
pronounced in the OMSSA algorithm (Figure 6C). Counting a peptide once per
MS/MS experiment, the average conformance scores for peptide matches detected by
OMSSA implementations were 78.7% and 86.1% for exclusive detection by a/b/y and
b/y ion screens respectively, and 89% for detection by both a/b/y and b/y ion screens.
32
Counting a peptide once per gel slice range, the average conformance scores for
peptide matches detected by OMSSA implementations were 68% and 79.2% for
exclusive detection by a/b/y and b/y ion screens respectively, and 86.3% for detection
by both a/b/y and b/y ion screens. This suggests that including the search for a ions in
the OMSSA algorithm has a minimal affect on not only the yield of peptide matches
(Figure 5A) but also the confidence in algorithm peptide matches (Figure 6C).
(Figure 6 located on pages 34-35)
33
A.
Conformance based on Distinct Peptide Matches
per MS Run
100
100
80
87.4
70
50
90
94.4
70.7
10%
25%
61.6
40
44.0
30
34.7
20
10
Conformance Score (%)
Conformance Score (%)
90
60
Conformance based on Distinct Peptide Matches
per Gel Slice Range
91.9
80
82.6
70
10%
25%
60
57.2
50
40
47.6
30
20
24.6
32.9
10
0
ABY ions
ABY ions & BY ions
Distinct Peptides (per MS Run)
BY ions
0
ABY ions
ABY ions & BY ions
BY ions
Distinct Peptides (per Gel Slice Range)
B.
Figure 6. Conformance scoring based on distinct peptides detected in the a/b/y ion screen, the b/y ion screen, or both a/b/y and b/y
ion screens. Algorithm detected peptides were selected based on distinct peptide matches counted once per MS Run or once per Gel Slice
Range. B. SEQUEST detected 5,016 peptides, counting each peptide once per MS/MS experiment. Of the 5,016 peptides, 4,221 peptides
were detected by both ion screens, while 536 and 259 peptides were detected by a/b/y and b/y ion screens respectively.
34
Figure 6 (continued): C. OMSSA conformance scoring based on distinct peptides detected in the a/b/y ion screen, the b/y ion screen, or
both a/b/y and b/y ion screens.
C.
Conformance based on
Distinct Peptide Matches per MS Run
Conformance based on
Distinct Peptide Matches per Gel Slice Range
10% Mass Tolerance
10% Mass Tolerance
OMSSA Implementation
ABY Ions
ABY Ions & BY Ions
BY Ions
OMSSA Implementation
ABY Ions
ABY Ions & BY Ions
BY Ions
0
1
2
3
4
82.0 (344)
78.1 (402)
75.8 (298)
77.7 (364)
79.8 (336)
89.3 (2239)
88.5 (2991)
89.5 (1404)
88.4 (2701)
89.3 (2220)
87.3 (821)
84.2 (653)
87.3 (730)
84.2 (594)
87.6 (815)
0
1
2
3
4
73.0 (152)
66.3 (181)
65.6 (154)
66.5 (170)
68.6 (153)
86.7 (1204)
85.5 (1604)
87.3 (860)
85.5 (1490)
86.7 (1194)
80.3 (346)
77.6 (299)
79.7 (344)
77.3 (278)
81.3 (348)
OMSSA Implementation
ABY Ions
ABY Ions & BY Ions
BY Ions
OMSSA Implementation
ABY Ions
ABY Ions & BY Ions
BY Ions
0
91.0
96.3
93.1
0
83.6
94.9
89.6
1
86.8
95.7
91.0
1
75.7
94.3
88.0
2
86.2
96.1
92.2
2
77.9
95.3
87.8
3
85.7
95.7
91.1
3
74.7
94.4
87.4
4
89.3
96.3
93.4
4
80.4
95.0
90.5
25% Mass Tolerance
25% Mass Tolerance
35
A.
Conformance of SEQUEST distinct peptides per MS Run
(10% Mass Tolerance, ABY ions, n = 536)
61.6%
100x
1000x
0.3
Frequency
0.25
0.2
0.15
0.1
0.05
0
60 61 62 63 64 65
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
Conformance Score (%)
B.
Conformance of SEQUEST distinct peptides per MS Run
(10% Mass Tolerance, BY ions, n = 259)
34.7%
100x
1000x
0.3
Frequency
0.25
0.2
0.15
0.1
0.05
0
33 34 35 36 37 38
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
Conformance Score (%)
Figure 7. Peptides detected by SEQUEST in both the b/y and a/b/y ion screens give
significantly higher conformance scores compared to peptides detected by b/y or a/b/y
ion screens alone. Blue dotted lines and percentages represent the conformance score based
on distinct peptides detected by b/y or a/b/y screens alone. Number of bootstrap cycles: 100
(purple) or 1000 (dark blue). A. Distribution of conformance scores calculated after random
sampling with replacement of 536 samples from a full set of 4,757 distinct peptides per MS
Run detected in the a/b/y ion screen. B. Distribution of conformance scores calculated after
random sampling with replacement of 259 samples from a full set of 4,480 distinct peptides
per MS Run detected in the b/y ion screen.
36
V. Protein Expression
Although the sample preparations for the 22 MS/MS experiments did not include
modifications such as glutaraldehydation, the SEQUEST algorithm successfully
matched a higher percentage of detected peptides to the correct parent protein
molecular weight size ranges when the search space included a ions (Figure 6A). In
contrast, OMSSA did not show a similar trend, indicating that the SEQUEST
algorithm uses a scoring and matching system that is inherently different from the
OMSSA algorithm. Indeed, SEQUEST is a cross-correlation based algorithm, while
OMSSA is a probability-based algorithm. Unfortunately, unlike OMSSA, which is an
open-source algorithm, SEQUEST is a proprietary algorithm and its source code is
unavailable to the general public. In order to elucidate the inherent differences
between SEQUEST and OMSSA when increasing the ion search space, we asked
whether protein expression plays a role in the higher conformance score for peptide
matches detected by SEQUEST in both the b/y and a/b/y ion screens.
1. Parent protein expression of peptide matches detected exclusively in the b/y or
a/b/y ion screens, and in both the b/y and a/b/y ion screens
Comparing the distribution of parent protein expression values for peptide matches
detected in the b/y ion screen or a/b/y ion screen alone to that for peptide matches
detected in both the b/y and a/b/y ion screens using the SEQUEST algorithm, we
found that peptides detected exclusively by either the b/y or a/b/y ion screens had
parent proteins with significantly lower protein expression levels as compared to the
parent proteins of peptides detected by both the b/y and a/b/y ion screens (Figure 8A;
37
p < 4*10-14 for BY ions, p < 0.0004 for ABY ions). In contrast, the distribution of
parent protein expression values for peptide matches detected by the OMSSA
algorithm did not differ significantly between the ion screening groups (Figure 8B),
possibly accounting for the high confidence in the peptide matches detected
exclusively by either the b/y or a/b/y ion screens (Figure 6C).
2. Bootstrap analysis: Confidence in peptide matches dependent on parent
protein expression
The significantly lower parent protein expression (Figure 8A) in combination with
the lower conformance score of peptide matches detected exclusively by the b/y or
a/b/y ion screens (Figure 6A) suggests that though the SEQUEST algorithm can
detect peptides of lower abundance, the confidence in the quality of peptide-spectrum
matches also decreases. To investigate whether the matching of spectra to peptide
sequences in the SEQUEST algorithm is dependent on protein expression, we ranked
the parent proteins of peptide matches detected exclusively by the b/y or a/b/y ion
screens according to protein expression, and asked if the conformance scores
computed from the top or bottom 1/3 pool of protein expressers lie outside a
distribution of conformance scores computed from a bootstrap analysis (Figure 9).
Bootstrap analysis (Figure 9A; p < 0.01, p<0.001) revealed that the conformance
scores of parent proteins in the top or bottom 1/3 pool of protein expressers for
peptides detected by SEQUEST lie outside the distribution of bootstrap conformance
score data, but the corresponding conformance scores of parent proteins for peptides
detected by OMSSA lie within the distribution of the bootstrap data. The combined
38
results from Figure 8A and Figure 9A support the idea that compared to OMSSA, the
SEQUEST algorithm can detect peptides of lower abundance, but with the caveat that
confidence in lower abundance peptides is also low.
(Figure 8 located on page 40)
Figure 8. Parent protein expression of distinct peptide matches (per MS Run) detected
in the a/b/y ion screen, the b/y ion screen, or both a/b/y and b/y ion screens. Protein
expression values were obtained from genomic-scale western analyses in yeast
(Ghaemmaghami et al., 2003). Proteins of unknown expression or expression values of zero
were disregarded. Pink distribution lines represent parent protein expression values of
peptides detected by the b/y ion screen exclusively. Blue distribution lines represent parent
protein expression values of peptides detected by the a/b/y ion screen exclusively. Green
distribution lines represent parent protein expression values of peptides detected by both ion
screens. A. SEQUEST. B. OMSSA implementations 0-4.
39
A.
B.
Protein Expression based on Distinct Peptide Matches
per MS Run (OMSSA Implementation 0)
Protein Expression based on SEQUEST Distinct Peptide Matches
per MS Run
0.35
0.25
0.2
0.15
0.1
0.25
0.2
0.15
0.1
ABY ions
ABY ions & BY ions
BY ions
0.35
6.5
6.5
5.5
5
4.5
4
3.5
3
Protein Expression log(pe)
Protein Expression log(pe)
ABY ions & BY ions
2.5
2
6.5
6
5.5
5
4.5
4
3.5
3
2.5
2
0
1.5
0
1
0.05
1.5
0.1
0.05
1
0.1
0.2
0.15
0.5
0.2
0.15
0
Frequency
0.25
0.5
p = 4.16E-01
p = 4.94E-01
0.3
0.25
6
p = 1.46E-01
p = 3.84E-01
0.3
0
6.5
ABY ions & BY ions
Protein Expression based on Distinct Peptide Matches
per MS Run (OMSSA Implementation 4)
0.35
BY ions
6
5.5
5
4.5
Protein Expression log(pe)
ABY ions
ABY ions & BY ions
Protein Expression based on Distinct Peptide Matches
per MS Run (OMSSA Implementation 3)
ABY ions
6
5.5
5
4.5
4
4
3.5
3
2.5
2
6.5
6
5.5
5
4.5
4
3.5
3
2.5
2
0
1.5
0
1
0.1
0.05
0.5
0.1
0.05
Protein Expression log(pe)
Frequency
3.5
0.2
0.15
1
0.15
0.25
0.5
Frequency
0.2
0
Frequency
p = 1.14E-03
p = 4.09E-01
0.3
0.25
BY ions
ABY ions & BY ions
0.35
p = 2.51E-02
p = 6.94E-02
0.3
0
0.35
ABY ions
BY ions
Protein Expression based on Distinct Peptide Matches
per MS Run (OMSSA Implementation 2)
Protein Expression based on Distinct Peptide Matches
per MS Run (OMSSA Implementation 1)
1.5
BY ions
3
Protein Expression log(pe)
Protein Expression log(pe)
ABY ions
2.5
2
1.5
1
0.5
6.5
6
5.5
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
0
0
0.05
0.05
0
p = 1.14E-01
p = 4.87E-01
0.3
Frequency
Frequency
0.35
p = 1.12-05
p = 1.09E-18
0.3
ABY ions
BY ions
ABY ions & BY ions
Figure 8.
40
A.
Frequency
Conformance of SEQUEST distinct peptides per MS Run
(10% Mass Tolerance, BY ions, n = 1244)
Bottom 1/3
Top 1/3
88.0%
76.8%
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
100x
1000x
76
77
78
79
80
81
82
83
84
85
86
87 87.5 88
89
Conformance Score (%)
Frequency
Conformance of SEQUEST distinct peptides per MS Run
(10% Mass Tolerance, ABY ions, n = 1244)
77.9%
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
88.0%
100x
1000x
76
77
78
79
80
81
82
83
84
85
86
87 87.5 88
89
Conformance Score (%)
B.
Conformance of OMSSA distinct peptides per MS Run
(ABY ions: Implementation 0, n = 714)
89.8%
86.2%
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
89.5%
86.6%
0.4
100x
1000x
100x
1000x
0.35
Frequency
Frequency
Conformance of OMSSA distinct peptides per MS Run
(BY ions: Implementation 0, n = 840)
0.3
0.25
0.2
0.15
0.1
0.05
0
82
83
84
85
86
87
88
89
90
Conformance Score (%)
91
92
93
94
82
83
84
85
86
87
88
89
90
91
92
93
94
Conformance Score (%)
Figure 9. Confidence in peptides detected by SEQUEST is dependent on parent protein
expression. Conformance scores computed at 10% mass tolerance. Purple and blue lines
represent distribution of bootstrap analysis data from 100 cycles and 1,000 cycles
respectively. Dotted lines and percentages represent the conformance score of the top and
bottom 1/3 protein expressers in the full set of distinct peptides. A. SEQUEST algorithm:
distribution of conformance scores calculated after random sampling with replacement of
1,244 samples from a full set of 3,731 distinct peptides per MS Run. B. OMSSA algorithm,
Run 0: distribution of conformance scores calculated after random sampling with replacement
of 840 samples from a full set of 2,520 distinct peptides per MS Run.
41
Figure 9B continued
Conformance of OMSSA distinct peptides per MS Run
(BY ions: Implementation 1, n = 1004)
Frequency
0.35
100x
1000x
0.3
0.25
0.2
0.15
0.1
Frequency
88.8%
85.4%
0.4
Conformance of OMSSA distinct peptides per MS Run
(ABY ions: Implementation 1, n = 941)
0.05
0
82
83
84
85
86
87
88
89
90
91
92
93
100x
1000x
82
94
89.4%
83.7%
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
83
84
85
86
Conformance Score (%)
100x
1000x
Frequency
0.3
0.25
0.2
0.15
0.1
Frequency
89.7%
86.9%
0.05
0
82
83
84
85
86
87
88
89
90
91
92
93
82
94
Frequency
0.25
0.2
0.15
0.1
Frequency
89.0%
0.3
0.05
0
84
85
86
87
88
89
90
83
84
86
91
92
93
84
85
86
88
89
90
Conformance Score (%)
Frequency
89
90
91
92
93
94
89.2%
83
84
85
86
87
88
89
90
91
92
93
94
Conformance of OMSSA distinct peptides per MS Run
(ABY ions: Implementation 4, n = 706)
90.0%
87
88
Conformance Score (%)
Frequency
83
94
100x
1000x
82
94
100x
1000x
82
87
84.6%
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Conformance of OMSSA distinct peptides per MS Run
(BY ions: Implementation 4, n = 834)
86.1%
93
89.8%
85
Conformance Score (%)
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
92
Conformance of OMSSA distinct peptides per MS Run
(ABY ions: Implementation 3, n = 845)
100x
1000x
83
91
Conformance Score (%)
0.35
82
90
100x
1000x
Conformance of OMSSA distinct peptides per MS Run
(BY ions: Implementation 3, n = 904)
84.2%
89
84.3%
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Conformance Score (%)
0.4
88
Conformance of OMSSA distinct peptides per MS Run
(ABY ions: Implementation 2, n = 470)
Conformance of OMSSA distinct peptides per MS Run
(BY ions: Implementation 2, n = 581)
0.35
87
Conformance Score (%)
91
92
93
94
89.4%
85.9%
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
100x
1000x
82
83
84
85
86
87
88
89
90
91
92
93
94
Conformance Score (%)
42
VI. Increasing the Yield of High Confidence Peptide Matches
1. Distinct peptides detected by both SEQUEST and OMSSA algorithms
The qualitative difference in the results from the SEQUEST and OMSSA
algorithms suggest that the two algorithms provide different samplings of the same
MS/MS data. To investigate whether the use of multiple algorithms increases the
yield of high confidence peptide matches, we counted peptides once regardless of
how many MS/MS experiments identified the peptide, which CID ion screen was
applied, and which implementation revealed the peptide, and subsequently computed
the number of distinct peptides detected by both algorithms. Of the 2,261 distinct
peptides detected by the SEQUEST algorithm and the 2,097 distinct peptides detected
by the OMSSA algorithm, 1,516 distinct peptides (53.3%) were detected by both
algorithms (Figure 10). The increased number of detected peptides (2,842 distinct
peptides) by combining results from both algorithms, in addition to the high
conformance scores for the individual algorithms (Table 2), suggests the use of
multiple algorithms in addition to algorithm specific methods in increasing
confidence of peptide matches (retaining peptide matches detected by both b/y and
a/b/y ion screens for SEQUEST; taking the union of peptide matches detected by
multiple implementations of OMSSA) to increase the yield of high confidence
peptide-spectrum matches.
43
A.
B.
Distinct Peptides detected in Multiple
Implementations of OMSSA
1 (10%)
2 (18%)
5 (47%)
3 (6%)
4 (19%)
Figure 10. Distinct peptides detected by both algorithms. A. SEQUEST detected
5,016 peptides counting a peptide once per MS Run even if seen in both b/y and a/b/y
ion screens. Of the 5,016 peptides, 2,261 are distinct (peptides found in multiple MS
Runs are only counted once). Of the 2,261 distinct peptides, 309 peptides are seen in
the a/b/y ion screen alone, 200 peptides are seen in the b/y ion screen alone, and
1,752 peptides are seen in both the b/y and a/b/y ion screens. B. OMSSA detected
4,429 peptides counting a peptide once per MS Run even if seen in both b/y and a/b/y
ion screens or seen in multiple implementations. Of the 4,429 peptides, 2,097 are
distinct (peptides found in multiple MS Runs are only counted once). Of the 2,097
peptides, 203 peptides are seen in a single implementation, 372 are seen in two
implementations, 118 are seen in three implementations, 392 are seen in four
implementations, and 1,012 are seen by five implementations. C. 1,516 distinct
peptides are found in both SEQUEST and OMSSA, while 745 distinct peptides are
unique to SEQUEST and 581 distinct peptides are unique to OMSSA. Counting each
peptide once, regardless of which implementation of the algorithm, how many
MS/MS experiments, or which CID ion screens revealed the peptide, we find that
53.34% of the peptides are detected by both algorithms.
44
2. Conformance scoring using algorithm-specific evaluation methods
In order to verify this claim, we computed conformance scores based on distinct
peptides detected by the SEQUEST and OMSSA algorithms after implementation of
algorithm-specific evaluation methods. Trends seen in peptides detected by both
algorithms or by either algorithm alone are consistent with previous data where
peptides detected by both a/b/y and b/y ion screens in SEQUEST (Figure 6A) and
peptides detected by multiple implementations of OMSSA (Figure 5A) had higher
overall conformance scores compared to peptides detected by the individual ion
screens in SEQUEST and by a single implementation of OMSSA respectively.
Peptides detected by both algorithms had a conformance score of 88.2%, with 93.1%
of its peptides detected in both the a/b/y and b/y ion screens of SEQUEST, and with
only 1.5% of its peptides detected in a single implementation of OMSSA. In contrast,
peptides detected by SEQUEST alone had a conformance score of 65.4%, with 54.4%
of its peptides detected in either a/b/y or b/y screens alone, while peptides detected by
OMSSA alone had a conformance of 62.1%, with 31% of its peptides detected in a
single implementation of OMSSA (Table 4A-C).
Despite the higher conformance of peptides detected by both algorithms (88.2%),
the conformance of peptides detected by either SEQUEST alone (65.4%) or OMSSA
alone (62.1%) had similar conformance scores, suggesting similar confidence in both
algorithms. Furthermore, when restrictions are applied, where peptides detected by
both algorithms or by SEQUEST alone are required to be detected in both a/b/y and
b/y ion screens of SEQUEST, and where peptides detected by both algorithms or
45
OMSSA alone are required to be detected in more than one implementation of
OMSSA, there is a striking increase in the conformance score of peptides detected by
each algorithm alone but not by both. In addition, applying such restrictions does not
significantly decrease the number of detected peptides (20.8% decrease in number of
distinct peptides; Table 4D), suggesting that using different algorithms and multiple
implementations, in addition to requiring a/b/y and b/y ion screen detection for
SEQUEST detected peptides and detection in >1 implementation for OMSSA, not
only increases the yield in peptide detection but also increases confidence in the
peptide matches.
(Table 4 on page 47)
46
Table 4. Conformance scoring using algorithm-specific evaluation methods. A.
Percentages of peptides detected by both algorithms or by SEQUEST alone that are also
detected in both a/b/y and b/y ion screens in SEQUEST. B. Percentages of peptides detected
by both algorithms or by OMSSA alone that are also detected in multiple implementations of
OMSSA. C. Conformance Scores calculated from distinct peptides detected by both
algorithms, SEQUEST only, or OMSSA only. D. Conformance scores for peptides detected
by both algorithms or by OMSSA alone, where peptides were limited to those detected in >1
implementations of OMSSA. Conformance scores for peptides detected by both algorithms or
by SEQUEST alone, where peptides were limited to those detected in both a/b/y and b/y ion
screens in SEQUEST.
A.
Percentage of distinct peptides detected in SEQUEST ion screens
that are detected by
B.
Ion screen
Both Algorithms
SEQUEST alone
a/b/y
5.7
29.8
a/b/y & b/y
93.1
45.6
b/y
1.1
24.6
Percentage of distinct peptides detected in multiple OMSSA implementations
that are detected by
Both Algorithms
OMSSA alone
1
1.5
31.0
2
10.4
37.0
# Implementations
C.
3
4.3
9.1
4
21.8
10.7
5
62.1
12.2
Conformance Scores of distinct peptides
detected by
D.
SEQUEST (745)
Both Algorithms (1516)
OMSSA (581)
65.4
88.2
62.1
Conformance Scores after applying evaluation methods to distinct peptides
detected by
SEQUEST (340)
Both Algorithms (1412, 1493)
OMSSA (401)
80.6
88.3
73.4
47
Conclusion
The rise of proteomics has been made possible by a number of factors, including:
i)
the complete genome sequencing of many organisms;
ii)
the development of fragmentation techniques in conjunction with
advances in the resolution, sensitivity, and m/z ratio ranges of mass
spectrometry instrumentation;
iii)
the establishment of public repository databases for high-throughput data;
iv)
the increasing awareness of the importance of interdisciplinary methods in
research (bioinformatics and computational biology);
v)
the development of spectra-matching algorithms that use protein sequence
databases, derived from the translation of putative ORFs, as a resource.
I. Evaluation Methods for MS/MS Algorithms and Implementations
Prior to high-throughput proteomic studies, data from publications were easily
available for validation through wet-lab experimentations. For example, if a lab
publishes a paper on identifying a protein-protein interaction between protein X and Z
through co-immunoprecipitation, a different lab can either conduct the same
experiment or use other methods such as pull-down assays to confirm the interaction.
In proteomic studies today, many resources are not available or are not made
available for validation through re-conducting the experiment. For example, lab A has
a high resolution, high sensitivity, but extremely expensive LTQ FT mass
spectrometer. Additionally, lab A has been using outdated proprietary software for
their spectra analysis. Unfortunately, lab B has run out of grant funds and owns a first
48
generation time of flight mass spectrometer. Furthermore, to lab B’s disappointment,
the raw spectral files and coding scripts used for the publication of the data were not
required to be made available online by the terms of the journal, and lab A is
unreachable. In this extreme situation, lab B has doubts about the analysis of highresolution data with an outdated algorithm, but can only trust that the published data
has been reviewed and is accurate.
As seen in this example, the problems that have risen with the advancements in
proteomics today include: i) the inability to have confidence in peptide matches
identified by spectra-matching algorithms, and ii) the inability to reproduce published
data from raw MS/MS spectra files – problems impeding the ultimate goal of
proteomics: identifying and characterizing proteins so as to understand how the
biological system functions. Fortunately, methods can be used to increase confidence
in peptides and proteins identified by algorithms, and there is increasing awareness
and efforts placed in making raw data and coding scripts publicly available – areas
that this study hopes to have addressed through parent protein profiling of 22 MS/MS
experiments performed on yeast cell lysates.
In current proteomic studies, the confidence in peptide-spectrum matching by
algorithms is assessed primarily through the application of probability thresholds at a
given false identification rate (e.g., FIR of 1% or 5%). Unfortunately, the typical use
of a reverse set of peptide sequences for every ‘validated’ peptide sequence in a
FASTA format sequence database has its downfalls. Specifically, reverse sequences
may contain homologous sequences to ‘validated’ peptide sequences, resulting in a
49
false negative match (Nesvizhskii et al., 2007). In this study, despite applying a
probability threshold of a 5% FIR, the conformance scores varied across different
algorithms and implementations (Table 2), indicative of different samplings due to
inherent differences between algorithms or due to slight changes in implementation
parameter settings. Therefore, the fact that peptide matches boast a FIR of 1% does
not necessarily provide high confidence in the published data.
1. Benchmark data: 22 MS/MS experiments for optimization of algorithm
implementations
Although an alternative to decoy based FIR would be to use an empirical bayes
approach, such as that implemented by PeptideProphet (Nesvizhskii et al., 2007), we
suggest an algorithm evaluation method in conjunction with the use of the standard
application of false identification rates. We have created a set of 22 MS/MS spectra
files from 22 experiments testing an algorithm’s ability to detect peptides with parent
protein sizes within a known molecular weight size range despite the algorithm
having no knowledge of the parent protein mass prior to trypsin digestion. Our results
have shown that despite the inherent differences between the design and structure of
different algorithms (SEQUEST and OMSSA), and despite parameter setting
variations between different algorithm implementations (OMSSA), the computed
conformance scores indicate high confidence in peptide matches from all algorithms
and implementations of algorithms. This suggests that investigators performing largescale yeast proteomic studies may use our set of 22 MS/MS spectra as an
50
optimization tool-kit for choosing the best algorithm to use or for tweaking the
parameter settings in an individual algorithm.
2. Multiple algorithms, implementations, and ion screens: increase yield of high
confidence peptide matches
Furthermore, analyzing the peptide matches from our implementations of
algorithms revealed algorithm-specific methods for increasing confidence in
algorithm detected peptides. In addition to the typical search for b and y ions (b/y ion
screen), an additional screen for a ion inclusion (a/b/y ion screen) may be performed
on MS/MS spectra files. After application of a 5% FIR, peptide matches that are
detected in both ion screens, and not exclusively in either screen, can be used for
further analysis. Our analysis on SEQUEST output data indicated a significantly
higher conformance score for peptide matches detected in both ion screens as
compared to peptide matches detected in either screen alone (Figure 6A). This
method, however, seems to be specific to SEQUEST and did not apply to peptide
matches detected by the OMSSA algorithm (Figure 6B).
Instead, our analysis on OMSSA output data suggested that taking the union of
peptide matches from multiple implementations of an algorithm increases the yield of
detected peptides while maintaining high confidence in the matches (Figure 5A;
Table 2). In particular, our data suggested that peptide matches detected by a single
OMSSA implementation have a low conformance score, indicative of low confidence
in the match, and should be discarded. For both SEQUEST and OMSSA,
implementation of the algorithm specific methods to increase confidence in peptide
51
matches indicated an increase in conformance scores for peptide matches detected by
SEQUEST or OMSSA alone (Table 4C vs. 4D).
3. Practical limitations to benchmark data sets
In conclusion, our data suggest the use of multiple algorithms, implementations,
and ion screens, in addition to the algorithm specific filtering methods described, to
increase the yield of high confidence peptide matches. This methodology of
combining samplings from different algorithms so as to increase peptide yield,
however, has also been suggested by a number of other groups (Searle et al., 2008;
Keller et al., 2005; Price et al., 2007). Furthermore, we plan to assemble a tool-kit for
algorithm and implementation evaluation using the 22 MS/MS spectra files, and to
make the raw spectral files available online.
Admittedly, although such benchmark data sets allow for the tweaking of
implementations and the choice of algorithms for large-scale proteomics
investigations, limitations in practicality include the fact that these benchmark data
sets are heavily dependent on the type of mass spectrometer and the fragmentation
method used to obtain the MS/MS spectra (Allmer, 2012). In a comparative study of
SEQUEST and MASCOT spectrum-peptide matching performance on spectra
obtained from LC-MS/MS analysis of the yeast proteome by different types of mass
spectrometers (LTQ and QqTOF; QSTAR), Elias et al. found that while the two
algorithms gave similar peptide matches using LTQ MS/MS spectra, the matches
obtained from QSTAR MS/MS spectra were less similar (Elias et al., 2005).
52
Therefore, if our 22 MS/MS experiments were to act as a benchmark data set for
proteomic studies in yeast, computed conformance scores for different algorithms or
implementations may be limited to investigators who aim to use CID fragmentation
and a mass spectrometer of similar resolution and sensitivity to the LCQ-Deca XP in
future proteomic experiments. Nonetheless, the parent protein-profiling approach is
still a valid evaluation method to assess the performance of spectra-matching
algorithms, and labs using other classes of mass spectrometers can apply the same
methodology using new gel slice data sets.
53
II. Future Directions
1. Assessment of MASCOT using parent protein conformance scoring
With the recent acquirement of the MASCOT spectra-matching algorithm, we plan
to use the spectra from the 22 MS/MS parent-profiling experiments as a benchmark
data set for the optimization of search parameter implementations in MASCOT. As
OMSSA and MASCOT are both probability-based spectra-matching algorithms, we
expected similar numbers of detected peptides and high conformance scores for
individual implementations set-up to mirror the parameter settings in the OMSSA
implementations. However, initial screenings performed using the MASCOT
algorithm yielded considerably lower numbers of detected peptides in addition to
slightly lower overall conformance scores – results that did not agree well with data
from our analyses of the SEQUEST and OMSSA algorithms. As MASCOT is also
proprietary software, a full understanding of the individual parameter settings may be
impossible. However, additional screening and changes to the current parameter
settings will allow us to find an optimal set or sets of implementations for future mass
spectrometry based yeast proteomic studies using LC-MS/MS on an ion trap mass
spectrometer coupled with a nano-ESI probe and CID fragmentation.
54
2. Applications of mass spectrometry-based proteomics
2.1 Elucidation of biological functions and pathways
Traditionally, individual protein-protein interactions in the budding yeast were
revealed through transcriptional activation of a reporter gene in two-hybrid assays.
Although comprehensive yeast two-hybrid studies have revealed a multitude of
protein-protein interactions (Ito et al., 2001; Uetz et al., 2000), the use of such
experimentation methods in the study of protein networks assumes an oversimplified
model that disregards: i) other non-binding protein-protein interactions such as
phosphorylation, ii) the downstream effects of protein-protein interactions, and iii) the
interconnection between signaling pathways. Fortunately, mass spectrometry-based
proteomic experiments can be manipulated to address all three issues by comparing
differentially treated protein samples at the proteomic-scale. In particular, the
emergence of hybrid mass spectrometers and electron capture dissociation (ECD) has
contributed to the identification of low abundance and post-translationally modified
proteins, allowing for the elucidation of protein interaction networks and cellular
pathways (Mumby and Brekken, 2005).
In a recent phosphoproteomic study on the epidermal growth factor (EGF)
signaling pathway, proteins from three populations of HeLa cells were differentially
labeled with isotopic forms of lysine and arginine via SILAC 1, and subsequently
quantified using LC-MS/MS on a linear iron trap/Fourier transform mass
spectrometer (LTQ-FT) after exposure of the HeLa cells to EGF for varying lengths
1
Stable isotope labeling by amino acids in cell culture
55
of time (Olsen et al., 2006). The temporal and kinetic profiling of the 6,600
phosphorylation sites on 2,244 proteins revealed clusters of phosphopeptides with
defined functions at specific stages in the EGF receptor pathway, providing insight
into the temporal dynamics and players involved in the initiation of the signaling
pathway via autophosphorylation of the EGF receptor, the step-wise activation of the
kinase cascades, and the regulation of transcription factors involved in cellular
processes such as proliferation (Lemmon and Schlessinger, 2010).
In addition to revealing temporal aspects of protein-protein interactions in
signaling pathways, mass spectrometry is also capable of elucidating higher order
structural features involved in protein-protein or protein-nucleotide interactions. One
of the first structural studies used limited proteolysis and mass spectrometry to
identify the DNA-binding regions in the Max protein, confirming previous X-ray
crystallographic results that suggested a DNA-binding site at the basic N-terminals of
the homodimer. Furthermore, the inherent design of the proteolytic protection assay,
where the Max protein was subjected to proteolysis by six different endoproteases in
the presence or absence of DNA, allowed for a comparison of MALDI-MS spectra
that revealed a conformational change due to the binding of DNA (Cohen et al., 1995).
More recently, improvements in cross-linking/mass spectrometry, where noncovalent protein-protein or protein-nucleotide interactions are converted to covalent
bonds, have allowed for structural studies on protein complexes such as the
elucidation of heterodimeric coiled coils and tetramerization domain organization in
the NDC80 complex (Maiolica et al., 2007).
56
2.2 Mass spectrometry-based proteomics in the clinical setting
With the maturation of mass spectrometry techniques and instrumentation in the
lab setting, a shift toward clinical applications of mass spectrometry is likely to
follow. Just as the quantitative characterization of gene expression via DNAmicroarrays has led to the development of array comparative genomic hybridization
(aCGH) in the diagnosis of diseases characterized by microdeletions or duplications
of chromosomes (Galizia et al., 2012), quantitative characterization of protein
expression via mass spectrometry can also be targeted toward the development of
diagnostic and prognostic tools in medicine. One such tool, MALDI imaging mass
spectrometry (MALDI-MS), has identified biomarkers for the diagnosis of gastric
cancer at various pathologic stages by comparing endoscopic biopsy tissue samples
from healthy individuals and cancer patients (Kim et al., 2010). Although MALDIMS can directly analyze specific areas of tissue samples mounted on a MALDI
matrix, thereby providing spatial information on protein expression, the application of
the technique in clinical settings is limited to a number of factors:
i)
the identification of biomarkers via comparison of sample tissue from “healthy”
individuals and patients may only be specific to a subset of disease patients,
resulting in false-positives or false-negatives (LaBaer, 2005);
ii)
the non-standardized work flow from patient sample collection, delivery to a
proteomic lab, sample preparation, mass spectrometry, to data interpretation is
open to multiple variables affecting disease prognosis (Beretta, 2007);
57
Nonetheless, the advantages in the prospects of mass spectrometric techniques in
diagnosing patients at the early on-set of a disease or in assessing the effectiveness of
a certain drug for chronic diseases greatly outweigh the disadvantages listed above.
With standardization of the workflow from sample collection to disease prognosis
and highly regulated population-specific studies in the identification of biomarkers,
mass spectrometry-based prognosis and diagnosis of disease-states may play a major
role in the development of preventive medicine.
58
References
Aebersold, R. and Mann, M. (2003). Mass spectrometry-based proteomics. Nature
422, 198-207.
Allmer, J. (2012). A Call for Benchmark Data in Mass Spectrometry-Based
Proteomics. JIOMICS 2, 1-5.
Balgley, B.M., Laudeman, T., Yang, L., Song, T., and Lee, C.S. (2007). Comparative
Evaluation of Tandem MS Search Algorithms Using a Target-Decoy Search Strategy.
Mol. Cell. Prot. 6.9, 1599-1608.
Beretta, L. (2007). Proteomics from the clinical perspective: many hopes and much
debate. Nature Methods 4, 785-786.
Black, D.L. (2000). Protein Diversity from Alternative Splicing: A Challenge for
Bioinformatics and Post-Genome Biology. Cell 103, 367-370.
Cohen, S.L., Ferré-D'Amaré, A.R., Burley, S.K., and Chait, B.T. (1995). Probing the
solution structure of the DNA-binding protein Max by a combination of proteolysis
and mass spectrometry. Protein Science 4, 1088-1099.
Coon, J.J., Syka, J.E.P., Shabanowitz, J., and Hunt, D.F. (2005). Tandem Mass
Spectrometry for Peptide and Protein Sequence Analysis. BioTechniques 38, 519-523.
Crick, F. (1970). Central Dogma of Molecular Biology. Nature 227, 561-563.
Davis, M.T. and Lee, T.D. (1998). Rapid Protein Identification Using a Microscale
Electrospray LC/MS System on an Ion Trap Mass Spectrometer. J. Am. Soc. Mass
Spectrom. 9, 194-201.
59
Elias, J.E. and Gygi, S.P. (2007). Target-decoy search strategy for increased
confidence in large-scale protein identifications by mass spectrometry. Nature
Methods 4, 207-214.
Elias, J.E., Haas, W., Faherty, B.K., and Gygi, S.P. (2005). Comparative evaluation
of mass spectrometry platforms used in large-scale proteomics investigations. Nature
Methods 2, 667-675.
Eng, J.K., McCormack, A.L., Yates, III, J.R. (1994). An approach to correlate tandem
mass spectral data of peptides with amino acid sequences in a protein database. J. Am.
Soc. Mass Spectrom. 5, 976-989.
Etienne, W., Meyer, M.H., Peppers, J., and Meyer, Jr., R.A. (2004). Comparison of
mRNA gene expression by RT-PCR and DNA microarray. BioTechniques 36, 618626.
Fenn, J.B., Mann, M., Meng, C.K., Wong, S.F., and Whitehouse, C.M. (1989).
Electrospray ionization for mass spectrometry of large biomolecules. Science 246,
64-71.
Fitzgibbon, M., Li, Q., and McIntosh, M. (2008). Modes of inference for evaluating
the confidence of peptide identifications. J. Proteome Res. 7, 35-39.
Fournier, C.T., Cherny, J.J., Truncali, K., Robbins-Pianka, A., Lin, M.S., Krizanc, D.,
and Weir, M.P. (2012). Amino Termini of Many Yeast Proteins Map to Downstream
Start Codons. J. Proteome Res. 11, 5712-5719.
60
Galizia, E.C., Srikantha, M., Palmer, R., Waters, J.J., Lench, N., Ogilvie, C.M.,
Kasperavičiūtėa, D., Nashef, L., and Sisodiya, S.M. (2012). Array comparative
genomic hybridization: Results from an adult population with drug-resistant epilepsy
and co-morbidities. European Journal of Medical Genetics 55, 342-348.
Geer, L.Y., Markey, S.P., Kowalak, J.A., Wagner, L., Xu, M., Maynard, D.M., Yang,
X., Shi, W., and Bryant, S.H. (2004). J. Proteome Res. 3, 958-964.
Ghaemmaghami, S., Huh, W., Kiowa, B., Howson, R.W., Belle, A., Dephoure, N.,
O’Shea, E.K., and Weissman, J.S. (2003). Global analysis of protein expression in
yeast. Nature 425, 737-741.
Gygi, S.P., Rochon, Y., Franza, B.R., and Aebersold, R. (1999). Correlation between
Protein and mRNA Abundance in Yeast. Mol. Cell. Biol. 19, 1720-1730.
Hunt, D.F., Yates, III, J.R., Shabanowitz, J., Winston, S., and Hauer, C.R. (1986).
Protein sequencing by tandem mass spectrometry. Proc. Natl. Acad. Sci. 83, 62336237.
Hyman, E. D. (1988). A new method of sequencing DNA. Analytical
Biochemistry 174, 423-436.
Iakoucheva, L.A., Kimzey, A.L., Masselon, C.D., Smith, R.D., Dunker, A.K., and
Ackerman, E.J. (2001). Aberrant mobility phenomena of the DNA repair protein
XPA. Protein Science 10, 1353-1362.
International Human Genome Sequencing Consortium. (2004). Finishing the
euchromatic sequence of the human genome. Naure 431, 931-945.
61
Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., and Sakaki, Y. (2001). A
comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc.
Natl. Acad. Sci. 98, 4569-4574.
Johnson, R.S., Martin, S.A., and Blemann, K. (1987). Novel Fragmentation Process
of Peptides by Collision-Induced Decomposition in a Tandem Mass Spectrometer:
Differentiation of Leucine and Isoleucine. Anal. Chem. 59, 2621-2625.
Karas, M. and Hillenkamp, F. (1988). Laser Desorption Ionization of Proteins with
Molecular Masses Exceeding 10 000 Daltons. Anal Chem 60, 2299-2301.
Keller, A., Eng, J., Zhang, N., Li, X., and Aebersold, R. (2005). A uniform
proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst.
Biol. [online] 1, E1-E8.
Kim, H.K., et al. (2010). Gastric Cancer-Specific Protein Profile Identified Using
Endoscopic Biopsy Samples via MALDI Mass Spectrometry. J. Proteome Res. 9,
4123-4130.
Kochetov, A.V., Sarai, A., Rogozin, I.B., Shumny, V.K., and Kolchanov, N.A. (2005).
The role of alternative translation start sites in the generation of human protein
diversity. Mol. Genet. Genomics 273, 491-496.
LaBaer, J. (2005). So, You Want to Look for Biomarkers (Introduction to the Special
Biomarkers Issue). J. Proteome Res. 4, 1053-1059.
Lemmon, M.A. and Schlessinger, J. (2010). Cell Signaling by Receptor Tyrosine
Kinases. Cell 141, 1117-1134.
62
Maiolica, A., Cittaro, D., Borsotti, D., Sennels, L., Ciferri, C., Tarricone, C.,
Musacchio, A., and Rappsilber, J. (2007). Structural Analysis of Multiprotein
Complexes by Cross-linking, Mass Spectrometry, and Database Searching. Mol. Cell.
Prot. 6, 2200-2211.
McLafferty, F.W., Breuker, K., Jin, M., Han, X., Infusini, G., Jiang, H., Kong, X.,
and Begley, T.P. (2007). Top-down MS, a powerful complement to the high
capabilities of proteolysis proteomics. FEBS J. 274, 6256-6258.
Mumby, M. and Brekken, D. (2005). Phosphoproteomics: new insights into cellular
signaling. Genome Biology 6, 230.
Nesvizhskii, A.I., Vitek, O., and Aeversold, R. (2007). Analysis and validation of
proteomic data generated by tandem mass spectrometry. Nature Methods 4, 787-797.
O’Farrell. (1975). High Resolution Two-Dimensional Electrophoresis of Proteins.
The Journal of Biological Chemistry. 250, 4007-4021.
Olsen, J.V., Blagoev, B., Gnad, F., Macek, B., Kumar, C., Mortensen, P., and Mann,
M. (2006). Global, In Vivo, and Site-Specific Phosphorylation Dynamics in Signaling
Networks. Cell 127, 635-648.
Parker C.E., Mocanu V., Mocanu M., et al. Mass Spectrometry for Post-Translational
Modifications. In: Alzate O, editor. Neuroproteomics. Boca Raton (FL): CRC Press;
2010. Chapter 6. Available from: http://www.ncbi.nlm.nih.gov/books/NBK56012/
Perkins, D.N., Pappin, D.J.C., Creasy, D.M., and Cottrell, J.S. (1999). Probabilitybased protein identification by searching sequence databases using mass spectrometry
data. Electrophoresis 20, 3551–3567.
63
Price, T.S., et al. (2007). EBP, a program for protein identification using multiple
tandem mass spectrometry data sets. Mol. Cell. Proteomics 6, 527-536.
Russo, A., Chandramouli, N., Zhang, L., and Deng, H. (2008). Reductive
Glutaraldehydation of Amine Groups for Identificaiton of Protein N-termini. J.
Proteome Res. 7, 4178-4182.
Sanger, F., Nicklen, S., and Coulson, A.R. (1977). DNA sequencing with chainterminating inhibitors. Proc. Natl. Acad. Sci. 74, 5463 – 5467.
Searle, B.C., Turner, M., and Nesvizhskii, A.I. (2008). Improving Sensitivity by
Probabilistically Combining Results form Multiple MS/MS Search Methodologies. J.
Proteome Res. 7, 245-253.
Shao, Y. and Kent, S.B.H. (1997). Protein splicing: occurrence, mechanisms and
related phenomena. Chemistry & Biology 4, 187-194.
Shevchenko, A., Tomas, H., Havliš, J., Olsen, J.V., and Mann, M. (2006). In-gel
digestion for mass spectrometric characterization of proteins and proteomes. Nature
Protocols 1, 2856-2860.
Shevchenko, A., Wilm, M., Vorm, O., and Mann, M. (1996). Mass Spectrometric
Sequencing of Proteins from Silver-Stained Polyacrylamide Gels. Anal. Chem. 68,
850-858.
Syka, J.E.P., Coon, J.J., Schroeder, M.J., Shabanowitz, J., and Hunt, D.F. (2004a).
Peptide and protein sequence analysis by electron transfer dissociation mass
spectrometry. Proc. Natl. Acad. Sci. 101, 9528-9533.
64
Syka, J.E.P., et al. (2004b). Novel Linear Quadrupole Ion Trap/FT Mass
Spectrometer: Performance Characterization and Use in the Comparative Analysis of
Histone H3 Post-translational Modifications. J. Proteome Res. 3, 621-626.
Uetz, P., et al. (2000). A comprehensive analysis of protein-protein interactions in
Saccharomyces cerevisiae. Nature 403, 623-627.
Wenger, C.D. and Coon, J.J. (2013). A Proteomics Search Algorithm Specifically
Designed for High-Resolution Tandem Mass Spectra. J. Proteome Res. 12, 13771386.
Wildt, S. and Gerngross, T.U. (2005). The Humanization of N-Glycosylation
Pathways in Yeast. Nature Reviews 3, 119-128.
Wysocki, V.H., Resing, K.A., Zhang, Q., and Cheng, G. (2005). Mass spectrometry
of peptides and proteins. Methods 35, 211-222.
Zubarev, R.A., Horn, D.M., Fridriksson, E.K., Kelleher, N.L., Kruger, N.A., Lewis,
M.A., Carpenter, B.K., and McLafferty, F.W. (2000). Electron Capture Dissociation
for Structural Characterization of Multiply Charged Protein Cations. Anal. Chem. 72,
563-573.
65
Appendix A: SEQUEST search parameters in Proteome Discoverer 1.2
b/y ions screen
66
a/b/y ions screen
67
Appendix B: Python and MS-SQL scripts
Python Upload Script
SEQUEST_TestAndParseXLS.py
https://wesfiles.wesleyan.edu/home/mlin/mlin_BAMA_thesis_2013/SEQUEST_Test
AndParseXLS.py
List of Stored Procedures
dt_2013_methods_assessGelSlices_SEQUEST_commented.txt
https://wesfiles.wesleyan.edu/home/mlin/mlin_BAMA_thesis_2013/dt_2013_method
s_assessGelSlices_SEQUEST_commented.txt
dt_2013_methods_assessGelSlices_OMSSA_Commented.txt
https://wesfiles.wesleyan.edu/home/mlin/mlin_BAMA_thesis_2013/dt_2013_method
s_assessGelSlices_OMSSA_Commented.txt
68