Target-decoy search strategy for increased confidence in

2015.06.19
김지형
Introduction
• precursor peptides are dynamically selected for
fragmentation with exclusion to prevent repetitive
acquisition of MS/MS spectra for the same peptide.
• Increase the throughput of proteomic experiments
• Incurs fragmentation of peptide ions having weak intensities
• Wrong interpretation
• the portion of wrong interpretation of precursor ion mass is up to ∼40%.
• Overlapping isotopic clusters are often observed with complex
proteome samples and resulted in wrong interpretation of their masses
Introduction
• Determining isotopic clusters and their monoisotopic masses is the
first step in interpreting complex mass spectra generated by highresolution mass spectrometers
• Fast, automated and accurate interpretation of the vastly large
amount of MS data
• a fundamental and critical step in MS-based proteomic experiments
• remains the subject of much research activity.
Introduction
• Mann et al. : suggested a deconvolution algorithm to find charge
states.
• Senko et al. :
• introduced a notion of an “average” amino acid called averagine
• suggested a computational method for determination of monoisotopic
masses using it.
• Zscore :
• a fast and automated isotopic cluster identification algorithm based on a
charge scoring scheme.
•
•
•
•
•
•
ESI-ISOCONV
MATCHING
PepList
LASSO
AID-MS
THRASH
Introduction
• THRASH
• Most widely used algorithms
• Employs the Fourier transform/Patterson method for charge
determination and least-squares fitting to compare a peak cluster with
an averagine isotopic distribution.
• Cons : the use of least-squares fitting and/or averagine isotopic
distribution often leads to an inaccurate monoisotopic mass that is 1-2
Da different from the correct value
Methods
1. Present a probabilistic model of isotopic distribution of a
polypeptide
2. Describe approximations of intensity ratio and intensity ratio
product functions
•
•
Intensity ratio function  Intensity ratio of two adjacent peaks
Intensity ratio products function  Intensity ratio products of three adjacent
peaks
3. Isotopic clusters and mono-isotopic masses was determined
from suggested algorithm
Methods
1. Isotopic Distribution Model
• Notations
• A ={C,H,N,O,S} Set of atoms that compose a polypeptide
• Xa : the +a isotope of an atom X (for each atom X ∈ A )
•
+a(1,2,4) isotope
• PXa = Existential probability
•
For example
PC1 = 0.00107  1.107 %
• nx = the number of atom X in the polypeptide
•
: elemental composition of a polypeptide
Methods
1. Isotopic Distribution Model
• Notations
• Isotopic Distribution of a polypeptide :
the theoretical masses and intensities of the peaks generated
by all instances of the polypeptide
• Ik : the intensity of the kth peak in an isotopic distribution (k>=0)
•
•
I0 : the intensity of the monoisotopic peak
Ik (k >= 1) : the intensity of the peak whose mass difference from the
monoisotopic peak is k.
I0
Ik
Methods
1. Isotopic Distribution Model
• Lemma1
• The intensity Ik in an isotopic distributions approximates to
where
Methods
1. Isotopic Distribution Model
• Lemma1
• We can compute Ik by the coefficient of xk in the expansion of the
following polynomial
• Intensity Ik in an isotopic distribution of a polypeptide is regarded as the
sum of existential probabilities of all polypeptide instances with mass
difference k
Methods
1. Isotopic Distribution Model
• Lemma1
• Intensity I0
•
•
the probability of there being no isotopes in
the constant term of polynomial P(x)
Methods
1. Isotopic Distribution Model
• Lemma1
• Intensity I1
•
•
the probability of there being only one +1 isotope
coefficient of x in P(x)
Methods
1. Isotopic Distribution Model
• Lemma1
• Intensity I2
•
•
the probability of there being two +1 isotopes or one +2 isotope
coefficient of 𝑥 2 in P(x)
Methods
1. Isotopic Distribution Model
• Lemma1
• Intensity Ik for arbitrary 𝑘 ≥ 1
•
•
mass difference
• kh : isotopes of +h Da
𝑡𝑥1 : the number of +1 isotopes of atom X
•
k1 : the sum of 𝑡𝑥1 for all atoms X
Methods
Methods
1. Isotopic Distribution Model
• Lemma1
• The intensity Ik in an isotopic distributions approximates to
where
Methods
1. Isotopic Distribution Model
• Lemma1
• simplify the mathematical form of Ik
• Assume the linearity between mass m and the numbers of atoms (𝑛𝑥 )
•
•
•
i.e., 𝑛𝑥 ≈ 𝑎𝑥 m (𝑎𝑥 : constant for each atom X)
𝑇1 , 𝑇2 , 𝑇4 : linear in mass m
Ik : polynomial of mass m
Methods
1. Isotopic Distribution Model
• Lemma 2
• In an isotopic distribution of a polypeptide
intensity Ik approximates to a polynomial of mass m
with degree k,
• i.e.,
,
• Because of variations in elemental compositions, each of 𝑇1 , 𝑇2 and 𝑇4 has
a range of constants in its linear form
• Obtain both minimum and maximum of 𝑇1 , 𝑇2 and 𝑇4
• We can estimate the range of Ik
Methods
2. Ratio Function and Ratio Product Functions
• Intensity ratio
• Ik+1/Ik : approximated to linear function of polypeptide mass
• Intensity ratio product
• IkIk+2/Ik+12 : approximated to constant function of polypeptide mass
• Compute average, minimum, and maximum functions using
simulation spectra of tryptic polypeptides generated from a
protein database.
Methods
2. Ratio Function and Ratio Product Functions
• Theorem 1
• In an isotopic distribution of a polypeptide
,
the ratio of two adjacent peaks, Ik+1/Ik
can be approximated by a linear function of the polypeptide mass.
• Ik+1/Ik = cm + b
•
•
From Lemma 2, Ik+1/Ik is a ratio of two polynomials of degree k+1
and k.
For a sufficiently large mass m, the highest degree terms (𝑐𝑘+1 𝑚𝑘+1
in Ik+1 and 𝑐𝑘 𝑚𝑘 in Ik ) dominate
Ik+1 = Ck+1mk+1 + Ckmk + Ck-1mk-1 + … + C1m + C0
Ik = Ckmk + Ck-1mk-1 + … + C1m + C0
Methods
2. Ratio Function and Ratio Product Functions
• Theorem 1
• Ik+1/Ik = cm + b
Sampled about 100,000 tryptic peptides
of 400 Da to 5,200 Da
generated from UniProt DB 8.0 and
computed the ratio Ik+1/Ik for each peptide
• Solid line : 𝑅𝑎𝑣𝑔 (𝑘, 𝑚)
• Dotted line : 𝑅𝑚𝑎𝑥 (𝑘, 𝑚)
• Dashed line : 𝑅𝑚𝑖𝑛 𝑘, 𝑚
Computed by linear regression
using least-squares fitting
Methods
2. Ratio Function and Ratio Product Functions
• Theorem 1
• Ik+1/Ik = cm + b
The reason for choosing the threshold 1800 :
A polypeptide within 1800 Da has the first and
most abundant peak as its monoisotopic peak
Methods
2. Ratio Function and Ratio Product Functions
• Theorem 2
• In an isotopic distribution of a polypeptide
the ratio product of three adjacent peaks, IkIk+2/Ik+12 ,
can be approximated to a constant.
,
• IkIk+2/Ik+12 = C
•
From Lemma 2, the degree of IkIk+2 and Ik+12 are the same as 2k+2
•
For a sufficiently large mass m, IkIk+2/Ik+12 can be approximated to a
constant.
Methods
2. Ratio Function and Ratio Product Functions
• Theorem 2
• IkIk+2/Ik+12 = C
RPavg(k,m)
RPmax(k,m)
RPmin(k,m)
Methods
3. Algorithm Overview
1) Peak picking
2) Pseudocluster identification
3) Isotopic cluster identification and monoisotopic mass
determination
4) Duplicate cluster removal
Methods
3. Algorithm Overview
1) Peak picking
• Remove noise and select relatively high intensity peaks from raw
spectrum
• In our experiment, we used the peak picking algorithm of
Decon2LS
Methods
3. Algorithm Overview
2) Pseudocluster identification
• Identifying pseudoclusters by scanning the selected peaks from low m/z to
high m/z
• Finding all the pseudoclusters starting at all peaks
• first find pseudoclusters with a charge state 1+ and find the other
pseudoclusters with higher charge states.
• Let X denote the m/z of current peak. Then the range of the next peak’s
m/z will be [ X+(D-E)/C … X+(D+E)/C ]
• D : estimated mass difference between two adjacent peaks in an
isotopic cluseter
• E : the error bound
Methods
3. Algorithm Overview
3) Isotopic cluster identification and monoisotopic mass
determination
• From the pseudoclusters, we identify isotopic clusters whose intensity
patterns are similar to those of isotopic distributions.
• For each pseudocluster, we determine whether it is an isotopic cluster or
not
•
by checking Ik+1/Ik , IkIk+2/Ik+12
• Consider the case that some peaks are missing in pseudocluster
• Because sometimes the monoisotopic and its neighboring peaks are as
small in their intensities as the noise level
• So, calculate scores for four cases (miss 0-3 leftmost peaks) and
select the case with the highest score
Methods
3. Algorithm Overview
3) Isotopic cluster identification and monoisotopic mass
determination
• Monoisotopic Mass Calculation
• m : monoisotopic mass (most abundant peak in the pseudocluster)
• the most abundant peak : qth peak in the pseudocluster
• p : the number of missing peaks
• Score Calculation (score of pseudocluster)
• n : the number of peaks in the pseudocluster
Methods
3. Algorithm Overview
3) Isotopic cluster identification and monoisotopic mass
determination
• Score Calculation (score of pseudocluster)
Score about
Ik+1 / Ik
Score about
IkIk+2 / Ik+12
Methods
3. Algorithm Overview
3) Isotopic cluster identification and monoisotopic mass
determination
• Score Calculation (score of pseudocluster)
• 𝐼𝑘′ : the intensity of the (k+1)st peak in a pseudocluster
• 𝐼𝑘′ corresponds to 𝐼𝑘+𝑝 in the isotopic distribution
•
The ratio score scoreR(k, p, m) : measures the similarity of the
′
intensity ratio 𝐼𝑘+1
/𝐼𝑘′ to the intensity ratio 𝐼𝑘+𝑝+1 /𝐼𝑘+𝑝 in the isotopic
distribution whose monoisotopic mass is m
Methods
Rmin < I'k+1/I'k < Rmax
 scoreR(k,m,p) > 0
Methods
Rmin < I'k+1/I'k < Rmax
 scoreRP(k,m,p) > 0
Methods
3. Algorithm Overview
3) Isotopic cluster identification and monoisotopic mass
determination
0<
The pseudocluster
is selected and
becomes an
isotopic cluster
Score
≤0
discarded
Methods
3. Algorithm Overview
4) Duplicate cluster removal
• Because this algorithm consider all possible pseudoclusters, many
pseudoclusters can be generated from a single isotopic cluster.
• Remove one of duplicate clusters as follows.
Most abundant peak
smaller
remove
same
Charge state
lower
remove
same
score
lower
remove
Results and Discussion
Three programs were compared
: RAPID, ICR2LS, Decon2LS
• Count the number of identified isotope clusters of known
peptides whose amino acid sequences were identified by MS/MS
• It is difficult to pick out the isotopic clusters of known peptides
• because the MS data can contain many peptides whose monoisotopic
clusters contain many peptides whose monoisotopic masses are very
similar.
Results and Discussion
So use the following method
• For each known peptide, find isotopic clusters of this peptide at
the MS scan where this peptide was identified by MS/MS.
• If an isotopic cluster has the monoisotopic mass within a mass
tolerance of 10 ppm, consider it a potentially correct isotopic
cluster.
• Also look for peptide in adjacent scans.
• If no isotopic cluster is found within any of 10 consecutive scans, the
cluster is discard
Results and Discussion
The number of isotopic clusters of 494 known peptides
new method
Decon2LS
ICR2LS
10588
10104
9577
Results and Discussion
Results and Discussion
Reasons for different search results
• Some clusters are inherently ambiguous and each program can
make different judgments.
• THRASH based algorithm  1~2 Da errors
• When the position of the most abundant peak is different from that of
averagine
Results and Discussion
Results and Discussion
Identification of Overlapping Clusters
• there are many overlapping clusters
 Hundreds of isotopic clusters crowded into a narrow range
Results and Discussion
Clusters without sharing peaks were
identified by all programs.
Results and Discussion
Clusters with sharing peaks (blue) were
only identified by paper’s method.
Results and Discussion
Execution Time
Results and Discussion
Conclusion
• New probabilistic model & algorithm for determining isotopic
distributions and monoisotopic masses
• Suggested algorithm found more isotopic clusters of identified
peptides and
• Successfully resolved 1-2Da mismatch problem.
• Suggested algorithm Identified overlapping clusters well.
• Suggested algorithm was Faster than other algorithms.

Download Report

Target-decoy search strategy for increased confidence in

Paperzz.com

Your Paperzz