ppt

PepHMM: A Hidden Markov Model Based Scoring
Function for Mass Spectrometry Database Search
Laxman Yetukuri
T-61.6070: Modeling of Proteomics Data
Outline

Motivation

Basics: MS and MS/MS for Protein Identification

Computational Framework of Database Search

Scoring Algorithms

PepHMM

MOWSE

Results

Summary
Motivation

Proteomics studies- dynamic and context sensitive

Speed and accuracy of omics-driven methods

High throughput MS-based approaches

Real analysis starts with protein identification

Protein identification is challenging

The heart of protein identification algorithm is scoring function
Protein Identification Is Challenging

Sample Contamination

Imperfect Fragmentation

Post translational Modifications

Low signal to noise ratio

Machine errors
Basics: MS and MS/MS for protein Identification
Liquid Chromatography
Trypsin Digest
Mass
Spectrometry
Precursor selection +
collision induced dissociation
(CID)
MS/MS
Computational Problem
Nesvizhskii and Aebersold, Drug Discovery Today, 2004, 9, 173-181
Peptide Fragmentation: b & y ions
yn-i
yn-i-1
-HN-CH-CO-NH-CH-CO-NHCH-R’
i+1
Ri
bi
R”
i+1
bi+1
Peptide Fragmentation: b & y ions …
88
145
292
405
534
663
778
907
1020
1166
b ions
S
1166
G
1080
F
1022
L
875
E
762
E
633
D
504
E
389
L
260
K
147
y ions
y6
100
% Intensity
y7
y5
b3
y2
y3
b4
y4 b5
b6
b7
b8
y
b9 8 y9
0
250
500
750
1000
m/z
Peptide Fragmentation with other ions
xn-i y
n-i z
n-i
yn-i-1
-HN-CH-CO-NH-CH-CO-NHCH-R’
i+1
Ri
ai
bi
ci
R”
i+1
bi+1
Peptide Identification
Two main methods for tandem MS:

De novo interpretation

Sequence database search
De Novo Interpretation
% Intensity
100
L
SGF
KL
E
E
E
E
D
E
L
D
E
F
L
0
250
500
750
G
1000
m/z
Sequence Database Search

Widely used approach

Compares peptides from a protein sequence database with
experimental spectra

Scoring function summarise the comparison


Critical for any search engine
Score each peptide against spectrum

Cross correlation (SEQUEST)

MOWSE scoring and its extensions (MASCOT)

Probabilistic scoring systems (OMSSA, OLAV, ProbID…..)
PepHMM is HMM based probabilistic scoring function
Computational Framework for pepHMM

MSDB based peptide extraction

Hypothetical spectrum generation


b,y,y-H2O,b-H2O,b2+ and y2+
Computing probabilistic scores

Initial classification :Match, missing or noise

Compute pepHMM scores (discussed later)

Compute Z-score

Compute E-score
Contents of pepHMM Model

PepHMM combines the information on correlation among
the ions, peak intensity and match tolerance

Input – sets of matches, missing and noise

Model is based on b and y ions

Each match is associated with observation (T,I)

Observation state = observed (T,I)

Hidden state
=True assignement of the observations
Model Structure
Four possible assignments corresponding to four hidden states
Model Computation
Goal: Calculate highest score peptide in the database
max Pr( s p, ) .
pD
  ( a ,  ,  , b ,  y )
Let a path in HMM be    1 2 .... n represents configuration of
states, probability of the path
Pr( , s p, )  Pr( , M  )
n 1
  (a
i 0
i , i 1
e
i 1

)*
# noise
Model Computation…
Considering all possible paths
Pr( s p, )   Pr( , M  ).

Forward algorithm: Probability of all possible Paths
from the first position to state v at postion i
(i )
fv 

 
Pr( 1 .....,  i 1  u , M  ).
1 .... i 1
(i )
fv
( i 1)
 ev ( f u
u
(i )
( i 1)
au ,v ),
Pr( M  )   f v( n )
v
Emmission Probabilities
Probability of observing (Tb,Ib) and (Ty, Iy) for the
state 1 at position i
e1(i )  Pr((Tb , I b ), (Ty , I y )  )
 Pr((Tb , I b  ) Pr((Ty , I y )  )
 Pr(Tb  ) Pr( I b  ) Pr(Ty  ) Pr( I y  ),
Pr(
Tb

Pr(
), Pr(
Ib

), Pr(
Ty

Iy

)
---Normal distribution
)
---Exponential distribution
MOWSE Scoring System
MOWSE Algorithm is implemented in MASCOT software
Score 
50,000
M P r ot   m i ,
n
j
Where
f
mi , j 
i, j
f
i , j max incolumnj
mi,j -elements of MOWSE frequence matrix
Data Sets
ISB data set:
1. A,B mixtures of 18 different proteins with modifications/relative
amounts
2. Analysed using SEQUEST and other in-house Software
3. Data set is curated
4. Final data set with charge 2+ for trypsin digestion contains 857
spectra
5. 5-fold cross validation by random selection
-Training set :687 spectra
-Testing set : 170 spectra
6. EM algorithm is used for estimating parameters
Results: Distributions of Ions
b and y ions
Match Tolerance
Noise
Parameter estimates
 , , 
Comparative Studies
Dat set selection repeated 10 times to select both training
and test data set
For each group parameters are similar values
Prediction is considered correct if the peptide has highest score
Independent Data Set
A.Y’s Lab: The other independent data set for comparing with
other tools like SEQUEST and MASCOT
size of data set =20,980 spectra
False/True Positive Rates
Summary

Developed probabilistic scoring function called pepHMM for
improving protein identifications

PepHMM outperform other tools like MASCOT with low false
postive rate (always?)

Can this handle other type of ions other than b and y ions

Need to handle post translational modifications