f(x) Protein and gene model inference based on statistical modeling

+
Protein and gene model inference based on
statistical modeling in k-partite graphs
Sarah Gester, Ermir Qeli, Christian H. Ahrens, and Peter Buhlmann
+
Problem Description

Given peptides and scores/probabilities, infer the set of
proteins present in the sample.
PERFGKLMQK
Protein A
MLLTDFSSAWCR
Protein B
TGYIPPPLJMGKR
FFRDESQINNR
Protein C
+
Previous Approaches

N-peptides rule

ProteinProphet (Nesvizhskii et al. 2003. Anal Chem)


Nested mixture model (Li et al. 2010. Ann Appl Statist)




Rescores peptides while doing the protein inference
Does not allow shared peptides
Peptide scores are independent
Hierarchical statistical model (Shen et al. 2008. Bioinformatics)




Assumes peptide scores are correct.
Allows for shared peptides
Assume PSM scores for the same peptide are independent
Impractical on normal datasets
MSBayesPro (Li et al. 2009. J Comput Biol)

Uses peptide detectabilities to determine peptide priors.
+
Markovian Inference of Proteins
and Gene Models (MIPGEM)

Inclusion of shared/degenerate peptides in the model.

Treats peptide scores/probabilities as random values

Model allows dependence of peptide scores.

Inference of gene models
+
Why scores as random values?
PERFGKLMQK
Protein A
MLLTDFSSAWCR
Protein B
TGYIPPPLJMGKR
FFRDESQINNR
Protein C
+
Building the bipartite graph
+
Shared peptides
+
Definitions

Let pi be the score/probabilitiy of peptide i. I is the set of all
peptides.

Let Zj be the indicator variable for protein j. J is the set of all
proteins.
P[Z j  1 |{pi ;i  I}]
+
Simple Probability Rules
P(A,B)  P(A | B)P(B)  P(B | A)P(A)
P(A)   P(A,B  b)
b
 P(A  a | B)  1
a
P(A,B) P(B | A)P(A)
P(A | B) 

P(B)
P(B)
+
Bayes Rule
P[Z j  1 |{pi ;i  I}] 
Probability of observing these
peptide scores given that the
protein is present

P[Z j  1,{pi ;i  I}]
P[{ pi ;i  I}]
Prior
probability on
the protein
being present
P[{ pi ;i  I} | Z j  1] P[Z j  1]
P[{ pi ;i  I}]
Joint probability of
seeing these
peptide scores
+
Assumptions

Prior probabilities of proteins are independent
P[{Z j ; j  J}]   P(Z j )
j J

Dependencies can be included with a little more effort.

This does not mean that proteins are independent.

+
Assumptions

Connected components are independent
+
Assumptions

Peptide scores are independent given their neighboring
proteins.

Ne(i) is the set of proteins connected to peptide i in the graph.
 Ir
is the set of peptides belonging to the rth connected
component

R(Ir) is the set of proteins connected to peptides in Ir
P[{pi ;i  I} |{Z j ; j  R(Ir )}]   P[ pi |{Z j ; j  Ne(i)}]
iI r
+
Assumptions

Conditional peptide probabilities are modeled by a mixture
model.

The specific mixture model they use is based on the peptide
scores used (from PeptideProphet).
+
Bayes Rule
P[Z j  1 |{pi ;i  I}] 
Probability of observing these
peptide scores given that the
protein is present

P[Z j  1,{pi ;i  I}]
P[{ pi ;i  I}]
Prior
probability on
the protein
being present
P[{ pi ;i  I} | Z j  1] P[Z j  1]
P[{ pi ;i  I}]
Joint probability of
seeing these
peptide scores
+
Joint peptide score distribution

Assumption: peptides in different components are
independent
R
P({ pi ;i  I})   P({ pi ;i  Ir })
r1
P({ pi ;i  Ir }) 

 Ir

 P({ p ;i  I } |{Z ; j  R(I )})P({Z ; j  R(I )})
i
r
j
r
Z j {0,1}
j R(I r )
is the set of peptides in component r
R(Ir) is the set of proteins connected to peptides in Ir
j
r
+
Conditional Probability

Mixture model
 1
if

 u  l
P( pi |{Z j ; j  Ne(i)})  
f1 ( pi ) if


l  min ( pi )
i
m  median ( pi )
i
u  max ( pi )
i
Z
j
0
j
0
j Ne( i)
Z
j Ne( i)
+
Conditional Probability

Mixture model

b1 (x  l)
lxm
f1 (x)  
(b1  b2 )(x  m)  b1 (m  l) m  x  u
l  min ( pi )
i
m  median ( pi )
i
u  max ( pi )
i
u

l
f1 (x)dx  1
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
+
f1(x) – pdf of P(pi|{zj})
f(x)
0.4
0.35
0.3
0.25
0.2
0.15
f(x)
0.1
0.05
0
median

+
Choosing b1 and b2

Seek to maximize the log likelihood of observing the peptide
scores.
 R

log( P({ pi ;i  I}))  log  P({ pi ;i  Ir })
r1

l 
R
l   log( P({ pi ;i  Ir }))
r1


l   log    P( pi |{Z j  z; j  Ne(i)}) 
z {0,1} iI r
r1
j R(I r )
R


 P(Z j  z)

j R(I r )

+
Choosing b1 and b2

It turns out:
ˆb  argmin  l (b )
1
1
b1
2
ˆ
2

b
(u

l)
1
ˆb 
2
2
(u  m)
+
Conditional Protein Probabilities
P[Z j  1 |{pi ;i  I}] 

P[Z j  1,{pi ;i  I}]
P[{ pi ;i  I}]
P[{ pi ;i  I} | Z j  1] P[Z j  1]
P[{ pi ;i  I}]
+
Conditional Protein Probabilities
P(Z j  1 |{pi ;i  I}) 
P[{ pi ;i  I} | Z j  1] P[Z j  1]
P[{ pi ;i  I}]
 P[{p ;i  I} | Z
i

j
1,Zk  z] P[Z j 1,Zk  z]
kR(I d ( j ) )
P[{pi;i  I}]
A(1)

P[{pi ;i  I}]
+
Conditional Protein
Probabilities(NEC Correction)
P(Z j  1 |{pi ;i  I}) 
 P[{ p ;i  I} | Z
i

j
P[{ pi ;i  I} | Z j  1] P[Z j  1]
P[{ pi ;i  I}]
 1,{Z k ;k  j})  P[Z j  1,{Z k ;k  j}
z k {0,1}
P[{ pi ;i  I}]
A(1)

P[{pi ;i  I}]
+
Conditional Protein Probabilities
A(1)
P(Z j 1 |{pi ;i  I}) 
P[{pi;i  I}]
A(0)
P(Z j  0 |{pi ;i  I}) 
P[{pi ;i  I}]
+
Conditional Protein Probabilities
P(Z j  1 |{pi ;i  I})  P(Z j  0 |{pi ;i  I})  1
A(0)
A(1)
A(1)  A(0)


1
P({pi ;i  I}) P({pi ;i  I}) P({pi ;i  I})
A(1)  A(0)  P({ pi ;i  I})
+
Conditional Protein Probabilities
A(1)
A(1)
P(Z j 1 |{pi ;i  I}) 

P[{pi;i  I}] A(0)  A(1)
A(0)
A(0)
P(Z j  0 |{pi ;i  I}) 

P[{pi ;i  I}] A(0)  A(1)
+
Shared Peptides
Aunshared(1) 
 P[{p ;i  I
i
A
} | Z1 1,Zk  z] P[Z1 1,Zk  z]
k R(I A )
Aunshared(1)  P[{pi ;i  IA } | Z1 1] P[Z1 1]
+
Shared Peptides
Ashared (1) 
 P[{p ;i  I } | Z
i
B
1
1,Zk  z] P[Z1 1,Zk  z]
k R(I B )
Ashared (1)  P[{pi ;i  IB } | Z1 1,Z2 1] P[Z1 1] P[Z2 1] 
P[{pi ;i  IB } | Z1 1,Z2  0] P[Z1 1] P[Z2  0]
+
Shared Peptides

If the shared peptide has pi ≥ median
Punshared[Z1 1|{pi;i  I}]  Pshared[Z1 1|{pi ;i  I}]
Punshared[Z1  0 |{pi;i  I}]  Pshared[Z1  0 |{pi ;i  I}]
+
Shared Peptides

If the shared peptide has pi < median
Punshared[Z1 1|{pi;i  I}]  Pshared[Z1 1|{pi ;i  I}]
Punshared[Z1  0 |{pi;i  I}]  Pshared[Z1  0 |{pi ;i  I}]
+
Gene Model Inference
+
Gene Model Inference

Assume a gene model, X, has only protein sequences which
belong to the same connected component.
Peptide 1
Protein A
Peptide 2
Peptide 3
Peptide 4
Gene X
Protein B
+
Gene Model Inference

Assume a gene model, X, has only protein sequences which
belong to the same connected component.


P[X 1 |{pi ;i  I}] 1 P I {Z j  0} |{pi ;i  Ir(X )}
j R(X )


R(X) is the set of proteins with edges to X.
 Ir(X)
is the set of peptides with edges to proteins with edges to X
+
Gene Model Inference

Gene model, X, has proteins from different connected
components of the peptide-protein graph.
Peptide 1
Protein A
Peptide 2
Peptide 3
Peptide 4
Gene X
Protein B
+
Gene Model Inference

Gene model, X, has proteins from different connected
components of the peptide-protein graph.


 m 
P I {Z j  0} |{pi ;i  Ir(X )}  P I {Z j  0} |{pi ;i  Il (X )}
j R(X )
 l 1 j Rl (X )


Rl(X) is the set of proteins with edges to X in component l.
 Il(X)
is the set of peptides with edges to proteins with edges to X
in component l.
+
Datasets

Mixture of 18 purified proteins

Mixture of 49 proteins (Sigma49)

Drosophila melanogaster

Saccharomyces cerevisiae (~4200 proteins)

Arabidopis thaliana (~4580 gene models)
+
Comparisons with other tools

Small datasets with a known answer
Mix of 18 proteins
Sigma49
+
Comparisons with other tools

One hit wonders
Sigma49
Sigma49 no one
hit wonders
+
Comparison with other tools

Arabidopsis thaliana dataset has many proteins with high
sequence similarity.
+
Splice isoforms
+
Conclusion +Criticism

Developed a model for protein and gene model inference.

Comparisons with other tools do not justify complexity:

Value of a small FP rate at the expense of many FN is not shared
for all applications.

Discard some useful information such as #spectra/peptide

Assumptions of parsimony from pruning may be too
aggressive.