+
Protein and gene model inference based on
statistical modeling in k-partite graphs
Sarah Gester, Ermir Qeli, Christian H. Ahrens, and Peter Buhlmann
+
Problem Description
Given peptides and scores/probabilities, infer the set of
proteins present in the sample.
PERFGKLMQK
Protein A
MLLTDFSSAWCR
Protein B
TGYIPPPLJMGKR
FFRDESQINNR
Protein C
+
Previous Approaches
N-peptides rule
ProteinProphet (Nesvizhskii et al. 2003. Anal Chem)
Nested mixture model (Li et al. 2010. Ann Appl Statist)
Rescores peptides while doing the protein inference
Does not allow shared peptides
Peptide scores are independent
Hierarchical statistical model (Shen et al. 2008. Bioinformatics)
Assumes peptide scores are correct.
Allows for shared peptides
Assume PSM scores for the same peptide are independent
Impractical on normal datasets
MSBayesPro (Li et al. 2009. J Comput Biol)
Uses peptide detectabilities to determine peptide priors.
+
Markovian Inference of Proteins
and Gene Models (MIPGEM)
Inclusion of shared/degenerate peptides in the model.
Treats peptide scores/probabilities as random values
Model allows dependence of peptide scores.
Inference of gene models
+
Why scores as random values?
PERFGKLMQK
Protein A
MLLTDFSSAWCR
Protein B
TGYIPPPLJMGKR
FFRDESQINNR
Protein C
+
Building the bipartite graph
+
Shared peptides
+
Definitions
Let pi be the score/probabilitiy of peptide i. I is the set of all
peptides.
Let Zj be the indicator variable for protein j. J is the set of all
proteins.
P[Z j 1 |{pi ;i I}]
+
Simple Probability Rules
P(A,B) P(A | B)P(B) P(B | A)P(A)
P(A) P(A,B b)
b
P(A a | B) 1
a
P(A,B) P(B | A)P(A)
P(A | B)
P(B)
P(B)
+
Bayes Rule
P[Z j 1 |{pi ;i I}]
Probability of observing these
peptide scores given that the
protein is present
P[Z j 1,{pi ;i I}]
P[{ pi ;i I}]
Prior
probability on
the protein
being present
P[{ pi ;i I} | Z j 1] P[Z j 1]
P[{ pi ;i I}]
Joint probability of
seeing these
peptide scores
+
Assumptions
Prior probabilities of proteins are independent
P[{Z j ; j J}] P(Z j )
j J
Dependencies can be included with a little more effort.
This does not mean that proteins are independent.
+
Assumptions
Connected components are independent
+
Assumptions
Peptide scores are independent given their neighboring
proteins.
Ne(i) is the set of proteins connected to peptide i in the graph.
Ir
is the set of peptides belonging to the rth connected
component
R(Ir) is the set of proteins connected to peptides in Ir
P[{pi ;i I} |{Z j ; j R(Ir )}] P[ pi |{Z j ; j Ne(i)}]
iI r
+
Assumptions
Conditional peptide probabilities are modeled by a mixture
model.
The specific mixture model they use is based on the peptide
scores used (from PeptideProphet).
+
Bayes Rule
P[Z j 1 |{pi ;i I}]
Probability of observing these
peptide scores given that the
protein is present
P[Z j 1,{pi ;i I}]
P[{ pi ;i I}]
Prior
probability on
the protein
being present
P[{ pi ;i I} | Z j 1] P[Z j 1]
P[{ pi ;i I}]
Joint probability of
seeing these
peptide scores
+
Joint peptide score distribution
Assumption: peptides in different components are
independent
R
P({ pi ;i I}) P({ pi ;i Ir })
r1
P({ pi ;i Ir })
Ir
P({ p ;i I } |{Z ; j R(I )})P({Z ; j R(I )})
i
r
j
r
Z j {0,1}
j R(I r )
is the set of peptides in component r
R(Ir) is the set of proteins connected to peptides in Ir
j
r
+
Conditional Probability
Mixture model
1
if
u l
P( pi |{Z j ; j Ne(i)})
f1 ( pi ) if
l min ( pi )
i
m median ( pi )
i
u max ( pi )
i
Z
j
0
j
0
j Ne( i)
Z
j Ne( i)
+
Conditional Probability
Mixture model
b1 (x l)
lxm
f1 (x)
(b1 b2 )(x m) b1 (m l) m x u
l min ( pi )
i
m median ( pi )
i
u max ( pi )
i
u
l
f1 (x)dx 1
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
+
f1(x) – pdf of P(pi|{zj})
f(x)
0.4
0.35
0.3
0.25
0.2
0.15
f(x)
0.1
0.05
0
median
+
Choosing b1 and b2
Seek to maximize the log likelihood of observing the peptide
scores.
R
log( P({ pi ;i I})) log P({ pi ;i Ir })
r1
l
R
l log( P({ pi ;i Ir }))
r1
l log P( pi |{Z j z; j Ne(i)})
z {0,1} iI r
r1
j R(I r )
R
P(Z j z)
j R(I r )
+
Choosing b1 and b2
It turns out:
ˆb argmin l (b )
1
1
b1
2
ˆ
2
b
(u
l)
1
ˆb
2
2
(u m)
+
Conditional Protein Probabilities
P[Z j 1 |{pi ;i I}]
P[Z j 1,{pi ;i I}]
P[{ pi ;i I}]
P[{ pi ;i I} | Z j 1] P[Z j 1]
P[{ pi ;i I}]
+
Conditional Protein Probabilities
P(Z j 1 |{pi ;i I})
P[{ pi ;i I} | Z j 1] P[Z j 1]
P[{ pi ;i I}]
P[{p ;i I} | Z
i
j
1,Zk z] P[Z j 1,Zk z]
kR(I d ( j ) )
P[{pi;i I}]
A(1)
P[{pi ;i I}]
+
Conditional Protein
Probabilities(NEC Correction)
P(Z j 1 |{pi ;i I})
P[{ p ;i I} | Z
i
j
P[{ pi ;i I} | Z j 1] P[Z j 1]
P[{ pi ;i I}]
1,{Z k ;k j}) P[Z j 1,{Z k ;k j}
z k {0,1}
P[{ pi ;i I}]
A(1)
P[{pi ;i I}]
+
Conditional Protein Probabilities
A(1)
P(Z j 1 |{pi ;i I})
P[{pi;i I}]
A(0)
P(Z j 0 |{pi ;i I})
P[{pi ;i I}]
+
Conditional Protein Probabilities
P(Z j 1 |{pi ;i I}) P(Z j 0 |{pi ;i I}) 1
A(0)
A(1)
A(1) A(0)
1
P({pi ;i I}) P({pi ;i I}) P({pi ;i I})
A(1) A(0) P({ pi ;i I})
+
Conditional Protein Probabilities
A(1)
A(1)
P(Z j 1 |{pi ;i I})
P[{pi;i I}] A(0) A(1)
A(0)
A(0)
P(Z j 0 |{pi ;i I})
P[{pi ;i I}] A(0) A(1)
+
Shared Peptides
Aunshared(1)
P[{p ;i I
i
A
} | Z1 1,Zk z] P[Z1 1,Zk z]
k R(I A )
Aunshared(1) P[{pi ;i IA } | Z1 1] P[Z1 1]
+
Shared Peptides
Ashared (1)
P[{p ;i I } | Z
i
B
1
1,Zk z] P[Z1 1,Zk z]
k R(I B )
Ashared (1) P[{pi ;i IB } | Z1 1,Z2 1] P[Z1 1] P[Z2 1]
P[{pi ;i IB } | Z1 1,Z2 0] P[Z1 1] P[Z2 0]
+
Shared Peptides
If the shared peptide has pi ≥ median
Punshared[Z1 1|{pi;i I}] Pshared[Z1 1|{pi ;i I}]
Punshared[Z1 0 |{pi;i I}] Pshared[Z1 0 |{pi ;i I}]
+
Shared Peptides
If the shared peptide has pi < median
Punshared[Z1 1|{pi;i I}] Pshared[Z1 1|{pi ;i I}]
Punshared[Z1 0 |{pi;i I}] Pshared[Z1 0 |{pi ;i I}]
+
Gene Model Inference
+
Gene Model Inference
Assume a gene model, X, has only protein sequences which
belong to the same connected component.
Peptide 1
Protein A
Peptide 2
Peptide 3
Peptide 4
Gene X
Protein B
+
Gene Model Inference
Assume a gene model, X, has only protein sequences which
belong to the same connected component.
P[X 1 |{pi ;i I}] 1 P I {Z j 0} |{pi ;i Ir(X )}
j R(X )
R(X) is the set of proteins with edges to X.
Ir(X)
is the set of peptides with edges to proteins with edges to X
+
Gene Model Inference
Gene model, X, has proteins from different connected
components of the peptide-protein graph.
Peptide 1
Protein A
Peptide 2
Peptide 3
Peptide 4
Gene X
Protein B
+
Gene Model Inference
Gene model, X, has proteins from different connected
components of the peptide-protein graph.
m
P I {Z j 0} |{pi ;i Ir(X )} P I {Z j 0} |{pi ;i Il (X )}
j R(X )
l 1 j Rl (X )
Rl(X) is the set of proteins with edges to X in component l.
Il(X)
is the set of peptides with edges to proteins with edges to X
in component l.
+
Datasets
Mixture of 18 purified proteins
Mixture of 49 proteins (Sigma49)
Drosophila melanogaster
Saccharomyces cerevisiae (~4200 proteins)
Arabidopis thaliana (~4580 gene models)
+
Comparisons with other tools
Small datasets with a known answer
Mix of 18 proteins
Sigma49
+
Comparisons with other tools
One hit wonders
Sigma49
Sigma49 no one
hit wonders
+
Comparison with other tools
Arabidopsis thaliana dataset has many proteins with high
sequence similarity.
+
Splice isoforms
+
Conclusion +Criticism
Developed a model for protein and gene model inference.
Comparisons with other tools do not justify complexity:
Value of a small FP rate at the expense of many FN is not shared
for all applications.
Discard some useful information such as #spectra/peptide
Assumptions of parsimony from pruning may be too
aggressive.
© Copyright 2026 Paperzz