Top Down Proteomics Biometrics Sources of

Statistical Consideration for
Identification and Quantification
in Top-Down Proteomics
Discovery Omics with
Top Down Proteomics
Richard LeDuc
National Center for Genome Analysis Support
Acknowledgements
The Kelleher Research Group
Drs. Neil Kelleher, Paul Thomas, and
Andy Forbes, and ProSight Development Team
(past and present)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Leonid Zamdborg
Shannee Babai
Bryon Early
Ian Spauling
Kevin Glowacz
Eric Bluhm
Vinayak Viswanathan
Yong-Bin Kim
Ryan Fellers
Tom Januszyk
Brian Cis
Chris Strouse
Seyoung Sohn
Greg Taylor
Joe Sola
Lee Bynum
Andrew Birck
National Center for
Genome Analysis Support
Le-Shin Wu,
Carrie Ganote,
Tom Doak,
Bill Barnett,
and a cast of thousands
Washington Univ. School of Medicine
Proteomics Core
Reid Townsend
Petra Gilmore
Cheryl Lichti
James Malone
Alan Davis
All the other numerous members of the KRG who have
contributed insights over the years.
Michael Gross (NCRR Mass Spec)
Henry Rohrs (NCRR Mass Spec)
Ron Bose (Oncology)
Mike Boyne (FDA)
Jeffry Hiken (Genetics)
Yury Bukhman, James McCurdy, Adam Halstead, Irene Ong (Area 3),
Mary Lipton (PNNL), Kathryn Richmond (Enabling Technologies) and others
Limbrick Laboratory
David Limbrick
Diego Morales
Holtzman Laboratory
David Holtzman
Rick Perrin
Jacqueline Payton
Chengjie Xiong (Biostatistics)
Differential Omics Studies
1. RNA-seq, Bottom-up
proteomics, metabolomics
2. Looking for a list of
discovered entities that
have different expression
levels between treatments
3. Very popular for target
discovery
4. Frequently done on
organisms before a
genome is completed
“Kelleher P-Score” Example
‘P score’ = Pf,n =
x
1
2Ma 2
111.11
(xf)n x e-xf
n!
f is the # of input fragment
ions,
n is the # of matches,
Ma is the Mass Accuracy
i 
n 1
i n
i 0
pcrude   p fn  1   p fn
F. Meng, B. Cargile, L. Miller, J. Johnson, and N. Kelleher, Nat. Biotechnol., 2001, 19, 952-957.
Modeling the Scrambled P-Scores
Motivation
9,839 MS/MS Queries (MS1 and MS2 data)
Goodness of Fit
Better is better, but
the easy ones are easy
Computers Ask the Darndest Questions
Top Down Proteomics!
• Three pillars of proteomics:
• Identification
• Characterization
• Quantification.
• Top down proteomic studies are underway.
• These are large and complex studies
(At several institutions, a typical production bottom-up
study would have 200+ LC runs)
Top Down Proteomics Biometrics
Sources of
1.
2.
3.
4.
Intensity calculation
LC alignment
Mass Spec Physics
Separation
Different fractions etc.
5. Protein Isolation
ChIP, RBC ghosts etc
6.
7.
8.
9.
Tissue variation
Individual variation
Population variation
Random and systemic errors
Experimental Design
Healthy Group
• Ronald A. Fisher (1926) : "The
Arrangement of Field Experiments“
• All measurements have errors
• All biological systems have
individual variation
Sub 1
R1
R5 R6 R7 R8
Diseased Group
• The goal of experimental design is
to design the experiment so that the
variation can be partitioned
• Typically testing variation between
groups against the variation within
R2 R3 R4
Sub 2
Sub 1
R1
R2 R3 R4
Sub 2
R5 R6 R7 R8
Typical Results: Human RBC Ghosts
RAP1A
3
2
25
20
100
75
1
50
0
25
-1
250
-2
75
50
37
-3
20
150
100
1
2 3
4
5 Control
6 7 8
Samples
Control Samples
9 10 11 12 13
PNH Samples
PNH Samples
RAP1A
Catalase
Peroxiredoxin
Coomassie
Populations of Experiments
• Instead of doing 1
experiment, you are doing
an unknown number of
experiments
• Number of experiments
determined by how many
unique entities are
observed consistently
over the entire set of
observations
Control
Samples
PNH Samples
Typical Results: Breast Cancer Model
Sources of Variation: The Model
I ijkl    a i  d j ( i )  rk ( ij )  eijkl
Where
i=1 or 2 and represents the two preparations,
j = 1 to 3 for each digestion within a given preparation,
k = 1 to 3 for each injection (or run) within each digestion
l = 1 to the number of peptides for the given protein.
Under this model, let
a i is the effect for the i th random preparatio n
d j(i) is the effect for the jth random digestion from the i th preperatio n
rk(ij) is the effect for the k th random run from the jth random digestion from the i th preperatio n
e ijkl is the residuals
Variance Component Estimates
Power Calculations
Power Curves for High Subject Low Residual Variation
Inbreed Mice
Human Subjects
1
0.9
n=20
0.8
n=5
0.7
Power
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
Effect Size (in STD)
2
2.5
Systems Analysis
• What to do with the laundry lists
of significant genes?
• Gene Ontology Analysis
• Gene Set Enrichment Analysis
• Often paired with RNA or
metabolomic data.
• Creates a third level of analysis
To Review
• Everything is in place for top-down proteomic
studies.
• In any discovery omic study, extreme care must
be taken – lots of pilot work to understand the
behavior of your analytic system
• Technology and mathematical formalism does
not trump biology.
(Bad experimental design results in bad experiments)
Shameless NCGAS Plug
• Funded by National Science Foundation
1.
2.
3.
Large memory clusters for assembly
Bioinformatics consulting for biologists
Optimized software for better efficiency
Questions?
• Partner Institutions:
• Extreme Science and Engineering Discovery Environment (XSEDE)
• Texas Advanced Computing Center (TACC) at the University of Texas at
Austin
• San Diego Supercomputer Center (SDSC) at the University of California, San
Diego.
• Pittsburgh Supercomputing Center (PSC)
• Open for business at: http://ncgas.org