Statistical Consideration for Identification and Quantification in Top-Down Proteomics Discovery Omics with Top Down Proteomics Richard LeDuc National Center for Genome Analysis Support Acknowledgements The Kelleher Research Group Drs. Neil Kelleher, Paul Thomas, and Andy Forbes, and ProSight Development Team (past and present) • • • • • • • • • • • • • • • • • Leonid Zamdborg Shannee Babai Bryon Early Ian Spauling Kevin Glowacz Eric Bluhm Vinayak Viswanathan Yong-Bin Kim Ryan Fellers Tom Januszyk Brian Cis Chris Strouse Seyoung Sohn Greg Taylor Joe Sola Lee Bynum Andrew Birck National Center for Genome Analysis Support Le-Shin Wu, Carrie Ganote, Tom Doak, Bill Barnett, and a cast of thousands Washington Univ. School of Medicine Proteomics Core Reid Townsend Petra Gilmore Cheryl Lichti James Malone Alan Davis All the other numerous members of the KRG who have contributed insights over the years. Michael Gross (NCRR Mass Spec) Henry Rohrs (NCRR Mass Spec) Ron Bose (Oncology) Mike Boyne (FDA) Jeffry Hiken (Genetics) Yury Bukhman, James McCurdy, Adam Halstead, Irene Ong (Area 3), Mary Lipton (PNNL), Kathryn Richmond (Enabling Technologies) and others Limbrick Laboratory David Limbrick Diego Morales Holtzman Laboratory David Holtzman Rick Perrin Jacqueline Payton Chengjie Xiong (Biostatistics) Differential Omics Studies 1. RNA-seq, Bottom-up proteomics, metabolomics 2. Looking for a list of discovered entities that have different expression levels between treatments 3. Very popular for target discovery 4. Frequently done on organisms before a genome is completed “Kelleher P-Score” Example ‘P score’ = Pf,n = x 1 2Ma 2 111.11 (xf)n x e-xf n! f is the # of input fragment ions, n is the # of matches, Ma is the Mass Accuracy i n 1 i n i 0 pcrude p fn 1 p fn F. Meng, B. Cargile, L. Miller, J. Johnson, and N. Kelleher, Nat. Biotechnol., 2001, 19, 952-957. Modeling the Scrambled P-Scores Motivation 9,839 MS/MS Queries (MS1 and MS2 data) Goodness of Fit Better is better, but the easy ones are easy Computers Ask the Darndest Questions Top Down Proteomics! • Three pillars of proteomics: • Identification • Characterization • Quantification. • Top down proteomic studies are underway. • These are large and complex studies (At several institutions, a typical production bottom-up study would have 200+ LC runs) Top Down Proteomics Biometrics Sources of 1. 2. 3. 4. Intensity calculation LC alignment Mass Spec Physics Separation Different fractions etc. 5. Protein Isolation ChIP, RBC ghosts etc 6. 7. 8. 9. Tissue variation Individual variation Population variation Random and systemic errors Experimental Design Healthy Group • Ronald A. Fisher (1926) : "The Arrangement of Field Experiments“ • All measurements have errors • All biological systems have individual variation Sub 1 R1 R5 R6 R7 R8 Diseased Group • The goal of experimental design is to design the experiment so that the variation can be partitioned • Typically testing variation between groups against the variation within R2 R3 R4 Sub 2 Sub 1 R1 R2 R3 R4 Sub 2 R5 R6 R7 R8 Typical Results: Human RBC Ghosts RAP1A 3 2 25 20 100 75 1 50 0 25 -1 250 -2 75 50 37 -3 20 150 100 1 2 3 4 5 Control 6 7 8 Samples Control Samples 9 10 11 12 13 PNH Samples PNH Samples RAP1A Catalase Peroxiredoxin Coomassie Populations of Experiments • Instead of doing 1 experiment, you are doing an unknown number of experiments • Number of experiments determined by how many unique entities are observed consistently over the entire set of observations Control Samples PNH Samples Typical Results: Breast Cancer Model Sources of Variation: The Model I ijkl a i d j ( i ) rk ( ij ) eijkl Where i=1 or 2 and represents the two preparations, j = 1 to 3 for each digestion within a given preparation, k = 1 to 3 for each injection (or run) within each digestion l = 1 to the number of peptides for the given protein. Under this model, let a i is the effect for the i th random preparatio n d j(i) is the effect for the jth random digestion from the i th preperatio n rk(ij) is the effect for the k th random run from the jth random digestion from the i th preperatio n e ijkl is the residuals Variance Component Estimates Power Calculations Power Curves for High Subject Low Residual Variation Inbreed Mice Human Subjects 1 0.9 n=20 0.8 n=5 0.7 Power 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 Effect Size (in STD) 2 2.5 Systems Analysis • What to do with the laundry lists of significant genes? • Gene Ontology Analysis • Gene Set Enrichment Analysis • Often paired with RNA or metabolomic data. • Creates a third level of analysis To Review • Everything is in place for top-down proteomic studies. • In any discovery omic study, extreme care must be taken – lots of pilot work to understand the behavior of your analytic system • Technology and mathematical formalism does not trump biology. (Bad experimental design results in bad experiments) Shameless NCGAS Plug • Funded by National Science Foundation 1. 2. 3. Large memory clusters for assembly Bioinformatics consulting for biologists Optimized software for better efficiency Questions? • Partner Institutions: • Extreme Science and Engineering Discovery Environment (XSEDE) • Texas Advanced Computing Center (TACC) at the University of Texas at Austin • San Diego Supercomputer Center (SDSC) at the University of California, San Diego. • Pittsburgh Supercomputing Center (PSC) • Open for business at: http://ncgas.org
© Copyright 2026 Paperzz