Data Validation and Annotation: PRIDEViewer and PIKE Bioinformatics analysis from proteomics data ProteoRed Bioinformatics Workshop Salamanca Alberto Medina-Aunon March, 15th 2010 Main Topics • Mass spectrometry and protein and peptide validation – PRIDEViewer: Description. – Examples: Uses-cases. • Experiment context: Linking functional information to our proteins. – PIKE: Description. – Examples: Uses-cases. MS Validation. The easiest Way • Starting from: – Mass spectrum/spectra – Tentative identification/Sequence – Search Engine Candidate: AFLLAMAARTGFRTR How to do it • By hand: – Just for a few sequences/spectra – We cannot read every format files (for instance binaries). • Semi-automatically: – Using PRIDE files as input: PRIDEViewer PRIDEViewer Experiment info PRIDEViewer Sample and Instrument info PRIDEViewer Spectra and identifications PRIDEViewer Gel Separation PRIDEViewer Mascot interface One Example: Identification using 5 peptides Example Mascot output Another example: 350 input spectra Validation study • Starting from one public proteomics repository – EBI PRIDE-: – Retrieve a set of available experiments. – Check the level of fulfillment of the experiments. – Repeat the protein and peptide identification. VALIDATE THE EXPERIMENT…….. Validation using PRIDE http://www.ebi.ac.uk/pride/ PRIDE: Searching experiments: Biomart Validation. First Round. Biomart Validation- First Round: PRIDE Accession 1642 First View: Mascot Results Validation – First Round: PRIDE Accession 1642 Protein Id Database Peptide Count Identified IPI00295598 IPI 2 No Q15843 SwissProt 6 Yes P62491 SwissProt 1 Why? If we explore the data, we’ll find ….. No PRIDE mass Calculate d mass IPI00295598 VISEPGEAEVFMTPEDF VR 2184.0375 2152.0267 Q15843 EIGPPQQQR 1052.5697 1052.5483 P62491 DHADSNIVIMLVGNK 1657.8186 1625.8316 Protein Id First Peptide Delta mass around 32Da Validation – First Round: Pride Accession 1642 • Hypothesis…. – First and third sequences present a mass variation around 32 Da. • Is there a modification in C or N termini? In that way, second sequence will present as well. • Is any residue -or more than one- modified? • We’ll extract the common aminoacids: D, A, S, I, C, M and G • Compare they with the described modifications with a mass variation of 32 Da. Validation – First Round: PRIDE Accession 1642. Only this modification could explain a common property between both sequences. So, we’ll select it in the next round Validation – First Round: PRIDE Accession 1642 Validation – Second Round: Latest Experiments. Retrieved by hand Validation – Second Round: Latest experiments • PRIDE accession id: 10470 to 11257 (787 experiments). – No one is suitable to check. – No information regarding the identification is available. • PRIDE accession id: 10000 to 10074 (74 experiments). – One dataset could be checked: 10042 to 10060. (Dataset title: Low abundance proteome of human red blood cells captured by combinatorial peptide libraries) Pride Accession 10053 Mascot output Pride Accession 10060 Mascot output: No identification Validation – Third Round: Recent Experiments. Retrieved by hand • Experiment id: 9900 to 9999 • Two dataset are suitable to check: – 9900 to 9942: LC-MALDI experiments (Tannerella forsythia). – 9944 to 9949: Rattus norvegicus. – 9984: Zebrafish. No spectra. – 9985 to 9992: Homo sapiens. (No identifications). – 44 not available. Validation – Third Round: Experiment 9900 Validation – Third Round. Experiment 9900 Validation – Third Round: Experiment 9900. Summary Protein Id Peptide Count Identified 1st Peptide Mass Theoretical Mass TF2239 1 No 1228.5463 1228.6433 TF26612 13 Yes -- -- TF1259 1 No 1271.6478 1271.6783 TF2116 4 No 1139.5835 1139.6208 TF1741 16 No 1044.5144 1044.5473 TF0447 2 No 1092.4619 1092.5432 TF2663 7 Yes -- -- TF2592 2 No 1022.5306 1022.5782 Study summary • Around 1000 PRIDE experiments were downloaded from PRIDE central repository. • Around 100 of them were suitable to test. • Less than of 50% were successfully validated. In summary • There a lot of data within the repositories. (PRIDE). • There a lot of missing information. • It is not possible to check the data automatically. • PRIDEViewer could help us saving a lot of time. Protein Set • Other times, if there is a mistake in the identification, it will not so significant if finally we can reach to the goal of the experiment. • For instance, proteins involved in a particular function or biological process. PIKE http://proteo.cnb.csic.es/ PIKE: Protein Information and Knowledge extractor DB id Protein Name gi|12857455 Heat shock protein gi|14017768 FKB9_HUMAN gi|12836587 Tubulin alpha homo sapiens gi|15010550 Ubiquitin specific protease gi|15489190 vinculin isoform VCL Homo sapiens gi|9963904 selenium binding protein 1 Homo sapiens … … PIKE http://proteo.cnb.csic.es/ PIKE http://proteo.cnb.csic.es/ PIKE http://proteo.cnb.csic.es/ Information asked by user PIKE http://proteo.cnb.csic.es/ PIKE output. CSV Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 18 16 14 12 10 Series1 8 Series2 6 4 2 0 Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 PIKE output First example medium-complexity protein list (containing 57 proteins) J Proteome Res. 2005 Nov-Dec;4(6):2435-41. First example medium-complexity protein list (containing 57 proteins) # Entry ID (UniProt ID) Manual searching PIKE output -Only Keywords- P08648 1 TM KeyWord: Transmembrane P05023 Q9UBN4 10 TM 8 TM KeyWord: Transmembrane KeyWord: Transmembrane 10 Band 3 anion transport protein P02730 11 TM KeyWord: Transmembrane 11 Transferrin receptor protein 1 17 calnexin precursor P02786 P27824 1 TM 1 TM KeyWord: Transmembrane KeyWord: Transmembrane 19 5'-nucleotidase precursor P21589 1 TM; GPI 21 Alkaline phosphatase, placental type precursor P05187 GPI Keyword: GPI-anchor KeyWords: Transmembrane; GPIanchor 22 4F2 cell-surface antigen heavy chain Solute carrier family 2, facilitated glucose 24 transporter, member 1 P08195 1 TM KeyWord: Transmembrane P11166 12 TM KeyWord: Transmembrane 29 chloride intracellular channel protein 5 3beta-hydroxy-Delta5-steroid dehydrogenase 30 multifunctional protein I Q9NZA1 P14060 1 TM KeyWord: Transmembrane 41 myristoylated alanine-rich C-kinase substrate P29966 Myristoylation Keyword: Myristate 42 Basigin precursor P35613 1 TM 47 Brain acid soluble protein 1 P80723 Myristoylation 51 ADP-ribosylation factor 1 P84077 KeyWord: Transmembrane KeyWords: Transmembrane; Myrsitate KeyWords: Transmembrane; Myristate entry namea 6 Integrin alpha-5 precursor Sodium/potassium-transporting ATPase alpha-1 7 chain precursor 8 Short transient receptor potential channel 4 KeyWord: Transmembrane Second example Human Plasma Proteins from PRIDE (HPPP). PRIDE Accession 65 25 MOST FREQUENT PROTEINS Serum albumin [Precursor] - Serum albumin - ALB Complement C3 [Precursor] IGHA1 protein Calcium/calmodulin-dependent protein kinase kinase 2 Inter-alpha-trypsin inhibitor heavy chain H1-H4 [Precursor] Putative uncharacterized protein IGL@ protein ARF GTPase-activating protein GIT2 Complement factor B [Precursor] PRO2275 IGHM protein IGKC protein Alpha-1B-glycoprotein [Precursor] cDNA FLJ14473 fis, clone MAMMA1001080. CDNA FLJ25298 fis, clone STM07683. Fibronectin [Precursor] IGHD protein Trypsin Apolipoprotein-L1 [Precursor] HP protein Alpha-2-macroglobulin [Precursor] SNC66 protein Ig kappa chain V-III region HAH [Precursor] PROTEIN COUNT REDUNDANCY RATIO (Protein count/non redundant entries) 356 273 225 100 99 97 96 90 90 90 78 64 62 58 58 58 56 55 54 53 52 52 50 2226 89.04% Third example The Human Plasma Proteome: A non redundant list: Mol Cell Proteomics. 2004 Apr;3(4):311-26. Epub 2004 Jan 12. >> We have merged four different views of the human plasma proteome, based on different methodologies, into a single nonredundant list of 1175 distinct gene products …. Third example The Human Plasma Proteome: A non redundant list: Mol Cell Proteomics. 2004 Apr;3(4):311-26. Epub 2004 Jan 12. Conclussion • PIKE represents a suitable and useful bioinformatics tool for small-or large-scale proteomics projects. • PIKE main characteristic is its ability to systematically access and automatically retrieve comprehensive biological information contained in common databases. • The resulting information is output in a wide range of standard formats that can be directly viewed, exported, or downloaded for additional analysis. Questions?
© Copyright 2026 Paperzz