PIKE - CNB Proteomics Facility

Data Validation and Annotation:
PRIDEViewer and PIKE
Bioinformatics analysis from proteomics data
ProteoRed Bioinformatics Workshop
Salamanca
Alberto Medina-Aunon
March, 15th 2010
Main Topics
• Mass spectrometry and protein and peptide
validation
– PRIDEViewer: Description.
– Examples: Uses-cases.
• Experiment context: Linking functional
information to our proteins.
– PIKE: Description.
– Examples: Uses-cases.
MS Validation.
The easiest Way
• Starting from:
– Mass spectrum/spectra
– Tentative identification/Sequence
– Search Engine
Candidate: AFLLAMAARTGFRTR
How to do it
• By hand:
– Just for a few sequences/spectra
– We cannot read every format files (for instance
binaries).
• Semi-automatically:
– Using PRIDE files as input: PRIDEViewer
PRIDEViewer
Experiment info
PRIDEViewer
Sample and Instrument info
PRIDEViewer
Spectra and identifications
PRIDEViewer
Gel Separation
PRIDEViewer
Mascot interface
One Example:
Identification using 5 peptides
Example
Mascot output
Another example:
350 input spectra
Validation study
• Starting from one public proteomics
repository – EBI PRIDE-:
– Retrieve a set of available experiments.
– Check the level of fulfillment of the experiments.
– Repeat the protein and peptide identification.
VALIDATE THE EXPERIMENT……..
Validation using PRIDE
http://www.ebi.ac.uk/pride/
PRIDE: Searching experiments:
Biomart
Validation. First Round.
Biomart
Validation- First Round:
PRIDE Accession 1642
First View: Mascot Results
Validation – First Round:
PRIDE Accession 1642
Protein Id
Database
Peptide
Count
Identified
IPI00295598
IPI
2
No
Q15843
SwissProt
6
Yes
P62491
SwissProt
1
Why? If we explore the data, we’ll find …..
No
PRIDE
mass
Calculate
d mass
IPI00295598 VISEPGEAEVFMTPEDF
VR
2184.0375
2152.0267
Q15843
EIGPPQQQR
1052.5697
1052.5483
P62491
DHADSNIVIMLVGNK
1657.8186
1625.8316
Protein Id
First Peptide
Delta mass
around 32Da
Validation – First Round:
Pride Accession 1642
• Hypothesis….
– First and third sequences present a mass
variation around 32 Da.
• Is there a modification in C or N termini? In that way,
second sequence will present as well.
• Is any residue -or more than one- modified?
• We’ll extract the common aminoacids: D, A, S, I, C, M
and G
• Compare they with the described modifications with a
mass variation of 32 Da.
Validation – First Round:
PRIDE Accession 1642.
Only this modification
could explain a
common property
between both
sequences.
So, we’ll select it in the
next round
Validation – First Round:
PRIDE Accession 1642
Validation – Second Round:
Latest Experiments. Retrieved by hand
Validation – Second Round:
Latest experiments
• PRIDE accession id: 10470 to 11257 (787
experiments).
– No one is suitable to check.
– No information regarding the identification is
available.
• PRIDE accession id: 10000 to 10074 (74
experiments).
– One dataset could be checked: 10042 to 10060.
(Dataset title: Low abundance proteome of human red blood cells captured by
combinatorial peptide libraries)
Pride Accession 10053
Mascot output
Pride Accession 10060
Mascot output: No identification
Validation – Third Round:
Recent Experiments. Retrieved by
hand
• Experiment id: 9900 to 9999
• Two dataset are suitable to check:
– 9900 to 9942: LC-MALDI experiments (Tannerella
forsythia).
– 9944 to 9949: Rattus norvegicus.
– 9984: Zebrafish. No spectra.
– 9985 to 9992: Homo sapiens. (No identifications).
– 44 not available.
Validation – Third Round:
Experiment 9900
Validation – Third Round.
Experiment 9900
Validation – Third Round:
Experiment 9900. Summary
Protein Id
Peptide
Count
Identified
1st Peptide
Mass
Theoretical
Mass
TF2239
1
No
1228.5463
1228.6433
TF26612
13
Yes
--
--
TF1259
1
No
1271.6478
1271.6783
TF2116
4
No
1139.5835
1139.6208
TF1741
16
No
1044.5144
1044.5473
TF0447
2
No
1092.4619
1092.5432
TF2663
7
Yes
--
--
TF2592
2
No
1022.5306
1022.5782
Study summary
• Around 1000 PRIDE experiments were
downloaded from PRIDE central repository.
• Around 100 of them were suitable to test.
• Less than of 50% were successfully
validated.
In summary
• There a lot of data within the repositories.
(PRIDE).
• There a lot of missing information.
• It is not possible to check the data
automatically.
• PRIDEViewer could help us saving a lot of
time.
Protein Set
• Other times, if there is a mistake in the
identification, it will not so significant if finally
we can reach to the goal of the experiment.
• For instance, proteins involved in a particular
function or biological process.
PIKE http://proteo.cnb.csic.es/
PIKE: Protein Information
and Knowledge extractor
DB id
Protein Name
gi|12857455
Heat shock protein
gi|14017768
FKB9_HUMAN
gi|12836587
Tubulin alpha homo sapiens
gi|15010550
Ubiquitin specific protease
gi|15489190
vinculin isoform VCL Homo
sapiens
gi|9963904
selenium binding protein 1
Homo sapiens
…
…
PIKE http://proteo.cnb.csic.es/
PIKE http://proteo.cnb.csic.es/
PIKE http://proteo.cnb.csic.es/
Information asked by user
PIKE http://proteo.cnb.csic.es/
PIKE output. CSV
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
18
16
14
12
10
Series1
8
Series2
6
4
2
0
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
PIKE output
First example
medium-complexity protein list (containing 57 proteins)
J Proteome Res. 2005 Nov-Dec;4(6):2435-41.
First example
medium-complexity protein list (containing 57 proteins)
#
Entry ID (UniProt ID)
Manual
searching
PIKE output -Only Keywords-
P08648
1 TM
KeyWord: Transmembrane
P05023
Q9UBN4
10 TM
8 TM
KeyWord: Transmembrane
KeyWord: Transmembrane
10 Band 3 anion transport protein
P02730
11 TM
KeyWord: Transmembrane
11 Transferrin receptor protein 1
17 calnexin precursor
P02786
P27824
1 TM
1 TM
KeyWord: Transmembrane
KeyWord: Transmembrane
19 5'-nucleotidase precursor
P21589
1 TM; GPI
21 Alkaline phosphatase, placental type precursor
P05187
GPI
Keyword: GPI-anchor
KeyWords: Transmembrane; GPIanchor
22 4F2 cell-surface antigen heavy chain
Solute carrier family 2, facilitated glucose
24 transporter, member 1
P08195
1 TM
KeyWord: Transmembrane
P11166
12 TM
KeyWord: Transmembrane
29 chloride intracellular channel protein 5
3beta-hydroxy-Delta5-steroid dehydrogenase
30 multifunctional protein I
Q9NZA1
P14060
1 TM
KeyWord: Transmembrane
41 myristoylated alanine-rich C-kinase substrate
P29966
Myristoylation
Keyword: Myristate
42 Basigin precursor
P35613
1 TM
47 Brain acid soluble protein 1
P80723
Myristoylation
51 ADP-ribosylation factor 1
P84077
KeyWord: Transmembrane
KeyWords: Transmembrane;
Myrsitate
KeyWords: Transmembrane;
Myristate
entry
namea
6 Integrin alpha-5 precursor
Sodium/potassium-transporting ATPase alpha-1
7 chain precursor
8 Short transient receptor potential channel 4
KeyWord: Transmembrane
Second example
Human Plasma Proteins from PRIDE (HPPP). PRIDE Accession 65
25 MOST FREQUENT PROTEINS
Serum albumin [Precursor] - Serum albumin - ALB
Complement C3 [Precursor]
IGHA1 protein
Calcium/calmodulin-dependent protein kinase kinase 2
Inter-alpha-trypsin inhibitor heavy chain H1-H4 [Precursor]
Putative uncharacterized protein
IGL@ protein
ARF GTPase-activating protein GIT2
Complement factor B [Precursor]
PRO2275
IGHM protein
IGKC protein
Alpha-1B-glycoprotein [Precursor]
cDNA FLJ14473 fis, clone MAMMA1001080.
CDNA FLJ25298 fis, clone STM07683.
Fibronectin [Precursor]
IGHD protein
Trypsin
Apolipoprotein-L1 [Precursor]
HP protein
Alpha-2-macroglobulin [Precursor]
SNC66 protein
Ig kappa chain V-III region HAH [Precursor]
PROTEIN COUNT
REDUNDANCY RATIO (Protein count/non redundant entries)
356
273
225
100
99
97
96
90
90
90
78
64
62
58
58
58
56
55
54
53
52
52
50
2226
89.04%
Third example
The Human Plasma Proteome: A non redundant list:
Mol Cell Proteomics. 2004 Apr;3(4):311-26. Epub 2004 Jan 12.
>> We have merged four different views of the human plasma
proteome, based on different methodologies, into a single
nonredundant list of 1175 distinct gene products ….
Third example
The Human Plasma Proteome: A non redundant list:
Mol Cell Proteomics. 2004 Apr;3(4):311-26. Epub 2004 Jan 12.
Conclussion
• PIKE represents a suitable and useful bioinformatics
tool for small-or large-scale proteomics projects.
• PIKE main characteristic is its ability to
systematically access and automatically retrieve
comprehensive biological information contained in
common databases.
• The resulting information is output in a wide range of
standard formats that can be directly viewed,
exported, or downloaded for additional analysis.
Questions?