From DNA to Protein

Lecture 6.3:
From DNA to Protein
Dr. Joanne Fox
Day 6: Saturday February 21st, 2004
13:45 – 15:15pm
Lecture 6.3
1
From DNA to Protein
Lecture 6.3
2
Objectives
• Review protein sequence features and databases
• Review the structural diversity of amino acids and
protein sequences
• Highlight several physiochemical and structural
features which can be calculated from protein
sequences
• Show how proteomics utilizes methods and
techniques for measuring, comparing and assessing
protein features
Lecture 6.3
3
Outline:
• Protein sequence features
• Databases of protein sequences
• Basics of protein structure
– 1o structure, prediction of Mw and pI
– 2o structure, prediction methods
– 3o structure, methods for predicting folds
• Proteomics
– Current methods
– Cutting edge technology
Lecture 6.3
4
Amino Acids
amino
group
alpha
carbon
O
H3N+
O
H
• The general formula for
an amino acid
• R is commonly one of
20 different side chains
• At pH 7 both the amino
and carboxyl groups are
ionized
R
carboxyl
group
Lecture 6.3
side chain
group
5
Peptide Bonds
• Amino acids are joined together by an amide
linkage called a peptide bond.
• The two bonds on either side of the rigid planar peptide
unit exhibit a high degree rotation
peptide
bonds
O H
H3N+
H
R1
N
H
R2 H
N
O H
O H
R3
N
H
R4
O
O
rotation occurs here
Lecture 6.3
6
Families of Amino Acids
• The common amino acids are grouped according to
whether their side chains are:
–
–
–
–
acidic D, E
basic K, R, H
uncharged polar N, Q, S, T, Y
nonpolar G, A, V, L, I, P, F, M, W, C
• Hydrophilic amino acids (uncharged polar) are
usually on the outside of a protein whereas nonpolar
residues cluster on the inside of protein
• Basic or acidic amino acids are very polar and are
generally found on the outside of protein molecules
Lecture 6.3
7
Protein Sequence Features
• Proteins exhibit far more sequence and
chemical complexity than DNA or RNA
• Properties and structure are defined by the
sequence and side chains of their constituent
amino acids
• The “engines” of life
• >95% of all drugs target proteins
• Favorite topic of post-genomic era
Lecture 6.3
8
Protein Sequence Databases
• Where does protein sequence information reside?
– Entrez Cross Database Search
• http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi
– Swissprot & TrEMBL
• http://ca.expasy.org/sprot/
– PIR
• http://pir.georgetown.edu/
• As of December 2003, all of this information is
integrated into unified protein database called Uniprot.
– Uniprot
• http://www.pir.uniprot.org/
Lecture 6.3
9
Entrez Cross Database Search
• Protein: sequence database gives access to
translated protein sequences from
Genbank/EMBL/DDBJ
• Complete set of deduced protein sequences
• Redundancy problem
Lecture 6.3
10
Swissprot & TrEMBL
• Swissprot is an expert curated database
– Function, domain structure, post-translational modifications,
variants, reactions, similarities
• TrEMBL (translated EMBL)
– Computer annotated supplement to Swissprot
Lecture 6.3
11
PIR – Protein Information Resource
• Annotated database which includes protein family
classification information
Lecture 6.3
12
The Uniprot Knowledgebase
• Contains all of the information in Swiss-Prot,
TrEMBL, and PIR. This new unified database was
launched in December 2003.
Lecture 6.3
13
Basics of Protein Structure
• Primary
• Secondary
• Tertiary
Lecture 6.3
primary structure
ACDEFGHIKLMNPQRSTVWY
14
Molecular Weight
• Quick formula = 110 X
number of residues
• Accurate determination
of mass by mass
spectrometry
• Tools exist for
accurately calculating
mass of peptides
based on amino acid
composition
Lecture 6.3
15
Molecular Weight & Proteomics
2-D Gel
Lecture 6.3
QTOF Mass Spectrometry
16
Isoelectric Point
• The pH at which a protein has a net
charge=0
pKa Values for Ionizable Amno Acids
Residue
C
D
E
pKa
10.28
3.65
4.25
Residue
H
K
R
pKa
6
10.53
12.43
Basics of Protein Structure
• Primary
• Secondary
• Tertiary
Lecture 6.3
primary structure
ACDEFGHIKLMNPQRSTVWY
18
Common Secondary
Structure Elements
• The Alpha Helix
Lecture 6.3
19
Common Secondary
Structure Elements
• The Beta Sheet
Lecture 6.3
20
Secondary Structure:
Phi & Psi Angles Defined
• Rotational constraints emerge from interactions with
bulky groups (ie. side chains).
• Phi & Psi angles define the secondary structure
adopted by a protein.
Lecture 6.3
21
Ramachandran Plot
Supersecondary Structure
Lecture 6.3
23
Secondary Structure &
Protein Folding
• Understanding the forces of hydrophobicity:
nonpolar
side chains
Hydrogen bonds can
form with polar side chains
on outside of the protein
polar
side chains
hydrophobic core contains
nonpolar side chains
unfolded or partially
folded polypeptide
Lecture 6.3
folded conformation
24
Hydrophobicity is a property which can be
calculated for protein sequences
• Hydrophobicity Scales:
– Used to calculate
hydrophobicity
– Based on experimental
evidence indicating
hydrophobic/hydrophilic
properties of each aa
• Solubility, Stability,
Location and/or
Globularity of protein
sequences can be
predicted
Lecture 6.3
Kyte / Doolittle Hyrophobicity Scale
Residue
A
C
D
E
F
G
H
I
K
L
Hphob
1.8
2.5
-3.5
-3.5
2.8
-0.4
-3.2
4.5
-3.9
3.8
Residue
M
N
P
Q
R
S
T
V
W
Y
Hphob
1.9
-3.5
-1.6
-3.5
-4.5
-0.8
-0.7
4.2
-0.9
-1.3
25
Hydrophobicity Profile
• Moving segment approach
• Correlation of this technique with 3D structure
interior residues
exterior
score
Score
hydrophobic 3+
hydrophilic
2
1
0
-1
-2
-3-4
1
NH2
51
101
151
201
protein
sequence
251
301
COOH
The a-helix is a common
secondary structure element
acidic
• A helical wheel is a
representation of the 3D
structure of the a-helix.
• Projection of aa side chains
onto a plane perpendicular
to axis of helix
• Hydrophobic arcs stabilize
helical interactions
• Amphipathic helices are
common
Lecture 6.3
nonpolar
27
Secondary Structure Prediction
• The presence of secondary structure elements can
be predicted.
• Current algorithms rely on:
–
–
–
–
–
–
–
statistics (Chou-Fasman, GOR)
homology or nearest neighbor comparisons (Levin)
physico-chemical properties (Lim, Eisenberg)
pattern matching (Cohen, Rooman)
neural networks (Qian & Sejnowski, Karplus)
evolutionary methods (Barton, Niemann)
and combined approaches (Rost, Levin, Argos)
Lecture 6.3
28
Chou-Fasman Algorithm
• Assign each residue a Pa, Pb, Pc value
• Take a window of 7 residues and calculate a
window-averaged value for all Pa, Pb, Pc
• Assign the average value for each of the
secondary structures to the middle residue
• Move down one residue and repeat steps 2
thru 3 until finished
• Scan and assign SS to the highest P/residue
Lecture 6.3
29
Chou-Fasman Statistics
Table 8
Chou & Fasman Secondary Structure Propensity of the Amino Acids
A
C
D
E
F
G
H
I
K
L
Lecture 6.3
Pa
1.42
0.7
1.01
1.51
1.13
0.57
1
1.08
1.16
1.21
Pb
0.83
1.19
0.54
0.37
1.38
0.75
0.87
1.6
0.74
1.3
Pc
0.75
1.11
1.45
1.12
0.49
1.68
1.13
0.32
1.1
0.49
M
N
P
Q
R
S
T
V
W
Y
Pa
1.45
0.67
0.57
1.11
0.98
0.77
0.83
1.06
1.08
0.69
Pb
1.05
0.89
0.55
1.1
0.93
0.75
1.19
1.7
1.37
1.47
Pc
0.5
1.44
1.88
0.79
1.09
1.48
0.98
0.24
0.45
0.84
30
The PhD Approach
PRFILE...
Lecture 6.3
31
The PhD Algorithm
• Search the SWISS-PROT database and
select high scoring homologues
• Create a sequence “profile” from the resulting
multiple alignment
• Include global sequence info in the profile
• Input the profile into a trained two-layer
neural network to predict the structure and to
“clean-up” the prediction
Lecture 6.3
32
Predicting via Neural Nets & PSSM
• PHDhtm
– http://www.embl-heidelberg.de/predictprotein/
• TMAP
– http://www.mbb.ki.se/tmap/index.html
• TMPred
– http://www.ch.embnet.org/software/TMPRED_form.html
ACDEGF...
Lecture 6.3
33
Lecture 6.3
PHD
ZHANG
GOR III
JASEP7
PTIT
LEVIN
LIM
GOR I
CF
Scores (%)
Prediction Performance
75
70
65
60
55
50
45
34
Best of the Best
• PredictProtein-PHD (72%)
– http://cubic.bioc.columbia.edu/predictprotein
• Jpred (73-75%)
– http://www.compbio.dundee.ac.uk/~www-jpred/
• PREDATOR (75%)
– http://www.hgmp.mrc.ac.uk/Registered/Option/predator.html
• PSIpred (77%)
– http://bioinf.cs.ucl.ac.uk/psipred/
Lecture 6.3
35
Basics of Protein Structure
• Primary
• Secondary
• Tertiary
Lecture 6.3
primary structure
ACDEFGHIKLMNPQRSTVWY
36
Tertiary Structure
Lactate
Dehydrogenase:
Mixed a / b
Lecture 6.3
Immunoglobulin
Fold: b
Hemoglobin B
Chain: a
37
Protein Structure Databases
• Where does protein structural information
reside?
– PDB:
• http://www.rcsb.org/pdb/
– MMDB:
• http://www.ncbi.nlm.nih.gov/Structure/
– FSSP:
• http://www.ebi.ac.uk/dali/fssp/
– SCOP:
• http://scop.mrc-lmb.cam.ac.uk/scop/
– CATH:
• http://www.biochem.ucl.ac.uk/bsm/cath_new/
Lecture 6.3
38
Structural Proteomics
• Aim to delineate total repertoire of protein
folds
• Provide 3D portraits for all proteins in an
organism
• Goal: Use structure to infer function.
– Compare structure of unknown protein to known
set of structures
– More sensitive than primary sequence
comparisons
Lecture 6.3
39
The Protein Fold Universe
Lecture 6.3
500?
2000?
10000?
8
How
Big
Is
It???
?
40
Structures in PDB
PDB = 19860 structures Jan 03
PDB = 23997 structures Jan 04
“structural genomics”
search = 156 structures Jan 03
search = 478 structures Jan 04
Lecture 6.3
41
500000
450000
400000
350000
300000
250000
200000
150000
100000
50000
0
1980
Lecture 6.3
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
1985
1990
1995
2000
42
Structures
Sequences
Structural Proteomics
Unique folds in PDB
Lecture 6.3
43
Prediction Methods for
3D structure
• Intermediate Steps
– Predict secondary structure
– Calculate solvent accessibility
• Methods for 3D structure prediction based on:
– Threading, Homology Modeling or Fold
recognition
• Similarity in amino acid sequence implies similar
structure/function
– Ab Initio Techniques
• Numerical methods designed to simulate the structure
and dynamics of marcromolecules
Lecture 6.3
44
Proteomics
• The study of the expression, location,
interaction, function and structure of all
the proteins in a given cell or organism
• Expressional Proteomics
• Functional Proteomics
• Structural Proteomics
Lecture 6.3
45
Proteomics
• Expressional Proteomics
• 2D or Capillary Electrophoresis, protein chips
• Mass Spectrometry, Laser induced fluorescence
• Functional Proteomics
• Mass Spectrometry, micro-assays, protein chips
• Yeast or Bacterial 2-hybrid systems
• Structural Proteomics
• High throughput X-ray crystallography
• High throughput NMR spectroscopy
Lecture 6.3
46
2D Gel Principles
SDS
PAGE
Lecture 6.3
47
Mass Spec Principles
Sample
+
_
Ionizer
Lecture 6.3
Mass Filter
Detector
48
Ionization Methods
370 nm UV laser
Fluid (no salt)
+
_
Lecture 6.3
cyano-hydroxy
cinnamic acid
Gold tip needle
MALDI
ESI
49
Protein ID Protocol
Lecture 6.3
50
Computational Tools for
Protein Identification
• PeptIdent
– http://us.expasy.org/tools/peptident.html
• Mascot
– http://www.matrixscience.com/search_form_select.html
Covered
in
Lab 6.4
• ProteinProspector
– http://prospector.ucsf.edu/
• MOWSE
– http://srs.hgmp.mrc.ac.uk/cgi-bin/mowse
• PeptideSearch
– http://www.mann.embl-heidelberg.de/
GroupPages/PageLink/peptidesearchpage.html
• AACompSim/AACompIdent
– http://www.expasy.ch/tools
Lecture 6.3
51
Proteomics
• Human proteome estimated to contain
500,000+ proteins
• The next “big wave” in bioinformatics
•
•
•
•
How to deal with so much data?
How to link structure to function to sequence?
How to show or store temporal and spatial data?
How to use it in drug discovery & development?
Proteomics Workshop
July 19 – 24th, 2004
Calgary, Alberta
Lecture 6.3
52
The Cutting Edge of Proteomics
• Evolution of Proteomes
• Structural Genomics
• Quantitative Mass Spectrometry and
Protein Chip Technology
• Chemical Proteomics
• Proteome Scale Analysis of Networks, i.e.,
signal transduction, Y2H experiments
Lecture 6.3
53
Global Proteome Interaction
Mapping in C. elegans
Science
23 January 2004
303: 540
see also
Science
7 January 2000
287: 116
Lecture 6.3
54
Yeast Two Hybrid (Y2H)
on the genomic scale
• Global interaction map
of C. elegans
• Use proteome as bait in
Y2H experiment
• Detect all pairwise
interactions
• Create global
protein:protein
interaction network
Lecture 6.3
55
Protein:Protein Interaction
Networks
Lecture 6.3
56
DNA vs Protein Chip Technology
• DNA microtechnology
– Can successfully read 1000’s of side by side
measurements of RNA levels
– BUT RNA ≠ protein = function
• Protein Microarray Technology
– Goal: develop protein chip with proteins in active
state.
• Proteins more challenging to prepare than DNA/RNA
• Protein functionality depends on state, modifications,
binding partners, localization etc.
Lecture 6.3
57
Protein Chip - Methods
• Attachment Methods:
• Diffusion
• Absorption
– nitrocellulose
• Covalent Crosslinking
– Reactive surfaces
• Affinity Attachment
– Affinity tags
Lecture 6.3
58
Protein Chip - Applications
• Antibody Chip
– Detect Ag-Ab interactions
• Protein Chip
– Protein:protein
– Protein:drug
– Enzyme:substrate
• Ligand Chip
• And more….
Lecture 6.3
59
Protein Chips
Summary
• Protein sequence, and subsequently
protein sequence databases, are much
more complex than DNA
• Prediction of protein structure is a
complex problem at both the 2D and 3D
levels
• Proteomics initiatives based on different
technologies are making inroads into the
study of protein structure and function on
a global level
Lecture 6.3
61