Information Encoding in Biological Molecules: DNA and protein

Protein Structure Threading
(Fold recognition)
Boris Steipe
University of Toronto
[email protected]
(Slides evolved from original material by Chris Hogue, David Wishart, Gary Van Domselaar and Boris Steipe)
3.3b
1
Concept 1:
Threading
methods can
sometimes find
similar folds.
3.3b
2
Definition
• Threading - A protein fold recognition
technique that involves incrementally
replacing the sequence of a known protein
structure with a query sequence of unknown
structure. The new “model” structure is
evaluated using a simple heuristic measure of
protein fold quality. The process is repeated
against all known 3D structures until an
optimal fit is found.
3.3b
3
Why Threading?
• Secondary structure is more conserved than
primary structure
• Tertiary structure is more conserved than
secondary structure
• Therefore very remote relationships can be
better detected through 2o or 3o structural
homology instead of sequence homology
3.3b
4
Fold recognition ("Threading")
Template Structure
Query Sequence
Query Sequence
Query Sequence
Query Sequence
3.3b
5
Threading
• Database of 3D structures and sequences
– Protein Data Bank (or non-redundant subset)
• Query sequence
– Sequence < 25% identity to known structures
• Alignment protocol
– Dynamic programming
• Evaluation protocol
– Distance-based potential or secondary structure
• Ranking protocol
3.3b
6
Concept 2:
Threading can be
done in 2D and 3D
(even 1D)
3.3b
7
2 Kinds of Threading
• 2D Threading or Prediction Based Methods
(PBM)
– Predict secondary structure (SS) or ASA of query
– Evaluate on basis of SS and/or ASA matches
• 3D Threading or Distance Based Methods
(DBM)
– Create a 3D model of the structure
– Evaluate using a distance-based “hydrophobicity” or
pseudo-thermodynamic (empirical) potential
3.3b
8
2D Threading Algorithm
• Convert PDB to a database containing
sequence, SS and ASA information
• Predict the SS and ASA for the query
sequence using a “high-end” algorithm
• Perform a dynamic programming alignment
using the query against the database (include
sequence, SS & ASA)
• Rank the alignments and select the most
probable fold
3.3b
9
2o Structure Identification
• DSSP - Database of Secondary Structures for Proteins
(swift.embl-heidelberg.de/dssp)
• VADAR - Volume Area Dihedral Angle Reporter
(redpoll.pharmacy.ualberta.ca)
• PDB - Protein Data Bank (www.rcsb.org)
QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCA
HHHHHHCCEEEEEEEEEEECCHHHHHHHCCCCCCC
3.3b
10
ASA Calculation
• DSSP - Database of Secondary Structures for Proteins
(swift.embl-heidelberg.de/dssp)
• VADAR - Volume Area Dihedral Angle Reporter
(www.redpoll.pharmacy.ualberta.ca/vadar/)
• GetArea - www.scsb.utmb.edu/getarea/area_form.html
QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCAMD
BBPPBEEEEEPBPBPBPBBPEEEPBPEPEEEEEEEEE
1056298799415251510478941496989999999
3.3b
11
Other ASA sites
• Connolly Molecular Surface Home Page
– http://www.biohedron.com/
• Naccess Home Page
– http://sjh.bi.umist.ac.uk/naccess.html
• ASA Parallelization
– http://cmag.cit.nih.gov/Asa.htm
• Protein Structure Database
– http://www.psc.edu/biomed/pages/research/PSdb/
3.3b
12
2D Threading Algorithm
• Convert PDB to a database containing
sequence, SS and ASA information
• Predict the SS and ASA for the query
sequence using a “high-end” algorithm
• Perform a dynamic programming alignment
using the query against the database (include
sequence, SS & ASA)
• Rank the alignments and select the most
probable fold
3.3b
13
ASA Prediction
• PredictProtein-PHDacc (58%)
– http://cubic.bioc.columbia.edu/predictprotein
• PredAcc (70%?)
– condor.urbb.jussieu.fr/PredAccCfg.html
QHTAW...
3.3b
QHTAWCLTSEQHTAAVIW
BBPPBEEEEEPBPBPBPB
14
2D Threading Algorithm
• Convert PDB to a database containing
sequence, SS and ASA information
• Predict the SS and ASA for the query
sequence using a “high-end” algorithm
• Perform a dynamic programming alignment
using the query against the database (include
sequence, SS & ASA)
• Rank the alignments and select the most
probable fold
3.3b
15
2D Threading Performance
• In test sets 2D threading methods can identify
30-40% of proteins having very remote
homologues (i.e. not detected by BLAST)
using “minimal” non-redundant databases
(<700 proteins)
• If the database is expanded ~4x the
performance jumps to 70-75%
• Performs best on true homologues as
opposed to postulated analogues
3.3b
16
2D Threading Advantages
• Algorithm is easy to implement
• Algorithm is much faster than 3D threading
approaches
• The 2D database is small (<500 kbytes)
compared to 3D database (>1.5 Gbytes)
• Appears to be just as accurate as DBM or
other 3D threading approaches
• Very amenable to web servers
3.3b
17
Servers - PredictProtein
3.3b
18
Servers - 123D
3.3b
19
Servers - GenThreader
3.3b
20
More Servers - www.bronco.ualberta.ca
3.3b
21
2D Threading Disadvantages
• Reliability is not 100% making most threading
predictions suspect unless experimental
evidence can be used to support the conclusion
• Does not produce a 3D model at the end of the
process
• Doesn’t include all aspects of 2o and 3o
structure features in prediction process
• PSI-BLAST may be just as good (faster too!)
(PSI-BLAST: 1D "threading")
3.3b
22
Making it Better
• Include 3D threading analysis as part of the
2D threading process -- offers another layer of
information
• Include more information about the “coil” state
(3-state prediction isn’t good enough)
• Include other biochemical (ligands, function,
binding partners, motifs) or phylogenetic
(origin, species) information
3.3b
23
3D Threading Servers
• Generate 3D models or coordinates of
possible models based on input sequence
• Loopp (version 2)
– http://ser-loopp.tc.cornell.edu/loopp.html
• 3D-PSSM
– http://www.sbg.bio.ic.ac.uk/~3dpssm/
• Prospect
– http://compbio.ornl.gov/prospect/
• All require email addresses since the process
may take hours to complete
3.3b
24
Threading Database Search
• Premise is that most sequences match some 3-D structure that
is already known (1/2)
• Given a database of known 3-D protein folds:
– align the test sequence to each known protein
– in real 3-D coordinate space (slow but exact)
– in parameterized 1-D space (fast but approximate)
• optimize some scoring function
• sort out best sequence-structure alignment
• assess alignments - statistically significant?
3.3b
25
Threading Statistics
• Z score (sequence composition correction)
• number of standard deviations the found alignment is off from the mode
of a randomized version of the structure or profile
• P value (sequence length correction)
• Shuffle the sequence - make a distribution of random threads…
• Is the unscrambled thread any better than a randomly optimized
sequence…
• Z score of Z scores
• Look for P values as a criterion for choosing a threading
method...
3.3b
26
Database Searching...
• Sensitivity
– High sensitivity implies finding all possible true
positive matches in the database
• Specificity
– High specificity implies finding no false positive
matches in the search.
3.3b
27
Interpret Threading Accordingly...
• In a ranked list of 10 matches, expect that
only one might be correct
• Expect that none may be correct
• Expect that the top ranked hit is a false
positive...
3.3b
28
How then does
Threading find things?
• If there is a true positive in a threading search
hit list - People find it ...
• It is most often found by FUNCTIONAL
similarity.
– Similar enzymatic mechanisms
– Motifs, DART ...
– Similar roles, cellular distributions ...
3.3b
29