Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto [email protected] (Slides evolved from original material by Chris Hogue, David Wishart, Gary Van Domselaar and Boris Steipe) 3.3b 1 Concept 1: Threading methods can sometimes find similar folds. 3.3b 2 Definition • Threading - A protein fold recognition technique that involves incrementally replacing the sequence of a known protein structure with a query sequence of unknown structure. The new “model” structure is evaluated using a simple heuristic measure of protein fold quality. The process is repeated against all known 3D structures until an optimal fit is found. 3.3b 3 Why Threading? • Secondary structure is more conserved than primary structure • Tertiary structure is more conserved than secondary structure • Therefore very remote relationships can be better detected through 2o or 3o structural homology instead of sequence homology 3.3b 4 Fold recognition ("Threading") Template Structure Query Sequence Query Sequence Query Sequence Query Sequence 3.3b 5 Threading • Database of 3D structures and sequences – Protein Data Bank (or non-redundant subset) • Query sequence – Sequence < 25% identity to known structures • Alignment protocol – Dynamic programming • Evaluation protocol – Distance-based potential or secondary structure • Ranking protocol 3.3b 6 Concept 2: Threading can be done in 2D and 3D (even 1D) 3.3b 7 2 Kinds of Threading • 2D Threading or Prediction Based Methods (PBM) – Predict secondary structure (SS) or ASA of query – Evaluate on basis of SS and/or ASA matches • 3D Threading or Distance Based Methods (DBM) – Create a 3D model of the structure – Evaluate using a distance-based “hydrophobicity” or pseudo-thermodynamic (empirical) potential 3.3b 8 2D Threading Algorithm • Convert PDB to a database containing sequence, SS and ASA information • Predict the SS and ASA for the query sequence using a “high-end” algorithm • Perform a dynamic programming alignment using the query against the database (include sequence, SS & ASA) • Rank the alignments and select the most probable fold 3.3b 9 2o Structure Identification • DSSP - Database of Secondary Structures for Proteins (swift.embl-heidelberg.de/dssp) • VADAR - Volume Area Dihedral Angle Reporter (redpoll.pharmacy.ualberta.ca) • PDB - Protein Data Bank (www.rcsb.org) QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCA HHHHHHCCEEEEEEEEEEECCHHHHHHHCCCCCCC 3.3b 10 ASA Calculation • DSSP - Database of Secondary Structures for Proteins (swift.embl-heidelberg.de/dssp) • VADAR - Volume Area Dihedral Angle Reporter (www.redpoll.pharmacy.ualberta.ca/vadar/) • GetArea - www.scsb.utmb.edu/getarea/area_form.html QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCAMD BBPPBEEEEEPBPBPBPBBPEEEPBPEPEEEEEEEEE 1056298799415251510478941496989999999 3.3b 11 Other ASA sites • Connolly Molecular Surface Home Page – http://www.biohedron.com/ • Naccess Home Page – http://sjh.bi.umist.ac.uk/naccess.html • ASA Parallelization – http://cmag.cit.nih.gov/Asa.htm • Protein Structure Database – http://www.psc.edu/biomed/pages/research/PSdb/ 3.3b 12 2D Threading Algorithm • Convert PDB to a database containing sequence, SS and ASA information • Predict the SS and ASA for the query sequence using a “high-end” algorithm • Perform a dynamic programming alignment using the query against the database (include sequence, SS & ASA) • Rank the alignments and select the most probable fold 3.3b 13 ASA Prediction • PredictProtein-PHDacc (58%) – http://cubic.bioc.columbia.edu/predictprotein • PredAcc (70%?) – condor.urbb.jussieu.fr/PredAccCfg.html QHTAW... 3.3b QHTAWCLTSEQHTAAVIW BBPPBEEEEEPBPBPBPB 14 2D Threading Algorithm • Convert PDB to a database containing sequence, SS and ASA information • Predict the SS and ASA for the query sequence using a “high-end” algorithm • Perform a dynamic programming alignment using the query against the database (include sequence, SS & ASA) • Rank the alignments and select the most probable fold 3.3b 15 2D Threading Performance • In test sets 2D threading methods can identify 30-40% of proteins having very remote homologues (i.e. not detected by BLAST) using “minimal” non-redundant databases (<700 proteins) • If the database is expanded ~4x the performance jumps to 70-75% • Performs best on true homologues as opposed to postulated analogues 3.3b 16 2D Threading Advantages • Algorithm is easy to implement • Algorithm is much faster than 3D threading approaches • The 2D database is small (<500 kbytes) compared to 3D database (>1.5 Gbytes) • Appears to be just as accurate as DBM or other 3D threading approaches • Very amenable to web servers 3.3b 17 Servers - PredictProtein 3.3b 18 Servers - 123D 3.3b 19 Servers - GenThreader 3.3b 20 More Servers - www.bronco.ualberta.ca 3.3b 21 2D Threading Disadvantages • Reliability is not 100% making most threading predictions suspect unless experimental evidence can be used to support the conclusion • Does not produce a 3D model at the end of the process • Doesn’t include all aspects of 2o and 3o structure features in prediction process • PSI-BLAST may be just as good (faster too!) (PSI-BLAST: 1D "threading") 3.3b 22 Making it Better • Include 3D threading analysis as part of the 2D threading process -- offers another layer of information • Include more information about the “coil” state (3-state prediction isn’t good enough) • Include other biochemical (ligands, function, binding partners, motifs) or phylogenetic (origin, species) information 3.3b 23 3D Threading Servers • Generate 3D models or coordinates of possible models based on input sequence • Loopp (version 2) – http://ser-loopp.tc.cornell.edu/loopp.html • 3D-PSSM – http://www.sbg.bio.ic.ac.uk/~3dpssm/ • Prospect – http://compbio.ornl.gov/prospect/ • All require email addresses since the process may take hours to complete 3.3b 24 Threading Database Search • Premise is that most sequences match some 3-D structure that is already known (1/2) • Given a database of known 3-D protein folds: – align the test sequence to each known protein – in real 3-D coordinate space (slow but exact) – in parameterized 1-D space (fast but approximate) • optimize some scoring function • sort out best sequence-structure alignment • assess alignments - statistically significant? 3.3b 25 Threading Statistics • Z score (sequence composition correction) • number of standard deviations the found alignment is off from the mode of a randomized version of the structure or profile • P value (sequence length correction) • Shuffle the sequence - make a distribution of random threads… • Is the unscrambled thread any better than a randomly optimized sequence… • Z score of Z scores • Look for P values as a criterion for choosing a threading method... 3.3b 26 Database Searching... • Sensitivity – High sensitivity implies finding all possible true positive matches in the database • Specificity – High specificity implies finding no false positive matches in the search. 3.3b 27 Interpret Threading Accordingly... • In a ranked list of 10 matches, expect that only one might be correct • Expect that none may be correct • Expect that the top ranked hit is a false positive... 3.3b 28 How then does Threading find things? • If there is a true positive in a threading search hit list - People find it ... • It is most often found by FUNCTIONAL similarity. – Similar enzymatic mechanisms – Motifs, DART ... – Similar roles, cellular distributions ... 3.3b 29
© Copyright 2026 Paperzz