KNOWLEDGE-BASEDMACHINELEARNING
METHODSFORMACROMOLECULAR3D
STRUCTUREPREDICTION
ByZhiyongWang
Submittedto:
ToyotaTechnologicalInstituteatChicago
6045S.KenwoodAve,Chicago,IL,60637
Forthedegreeof
DoctorofPhilosophyinComputerScience
ThesisCommittee:
JinboXu(ThesisSupervisor)
DavidMcAllester
JieLiang
ABSTRACT
Predictingthe3Dstructureofamacromolecule,suchasaproteinoranRNAmolecule,isranked
topamongthemostdifficultandattractiveproblemsinbioinformaticsandcomputational
biology.Itsimportancecomesfromtherelationshipbetweenthe3Dstructureandthefunction
ofagivenproteinorRNA.3Dstructuresalsohelptofindtheligandsoftheprotein,whichare
usuallysmallmolecules,akeystepindrugdesign.Unfortunately,thereisnoshortcutto
accuratelyobtainthe3Dstructureofamacromolecule.Manyphysicalmeasurementsof
macromolecular3Dstructurescannotscaleup,duetotheirlargelaborcostsandthe
requirementsforlabconditions.
Inrecentyears,computationalmethodshavemadehugeprogressduetoadvancein
computationspeedandmachinelearningmethods.Thesemethodsonlyneedthesequence
informationtopredict3Dstructuresbyemployingvariousmathematicalmodelsandmachine
learningmethods.Thesuccessofcomputationalmethodsishighlydependentonalarge
databaseoftheproteinsandRNAwithknownstructures.
However,theperformanceofcomputationalmethodsarealwaysexpectedtobeimproved.
Thereareseveralreasonsforthis.First,wearefacing,andwillcontinuetofacesparsenessof
data.Thenumberofknown3Dstructuresincreasedrapidlyinthefastfewyears,butstillfalls
behindthenumberofsequences.Structuredataismuchmoreexpensivewhencomparedwith
sequencedata.Secondly,the3Dstructurespaceistoolargeforourcomputationalcapability.
Thecomputingspeedisnotnearlyenoughtosimulatetheatom-levelfoldprocesswhen
computingthephysicalenergyamongalltheatoms.
Thetwoobstaclescanberemovedbyknowledge-basedmethods,whichcombineknowledge
learnedfromtheknownstructuresandbiologists’knowledgeofthefoldingprocessofprotein
orRNA.Inthedissertation,Iwillpresentmyresultsinbuildingaknowledge-basedmethodby
usingmachinelearningmethodstotacklethisproblem.Mymethodsincludetheknowledge
constraintsonintermediatestates,whichcanhighlyreducethesolutionspaceofaproteinor
RNA,inturnincreasingtheefficiencyofthestructurefoldingmethodandimprovingits
accuracy.
ACKNOWLEDGEMENT
TherearesomanypeopleIamgratefulto.Icannotthanktoomuchforthosepeoplewhogave
generoussupportstomeduringthepastyears.
First,thanksmyadvisorProfessorJinboXu.Withoutyourguidanceandgreatpatience,Icannot
finishmyPhDstudy.ThanksProfessorDavidMcAllesterandProfessorJieLiangwhoadvisedme
onmyPhDthesisproposalanddefense.Thanksalltheprofessorswhotaughtmeduringthelast
6years.
Thanksmyfamilywhosupportedmesomuch.
Thanksmyteammatesandclassmates.
AvleenBijral
SomayeHashemifar
TaehwanKim
JianzhuMa
JianPeng
XinghuaShi
SiqiSun
HaoTang
QingmingTang
ShengWang
PaymanYadollahpour
FengZhao
Itismygreathonortohavethechancetoworkandstudywithyoutogether.
TABLEOFCONTENTS
Abstract...........................................................................................................................................3
Acknowledgement..........................................................................................................................4
TableofContents............................................................................................................................5
ListofFigures..................................................................................................................................7
ListofTables....................................................................................................................................8
Chapter0.TheinformationallifeofaproteinandaRNA...............................................................9
Chapter1.Protein/RNA3DStructureResearch............................................................................11
Chapter2.ProblemDefinitionandBasicConcepts......................................................................12
Chapter3.Intermediatestatesbridgingsequencesand3Dstructures........................................15
ProteinSecondaryStructure.....................................................................................................15
Proteincontactmap..................................................................................................................19
Chapter4.3Dstructureprediction...............................................................................................22
RNA3Dstructuresampling.......................................................................................................22
Intermediatestatesusedforprotein3Dstructureprediction......................................................24
MethodOverview.........................................................................................................................25
Featurevector...........................................................................................................................27
NeuralNetworkOptimizationandModelSelection.................................................................27
TrainingDataset........................................................................................................................31
Datapreprocessing...................................................................................................................31
Computingenvironment...........................................................................................................31
Result............................................................................................................................................31
ThefoldingresultforCASP10targets......................................................................................38
Summary...................................................................................................................................38
Chapter5.Conclusion...................................................................................................................40
Reference......................................................................................................................................42
LISTOFFIGURES
Figure1.Conditionalneuralfieldmodelforan8-classsecondarystructureprediction.The
sigmoidneuralnetworklayertransformstheinputfeatureinanon-linearway,andthetwo
orderconditionalrandomfieldmodelsneighborhooddependencyaswellastheoutputof
theneuralnetwork.
17
Figure2.Q8accuracybetweenourCNFmodelandSSpro8ontheCB513dataset.Theaccuracyis
plottedwiththeNeffrangesofthesequences.
18
Figure3.Flowchartofourmixturemodelwithintegerprogrammingandrandomforests.
20
Figure4.ThetopL/10contactpredictionaccuracy(Y-axis)ofourmethodsandotherstate-ofthe-artmethodsontheCASP10targetsgroupedbytheNeffvalue(X-axis). 21
Figure5.RMSDhistogramof1ESYshowsthatourmethod(left)producesmorelowRMSD
decoysthanFARNA(right).23
Figure6.Theneuralnetworkmodelusedinourpotentialenergyfunction.
26
LISTOFTABLES
Table1.Q8accuracyandSOV(SegmentOverlapValue)foreachtypeof8-classsecondary
structurepredictionresults.
18
Table2.ComparedwithFARNA,ourmethodproducesbetterdecoysintermsofthebestcluster
centroid.Theboldnumbersrevealsthebetterresults.Thebestclustercentroidisclearly
seenfromthe5clustersofallthedecoys. 22
Table3.PerformanceonRosettadataset.
33
Table4.AnextensivecomparisonofournewmethodandEpadenergyfunctionon3datasets.
Error!Bookmarknotdefined.
Table5.ThesamplingresultsonCASP10,12freemodellingdomains.
36
Table6.Thetargetsonwhichourmethodresults>0.4inTMscore.
37
CHAPTER0.THEINFORMATIONALLIFEOFAPROTEINANDARNA
Beforewediveintotheprotein/RNAstructurepredictionstudy,wefirstneedtohaveamapof
thepositionsofproteinandRNAinalivingorganism.Thoughmanybiologicalcollegetextbooks
introducewhatisaprotein/RNAandhowtheyareproduced,employedanddecomposedin
livingcells,theyareusuallypresentedwithagreatdetail.Thelowleveldetailofbiologywould
burythehighlevelmeaningsoftheirfunctions.Infact,thehighlevelfunctionsofproteinand
RNAgiveawholepictureofthebiologicalnetworkanditscomplexity.Inordertounderstand
thehighlevelmeaningofaproteinandaRNAfunction,wepresenttheinformationallifeofa
proteinandanRNAbyzoomingintheirroleinstoring,carryingandinterpretinginformationin
differentphasesoftheirlife.Itwouldbeeasiertounderstandthecomplexityofthewhole
protein/RNAsystem,oncewefindanisomorphismbetweentheirinformationrolesanda
softwaresystem.
ProteinandRNAareubiquitouslyfoundinallthespecieswefoundontheearthwithoutasingle
exception.Alllivingorganismscompletetheirlifecyclesthroughaseriesofbiochemical
reactionswiththehelpofvariousproteinsorRNAs.Intheadvancedlifeform,eukaryoticcells,
thefirstpartofalifecycletakesplaceinthenucleus.DNA,thegeneticinformationcarrier,
duplicatesitself.TheinformationinDNAcanbecopiedtoRNA,whenitneedstobepassed
outsidenucleus.TheinformationinRNAthenisconsumedintheribosomeforproteinsynthesis.
That’sasimplifiedversionofcentraldogma(Crick,Francis."Centraldogmaofmolecular
biology."Nature227.5258(1970):561-563).
Beyondthecentraldogma,arealbiochemicalsystemconsistsofalargenumberof
macromoleculesandtheirinteractions.Theseinteractionsincludemanyinterferenceand
regulationonthewaytheinformationtransferringfromDNAtoRNAandfromRNAtoprotein
synthesis.Asaresult,theamountofaproteininalivingcellcanincreaseordecrease
dramaticallywithinashorttimeduetoafeedbackloop.Insomespeciesofvirus,theRNAcan
playthesameroleofDNAasageneticinformationcarrier.Thusthechainofbiochemical
reactionsisshorter.
Allthesefeedbackandforwardloopsformedbybiochemicalreactionsmakethebiological
systemthemostcomplexsysteminhumanknowledge.Inthepastcenturiesofthescientific
history,peopleinvestverymuchtounderthiscomplexsystem,whichwillbenefitthehuman
health.Therootcauseofsuchcomplexityistheselfreferentialrelationshiporfeedbackloops,
aswesawinmanyothercomplexsystems.
Wecancomparethebiologicalsystemwithanothercomplexsystem,softwaresystem,whichis
totallyman-madeandwidelystudied.
Inthecomputersoftwareworld,dataisastaticobject,whichcanbestoredonharddiskorin
computermemory.Programisdynamic,whichhasthecapabilitytoprocessdata.The
complexityofprogramisnotbecausetheycanchangedataonharddiskorinmemory,but
changethebehaviorofthemselves.ItisthesamewayofproteinsandRNAsregulating
themselvesproduction.
Datacanbestoredincomputerinaserializedformatorhierarchicalstructure.Programsare
usuallywritteninaspecificsyntax,whichalsoimpliestheircomplexity.Theworldofdatais
convex,whichmeansconvexcombinationoftwoscalarsisstillascalar.TwosyntaxcorrectC++
programscannotbeconvexcombinedintoanothersyntaxcorrectC++program.Wecan
enumeratea4x4matrixwithbinaryitems,wecansamplearandom8x8matrixfroma
multivariateGaussiandistribution,butitisimpossibletosamplearandomC++programwith
correctsyntax.ItisalsotrueforproteinandRNAstructuresthatwecannotenumerateallthe
possiblestructuresofaproteinoranRNA.
Protein/RNAstructureismoreclosetoprogramthandatainalivingcell.Crystalized
protein/RNAhasastaticstructure.Theyarethecoordinatesofalltheatomsinaprotein,i.e.
(x,y,z)foreachatom.Inalivingcellenvironment,aproteinhasthefunctionalitytoturnonor
turnoffsomebiochemicalreactions.Suchaswitchingmechanismmakesthembehavelikethe
branchinginaprogram.Proteinstructuredefineswhichbiochemicalreactionaproteincan
change.Thus,thestructureofproteinmoleculesisprogram.
Theseobservationsareonlyvalidonmosthighlevelprogramminglanguages.Ifweconsulta
highlevellanguage,LISP,itisnoteasytodrawtheclearlinebetweendataandprogram.InLisp
language,theonlysyntaxneedtofollowisthatyouneedwriteeverythingaslists.Avalidlisp
programisasetofrecursively-definedlists.ThisisjusthowRNAsplaytheirrolesinaliving
organism.Itisbothdataandfunctionalunitconsuminggeneticinformationstoredinit.
Tofulfillthosecomplexroles,RNAandproteinmacromolecularevolvesagreatcomplexityin
theirstructures.Fromstrandtohelicesandsheets,theirstructuresconsistofmanylevelsof
sub-modules,likeprogramlibraries.Bycombiningthesesub-modules,proteinandRNA
molecularworktogetherasawholesystem.Inthefollowingparts,wewilltrytounderstandthis
complexityinRNAandproteinstructurebyusingamachinelearningmethod.
Thefollowingpartsofthisthesisareorganizedasfollowing.Chapter1introducesbackground,
motivation,andchallengesofrelatedstudies.Chapter2listsbasicconceptsandproblem
definitions.Chapter3isabouthowtopredictintermediatestates,inbothproteinandRNA.
Chapter4presentsourresultapplyingintermediatepredictionto3Dstructureprediction.
Chapter5concludeswiththedifficultyandchallengeofthisstudy.
CHAPTER1.PROTEIN/RNA3DSTRUCTURERESEARCH
BriefHistory
FromknowingthatX-rayscouldbeusedtodeterminethestructureofmatter,45yearspassed
beforeMaxPerutzandJohnKendrewsolvedthefirstprotein3Dstructureofamyoglobin
protein,anironandoxygenbindingproteininthemuscletissue.Thisworkgavethetwoauthors
theChemistryNobelPrizein1962.Sincethen,theChemistryPrizeof1964wasawardedforthe
structuresofvitaminB12andinsulin,andtheChemistryPrizeof1988wasawardedforthe
structureofamembraneprotein.Recently,milestoneworksincludedeterminingthestructure
ofribosome(VenkiRamakrishnan,ThomasA.Steitz,AdaYonath,ChemistryPrizeof2009),a
largecomplexofproteinandRNA,andGPCR(G-protein-coupledreceptors)(BrianKobilkaand
RobertLefkowitz,ChemistryPrizeof2012).InadditiontoX-raycrystallography,otherlaboratory
methodsincludeNuclearMagneticResonancespectroscopy(NMR)andcryo-electron
microscopy(EM),eachofwhichworksunderdifferentlaboratoryconditions.
AswithmanysuccessesonproteinandRNAstructuresdeterminationusingdifferentlaboratory
methods,thetechnologyonhigh-throughputsequencingadvancesalsorapidlybringsmany
moresequenceswithunknownstructures.Thespeedofdeterminingatom-levelaccurate
structuresbybiologicalexperimentmethodsalwaysfallsbehindthespeedofnewsequence
increases.Forthisreason,computationalmethodsbecamepopularforprovidingagood
approximationsolutiontothisproblem.Moleculardynamicmethodswereinventedmorethan
50yearsago(GiddingsandByring,1955),butwerenotdesignedformacromolecularstructures.
Evenwithhugecomputationalresources,peoplestillcannotfoldawholeproteinmolecule
usingthemoleculardynamicsmethod.Inthisthesis,wearefocusedonthehomologous-based
methodspioneeredby(Bowie,etal.,1991;Rost,1996;Subramaniam,etal.,1996;Sutcliffe,et
al.,1987),whichmakeuseofknownstructures.
Motivation
Thereisacloserelationshipbetweenthe3Dstructureandthefunctionalityofamacromolecule,
suchasaproteinoraRNA.Thefunctionofaproteinincludesitsbindingcapabilitytoother
molecules,whichcanbedeterminedbytheshapeofpocketsonthesurfaceoftheprotein.With
apredictedstructure,wecanmodelthebindingcoefficientbetweenadruganditstarget,which
helpsevaluatetheeffectofthedrug.3Dstructuresmodelscanbeusedtoestimatethe
sensitivityofaproteinstructurewithregardtomutationsinthesequence,whichexplainsdrug
sensitivitydifferencesfordifferentpeople.3Dstructurescanalsoexplainthestabilityofagiven
proteinorRNA.Thereareseveralwaystoassessthestabilityofaprotein,whichisneededto
evaluateanartificialdesignofaproteinsequence.Moreotherapplicationsincludedrugrepositioning,personalizeddiagnosisandgenomicanalysis.
Challengeanddifficulty
ThebiggestchallengeforproteinandRNA3Dstructuredeterminationisthegapbetweenthe
sequencedataandthestructuredata.Thoughwehavesolvedmorestructures,wehavemany
moresequenceswithoutknownstructuresstilltobepredicted.Anotherproblemishowto
determineanduseappropriatemethodstoproducehighlyaccurateresults.Afewexperimental
methods,suchastheelectronmicroscope,areavailablefordiscoveringlowaccuracy3D
structures.However,itisnotobvioushowtointegratetheseresultswithcurrentalgorithmsto
improve3Dstructurepredictionaccuracy.
All3Dstructurepredictionalgorithmsfacemuchthesamedifficultyinthatthesearchingspace
appearstobeextremelylarge.Toenumerateastructure,analgorithmshouldtryallthree
possiblecoordinatesforeachatominthespace.Inthecoordinaterepresentation,evenifwe
consideronlyoneatomforeachaminoacidornucleotide,wehaveasearchspaceof! ! !,
whereLislength-whichisthenumberofaminoacidsinaprotein,orthenumberofnucleotides
inaRNA.Thecoordinaterepresentationofamacromolecularstructureistooflexiblewithout
consideringmanymacromoleculeproperties.Thismeansthatwhenweenumerateastructure
fromitscoordinaterepresentation,wewillgetaninvalidorunrealstructureinalargechance.
Structuresearchmethodswouldbemuchmoreefficient,ifweknewhowtodistinguishthe
physicallyvaliddecoyswithoutbiologicalmeaningfromthosevaliddecoys.Duetothelackof
completerulesforhowaproteinandRNAfold,wedonothaveefficientmethodstotestifa
decoyisbiologicallymeaningfulbycomputingtheenergyfunctiononthewholedecoy.Thus,
manydecoysproducedbythedecoysamplingmethodsarerejectedduetotheirpoorquality.
CHAPTER2.PROBLEMDEFINITIONANDBASICCONCEPTS
Definitionof3Dstructure
ForaproteinoraRNAmolecule,its3Dstructureconsistsofallthecoordinatesofallitsatoms.
Weconsideraproteinofafewhundredaminoacidsasanexample.Forsimplicity,wecanuse
thecoordinatesofthealphacarbonineachaminoacidasadelegateofthepositionofthe
aminoacid.Thisassumptionisfromtheobservationthateachaminoacidhasarelativelyfixed
3Dstructure.Thus,the3Dstructureofaproteinisdescribedbyallthecoordinatesofthealpha
carbons,orbackboneatoms.ForaRNAcase,thedifferenceisthateachnucleotidehassix
atoms,soweusetwoatomsforeachnucleotideasitsbackboneatomsinordertokeepthe
representationsimpleandaccurate(Cao,2005;Jonikas2009).
Usingthecoordinatestodescribe3Dstructuresisgoodforvisualization,however,isnotagood
ideaforpredictivemodeling.Thisisfortworeasons.First,thecoordinateschangewiththe
coordinatingsystemswechoose.Thisintroducesdifficultywhenbuildingcomparablesamples
fromdifferentproteinstructuresusedintrainingamachinelearningmodel.Secondly,the
coordinatesofaminoacids,whicharenotindependentvariables,arequitetangledwitheach
other.Thedistanceoftwoneighboringaminoacidshasanearlyfixedvalue,whichisastrict
constraintbetweencoordinates.Ifwebuiltalearningmodeltodirectlypredictthecoordinates,
wewouldhavetoconsideralltheconstraintsinthepredictionstep.Fora3Dstructure,the
coordinatesofeachatommakenotmuchsensewithoutconsideringitsrelativepositionto
otheratoms.
Abetterrepresentationofthe3Dstructureistherotatingdihedralanglesalongthechainofa
proteinoraRNA.Thedihedralangleforapositiononasequenceiscomputedbetweenthetwo
neighborplanesformedbyeachoffourbackboneatoms.Thedihedralanglesareinvariantwith
thecoordinatingsystemwechoose,andkeepalltherelativepositionsbetweenboneatoms.
Anotherapproachtorepresentthe3Dstructureisafragmentlibrarymethod.Thismethod
includesafragmentlibrary,whichisanassemblyofreal3Dstructurefragmentsofveryshort
sequences.Thismethodbuildsawhole3Dstructuredecoybycollagingmanyfragments
together.Thus,itexcludesanystructureswithanunphysicalfragment,however,manyvalid
structuredecoyshavealsobeenexcludedfromtherepresentation,especiallythosestructures
withanunknownstructurefragment.
DefinitionofIntermediatestates
Thefoldingprocessfromthebeginningofthesequencetothefinalfunctionalstructureincludes
severalphasesandtransientstates,whichisstillunclearindetailformostproteinsandRNA.
Generally,thefoldingprocessspreadsfromneighboredaminoacidsornucleotidesgroupsto
thewholesequence.Fromthefunctionalstructuredeterminedbyexperiments,scientistsalso
observedpatternsoflocal3Dstructures.Thesepatternsmaybeassociatedwiththe
intermediatestatesofproteinfolding,whichincludeproteinandRNAsecondarystructuresand
theproteincontactmap.Knowingtheseintermediatestatesforeachlocalaminoacidor
nucleotidesgroupdoesnotsuggesttheaccurate3Dstructureofallatoms.However,
intermediatesstates,suchasasecondarystructureoracontactmap,aremuchsimplerthanthe
3Dstructureandusuallypossessanicermathematicalform.Thissimplicityinvitesawiderrange
ofresearcherstotackletheproblemofpredictingthesecondarystructureoverthatof3D
structures.Inturn,theresultsofintermediatestructurepredictioncanhelpimproveaccuracy
of3Dstructureprediction.
Definitionofcommonfeatures
BeforedescribingthedetailedmethodweproposedforproteinandRNAstructuremodeling,we
wanttopresentthedefinitionsofsomecommonsymbolsandnotationsusedinthefollowing
sections.
Italic:Non-textnotations,whichcanbevariables,e.g.L;oratomname,e.g.Cα,orsecondary
structuretypes
Å:Angstrom
Cα:Alphacarboninanaminoacid
Cβ:Betacarboninanaminoacid
H,E,C:α-helix,β-strand,andcoilassecondarystructuretypes
Sequenceinformation:
Inthefollowingparagraphs,wedenoteLasthelengthofthequerysequence.
PSSM,positionspecificscoringmatrix,isthemutationprobabilityforeachaminoacidor
nucleotideinthesequenceofaproteinoraRNA.Foreachpositioninthesequence,the
mutationprobabilityisarowvectorindicatingtheprobabilityforthispositiontomutateto
othertypesofaminoacidsornucleotides.ThenumberoftherowsinaPSSMisthelengthofthe
proteinorRNA.Thus,forasequencewiththelengthofL,thematrixwillbeLx20orLx4for
proteinandnucleotide,respectively.Thismatrixcanbeproducedbyusingmultiplesequence
alignmentprograms.Inourresearch,weusePSI-BLAST(Schäffer,etal.,2001)tocomputethe
PSSMforproteinsequences.
DistanceMatrix:anLbyLmatrix,eachelementat(i,j)isthedistanceoftheatomiandtheatom
jinangstrom.
Contact:Anatompairhasacontactiftheirdistanceislessthan8Å.Thecontactbetweentwo
aminoacidsisdefinedasthecontactbetweentheirrepresentativeatoms.Thereare3waysto
choosethisrepresentativeatomoftheaminoacid,Cα,Cβ,ortheclosesttwoatomsofapairof
aminoacids.Analternativethresholdofthedistanceis6Å,whichisnotcommonlyused.
ContactMap:anLbyLbinarymatrixtodescribeifthereisacontactforeachatompairinthe
sequence.Thisbinarycontactcanberelaxedintoaprobability.Intheprobabilitymatrix,each
elementisa[0,1]realvaluetoindicatetheprobabilityofwhetherthetwoatomshavea
contact.
SecondaryStructureofaProtein:Thesecondarystructureofaproteinisasequenceof
secondarystructureelements,whichcanbe3-classor8-class.Eachelementinthesecondary
structuresequencedescribesthelocalsub-structurepatternaroundtheaminoacidatthe
correspondingpositionintheproteinsequence.
SecondaryStructureofRNA:ThesecondarystructureofRNAdescribesthecontactmapofRNA.
Itisabinarymatrix,whereeachelementrepresentswhethertwonucleotidesarecloseinspace
andhaveahydrogenbondbetweenthem.
CHAPTER3.INTERMEDIATESTATESBRIDGINGSEQUENCESAND3D
STRUCTURES
Thewholelifeofthefoldingprocessofaprotein/RNAmoleculecanbeslicedintomany
differentphases.Shortrangeinteractionsformedafterthesequenceisassembled.Thereare
thenalsolongrangeinteractions.Thoseinteractionscanbegroupedintoseveralpatterns,
whicharetheintermediatestates,includingprotein/RNAsecondarystructuresandprotein
contactmaps.Onthesideofprotein,short-rangeinteractionsformsecondarystructures,and
longrangeinteractionsareassociatedwithaproteincontactmap.OnthesideofRNA,theRNA
secondarystructuremeanstherearebothshortrangeandlongrangeinteractionpatterns.
PROTEINSECONDARYSTRUCTURE
Eachproteinsecondarystructurerepresentsasimplifiedlocalpatternofaprotein3Dstructure.
Thelocalpatterninvolvesaminoacidsthatareconsecutivewithinthesequence.Topresenta
simplifiedclassification,thereare3secondarystructuretypes,α-helix,β-strand,andcoil,as
suggestedbyLinusPaulingandhisco-workersmorethan50yearsago(Pauling,etal.,1951).
Amongthethreesecondarystructuretypes,helicesandstrandshaveapattern,butcoilsare
definedastheunclassifiedpartsbetweentwohelicesandstrandswithoutastablepattern.This
3-classmodelisextendedbyKabschandSandergroupto8classesbyaddingsub-classesfor
eachsecondarystructureclassinthe3-classsystem(KabschandSander,1983).Byusingan8classclassification,wehavemoreinformationtodistinguishthedifferencesof4-helixand3helix,andthedifferencesamongcoilpatterns.
Weconsiderproteinsecondarystructureasakindofintermediatestateduetobothpractical
reasonsandbiologicalevidence.Thesecondarystructureofaproteinprovidespartial
informationaboutits3Dstructureotherthanthesequence.Thus,itisusedbybiologiststo
identifyproteinfunction(MyersandOas,2001).Itisalsothebasicunitintheprocessofthe
foldingandunfoldingofaprotein(KarplusandWeaver,1994),wherethesecondarystructures
areanintermediatephasebetweenthestateofasequenceandthestateofa3Dstructure.
Thereisresearchpredictingproteinsecondarystructuresbasedontheproteinsequence
information,includingthesequenceandPSSM(PirovanoandHeringa,2010).Mostmethods
focuson3-classprediction.Neuralnetworkmethods(CuffandBarton,1999;HolleyandKarplus,
1989;Jones,1999;Kneller,etal.,1990;QianandSejnowski,1988;ROST,1996;RostandSander,
1993;RostandSander,1994)achieveQ3accuracyof~80%,amongwhichPSIPREDisthemost
representativeone.Thesemethodstakeinputfeaturesfromeachindependentpositionofthe
sequence,anddonotmodelthetransitionrelationshipfromonepositionthenext.TheHidden
MarkovModelcancapturethistransition,butcannotmodelthenon-linearrelationship
betweensequencefeaturesandsecondarystructurelabels.Tofillthisgap,wemodelthis
problembyaconditionalneuralfieldmodel,whichisaprobabilisticgraphicalmodeltaking
advantageofneuralnetworksandconditionalrandomfields.
Comparedwithpredicting3-classsecondarystructures,the8-classpredictionproblemismore
challengingandmoreimportant.The8-classsecondarystructureofaproteinwillprovidemore
informationthanthe3-classone.Itdistinguishesthedifferencebetween3-helixand4-helixand
recognizesdifferenttypesofloopregions.Topredictan8-classstructureismuchmoredifficult
thanpredictinga3-classone,becausethedistributionof8-classsecondarystructuresis
extremelyunbalancedinthetrainingdataset.
Intheresultsofapplyingour2ndorderconditionalneuralfieldsonthe8-classprediction
problem,wehavesignificantlyimprovedthe8-classsecondarystructureprediction.Ourmethod
of2ndorderconditionalneuralfieldscanbeillustratedbyFigure1.
Figure1.Conditionalneuralfieldmodelforan8-classsecondarystructureprediction.Thesigmoidneuralnetwork
layertransformstheinputfeatureinanon-linearway,andthetwoorderconditionalrandomfieldmodels
neighborhooddependencyaswellastheoutputoftheneuralnetwork.
Weexamineourmethodbycomparingitwithotherstate-of-the-artmethods,includingSSpro8
(Pollastri,etal.,2002).Ournumericalresultsshowthesignificantimprovementofourmethod
andexplainhowtheNeffvalueaffectsperformanceofthesecondarystructureprediction.See
Figure2andTable1asfollow.
Figure2.Q8accuracybetweenourCNFmodelandSSpro8ontheCB513dataset.Theaccuracyisplottedwiththe
Neffrangesofthesequences.
Table1.Q8accuracyandSOV(SegmentOverlapValue)foreachtypeof8-classsecondarystructureprediction
results.
PROTEINCONTACTMAP
Anotherintermediatestateintheprocessoftheproteinsequencefoldingintoaprotein3D
structureisthecontactmap.Comparedwiththelocalpatterninthesecondarystructure,the
proteincontactmapisarepresentationoflongrangepatternsina3Dstructure.
Theproteincontactmapcanbewrittenasabinarymatrix,M,whereMi,j=0meanstheithamino
acidandthejthaminoacidhavenocontact,i.e.theirdistanceislargerthan8Å.Also,Mi,j=1if
andonlyiftheyhaveacontact.Acontactbetweentwoaminoacidsusuallyimpliesafunctional
relationship,thereforethepredictionoftheproteincontactmapisanimportanttopicfor
structuralbiology(Ortiz,etal.,1999;Vassura,etal.,2008;Vendruscolo,etal.,1997;Wu,etal.,
2011).
Therearetwomainpitfallsinmostmachinelearningmethodshavebeentriedtotheprotein
contactpredictionproblem.Thefirstisthefailureofmodelingdependencybetweenaminoacid
pairs.Thisdependencyisverycommonfortwobetastrandsformingabetasheetorfortwo
helicesformingaparallelstructure.Thesecondisthedifficultytomodelthesparsenessofa
contactmatrix.Aproteincontactmapissparseandthetotalnumberofcontactsisinalinear
scalewiththelengthofthesequence.Neitherdependencynorsparsenesshadbeenconsidered
byallthemachinelearningmethodsbythetimeweinventedowncontactpredictionmethod.
Otherthanaparametricmodel,themutualinformationisfoundtobeamethodtopredict
proteincontactmap(Jones,etal.,2012;Morcos,etal.,2011),whichmeasurestheco-mutation
relationshipofeachpairofaminoacids.Theproblemwiththesemethodsisthattheydonot
usetemplate-basedinformation.Formostglobularproteins,thetemplateinformationcan
provideaveryaccuratepredictionof3Dstructures.Intheothercase,proteinswithoutagood
homologousstructureitwillbedifficulttousethesetwomethods,whichareheavilydependent
onthehomologoussearch.Thesetwomethodsalsodonotmodelthedependencybetween
aminoacidpairs.
Thefirstmethodthatsystematicallyincludestheaminoacidpair-wisedependency,isAstro-Fold
(KlepeisandFloudas,2003),whichmodeltheproblemasanintegerprogrammingproblemand
encodesthedependencyinlinearconstraints.However,Astro-Folddoesnotmakeuseof
evolutionaryinformation,suchasPSSM,whichhasprovedveryusefulinmanyproteinstructure
studies.
Inordertotakeadvantageofthehomologousinformationofaproteinsequenceaswellasto
modelthedependencyofaminoacidpairs,wecreateaPhyCMAP,whichisamixturemodel
integratingaRandomForestandintegerlinearprogramming.InthefirststepofPhyCMAP,the
randomforestsmodelpredictsthecontactprobabilitybetweeneachpairofaminoacids.The
probabilityistheinputofthesecondstep,theintegerlinearprogrammingmodel,whichoutput
afilteredprobabilityafterremovingtheconflictsamongaminoacidpairscausedbythecontact
dependency.
Figure3.Flowchartofourmixturemodelwithintegerprogrammingandrandomforests.
Byusingthismodel,wehaveanimprovementoverothermethodsintheCASP10targetset
(WangandXu,2013).
Figure4.ThetopL/10contactpredictionaccuracy(Y-axis)ofourmethodsandotherstate-of-the-artmethodson
theCASP10targetsgroupedbytheNeffvalue(X-axis).
CHAPTER4.3DSTRUCTUREPREDICTION
RNA3DSTRUCTURESAMPLING
IntheworldofRNA,thesecondarystructurehasadifferentmeaningthanintheproteinworld.
RNAsecondarystructuredescribesboththeshortandlongrangeofaRNA3Dstructure.Itis
similartoaproteincontactmap,butnotsimilartoaproteinsecondarystructure.Compared
with20aminoacidsinproteins,thereareonly4nucleotidesinRNAsequences,andeach
nucleotidehasitsownstrongpreferenceforhavingacontactwiththeothernucleotidetype.
TheRNAsecondarystructurepredictionproblemhasbeenstudiedforalongtime(Akutsu,
2000;BindewaldandShapiro,2006;Chen,etal.,2008;Do,etal.,2006;EddyandDurbin,1994;
Hofacker,2003;KnudsenandHein,2003;Zuker,2003;ZukerandSankoff,1984).
TherearenotsomanyRNA3Dpredictionmethodshavebeenpublished,includingmolecular
dynamicsandknowledge-basedmethods.AmongtheseareFARNA(DasandBaker,2007),MCSym(ParisienandMajor,2008),andBARNACLE(Frellsen,etal.,2009).Inourstudy,theresults
showedthatapredictedsecondarystructurecansignificantlyhelptoimprovethesampling
efficiencyofaRNA3Dstructurepredictionmethod(WangandXu,2011).Withpredicted
secondarystructures,ourmethodproducesmoredecoysofahighqualitycomparedwithother
state-of-the-artsmethods.
Differentfromothermethods,ourmethodtakesinputofthepredictedsecondarystructureto
guidethe3Dstructuresamplingprocess.TheRNAsecondarystructuredefinestheorderofsubsequencesampling.Foreachsub-sequence,ouralgorithmsamplesits3Dstructureaccordingto
anenergyfunctionmodelledbyaconditionrandomfieldmodel.
Table2.ComparedwithFARNA,ourmethodproducesbetterdecoysintermsofthebestclustercentroid.Thebold
numbersrevealsthebetterresults.Thebestclustercentroidisclearlyseenfromthe5clustersofallthedecoys.
Figure5.RMSDhistogramof1ESYshowsthatourmethod(left)producesmorelowRMSDdecoysthanFARNA
(right).
OurTreeFoldalgorithmshowsthatusingtheintermediateinformation,evenifpredicted,will
helpimprove3Dstructureprediction.
INTERMEDIATESTATESUSEDFORPROTEIN3DSTRUCTURE
PREDICTION
Ourpreviousstudiesillustratehowtodefinethemacromolecularintermediatestates,andshow
theresultswehaveachievedonintermediatestatesprediction.Basedontheseresults,we
proposeanovelmethodforprotein3Dstructurepredictionbymodelingtheintermediate
states.Inthisstudy,weformulizeanewprotein3Dstructurepotentialinvolvingamachine
learningmodelbetweensequenceinformationandintermediatestate.Thenovelstatistical
energyfunctionisconsistofaneuralnetworktakinginputoffeaturesusedinproteinsecondary
structurepredictionandcontactmapprediction,includingPSSMandmutualinformation.The
programofproteinsecondarystructureandcontactmappredictionwillproducetheprobability
resultsforeachaminoacidandeachaminoacidpairinthesequence.Thepotentialenergy
functionwilltakestheprobabilityresultsasinputtopredictthedistanceforeachpairofthe
aminoacids.Withthepairdistanceprobability,thepotentialenergyfunctionisdefinedasa
likelihoodofagivenprotein3Dstructurecandidate,adecoy.
Beforeweintroduceourmethod,wewanttobrieflyintroducethebackgroundofhowtouse
energyfunctiontodoprotein3Dstructureprediction.Energyfunction,alsonamedpotential
energyfunction,evaluatesthequalityofprotein3Dstructure.Potentialenergyisderivedfrom
biophysics,wherethepotentialisdefinedastheenergyusedtodecomposeensembleofatoms.
Physicalpotentialofaproteincrystallized3Dstructurecanbecalculatedasthetotalenergy
insideallchemicalbondsbetweeneachpairofatoms.Accordingtophysicaltheory,the
crystalizedstructureofaproteinmoleculeismoststablestatus,whichcorrespondingtothe
lowestpotentialenergy.Inmoleculardynamics,thephysicalpotentialenergyiscomputedasa
simulatedthermodynamicsprocessmovingthepositionsofalltheatoms.However,the
computationofphysicalpotentialisnotfeasibleduetocomputationresourcelimitandmaynot
benecessarywhenconsideringthetrade-offbetweencomputingcomplexityandtheprecision
ofprotein3Dstructureprediction.
Thestatisticalpotentialenergyisanapproximationtothephysicalpotentialtoevaluatethe
stabilityofagivenprotein3Dstructure.Differentfromphysicalpotentialenergy,thestatistical
potentialenergyisoptimizedfromtheobservationofknown3Dstructures.[citedope].
Comparedwithphysicalpotentialenergy,statisticalenergyfunctionissimplerincomputation,
andcanbedefinedahighlevelofresolution,i.e.aminoacidsinsteadofatomlevel.Many
studieshavebeendoneonthestatisticalpotential(Shen2006).
METHODOVERVIEW
Ournewstatisticalenergyfunctionestablishesanon-linearrelationshipbetweenthefeaturesof
theproteinsequenceandthedistanceofallatompairsbyusinganeuralnetworkmodel.Inour
model,theresponsiblevariableisthediscretizeddistancebetweeneachpairofatoms,which
rangesfrom0to11,correspondingto12distanceintervals.Ourfeaturesfromtheprotein
sequenceincludethepositionspecificscoringmatrix,whichisthemutationrateforeach
position,andrepresentsshort-rangesequencefeature.Theglobalsequencefeaturesare
calculatedfromtheco-evolutionaryrelationshipbetweeneachpairofaminoacids,whichare
mutualinformationanddecoupledmutualinformation.
Theneuralnetworkusedinourpotentialenergyfunctionpredictsthecategoricaldistancelabel
fromthegivenproteinaminoacidsequenceanditsfeaturevectors.Theoutputoftheneural
networkisprobabilityforeachlabelcategory.ThisisdonebyalayerofrestrictedBoltzmann
machine(SalakhutdinovandHinton,2009).Thus,thegeneralformofourstatisticalpotential
energycanbewrittenasthefollowingformula.
P(Y = y | X = x) = exp(−wy ,k NN k ( x)) / Z ( x) (1)
Intheabove, NN k istheoutputvalueofthek-thonthelastlayer.{ wy,k }isa12x40matrixfor
theweightoflinearcombinationoftheBoltzmannmachinelayer.Zisthenormalizationfactor.
Withtheprobabilityofthedistanceofeachatompair,wecandefinetheenergyfunctionofa
given3Dstructureofprotein.Thewholestructureisdecomposedintothepairwisedistance dij ofallthebetaatompairsonthegivenproteinsequence.Thestatisticalpotentialenergy
functionEofthegivenwholestructureisdefinedbythefollowingformula.
E = ∑ E (d i , j , xi , j ) = − log[P(Y = d i , j | X = xi , j ) / R(i, j, d i , j , L)]
i, j
Intheenergyfunctionofthewholestructure,the di, j isthediscretizeddistancebetweenthe
atomiandj. xi, j isthefeatureoftheatompair(i,j),and R(i, j,di, j , L) isthereferencefactorof
thepairwisedistance,definedasfollowingformula.
R(i, j,di, j , L) =
Number of atom pairs with distance d i,j
Total number of atom pairs in the protein with length of L Figure6.Theneuralnetworkmodelusedinourpotentialenergyfunction.
FEATUREVECTOR
Thefeaturevectorusedinourmodelisbuiltasthefollowingprocedure.First,webuildthe
multiplesequencealignmentofthegivenproteinsequence.Withthemultiplesequence
alignment,wecomputemostimportantfeaturesweusedinourenergyfunction.Theevaluation
ofadecoywithourenergyfunctionrequiresareferencefactorforeachpairofaminoacids.In
otherwordsthepotentialenergyofagivendecoyisdecidedbyboththeenergyfunctionvalued
computedfromeachpairofaminoacidsandthereferencestatesforthepairs.
Thefeaturevectorforeachpairof(i,j)isconsistofthepositionspecificscoringmatrix(PSSM),
whichisa20vectorforeachposition.ThePSSMmatrixisresultedfromhomologoussequence
searchinanon-redundantsequencedatabasebyNCBIPSI-BLAST.ThisPSSMfeatureisproved
veryhelpfulinmanysequencerelatedresearch(Jones,1999;Källberg,etal.,2012;Ma,etal.,
2014;PengandXu,2010;PengandXu,2011;Söding,2005;WangandXu,2013;Wang,etal.,
2011).Wealsoincludethegeneralizedpairwisemutualinformationfeatureset.Thisfeatureset
includesthemutualinformation,definedasfollowing.
Mu (i, j ) = ∑ log
a ,b
fi, j
f i , j ( a, b)
f i (a) f j (b)
.
isthefrequencyoftwoaminoacidattheposition(i,j).
NEURALNETWORKOPTIMIZATIONANDMODELSELECTION
Ourenergyfunctioncontainsa4-layerneuralnetwork.Foreverylayerintheneuralnetwork,
thereare100,80,60,and40nodesresponsively.Thefirstlayerwith100nodestakesinputfrom
features,andthelastlayeroutputwilloutputtheprobabilityvaluesof13stateswithsoft-max
asshowninEquation(1).Thenodesbetweeneachtwoneighboredlayersarefullyconnected,
e.g.foreachnodeinthesecondlayerorlayersafter,ittakesinputfromalltheoutputofits
previouslayer.
Thereareanenormousnumberoftheatompairs,whichhavethelongestdistancelabel.The
pairwiththislabelhasadistancelargerthan15angstrom(A).Includingalloftheminthe
trainingprocessnotonlyslowsdowntheoptimization,butnotbenefittheaccuracyverymuch.
Wehavewatchedtheoptimizationprocesswith30%and10%ofallthepairswithlargerthan
15Adistance.Thedecreasingoftheobjectfunctiononavalidationdatasetismuchsimilarfrom
twodifferenceoptimizationprocesses.Soweonlyinclude10%ofthepairswiththelaststatein
ourtrainingset.
Theweightsofourneuralnetworkmodelshowtheimportanceoftheco-evolutionary
informationweused.Inthefollowingfigure,weestimatetheimportanceforeachpairof
featureandlabelbyusingtheproductofalltheedgeweightfromthefeaturetothelabel.The
importanceforall13statescanbeshownonthefollowingfigure.Wecanseethesignalsonthe
featurefrom1360to1575showstrongertheimportancescore,whichareassociatedwiththe
co-evolutionfeaturesinournewmethod.
Figure7.Theimportancescoreforeachpairoffeatureandlabel.Eachlabeliscorrespondingtoasubplotfromthe
toptothebottom.XaxisistheimportancescoreforeachfeatureandYaxisistheindexofthefeature.
Fromthenetworkmodelweights,wealsofindthatitisnotnecessarytodividethedistance
largerthan16A.Wecomputethecorrelationbetweentheweightsusedinthesoft-maxlayer
amongthe13statesandshowthemintheFigure3.Thelast4stateshavehighlycorrelated
weightsintheirsoft-maxlayer,whichimpliesthemodelhasdifficulttodistinguishthem.
Figure8.Thecorrelationamongtheweightsofthesoft-maxlayersforthe13states.
TRAININGDATASET
Thetrainingsetisbuiltbasedon1200non-redundantprotein3Dstructures.Theredundancy
betweeneachpairofproteinsinourdatasetissmallerthan25%sequenceidentity.Sincethe
numberofaminoacidpairsisproportionaltothesquareofthelengthoftheproteinsequence,
welimittheproteinswithlengthupto350aminoacidsinourtrainingdataset.
DATAPREPROCESSING
Weremovethesequencethathasmissingaminoacidsintestdatasetsandtrainingdatasets.
Wealsoremovesequenceshorterthan50aminoacids,whichusuallyhasfewdistancecontact
inits3Dstructure.
COMPUTINGENVIRONMENT
Ourtrainingprogramtakes300coresand72hourstofindingalocaloptimalsolutionforthe
weightsfortheneuralnetwork.WeusedL-BFGSalgorithmtooptimizetheweightoftheneural
network(LiuandNocedal,1989).
RESULT
Wefirstcompareournewenergyfunctionwithstate-of-the-artsmethodsondecoy
discrimination.Wemeasuretheimprovementofdecoydiscriminationby5metrics,numberof
correctlyidentifiednatives,ranksofthenativestructure,GDTscoreofthefirst-rankeddecoy,
andcorrelationoftheenergyandthedecoyquality,andZscoreofthenativeenergy.
Foreachtargettherearedozentohundredsdecoysinthedatasetofthisexperiments.The
numberofcorrectlyidentifiednativesisthenumberofproteinswhichnativestructuresare
rankedlowestamongallthedecoys.Therankofthenativestructureshowstherankofthe
nativestructureofaproteinamongallthedecoysintheorderofpotentialenergyvalues.GDT
scoreisusedtocomparethelowestenergydecoyanditsnativestructure.ThelowertheGDT
scoreisthebetterthequalityofthedecoy.WealsocalculatedthePearsoncorrelation
coefficientbetweenthepotentialenergyvalueandtheGDTscore.Agoodenergyfunction
shouldproducesmallscoresforhighqualitydecoysandhighscoresforlowqualitydecoys.
Numerically,thequalityofanenergyfunctionisshowedbythecorrelationbetweendecoy
energyvaluesanddecoyGDTscores.Wealsocalculatethe“bestGDTscore”,whichistheGDT
scoreforthedecoywithlowestenergyvaluewithagivenenergyfunction.“Nativerank”isthe
placewheretheproteinnativestructurelieintheorderofenergyvalues.
OntheRosettadataset(Raman,etal.,2009),wecompareourmethodwithotherstate-of-theartsmethods,includingDOPE,DFIRE,MyDopeandOPUS.Wefirstexamineourmethodon
rankingthedecoys.InthedatasetofRosetta,thereare58proteinsand120decoysforeach
protein.ThedecoysaregeneratedbyRosettaproteinstructuresamplingmethod.
Table3.PerformanceonRosettadataset.
Rosetta
Nativeidentified
meanofcorrelation
z-scoreofnative
meanofbestGDT
Averagenativerank
DOPE
11
-0.24
-1.51
0.47
18.7
DFIRE
12
-0.20
-0.66
0.48
30.7
MyDope
10
-0.21
-1.23
0.48
21.7
OPUS
6
-0.15
0.25
0.46
55.3
Epad
17
-0.44
-1.02
0.64
31.9
Epmi
19
-0.49
-1.23
0.68
29.7
WealsocompareourthenewmethodwiththeEpadmethodonthreedatasets,theRosetta
decoyset,I-Tasserdecoyset(Wu,etal.,2007)andtheCASP5-8targetswithanextensivesetof
metrics.
Table 4. An extensive comparison of our new method and Epad energy function on 3 datasets, rosetta, casp5-8, itasser. Dfire, opus, rw, epad, and epmi are the energy
functionsforeachcolumnresult.
opus
rw
epad epmi
itasser
dfire
epmi
dfire
epad
casp5-8
rw
-0.52 -0.61
opus
-0.49
dfire
-0.29
rosetta
-0.44
0.58
-0.52 -0.60
-0.58
0.60
-0.48
-0.55
0.57
-0.29
-0.46
0.60
-0.43
-0.46
opus rw epad epmi
0.49 0.65 0.71 0.84
0.45 0.63 0.72 0.84
-0.35
0.63 0.69 0.66 0.74
-0.44
native-free
0.67
-0.56
0.61
0.55
-0.53
0.51
-0.43
115.0 175.6 126.1 163.7 99.4
9
8
1
9
3
0.53
0.59
-0.44
0.56
0.53
0.57
-0.34
6.33 4.22 5.30 2.84
0.55
0.95
0.53
4.85
0.59
0.68
-0.43
26.90
0.71 0.77 0.75 0.82
0.98
correlation between GDT and
energy value
correlation between TMscore and
energy value
38.25
0.75
0.96
GDT score of the top ranked decoy
37.33
0.56
0.93
native-free
36.41
0.49
0.86 0.92 0.70 0.84
55
native-free
35.76
0.50
0.94
50
place(rank) of the lowest energy
decoy in the order of GDT
0.50
0.86
48
native-free
0.50
0.64
107
TMscore of the top ranked decoy
0.71
90
22
50
139.8 10.0
9
5
native-free
0.86
126
4.61
0.70
33
19.77 24.93
80
18
2.45 1.14 4.98 1.71
121
30
0.89
41
11.26
33
34.12
28.45
14.7
3
14.12
77.04
28.90
0.80
native-included GDT score of the top ranked decoy
number of native structure identified
native-included in the top 5 models
place(rank) of the lowest energy
native-included decoy in the order of GDT
24.50 26.71
32.49
5.13 0.74 3.89 2.46
22.61
0.38
10.82
17.12
19.57
0.94
native-included rank of the native structure
0.66
0.85
0.98
0.63
0.95
0.68
0.92
0.85
-1.55
0.95
0.68
-2.39
-1.17 -2.91
-1.02
-4.46
-1.66
-5.03
-2.74
-3.27
-1.73
0.89 0.94 0.78 0.90
0.68 1.50 0.74 1.02
native-included TMscore of the top ranked decoy
Z-score of the native structure
native-included energy value
Theotherimportantusageofstatisticalenergyfunctionisintheforproteinstructures(Wang
andXu,2011;Zhao,etal.,2008).Wecomparedournovelenergyfunctionandtheenergy
functionderivedwithoutusingthedirectinformationonthedatasetofCASP10freemodeling
targets.ThedatasetweusedistheCASP10freemodelingtargetslessthedomainsQuarkor
RosettaServerhavenosubmissionandthelongtargetswithoutdomaincutting.Ourmethod
topson4targetdomainsoutof12domains,andon7of12domainsourmethodisbetterthan
QuarkandRosetta.
Table5.ThesamplingresultsonCASP10,12freemodellingdomains.
Ourmethodoutperformsthetopmodel
OurmethodoutperformsRaptorX-Roll
OurmethodoutperformsQuarkandRosettaserver
T0653-D1
T0658-D1
T0666-D1
T0684-D2
T0693-D1
T0719-D6
T0726-D3
T0734-D1
T0735-D2
T0737-D1
T0740-D1
T0741-D1
Ournew
method
0.1861
0.2229
0.4409
0.2652
0.2527
0.2366
0.3000
0.2975
0.4264
0.2871
0.3110
0.1992
Quark
0.4181
0.2970
0.2252
0.2448
0.3287
0.2331
0.1804
0.2155
0.3549
0.3175
0.2678
0.1394
Rosetta
server
0.4368
0.1920
0.2499
0.2711
0.2410
0.2220
0.2331
0.2446
0.3504
0.3496
0.2689
0.1752
RaptorXRoll
0.1656
0.2276
0.2642
0.2287
0.2672
N/A
N/A
N/A
0.3971
0.2816
0.4778
0.1627
Thetop
model
submitted
0.4280
0.2910
0.4160
0.2820
0.3460
0.3230
0.2570
0.2670
0.3970
0.3630
0.3610
0.2020
Theadvantageofourmethodisalsoverifiedbythedifferencebetweenthetargetsourmethod
achieveTMscore>0.4.Onthese5targets,weruntheRosettaserver,whichproducedresults
withTMscore<0.4on4ofthe5targets.
Table6.Thetargetsonwhichourmethodresults>0.4inTMscore.
T0651-D1
T0651-D2
T0663-D1
T0663-D2
T0726-D2
Ourmethod
0.4272
0.4527
0.4422
0.5781
0.5922
Robetta
0.3336
0.2724
0.4312
0.7289
0.3160
THEFOLDINGRESULTFORCASP10TARGETS
Wetestourmethodonthehuman-serverCASP10targets.Among67targets,wefindour
methodimprovedthesamplingdecoyqualityon6targetsasshowninthefollowingfigure.
Figure9.SixCASP10human-servertargets(fromuplefttodownright,T0713-D1,T0713-D2,T0651-D1,T0690-D1,
T0690-D2,andT0707-D1)haveimproveddecoysinthe3Dstructuresamplingguidedbyournewenergyfunction
thantheEPADenergyfunction.
SUMMARY
Inthisstudy,wehaveproposedanovelpotentialenergyfunctionbasedonthepairwisecoevolutionaryinformation.Ournumericalexperimentsondifferentdatasetsshowedits
improvementonbothrankingdecoysand3Dstructuresampling.Ourstatisticalenergyfunction
modelstheatompairsindependently,i.e.weassumetheprobabilityofpairaminoaciddistance
isindependentfromeachother.Thisassumptionisfirstusedinthedefinitionofmeanfield
model(LeBoudec,etal.,2007),andmakethecomputationpossibleonalargedataset.
However,thedistanceofallaminoacidpairsmaynotbeindependentinthecomplicated
proteinstructureproblem.Thereforewedesignareferencefactorforeachpairtocorrectthe
totalenergytomakeupthedependency.
Thereareseveralwaystoimproveonthestatisticalpotentialfunctioninthefuturebasedon
ourresultsachievedinthisstudy.Itwillbehelpfultooptimizethediscretizedlabelforthe
distanceofeachatompair.Figure3suggestsacorrelationbetweenseveraldiscretizedlabels.To
makeuseofthiscorrelation,wecangroupsomeneighboredlabelstogether,whichwould
increasethesamplesizeforeachlabel.Thereferencefactorisnotfullydevelopedbyfar.Itis
importancetocombinetheindependentpairprobabilityintotheevaluationfunctionofthe
wholestructure.Itwouldbehelpfultomodelthereferencefactorwithglobalconstraints.
Previouswehavedemonstratedhowtheglobalconstraintscanbeusedtoimprovethecontact
mapprediction.Thus,itimpliesustocombinetheco-evolutionaryinformationandglobal
constraintsonthepotentialenergyfunctioninthefuture.
CHAPTER5.CONCLUSIONSANDFUTUREWORKS
ProteinandRNA3Dstructurepredictionisaverychallengingprobleminthefamilyofmachine
learningapplications.Itsdifficultiesincludethatitisastructurallearningproblem,itsobjective
functionisnon-convex,andthedatasetoftrainingsamplesisstillsparseandexpensivetoget.
Asastructurallearningproblem,themacromolecular3Dstructurehasmanyconstraints,such
asphysicalclashconstraints,whichdisallowdifferentatomsoccupythesamepositioninthe
space,andtheelectrostaticpreferencebetweenatoms.Theobjectivefunctionof3Dstructureis
ametricbetweenadecoyanditsnativestructure.Mostmetrics,includingRMSD,GDT-score
andTM-score,arenotconvex,whichbringdifficultytodesignanalgorithmtousethefinal
metricasanoptimizationtarget.
Anotherchallengingsideoftheproblemistheevolvingofhomologousinformation,whichis
provedasanimportantfeatureinmanysequencerelatedproblems.Thehomologousmatrixis
highlydependentonthedatabaseofallthesequencespeoplehavediscovered,whichis
changingwiththeadvanceoftechnology.Thismakestheinputfeaturemayvaryifanew
databaseisusedinbuildingthefeature.Changingtheparameterofthealgorithmofthe
multiplesequencealignmentalsoresultsinadifferentfeature.Therearefewworksto
investigatetherelationshipbetweenhomologoussearchalgorithmandtheperformanceof
structureprediction.
Themostimportantcontributionofthisthesisisthenovelconceptofintermediatestateandits
applicationin3Dstructureprediction.Inthenativestructures,wecanobservemanypatterns
andpatterncombinations.Thesepatternsaresupposedlyahigherlevelofintermediatestates
thansecondarystructureandcontactmap.However,current3Dstructurepredictionmethods
hardlymodelorsearchinaspacewiththesepatterns.Onereasonmaybethesepatternsare
oftenwithmanyaminoacidsornucleotidesfarfromothersonthesequence.Theotherreason
iswestilldonotunderstandhowthesepatternsareformedduringtherealfoldingprocess.In
thisthesis,weproposeamethodtomakeourpreliminaryattempttoconnectintermediate
statespredictionandthe3Dstructurepredictionatogetherbymakingusingtheobserved
patterns.
Throughaseriesofresearch,wehaveprovedtheimprovementofintegratingdifferentlevelsof
informationtogether.Fromthesecondarystructureprediction,wehaveprovedthe
effectivenessofusingthehomologousinformation.Fromusingthesecondarystructure
predictionresultsinproteincontactmapprediction,weshowedamethodtopredictthe
intermediatestates.RNAsamplingstudyshowedamethodtomakeuseofsecondarystructure
predictionresultsforRNA3Dstructureprediction.Finally,wehaveshowedthattheprotein
intermediatestate,includingsecondarystructureandcontactmap,canbeusedforprotein3D
structureprediction.
Ourworkinthisthesisistobuildachainofmachinelearningmodelstoimprove
macromolecular3Dstructureprediction.Furtherimprovementaroundthecornerincludesusing
deeplearningmethodtoautomaticallydiscovertheintermediatestateandoptimizingeach
modelonthelogicchain.Withtheconceptofintermediatestate,wemayunderstandthe
macromolecular3Dstructurebetter.
REFERENCE
Akutsu,T.DynamicprogrammingalgorithmsforRNAsecondarystructurepredictionwith
pseudoknots.DiscreteAppliedMathematics2000;104(1):45-62.
Bindewald,E.andShapiro,B.A.RNAsecondarystructurepredictionfromsequencealignments
usinganetworkofk-nearestneighborclassifiers.Rna2006;12(3):342-352.
Bowie,J.U.,Luthy,R.andEisenberg,D.Amethodtoidentifyproteinsequencesthatfoldintoa
knownthree-dimensionalstructure.Science1991;253(5016):164-170.
Chen,X.,etal.FlexStem:improvingpredictionsofRNAsecondarystructureswithpseudoknots
byreducingthesearchspace.Bioinformatics2008;24(18):1994-2001.
Cuff,J.A.andBarton,G.J.Evaluationandimprovementofmultiplesequencemethodsfor
proteinsecondarystructureprediction.Proteins:Structure,Function,andBioinformatics
1999;34(4):508-519.
Das,R.andBaker,D.Automateddenovopredictionofnative-likeRNAtertiarystructures.
ProceedingsoftheNationalAcademyofSciences2007;104(37):14664-14669.
Do,C.B.,Woods,D.A.andBatzoglou,S.CONTRAfold:RNAsecondarystructureprediction
withoutphysics-basedmodels.Bioinformatics2006;22(14):e90-e98.
Eddy,S.R.andDurbin,R.RNAsequenceanalysisusingcovariancemodels.Nucleicacidsresearch
1994;22(11):2079-2088.
Frellsen,J.,etal.AprobabilisticmodelofRNAconformationalspace.PLoScomputational
biology2009;5(6):e1000406.
Giddings,J.C.andByring,H.Amoleculardynamictheoryofchromatography.TheJournalof
PhysicalChemistry1955;59(5):416-421.
Hofacker,I.L.ViennaRNAsecondarystructureserver.Nucleicacidsresearch2003;31(13):34293431.
Holley,L.H.andKarplus,M.Proteinsecondarystructurepredictionwithaneuralnetwork.
ProceedingsoftheNationalAcademyofSciences1989;86(1):152-156.
Jones,D.T.Proteinsecondarystructurepredictionbasedonposition-specificscoringmatrices.
Journalofmolecularbiology1999;292(2):195-202.
Jones,D.T.,etal.PSICOV:precisestructuralcontactpredictionusingsparseinversecovariance
estimationonlargemultiplesequencealignments.Bioinformatics2012;28(2):184-190.
Kabsch,W.andSander,C.Dictionaryofproteinsecondarystructure:patternrecognitionof
hydrogen bondedandgeometricalfeatures.Biopolymers1983;22(12):2577-2637.
Karplus,M.andWeaver,D.L.Proteinfoldingdynamics:Thediffusion
experimentaldata.ProteinScience1994;3(4):650-668.
collisionmodeland
Klepeis,J.andFloudas,C.ASTRO-FOLD:acombinatorialandglobaloptimizationframeworkfor
abinitiopredictionofthree-dimensionalstructuresofproteinsfromtheaminoacidsequence.
BiophysicalJournal2003;85(4):2119-2146.
Kneller,D.,Cohen,F.andLangridge,R.Improvementsinproteinsecondarystructureprediction
byanenhancedneuralnetwork.Journalofmolecularbiology1990;214(1):171-182.
Knudsen,B.andHein,J.Pfold:RNAsecondarystructurepredictionusingstochasticcontext-free
grammars.Nucleicacidsresearch2003;31(13):3423-3428.
Morcos,F.,etal.Direct-couplinganalysisofresiduecoevolutioncapturesnativecontactsacross
manyproteinfamilies.ProceedingsoftheNationalAcademyofSciences2011;108(49):E1293E1301.
Myers,J.K.andOas,T.G.Preorganizedsecondarystructureasanimportantdeterminantoffast
proteinfolding.NatureStructural&MolecularBiology2001;8(6):552-558.
Ortiz,A.R.,etal.Abinitiofoldingofproteinsusingrestraintsderivedfromevolutionary
information.Proteins:Structure,Function,andBioinformatics1999;37(S3):177-185.
Parisien,M.andMajor,F.TheMC-FoldandMC-SympipelineinfersRNAstructurefrom
sequencedata.Nature2008;452(7183):51-55.
Pauling,L.,Corey,R.B.andBranson,H.R.Thestructureofproteins:twohydrogen-bonded
helicalconfigurationsofthepolypeptidechain.ProceedingsoftheNationalAcademyofSciences
1951;37(4):205-211.
Pirovano,W.andHeringa,J.Proteinsecondarystructureprediction.In,DataMiningTechniques
fortheLifeSciences.Springer;2010.p.327-348.
Pollastri,G.,etal.Improvingthepredictionofproteinsecondarystructureinthreeandeight
classesusingrecurrentneuralnetworksandprofiles.Proteins:Structure,Function,and
Bioinformatics2002;47(2):228-235.
Qian,N.andSejnowski,T.J.Predictingthesecondarystructureofglobularproteinsusingneural
networkmodels.Journalofmolecularbiology1988;202(4):865-884.
ROST,B.PHD:PREDICTINGONE-DIMENSIONALPROTEINSTRUCTUREBYPROFILE-BASED
NEURALNETWORKS.Methodsinenzymology1996;266:525-539.
Rost,B.andSander,C.Predictionofproteinsecondarystructureatbetterthan70%accuracy.
Journalofmolecularbiology1993;232(2):584-599.
Rost,B.andSander,C.Combiningevolutionaryinformationandneuralnetworkstopredict
proteinsecondarystructure.Proteins:Structure,Function,andBioinformatics1994;19(1):55-72.
Schäffer,A.A.,etal.ImprovingtheaccuracyofPSI-BLASTproteindatabasesearcheswith
composition-basedstatisticsandotherrefinements.Nucleicacidsresearch2001;29(14):29943005.
Subramaniam,S.,Tcheng,D.K.andFenton,J.M.Aknowledge-basedmethodforprotein
structurerefinementandprediction.ProceedingsofInternationalConferenceonIntelligent
SystemsforMolecularBiology;ISMB.InternationalConferenceonIntelligentSystemsfor
MolecularBiology1996;4:218-229.
Sutcliffe,M.,etal.Knowledgebasedmodellingofhomologousproteins,PartI:Threedimensionalframeworksderivedfromthesimultaneoussuperpositionofmultiplestructures.
ProteinEngineering1987;1(5):377-384.
Vassura,M.,etal.Reconstructionof3Dstructuresfromproteincontactmaps.IEEE/ACM
TransactionsonComputationalBiologyandBioinformatics(TCBB)2008;5(3):357-367.
Vendruscolo,M.,Kussell,E.andDomany,E.Recoveryofproteinstructurefromcontactmaps.
FoldingandDesign1997;2(5):295-306.
Wang,Z.andXu,J.AconditionalrandomfieldsmethodforRNAsequence–structure
relationshipmodelingandconformationsampling.Bioinformatics2011;27(13):i102-i110.
Wang,Z.andXu,J.Predictingproteincontactmapusingevolutionaryandphysicalconstraints
byintegerprogramming.Bioinformatics2013;29(13):i266-i273.
Wu,S.,Szilagyi,A.andZhang,Y.Improvingproteinstructurepredictionusingmultiple
sequence-basedcontactpredictions.Structure2011;19(8):1182-1191.
Zuker,M.Mfoldwebserverfornucleicacidfoldingandhybridizationprediction.Nucleicacids
research2003;31(13):3406-3415.
Zuker,M.andSankoff,D.RNAsecondarystructuresandtheirprediction.Bulletinof
MathematicalBiology1984;46(4):591-621.
Jonikas,MagdalenaA.,RandallJ.Radmer,andRussB.Altman."Knowledge-basedinstantiation
offullatomicdetailintocoarse-grainRNA3Dstructuralmodels."Bioinformatics25.24(2009):
3259-3266.
Cao,Song,andShi-JieChen."PredictingRNAfoldingthermodynamicswithareducedchain
representationmodel."Rna11.12(2005):1884-1897.
Shen,Minyi,andAndrejSali."Statisticalpotentialforassessmentandpredictionofprotein
structures."Proteinscience15.11(2006):2507-2524.
© Copyright 2026 Paperzz