WhereinaGenomeDoesDNAReplicationBegin? AlgorithmicWarm-Up PhillipCompeau andPavelPevzner BioinformaticsAlgorithms:anActiveLearningApproach ©2013byCompeauandPevzner.Allrightsreserved BeforeaCellDivides,itMustReplicateitsGenome Replicationbeginsinaregioncalled thereplicationorigin (oriC) Whereinagenomedoesitallbegin? FindingOriginofReplication FindingoriC Problem: FindingoriC inagenome. • Input.Agenome. • Output.ThelocationoforiC inthegenome. OK– let’scutoutthisDNAfragment. Canthegenomereplicatewithoutit? Thisisnota computational problem! HowDoestheCellKnowtoBeginReplication inShortoriC? ReplicationoriginofVibriocholerae (≈500nucleotides): atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc Theremustbeahiddenmessagetellingthecelltostartreplicationhere. TheHiddenMessageProblem HiddenMessageProblem.Findingahiddenmessagein astring. • Input.AstringText (representingreplicationorigin). • Output.AhiddenmessageinText. Thisisnota computational problemeither! Thenotionof“hiddenmessage”isnot preciselydefined. TheHiddenMessageProblemRevisited HiddenMessageProblem.Findingahiddenmessagein astring. • Input.AstringText (representingoriC). • Output.AhiddenmessageinText. Thisisnota computational problemeither! Thenotionof“hiddenmessage”isnot preciselydefined. Hint:Forvariousbiologicalsignals,certainwords appearsurprisinglyfrequentlyinsmallregionsof thegenome. AATTTisasurprisinglyfrequent5-merin: ACAAATTTGCATAATTTCGGGAAATTTCCT TheFrequentWordsProblem FrequentWordsProblem.Findingmostfrequentk-mers inastring. • Input.AstringText andanintegerk. • Output.Allmostfrequentk-mers inText. Thisisbetter,butwhereis thedefinitionof“amost frequentk-mer?” TheFrequentWordsProblem FrequentWordsProblem.Findingmostfrequentk-mers inastring. • Input.AstringText andanintegerk. • Output.Allmostfrequentk-mers inText. Ak-mer Pattern isamostfrequentk-mer ina textifnootherk-mer ismorefrequentthan Pattern. AATTT isamostfrequent5-merin: ACAAATTTGCATAATTTCGGGAAATTTCCT SonPham,Ph.D.,kindlygaveus permissiontousehisphotographsand greatlyhelpedwithpreparingthis presentation.ThankyouSon! DoestheFrequentWordsProblemMake SensetoBiologists? FrequentWordsProblem.Findingmostfrequentk-mers inastring. • Input.AstringText andanintegerk. • Output.Allmostfrequentk-mers inText. ReplicationisperformedbyDNApolymerase andtheinitiationof replicationismediatedbyaproteincalledDnaA. DnaA bindstoshort(typically9nucleotideslong)segmentswithin thereplicationoriginknownasaDnaA box. ADnaA boxisahiddenmessagetellingDnaA:“bindhere!”And DnaA wantstoseemultipleDnaA boxes. Whatisthesimplestwaytogetmostfrequent k-mers? FREQUENTWORDS(Text,k) FrequentPatterns <- anemptyset fori <-0to|Text|- k Pattern<- thek-mer (i,k) COUNT(i)<- PATTERNCOUNT(Text,Pattern) maxCount <- maximumvalueinarrayCOUNT fori <-0to|Text|- k ifCOUNT(i)=maxCount addText(i,k)toFrequentPatterns removeduplicatesfromFrequentPatterns PATTERNCOUNT(Text,Pattern) count<- 0 fori <-0to|Text|- |Pattern| ifText(i,|Pattern|)=Pattern count<- count+1 returncount Whatistheproblemwiththepreviousalgorithm? HumanGenomeisabout3billionbasepairs O(|text|2 .k)willtakeforever! HowcanwemakeFREQUENTWORDSfaster? Whatarethepossiblek-mers oflengthk=3in AlphabetA,T,C,G? AAA AAT AAC AAG ATA ATT ATC ATG ACA ACT ACC ACG AGA AGT AGC AGG….. Number of possible combinations at k=3 43 = 64 Generally Number of possible combinations is 4k FASTERFREQUENTWORDS(Text,k) FrequentPatterns <- anemptyset FREQUENCYARRAY<- COMPUTINGFREQENCIES(Text,k) maxCount <- maximumvalueinarrayFREQUENCYARRAY fori <-0to4k-1 ifFREQUENCYARRAY(i)=maxCount Pattern<– NumberToPattern (i,k) addPatterntoFrequentPatterns removeduplicatesfromFrequentPatterns COMPUTINGFREQENCIES(Text,k) fori <-0to4k-1 FREQUENCYARRAY(i)<- 0 fori <-0to|Text|- k Pattern<- Text(i,k) j<- PatternToNumber(Pattern) FREQUENCYARRAY(j)<- FREQUENCYARRAY(j)+1 returnFREQUENCYARRAY Anotheridea! Sortallk-mers andthencounttherefrequency. Willthisimprovecomplexity? FINDINGFREQUENTWORDSBYSORTING(Text,k) FrequentPatterns <- anemptyset fori <-0to|Text|- k Pattern<- Text(i,k) INDEX(i)<-PatternToNumber(Pattern) COUNT(i)<- 1 SORTEDINDEX<- SORT(INDEX) fori <-1to|Text|- k ifSORTEDINDEX(i)=SORTEDINDEX(i-1) COUNT(i)=COUNT(i-1)+1 maxCount <- maximumvalueinarrayCOUNT fori <-1to|Text|- k ifCOUNT(i)=maxCount Pattern<– NumberToPattern (SORTEDINDEX(i),k) addPatterntoFrequentPatterns removeduplicatesfromFrequentPatterns Howdoweknowthatthefrequenciesare meaningfulandnotrandom? Probabilities! Whatistheprobabilityofgeneratinga palindromic(e.g.,ATCGAAGCTA)? Whatistheprobabilitythatk-mer k=2appearsat leastonceina binary stringoflength4? Saywewantprobabilityof01 00000001 0010001101000101 01100111 10001001 1010101111001101 11101111 Probabilityis !! !" Wemadeanassumptionthattextisnot overlappingwhatifthepatternisAAAAAAAA? Whatistheprobabilitythatk-mer k=2appearsat leastonceina binary stringoflength4? Saywewantprobabilityof11 0000000100100011 0100010101100111 1000100110101011 1100110111101111 Probabilityis # !" Whatistheprobabilitythatsome k-mer appearsttimesinatext? Letsdefinesomevariables: • Pr(𝑁, 𝐴, 𝑃𝑎𝑡𝑡𝑒𝑟𝑛, 𝑡):Probabilitythatk-mer 𝑃𝑎𝑡𝑡𝑒𝑟𝑛appears𝑡 timesina textwithlength𝑁 andalphabet𝐴. • Let𝑛benumberofwaystointersect𝑡 instancesofk-mer 𝑃𝑎𝑡𝑡𝑒𝑟𝑛intoa fixedtextoflength𝑁 𝑛 = 𝑁– 𝑡. 𝑘 • Sowehave𝑛 + 𝑡 optionsinwhichweselect𝑡 fortheplacementof 𝑛+𝑡 𝑃𝑎𝑡𝑡𝑒𝑟𝑛givingtotal 𝑡 Whatistheprobabilitythatsome k-mer appearsttimesinatext?cont .. 𝑛+𝑡 • Wethenmultiply bythenumberofstringsoflength𝑛inwhichwe 𝑡 𝑛+𝑡 caninsert𝑡 instancesof𝑃𝑎𝑡𝑡𝑒𝑟𝑛tohaveapproximatetotalof 𝐴8 𝑡 • Togettheprobabilitywedividebythenumberofstringsoflength𝑁 𝑛+𝑡 𝐴8 𝑡 Pr 𝑁, 𝐴, 𝑃𝑎𝑡𝑡𝑒𝑟𝑛, 𝑡 ≈ 𝐴: Whatistheprobabilityofgeneratingapalindromic (e.g.,ATCGAAGCTA)inaDNAoflength1000once? Pr 1000, 4, 𝐴𝑇𝐶𝐺𝐴𝐴𝐺𝐶𝑇𝐴, 1 " WhatiftheDNAhaslength1×10 ? Pr 1×10 " , 4, 𝐴𝑇𝐶𝐺𝐴𝐴𝐺𝐶𝑇𝐴, 1 Whatistheprobabilitythatany k-mer of lengthkappearsatleast ttimesinatext? • Let𝑝=Pr 𝑁, 𝐴, 𝑃𝑎𝑡𝑡𝑒𝑟𝑛, 𝑡 ≈ 8CD EF D EG • Theapproximateprobabilitythatapatterndoesn’tappear𝑡 ormoretimesis1 − 𝑝 • Theprobabilitythatallpatternsoflength𝑘appearfewer then𝑡timesinarandom EI stringis(1 − 𝑝) • Theprobabilitythatthereexistsak-mer appearing𝑡 ormoretimesis EI 1 − (1 − 𝑝) • Tosimplifytheaboveequationletsassume𝑝isthesameforanypatternsonow Pr 𝑁, 𝐴, 𝑘𝑡 ≈ 𝑝. 𝐴J ≈ 8CD EF D . 𝐴J G E
© Copyright 2026 Paperzz