Probabilities of patterns in strings

WhereinaGenomeDoesDNAReplicationBegin?
AlgorithmicWarm-Up
PhillipCompeau andPavelPevzner
BioinformaticsAlgorithms:anActiveLearningApproach
©2013byCompeauandPevzner.Allrightsreserved
BeforeaCellDivides,itMustReplicateitsGenome
Replicationbeginsinaregioncalled
thereplicationorigin (oriC)
Whereinagenomedoesitallbegin?
FindingOriginofReplication
FindingoriC Problem: FindingoriC inagenome.
• Input.Agenome.
• Output.ThelocationoforiC inthegenome.
OK– let’scutoutthisDNAfragment.
Canthegenomereplicatewithoutit?
Thisisnota
computational
problem!
HowDoestheCellKnowtoBeginReplication
inShortoriC?
ReplicationoriginofVibriocholerae (≈500nucleotides):
atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc
Theremustbeahiddenmessagetellingthecelltostartreplicationhere.
TheHiddenMessageProblem
HiddenMessageProblem.Findingahiddenmessagein
astring.
• Input.AstringText (representingreplicationorigin).
• Output.AhiddenmessageinText.
Thisisnota
computational
problemeither!
Thenotionof“hiddenmessage”isnot
preciselydefined.
TheHiddenMessageProblemRevisited
HiddenMessageProblem.Findingahiddenmessagein
astring.
• Input.AstringText (representingoriC).
• Output.AhiddenmessageinText.
Thisisnota
computational
problemeither!
Thenotionof“hiddenmessage”isnot
preciselydefined.
Hint:Forvariousbiologicalsignals,certainwords
appearsurprisinglyfrequentlyinsmallregionsof
thegenome.
AATTTisasurprisinglyfrequent5-merin:
ACAAATTTGCATAATTTCGGGAAATTTCCT
TheFrequentWordsProblem
FrequentWordsProblem.Findingmostfrequentk-mers inastring.
• Input.AstringText andanintegerk.
• Output.Allmostfrequentk-mers inText.
Thisisbetter,butwhereis
thedefinitionof“amost
frequentk-mer?”
TheFrequentWordsProblem
FrequentWordsProblem.Findingmostfrequentk-mers inastring.
• Input.AstringText andanintegerk.
• Output.Allmostfrequentk-mers inText.
Ak-mer Pattern isamostfrequentk-mer ina
textifnootherk-mer ismorefrequentthan
Pattern.
AATTT isamostfrequent5-merin:
ACAAATTTGCATAATTTCGGGAAATTTCCT
SonPham,Ph.D.,kindlygaveus
permissiontousehisphotographsand
greatlyhelpedwithpreparingthis
presentation.ThankyouSon!
DoestheFrequentWordsProblemMake
SensetoBiologists?
FrequentWordsProblem.Findingmostfrequentk-mers inastring.
• Input.AstringText andanintegerk.
• Output.Allmostfrequentk-mers inText.
ReplicationisperformedbyDNApolymerase andtheinitiationof
replicationismediatedbyaproteincalledDnaA.
DnaA bindstoshort(typically9nucleotideslong)segmentswithin
thereplicationoriginknownasaDnaA box.
ADnaA boxisahiddenmessagetellingDnaA:“bindhere!”And
DnaA wantstoseemultipleDnaA boxes.
Whatisthesimplestwaytogetmostfrequent
k-mers?
FREQUENTWORDS(Text,k)
FrequentPatterns <- anemptyset
fori <-0to|Text|- k
Pattern<- thek-mer (i,k)
COUNT(i)<- PATTERNCOUNT(Text,Pattern)
maxCount <- maximumvalueinarrayCOUNT
fori <-0to|Text|- k
ifCOUNT(i)=maxCount
addText(i,k)toFrequentPatterns
removeduplicatesfromFrequentPatterns
PATTERNCOUNT(Text,Pattern)
count<- 0
fori <-0to|Text|- |Pattern|
ifText(i,|Pattern|)=Pattern
count<- count+1
returncount
Whatistheproblemwiththepreviousalgorithm?
HumanGenomeisabout3billionbasepairs
O(|text|2 .k)willtakeforever!
HowcanwemakeFREQUENTWORDSfaster?
Whatarethepossiblek-mers oflengthk=3in
AlphabetA,T,C,G?
AAA
AAT
AAC
AAG
ATA
ATT
ATC
ATG
ACA
ACT
ACC
ACG
AGA
AGT
AGC
AGG…..
Number of possible combinations at
k=3
43 = 64
Generally Number of possible
combinations is 4k
FASTERFREQUENTWORDS(Text,k)
FrequentPatterns <- anemptyset
FREQUENCYARRAY<- COMPUTINGFREQENCIES(Text,k)
maxCount <- maximumvalueinarrayFREQUENCYARRAY
fori <-0to4k-1
ifFREQUENCYARRAY(i)=maxCount
Pattern<– NumberToPattern (i,k)
addPatterntoFrequentPatterns
removeduplicatesfromFrequentPatterns
COMPUTINGFREQENCIES(Text,k)
fori <-0to4k-1
FREQUENCYARRAY(i)<- 0
fori <-0to|Text|- k
Pattern<- Text(i,k)
j<- PatternToNumber(Pattern)
FREQUENCYARRAY(j)<- FREQUENCYARRAY(j)+1
returnFREQUENCYARRAY
Anotheridea!
Sortallk-mers andthencounttherefrequency.
Willthisimprovecomplexity?
FINDINGFREQUENTWORDSBYSORTING(Text,k)
FrequentPatterns <- anemptyset
fori <-0to|Text|- k
Pattern<- Text(i,k)
INDEX(i)<-PatternToNumber(Pattern)
COUNT(i)<- 1
SORTEDINDEX<- SORT(INDEX)
fori <-1to|Text|- k
ifSORTEDINDEX(i)=SORTEDINDEX(i-1)
COUNT(i)=COUNT(i-1)+1
maxCount <- maximumvalueinarrayCOUNT
fori <-1to|Text|- k
ifCOUNT(i)=maxCount
Pattern<– NumberToPattern (SORTEDINDEX(i),k)
addPatterntoFrequentPatterns
removeduplicatesfromFrequentPatterns
Howdoweknowthatthefrequenciesare
meaningfulandnotrandom?
Probabilities!
Whatistheprobabilityofgeneratinga
palindromic(e.g.,ATCGAAGCTA)?
Whatistheprobabilitythatk-mer k=2appearsat
leastonceina binary stringoflength4?
Saywewantprobabilityof01
00000001 0010001101000101 01100111
10001001 1010101111001101 11101111
Probabilityis
!!
!"
Wemadeanassumptionthattextisnot
overlappingwhatifthepatternisAAAAAAAA?
Whatistheprobabilitythatk-mer k=2appearsat
leastonceina binary stringoflength4?
Saywewantprobabilityof11
0000000100100011 0100010101100111
1000100110101011 1100110111101111
Probabilityis
#
!"
Whatistheprobabilitythatsome k-mer
appearsttimesinatext?
Letsdefinesomevariables:
• Pr(𝑁, 𝐴, 𝑃𝑎𝑡𝑡𝑒𝑟𝑛, 𝑡):Probabilitythatk-mer 𝑃𝑎𝑡𝑡𝑒𝑟𝑛appears𝑡 timesina
textwithlength𝑁 andalphabet𝐴.
• Let𝑛benumberofwaystointersect𝑡 instancesofk-mer 𝑃𝑎𝑡𝑡𝑒𝑟𝑛intoa
fixedtextoflength𝑁
𝑛 = 𝑁– 𝑡. 𝑘
• Sowehave𝑛 + 𝑡 optionsinwhichweselect𝑡 fortheplacementof
𝑛+𝑡
𝑃𝑎𝑡𝑡𝑒𝑟𝑛givingtotal
𝑡
Whatistheprobabilitythatsome k-mer
appearsttimesinatext?cont ..
𝑛+𝑡
• Wethenmultiply
bythenumberofstringsoflength𝑛inwhichwe
𝑡
𝑛+𝑡
caninsert𝑡 instancesof𝑃𝑎𝑡𝑡𝑒𝑟𝑛tohaveapproximatetotalof
𝐴8
𝑡
• Togettheprobabilitywedividebythenumberofstringsoflength𝑁
𝑛+𝑡
𝐴8
𝑡
Pr 𝑁, 𝐴, 𝑃𝑎𝑡𝑡𝑒𝑟𝑛, 𝑡 ≈ 𝐴:
Whatistheprobabilityofgeneratingapalindromic
(e.g.,ATCGAAGCTA)inaDNAoflength1000once?
Pr 1000, 4, 𝐴𝑇𝐶𝐺𝐴𝐴𝐺𝐶𝑇𝐴, 1
"
WhatiftheDNAhaslength1×10 ?
Pr
1×10
"
, 4, 𝐴𝑇𝐶𝐺𝐴𝐴𝐺𝐶𝑇𝐴, 1
Whatistheprobabilitythatany k-mer of
lengthkappearsatleast ttimesinatext?
• Let𝑝=Pr 𝑁, 𝐴, 𝑃𝑎𝑡𝑡𝑒𝑟𝑛, 𝑡 ≈ 8CD EF
D
EG
• Theapproximateprobabilitythatapatterndoesn’tappear𝑡 ormoretimesis1 − 𝑝
• Theprobabilitythatallpatternsoflength𝑘appearfewer then𝑡timesinarandom
EI
stringis(1 − 𝑝)
• Theprobabilitythatthereexistsak-mer appearing𝑡 ormoretimesis
EI
1 − (1 − 𝑝)
• Tosimplifytheaboveequationletsassume𝑝isthesameforanypatternsonow
Pr 𝑁, 𝐴, 𝑘𝑡 ≈ 𝑝. 𝐴J ≈
8CD EF
D
. 𝐴J
G
E