Markov Models

Markov Models
Predicting Functions in DNA
DNA sequences are not purely random
Fruit fly (“cheap date”):
GAAGGGGTGC GCACATCCTA AGTGCGCAAA
Human (breast cancer):
GAACCCCTAT ATGGAAGAAG AAAACTGAAC
Subsequences such as ‘CTGCA’ are suspected to serve some
biological function if they occur often:
• DNA replication
• RNA transcription (copying a subsequence to make RNA)
• Intron excision (removing a subsequence)
Likelihood of seeing a subsequence?
Detecting Repeating Subsequences
in DNA
Research from
U. Washington,
published last month
in Nature
Detecting Repeating Subsequences
in DNA [Nature, 2016]
• During cell division, DNA replicates
• DNA mutation: change in the sequence (i.e. during
DNA replication)
• “microsatellite” – short repetitive subsequences in a
DNA sequence (e.g. “TATATA”)
• Mutation more likely to occur in sequences that
contain many microsatellites
• Some mutations are likely to cause cancer
ACTAGATTATATAACGGGTACACTATATAACTGTACATATATACCTACGG
ACTAGATTCTATAACGGGTACACTATGTAACTGTACATATATACCTACGG
Transitional Probabilities
Markov Models for Classification
Goal: determine whether DNA replication or transcription is
more likely in a subsequence
Suppose known that:
• CTGCA, ACTGA, AACTG,…indicate replication
• TGCAA, TTGCA, TACGT,…indicate transcription
Compute:
• Pr(CTGCA) + Pr(ACTGA) + Pr(AACTG)
• Pr(TGCAA) + Pr(TTGCA) + Pr(TACGT)
Apply Maximum Likelihood Estimate to find which is more likely
Markov Models for Classification
Goal: determine whether DNA replication or transcription is
more likely in a sequence
Suppose known that:
• CTGCA, ACTGA, AACTG,…indicate replication
• TGCAA, TTGCA, TACGT,…indicate transcription
Compute: Probabilities of seeing those subsequences in our
given sequence
• Pr(CTGCA) + Pr(ACTGA) + Pr(AACTG)
• Pr(TGCAA) + Pr(TTGCA) + Pr(TACGT)
Apply Maximum Likelihood Estimate to find which is more likely
Markov Models for
Handwriting Recognition
Goal: Determine what a piece of handwritten text says
Simplified approach first
Markov Models for
Handwriting Recognition
• Given:
– Initial probabilities of letters
Pr(S0 = 𝒯 ) = . 05
Pr(S0 = 𝓘 ) = . 01
Pr(S0 = 𝒵 ) = . 002
– And transitional probabilities between letters/characters:
𝓘𝓽
Pr(𝓽| 𝓘 ) = .29
𝓘'𝓶
Pr('𝓶| 𝓘 ) = .19
𝓘
Pr(‘ ‘| 𝓘 ) = .34
𝓘𝔃
Pr(𝔃| 𝓘 ) = .001
• Compute the probabilities of text:
Pr(𝓘'𝓶 𝓭𝓸𝓲𝓷𝓰 𝔀𝓮𝓵𝓵) = .02
Markov Models for
Handwriting Recognition
An “agent” (i.e. computer program) can’t
determine that
is I, I is t, is m …
What information does the agent get?
• Each character is 28 x 28 pixel image.
• Represented by 28 x 28 array of 1s and 0s
(1 = pixel occupied, 0 = pixel unoccupied)
• Thousands of test examples: arrays labeled with
corresponding letters
Markov Models for
Handwriting Recognition
An “agent” (i.e. computer program) can’t
determine that
is I, I is t, is m …
What information does the agent get?
• Each character is 28 x 28 pixel image.
• Represented by 28 x 28 array of 1s and 0s
(1 = pixel occupied, 0 = pixel unoccupied)
• Thousands of test examples: arrays labeled with
corresponding letters
D .. Z
D .. Z
D .. Z
Use these to compute probability that
is I, is is t,
is m …
Hidden Markov Model
• Agent cannot see the actual state (the character)
• Agent only observes something about each state
(the 2D arrays).