Large Vocabulary Unconstrained Handwriting Recognition

Large Vocabulary
Unconstrained
Handwriting
Recognition
J Subrahmonia
Pen Technologies
IBM T J Watson Research
Center
Pen Technologies
 Pen-based interfaces in mobile computing
Mathematical Formulation
 H : Handwriting evidence on the basis of
which a recognizer will make its decision
– H = {h1, h2, h3, h4,…,hm}
 W : Word string from a large vocabulary
– W = {w1, w2, w3, w4,…., wn}
 Recognizer :
–

W  argmax p( W | H)
W
Mathematical Formulation

W  argmax p ( W | H ) 
W
p(H | W) p( W)
argmax

W
p(H )
argmax p (H | W ) p ( W )
W
CHANNEL
SOURCE
Source Channel Model
CHANNEL
WRITER
DIGITIZER
FEATURE
EXTRACTOR
H
DECODER
Ŵ
Source Channel Model

W  argmax p( W | H) 
W
argmax p(H | W) p( W)
W
Handwriting
Language
Modeling : HMMs Modeling
SEARCH STRATEGY
Hidden Markov Models
Memoryless Model
Add Memory
Hide Something
Markov Model
Mixture Model
Hide Something
Add Memory
Hidden Markov Model
Alan B Poritz : Hidden Markov Models : A Guided Tour
ICASSP 1988
Memoryless Model
COIN : Heads (1) : probability p
Tails (0) : probability 1-p
Flip the coin 10 times (IID Random sequence)
Sequence : 1 0 1 0 0 0 1 1 1 1
Probability = p*(1-p)*p*(1-p)*(1-p)*(1-p)*p*p*p*p
6
= p (1-p)
4
Add Memory – Markov Model
2 Coins : COIN 1 => p(1) = 0.9, p(0) = 0.1
COIN 2 => p(1) = 0.1, p(0) = 0.9
Experiment :
Flip COIN 1, Note the outcome
If ( outcome = Head)
Flip Coin 1
Else
Flip Coin 2
End
Sequence 110 0 : Probability = 0.9*0.9*0.1*0.9
Sequence 1010 : Probability = 0.9*0.1*0.1*0.1
State Sequence Representation
1 : 0.9
0 : 0.9
0 : 0.1
1
2
1 : 0.1
Observed Output Sequence  Unique State Sequence
Hide the states => Hidden
Markov Model
0.9
0.9
0.9
0.1
0.1
0.9
0.9
0.1 0.1
s1
s2
0.1
0.1
0.9
Why use Hidden Markov Models
Instead of Non-hidden?
 Hidden Markov Models can be smaller –
less parameters to estimate
 States may be truly hidden
– Position of the hand
– Positions of articulators
Summary of HMM Basics
 We are interested in assigning probabilities p(H)
to feature sequences
n
 Memoryless model p(H)   p(hi)
i 1
– This model has no memory of the past
 Markov noticed that is some sequences the future
depends on the past. He introduced the concept of
a STATE – a equivalence class of the past that
influences the future
p(hi | hi  1,...,h1)  p(hi | si 1 )
 Hide the states : HMM
p(H)   p(H, S)
S
Hidden Markov Models
 Given a observed sequence H
– Compute p(H) for decoding
– Find the most likely state sequence for a
given Markov model (Viterbi algorithm)
– Estimate the parameters of the Markov
source (training)
Compute p(H)
0.5
0.4
0.8
0.2
s1
p(a)
p(b)
0.5
0.5
0.3
0.2
0.7
0.3
s2
0.5
0.3
0.7
0.1
s3
Compute p(H) – contd.
 Compute p(H) where H = a a b b
 Enumerate all ways of producing h1=a
0.5x0.8
s1
s1 0.40
0.3x0.7
s2 0.21
0.2
0.2
s2
0.4x0.5
s2
0.5x0.3
s2 0.04
s3 0.03
Compute p(H) – contd.
 Enumerate all ways of producing
h1=a h2=a
0.5x0.8
s1
0.3x0.7
s1
0.5x0.8
0.3x0.7
s2
0.2
s2 0.2 s2
s2
0.4x0.5
s2
s3
s2
s2
0.5x0.3
0.4x0.5
0.4x0.5
s2
s3
s2
0.5x0.3
0.5x0.3
0.2
0.2
s1
s3
Compute p(H)
 Can save computation by combining paths
s1
s1
s1
s2
s2
s2
s2
s2
s2
s3
s3
s2
s3
Compute p(H)
 Trellis Diagram
0
.5x.8
s1
.2
s2
.3x.7
.5x.3
aa
.5x.8
.2
.4x.5
.1
s3
a
.3x.7
.2
.5x.3
.3x.3
.2
.5x.7
.3x.3
.2
.4x.5
.4x.5
.1
aabb
.5x.2
.5x.2
.4x.5
.1
aab
.1
.5x.7
.1
Basic Recursion
 Prob (Node) =
sum (Prob(predecessor) x Prob (predecessor->node) )
 Boundary condition : Prob (s, 0) = 1
0
s1
s2
a
1.0
1.0
s1, 0 : 0.2
0.2
s2, 0 : 0.02
s3
0.02
aa
aab
aabb
s1, a : 0.4 s1, a : 0.4 s1, a : 0.4 s1, a : 0.4
0.4
.16
.016
.0016
s1, 0 : .08
s1, a : .21
s2, a : .04
s1, 0 : .032
s1, a : .084
s2, a : .066
0.33
.182
s2, 0 : .033
s1, a : .03
0.063
s2, 0 : .0182
s2, a : .0495
.0677
s1, 0 : .0032
s1, b : .0144
s2, b : .0364
.054
s1, 0 : .00032
s1, b : .00144
s2, b : .0108
.01256
s2, 0 : .0054 s2, 0 : .001256
s2, b : .0637 s2, b : .0189
.0691
.020156
More Formally –Forward
Algorithm
 t ( s) 

s



(
s
)
P
(
s
|
s
)
P
(
ht
|
s

s
)
t 1
 (s) P(s | s)
s
t
Find Most Likely Path for aabb
- Dynamic Prog. or Viterbi
Max Prob (Node) =
MAX(Max(predecessor) x Prob (predecessor->node) )
0
s1
a
aa
aab
s1, b : .016
aabb
s1,b : .0016
1.0
s1, a : 0.4
s1, a : .16
s2
s1, 0 : 0.2
s1, 0 : .08
s1, a : .21
s2, a : .04
s1, 0 : .032
s1, a : .084
s2, a : .066
s1, 0 : .0032 s1, 0 : .00032
s1, b : .0144 s1, b : .00144
s2, b : .0168 s2, b : .00336
s3
s2, 0 : 0.02
s2, 0 : .021
s1, a : .03
s2, 0 : .0084
s2, a : .0315
s2, 0 :.00168 s2, 0 : .000336
s2, b : .0294 s2, b : .00588
Training HMM parameters
1/3
1/2
p(a)
1/2
p(b) = 1/2
H = abaa
1/3
1/2
1/3
.000385
.001302
.000578
.001157
.000868
.002604
p(H) = .008632
.001736
Training HMM parameters
t4
t1
t5
t2
t3
ci = A posterior probability of path i = pi p(H )
c1
c2
.045 .067
c3
.134
c4
c5
c6
.100 .201 .150
c7
.301
c(t1 )  3c1  2c2  2c3  c4  c5  0.838
c(t 2 )  c3  c5  c7  0.637
c(t3 )  c1  c2  c4  c6  0.363
New :
p(t1 )  0.46 p(t2 )  0.34
p(t3 )  0.20
Training HMM parameters
t4
t1
t2
t5
c(t1 , a)  2c1  c2  c3  c4  c5  0.592
c(t1 , b)  c1  c2  c3  0.246
New :
p(t1 , a)  0.71
p(t1 , b)  0.29
Training HMM parameters
.46
.71
.29
.60
.68
.32
.64
.36
.34
.40
.60
.40
.20
p1
p2
0.00108
0.00129
p3
p4
0.00404 0.00212
p5
p6
0.00537 0.00253
p7
0.00791
p( H )  0.02438  0.008632
Keep on repeating : 600 iterations : p(H) = .037037037
Another initial parameter set : p(H) = 0.0625
Training HMM parameters
 Converges to local maximum
 There are 7 (atleast) local maxima
 Final solution depends on starting point
 Speed of convergence depends on
starting point
Training HMM parameters :
Forward Backward algorithm
 Improves on enumerating algorithm by
using the Trellis
 Results in reduction from exponential
computation to linear computation
Forward Backward Algorithm
..........
j
..........
..........
sa
..........
..........
.
.
.
.
sa
.
.
.
.
.
.
.
.
..........
sb
..........
.
.
.
.
.
.
.
.
.
.
.
.
..........
Forward Backward Algorithm
 p j (ti , H ) = Probability that hj is produced
by t i and the complete output is H
=  j 1 ( sa ).P(ti ) P(hj | ti ). j ( sb )
 j 1 ( sa ) = Probability of being in state sa
and producing the output h1, .. hj-1
 j ( sb )
s
= Probability of being in state b
and producing the output hj+1,..hm
Forward Backward Algorithm
 t (s) 

s
t 1
( s) P ( s | s) P (ht  1 | s  s) 
  (s) P(s | s)
s
t
Transition count
Ci (t | H)  p j (t , H) / p(H)
Training HMM parameters
 Guess initial values for all parameters
 Compute forward and backward pass
probabilities
 Compute counts
 Re-estimate probabilities
BAUM-WELCH, BAUM-EAGON,
FORWARD-BACKWARD, E-M

Download Report

Large Vocabulary Unconstrained Handwriting Recognition

Paperzz.com

Your Paperzz