CONTEXT DEPENDENT CLASSIFICATION
Remember: Bayes rule
P(i x) P( j x), j i
Here: The class to which a feature vector
belongs depends on:
Its own value
The values of the other features
An existing relation among the various classes
1
This interrelation demands the classification to be
performed simultaneously for all available feature
vectors
Thus, we will assume that the training vectors
x1 , x 2 ,..., x N occur in sequence, one after the
other and we will refer to them as observations
2
The Context Dependent Bayesian Classifier
Let
X : {x1 , x 2 ,..., x N }
Let i , i 1,2,..., M
Let i be a sequence of classes, that is
i : i1 i 2 ... iN
There are MN of those
Thus, the Bayesian rule can equivalently be stated
as
X i : P(i X ) P( j X ) i j, i, j 1,2,..., M N
Markov Chain Models (for class dependence)
P(ik ik 1 ,ik 2 ,..., i1 ) P (ik ik 1 )
3
NOW remember:
P( i ) P(i1 , i2 ,..., iN )
P(iN iN 1 ,..., i1 ).
P(iN 1 iN 2 ,..., i1 )...P(i1 )
or
N
P(i ) ( P(ik ik 1 )) P(i1 )
k 2
Assume:
x i statistically mutually independent
The pdf in one class independent of the others,
then
N
p( X i ) p( x k ik )
k 1
4
From the above, the Bayes rule is readily seen
to be equivalent to:
P(i X )( ) P( j X )
P(i ) p( X i )( ) P( j ) p( X j )
that is, it rests on
p( X i ) P(i ) P(i1 ) p( x1 i1 ).
N
P(
k 2
ik
i ) p( x i )
k 1
k
k
To find the above maximum in brute-force task
we need Ο(NMΝ ) operations!!
5
The Viterbi Algorithm
6
Thus, each Ωi corresponds to one path through the
trellis diagram. One of them is the optimum (e.g.,
black). The classes along the optimal path determine
the classes to which ωi are assigned.
To each transition corresponds a cost. For our case
•
dˆ (ik , ik 1 ) P(ik ik 1 ).
p ( x k ik )
•
dˆ (i1 , i0 ) P(i1 ) p( x i i1 )
N
•
Dˆ dˆ (ik , ik 1 ) p( X i )P(i )
k 1
7
• Equivalently
N
N
k 1
k 1
ln Dˆ ln dˆ (.,.) D d (.,.)
where,
d (ik , ik 1 ) ln dˆ (ik , ik 1 )
• Define the cost up to a node ,k,
k
D(ik ) d (ir , ir 1 )
r 1
8
Bellman’s principle now states
Dmax (ik ) max Dmax (ik 1 ) d (ik , ik 1 )
ik 1
ik , ik 1 1,2,..., M
Dmax (i0 ) 0
The optimal path terminates at
:
*
iN
i* arg max Dmax (i )
N
iN
N
• Complexity O (NM2)
9
Channel Equalization
The problem
• xk f ( I k , I k 1 ,..., I k n1 ) nk
T
x
[
x
,
x
,...,
x
]
•
k
k
k 1
k l 1
• x k Iˆk or Iˆk r
x k equalizer Î k r
10
Example
•
xk 0.5 I k I k 1 nk
•
xk
xk , l 2
xk 1
• In xk three input symbols are involved:
Ik, Ik-1, Ik-2
11
Ik
Ik-1
Ik-2
xk
xk-1
0
0
0
0
0
ω1
0
0
1
0
1
ω2
0
1
0
1
0.5
ω3
0
1
1
1
1.5
ω4
1
0
0
0.5
0
ω5
1
0
1
0.5
1
ω6
1
1
0
1.5
0.5
ω7
1
1
1
1.5
1.5
ω8
12
Not all transitions are allowed
• ( I k , I k 1 , I k 2 ) (0, 0, 1)
• Then
(1, 0, 0)
( I k 1 , I k , I k 1 )
(0, 0, 0)
5
•
2
1
0.5, i 5, 1
P( 2 i )
0, otherwise
13
• In this context, ωi are related to states.
Given the current state and the transmitted
bit, Ik, we determine the next state. The
probabilities P(ωi|ωj) define the state
dependence model.
The transition cost
•
d (ik , ik 1 ) di ( x)
k
x k ik (( x k i )T i ( x k i ))
1
k
k
1
2
k
for all allowable transitions
14
Assume:
• Noise white and Gaussian
• A channel impulse response
available
fˆ estimate to be
• ( xk fˆ ( I k ,..., I k n1 )) nk
•
d (ik , ik 1 ) ln p( x k ik ) ln p(nk )
( xk fˆ ( I k ,..., I k n1 )) 2
• The states are determined by the values of the
binary variables
Ik-1,…,Ik-n+1
For n=3, there will be 4 states
15
Hidden Markov Models
In the channel equalization problem, the states are
observable and can be “learned” during the
training period
Now we shall assume that states are not
observable and can only be inferred from the
training data
Applications:
• Speech and Music Recognition
• OCR
• Blind Equalization
• Bioinformatics
16
An HMM is a stochastic finite state automaton, that
generates the observation sequence, x1, x2,…, xN
We assume that: The observation sequence is
produced as a result of successive transitions
between states, upon arrival at a state:
17
This type of modeling is used for nonstationary
stochastic
processes
that
undergo
distinct
transitions among a set of different stationary
processes.
18
Examples of HMM:
• The single coin case: Assume a coin that is tossed
behind a curtain. All it is available to us is the
outcome, i.e., H or T. Assume the two states to be:
S = 1H
S = 2T
This is also an example of a random experiment
with observable states. The model is characterized
by a single parameter, e.g., P(H). Note that
P(1|1) = P(H)
P(2|1) = P(T) = 1 – P(H)
19
• The two-coins case: For this case, we observe a
sequence of H or T. However, we have no access to
know which coin was tossed. Identify one state for
each coin. This is an example where states are not
observable. H or T can be emitted from either
state. The model depends on four parameters.
P1(H), P2(H),
P(1|1), P(2|2)
20
• The three-coins case example is shown below:
• Note that in all previous examples, specifying the
model is equivalent to knowing:
– The probability of each observation (H,T) to be emitted
from each state.
– The transition probabilities among states: P(i|j).
21
A general HMM model is characterized by the
following set of parameters
• Κ, number of states
• P(i j ), i, j 1,2,..., K
•
p( x i ), i 1,2,..., K
•
P(i ), i 1,2,..., K , initial state probabilit ies, P(.)
22
That is:
S {P(i j ), p( x i), P(i), K}
What is the problem in Pattern Recognition
• Given M reference patterns, each described by
an HMM, find the parameters, S, for each of
them (training)
• Given an unknown pattern, find to which one
of the M, known patterns, matches best
(recognition)
23
Recognition: Any path method
• Assume the M models to be known (M
classes).
• A sequence of observations, X, is given.
• Assume observations to be emissions upon
the arrival on successive states
• Decide in favor of the model S* (from the M
available) according to the Bayes rule
S * arg max P(S X )
S
for equiprobable patterns
S * arg max p( X S )
S
24
• For each model S there is more than one
possible sets of successive state transitions Ωi,
each with probability P(i S )
Thus:
P( X S ) p( X , i S )
i
p ( X i , S ) P ( i S )
i
• For the efficient computation of the above
DEFINE
–
(ik 1 ) p( x1 ,..., x k 1 , ik 1 S )
(ik ) P(ik 1 ik ) p( x k 1 ik 1 )
ik
History
Local activity
25
• Observe that
KS
P ( X S ) (i )
i N 1
Compute this
for each S
26
• Some more quantities
–
(ik ) p( x k 1 , x k 2 ,..., x N ik , S )
(ik 1 )P(ik 1 ik ) p( x k 1 ik 1 )
ik 1
–
(ik ) p( x1 ,..., x N , ik S )
(ik ) (ik )
27
Training
• The philosophy:
Given a training set X, known to belong to the
specific model, estimate the unknown
parameters of S, so that the output of the
model, e.g.
Ks
p ( X S ) (i )
i N 1
to be maximized
This is a ML estimation problem with missing data
28
Assumption: Data x discrete
x {1,2,..., r} p( x i) P( x i)
Definitions:
•
k (i, j )
• k (i)
(ik i ) P( j i ) P( x k 1 j ) (ik 1 j )
P( X S )
(ik i) (ik i)
P( X S )
29
The Algorithm:
• Initial conditions for all the unknown parameters.
Compute P( X S )
• Step 1: From the current estimates of the model
parameters reestimate the new model S from
N 1
P( j i)
k 1
N 1
k
k 1
(i, j )
k
# of transitio ns from i to j
# of transitio ns from i
(i )
N
P x (r i)
(i )
k
k 1 and x r )
N
k 1
k
(i )
at state i and x r
of being at state i
P(i ) 1 (i )
30
• Step 3: Compute P( X S ). If P( X S ) P( X S ) , S S
go to step 2. Otherwise stop
• Remarks:
– Each iteration improves the model
S : P( X S ) P( X S )
– The algorithm converges to a maximum (local
or global)
– The algorithm is an implementation of the EM
algorithm
31
© Copyright 2026 Paperzz