Human action recognition by sequence of movelet codewords

Human Action Recognition By Sequence of Movelet Codewords
Xiaolin Fengy
Pietro Perona yz
y California Institute of Technology, 136-93, Pasadena, CA 91125, USA
z Universit
a di Padova, Italy
fxlfeng,[email protected]
Abstract
An algorithm for the recognition of human actions
in image sequences is presented. The algorithm consists of 3 stages: background subtraction, body pose
classication, and action recognition. A pose is represented in space-time { we call it `movelet'. A movelet
is a collection of the shape, motion and occlusion of
image patches corresponding to the main parts of the
body. The (innite) set of all possible movelets is
quantized into codewords obtained by vector quantization. For every pair of frames each codeword is assigned a probability. Recognition is performed by simultaneously estimating the most likely sequence of
codewords and the action that took place in a sequence.
This is done using hidden Markov models. Experiments on recognition of 3 periodic human actions, each
under 3 dierent viewpoints, and 8 nonperiodic human
actions are presented. Training and testing are performed on dierent subjects with encouraging results.
The inuence of the number of codewords on algorithm
performance is studied.
1 Introduction
Detecting and recognizing human actions and activities is one of the most important applications in machine vision. There are many approaches recently to
address it from dierent point of view [5, 4, 6, 8, 9, 10,
1, 2, 11, 7, 3]. Both top-down methods starting with
human body silhouettes [5] and bottom-up methods
based on low level image features such as points [9]
and patches [6, 4] have been proposed for detecting
the human body and measuring body pose. In the
case of exploring patch features, typical bottom-up
method assumes the color and/or texture of each body
part is known in advance. We do not wish to make
any assumptions on the style of clothing (or absence
thereof) and/or on the presence of good boundaries
between limbs and torso. Therefore we take top-down
approach.
Our basic assumption is that a sequence of regions
of interest (ROI) containing a foreground moving object is obtained. We discretize the space of poses into
a number of codewords which are tted to ROI for the
recognition. In order to further improve the signal-tonoise ratio we consider poses in space-time which are
called movelets. We model each action as a stochastic
concatenation of an arbitrary number of movelet codewords, each one of which may generate images (also
stochastically). Hidden Markov models are used to
represent actions and the images they generate. The
Viterbi algorithm is used to nd the most likely sequence of codewords and action that are associated to
a given observed sequence. The movelet codewords
are learned from training examples and are shared
amongst all actions (similarly to an alpabed of letters
which are shared by all words).
Results on training of models and recognition of
3 periodic actions and 8 nonperiodic actions imaged
from multiple viewpoints are presented. We also explore experimentally what is the necessary number of
codewords to represent the movelet space and how action recognition performance varies as a function of
the number of frames that are available.
2 Movelet and Codeword: Human Body Modeling
If we only monitor the main body, a human body conguration can be characterized by the shape and motion of the 10 parts: head, torso, 4 parts of upper
limb and 4 of lower limb. We name such conguration representation movelets. The shape of each corresponding body part on the image plane is modeled
as a rectangle Sj (5 degree of freedom (DOF)). Assume this shape deforms to be Sj0 (5 DOF) in the
next frame. This shape deformation is no longer 3
dimensional rigid motion on the image plane due to
loose clothes and occlusion etc., but a non-rigid
mo0
tion applying in its full parametric space as Sj Sj .
Therefore, a human body movelet can be represented
as M = (S; S 0 ) = ( 10
Sj ; 10
Sj0 ). The advantage
j =1
j =1
S
S
of this parameterization over the direct shape and motion representation is its parameters are in the same
metric space.
Consider a set of human actions we are interested in.
The actions are sequences of movelets. Dierent actions may contain similar movelets at certain time. We
collect the set of the movelets of all actions that have
been observed over a learning time period, and cluster
them into C codewords denoted as Ci ; i = 1 : : : N using an unsupervised vector quantization method. We
choose K-means algorithm in this paper. The vector
quantization coarsely divides the whole movelet space
into N regions and represents each region by a codeword.
There are two practical issues:
First, Origin Selection: The body movelet is dened
relative to an origin which we set at the center of the
rectangular shape of the head. The positions of all the
parts are relative positions to this origin. The DOF
of the movelet is reduced by 2 correspondingly.
The second issue is Self-Occlusion: we observe in
our experiments that over half of the movelets have
some of the parts occluded. Therefore their dimension
is reduced to a lower DOF. We divide the movelets
into subgroups according to their occlusion patterns.
The vector quantization is performed in each subgroup
with the number of codewords proportional to the relative frequency of the occlusion pattern. The union of
the trained codewords from subgroups gives the nal
dataset of codewords.
3 Recognition of A Conguration
To recognize the observed action, we assume that the
images of the moving body are segmented as the foreground in image sequence, which sometimes can be
done by applying either background subtraction or independent moving object segmentation.
For each pair of frames, an observed conguration is
dened as the foreground pixels X = fX; X 0 g, where
X and X 0 respectively represent the 2D positions of
the foreground pixels in the rst and second image
frame. Although we model the human body by parts
in the previous section, for the recognition, due to
the lack of technique to segment human body parts
robustly, the observed conguration consists of foreground pixels in their entirety and top-down approach
is applied. Given a codeword with shape S and deformed shape S 0 , we wish to calculate the likelihood
of observing X and X 0 . We assume, given the shape
S , apriori probability to detect a pixel X in the image as a foreground(fg) or background (bg) data is a
i
Binomial distribution as:
X 2S
X 2= S
X
i
i
Table 1:
i
2 fg X 2 bg
i
1
1
Binomial distribution for P (X jS )
i
Where . Assuming the independence of pixels, the likelihood of observing the X as the foreground data given the codeword shape S is: P (X jS ) =
P (Xi jS ). The likelihood P (X 0 jS 0 ) is estiXi 2image
mated in the same way.
A codeword C includes both the human body shape
S and its deformed shape S 0 in the next frame. The
likelihood of observing the conguration X given a
codeword C = (S; S 0 ) can then be estimated as:
P (XjC) = P (X; X 0jS; S 0 ) = P (X jS )P (X 0jS 0 ) (1)
Q
4 HMM for Action Recognition
Assume the observation contains the sequence of the
foreground congurations as (X1 ; X2 ; : : : ; XT ), where
Xt; t = 1 : : : T are dened as same as X in Section
3. To recognize the observed sequence as one of the
learned actions, the slightly modied Hidden Markov
Model (HMM) is applied. We dene the states of
HMM as the trained codewords. The probability of an
observation generated from a state is obtained in the
previous section as P (XjC). This probability is not
obtained through the usual HMM learning scheme because here we know exactly by construction what the
states are.
Assume there are K human actions we want to recognize: Hk ; k = 1 : : : K . In the training, each movelet
has been assigned to one codeword. Therefore, for
each training action k , we obtain a codeword sequence
to represent its movelet sequence. Its HMM state
(codeword) transition matrix Ak can be easily trained
from this codeword sequence.
To recognize which action the observed sequence
(X1 ; X2 ; : : : ; XT ) represents and what is its best representative codeword sequence, the following HMM solutions are applied. First, for each action hypothesis
k 2 (1 : : : K ), we apply the viberbi algorithm to search
for the single best state (codeword) sequence which
maximizes the following probability:
q1 ; q2 ; : : : ; q
=
max
P (q 1 ; q 2 ; : : : ; q
k
k
k1 ;qk2 ;:::;qkT
q
kT
k
k
kT
jX1 ; X2 ; : : : ; X ; A
T
k
)
Where
qkt 2 (C1 : : : CN ). The human action
Hk that best represents the observed sequence is determined by:
k = arg max P (q1 ; q2 ; : : : ; q ; X1 ; X2 ; : : : ; X jA )
Correspondingly the sequence q 1 ; q 2 ; : : : ; q is
k
k
k
T
kT
k
k
k
k T
the recognized most likely codeword sequence to represent the observed sequence (X1 ; X2 ; : : : ; XT ).
5 Experiments
5.1 Experimental Design for Codewords Training
The objective of training is to obtain the dataset of
codewords from the movelets of training actions. The
human body movelet is represented by the union of
body parts and their motions. Therefore, we need to
identify each body part in the image frames of training actions. For this purpose, we made a set of special clothes which has distinguishable color for each
body part. Also our training subject wears black mask
on the head. We make sure that any two neighboring parts have dierent colors and the two parts having same color are located far away from each other.
Therefore each body part can be segmented by its
color and location dierence from the others and represented by a rectangle. A movelet is formed by stacking the rectangular shapes of all body parts visible on
a pair of consecutive frames together. In a movelet,
the shapes in both frames are relative to the same origin which is the center of the head position in the rst
frame. An example of a pair of training frames and
the extracted movelets is shown in Fig.1. Given the
movelets of all training actions, codewords are learned
following the scheme proposed in Section 2.
Figure 1: Example of Training Images and Movelet. Left
two: two consecutive images in the training sequence. Right
two: estimated rectangular shapes for both frames to form a
movelet.
5.2 Matching of Codeword to Foreground Conguration
To estimate the likelihood P (XjC) that a foreground
conguration is observed given a codeword , we need
to eÆciently search for the position in the image to locate the codeword so that it ts the foreground conguration best. Although the codewords are represented
10
20
30
40
10
20
30
40
Figure 2: Log-likelihood Varies With Centroid Position.
Left: the searching region to place the centroid of a codeword is
the 4949 square. Right: Log-likelihood log P (XjC) varies with
respect to the assumed centroid position in the searching region.
It is gray-scaled. Brighter indicates higher log-likelihood.
relative to the center of head, centroid is the most reliable point to detect a foreground object. Therefore,
instead of searching for the head to locate the codeword, we match the codeword to the foreground conguration according to their centroids. For each codeword, we estimate the centroid of its shapes. Given a
pair of test frames, the centroid of rst frame's foreground data is computed. We then lock in the square
of 49 49 pixels centered at this centroid as the searching region to place the codeword's centroid. Fig.2
shows an example of how log-likelihood varies with
respect to the centroid placed in this region. For each
codeword, we search in this region for the location
where the observation log-likelihood of the rst frame
given the shape is maximized. The deformed shape of
the codeword is correspondingly placed on the second
frame. After locating the codeword, the log-likelihood
of P (XjC) can be determined by Eq.1. We arbitrarily
choose = 0:9 and = 0:1.
In our experiment we actually search for the centroid location of a codeword on every third pixel in
the searching region, which ends up with 16 16 possible searching locations. The computational time of
such searching for one codeword is 0:012 seconds on
Pentium 750MHz machine.
5.3 Periodic Action Recognition
In this experiment section we illustrate the recognition
of a set of 9 view-based periodic actions: 3 dierent
gaits (stepping at same place, walking and running)
captured at 3 dierent view angles (45Æ ; 90Æ ; 135Æ between the camera and subject direction). To make
the action and view angle controllable, a treadmill
was used on which all these periodic actions were performed. The image sequences were captured at 30
frames/sec.
Training the Actions
Figure 3: Key Frames of the Training Periodic Actions.
45Æ stepping,90Æ walking,135Æ running.
Act 45Æ S 90Æ S 135Æ S 45Æ W
45ÆÆ S 66% 1% 42%
99%
90 ÆS
58%
135Æ S 34%
67%
45Æ W
90 ÆW
33%
135Æ W
45Æ R
90 ÆR
135 R
Figure 4: Samples of Trained Codewords (deformed shape
S 0 is in gray, superimposed by the shape S in black). It shows
samples of codewords that actions 90Æ walking (rst 5 plots),90Æ
running (last 5 plots) can pass, obtained from their HMM transition matrix.
3%
84%
13%
31%
69%
83%
17%
84%
16%
57%
43%
Table 2: Confusion Matrix for Periodic Action Recognition (T = 20)
The training actions were performed by 1 subject
wearing the special colored clothes. For each training
action, 1800 frames were collected and input to the
codewords training algorithm. Some key frames are
shown in Fig.3. Movelets are obtained from each pair
of frames as described in section 5.1.
From this set of movelets, we rst train N = 270
codewords. The performance of recognition with
training of dierent number N of codewords is reported at the end of this section. As described in
Section 2, the movelets are grouped according to their
occlusion patterns. There are 15 occlusion patterns in
this training set. The proportional number of codewords are trained from the group of movelets associated with each occlusion pattern.
Given the trained codewords and their associated
movelets, we learn the transition matrix of HMM for
each action. The transition matrix represents the possible temporal paths of codewords that an action can
pass through. In Fig.4, we show a few samples of
codewords that 90Æ walking and running actions pass.
Figure 5: Top row: sample images of subject's 45Æ walking,135Æ
walking,90Æ stepping,90Æ running. Bottom row: t codewords.
rection. Secondly, we claim that 45Æ viewangle is basically not separable from 135Æ viewangle using only
shape and motion cues. Fig.5 shows the examples of
original images and best t codeword of a few test
actions. In this gure, although the rst two actions,
45Æ and 135Æ walking, were performed by two dierent
subjects, their individual foreground data are so similar to each other that they are represented by same
codeword. Therefore 45Æ and 135Æ walking can be regarded as one action, so is 45Æ and 135Æ stepping, as
well as 45Æ and 135Æ running.
The left two plots in Fig.6 demonstrate how the correct recognition rate varies as the segment length T
varies from 1 to 40 for six actions (here we regard 45Æ
and 135Æ actions as one already). From the plots, we
observe that better performance is gained with longer
segment length T . The action can be reliably recognized within about 20 frames, which is 0:66 seconds of
action.
1
1
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.6
0.5
0.4
0.3
°
90 S
0.2
0
0
10
15
20
25
Segment Length T
30
35
0.4
0.3
°
40
0
0
°
45 S+135 S
°
°
45 W+135 W
0.1
°
90 R
5
0.5
0.2
°
90 W
0.1
0.6
10
15
20
25
Segment Length T
30
35
0.6
0.5
0.4
0.3
°
90 S
0.2
°
90 W
0.1
°
°
45 R+135 R
5
0.7
40
0
5
°
90 R
9
18 27 36 45 90 135 180 225 270
Number of Trained Codewords
Recognition Rate at T=20
1
0.9
0.8
Recognition Rate at T=20
1
0.9
Recognition Rate
Recognition Rate
Recognition of the Actions
6 subjects participated in the test experiments. A
sequence of 1050 frames (35 seconds video) was captured for each of 9 actions performed by each subject. Foreground data were obtained by background
subtraction. We answer three questions in this experiment: 1. how well can the actions be recognized?
2. how many frames are needed to recognize the action with certain accuracy? 3. how does the number
of trained codewords inuence the recognition performance?
Human vision doesn't need to observe the whole sequence (1050 frames) to recognize the action. To test
how many frames are necessary for a good recognition,
each sequence is split into segments of T frames long.
The tail frames are ignored. We classify each segment
into a learned action. Table 2 shows the confusion
matrix at T = 20.
We make the following two observations from table
2. First, the direction of the action is more diÆcult
to recognize than the gait. In other words, we have
no problem to tell if the action is stepping, walking
or running, but have some diÆculty in telling its di-
90Æ W 135Æ W 45Æ R 90Æ R 135Æ R
0.7
0.6
0.5
0.4
0.3
45°S+135°S
0.2
°
°
45 W+135 W
0.1
0
5
°
°
45 R+135 R
9
18 27 36 45 90 135 180 225 270
Number of Trained Codewords
Figure 6: Performance for Periodic Actions. Left two:
recognition rate vs. segment length T . Right two: recognition
rate vs. number of trained codewords.
What is the appropriate number of codewords we should train to represent the space
of movelets?
This question can be answered
by experiments. We train the dierent number
N = (5; 9; 18; 27; 36; 45; 90; 135; 180; 225; 270) of
codewords from the training set of movelets, and
repeat the recognition experiments with each group
of trained codewords. The right two plots in Fig.6
Table 3:
Confusion Matrix for Nonperiodic Action
Recognition
Figure 7: Top row: sample images of reaching actions for testing. Bottom row: t codewords.
show how the recognition rate at segment length
T = 20 varies with respect to dierent number of
trained codewords. For all the actions, above 80%
recognition rates are achieved with N 135.
5.4 Nonperiodic Action Recognition
In this section we describe the experiments in modeling and recognizing nonperiodic reaching actions.
The actions are reaching 8 dierent directions using
right arm and captured with front view. From reaching the top, we name the actions counter clockwise
as A000Æ ; A045Æ ; A090Æ ; A135Æ ; A180Æ ; A225Æ ; A270Æ
and A315Æ .
Training the Actions
The training actions were performed by 1 subject
with the specially designed color clothes. The training
subject repeated each reaching action 20 times which
last 40 seconds. We learned 240 codewords rst for
this set of actions.
Recognition of the Actions
A set of 400 sequences of reaching dierent directions done by 5 subjects were captured for testing. Each sequence is about 60 frames (2 seconds
video) long. For the nonperiodic action, the recognition is done by using all of the frames the sequence contains. The confusion matrix is shown in
Table 3. We also train the dierent number N =
(4; 8; 16; 24; 32; 40; 80; 120; 160; 200; 240) of codewords
for this set of actions, and repeat the recognition experiments with each group of trained codewords. The
results are shown in Fig.8. For all the actions, above
80% recognition rates are achieved with N 120 and
100% recognition rate are achieved for 6 out of 8 actions.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.5
0.4
0.3
Rh0°
0.2
Rh45°
Rh90°
0.1
0
4
Rh135°
8
16 24 32 40 80 120 160 200 240
Number of Trained Codewords
Recognition Rate
Recognition Rate
Reach A000Æ A045Æ A090Æ A135Æ A180Æ A225Æ A270Æ A315Æ
A000ÆÆ 90%
100%
A045Æ
100%
A090Æ
100%
A135Æ
100%
A180Æ
100%
A225Æ
100%
A270Æ
100%
A315 10%
0.6
0.5
0.4
0.3
Rh180°
0.2
Rh225°
Rh270°
0.1
0
4
Rh315°
8
16 24 32 40 80 120 160 200 240
Number of Trained Codewords
Figure 8: Performance for Nonperiodic Actions. Recognition Rate vs. Number of Trained Codewords.
6 Conclusion and Discussion
In this paper we proposed and tested an algorithm
for the recognition of both human poses and action simultaneously in the image sequence. The experiments
show above 80% recognition rate is achieved for all the
test actions while over half of the actions can be recognized with the accuracy rate higher than 98%. Our
fundamental idea can be extended to other types of
action recognition, as long as the experimental setup
for the training is properly designed. A possible future
work that we may pursue is to recognize the actions
from sequence without background subtraction.
References
[1] A. Bissacco, F. Nori, and S. Soatto, \Recognition of human
gaits", In Proc. IEEE Comput. Soc. Conf. Comput. Vision
and Pattern Recogn., 2001.
[2] C. Bregler, \Learning and recognizing human dynamics in
video sequences", In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recogn., pages 568{574, 1997.
[3] J. Davis and A. Bobick, \The representation and recognition
of action using temporal templates", In Proc. IEEE Comput.
Soc. Conf. Comput. Vision and Pattern Recogn., pages 928{
934, 1997.
[4] P. Felzenszwalb and D. Huttenlocher, \EÆcient matching of
pictorial structures", In Proc. IEEE Comput. Soc. Conf.
Comput. Vision and Pattern Recogn., pages II:66{73, 2000.
[5] I. Haritaoglu, D. Harwood, and L. Davis, \Ghost: a human
body part labeling system using silhouettes", In Proc. Int.
Conf. Pattern Recognition, pages 77{82, 1998.
[6] S. Ioe and D. Forsyth, \Human tracking with mixtures of
trees", Proc. 8th Int. Conf. Computer Vision, pages I:690{
695, 2001.
[7] R. Polana and R. Nelson, \Detecting activities",
Journal of
Visual Communication and Image Representation, 5(2):172{
180, 1994.
[8] R. Rosales and S. Sclaro, \Inferring body pose without tracking body parts", In Proc. IEEE Comput. Soc. Conf. Comput.
Vision and Pattern Recogn., pages II:721{727, 2000.
[9] Y. Song, X. Feng, and P. Perona, \Towards the detection of
human motion", In Proc. IEEE Comput. Soc. Conf. Comput.
Vision and Pattern Recogn., pages 810{817, 2000.
[10] Y. Yacoob and M. Black, \Parameterized modeling and recognition of activities", Computer Vision and Image Understanding, 73(2):232{247, 1999.
[11] J. Yamoto, J. Ohya, and K. Ishii, \Recognizing human action
in time-sequential images using hidden markov model", In
Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern
Recogn., pages 379{385, 1992.

Download Report

Human action recognition by sequence of movelet codewords

Paperzz.com

Your Paperzz