Recovering the Linguistic Components of the Manual

Recovering the Linguistic Components of the Manual Signs in American Sign
Language
Liya Ding and Aleix M. Martinez
Dept. of Electrical and Computer Engineering
The Ohio State University
{dingl,aleix}@ece.osu.edu
Abstract
Manual signs in American Sign Language (ASL) are
constructed using three building blocks – handshape, motion, and place of articulations. Only when these three
are successfully estimated, can a sign by uniquely identified. Hence, the use of pattern recognition techniques that
use only a subset of these is inappropriate. To achieve
accurate classifications, the motion, the handshape and
their three-dimensional position need to be recovered. In
this paper, we define an algorithm to determine these
three components form a single video sequence of twodimensional pictures of a sign. We demonstrated the use
of our algorithm in describing and recognizing a set of
manual signs in ASL.
1. Introduction
Sign languages are used all over the world as a primary
means of communications by deaf people. American Sign
Langauge (ASL), is one such language. According to current estimates, it is used regularly by more than 500, 000
people, and up to 2 million use it from time to time. There
is thus a great need for systems that can be used with ASL
(e.g., computer interfaces) or can serve as interpreters between ASL and English.
As other sign languages, ASL has a manual component
and a non-manual one (i.e., the face). The manual sign is
further divided into three components: i) handshape, ii)
motion, and iii) place of articulation [2, 13, 4]. Most manual signs can only be distinguished when all three components have been identified. An example is illustrated
in Fig. 1. In this figure, the words (more accurately
called concept in ASL) “search” and “drink” share the
same handshape, but have different motion and place of
articulation; “family” and “class” share the same motion
and place of articulation, but have a different handshape.
Similarly, the concepts “onion” and “apple” would only
be distinguished by the place of articulation – one at the
mouth, the other at the chin.
If we are to build computers that can recognize ASL, it
is imperative that we develop algorithms that can identify
these three components of the manual sign. In this paper,
we present innovative algorithms for obtaining the motion, handshape and their three-dimensional position from
a single (frontal) video sequence of the sign. By position,
we mean that we can identify the 3D location of the hand
with respect to the face and torso of the signer. The motion is given by the path traveled by the dominant hand
from start to end of the sign. The dominant hand is that
which carries (most of) the meaning of the word – usually
the right hand for right-handed people. Finally, the handshape is given by the linguistically significant fingers [2].
In each ASL sign, only a subset of fingers is actually of
linguistic interest. While the position of the other fingers
is irrelevant, that of the linguistically significant fingers is
not. We address the issue as follows. First, we assume
that while not all the fingers are visible in the video sequence of a sign, those that are linguistically significant
are (even if only for a short period of time). This assumption is based on the observation that signers must provide
the necessary information to observes to make a sign unambiguous. Therefore, the significant fingers ought to be
visible at some point. With this assumption in place, we
can use structure-from motion algorithms to recover the
3D structure and motion of the hand. To accomplish this
though, we need to define algorithms that are robust to
occlusions. This is necessary because although the significant fingers may be visible for some interval of time,
these may be occluded elsewhere. We thus need to extract
as much information as possible from each segment of our
sequence.
Once these three components of the sign are recovered,
we can construct a feature space to represent each sign.
This will also allow us to do recognition in new video
sequences. This is in contrast to most computer vision
systems which only use a single feature for representation and recognition [10, 1, 5]. In several of these algorithms, the discriminant information is generally searched
within a feature space constructed with appearance-based
features such as images of pre-segmented hands [1], hand
binary masks, and hand contours [12]. The other most
typically used feature set is motion [15]. Then, for recog-
j = 1, 2, . . . , m, where
Qj = (qj1 , . . . , qjn ) =
(a)
(b)
(c)
(d)
Figure 1. Example signs with same handshape, different motion
(a-b); same motion, different handshapes (c-d).
nition, one generally uses Hidden Markov Models [12],
Neuron Network [15] and Multiple Discriminant Analysis [1]. These methods are limited to the identification of
signs clearly separated by the single feature in use. These
methods are thus not scalable to large systems.
Our first goal is thus to define a robust algorithm that
can recover the handshape and motion trajectory of each
manual sign as well as their 3D position. In our study, we
allow not only for self-occlusions but also for imprecise
localizations of the fiducial points. We restrict ourselves to
the case where the handshape does not change from start
to end, because this represents a sufficiently large number of the concepts in ASL [11] and allows us to use linear fitting algorithms. Derivations of our method are in
Section 2. The motion path of the hand can then be obtained using solutions to the three point resection problem
[6]. Since these solutions are generally very sensitive to
miss-localized feature points, we introduced a robustified
version which searches for the most stable computation.
Derivations for this method are in Section 3.
To recover the 3D position of the hand with respect to
the face, we make use of a face detection algorithm. We
find the distance from the face to the camera as the maximum distances traveled by the hand in all signs of that person. Using the perspective model, we can get the 3D position of face using the camera coordinate system. Then,
the 3D position of the hand can be described using the face
coordinate system, providing the necessary information to
discriminate signs [4, 13, 2]. Derivations are in Section 4.
Experimental results are in Section 5, where we develop
on the use of the three algorithms presented in this paper.
2. Handshape
We denote the set of all 3D-world hand points (in
particular, we will be using the knuckles) as Pe =
T
{p1 , . . . , pn }, where pi = (xi , yi , zi ) specifies the three
th
dimensional coordinates of the i feature point in the Euclidean coordinate system. As it is well-known, the image points in camera j are given by Qj = Aj Pe + bj ,
µ
¶
uj1 . . . ujn
vj1 . . . vjn
(1)
are the image points, and Aj and bj are the parameters of
the j th affine camera.
Since in our application the camera position does not
change, Qj here are the image points in the j th image of
our video sequence. Our goal is to recover Pe with regard
to the object (i.e. hand) coordinate system from known
Qj , j = 1, 2, ..., m.
Jacobs [7] presents a method that uses the model defined above to recover the 3D shape even when not all
the feature points are visible in all frames, i.e., with occlusions. Let us first represent the set of affine equations
given above in a compact form as D = AP, where



Q1
A1
 Q2 
 A2


D=
 ..  , A =  ..
.
.
Qm
Am

b1
·
b2 
p

..  , P = 11
.
bm
p2
1
···
...
¸
pn
.
1
When there is neither noise nor missing data, D is of rank
4 or less, since it is the product of A (which has 4 columns)
and P (4 rows). If we consider a row vector of D as a
point in the Rn space, all the points from D lie in a 4dimensional subspace of Rn . This subspace, which is actually the row space of D, is denoted as L. Any four linear
independent rows of D should span L.
When there is missing data in a row vector Di (i =
1, 2, ..., 2m), all possible values (that can occupy this position) have to be considered. The possible points in this
row vector create an affine subspace denoted Ei . Assume
we have four rows Dh , Di , Dj , Dl (h, i, j, l = 1, 2, ..., 2m,
with h 6= i 6= j 6= l) with or without missing data, and
denote the set as Fk = {Dh , Di , Dj , Dl }, k ∈ N. And,
there is a total of nf (nf ∈ N) possible Fk .
If the four affine subspaces (Eh , Ei , Ej , El ) corresponding to these four row vectors in Fk don’t
intersect, then L should be a subset of Sk =
span(Eh , Ei , Ej , El ), k = 1, 2, ..., nf . And, thus, L
should be a subset of the intersection
of all possible span
T
of this kind. Hence, S = k=1,2,...,nf Sk , and L ⊆ S.
Unfortunately, with localization noise and errors caused
by inaccurate modelling, this relation of subsets is not
retained. This can be solve by using the null-space [7].
That is, the orthogonal complement of Sk is denoted as
⊥
S⊥
k . If we have the matrix representation of Sk as Nk ,
then N = [N1 , N2 , . . . , Nnf ] is the matrix representation
of S⊥ . And the null space of N is S. Using SVD of
N = UWVT , we can take the four columns in U according to the four smallest singular values as the four rows in
P. Next, we can find the matrix representation P’ of the
subspace L that is closest to being its null-space according
to the Frobenius norm. In this case, note that one of the
100
50
Y
0
−50
−100
−150
−100
0
100
X
−200
0
−100
100
200
Z
Figure 2. 3D handshape reconstructed from the 2D knuckle
points tracked over a video sequence.
vectors spanning D’s row space L is known to be a vector
with all 1s (because in homogeneous form, P has a row
with 1s).
If an image point is missing (its (u, v) coordinates
are missing), taking the two rows corresponding to the
same image as in Fk will prove beneficial for calculating Nk . Then, to calculate Nk from Sk , we
£
¤T
take Fk = {D2i−1 , D2i , D2j−1 , D2j , 1 · · · 1 }, i, j =
1, 2, ..., m, i 6= j for better stability. And we eliminate the
vectors with poor condition in calculation of nullspaces
Ni , i = 1, .., nf which combine into a more stable solution of N. This generally improves the performance of the
above defined algorithm in practice.
If a column in D has a large number of missing data,
we may very well be unable to recover any 3-dimensional
information from it. Note, however, that this can only happen when one of the fingers is occluded during the entire
sequence, in which case, we will assume that this finger is
not linguistically significant. This means we do not need
to recover its 3D shape, because this will be irrelevant for
the analysis and understanding of the sign.
Once P has been properly estimated, we can use the
non-missing data of each row in D to fill in the missing
gaps as a linear combination of the rows of P. At the same
time, we have decomposed the filled D̂ into P and A =
D̂(P)+ .
The above result P generates what is known as an
affine shape. Two affine shapes are equivalent if there
exists an affine transformation between them. To break
this ambiguity, we can include the Euclidean constraints
to find that Euclidean shape that best approximates the
real one. One way to achieve this, is to find a matrix H
such that Mj H(j = 1, 2, ..., n) is orthographic, where
Mj = (Aj bj ). This is so, because orthographic projections do not carry the shape ambiguity mentioned above.
The Euclidean shape can be recovered using the Cholesy
decomposition and non-linear optimization. Fig. 2 shows
an example result of the recovery of the 3D hand shape
“W”.
3. Motion Reconstruction
We can now recover the 3D motion path of the hand by
finding the pose difference between each pair of consecutive frames. That means, we need to estimate the translation and rotation made by the object from frame to frame
using the camera coordinate system. A typical solution
is that given by the PNP resection problem [6]. Also, in
this case, the appropriate model to be used is perspective
projection.
In the three point perspective pose estimation problem,
there are three object points, p1 , p2 , and p3 , with camT
era coordinates pi = (xi , yi , zi ) . Our goal is to recover
these values for each of the points. Since the 3D shape of
the object has already been recovered, the interpoint distances (i.e., namely a = kp2 − p1 k, b = kp1 − p3 k, and
c = kp3 − p2 k) can be easily calculated.
As it is well-known, the perspective model is given by
½
ui = f xzii
, i = 1, 2, 3 ,
(2)
vi = f yzii
T
where qi = (ui , vi ) is the ith image point and f is the
focal length of the camera. The the object is in the direction specified by the following unit vector


u
i
1
 vi  , i = 1, 2, 3 .
̃i = p 2
ui + vi2 + f 2
f
Now, the task reduces to finding those scalers, s1 , s2 and
s3 , such that pi = si ̃i , i = 1, 2, 3. The angles between
these unit vectors can be calculated as
cos α = ̃2 · ̃3 , cos β = ̃1 · ̃3 , cos γ = ̃1 · ̃2 .
(3)
Grunert’s solution to this problem, is based on the assumption that s2 = µs1 and s3 = νs1 , which allows us to reduce the three point resection problem to a fourth order
polynomial of ν:
A4 ν 4 + A3 ν 3 + A2 ν 2 + A1 ν + A0 = 0,
(4)
where the coefficients A4 ,A3 , A2 , A1 and A0 are functions of interpoint distances a, b, c and the angles α, β, γ
between ̃i ’s [6].
Such polynomials are known to have zero, two or four
real roots. With each ν’s real root, we can calculate µ,
s1 , s2 , s3 and the values of p1 , p2 and p3 . To recover the
translation and rotation of the hand points, we use the nine
equations given by
pi = R w pi + t,
i = 1, 2, 3 ,
(5)
where w pi is the hand point as described in the world coordinate system, and R and t = [tx , ty , tz ]T are the rotation matrix and translation vectors we want to recover.The
nine entries of the rotation matrix are not independent. To
reduce the 12 dependent parameters into 9 and solve from
the nine equations, we use a two-step method as in [3].
Having rotation matrix R we further parameterize it into
a 3 rotational angle representation rx , ry , rz . Altogether,
we use six parameters,rx , ry , rz , tx , ty , tz , to represent the
rotational and translational motion of each frame.
09−H−10−D frame: 9
09−H−10−D frame: 40
Figure 3. Face detection examples.
each person, we can define the distance from the face to
the camera as the maximum distances between the hand
and the camera in all video sequences.
Once the face center [uf , vf ]T , the radius rf , and the
distance from the face to the camera Zf are known, we
can calculate the center of the face in the camera coordiX
Y
nate system as uf = f Zff and vf = f Zff . We then define the x-y plane to be that provided by the face (which
is from then one assumed to be a plane). Whenever the
face is frontal, there will only be a translation between
the camera coordinate system and that of the face, i.e.,
Pf = Pc − [Xf , Yf , Zf ]T . This is the most common case
in our examples, since the subjects in [8] were asked to
signed while looking at the camera. Using this approach,
we can define the place of articulation with respect to the
subjects’ coordinate system. To normalize the 3D positions even further, we can scale this 3D space in the subjects’ coordinate system so that the radius of face is always
a pre-specified constant. This provided scale invariance.
300
400
1
200
2
100
−200
To successfully represent the 3D handshape and motion trajectories recovered using the above presented algorithms, we need to be able to describe these with respect
to the signer not the camera. For this, we can set the face
as a frame of reference. This is appropriate because the
face provides a large amount of information and serves as
the main center of attention [9]. For example, we have the
sign “father” as handshape “5”, and with the thumb taping
the forehead as the motion pattern. To represent this sign,
we can use the center of the face as the origin of the 3D
space. Without loss of generality, we can further assume
the face defines the x-y plane of our 3D representation.
The cascade-based face detection method constructed
from a set of Haar-like features [14] provides a fast and
appropriate face detection algorithm for our application.
Since in the sign, the motion of the head is small, Gaussian
models can be employed to fit the results. Given a model
of the center of the face and its radius, false detection and
bad detection can be readily corrected.
In our data-set, extracted from [8], the signing person
stands in a fixed position. Because of the fact that there are
no signs in ASL where the hand moves behind the face, for
0
−100
4
5
6
3
8
−200
−400
−300
−500
−400
−600
−700
−200
7
100
−300
4. Place of articulation
2
200
Y
0
1
300
3
4
7
5 8
6
−100
Y
Since we regard the hand as a rigid object during the
motion, the rotation and translation of any three-point
group are identical. The polynomial defined in (4) may
have more than one root. Unfortunately, in general, it is
not known which of these roots corresponds to the solution of our problem. To solve this, we calculate all the
solutions given by all possible combinations of three feature points, and describe them in a histogram. This allows
us to select the result with highest occurrence (i.e., with
most votes) as our solution.
It is also known that the geometric approaches to the
three point resection problem are very sensitive to errors
of localization. We now define an approach to address this
issue.
We assume that the correct localization is close to that
given by the user or any automatic tracking algorithm. We
generate a set of candidate hand points by moving the
original position of the original fiducial about a neighborhood of p × p pixels. The solutions for each of the
Grunert’s polynomials are then used to obtain all possible
values for rx , ry , rz , tx , ty and tz . Each of these results
is described in a histogram and the result interval I0 with
most votes is checked out. A wider interval centered at I0
is then chosen. The median of the results within this new
interval will correspond to our final solution. Note that
voting was first used to eliminate the outliers from the solutions of Grunert’s polynomials and, hence, our method
is not effected by large deviations of the results. The median is used to select the best result among the correct solutions from different image point localizations. The robustness of this algorithm has been studied in [3].
0
200
−800
−600
−400
−200
0
−500
−400
−200
0
200
Z
X
400
0
−200
−400
−600
−800
Z
X
Figure 4. The places of articulation for the signs “water” and
“Wednesday”.
Our 3D face model can be divided into eight regions:
forehead region, eye regions, nose region, mouth region,
jaw region and cheek regions. This divisions allow us to
further represent the signs close to the face appropriately
– allowing, for example, to easily discriminate between
“onion” and “apple”. At the same time, eye detection,
nose detection and mouth detection could be employed to
give a more accurate localization of the regions and more
detailed definition of the regions. This is illustrated in Fig.
4. The sign“water” happens with the tip of the index finger
touching the jaw. And the sign “Wednesday” has a circular
400
400
400
300
300
300
200
200
100
100
0
200
100
0
Y
In this section, we show some results using video sequences from the Purdue ASL Database [8]. This database
contains 2576 video sequences. The first set of these
videos includes a large number of video clips of motion
primitives and handshapes. In particular, there is a subset
of video clips in which each person signs two words (concepts) with distinct meanings. Although the signs have
different meaning, they share a common handshape. The
difference is given by the motion and/or place or articulation. These are used here to test the performance of our
algorithm.
In general it is very difficult to track the knuckles of
the hand automatically, mainly due to the limited image
resolution, inevitable self-occlusions and lack of salient
characteristics present in the video sequences. Since our
goal is to test the performance of the three algorithms presented in this paper, we opted for a manual detection of
the fiducials. This allows us to demonstrate the uses of
our approach for representing and identifying signs.
Let us start by showing some example results of our algorithm. The signs “fruit” and “cat” share the same handshape, and have a similar place of articulation. However,
they are easily distinguished by their motion path. In Fig.
5, we can see the handshape of these two signs. A quantitative comparison further demonstrates their similarity.
This is provided by an Euclidean distance between the
knuckle points, once normalized by position and scale.
Shift invariance is simply given by centering the position of the wrist. Then, we use a least-squares solution
to match the rest of the two sets of points. The residual
errors (i.e, squares of the distances), R2 , provides the error, which in our case is close to zero. In recognition of
the handshape, we can compare the 3D shape of the reconstructed handshape with trained handshape models in
the database.
Y
5. Experimental Results
more detailed definition of the regions. For example, as
to the signs for “fruit” and “cat,” the places of articulation
are similar: “fruit” is in the jaw region of the face, “cat”
starts in the jaw with a small motion to the right side of
the face.
The motion of the sign “fruit” is a small rotation around
the tip of the thumb and index finger (where the hand
touches the face). The motion of the sign “cat” is a small
repeated translation between the jaw region and the right
hand-side of the jaw. The definition of the starting and
ending points of the motion is detected as a short pause in
signing (also known as zero velocity crossings).
In Figs. 6 and 7, we show two sequences of images and
the corresponding projection of the reconstructed handshape. The 3D handshapes obtained by our algorithm (in
the face coordinate system) are shown above each image.
In addition, we provide the 3D trajectory recovered by the
robust method presented in this paper. The direction of
motion is marked along the trajectory. Since we have normalized the 3D representation with respect to the subjects’
coordinate system, the trajectory can be used for recognition. The algorithm we employ is the least squares solution presented above. In this case, we first discretize
the paths into an equal number of evenly separated points.
The residual of the least squares fit provides the classification. This is sufficient to correctly classify the same signs
(as signed by other subjects) correctly.
Y
motion in front of the body (or face).
0
−100
−100
−100
−200
−200
−200
−300
−300
−300
−400
−500
−400
−200
0
200
400
0
−200
−400
−600
−800
Z
−400
−500
−400
−200
0
X
200
400
0
−200
−400
−600
−800
Z
−400
−500
−400
−200
0
X
200
400
0
−200
−400
−600
−800
Z
X
100
300
400
50
300
0
200
−50
100
Y
Y
150
100
50
0
−50
Y
100
−100
200
0
−100
−150
−100
−100
−50
0
−200
Z
50
100
−300
−400
50 0
−50
0
−100
100
Z
X
100
−200
−200
−500
−400
−200
0
200
X
X
Figure 5. The recovered handshapes of sign “fruit” (left) and sign
“cat” (right).
400
0
−200
−400
−600
−800
Z
0
Y
0
−100
−200
−300
0
−400
For the places of articulation, we centered our 3D positions in the center of the face and normalize the radius of
the face appropriately. The forehead region, eye regions,
nose region, mouth region, chine regions and cheek regions of the face are inferred from a (learned) face model.
At the same time, eye detection, nose detection and mouth
detection can be employed to give a more accurate and
−200
−200
0
200
−400
Z
X
Figure 6. Reconstruction of the 3D handshape and hand trajectory for sign “fruit.” Four example frames with the 3D handshapes recovered by our method and its 3D trajectory.
300
200
200
100
100
100
0
0
0
−100
−100
−200
−200
−300
−400
−200
0
200
400
0
−200
−400
−600
−800
Z
−300
−400
−500
−400
−200
0
X
200
400
0
−200
−400
−600
−800
Z
−400
−500
−400
300
200
200
100
0
0
−100
−100
−200
−200
−200
−300
−400
200
400
0
−200
−400
−600
−800
Z
−200
0
X
200
400
0
−200
−400
−600
−800
Z
−400
−500
−400
−200
X
0
200
400
0
−200
−400
−600
−800
Z
X
300
200
100
Y
0
−100
−200
−300
−400
−200
0
200
This research was supported in part by a grant from the
National Institutes of Health.
References
−300
−400
−500
−400
400
0
−100
−300
200
100
Y
Y
400
300
100
Y
400
200
0
0
0
−200
−400
−600
−800
Z
X
300
−200
−200
X
400
−500
−400
7. Acknowledgments
−100
−200
−300
−500
−400
Y
400
300
200
Y
400
300
Y
400
0
−200
−400
Z
X
Figure 7. Reconstruction of the handshape and hand trajectory
for the sign “cat.” Five example frames with the 3D handshapes
recovered by our method and the 3D trajectory of the sign.
6. Conclusions
To successfully represent and recognize a large number
of ASL signs, one needs to be able to recover their 3D position, handshape and motion trajectory. In this paper we
have presented a set of algorithms specifically designed to
accomplish this. Since in ASL, self-occlusions and imprecise fiducial detection are common, we have presented
extensions of the structure-from-motion and resection algorithms that appropriately resolve these issues. We have
also introduced the use of a face detector to identify the
place of articulation of the sign. These components together allow us to uniquely identify a set of signs which
only diverge in one of these three variables.
[1] Y. Cui and J. Weng, “Appearance-Based Hand Sign Recognition from Intensity Image Sequences,” Computer Vision
Image Understanding, vol. 78, no. 2, pp. 157-176, 2000. 1,
2
[2] D. Brentari, “A prosodic model of sign language phonology,” MIT Press, 2000. 1, 2
[3] L. Ding and A.M. Martinez, “Three-Dimensional Shape and
Motion Reconstruction for the Analysis of American Sign
Language,” In Proc. the 2nd IEEE Workshop on Vision for
Human Computer Interaction, 2006. 3, 4
[4] K. Emmorey and J. Reilly (Eds.), “Language, gesture, and
space,” Hillsdale, N.J.:Lawrence Erlbaum, 1999. 1, 2
[5] R.A. Foulds, “Piecewise parametric interpolation for temporal compression of multijoint movement trajectories” IEEE
Transactions on Information Technology In Biomedicine
10(1):199-206, 2006. 1
[6] R.M. Haralick, C. Lee, K. Ottenberg, M. Nolle, “Review and
Analysis of Solutions of the Three Point Perspective Pose
Estimation Problem” International Journal of Computer Vision 13, 3, pp. 331-356, 1994. 2, 3
[7] D.W. Jacobs
“Linear Fitting with Missing Data for
Structure-from-Motion” Proc. IEEE Computer Vision and
Pattern Recognition pp. 206-212, 1997. 2
[8] A.M. Martinez, R.B. Wilbur, R. Shay and A.C. Kak, “The
Purdue ASL Database for the Recognition of American Sign
Language,” In Proc. IEEE Multimodal Interfaces, Pittsburgh
(PA), November 2002. 4, 5
[9] M.S. Messing and R. Campbell, “Gesture, speech, and
sign,” Oxford University Press, 1999. 4
[10] S.C.W. Ong, S. Ranganath “Automatic Sign Language
Analysis: A survey and the Future beyond Lexical Meaning,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol 27, No 6, June 2005. 1
[11] W.C. Stoke, D.C. Casterline, and C.G. Croneberg, “A
dictionary of American sign language on linguistic principales,” Linstok Press, 1976. 2
[12] N. Tanibata, N. Shimada, and Y. Shirai, “Extraction of
Hand Features for Recognition of Sign Language Words,”
In Proc. International Conf. Vision Interface, pp. 391-398,
2002. 1, 2
[13] R.B. Wilbur, “American Sign Language: Linguistic and
applied dimensions” Second Edition, Boston: Little, Brown,
1987. 1, 2
[14] P. Viola, M.Jones, “Rapid Object Detection using a
Boosted Cascade of Simple Features,” Proc. IEEE Computer Vision and Pattern Recognition, 2001. 4
[15] M. Yang, N. Ahuja, and M. Tabb, “Extraction of 2D
Motion Trajectories and Its Application to Hand Gesture
Recognition,” IEEE Trans. Pattern Analysis Machine Intelligence, Vol. 24, No. 8, pp. 1061-1074, Aug. 2002. 1,
2