view-invariance

INTERNATIONAL CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2006
New York University, NY, June 17-22 2006
USING BILINEAR MODELS FOR VIEW-INVARIANT ACTION AND IDENTITY RECOGNITION
Fabio Cuzzolin, UCLA Vision Lab, University of California at Los Angeles
• we want to recognize the identity of a walking person, no
matter the viewpoint or the walking gait performed
• bilinear models can be used to describe datasets in
which each sequence possesses more than a single label
• a three-layer model in which HMMs represent sequences
and are fed to a bilinear model is proposed to provide an
effective view- and action-invariant approach to gaitID
VIEW-INVARIANCE IN IDENTITY
RECOGNITION FROM GAIT
 the problem: recognizing the identity of a person from the way
he/she walks
• in a realistic setup, the person to identify would walk into the surveyed
area from an arbitrary direction -> view-invariance
• view-invariance is a particular case of the “style-invariance” issue
3
ASYMMETRIC BILINEAR MODELS
y
SC
Ab
S
THREE-LAYER MODEL
C
•
•
ySC is a training set of k-dimensional observations with labels S and C
bC is a parameter vector representing content, while AS is a style-specific
linear map mapping the content space onto the observation space
•
an asymmetric bilinear model can learned from an observation sequence
through the SVD of a stacked observation matrix ySC
•
when new motions are acquired in which a known person is being seen
walking from a different viewpoint (unknown style) an iterative EM
procedure can be set up to classify the content (identity):
E step: estimation of p(c|s), the prob. of the

content given the current estimate
p( y | ~
sof, cs;)  e
M step: estimation of the linear map for s.
FROM VIEW-INVARIANCE TO STYLE-INVARIANCE
• consider a training set of sequences in which each sequence is associated with more
than a single label;
• each motion can in fact be classified according to the person who performed it, the
category of action performed (i.e. walking, reaching out, pointing, etc.), or (if the
number of cameras is finite) the viewpoint from which the sequence is shot;
• multilinear and bilinear models can be seen as tools for separating ``style" and
``content" of the objects to classify.
1
FEATURE EXTRACTION FROM SILHOUETTES
• the training set of sequences is used to learn a three-layer model
• the model can be used thereafter to estimate the content of new
image sequences with known content but unknown style
• in the first layer features are
extracted from the available
silhouettes, by simply projecting
their contour onto a family of
lines passing through their center;
2
~
y  As bc
2
2 2
EXPERIMENTS WITH THE MOBO DATABASE
VIEW- AND ACTION-INVARIANT GAIT ID
• view-invariant gaitID. Left: score as a function of the nuisance (action), test view
1. Right: score for the dataset of sequences of action ``slow", different test views.
• we chose the CMU Mobo database, in which 25
different people perform four different walkingrelated actions: walking slow, walking fast, walking
along a slope, walking while carrying a ball. Cameras
are more or less equally spaced around the treadmill.
HIDDEN MARKOV MODELS
• in the second layer each feature
sequence is fed to an HMM with a
fixed number of states, yielding a
dataset of Markov models;
• as the dynamics is the same for all
sequences, they can be represented
by the stacked C matrix of the HMM
FEATURE PERFORMANCE COMPARISON
• sequences in the Mobo database have three different labels: identity, action, and
viewpoint;
• four series of tests in which we built bilinear models for different content and style
labels: view-invariant gaitID, action-invariant gaitID, view-invariant action
recognition, style-invariant action classification.
• comparison with two other approaches: baseline algorithm, and direct application of a
nearest-neighbor NN classifier on the dataset of HMMs, using Kullback-Leibler.
• performance of the bilinear classifier in the ID vs action experiment as a function of the nuisance (view=1:5),
averaged over all the possible choices of the test action. The average best-match performance of the bilinear classifier
is shown in solid red, (minimum and maximum in magenta). The best-3 matches ratio is in dotted red. The average
performance of the KL-nearest neighbor classifier is shown in solid black, minimum and maximum in blue. Pure
chance is in dashed black.
• Left: ID-invariant action recognition using the bilinear classifier. The entire dataset
is considered, regardless the viewpoint. The correct classification percentage is shown
as a function of the test identity in black (for models using Lee's features) and red
(contour projections). Related mean levels are drawn as dotted lines. Right: Viewinvariant action recognition.