Older App..

Activity Detection Seminar
Sivan Edri

This capability of the human vision system
argues for recognition of movement directly
from the motion itself, as opposed to first
reconstructing a three-dimensional model
of a person and then recognizing the
motion of the model


First, I will present the construction of a
binary motion-energy image (MEI) which
represents where motion has occurred in an
image sequence – where there is motion.
Next, we generate a motion-history image
(MHI) which is a scalar-valued image where
intensity is a function of recency of motion
– how the motion is moving.


Taken together, the MEI and MHI can be
considered as a two component version of a
temporal template, a vector-valued image
where each component of each pixel is
some function of the motion at that pixel
location.
These templates are matched against the
stored models of views of known
movements.
Example of someone sitting. Top row contains key frames. The bottom row
is cumulative motion images starting from Frame 0.


Let
be an image sequence and let
be a binary image sequence
indicating regions of motion.
For many applications image differencing is
adequate to generate D.
Then, the binary MEI
is defined
MEIs of sitting movement over 90 viewing angle.
The smooth change implies only a coarse sampling of viewing
direction is necessary to recognize the movement from all angles.


To represent how (as opposed to where) the
image motion is moving, we form a motionhistory image (MHI). In an MHI
,
pixel intensity is a function of the temporal
history of motion at that point.
The result is a scalar-valued image where
more recently moving pixels are brighter.


Note that the MEI can be generated by
thresholding the MHI above zero.
Given this situation, one might consider
why not use the MHI alone for recognition?

The computation is recursive.
The MHI at time t is computed from the MHI
at time t-1 and the current motion image
, and the current MEI is computed
by thresholding the MHI.
The recursive definition implies that no
history of the previous images or their
motion fields need to be stored nor
manipulated, making the computation both
fast and space efficient.


There is no consideration of optic flow, the
direction of image motion.
Note the relation between the construction
of the MHI and direction of motion.
Consider the waving example where the
arms fan upwards.


To evaluate the power of the temporal
template representation, 18 video
sequences of aerobic exercises were
recorded, performed several times by an
experienced aerobics instructor.
Seven views of the movement -90o to +90o
in 30o increments in the horizontal plane
were recorded.


The only preprocessing done on the data
was to reduce the image resolution
to 320 x 240 from the captured 640 x 480.
This step had the effect of not only
reducing the data set size,
but also of providing some
limited blurring which
enhances the stability of
the global statistics.



The Mahalanobis distance is a measure of
the distance between a point P and
a distribution D.
It is a multi-dimensional generalization of
the idea of measuring how many standard
deviations away P is from the mean of D.
This distance is zero if P is at the mean of
D, and grows as P moves away from the
mean.

The Mahalanobis distance of an observation
from a group of
observations with mean
and covariance matrix S is defined as:
P(x)
decreases
fast
P(x) decreases
µ
P(x) decreases slow
S=I
S != I






Intuitively, for one random variable
Mahalanobis distance is computed:
Lets say we have the next samples: 1, 1, 9, 9
What is the mean?
What is the variance?
What is the standard deviation?
Lets compute the Mahalanobis distance of
sample 9:




Collect training examples of each
movement from a variety of viewing angles.
Compute statistical descriptions of the MEIs
& MHIs using moment-based features.
Our choice is 7 Hu moments.
To recognize an input movement, a
Mahalanobis distance is calculated between
the moment description of the input and
each of the known movements.
An example of MHIs with similar statistics. (a) Test input of
move 13 at 30o. (b) Closest match which is move 6 at 0o.
(c) Correct match.


For this experiment, two cameras are used,
placed such that they have orthogonal views
of the subject.
The recognition system now finds the
minimum sum of Mahalanobis distances
between the two input templates and two
stored views of a movement that have the
correct angular difference between them, in
this case 90o.


During the training phase, we measure the
minimum and maximum duration that a
movement may take, Tmin and Tmax .
If the test motions are performed at varying
speeds, we need to choose the right T for
the computation of the MEI and the MHI.


At each time step, a new MHI
is
computed setting
, where
is
the longest time window we want the
system to consider.
We choose
where
n is the number of temporal integration
windows to be considered.

A simple thresholding of MHI values less
than
generates
from :
T = 20
∆T = 5
T- ∆T = 15
15
0
5
∆T
10
20
T-∆T
4

To compute the shape moments, we scale
by
. This scale factor causes all the
MHIs to range from 0 to 1 and provides
invariance with respect to the speed of the
movement. Iterating, we compute all n
MHIs, thresholding of the MHIs yields the
corresponding MEIs.




Compute the various scaled MHIs and MEIs.
Compute the Hu moments for each image.
Check the Mahalanobis distance of the MEI
parameters against the known view/movement
pairs.
Any movement found to be within a threshold
distance of the input is tested for agreement of
the MHI. If more than one movement is matched,
we select the movement with the smallest
distance.


People can easily track individual players
and recognize actions such as running,
kicking, jumping etc. This is possible in
spite of the fact that the resolution is not
high – each player might be, say, just 30
pixels tall.
How do we develop computer programs
that can replicate this impressive human
ability?
Data flow for the algorithm. Starting with a stabilized figurecentric motion sequence, we compute the spatio-temporal
motion descriptor centered at each frame.
The descriptors are then matched to a database of
pre-classiffied actions using the k-nearest-neighbor
framework.
The retrieved matches can be used to obtain the correct
classification label, as well as other associated information.

Optical flow is the pattern of
apparent motion of objects, surfaces, and
edges in a visual scene caused by the
relative motion between an observer
(an eye or a camera) and the scene.

https://www.youtube.com/watch?v=JlLkkom6
tWw
Constant Brightness Assumption - 2D Case:
I ( x, y, t )  I ( x  u , y  v, t  t )
Take the Taylor series expansion of I :
dI
dI
dI
I x  u, y  v, t  t   I ( x, y, t )  u  v  t
dx
dy
dt
using brightness assumption:
0  I t  I xu  I y v
* Taken from optical flow presentation by Hagit Hel-Or
Optical Flow Equation- Intuition
0 IItt  I xxuu  I v v
The change in value It at a pixel P is dependent on:
The distance moved (u).
I x , t 
Ix , t  t 
It
I
Ix 
x
I
u
x
x
* Taken from optical flow presentation by Hagit Hel-Or
Optical Flow Equation
I x u  I y v  It
I [u , v]   I t
Only the component of the flow in the gradient direction can be
determined.
The component of the flow parallel to an edge is unknown.
* Taken from optical flow presentation by Hagit Hel-Or
Optical Flow Equation
Solving for u,v:
I x u  I y v  It
Shoot! One equation, two velocity unknowns (u,v)
* Taken from optical flow presentation by Hagit Hel-Or

Impose additional constraints
◦ Assume the pixel’s neighbors have the same (u,v)
p1 p2
I xu  I y v  It
pN
 I x p1  I y p1  
 I p  I p  
y
2 
 x 2
 
 






I
p
I
p
y
N 
 x N
A
N2
 I t p1  
 I p  
u 
 t 2 


v 
  
 




I
p
 t N 
Ax  b
A t Ax  A t b


1 t
t
x
b
x A A Ab
21
N1
* Taken from optical flow presentation by Hagit Hel-Or
Equivalent to Solving least squares:
(A T A) x  A T b
ATA
x
ATb
• The summations are over all pixels in the K x K window
• This technique was first proposed by Lukas & Kanade
(1981)
* Taken from optical flow presentation by Hagit Hel-Or
When can we solve LK Eq ?
Optimal (u, v) satisfies Lucas-Kanade
equation
• ATA should be invertible
• The eigen values of ATA should not be too small
(noise)
• ATA should be well-conditioned:
l1/ l2 should not be too large
(l1 = larger eigen value)
* Taken from optical flow presentation by Hagit Hel-Or
Hessian
Matrix
M=
Ix = 0
Iy = 0
M=
Ix = k
Iy = 0
M=
Ix = 0
Iy = k
M=
Ix = k 1
Iy = k 2
k1, k2 correlated
0
0
0
0
Non Invertable
k2 0
Non Invertable
0
0
0
0
0
RM =
k2
k2 0
0
0
Non Invertable
Non Invertable
(R = rotation)
Ix = k1
Iy = k 2
k12 0
M=
k1 * k 2 = 0
0
k22
Invertable

Different motions – classified as similar
* Taken from optical flow presentation by Hagit Hel-Or
source: Ran Eshel

The algorithm starts by computing a figurecentric spatio-temporal volume for each
person. Such a representation can be
obtained by tracking the human figure and
then constructing a window in each frame
centered at the figure.
Track each player and recover a stabilized spatiotemporal volume,
which is the only data used by the algorithm.

Finding similarity between different motions
requires both spatial and temporal
information. This leads to the notion of the
spatio-temporal motion descriptor, an
aggregate set of features sampled in space
and time, that describe the motion over a
local time period.


The features are based on pixel-wise
optical flow as the most natural technique
for capturing motion independent of
appearance.
We think of the spatial arrangement of
optical flow vectors as a template that is to
be matched in a robust way.

Given a stabilized figure-centric sequence,
we first compute optical flow at each frame
using the Lucas-Kanade algorithm.
(a) original image
(b) optical flow Fx,y
(c) Separating the x and y components of optical flow
vectors
(d) Half-wave rectification of each component to produce
4 separate channels
(e) Final blurry motion channels


If the four motion channels for frame i of
sequence A are ai1, ai2, ai3, ai4, and similarly
for frame j of sequence B then the similarity
between motion descriptors centered at
frames i and j is:
where T and I are the temporal and spatial
extents of the motion descriptor respectively.

To compare two sequences A and B, the
similarity computation will need to be done
for every frame of A and B.


Ballet: choreographed actions, stationary
camera.
Clips of motions were digitized from an
instructional video for ballet showing
professional dancers, two men and two
women, performing mostly standard ballet
moves. The motion descriptors were
computed with 51 frames of temporal
extent.
(a) Ballet dataset (24800 frames).
Video of the male dancers was
used to classify the video of the
female dancers and vice versa.
Classification used 5-nearestneighbors. The main diagonal
shows the fraction of frames
correctly classified for each class
and is as follows: [.94 .97 .88
.88 .97 .91 1 .74 .92 .82 .99 .62
.71 .76 .92 .96].


Tennis: real actions, stationary camera.
For this experiment, footage of two amateur
tennis players outdoors were shot. Each
player was video-taped on different days in
different locations with slightly different
camera positions. Motion descriptors were
computed with 7 frames of temporal extent.
(b) Tennis dataset.
The video was sub-sampled by a
factor of four, rendering the figures
approximately 50 pixels tall. Actions
were hand-labeled with six labels.
Video of the female tennis player
(4610 frames) was used to classify
the video of the male player (1805
frames). Classification used 5nearest-neighbors.
The main diagonal is:
[.46 .64 .7 .76 .88 .42].


The visual quality of the motion descriptor
matching suggests that the method could
be used in graphics for action synthesis,
creating a novel video sequence of an actor
by assembling frames of existing Footage.
The ultimate goal would be to collect a
large database of, say, Charlie Chaplin
footage and then be able to “direct” him in a
new movie.


Given a “target” actor database T,
and a “driver” actor sequence D, the goal is
to create a synthetic sequence S, that
contains the actor from T performing
actions described by D.
In practice, the synthesized motion
sequence S must satisfy two criteria:
◦ The actions in S must match the actions in
the “driver” sequence D.
◦ The “target” actor must appear natural when
performing the sequence S.
“Do as I Do” Action Synthesis.
The top row is a sequence of a “driver” actor, the bottom
row is the synthesized sequence of the “target” actor (one
of the authors) performing the action of the “driver”.


We can also synthesize a novel “target”
actor sequence by simply issuing
commands, or action labels, instead of
using the “driver” actor.
For example, one can imagine a video game
where pressing the control buttons will
make the real-life actor on the screen move
in the appropriate way.
We use the power of our data to correct imperfections in
each individual sample. The input frames (top row) are
automatically corrected to produce cleaned up figures
(bottom row).





The Recognition of Human Movement Using
Temporal Templates Aaron F. Bobick, Member, IEEE
Computer Society, and James W. Davis, Member,
IEEE Computer Society
Recognizing Action at a Distance Alexei A. Efros,
Alexander C. Berg, Greg Mori, Jitendra Malik
Computer Science Division, UC Berkeley Berkeley,
CA 94720, USA
http://en.wikipedia.org/wiki/Mahalanobis_distance
http://en.wikipedia.org/wiki/Optical_flow
Optical flow presentation by Hagit Hel-Or