Recognizing and Tracking
Human Action
Josephine Sullivan and Stefan
Carlsson
Define Tracking
Traditional tracking
•
•
•
•
•
•
Kalman Filters
Condensation
HMM
Matching articulated 3d models
Similarities?
Problems?
New approach
• What is the difference between tracking
and recognition?
• Assume Pose recognition and activity
recognition are equivalent.
• Now track activity by repeating
recognition of key frames
Discussion: reasons for
previous approach
• Why the distinction between tracking
and recognition?
• Applications?
– Projectile tracking
– Motion capture
Object descriptors
• Embedding global data in local
descriptors
• Order Structure
• Shape context
Order Structure
• Problem: find
correspondence
between deformed
shapes
• Solution
– Sample points on
contour
– Describe shape
using order structure
• Order of points and
intersections of
tangent lines
Order Structure
• Many transformations preserve order
structure
– Superset of Affine and Projective
transformations
– Encodes perceptual similarity
• Encodes properties of point sets, lines,
and combinations of points and lines.
• Descriptor for Point sets - orientation
• Set {a,b,c} has + orientation if traversing them in
order means anti-clockwise rotation
Order Structure
• Descriptor for Sets of lines
– Uses: points and lines are projectively dual
– p - homogeneous coord’s for a point
– q - oriented homogeneous line coord’s for
line thru p, then: qTp = 0
– q = (a,1,b) where ax+y+b = 0.
– Order type for a set of 3 lines is then
Order Structure
• Descriptor for combinations of points and
lines
– Oriented coordinates => every line has a direction
• Assign a left-right position for every point w.r.t every line
qi = line
pj = point
• Unique order structure for arbitrary set of
points
• Order structure for a set characterized by an
index
Order Structure
• Algorithm
• Voting matrix
Order Structure
• Perceptual similarity example: human
pose
Shape Context descriptor
• Sample points from edges in image
• Each point’s descriptor is a histogram of
the relative coordinates of all other
points.
Action Recognition using
Key Frames
• Deciding images are related
– pai and pbi are coordinates of corresponding
points in images A and B.
– T is class of transformations that define
relation between A and B. (known a priori)
– Matching Distance
• General case
• Using pure translation
Action recognition using
Key Frames
• 30 second tennis sequence
• “Coarse” automatic tracking
• Edge detection done on upper half of
player
– No deletion of background edges
• Selected a key frame and computed
matching score wrt. each other frame.
• 9 local minima shown, each the start of
a forehand stroke.
Action recognition using Key
Frames
Tracking
• Point transferral
– Each key frame is marked
manually
– For each point in key frame, a
subset of points in the image are
chosen, and a translation is
estimated.
Point
corresponding to
PkR in image It
Simple local
translation
Point in keyframe
R
Updating the Voting Matrix
• Extra information to improve accuracy
• Use “standard tracker” for head and
body localization. (Brand, “Shadow Puppetry”)
• Set V(piR, pjt) = 0 if the points aren’t
close to the corresponding lines in
corresponding matched head/body
quadrangles.
Further constraints
• Want to enforce similar arrangement of
interior points in images that are
matched to key frames
• Also incorporate intensity around points
• Monte-Carlo smoothing is used to
correct outlying points
Tracking using Shape Context
• Mori & Malik
• Very similar technique, using shape
context descriptor
• Very clear that frames are processed
independently
• Tested on standard data
Tracking w/Shape Context
Movie
QuickTime™ and a
Video decompressor
are needed to see this picture.
Discussion & Questions
•
•
•
•
•
•
•
•
Results - how effective?
Effect of rate of motion?
Efficiency of “closed loop system”?
No need for background subtraction?
Flexibility to multiple actions?
Do they give a specific order to key frames?
Is the coarse tracking too simple?
What about poses facing away from camera?
© Copyright 2026 Paperzz