Human pose recognition from depth image MS Research

Human pose recognition from
depth image
MS Research Cambridge
Goals


Classify pixels to body
parts categories
Input is based on
single depth image
from Kinect
Claim contributions


Fast classification - speed up to 200 frames/
second on Xbox 360 GPU implementation
High accuracy on both synthetic and real world
dataset
Methods



Based on Randomized forest (bunch of
random decision trees)
At the leaf node in tree t, learned distribution
P(c | x) over part labels c is stored (x = testing
pixel)
Final classification result is the average value.
Randomized forest
Features




For pixel X, define features as
dI (x) is depth intensity on pixel x
u,v are position offset, only 2 learned
parameters
Intuitively, the features represent random
derivative value on 2D space
Features
Training randomized forest

Each decision tree is trained separately, using
different disjoint training sets
τ

Parameter (u, v, t) is associated with each
node



u,v are offset, t = decision threshold
Proposed a randomly selected set of
parameters
Follow standard decision tree training based
on largest gain information to select locally
optimal parameters
Implementation

Training images are 300k synthetic body pose
images

2000 training pixels per image

10000 pre-selected random parameters

On default, 3 trees and 20 levels depth
Result
How will I apply on RGB video


Skin silhouette can be extracted easily on ASL

No background distortion

One signer
Mark pixels for class (hand / non-hands). 2
classes for now.
Features to use

Using temporal dimension

Feature will be random 3 dimension derivative

Unsure about d(x) – (depth intensity in the
original paper). Candidates are

Skin score

Linear combination of skin + motion score
Contribution (if result is success)


Using randomized forest on RGB instead of
depth image
Apply temporal information
Progress so far


Complete pixels marking for training set using
segment cut
Marking is not perfect but good enough in my
opinion

As of now, extract only one hand sign gestures

Will work on decision tree training next week
Samples