Slides1 - Tamara L Berg

“Hello! My name is... Buffy” Automatic
Naming of Characters in TV Video
Mark Everingham, Josef Sivic and Andrew
Zisserman
Arun Shyam
Objective
To label television or movie footage with the identity of the
people present in each frame of the video.
Challenging problem owing to change in scale,lighting,pose,hairstyle etc.
To employ readily available textual annotation for TV , in the form
of subtitles and transcripts, to automatically assign the correct
name to each face image.
Outline
Three main parts:
1) Processing of subtitles and script to obtain proposals for the
names of the characters in the video.
2) Processing the video to extract face tracks and accompanying
descriptors, and to extract descriptors for clothing.
3) Combine the textual and visual information to assign labels to
detected faces in the video.
Test Data:
Two 40 minute episodes of the TV serial “Buffy the Vampire
Slayer”.
Subtitle and Script Processing
Subtitles extracted using a simple OCR algorithm. Script obtained from a fan
site in HTML format.
Subtitles record what is being said and when but not by whom.
Script tells who says what but not when.
What we need – Who, What and When.
Solution is to align the script and subtitles by Dynamic Time Warping algorithm.
Write the subtitle text vertically, and the script text horizontally. The task then is
to find a path from top-left to bottom-right which moves only forward through
either text.
The word-level alignment is then mapped back onto the original subtitle units.
Subtitle-Script Alignment
Face Detection and Tracking
A frontal face detector is run on every frame of the video. Better than multi-face
and person detection.
Any individual who appears in a video for any length of time generates a face
track – that is, a sequence of face instances across time.
Track provides multiple examples of the character’s appearance.
Face tracks are obtained as a set of point tracks starting at some frame in the
shot and continuing until some later frame.
For a given pair of faces in different frames, the number of point tracks which
pass through both faces is counted, and if this number is large relative to the
number of point tracks which are not in common to both faces, a match is
declared
Facial Feature Localization
Output of the face detector gives an approximate location and scale of the face.
After this the facial features are located.
Nine facial features are located: the left and right corners of each eye, the two
nostrils and the tip of the nose, and the left and right corners of the mouth.
To locate the feature positions a Gaussian mixture model is used, where the
covariance of each component is restricted to form a tree structure with each
variable dependent on a single parent variable.
Gives better performance in case of pose variation and less light.
Appearance of each facial feature is assumed independent of the other
features and is modeled by feature/non-feature classifier that uses a variation
of Ada-boost algorithm and 'Haar-like image features'.
Face and Feature Detection
Representing Face Appearance
Computing descriptors of the local appearance of the face around each of the
located facial features. Gives robustness to pose variation, lighting, and partial
occlusion compared to a global face descriptor.
Before extracting descriptors normalize face region to reduce the scale
uncertainty and pose variation.
An affine transformation then transforms located facial feature points to a set of
feature positions.
Two descriptors were investigated: (i) the SIFT descriptor (ii) a simple pixelwised descriptor which is formed by taking a vector of normalized pixels to
obtain local photometric invariance.
Face descriptor is
feature.
formed by concatenating the descriptors for each facial
Representing Clothing Appearance
Sometimes matching faces is very difficult because of different expression,
pose, lighting or motion blur.
Cues to matching identity can be derived by representing the appearance of
the clothing.
For each face detection a bounding box which is expected to contain the
clothing of the corresponding character is predicted relative to the position
and scale of the face detection.
A color histogram is computed as a descriptor of that bounding box in YCbCr
color space which de-correlates the color components better.
While similar clothing appearance suggests the same character, observing
different clothing does not necessarily imply a different character.
Clothing Appearance Aid
Speaker Detection
The combined subtitle and script face detection is highly ambiguous as:
(i) there might be several detected faces present in the frame and we do not
know which one is speaking;
(ii) It might be a reaction shot.
This ambiguity can be removed using visual clues i.e. movement of the lips.
A rectangular mouth region within each face detection is identified using the
located mouth corners and mean squared difference of the pixel values
within the region is computed between the current and previous frame.
If difference above a high threshold then classify face detections into ‘speaking’,
if less than a low threshold then classify face detections into ‘non-speaking'. If
in between then ‘refuse to predict’.
Speaker Detection Ambiguity
Lip Movement Detection
Classification by Exemplar Sets
Tracks for which a single identity is proposed are treated as exemplars with
which to label the other tracks which have no, or uncertain, proposed identity.
Each unlabeled face track F is represented as a set of face descriptors and
clothing descriptors {f,c}. Exemplar sets {λi} have the same representation but
are associated with a particular name.
For a given track F, the quasi-likelihood that the face corresponds to a
particular name λi :
p(F|λi) =1/Z .exp{-(df (F,λi)²/2σf²)} .exp{-(df (F,λi)²/2σc²)}
Classification by Exemplar Sets
Face distance df (F,λi) is defined as the minimum distance between the
Descriptors in F and in the exemplar tracks λi. The clothing distance dc(F,λi) is
similarly defined.
The quasi-likelihoods for each name λi are combined to obtain a posterior
probability of the name by assuming equal priors on the names and applying
Bayes’ rule:
P(λi|F) = p(F|λi)/ Σj p(F|λ j)
By thresholding the posterior, a “refusal to predict” mechanism is implemented.
The faces for which the certainty of naming does not reach some threshold
will be left unlabeled; this decreases the recall of the method but improves the
accuracy of the labeled tracks.
Results
The speaking detection labels around 25% of face tracks with around 90%
accuracy. No manual annotation of any data is performed other than to
evaluate the method (ground truth label for each face track).
Recall here means the proportion of tracks which are assigned a name after
applying the “refusal to predict” mechanism.
Two baseline methods were compared to the proposed method: (i) “Prior” – label
all tracks with the name which occurs most often in the script i.e. Buffy
(accuracy -21.9 %) (ii) “Subtitles only” – label any tracks with proposed names
from the script (accuracy-45%).
Using the proposed method, if forced to assign a name to all face tracks, the
accuracy obtained is around 69% in both episodes. Requiring only 80% of
tracks to be labeled increases the accuracy to around 80%.
Results
Conclusion
Promising results obtained without any supervision except the readily available
annotation.
The detection method and appearance models used can be improved by
including certain weak cues like hair or eye color and by using a specific body
tracker rather than a generic point tracker in cases where face detection is very
difficult