Document

Low-Level Fusion of Audio and
Video Feature for Multi-modal
Emotion Recognition
Chair for Image Understanding and Knowledge-based Systems
Institute for Informatics
Technische Universität München
Sylvia Pietzsch
[email protected]
Overview
 Video low-level descriptors
 Model-based image interpretation
 Structural features
 Temporal features
 Audio low-level descriptors
 Combining video and audio descriptors
 Experimental results
 Conclusion and outlook
Technische Universität München
Sylvia Pietzsch
2008, January 23rd
2/15
Model-based Image Interpretation
 The model
The model contains a parameter vector
that represents the model’s configuration.
 The objective function
Calculates a value that indicates how
accurately a parameterized model
matches an image.
 The fitting algorithm
Searches for the model parameters that
describe the image best, i.e. it minimizes
the objective function.
Technische Universität München
Sylvia Pietzsch
2008, January 23rd
3/15
Local Objective Functions
Technische Universität München
Sylvia Pietzsch
2008, January 23rd
4/15
Ideal Objective Functions
P1: Correctness property:
Global minimum corresponds to the best fit.
P2: Uni-modality property:
The objective function has no local extrema.
¬ P1
P1
¬P2
P2

Don’t exist for real-world images

Only for annotated images: fn(
Technische Universität München
Sylvia Pietzsch
I , x ) = | cn – x |
2008, January 23rd
5/15
Learning the Objective Function
 Ideal objective function generates training data
 Machine Learning technique generates calculation rules
xx x x
xxx
x
xxx
x
xxx
xxx
x
x
x
xxx
x x xx x x x x
Technische Universität München
Sylvia Pietzsch
2008, January 23rd
6/15
Skin Color Extraction
original
image
 Location of contour
lines and skin colored
parts
fixed
classifier
 Adaptive to image
context conditions
adapted
classifier
Correctly detected pixels:
 fixed classifier:
 adapted classifier:
Technische Universität München
Sylvia Pietzsch
90.4%
97.5%
74.8%
87.5%
2008, January 23rd
40.2%
97.0%
7/15
Structural Features
 Deformation parameters describe a distinctive
state of the face.
Technische Universität München
Sylvia Pietzsch
2008, January 23rd
8/15
Temporal Features
 Facial expressions emerge from muscle activity.
 Optical flow vectors are calculated at equally
distributed feature points connected to the shape
model.
Technische Universität München
Sylvia Pietzsch
2008, January 23rd
9/15
Audio Low-level Descriptors
 Aiming at independence of phonetic content and speaker
 Coverage of prosodic, articulatory, and voice quality aspects
 20ms frames, 50% overlap, Hamming window function








Zero crossing rate (ZCR)
Pitch
7 formants
Energy
Spectral development
Harmonics-to-Noise-Ratio (HNR)
Durations of voiced sounds by HNR
Durations of silences by bi-state energy
 SMA filtering of LLDs
 Addition of 1st and 2nd order LLD regression coefficients
Technische Universität München
Sylvia Pietzsch
2008, January 23rd
10/15
Combining Audio and Video LLDs
 Time series constructed for LLDs (audio, video
separately)
 Application of functionals to combined low-level
descriptors
 Linear moments (mean, std. deviation)
 Quartiles
 Durations
 Resulting feature vector:
 276 audio features
 1048 video features
Technische Universität München
Sylvia Pietzsch
SVM
2008, January 23rd
11/15
Experimental Results (1)
 Database: Airplane Behavior Corpus
 Guided storyline
 8 subjects (25 to 48 years old)
 11.5 hours of video in total
 10-fold stratisfied cross validation
 Feature pre-selection by SVM-SFFS (sequential forward
floating search)
Audio
Video
Audiovisual
Features [#]
92
156
200
Accuracy [%]
73.7
61.1
81.8
Technische Universität München
Sylvia Pietzsch
2008, January 23rd
12/15
Experimental Results (2)
 Main confusions:
 neutral, nervous
 cheerful, intoxicated
 Aggressive behavior recognized best
Technische Universität München
Sylvia Pietzsch
2008, January 23rd
13/15
Conclusion and Outlook
 Combined feature set superior over individual
audio or video feature set
 Future work:




Investigation on further data sets
Comparison to late fusion approaches
Performance of asynchronous feature fusion
Application of hierarchical functionals
Technische Universität München
Sylvia Pietzsch
2008, January 23rd
14/15
Thank you!
Technische Universität München
Sylvia Pietzsch
2008, January 23rd
15/15