PowerPoint - Yanran (Joyce) Wang

Understanding and Predicting Interestingness of Videos
Yu-Gang Jiang , Yanran Wang , Rui Feng , Hanfang Yang, Yingbin Zheng, Xiangyang Xue
School of Computer Science, Fudan University, Shanghai, China
Two New Datasets
The problem
Can a computational model automatically analyze video
contents and predict the interestingness of videos?
We conduct a pilot study on this problem, and
demonstrates a simple method to identify more interesting
videos.
Flickr Dataset:
•
•
•
•
•
•
Results
Visual Feature Results:
YouTube Dataset:
Source: Flickr.com
Video Type: Consumer Videos
Video Number: 1200
Categories: 15 (basketball, beach…)
Duration: 20 hrs in total
Label: Top 10% as interesting videos;
Bottom 10% as uninteresting
•
•
•
•
•
•
AAAI 2013
Bellevue, USA
Source: YouTube.com
Video Type: Advertisements
Video Number: 420
Categories: 14 (food, drink…)
Duration: 4.2 hrs in total
Label: 10 human assessors to compare
video pairs
80
70
60
50
•
•
Flickr
76.6
74.5 74.2
80
70
YouTube
67.0 67.1
68.0
60
50
Overall the visual features achieve very impressive performance on both datasets
Among five features, SIFT and HOG are very effective, and their combination performs best
Audio Feature Results:
80
70
60
50
Key Idea
Applications:
•
•
76.4
74.7
80
70
60
50
64.8
65.7
Web Video Search
Video Recommendation System
•
The three audio features are effective and complementary. Comparing them gets best performance
Attribute Feature Results:
80
70
60
50
Prediction & Evaluation
80
70
60
50
64.5
56.8
Computational Framework:
•
Aim: train a model to compare the interestingness of two videos
• Attribute features do not work as well as we expected. Especially style performs poorly. It is a very
interesting observation since in the prediction of image interestingness, style is claimed effective
Multi-modal feature extraction
Visual+Audio+Attribute Fusion Results:
Visual features
Related Work:
•
There is a few studies about predicting Aesthetics and
Interestingness of Images
Key Idea is building computational model to predict
which video is more interesting, when given two
videos.
VS.
Audio features
Multi-modal
fusion
Ranking
SVM
Feature:
Visual features
Color Histogram
SIFT
HOG
Audio features
MFCC
Spectrogram SIFT
Audio-Six
High-level
attribute
features
Classemes
Objectbank
Style
SSIM
Prediction:
•
•
•
Conducted a pilot study on video interestingness
Built two new datasets to support this study
Evaluated a large number of features and get interesting
observations
results
High-level attribute features
VS.
Contributions:
80
70
60
50
•
•
•
Adopt Joachims’ Ranking SVM (Joachims 2003) to train prediction models
For both datasets, we use 2/3 of the videos for training and 1/3 for testing
Use Kernel-level Fusion & Equal Weights to fuse multiple features.
GIST
76.6
78.6
2.6%
80
70
60
50
68.0
71.7
5.4%
• Fusing visual and audio features leads to substantial performance gains with 2.6% increase on Flickr and
5.4% increase on YouTube. While adding Attribute features is not that effective
Conclusion
We conducted a study on predicting video interestingness. We also built two
new datasets. A great number of features have been evaluated, leading to
interesting observations:
•
•
Visual and Audio features are effective in predicting video interestingness
A few features useful in image interestingness do not extend to video domain
(Style…)
Evaluation:
•
Accuracy (the percentage of correctly ranked test video pairs)
Datasets are available at: www.yugangjiang.info/research/interestingness