Understanding and Predicting Interestingness of Videos Yu-Gang Jiang , Yanran Wang , Rui Feng , Hanfang Yang, Yingbin Zheng, Xiangyang Xue School of Computer Science, Fudan University, Shanghai, China Two New Datasets The problem Can a computational model automatically analyze video contents and predict the interestingness of videos? We conduct a pilot study on this problem, and demonstrates a simple method to identify more interesting videos. Flickr Dataset: • • • • • • Results Visual Feature Results: YouTube Dataset: Source: Flickr.com Video Type: Consumer Videos Video Number: 1200 Categories: 15 (basketball, beach…) Duration: 20 hrs in total Label: Top 10% as interesting videos; Bottom 10% as uninteresting • • • • • • AAAI 2013 Bellevue, USA Source: YouTube.com Video Type: Advertisements Video Number: 420 Categories: 14 (food, drink…) Duration: 4.2 hrs in total Label: 10 human assessors to compare video pairs 80 70 60 50 • • Flickr 76.6 74.5 74.2 80 70 YouTube 67.0 67.1 68.0 60 50 Overall the visual features achieve very impressive performance on both datasets Among five features, SIFT and HOG are very effective, and their combination performs best Audio Feature Results: 80 70 60 50 Key Idea Applications: • • 76.4 74.7 80 70 60 50 64.8 65.7 Web Video Search Video Recommendation System • The three audio features are effective and complementary. Comparing them gets best performance Attribute Feature Results: 80 70 60 50 Prediction & Evaluation 80 70 60 50 64.5 56.8 Computational Framework: • Aim: train a model to compare the interestingness of two videos • Attribute features do not work as well as we expected. Especially style performs poorly. It is a very interesting observation since in the prediction of image interestingness, style is claimed effective Multi-modal feature extraction Visual+Audio+Attribute Fusion Results: Visual features Related Work: • There is a few studies about predicting Aesthetics and Interestingness of Images Key Idea is building computational model to predict which video is more interesting, when given two videos. VS. Audio features Multi-modal fusion Ranking SVM Feature: Visual features Color Histogram SIFT HOG Audio features MFCC Spectrogram SIFT Audio-Six High-level attribute features Classemes Objectbank Style SSIM Prediction: • • • Conducted a pilot study on video interestingness Built two new datasets to support this study Evaluated a large number of features and get interesting observations results High-level attribute features VS. Contributions: 80 70 60 50 • • • Adopt Joachims’ Ranking SVM (Joachims 2003) to train prediction models For both datasets, we use 2/3 of the videos for training and 1/3 for testing Use Kernel-level Fusion & Equal Weights to fuse multiple features. GIST 76.6 78.6 2.6% 80 70 60 50 68.0 71.7 5.4% • Fusing visual and audio features leads to substantial performance gains with 2.6% increase on Flickr and 5.4% increase on YouTube. While adding Attribute features is not that effective Conclusion We conducted a study on predicting video interestingness. We also built two new datasets. A great number of features have been evaluated, leading to interesting observations: • • Visual and Audio features are effective in predicting video interestingness A few features useful in image interestingness do not extend to video domain (Style…) Evaluation: • Accuracy (the percentage of correctly ranked test video pairs) Datasets are available at: www.yugangjiang.info/research/interestingness
© Copyright 2026 Paperzz