Understanding and Predicting Interestingness of Videos Yu-Gang Jiang , Yanran Wang , Rui Feng , Hanfang Yang, Yingbin Zheng, Xiangyang Xue School of Computer Science, Fudan University, Shanghai, China AAAI 2013 Bellevue, USA Applications: • Web Video Search • Video Recommendation System Related Work: • There is a few studies about predicting Aesthetics and Interestingness of Images Key Idea is building computational model to predict which video is more interesting, when given two videos. Contributions: • Conducted a pilot study on video interestingness • Built two new datasets to support this study • Evaluated a large number of features and get interesting observations Can a computational model automatically analyze video contents and predict the interestingness of videos? We conduct a pilot study on this problem, and demonstrates a simple method to identify more interesting videos. The problem Key Idea VS. Two New Datasets Flickr Dataset: • Source: Flickr.com • Video Type: Consumer Videos • Video Number: 1200 • Categories: 15 (basketball, beach…) • Duration: 20 hrs in total • Label: Top 10% as interesting videos; Bottom 10% as uninteresting YouTube Dataset: • Source: YouTube.com • Video Type: Advertisements • Video Number: 420 • Categories: 14 (food, drink…) • Duration: 4.2 hrs in total • Label: 10 human assessors to compare video pairs Prediction & Evaluation Computational Framework: • Aim: train a model to compare the interestingness of two videos Feature: Prediction: • Adopt Joachims’ Ranking SVM (Joachims 2003) to train prediction models • For both datasets, we use 2/3 of the videos for training and 1/3 for testing • Use Kernel-level Fusion & Equal Weights to fuse multiple features. Evaluation : Visual features Audio features High-level attribute features Ranking SVM results Multi- modal fusion VS. Multi-modal feature extraction Visual features Color Histogram SIFT HOG SSIM GIST Audio features MFCC Spectrogram SIFT Audio-Six High-level attribute features Classemes Objectbank Style Results Visual Feature Results: • Overall the visual features achieve very impressive performance on both datasets • Among five features, SIFT and HOG are very effective, and their combination performs best Audio Feature Results: • The three audio features are effective and complementary. Comparing them gets best performance Attribute Feature Results: • Attribute features do not work as well as we expected. Especially style performs poorly. It is a very interesting observation since in the prediction of image interestingness, style is claimed effective Visual+Audio+Attribute Fusion Results: • Fusing visual and audio features leads to substantial performance gains with 2.6% increase on Flickr and 5.4% increase on YouTube. While adding Attribute features is not that effective SIF T HOG SSI M GIST Color Histrogram SIF T+HOG SIF T+HOG+SS IM SIF T+HOG+GI ST SIF T + HOG+Color 50 70 74.2 50 60 70 80 MFC C Spe ctrogram SIFT Audio-Six MFCC+ SS MFC C+SS+Audio-Si x 50 70 76.4 MFC C Spect rogra m SIFT Audio-Six MFC C+ SS MFC C+ SS+Audio-Si x 50 70 Style Classemes Objectb a nk Style +Clas semes Classemes+Objectbank 50 70 Sty le Classemes Obj ectbank Sty le+Classemes Classeme s+Objectb a nk 50 70 50 70 50 60 70 80 Flickr YouTub e Datasets are available at: www.yugangjiang.info/research/interestingness 76.6 68.0 74.5 67.0 67.1 65.7 64.8 74.7 64.5 56.8 71.7 78.6 76.6 68.0 2.6 % 5.4% Conclusion We conducted a study on predicting video interestingness. We also built two new datasets. A great number of features have been evaluated, leading to interesting observations: • Visual and Audio features are effective in predicting video interestingness • A few features useful in image interestingness do not extend to video domain (Style…)