Top Banner
Understanding and Predicting Interestingness of Videos Yu-Gang Jiang , Yanran Wang , Rui Feng , Hanfang Yang, Yingbin Zheng, Xiangyang Xue School of Computer Science, Fudan University, Shanghai, China AAAI 2013 Bellevue, USA Applications: Web Video Search Video Recommendation System Related Work: There is a few studies about predicting Aesthetics and Interestingness of Images Key Idea is building computational model to predict which video is more interesting, when given two videos. Contributions: Conducted a pilot study on video interestingness Built two new datasets to support this study Evaluated a large number of features and get interesting observations Can a computational model automatically analyze video contents and predict the interestingness of videos? We conduct a pilot study on this problem, and demonstrates a simple method to identify more interesting videos. The problem Key Idea VS. Two New Datasets Flickr Dataset: Source: Flickr.com Video Type: Consumer Videos Video Number: 1200 Categories: 15 (basketball, beach…) Duration: 20 hrs in total Label: Top 10% as interesting videos; Bottom 10% as uninteresting YouTube Dataset: Source: YouTube.com Video Type: Advertisements Video Number: 420 Categories: 14 (food, drink…) Duration: 4.2 hrs in total Label: 10 human assessors to compare video pairs Prediction & Evaluation Computational Framework: Aim: train a model to compare the interestingness of two videos Feature: Prediction: Adopt Joachims’ Ranking SVM (Joachims 2003) to train prediction models For both datasets, we use 2/3 of the videos for training and 1/3 for testing Use Kernel-level Fusion & Equal Weights to fuse multiple features. Evaluation Visual features Audio features High-level attribute features Ranking SVM results Multi- modal fusion VS. Multi-modal feature extraction Visual features Color Histogram SIFT HOG SSIM GIST Audio features MFCC Spectrogram SIFT Audio-Six High-level attribute features Classemes Objectbank Style Results Visual Feature Results: Overall the visual features achieve very impressive performance on both datasets Among five features, SIFT and HOG are very effective, and their combination performs best Audio Feature Results: The three audio features are effective and complementary. Comparing them gets best performance Attribute Feature Results: Attribute features do not work as well as we expected. Especially style performs poorly. It is a very interesting observation since in the prediction of image interestingness, style is claimed effective Visual+Audio+Attribute Fusion Results: Fusing visual and audio features leads to substantial performance gains with 2.6% increase on Flickr and 5.4% increase on YouTube. While adding Attribute features is not that effective SIF T HOG SSI M GIST Color Histrogram SIF T+HOG SIF T+HOG+SS IM SIF T+HOG+GI ST SIF T + HOG+Color 50 70 74.2 50 60 70 80 MFC C Spe ctrogram SIFT Audio-Six MFCC+ SS MFC C+SS+Audio-Si x 50 70 76.4 MFC C Spect rogra m SIFT Audio-Six MFC C+ SS MFC C+ SS+Audio-Si x 50 70 Style Classemes Objectb a nk Style +Clas semes Classemes+Objectbank 50 70 Sty le Classemes Obj ectbank Sty le+Classemes Classeme s+Objectb a nk 50 70 50 70 50 60 70 80 Flickr YouTub e Datasets are available at: www.yugangjiang.info/research/interestingness 76.6 68.0 74.5 67.0 67.1 65.7 64.8 74.7 64.5 56.8 71.7 78.6 76.6 68.0 2.6 % 5.4% Conclusion We conducted a study on predicting video interestingness. We also built two new datasets. A great number of features have been evaluated, leading to interesting observations: Visual and Audio features are effective in predicting video interestingness A few features useful in image interestingness do not extend to video domain (Style…)
1

Understanding and Predicting Interestingness of Videos Yu-Gang Jiang, Yanran Wang, Rui Feng, Hanfang Yang, Yingbin Zheng, Xiangyang Xue School of Computer.

Jan 06, 2018

Download

Documents

Caroline Pierce
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Understanding and Predicting Interestingness of Videos Yu-Gang Jiang, Yanran Wang, Rui Feng, Hanfang Yang, Yingbin Zheng, Xiangyang Xue School of Computer.

Understanding and Predicting Interestingness of VideosYu-Gang Jiang , Yanran Wang , Rui Feng , Hanfang Yang, Yingbin Zheng, Xiangyang Xue

School of Computer Science, Fudan University, Shanghai, ChinaAAAI 2013Bellevue, USA

Applications:• Web Video Search• Video Recommendation System

Related Work:• There is a few studies about predicting Aesthetics and

Interestingness of Images

Key Idea is building computational model to predict which video is more interesting, when given two videos.

Contributions:• Conducted a pilot study on video interestingness• Built two new datasets to support this study• Evaluated a large number of features and get interesting

observations

Can a computational model automatically analyze video contents and predict the interestingness of videos?

We conduct a pilot study on this problem, and demonstrates a simple method to identify more interesting videos.

The problem

Key Idea

VS.

Two New DatasetsFlickr Dataset:• Source: Flickr.com• Video Type: Consumer Videos• Video Number: 1200 • Categories: 15 (basketball, beach…)• Duration: 20 hrs in total• Label: Top 10% as interesting videos;

Bottom 10% as uninteresting

YouTube Dataset:• Source: YouTube.com• Video Type: Advertisements• Video Number: 420• Categories: 14 (food, drink…)• Duration: 4.2 hrs in total• Label: 10 human assessors to compare

video pairs

Prediction & EvaluationComputational Framework: • Aim: train a model to compare the interestingness of two videos

Feature:

Prediction:• Adopt Joachims’ Ranking SVM (Joachims 2003) to train prediction models• For both datasets, we use 2/3 of the videos for training and 1/3 for testing• Use Kernel-level Fusion & Equal Weights to fuse multiple features.

Evaluation:• Accuracy (the percentage of correctly ranked test video pairs)

Visual features

Audio features

High-level attribute features

Ranking SVM

resultsMulti-modal fusionVS.

Multi-modal feature extraction

Visual features Color Histogram SIFT HOG SSIM GIST

Audio features MFCC Spectrogram SIFT Audio-Six

High-level attribute features

Classemes Objectbank Style

ResultsVisual Feature Results:

• Overall the visual features achieve very impressive performance on both datasets• Among five features, SIFT and HOG are very effective, and their combination performs best

Audio Feature Results:

• The three audio features are effective and complementary. Comparing them gets best performance

Attribute Feature Results:

• Attribute features do not work as well as we expected. Especially style performs poorly. It is a very interesting observation since in the prediction of image interestingness, style is claimed effective

Visual+Audio+Attribute Fusion Results:

• Fusing visual and audio features leads to substantial performance gains with 2.6% increase on Flickr and 5.4% increase on YouTube. While adding Attribute features is not that effective

SIFT

HOGSSI

M GIST

Color Hist

rogram

SIFT+H

OG

SIFT+H

OG+SSIM

SIFT+H

OG+GIST

SIFT+H

OG+Color

50

60

70

80 74.2

SIFT

HOGSSI

M GIST

Color Hist

rogram

SIFT+H

OG

SIFT+H

OG+SSIM

SIFT+H

OG+GIST

SIFT+H

OG+Color

50

60

70

80

50556065707580 76.4

50556065707580

Style

Classemes

Objectbank

Style+Classe

mes

Classemes+

Objectbank

50607080

Style

Classemes

Objectbank

Style+Classe

mes

Classemes+

Objectbank

50607080

Visual(S

IFT+HOG)

Audio(MFCC+SS+Audio-Six)

Attribute(O

bjectbank+

Classeme)

Visual+Audio

Visual+Audio+Attrib

ute50607080

50607080

Flickr YouTube

Datasets are available at: www.yugangjiang.info/research/interestingness

76.6 68.074.567.0 67.1

65.764.874.7

64.5 56.8

71.778.676.6

68.0

2.6% 5.4%

ConclusionWe conducted a study on predicting video interestingness. We also built two new datasets. A great number of features have been evaluated, leading to interesting observations:• Visual and Audio features are effective in predicting video interestingness• A few features useful in image interestingness do not extend to video domain

(Style…)