Top Banner
Multimedia Segmentation and Multimedia Segmentation and Summarization Summarization Dr. Jia-Ching Wang Dr. Jia-Ching Wang Honorary Fellow, ECE Department, UW-Madison Honorary Fellow, ECE Department, UW-Madison
44
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Slides

Multimedia Segmentation and Multimedia Segmentation and SummarizationSummarization

Dr. Jia-Ching WangDr. Jia-Ching Wang

Honorary Fellow, ECE Department, UW-MadisonHonorary Fellow, ECE Department, UW-Madison

Page 2: Slides

2 / 47Multimedia Segmentation and Summarization

OutlineOutline

Introduction

Speaker Segmentation

Video Summarization

Conclusion

Page 3: Slides

3 / 47Multimedia Segmentation and Summarization

What is Multimedia?What is Multimedia?

Image

Video

Speech

Audio

Text

Page 4: Slides

4 / 47Multimedia Segmentation and Summarization

Multimedia EverywhereMultimedia Everywhere

Fax machines: transmission of binary images Digital cameras: still images iPod / iPhone & MP3 Digital camcorders: video sequences with audio Digital television broadcasting Compact disk (CD), Digital video disk (DVD) Personal video recorder (PVR, TiVo) Images on the World Wide Web Video streaming, video conferencing Video on cell phones, PDAs High-definition televisions (HDTV) Medical imaging: X-ray, MRI, ultrasound Military imaging: multi-spectral, satellite, microwave

Page 5: Slides

5 / 47Multimedia Segmentation and Summarization

WhatWhat is Multimedia Content? is Multimedia Content?

Multimedia content: the syntactic and semantic information inherent in a digital material.

Example: text document

Syntactic content: chapter, paragraph

Semantic content: key words, subject, types of text document, etc.

Example: video document

Syntactic content: scene cuts, shots

Semantic content: motion, summary, index, caption, etc.

Page 6: Slides

6 / 47Multimedia Segmentation and Summarization

WhyWhy We Need to Know Multimedia Content? We Need to Know Multimedia Content?

Why we need to know multimedia content?

Information processing, in terms of archiving, indexing, delivering, accessing and other processing, require in-depth knowledge of content to optimize the performance.

Page 7: Slides

7 / 47Multimedia Segmentation and Summarization

HowHow to Know Multimedia Content?to Know Multimedia Content?

How to Know Multimedia Content?

Multimedia content analysis The computerized understanding of the semantic/syntactic

of a multimedia document

Multimedia content analysis usually involves

Segmentation Segmenting the multimedia document into units

Classification Classifying each unit into a predefined type

Annotation Annotating the multimedia document

Summarization Summarizing the multimedia document

Page 8: Slides

8 / 47Multimedia Segmentation and Summarization

Multimedia Segmentation and SummarizationMultimedia Segmentation and Summarization

Multimedia segmentation

Syntactic content

Multimedia summarization

Semantic/syntactic content

The result of the temporal segmentation can benefit the video summarization

Page 9: Slides

9 / 47Multimedia Segmentation and Summarization

Multimedia SegmentationMultimedia Segmentation

Image segmentation Video segmentation

Scene change, shot change Audio segmentation

Audio class change Speech segmentation

Speaker change detection Text Segmentation

word segmentation, sentence segmentation, topic change detection

Page 10: Slides

10 / 47Multimedia Segmentation and Summarization

Multimedia SummarizationMultimedia Summarization

Image summarization Region of interest

Video summarization Storyboard, highlight

Audio summarization Main theme in music, Corus in song, event sound

in environmental sound stream Speech summarization

Speech abstract Text summarization

Abstract

Page 11: Slides

11 / 47Multimedia Segmentation and Summarization

What is Speaker Segmentation?What is Speaker Segmentation?

It can also be called speaker change detection (SCD) Assumption: there is no overlapping between any of

the two speaker streams

speaker1 speaker2 speaker3

Page 12: Slides

12 / 47Multimedia Segmentation and Summarization

Supervised v.s. Unsupervised SCDSupervised v.s. Unsupervised SCD

Supervised manner: acoustic data are made up of distinct speakers who are known a priori

Recognition based solution

Unsupervised manner: no prior knowledge about the number and identities of speakers

Metric-based criterion

Model selection-based criterion

Page 13: Slides

13 / 47Multimedia Segmentation and Summarization

Supervised Speaker SegmentationSupervised Speaker Segmentation-- Gaussian Mixture Model-- Gaussian Mixture Model

Gaussian mixture modeling (GMM)

)}()(2

1exp{

2

1)( 1

1 21

2ii

Ti

M

ii

di xxcxp

ii

x is a d-dimensional random vector. , i=1,…,M is the mixture weight. ,the mean vector. ,the covariance matrix.

ic

)(maxarg ,,2,1 idDdt xPD

Incoming audio stream is classified into one of D classes in a maximum likelihood manner at time t

Page 14: Slides

14 / 47Multimedia Segmentation and Summarization

Supervised Speaker SegmentationSupervised Speaker Segmentation-- Hidden Markov Model-- Hidden Markov Model

Page 15: Slides

15 / 47Multimedia Segmentation and Summarization

Unsupervised Speaker SegmentationUnsupervised Speaker Segmentation-- Sliding Window Strategy & Detection Criterion-- Sliding Window Strategy & Detection Criterion

Metric-based criterion (The dissimilarities between the acoustic feature vectors are measured)

Kullback-Leibler distance

Mahalanobis distance

Bhattacharyya distance

Model selection-based criterion

Bayesian information criterion (BIC)

Page 16: Slides

16 / 47Multimedia Segmentation and Summarization

Bayesian Information CriterionBayesian Information Criterion

Model selection Choose one among a set of candidate models Mi , i=1,2,...,m and corresponding

model parameters to represent a given data set D = (D1, D2, …, DN).

Model Posterior Probability

Bayesian information criterion Maximized log data likelihood for the given model with model complexity penalty Bayesian information criterion of model Mi

where di is the number of independent

parameters in the mode parameter set

i

( | ) ( )( | ) ( | )

( )

P D M P MP M D P D M

P D

BIC( ) log ( | )

ˆlog ( | , ) ( log )2

i i

ii

M P M D

dP D M N

1 1ˆlog 2 log ( ( 1)) log2 2 2 2 2

d N NN d d d N

Page 17: Slides

17 / 47Multimedia Segmentation and Summarization

Unsupervised Segmentation Using Bayesian Unsupervised Segmentation Using Bayesian Information CriterionInformation Criterion

First model

Second model

Bayesian information criterion

1 1 2: , , , ~ ( , )NM x x x N

2 1 2 1 1

1 2 2 2

: , , , ~ ( , )

, , , ~ ( , )b

b b N

M x x x N

x x x N

1

1 1ˆBIC( ) log 2 log ( ( 1)) log2 2 2 2 2

d N NM N d d d N

2 1 2

1 1ˆ ˆBIC( ) log 2 log log ( ( 1)) log2 2 2 2 2 2

d N N b NM N d d d N

2 1BIC( ) BIC( )-BIC( )b M M

Page 18: Slides

18 / 47Multimedia Segmentation and Summarization

Disadvantages of Conventional Unsupervised Disadvantages of Conventional Unsupervised Speaker Change DetectionSpeaker Change Detection

Disadvantage:

For metric based methods, it’s not easy to decide a suitable threshold

For BIC, it’s not easy to detect speaker segment less than 2 seconds

Page 19: Slides

19 / 47Multimedia Segmentation and Summarization

Proposed Method -- Misclassification Error RateProposed Method -- Misclassification Error Rate

Sliding window pairs

Feature vector distribution

Same speaker Different speakers

Page 20: Slides

20 / 47Multimedia Segmentation and Summarization

Mathematical AnalysisMathematical Analysis

Page 21: Slides

21 / 47Multimedia Segmentation and Summarization

Mathematical AnalysisMathematical Analysis

Page 22: Slides

22 / 47Multimedia Segmentation and Summarization

DiscussionDiscussion

Generative and discriminant classifiers are both applicable

Key Point: Discriminant classifiers have the benefit that smaller data are required We can have smaller scanning window size The ability to detect short speaker change segment

increases

Page 23: Slides

23 / 47Multimedia Segmentation and Summarization

Speaker Segmentation Using Misclassification Speaker Segmentation Using Misclassification Error RateError Rate

Steps

Preprocessing

Framing, Feature extraction

Hypothesized speaker change point selection

Forcing 2-class labels

Training a discriminat hyperplane

Inside data recognition & calculating misclassification error rate

Accept/reject the hypothesized speaker change point

Significance

The unsupervised speaker segmentation problem is solved by supervised classification

Feature Extraction Feature ExtractionTag +1 Tag -1

+1 +1+1 +1+1 +1+1 +1+1 +1+1 +1+1 +1+1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Discriminant Classifier

+1-1

Mmisclassification Error Rate

Hypothesized Speaker Change Point

Accept/ Reject the Hypothesized Speaker Change Point

Page 24: Slides

24 / 47Multimedia Segmentation and Summarization

Experimental ResultsExperimental Results

EXPERIMENTAL RESULTSMethod F-score Precision Recall

Proposed 71.8 70.2 81.3

BIC 63.3 54.4 75.7

Page 25: Slides

25 / 47Multimedia Segmentation and Summarization

Video SummarizationVideo Summarization

Dynamic v.s. Static Video Summarization

Dynamic video summarization

Sport highlight, movie trailer

Static video summarization

Storyboard

– Visual-based approach

– Incorporation of the semantic Information

Page 26: Slides

26 / 47Multimedia Segmentation and Summarization

Static Video SummarizationStatic Video Summarization-- Visual Based Approach-- Visual Based Approach

Example

Problem

Is the summarization ratio adjustable?

How to generate effective storyboard under a given summarization ratio?

Page 27: Slides

27 / 47Multimedia Segmentation and Summarization

How to Generate Effective StoryboardHow to Generate Effective Storyboard

Question: Assume there are n frames and the summarization ratio is r/n. How do we select the best r frames ?

Complexity:

There are C(n,r) different choices

Page 28: Slides

28 / 47Multimedia Segmentation and Summarization

How to Generate Effective StoryboardHow to Generate Effective Storyboard

In visual viewpoint Most visually distinct frames should be extracted Dissimality between two frames is measured by low level visual

features

How to select best r frames from n frames Solution: maximize the overall pairwise dissimilities Complexity: C(n,r) x C(r,2) Unfeasible: C(n,r) is usually huge

Fact Human beings usually browse a storyboard in a sequential way

Optimal solution in a sequential sense Maximize the sum of dissimilities from sequential adjacent

images in a storyboard

Page 29: Slides

29 / 47Multimedia Segmentation and Summarization

How to Maximize the Dissimality Sum of the How to Maximize the Dissimality Sum of the Extracted ImagesExtracted Images

Lattice-based representative frame extraction approach

Extract key component from temporal sequence

Dynamic programming can be applied

Example: how to select the best 4 images from an 8-image sequence

Page 30: Slides

30 / 47Multimedia Segmentation and Summarization

How to Maximize the Adjacent Dissimality Sum of How to Maximize the Adjacent Dissimality Sum of the Extracted Imagesthe Extracted Images

Original images: O(1), O(2), O(3), O(4), O(5), O(6), O(7), O(8)

Extracted images: E(1), E(2), E(3), E(4)

E(1) ← O(i); E(1) ← O(j); E(1) ← O(k); E(1) ← O(l); where i < j < k < lOriginal Sequence

Extracted Sequence

1

2

3

4

5

6

7

8

1 2 3 4

Each legal left-to-right path represents a way to extract images

Each transition results in an adjacent dissimality

In this example, the adjacent dissimality sum of the extracted images are D[ O(1),O(3) ] +

D[ O(3),O(4) ] + D[ O(4),O(7) ]

Page 31: Slides

31 / 47Multimedia Segmentation and Summarization

How to Maximize the Adjacent Dissimality Sum of How to Maximize the Adjacent Dissimality Sum of the Extracted Imagesthe Extracted Images

Original Sequence

Extracted Sequence

1

2

3

4

5

6

7

8

1 2 3 4

Original Sequence

Extracted Sequence

1

2

3

4

5

6

7

8

1 2 3 4

Page 32: Slides

32 / 47Multimedia Segmentation and Summarization

Complexity ComparisonComplexity Comparison

Select 4 images from an 8-image sequence Lattice-based approach

45 dissimality comparison

Optimal approach 420 dissimality comparison

Page 33: Slides

33 / 47Multimedia Segmentation and Summarization

Segment-Based SolutionSegment-Based Solution

Original Sequence

Extracted Sequence

8765 9 121110432

1 9

8

7

6

5

4

3

2

16

15

14

13

12

11

10

24

23

22

21

20

19

18

17

1

Page 34: Slides

34 / 47Multimedia Segmentation and Summarization

Experimental ResultsExperimental Results

Page 35: Slides

35 / 47Multimedia Segmentation and Summarization

Incorporation of the Semantic InformationIncorporation of the Semantic Information

Conventional

The static summarized images are extracted in accordance with low level visual features

Disadvantage

It’s difficult to catch the main story without the support of semantic significant information

We present a semantic based static video summarization

Each extracted image has an annotation

Related images are connected by edge

Using ‘who’ ‘what’ ‘where’ ‘when’ to list all extracted images

Page 36: Slides

36 / 47Multimedia Segmentation and Summarization

The Proposed ArchitectureThe Proposed Architecture

Shot annotation: mapping visual content to text Concept expansion: It provides an alterative view and

dependency information while measuring the relation of two annotations.

Relational graph construction

Page 37: Slides

37 / 47Multimedia Segmentation and Summarization

Concept Tree ConstructionConcept Tree Construction

The concept tree denotes the dependent structure of the expanded words

Meronym

‘Wheel' is a meronym of 'automobile'.

Holonym

‘Tree' is a holonym of 'bark', of 'trunk' and of 'limb'

Pencil used for Draw

Salesperson location of Store

Motorist capable of Drive

Eat breakfast Effect of Full stomach

Page 38: Slides

38 / 47Multimedia Segmentation and Summarization

Concept Tree ReorganizationConcept Tree Reorganization

Who: names of people, subset of "person" in WordNet Where: "social group," "building," and "location " in WordNet What: " All the other words which do not belong to "who" and

"where" When: searching for time-period phrase

Page 39: Slides

39 / 47Multimedia Segmentation and Summarization

Relational Graph Construction Relational Graph Construction -- Relation of Two Concept Trees -- Relation of Two Concept Trees

The relation of the two concept trees

The relation of the two roots

The relation of the two children

root child, Relation , Relation ,

,rootRelation ,

the number of the sentences

sent

,child

,

Relation ,the number of the pairs

I J

type I J type

ident

,

,

Relation ,the number of the pairs

I Jtype

I J

identI J

Page 40: Slides

40 / 47Multimedia Segmentation and Summarization

Relational Graph Construction Relational Graph Construction -- Remove Unimportant Vertices and Edges -- Remove Unimportant Vertices and Edges

Remove edges with smaller weighting, i.e. lower relation

Remove vertices with smaller term frequency – inverse document frequency (TF-IDF)

Page 41: Slides

41 / 47Multimedia Segmentation and Summarization

The Final Relational GraphThe Final Relational Graph

Comparison with conventional storyboard

Page 42: Slides

42 / 47Multimedia Segmentation and Summarization

ConclusionConclusion

A novel speaker segmentation criterion is proposed Misclassification error rate

The unsupervised speaker segmentation problem is solved by supervised classification with label-forcing

Discriminat classifier makes the proposed approach be able to have smaller scanning window size The ability to detect short speaker change segment increases

Two new static video summarization approaches are proposed Lattice-based representative frame extraction

Merely using low level visual features The summarization ratio is adjustable Under a given summarization ratio, the dissimality sum from sequential

adjacent images is minimized Concept-organized representative frame extraction

Incorporating semantic information Mining the four kinds of concept entities: who, what, where, and when People can efficiently grasp the comprehensive structure of the story

and understand the main points of the contents

Page 43: Slides

43 / 47Multimedia Segmentation and Summarization

Future WorkFuture Work

Multimedia segmentation

Speech segmentation

Audio segmentation

Video segmentation

Multimedia summarization

Video summarization Static, dynamic

Speech summarization

Audio summarization

Page 44: Slides

Thank all of you for your attendance!