Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.

Post on 19-Dec-2015

218 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

Transcript

Content-based Video Indexing,

Classification & Retrieval

Presented by HOI, Chu Hong

Nov. 27, 2002

Outline

Motivation Introduction Two approaches for semantic analysis

A probabilistic framework (Naphade, Huang ’01) Object-based abstraction and modeling [Lee, Kim, Hwang ’01]

A multimodal framework for video content interpretation

Conclusion

Motivation

There is an amazing growth in the amount of digital video data in recent years.

Lack of tools for classify and retrieve video content

There exists a gap between low-level features and high-level semantic content.

To let machine understand video is important and challenging.

Introduction

Content-based Video indexing the process of attaching content based labels to video

shots essential for content-based classification and retrieval Using automatic analysis techniques

- shot detection, video segmentation- key frame selection

- object segmentation and recognition- visual/audio feature extraction

- speech recognition, video text, VOCR

Introduction

Content-based Video Classification Segment & classify videos into meaning categories Classify videos based on predefined topic Useful for browsing and searching by topic Multimodal method

Visual features Audio features Motion features Textual features

Domain-specific knowledge

Introduction

Content-based Video Retrieval Simple visual feature query

Retrieve video with key-frame: Color-R(80%),G(10%),B(10%) Feature combination query

Retrieve video with high motion upward(70%), Blue(30%) Query by example (QBE)

Retrieve video which is similar to example Localized feature query

Retrieve video with a running car toward right Object relationship query

Retrieve video with a girl watching the sun set Concept query (query by keyword)

Retrieve explosion, White Christmas

Introduction

Feature ExtractionColor featuresTexture featuresShape featuresSketch featuresAudio features Camera motion featuresObject motion features

Semantic Indexing & Querying

Limitation of QBE Measuring similarity using only low-level features Lack reflection of user’s perception Difficult annotation of high level features

Syntactic to Semantic Bridge the gap between low-level feature and semantic content Semantic indexing, Query By Keyword (QBK)

Semantic description scheme – MPEG-7 Semantic interaction between concepts no scheme to learn the model for individual concepts

Semantic Modeling & Indexing

Two approachesProbabilistic framework, ‘Multiject’ (Naphade’01)

Object-based abstraction and indexing [Lee, Kim, Hwang ’01]

A probabilistic approach (‘Multiject’ & ‘Multinet’) (Naphade, Huang ’01)

a probabilistic multimedia object 3 categories semantic concepts

Objects Face, car, animal, building

Sites Sky, mountain, outdoor, cityscape

Events Explosion, waterfall, gunshot, dancing

Multiject for semantic concept

Outdoor

Visual features Audio features

Other multijects

P( Outdoor = Present | features, other multijects) = 0.7

Text features

How to create a Multiject

Shot-boundary detection Spatio-temporal segmentation of within-shot frames Feature extraction (color, texture, edge direction, etc ) Modeling

Sites: mixture of Gaussians Events: hidden Markov models (HMMs) with observati

on densities as gaussian mixtures All audio events: modeled using HMMs Each segment is tested for each concept and the infor

mation is then composed at frame level

Multiject : Hierarchical HMM

ss1 - ssm : state sequence for supervisor HMMsa1 - sam : state sequence for audio HMMxa1 - xam : audio observationssv1 - svm : state sequence for video HMMxv1 - xvm : video observations

Multinet: Concept Building based on Multiject

• A network of multijects modeling interaction between them

• + / - : positive/negative interaction between multijects

Bayesian Multinet

• Nodes : binary random variables (presence/absence of multiject)

• Layer 0 : frame-level multiject-based semantic features

• Layer 1 : inference from layer 0 :

• Layer 2 : higher level for performance improvement

Object-based Semantic Video Modeling

VO Extraction

Object-based Video Abstraction

Object-based Low-Level Feature Extraction

Semantic Features

Modeling

Video Sequence

Indexing/Retrieving

Object Extraction based on Object Tracking [Kim, Hwang ‘00]

In-1

Motion Projection

Model Update(Histogram Backprojection)

Object Post-processing

von

von-1

In

delay

Semantic Feature Modeling

- Modeling based on temporal variation of object features- Boundary shape and motion statistics of object area

Pre-processing

Pre-processing

HMMTraining

HMMTraining

Object Features

Object Features

Abstracted frame sequence

HMM Modeling1. Observation Sequence

O1 ……. OT

.

.

2. Left-Right 1-D HMM modeling

.

.

…..S1 S2 ST

object features

Video Modeling: Three Layer Structure

Content Interpretation

Frame-based StructuralModeling

Audio-Visual Feature

Extraction

SemanticVideo

ModelingObject-based

Structural Modeling

Video Understanding

Natural Language Processing

Interpretation

Sentence Structure & grammar

Word Recognition

Three layer structure of video modeling, compared to NLP

A Multimodal Framework for Video Content Interpretation

Long-term goal Application on automatic TV Programs Scout Allow user to request topic-level programs Integrate multiple modalities: visual, audio and Text

information Multi-level concepts

Low: low-level feature Mid: object detection, event modeling High: classification result of semantic content

Probabilistic model, Using Bayesian network for classification (causal relationship, domain-knowledge)

How to work with the framework?

Preprocessing Story segmentation (shot detection) VOCR, Speech Recognition Key frame selection

Feature Extraction Visual features based on key-frame

Color, texture, shape, sketch, etc. Audio features

average energy, bandwidth, pitch, mel-frequency cepstral coefficients, etc. Textual features (Transcript)

Knowledge tree, a lot of keyword categories: politics, entertainment, stock, art, war, etc. Word spotting, vote histogram

Motion features Camera operation: Panning, Tilting, Zooming, Tracking, Booming, Dollying Motion trajectories (moving objects) Object abstraction, recognition

Building and training the Bayesian network

Challenging points

Preprocessing is significant in the framework. Accuracy of key-frame selection Accuracy of speech recognition & VOCR

Good feature extraction is important for the performance of classification.

Modeling semantic video objects and events How to integrate multiple modalities still need to

be well considered.

Conclusion

Introduction of several basic concepts Semantic video modeling and indexing Propose a multimodal framework for topic

classification of Video Discussion of Challenging problems

Q & A

Thank you!

top related