Multimedia Retrieval

Multimedia Retrieval

Outline

• Audio Retrieval • Spoken information• Music

• Document Image Analysis and Retrieval• Video Retrieval

A Taxonomy of Audio

Sound

Music Other?Speech

Classical

Country

Disco Hip Hop

Jazz

RockSportsAnnouncer

Female

Male

Orchestra

StringQuartet

Choir

Piano

?

Spoken Document Retrieval


Acoustic Modeling

Describes the sounds thatmake up speech

Lexicon

Describes which sequences of speech

sounds make upvalid words

Language Model

Describes the likelihoodof various sequences of

words being spoken

Speech Recognition

Speech Recognition Knowledge Sources

Speech Recognition in Brief

Pronunciation Lexicon

Signal Processing

PhoneticProbabilityEstimator(Acoustic

Model)

Decoder(Language

Model)WordsSpeech

Grammar

Hints For Better Recognition

• Topical information• News of the day• Image information ?

• Goal: improve the estimation p(word|acoustic_sig)• Main idea:

p(word|acoustic_sign) p(word|acoustic_signal, X)

What could be X?

Hints For Better Recognition

• Topical information• News of the day• Image information

• Lip reading• Video Optical Character

Recognition (VOCR)

• Goal: improve the estimation p(word|acoustic_sig)• Main idea:

p(word|acoustic_sign) p(word|acoustic_signal, X)

What could be X?

Speech Recognition AccuracyWord Error Rate

BenchmarkLab

TV Studio

DialogNews

Documentary

Commercials

0

10

20

30

40

50

60

70

80

90

100

Information Retrieval Precision vs. Speech Accuracy

Word Error Rate

% of Text IR

100

90

80

70

60

50

40

30

Rel

ativ

e P

reci

sio

n

0 10 20 30 40 50 60 70 80

Indexing and Search of Multimodal Information, Hauptmann, A., Wactlar, H. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP-97), Munich, Germany, April 1997.

A rather small degradation in retrieval when word error rate is small than 30%


• Segmentation issue• Continuous speech data without story boundaries

• Typical segmentation approaches

Overlapping windows (30 sec for each segment)

Automatic detection of speaker changes

Spoken Document Retrieval:Document Expansion

• Motivation: documents are erroneous• Goal: apply expansion techniques to reduce the

impacts of recognition errors in spoken documents• Similar to query expansion




Clean Doc Collection (web docs)

Speech Recognized Transcript

doc1

doc2

doc3

doc4

Find common

words in top ranked docs




• Treat each speech document as a query

• Find clean documents that are relevant to speech documents

• Expand each speech document with the common words in the top ranked clean documents.

Document Expansion (Sighal & Piereira, 1999)

A Taxonomy of Audio

Sound

Music Other?Speech

Classical

Country

Disco Hip Hop

Jazz

RockSportsAnnouncer

Female

Male

Orchestra

StringQuartet

Choir

Piano

?

Music Information Retrieval

Music Retrieval

• A textual retrieval approach• Using meta data: titles, artists, genres, …

• Content-based music retrieval• Query by audio• Query by score document/segment

Content-based Music Retrieval

Short-termAutocorrelation

NoteSegmentation

Mid-level Representation

Similarity Comparison

Query results(Ranked song list)

Songs Database

Midi message Extraction

Microphone Signal input

Sampling

11KHz

CenterClipping

Off-line processing

On-line processing

67 64 65 62 60 (Midi representation)

-3 1 -3 -2

Content-based Music Retrieval

: 1 1 2 0 -2 0 1 2 0 : -3 1 1 2

• N-gram representation

1 1 2 C1 1 1

1 2 0 C2 2 0

2 0 –2 C3 1 0

0 –2 0 C4 1 0

-3 1 1 C5 0 1

• A vector representation for each music document• A typical information retrieval problem

Document Image Analysis and Retrieval

Document Image Analysis

• Recognize text (OCR)• convert page images to Unicode

• machine-printed, handwritten

• Analyze page layout geometry• a 2-D problem (unlike speech, text)

• good ‘language-free’ algorithms

• Capture logical structure• output marked-up text (XML, etc)

• exploit non-textual clues

Video/Image OCR Block Diagram

Text Area

Detection

Text Area

Preprocessing

Commercial

OCR

Video orImage

UTF8 Text

Text Detection

• Low resolution (as low as 10 pixel height/character)

• limited by NTSC (352x248) /PAL/SECAM TV standard

• Complex background

• Character Hue and Brightness similar to background

Video OCR

VOCR Preprocessing Problems

Video Frames(1/2 s intervals)

Filtered Frames AND-ed Frames

OCR Document Retrieval

• Task: find OCR recognized document relevant to a information need

• Challenge: erroneous documents

needs to handle with word errors

OCR Document Retrieval

• Correction based approaches• Find potential word errors and replace each with the

most likely correct one

• Partial matching approaches• Word a set of n-grams

• Word matches n-gram matches

Video Retrieval

Video Retrieval - Application of Diverse Technologies

• Speech understanding for automatically derived transcripts

• Image understanding for video “paragraphing”; face, text and other object recognition

• Natural language for query expansion, topic detection and content summarization

• Human computer interaction for video display, navigation and reuse

• Integration overcomes limitation of each

Introduction to TREC Video Retrieval Track

• NIST TREC Video Track web site: http://www-nlpir.nist.gov/projects/trecvid/

• Video Retrieval Track started in 2001• Investigation of content-based retrieval from digital video

• Focus on the shot as the unit of information retrieval rather than the scene or story/segment/clip

The TRECVID Collections

2001 - 11 hours, 74 queries, 8000 shots

2002 - 40 hours, 25 queries, 14000 shotsVideo from the Internet Archive between the ‘50’s and ’70’s

Advertising, educational, industrial and amateur films

Common shot boundaries

2003 – 56 hours, 25 queries, 32000 shots1998 Broadcast News (CNN, ABC, CSpan)

+ Common Speech Recognition

+ Common Annotations

2004 – 61 hours, 24 queries, 33000 shotsMore 1998 Broadcast News

Sample Query and Target

Query: Find pictures of Harry Hertz, Director

of the National Quality Program, NIST

Speech: We’re looking for people that have a broad range of expertise that have business knowledge that have knowledge on quality management on quality improvement and in particular …

OCR:H,arry Hertz a Director aro 7 wa-,i,,ty Program,Harry Hertz a Director

System Architecture (Trec Video Track 2001)

• Combine video, audio and text retrieval scores

Query

Text Image Audio

Text Score Image Score Audio Score

RetrievalAgents

Final Score

ARR Recall

ASR Transcripts 1.84% 13.2%

VOCR 5.93% 7.52%

Image Retrieval 14.99% 24.45%

Combine 18.9% 28.25%

Results for TREC01

Multimedia Retrieval

Documents

document expansionmotivation

spoken documentssimilar

impacts of recognition

speech documentsexpand

estimation pwordacoustic

sign pwordacoustic

expansion techniques

speech accuracyword