Multimedia Retrieval
Jan 11, 2016
Multimedia Retrieval
Outline
• Audio Retrieval • Spoken information• Music
• Document Image Analysis and Retrieval• Video Retrieval
A Taxonomy of Audio
Sound
Music Other?Speech
Classical
Country
Disco Hip Hop
Jazz
RockSportsAnnouncer
Female
Male
Orchestra
StringQuartet
Choir
Piano
?
Spoken Document Retrieval
Spoken Document Retrieval
Acoustic Modeling
Describes the sounds thatmake up speech
Lexicon
Describes which sequences of speech
sounds make upvalid words
Language Model
Describes the likelihoodof various sequences of
words being spoken
Speech Recognition
Speech Recognition Knowledge Sources
Speech Recognition in Brief
Pronunciation Lexicon
Signal Processing
PhoneticProbabilityEstimator(Acoustic
Model)
Decoder(Language
Model)WordsSpeech
Grammar
Hints For Better Recognition
• Topical information• News of the day• Image information ?
• Goal: improve the estimation p(word|acoustic_sig)• Main idea:
p(word|acoustic_sign) p(word|acoustic_signal, X)
What could be X?
Hints For Better Recognition
• Topical information• News of the day• Image information
• Lip reading• Video Optical Character
Recognition (VOCR)
• Goal: improve the estimation p(word|acoustic_sig)• Main idea:
p(word|acoustic_sign) p(word|acoustic_signal, X)
What could be X?
Speech Recognition AccuracyWord Error Rate
BenchmarkLab
TV Studio
DialogNews
Documentary
Commercials
0
10
20
30
40
50
60
70
80
90
100
Information Retrieval Precision vs. Speech Accuracy
Word Error Rate
% of Text IR
100
90
80
70
60
50
40
30
Rel
ativ
e P
reci
sio
n
0 10 20 30 40 50 60 70 80
Indexing and Search of Multimodal Information, Hauptmann, A., Wactlar, H. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP-97), Munich, Germany, April 1997.
A rather small degradation in retrieval when word error rate is small than 30%
Spoken Document Retrieval
• Segmentation issue• Continuous speech data without story boundaries
• Typical segmentation approaches
Overlapping windows (30 sec for each segment)
Automatic detection of speaker changes
Spoken Document Retrieval:Document Expansion
• Motivation: documents are erroneous• Goal: apply expansion techniques to reduce the
impacts of recognition errors in spoken documents• Similar to query expansion
Spoken Document Retrieval:Document Expansion
• Motivation: documents are erroneous• Goal: apply expansion techniques to reduce the
impacts of recognition errors in spoken documents• Similar to query expansion
Clean Doc Collection (web docs)
Speech Recognized Transcript
doc1
doc2
doc3
doc4
Find common
words in top ranked docs
Spoken Document Retrieval:Document Expansion
• Motivation: documents are erroneous• Goal: apply expansion techniques to reduce the
impacts of recognition errors in spoken documents• Similar to query expansion
• Treat each speech document as a query
• Find clean documents that are relevant to speech documents
• Expand each speech document with the common words in the top ranked clean documents.
Document Expansion (Sighal & Piereira, 1999)
A Taxonomy of Audio
Sound
Music Other?Speech
Classical
Country
Disco Hip Hop
Jazz
RockSportsAnnouncer
Female
Male
Orchestra
StringQuartet
Choir
Piano
?
Music Information Retrieval
Music Retrieval
• A textual retrieval approach• Using meta data: titles, artists, genres, …
• Content-based music retrieval• Query by audio• Query by score document/segment
Content-based Music Retrieval
Short-termAutocorrelation
NoteSegmentation
Mid-level Representation
Similarity Comparison
Query results(Ranked song list)
Songs Database
Midi message Extraction
Microphone Signal input
Sampling
11KHz
CenterClipping
Off-line processing
On-line processing
67 64 65 62 60 (Midi representation)
-3 1 -3 -2
Content-based Music Retrieval
: 1 1 2 0 -2 0 1 2 0 : -3 1 1 2
• N-gram representation
1 1 2 C1 1 1
1 2 0 C2 2 0
2 0 –2 C3 1 0
0 –2 0 C4 1 0
-3 1 1 C5 0 1
• A vector representation for each music document• A typical information retrieval problem
Document Image Analysis and Retrieval
Document Image Analysis
• Recognize text (OCR)• convert page images to Unicode
• machine-printed, handwritten
• Analyze page layout geometry• a 2-D problem (unlike speech, text)
• good ‘language-free’ algorithms
• Capture logical structure• output marked-up text (XML, etc)
• exploit non-textual clues
Video/Image OCR Block Diagram
Text Area
Detection
Text Area
Preprocessing
Commercial
OCR
Video orImage
UTF8 Text
Text Detection
• Low resolution (as low as 10 pixel height/character)
• limited by NTSC (352x248) /PAL/SECAM TV standard
• Complex background
• Character Hue and Brightness similar to background
Video OCR
VOCR Preprocessing Problems
Video Frames(1/2 s intervals)
Filtered Frames AND-ed Frames
OCR Document Retrieval
• Task: find OCR recognized document relevant to a information need
• Challenge: erroneous documents
needs to handle with word errors
OCR Document Retrieval
• Correction based approaches• Find potential word errors and replace each with the
most likely correct one
• Partial matching approaches• Word a set of n-grams
• Word matches n-gram matches
Video Retrieval
Video Retrieval - Application of Diverse Technologies
• Speech understanding for automatically derived transcripts
• Image understanding for video “paragraphing”; face, text and other object recognition
• Natural language for query expansion, topic detection and content summarization
• Human computer interaction for video display, navigation and reuse
• Integration overcomes limitation of each
Introduction to TREC Video Retrieval Track
• NIST TREC Video Track web site: http://www-nlpir.nist.gov/projects/trecvid/
• Video Retrieval Track started in 2001• Investigation of content-based retrieval from digital video
• Focus on the shot as the unit of information retrieval rather than the scene or story/segment/clip
The TRECVID Collections
2001 - 11 hours, 74 queries, 8000 shots
2002 - 40 hours, 25 queries, 14000 shotsVideo from the Internet Archive between the ‘50’s and ’70’s
Advertising, educational, industrial and amateur films
Common shot boundaries
2003 – 56 hours, 25 queries, 32000 shots1998 Broadcast News (CNN, ABC, CSpan)
+ Common Speech Recognition
+ Common Annotations
2004 – 61 hours, 24 queries, 33000 shotsMore 1998 Broadcast News
Sample Query and Target
Query: Find pictures of Harry Hertz, Director
of the National Quality Program, NIST
Speech: We’re looking for people that have a broad range of expertise that have business knowledge that have knowledge on quality management on quality improvement and in particular …
OCR:H,arry Hertz a Director aro 7 wa-,i,,ty Program,Harry Hertz a Director
System Architecture (Trec Video Track 2001)
• Combine video, audio and text retrieval scores
Query
Text Image Audio
Text Score Image Score Audio Score
RetrievalAgents
Final Score
ARR Recall
ASR Transcripts 1.84% 13.2%
VOCR 5.93% 7.52%
Image Retrieval 14.99% 24.45%
Combine 18.9% 28.25%
Results for TREC01