www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of Illinois at Urbana-Champ
Dec 23, 2015
www.beckman.uiuc.edu
Illinois Group in Star Challenge
PART I: visual data processing
PART II. Audio Search
Liangliang Cao, Xiaodan ZhuangUniversity of Illinois at Urbana-Champaign
www.beckman.uiuc.edu
What is Star Challenge?
• Competition to Develop World’s Next-Generation Multimedia Search Technology
• Hosted by the Agency for Science, Technology and Research (A*STAR), Singapore.
• A real-world computer vision task which requires large amounts of computation power
www.beckman.uiuc.edu
But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teamsGrand Final in Singapore
www.beckman.uiuc.edu
But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teamsGrand Final in Singapore
No rewards No rewards No rewards
No rewards
Only one team can win US$100,000
www.beckman.uiuc.edu
Xiaodan, Lyon, Paritosh, Mark, Tom, MandarSean, Jui-Ting, Zhen, Huazhong, Xi
Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao
But we have a team with no fears…
www.beckman.uiuc.edu
Xiaodan, Lyon, Paritosh, Mark, Tom, MandarSean, Jui-Ting, Zhen, Huazhong, Xi
Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao
But we have a team with no fears…
www.beckman.uiuc.edu
Let’s go over our experience and stories…
www.beckman.uiuc.edu
Outlines
• Problems of Visual Retrieval
• Data
• Features
• Algorithms
• Results (first 3 rounds)
www.beckman.uiuc.edu
3 Audio Retrieval TasksTask Query Target Metric Data Set
AT1 IPA sequence segments that contain the query IPA sequence regardless of its languages
“Mean Average Precision”: 25 hours monolingual database in round1;13 hours multilingual database in round3
AT2 an utterance spoken
by different speakers all segments that contain the query word/phrase/sentence regardless of its spoken languages
AT3 No queries extract all recurrent segments which are at least 1 second in length
F-measure
L
i
R
ji
i
jiRL
MAP1 1
,11
D
d dd
dd
RP
RP
DF
1
)1(1
Xiaodan will talk about this part……
www.beckman.uiuc.edu
3 Video Retrieval TasksTask Query Target Criteria Metric Data Set
VT1 Single Image
20 queries
(short) VideoSegs
All the similar Segs
“visually similar”
Mean Average Precision: 20 categories, multiple labels possible
VT2 Short Video Shot(<10s)
20 queries
(long) Video Segs
All the similar Segs
PerceptuallySimilar
10 categories, multiple labels possible
VT3 Videos with sound(3~10s)
Order of 10K
Category number
learning thecommon visual characteristics
Classification accuracy 10(20) categories, including one “others” category
N
r
rrelrPR
AP1
)()(1
r
i
irelr
rP1
)(1
)(
www.beckman.uiuc.edu
20 VT1 Categories
100. Not-Applicable, None of the labels 101. Crowd (>10 people) 102. Building with sky as backdrop, clearly visible 103. Mobile devices including handphone/PDA 104. Flag 105. Electronic chart, e.g. stock charts, airport departure chart 106. TV chart Overlay, including graphs, text, powerpoint style 107. Person using Computer, both visible 108. Track and field, sports 109. Company Trademark, including billboard, logo 110. Badminton court, sports 111. Swimming pool, sports 112. Closeup of hand, e.g. using mouse, writing, etc 113. Business meeting (> 2 people), mostly seated down, table visible 114. Natural scene, e.g. mountain, trees, sea, no pple 115. Food on dishes, plates 116. Face closeup, occupying about 3/4 of screen, frontal or side 117. Traffic Scene, many cars, trucks, road visible 118. Boat/Ship, over sea, lake 119. PC Webpages, screen of PC visible 120. Airplane
www.beckman.uiuc.edu
10 Categories for VT2201. People entering/exiting door/car202. Talking face with introductory caption 203. Fingers typing on a keyboard204. Inside a moving vehicle, looking outside205. Large camera movement, tracking an object,
person, car, etc206. Static or minute camera movement, people(s)
walking, legs visible207. Large camera movement, panning left/right,
top/down of a scene208. Movie ending credit209. Woman monologue210. Sports celebratory hug
www.beckman.uiuc.edu
5 Categories for VT3
• 101. Crowd (>10 people): • 102. Building with sky as backdrop,
clearly visible • 107. Person using Computer, both
visible • 112. Closeup of hand, e.g. using mouse,
writing, etc; • 116. Face closeup, occupying about 3/4
of screen, frontal or side
www.beckman.uiuc.edu
Video+Audio Tasks in Round 31) Audio search (AT1 or AT2)5 queries will be given, either in the form of IPA sequence or
waveform, and the participants are required to solve 4.
2) Video search (VT1) 5 queries will be given and the participants are required to solve 4.
3) Audio + Video search (AT1 + VT2)The search queries for this task are a combination of IPA
sequence/waveform and video category. The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequence/waveform and video category respectively. 3 queries will be given and the participants are required to solve 2.
www.beckman.uiuc.edu
Examples of Images
www.beckman.uiuc.edu
More samples
www.beckman.uiuc.edu
Evaluation Video Data in Round2
31 Mpeg Videos, ~20 hours17289 frames for VT1 in total40994 frames for VT2 in total32508 pseudo key frames, 8486 real key
frames
www.beckman.uiuc.edu
Evaluation Video Data of Round3
Video Files: 27 Mpeg1 files, (13 hours of video/audio in total)
Key frames for VT1: 10580 .jpg filesKey frames for VT2: 64546 files in total,
including 10580 .jpg files (true key frames) + 53966 .jpg files (pseudo key frames)
Video: 352*288
www.beckman.uiuc.edu
Computation PowersWork Stations in IFP
10 Servers, 2~4 CPU each, 36CPU in total
IFP-32 Cluster: 32 dual-core 2.8G 64bit CPUCSL Cluster:
Trusted-ILLIAC: 256 nodes with dual 2.2 GHz Opterons, 2 GB of RAM and 73 GB SCSI Ultra320 disks;
Monolith: 128 node cluster with dual Pentium III CPUs at 1 Ghz with 1.5 GB of RAM per node
TeraGrid:
www.beckman.uiuc.edu
Time Cost for Video Tasks Data Decompression: 15 minutes; Video Format Conversion: 2 hours; Video Segmentation (for VT2): 40 minutes Sound Track Extraction: 30 minutes; Feature Extraction:
Global Feature 2: 2 hours (c) Global Feature 1: 2 hours (c) Patch-based Feature1: 2 hours (c) Patch-based Feature2 :5 hours (matlab) Semantic Feature 1: 24 hours (matlab) Semantic Feature 2: 3 hours (c) Semantic Feature 3: 4 hours (c) Motion Feature 1: 24 hours (matlab) Motion Feature 2: 3 hours on t-Illiac
Classifier Training: Classifier 1: 1 hour (on IFP cluster,25 CPU, matlab) Classifier 2: 20 minutes Classifier 3: less than 10 minutes
www.beckman.uiuc.edu
Possible Accelerations for Video
Matlab codes to CParallel computingGPU Acceleration:
Patch based featuresLoad time is the major issueExtracting all the features after one load
www.beckman.uiuc.edu
Features for Round2- VT1Image Features
SIFTHOGGISTAPC LBPColor, Texture, and etc
Semantic Feature
www.beckman.uiuc.edu
Features for Round2-VT2Character Detector
Harris cornermorphological operations
Optical FlowLucas-Kanade on spatial intensity gradient
Gender recognitionSODA-boost based
Motion History ImageSpatial interest points
www.beckman.uiuc.edu
GUFE: Grand Unified Feature Extractor
Designed by DennisCollects features generated by team
members into one standard formatRetrieval by Query Expansion based on NNFeature Normalization/Combination Result Visualization
www.beckman.uiuc.edu
Observations1. Samples under the same category are more semantic
similar to each other;
2. The shot boundaries are not well defined
3. some of the key frames are not labeled correctly.
e.g., VT1 101, 103(26-141);
www.beckman.uiuc.edu
AlgorithmsQuery-expansion
Input: a query image and its category number.
0. Preprocessing: compute the matching between the evaluation and the development data
1. Expand the query image by retrieving all the images from the development data set with the same category.
2. Search the evaluation set with the expanded query.
Output: return the top 50/20 results.
www.beckman.uiuc.edu
AlgorithmsQuery-expansionGMM Based Approach
Motivation: using a GMM to model the distribution of patches
1. Train a UBM (Universal Background Model) based on patches from all training images
2. MAP Estimation of the distribution of the patches belonging to one image given UBM
3. Compute pair-wise image distance based on patch kernel and within-class covariance normalization
3. Retrieving images based the normalized distance
www.beckman.uiuc.edu
VT1 Performance (#2 in 8)Category MAP
•101. Crowd (>10 people): 0.8419
•102. Building with sky as backdrop, clearly visible 0.977
•103. Mobile devices including handphone/PDA 0.028
•107. Person using Computer, both visible 0.2281
•109. Company Trademark, including billboard, logo 0.96
•112. Closeup of hand, e.g. using mouse, writing, etc; 0.4584
•113. Business meeting (> 2 people), mostly seated down, table visible 0.0644
•115. Food on dishes, plates, 0.2285
•116. Face closeup, occupying about 3/4 of screen, frontal or side 0.9783
•117. Traffic Scene, many cars, trucks, road visible, 0.2901
www.beckman.uiuc.edu
VT2 Performance(#1in8)
Category MAP
•202. Talking face with introductory caption 0.8432
•206. Static or minute camera movement, people(s) walking, legs visible,
0.0581
•207. Large camera movement, panning left/right, top/down of a scene,
0.7789
•208. Movie ending credit 0.2782
•209. Woman monologue, Zhen 0.9756
www.beckman.uiuc.edu
Performance of Round3 (#1in7)Task 2 (VT1)
Target Estimated MAP (R=20)
101. Crowd (>10 people): 0.64
102. Building with sky as backdrop, clearly visible 1
107. Person using Computer, both visible 0.7
112. Closeup of hand, e.g. using mouse, writing, etc; 0.527
116. Face closeup, occupying about 3/4 of screen, frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202, face with introductory caption
1 0.03
209, women monolog 0.35 0.1
201, People entering door N/A
We are:2nd in Audio search 4th in Video search2nd in AV search1nd overall
www.beckman.uiuc.edu
A general index / retrieval approach leveraging on speech recognition output lattices
Experience in a real-world audio retrieval task The Star Challenge
Experience in speech retrieval in an unknown language
Illinois Group in Star ChallengePart II. Audio Search
www.beckman.uiuc.edu
(Audio) Information RetrievalProblem definition• Task Description: given a query, find the “most
relevant” segments in a database
36
k r u: d p r ai s ^ z
“CRUDE PRICES”
ThousandsOf
Audio files
Top N file IDs
www.beckman.uiuc.edu
(Audio) Information Retrieval“Standard” Methods • Published Algorithms:
– EXACT MATCH: segment = argmini d(query,segmenti)where d is the string edit distance• Fast
– SUMMARY STATISTICS: segment = argmaxi p(query|segmenti), bag-of-words, no concept of “sequence”• Good for text, e.g., google, yahoo, etc.
– TRANSFORM AND INFER: segment = argmaxi p(query|segmenti)≈ argmaxi E(count(query)|segment),word order matters• Flexible, but slow....
www.beckman.uiuc.edu
Example: Language-Independent Speech Information Retrieval
Voice activity detectionPerceptual freq warpingGaussian mixtures
Likelihood Vector b
i=p(observation
t|state
t=i)
Retrieval Ranking = E(count(query|segment observations))
Inference Algorithm: Finite State Transducer built from ASR LatticesE(count(query|observations))
www.beckman.uiuc.edu
STAR Challenge Audio Retrieval Tasks
Task Query Target Data Set
AT1 IPA sequence segments that contain the query IPA sequence regardless of its languages
25 hours monolingual database in round1;13 hours multilingual database in round3
AT2 an utterance spoken by different speakers
all segments that contain the query word/phrase/sentence regardless of its spoken languages
www.beckman.uiuc.edu
STAR Challenge Audio Retrieval Tasks• Genuine retrieval tasks
– without pre-defined categories• The queries are human speech or IPA
sequences– one or multiple languages.
• Queries might be only part of the speech in the provided segments in the database
• The returned hits should be ordered and only the first 50 or 20 are submitted.
www.beckman.uiuc.edu41
AudioArchive
SpeechRecognition
SpectralFeatures
Lattices
Feature Extraction onShort-time windows
QueryFSM-based
Query Construction
EmpiricalPhone confusion
Knowledge-basedPhone confusion
FSM-based Indexing
FSA Generation FSM Index
All audio segments
FSA Generation FSM Index
. . . . . . Group Index
. . .
Group Index
All GroupIndices
FSM-basedRetrieval
FSM-basedQuery Construction
RetrievalResults
| Speech R
ecog. | Indexing / Query / R
etrieval
www.beckman.uiuc.edu42
Automatic speech recognition
Better performance with language-specific systems than language-independent systemsNo inter-language mismatch between
training and testing (acoustic model, pronunciation model, language model)
e.g., for English data, we can use all knowledge of English (training data, pronunciation dictionary, language model)
language-specific (e.g., English) word recognition
www.beckman.uiuc.edu43
Automatic speech recognition
Multilingual database and queries○ We might fail to identify which particular language
the testing data is; Even if we know, we might not have the speech recognition models for that language.
○ We might encounter unseen languages (Tamil, Malay…)
Language-Independent phone-based recognition instead of word-based recognition
www.beckman.uiuc.edu44
Summary of corpora of different languages
www.beckman.uiuc.edu45
Recognition results --- All languages/datasets are
not equal
www.beckman.uiuc.edu46
Acoustic model Spectral features:
39-dim PLP, cepstral mean/variance normalization per speaker
Modeling: HMMs with {11,13,15,17}-mixture Gaussians
Context-dependent language-dependent modeling Left context- CENTRAL PHONE +right context % language
referred to language-dependent “triphones” e.g., sound /A/ in different context and of different language
^’-A+b%Eng^’-A+b’%Eng
>-A+cm%Chinese….
www.beckman.uiuc.edu
Acoustic model - Context-Dependent Phones
www.beckman.uiuc.edu48
Acoustic model - clustering Categories for decision tree questions
Right or left context
Distinctive phone features (manner/place of articulation)
Language identity
Lexical stress
Punctuation mark
^’-A+b%Eng^’-A+b’%Eng
>-A+cm%Chinese….
www.beckman.uiuc.edu49
Language/Sequence model --- N-gram
If there is a pronunciation model, a particular sequence of context-dependent phone models (acoustic models) can be converted to a particular word sequence, which is modeled by N-gram.
If there is no pronunciation model, the context-dependent phone sequence is directly modeled by N-gram.
)|( 11 Niii WWWP
www.beckman.uiuc.edu50
Solutions for English Speech Retrieval
Speech recognition frontendFeatures:
○ Perceptual Linear Predictive Cepstra + Energy
○ first/second orders regression coefficients
Speech recognizer
○ Triphone-clustered English speech recognizer
○ English acoustic model, English dictionary, English language model
○ Lattice generation using a Viterbi decoderAudio
Archive
SpeechRecognition
SpectralFeatures
Lattices
Feature Extraction onShort-time windows
www.beckman.uiuc.edu
I. A compact way to represent numerous alternative hypotheses output by the speech recognizer
• A simple example• a b a or b a
• Some more complex examples:
Using lattices as the representation of speech data
www.beckman.uiuc.edu
Mangu et al. 1999
James 1995
• Some more complex examples:
www.beckman.uiuc.edu
Using lattices as the representation of speech data
II. Enable more robust speech retrieval• One best hypothesis by speech recognition is not
reliable enough. • Lattices can be represented as finite state
machines, which can be used in speech retrieval and take advantage of general weighted finite state machine algorithms.
• Robust matching between query and audio files
www.beckman.uiuc.edu54
FSM-based Indexing
FSA Generation FSM Index
Indexing the audio archiveInput from ASR frontend
○ Lattices of database segments
○ English vocabulary
Constructing log semiring automatas for each segment
Construct FST-based segment index files
Combine segment index files into a few group index files for retrieval.
Solutions for English Speech Retrieval
FSA Generation FSM Index
. . . . . . Group Index
. . .
Group Index
All GroupIndices
All audio segments
www.beckman.uiuc.edu
Allauzen et al., 2004
Indexing the audio archive (example)
Speech recognition output:Lattices for two audio files
Index for the two files
www.beckman.uiuc.edu56
Making FST-based queriesProvided queries each as an IPA sequence
○ Building an automata from the IPA sequence and expand each IPA arc into alternative
arcs for 'similar' IPA
○ Building a query FSA, incorporating constraints
Provided queries each as audio waveform
○ Processed by ASR frontend
○ Building log semiring automata
○ Build query FSA
Solutions for English Speech Retrieval
www.beckman.uiuc.edu57
k r u: d p r ai s ^ z
“CRUDE PRICES”
QueryFSM-based
Query Construction
EmpiricalPhone confusion
Knowledge-basedPhone confusion
FSM-basedQuery Construction
Example of query expansion(phone sequence to word FSA)
www.beckman.uiuc.edu58
Retrieval using queriesParallel retrieval in all group index files
Order retrieved segment ids
truncate/format results
Fusing results with different precision/recall tradeoff obtained by different settings
Solutions for English Speech Retrieval
All GroupIndices
FSM-basedRetrieval
www.beckman.uiuc.edu59
• ASR frontend
• Multi-lingual word/phone speech recognizer
• Language-independent phone recognizer
• Indexing the audio archive
– Multi-lingual index
– Language-independent index
• Making FST-based queries
– Multi-lingual queries
– Language-independent queries
• Retrieval using queries
– Same as in language-specific retrieval
Solutions for Multilingual Speech Retrieval
www.beckman.uiuc.edu
62
k r u: d p r ai s ^ z
“CRUDE PRICES”
ThousandsOf
Audio files
Top N file IDs
Audio Retrieval
What’s the “optimal” model set?
Language-specific phones/words
Language-independent phones
General sub-word units
Size of inventory - using only the frequent symbols is better
Data driven - selected by clustering tree
www.beckman.uiuc.edu
AudioArchive
SpeechRecognition
SpectralFeatures
Lattices
Feature Extraction onShort-time windows
QueryFSM-based
Query Construction
EmpiricalPhone confusion
Knowledge-basedPhone confusion
FSM-based Indexing
FSA Generation FSM Index
FSA Generation FSM Index
. . . . . .
Group Index
. . .
Group Index
All GroupIndices
FSM-basedRetrieval
FSM-basedQuery Construction
RetrievalResults
k r u: d p r ai s ^ z
“CRUDE PRICES”
ThousandsOf
Audio files
Top N file IDs
WordTriphonePhone
Other acoustic units
Speech in unknown language
Query in unknown language
Represented as ?
Automatic Speech Retrieval in an Unknown LanguageFSM-based fuzzy matching and retrieval
www.beckman.uiuc.edu
Automatic Speech Retrieval in an Unknown Language
modeled as a special case (below) of the cognitive process called assimilation (above).
EnglishPhones
RussianPhones
SpanishPhones
Language-dependent (LD) phones
LD models
Language-independent (LID) phone clustersbased on pairwise KL divergence
Speech in unknown language
Represented as LID phone lattices
Accommodation ?(introducing new models)
www.beckman.uiuc.edu
… …
Multilingual Subword Units in an Unknown Language
• Language-dependent versions of the same IPA symbols can end up:– in one cluster, e.g., /z/ or /t∫/– in different clusters, e.g., /j/ or /I/
• Different IPAs may be similar:
Retrieval w/audio query
Retrieval w/ IPA sequence query
Clustering gives a more compact phone set, and performs better
Croatian
Phonerecognition
Croatian for training
Croatian Unknown
www.beckman.uiuc.edu
• Thank you.