Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Illinois Group in Star Challenge

PART I: visual data processing

PART II. Audio Search

Liangliang Cao, Xiaodan ZhuangUniversity of Illinois at Urbana-Champaign


What is Star Challenge?

• Competition to Develop World’s Next-Generation Multimedia Search Technology

• Hosted by the Agency for Science, Technology and Research (A*STAR), Singapore.

• A real-world computer vision task which requires large amounts of computation power


But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teamsGrand Final in Singapore


But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teamsGrand Final in Singapore

No rewards No rewards No rewards

No rewards

Only one team can win US$100,000


Xiaodan, Lyon, Paritosh, Mark, Tom, MandarSean, Jui-Ting, Zhen, Huazhong, Xi

Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao

But we have a team with no fears…


Xiaodan, Lyon, Paritosh, Mark, Tom, MandarSean, Jui-Ting, Zhen, Huazhong, Xi

Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao

But we have a team with no fears…


Let’s go over our experience and stories…


Outlines

• Problems of Visual Retrieval

• Data

• Features

• Algorithms

• Results (first 3 rounds)


3 Audio Retrieval TasksTask Query Target Metric Data Set

AT1 IPA sequence segments that contain the query IPA sequence regardless of its languages

“Mean Average Precision”: 25 hours monolingual database in round1;13 hours multilingual database in round3

AT2 an utterance spoken

by different speakers all segments that contain the query word/phrase/sentence regardless of its spoken languages

AT3 No queries extract all recurrent segments which are at least 1 second in length

F-measure

L

i

R

ji

i

jiRL

MAP1 1

,11

D

d dd

dd

RP

RP

DF

1

)1(1

Xiaodan will talk about this part……


3 Video Retrieval TasksTask Query Target Criteria Metric Data Set

VT1 Single Image

20 queries

(short) VideoSegs

All the similar Segs

“visually similar”

Mean Average Precision: 20 categories, multiple labels possible

VT2 Short Video Shot(<10s)

20 queries

(long) Video Segs

All the similar Segs

PerceptuallySimilar

10 categories, multiple labels possible

VT3 Videos with sound(3~10s)

Order of 10K

Category number

learning thecommon visual characteristics

Classification accuracy 10(20) categories, including one “others” category

N

r

rrelrPR

AP1

)()(1

r

i

irelr

rP1

)(1

)(


20 VT1 Categories

100. Not-Applicable, None of the labels 101. Crowd (>10 people) 102. Building with sky as backdrop, clearly visible 103. Mobile devices including handphone/PDA 104. Flag 105. Electronic chart, e.g. stock charts, airport departure chart 106. TV chart Overlay, including graphs, text, powerpoint style 107. Person using Computer, both visible 108. Track and field, sports 109. Company Trademark, including billboard, logo 110. Badminton court, sports 111. Swimming pool, sports 112. Closeup of hand, e.g. using mouse, writing, etc 113. Business meeting (> 2 people), mostly seated down, table visible 114. Natural scene, e.g. mountain, trees, sea, no pple 115. Food on dishes, plates 116. Face closeup, occupying about 3/4 of screen, frontal or side 117. Traffic Scene, many cars, trucks, road visible 118. Boat/Ship, over sea, lake 119. PC Webpages, screen of PC visible 120. Airplane


10 Categories for VT2201. People entering/exiting door/car202. Talking face with introductory caption 203. Fingers typing on a keyboard204. Inside a moving vehicle, looking outside205. Large camera movement, tracking an object,

person, car, etc206. Static or minute camera movement, people(s)

walking, legs visible207. Large camera movement, panning left/right,

top/down of a scene208. Movie ending credit209. Woman monologue210. Sports celebratory hug


5 Categories for VT3

• 101. Crowd (>10 people): • 102. Building with sky as backdrop,

clearly visible • 107. Person using Computer, both

visible • 112. Closeup of hand, e.g. using mouse,

writing, etc; • 116. Face closeup, occupying about 3/4

of screen, frontal or side


Video+Audio Tasks in Round 31) Audio search (AT1 or AT2)5 queries will be given, either in the form of IPA sequence or

waveform, and the participants are required to solve 4.

2) Video search (VT1) 5 queries will be given and the participants are required to solve 4.

3) Audio + Video search (AT1 + VT2)The search queries for this task are a combination of IPA

sequence/waveform and video category. The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequence/waveform and video category respectively. 3 queries will be given and the participants are required to solve 2.


Examples of Images


More samples


Evaluation Video Data in Round2

31 Mpeg Videos, ~20 hours17289 frames for VT1 in total40994 frames for VT2 in total32508 pseudo key frames, 8486 real key

frames


Evaluation Video Data of Round3

Video Files: 27 Mpeg1 files, (13 hours of video/audio in total)

Key frames for VT1: 10580 .jpg filesKey frames for VT2: 64546 files in total,

including 10580 .jpg files (true key frames) + 53966 .jpg files (pseudo key frames)

Video: 352*288


Computation PowersWork Stations in IFP

10 Servers, 2~4 CPU each, 36CPU in total

IFP-32 Cluster: 32 dual-core 2.8G 64bit CPUCSL Cluster:

Trusted-ILLIAC: 256 nodes with dual 2.2 GHz Opterons, 2 GB of RAM and 73 GB SCSI Ultra320 disks;

Monolith: 128 node cluster with dual Pentium III CPUs at 1 Ghz with 1.5 GB of RAM per node

TeraGrid:


Time Cost for Video Tasks Data Decompression: 15 minutes; Video Format Conversion: 2 hours; Video Segmentation (for VT2): 40 minutes Sound Track Extraction: 30 minutes; Feature Extraction:

Global Feature 2: 2 hours (c) Global Feature 1: 2 hours (c) Patch-based Feature1: 2 hours (c) Patch-based Feature2 :5 hours (matlab) Semantic Feature 1: 24 hours (matlab) Semantic Feature 2: 3 hours (c) Semantic Feature 3: 4 hours (c) Motion Feature 1: 24 hours (matlab) Motion Feature 2: 3 hours on t-Illiac

Classifier Training: Classifier 1: 1 hour (on IFP cluster,25 CPU, matlab) Classifier 2: 20 minutes Classifier 3: less than 10 minutes


Possible Accelerations for Video

Matlab codes to CParallel computingGPU Acceleration:

Patch based featuresLoad time is the major issueExtracting all the features after one load


Features for Round2- VT1Image Features

SIFTHOGGISTAPC LBPColor, Texture, and etc

Semantic Feature


Features for Round2-VT2Character Detector

Harris cornermorphological operations

Optical FlowLucas-Kanade on spatial intensity gradient

Gender recognitionSODA-boost based

Motion History ImageSpatial interest points


GUFE: Grand Unified Feature Extractor

Designed by DennisCollects features generated by team

members into one standard formatRetrieval by Query Expansion based on NNFeature Normalization/Combination Result Visualization


Observations1. Samples under the same category are more semantic

similar to each other;

2. The shot boundaries are not well defined

3. some of the key frames are not labeled correctly.

e.g., VT1 101, 103(26-141);


AlgorithmsQuery-expansion

Input: a query image and its category number.

0. Preprocessing: compute the matching between the evaluation and the development data

1. Expand the query image by retrieving all the images from the development data set with the same category.

2. Search the evaluation set with the expanded query.

Output: return the top 50/20 results.


AlgorithmsQuery-expansionGMM Based Approach

Motivation: using a GMM to model the distribution of patches

1. Train a UBM (Universal Background Model) based on patches from all training images

2. MAP Estimation of the distribution of the patches belonging to one image given UBM

3. Compute pair-wise image distance based on patch kernel and within-class covariance normalization

3. Retrieving images based the normalized distance


VT1 Performance (#2 in 8)Category MAP

•101. Crowd (>10 people): 0.8419

•102. Building with sky as backdrop, clearly visible 0.977

•103. Mobile devices including handphone/PDA 0.028

•107. Person using Computer, both visible 0.2281

•109. Company Trademark, including billboard, logo 0.96

•112. Closeup of hand, e.g. using mouse, writing, etc; 0.4584

•113. Business meeting (> 2 people), mostly seated down, table visible 0.0644

•115. Food on dishes, plates, 0.2285

•116. Face closeup, occupying about 3/4 of screen, frontal or side 0.9783

•117. Traffic Scene, many cars, trucks, road visible, 0.2901


VT2 Performance(#1in8)

Category MAP

•202. Talking face with introductory caption 0.8432

•206. Static or minute camera movement, people(s) walking, legs visible,

0.0581

•207. Large camera movement, panning left/right, top/down of a scene,

0.7789

•208. Movie ending credit 0.2782

•209. Woman monologue, Zhen 0.9756


Performance of Round3 (#1in7)Task 2 (VT1)

Target Estimated MAP (R=20)

101. Crowd (>10 people): 0.64

102. Building with sky as backdrop, clearly visible 1

107. Person using Computer, both visible 0.7

112. Closeup of hand, e.g. using mouse, writing, etc; 0.527

116. Face closeup, occupying about 3/4 of screen, frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202, face with introductory caption

1 0.03

209, women monolog 0.35 0.1

201, People entering door 　 N/A　　

We are:2nd in Audio search 4th in Video search2nd in AV search1nd overall


A general index / retrieval approach leveraging on speech recognition output lattices

Experience in a real-world audio retrieval task The Star Challenge

Experience in speech retrieval in an unknown language

Illinois Group in Star ChallengePart II. Audio Search


(Audio) Information RetrievalProblem definition• Task Description: given a query, find the “most

relevant” segments in a database

36

k r u: d p r ai s ^ z

“CRUDE PRICES”

ThousandsOf

Audio files

Top N file IDs


(Audio) Information Retrieval“Standard” Methods • Published Algorithms:

– EXACT MATCH: segment = argmini d(query,segmenti)where d is the string edit distance• Fast

– SUMMARY STATISTICS: segment = argmaxi p(query|segmenti), bag-of-words, no concept of “sequence”• Good for text, e.g., google, yahoo, etc.

– TRANSFORM AND INFER: segment = argmaxi p(query|segmenti)≈ argmaxi E(count(query)|segment),word order matters• Flexible, but slow....


Example: Language-Independent Speech Information Retrieval

Voice activity detectionPerceptual freq warpingGaussian mixtures

Likelihood Vector b

i=p(observation

t|state

t=i)

Retrieval Ranking = E(count(query|segment observations))

Inference Algorithm: Finite State Transducer built from ASR LatticesE(count(query|observations))


STAR Challenge Audio Retrieval Tasks

Task Query Target Data Set

AT1 IPA sequence segments that contain the query IPA sequence regardless of its languages

25 hours monolingual database in round1;13 hours multilingual database in round3

AT2 an utterance spoken by different speakers

all segments that contain the query word/phrase/sentence regardless of its spoken languages


STAR Challenge Audio Retrieval Tasks• Genuine retrieval tasks

– without pre-defined categories• The queries are human speech or IPA

sequences– one or multiple languages.

• Queries might be only part of the speech in the provided segments in the database

• The returned hits should be ordered and only the first 50 or 20 are submitted.

www.beckman.uiuc.edu41

AudioArchive

SpeechRecognition

SpectralFeatures

Lattices

Feature Extraction onShort-time windows

QueryFSM-based

Query Construction

EmpiricalPhone confusion

Knowledge-basedPhone confusion

FSM-based Indexing

FSA Generation FSM Index

All audio segments


. . . . . . Group Index

. . .

Group Index

All GroupIndices

FSM-basedRetrieval

FSM-basedQuery Construction

RetrievalResults

| Speech R

ecog. | Indexing / Query / R

etrieval


Automatic speech recognition

Better performance with language-specific systems than language-independent systemsNo inter-language mismatch between

training and testing (acoustic model, pronunciation model, language model)

e.g., for English data, we can use all knowledge of English (training data, pronunciation dictionary, language model)

language-specific (e.g., English) word recognition


Automatic speech recognition

Multilingual database and queries○ We might fail to identify which particular language

the testing data is; Even if we know, we might not have the speech recognition models for that language.

○ We might encounter unseen languages (Tamil, Malay…)

Language-Independent phone-based recognition instead of word-based recognition


Summary of corpora of different languages


Recognition results --- All languages/datasets are

not equal


Acoustic model Spectral features:

39-dim PLP, cepstral mean/variance normalization per speaker

Modeling: HMMs with {11,13,15,17}-mixture Gaussians

Context-dependent language-dependent modeling Left context- CENTRAL PHONE +right context % language

referred to language-dependent “triphones” e.g., sound /A/ in different context and of different language

^’-A+b%Eng^’-A+b’%Eng

>-A+cm%Chinese….


Acoustic model - Context-Dependent Phones


Acoustic model - clustering Categories for decision tree questions

Right or left context

Distinctive phone features (manner/place of articulation)

Language identity

Lexical stress

Punctuation mark

^’-A+b%Eng^’-A+b’%Eng

>-A+cm%Chinese….


Language/Sequence model --- N-gram

If there is a pronunciation model, a particular sequence of context-dependent phone models (acoustic models) can be converted to a particular word sequence, which is modeled by N-gram.

If there is no pronunciation model, the context-dependent phone sequence is directly modeled by N-gram.

)|( 11 Niii WWWP


Solutions for English Speech Retrieval

Speech recognition frontendFeatures:

○ Perceptual Linear Predictive Cepstra + Energy

○ first/second orders regression coefficients

Speech recognizer

○ Triphone-clustered English speech recognizer

○ English acoustic model, English dictionary, English language model

○ Lattice generation using a Viterbi decoderAudio

Archive

SpeechRecognition

SpectralFeatures

Lattices



I. A compact way to represent numerous alternative hypotheses output by the speech recognizer

• A simple example• a b a or b a

• Some more complex examples:

Using lattices as the representation of speech data


Mangu et al. 1999

James 1995

• Some more complex examples:


Using lattices as the representation of speech data

II. Enable more robust speech retrieval• One best hypothesis by speech recognition is not

reliable enough. • Lattices can be represented as finite state

machines, which can be used in speech retrieval and take advantage of general weighted finite state machine algorithms.

• Robust matching between query and audio files


FSM-based Indexing


Indexing the audio archiveInput from ASR frontend

○ Lattices of database segments

○ English vocabulary

Constructing log semiring automatas for each segment

Construct FST-based segment index files

Combine segment index files into a few group index files for retrieval.



. . . . . . Group Index

. . .

Group Index

All GroupIndices

All audio segments


Allauzen et al., 2004

Indexing the audio archive (example)

Speech recognition output:Lattices for two audio files

Index for the two files


Making FST-based queriesProvided queries each as an IPA sequence

○ Building an automata from the IPA sequence and expand each IPA arc into alternative

arcs for 'similar' IPA

○ Building a query FSA, incorporating constraints

Provided queries each as audio waveform

○ Processed by ASR frontend

○ Building log semiring automata

○ Build query FSA




“CRUDE PRICES”

QueryFSM-based

Query Construction




Example of query expansion(phone sequence to word FSA)


Retrieval using queriesParallel retrieval in all group index files

Order retrieved segment ids

truncate/format results

Fusing results with different precision/recall tradeoff obtained by different settings


All GroupIndices

FSM-basedRetrieval


• ASR frontend

• Multi-lingual word/phone speech recognizer

• Language-independent phone recognizer

• Indexing the audio archive

– Multi-lingual index

– Language-independent index

• Making FST-based queries

– Multi-lingual queries

– Language-independent queries

• Retrieval using queries

– Same as in language-specific retrieval

Solutions for Multilingual Speech Retrieval


62


“CRUDE PRICES”

ThousandsOf

Audio files

Top N file IDs

Audio Retrieval

What’s the “optimal” model set?

Language-specific phones/words

Language-independent phones

General sub-word units

Size of inventory - using only the frequent symbols is better

Data driven - selected by clustering tree


AudioArchive

SpeechRecognition

SpectralFeatures

Lattices


QueryFSM-based

Query Construction



FSM-based Indexing



. . . . . .

Group Index

. . .

Group Index

All GroupIndices

FSM-basedRetrieval


RetrievalResults


“CRUDE PRICES”

ThousandsOf

Audio files

Top N file IDs

WordTriphonePhone

Other acoustic units

Speech in unknown language

Query in unknown language

Represented as ?

Automatic Speech Retrieval in an Unknown LanguageFSM-based fuzzy matching and retrieval


Automatic Speech Retrieval in an Unknown Language

modeled as a special case (below) of the cognitive process called assimilation (above).

EnglishPhones

RussianPhones

SpanishPhones

Language-dependent (LD) phones

LD models

Language-independent (LID) phone clustersbased on pairwise KL divergence

Speech in unknown language

Represented as LID phone lattices

Accommodation ?(introducing new models)


… …

Multilingual Subword Units in an Unknown Language

• Language-dependent versions of the same IPA symbols can end up:– in one cluster, e.g., /z/ or /t∫/– in different clusters, e.g., /j/ or /I/

• Different IPAs may be similar:

Retrieval w/audio query

Retrieval w/ IPA sequence query

Clustering gives a more compact phone set, and performs better

Croatian

Phonerecognition

Croatian for training

Croatian Unknown


• Thank you.

Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

Documents

singapore slide

rounds slide

airplane slide

urbanachampaign slide

teams grand final

screen of pc visible

road visible

table visible