Top Banner
www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of Illinois at Urbana-Champ
60

Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

Dec 23, 2015

Download

Documents

Samson Holmes
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Illinois Group in Star Challenge

PART I: visual data processing

PART II. Audio Search

Liangliang Cao, Xiaodan ZhuangUniversity of Illinois at Urbana-Champaign

Page 2: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

What is Star Challenge?

• Competition to Develop World’s Next-Generation Multimedia Search Technology

• Hosted by the Agency for Science, Technology and Research (A*STAR), Singapore.

• A real-world computer vision task which requires large amounts of computation power

Page 3: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teamsGrand Final in Singapore

Page 4: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teamsGrand Final in Singapore

No rewards No rewards No rewards

No rewards

Only one team can win US$100,000

Page 5: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Xiaodan, Lyon, Paritosh, Mark, Tom, MandarSean, Jui-Ting, Zhen, Huazhong, Xi

Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao

But we have a team with no fears…

Page 6: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Xiaodan, Lyon, Paritosh, Mark, Tom, MandarSean, Jui-Ting, Zhen, Huazhong, Xi

Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao

But we have a team with no fears…

Page 7: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Let’s go over our experience and stories…

Page 8: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Outlines

• Problems of Visual Retrieval

• Data

• Features

• Algorithms

• Results (first 3 rounds)

Page 9: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

3 Audio Retrieval TasksTask Query Target Metric Data Set

AT1 IPA sequence segments that contain the query IPA sequence regardless of its languages

“Mean Average Precision”: 25 hours monolingual database in round1;13 hours multilingual database in round3

AT2 an utterance spoken

by different speakers all segments that contain the query word/phrase/sentence regardless of its spoken languages

AT3 No queries extract all recurrent segments which are at least 1 second in length

F-measure

L

i

R

ji

i

jiRL

MAP1 1

,11

D

d dd

dd

RP

RP

DF

1

)1(1

Xiaodan will talk about this part……

Page 10: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

3 Video Retrieval TasksTask Query Target Criteria Metric Data Set

VT1 Single Image

20 queries

(short) VideoSegs

All the similar Segs

“visually similar”

Mean Average Precision: 20 categories, multiple labels possible

VT2 Short Video Shot(<10s)

20 queries

(long) Video Segs

All the similar Segs

PerceptuallySimilar

10 categories, multiple labels possible

VT3 Videos with sound(3~10s)

Order of 10K

Category number

learning thecommon visual characteristics

Classification accuracy 10(20) categories, including one “others” category

N

r

rrelrPR

AP1

)()(1

r

i

irelr

rP1

)(1

)(

Page 11: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

20 VT1 Categories

100. Not-Applicable, None of the labels 101. Crowd (>10 people) 102. Building with sky as backdrop, clearly visible 103. Mobile devices including handphone/PDA 104. Flag 105. Electronic chart, e.g. stock charts, airport departure chart 106. TV chart Overlay, including graphs, text, powerpoint style 107. Person using Computer, both visible 108. Track and field, sports 109. Company Trademark, including billboard, logo 110. Badminton court, sports 111. Swimming pool, sports 112. Closeup of hand, e.g. using mouse, writing, etc 113. Business meeting (> 2 people), mostly seated down, table visible 114. Natural scene, e.g. mountain, trees, sea, no pple 115. Food on dishes, plates 116. Face closeup, occupying about 3/4 of screen, frontal or side 117. Traffic Scene, many cars, trucks, road visible 118. Boat/Ship, over sea, lake 119. PC Webpages, screen of PC visible 120. Airplane

Page 12: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

10 Categories for VT2201. People entering/exiting door/car202. Talking face with introductory caption 203. Fingers typing on a keyboard204. Inside a moving vehicle, looking outside205. Large camera movement, tracking an object,

person, car, etc206. Static or minute camera movement, people(s)

walking, legs visible207. Large camera movement, panning left/right,

top/down of a scene208. Movie ending credit209. Woman monologue210. Sports celebratory hug

Page 13: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

5 Categories for VT3

• 101. Crowd (>10 people): • 102. Building with sky as backdrop,

clearly visible   • 107. Person using Computer, both

visible • 112. Closeup of hand, e.g. using mouse,

writing, etc; • 116. Face closeup, occupying about 3/4

of screen, frontal or side

Page 14: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Video+Audio Tasks in Round 31) Audio search (AT1 or AT2)5 queries will be given, either in the form of IPA sequence or

waveform, and the participants are required to solve 4. 

2) Video search (VT1) 5 queries will be given and the participants are required to solve 4. 

3) Audio + Video search (AT1 + VT2)The search queries for this task are a combination of IPA

sequence/waveform and video category. The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequence/waveform and video category respectively. 3 queries will be given and the participants are required to solve 2.

Page 15: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Examples of Images

Page 16: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

More samples

Page 17: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Evaluation Video Data in Round2

31 Mpeg Videos, ~20 hours17289 frames for VT1 in total40994 frames for VT2 in total32508 pseudo key frames, 8486 real key

frames

Page 18: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Evaluation Video Data of Round3

Video Files: 27 Mpeg1 files, (13 hours of video/audio in total)

Key frames for VT1: 10580 .jpg filesKey frames for VT2: 64546 files in total,

including 10580 .jpg files (true key frames) + 53966 .jpg files (pseudo key frames)

Video: 352*288

Page 19: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Computation PowersWork Stations in IFP

10 Servers, 2~4 CPU each, 36CPU in total

IFP-32 Cluster: 32 dual-core 2.8G 64bit CPUCSL Cluster:

Trusted-ILLIAC: 256 nodes with dual 2.2 GHz Opterons, 2 GB of RAM and 73 GB SCSI Ultra320 disks;

Monolith: 128 node cluster with dual Pentium III CPUs at 1 Ghz with 1.5 GB of RAM per node

TeraGrid:

Page 20: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Time Cost for Video Tasks Data Decompression: 15 minutes; Video Format Conversion: 2 hours; Video Segmentation (for VT2): 40 minutes Sound Track Extraction: 30 minutes; Feature Extraction:

Global Feature 2: 2 hours (c) Global Feature 1: 2 hours (c) Patch-based Feature1: 2 hours (c) Patch-based Feature2 :5 hours (matlab) Semantic Feature 1: 24 hours (matlab) Semantic Feature 2: 3 hours (c) Semantic Feature 3: 4 hours (c) Motion Feature 1: 24 hours (matlab) Motion Feature 2: 3 hours on t-Illiac

Classifier Training: Classifier 1: 1 hour (on IFP cluster,25 CPU, matlab) Classifier 2: 20 minutes Classifier 3: less than 10 minutes

Page 21: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Possible Accelerations for Video

Matlab codes to CParallel computingGPU Acceleration:

Patch based featuresLoad time is the major issueExtracting all the features after one load

Page 22: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Features for Round2- VT1Image Features

SIFTHOGGISTAPC LBPColor, Texture, and etc

Semantic Feature

Page 23: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Features for Round2-VT2Character Detector

Harris cornermorphological operations

Optical FlowLucas-Kanade on spatial intensity gradient

Gender recognitionSODA-boost based

Motion History ImageSpatial interest points

Page 24: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

GUFE: Grand Unified Feature Extractor

Designed by DennisCollects features generated by team

members into one standard formatRetrieval by Query Expansion based on NNFeature Normalization/Combination Result Visualization

Page 25: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Observations1. Samples under the same category are more semantic

similar to each other;

2. The shot boundaries are not well defined

3. some of the key frames are not labeled correctly.

e.g., VT1 101, 103(26-141);

Page 26: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

AlgorithmsQuery-expansion

Input: a query image and its category number.

0. Preprocessing: compute the matching between the evaluation and the development data

1. Expand the query image by retrieving all the images from the development data set with the same category.

2. Search the evaluation set with the expanded query.

Output: return the top 50/20 results.

Page 27: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

AlgorithmsQuery-expansionGMM Based Approach

Motivation: using a GMM to model the distribution of patches

1. Train a UBM (Universal Background Model) based on patches from all training images

2. MAP Estimation of the distribution of the patches belonging to one image given UBM

3. Compute pair-wise image distance based on patch kernel and within-class covariance normalization

3. Retrieving images based the normalized distance

Page 28: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

VT1 Performance (#2 in 8)Category MAP

•101. Crowd (>10 people): 0.8419

•102. Building with sky as backdrop, clearly visible   0.977

•103. Mobile devices including handphone/PDA 0.028

•107. Person using Computer, both visible 0.2281

•109. Company Trademark, including billboard, logo  0.96

•112. Closeup of hand, e.g. using mouse, writing, etc; 0.4584

•113. Business meeting (> 2 people), mostly seated down, table visible 0.0644

•115. Food on dishes, plates, 0.2285

•116. Face closeup, occupying about 3/4 of screen, frontal or side 0.9783

•117. Traffic Scene, many cars, trucks, road visible, 0.2901

Page 29: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

VT2 Performance(#1in8)

Category MAP

•202. Talking face with introductory caption 0.8432

•206. Static or minute camera movement, people(s) walking, legs visible,

0.0581

•207. Large camera movement, panning left/right, top/down of a scene,

0.7789

•208. Movie ending credit 0.2782

•209. Woman monologue, Zhen 0.9756

Page 30: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Performance of Round3 (#1in7)Task 2 (VT1)

Target Estimated MAP (R=20)

101. Crowd (>10 people): 0.64

102. Building with sky as backdrop, clearly visible   1

107. Person using Computer, both visible 0.7

112. Closeup of hand, e.g. using mouse, writing, etc; 0.527

116. Face closeup, occupying about 3/4 of screen, frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202, face with introductory caption

1 0.03

209, women monolog 0.35 0.1

201, People entering door   N/A  

We are:2nd in Audio search 4th in Video search2nd in AV search1nd overall

Page 31: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

A general index / retrieval approach leveraging on speech recognition output lattices

Experience in a real-world audio retrieval task The Star Challenge

Experience in speech retrieval in an unknown language

Illinois Group in Star ChallengePart II. Audio Search

Page 32: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

(Audio) Information RetrievalProblem definition• Task Description: given a query, find the “most

relevant” segments in a database

36

k r u: d p r ai s ^ z

“CRUDE PRICES”

ThousandsOf

Audio files

Top N file IDs

Page 33: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

(Audio) Information Retrieval“Standard” Methods • Published Algorithms:

– EXACT MATCH: segment = argmini d(query,segmenti)where d is the string edit distance• Fast

– SUMMARY STATISTICS: segment = argmaxi p(query|segmenti), bag-of-words, no concept of “sequence”• Good for text, e.g., google, yahoo, etc.

– TRANSFORM AND INFER: segment = argmaxi p(query|segmenti)≈ argmaxi E(count(query)|segment),word order matters• Flexible, but slow....

Page 34: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Example: Language-Independent Speech Information Retrieval

Voice activity detectionPerceptual freq warpingGaussian mixtures

Likelihood Vector b

i=p(observation

t|state

t=i)

Retrieval Ranking = E(count(query|segment observations))

Inference Algorithm: Finite State Transducer built from ASR LatticesE(count(query|observations))

Page 35: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

STAR Challenge Audio Retrieval Tasks

Task Query Target Data Set

AT1 IPA sequence segments that contain the query IPA sequence regardless of its languages

25 hours monolingual database in round1;13 hours multilingual database in round3

AT2 an utterance spoken by different speakers

all segments that contain the query word/phrase/sentence regardless of its spoken languages

Page 36: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

STAR Challenge Audio Retrieval Tasks• Genuine retrieval tasks

– without pre-defined categories• The queries are human speech or IPA

sequences– one or multiple languages.

• Queries might be only part of the speech in the provided segments in the database

• The returned hits should be ordered and only the first 50 or 20 are submitted.

Page 37: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu41

AudioArchive

SpeechRecognition

SpectralFeatures

Lattices

Feature Extraction onShort-time windows

QueryFSM-based

Query Construction

EmpiricalPhone confusion

Knowledge-basedPhone confusion

FSM-based Indexing

FSA Generation FSM Index

All audio segments

FSA Generation FSM Index

. . . . . . Group Index

. . .

Group Index

All GroupIndices

FSM-basedRetrieval

FSM-basedQuery Construction

RetrievalResults

| Speech R

ecog. | Indexing / Query / R

etrieval

Page 38: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu42

Automatic speech recognition

Better performance with language-specific systems than language-independent systemsNo inter-language mismatch between

training and testing (acoustic model, pronunciation model, language model)

e.g., for English data, we can use all knowledge of English (training data, pronunciation dictionary, language model)

language-specific (e.g., English) word recognition

Page 39: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu43

Automatic speech recognition

Multilingual database and queries○ We might fail to identify which particular language

the testing data is; Even if we know, we might not have the speech recognition models for that language.

○ We might encounter unseen languages (Tamil, Malay…)

Language-Independent phone-based recognition instead of word-based recognition

Page 40: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu44

Summary of corpora of different languages

Page 41: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu45

Recognition results --- All languages/datasets are

not equal

Page 42: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu46

Acoustic model Spectral features:

39-dim PLP, cepstral mean/variance normalization per speaker

Modeling: HMMs with {11,13,15,17}-mixture Gaussians

Context-dependent language-dependent modeling Left context- CENTRAL PHONE +right context % language

referred to language-dependent “triphones” e.g., sound /A/ in different context and of different language

^’-A+b%Eng^’-A+b’%Eng

>-A+cm%Chinese….

Page 43: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Acoustic model - Context-Dependent Phones

Page 44: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu48

Acoustic model - clustering Categories for decision tree questions

Right or left context

Distinctive phone features (manner/place of articulation)

Language identity

Lexical stress

Punctuation mark

^’-A+b%Eng^’-A+b’%Eng

>-A+cm%Chinese….

Page 45: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu49

Language/Sequence model --- N-gram

If there is a pronunciation model, a particular sequence of context-dependent phone models (acoustic models) can be converted to a particular word sequence, which is modeled by N-gram.

If there is no pronunciation model, the context-dependent phone sequence is directly modeled by N-gram.

)|( 11 Niii WWWP

Page 46: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu50

Solutions for English Speech Retrieval

Speech recognition frontendFeatures:

○ Perceptual Linear Predictive Cepstra + Energy

○ first/second orders regression coefficients

Speech recognizer

○ Triphone-clustered English speech recognizer

○ English acoustic model, English dictionary, English language model

○ Lattice generation using a Viterbi decoderAudio

Archive

SpeechRecognition

SpectralFeatures

Lattices

Feature Extraction onShort-time windows

Page 47: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

I. A compact way to represent numerous alternative hypotheses output by the speech recognizer

• A simple example• a b a or b a

• Some more complex examples:

Using lattices as the representation of speech data

Page 48: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Mangu et al. 1999

James 1995

• Some more complex examples:

Page 49: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Using lattices as the representation of speech data

II. Enable more robust speech retrieval• One best hypothesis by speech recognition is not

reliable enough. • Lattices can be represented as finite state

machines, which can be used in speech retrieval and take advantage of general weighted finite state machine algorithms.

• Robust matching between query and audio files

Page 50: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu54

FSM-based Indexing

FSA Generation FSM Index

Indexing the audio archiveInput from ASR frontend

○ Lattices of database segments

○ English vocabulary

Constructing log semiring automatas for each segment

Construct FST-based segment index files

Combine segment index files into a few group index files for retrieval.

Solutions for English Speech Retrieval

FSA Generation FSM Index

. . . . . . Group Index

. . .

Group Index

All GroupIndices

All audio segments

Page 51: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Allauzen et al., 2004

Indexing the audio archive (example)

Speech recognition output:Lattices for two audio files

Index for the two files

Page 52: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu56

Making FST-based queriesProvided queries each as an IPA sequence

○ Building an automata from the IPA sequence and expand each IPA arc into alternative

arcs for 'similar' IPA

○ Building a query FSA, incorporating constraints

Provided queries each as audio waveform

○ Processed by ASR frontend

○ Building log semiring automata

○ Build query FSA

Solutions for English Speech Retrieval

Page 53: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu57

k r u: d p r ai s ^ z

“CRUDE PRICES”

QueryFSM-based

Query Construction

EmpiricalPhone confusion

Knowledge-basedPhone confusion

FSM-basedQuery Construction

Example of query expansion(phone sequence to word FSA)

Page 54: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu58

Retrieval using queriesParallel retrieval in all group index files

Order retrieved segment ids

truncate/format results

Fusing results with different precision/recall tradeoff obtained by different settings

Solutions for English Speech Retrieval

All GroupIndices

FSM-basedRetrieval

Page 55: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu59

• ASR frontend

• Multi-lingual word/phone speech recognizer

• Language-independent phone recognizer

• Indexing the audio archive

– Multi-lingual index

– Language-independent index

• Making FST-based queries

– Multi-lingual queries

– Language-independent queries

• Retrieval using queries

– Same as in language-specific retrieval

Solutions for Multilingual Speech Retrieval

Page 56: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

62

k r u: d p r ai s ^ z

“CRUDE PRICES”

ThousandsOf

Audio files

Top N file IDs

Audio Retrieval

What’s the “optimal” model set?

Language-specific phones/words

Language-independent phones

General sub-word units

Size of inventory - using only the frequent symbols is better

Data driven - selected by clustering tree

Page 57: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

AudioArchive

SpeechRecognition

SpectralFeatures

Lattices

Feature Extraction onShort-time windows

QueryFSM-based

Query Construction

EmpiricalPhone confusion

Knowledge-basedPhone confusion

FSM-based Indexing

FSA Generation FSM Index

FSA Generation FSM Index

. . . . . .

Group Index

. . .

Group Index

All GroupIndices

FSM-basedRetrieval

FSM-basedQuery Construction

RetrievalResults

k r u: d p r ai s ^ z

“CRUDE PRICES”

ThousandsOf

Audio files

Top N file IDs

WordTriphonePhone

Other acoustic units

Speech in unknown language

Query in unknown language

Represented as ?

Automatic Speech Retrieval in an Unknown LanguageFSM-based fuzzy matching and retrieval

Page 58: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

Automatic Speech Retrieval in an Unknown Language

modeled as a special case (below) of the cognitive process called assimilation (above).

EnglishPhones

RussianPhones

SpanishPhones

Language-dependent (LD) phones

LD models

Language-independent (LID) phone clustersbased on pairwise KL divergence

Speech in unknown language

Represented as LID phone lattices

Accommodation ?(introducing new models)

Page 59: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

… …

Multilingual Subword Units in an Unknown Language

• Language-dependent versions of the same IPA symbols can end up:– in one cluster, e.g., /z/ or /t∫/– in different clusters, e.g., /j/ or /I/

• Different IPAs may be similar:

Retrieval w/audio query

Retrieval w/ IPA sequence query

Clustering gives a more compact phone set, and performs better

Croatian

Phonerecognition

Croatian for training

Croatian Unknown

Page 60: Www.beckman.uiuc.edu Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of.

www.beckman.uiuc.edu

• Thank you.