18.0 Some Recent Developments in NTU Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on.

18.0 Some Recent Developments in NTU

Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on Speech and Audio Processing, Vol.13, No.3, May 2005, pp.399-411.

2. “Higher Order Cepstral Moment Nomalization(HOCMN) for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Montreal, CA, May 2004, pp.197-200.

3. “Extension and Further Analysis of Higher Order Cepstral Moment Normalization (HOCMN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.

4. “Powered Cepstral Normalization (P-CN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.

5. “ Improved Spontaneous Mandarin Speech Recognition by Disfluency Interruption Point (IP) Detection Using Prosodic Features”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.1621-1624. 6. “ Prosodic Modeling in Large Vocabulary Mandarin Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 7. “Latent Prosodic Modeling (LPM) for Speech with Applications in Recognizing

Spontaneous Mandarin Speech with Disfluencies”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.

Reference: 8. “Entropy-based Feature Parameter Weighting for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.

9. “A New Framework for System Combination Based on Integrated Hypothesis Space,” International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.

10. “Improved Spoken Document Summarization Using Probabilistic Latent Semantic Analysis (PLSA)”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.

11. “Analytical Comparison between Position Specific Posterior Lattices and Confusion Networks Based on Words and Subword Units for Spoken Document Indexing”, IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto, Japan, December 2007.

12. “A Multi-Modal Dialogue System for Information Navigation and Retrieval across Spoken Document Archives with Topic Hierarchies”, Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, San Juan, Nov-Dec 2005.

13. “Efficient Interactive Retrieval of Spoken Documents with Key Terms Ranked by Reinforcement Learning”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.

14. “Type- Dialogue Systems for Information Access from Unstructured Knowledge ⅡSources”, IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto, Japan, December 2007.


Reference: 15. “Histogram-Based Quantization (HQ) for Robust and Scalable Distributed Speech Recognition”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.957-960.

16. “Joint Uncertainty Decoding (JUD) with Histogram-Based Quantization (HQ) for Robust and/or Distributed Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.


Role of Spoken Language Processing under Network Environment

Content AnalysisUser Interface

Internet

User-Content Interaction

User Interface

—when keyboards/mice inadequate

Content Analysis — help in browsing/retrieval of multimedia content User-Content Interaction —all text-based interaction can be accomplished by spoken language

Hierarchy of Research Areas

Applications

MultimediaTechnologies

SpokenDialogue

Speech-basedInformationRetrieval

Dictation&

Transcription

Distributed SpeechRecognition and

Wireless Environment

MultilingualSpeech

Processing

InformationIndexing

& Retrieval

Text-to-speechSynthesis

Speech/Language

Understanding

Decoding&

SearchAlgorithms

LinguisticProcessing

&LanguageModeling

Wireless Transmission

&Network

Environment

Speech Recognition Core

KeywordSpotting

Robustness:noise/channelfeature/model

Hands-freeInteraction:

acoustic receptionmicrophone array, etc.

Speaker Adaptation

&Recognition

IntegratedTechnologies

Applied Technologies

BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document

Understanding and organization

13

9

Prosodic Modeling

Spontaneous Speech Processing:

pronunciation modeling disfluencies, etc.

Segmental Eigenvoice

– Decompose the supervectors into sub-supervectors, from which sub-eigenspaces can be constructed, therefore better performance

can be obtained with more adaptation data

Segmental Eigenvoice (1/3)




Applications


SpokenDialogue


Dictation&

Transcription



MultilingualSpeech

Processing

InformationIndexing

& Retrieval


Speech/Language

Understanding

Decoding&

SearchAlgorithms


&LanguageModeling


&Network

Environment


KeywordSpotting




Speaker Adaptation

&Recognition



BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document


13

9

Prosodic Modeling



Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition

— to reduce the mismatch between the statistical characterics of training and testing corpora by

normalizing the ceptral moments

Cepstral Moment Normalization

• Moment Estimation:– Time average : N-th moment of MFCC parameters about the origin

• Cepstral Normalization:– For odd order L

– For even order N

1 1

0 0

1 1[ ( )] ( ) ( )

T TNN N

k k

E X n X k X kT T

[ ]( ) 0LLE X n

[ ]( )NN NE X n M

Example: CMS for L=1

Example: CMVN for N=1 and 2

Higher Order Cepstral Moment Normalization (HOCMN)

CN

CTN=HOCMN[1,2,3]

CN (l=86)

• Aurora 2, Clean Condition Training, Word Accuracy Averaged over 0~20dB and All Types of Noise (sets A,B,C)

CMVN

CTN=HOCMN[1,3,2]

CMVN (l=86)

Skewness and Kurtosis (1)

• Skewness

– Third moment about the mean and normalized to the standard deviation

– Departure of pdf from symmetry• Positive/negative indicates skew to right/left• Zero indicates symmetric

• Kurtosis

– Fourth moment about the mean and normalized to the standard deviation

– Peaked or “flat with tails of large size” as compared to standard Gaussian

• “3” is the fourth moment of N(0,1)• Positive/negative indicates flatter/more peaked


• Define: Generalized Skewness of Odd Order L

– L not necessarily 3– Similar meaning as skewness (skew to right or left) except in the

sense of L–th moment

• Define: Generalized Kurtosis of Even Order N

– N not necessarily 4– Similar meaning as kurtosis (peaked or flat) except in the sense of

N–th moment

( ) , : an odd integerL LS E X L


• Normalizing Odd Order Moment is to Constrain the pdf to be Symmetric about the Origin

– Except in the sense of L-th moment

• Normalizing Even Order Moment is to Constrain the pdf to be “Equally Flat with Tails of Equal Size” as Compared to a Standard Gaussian

– Except in the sense of N-th moment

• The Order of Normalized Moments are not necessarily Integers

• Generalized Moments– Type 1:

• Reduced to odd order moment when u is an odd integer L

(example: L=1 or 3)

– Type 2:

• Reduced to even order moment when u is an even integer N

(example: N=2 or 4)

– HOCMN with Non-integer Moment Orders

Generalized Moments with Non-integer Orders

PDF Analysis

• HEQ– Over fitted to Gaussian– Original statistics lost

• HOCMN– Fitting the generalized skewness and

kurtosis of a few orders only– Retain more original characteristics

HEQ

HOCMN

Original C0 & C1


Applications


SpokenDialogue


Dictation&

Transcription



MultilingualSpeech

Processing

InformationIndexing

& Retrieval


Speech/Language

Understanding

Decoding&

SearchAlgorithms


&LanguageModeling


&Network

Environment


KeywordSpotting




Speaker Adaptation

&Recognition



BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document


13

9

Prosodic Modeling



Use of Prosody in Recognition and Handling Disfluencies in Spontaneous Speech

— prosody may be useful in recognition, and in particular in handling disfluencies in spontaneous speech

100

200

300

400

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

frame number

fun

dam

en

tal

fre

qu

en

cy

(H

z)

Tone 2 Tone 4Tone 3

Prosodic Features (І) — Pitch-related Features

P1

P2d1

d2

• Pitch-related Features– The average pitch value within the syllable – The maximum difference of pitch value within the syllable – The average of absolute values of pitch variations within the syllable– The magnitude of pitch reset for boundaries – The difference of such feature values of adjacent syllable boundaries ( P1-P2 ,

d1-d2 , etc.)

– A total of 54 pitch-related features were obtained

• Duration-related Features

– A total of 38 duration-related features were obtained

syllable boundary syllable boundarypausepause

end of utterancebegin of utterance

A B C D Eba

Prosodic Features (Ⅱ) —Duration-related Features

Pause duration b Average syllable duration

(B+C+D+E)/4 or ( (D+E)/2 + C )/2 Average syllable duration ratio

(D+E)/(B+C) or (D+E)/2 /C

Combination of pause & syllable features (ratio or product) C*b , D*b, C/b, D/b Lengthening C / ( (A+B)/2 ) Standard deviation of feature values

Recognition Framework with Prosodic Modeling

• Rescoring Formula:

λl ,λp: weighting coefficients

( ) log log logl pS W P X W P W P F W Prosodicmodel

• Two-pass Recognition

Prosodic Feature Extraction from Paths in the Word Graph

Define Tone

variable

Directly take the LW boundaries as a prosodic

cue

21

,jL

j j jk jk jkk

P f w P f T B

(LW (LW boundaries )boundaries )

LWLW

Lj : the length of the j-th word

Prosodic modeling

,jk jk jkP f T BGMMclassifier

,

,

,

1,0

2,0

...

...

5,1

jk jk jk

jk jk jk

jk jk jk

p T B f

p T B f

p T B f

fjk

fjk’=

• Hybrid

• GMM

• Decision Tree

GMM

classifier

,jk jk jkP f T B

jkP B

,jk jk jkP T B f Baye’s Rule

,jk jk jkP f T B

fjk

fjk

Examples of Disfluencies in Spontaneous Speech

It has a *eh there is a resort there.

The disfluency interruption point (IP) (*)

它 (ta1) 有 (you3) 一個 (yi2ge5) 呃 (E) 那邊(ne4bian1)

it has one [discourse particle] there

有個 (you3ge5) 度假村 (du4jian4cun1) 嘛 (MA)

has a resort [discourse particle ]

Do you import * uhn export products?

reparandum resumptionoptional editing term

是 (shi4) 進口 (jin4kou3) 嗯 (EN) 出口 (chu1kou3) 嗎 (ma1) is import [discourse export [interrogative particle] particle]

• Overt Repair

reparandum resumption

• Abandoned Utterances

optional editing term

Spontaneous Speech Recognition with Disfluency Interruption Point (IP) Detection

• Rescoring with IP information

• Recognition Results

( c: IP class )

* arg max ( | , )W

W P W X F

arg max ( | ) ( | )W

P W F P X W

11( | ) ( | , )n

n n Nn

P W F P w w F

1 11 1( | , ) ( | , )n n

n N n n Ncn

P c w F P w w c

4444.5

4545.5

4646.5

0.5 0.9 1.3 2 4char

acte

r Acc with disfluency

handling

baseline

word n-grams when crossing IP boundaries

IP probability given by detection models


Applications


SpokenDialogue


Dictation&

Transcription



MultilingualSpeech

Processing

InformationIndexing

& Retrieval


Speech/Language

Understanding

Decoding&

SearchAlgorithms


&LanguageModeling


&Network

Environment


KeywordSpotting




Speaker Adaptation

&Recognition



BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document


13

9

Prosodic Modeling



Entropy-based Weighted Viterbi Decoding

— contribution of each feature parameter in Viterbi decoding weighted by its entropy with respect to

different phone classes

t: frame index

x(t): feature vector

d: index of feature parameter in x(t)

c: class index

Entropy-based Weighting

• Basic Idea

– If a feature parameteris discriminative

– If not discriminative

• its Entropy value is low

• its Entropy value is high

observation probability distributions of different classes

Entropy Estimation by GMMs

• GMMs for Different Classes c

– “GMM c” is developed for the acoustic class “c” (c = 1, 2, …)

Entropy-based Weighted Viterbi Decoding

• Testing

• Viterbi decodingD M

j jm d jmd jmdd=1 m=1

log[ ( (t)) ] = (t, d) ( log c ( (t); , ) )b W N x x

Experimental Results

• MFCC– Consistent improvements

for all types of noiseand SNR conditions

• Similar Results for PLP and Other Features

OriginalParameterWeighting

Relative ErrorReduction (%)

Set A 61.34 68.00 17.23Set B 55.75 63.74 18.06Set C 66.14 69.46 9.81

Average 61.08 67.07 15.39

MFCC


Applications


SpokenDialogue


Dictation&

Transcription



MultilingualSpeech

Processing

InformationIndexing

& Retrieval


Speech/Language

Understanding

Decoding&

SearchAlgorithms


&LanguageModeling


&Network

Environment


KeywordSpotting




Speaker Adaptation

&Recognition



BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document


13

9

Prosodic Modeling



System Combination by Integrated Hypothesis Space and Delicate Rescoring

– properly integrating useful information from different approaches

Conventional System Combination Approaches

Decoder 1

AlignmentModule

VotingModule

Decoder N

InputSpeech

N-BestConfusionNetwork

result

1.Alignment Algorithms2.Distortion introduced

Inner Word graph

Proposed Approach

Decoder 1

Decoder N

InputSpeech Rescoring

result

IntegratedHypothesis

Space

Direct Integration

of Individual Hypothesis

SpaceDelicate

Rescoring

• Produce Integrated Hypothesis Space with detail time information

• Perform Delicate Rescoring on the Integrated Hypothesis Space

• Merged Word Graph

• If Two Word Arcs from Different Systems are Equal– Define:

–

• Others

S(q=q1+q2)=combine(S(q1), S(q2)) if q1=q2

122211

212121

||

|

WqqWqq

qqqqqWWW

q1=q2 ≡ pw1=pw2 , w1=w2 , ts1=ts2 , te1=te2

S(q=qi)=S(qi)

Hypothesis Space Integration

W1

W4

W4

W4

W2W8

W5

W6

W6

W7

W10

W10

W4

W8

W9

W10

W3

W10

W1

W4

W4

W4

W2

W8

W5

W6

W6

W7

W10

W10

W10

W4

W2

W8

W6

W7

W10

W10

W4

W8

W9

W10

W3

System 1

System 2

Delicate Rescoring Example (Ⅰ) – Expected Phone Accuracy Score (EPA)

• Borrowing the Concept of Expected Phone Accuracy in MPE Training– – –

• Decoding Procedure

Wp ppe

ppeOpAw

w' phonesdifferent are p and p' if ',1phone same theare p and p' if ',21max|wP

qqAqAEqSqS EPA P

K

qpii

i

pAqA,1 豪雨

陶藝

h_a au sic_iu u

t_a au sic_i u

1 5 t_a1/6=0.17-1+2*0.17=-0.66

au5/6=0.83-1+0.83=-0.17

k 1

,

y* arg maxM k

ky W q y

S q

qk : the kth word in the path y

y : word sequence for a path

Delicate Rescoring Example (Ⅱ) – Time Frame Error Score (TFE)

• Borrowing the Concept from Minimum Time Frame Error Decoding– frame level loss function

– P(q’) is available from the process of calculating consensus scores

• Decoding Procedure

)(1

)'()',()1(

)()(]',';,'['

se

Wttwpwqse

TFE tt

qPqqoverlaptt

qSqS esi

],;,[ esii ttwpwq

k 1

y , y

y* arg minM k

kW q

S q

qk : the kth word in the path

y : word sequence for a path


• For Chinese language SER and CER make better sense due to the word segmentation problem

• For SER (for syllables), CER (for characters), proposed approach is significantly better than ROVER upper bound

– Alignment distortion

• TFE has best performance– Discriminative Decoding

Tested system SER CER WER

BaselineMFCC 15.89 22.19 29.93

HLDA 14.43 20.80 28.53

ROVER upper bound

1-Best 14.90 20.39 26.92

10-Best 14.64 20.21 26.76

20-Best 14.49 20.12 26.79

Integrated Hypothesis

Space

(1)CONS 13.67 19.62 26.88

(2)EPA 13.41 19.73 27.70

(3)CONS

+EPA13.55 19.54 26.97

(4)TFE 13.35 19.27 26.71


Applications


SpokenDialogue


Dictation&

Transcription



MultilingualSpeech

Processing

InformationIndexing

& Retrieval


Speech/Language

Understanding

Decoding&

SearchAlgorithms


&LanguageModeling


&Network

Environment


KeywordSpotting




Speaker Adaptation

&Recognition



BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document


13

9

Prosodic Modeling



Multimedia Content Analysis for Efficient Browsing and Retrieval

– automatic generation of titles, summaries and semantic structures for multimedia documents

Difficulties in Browsing Multimedia/Spoken Documents Written Documents are Better Structured and Easier to

Browse

— in paragraphs with titles

— easily summarized and shown on the screen

— easily decided at a glance if it is what the user is looking for Multimedia/Spoken Documents are just Video/Audio Signals

— not easy to be summarized and shown on the screen

— the user can’t go through each one from the beginning to the end during browsing

— better approaches for efficient browsing and retrieval are needed

Integration Relationships among the Involved Technology Areas

Keyterms/Named EntityExtraction from

Spoken Documents

Semantic

Analysis

Information

Indexing,

Retrieval

And Browsing

Key Term Extraction from

Spoken Documents


Applications


SpokenDialogue


Dictation&

Transcription



MultilingualSpeech

Processing

InformationIndexing

& Retrieval


Speech/Language

Understanding

Decoding&

SearchAlgorithms


&LanguageModeling


&Network

Environment


KeywordSpotting




Speaker Adaptation

&Recognition



BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 4 15

11 10 2 3 1

7 5 68

Spoken Document


13

9

Prosodic Modeling



14

Improved and Interactive Spoken Document Retrieval

– improved spoken document retrieval with higher accuracy and better user-content interaction

Lattices, Position Specific Posterior Lattices (PSPL), Confusion Networks (CN)

W2: probW9: probW4: probW1: prob

CN structure:

W3: probW6: probW7: prob

W8: prob W5: probW10: prob

W3: prob

W7: prob

W2: probW1: prob W5: probW9: prob

W10: prob

PSPL structure:

W6: prob

cluster 1

W4: probW8: prob

cluster 2 cluster 4cluster 3 cluster 1 cluster 2 cluster 4cluster 3

W6W8

W4

W1

W7W8W9W10

W8

W7

W9

W3

W2

W5

W10

Start node End node

Time index

All paths:W1W2, W3W4W5, W6W8W9W10,

Lattice:

• PSPL:─ Locate a word in a segment according to the order of the word in a path

• CN:─ Cluster several words in a segment according to similar time spans and word

pronunciation

OOV/Rare Word Problem

• OOV word W=w1w2w3w4 and a lattice L of document D

– wi : subword units

• W never appears in L – Never find D under PSPL

• But W=w1w2w3w4 is hidden in L at subword level

• Subword-based PSPL (S-PSPL)

w2w3

w3w4bcdw3w4e

w3w4b

aw1w2

w1w2

Word Lattice L:

Time index

Subword-based PSPL and CN

w1_1

Time index

w1_2

w1_3 w2_1 w2_2 w2_3

w2_4

w3_2w3_1 w4_1 w4_2 w51

w5_2 w5_3 w5_4

w7_1

w7_2

w6_1

w6_2 w8_1 w8_2

w8_2w8_1

w9_1

w9_2

w10_1

w10_2

w1_1: prob w1_2: prob …. …..

S-PSPL structure:

…..

cluster 1

…..

…..

cluster 2cluster 8

S-CN structure:

…..

w5_4: prob

….. …..

w1_1: prob w1_2: prob …. …..

…..

cluster 1

…..

…..

cluster 2 cluster 8

…..

w2_4: prob

…..

Lattice Represented by Subword Arcs:

Performance Comparison

0.54

0.59

0.64

0.69

0.74

0.79

0.84

0.89

0 2 4 6 8 10 12 14 16 18 20

Index Size(MB)

MA

P

PSPL(word)CN(word)

PSPL(character)

CN(character)

CN(syllable)

PSPL(syllable)

Interactive Retrieval of Spoken Documents by Topic Hierarchy

• Interactive Process between User and Content for Spoken Document Retrieval

• Given User’s Initial Query, the Extracted Key Terms can be many, even in a Hierarchy– Ranking the key terms will be helpful in efficient retrieval

Topic Hierarchy

User

Multi-modal Dialogue

Retrieved Documents

Spoken Document

Archive

Retrieval System

Query/Instruction

Key Term Space Archive Space

titj

tktl

C(ti)

C(tj)

C(tk)

s1 = [ti ]

s2 = [ti ,tj ]

s3 = [ti ,tk ]

sn = [ti ,tj ,tl ]

G1 = C(ti )

G2 = C(ti + tj)

G3 = C(ti + tk)

Gn = C(ti +tj +tl)

Query Term Suggestions and Improved Interaction by Dialogue Modeling

Such mapping is defined by some IR function (ex: PLSA)

states: s1, s2, s3, …

actions: ti, tj, tk, …

state_s1 + action_tj

state_s2

Document Space

• A State Transition Diagram Generated for Each User Given the Initial Query s1

• User Assumed Satisfied (Double Circles) when Recall Rate = L/|D| > τ0

– L: number of relevant documents appearing in the top K retrieved documents– D: desired document set– m(s) = Mininum Number of Steps or Queries to Arrive at the Final State

Learning User’s Behavior in Retrieval by a Large Number of Simulated Users

s1

s2

s3

s4

s6

s7

s8s13

s14

s12

s15

m(s12) = 4

m(s7) = 3

m(s13) = 4

m(s15) = 5

s9 m(s9) = 3

m(s4) = 2

…

m(s3) = 3

Goal: to minimize the number of steps to arrive at the final state

Types- and Dialogue SystemsⅠ Ⅱ

ASRLanguage

Understanding

Well-organizedDatabase

Speech, Graph, Tables

Dialogue Modeling

words,lattices

Dialogue Act Classification

Semantic Frame

Dialogue State

Output Generator

Spoken language Understanding

User Act

System Action Dialogue

Manager

U

Input Speech Utterance Au

^S

Type-I:

Types- and Dialogue SystemsⅠ Ⅱ

ASR

Multimedia Document

Archive

Retrieval Engine

Indexing

word/ phone lattice, one-best, N-best

ASR

inverted index file

word/ phone lattice, one-best, N-bestSpoken Language based Information Access

Internal State

Dialogue Modeling

Related Documents

Multi-modal

User Interface

Dialogue ManagerOutput

Presentation

Multi-modal interactions

Information Obtained

d

Spoken Docume

nts

Input Spoken Query

q

Type-II:

Improved Performance by Dialogue Modeling

0

15

30

45

60

75

90

105

74 88 92 100

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

74 88 92 100ASR Character Accuracy in % for Queries ASR Character Accuracy in % for Queries

Ave

rage

Num

ber

of K

ey T

erm

s N

eede

d fo

r S

ucce

ssfu

l Tri

als

Tas

k S

ucce

ss R

ate

Dialogue Modeling

wpq

tf-idf Dialogue Modeling

wpq

tf-idf


Applications


SpokenDialogue


Dictation&

Transcription



MultilingualSpeech

Processing

InformationIndexing

& Retrieval


Speech/Language

Understanding

Decoding&

SearchAlgorithms


&LanguageModeling


&Network

Environment


KeywordSpotting




Speaker Adaptation

&Recognition



BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document


13

9

Prosodic Modeling



Histogram-based Quantization (HQ) for Robust Distributed Speech Recognition

– quantization dynamically determined by local statistics, thus automatically absorbing the various disturbances

• An Example Partition of Speech Recognition Processes into Client/Sever

Distributed Speech Recognition (DSR) and Wireless Environment

Front-endSignal Processing

AcousticModels Lexicon

FeatureVectors

Linguistic Decoding and

Search Algorithm

Output Sentence

SpeechCorpora

AcousticModel

Training

LanguageModel

Construction

TextCorpora

LexicalKnowledge-base

LanguageModel

Input Speech

Grammar

– encoded feature parameters transmitted in packets Client/Server Structure

Server

ServerClients

Network

Client

Problems with Conventional Vector Quantization (VQ) Conventional VQ (e.g. SVQ) Popularly Used in DSR Dynamic Environmental Noise and Codebook Mismatch

Jointly Degrade the Performance of SVQ

Noise moves clean speech to another partition cell (X to

Y)

Mismatch between fixed VQ codebook and test data

increases distortion

Quantization increases difference between clean

and noisy features

– Decision boundaries yi{i=1,…,N} are dynamically defined by C(y).

– Representative values zi {i=1,…,N} are fixed, transformed by a standard Gaussian.

Histogram-based Quantization (HQ) ( )Ⅰ

T

{ , , (vertical scale) 1,..., }determined by Lloyd-Max and a standard Gaussian Distribution

i i iD z b i N

Histogram-based Quantization (HQ) (Ⅱ)

– With histogram C’(y’), decision boundaries automatically changed to .

– Decision boundaries are adjusted according to local statistics, no codebook mismatch problem.

T

1( , )iiy y

1

1

, '( )

' ' ,

1,2, ...

t ti ii

t ii

x z if b C x b

or y x y

where i N

Histogram-based Quantization (HQ) (Ⅱ)

• Based on CDF on the Vertical Scale and Histogram, less Sensitive to Noise on the Horizontal Scale

• Disturbances are Automatically Absorbed into HQ Blocks

Dynamic nature of HQ hidden codebook on vertical scaletransformed by dynamic C(y){yi} Dynamic on horizontal scale

T

Histogram-based VQ (HVQ)

Different Types of Noise, Averaged over All SNR Values


ClientHEQ-SVQ

ClientHEQ-SVQ

ServerUD

ClientHQ

ClientHQ

ServerJUD

Performance in Mobile Wireless Networks

18.0 Some Recent Developments in NTU Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on.

Documents

robust speech recognition

international conference

speech communication

spoken language processing

signal processing

european conference

audio processing

spoken document indexing