7- Speech Recognition

11

7-Speech Recognition7-Speech Recognition

Speech Recognition Concepts Speech Recognition Concepts Speech Recognition ApproachesSpeech Recognition ApproachesRecognition TheoriesRecognition TheoriesBayse RuleBayse RuleSimple Language ModelSimple Language ModelP(A|W) Network TypesP(A|W) Network Types

22

7-Speech Recognition (Cont’d)7-Speech Recognition (Cont’d)

HMM Calculating ApproachesHMM Calculating ApproachesNeural ComponentsNeural ComponentsThree Basic HMM ProblemsThree Basic HMM ProblemsViterbi AlgorithmViterbi AlgorithmState Duration ModelingState Duration ModelingTraining In HMMTraining In HMM

33

Recognition TasksRecognition TasksIsolated Word Recognition (IWR)Isolated Word Recognition (IWR)

Connected Word (CW) , And Continuous Connected Word (CW) , And Continuous Speech Recognition (CSR)Speech Recognition (CSR)Speaker Dependent, Multiple Speaker, And Speaker Dependent, Multiple Speaker, And Speaker Independent Speaker Independent Vocabulary SizeVocabulary Size– Small <20Small <20– Medium >100 , <1000Medium >100 , <1000– Large >1000, <10000Large >1000, <10000– Very Large >10000Very Large >10000

44

Speech Recognition ConceptsSpeech Recognition Concepts

NLP SpeechProcessing

Text Speech

NLPSpeech Processing

Speech Understanding

Speech Synthesis

TextPhone Sequence

Speech Recognition

Speech recognition is inverse of Speech Synthesis

55

Speech Recognition Speech Recognition ApproachesApproaches

Bottom-Up ApproachBottom-Up Approach

Top-Down ApproachTop-Down Approach

Blackboard ApproachBlackboard Approach

66

Bottom-Up ApproachBottom-Up Approach

Signal Processing

Feature Extraction

Segmentation

Signal Processing

Feature Extraction

Segmentation

Segmentation

Sound Classification Rules

Phonotactic Rules

Lexical Access

Language Model

Voiced/Unvoiced/Silence

Kno

wle

dge

Sou

rces

Recognized Utterance

77

UnitMatching

System

Top-Down ApproachTop-Down Approach

FeatureAnalysis

LexicalHypothesis

SyntacticHypothesis

SemanticHypothesis

UtteranceVerifier/Matcher

Inventory of speech

recognition units

Word Dictionary Grammar

TaskModel

Recognized Utterance

88

Blackboard ApproachBlackboard Approach

EnvironmentalProcesses

Acoustic Processes Lexical

Processes

SyntacticProcesses

SemanticProcesses

Blackboard

99

Recognition TheoriesRecognition Theories

Articulatory Based RecognitionArticulatory Based Recognition– Use from Articulatory system for recognitionUse from Articulatory system for recognition– This theory is the most successful until nowThis theory is the most successful until now

Auditory Based RecognitionAuditory Based Recognition– Use from Auditory system for recognitionUse from Auditory system for recognition

Hybrid Based RecognitionHybrid Based Recognition– Is a hybrid from the above theoriesIs a hybrid from the above theories

Motor TheoryMotor Theory– Model the intended gesture of speakerModel the intended gesture of speaker

1010

Recognition ProblemRecognition Problem

We have the sequence of acoustic We have the sequence of acoustic symbols and we want to find the words symbols and we want to find the words that expressed by speakerthat expressed by speaker

Solution : Finding the most probable of Solution : Finding the most probable of word sequence by having Acoustic word sequence by having Acoustic symbolssymbols

1111

Recognition ProblemRecognition Problem

A : Acoustic SymbolsA : Acoustic SymbolsW : Word SequenceW : Word Sequence

we should find so that we should find so that W)|(max)|ˆ( AWPAWP

W

1212

Bayse RuleBayse Rule

),()()|( yxPyPyxP

)()()|()|(

yPxPxyPyxP

)()()|()|(

APWPWAPAWP

1313

Bayse Rule (Cont’d)Bayse Rule (Cont’d)

)()()|(max

APWPWAP

W

)|(max)|ˆ( AWPAWPW

)()|(max

)|(maxˆ

WPWAPArg

AWPArgW

W

W

1414

Simple Language ModelSimple Language Modelnwwwww 321

),...,,,(),...,,|(

).....,,|(),|()|()(

)|()(

121

121

1234

123121

1211

WWWWPWWWWP

WWWWPWWWPWWPWP

wwwwPwP

nnn

nnn

iii

n

i

Computing this probability is very difficult and we need a very big database. So we use from Trigram and Bigram models.

1515

Simple Language Model Simple Language Model (Cont’d)(Cont’d)

)|()( 211 iii

n

iwwwPwP

)|()( 11 ii

n

iwwPwP

Trigram :

Bigram :

)()(1 i

n

iwPwP

Monogram :

1616

Simple Language Model Simple Language Model (Cont’d)(Cont’d)

)|( 123 wwwP

Computing Method :Number of happening W3 after W1W2

Total number of happening W1W2

AdHoc Method :)()|()|()|( 332321231123 wfwwfwwwfwwwP

1717

Error Production FactorError Production Factor

Prosody (Recognition should be Prosody (Recognition should be Prosody Independent)Prosody Independent)

Noise (Noise should be prevented)Noise (Noise should be prevented)

Spontaneous SpeechSpontaneous Speech

1818

P(A|W) Computing P(A|W) Computing ApproachesApproaches

Dynamic Time Warping (DTW)Dynamic Time Warping (DTW)

Hidden Markov Model (HMM)Hidden Markov Model (HMM)

Artificial Neural Network (ANN)Artificial Neural Network (ANN)

Hybrid SystemsHybrid Systems

Dynamic Time WarpingDynamic Time Warping





Search Limitation :Search Limitation :- First & End Interval- First & End Interval- Global Limitation- Global Limitation- Local Limitation- Local Limitation


Global Limitation : Global Limitation :


Local Limitation : Local Limitation :

2626

Artificial Neural NetworkArtificial Neural Network

...

1x

0x

1w 0w

1Nw1Nx

y)(

1

0

i

N

ii xwy

Simple Computation Element of a Neural Network

2727

Artificial Neural Network Artificial Neural Network (Cont’d)(Cont’d)

Neural Network TypesNeural Network Types– PerceptronPerceptron– Time DelayTime Delay– Time Delay Neural Network Computational Time Delay Neural Network Computational

Element (TDNN)Element (TDNN)

2828


. . .

. . .

0x

0y 1My

1Nx

Single Layer Perceptron

2929


. . .

. . .

Three Layer Perceptron

. . .

. . .

3030

2.5.4.2 Neural Network Topologies2.5.4.2 Neural Network Topologies

3131

TDNNTDNN

3232

2.5.4.6 Neural Network Structures for 2.5.4.6 Neural Network Structures for Speech RecognitionSpeech Recognition

3333

2.5.4.6 Neural Network Structures for 2.5.4.6 Neural Network Structures for

Speech RecognitionSpeech Recognition

3434

Hybrid MethodsHybrid Methods

Hybrid Neural Network and Matched Filter For Hybrid Neural Network and Matched Filter For RecognitionRecognition

PATTERN

CLASSIFIER

Speech Acoustic Features Delays

Output Units

3535

Neural Network PropertiesNeural Network Properties

The system is simple, But too much The system is simple, But too much iteration is needed for trainingiteration is needed for trainingDoesn’t determine a specific structureDoesn’t determine a specific structureRegardless of simplicity, the results are Regardless of simplicity, the results are goodgoodTraining size is large, so training should be Training size is large, so training should be offlineofflineAccuracy is relatively goodAccuracy is relatively good

Pre-processingPre-processing

Different preprocessing techniques are Different preprocessing techniques are employed as the front end for speech employed as the front end for speech recognition systemsrecognition systems

The choice of preprocessing method is The choice of preprocessing method is based on the task, the noise level, the based on the task, the noise level, the modeling tool, etc.modeling tool, etc.

3636

3838

3939

4141

4242

4343

MFCCMFCCروش روش

يي بر نحوه ادراک گوش انسان از اصوات م بر نحوه ادراک گوش انسان از اصوات ميي مبتن مبتنMFCCMFCC روش روش باشد.باشد.

بهتر بهتر يي نويز نويزييطهاطهاييژگيها در محژگيها در محيير ور ويي نسبت به سا نسبت به ساMFCCMFCC روش روش کند.کند.ييعمل معمل م

MFCCMFCCه شده ه شده يي گفتار ارا گفتار ارايييي شناسا شناسايي اساسا جهت کاربردها اساسا جهت کاربردها دارد. دارد.ييز راندمان مناسبز راندمان مناسبيينده ننده نيي گو گويييياست اما در شناسااست اما در شناسا

ر ر يي باشد که به کمک رابطه ز باشد که به کمک رابطه زيي م مMelMelدار گوش انسان دار گوش انسان يي واحد شن واحد شند:د:يي آ آييبدست مبدست م

4444

MFCCMFCCمراحل روش مراحل روش

گنال از حوزه زمان به حوزه گنال از حوزه زمان به حوزه يي: نگاشت س: نگاشت س11 مرحله مرحله زمان کوتاه. زمان کوتاه.FFTFFTفرکانس به کمک فرکانس به کمک

گنال گفتاريس : Z(n)تابع پنجره مانند پنجره :

)W(nهمينگWF= e-j2π/F

m : 0,…,F – 1;يم گفتاريطول فر : .F

4545


لتر.لتر.يي هر کانال بانک ف هر کانال بانک فييافتن انرژافتن انرژيي: : 22مرحله مرحله

MMبر معيار مل بر معيار مل يي فيلتر مبتن فيلتر مبتنيي تعداد بانکها تعداد بانکها باشد.باشد.ييمم

بانک فيلتر بانک فيلتر ييلترهالترهايي تابع ف تابع فاست.است.0,1,..., 1k M ( )kW j

4646

توزيع فيلتر مبتنی بر معيار ملتوزيع فيلتر مبتنی بر معيار مل

4747


DCTDCTل ل يي طيف و اعمال تبد طيف و اعمال تبديي: فشرده ساز: فشرده ساز44 مرحله مرحله MFCCMFCCب ب ييجهت حصول به ضراجهت حصول به ضرا

در رابطه باال در رابطه باالLL،،......،،00==nnب ب يي مرتبه ضرا مرتبه ضراMFCCMFCC باشد.باشد.ييمم

4848

روش مل-کپسترومروش مل-کپستروم

Mel-scaling بندی فریم

IDCT

|FFT|2

Low-order coefficientsDifferentiator

Cepstra

Delta & Delta Delta Cepstra

زمانی سیگنال

Logarithm

4949

ضرایب مل ضرایب مل ((MFCCMFCC))کپسترومکپستروم

5050

ویژگی های مل ویژگی های مل ((MFCCMFCC))کپسترومکپستروم

نگاشت انرژی های بانک فیلترمل نگاشت انرژی های بانک فیلترمل درجهتی که واریانس آنها ماکسیمم باشددرجهتی که واریانس آنها ماکسیمم باشد

((DCTDCT )با استفاده از)با استفاده ازاستقالل ویژگی های گفتار به صورت استقالل ویژگی های گفتار به صورت

((DCTDCT غیرکامل نسبت به یکدیگر)تاثیرغیرکامل نسبت به یکدیگر)تاثیرپاسخ مناسب در محیطهای تمیزپاسخ مناسب در محیطهای تمیز

کاهش کAارایی آن در محیطهای نویزیکاهش کAارایی آن در محیطهای نویزی

5151

Time-Frequency analysisTime-Frequency analysis

Short-term Fourier TransformShort-term Fourier Transform– Standard way of frequency analysis: decompose the Standard way of frequency analysis: decompose the

incoming signal into the constituent frequency components.incoming signal into the constituent frequency components.

– W(n): windowing functionW(n): windowing function– N: frame lengthN: frame length– p: step sizep: step size

5252

Critical band integrationCritical band integration

Related to masking phenomenon: the Related to masking phenomenon: the threshold of a sinusoid is elevated when its threshold of a sinusoid is elevated when its frequency is close to the center frequency of frequency is close to the center frequency of a narrow-band noisea narrow-band noise

Frequency components within a critical band Frequency components within a critical band are not resolved. Auditory system interprets are not resolved. Auditory system interprets the signals within a critical band as a wholethe signals within a critical band as a whole

5353

Bark scaleBark scale

5454

Feature orthogonalizationFeature orthogonalization

Spectral values in adjacent frequency Spectral values in adjacent frequency channels are highly correlatedchannels are highly correlatedThe correlation results in a Gaussian The correlation results in a Gaussian model with lots of parameters: have to model with lots of parameters: have to estimate all the elements of the estimate all the elements of the covariance matrixcovariance matrixDecorrelation is useful to improve the Decorrelation is useful to improve the parameter estimation.parameter estimation.

7- Speech Recognition

Documents

speech recognition contdhmm

recognition problemwe

recognition problema

neural network topologies

neural network structu

preventedspontaneous

simple language modelcomputing

word cw