Top Banner
Introduction to Speech Recognition Steve Renals & Hiroshi Shimodaira Automatic Speech Recognition— ASR Lecture 1 11 January 2016 ASR Lecture 1 Introduction to Speech Recognition 1
37

Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Jun 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Introduction to Speech Recognition

Steve Renals & Hiroshi Shimodaira

Automatic Speech Recognition— ASR Lecture 111 January 2016

ASR Lecture 1 Introduction to Speech Recognition 1

Page 2: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Automatic Speech Recognition — ASR

Course details

About 18 lectures, plus a couple of extra lectures on basicintroduction to neural networks

Coursework:

Lab: build ASR system using HTK (20%)Literature review (10%)

Exam in April or May (worth 70%)

Books and papers:

Jurafsky & Martin (2008), Speech and Language Processing,Pearson Education (2nd edition). (J&M)Some review and tutorial articles, readings for specific topics

If you haven’t taken Speech Processing (SP): Read J&M,chapter 7 (Phonetics), and look at Simon King’s SP coursematerial (mail [email protected] to get access)

http://www.inf.ed.ac.uk/teaching/courses/asr/ASR Lecture 1 Introduction to Speech Recognition 2

Page 3: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Automatic Speech Recognition — ASR

Course content

Introduction to statistical speech recognition

The basics

Speech signal processingAcoustic modelling with HMMs using Gaussian mixture modelsand neural networksPronunciations and language modelsSearch

Advanced topics:

AdaptationNeural network language models“HMM-free” speech recognition

ASR Lecture 1 Introduction to Speech Recognition 3

Page 4: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Automatic Speech Recognition — ASR

Course content

Introduction to statistical speech recognition

The basics

Speech signal processingAcoustic modelling with HMMs using Gaussian mixturemodels and neural networksPronunciations and language modelsSearch

Advanced topics:

AdaptationNeural network language models“HMM-free” speech recognition

ASR Lecture 1 Introduction to Speech Recognition 3

Page 5: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Overview

Introduction to Speech Recognition

Today

Overview

Statistical Speech Recognition

Hidden Markov Models (HMMs)

ASR Lecture 1 Introduction to Speech Recognition 4

Page 6: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

What is ASR?

Speech-to-text transcription

Transform recorded audio into a sequence of words

Just the words, no meaning....

But: “Will the new display recognise speech?” or “Will thenudist play wreck a nice beach?”

Speaker diarization: Who spoke when?

Speech recognition: what did they say?

Paralinguistic aspects: how did they say it? (timing,intonation, voice quality)

ASR Lecture 1 Introduction to Speech Recognition 5

Page 7: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Why is speech recognition difficult?

ASR Lecture 1 Introduction to Speech Recognition 6

Page 8: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Variability in speech recognition

Several sources of variation

Size Number of word types in vocabulary, perplexity

Speaker Tuned for a particular speaker, orspeaker-independent? Adaptation to speakercharacteristics and accent

Acoustic environment Noise, competing speakers, channelconditions (microphone, phone line, room acoustics)

Style Continuously spoken or isolated? Planned monologueor spontaneous conversation?

ASR Lecture 1 Introduction to Speech Recognition 7

Page 9: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Variability in speech recognition

Several sources of variation

Size Number of word types in vocabulary, perplexity

Speaker Tuned for a particular speaker, orspeaker-independent? Adaptation to speakercharacteristics and accent

Acoustic environment Noise, competing speakers, channelconditions (microphone, phone line, room acoustics)

Style Continuously spoken or isolated? Planned monologueor spontaneous conversation?

ASR Lecture 1 Introduction to Speech Recognition 7

Page 10: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Variability in speech recognition

Several sources of variation

Size Number of word types in vocabulary, perplexity

Speaker Tuned for a particular speaker, orspeaker-independent? Adaptation to speakercharacteristics and accent

Acoustic environment Noise, competing speakers, channelconditions (microphone, phone line, room acoustics)

Style Continuously spoken or isolated? Planned monologueor spontaneous conversation?

ASR Lecture 1 Introduction to Speech Recognition 7

Page 11: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Variability in speech recognition

Several sources of variation

Size Number of word types in vocabulary, perplexity

Speaker Tuned for a particular speaker, orspeaker-independent? Adaptation to speakercharacteristics and accent

Acoustic environment Noise, competing speakers, channelconditions (microphone, phone line, room acoustics)

Style Continuously spoken or isolated? Planned monologueor spontaneous conversation?

ASR Lecture 1 Introduction to Speech Recognition 7

Page 12: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Spontaneous vs. Planned

Oh [laughter] he he used to be pretty crazybut I think now that he’s kind of gotten his

act together now that he’s mentally uh sharphe he doesn’t go in for that anymore.

Dictated Imitated Spontaneous

ASR Lecture 1 Introduction to Speech Recognition 8

Page 13: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Spontaneous vs. Planned

Oh [laughter] he he used to be pretty crazybut I think now that he’s kind of gotten his

act together now that he’s mentally uh sharphe he doesn’t go in for that anymore.

Dictated

Imitated Spontaneous

ASR Lecture 1 Introduction to Speech Recognition 8

Page 14: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Spontaneous vs. Planned

Oh [laughter] he he used to be pretty crazybut I think now that he’s kind of gotten his

act together now that he’s mentally uh sharphe he doesn’t go in for that anymore.

Dictated Imitated

Spontaneous

ASR Lecture 1 Introduction to Speech Recognition 8

Page 15: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Spontaneous vs. Planned

Oh [laughter] he he used to be pretty crazybut I think now that he’s kind of gotten his

act together now that he’s mentally uh sharphe he doesn’t go in for that anymore.

Dictated Imitated Spontaneous

ASR Lecture 1 Introduction to Speech Recognition 8

Page 16: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Linguistic Knowledge or Machine Learning?

Intense effort needed to derive and encode linguistic rules thatcover all the language

Very difficult to take account of the variability of spokenlanguage with such approaches

Data-driven machine learning: Construct simple models ofspeech which can be learned from large amounts of data(thousands of hours of speech recordings)

ASR Lecture 1 Introduction to Speech Recognition 9

Page 17: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Statistical Speech Recognition

Thomas Bayes (1701-1761)

AA Markov (1856-1922)Claude Shannon (1916-2001)

ASR Lecture 1 Introduction to Speech Recognition 10

Page 18: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Fundamental Equation of Statistical Speech Recognition

If X is the sequence of acoustic feature vectors (observations) and

W denotes a word sequence, the most likely word sequence W∗ is

given by

W∗ = argmaxW

P(W | X)

Applying Bayes’ Theorem:

P(W | X) =p(X |W)P(W)

p(X)

∝ p(X |W)P(W)

W∗ = arg maxW

p(X |W)︸ ︷︷ ︸Acoustic

model

P(W)︸ ︷︷ ︸Language

model

ASR Lecture 1 Introduction to Speech Recognition 11

Page 19: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Fundamental Equation of Statistical Speech Recognition

If X is the sequence of acoustic feature vectors (observations) and

W denotes a word sequence, the most likely word sequence W∗ is

given by

W∗ = argmaxW

P(W | X)

Applying Bayes’ Theorem:

P(W | X) =p(X |W)P(W)

p(X)

∝ p(X |W)P(W)

W∗ = arg maxW

p(X |W)︸ ︷︷ ︸Acoustic

model

P(W)︸ ︷︷ ︸Language

model

ASR Lecture 1 Introduction to Speech Recognition 11

Page 20: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Statistical speech recognition

Statistical models offer a statistical “guarantee” — typicalcommercial speech recognition license conditions:

Licensee understands that speech recognition is astatistical process and that recognition errors areinherent in the process. Licensee acknowledges that itis licensee’s responsibility to correct recognition errorsbefore using the results of the recognition.

ASR Lecture 1 Introduction to Speech Recognition 12

Page 21: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Statistical Speech Recognition

AcousticModel

Lexicon

LanguageModel

Recorded Speech

SearchSpace

Decoded Text (Transcription)

TrainingData

SignalAnalysis

ASR Lecture 1 Introduction to Speech Recognition 13

Page 22: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Statistical Speech Recognition

AcousticModel

Lexicon

LanguageModel

Recorded Speech

SearchSpace

Decoded Text (Transcription)

TrainingData

SignalAnalysis

Hidden Markov Model

n-gram model

ASR Lecture 1 Introduction to Speech Recognition 14

Page 23: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Hierarchical modelling of speech

"No right"

NO RIGHT

ohn r ai t

Utterance

Word

Subword

HMM

Acoustics

ASR Lecture 1 Introduction to Speech Recognition 15

Page 24: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Hierarchical modelling of speech

"No right"

NO RIGHT

ohn r ai t

Utterance

Word

Subword

HMM

Acoustics

Generative Model

ASR Lecture 1 Introduction to Speech Recognition 15

Page 25: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Neural networks and deep learning

ASR Lecture 1 Introduction to Speech Recognition 16

Page 26: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Neural networks and deep learning

ASR Lecture 1 Introduction to Speech Recognition 16

Page 27: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Neural networks and deep learning

ASR Lecture 1 Introduction to Speech Recognition 16

Page 28: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Neural networks and ASR (the course)

Big advances in speech recognition over the past 3–4 years

Neural networks are state-of-the-art for acoustic modelling

Although (as we shall see) this does not mean throwing awayHMMs...

Neural networks in ASR (the course): Need to have a basicunderstanding of feed forward neural networks forclassification

If you have done MLP, MLPR, or learned about this fromsomewhere else... – great!If not: a couple of extra informal lectures about neuralnetworks assuming no prior knowledge:

Monday 18 January, 17:00 (FH-3.D01)Thursday 28 January, 17:00 (FH-3.D01)

ASR Lecture 1 Introduction to Speech Recognition 17

Page 29: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Data

The statistical framework is based on learning from data

Standard corpora with agreed evaluation protocols veryimportant for the development of the ASR field

TIMIT corpus (1986)—first widely used corpus, still in use

Utterances from 630 North American speakersPhonetically transcribed, time-alignedStandard training and test sets, agreed evaluation metric(phone error rate)

Many standard corpora released since TIMIT: DARPAResource Management, read newspaper text (eg Wall StJournal), human-computer dialogues (eg ATIS), broadcastnews (eg Hub4), conversational telephone speech (egSwitchboard), multiparty meetings (eg AMI)

Corpora have real value when closely linked to evaluationbenchmark tests (with new test data from the same domain)

ASR Lecture 1 Introduction to Speech Recognition 18

Page 30: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Data

The statistical framework is based on learning from data

Standard corpora with agreed evaluation protocols veryimportant for the development of the ASR field

TIMIT corpus (1986)—first widely used corpus, still in use

Utterances from 630 North American speakersPhonetically transcribed, time-alignedStandard training and test sets, agreed evaluation metric(phone error rate)

Many standard corpora released since TIMIT: DARPAResource Management, read newspaper text (eg Wall StJournal), human-computer dialogues (eg ATIS), broadcastnews (eg Hub4), conversational telephone speech (egSwitchboard), multiparty meetings (eg AMI)

Corpora have real value when closely linked to evaluationbenchmark tests (with new test data from the same domain)

ASR Lecture 1 Introduction to Speech Recognition 18

Page 31: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Data

The statistical framework is based on learning from data

Standard corpora with agreed evaluation protocols veryimportant for the development of the ASR field

TIMIT corpus (1986)—first widely used corpus, still in use

Utterances from 630 North American speakersPhonetically transcribed, time-alignedStandard training and test sets, agreed evaluation metric(phone error rate)

Many standard corpora released since TIMIT: DARPAResource Management, read newspaper text (eg Wall StJournal), human-computer dialogues (eg ATIS), broadcastnews (eg Hub4), conversational telephone speech (egSwitchboard), multiparty meetings (eg AMI)

Corpora have real value when closely linked to evaluationbenchmark tests (with new test data from the same domain)

ASR Lecture 1 Introduction to Speech Recognition 18

Page 32: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Evaluation

How accurate is a speech recognizer?

Use dynamic programming to align the ASR output with areference transcriptionThree type of error: insertion, deletion, substitutionWord error rate (WER) sums the three types of error. If thereare N words in the reference transcript, and the ASR outputhas S substitutions, D deletions and I insertions, then:

WER = 100 · S + D + I

N% Accuracy = 100−WER%

Speech recognition evaluations: common training anddevelopment data, release of new test sets on which differentsystems may be evaluated using word error rate

NIST evaluations enabled an objective assessment of ASRresearch, leading to consistent improvements in accuracyMay have encouraged incremental approaches at the cost ofsubduing innovation (“Towards increasing speech recognitionerror rates”)

ASR Lecture 1 Introduction to Speech Recognition 19

Page 33: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Evaluation

How accurate is a speech recognizer?Use dynamic programming to align the ASR output with areference transcriptionThree type of error: insertion, deletion, substitution

Word error rate (WER) sums the three types of error. If thereare N words in the reference transcript, and the ASR outputhas S substitutions, D deletions and I insertions, then:

WER = 100 · S + D + I

N% Accuracy = 100−WER%

Speech recognition evaluations: common training anddevelopment data, release of new test sets on which differentsystems may be evaluated using word error rate

NIST evaluations enabled an objective assessment of ASRresearch, leading to consistent improvements in accuracyMay have encouraged incremental approaches at the cost ofsubduing innovation (“Towards increasing speech recognitionerror rates”)

ASR Lecture 1 Introduction to Speech Recognition 19

Page 34: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Evaluation

How accurate is a speech recognizer?Use dynamic programming to align the ASR output with areference transcriptionThree type of error: insertion, deletion, substitutionWord error rate (WER) sums the three types of error. If thereare N words in the reference transcript, and the ASR outputhas S substitutions, D deletions and I insertions, then:

WER = 100 · S + D + I

N% Accuracy = 100−WER%

Speech recognition evaluations: common training anddevelopment data, release of new test sets on which differentsystems may be evaluated using word error rate

NIST evaluations enabled an objective assessment of ASRresearch, leading to consistent improvements in accuracyMay have encouraged incremental approaches at the cost ofsubduing innovation (“Towards increasing speech recognitionerror rates”)

ASR Lecture 1 Introduction to Speech Recognition 19

Page 35: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Evaluation

How accurate is a speech recognizer?Use dynamic programming to align the ASR output with areference transcriptionThree type of error: insertion, deletion, substitutionWord error rate (WER) sums the three types of error. If thereare N words in the reference transcript, and the ASR outputhas S substitutions, D deletions and I insertions, then:

WER = 100 · S + D + I

N% Accuracy = 100−WER%

Speech recognition evaluations: common training anddevelopment data, release of new test sets on which differentsystems may be evaluated using word error rate

NIST evaluations enabled an objective assessment of ASRresearch, leading to consistent improvements in accuracyMay have encouraged incremental approaches at the cost ofsubduing innovation (“Towards increasing speech recognitionerror rates”)

ASR Lecture 1 Introduction to Speech Recognition 19

Page 36: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Next Lecture

AcousticModel

Lexicon

LanguageModel

Recorded Speech

SearchSpace

Decoded Text (Transcription)

TrainingData

SignalAnalysis

ASR Lecture 1 Introduction to Speech Recognition 20

Page 37: Steve Renals & Hiroshi Shimodaira 11 January 2016 · Automatic Speech Recognition | ASR Course details About 18 lectures, plus a couple of extra lectures on basic introduction to

Reading

Jurafsky and Martin (2008). Speech and Language Processing(2nd ed.): Chapter 9 to end of sec 9.3.

Renals and Hain (2010). “Speech Recognition”,Computational Linguistics and Natural Language ProcessingHandbook, Clark, Fox and Lappin (eds.), Blackwells. (onwebsite)

ASR Lecture 1 Introduction to Speech Recognition 21