Top Banner
69

Speech Recognition

Feb 05, 2016

Download

Documents

chuong

Speech Recognition. Components of a Recognition System. Frontend. Feature extractor. Frontend. Feature extractor Mel-Frequency Cepstral Coefficients (MFCCs). Feature vectors. Hidden Markov Models ( HMMs ). Acoustic Observations. Hidden Markov Models ( HMMs ). Acoustic Observations - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speech Recognition
Page 2: Speech Recognition
Page 3: Speech Recognition

Feature extractor

Page 4: Speech Recognition

Feature extractorMel-Frequency Cepstral Coefficients

(MFCCs)Feature vectors

Page 5: Speech Recognition

Acoustic Observations

Page 6: Speech Recognition

Acoustic ObservationsHidden States

Page 7: Speech Recognition

Acoustic ObservationsHidden StatesAcoustic Observation likelihoods

Page 8: Speech Recognition

“Six”

Page 9: Speech Recognition
Page 10: Speech Recognition

Constructs the HMMs of phonesProduces observation likelihoods

Page 11: Speech Recognition

Constructs the HMMs for units of speech

Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8k

Page 12: Speech Recognition

Constructs the HMMs for units of speech

Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8kTIDIGITS, RM1, AN4, HUB4

Page 13: Speech Recognition

Word likelihoods

Page 14: Speech Recognition

ARPA format Example:

1-grams:-3.7839 board -0.1552-2.5998 bottom -0.3207-3.7839 bunch -0.21742-grams:-0.7782 as the -0.2717-0.4771 at all 0.0000-0.7782 at the -0.29153-grams:-2.4450 in the lowest -0.5211 in the middle -2.4450 in the on

Page 15: Speech Recognition

public <basicCmd> = <startPolite> <command> <endPolite>;

public <startPolite> = (please | kindly | could you ) *;

public <endPolite> = [ please | thanks | thank you ];

<command> = <action> <object>;

<action> = (open | close | delete | move); <object> = [the | a] (window | file | menu);

Page 16: Speech Recognition

Maps words to phoneme sequences

Page 17: Speech Recognition

Example from cmudict.06d

POULTICE P OW L T AH SPOULTICES P OW L T AH S IH ZPOULTON P AW L T AH NPOULTRY P OW L T R IYPOUNCE P AW N SPOUNCED P AW N S TPOUNCEY P AW N S IYPOUNCING P AW N S IH NGPOUNCY P UW NG K IY

Page 18: Speech Recognition

Constructs the search graph of HMMs from: Acoustic model Statistical Language model ~or~ Grammar Dictionary

Page 19: Speech Recognition
Page 20: Speech Recognition
Page 21: Speech Recognition

Can be statically or dynamically constructed

Page 22: Speech Recognition

FlatLinguist

Page 23: Speech Recognition

FlatLinguistDynamicFlatLinguist

Page 24: Speech Recognition

FlatLinguistDynamicFlatLinguistLexTreeLinguist

Page 25: Speech Recognition

Maps feature vectors to search graph

Page 26: Speech Recognition

Searches the graph for the “best fit”

Page 27: Speech Recognition

Searches the graph for the “best fit”

P(sequence of feature vectors| word/phone)

aka. P(O|W)

-> “how likely is the input to have been generated by the word”

Page 28: Speech Recognition

F ay ay ay ay v v v v vF f ay ay ay ay v v v vF f f ay ay ay ay v v vF f f f ay ay ay ay v vF f f f ay ay ay ay ay vF f f f f ay ay ay ay vF f f f f f ay ay ay v…

Page 29: Speech Recognition

TimeO1 O2 O3

Page 30: Speech Recognition

Uses algorithms to weed out low scoring paths during decoding

Page 31: Speech Recognition

Words!

Page 32: Speech Recognition

Most common metricMeasure the # of modifications to

transform recognized sentence into reference sentence

Page 33: Speech Recognition

Reference: “This is a reference sentence.”

Result: “This is neuroscience.”

Page 34: Speech Recognition

Reference: “This is a reference sentence.”

Result: “This is neuroscience.”Requires 2 deletions, 1 substitution

Page 35: Speech Recognition

Reference: “This is a reference sentence.”

Result: “This is neuroscience.”

WER =100 ×deletions+ substitutions+ insertions

Length

Page 36: Speech Recognition

Reference: “This is a reference sentence.”

Result: “This is neuroscience.” D S

D

WER =100 ×2 +1+ 0

5=100 ×

3

5= 60%

Page 37: Speech Recognition
Page 38: Speech Recognition
Page 39: Speech Recognition
Page 40: Speech Recognition
Page 41: Speech Recognition
Page 42: Speech Recognition
Page 43: Speech Recognition
Page 44: Speech Recognition
Page 45: Speech Recognition
Page 46: Speech Recognition
Page 47: Speech Recognition
Page 48: Speech Recognition

Limited Vocab Multi-Speaker

Page 49: Speech Recognition

Limited Vocab Multi-SpeakerExtensive Vocab Single Speaker

Page 50: Speech Recognition

*If you have noisy audio input multiply expected error rate x 2

Page 51: Speech Recognition

Other variables:-Continuous vs. Isolated-Conversational vs. Read-Dialect

Page 52: Speech Recognition

Questions?

Page 53: Speech Recognition

TimeO1 O2 O3

Page 54: Speech Recognition

TimeO1 O2 O3

P(ay | f) *P(O2|ay)

P(f|f) * P(O2 | f)

Page 55: Speech Recognition

TimeO1 O2 O3

P (O1) * P(ay | f) *P(O2|ay)

Page 56: Speech Recognition

TimeO1 O2 O3

Page 57: Speech Recognition

Common Sphinx4 FAQs can be found online:http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html

What followes are some less-FAQs

Page 58: Speech Recognition

Q. Is a search graph created for every recognition result or one for the recognition app?

A. This depends on which Linguist is used. The flat linguist generates the entire search graph and holds it in memory. It is only useful for small vocab recognition tasks. The lexTreeLinguist dynamically generates search states allowing it to handle very large vocabularies

Page 59: Speech Recognition

Q. How does the Viterbi algorithm save computation over exhaustive search?

A. The Viterbi algorithm saves memory and computation by reusing subproblems already solved within the larger solution. In this way probability calculations which repeat in different paths through the search graph do not get calculated multiple times

Viterbi cost = n2 – n3

Exhaustive search cost = 2n -3n

Page 60: Speech Recognition

Q. Does the linguist use a grammar to construct the search graph if it is available?

A. Yes, a grammar graph is created

Page 61: Speech Recognition

Q. What algorithm does the Pruner use?

A. Sphinx4 uses absolute and relative beam pruning

Page 62: Speech Recognition

Absolute Beam Width - # active search paths

<property name="absoluteBeamWidth" value="5000"/>

Page 63: Speech Recognition

Absolute Beam Width - # active search paths

<property name="absoluteBeamWidth" value="5000"/>

Relative Beam Width – probability threshold

<property name="relativeBeamWidth" value="1E-120"/>

Page 64: Speech Recognition

Absolute Beam Width - # active search paths

<property name="absoluteBeamWidth" value="5000"/>

Relative Beam Width – probability threshold

<property name="relativeBeamWidth" value="1E-120"/>

Word Insertion Probability – Word break likelihood

<property name="wordInsertionProbability" value="0.7"/>

Page 65: Speech Recognition

Absolute Beam Width - # active search paths <property name="absoluteBeamWidth" value="5000"/>Relative Beam Width – probability threshold <property name="relativeBeamWidth" value="1E-120"/> Word Insertion Probability – Word break likelihood <property name="wordInsertionProbability" value="0.7"/> Language Weight – Boosts language model scores <property name="languageWeight" value="10.5"/>

Page 66: Speech Recognition

Silence Insertion Probability – Likelihood of inserting silence

<property name="silenceInsertionProbability" value=".1"/>

Page 67: Speech Recognition

Silence Insertion Probability – Likelihood of inserting silence

<property name="silenceInsertionProbability" value=".1"/>

Filler Insertion Probability – Likelihood of inserting filler words

<property name="fillerInsertionProbability" value="1E-10"/>

Page 68: Speech Recognition

To call a Java example from Python:

import subprocess

subprocess.call(["java", "-mx1000m", "-jar","/Users/Username/sphinx4/bin/Transcriber.jar”)

Page 69: Speech Recognition

Speech and Language Processing 2nd Ed.Daniel Jurafsky and James MartinPearson, 2009

Artificial Intelligence 6th Ed.George LugerAddison Wesley, 2009

Sphinx Whitepaperhttp://cmusphinx.sourceforge.net/sphinx4/#whitepaper

Sphinx Forumhttps://sourceforge.net/projects/cmusphinx/forums