Speech Recognition

Feature extractor

Feature extractorMel-Frequency Cepstral Coefficients

(MFCCs)Feature vectors

Acoustic Observations

Acoustic ObservationsHidden States

Acoustic ObservationsHidden StatesAcoustic Observation likelihoods

“Six”

Constructs the HMMs of phonesProduces observation likelihoods

Constructs the HMMs for units of speech

Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8k

Constructs the HMMs for units of speech

Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8kTIDIGITS, RM1, AN4, HUB4

Word likelihoods

ARPA format Example:

1-grams:-3.7839 board -0.1552-2.5998 bottom -0.3207-3.7839 bunch -0.21742-grams:-0.7782 as the -0.2717-0.4771 at all 0.0000-0.7782 at the -0.29153-grams:-2.4450 in the lowest -0.5211 in the middle -2.4450 in the on

public <basicCmd> = <startPolite> <command> <endPolite>;

public <startPolite> = (please | kindly | could you ) *;

public <endPolite> = [ please | thanks | thank you ];

<command> = <action> <object>;

<action> = (open | close | delete | move); <object> = [the | a] (window | file | menu);

Maps words to phoneme sequences

Example from cmudict.06d

POULTICE P OW L T AH SPOULTICES P OW L T AH S IH ZPOULTON P AW L T AH NPOULTRY P OW L T R IYPOUNCE P AW N SPOUNCED P AW N S TPOUNCEY P AW N S IYPOUNCING P AW N S IH NGPOUNCY P UW NG K IY

Constructs the search graph of HMMs from: Acoustic model Statistical Language model ~or~ Grammar Dictionary

Can be statically or dynamically constructed

FlatLinguist

FlatLinguistDynamicFlatLinguist

FlatLinguistDynamicFlatLinguistLexTreeLinguist

Maps feature vectors to search graph

Searches the graph for the “best fit”

Searches the graph for the “best fit”

P(sequence of feature vectors| word/phone)

aka. P(O|W)

-> “how likely is the input to have been generated by the word”

F ay ay ay ay v v v v vF f ay ay ay ay v v v vF f f ay ay ay ay v v vF f f f ay ay ay ay v vF f f f ay ay ay ay ay vF f f f f ay ay ay ay vF f f f f f ay ay ay v…

TimeO1 O2 O3

Uses algorithms to weed out low scoring paths during decoding

Words!

Most common metricMeasure the # of modifications to

transform recognized sentence into reference sentence

Reference: “This is a reference sentence.”

Result: “This is neuroscience.”


Result: “This is neuroscience.”Requires 2 deletions, 1 substitution


Result: “This is neuroscience.”

€

WER =100 ×deletions+ substitutions+ insertions

Length


Result: “This is neuroscience.” D S

D

€

WER =100 ×2 +1+ 0

5=100 ×

3

5= 60%

Limited Vocab Multi-Speaker

Limited Vocab Multi-SpeakerExtensive Vocab Single Speaker

*If you have noisy audio input multiply expected error rate x 2

Other variables:-Continuous vs. Isolated-Conversational vs. Read-Dialect

Questions?

TimeO1 O2 O3

TimeO1 O2 O3

P(ay | f) *P(O2|ay)

P(f|f) * P(O2 | f)

TimeO1 O2 O3

P (O1) * P(ay | f) *P(O2|ay)

TimeO1 O2 O3

Common Sphinx4 FAQs can be found online:http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html

What followes are some less-FAQs

Q. Is a search graph created for every recognition result or one for the recognition app?

A. This depends on which Linguist is used. The flat linguist generates the entire search graph and holds it in memory. It is only useful for small vocab recognition tasks. The lexTreeLinguist dynamically generates search states allowing it to handle very large vocabularies

Q. How does the Viterbi algorithm save computation over exhaustive search?

A. The Viterbi algorithm saves memory and computation by reusing subproblems already solved within the larger solution. In this way probability calculations which repeat in different paths through the search graph do not get calculated multiple times

Viterbi cost = n2 – n3

Exhaustive search cost = 2n -3n

Q. Does the linguist use a grammar to construct the search graph if it is available?

A. Yes, a grammar graph is created

Q. What algorithm does the Pruner use?

A. Sphinx4 uses absolute and relative beam pruning

Absolute Beam Width - # active search paths

<property name="absoluteBeamWidth" value="5000"/>



Relative Beam Width – probability threshold

<property name="relativeBeamWidth" value="1E-120"/>



Relative Beam Width – probability threshold

<property name="relativeBeamWidth" value="1E-120"/>

Word Insertion Probability – Word break likelihood

<property name="wordInsertionProbability" value="0.7"/>

Absolute Beam Width - # active search paths <property name="absoluteBeamWidth" value="5000"/>Relative Beam Width – probability threshold <property name="relativeBeamWidth" value="1E-120"/> Word Insertion Probability – Word break likelihood <property name="wordInsertionProbability" value="0.7"/> Language Weight – Boosts language model scores <property name="languageWeight" value="10.5"/>

Silence Insertion Probability – Likelihood of inserting silence

<property name="silenceInsertionProbability" value=".1"/>

Silence Insertion Probability – Likelihood of inserting silence

<property name="silenceInsertionProbability" value=".1"/>

Filler Insertion Probability – Likelihood of inserting filler words

<property name="fillerInsertionProbability" value="1E-10"/>

To call a Java example from Python:

import subprocess

subprocess.call(["java", "-mx1000m", "-jar","/Users/Username/sphinx4/bin/Transcriber.jar”)

Speech and Language Processing 2nd Ed.Daniel Jurafsky and James MartinPearson, 2009

Artificial Intelligence 6th Ed.George LugerAddison Wesley, 2009

Sphinx Whitepaperhttp://cmusphinx.sourceforge.net/sphinx4/#whitepaper

Sphinx Forumhttps://sourceforge.net/projects/cmusphinx/forums

Speech Recognition

Documents

vf f f f f f

timeo1o2o3timeo1o2o3pay

dpoultice p

t ah npoultry p

t ah spoultices p

search graph of hmms

t ah s ih zpoulton p

reference sentencereference