PowerPoint Presentation
Presented by,K.L Srinivas (M.Tech 2nd year)Guided by,Prof.
Preeti Rao (Elect. Dept)
Department of Electrical Engineering, IIT BombayMumbai ,
IndiaCS626-460: Lecture 34Pronunciation Scoring For Language
Learners Using A Phone RecognitionSystem
1
Introduction
Pronunciation refers to the manner in which a particular word of
a language is uttered. MotivationAccurate pronunciation or
articulation is a vital component of a language acquisition
process. Fluency in speech of a non-native speaker of a language
can be judged by pronunciation and prosody. Non availability of a
classroom environment for learners.
Subjective EvaluationWord spoken: KaleidoscopicSpeaker 1 Speaker
2
Department of Electrical Engineering , IIT Bombay#2
Problem statement
Developing computer based automatic pronunciation scoring
system.Accessing the closeness of language learner pronunciation to
that of reference speaker (already stored in system).To provide
language learner with pronunciation score and feedback.
Department of Electrical Engineering , IIT Bombay#A brief on
Automatic Speech Recognition
Introduction
Automatic speech recognition (ASR) is a process by which an
acoustic speech signal is converted into a set of words.
Getting a computer to understand spoken language.
Approaches to ASRTemplate matchingKnowledge-based (rule based
approach)Statistical approach (machine learning)Department of
Electrical Engineering , IIT Bombay#
Statistical based approach :
Collect a large corpus of transcribed speech recordings.
Train the computer to learn the corresponding instances (Machine
learning).
At run time, apply statistical processes to search through the
space of all possible solutions and pick the statistically most
likely one.
Department of Electrical Engineering , IIT Bombay#
Speech recognition tool kits :
Sphinx and HTK are two widely accepted and used speech
recognition tools.CMU sphinx : Carnegie Mellon University (CMU)HTK
: Cambridge University
Both the frameworks are used for developing, training and
testing a speech model from existing corpus speech data.
Both use Hidden Markov Modeling techniques.
Department of Electrical Engineering , IIT Bombay#
MFCC feature vector :
The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular
choiceFrame size : 25 msecHop size : 10 msec39 feature per 10ms
frameAbsolute : Log Frame Energy (1) and MFCCs (12) Delta :
First-order derivatives of the 13 absolute coefficients Delta-Delta
: Second-order derivatives of the 13 absolute coefficients
Department of Electrical Engineering , IIT Bombay#
Sphinx 3 :
Training :
Testing / Decoding: Department of Electrical Engineering , IIT
Bombay#
Decoder ouput :
Recognition Hypothesis :This gives the single best recognition
result for each utterance processed.Linear word sequence with their
time segmentation and their scores.
Output format :
Department of Electrical Engineering , IIT Bombay#Non-native
speech characters:Phone substitutions: S in word she pronounced as
sPhonotactic constraints: Stop cluster sk in school pronounced as
iskUl Use of language model masks out the non-nativeness during
recognition.Accuracy of state-of-the-art phone recognition systems
as low as 50%-70%Traditional ASR techniques cannot be used for
non-native speechPhone recognition to be carried out in constrained
mode
Automatic Speech Recognition for non-native speechDepartment of
Electrical Engineering , IIT Bombay#11Back to pronunciation
scoringPronunciation Scoring SystemCanonical transcription of the
utteranceInput speech signalProsody ScoreArticulation
ScorePronunciation ScoreDepartment of Electrical Engineering , IIT
Bombay#13Pronunciation VariantsChallengesNo ready database of
speakers of Indian EnglishMultiple L1s for Indian speakers poses
further challenges.Native Hindi and native English databases are
availableCanonical form: SIL f aa n d aa m ee n clt t aa l s
SILVariant_1: SIL f aa n d a m ee n clt t aa l s SILVariant_2: SIL
f aa n d ee m ee n clt t aa l s SILWord: FundamentalsDepartment of
Electrical Engineering , IIT Bombay#14Constrained Phone DecodingHMM
based recognizers HTK 3.4 Sphinx 3
Decoding
Input Speech UtteranceAcoustic Models from
trainingVariantsAligned Phone Sequence with likelihood for each
variantDepartment of Electrical Engineering , IIT Bombay#15Variant
SelectionPast WorkAligned Phone Sequence with likelihood for each
variantVisual Feedback and Articulation ScoreStrik and Cucchiarini
(2000): Pronunciation variations and modelingGoronzy, Rapp and Komp
(2004 ): Non-native pronunciation variations and generation (
native English speakers speaking German) Wesenick and Schiel (1994
), Wesenick (1996): Generation of rules for German pronunciation
variations Franco et al. (1997) , Franco et al. (2000): A paradigm
for automatic assessment of pronunciation quality.Witt and Young
(2000): Presented likelihood based goodness of pronunciation
schemeDepartment of Electrical Engineering , IIT
Bombay#16Databases
TIMIT database630 speakers of 8 major dialects of American
English.Each speaking 10 phonetically rich sentences.TIFR
database100 native speakers of Hindi.Each speaking 10 phonetically
rich sentences.Indian English database - Testing30 Indian college
students each speaking the 2 common sentences from TIMIT
database.Acoustic Phone Models47 class TIMIT models: Entire phone
set from TIMIT. 52 class Union models: Entire TIMIT phone set(47
phones) and 5 additional phones from the TIFR phone set making a
total of 52 phones. 48 class Union models: Entire TIFR Hindi phone
set(36 phones) and 12 phones from TIMIT. TrainingDepartment of
Electrical Engineering , IIT Bombay#17Experiments and EvaluationThe
focus of this work is to investigate the effect of selection of
phone models from one of 47, 52-union and 48- union phone
modelsEvaluation Measures Method I: The number of instances in
which the surface transcription is within the top N decoded
sequences in terms of likelihood score. Method II: The edit
distance between the most likely phone sequence and the surface
transcription in terms of %correct and %accuracyMethod III:
Normalized likelihood error. A value of 0 for this measure
indicates the best achievable performance.Department of Electrical
Engineering , IIT Bombay#18Performances of Method I and
IITabulation of Method I and Method II of evaluation for HTK 3.4
and Sphinx 3HTK 3.4Sphinx3Method IMethod IIMethod IMethod IIDecoder
models# of Unique variantsReference transcription
in%Corr%AccReference transcription in%Corr%AccTop 1Top 5Top 1Top
5SA1SA147class6365782.479.42683.880.252class12636981.880.01682.578.448class763212496.294.6121792.289.1SA2SA247class102661188.085.85988.886.452class102671388.687.051085.983.848class1026162092.892.27989.487.7Department
of Electrical Engineering , IIT Bombay#We can see that 48 class has
performed best for both the sentences as well as for both HTK and
sphinx19Performance of Method III
Distribution of the likelihood scores across the 60
utterances48-phone class has average likelihood error closest to
zero of the three phone sets.Department of Electrical Engineering ,
IIT Bombay#The results are consistent with method 1 and
220Articulation Scoring methods :Articulation score indicates the
closeness of language learners pronunciation with native speaker
(of target language) pronunciation.Detects phoneme level
mispronunciation and extent to which phoneme has been
mispronounced. Algorithm uses speech models derived from speech
database of native speakers.Uses forced align tools in the
background to get acoustic scores (quantitative measure indicating
acoustic fit for that particular speech segment). Two methods
investigated: GOP (Goodness of Pronunciation) score [2]. Method by
Sunil K. Gupta [9].
Department of Electrical Engineering , IIT Bombay#GOP scoring
method : Confidence with which particular phone has been recognized
.Also called as Goodness Of Pronunciation (GOP) score. GOP score is
given by normalized log posterior probability
Department of Electrical Engineering , IIT Bombay#GOP scoring
method (cont.) :
Block diagram for Articulation scoringDepartment of Electrical
Engineering , IIT Bombay#Method by Sunil K. Gupta :Shortcoming of
GOP score : Threshold selection was based on subjective rating of
human judges. Not providing any quantitative measure to measure
extent of mispronunciation. Free decoder not accurate enough
leading to alignment errors.
In this method two speech models have to be prepared: 48 class
phone models ( 36 TIFR Hindi + 12 TIMIT English) Garbage model (
all phonemes of speech data combined to get one speech
model)Department of Electrical Engineering , IIT Bombay#Garbage
model : A single speech model combining all the phonemes of speech
data. Entire speech corpus trained with garbage transcription.
Department of Electrical Engineering , IIT Bombay#Methodology
(cont.) :Utterance is force aligned using Sphinx3_align with the
reference transcription using 48 class phone models. Each phoneme
of the transcription will have its own acoustic score. These
log-likelihood scores are duration normalized given by
Similarly, utterance is force aligned with garbage transcription
using Garbage model.
Difference between these two likelihood is current phoneme
likelihood
This difference score (d) is used for coming up with phone
articulation score using lookup score table.(explained in next
slide)
Department of Electrical Engineering , IIT Bombay#Formation of
score table : For each utterance In-grammar and Out-grammar is
formed In-grammar : When the transcription is conforming to target
acoustic waveform. Out-grammar : Transcription selected is some
random phrase from training database not conforming to target
acoustic waveform. In-grammar and Out-grammar transcriptions are
force aligned to come up with log-likelihood scores:
In-grammar :
Out-grammar :
Department of Electrical Engineering , IIT Bombay#Score table
(cont.) :Using all the in-grammar points and out-grammar points ,
pdf is formed for each phoneme.
Using these probability density functions are used for coming up
with score table .(table shown in results section)
Department of Electrical Engineering , IIT Bombay#Results :
Histograms and Gaussian pdf (approximating data points) for both
In-grammar and Out-grammar for phoneme aa:
Department of Electrical Engineering , IIT Bombay#Results
(cont.) : Histograms and Gaussian pdf (approximating data points)
for both In-grammar and Out-grammar for phoneme ee:
Department of Electrical Engineering , IIT Bombay#Results
(cont.) : Combined PDF of In-grammar and Out-grammar for aa :
Department of Electrical Engineering , IIT Bombay#Results
(cont.) : Combined PDF of In-grammar and Out-grammar for ee :
Department of Electrical Engineering , IIT Bombay#Results (score
table) :Below calculations and table is for phoneme aa :f denotes
probability density function.
and are In-grammar and Out-grammar mean respectively.For
In-grammar and Out-grammar points :
Score table for phoneme aa in next slide :
Department of Electrical Engineering , IIT Bombay#Score table
(phoneme aa ) :Dh
(x)Score1.2421001.185900.748800.457700.21960050
Department of Electrical Engineering , IIT Bombay#Score table
(phoneme aa ) :Dh
(x)Score050-0.04840-0.100130-0.16420-0.25910-0.2720
Department of Electrical Engineering , IIT Bombay#Result
(Speaker 1 : Fundamentals) :PhoneCorrect Pronun.Incorrect
Pronun.dScoredScoreh-8647-90232aa-1439760%-299080%n-99.5-8224d-23238-27363aa-1528060%-403040%m-756-813ee-19837-19767n-7571-5941SI-12570-5250t-2023-6451aa-8659.580%-1053170%l-10920-2014
Department of Electrical Engineering , IIT Bombay#Result
(Speaker 2 : Fundamentals) :PhoneCorrect Pronun.Incorrect
Pronun.dScoredScoreh370-9969aa-1026970%-1049970%n-9675-4115d-25162-17556aa-229580%-467780%m-4100-11144ee-24043-34685n4284-5253SI-4595-10971t-10534-5146aa-1427150%-727180%l-10917-18839
Department of Electrical Engineering , IIT Bombay#Duration
scoring :Duration score provides feedback on normalized relative
duration difference between language learner speech and reference
speaker speech.Denotes whether a particular syllable is stressed or
not.If Li and Ri are their respective durations corresponding to
phoneme qi then ,utterance consisting of N phones can be denoted
by:
Normalized durations given by:
Department of Electrical Engineering , IIT Bombay#Duration
scoring (cont.) :Overall duration score given by :
Maximum duration score is 1 and minimum is 0.
Department of Electrical Engineering , IIT Bombay#Duration
scoring (Results) :Speaker_1 was taken as reference and duration
scores were calculated for other speakers.Speaker_1
Department of Electrical Engineering , IIT Bombay#Duration
scoring (Results) :Speaker_1 Vs Speaker_2 Duration score = 0.573
(low due to differences in f, a and s duration)
Department of Electrical Engineering , IIT Bombay#Duration
scoring (Results) :Speaker_1 Vs Speaker_3Duration score = 0.485
(low due to differences in f, a and s duration)
Department of Electrical Engineering , IIT Bombay#Feedback ,
Articulation and Duration ScoreSpeaker 2Canonical Transcription
(Reference speaker)SIL f aa n d aa m ee n clt t aa l s SILSpeaker
1TranscriptionSIL f aa n d ee m ee n clt t aa l s SILFeedback
Articulation Score: 72%Duration Score: 0.573SIL f aa n d ee m ee
n clt t aa l s SILDepartment of Electrical Engineering , IIT
Bombay#43ReferencesStrik, H., Neri, A., and Cucchiarini, C. 2008.
Speech technology for language tutoring. In Proceedings of LangTech
( Rome, Italy,February 28-29, 2008).Witt, S., and Young, S. 2000.
Phone-level pronunciation scoring and assessment for interactive
language learning. Speech Communication. Vol. 30, pp. 95-108,
2000.Franco, H., et al. 2000. Automatic scoring of pronunciation
quality. Speech Communication. Vol. 30, pp. 83-93, 2000.Kawai, G.,
Hirose, K. 1998. A method for measuring the intelligibility and non
nativeness of phone quality in foreign language pronunciation
training. In Proceedings of ICSLP-98 (Sydney, Australia, November
30- December 04,1998) .pp. 1823-1826.Goronzy, S., Rapp, S., Kompe,
R. 2004. Generating non-native pronunciation variants for lexicon
adaption. Speech Communication. Vol. 42, pp. 109-123, 2004.Lee,
K.F. 1998. Large-vocabulary speaker-independent continuous speech
recognition: The SPHINX system. Ph.D. dissertation, Comput. Sci.
Dep., Carnegie Mellon University.Young, S., et al. 2006. The HTK
Book v3. Cambridge University, 2006. Samudravijaya, K., Rawat,
K.D., and Rao, P.V.S. 1998. Design of Phonetically Rich Sentences
for Hindi Speech Database. J. Ac. Soc. Ind. Vol. XXVI, December
1998, pp. 466-471.Sunil K. Gupta, Ziyi Lu and Fengguang Zhao,
Automatic Pronunciation Scoring for Language Learning , U.S. Patent
7,219,059, May 15, 2007.Department of Electrical Engineering , IIT
Bombay#