-
Page 1 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
PRONUNCIATION ASSESSMENT FOR INTELLIGIBILITY REMEDIATION
Utility Patent Specification
by James Salsman, Fort Lupton; and Lance Culnane, Westminster,
both of Colorado.
April 22, 2017
BACKGROUND CITATIONS
U.S. Patent Documents:
5,679,001: Russell, et al. (1997) “Children's speech training
aid.”
5,920,838: Mostow, et al. (1999) “Reading and Pronunciation
Tutor.”
6,634,887: Heffernan, III, et al. (2003) “Methods and Systems
for Tutoring Using a
Tutorial Model with Interactive Dialog.”
6,963,841: Handal et al. (2005) “Speech Training Method with
Alternative Proper
Pronunciation Database.”
7,752,045: Eskenazi, et al. (2010) “Systems and Methods for
Comparing Speech
Elements.”
8,109,765: Beattie, et al. (2012) “Intelligent Tutoring
Feedback.”
-
Page 2 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
8,271,281: Jayadeva, et al. (2012) “Method for Assessing
Pronunciation Abilities.”
8,744,856: Ravishankar (2014) “Computer implemented system and
method and
computer program product for evaluating pronunciation of
phonemes in a language.”
9,520,068: Beattie, et al. (2016) “Sentence Level Analysis in a
Reading Tutor.”
Other References:
Chen and Li (2016) “Computer-assisted pronunciation training:
From pronunciation
scoring towards spoken language learning,” in Proceedings of the
2016 Asian-Pacific
Signal and Information Processing Association (APSIPA) Annual
Summit and
Conference:
http://www.apsipa.org/proceedings_2016/HTML/paper2016/227.pdf
Cole, et al. (1999) “A platform for multilingual research in
spoken dialogue systems,” in
Proceedings of the Multilingual Interoperability in Speech
Technology Conference
(Leusden, Netherlands.)
http://www.cslu.ogi.edu/people/hosom/pubs/cole_MIST-platform_1999.pdf
Hawkins, J.A.; and Filipović, L. (2012) Criterial Features in L2
English: Specifying the
Reference Levels of the Common European Framework (United
Kingdom: Cambridge
University Press.)
https://drive.google.com/open?
id=0B73LgocyHQnfcEVacmZRc2xEQ3VIZ0tkMHNmdjhNOXVsS1VR
-
Page 3 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
Huggins-Daines, et al. (2006) “Pocketsphinx: A free, real-time
continuous speech
recognition system for hand-held devices.” Proceedings of the
IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP),
https://www.cs.cmu.edu/~awb/papers/ICASSP2006/0100185.pdf
Kibishi, et al. (2014) “A statistical method of evaluating the
pronunciation proficiency/
intelligibility of English presentations by Japanese speakers,”
ReCALL (European
Association for Computer Assisted Language Learning)
doi:10.1017/
S0958344014000251,
http://www.slp.ics.tut.ac.jp/Material_for_Our_Studies/Papers/shiryou_last/e2014-
Paper-01.pdf
Loukina, et al. (2015) “Pronunciation accuracy and
intelligibility of non-native speech,”
in InterSpeech-2015, the Proceedings of the Sixteenth Annual
Conference of the
International Speech Communication Association (Dresden,
Germany: Educational
Testing Service)
http://www.oeft.com/su/pdf/interspeech2015b.pdf
Panayotov, V., et al. (2015) "LIBRISPEECH: an ASR Corpus Based
on Public Domain
Audio Books," Proceedings of the IEEE International Conference
on Acoustics, Speech
and Signal Processing (ICASSP 2015),
http://www.danielpovey.com/files/2015_icassp_librispeech.pdf
Proceedings of the International Symposium on Automatic
Detection of Errors in
Pronunciation Training, June 6–8, 2012, KTH, Stockholm,
Sweden.
http://www.speech.kth.se/isadept/ISADEPT-proceedings.pdf
Proceedings of the Workshop on Speech and Language Technology in
Education,
-
Page 4 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
September 4–5, 2015 (Satellite Workshop of Interspeech 2015 and
the ISCA Special
Interest Group SLaTE) Leipzig, Germany:
https://www.slate2015.org/files/SLaTE2015-Proceedings.pdf
Ronanki, S.; Salsman, J. and Bo, L. (December 2012) “Automatic
Pronunciation
Evaluation And Mispronunciation Detection Using CMUSphinx,” in
the Proceedings of
the 24th International Conference on Computational Linguistics
(Mumbai, India:
COLING 2012) pp. 61- 67.
http://www.aclweb.org/anthology/W12-5808
Salsman, J. (July 2014) “Development challenges in automatic
speech recognition for
computer assisted pronunciation teaching and language learning”
in Proceedings of the
Research Challenges in Computer Aided Language Learning
Conference (Antwerp,
Belgium: CALL 2014.)
http://talknicer.com/Salsman-CALL-2014.pdf
Computer-Assisted Pronunciation Teaching (CAPT)
Bibliography:
http://liceu.uab.es/~joaquim/applied_linguistics/L2_phonetics/CALL_Pron_Bib.html
FIELD OF THE INVENTION
This invention relates to the field of computer-assisted
pronunciation training (CAPT)
using automatic speech recognition for language learning, speech
language pathology,
and reading tutoring, such as described by Russell, et al.
(1997) “Children's speech
training aid,” U.S. Patent 5,679,001. The assessment and
remediation of the authentic
intelligibility of learners' spoken language as measured by
agreement with panels of non-
expert word transcriptionists including both native and
non-native language listeners
-
Page 5 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
provides substantial advantages over the current state of the
art which instead typically
assesses formal pronunciation agreement with a panel of native
language listener
pronunciation experts, because those formal mispronunciations
are only associated with
16% of measured authentic intelligibility of words, according to
Loukina, et al. (2015.)
DISCUSSION OF PRIOR ART
While Kibishi et al. (2014) have demonstrated the achievement of
75% agreement with
authentic word transcription, even earlier work by Ronanki,
Salsman, and Bo (2012)
produced open source software implementing means of more precise
discrimination
between consequential and incidental errors by allowing accent
and dialect adaptation
using physiologically neighboring phones (phonemes and diphones)
derived from the
adjacency of vocal tract components, e.g., in the positions and
configuration of the lips,
teeth, tongue, jaw, vocal fold cords, nasal flap, and
diaphragm.
As stated in Salsman (2014), “To best support language
instruction, we have been
developing the use of physiologically neighboring phonemes,
i.e., sounds produced with
similar vocal tract articulations, to identify and discern
between serious
mispronunciations and incidental errors (Ronanki et al., 2012.)
We are using diphones,
i.e. the last half of one phoneme followed by the first half of
the next, as alternatives and
supplements to phonemes and triphones for both automatic speech
recognition and
pronunciation scoring (Cole, et al., 1999.) We plan to model
learner fluency and select
the sequence of self-study practice exercises using cumulative
diphone scores. We are
scoring segment durations to indicate syllables and words
pronounced too quickly
relative to exemplary pronunciations. We have measured
substantial potential
improvements from all of these techniques.
-
Page 6 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
“The language instructor’s experience of computer-assisted
pronunciation assessment can
be enhanced by offering comparisons of students’ utterances to
exemplary pronunciations
in ways that illustrate the measurements of physiologically
neighboring phonemes,
diphones, and speech segment durations. For example,
mispronunciations might be
annotated with International Phonetic Alphabet symbols for both
the expected
pronunciation and its physiologically neighboring phoneme which
most closely matched
the observed speech. Diphones can be used to highlight difficult
phonetic transitions, for
example when two adjacent phonemes are both mispronounced.
Duration scoring can
annotate not just words and sub-word segments given insufficient
emphasis, e.g. such as
might confuse ‘fourteen’ with ‘forty,’ and can highlight missing
glottal stops essential to
discern, for example, ‘harder’ from ‘hard or.’”
Pronunciation assessment and CAPT responses should be based on
at least 44 exemplary
pronunciations for each response word or phrase, comprised of
both genders, two age
groups such as age 20s and 50s, and, for English, at least
eleven geographic regions, in
order to provide for world-wide English accent and dialect
adaptation coverage. For
English, such exemplary pronunciations should be recorded from
native language
speakers selected from, for example, Australia, Canada, Ireland,
New Zealand, South
Africa, London (Standard Southern English), London (Cockney),
London (Received
Pronunciation), Birmingham, Cornwall, East Anglia, East
Yorkshire, North Wales,
Edinburgh, Ulster, Dublin, Boston, Midwestern US (i.e., in or
west of Michigan,
Pennsylvania, Missouri, or New Mexico), New England, New York
City, and the
Southern U.S. Gulf Coast region.
Learner analytics (scoring pronunciation for CAPT and grading
authentic intelligibility)
may include log-normal means and variances of phoneme, diphone,
word, and phrase
acoustic scores and durations, along with cumulative phoneme and
diphone scores;
mispronunciations ranked by consequential interference with
intelligibility for each word
-
Page 7 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
in an utterance and for the whole utterance; tonality scores for
tonal languages; language
grammar, morphology, and vocabulary criterial feature coverage
scores; and subject
matter topic correctness and coverage scores.
The intelligibility scoring system should agree with a panel of
non-expert, authentic
native and non-native language word transcriptionists. Beyond
the logistic regression of
word intelligibility by such transcriptionists, other machine
learning techniques may
include, but are not limited to those of Kibishi et al. (2014),
such as symbolic regression,
general and nonlinear regression, classification, artificial
neural networks, support vector
machines, learning vector quantization, or self-organizing maps.
Quality assurance
should be performed by measuring the extent to which the
resulting intelligibility scores
match those of an actual panel of such non-expert native and
non-native word
transcriptionists, preferably using blind or double-blind
analysis. Transcriptionist data
may be enhanced with automatic spelling correction.
Intelligibility determination may be
enhanced with word frequency-based phonological similarity
measures of speech
ambiguity.
Learner remediation may include audio and visual feedback using
expected and observed
phones and their durations to show vocal tract sagittal sections
and front-facing lip static
graphic diagrams and animations along with spoken audio and text
describing corrective
vocal tract motions in the learner's preferred language with
examples in that language.
OBJECTIVES AND ADVANTAGES
The invention eliminates pronunciation assessment feedback which
does not involve a
consequential mispronunciation interfering with the student's
authentic intelligibility, and
provides feedback as pair of audio words in the learners' first
language, the first
-
Page 8 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
containing the correct phoneme and the second containing the
mistaken sound produced.
To achieve those goals, we collect transcriptions of learner
utterances. For example, while
displaying, “Please listen to this phrase and type in the
English words you hear,” play this
audio for the phrase: “I'm here on behalf of the Excellence
Hotel group.” For this
example, let's say that in the audio, “behalf” was mispronounced
as “beh-alf” and
“Excellence” was mispronounced as “Excellent” but everything
else was good. The
learner types in the text: “I'm here on behalf of the excellent
hotel group.” (I.e., the
transcribing advanced learner gets “behalf” right, but doesn't
transcribe Excellence
correctly because it was mispronounced.) The system sees that
“Excellence” was not
transcribed correctly, while the SR system reports two
mispronunciations. Therefore,
update the database entry for this phrase that a tally for the
corresponding phonemes in
“behalf” are inconsequential, but the final phoneme /s/ in
“excellence” is consequential if
mispronounced.
After sufficient data is collected, inconsequential
mispronunciations can be ignored. The
database of the prompting phrases will have a probability
associated with each phoneme
by which we can scale (or "weight" per Figure 2) each
mispronunciation's acoustic score
with that probability to establish the cut-of point for the
scaled values which will not be
scored as wrong, e.g. by displaying the word as green or yellow
instead of orange or red.
Using a recorded audio library for words in each learner's first
language containing each
phoneme near the front, instead of showing green/yellow/orange,
the audio recording of
an e.g. Spanish word which starts with a /s/ sound can be
played. For example, a
recording saying in Spanish audio, “When you said excellence
[that target word in
English] you needed the sound that [a Spanish word starting with
/s/] starts with, but
instead you pronounced the sound [a Spanish word starting with
/t/] starts with. Listen to
what you said. [Playing the audio of the learner's mispronounced
word.] You were
supposed to say excellence [the word in English again]. Click
replay to hear this again,”
-
Page 9 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
can be played while displaying the word “Excellence” and, e.g.,
two buttons labeled
Replay and Continue.
The specific advantageous improvements of the invention
include:
Learner analytics: Learners are scored by any combination of the
quality and
intelligibility of their phoneme, diphone, syllable, and word
production; their word and
phrase comprehension, and their ability to both comprehend and
use grammatical forms,
word stem morphology, "can-do" criteria, including for both
production and
comprehension, and other criterial aspects of the instructional
interactions (please see, for
example, Hawkins and Filipović, 2012.) In addition to accuracy
for each of those aspects,
the learner's confidence, effort, and independence are measured
too. For example,
confidence can be self-reported, derived from vocal and timing
features, or both. Effort
corresponds to the number and duration of attempts to perform
exercises. And
independence can be measured by the number and frequency of
learner requests for help.
Integrated content development system: Both instructions and
peer learners can add to
and extend branching scenario instructional interactions, which
are multiple choice
response instructional content, such as is used in the Twine
Twee formalism, or "Choose
Your Own Adventure" role-play interactions. This branching
scenario instructional
content can be added and removed by editing the database of
interactions in a manner
similar to editing a wiki such as Wikipedia or Wiktionary.
Phonetic disambiguation of homographs (and equivalently,
heterophones, meaning words
that are spelled identically but pronounced differently, such as
the past and future tenses
of the word, “read”) are automatically presented for
disambiguation as an integrated part
of the instructional content development subsystem. This allows
instructors and peer
learners to code their instructional content prompting response
phrases, of which there
-
Page 10 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
are typically three per branching scenario node, although there
can be any natural
number: zero responses ends the instructional interaction
module, one response requires
the production of a particular prompted response and two or more
choices allow for
transitions to (usually other) nodes.
Part of speech labeling: The instructional interaction
development support subsystem
also assists in labeling the part of speech (e.g., noun, verb,
article, adjective, conjunction,
preposition, adverb, etc.) of each word of the prompt phrases in
new instructional content
to assist with pronunciation assessment for intelligibility
remediation.
Peer consensus-based validation of instructional content: Each
node and each transition
between nodes in the branching scenario instructional
interactions are separately
validated by instructor data entry and review or peer learner
review or both.
Caching stand-alone exercises for offline execution: The system
network interface
caches both instructional interactions during download and their
results in nonvolatile
storage so that the system will still be usable when
disconnected from the network, when
downloads or uploads or both are inhibited, so that the entire
system can perform in a
manner consistent with stand-alone operation compatible with
free, freemium, or paid
content accession models.
Extensible vocabulary: Each of the prompting phrases is composed
of one or more
words, each of which is in turn composed of one or more
syllables, diphones, and
phonemes. The number and type of words may be increased by
length, subject matter,
vocational or other topic, geography, languages, morphological
features, and other
aspects.
Extensible prompting phrases: The number and type of prompting
phrases associated
-
Page 11 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
with each of the branching scenario transitions may be increased
by length, subject
matter, vocational topic, geographies, languages, grammatical
features, "can-do" criteria,
and other criteria and aspects. The branching scenario
interaction modules in which the
transitions are contained may similarly be increased by each of
those aspects.
Instructional interaction sequencing: Using a registration and
sign-in system which
records the learners' proficiency with each phoneme, diphone,
word, and other learner
analytics in the system, allows selection of instructional
content modules such as
branching scenarios and prompting phrases which the learner
needs to practice the most
to be provided to them in sequence. While the sequence is often
determined by the
branching scenario interaction transitions, sequencing can also
be performed with
adaptive instruction, by selecting prompting phrases based on
how much the learner
analytics database indicates that the learner needs to practice
words or criterial aspects
contained in the selected phrases.
Collecting exemplar and student pronunciation audio recordings:
The instructional
interaction development subsystem also includes support for
collecting, evaluating the
authentic intelligibility of, and storing audio recordings from
students, instructors, and
paid voice artists.
Collecting transcriptions of both first and subsequent language
transcriptionists from
recorded phrases: Both the instructional interactions and the
interaction development
system collect transcriptions of the words that both native and
foreign speakers can hear
when they listen to recorded audio from instructors, voiceover
artists, and learners. Such
transcriptions are scored by the extent to which they match the
words that the speaker
was trying to say when recording the audio.
Authentic intelligibility remediation: This groundbreaking
technique was developed
-
Page 12 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
independently by researchers and software engineers in Japan and
the U.S. Educational
Testing Service. Please see Kibishi, et al. (2014) and Loukina,
et al. (2015.) This
advantage is a monumental improvement over the commercial state
of the art, much if
not most of which is two or three substantial generations behind
(see Figure 2.) The
invention's specific remediation process emphasizes audio
feedback of spoken words in
the learners' first language containing the sounds of the
correct and mistaken
pronunciations, as opposed to merely visual feedback alone.
Multiple pass automatic speech recognition: The learner
analytics assessment process
includes the temporal endpoints (and thus the duration) and
acoustic scores for the words,
syllables, diphones, and phonemes, of each such speech segments
in prompting phrases,
using anomalous durations of those segments to guide multiple
passes of automatic
speech recognition against audio input using different speech
recognition grammars
representing utterance expectations, and different overall
endpoints.
Speech-language pathology reporting: The reports, statistics,
and alerts produced from
the learner analytics are designed to provide data in the terms,
manner, form, order, and
with the information contained in reports familiar to practicing
speech-language
pathologists. However, the same reports are also annotated and
provided with context
available by, for example, clickable links to additional text,
or similar explanatory
information such that the learners themselves, their teachers,
parents, school
administrators, and peers can understand and interpret those
reports, statistics, and alerts
produced from the analytics database.
BRIEF DESCRIPTION OF DRAWINGS
Figure 1 depicts the databases and dataflow for the
voice-response instructional
-
Page 13 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
application, comprising a client-server networked computer
system composed of: (#1) an
integrated instructional interaction development system; (#2) an
instructional interaction
database server process and database; (#3) an interaction and
prompting phrase selection
server process; (#4) a network connection from the server to the
client; (#5) a client
computer system which may include a web browser in which the
client software is
implemented; (#6) an instruction delivery application composed
of: (#7) an interaction
and prompting phrase section client process, (#8) a display for
interaction multimedia and
prompting phrases, (#9) a microphone for speech audio input and
recording, and (#10) a
client process to record speech, determine learner analytics;
(#11) a network connection
from the client to the server, (#12) a server process to update
speech recognition results
and learner analytics; (#13) a learner analytics database server
process and database;
(#14) a server process to calculate and update learner analytics
results, reports, and
statistics; and (#15) a server process to produce, display, and
send reports, statistics, and
alerts.
Figure 2 depicts the motivation for collecting intelligibility
transcriptions, as opposed to
text-independent pronunciation assessment or pronunciation
assessment based solely on
exemplar pronunciations of students or voiceover talent.
Figure 3 depicts an example use of logistic regression for
intelligibility remediation.
Figure 4 depicts the main database records in an asynchronous
intelligibility remediation
peer learning and data collection system.
Figure 5 depicts learner analytics-based instructional prompting
phrase sequencing and
branching scenario transitions.
-
Page 14 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
DESCRIPTION OF THE PREFERRED EMBODIMENT
In its preferred embodiment, the invention consists of software
modules to extend
software systems such as Moodle, a free open source
instructional course management
system, Wikipedia, a free open editable online encyclopedia,
Wiktionary, a free open
editable online dictionary, or Wikiversity, a free open editable
online instructional course
creation system. The user of such software, who typically
intends to learn the meaning,
pronunciation, grammar, morphology, and associated aspects of
words and phrases, will
be shown user interface elements to allow audio recording and
subsequent evaluation of
the audio phrase.
For example, a Wiktionary user may be presented with buttons
labeled "Record," "Stop,"
"Play," "Evaluate," and, "Try in phrase." The Record button
would begin storing audio
data from the microphone, perhaps with a visual audio level
meter indicator. The Stop
button would terminate the recording, the Play button would
allow the learner to listen to
the recording, perhaps to ascertain the loudness of background
noise in order to decide
whether to evaluate the recording. The Evaluate button would
perform the pronunciation
assessment and determine the intelligibility of the phrase, and
use that information to
select, compose, and produce audio or visual feedback or both,
for the learner to review
in order to remediate their pronunciation intelligibility issues
that could be identified.
Finally, the "Try in phrase" button should provide an
opportunity for the learner to
practice the word in a phrase, and may link the user to a
registration and sign-in system
which records their proficiency with each phoneme, diphone,
word, and phrase in the
system so that the exercises which the learner needs to practice
the most can be provided
to them in a sequence beginning with trying to pronounce the
word in a phrase.
-
Page 15 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
OPERATION AND EXPLANATION
One well-known automatic speech recognition system capable of
providing the data on
which the processes of the invention rely is the Carnegie Mellon
Sphinx Speech
Recognition Project’s PocketSphinx free open source software
described in Huggins-
Daines, et al. (2006.) The operation of the PocketSphinx system
to provide pronunciation
assessment data is described on this CMUsphinx Wiki page
tutorial describing the use of
PocketSphinx for pronunciation evaluation:
https://cmusphinx.github.io/wiki/pocketsphinx_pronunciation_evaluation
One of the most important advances of the invention over
essentially all of the prior art is
the use of physiologically nearby neighboring phonemes, which
are shown on that wiki
page as the following file encoding the speech recognition
results grammar comprised of
the physiologically nearby neighboring phonemes of the word
“with,” along with those of
the other phonemes in alphabetical order:
#JSGF V1.0;
grammar neighbors;
public = sil [sil];
= aa | ah | er | ao;
= ae | eh | er | ah;
= ah | ae | er | aa;
= ao | aa | er | uh;
= aw | aa | uh | ow;
= ay | aa | iy | oy | ey;
= b | p | d;
= ch | sh | jh | t;
-
Page 16 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
= dh | th | z | v;
= d | t | jh | g | b;
= eh | ih | er | ae;
= er | eh | ah | ao;
= ey | eh | iy | ay;
= f | hh | th | v;
= g | k | d;
= hh | th | f | p | t | k;
= ih | iy | eh;
= iy | ih;
= jh | ch | zh | d;
= k | g | t | hh;
= l | r | w;
= m | n;
= ng | n;
= n | m | ng;
= ow | ao | uh | aw;
= oy | ao | iy | ay;
= p | t | b | hh;
= r | y | l;
= sh | s | z | th;
= sh | s | zh | ch;
= t | ch | k | d | p | hh;
= th | s | dh | f | hh;
= uh | ao | uw | uw;
= uw | uh | uw;
= v | f | dh;
= w | l | y;
-
Page 17 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
= y | w | r;
= z | s | dh | z;
= zh | sh | z | jh;
The phonemes shown above are encoded in the CMUBET phonetic
alphabet, which is
described and explained on this wiki page:
https://cmusphinx.github.io/wiki/cmubet
Another important advance of the invention is the use of
diphones. A diphone is the last
part of one phoneme followed by the first part of another. There
are over 1,000 diphones
in spoken English, but only about 650 of those occur with
substantial frequency. English
diphones in the CMUBET phonetic alphabet are explained and
listed with their
frequencies on this wiki page:
http://cmusphinx.github.io/wiki/diphones
The use of logistic regression for intelligibility remediation
is explained by Figure 3. The
primary database records for asynchronous intelligibility
remediation using peer learning
and data collection are depicted in Figure 4. The use of learner
analytics for instructional
prompt phrase sequencing and branching scenario transitions are
explained by Figure 5.
CONCLUSION
The invention provides better speaking skills instructional
software than presently
commercially available from the state of the art. Language
students can use thousands of
free web and stand-alone software applications for learning
reading, writing, and
-
Page 18 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
listening. But speaking skills instruction is limited to
expensive, cumbersome, and often
inaccurate commercial software for pronunciation assessment. The
interactive language
pronunciation assessment and remediation software of the
invention may be able to
improve students’ pronunciation of words perhaps six times
faster than commercially
available products. Millions of people worldwide currently wish
to improve their
pronunciation in order to gain access to better jobs and succeed
at more opportunities to
speak in public, on teleconferences, or to groups.
Unfortunately, the state of the art often
frustrates students by putting too much emphasis on
inconsequential mistakes. The
invention solves those problems by allowing adaptive
instruction
While the description above contains many specifics, they should
not be considered as
limitations on the scope of the invention, but rather as
exemplification of one preferred
embodiment thereof. Many other variations are possible. For
example, a children's toy
to teach speaking skills may be provided as a device with a
microphone and display, or
the software system may run in internet web browsers as software
executed by the
browsers as, for example, program code in the JavaScript
computer programming
language. Accordingly, the scope of the invention should be be
determined not by the
embodiments as described and illustrated, but by the following
claims.
-
Page 19 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
CLAIMS
What is claimed is:
(1) A networked client-server computer system composed of:
(a) an instructional interaction database server process and
database (Figure 1, #2);
(b) an interaction and prompting phrase selection server process
(#3);
(c) a network connection from the server to the client (#4);
(d) a client web browser (#5);
(e) an instruction delivery application (#6), composed of:
(e)(1) an interaction and prompting phrase section client
process (#7),
(e)(2) a display for interaction multimedia and prompting
phrases (#8),
(e)(3) a microphone for speech audio input and recording
(#9),
(e)(4) a client process to record speech, determine learner
analytics, such as the quality
and intelligibility of the learner’s phoneme, diphone, syllable,
and word production; their
word and phrase comprehension; their ability to both comprehend
and use grammatical
forms; word stem morphology production and comprehension;
"can-do" criteria such as
arbitrary instructional objectives and subject matter; the
learner's measured confidence,
effort, and independence; and use those analytics to assess
resulting achievement and
-
Page 20 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
progress scores from the learner’s audio input (#10), and
(f) a network connection from the client to the server
(#11);
(g) a server process to update speech recognition results and
learner analytics, such as the
quality and intelligibility of the learner’s phoneme, diphone,
syllable, and word
production; their word and phrase comprehension; their ability
to both comprehend and
use grammatical forms; word stem morphology production and
comprehension; "can-do"
criteria, including arbitrary instructional objectives and
subject matter; the learner's
measured confidence, effort, and independence; and use those
analytics to assess
resulting achievement and progress scores from the learner’s
audio input (#12);
(h) a learner analytics database server process and database
(#13);
(i) a server process to calculate and update learner analytics
results, reports, and statistics
(#14);
(j) a server process to produce, display, and send reports,
statistics, and alerts (#15).
(2) The computer system of Claim 1 with an integrated
instructional interaction
development system (#1) composed of a means to input, edit, and
and extend branching
scenario instructional interactions composed of multiple choice
response instructional
content, such as: the Twine (twinery.org) Twee language and
"Choose Your Own
Adventure" role-play interactions, which can be added, changed,
and removed by editing
a database of interactions in a manner similar to editing a wiki
such as Wikipedia or
Wiktionary.
(3) The computer system and instructional interaction
development system of Claim 2,
-
Page 21 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
with a means of phonetic disambiguation of homographs (words
that are spelled
identically but pronounced differently) presented to the
instructional interaction
developer for disambiguation by selection of alternative
pronunciations during input and
editing.
(4) The computer system and instructional interaction
development system of Claim 2,
with a means of part of speech (e.g., noun, verb, article,
adjective, conjunction,
preposition, adverb, etc.) labeling of each word of the
instructional interaction prompting
phrases presented for selection of each word’s part of speech
during instructional
interaction input and editing.
(5) The computer system of Claim 1, with a means of peer
consensus-based validation of
instructional content composed of a way for learners,
instructors, parents, and
administrators to verify and validate each node and each
transition between nodes in the
branching scenario instructional interactions are separately
validated by instructor data
entry and review or peer learner review or both.
(6) The computer system of Claim 1, with a means of caching
stand-alone exercises for
offline execution comprised of a processes reading instructional
interactions and
associated data from the system network input interface (#4)
which caches instructional
interactions during download, allowing them to be used when the
network becomes
disconnected, and a process storing data when the system network
output interface their
results in nonvolatile storage so that the system will still be
usable when disconnected
from the network, when downloads or uploads or both are
inhibited, such that the system
can perform in a manner consistent with stand-alone operation
compatible with free,
freemium, or paid content accession models.
(7) The computer system of Claim 1, with a means of extensible
vocabulary, composed of
-
Page 22 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
processes to assist in increasing the number and type of words
contained in prompting
phrases by length, subject matter, vocational topic, geography,
languages, morphological
features, and other topics and aspects.
(8) The computer system of Claim 1, with a means of extensible
prompting phrases and
branching scenario interaction modules, allowing for increasing
the number and type of
prompting phrases and branching scenario interaction modules by
length, subject matter,
vocational topic, geographies, languages, grammatical features,
"can-do" criteria, and
other criteria and aspects.
(9) The computer system of Claim 1, with a means of
instructional interaction sequencing
composed of processes for registration and sign-in, a process to
allow recording learners'
proficiency with each phoneme, diphone, word, and other learner
analytics, and a process
to determine which instructional content modules such as
branching scenarios and
prompting phrases that the learner needs to practice the most,
and a process to provide
learners those instructional content modules in sequence (Figure
5.)
(10) The computer system of Claim 1, with a means of authentic
intelligibility
remediation composed of two processes:
(a) to obtain recorded audio prompting phrase utterances, their
transcriptions from native
and foreign language transcriptionists, to create a predictive
model of the consequence of
observed mispronunciations as follows:
(a)(1) obtain learner attempts at pronouncing a number of
phrases, each associated with a
branching scenario instructional interaction transition in the
form of recorded audio;
(a)(2) using the recorded audio attempts, categorize each word
as having been transcribed
-
Page 23 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
either correctly or incorrectly;
(a)(3) using automatic speech recognition, evaluate the
pronunciation of the recorded
audio to determine the temporal endpoints and duration, along
with the acoustic
confidence probability, and alternative nearby physiologically
neighboring speech
segments such as phonemes, diphones, and syllables which may
have matched the
recorded audio more closely than the expected segments;
(a)(4) using the recorded audio of each words and the proportion
of the time that they
were transcribed correctly, use logistic regression to model the
consequence of each
mispronunciation for prediction of the likelihood that the word
was correctly transcribed,
from the independent variables produced by the automatic speech
recognition results
(Figure 3); and
(a)(5) store the results of the logistic regression predictive
model as weight coefficients
for each of the independent variables of each word of each
prompting phrase in the
predictive model. And,
(b) to provide learner exercise interaction as follows:
(b)(1) display one or more prompting phrases;
(b)(2) record audio from the learner;
(b)(3) using automatic speech recognition, evaluate the
pronunciation of the recorded
audio to determine the temporal endpoints and duration, along
with the acoustic
confidence probability, and alternative nearby physiologically
neighboring speech
segments such as phonemes, diphones, and syllables which may
have matched the
-
Page 24 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
recorded audio more closely than the expected segments;
(b)(4) scale the results of the automatic speech recognition
according to the weights
stored in step (a)(5) to determine the expected probability that
each word is intelligible;
(b)(5) rank each of the predicted unintelligible words by
consequence according to part of
speech and predictive model probability magnitude;
(b)(6) provide audio or audio and visual feedback to the learner
based on their most
consequential pronunciation mistake as expected by the
predictive model; and
(b)(7) as part of the audio feedback, replay the learner's most
consequential
mispronunciation followed by another two prerecorded audio
words, one of which
includes the phoneme or diphone associated with the observed
sound constituting the
mispronunciation, followed by a word with the phoneme or diphone
associated with the
correct pronunciation.
(11) The computer system of Claim 1, with a means of multiple
pass automatic speech
recognition composed of learner analytics assessment processes
to determine temporal
endpoints, and thereby the duration, and acoustic scores for
speech segments such as
phonemes, diphones, syllables, and words of prompting phrases,
wherein anomalous
durations of those segments guide multiple passes of automatic
speech recognition of the
same audio input using different speech recognition grammars
representing utterance
expectations.
(12) The computer system of Claim 1, with a means of
speech-language pathology
reporting composed of surveying current terms used in, the
manner of presentation of,
printed forms composing, the order of presentation of, and the
information contained in
-
Page 25 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
reports used by practicing speech-language pathologists, and
then formatting reports,
statistics, and alerting messages according to the surveyed
descriptions of those reports.
(13) A networked client-server computer system composed of:
(a) an instructional interaction database server process and
database (Figure 1, #2);
(b) an interaction and prompting phrase selection server process
(#3);
(c) a network connection from the server to the client (#4);
(d) a client web browser (#5);
(e) an instruction delivery application (#6), composed of:
(e)(1) an interaction and prompting phrase section client
process (#7),
(e)(2) a display for interaction multimedia and prompting
phrases (#8),
(e)(3) a microphone for speech audio input and recording
(#9),
(e)(4) a client process to record speech, determine learner
analytics, including the quality
and intelligibility of the learner’s phoneme, diphone, syllable,
and word production; their
word and phrase comprehension; their ability to both comprehend
and use grammatical
forms; word stem morphology production and comprehension;
"can-do" criteria such as
arbitrary instructional objectives and subject matter; the
learner's measured confidence,
effort, and independence; and use those analytics to assess
resulting achievement and
progress scores from the learner’s audio input (#10), and
-
Page 26 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(f) a network connection from the client to the server
(#11);
(g) a server process to update speech recognition results and
learner analytics, including
the quality and intelligibility of the learner’s phoneme,
diphone, syllable, and word
production; their word and phrase comprehension; their ability
to both comprehend and
use grammatical forms; word stem morphology production and
comprehension; "can-do"
criteria, including arbitrary instructional objectives and
subject matter; the learner's
measured confidence, effort, and independence; and use those
analytics to assess
resulting achievement and progress scores from the learner’s
audio input (#12);
(h) a learner analytics database server process and database
(#13);
(i) a server process to calculate and update learner analytics
results, reports, and statistics
(#14);
(j) a server process to produce, display, and send reports,
statistics, and alerts (#15);
(k) an integrated instructional interaction development system
(#1) composed of a means
to input, edit, and and extend branching scenario instructional
interactions composed of
multiple choice response instructional content, such as: the
Twine (twinery.org) Twee
language and "Choose Your Own Adventure" role-play interactions,
which can be added,
changed, and removed by editing a database of interactions in a
manner similar to editing
a wiki such as Wikipedia or Wiktionary;
(l) a means of phonetic disambiguation of homographs (words that
are spelled identically
but pronounced differently) presented to the instructional
interaction developer for
disambiguation by selection of alternative pronunciations during
input and editing;
-
Page 27 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(m) a means of part of speech (e.g., noun, verb, article,
adjective, conjunction,
preposition, adverb, etc.) labeling of each word of the
instructional interaction prompting
phrases presented for selection of each word’s part of speech
during instructional
interaction input and editing;
(n) a means of peer consensus-based validation of instructional
content composed of a
way for learners, instructors, parents, and administrators to
verify and validate each node
and each transition between nodes in the branching scenario
instructional interactions are
separately validated by instructor data entry and review or peer
learner review or both;
(o) a means of caching stand-alone exercises for offline
execution comprised of a
processes reading instructional interactions and associated data
from the system network
input interface (#4) which caches instructional interactions
during download, allowing
them to be used when the network becomes disconnected, and a
process storing data
when the system network output interface their results in
nonvolatile storage so that the
system will still be usable when disconnected from the network,
when downloads or
uploads or both are inhibited, such that the system can perform
in a manner consistent
with stand-alone operation compatible with free, freemium, or
paid content accession
models.
(p) a means of extensible vocabulary, composed of processes to
assist in increasing the
number and type of words contained in prompting phrases by
length, subject matter,
vocational topic, geography, languages, morphological features,
and other topics and
aspects.
(q) a means of extensible prompting phrases and branching
scenario interaction modules,
allowing for increasing the number and type of prompting phrases
and branching scenario
-
Page 28 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
interaction modules by length, subject matter, vocational topic,
geographies, languages,
grammatical features, "can-do" criteria, and other criteria and
aspects.
(r) a means of instructional interaction sequencing composed of
processes for registration
and sign-in, a process to allow recording learners' proficiency
with each phoneme,
diphone, word, and other learner analytics, and a process to
determine which instructional
content modules such as branching scenarios and prompting
phrases that the learner
needs to practice the most, and a process to provide learners
those instructional content
modules in sequence (Figure 5.)
(s) a means of authentic intelligibility remediation composed of
two processes:
(s)(1) to obtain recorded audio prompting phrase utterances,
their transcriptions from
native and foreign language transcriptionists, to create a
predictive model of the
consequence of observed mispronunciations as follows:
(s)(1)(a) obtain learner attempts at pronouncing a number of
phrases, each associated
with a branching scenario instructional interaction transition
in the form of recorded
audio;
(s)(1)(b) using the recorded audio attempts, categorize each
word as having been
transcribed either correctly or incorrectly;
(s)(1)(c) using automatic speech recognition, evaluate the
pronunciation of the recorded
audio to determine the temporal endpoints and duration, along
with the acoustic
confidence probability, and alternative nearby physiologically
neighboring speech
segments such as phonemes, diphones, and syllables which may
have matched the
recorded audio more closely than the expected segments;
-
Page 29 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(s)(1)(d) using the recorded audio of each words and the
proportion of the time that they
were transcribed correctly, use logistic regression to model the
consequence of each
mispronunciation for prediction of the likelihood that the word
was correctly transcribed,
from the independent variables produced by the automatic speech
recognition results
(Figure 3); and
(s)(1)(e) store the results of the logistic regression
predictive model as weight coefficients
for each of the independent variables of each word of each
prompting phrase in the
predictive model. And,
(s)(2) to provide learner exercise interaction as follows:
(s)(2)(a) display one or more prompting phrases;
(s)(2)(b) record audio from the learner;
(s)(2)(c) using automatic speech recognition, evaluate the
pronunciation of the recorded
audio to determine the temporal endpoints and duration, along
with the acoustic
confidence probability, and alternative nearby physiologically
neighboring speech
segments such as phonemes, diphones, and syllables which may
have matched the
recorded audio more closely than the expected segments;
(s)(2)(d) scale the results of the automatic speech recognition
according to the weights
stored in step (s)(1)(e) to determine the expected probability
that each word is
intelligible;
(s)(2)(e) rank each of the predicted unintelligible words by
consequence according to part
-
Page 30 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
of speech and predictive model probability magnitude;
(s)(2)(f) provide audio or audio and visual feedback to the
learner based on their most
consequential pronunciation mistake as expected by the
predictive model; and
(s)(2)(g) as part of the audio feedback, replay the learner's
most consequential
mispronunciation followed by another two prerecorded audio
words, one of which
includes the phoneme or diphone associated with the observed
sound constituting the
mispronunciation, followed by a word with the phoneme or diphone
associated with the
correct pronunciation;
(t) a means of multiple pass automatic speech recognition
composed of learner analytics
assessment processes to determine temporal endpoints, and
thereby the duration, and
acoustic scores for speech segments such as phonemes, diphones,
syllables, and words of
prompting phrases, wherein anomalous durations of those segments
guide multiple passes
of automatic speech recognition of the same audio input using
different speech
recognition grammars representing utterance expectations;
and
(u) a means of speech-language pathology reporting composed of
surveying current terms
used in, the manner of presentation of, printed forms composing,
the order of presentation
of, and the information contained in reports used by practicing
speech-language
pathologists, and then formatting reports, statistics, and
alerting messages according to
the surveyed descriptions of those reports.
(14) A networked client-server computer system composed of:
(a) an instructional interaction database server process and
database (Figure 1, #2);
-
Page 31 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(b) an interaction and prompting phrase selection server process
(#3);
(c) a network connection from the server to the client (#4);
(d) a client web browser (#5);
(e) an instruction delivery application (#6), composed of:
(e)(1) an interaction and prompting phrase section client
process (#7),
(e)(2) a display for interaction multimedia and prompting
phrases (#8),
(e)(3) a microphone for speech audio input and recording
(#9),
(e)(4) a client process to record speech, determine learner
analytics, such as the quality
and intelligibility of the learner’s phoneme, diphone, syllable,
and word production; their
word and phrase comprehension; their ability to both comprehend
and use grammatical
forms; word stem morphology production and comprehension;
"can-do" criteria such as
arbitrary instructional objectives and subject matter; the
learner's measured confidence,
effort, and independence; and use those analytics to assess
resulting achievement and
progress scores from the learner’s audio input (#10), and
(f) a network connection from the client to the server
(#11);
(g) a server process to update speech recognition results and
learner analytics, such as the
quality and intelligibility of the learner’s phoneme, diphone,
syllable, and word
production; their word and phrase comprehension; their ability
to both comprehend and
use grammatical forms; word stem morphology production and
comprehension; "can-do"
-
Page 32 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
criteria, including arbitrary instructional objectives and
subject matter; the learner's
measured confidence, effort, and independence; and use those
analytics to assess
resulting achievement and progress scores from the learner’s
audio input (#12);
(h) a learner analytics database server process and database
(#13);
(i) a server process to calculate and update learner analytics
results, reports, and statistics
(#14);
(j) a server process to produce, display, and send reports,
statistics, and alerts (#15).
(15) The computer system of Claim 14 with an integrated
instructional interaction
development system (#1) composed of a means to input, edit, and
and extend branching
scenario instructional interactions composed of multiple choice
response instructional
content, such as: the Twine (twinery.org) Twee language and
"Choose Your Own
Adventure" role-play interactions, which can be added, changed,
and removed by editing
a database of interactions in a manner similar to editing a wiki
such as Wikipedia or
Wiktionary.
(16) The computer system of Claim 14, with a means of caching
stand-alone exercises for
offline execution comprised of a processes reading instructional
interactions and
associated data from the system network input interface (#4)
which caches instructional
interactions during download, allowing them to be used when the
network is, as usual,
disconnected, and a process storing data when the system network
output interface their
results in nonvolatile storage so that the system will still be
usable when disconnected
from the network, when downloads or uploads or both are
inhibited, such that the system
can perform in a manner consistent with stand-alone operation
compatible with free,
freemium, or paid content accession models.
-
Page 33 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(17) The computer system of Claim 14, with a means of
instructional interaction
sequencing composed of processes for registration and sign-in, a
process to allow
recording learners' proficiency with each phoneme, diphone,
word, and other learner
analytics, and a process to determine which instructional
content modules such as
branching scenarios and prompting phrases that the learner needs
to practice the most,
and a process to provide learners those instructional content
modules in sequence (Figure
5.)
(18) The computer system of Claim 14, with a means of authentic
intelligibility
remediation composed of two processes:
(a) to obtain recorded audio prompting phrase utterances, their
transcriptions from native
and foreign language transcriptionists, to create a predictive
model of the consequence of
observed mispronunciations as follows:
(a)(1) obtain learner attempts at pronouncing a number of
phrases, each associated with a
branching scenario instructional interaction transition in the
form of recorded audio;
(a)(2) using the recorded audio attempts, categorize each word
as having been transcribed
either correctly or incorrectly;
(a)(3) using automatic speech recognition, evaluate the
pronunciation of the recorded
audio to determine the temporal endpoints and duration, along
with the acoustic
confidence probability, and alternative nearby physiologically
neighboring speech
segments such as phonemes, diphones, and syllables which may
have matched the
recorded audio more closely than the expected segments;
-
Page 34 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(a)(4) using the recorded audio of each words and the proportion
of the time that they
were transcribed correctly, use logistic regression to model the
consequence of each
mispronunciation for prediction of the likelihood that the word
was correctly transcribed,
from the independent variables produced by the automatic speech
recognition results
(Figure 3); and
(a)(5) store the results of the logistic regression predictive
model as weight coefficients
for each of the independent variables of each word of each
prompting phrase in the
predictive model. And,
(b) to provide learner exercise interaction as follows:
(b)(1) display one or more prompting phrases;
(b)(2) record audio from the learner;
(b)(3) using automatic speech recognition, evaluate the
pronunciation of the recorded
audio to determine the temporal endpoints and duration, along
with the acoustic
confidence probability, and alternative nearby physiologically
neighboring speech
segments such as phonemes, diphones, and syllables which may
have matched the
recorded audio more closely than the expected segments;
(b)(4) scale the results of the automatic speech recognition
according to the weights
stored in step (a)(5) to determine the expected probability that
each word is intelligible;
(b)(5) rank each of the predicted unintelligible words by
consequence according to part of
speech and predictive model probability magnitude;
-
Page 35 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(b)(6) provide audio or audio and visual feedback to the learner
based on their most
consequential pronunciation mistake as expected by the
predictive model; and
(b)(7) as part of the audio feedback, replay the learner's most
consequential
mispronunciation followed by another two prerecorded audio
words, one of which
includes the phoneme or diphone associated with the observed
sound constituting the
mispronunciation, followed by a word with the phoneme or diphone
associated with the
correct pronunciation.
(19) The computer system of Claim 14, with a means of multiple
pass automatic speech
recognition composed of learner analytics assessment processes
to determine temporal
endpoints, and thereby the duration, and acoustic scores for
speech segments such as
phonemes, diphones, syllables, and words of prompting phrases,
wherein anomalous
durations of those segments guide multiple passes of automatic
speech recognition of the
same audio input using different speech recognition grammars
representing utterance
expectations.
(20) The computer system of Claim 14, with a means of
speech-language pathology
reporting comprised of surveying current terms used in, the
manner of presentation of,
printed forms composing, the order of presentation of, and the
information contained in
reports used by practicing speech-language pathologists, and
then formatting reports,
statistics, and alerting messages according to the surveyed
descriptions of those reports.
-
Page 36 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
ABSTRACT
This invention is a method of interactive computer-aided
instruction for general education
including speaking skills. Learners are asked to read text
prompting phrases into a
microphone in response to multiple choice questions. Automatic
speech recognition is
used to assess the pronunciation and provide remediation, in the
form of audio or visual
responses or both, based on the authentic intelligibility of the
learners' spoken responses
determined from transcriptions of other learners' utterances of
the same prompting
phrases.
PROVISIONAL PATENT APPLICATION AND DISCLOSURE DOCUMENT
REFERENCES
The forgoing utility patent application specification claims the
earlier date of James
Salsman's U.S. provisional patent application of March 4, 2016,
entitled, “Pronunciation
Assessment for Intelligibility Remediation.” The delay in filing
the present application
beyond the one year statutory limit was unavoidable, but was
less than the two month
regulatory exemption for unavoidable delay. The present
application also makes reference
to U.S. Patent and Trademark Office Disclosure Document number
S00867 filed by
James Salsman on October 23, 1998, entitled, “Solar-powered
Portable Reading
Instruction System.”