PRONUNCIATION ASSESSMENT FOR INTELLIGIBILITY … · PRONUNCIATION ASSESSMENT FOR INTELLIGIBILITY REMEDIATION Utility Patent Specification by James Salsman, Fort Lupton; and Lance

of 36

salsman-culnane-specification 6/2/17, 4:47 PM

PRONUNCIATION ASSESSMENT FOR INTELLIGIBILITY REMEDIATION

Utility Patent Specification

by James Salsman, Fort Lupton; and Lance Culnane, Westminster, both of Colorado.

April 22, 2017

BACKGROUND CITATIONS

U.S. Patent Documents:

5,679,001: Russell, et al. (1997) “Children's speech training aid.”

5,920,838: Mostow, et al. (1999) “Reading and Pronunciation Tutor.”

6,634,887: Heffernan, III, et al. (2003) “Methods and Systems for Tutoring Using a

Tutorial Model with Interactive Dialog.”

6,963,841: Handal et al. (2005) “Speech Training Method with Alternative Proper

Pronunciation Database.”

7,752,045: Eskenazi, et al. (2010) “Systems and Methods for Comparing Speech

Elements.”

8,109,765: Beattie, et al. (2012) “Intelligent Tutoring Feedback.”

of 36


8,271,281: Jayadeva, et al. (2012) “Method for Assessing Pronunciation Abilities.”

8,744,856: Ravishankar (2014) “Computer implemented system and method and

computer program product for evaluating pronunciation of phonemes in a language.”

9,520,068: Beattie, et al. (2016) “Sentence Level Analysis in a Reading Tutor.”

Other References:

Chen and Li (2016) “Computer-assisted pronunciation training: From pronunciation

scoring towards spoken language learning,” in Proceedings of the 2016 Asian-Pacific

Signal and Information Processing Association (APSIPA) Annual Summit and

Conference:

http://www.apsipa.org/proceedings_2016/HTML/paper2016/227.pdf

Cole, et al. (1999) “A platform for multilingual research in spoken dialogue systems,” in

Proceedings of the Multilingual Interoperability in Speech Technology Conference

(Leusden, Netherlands.)

http://www.cslu.ogi.edu/people/hosom/pubs/cole_MIST-platform_1999.pdf

Hawkins, J.A.; and Filipović, L. (2012) Criterial Features in L2 English: Specifying the

Reference Levels of the Common European Framework (United Kingdom: Cambridge

University Press.)

https://drive.google.com/open?

id=0B73LgocyHQnfcEVacmZRc2xEQ3VIZ0tkMHNmdjhNOXVsS1VR

of 36


Huggins-Daines, et al. (2006) “Pocketsphinx: A free, real-time continuous speech

recognition system for hand-held devices.” Proceedings of the IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP),

https://www.cs.cmu.edu/~awb/papers/ICASSP2006/0100185.pdf

Kibishi, et al. (2014) “A statistical method of evaluating the pronunciation proficiency/

intelligibility of English presentations by Japanese speakers,” ReCALL (European

Association for Computer Assisted Language Learning) doi:10.1017/

S0958344014000251,

http://www.slp.ics.tut.ac.jp/Material_for_Our_Studies/Papers/shiryou_last/e2014-

Paper-01.pdf

Loukina, et al. (2015) “Pronunciation accuracy and intelligibility of non-native speech,”

in InterSpeech-2015, the Proceedings of the Sixteenth Annual Conference of the

International Speech Communication Association (Dresden, Germany: Educational

Testing Service)

http://www.oeft.com/su/pdf/interspeech2015b.pdf

Panayotov, V., et al. (2015) "LIBRISPEECH: an ASR Corpus Based on Public Domain

Audio Books," Proceedings of the IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP 2015),

http://www.danielpovey.com/files/2015_icassp_librispeech.pdf

Proceedings of the International Symposium on Automatic Detection of Errors in

Pronunciation Training, June 6–8, 2012, KTH, Stockholm, Sweden.

http://www.speech.kth.se/isadept/ISADEPT-proceedings.pdf

Proceedings of the Workshop on Speech and Language Technology in Education,

of 36


September 4–5, 2015 (Satellite Workshop of Interspeech 2015 and the ISCA Special

Interest Group SLaTE) Leipzig, Germany:

https://www.slate2015.org/files/SLaTE2015-Proceedings.pdf

Ronanki, S.; Salsman, J. and Bo, L. (December 2012) “Automatic Pronunciation

Evaluation And Mispronunciation Detection Using CMUSphinx,” in the Proceedings of

the 24th International Conference on Computational Linguistics (Mumbai, India:

COLING 2012) pp. 61- 67.

http://www.aclweb.org/anthology/W12-5808

Salsman, J. (July 2014) “Development challenges in automatic speech recognition for

computer assisted pronunciation teaching and language learning” in Proceedings of the

Research Challenges in Computer Aided Language Learning Conference (Antwerp,

Belgium: CALL 2014.)

http://talknicer.com/Salsman-CALL-2014.pdf

Computer-Assisted Pronunciation Teaching (CAPT) Bibliography:

http://liceu.uab.es/~joaquim/applied_linguistics/L2_phonetics/CALL_Pron_Bib.html

FIELD OF THE INVENTION

This invention relates to the field of computer-assisted pronunciation training (CAPT)

using automatic speech recognition for language learning, speech language pathology,

and reading tutoring, such as described by Russell, et al. (1997) “Children's speech

training aid,” U.S. Patent 5,679,001. The assessment and remediation of the authentic

intelligibility of learners' spoken language as measured by agreement with panels of non-

expert word transcriptionists including both native and non-native language listeners

of 36


provides substantial advantages over the current state of the art which instead typically

assesses formal pronunciation agreement with a panel of native language listener

pronunciation experts, because those formal mispronunciations are only associated with

16% of measured authentic intelligibility of words, according to Loukina, et al. (2015.)

DISCUSSION OF PRIOR ART

While Kibishi et al. (2014) have demonstrated the achievement of 75% agreement with

authentic word transcription, even earlier work by Ronanki, Salsman, and Bo (2012)

produced open source software implementing means of more precise discrimination

between consequential and incidental errors by allowing accent and dialect adaptation

using physiologically neighboring phones (phonemes and diphones) derived from the

adjacency of vocal tract components, e.g., in the positions and configuration of the lips,

teeth, tongue, jaw, vocal fold cords, nasal flap, and diaphragm.

As stated in Salsman (2014), “To best support language instruction, we have been

developing the use of physiologically neighboring phonemes, i.e., sounds produced with

similar vocal tract articulations, to identify and discern between serious

mispronunciations and incidental errors (Ronanki et al., 2012.) We are using diphones,

i.e. the last half of one phoneme followed by the first half of the next, as alternatives and

supplements to phonemes and triphones for both automatic speech recognition and

pronunciation scoring (Cole, et al., 1999.) We plan to model learner fluency and select

the sequence of self-study practice exercises using cumulative diphone scores. We are

scoring segment durations to indicate syllables and words pronounced too quickly

relative to exemplary pronunciations. We have measured substantial potential

improvements from all of these techniques.

of 36


“The language instructor’s experience of computer-assisted pronunciation assessment can

be enhanced by offering comparisons of students’ utterances to exemplary pronunciations

in ways that illustrate the measurements of physiologically neighboring phonemes,

diphones, and speech segment durations. For example, mispronunciations might be

annotated with International Phonetic Alphabet symbols for both the expected

pronunciation and its physiologically neighboring phoneme which most closely matched

the observed speech. Diphones can be used to highlight difficult phonetic transitions, for

example when two adjacent phonemes are both mispronounced. Duration scoring can

annotate not just words and sub-word segments given insufficient emphasis, e.g. such as

might confuse ‘fourteen’ with ‘forty,’ and can highlight missing glottal stops essential to

discern, for example, ‘harder’ from ‘hard or.’”

Pronunciation assessment and CAPT responses should be based on at least 44 exemplary

pronunciations for each response word or phrase, comprised of both genders, two age

groups such as age 20s and 50s, and, for English, at least eleven geographic regions, in

order to provide for world-wide English accent and dialect adaptation coverage. For

English, such exemplary pronunciations should be recorded from native language

speakers selected from, for example, Australia, Canada, Ireland, New Zealand, South

Africa, London (Standard Southern English), London (Cockney), London (Received

Pronunciation), Birmingham, Cornwall, East Anglia, East Yorkshire, North Wales,

Edinburgh, Ulster, Dublin, Boston, Midwestern US (i.e., in or west of Michigan,

Pennsylvania, Missouri, or New Mexico), New England, New York City, and the

Southern U.S. Gulf Coast region.

Learner analytics (scoring pronunciation for CAPT and grading authentic intelligibility)

may include log-normal means and variances of phoneme, diphone, word, and phrase

acoustic scores and durations, along with cumulative phoneme and diphone scores;

mispronunciations ranked by consequential interference with intelligibility for each word

of 36


in an utterance and for the whole utterance; tonality scores for tonal languages; language

grammar, morphology, and vocabulary criterial feature coverage scores; and subject

matter topic correctness and coverage scores.

The intelligibility scoring system should agree with a panel of non-expert, authentic

native and non-native language word transcriptionists. Beyond the logistic regression of

word intelligibility by such transcriptionists, other machine learning techniques may

include, but are not limited to those of Kibishi et al. (2014), such as symbolic regression,

general and nonlinear regression, classification, artificial neural networks, support vector

machines, learning vector quantization, or self-organizing maps. Quality assurance

should be performed by measuring the extent to which the resulting intelligibility scores

match those of an actual panel of such non-expert native and non-native word

transcriptionists, preferably using blind or double-blind analysis. Transcriptionist data

may be enhanced with automatic spelling correction. Intelligibility determination may be

enhanced with word frequency-based phonological similarity measures of speech

ambiguity.

Learner remediation may include audio and visual feedback using expected and observed

phones and their durations to show vocal tract sagittal sections and front-facing lip static

graphic diagrams and animations along with spoken audio and text describing corrective

vocal tract motions in the learner's preferred language with examples in that language.

OBJECTIVES AND ADVANTAGES

The invention eliminates pronunciation assessment feedback which does not involve a

consequential mispronunciation interfering with the student's authentic intelligibility, and

provides feedback as pair of audio words in the learners' first language, the first

of 36


containing the correct phoneme and the second containing the mistaken sound produced.

To achieve those goals, we collect transcriptions of learner utterances. For example, while

displaying, “Please listen to this phrase and type in the English words you hear,” play this

audio for the phrase: “I'm here on behalf of the Excellence Hotel group.” For this

example, let's say that in the audio, “behalf” was mispronounced as “beh-alf” and

“Excellence” was mispronounced as “Excellent” but everything else was good. The

learner types in the text: “I'm here on behalf of the excellent hotel group.” (I.e., the

transcribing advanced learner gets “behalf” right, but doesn't transcribe Excellence

correctly because it was mispronounced.) The system sees that “Excellence” was not

transcribed correctly, while the SR system reports two mispronunciations. Therefore,

update the database entry for this phrase that a tally for the corresponding phonemes in

“behalf” are inconsequential, but the final phoneme /s/ in “excellence” is consequential if

mispronounced.

After sufficient data is collected, inconsequential mispronunciations can be ignored. The

database of the prompting phrases will have a probability associated with each phoneme

by which we can scale (or "weight" per Figure 2) each mispronunciation's acoustic score

with that probability to establish the cut-of point for the scaled values which will not be

scored as wrong, e.g. by displaying the word as green or yellow instead of orange or red.

Using a recorded audio library for words in each learner's first language containing each

phoneme near the front, instead of showing green/yellow/orange, the audio recording of

an e.g. Spanish word which starts with a /s/ sound can be played. For example, a

recording saying in Spanish audio, “When you said excellence [that target word in

English] you needed the sound that [a Spanish word starting with /s/] starts with, but

instead you pronounced the sound [a Spanish word starting with /t/] starts with. Listen to

what you said. [Playing the audio of the learner's mispronounced word.] You were

supposed to say excellence [the word in English again]. Click replay to hear this again,”

of 36


can be played while displaying the word “Excellence” and, e.g., two buttons labeled

Replay and Continue.

The specific advantageous improvements of the invention include:

Learner analytics: Learners are scored by any combination of the quality and

intelligibility of their phoneme, diphone, syllable, and word production; their word and

phrase comprehension, and their ability to both comprehend and use grammatical forms,

word stem morphology, "can-do" criteria, including for both production and

comprehension, and other criterial aspects of the instructional interactions (please see, for

example, Hawkins and Filipović, 2012.) In addition to accuracy for each of those aspects,

the learner's confidence, effort, and independence are measured too. For example,

confidence can be self-reported, derived from vocal and timing features, or both. Effort

corresponds to the number and duration of attempts to perform exercises. And

independence can be measured by the number and frequency of learner requests for help.

Integrated content development system: Both instructions and peer learners can add to

and extend branching scenario instructional interactions, which are multiple choice

response instructional content, such as is used in the Twine Twee formalism, or "Choose

Your Own Adventure" role-play interactions. This branching scenario instructional

content can be added and removed by editing the database of interactions in a manner

similar to editing a wiki such as Wikipedia or Wiktionary.

Phonetic disambiguation of homographs (and equivalently, heterophones, meaning words

that are spelled identically but pronounced differently, such as the past and future tenses

of the word, “read”) are automatically presented for disambiguation as an integrated part

of the instructional content development subsystem. This allows instructors and peer

learners to code their instructional content prompting response phrases, of which there

of 36


are typically three per branching scenario node, although there can be any natural

number: zero responses ends the instructional interaction module, one response requires

the production of a particular prompted response and two or more choices allow for

transitions to (usually other) nodes.

Part of speech labeling: The instructional interaction development support subsystem

also assists in labeling the part of speech (e.g., noun, verb, article, adjective, conjunction,

preposition, adverb, etc.) of each word of the prompt phrases in new instructional content

to assist with pronunciation assessment for intelligibility remediation.

Peer consensus-based validation of instructional content: Each node and each transition

between nodes in the branching scenario instructional interactions are separately

validated by instructor data entry and review or peer learner review or both.

Caching stand-alone exercises for offline execution: The system network interface

caches both instructional interactions during download and their results in nonvolatile

storage so that the system will still be usable when disconnected from the network, when

downloads or uploads or both are inhibited, so that the entire system can perform in a

manner consistent with stand-alone operation compatible with free, freemium, or paid

content accession models.

Extensible vocabulary: Each of the prompting phrases is composed of one or more

words, each of which is in turn composed of one or more syllables, diphones, and

phonemes. The number and type of words may be increased by length, subject matter,

vocational or other topic, geography, languages, morphological features, and other

aspects.

Extensible prompting phrases: The number and type of prompting phrases associated

of 36


with each of the branching scenario transitions may be increased by length, subject

matter, vocational topic, geographies, languages, grammatical features, "can-do" criteria,

and other criteria and aspects. The branching scenario interaction modules in which the

transitions are contained may similarly be increased by each of those aspects.

Instructional interaction sequencing: Using a registration and sign-in system which

records the learners' proficiency with each phoneme, diphone, word, and other learner

analytics in the system, allows selection of instructional content modules such as

branching scenarios and prompting phrases which the learner needs to practice the most

to be provided to them in sequence. While the sequence is often determined by the

branching scenario interaction transitions, sequencing can also be performed with

adaptive instruction, by selecting prompting phrases based on how much the learner

analytics database indicates that the learner needs to practice words or criterial aspects

contained in the selected phrases.

Collecting exemplar and student pronunciation audio recordings: The instructional

interaction development subsystem also includes support for collecting, evaluating the

authentic intelligibility of, and storing audio recordings from students, instructors, and

paid voice artists.

Collecting transcriptions of both first and subsequent language transcriptionists from

recorded phrases: Both the instructional interactions and the interaction development

system collect transcriptions of the words that both native and foreign speakers can hear

when they listen to recorded audio from instructors, voiceover artists, and learners. Such

transcriptions are scored by the extent to which they match the words that the speaker

was trying to say when recording the audio.

Authentic intelligibility remediation: This groundbreaking technique was developed

of 36


independently by researchers and software engineers in Japan and the U.S. Educational

Testing Service. Please see Kibishi, et al. (2014) and Loukina, et al. (2015.) This

advantage is a monumental improvement over the commercial state of the art, much if

not most of which is two or three substantial generations behind (see Figure 2.) The

invention's specific remediation process emphasizes audio feedback of spoken words in

the learners' first language containing the sounds of the correct and mistaken

pronunciations, as opposed to merely visual feedback alone.

Multiple pass automatic speech recognition: The learner analytics assessment process

includes the temporal endpoints (and thus the duration) and acoustic scores for the words,

syllables, diphones, and phonemes, of each such speech segments in prompting phrases,

using anomalous durations of those segments to guide multiple passes of automatic

speech recognition against audio input using different speech recognition grammars

representing utterance expectations, and different overall endpoints.

Speech-language pathology reporting: The reports, statistics, and alerts produced from

the learner analytics are designed to provide data in the terms, manner, form, order, and

with the information contained in reports familiar to practicing speech-language

pathologists. However, the same reports are also annotated and provided with context

available by, for example, clickable links to additional text, or similar explanatory

information such that the learners themselves, their teachers, parents, school

administrators, and peers can understand and interpret those reports, statistics, and alerts

produced from the analytics database.

BRIEF DESCRIPTION OF DRAWINGS

Figure 1 depicts the databases and dataflow for the voice-response instructional

of 36


application, comprising a client-server networked computer system composed of: (#1) an

integrated instructional interaction development system; (#2) an instructional interaction

database server process and database; (#3) an interaction and prompting phrase selection

server process; (#4) a network connection from the server to the client; (#5) a client

computer system which may include a web browser in which the client software is

implemented; (#6) an instruction delivery application composed of: (#7) an interaction

and prompting phrase section client process, (#8) a display for interaction multimedia and

prompting phrases, (#9) a microphone for speech audio input and recording, and (#10) a

client process to record speech, determine learner analytics; (#11) a network connection

from the client to the server, (#12) a server process to update speech recognition results

and learner analytics; (#13) a learner analytics database server process and database;

(#14) a server process to calculate and update learner analytics results, reports, and

statistics; and (#15) a server process to produce, display, and send reports, statistics, and

alerts.

Figure 2 depicts the motivation for collecting intelligibility transcriptions, as opposed to

text-independent pronunciation assessment or pronunciation assessment based solely on

exemplar pronunciations of students or voiceover talent.

Figure 3 depicts an example use of logistic regression for intelligibility remediation.

Figure 4 depicts the main database records in an asynchronous intelligibility remediation

peer learning and data collection system.

Figure 5 depicts learner analytics-based instructional prompting phrase sequencing and

branching scenario transitions.

of 36


DESCRIPTION OF THE PREFERRED EMBODIMENT

In its preferred embodiment, the invention consists of software modules to extend

software systems such as Moodle, a free open source instructional course management

system, Wikipedia, a free open editable online encyclopedia, Wiktionary, a free open

editable online dictionary, or Wikiversity, a free open editable online instructional course

creation system. The user of such software, who typically intends to learn the meaning,

pronunciation, grammar, morphology, and associated aspects of words and phrases, will

be shown user interface elements to allow audio recording and subsequent evaluation of

the audio phrase.

For example, a Wiktionary user may be presented with buttons labeled "Record," "Stop,"

"Play," "Evaluate," and, "Try in phrase." The Record button would begin storing audio

data from the microphone, perhaps with a visual audio level meter indicator. The Stop

button would terminate the recording, the Play button would allow the learner to listen to

the recording, perhaps to ascertain the loudness of background noise in order to decide

whether to evaluate the recording. The Evaluate button would perform the pronunciation

assessment and determine the intelligibility of the phrase, and use that information to

select, compose, and produce audio or visual feedback or both, for the learner to review

in order to remediate their pronunciation intelligibility issues that could be identified.

Finally, the "Try in phrase" button should provide an opportunity for the learner to

practice the word in a phrase, and may link the user to a registration and sign-in system

which records their proficiency with each phoneme, diphone, word, and phrase in the

system so that the exercises which the learner needs to practice the most can be provided

to them in a sequence beginning with trying to pronounce the word in a phrase.

of 36


OPERATION AND EXPLANATION

One well-known automatic speech recognition system capable of providing the data on

which the processes of the invention rely is the Carnegie Mellon Sphinx Speech

Recognition Project’s PocketSphinx free open source software described in Huggins-

Daines, et al. (2006.) The operation of the PocketSphinx system to provide pronunciation

assessment data is described on this CMUsphinx Wiki page tutorial describing the use of

PocketSphinx for pronunciation evaluation:

https://cmusphinx.github.io/wiki/pocketsphinx_pronunciation_evaluation

One of the most important advances of the invention over essentially all of the prior art is

the use of physiologically nearby neighboring phonemes, which are shown on that wiki

page as the following file encoding the speech recognition results grammar comprised of

the physiologically nearby neighboring phonemes of the word “with,” along with those of

the other phonemes in alphabetical order:

#JSGF V1.0;

grammar neighbors;

public = sil [sil];

= aa | ah | er | ao;

= ae | eh | er | ah;

= ah | ae | er | aa;

= ao | aa | er | uh;

= aw | aa | uh | ow;

= ay | aa | iy | oy | ey;

= b | p | d;

= ch | sh | jh | t;

of 36


= dh | th | z | v;

= d | t | jh | g | b;

= eh | ih | er | ae;

= er | eh | ah | ao;

= ey | eh | iy | ay;

= f | hh | th | v;

= g | k | d;

= hh | th | f | p | t | k;

= ih | iy | eh;

= iy | ih;

= jh | ch | zh | d;

= k | g | t | hh;

= l | r | w;

= m | n;

= ng | n;

= n | m | ng;

= ow | ao | uh | aw;

= oy | ao | iy | ay;

= p | t | b | hh;

= r | y | l;

= sh | s | z | th;

= sh | s | zh | ch;

= t | ch | k | d | p | hh;

= th | s | dh | f | hh;

= uh | ao | uw | uw;

= uw | uh | uw;

= v | f | dh;

= w | l | y;

of 36


= y | w | r;

= z | s | dh | z;

= zh | sh | z | jh;

The phonemes shown above are encoded in the CMUBET phonetic alphabet, which is

described and explained on this wiki page:

https://cmusphinx.github.io/wiki/cmubet

Another important advance of the invention is the use of diphones. A diphone is the last

part of one phoneme followed by the first part of another. There are over 1,000 diphones

in spoken English, but only about 650 of those occur with substantial frequency. English

diphones in the CMUBET phonetic alphabet are explained and listed with their

frequencies on this wiki page:

http://cmusphinx.github.io/wiki/diphones

The use of logistic regression for intelligibility remediation is explained by Figure 3. The

primary database records for asynchronous intelligibility remediation using peer learning

and data collection are depicted in Figure 4. The use of learner analytics for instructional

prompt phrase sequencing and branching scenario transitions are explained by Figure 5.

CONCLUSION

The invention provides better speaking skills instructional software than presently

commercially available from the state of the art. Language students can use thousands of

free web and stand-alone software applications for learning reading, writing, and

of 36


listening. But speaking skills instruction is limited to expensive, cumbersome, and often

inaccurate commercial software for pronunciation assessment. The interactive language

pronunciation assessment and remediation software of the invention may be able to

improve students’ pronunciation of words perhaps six times faster than commercially

available products. Millions of people worldwide currently wish to improve their

pronunciation in order to gain access to better jobs and succeed at more opportunities to

speak in public, on teleconferences, or to groups. Unfortunately, the state of the art often

frustrates students by putting too much emphasis on inconsequential mistakes. The

invention solves those problems by allowing adaptive instruction

While the description above contains many specifics, they should not be considered as

limitations on the scope of the invention, but rather as exemplification of one preferred

embodiment thereof. Many other variations are possible. For example, a children's toy

to teach speaking skills may be provided as a device with a microphone and display, or

the software system may run in internet web browsers as software executed by the

browsers as, for example, program code in the JavaScript computer programming

language. Accordingly, the scope of the invention should be be determined not by the

embodiments as described and illustrated, but by the following claims.

of 36


CLAIMS

What is claimed is:

(1) A networked client-server computer system composed of:

(a) an instructional interaction database server process and database (Figure 1, #2);

(b) an interaction and prompting phrase selection server process (#3);

(c) a network connection from the server to the client (#4);

(d) a client web browser (#5);

(e) an instruction delivery application (#6), composed of:

(e)(1) an interaction and prompting phrase section client process (#7),

(e)(2) a display for interaction multimedia and prompting phrases (#8),

(e)(3) a microphone for speech audio input and recording (#9),

(e)(4) a client process to record speech, determine learner analytics, such as the quality

and intelligibility of the learner’s phoneme, diphone, syllable, and word production; their

word and phrase comprehension; their ability to both comprehend and use grammatical

forms; word stem morphology production and comprehension; "can-do" criteria such as

arbitrary instructional objectives and subject matter; the learner's measured confidence,

effort, and independence; and use those analytics to assess resulting achievement and

of 36


progress scores from the learner’s audio input (#10), and

(f) a network connection from the client to the server (#11);

(g) a server process to update speech recognition results and learner analytics, such as the

quality and intelligibility of the learner’s phoneme, diphone, syllable, and word

production; their word and phrase comprehension; their ability to both comprehend and

use grammatical forms; word stem morphology production and comprehension; "can-do"

criteria, including arbitrary instructional objectives and subject matter; the learner's

measured confidence, effort, and independence; and use those analytics to assess

resulting achievement and progress scores from the learner’s audio input (#12);

(h) a learner analytics database server process and database (#13);

(i) a server process to calculate and update learner analytics results, reports, and statistics

(#14);

(j) a server process to produce, display, and send reports, statistics, and alerts (#15).

(2) The computer system of Claim 1 with an integrated instructional interaction

development system (#1) composed of a means to input, edit, and and extend branching

scenario instructional interactions composed of multiple choice response instructional

content, such as: the Twine (twinery.org) Twee language and "Choose Your Own

Adventure" role-play interactions, which can be added, changed, and removed by editing

a database of interactions in a manner similar to editing a wiki such as Wikipedia or

Wiktionary.

(3) The computer system and instructional interaction development system of Claim 2,

of 36


with a means of phonetic disambiguation of homographs (words that are spelled

identically but pronounced differently) presented to the instructional interaction

developer for disambiguation by selection of alternative pronunciations during input and

editing.

(4) The computer system and instructional interaction development system of Claim 2,

with a means of part of speech (e.g., noun, verb, article, adjective, conjunction,

preposition, adverb, etc.) labeling of each word of the instructional interaction prompting

phrases presented for selection of each word’s part of speech during instructional

interaction input and editing.

(5) The computer system of Claim 1, with a means of peer consensus-based validation of

instructional content composed of a way for learners, instructors, parents, and

administrators to verify and validate each node and each transition between nodes in the

branching scenario instructional interactions are separately validated by instructor data

entry and review or peer learner review or both.

(6) The computer system of Claim 1, with a means of caching stand-alone exercises for

offline execution comprised of a processes reading instructional interactions and

associated data from the system network input interface (#4) which caches instructional

interactions during download, allowing them to be used when the network becomes

disconnected, and a process storing data when the system network output interface their

results in nonvolatile storage so that the system will still be usable when disconnected

from the network, when downloads or uploads or both are inhibited, such that the system

can perform in a manner consistent with stand-alone operation compatible with free,

freemium, or paid content accession models.

(7) The computer system of Claim 1, with a means of extensible vocabulary, composed of

of 36


processes to assist in increasing the number and type of words contained in prompting

phrases by length, subject matter, vocational topic, geography, languages, morphological

features, and other topics and aspects.

(8) The computer system of Claim 1, with a means of extensible prompting phrases and

branching scenario interaction modules, allowing for increasing the number and type of

prompting phrases and branching scenario interaction modules by length, subject matter,

vocational topic, geographies, languages, grammatical features, "can-do" criteria, and

other criteria and aspects.

(9) The computer system of Claim 1, with a means of instructional interaction sequencing

composed of processes for registration and sign-in, a process to allow recording learners'

proficiency with each phoneme, diphone, word, and other learner analytics, and a process

to determine which instructional content modules such as branching scenarios and

prompting phrases that the learner needs to practice the most, and a process to provide

learners those instructional content modules in sequence (Figure 5.)

(10) The computer system of Claim 1, with a means of authentic intelligibility

remediation composed of two processes:

(a) to obtain recorded audio prompting phrase utterances, their transcriptions from native

and foreign language transcriptionists, to create a predictive model of the consequence of

observed mispronunciations as follows:

(a)(1) obtain learner attempts at pronouncing a number of phrases, each associated with a

branching scenario instructional interaction transition in the form of recorded audio;

(a)(2) using the recorded audio attempts, categorize each word as having been transcribed

of 36


either correctly or incorrectly;

(a)(3) using automatic speech recognition, evaluate the pronunciation of the recorded

audio to determine the temporal endpoints and duration, along with the acoustic

confidence probability, and alternative nearby physiologically neighboring speech

segments such as phonemes, diphones, and syllables which may have matched the

recorded audio more closely than the expected segments;

(a)(4) using the recorded audio of each words and the proportion of the time that they

were transcribed correctly, use logistic regression to model the consequence of each

mispronunciation for prediction of the likelihood that the word was correctly transcribed,

from the independent variables produced by the automatic speech recognition results

(Figure 3); and

(a)(5) store the results of the logistic regression predictive model as weight coefficients

for each of the independent variables of each word of each prompting phrase in the

predictive model. And,

(b) to provide learner exercise interaction as follows:

(b)(1) display one or more prompting phrases;

(b)(2) record audio from the learner;

(b)(3) using automatic speech recognition, evaluate the pronunciation of the recorded




of 36



(b)(4) scale the results of the automatic speech recognition according to the weights

stored in step (a)(5) to determine the expected probability that each word is intelligible;

(b)(5) rank each of the predicted unintelligible words by consequence according to part of

speech and predictive model probability magnitude;

(b)(6) provide audio or audio and visual feedback to the learner based on their most

consequential pronunciation mistake as expected by the predictive model; and

(b)(7) as part of the audio feedback, replay the learner's most consequential

mispronunciation followed by another two prerecorded audio words, one of which

includes the phoneme or diphone associated with the observed sound constituting the

mispronunciation, followed by a word with the phoneme or diphone associated with the

correct pronunciation.

(11) The computer system of Claim 1, with a means of multiple pass automatic speech

recognition composed of learner analytics assessment processes to determine temporal

endpoints, and thereby the duration, and acoustic scores for speech segments such as

phonemes, diphones, syllables, and words of prompting phrases, wherein anomalous

durations of those segments guide multiple passes of automatic speech recognition of the

same audio input using different speech recognition grammars representing utterance

expectations.

(12) The computer system of Claim 1, with a means of speech-language pathology

reporting composed of surveying current terms used in, the manner of presentation of,

printed forms composing, the order of presentation of, and the information contained in

of 36


reports used by practicing speech-language pathologists, and then formatting reports,

statistics, and alerting messages according to the surveyed descriptions of those reports.










(e)(4) a client process to record speech, determine learner analytics, including the quality







of 36



(g) a server process to update speech recognition results and learner analytics, including

the quality and intelligibility of the learner’s phoneme, diphone, syllable, and word








(#14);

(j) a server process to produce, display, and send reports, statistics, and alerts (#15);

(k) an integrated instructional interaction development system (#1) composed of a means

to input, edit, and and extend branching scenario instructional interactions composed of

multiple choice response instructional content, such as: the Twine (twinery.org) Twee

language and "Choose Your Own Adventure" role-play interactions, which can be added,

changed, and removed by editing a database of interactions in a manner similar to editing

a wiki such as Wikipedia or Wiktionary;

(l) a means of phonetic disambiguation of homographs (words that are spelled identically

but pronounced differently) presented to the instructional interaction developer for

disambiguation by selection of alternative pronunciations during input and editing;

of 36


(m) a means of part of speech (e.g., noun, verb, article, adjective, conjunction,

preposition, adverb, etc.) labeling of each word of the instructional interaction prompting

phrases presented for selection of each word’s part of speech during instructional

interaction input and editing;

(n) a means of peer consensus-based validation of instructional content composed of a

way for learners, instructors, parents, and administrators to verify and validate each node

and each transition between nodes in the branching scenario instructional interactions are

separately validated by instructor data entry and review or peer learner review or both;

(o) a means of caching stand-alone exercises for offline execution comprised of a

processes reading instructional interactions and associated data from the system network

input interface (#4) which caches instructional interactions during download, allowing

them to be used when the network becomes disconnected, and a process storing data

when the system network output interface their results in nonvolatile storage so that the

system will still be usable when disconnected from the network, when downloads or

uploads or both are inhibited, such that the system can perform in a manner consistent

with stand-alone operation compatible with free, freemium, or paid content accession

models.

(p) a means of extensible vocabulary, composed of processes to assist in increasing the

number and type of words contained in prompting phrases by length, subject matter,

vocational topic, geography, languages, morphological features, and other topics and

aspects.

(q) a means of extensible prompting phrases and branching scenario interaction modules,

allowing for increasing the number and type of prompting phrases and branching scenario

of 36


interaction modules by length, subject matter, vocational topic, geographies, languages,

grammatical features, "can-do" criteria, and other criteria and aspects.

(r) a means of instructional interaction sequencing composed of processes for registration

and sign-in, a process to allow recording learners' proficiency with each phoneme,

diphone, word, and other learner analytics, and a process to determine which instructional

content modules such as branching scenarios and prompting phrases that the learner

needs to practice the most, and a process to provide learners those instructional content

modules in sequence (Figure 5.)

(s) a means of authentic intelligibility remediation composed of two processes:

(s)(1) to obtain recorded audio prompting phrase utterances, their transcriptions from

native and foreign language transcriptionists, to create a predictive model of the

consequence of observed mispronunciations as follows:

(s)(1)(a) obtain learner attempts at pronouncing a number of phrases, each associated

with a branching scenario instructional interaction transition in the form of recorded

audio;

(s)(1)(b) using the recorded audio attempts, categorize each word as having been

transcribed either correctly or incorrectly;

(s)(1)(c) using automatic speech recognition, evaluate the pronunciation of the recorded





of 36


(s)(1)(d) using the recorded audio of each words and the proportion of the time that they




(Figure 3); and

(s)(1)(e) store the results of the logistic regression predictive model as weight coefficients



(s)(2) to provide learner exercise interaction as follows:

(s)(2)(a) display one or more prompting phrases;

(s)(2)(b) record audio from the learner;

(s)(2)(c) using automatic speech recognition, evaluate the pronunciation of the recorded





(s)(2)(d) scale the results of the automatic speech recognition according to the weights

stored in step (s)(1)(e) to determine the expected probability that each word is

intelligible;

(s)(2)(e) rank each of the predicted unintelligible words by consequence according to part

of 36


of speech and predictive model probability magnitude;

(s)(2)(f) provide audio or audio and visual feedback to the learner based on their most


(s)(2)(g) as part of the audio feedback, replay the learner's most consequential




correct pronunciation;

(t) a means of multiple pass automatic speech recognition composed of learner analytics

assessment processes to determine temporal endpoints, and thereby the duration, and

acoustic scores for speech segments such as phonemes, diphones, syllables, and words of

prompting phrases, wherein anomalous durations of those segments guide multiple passes

of automatic speech recognition of the same audio input using different speech

recognition grammars representing utterance expectations; and

(u) a means of speech-language pathology reporting composed of surveying current terms

used in, the manner of presentation of, printed forms composing, the order of presentation

of, and the information contained in reports used by practicing speech-language

pathologists, and then formatting reports, statistics, and alerting messages according to

the surveyed descriptions of those reports.



of 36









(e)(4) a client process to record speech, determine learner analytics, such as the quality








(g) a server process to update speech recognition results and learner analytics, such as the

quality and intelligibility of the learner’s phoneme, diphone, syllable, and word



of 36







(#14);

(j) a server process to produce, display, and send reports, statistics, and alerts (#15).

(15) The computer system of Claim 14 with an integrated instructional interaction

development system (#1) composed of a means to input, edit, and and extend branching

scenario instructional interactions composed of multiple choice response instructional

content, such as: the Twine (twinery.org) Twee language and "Choose Your Own

Adventure" role-play interactions, which can be added, changed, and removed by editing

a database of interactions in a manner similar to editing a wiki such as Wikipedia or

Wiktionary.

(16) The computer system of Claim 14, with a means of caching stand-alone exercises for

offline execution comprised of a processes reading instructional interactions and

associated data from the system network input interface (#4) which caches instructional

interactions during download, allowing them to be used when the network is, as usual,

disconnected, and a process storing data when the system network output interface their

results in nonvolatile storage so that the system will still be usable when disconnected

from the network, when downloads or uploads or both are inhibited, such that the system

can perform in a manner consistent with stand-alone operation compatible with free,

freemium, or paid content accession models.

of 36


(17) The computer system of Claim 14, with a means of instructional interaction

sequencing composed of processes for registration and sign-in, a process to allow

recording learners' proficiency with each phoneme, diphone, word, and other learner

analytics, and a process to determine which instructional content modules such as

branching scenarios and prompting phrases that the learner needs to practice the most,

and a process to provide learners those instructional content modules in sequence (Figure

5.)

(18) The computer system of Claim 14, with a means of authentic intelligibility

remediation composed of two processes:

(a) to obtain recorded audio prompting phrase utterances, their transcriptions from native

and foreign language transcriptionists, to create a predictive model of the consequence of

observed mispronunciations as follows:

(a)(1) obtain learner attempts at pronouncing a number of phrases, each associated with a

branching scenario instructional interaction transition in the form of recorded audio;

(a)(2) using the recorded audio attempts, categorize each word as having been transcribed

either correctly or incorrectly;

(a)(3) using automatic speech recognition, evaluate the pronunciation of the recorded





of 36


(a)(4) using the recorded audio of each words and the proportion of the time that they




(Figure 3); and

(a)(5) store the results of the logistic regression predictive model as weight coefficients



(b) to provide learner exercise interaction as follows:

(b)(1) display one or more prompting phrases;

(b)(2) record audio from the learner;

(b)(3) using automatic speech recognition, evaluate the pronunciation of the recorded





(b)(4) scale the results of the automatic speech recognition according to the weights

stored in step (a)(5) to determine the expected probability that each word is intelligible;

(b)(5) rank each of the predicted unintelligible words by consequence according to part of

speech and predictive model probability magnitude;

of 36


(b)(6) provide audio or audio and visual feedback to the learner based on their most


(b)(7) as part of the audio feedback, replay the learner's most consequential




correct pronunciation.

(19) The computer system of Claim 14, with a means of multiple pass automatic speech

recognition composed of learner analytics assessment processes to determine temporal

endpoints, and thereby the duration, and acoustic scores for speech segments such as

phonemes, diphones, syllables, and words of prompting phrases, wherein anomalous

durations of those segments guide multiple passes of automatic speech recognition of the

same audio input using different speech recognition grammars representing utterance

expectations.

(20) The computer system of Claim 14, with a means of speech-language pathology

reporting comprised of surveying current terms used in, the manner of presentation of,

printed forms composing, the order of presentation of, and the information contained in

reports used by practicing speech-language pathologists, and then formatting reports,

statistics, and alerting messages according to the surveyed descriptions of those reports.

of 36


ABSTRACT

This invention is a method of interactive computer-aided instruction for general education

including speaking skills. Learners are asked to read text prompting phrases into a

microphone in response to multiple choice questions. Automatic speech recognition is

used to assess the pronunciation and provide remediation, in the form of audio or visual

responses or both, based on the authentic intelligibility of the learners' spoken responses

determined from transcriptions of other learners' utterances of the same prompting

phrases.

PROVISIONAL PATENT APPLICATION AND DISCLOSURE DOCUMENT

REFERENCES

The forgoing utility patent application specification claims the earlier date of James

Salsman's U.S. provisional patent application of March 4, 2016, entitled, “Pronunciation

Assessment for Intelligibility Remediation.” The delay in filing the present application

beyond the one year statutory limit was unavoidable, but was less than the two month

regulatory exemption for unavoidable delay. The present application also makes reference

to U.S. Patent and Trademark Office Disclosure Document number S00867 filed by

James Salsman on October 23, 1998, entitled, “Solar-powered Portable Reading

Instruction System.”