( 12 ) United States Patent Ramanarayanan et alantikenschlacht.com/su/pdf/patent2020.pdfUS010607504B1 ( 12 ) United States Patent Ramanarayanan et al . ( 10 ) Patent No .: US 10,607,504

US010607504B1

( 12 ) United States Patent Ramanarayanan et al .

( 10 ) Patent No .: US 10,607,504 B1 ( 45 ) Date of Patent : Mar. 31 , 2020

( 54 ) COMPUTER - IMPLEMENTED SYSTEMS AND METHODS FOR A CROWD SOURCE - BOOTSTRAPPED SPOKEN DIALOG SYSTEM

( 58 ) Field of Classification Search CPC

( Continued ) GO9B 19/04

( 56 ) References Cited ( 71 ) Applicant : Educational Testing Service , Princeton ,

NJ ( US ) U.S. PATENT DOCUMENTS

2006/0080101 A1 * 4/2006 Chotimongkol G06F 17/278 704/257

GIOL 15/22 704/275

2006/0149555 A1 * 7/2006 Fabbrizio

( Continued )

( 72 ) Inventors : Vikram Ramanarayanan , San Francisco , CA ( US ) ; David Suendermann - Oeft , San Francisco , CA ( US ) ; Patrick Lange , San Francisco , CA ( US ) ; Alexei V. Ivanov , Redwood City , CA ( US ) ; Keelan Evanini , Pennington , NJ ( US ) ; Yao Qian , San Francisco , CA ( US ) ; Zhou Yu , Pittsburgh , PA ( US )

OTHER PUBLICATIONS

( 73 ) Assignee : Educational Testing Service , Princeton , NJ ( US )

Bohus , Dan , Raux , Antoine , Harris , Thomas , Eskenazi , Maxine , Rudnicky , Alexander ; Olympus : An Open - Source Framework for Conversational Spoken Language Interface Research ; Proceedings of the Workshop on Bridging the Gap : Academic and Industrial Research in Dialog Technolgies ; pp . 32-39 ; 2007 .

( Continued ) ( * ) Notice : Subject to any disclaimer , the term of this

patent is extended or adjusted under 35 U.S.C. 154 ( b ) by 377 days .

Primary Examiner Thomas J Hong ( 74 ) Attorney , Agent , or Firm Jones Day

( 21 ) Appl . No .: 15 / 272,903

( 22 ) Filed : Sep. 22 , 2016

Related U.S. Application Data ( 60 ) Provisional application No. 62 / 232,537 , filed on Sep.

25 , 2015 .

( 57 ) ABSTRACT Systems and methods are provided for implementing an educational dialog system . An initial task model is accessed that identifies a plurality of dialog states associated with a task , a language model configured to identify a response meaning associated with a received response , and a language understanding model configured to select a next dialog state based on the identified response meaning . The task is provided to a plurality of persons for training . The task model is updated by revising the language model and the language understanding model based on responses received to prompts of the provided task , and the updated task is provided to a student for development of speaking capabili ties .

( 51 ) Int . Ci . GOIB 19/04 ( 2006.01 ) GIOL 15/22 ( 2006.01 )

( Continued ) ( 52 ) U.S. Cl .

CPC G09B 19/04 ( 2013.01 ) ; GIOL 15/063 ( 2013.01 ) ; GIOL 15/1815 ( 2013.01 ) ; GIOL

15/22 ( 2013.01 ) ; GIOL 2015/0635 ( 2013.01 ) 17 Claims , 12 Drawing Sheets

1002

ACCESS INITIAL TASK MODEL

1004

PROVIDE TASK REPRESENTED BY TASK MODEL TO PERSONS

FOR CROWOSOURCED TRAINING

1006

UPDATE TASK MODEL BY REVISING LANGUAGE MODEL AND SPOKEN

LANGUAGE UNDERSTANDING MODEL

1008

PROVIDE UPDATE TASK TO STUDENT FOR DEVELOPMENT OF SPEAKING CAPABILITIES

US 10,607,504 B1 Page 2

( 51 ) Int . Ci . GIOL 15/18 ( 2013.01 ) GIOL 15/06 ( 2013.01 )

( 58 ) Field of Classification Search USPC 434/185 See application file for complete search history .

( 56 ) References Cited

U.S. PATENT DOCUMENTS

2014/0379326 A1 * 12/2014 Sarikaya GIOL 15/18 704/9

G06F 17/289 704/8

2015/0363393 A1 * 12/2015 Williams

OTHER PUBLICATIONS

Bohus , Dan , Saw , Chit , Horvitz , Eric ; Directions Robot : In - the Wild Experiences and Lessons Learned ; Proceedings of the Inter national Conference on Autonomous Agents and Multi - Agent Sys tems ; pp . 637-644 ; 2014 . Buchholz , Sabine , Latorre , Javier ; Crowdsourcing Preference Tests , and How to Detect Cheating ; INTERSPEECH ; pp . 3053-3056 ; Aug. 2011 . Eskenazi , Maxine , Black , Alan , Raux , Antoine , Langner , Brian ; Let's Go Lab : a Platform for Evaluation of Spoken Dialog Systems with Real World Users ; 9th Annual Conference of the International Speech Communications Association ; p . 219 ; Sep. 2008 . Evanini , Keelan , Higgins , Derrick , Zechner , Klaus ; Using Amazon Mechanical Turk for Transcription of Non - Native Speech ; Proceed ings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk ; pp . 53-56 ; Jun . 2010 . Jurcicek , Filip , Keizer , Simon , Gasic , Milica , Mairesse , Francois , Thomson , Blaise , Yu , Kai , Young , Steve ; Real User Evaluation of Spoken Dialogue Systems Using Amazon Mechanical Turk ; Pro ceedings of INTERSPEECH ; pp . 3061-3064 ; 2011 . Kousidis , Spyros , Kennington , Casey , Baumann , Timo , Buschmeier , Hendrik , Kopp , Stefan , Schlangen , David ; A Multimodal In - Car Dialogue System That Tracks the Driver's Attention ; Proceedings of the 16th International Conference on Multimodal Interaction ; pp . 26-33 ; Nov. 2014 . Lamere , Paul , Kwok , Philip , Gouvea , Evandro , Raj , Bhiksha , Singh , Rita , Walker , William , Warmuth , Manfred , Wolf , Peter ; The CMU SPHINX - 4 Speech Recognition System ; Proceedings of the ICASSP ; Hong Kong , China ; 2003 . McGraw , Ian , Lee , Chia - ying , Hetherington , Lee , Seneff , Stephanie , Glass , James ; Collecting Voices from the Cloud ; LREC ; pp . 1576 1583 ; 2010 . Minessale , Anthony , Collins , Michael , Schreiber , Darren , Chandler , Raymond ; FreeSWITCH Cookbook ; Packt Publishing ; 2012 . Pappu , Aasish , Rudnicky , Alexander ; Deploying Speech Interfaces to the Masses ; Proceedings of the Companion Publication of the 2013 International Conference on Intelligent User Interfaces Com panion ; pp . 41-42 ; Mar. 2013 . Povey , Daniel , Ghoshal , Amab , Boulianne , Gilles , Burget , Lukas , Glembek , Ondrej , Goel , Nagendra , Hannemann , Mirko , Motlicek , Petr , Qian , Yanmin , Schwarz , Petr , Silovsky , Jan , Stemmer , Georg , Vesely , Karel ; The Kaldi Speech Recognition Toolkit ; Proceedings of the ASRU Workshop ; 2011 .

Prylipko , Dmytro , Schnelle - Walka , Dirk , Lord , Spencer , Wendemuth , Andreas ; Zanzibar OpenIVR : an Open - Source Framework for Devel opment of Spoken Dialog Systems ; Proceedings of the TSD Work shop ; 2011 . Ramanarayanan , Vikram , Suendermann - Oeft , David , Ivanov , Alexei , Evanini , Keelan ; A Distributed Cloud - Based Dialog System for Conversational Application Development ; Proceedings of the SIGDIAL Conference ; pp . 432-434 ; Sep. 2015 . Rayner , Manny , Frank , Ian , Chua , Cathy , Tsourakis , Nikos , Bouil lon , Pierrette ; For a Fistful of Dollars : Using Crowd - Sourcing to Evaluate a Spoken Language CALL Application ; Proceedings of the SLATE Workshop ; Aug. 2011 . Schnelle - Walka , Dirk , Radomski , Stefan , Muhlhauser , Max ; JVoiceXML as a Modality Component in the W3C Multimodal Architecture ; Journal on Multimodal User Interfaces , 7 ( 3 ) ; pp . 183-194 ; Nov. 2013 . Schroder , Marc , Trouvain , Jurgen ; The German Text - to - Speech Synthesis System Mary : a Tool for Research , Development and Teaching ; International Journal of Speech Technology , 6 ( 4 ) ; pp . 365-377 ; 2003 . Sciutti , Alessandra , Schilingmann , Lars , Palinko , Oskar , Nagai , Yukie , Sandini , Giulio ; A Gaze - Contingent Dictating Robot to Study Turn - Taking ; Proceedings of the 10th Annual ACM / IEEE International Conference on Human - Robot Interaction Extended Abstracts ; pp . 137-138 ; 2015 . Suendermann , David , Liscombe , Jackson , Pieraccini , Roberto ; How to Drink from a Fire Hose : One Person can Annoscribe 693 Thousand Utterances in One Month ; Proceedings of SIGDIAL 2010 : the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue ; pp . 257-260 ; Sep. 2010 . Suendermann , David , Liscombe , Jackson , Pieraccini , Roberto , Evanini , Keelan ; How Am I Doing ?: A New Framework to Effectively Measure the Performance of Automated Customer Care Contact Centers ; Ch . 7 in Advances in Speech Recognition : Mobile Envi ronments , A. Neustein ( Ed . ) ; Springer ; pp . 155-179 ; Aug. 2010 . Suendermann - Oeft , David , Ramanarayanan , Vikram , Teckenbrock , Moritz , Neutatz , Felix , Schmidt , Dennis ; HALEF : An Open - Source Standard - Compliant Telephony - Based Modular Spoken Dialog Sys tem — A Review and an Outlook ; Proceedings of the International Workshop on Spoken Dialog Systems ; Jan. 2015 . Taylor , Paul , Black , Alan , Caley , Richard ; The Architecture of the Festival Speech Synthesis System ; Proceedings of the ESCA Work shop on Speech Synthesis ; 1998 . Van Meggelen , Jim , Madsen , Leif , Smith , Jared ; Asterisk : The Future of Tel ony ; Sebastopol , CA : O'Reilly Media ; 2007 . Vinciarelli , Alessandro , Pantic , Maja , Bourlard , Herve ; Social Sig nal Processing : Survey of an Emerging Domain ; Image and Vision Computing Journal , 27 ( 12 ) ; pp . 1743-1759 ; 2009 . Wolters , Maria , Isaac , Karl , Renals , Steve ; Evaluating Speech Synthesis Intelligibility Using Amazon Mechanical Turk ; pp . 136 141 ; Jan. 2010 . Yu , Zhou , Bonus , Dan , Horvitz , Eric ; Incremental Coordination : Attention - Centric Speech Production in a Physically situated Con versational Agent ; Proceedings of the SIGDIAL 2015 Conference ; pp . 402-406 ; Sep. 2015 . Yu , Zhou , Papangelis , Alexandros , Rudnicky , Alexander ; TickTock : A Non - Goal - Oriented Multimodal Dialog System with Engagement Awareness ; Proceedings of the Association for the Advancement of Artificial Intelligence Spring Symposium ; pp . 108-111 ; 2015 .

* cited by examiner

102

U.S. Patent

PROMPT RESPONSE

Mar. 31 , 2020

TRAINING 104

EDUCATIONAL DIALOG SYSTEM

PROMPT

Sheet 1 of 12

108

RESPONSE

SCORE

EVALUATION 901

Fig . 1

US 10,607,504 B1

206

208

202

204

CONTINUE

CONTINUE

CONTINUE

BEGIN

WELCOME

COULD YOU TELL ME MORE ABOUT YOUR EDUCATION ?

00 YOU PLAN TO RETURN TO SCHOOL FOR HIGHER STUDIES ?

U.S. Patent

T CONTINUE 210

CONTINUE HAVE YOU EVER QUIT A JOB BEFORE ?

IS THE ANSWER AFFIRMATIVE ?

CONTINUE CONTINUE


INTERESTING YES

CONTINUE

Mar. 31 , 2020

CONTINUE

DEFAULT

212

LET'S TALK ABOUT YOUR EXPERIENCE

DEFAULT

BRANCH NO

CONTINUE

CONTINUE I SEE

YES

1 THAT'S UNFORTUNATE

CONTINUE

DEFAULT

CONTINUE

CONTINUE

BRANCH

HAVE YOU EVER SPOKEN BEFORE A GROUP OF PEOPLE ?

Sheet 2 of 12

MOVE ON

IS THE ANSWER CONTINUE AFFIRMATIVE ?

BRANCH

NO

CONTINUE OKAY , THAT'S GREAT

YES

NO

THANKS , WAS A PLEASURE

DEFAULT

NO

BRANCH

GREAT !

OKAY , MOVE ON

I SEE

CONTINUE CONTINUE

YES

GET BACK TO YOU LATER

CONTINUE

CONTINUE

CONTINUE

CONTINUE

RETURN

CONTINUE

WELL I HAVE BEEN ASKING YOU A LOT OF QUESTIONS . DO YOU HAVE ANY QUESTIONS FOR ME ?


US 10,607,504 B1

Fig . 2

312

310

302

PROMPT

LANGUAGE MODEL

U.S. Patent

RESPONSE 314

RESPONSE

304

RESPONSE MEANING

322

TRAINING 306

TASK ADMINISTRATOR ( DIALOG MANAGER )

TASK MODELS

Mar. 31 , 2020

318

PROMPT

NEXT DIALOG STATE

316

RESPONSE

Sheet 3 of 12

314

EVALUATION

RESPONSE MEANING

( SPOKEN ) LANGUAGE UNDERSTANDING MODEL

308

320 SCORE


US 10,607,504 B1

3

Fig . 3

414

412

402

PROMPT

LANGUAGE MODEL

U.S. Patent

RESPONSE

410

-412

416

RESPONSE

404

RESPONSE MEANING

408

TRAINING 406

TASK AOMINISTRATOR ( DIALOG MANAGER )

TASK MODELS

Mar. 31 , 2020

420

PROMPT

NEXT DIALOG STATE

.

418

RESPONSE

Sheet 4 of 12

416

EVALUATION

RESPONSE MEANING

LANGUAGE UNDERSTANDING MODEL SCORE


US 10,607,504 B1

Fig . 4

510

502

PROMPT

LANGUAGE MODEL

U.S. Patent

RESPONSE

RESPONSE

504

RESPONSE MEANING

508

TRAINING

TASK ADMINISTRATOR ( DIALOG MANAGER

TASK MODELS

Mar. 31 , 2020

PROMPT

NEXT DIALOG STATE

514

516

512

RESPONSE

Sheet 5 of 12

EVALUATION

RESPONSE MEANING


506

SCORE


US 10,607,504 B1

Fig . 5

SPEECH CONTENT ; GESTURES ; FACIAL EXPRESSIONS U.S. Patent

PROMPT

RESPONSE

LANGUAGE MODEL

RESPONSE

RESPONSE MEANING

}

TRAINING

TASK ADMINISTRATOR

Mar. 31 , 2020

TASK MODELS

PROMPT

NEXT DIALOG STATE

Sheet 6 of 12

RESPONSE EVALUATION

RESPONSE MEANING


MICROPHONE AND ASR ; MOTION DETECTOR ; VIDEO CAMERA

SCORE


US 10,607,504 B1

6

Fig . 6

uula botin

PSIN

U.S. Patent

SIP

TELEPHONY SERVERS

VOICE BROWSER

( 2 ) OATA LOGGING & ORGANIZATION

SIP WebRTC

SIP

SIP

RTP ( AUDIO )

Mar. 31 , 2020

HTTP

MRCPI v2 )

TCP

HTTP

SPEECH SERVER

WEB SERVER

Sheet 7 of 12

( ASR )

( 3 ) ITERATIVE REFINEMENT OF MODELS & CALLFLOWS SPEECH TRANSCRIPTION , ANNOTATION & RATING ( STAR ) PORTAL

LOGGING DATABASE ( MySOL ) US 10,607,504 B1

Fig . 7

3

und DILON

PSIN

( 1 ) CROWDSOURCED DATA COLLECTION USING AMAZON MECHANICAL TURK

U.S. Patent

VOICE BROWSER SIP

w

TELEPHONY SERVERS WITH VIDEO SUPPORT ( ASTERISK & FREESWITCH )

JVoiceXML

( 2 ) OATA LOGGING & ORGANIZATION

SIP WebRTC

Zanzibar SIP

SIP

RTP ( AUDIO )

Mar. 31 , 2020

HTTP

MRCP ( v2 )

SPEECH SERVER

WEB SERVER

HTTP

$

APACHE

$

CAIRO ( MRCP server

FESTIVAL ( TIS )

Sheet 8 of 12

SPHINX ( ASR )

VXML , JSGF , ARPA , SRGS , WAV

MARY ( TS )

KALDI ( ASR ) SERVER

1

( 3 ) ITERATIVE REFINEMENT OF MODELS & CALLFLOWS

<

SPEECH TRANSCRIPTION , ANNOTATION & RATING ( STAR ) PORTAL

LOGGING DATABASE ( MySQL ) US 10,607,504 B1

Fig . 8

NO . DIALOG STATES

NO . CALLS

mont

ITEM

PRAGMATICS ( FOOD OFFER ) PRAGMATICS ( SCHEDULING ) JOB INTERVIEW

PIZZA CUSTOMER SERVICE

131 166 192 187

???

COMPLETION RATE ( % ) 61.83 66.87 35.42 47.06

U.S. Patent

8 7

0.8

Mar. 31 , 2020

0.64 PROPORTION OF COMPLETED CALLS

0.4

Sheet 9 of 12

0.2 0.0

5

15

20

10

DAY OF DATA COLLECTION

US 10,607,504 B1

Fig . 9

U.S. Patent Mar. 31 , 2020 Sheet 10 of 12 US 10,607,504 B1

1002

ACCESS INITIAL TASK MODEL

1004

PROVIDE TASK REPRESENTED BY TASK MODEL TO PERSONS

FOR CROWOSOURCED TRAINING

1006

UPDATE TASK MODEL BY REVISING LANGUAGE MODEL AND SPOKEN


1008

PROVIDE UPDATE TASK TO STUDENT FOR DEVELOPMENT OF SPEAKING CAPABILITIES

Fig . 10


1100 1107 1110

COMPUTER - READABLE MEMORY

TASK MODEL

1102

PROCESSING SYSTEM 1108

COMPUTER - IMPLEMENTED EDUCATIONAL DIALOG SYSTEM

DATA STORE ( S )

1112

1104 SCORES

Fig . 11A

1120

1130 1134

USER PCK COMPUTER - READABLE MEMORY

TASK MODEL

1122 1128 1124 1122

USER PC NETWORK ( S ) SERVER ( S ) DATA STORE ( S )

1132

SCORES USER PC 1127 1138

1122 PROCESSING SYSTEM

1137 COMPUTER - IMPLEMENTED


Fig . 11B


1150

1180 1179 1181 DISPLAY

KEYBOARD MICROPHONE

1154 1187 1188

CPU INTERFACE DISPLAY INTERFACE

1152

1190

ROM RAM DISK CONTROLLER

COMMUNICATION PORTS

1158 1159 1182

1184

CD ROM HARD DRIVE

1185

1183 FLOPPY ORIVE

Fig . 11C

a

US 10,607,504 B1 2

COMPUTER - IMPLEMENTED SYSTEMS AND based on the identified response meaning . The task is METHODS FOR A CROWD provided to a plurality of persons for training , where pro

SOURCE - BOOTSTRAPPED SPOKEN viding the task includes providing a prompt for a particular DIALOG SYSTEM one of the dialog states , receiving a response to the prompt ,

5 using the language model to determine the response mean CROSS - REFERENCE TO RELATED ing based on the received response , and selecting a particular

APPLICATIONS next dialog state based on the determined response meaning . The task model is updated by revising the language model

This application claims priority to U.S. Provisional Appli and the language understanding model based on responses cation No. 62 / 232,537 , entitled “ Bootstrapping Develop- 10 received to prompts of the provided task , and the updated ment of a Cloud - Based Multimodal Dialog System in the task is provided to a student for development of speaking Educational Domain , ” filed Sep. 25 , 2015 , the entirety of capabilities . each of which is incorporated herein by reference . As another example , a system for implementing an edu cational dialog system includes a processing system that

FIELD 15 includes one or more data processors and a computer readable medium encoded with instructions for command

The technology described in this patent document relates ing the processing system to execute steps of a method . In generally to interaction evaluation and more particularly to the method , an initial task model is accessed that identifies development of a spoken dialog system for teaching and a plurality of dialog states associated with a task , a language evaluation of interactions . 20 model configured to identify a response meaning associated

with a received response , and language understanding BACKGROUND model configured to select a next dialog state based on the

identified response meaning . The task is provided Spoken dialog systems ( SDSs ) consist of multiple sub plurality of persons for training , where providing the task

systems , such as automatic speech recognizers ( ASRs ) , 25 includes providing a prompt for a particular one of the dialog spoken language understanding ( SLU ) modules , dialog states , receiving a response to the prompt , using the lan managers ( DMs ) , and spoken language generators , among guage model to determine the response meaning based on others , interacting synergistically and often in real time . the received response , and selecting a particular next dialog Each of these subsystems is complex and brings with it state based on the determined response meaning . The task design challenges and open research questions in its own 30 model is updated by revising the language model and the right . Rapidly bootstrapping a complete , working dialog language understanding model based on responses received system from scratch is therefore a challenge of considerable to prompts of the provided task , and the updated task is magnitude . Apart from the issues involved in training rea provided to a student for development of speaking capabili sonably accurate models for ASR and SLU that work well in ties . the domain of operation in real time , one should review that 35 As a further example , a computer - readable medium is the individual systems also work well in sequence such that encoded with instructions for commanding a processing the overall SDS performance does not suffer and provides an system to implement a method associated with an educa effective interaction with interlocutors who call into the tional dialog system . In the method , an initial task model is system . accessed that identifies a plurality of dialog states associated

The ability to rapidly prototype and develop such SDSs is 40 with a task , a language model configured to identify a important for applications in the educational domain . For response meaning associated with a received response , and example , in automated conversational assessment , test a language understanding model configured to select next developers might design several conversational items , each dialog state based on the identified response meaning . The in a slightly different domain or subject area . One can , in task is provided to a plurality of persons for training , where such situations , be able to rapidly develop models and 45 providing the task includes providing a prompt for a par capabilities to ensure that the SDS can handle each of these ticular one of the dialog states , receiving a response to the diverse conversational applications gracefully . This is also prompt , using the language model to determine the response true in the case of learning applications and so - called meaning based on the received response , and selecting a formative assessments : One should be able to quickly and particular next dialog state based on the determined response accurately bootstrap SDSs that can respond to a wide variety 50 meaning . The task model is updated by revising the lan of learner inputs across domains and contexts . Language guage model and the language understanding model based learning and assessments add yet another complication in on responses received to prompts of the provided task , and that systems need to deal gracefully with non - native speech . the updated task is provided to a student for development of Despite these challenges , the increasing demand for non speaking capabilities . native conversational learning and assessment applications 55 makes this avenue of research an important one to pursue ; BRIEF DESCRIPTION OF THE DRAWINGS however , this requires us to find a way to rapidly obtain data for model building and refinement an iterative cycle . FIG . 1 is a diagram depicting a processor - implemented

educational dialog system . SUMMARY FIG . 2 is a diagram depicting example dialog states

associated with a task . Systems and methods are provided for implementing an FIG . 3 is a diagram depicting example components of an

educational dialog system . An initial task model is accessed educational dialog system . that identifies a plurality of dialog states associated with a FIG . 4 is a diagram depicting active entities of an edu task , a language model configured to identify a response 65 cational dialog system in a training mode . meaning associated with a received response , and a language FIG . 5 is a diagram depicting active entities of an edu understanding model configured to select a next dialog state cational dialog system in an educational or evaluation mode .

60

15

30

US 10,607,504 B1 3 4

FIG . 6 is a diagram depicting a multi - modal educational model for understanding the response meaning associated dialog system . with a received response , and a language ( e.g. , spoken FIGS . 7-8 are diagrams depicting example components of language ) understanding model configured to select a next

an educational dialog system . dialog state based on the identified response meaning . An FIG . 9 is a diagram depicting four example tasks and 5 initial task model can take the form of a set of dialog states

improved performance of task models as those task models ( e.g. , as depicted in FIG . 2 ) and a base / default language are trained ( e.g. , using the crowd sourcing techniques model and language understanding model . The educational described herein . dialog system 102 functions in two modes , a training mode FIG . 10 is a flow diagram depicting a processor - imple 104 , and an active learning / evaluation mode 106 . mented method for implementing an educational dialog 10 In the training mode , a variety of persons interact with the system . task to further develop the default language and language FIGS . 11A , 11B , and 11C depict example systems for understanding models . This additional training enables the implementing the approaches described herein for imple dialog system 102 to better understand context and nuances menting a computer - implemented educational dialog sys associated with the task being developed and implemented . tem . For example , in different tasks , the phrase “ I don't know ”

DETAILED DESCRIPTION can have different meanings . In a task where an interactor is under police interrogation , that phrase likely means that the

FIG . 1 is a diagram depicting a processor - implemented subject has no knowledge of the topic . But , where the task educational dialog system . The educational dialog system 20 potentially includes flirtatious behavior by an interactor , the 102 is configured to interactively provide prompts associ phrase “ I don't know ” combined with a smile and a shrug ated with dialog states to an interactor , prompting the could imply coy behavior , where the subject really does have interactor to provide a response . In this way , the interactor knowledge on the topic . The base language model and can participate in a simulated conversation with the dialog language understanding model may not be able to under system 102. The dialog system 102 may provide its prompts 25 stand such nuances , but trained versions of those models , in a voice - only fashion ( e.g. , via a speaker ) , or in a multi which adjust their behavior based on successful / unsuccess modal fashion using an avatar ( e.g. , via a graphical user ful completions of tasks and indicated survey approvals or interface , a puppet , an artificial life form ) that communicates disapprovals by training - interactors will gain an understand both voice and information via other modalities , such as ing of these factors over time . facial expressions and body movements . In one embodiment , the training mode 104 for a task

The educational dialog system 102 of FIG . 1 is configured model for a task is crowd sourced ( e.g. , using the Amazon initially with a base task model that identifies the dialog Mechanical Turk platform ) . A plurality of training - interac states associated with a conversation task . The dialog states tors interact with a task in training mode 104 via prompts indicate an anticipated path of a task that will be facilitated and responses , where those responses ( e.g. , speech , facial by the educational dialog system 102. FIG . 2 is a diagram 35 expressions , gestures ) are captured and evaluated to deter depicting example dialog states associated with a task . The mine whether the language model and / or language under task begins at 202 where the dialog system provides a standing model should be adjusted . Once the task model has welcome at 204 and asks initial questions at 206 and 208 . been refined via a number of interactions with the public , The question at 208 is the first time in the dialog states that crowd - sourced participants , the improved task model can be the conversation branches based on response given by the 40 provided for educational purposes , such as for developing interactor , as indicated by the evaluation at 210 and the speaking and interaction skills of non - native language corresponding branch at 212 to one of three different pos speakers in a drilling or even an evaluation context , where sible paths . In one embodiment , the evaluation of the a score 108 is provided . interactor response to the prompt question at 208 is per FIG . 3 is a diagram depicting example components of an formed in two steps . First , the response to the prompt is 45 educational dialog system . The dialog system 302 includes processed ( e.g. , voice responses are decoded via automatic a task administrator ( dialog manager ) 304. The task admin speech recognition , facial expressions are determined via istrator 304 monitors traversal of the dialog states of a task video processing , body language is detected via infrared conversation . It provides corresponding prompts , whether in motion capture ) to determine a response meaning associated training 306 or evaluation 308 mode and receives corre with the response ( e.g. , based on the totality of data received 50 sponding responses . Responses 310 ( e.g. , text from auto in the response , such as audio , video , and motion capture ) matic speech recognition performed on a response ) are using a language model . Once the meaning is determined at provided to the language model 312 which determines a 208-210 , that meaning is utilized at 210-212 to select an response meaning 314 that is returned to the task adminis appropriate branch using a meaning understanding ( spoken trator 304 ( or directly to the language understanding model language ) model . While the example of FIG . 2 generally 55 316 ) . The response meaning 314 is received by the language depicts a single path task conversation , more complicated understanding model 316 and determines the next dialog sets of dialog states ( e.g. , tree shaped ) can be implemented . state 318 that should be taken in the task , where that next For example , voice , gesture , and facial expression data can state 318 is returned to the task administrator 304 . be used to measure an engagement level of an interactor In a training mode 306 , the task administrator 304 is ( e.g. , based on head pose , gaze , and facial expressions to 60 configured to adjust the language model 312 and language identify smiles or indications of boredom , such as yawns , as understanding model 316 to improve their performance in well as content of detected speech ) , where positive feedback subsequent interactor iterations . The task model , which or other encouragement is given to an interactor whose includes the dialog states , the current language model , and engagement is determined to have waned . the current language understanding model , is accessed from

With reference back to FIG . 1 , the educational dialog 65 a task model data store 322 before each training iteration and system 102 utilizes a task model , which includes the plu is returned , when any of those entities are altered , for rality of dialog states associated with a task , the language storage . In an evaluation mode 308 , the task administrator

US 10,607,504 B1 5 6

304 may be configured to output a score 320 indicating a ( SIP ) , Public Switched Telephone Network ( PSTN ) , quality of responses received from the evaluation - interactor . and web Real - Time Communications ( WebRTC ) stan

FIG . 4 is a diagram depicting active entities of an edu dards and include support for voice and video ; cational dialog system in a training mode . In training mode A voice browser ( e.g. , JVoiceXML ) , which is compatible 406 , the task administrator 404 accesses the task model for 5 with VoiceXML 2.1 and can process SIP traffic and the task to be administered from the task model database which incorporates support for multiple grammar stan 408. The task administrator 404 traverses the dialog states dards , such as Java Speech Grammar Format ( JSGF ) , associated with the task model , providing prompts 410 and Advanced Research Projects Agency ( ARPA ) , and receiving response 412 from the training - interactor . The Weighted Finite State Transducer ( WFST ) ; responses 412 are provided to the language model 414 10 A Media Resource Control Protocol ( MRCP ) speech which determines response meanings 416 , where that server , which allows the voice browser to initiate SIP or response meaning 416 is used by the language understand Real - Time Transport Protocol ( RTP ) connections from / ing model 418 to determine the next dialog state 420 for the to the telephony server and incorporates two speech task . recognizers and synthesizers ;

Following conclusion of the task ( via a completion of the 15 An Apache Tomcat - based web server which can host entirety of the task or a failure to complete the task ) , the task dynamic VoiceXML pages , web services , and media administrator 404 adjusts the language model 414 and the libraries containing grammars and audio files ; language understanding model 418 based on the training OpenVXML , a VoiceXML - based voice application interactions . For example , if a task is not completed or if an authoring suite : generates dynamic web applications interactor states via a survey that they were dissatisfied with 20 that can be housed on the web server ; the task ( e.g. , the task did not provide a next prompt that was A MySQL database server for storing call logs ; appropriate for their current response ) , then the task admin A speech transcription , annotation , and rating portal that istrator 404 determines that one of the models 414 , 418 allows one to listen to and transcribe full - call record should be adjusted to better function . For example , the ings , rate them on a variety of dimensions such as caller language model 414 may be refined to apply a different 25 experience and latency , and perform various semantic response meaning 416 to a particular response 412 from the annotation tasks to train ASR and SLU modules . training - interactor that resulted in an erroneous dialog state FIG . 9 is a diagram depicting four example tasks and path . If the training - interactor completes the task or indicates improved performance of task models as those task models a positive experience , then that data is utilized to strengthen are trained ( e.g. , using the crowd sourcing techniques the models 414 , 418 ( e.g. , weights associated with potential 30 described herein ) . In the example of FIG . 9 , four tasks are paths or factors in a neural network model ) based on the described , having between 1 and 8 dialog states . The final confirmation that those models 414 , 418 behaved appropri two examples are longer , having 8 and 7 dialog states , ately to the responses 412 received from the training respectively . Task models were trained over the number of interactor . The adjusted task model is then returned by the iterations shown in the middle column , where completion task administrator 404 to the task model data store 408 . 35 rate is illustrated in a final column as an indicator of quality

FIG . 5 is a diagram depicting active entities of an edu of the task models . The graph at the bottom of FIG.9 shows cational dialog system in an educational or evaluation mode . an improvement of completion rate associated with the job In the evaluation mode 506 , the task administrator 504 interview and pizza customer service tasks over time , as the accesses a task model from the task model data store 508 and associated language and language understanding modules uses that data to set up the language model 510 and language 40 were trained using crowd source participation . understanding model 512. The task administrator 504 tra FIG . 10 is a flow diagram depicting a processor - imple verses the dialog states of the task as informed by responses mented method for implementing an educational dialog 516 to prompts 514 with the aid of the language model 510 system . At 1002 , an initial task model is accessed that and the language understanding model 512. The task admin identifies a plurality of dialog states associated with a task , istrator 504 tracks the appropriateness of responses 516 45 a language model configured to identify a response meaning received from the evaluation - interactor as well as other associated with a received response , and a language under metrics associated with those responses ( e.g. , pronunciation , standing model configured to select a next dialog state based grammar ) to determine a score 518 indicative of the quality on the identified response meaning . The task is provided to of the interactor's communication with the educational a plurality of persons for training at 1004 , where providing dialog system 502 . 50 the task includes providing a prompt for a particular one of

FIG . 6 is a diagram depicting a multi - modal educational the dialog states , receiving a response to the prompt , using dialog system . FIG . 6 indicates that speech response data is the language model to determine the response meaning captured via automatic speech recognition , as well as gesture based on the received response , and selecting a particular data via motion detection ( e.g. , using an XBOX Kinect next dialog state based on the determined response meaning . motion capture device ) and facial expression data via video 55 The task model is updated at 1006 by revising the language capture and analysis . The language model uses all or a model and the language understanding model based on portion of that data to extract meaning associated with responses received to prompts of the provided task , and the responses , where extracted text associated with speech may updated task is provided to a student at 1008 for develop be assigned different meanings depending on the context of ment of speaking capabilities . gesture and facial expression data that is identified . FIGS . 11A , 11B , and 11C depict example systems for FIGS . 7-8 are diagrams depicting example components of implementing the approaches described herein for imple

an educational dialog system . Certain embodiments use a menting a computer - implemented educational dialog sys HALEF dialog system to develop conversational applica tem . For example , FIG . 11A depicts an exemplary system tions within the crowd sourcing framework . These systems 1100 that includes a standalone computer architecture where can include one or more of the following components : 65 a processing system 1102 ( e.g. , one or more computer

Telephony servers ( e.g. , Asterisk and FreeSWITCH ) , processors located in a given computer or in multiple which are compatible with Session Initiation Protocol computers that may be separate and distinct from one

60

US 10,607,504 B1 7 8

another ) includes a computer - implemented educational dia board 1179 , or other input device 1181 , such as a micro log system 1104 being executed on the processing system phone , remote control , pointer , mouse and / or joystick . 1102. The processing system 1102 has access to a computer Additionally , the methods and systems described herein readable memory 1107 in addition to one or more data stores may be implemented on many different types of processing 1108. The one or more data stores 1108 may include task 5 devices by program code comprising program instructions models 1110 as well as scores 1112. The processing system that are executable by the device processing subsystem . The 1102 may be a distributed parallel computing environment , software program instructions may include source code , which may be used to handle very large - scale data sets . object code , machine code , or any other stored data that is

FIG . 11B depicts a system 1120 that includes a client operable to cause a processing system to perform the meth server architecture . One or more user PCs 1122 access one 10 ods and operations described herein and may be provided in or more servers 1124 running a computer - implemented any suitable language such as C , C ++ , JAVA , for example , or any other suitable programming language . Other imple educational dialog system 1137 on a processing system 1127 mentations may also be used , however , such as firmware or via one or more networks 1128. The one or more servers even appropriately designed hardware configured to carry 1124 may access a computer - readable memory 1130 as well 15 out the methods and systems described herein . as one or more data stores 1132. The one or more data stores The systems ' and methods ' data ( e.g. , associations , map 1132 may include task models 1134 as well as scores 1138 . pings , data input , data output , intermediate data results , final FIG . 11C shows a block diagram of exemplary hardware data results , etc. ) may be stored and implemented in one or for a standalone computer architecture 1150 , such as the more different types of computer - implemented data stores , architecture depicted in FIG . 11A that may be used to 20 such as different types of storage devices and programming include and / or implement the program instructions of sys constructs ( e.g. , RAM , ROM , Flash memory , flat files , tem embodiments of the present disclosure . A bus 1152 may databases , programming data structures , programming vari serve as the information highway interconnecting the other ables , IF - THEN ( or similar type ) statement constructs , etc. ) . illustrated components of the hardware . A processing system It is noted that data structures describe formats for use in 1154 labeled CPU ( central processing unit ) ( e.g. , one or 25 organizing and storing data in databases , programs , memory , more computer processors at a given computer or at multiple or other computer - readable media for use by a computer computers ) , may perform calculations and logic operations program . required to execute a program . A non - transitory processor The computer components , software modules , functions , readable storage medium , such as read only memory ( ROM ) data stores and data structures described herein may be 1158 and random access memory ( RAM ) 1159 , may be in 30 connected directly or indirectly to each other in order to communication with the processing system 1154 and may allow the flow of data needed for their operations . It is also include one or more programming instructions for perform noted that a module or processor includes but is not limited ing the method of implementing a computer - implemented to a unit of code that performs a software operation , and can educational dialog system . Optionally , program instructions be implemented for example as a subroutine unit of code , or may be stored on a non - transitory computer - readable storage 35 as a software function unit of code , or as an object ( as in an medium such as a magnetic disk , optical disk , recordable object - oriented paradigm ) , or as an applet , or in a computer memory device , flash memory , or other physical storage script language , or as another type of computer code . The medium . software components and / or functionality may be located on

In FIGS . 11A , 11B , and 11C , computer readable memo a single computer or distributed across multiple computers ries 1108 , 1130 , 1158 , 1159 or data stores 1108 , 1132 , 1183 , 40 depending upon the situation at hand . 1184 , 1188 may include one or more data structures for While the disclosure has been described in detail and with storing and associating various data used in the example reference to specific embodiments thereof , it will be appar systems for implementing a computer - implemented educa ent to one skilled in the art that various changes and tional dialog system . For example , a data structure stored in modifications can be made therein without departing from any of the aforementioned locations may be used to store 45 the spirit and scope of the embodiments . Thus , it is intended data from XML files , initial parameters , and / or data for other that the present disclosure cover the modifications and variables described herein . A disk controller 1190 interfaces variations of this disclosure provided they come within the one or more optional disk drives to the system bus 1152 . scope of the appended claims and their equivalents . For These disk drives may be external or internal floppy disk example , in one embodiment , in addition to or in the drives such as 1183 , external or internal CD - ROM , CD - R , 50 alternative to adjusting language models and language CD - RW or DVD drives such as 1184 , or external or internal understanding models , systems and methods can be config hard drives 1185. As indicated previously , these various disk ured to adjust acoustic models ( models that relate how drives and disk controllers are optional devices . probable a given sequence of words correspond to the actual

Each of the element managers , real - time data buffer , speech signal received ) , dialog management models ( models conveyors , file input processor , database index shared access 55 that inform what the optimal next action of the dialog system memory loader , reference data buffer and data managers should be based on the given state ) , and engagement pre may include a software application stored in one or more of diction models ( models that inform the dialog manager how the disk drives connected to the disk controller 1190 , the to react given a current engagement state of the user ) . ROM 1158 and / or the RAM 1159. The processor 1154 may The invention claimed is : access one or more components as required . 1. A processor - implemented method for implementing an

A display interface 1187 may permit information from the educational dialog system , comprising : bus 1152 to be displayed on a display 1180 in audio , graphic , accessing an initial task model that identifies a plurality of or alphanumeric format . Communication with external dialog states associated with a task , a language model devices may optionally occur using various communication configured to identify a response meaning associated ports 1182 . with a received response , and a language understanding

In addition to these computer - type components , the hard model configured to select a next dialog state based on ware may also include data input devices , such as a key the identified response meaning ;

60

65

5

10

25

US 10,607,504 B1 9 10

wherein the language model identifies the response model configured to select a next dialog state based on meaning based on content of speech of the response the identified response meaning ; received from a person , wherein the content of the wherein the language model identifies the response speech is determined using automatic speech recog meaning based on content of speech of the response nition ; received from a person , wherein the content of the

wherein the language model further identifies the speech is determined using automatic speech recog response meaning based on gestures associated with nition ; the response received from the person , wherein the wherein the language model further identifies the gestures are captured via a video capture device or an response meaning based on gestures associated with infrared capture device ; the response received from the person , wherein the

providing the task to a plurality of persons for training , gestures are captured via a video capture device or an wherein providing the task includes providing a prompt infrared capture device ; for a particular one of the dialog states , receiving a providing the task to a plurality of persons for training , response to the prompt , using the language model to wherein providing the task includes providing a prompt determine the response meaning based on the received 15 for a particular one of the dialog states , receiving a response , and selecting a particular next dialog state response to the prompt , using the language model to based on the determined response meaning ; determine the response meaning based on the received

providing a survey to each of the plurality of persons after response , and selecting a particular next dialog state interaction with the provided task , wherein the survey based on the determined response meaning ; requests evaluation data regarding the quality of the 20 providing a survey to each of the plurality of persons after interaction ; interaction with the provided task , wherein the survey

updating the task model by revising the language model requests evaluation data regarding the quality of the and the language understanding model based on interaction ; responses received to prompts of the provided task and updating the task model by revising the language model the evaluation data from the surveys ; and the language understanding model based on

providing an updated task to a student for development of responses received to prompts of the provided task and speaking capabilities ; and the evaluation data from the surveys ;

scoring the student's speaking capabilities based on the providing an updated task to a student for development of student's interaction with the updated task . speaking capability ; and

2. The method of claim 1 , wherein the updated task is 30 scoring the student's speaking capability based on the provided to the student by an educational organization ; student's interaction with the updated task . wherein said providing the task to a plurality of persons 8. The system of claim 7 , wherein the updated task is

for training includes providing the task to a pool of provided to the student by an educational organization ; public persons unaffiliated with the educational orga wherein said providing the task to a plurality of persons nization . for training includes providing the task to a pool of

3. The method of claim 1 , further comprising : public persons unaffiliated with the educational orga providing revised tasks to further persons for additional nization .

training prior to providing the updated task to the 9. The system of claim 7 , wherein the steps further student . include :

4. The method of claim 3 , further comprising : providing revised tasks to further persons for additional tracking whether a first person participating in a round of training prior to providing the updated task to the

training completes all of the dialog states associated student . with the task as a first metric ; 10. The system of claim 9 , wherein the steps further

tracking whether a second person participating in a round include : of training completes all of the dialog states associated 45 tracking whether a first person participating in a round of with the updated task as a second metric ; training completes all of the dialog states associated

comparing the first metric and the second metric to with the task as a first metric ; determine whether the task model is improving based tracking whether a second person participating in a round on additional training . of training completes all of the dialog states associated

5. The method of claim 4 , wherein updates to the task 50 with the updated task as a second metric ; model are retained when the task model is determined to comparing the first metric and the second metric to have improved , wherein updates are reverted when the task determine whether the task model is improving based model is determined not to have improved . on additional training .

6. The method of claim 1 , wherein the updated task is 11. The system of claim 10 , wherein updates to the task provided to a student for development of non - native speak- 55 model are retained when the task model is determined to ing capabilities . have improved , wherein updates are reverted when the task

7. A processor - implemented system for implementing an model is determined not to have improved . educational dialog system , comprising : 12. The system of claim 10 , wherein the updated task is

a processing system comprising one or more data proces provided to a student for development of non - native speak sors ; 60 ing capabilities .

a non - transitory computer - readable medium encoded with 13. A non - transitory computer - readable medium encoded instructions for commanding the processing system to with instructions for commanding one or more data proces execute steps of a method that include : sors to execute steps of a method for implementing an

accessing an initial task model that identifies a plurality of educational dialog system , the method comprising : dialog states associated with a task , a language model 65 accessing an initial task model that identifies a plurality of configured to identify a response meaning associated dialog states associated with a task , a language model with a received response , and a language understanding configured to identify a response meaning associated

35

40

10

US 10,607,504 B1 11 12

with a received response , and a language understanding updating the task model by revising the language model model configured to select a next dialog state based on and the language understanding model based on the identified response meaning ; responses received to prompts of the provided task and wherein the language model identifies the response the evaluation data from the surveys ; meaning based on content of speech of the response 5 received from a person , wherein the content of the providing an updated task to a student for development of speech is determined using automatic speech recog speaking capabilities ; and nition ; scoring the student's speaking capabilities based on the wherein the language model further identifies the student's interaction with the updated task . response meaning based on gestures associated with the response received from the person , wherein the 14. The method of claim 1 , wherein the gestures captured gestures are captured via a video capture device or an include facial expressions . infrared capture device ; 15. The method of claim 1 , wherein the language model providing the task to a plurality of persons for training , further identifies the response meaning based on a detected wherein providing the task includes providing a prompt 15 engagement level of the person . for a particular one of the dialog states , receiving a

response to the prompt , using the language model to 16. The system of claim 7 , wherein the gestures captured determine the response meaning based on the received include facial expressions . response , and selecting a particular next dialog state 17. The system of claim 7 , wherein the language model based on the determined response meaning ; further identifies the response meaning based on a detected providing a survey to each of the plurality of persons after 20 engagement level of the person . interaction with the provided task , wherein the survey requests evaluation data regarding the quality of the interaction ;

( 12 ) United States Patent Ramanarayanan et alantikenschlacht.com/su/pdf/patent2020.pdfUS010607504B1 ( 12 ) United States Patent Ramanarayanan et al . ( 10 ) Patent No .: US 10,607,504

Documents