-
US010607504B1
( 12 ) United States Patent Ramanarayanan et al .
( 10 ) Patent No .: US 10,607,504 B1 ( 45 ) Date of Patent :
Mar. 31 , 2020
( 54 ) COMPUTER - IMPLEMENTED SYSTEMS AND METHODS FOR A CROWD
SOURCE - BOOTSTRAPPED SPOKEN DIALOG SYSTEM
( 58 ) Field of Classification Search CPC
( Continued ) GO9B 19/04
( 56 ) References Cited ( 71 ) Applicant : Educational Testing
Service , Princeton ,
NJ ( US ) U.S. PATENT DOCUMENTS
2006/0080101 A1 * 4/2006 Chotimongkol G06F 17/278 704/257
GIOL 15/22 704/275
2006/0149555 A1 * 7/2006 Fabbrizio
( Continued )
( 72 ) Inventors : Vikram Ramanarayanan , San Francisco , CA (
US ) ; David Suendermann - Oeft , San Francisco , CA ( US ) ;
Patrick Lange , San Francisco , CA ( US ) ; Alexei V. Ivanov ,
Redwood City , CA ( US ) ; Keelan Evanini , Pennington , NJ ( US )
; Yao Qian , San Francisco , CA ( US ) ; Zhou Yu , Pittsburgh , PA
( US )
OTHER PUBLICATIONS
( 73 ) Assignee : Educational Testing Service , Princeton , NJ (
US )
Bohus , Dan , Raux , Antoine , Harris , Thomas , Eskenazi ,
Maxine , Rudnicky , Alexander ; Olympus : An Open - Source
Framework for Conversational Spoken Language Interface Research ;
Proceedings of the Workshop on Bridging the Gap : Academic and
Industrial Research in Dialog Technolgies ; pp . 32-39 ; 2007 .
( Continued ) ( * ) Notice : Subject to any disclaimer , the
term of this
patent is extended or adjusted under 35 U.S.C. 154 ( b ) by 377
days .
Primary Examiner Thomas J Hong ( 74 ) Attorney , Agent , or Firm
Jones Day
( 21 ) Appl . No .: 15 / 272,903
( 22 ) Filed : Sep. 22 , 2016
Related U.S. Application Data ( 60 ) Provisional application No.
62 / 232,537 , filed on Sep.
25 , 2015 .
( 57 ) ABSTRACT Systems and methods are provided for
implementing an educational dialog system . An initial task model
is accessed that identifies a plurality of dialog states associated
with a task , a language model configured to identify a response
meaning associated with a received response , and a language
understanding model configured to select a next dialog state based
on the identified response meaning . The task is provided to a
plurality of persons for training . The task model is updated by
revising the language model and the language understanding model
based on responses received to prompts of the provided task , and
the updated task is provided to a student for development of
speaking capabili ties .
( 51 ) Int . Ci . GOIB 19/04 ( 2006.01 ) GIOL 15/22 ( 2006.01
)
( Continued ) ( 52 ) U.S. Cl .
CPC G09B 19/04 ( 2013.01 ) ; GIOL 15/063 ( 2013.01 ) ; GIOL
15/1815 ( 2013.01 ) ; GIOL
15/22 ( 2013.01 ) ; GIOL 2015/0635 ( 2013.01 ) 17 Claims , 12
Drawing Sheets
1002
ACCESS INITIAL TASK MODEL
1004
PROVIDE TASK REPRESENTED BY TASK MODEL TO PERSONS
FOR CROWOSOURCED TRAINING
1006
UPDATE TASK MODEL BY REVISING LANGUAGE MODEL AND SPOKEN
LANGUAGE UNDERSTANDING MODEL
1008
PROVIDE UPDATE TASK TO STUDENT FOR DEVELOPMENT OF SPEAKING
CAPABILITIES
-
US 10,607,504 B1 Page 2
( 51 ) Int . Ci . GIOL 15/18 ( 2013.01 ) GIOL 15/06 ( 2013.01
)
( 58 ) Field of Classification Search USPC 434/185 See
application file for complete search history .
( 56 ) References Cited
U.S. PATENT DOCUMENTS
2014/0379326 A1 * 12/2014 Sarikaya GIOL 15/18 704/9
G06F 17/289 704/8
2015/0363393 A1 * 12/2015 Williams
OTHER PUBLICATIONS
Bohus , Dan , Saw , Chit , Horvitz , Eric ; Directions Robot :
In - the Wild Experiences and Lessons Learned ; Proceedings of the
Inter national Conference on Autonomous Agents and Multi - Agent
Sys tems ; pp . 637-644 ; 2014 . Buchholz , Sabine , Latorre ,
Javier ; Crowdsourcing Preference Tests , and How to Detect
Cheating ; INTERSPEECH ; pp . 3053-3056 ; Aug. 2011 . Eskenazi ,
Maxine , Black , Alan , Raux , Antoine , Langner , Brian ; Let's Go
Lab : a Platform for Evaluation of Spoken Dialog Systems with Real
World Users ; 9th Annual Conference of the International Speech
Communications Association ; p . 219 ; Sep. 2008 . Evanini , Keelan
, Higgins , Derrick , Zechner , Klaus ; Using Amazon Mechanical
Turk for Transcription of Non - Native Speech ; Proceed ings of the
NAACL HLT 2010 Workshop on Creating Speech and Language Data with
Amazon's Mechanical Turk ; pp . 53-56 ; Jun . 2010 . Jurcicek ,
Filip , Keizer , Simon , Gasic , Milica , Mairesse , Francois ,
Thomson , Blaise , Yu , Kai , Young , Steve ; Real User Evaluation
of Spoken Dialogue Systems Using Amazon Mechanical Turk ; Pro
ceedings of INTERSPEECH ; pp . 3061-3064 ; 2011 . Kousidis , Spyros
, Kennington , Casey , Baumann , Timo , Buschmeier , Hendrik , Kopp
, Stefan , Schlangen , David ; A Multimodal In - Car Dialogue
System That Tracks the Driver's Attention ; Proceedings of the 16th
International Conference on Multimodal Interaction ; pp . 26-33 ;
Nov. 2014 . Lamere , Paul , Kwok , Philip , Gouvea , Evandro , Raj
, Bhiksha , Singh , Rita , Walker , William , Warmuth , Manfred ,
Wolf , Peter ; The CMU SPHINX - 4 Speech Recognition System ;
Proceedings of the ICASSP ; Hong Kong , China ; 2003 . McGraw , Ian
, Lee , Chia - ying , Hetherington , Lee , Seneff , Stephanie ,
Glass , James ; Collecting Voices from the Cloud ; LREC ; pp . 1576
1583 ; 2010 . Minessale , Anthony , Collins , Michael , Schreiber ,
Darren , Chandler , Raymond ; FreeSWITCH Cookbook ; Packt
Publishing ; 2012 . Pappu , Aasish , Rudnicky , Alexander ;
Deploying Speech Interfaces to the Masses ; Proceedings of the
Companion Publication of the 2013 International Conference on
Intelligent User Interfaces Com panion ; pp . 41-42 ; Mar. 2013 .
Povey , Daniel , Ghoshal , Amab , Boulianne , Gilles , Burget ,
Lukas , Glembek , Ondrej , Goel , Nagendra , Hannemann , Mirko ,
Motlicek , Petr , Qian , Yanmin , Schwarz , Petr , Silovsky , Jan ,
Stemmer , Georg , Vesely , Karel ; The Kaldi Speech Recognition
Toolkit ; Proceedings of the ASRU Workshop ; 2011 .
Prylipko , Dmytro , Schnelle - Walka , Dirk , Lord , Spencer ,
Wendemuth , Andreas ; Zanzibar OpenIVR : an Open - Source Framework
for Devel opment of Spoken Dialog Systems ; Proceedings of the TSD
Work shop ; 2011 . Ramanarayanan , Vikram , Suendermann - Oeft ,
David , Ivanov , Alexei , Evanini , Keelan ; A Distributed Cloud -
Based Dialog System for Conversational Application Development ;
Proceedings of the SIGDIAL Conference ; pp . 432-434 ; Sep. 2015 .
Rayner , Manny , Frank , Ian , Chua , Cathy , Tsourakis , Nikos ,
Bouil lon , Pierrette ; For a Fistful of Dollars : Using Crowd -
Sourcing to Evaluate a Spoken Language CALL Application ;
Proceedings of the SLATE Workshop ; Aug. 2011 . Schnelle - Walka ,
Dirk , Radomski , Stefan , Muhlhauser , Max ; JVoiceXML as a
Modality Component in the W3C Multimodal Architecture ; Journal on
Multimodal User Interfaces , 7 ( 3 ) ; pp . 183-194 ; Nov. 2013 .
Schroder , Marc , Trouvain , Jurgen ; The German Text - to - Speech
Synthesis System Mary : a Tool for Research , Development and
Teaching ; International Journal of Speech Technology , 6 ( 4 ) ;
pp . 365-377 ; 2003 . Sciutti , Alessandra , Schilingmann , Lars ,
Palinko , Oskar , Nagai , Yukie , Sandini , Giulio ; A Gaze -
Contingent Dictating Robot to Study Turn - Taking ; Proceedings of
the 10th Annual ACM / IEEE International Conference on Human -
Robot Interaction Extended Abstracts ; pp . 137-138 ; 2015 .
Suendermann , David , Liscombe , Jackson , Pieraccini , Roberto ;
How to Drink from a Fire Hose : One Person can Annoscribe 693
Thousand Utterances in One Month ; Proceedings of SIGDIAL 2010 :
the 11th Annual Meeting of the Special Interest Group on Discourse
and Dialogue ; pp . 257-260 ; Sep. 2010 . Suendermann , David ,
Liscombe , Jackson , Pieraccini , Roberto , Evanini , Keelan ; How
Am I Doing ?: A New Framework to Effectively Measure the
Performance of Automated Customer Care Contact Centers ; Ch . 7 in
Advances in Speech Recognition : Mobile Envi ronments , A. Neustein
( Ed . ) ; Springer ; pp . 155-179 ; Aug. 2010 . Suendermann - Oeft
, David , Ramanarayanan , Vikram , Teckenbrock , Moritz , Neutatz ,
Felix , Schmidt , Dennis ; HALEF : An Open - Source Standard -
Compliant Telephony - Based Modular Spoken Dialog Sys tem — A
Review and an Outlook ; Proceedings of the International Workshop
on Spoken Dialog Systems ; Jan. 2015 . Taylor , Paul , Black , Alan
, Caley , Richard ; The Architecture of the Festival Speech
Synthesis System ; Proceedings of the ESCA Work shop on Speech
Synthesis ; 1998 . Van Meggelen , Jim , Madsen , Leif , Smith ,
Jared ; Asterisk : The Future of Tel ony ; Sebastopol , CA :
O'Reilly Media ; 2007 . Vinciarelli , Alessandro , Pantic , Maja ,
Bourlard , Herve ; Social Sig nal Processing : Survey of an
Emerging Domain ; Image and Vision Computing Journal , 27 ( 12 ) ;
pp . 1743-1759 ; 2009 . Wolters , Maria , Isaac , Karl , Renals ,
Steve ; Evaluating Speech Synthesis Intelligibility Using Amazon
Mechanical Turk ; pp . 136 141 ; Jan. 2010 . Yu , Zhou , Bonus ,
Dan , Horvitz , Eric ; Incremental Coordination : Attention -
Centric Speech Production in a Physically situated Con versational
Agent ; Proceedings of the SIGDIAL 2015 Conference ; pp . 402-406 ;
Sep. 2015 . Yu , Zhou , Papangelis , Alexandros , Rudnicky ,
Alexander ; TickTock : A Non - Goal - Oriented Multimodal Dialog
System with Engagement Awareness ; Proceedings of the Association
for the Advancement of Artificial Intelligence Spring Symposium ;
pp . 108-111 ; 2015 .
* cited by examiner
-
102
U.S. Patent
PROMPT RESPONSE
Mar. 31 , 2020
TRAINING 104
EDUCATIONAL DIALOG SYSTEM
PROMPT
Sheet 1 of 12
108
RESPONSE
SCORE
EVALUATION 901
Fig . 1
US 10,607,504 B1
-
206
208
202
204
CONTINUE
CONTINUE
CONTINUE
BEGIN
WELCOME
COULD YOU TELL ME MORE ABOUT YOUR EDUCATION ?
00 YOU PLAN TO RETURN TO SCHOOL FOR HIGHER STUDIES ?
U.S. Patent
T CONTINUE 210
CONTINUE HAVE YOU EVER QUIT A JOB BEFORE ?
IS THE ANSWER AFFIRMATIVE ?
CONTINUE CONTINUE
IS THE ANSWER AFFIRMATIVE ?
INTERESTING YES
CONTINUE
Mar. 31 , 2020
CONTINUE
DEFAULT
212
LET'S TALK ABOUT YOUR EXPERIENCE
DEFAULT
BRANCH NO
CONTINUE
CONTINUE I SEE
YES
1 THAT'S UNFORTUNATE
CONTINUE
DEFAULT
CONTINUE
CONTINUE
BRANCH
HAVE YOU EVER SPOKEN BEFORE A GROUP OF PEOPLE ?
Sheet 2 of 12
MOVE ON
IS THE ANSWER CONTINUE AFFIRMATIVE ?
BRANCH
NO
CONTINUE OKAY , THAT'S GREAT
YES
NO
THANKS , WAS A PLEASURE
DEFAULT
NO
BRANCH
GREAT !
OKAY , MOVE ON
I SEE
CONTINUE CONTINUE
YES
GET BACK TO YOU LATER
CONTINUE
CONTINUE
CONTINUE
CONTINUE
RETURN
CONTINUE
WELL I HAVE BEEN ASKING YOU A LOT OF QUESTIONS . DO YOU HAVE ANY
QUESTIONS FOR ME ?
IS THE ANSWER AFFIRMATIVE ?
US 10,607,504 B1
Fig . 2
-
312
310
302
PROMPT
LANGUAGE MODEL
U.S. Patent
RESPONSE 314
RESPONSE
304
RESPONSE MEANING
322
TRAINING 306
TASK ADMINISTRATOR ( DIALOG MANAGER )
TASK MODELS
Mar. 31 , 2020
318
PROMPT
NEXT DIALOG STATE
316
RESPONSE
Sheet 3 of 12
314
EVALUATION
RESPONSE MEANING
( SPOKEN ) LANGUAGE UNDERSTANDING MODEL
308
320 SCORE
EDUCATIONAL DIALOG SYSTEM
US 10,607,504 B1
3
Fig . 3
-
414
412
402
PROMPT
LANGUAGE MODEL
U.S. Patent
RESPONSE
410
-412
416
RESPONSE
404
RESPONSE MEANING
408
TRAINING 406
TASK AOMINISTRATOR ( DIALOG MANAGER )
TASK MODELS
Mar. 31 , 2020
420
PROMPT
NEXT DIALOG STATE
.
418
RESPONSE
Sheet 4 of 12
416
EVALUATION
RESPONSE MEANING
LANGUAGE UNDERSTANDING MODEL SCORE
EDUCATIONAL DIALOG SYSTEM
US 10,607,504 B1
Fig . 4
-
510
502
PROMPT
LANGUAGE MODEL
U.S. Patent
RESPONSE
RESPONSE
504
RESPONSE MEANING
508
TRAINING
TASK ADMINISTRATOR ( DIALOG MANAGER
TASK MODELS
Mar. 31 , 2020
PROMPT
NEXT DIALOG STATE
514
516
512
RESPONSE
Sheet 5 of 12
EVALUATION
RESPONSE MEANING
LANGUAGE UNDERSTANDING MODEL
506
SCORE
EDUCATIONAL DIALOG SYSTEM
US 10,607,504 B1
Fig . 5
-
SPEECH CONTENT ; GESTURES ; FACIAL EXPRESSIONS U.S. Patent
PROMPT
RESPONSE
LANGUAGE MODEL
RESPONSE
RESPONSE MEANING
}
TRAINING
TASK ADMINISTRATOR
Mar. 31 , 2020
TASK MODELS
PROMPT
NEXT DIALOG STATE
Sheet 6 of 12
RESPONSE EVALUATION
RESPONSE MEANING
LANGUAGE UNDERSTANDING MODEL
MICROPHONE AND ASR ; MOTION DETECTOR ; VIDEO CAMERA
SCORE
EDUCATIONAL DIALOG SYSTEM
US 10,607,504 B1
6
Fig . 6
-
uula botin
PSIN
U.S. Patent
SIP
TELEPHONY SERVERS
VOICE BROWSER
( 2 ) OATA LOGGING & ORGANIZATION
SIP WebRTC
SIP
SIP
RTP ( AUDIO )
Mar. 31 , 2020
HTTP
MRCPI v2 )
TCP
HTTP
SPEECH SERVER
WEB SERVER
Sheet 7 of 12
( ASR )
( 3 ) ITERATIVE REFINEMENT OF MODELS & CALLFLOWS SPEECH
TRANSCRIPTION , ANNOTATION & RATING ( STAR ) PORTAL
LOGGING DATABASE ( MySOL ) US 10,607,504 B1
Fig . 7
-
3
und DILON
PSIN
( 1 ) CROWDSOURCED DATA COLLECTION USING AMAZON MECHANICAL
TURK
U.S. Patent
VOICE BROWSER SIP
w
TELEPHONY SERVERS WITH VIDEO SUPPORT ( ASTERISK & FREESWITCH
)
JVoiceXML
( 2 ) OATA LOGGING & ORGANIZATION
SIP WebRTC
Zanzibar SIP
SIP
RTP ( AUDIO )
Mar. 31 , 2020
HTTP
MRCP ( v2 )
SPEECH SERVER
WEB SERVER
HTTP
$
APACHE
$
CAIRO ( MRCP server
FESTIVAL ( TIS )
Sheet 8 of 12
SPHINX ( ASR )
VXML , JSGF , ARPA , SRGS , WAV
MARY ( TS )
KALDI ( ASR ) SERVER
1
( 3 ) ITERATIVE REFINEMENT OF MODELS & CALLFLOWS
<
SPEECH TRANSCRIPTION , ANNOTATION & RATING ( STAR )
PORTAL
LOGGING DATABASE ( MySQL ) US 10,607,504 B1
Fig . 8
-
NO . DIALOG STATES
NO . CALLS
mont
ITEM
PRAGMATICS ( FOOD OFFER ) PRAGMATICS ( SCHEDULING ) JOB
INTERVIEW
PIZZA CUSTOMER SERVICE
131 166 192 187
???
COMPLETION RATE ( % ) 61.83 66.87 35.42 47.06
U.S. Patent
8 7
0.8
Mar. 31 , 2020
0.64 PROPORTION OF COMPLETED CALLS
0.4
Sheet 9 of 12
0.2 0.0
5
15
20
10
DAY OF DATA COLLECTION
US 10,607,504 B1
Fig . 9
-
U.S. Patent Mar. 31 , 2020 Sheet 10 of 12 US 10,607,504 B1
1002
ACCESS INITIAL TASK MODEL
1004
PROVIDE TASK REPRESENTED BY TASK MODEL TO PERSONS
FOR CROWOSOURCED TRAINING
1006
UPDATE TASK MODEL BY REVISING LANGUAGE MODEL AND SPOKEN
LANGUAGE UNDERSTANDING MODEL
1008
PROVIDE UPDATE TASK TO STUDENT FOR DEVELOPMENT OF SPEAKING
CAPABILITIES
Fig . 10
-
U.S. Patent Mar. 31 , 2020 Sheet 11 of 12 US 10,607,504 B1
1100 1107 1110
COMPUTER - READABLE MEMORY
TASK MODEL
1102
PROCESSING SYSTEM 1108
COMPUTER - IMPLEMENTED EDUCATIONAL DIALOG SYSTEM
DATA STORE ( S )
1112
1104 SCORES
Fig . 11A
1120
1130 1134
USER PCK COMPUTER - READABLE MEMORY
TASK MODEL
1122 1128 1124 1122
USER PC NETWORK ( S ) SERVER ( S ) DATA STORE ( S )
1132
SCORES USER PC 1127 1138
1122 PROCESSING SYSTEM
1137 COMPUTER - IMPLEMENTED
EDUCATIONAL DIALOG SYSTEM
Fig . 11B
-
U.S. Patent Mar. 31 , 2020 Sheet 12 of 12 US 10,607,504 B1
1150
1180 1179 1181 DISPLAY
KEYBOARD MICROPHONE
1154 1187 1188
CPU INTERFACE DISPLAY INTERFACE
1152
1190
ROM RAM DISK CONTROLLER
COMMUNICATION PORTS
1158 1159 1182
1184
CD ROM HARD DRIVE
1185
1183 FLOPPY ORIVE
Fig . 11C
-
a
US 10,607,504 B1 2
COMPUTER - IMPLEMENTED SYSTEMS AND based on the identified
response meaning . The task is METHODS FOR A CROWD provided to a
plurality of persons for training , where pro
SOURCE - BOOTSTRAPPED SPOKEN viding the task includes providing
a prompt for a particular DIALOG SYSTEM one of the dialog states ,
receiving a response to the prompt ,
5 using the language model to determine the response mean CROSS
- REFERENCE TO RELATED ing based on the received response , and
selecting a particular
APPLICATIONS next dialog state based on the determined response
meaning . The task model is updated by revising the language
model
This application claims priority to U.S. Provisional Appli and
the language understanding model based on responses cation No. 62 /
232,537 , entitled “ Bootstrapping Develop- 10 received to prompts
of the provided task , and the updated ment of a Cloud - Based
Multimodal Dialog System in the task is provided to a student for
development of speaking Educational Domain , ” filed Sep. 25 , 2015
, the entirety of capabilities . each of which is incorporated
herein by reference . As another example , a system for
implementing an edu cational dialog system includes a processing
system that
FIELD 15 includes one or more data processors and a computer
readable medium encoded with instructions for command
The technology described in this patent document relates ing the
processing system to execute steps of a method . In generally to
interaction evaluation and more particularly to the method , an
initial task model is accessed that identifies development of a
spoken dialog system for teaching and a plurality of dialog states
associated with a task , a language evaluation of interactions . 20
model configured to identify a response meaning associated
with a received response , and language understanding BACKGROUND
model configured to select a next dialog state based on the
identified response meaning . The task is provided Spoken dialog
systems ( SDSs ) consist of multiple sub plurality of persons for
training , where providing the task
systems , such as automatic speech recognizers ( ASRs ) , 25
includes providing a prompt for a particular one of the dialog
spoken language understanding ( SLU ) modules , dialog states ,
receiving a response to the prompt , using the lan managers ( DMs )
, and spoken language generators , among guage model to determine
the response meaning based on others , interacting synergistically
and often in real time . the received response , and selecting a
particular next dialog Each of these subsystems is complex and
brings with it state based on the determined response meaning . The
task design challenges and open research questions in its own 30
model is updated by revising the language model and the right .
Rapidly bootstrapping a complete , working dialog language
understanding model based on responses received system from scratch
is therefore a challenge of considerable to prompts of the provided
task , and the updated task is magnitude . Apart from the issues
involved in training rea provided to a student for development of
speaking capabili sonably accurate models for ASR and SLU that work
well in ties . the domain of operation in real time , one should
review that 35 As a further example , a computer - readable medium
is the individual systems also work well in sequence such that
encoded with instructions for commanding a processing the overall
SDS performance does not suffer and provides an system to implement
a method associated with an educa effective interaction with
interlocutors who call into the tional dialog system . In the
method , an initial task model is system . accessed that identifies
a plurality of dialog states associated
The ability to rapidly prototype and develop such SDSs is 40
with a task , a language model configured to identify a important
for applications in the educational domain . For response meaning
associated with a received response , and example , in automated
conversational assessment , test a language understanding model
configured to select next developers might design several
conversational items , each dialog state based on the identified
response meaning . The in a slightly different domain or subject
area . One can , in task is provided to a plurality of persons for
training , where such situations , be able to rapidly develop
models and 45 providing the task includes providing a prompt for a
par capabilities to ensure that the SDS can handle each of these
ticular one of the dialog states , receiving a response to the
diverse conversational applications gracefully . This is also
prompt , using the language model to determine the response true in
the case of learning applications and so - called meaning based on
the received response , and selecting a formative assessments : One
should be able to quickly and particular next dialog state based on
the determined response accurately bootstrap SDSs that can respond
to a wide variety 50 meaning . The task model is updated by
revising the lan of learner inputs across domains and contexts .
Language guage model and the language understanding model based
learning and assessments add yet another complication in on
responses received to prompts of the provided task , and that
systems need to deal gracefully with non - native speech . the
updated task is provided to a student for development of Despite
these challenges , the increasing demand for non speaking
capabilities . native conversational learning and assessment
applications 55 makes this avenue of research an important one to
pursue ; BRIEF DESCRIPTION OF THE DRAWINGS however , this requires
us to find a way to rapidly obtain data for model building and
refinement an iterative cycle . FIG . 1 is a diagram depicting a
processor - implemented
educational dialog system . SUMMARY FIG . 2 is a diagram
depicting example dialog states
associated with a task . Systems and methods are provided for
implementing an FIG . 3 is a diagram depicting example components
of an
educational dialog system . An initial task model is accessed
educational dialog system . that identifies a plurality of dialog
states associated with a FIG . 4 is a diagram depicting active
entities of an edu task , a language model configured to identify a
response 65 cational dialog system in a training mode . meaning
associated with a received response , and a language FIG . 5 is a
diagram depicting active entities of an edu understanding model
configured to select a next dialog state cational dialog system in
an educational or evaluation mode .
60
-
15
30
US 10,607,504 B1 3 4
FIG . 6 is a diagram depicting a multi - modal educational model
for understanding the response meaning associated dialog system .
with a received response , and a language ( e.g. , spoken FIGS .
7-8 are diagrams depicting example components of language )
understanding model configured to select a next
an educational dialog system . dialog state based on the
identified response meaning . An FIG . 9 is a diagram depicting
four example tasks and 5 initial task model can take the form of a
set of dialog states
improved performance of task models as those task models ( e.g.
, as depicted in FIG . 2 ) and a base / default language are
trained ( e.g. , using the crowd sourcing techniques model and
language understanding model . The educational described herein .
dialog system 102 functions in two modes , a training mode FIG . 10
is a flow diagram depicting a processor - imple 104 , and an active
learning / evaluation mode 106 . mented method for implementing an
educational dialog 10 In the training mode , a variety of persons
interact with the system . task to further develop the default
language and language FIGS . 11A , 11B , and 11C depict example
systems for understanding models . This additional training enables
the implementing the approaches described herein for imple dialog
system 102 to better understand context and nuances menting a
computer - implemented educational dialog sys associated with the
task being developed and implemented . tem . For example , in
different tasks , the phrase “ I don't know ”
DETAILED DESCRIPTION can have different meanings . In a task
where an interactor is under police interrogation , that phrase
likely means that the
FIG . 1 is a diagram depicting a processor - implemented subject
has no knowledge of the topic . But , where the task educational
dialog system . The educational dialog system 20 potentially
includes flirtatious behavior by an interactor , the 102 is
configured to interactively provide prompts associ phrase “ I don't
know ” combined with a smile and a shrug ated with dialog states to
an interactor , prompting the could imply coy behavior , where the
subject really does have interactor to provide a response . In this
way , the interactor knowledge on the topic . The base language
model and can participate in a simulated conversation with the
dialog language understanding model may not be able to under system
102. The dialog system 102 may provide its prompts 25 stand such
nuances , but trained versions of those models , in a voice - only
fashion ( e.g. , via a speaker ) , or in a multi which adjust their
behavior based on successful / unsuccess modal fashion using an
avatar ( e.g. , via a graphical user ful completions of tasks and
indicated survey approvals or interface , a puppet , an artificial
life form ) that communicates disapprovals by training -
interactors will gain an understand both voice and information via
other modalities , such as ing of these factors over time . facial
expressions and body movements . In one embodiment , the training
mode 104 for a task
The educational dialog system 102 of FIG . 1 is configured model
for a task is crowd sourced ( e.g. , using the Amazon initially
with a base task model that identifies the dialog Mechanical Turk
platform ) . A plurality of training - interac states associated
with a conversation task . The dialog states tors interact with a
task in training mode 104 via prompts indicate an anticipated path
of a task that will be facilitated and responses , where those
responses ( e.g. , speech , facial by the educational dialog system
102. FIG . 2 is a diagram 35 expressions , gestures ) are captured
and evaluated to deter depicting example dialog states associated
with a task . The mine whether the language model and / or language
under task begins at 202 where the dialog system provides a
standing model should be adjusted . Once the task model has welcome
at 204 and asks initial questions at 206 and 208 . been refined via
a number of interactions with the public , The question at 208 is
the first time in the dialog states that crowd - sourced
participants , the improved task model can be the conversation
branches based on response given by the 40 provided for educational
purposes , such as for developing interactor , as indicated by the
evaluation at 210 and the speaking and interaction skills of non -
native language corresponding branch at 212 to one of three
different pos speakers in a drilling or even an evaluation context
, where sible paths . In one embodiment , the evaluation of the a
score 108 is provided . interactor response to the prompt question
at 208 is per FIG . 3 is a diagram depicting example components of
an formed in two steps . First , the response to the prompt is 45
educational dialog system . The dialog system 302 includes
processed ( e.g. , voice responses are decoded via automatic a task
administrator ( dialog manager ) 304. The task admin speech
recognition , facial expressions are determined via istrator 304
monitors traversal of the dialog states of a task video processing
, body language is detected via infrared conversation . It provides
corresponding prompts , whether in motion capture ) to determine a
response meaning associated training 306 or evaluation 308 mode and
receives corre with the response ( e.g. , based on the totality of
data received 50 sponding responses . Responses 310 ( e.g. , text
from auto in the response , such as audio , video , and motion
capture ) matic speech recognition performed on a response ) are
using a language model . Once the meaning is determined at provided
to the language model 312 which determines a 208-210 , that meaning
is utilized at 210-212 to select an response meaning 314 that is
returned to the task adminis appropriate branch using a meaning
understanding ( spoken trator 304 ( or directly to the language
understanding model language ) model . While the example of FIG . 2
generally 55 316 ) . The response meaning 314 is received by the
language depicts a single path task conversation , more complicated
understanding model 316 and determines the next dialog sets of
dialog states ( e.g. , tree shaped ) can be implemented . state 318
that should be taken in the task , where that next For example ,
voice , gesture , and facial expression data can state 318 is
returned to the task administrator 304 . be used to measure an
engagement level of an interactor In a training mode 306 , the task
administrator 304 is ( e.g. , based on head pose , gaze , and
facial expressions to 60 configured to adjust the language model
312 and language identify smiles or indications of boredom , such
as yawns , as understanding model 316 to improve their performance
in well as content of detected speech ) , where positive feedback
subsequent interactor iterations . The task model , which or other
encouragement is given to an interactor whose includes the dialog
states , the current language model , and engagement is determined
to have waned . the current language understanding model , is
accessed from
With reference back to FIG . 1 , the educational dialog 65 a
task model data store 322 before each training iteration and system
102 utilizes a task model , which includes the plu is returned ,
when any of those entities are altered , for rality of dialog
states associated with a task , the language storage . In an
evaluation mode 308 , the task administrator
-
US 10,607,504 B1 5 6
304 may be configured to output a score 320 indicating a ( SIP )
, Public Switched Telephone Network ( PSTN ) , quality of responses
received from the evaluation - interactor . and web Real - Time
Communications ( WebRTC ) stan
FIG . 4 is a diagram depicting active entities of an edu dards
and include support for voice and video ; cational dialog system in
a training mode . In training mode A voice browser ( e.g. ,
JVoiceXML ) , which is compatible 406 , the task administrator 404
accesses the task model for 5 with VoiceXML 2.1 and can process SIP
traffic and the task to be administered from the task model
database which incorporates support for multiple grammar stan 408.
The task administrator 404 traverses the dialog states dards , such
as Java Speech Grammar Format ( JSGF ) , associated with the task
model , providing prompts 410 and Advanced Research Projects Agency
( ARPA ) , and receiving response 412 from the training -
interactor . The Weighted Finite State Transducer ( WFST ) ;
responses 412 are provided to the language model 414 10 A Media
Resource Control Protocol ( MRCP ) speech which determines response
meanings 416 , where that server , which allows the voice browser
to initiate SIP or response meaning 416 is used by the language
understand Real - Time Transport Protocol ( RTP ) connections from
/ ing model 418 to determine the next dialog state 420 for the to
the telephony server and incorporates two speech task . recognizers
and synthesizers ;
Following conclusion of the task ( via a completion of the 15 An
Apache Tomcat - based web server which can host entirety of the
task or a failure to complete the task ) , the task dynamic
VoiceXML pages , web services , and media administrator 404 adjusts
the language model 414 and the libraries containing grammars and
audio files ; language understanding model 418 based on the
training OpenVXML , a VoiceXML - based voice application
interactions . For example , if a task is not completed or if an
authoring suite : generates dynamic web applications interactor
states via a survey that they were dissatisfied with 20 that can be
housed on the web server ; the task ( e.g. , the task did not
provide a next prompt that was A MySQL database server for storing
call logs ; appropriate for their current response ) , then the
task admin A speech transcription , annotation , and rating portal
that istrator 404 determines that one of the models 414 , 418
allows one to listen to and transcribe full - call record should be
adjusted to better function . For example , the ings , rate them on
a variety of dimensions such as caller language model 414 may be
refined to apply a different 25 experience and latency , and
perform various semantic response meaning 416 to a particular
response 412 from the annotation tasks to train ASR and SLU modules
. training - interactor that resulted in an erroneous dialog state
FIG . 9 is a diagram depicting four example tasks and path . If the
training - interactor completes the task or indicates improved
performance of task models as those task models a positive
experience , then that data is utilized to strengthen are trained (
e.g. , using the crowd sourcing techniques the models 414 , 418 (
e.g. , weights associated with potential 30 described herein ) . In
the example of FIG . 9 , four tasks are paths or factors in a
neural network model ) based on the described , having between 1
and 8 dialog states . The final confirmation that those models 414
, 418 behaved appropri two examples are longer , having 8 and 7
dialog states , ately to the responses 412 received from the
training respectively . Task models were trained over the number of
interactor . The adjusted task model is then returned by the
iterations shown in the middle column , where completion task
administrator 404 to the task model data store 408 . 35 rate is
illustrated in a final column as an indicator of quality
FIG . 5 is a diagram depicting active entities of an edu of the
task models . The graph at the bottom of FIG.9 shows cational
dialog system in an educational or evaluation mode . an improvement
of completion rate associated with the job In the evaluation mode
506 , the task administrator 504 interview and pizza customer
service tasks over time , as the accesses a task model from the
task model data store 508 and associated language and language
understanding modules uses that data to set up the language model
510 and language 40 were trained using crowd source participation .
understanding model 512. The task administrator 504 tra FIG . 10 is
a flow diagram depicting a processor - imple verses the dialog
states of the task as informed by responses mented method for
implementing an educational dialog 516 to prompts 514 with the aid
of the language model 510 system . At 1002 , an initial task model
is accessed that and the language understanding model 512. The task
admin identifies a plurality of dialog states associated with a
task , istrator 504 tracks the appropriateness of responses 516 45
a language model configured to identify a response meaning received
from the evaluation - interactor as well as other associated with a
received response , and a language under metrics associated with
those responses ( e.g. , pronunciation , standing model configured
to select a next dialog state based grammar ) to determine a score
518 indicative of the quality on the identified response meaning .
The task is provided to of the interactor's communication with the
educational a plurality of persons for training at 1004 , where
providing dialog system 502 . 50 the task includes providing a
prompt for a particular one of
FIG . 6 is a diagram depicting a multi - modal educational the
dialog states , receiving a response to the prompt , using dialog
system . FIG . 6 indicates that speech response data is the
language model to determine the response meaning captured via
automatic speech recognition , as well as gesture based on the
received response , and selecting a particular data via motion
detection ( e.g. , using an XBOX Kinect next dialog state based on
the determined response meaning . motion capture device ) and
facial expression data via video 55 The task model is updated at
1006 by revising the language capture and analysis . The language
model uses all or a model and the language understanding model
based on portion of that data to extract meaning associated with
responses received to prompts of the provided task , and the
responses , where extracted text associated with speech may updated
task is provided to a student at 1008 for develop be assigned
different meanings depending on the context of ment of speaking
capabilities . gesture and facial expression data that is
identified . FIGS . 11A , 11B , and 11C depict example systems for
FIGS . 7-8 are diagrams depicting example components of
implementing the approaches described herein for imple
an educational dialog system . Certain embodiments use a menting
a computer - implemented educational dialog sys HALEF dialog system
to develop conversational applica tem . For example , FIG . 11A
depicts an exemplary system tions within the crowd sourcing
framework . These systems 1100 that includes a standalone computer
architecture where can include one or more of the following
components : 65 a processing system 1102 ( e.g. , one or more
computer
Telephony servers ( e.g. , Asterisk and FreeSWITCH ) ,
processors located in a given computer or in multiple which are
compatible with Session Initiation Protocol computers that may be
separate and distinct from one
60
-
US 10,607,504 B1 7 8
another ) includes a computer - implemented educational dia
board 1179 , or other input device 1181 , such as a micro log
system 1104 being executed on the processing system phone , remote
control , pointer , mouse and / or joystick . 1102. The processing
system 1102 has access to a computer Additionally , the methods and
systems described herein readable memory 1107 in addition to one or
more data stores may be implemented on many different types of
processing 1108. The one or more data stores 1108 may include task
5 devices by program code comprising program instructions models
1110 as well as scores 1112. The processing system that are
executable by the device processing subsystem . The 1102 may be a
distributed parallel computing environment , software program
instructions may include source code , which may be used to handle
very large - scale data sets . object code , machine code , or any
other stored data that is
FIG . 11B depicts a system 1120 that includes a client operable
to cause a processing system to perform the meth server
architecture . One or more user PCs 1122 access one 10 ods and
operations described herein and may be provided in or more servers
1124 running a computer - implemented any suitable language such as
C , C ++ , JAVA , for example , or any other suitable programming
language . Other imple educational dialog system 1137 on a
processing system 1127 mentations may also be used , however , such
as firmware or via one or more networks 1128. The one or more
servers even appropriately designed hardware configured to carry
1124 may access a computer - readable memory 1130 as well 15 out
the methods and systems described herein . as one or more data
stores 1132. The one or more data stores The systems ' and methods
' data ( e.g. , associations , map 1132 may include task models
1134 as well as scores 1138 . pings , data input , data output ,
intermediate data results , final FIG . 11C shows a block diagram
of exemplary hardware data results , etc. ) may be stored and
implemented in one or for a standalone computer architecture 1150 ,
such as the more different types of computer - implemented data
stores , architecture depicted in FIG . 11A that may be used to 20
such as different types of storage devices and programming include
and / or implement the program instructions of sys constructs (
e.g. , RAM , ROM , Flash memory , flat files , tem embodiments of
the present disclosure . A bus 1152 may databases , programming
data structures , programming vari serve as the information highway
interconnecting the other ables , IF - THEN ( or similar type )
statement constructs , etc. ) . illustrated components of the
hardware . A processing system It is noted that data structures
describe formats for use in 1154 labeled CPU ( central processing
unit ) ( e.g. , one or 25 organizing and storing data in databases
, programs , memory , more computer processors at a given computer
or at multiple or other computer - readable media for use by a
computer computers ) , may perform calculations and logic
operations program . required to execute a program . A non -
transitory processor The computer components , software modules ,
functions , readable storage medium , such as read only memory (
ROM ) data stores and data structures described herein may be 1158
and random access memory ( RAM ) 1159 , may be in 30 connected
directly or indirectly to each other in order to communication with
the processing system 1154 and may allow the flow of data needed
for their operations . It is also include one or more programming
instructions for perform noted that a module or processor includes
but is not limited ing the method of implementing a computer -
implemented to a unit of code that performs a software operation ,
and can educational dialog system . Optionally , program
instructions be implemented for example as a subroutine unit of
code , or may be stored on a non - transitory computer - readable
storage 35 as a software function unit of code , or as an object (
as in an medium such as a magnetic disk , optical disk , recordable
object - oriented paradigm ) , or as an applet , or in a computer
memory device , flash memory , or other physical storage script
language , or as another type of computer code . The medium .
software components and / or functionality may be located on
In FIGS . 11A , 11B , and 11C , computer readable memo a single
computer or distributed across multiple computers ries 1108 , 1130
, 1158 , 1159 or data stores 1108 , 1132 , 1183 , 40 depending upon
the situation at hand . 1184 , 1188 may include one or more data
structures for While the disclosure has been described in detail
and with storing and associating various data used in the example
reference to specific embodiments thereof , it will be appar
systems for implementing a computer - implemented educa ent to one
skilled in the art that various changes and tional dialog system .
For example , a data structure stored in modifications can be made
therein without departing from any of the aforementioned locations
may be used to store 45 the spirit and scope of the embodiments .
Thus , it is intended data from XML files , initial parameters ,
and / or data for other that the present disclosure cover the
modifications and variables described herein . A disk controller
1190 interfaces variations of this disclosure provided they come
within the one or more optional disk drives to the system bus 1152
. scope of the appended claims and their equivalents . For These
disk drives may be external or internal floppy disk example , in
one embodiment , in addition to or in the drives such as 1183 ,
external or internal CD - ROM , CD - R , 50 alternative to
adjusting language models and language CD - RW or DVD drives such
as 1184 , or external or internal understanding models , systems
and methods can be config hard drives 1185. As indicated previously
, these various disk ured to adjust acoustic models ( models that
relate how drives and disk controllers are optional devices .
probable a given sequence of words correspond to the actual
Each of the element managers , real - time data buffer , speech
signal received ) , dialog management models ( models conveyors ,
file input processor , database index shared access 55 that inform
what the optimal next action of the dialog system memory loader ,
reference data buffer and data managers should be based on the
given state ) , and engagement pre may include a software
application stored in one or more of diction models ( models that
inform the dialog manager how the disk drives connected to the disk
controller 1190 , the to react given a current engagement state of
the user ) . ROM 1158 and / or the RAM 1159. The processor 1154 may
The invention claimed is : access one or more components as
required . 1. A processor - implemented method for implementing
an
A display interface 1187 may permit information from the
educational dialog system , comprising : bus 1152 to be displayed
on a display 1180 in audio , graphic , accessing an initial task
model that identifies a plurality of or alphanumeric format .
Communication with external dialog states associated with a task ,
a language model devices may optionally occur using various
communication configured to identify a response meaning associated
ports 1182 . with a received response , and a language
understanding
In addition to these computer - type components , the hard model
configured to select a next dialog state based on ware may also
include data input devices , such as a key the identified response
meaning ;
60
65
-
5
10
25
US 10,607,504 B1 9 10
wherein the language model identifies the response model
configured to select a next dialog state based on meaning based on
content of speech of the response the identified response meaning ;
received from a person , wherein the content of the wherein the
language model identifies the response speech is determined using
automatic speech recog meaning based on content of speech of the
response nition ; received from a person , wherein the content of
the
wherein the language model further identifies the speech is
determined using automatic speech recog response meaning based on
gestures associated with nition ; the response received from the
person , wherein the wherein the language model further identifies
the gestures are captured via a video capture device or an response
meaning based on gestures associated with infrared capture device ;
the response received from the person , wherein the
providing the task to a plurality of persons for training ,
gestures are captured via a video capture device or an wherein
providing the task includes providing a prompt infrared capture
device ; for a particular one of the dialog states , receiving a
providing the task to a plurality of persons for training ,
response to the prompt , using the language model to wherein
providing the task includes providing a prompt determine the
response meaning based on the received 15 for a particular one of
the dialog states , receiving a response , and selecting a
particular next dialog state response to the prompt , using the
language model to based on the determined response meaning ;
determine the response meaning based on the received
providing a survey to each of the plurality of persons after
response , and selecting a particular next dialog state interaction
with the provided task , wherein the survey based on the determined
response meaning ; requests evaluation data regarding the quality
of the 20 providing a survey to each of the plurality of persons
after interaction ; interaction with the provided task , wherein
the survey
updating the task model by revising the language model requests
evaluation data regarding the quality of the and the language
understanding model based on interaction ; responses received to
prompts of the provided task and updating the task model by
revising the language model the evaluation data from the surveys ;
and the language understanding model based on
providing an updated task to a student for development of
responses received to prompts of the provided task and speaking
capabilities ; and the evaluation data from the surveys ;
scoring the student's speaking capabilities based on the
providing an updated task to a student for development of student's
interaction with the updated task . speaking capability ; and
2. The method of claim 1 , wherein the updated task is 30
scoring the student's speaking capability based on the provided to
the student by an educational organization ; student's interaction
with the updated task . wherein said providing the task to a
plurality of persons 8. The system of claim 7 , wherein the updated
task is
for training includes providing the task to a pool of provided
to the student by an educational organization ; public persons
unaffiliated with the educational orga wherein said providing the
task to a plurality of persons nization . for training includes
providing the task to a pool of
3. The method of claim 1 , further comprising : public persons
unaffiliated with the educational orga providing revised tasks to
further persons for additional nization .
training prior to providing the updated task to the 9. The
system of claim 7 , wherein the steps further student . include
:
4. The method of claim 3 , further comprising : providing
revised tasks to further persons for additional tracking whether a
first person participating in a round of training prior to
providing the updated task to the
training completes all of the dialog states associated student .
with the task as a first metric ; 10. The system of claim 9 ,
wherein the steps further
tracking whether a second person participating in a round
include : of training completes all of the dialog states associated
45 tracking whether a first person participating in a round of with
the updated task as a second metric ; training completes all of the
dialog states associated
comparing the first metric and the second metric to with the
task as a first metric ; determine whether the task model is
improving based tracking whether a second person participating in a
round on additional training . of training completes all of the
dialog states associated
5. The method of claim 4 , wherein updates to the task 50 with
the updated task as a second metric ; model are retained when the
task model is determined to comparing the first metric and the
second metric to have improved , wherein updates are reverted when
the task determine whether the task model is improving based model
is determined not to have improved . on additional training .
6. The method of claim 1 , wherein the updated task is 11. The
system of claim 10 , wherein updates to the task provided to a
student for development of non - native speak- 55 model are
retained when the task model is determined to ing capabilities .
have improved , wherein updates are reverted when the task
7. A processor - implemented system for implementing an model is
determined not to have improved . educational dialog system ,
comprising : 12. The system of claim 10 , wherein the updated task
is
a processing system comprising one or more data proces provided
to a student for development of non - native speak sors ; 60 ing
capabilities .
a non - transitory computer - readable medium encoded with 13. A
non - transitory computer - readable medium encoded instructions
for commanding the processing system to with instructions for
commanding one or more data proces execute steps of a method that
include : sors to execute steps of a method for implementing an
accessing an initial task model that identifies a plurality of
educational dialog system , the method comprising : dialog states
associated with a task , a language model 65 accessing an initial
task model that identifies a plurality of configured to identify a
response meaning associated dialog states associated with a task ,
a language model with a received response , and a language
understanding configured to identify a response meaning
associated
35
40
-
10
US 10,607,504 B1 11 12
with a received response , and a language understanding updating
the task model by revising the language model model configured to
select a next dialog state based on and the language understanding
model based on the identified response meaning ; responses received
to prompts of the provided task and wherein the language model
identifies the response the evaluation data from the surveys ;
meaning based on content of speech of the response 5 received from
a person , wherein the content of the providing an updated task to
a student for development of speech is determined using automatic
speech recog speaking capabilities ; and nition ; scoring the
student's speaking capabilities based on the wherein the language
model further identifies the student's interaction with the updated
task . response meaning based on gestures associated with the
response received from the person , wherein the 14. The method of
claim 1 , wherein the gestures captured gestures are captured via a
video capture device or an include facial expressions . infrared
capture device ; 15. The method of claim 1 , wherein the language
model providing the task to a plurality of persons for training ,
further identifies the response meaning based on a detected wherein
providing the task includes providing a prompt 15 engagement level
of the person . for a particular one of the dialog states ,
receiving a
response to the prompt , using the language model to 16. The
system of claim 7 , wherein the gestures captured determine the
response meaning based on the received include facial expressions .
response , and selecting a particular next dialog state 17. The
system of claim 7 , wherein the language model based on the
determined response meaning ; further identifies the response
meaning based on a detected providing a survey to each of the
plurality of persons after 20 engagement level of the person .
interaction with the provided task , wherein the survey requests
evaluation data regarding the quality of the interaction ;