Page 1
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Semi-supervised Learning by Machine Speech Chain for
Multilingual Speech Processing, and Recent Progress
on Automatic Speech Interpretation
Satoshi Nakamura,
Sakriani Sakti, and Katsuhito Sudoh
Graduate School of Science and Technology,
Nara Institute of Science and Technology, Japan
LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
1
Dec. 6. 2019
Page 2
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Topics
Recent advances in speech processing
– ASR and TTS research
– Machine Speech Chain unifies ASR and TTS
– Application to code switching speech
Speech Translation
– Recent Progress on Automatic Speech Interpretation
Dec. 6. 2019 LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
2
Page 3
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Motivation Background
In human communication → A closed-loop speech chain mechanism has a critical auditory feedback
mechanism (“Speech Chain”, Denes, Pinson 1973)
Dec. 6. 2019 LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
3
Sensory nerves
Motornerves
Sensory nerves
Auditory feedback
Speaking Listening
Page 4
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Machine Speech Chain
Proposed Method
LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
4
Develop a closed-loop speech chain model based on deep learning
“Good afternoon”
Sensory nerves
Motornerves
Auditory feedback
Speaking
“How are you?”
Speaking
Auditory feedback
Use the closed-loop
for ASR and TTS
Dec. 6. 2019
Page 5
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Machine Speech Chain
Definition:
– 𝑥 = original speech, 𝑦 = original text
– ො𝑥 = predicted speech, ො𝑦 = predicted text
– 𝐴𝑆𝑅(𝑥): 𝑥 → ො𝑦 (seq2seq model transforms speech to text)
– 𝑇𝑇𝑆 𝑦 : 𝑦 → ො𝑥 (seq2seq model transforms text to speech)
LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
5
Dec. 6. 2019
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, “Listening while Speaking: Speech Chain by Deep Learning”, Proc. IEEE ASRU 2017
𝐿𝐴𝑆𝑅(𝑦, ො𝑦)𝐿𝑇𝑇𝑆(𝑥, ො𝑥)
Page 6
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Speech Chain with One-shot Speaker Adaptation
Proposed model
– Train ASR and TTS models using unpaired data and small amount of paired data.
– Speaker individuality is generated by SPKREC embedding.
6
Dec. 6. 2019 LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, “Machine Speech Chain with One-shot Speaker Adaptation”, Proc. INTERSPEECH 2018
Page 7
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits- LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
7
Code-switching Challenges For ASR
“これはstill waterですか?”
Standard ASR is monolingual
ASR?Output text
Japanese
ASR“こんにちは”
English
ASR “Hello”
Japanese
EnglishJapanese
• Typical case where paired speech and transcription are difficult to collect.
Challenge with Code-switching : Mixed multilingual input
Dec. 6. 2019
Page 8
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Code-switching Challenges
LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
8
ASR
TTS
෧𝑡𝑒𝑥𝑡
෧
𝑡𝑒𝑥𝑡
𝐿𝑜𝑠𝑠
෧
𝑠𝑝𝑒𝑒𝑐ℎASR
TTS
෧𝑡𝑒𝑥𝑡𝐿𝑜𝑠𝑠unfold
ASR
TTS
෧𝑡𝑒𝑥𝑡
෧𝑡𝑒𝑥𝑡
෧
෧
t𝑒𝑥𝑡
Given only text Given only speech
• Typical case where paired speech and transcription are difficult to collect.
“これはstill waterですか?”
ASR?Output text
Japanese
EnglishJapanese
Challenge with Code-switching : Mixed multilingual input
Dec. 6. 2019
S. Nakayama, A. Tjandra, S. Sakti, S. Nakamura, “Speech chain for semi-supervised learning of Japanese-English code-switching ASR and TTS”. In Proc. IEEE SLT, 2018
18.11% CER-> 5.08% CER
Page 9
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Topics
Recent advances in speech processing
– ASR and TTS research
– Machine Speech Chain unifies ASR and TTS
– Application to code switching speech
Speech Translation
– Recent Progress on Automatic Speech Translation
LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
9
Dec. 6. 2019
Page 10
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Speech Translation
2019/06/15 CLI9 Keynote Satoshi Nakamura, NAIST10
MultilingualSpeech
Recognition
Spoken Language
Translation
MultilingualSpeech
SynthesisJapanese English
I go to school「私は学校に行く: Watashi wa Gakko he iku」
Watashi wa Gakko he iku
I go to school
• Cascaded process of speech recognition, machine translation, and speech synthesis.
• Machine translation of ASR transcripts.
Page 11
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Cross-lingual Communication
LT4ALL Dec.6 2019 ©Satoshi Nakamura, AHC Lab, NAIST, Japan
11
Input:Text
SpeechVideo
Gesture
Speech⇒TextASR
RealtimeIncremental
MTConversion
Dialog Control
LinguisticInformation
ParalinguisticEmotion,
Style, Personality,
Prosody, Gesture
ParalinguisticEmotion,
Style, Personality,
Prosody, Gesture
Output:Text
SpeechVideo
Gesture
Source Language Target Language
Speech“to o kyo e i ku”
MT results/I/go/to/Tokyo/
TTS results“ai go tu tokyo/
Personality, Prosody Personality, Prosody
DiscourceContext
Domain knowledge
Text
Image⇒textPR
Text
Text⇒SpeechTTS
Text⇒ImageImage Generation
End-to-end Process
Communication
① Simultaneity, Incremental, Latency,
② +Para/non linguistic information
LinguisticInformation
Dec. 6. 2019
Page 12
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Human Interpreter [A.Mizuno 2016]
Dec. 6. 2019 LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
12
E-J Translation Example
(1) The relief workers (2) say (3) they don’t have (4) enough food, water, shelter, and medical supplies (5) to deal with (6) the gigantic wave of refugees (7) who are ransacking the countryside (8) in search of the basics (9) to stay alive.
(1) 救援担当者は (9) 生きるための (8) 食料を求めて (7) 村を荒らし回っている (6) 大量の難民達の (5) 世話をするための (4) 十分な食料や水,宿泊施設,医療品が (3) 無いと (2) 言っています.
Necessary #Chunk>3!
(1) 救援担当者達の (2) 話では (4)食料,水,宿泊施設,医薬品が, (3) 足りず (6) 大量の難民達の (5) 世話が出来ないとのことです.(7) 難民達は今村々を荒らし回って, (9) 生きるための (8) 食料を求めているのです.
Necessary #Chunk<3!
Memory Chunk
Page 13
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits- Dec. 6. 2019 LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
13
Katsuki Chousa, Katsuhito Sudoh, and Satoshi Nakamura. 2019. Simultaneous Neural Machine Translation using Connectionist Temporal Classification. arXiv preprint , 1911.1193
Simultaneous Speech Translation with Adaptive Delay
Define a special token <wait>
ブッシュ大統領 は プーチン と 会談 する
President Bush <wait> meets with Putin<wait><wait>
Page 14
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Paralinguistic Speech Translation
Dec. 6. 2019 LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
14
Q. T. Do, S. Sakti, S. Nakamura, “Sequence-to-Sequence Models for Emphasis Speech Translation”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10):1873–1883, 2018
Page 15
http://www.naist.jp/無限の可能性、ここが最先端 -Outgrow your limits-
Summary
Machine Speech Chain
– Semi-supervised training using unpaired data
– Code switching speech, under-resource language, and continuous learning
Speech Translation
– Simultaneous speech translation
• Segmentation, anticipation, rewording, evaluation
– Paralinguistic speech translation
• Emphasis, and emotions
Future works
– Understanding and interpretation
– Context, situation and multi-modality
– Common sense, knowledge, and cross-cultural knowledge
LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan
15
Dec. 6. 2019