04/19/23 1
Spoken Language Processing:Summing Up
Julia Hirschberg
CS 4706
04/19/23 2
What We’ve Studied
• Speech phenomena– What can people convey by varying the way
they say something?– How we identify this kind of variation?– What tools do we have for analysis?
• Speech generation (TTS)• Speech recognition (ASR) and understanding
(ASRU)• Applications for speech technologies
04/19/23 3
What phenomena vary in speech?
• Intonational contours (ToBI)– Phrasing: scope – Accent: focus, given/new– Overall contour: speech acts
• Pitch range, timing– Topic structure
• Voice quality, intensity, …– Emotion – Deception? – Charisma?
04/19/23 4
Analyzing Speech: At the Acoustic Level
• How do we capture speech data for analysis?– Digitizing: sampling, quantization, filtering
• How can we distinguish one speech sound from another?– Periodic vs. aperiodic waveforms
• Characterizing periodic waveforms: cycle, period, phase
– Displaying and analyzing spectra, pitch tracks– Comparing intensity (db)
• Tools to do all this and more: Praat
04/19/23 5
04/19/23 6
Analyzing Speech: At the Phonetic Level
• Can we distinguish different languages in terms of their phoneme sets? Are their universal constraints on possible speech sounds?– Articulatory constraints
• How do we characterize the sounds of a given language:– Acoustic differences associated with place and
manner of articulation distinguish consonants– Vowels differ in their formant frequencies
• Do we use such information in speech technologies?
04/19/23 7
Articulators in action
“Why did Ken set the soggy net on top of his deck?”
(Sample from the Queen’s University / ATR Labs X-ray Film Database)
04/19/23 8
Articulatory parameters for English consonants (in ARPAbet)
PLACE OF ARTICULATION
bilabial labio-dental
inter-dental
alveolar palatal velar glottal
stop p b t d k g q
fric. f v th dh s z sh zh h
affric. ch jh
nasal m n ng
approx w l/r y
flap dxMA
NN
ER
OF
AR
TIC
ULA
TIO
N
VOICING: voiceless voiced
04/19/23 9
American English vowel space
FRONT BACK
HIGH
LOW
eyow
aw
oy
ay
iy
ih
eh
ae aa
ao
uw
uh
ah
ax
ix ux
04/19/23 10
Analyzing Speech: At the Phononological Level
• How do people develop models of intonation?• ToBI
– Tones: Pitch accents, phrase accents, boundary tones
– Break indices• Hand labeling vs. automatic analysis
– Which provides more useful information?
04/19/23 11
L*+H
L*
H*
H-H%H-L%L-H%L-L%
04/19/23 12
H* !H*
H+!H*
L+H*
H-H%H-L%L-H%L-L%
04/19/23 13
Speech Generation
• Synthesis then and now• Open problems in TTS:
– Pronunciation modeling: OOV words, homographs, abbreviations
– Predicting pitch accents and phrase boundaries: corpus-based approaches
– Information status: focus, given/new– Modeling discourse structure– Producing emotional speech– Evaluation
04/19/23 14
Speech Recognition/Understanding
• ASR then and now: From speaker-dependent digit recognition using analog circuits to HMM-based speaker-independent recognition of spontaneous speech by computer
• Open problems– Segmentation: sentence, speaker, topic– OOV recognition– Handling disfluencies– Evaluation: transcription, semantic, task-based?– Recognizing emotion and other types of speaker state
04/19/23 15
Spoken Dialogue Systems
• Integrating TTS and ASR with dialogue management and task-based components
• Open questions:– Improving ASR accuracy– Recognizing dialogue acts– Turn-taking behavior– Confirmation strategies and initiative– Entrainment and ‘personality’– Evaluation
04/19/23 16
Recognizing Speaker State and Diagnosis
• Emotional speech– Voice quality
• Deceptive speech• Charismatic speech• Customer care rep evaluation• Medical diagnosis
– Paranoia and other psychiatric disorders– Cancer patient prognosis
04/19/23 17
Take-Home Final
• Due: May 14 by 4:10 pm• Submission instructions:
– This examination is designed to test your ability to synthesize information and to perform critical analysis of published research. Choose 3 of the following 4 questions to answer Each question should be answered with specific reference to the readings specified, all of which are linked to the syllabus for the class on the date given. (I.e., cite articles with page numbers to support claims about authors’ findings or claims, as “McLeod et al. (1998) claims that existing Spoken Dialogue Systems’ major drawback is their lack of delightful personalities (p. 4).”) Do not attempt to answer the questions until you have read and understood the specified articles. Essays that do not show evidence of this understanding will not receive high marks.
– Each essay will be worth 33 1/3 points. Each essay should be no more than 1200 words in length; only the first 1200 words of each essay will be graded, so please do not exceed this limit. If you can answer the question in a shorter essay, feel free to do so. Please use plain ascii or Word and report word-counts for each essay.
04/19/23 18
Sample Question
Agree or disagree: “It is more difficult to recognize deception automatically from acoustic/prosodic and lexical cues than from visual cues obtained from face or body gesture.” Use the readings assigned for April 28 to support your answer.1. Show that you understand the question and are
answering it• E.g. “I believe that it is more difficult to recognize deception
automatically from from visual cues than from acoustic/prosodic and lexical cues.”
2. For agree/disagree questions, decide whether you basically agree or disagree
• e.g. “While there are difficulties recognizing deception from both types of cues, I believe it is more difficult to recognize deception from visual cues than from language-based cues.”
04/19/23 19
3. Provide evidence on both sides of the question• “While both audio and visual cues require high quality
recordings, audio recordings must be obtained in a quiet environment whereas video recordings can be obtained in a wider variety of situations, providing that equipment is available.”
• “While Mehrabian (1971) found significant effects for both visual and language-based cues, the particular language cues he identified in this study would seem to be easier to recognize automatically than the visual cues: For example, it should be easier to identify amount of speech and speaking rate than features such as ‘rocking gestures’ and ‘leg and foot movements’.”
4. Support your statements with specific reference to your sources• e.g. “DePaulo et al (1983) find that…”• Or, “Motivation greatly influences subjects’ ability to
control their verbal cues (DePaulo et al, 1983).”
04/19/23 20
• When in doubt, cite