Spoken Language Processing:Summing Up

04/19/23 1

Spoken Language Processing:Summing Up

Julia Hirschberg

CS 4706

04/19/23 2

What We’ve Studied

• Speech phenomena– What can people convey by varying the way

they say something?– How we identify this kind of variation?– What tools do we have for analysis?

• Speech generation (TTS)• Speech recognition (ASR) and understanding

(ASRU)• Applications for speech technologies

04/19/23 3

What phenomena vary in speech?

• Intonational contours (ToBI)– Phrasing: scope – Accent: focus, given/new– Overall contour: speech acts

• Pitch range, timing– Topic structure

• Voice quality, intensity, …– Emotion – Deception? – Charisma?

04/19/23 4

Analyzing Speech: At the Acoustic Level

• How do we capture speech data for analysis?– Digitizing: sampling, quantization, filtering

• How can we distinguish one speech sound from another?– Periodic vs. aperiodic waveforms

• Characterizing periodic waveforms: cycle, period, phase

– Displaying and analyzing spectra, pitch tracks– Comparing intensity (db)

• Tools to do all this and more: Praat

04/19/23 5

04/19/23 6

Analyzing Speech: At the Phonetic Level

• Can we distinguish different languages in terms of their phoneme sets? Are their universal constraints on possible speech sounds?– Articulatory constraints

• How do we characterize the sounds of a given language:– Acoustic differences associated with place and

manner of articulation distinguish consonants– Vowels differ in their formant frequencies

• Do we use such information in speech technologies?

04/19/23 7

Articulators in action

“Why did Ken set the soggy net on top of his deck?”

(Sample from the Queen’s University / ATR Labs X-ray Film Database)

http://psyc.queensu.ca/~munhallk/05_database.htm

http://psyc.queensu.ca/~munhallk/05_why_did_ken.mov

04/19/23 8

Articulatory parameters for English consonants (in ARPAbet)

PLACE OF ARTICULATION

bilabial labio-dental

inter-dental

alveolar palatal velar glottal

stop p b t d k g q

fric. f v th dh s z sh zh h

affric. ch jh

nasal m n ng

approx w l/r y

flap dxMA

NN

ER

OF

AR

TIC

ULA

TIO

N

VOICING: voiceless voiced

04/19/23 9

American English vowel space

FRONT BACK

HIGH

LOW

eyow

aw

oy

ay

iy

ih

eh

ae aa

ao

uw

uh

ah

ax

ix ux

04/19/23 10

Analyzing Speech: At the Phononological Level

• How do people develop models of intonation?• ToBI

– Tones: Pitch accents, phrase accents, boundary tones

– Break indices• Hand labeling vs. automatic analysis

– Which provides more useful information?

04/19/23 11

L*+H

L*

H*

H-H%H-L%L-H%L-L%

04/19/23 12

H* !H*

H+!H*

L+H*

H-H%H-L%L-H%L-L%

04/19/23 13

Speech Generation

• Synthesis then and now• Open problems in TTS:

– Pronunciation modeling: OOV words, homographs, abbreviations

– Predicting pitch accents and phrase boundaries: corpus-based approaches

– Information status: focus, given/new– Modeling discourse structure– Producing emotional speech– Evaluation

04/19/23 14

Speech Recognition/Understanding

• ASR then and now: From speaker-dependent digit recognition using analog circuits to HMM-based speaker-independent recognition of spontaneous speech by computer

• Open problems– Segmentation: sentence, speaker, topic– OOV recognition– Handling disfluencies– Evaluation: transcription, semantic, task-based?– Recognizing emotion and other types of speaker state

04/19/23 15

Spoken Dialogue Systems

• Integrating TTS and ASR with dialogue management and task-based components

• Open questions:– Improving ASR accuracy– Recognizing dialogue acts– Turn-taking behavior– Confirmation strategies and initiative– Entrainment and ‘personality’– Evaluation

04/19/23 16

Recognizing Speaker State and Diagnosis

• Emotional speech– Voice quality

• Deceptive speech• Charismatic speech• Customer care rep evaluation• Medical diagnosis

– Paranoia and other psychiatric disorders– Cancer patient prognosis

04/19/23 17

Take-Home Final

• Due: May 14 by 4:10 pm• Submission instructions:

– This examination is designed to test your ability to synthesize information and to perform critical analysis of published research. Choose 3 of the following 4 questions to answer Each question should be answered with specific reference to the readings specified, all of which are linked to the syllabus for the class on the date given. (I.e., cite articles with page numbers to support claims about authors’ findings or claims, as “McLeod et al. (1998) claims that existing Spoken Dialogue Systems’ major drawback is their lack of delightful personalities (p. 4).”) Do not attempt to answer the questions until you have read and understood the specified articles. Essays that do not show evidence of this understanding will not receive high marks.

– Each essay will be worth 33 1/3 points. Each essay should be no more than 1200 words in length; only the first 1200 words of each essay will be graded, so please do not exceed this limit. If you can answer the question in a shorter essay, feel free to do so. Please use plain ascii or Word and report word-counts for each essay.

04/19/23 18

Sample Question

Agree or disagree: “It is more difficult to recognize deception automatically from acoustic/prosodic and lexical cues than from visual cues obtained from face or body gesture.” Use the readings assigned for April 28 to support your answer.1. Show that you understand the question and are

answering it• E.g. “I believe that it is more difficult to recognize deception

automatically from from visual cues than from acoustic/prosodic and lexical cues.”

2. For agree/disagree questions, decide whether you basically agree or disagree

• e.g. “While there are difficulties recognizing deception from both types of cues, I believe it is more difficult to recognize deception from visual cues than from language-based cues.”

04/19/23 19

3. Provide evidence on both sides of the question• “While both audio and visual cues require high quality

recordings, audio recordings must be obtained in a quiet environment whereas video recordings can be obtained in a wider variety of situations, providing that equipment is available.”

• “While Mehrabian (1971) found significant effects for both visual and language-based cues, the particular language cues he identified in this study would seem to be easier to recognize automatically than the visual cues: For example, it should be easier to identify amount of speech and speaking rate than features such as ‘rocking gestures’ and ‘leg and foot movements’.”

4. Support your statements with specific reference to your sources• e.g. “DePaulo et al (1983) find that…”• Or, “Motivation greatly influences subjects’ ability to

control their verbal cues (DePaulo et al, 1983).”

04/19/23 20

• When in doubt, cite

Spoken Language Processing:Summing Up

Documents

Spoken Language Processing:Summing Up