Top Banner
Speech Comprehension
32

Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

Speech Comprehension

Page 2: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

A few words on acoustics• Given a source, how it is heard is a function of the resonant

cavities through which it is filtered• The shape of a cavity in which a sound occurs determines

several measurable properties of that sound– This is easy to see when you have a deformable sound cavity, such as

a wind instrument– The sound that comes out is the one which is most resonant: where

the sound waves are ‘in sync’– This is a function of the length and shape of the resonating chamber:

simple if the chamber is a simple tube; complex if the chamber reflect sound in complex ways

• Speaking is a highly controlled deformation of the resonating chamber which is our vocal tract

Page 3: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

Structure of sound• Most natural sounds are not one pure resonating frequency, but multiple

resonating frequencies stacked up on each other– Those can be more or less cleanly dividing up the sound spectrum– ’Hissy' noises (like fricatives) send out waves at many frequencies at

the same time, resulting in a complex spectrum of resonance• We can see fricatives as smears of high frequency bands interspersed with

more clearly multiple-frequency (and low-frequency) bands of vowels• Note that the 'sh' sound is characterized by slightly lower frequencies

– Clean sounds (like vowels) send out a controlled bands of frequencies at different ranges, resulting in a cleaner spectrum of resonance

Page 4: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

Formants• When we deform our mouths, we are manipulating which

frequencies will resonate

• The ones that resonate are called ‘formants’ [see pg. 121], which appear as visible bands in a spectrogram

• Before we said: Speaking is a highly controlled deformation of the resonating chamber which is our vocal tract

• We can equivalently say: Speaking is a method for manipulating the resonance of formants.

Page 5: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

Image from: http://www.umanitoba.ca/faculties/arts/linguistics/russell/138/sec4/specgram.htm

“We were away a year ago”

Page 6: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

What’s in a phoneme?

• As soon as we were able to electronically manipulate the signal, it was found that the speech signal could be greatly simplified: much of the information carried is not necessary– Why is it a good thing to have (why might natural

selection have favoured) unnecessary information in a signal system?

• The question of interest is: What are the components of the speech signal that carry necessary/sufficient information?

Page 7: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

What’s in a phoneme?

• The first and second formants are sufficient for comprehensible speech

• In fact, subjects can get some discriminating information from only the first formant: low-frequency formants were associated with low, back vowels (o, u) and higher-frequency with high, front vowels (i, e)

Page 8: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

What’s in a phoneme?

• We use sound (the formants we extract) to deduce information about how the vocal tract was positioned when that sound was produced– F1 largely reflects by tongue body height, which (as we

saw previously) changes with different vowels– F2 reflects whether the tongue body is more front or

more back• The difference between F1 and F2 is a better

indicator• In this way the sound encodes information about the state

of the system that produced it

Page 9: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

A complication• Vowel sounds are dependent on the consonants

that flank them• We make different sounds by changing the shape

of our mouth and our mouth has to change in different ways to get to a particular vowel sound from one position than from another

• In other words: The very process of getting into position to make a sound involves manipulating exactly those elements which are manipulated to change sound

Page 10: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

What’s in a vowel?• If you make ‘CVC’ words and the chop out the V, people

make many mistakes in guessing what that missing sound is supposed to be– They are much better at guessing what a vowel sound is if

you give them only the flanking consonants– They were as good at silent center- the V taken out- as they

were with the original word!• V recognition is worse if you discard temporal information, so

subjects only hear a small, constant-length portion of the missing vowel– This suggests that temporal information- how long a vowel

lasts- is one of the clues used in vowel identification.

Page 11: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

Consonants too.

• The same is true for consonants– If you take a stop consonant off the front of a vowel

(the b in BA) it is utterly impossible to recognize what the consonant was (a beep or chirp?): it was never a ‘b’ but a ‘b merging rapidly into an a’

– Both the stop consonant and a chunk of the formant transition into the next vowel are necessary for comprehension

Page 12: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

Coarticulation

• A phoneme merging with its adjacent neighbour- is called an encoded phoneme– We can also say the two phonemes are

coarticulated• Since an encoded phoneme is a single

indistinguishable sound which encodes two phonemes- the encoded one and its neighbour- we say there is ‘parallel transmission’

Page 13: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

Information compression

• Coarticulation is a feature, not a bug

• The informational compression it offers is one way we get up to the informational transfer rates that I mentioned last time, of 25-30 phonemes per second

Page 14: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

More information, please• In normal sentence-level decoding of the phonetic

stream, we have higher-level informational cues• Early work showed the words masked with noise are

better recognized in sentences than in isolation• A classic experiment from the 1970s showed that people

are amazingly smooth at using these cues to restore missing phonemic segments– Parts of sentence were chopped out (in mid-word) and

replaced with the sound of someone coughing– Subjects reported that they didn’t hear the cough cover any

part of the speech signal at all- they claimed to have heard the entire word, with the cough in background

Page 15: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

Our favourite theme• Yet another linguistic phenomenon (phoneme

identification) that superficially appears to be a single function is in fact a complex function that uses many independent and redundant cues:– Formant transitions [from C to V anv V to C]– Individual formants [of V],– Durational information – Amount of energy in the burst [the release of pressure

after a stop]– Onset frequency of the formant– Sentence and word level information

Page 16: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

The McGurk Effect

Page 17: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

Models of speech perception

• i.) Motor theory of speech perception (Liberman)• ii.) Analysis by synthesis (Stevens)• iii.) Fuzzy logic model (Massaro) • iv.) Cohort model (Marslen-Wilson)• v.) TRACE model (Elman & McLelland)

Page 18: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

i.) Motor Theory Of Speech Perception (Liberman)

• Main idea: we interpret speech input by tying it to motor articulation required to produce it

• Pros:– Provides a nice evolutionary story: phonetic comprehension built on a

more 'primitive' (evolutionarily older) level of sound production.– Ties into 'hardware'– Explains McGurk effect– Explains how we deal with coarticulation so easily– Explains how we deal with invariance– Explains categorical perception

• i.e. we use motor information to constrain possible sounds; use motor invariance to counter acoustic variance

Page 19: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

i.) Motor Theory Of Speech Perception

• Cons:

– Animals also show categorical perception but can’t produce phonemes

– Humans with deformed mouths can comprehend speech

– We can comprehend sounds we cannot make

– Says nothing about semantic and pragmatic constraints

Page 20: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

ii.) Analysis by synthesis (Stevens)

• Main idea: We synthesize speech from phonetic features; we have 'rules for synthesizing, which can be absolute when the signal is clear, and less absolute (more dependent on contextual cues) when there is known ambiguity

• The synthesized version is compared with the heard version not at the level of motor articulations

Page 21: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

ii.) Analysis by synthesis• Pros

– Tries to capture the fact that not all phonemes are created equal

• ambiguous sounds must be more carefully analyzed- because they are subject to a greater variety of constraints- than unambiguous sounds

• early phonemes have greater weight than later phonemes

– The idea that rules across phonetic features underlie comprehension means that the problem will be tractable

• Since we have a good handle on what those features are, there is hope we could specify how they combine

Page 22: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

ii.) Analysis by synthesis

• Cons– Can’t explain McGurk effect, since everything is

acoustically specified– Pretty vague without the rules actually being specified– Very abstract: hard to falsify or confirm

experimentally, because it makes claims about what is happening internally that cannot be tested easily

– Says nothing about semantic and pragmatic constraints

Page 23: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

iii.) Fuzzy Logic Model (Massaro)• Main idea: speech perception is a special case of

pattern recognition (analysis by features)• There are four steps:

– i.) Feature identification/extraction: Identify the relevant features

– ii.) Feature evaluation: Match those features to prototypes in memory- i.e. generate a list of partial matches with features sets that contain some of the identified features

– iii.) Feature integration: Rank order the candidates according to the degree that they match

– iv.) Feature decision: Make a ‘goodness of match’ decision and return the best candidate

Page 24: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

iii.) Fuzzy Logic Model (Massaro)• Pros

– Puts speech recognition out of the special case category into the category of general pattern recognition, thereby tying it in to work in other subfields, including other areas of language and into a general theory

• This could also be a con, since speech recognition does seem to be special…

– Stresses continuous (quantitative) rather than discontinuous (qualitative) information, so a match can be more-or-less good; more or less-certain

Page 25: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

iii.) Fuzzy Logic Model (Massaro)• Cons

– Very abstract: hard to falsify or confirm experimentally, because it makes claims about what is happening internally that cannot be tested easily

– Says nothing about semantic and pragmatic constraints (but perhaps it could…?)

Page 26: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

iv.) Cohort Model (Marslen-Wilson)

• Basic idea: A spreading activation model– Stage 1- Initial Access

• Access cohort: Bottom-up, based on first 150-200 ms

– Stage 2- Selection: • Elimination of candidates that fail for reasons other

than phonology- so we can weed out using semantic/pragmatic and syntactic constraints, as well as later-stage phonology

– Stage 3: Integration of semantic and syntactic information

Page 27: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

iv.) Cohort Model (Marslen-Wilson)

• Pros– Does take into account semantics and

pragmatics– Well-supported by a variety of experimental

evidence: frequency effects, neighbourhood effects, word/NW RT effect

Page 28: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

iv.) Cohort Model (Marslen-Wilson)

• Cons

– Says nothing about mechanisms

– Says nothing about word segmentation

• The model assumes listeners pick out the words, but we have seen that word boundaries are not usually specified in the speech stream

– Not incompatible with other models, since it takes phonemic activation (the selection of the initial cohort) for granted (maybe this is a pro?)

Page 29: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

v.) TRACE Model (Elman & McLelland)

• Basic idea: A parallel distributed processing (PDP) model: degree of activation/inhibition from units at each of three levels (phonemic feature, phoneme, word) is determined by the resting activation level of word units– Each gets input directly from constant sequence of

phonemes, all equally valuable

Page 30: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

v.) TRACE Model

• Pros– Decision based on overall goodness of fit, so degraded

input is not problematic

– Consistent in principle with cohort activation models (and so well-supported by experimental evidence)

– Does take into account semantics and pragmatics- - activation of overlapping lexical levels is explicable

Page 31: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

v.) TRACE Model

• Cons– Treats all features as equal, which we know they are

not– Says nothing about mechanisms– ‘Cheats’ by building in phonemic activation (the

selection of the initial cohort) by direct activation of those features

• One big part of the puzzle (how do we specify and recognize these features?) is thereby glossed over

– Highly over-simplified, both at the level of language and neurology

Page 32: Speech Comprehension. A few words on acoustics Given a source, how it is heard is a function of the resonant cavities through which it is filtered The.

• Are these models incompatible?

• Can they be synthesized into a ‘meta-model’?

• How could we test for which parts of each were best?