A simulation of final stop consonants in speech perception using the bicameral neural network model

51

A Simulation of Final Stop Consonants in Speech Perception Using the Bicameral Neural Network Model:

Michael C. Stinson Department of Computer Science

Central Michigan University Mt. Pleasant, Michigan, USA

DanFoster Department of English University of Winnipeg

Winnipeg, Manitoba, Canada

Abstract

This paper demonstrates the integration of contextual information in a neural network for speech perception. Neural networks have been unable to integrate such information successfully because they cannot implement conditional rule structures. The Bicameral neural network employs an asynchronous controller which allows conditional rules to choose neurons for update rather than updating them randomly. The Bicameral model is applied to the perception of word-final plosives, an ongoing problem for machine recognition of speech.

Neural networks allow computers to learn information and then use that information for interpretation of subsequent input. This ability has proven useful in the simulation of human perception in various modes (i.e. Visual, etc.). However, systems have found it difficult to take contextual factors into account in the recognition of data since to do so efficiently would require conditional rule structures, something not easy to implement in neural networks. Such an ability would be useful for the simulation of perception since contextual factors apparently play a major part in human perception. We offer a solution to the problem of conditional rule structures in neural networks by using the Bicameral model [Stinson 88, Stinson Ma, Stinson 88cI which merges some traditional artificial intelligence techniques with neural network methods.

The processing of linguistic information is often conditional on contextual factors that may greatly influence interpretation. An example of this would be how we might interpret polysemous terms. For instance, if a person meets a friend walking down a road carrying a fishing pole, he might suggest that they go down to the bank. Altemately, if that person is wearing a business suit on Wall Street, the same suggestion would likely have a different meaning. This tendency might be written as a pair of conditional rules:

If fishing pole, then bank suggests river. If business suit, then bank suggests building.

(1) (2)

In either case, the interpretation of bank might be wrong, but the perception would be influenced by the contextual information and not simply the word.

To demonstrate an important application of this technique, we have applied conditional rule structures to a problem in speech recognition, namely the recognition of post vocalic, final voiced and unvoiced stop consonants. These stop consonants are referred to as plosives in Kohonen's work [Kohonen 881. Recognition of these plosives requires that conditional rules be applied to the acoustic input to aide interpretation of the physical signal into phonemes.

The first part of the paper will introduce implementation of the Bicameral model. The secund part will apply the model to phonological interpretation, providing examples of the implementation of conditional rules.

The implementation of conditional rule structures in neural networks could be done in two ways. First, the connection weights between neurons could be modified under varying conditions. The modification of connection

standpoint, the changing of the weights changes the topolography of the system, allowing the user to increase the gradient which attracts the probe toward a leamed result. However, modifying the connection strengths is eliminated as a solution for practical reasons, primarily because a system of n neurons contains n2 connections, making such an approach too costly.

strengths opens some interesting theoretical problems. If one considers the approach from a topolographical

Annual Simulation Symposium

52 STINSON and FOSTER

The second approach, the one we have chosen, is to control selection of neurons for update rather than allowing the system to choose them randomly as is currently done in the Hopfield model [Hopfield 821. This allows the system to emphasize certain leamed memories over others. This method does not change the topography of the neural network, but does affect the movement of a probe within the original system. This approach need only have n connections for n neurons, making it considerably more practical for implementation.

21. Background

Consider the Hopfield model as a content-addressable associative memory. Each neuron contains one bit of information, since it can be in only one of two states: on (+I) or off (-1). The state of each neuron is defined by the value of the neuron. The system is the set of neurons and the connection weights, and the state of the system is the n-tuple formed by the n neurons in the system. The neurons are completely connected, that is, neuron i is connected by a synapse to neuron j for all i and j . The set of connections forms a linear connection matrix Tij, also known as the transition matrix. The connections are reciprocal, That is the matrix (Tij) is symmetric (tij=tji) in it’s components (the weights). One additional constraint on the system is that the weights of the diagonals in Tij are all zeros (tii=O). This implies that a neuron is not connected with itself, and therefore does not affect, at least directly, its own value.

Whether or not the neurons fire is determined using the threshold decision rule.

xi = sgn [ CTij xj 1 = +1, if ET-x > O (Eq 1) ‘1 1- -1, “ijxj go

The value of each neuron is computed using Equation 1, where information to be leamed is the vector ( XlJ2v-,Xn). Individual neurons sum the values of all other connected neurons in the system. If the summation equals or exceeds a predetermined threshold, zero in this example, the neuron assumes the value +1. If the summation is less than the prescribed threshold, the neuron assumes a value of -1.

In the accompanying picture (Figure 1) each neuron is connected with every other neuron, denoted by 0. In this model the connection from a neuron to itself is open, denoted by 0. This eliminates feedback to a neuron from itself, as per the definition of this model.

Figure 1: A Circuit Diagram of a Hopfield Neural Net

In the Hopfield model, information learned from an external source is referred to loosely as a Zearned memory and should be a stabZe point. However, learned memories and stable points must be differentiated since systems sometimes develop other stable points (spurious memories) in addition to or in lieu of the leamed memories. This poses a problem of differentiating between leamed memories and extraneous stable points. The use of the Bicameral model’s control structure reduces the risk of extraneous stable points [Stinson 88c. Youn 891. Reducing risk of extraneous stable points is important because stable points act as attractors.

An attractor forms a region about it, an attraction basin, and acts as a magnet to probes in the space formed by the memory. The strength of this attraction depends on the proximity of the probe to the information and the strength of the other attractors.

To test whether the system has leamed certain information, non-intrusive probes may be introduced into the space of the memory. A probe is an external stimulus that queries the system as to whether it, the probe, is a stable point. The probe does this without having any effect on the connection weights of te system and therefore does not change the system. The probe therefore enters the space of the associative memory and moves to the attractors

osium

SIMULATION USING THE BICAMERAL NEURAL N m MODEL 53

without changing the space in any manner. Ideally, an update system shifts the position of the probe to a point closer to the location of an attractor, until eventually the probe stabilizes.

One important concept for consideration is the timing that the system uses to update each neuron. Neurons may be updated either synchronously (every neuron at once) or asynchronously (one neuron at a time). If a synchronous network stabilizes, its resultant stable points are similar to that of an asynchronous network. However, asynchronous update guarantees stabilization of the probe, whereas synchronous update may lead to oscillations between non-stable states. The Hopfield model allows asynchronous update.

we define the leamed memory X(k) = (~1 .~2 , . . . , xn) to be the kth -memory of n-dimensional column vectors. Each vector consists solely of -1's and +I's. Each memory vector is used to form an n x n matrix using the outer product rule:

In this equation In is an n x n identity matrix. Using this technique, the matrix stores the outer product of the vectors with every xi Xi element equal to zero. The Hopfield transition matrix is then defined as the sum of these outer product matrices:

T=Z'I(IS (Eq.3)

The retrieval process is accomplished using the T matrix. A probe p is presented with each position either a +I or a -1. The probe is then multiplied by T, resulting in a new, unthresholded vector pu':

pT = h' (Eq. 4)

The vector pu' is then thresholded, resulting in a new vector p'. This binary vector is then compared with the ori@nal vector p for differences in the values of individual positions. If p' matches p, it has stabilized. If not, a neuron is chosen at random from p and updated according to Equation 1. Notice that the neurons could be classified as either in the set that would not change (i.e. those that match the value in the same position in p') or those that could effect change in the probe (i.e. those that do not match the value in the same position in p'). The latter set is called the effective set [Stinson 881. In the Hopfield model, the random updating continues without change until an effective neuron is chosen, thus changing the probe. This process is repeated until stabilization is achieved.

21.1. Example

The problem is illustrated through an example system that uses the Hopfield Model. There are n=5 neurons and '

three leamed memories:

x(1) = (+1+1 +1 +1 +1) x(3=(-1 -1 +1 +1 -1) x(3)=(+l -1 -1 -1 -1)

yielding the following individual transition matrices using the outer product rule:

0 +1 +1 +1 +I +1 0 +l +1 +1

T I = +1 +1 0+1 +1 +1 +1 +1 0 +1 +1 +1+1 +I 0

0 +1 -1 -1 +l +1 0-1 -1 +1

T*= -1 -1 0 +1 -1 -1 -1 +I 0 -1 +1 +1 -1 -1 0

0 -1 -1 -1 -1 -1 0 +1+1+1

T3= -1 +1 0+1 +1 -1 +1 +I 0 +I -1 +1 +1 +1 0

Note that the individual transition matrices are created using the outer product rule for the three individual leamed memories. The diagonal elements are then set to zero to represent the lack of feedback from the neuron to itself.



The individual matrices, TI, T2, and T3 are summed, multing in the transition matrix

0 +1 -1 -1 +1 +1 0 +1 +l +3

T = -1 +1 0+3 +1 -1 +1 +3 0 +1 +l +3 +1 +1 0

If the resulting system is probed using a learned memory as probe p, the calculations should retum the leamed memov

x(3) = p = (+1-1-1-1-1)

Multiplying this by the transition matrix T yields pu':

pu' = (04 -6 -6 4)

The values are then threshold4 to obtain:

p' = (+1-1-1-1-1)

which is the original information vector. We obtain the original vector x(3) because it is both a learned memory and an attractor (a stable point) for the system. The probe simply stayed at its position, thus both defining and exemplifying a stable point. Ideally, when a system learns a memory the learned memory becomes a stable point and attracts probes that are near it.

When a system is faced with a probe containing values that do not match a learned memory, it attempts to resolve the input to one of its leamed memories. To demonstrate this, we probe the system with another vector which is not a leamed memory.

p = (+1+1-1+1+1)

Multiplying the probe by the transition matrix, we obtain:

pu' = (+2 +4 4 -2 +4)

which will threshold to:

p' = (+1+1+1-1+1)

The elements of p are then compared to p'. The values in positions 3 and 4 in p' are different from those positions in p. These values form the effective set for updating. If position 4 is updated and the resulting vector is multiplied by T the system yields the new probe p = (+1+1+1+1 +U, which is a leamed memory x(3). However updating position 3 and multiplying tjhe resulting vector by T yields p = (+1 +1 -1 -1 +l), which is not a leamed memory. This latter point turns out to be a stable point since updating yields a p' which equals p. Notice in this instance, p is not a leamed memory; instead it is the complement of a leamed memory. This is an example of a spurious memory mentioned earlier. Another type of spurious state is a linear combination of memories which are caused by interference between stored memories. The Bicameral model has been shown capable of significantly reducing spurious states [Stinson 88, Stinson 88c, Youn 891.

2.2 The Bicameral Network Model

The underlying model for the new memory and recall mechanism is the Bicameral neural network model proposed by Stinson and Kak [Stinson 88, Stinson 88a. Stinson 88b, Stinson 88c, Kak 891. Instead of a single neural network which acts on global characteristics the Bicameral model add a companion neural structure, called the controller, which additionally takes advantage certain subset characteristics of the stored memories. For a VLSI- implemented neural network the controller block can be easily placed in the feedback loop.

In a network running asynchronously, the updating of the probe generally offers a choice of several neurons. AS we have seen, if the right neurons are not updated the network may converge to a spurious stable point. The asynchronous controller together with the basic neural net forms a bicameral network that can be programed in various ways to exploit global and local characteristics of stored memories and contextual information

The controller determines the effective set prior to any update of the neurons (see figure 2). It then determines which neuron should be updated to facilitate a "best" answer for the system based on contextual factors. The controller, using a rule based decision process, not only helps avoid spurious states, but may also increase the probability of reaching the appropriate stable point.

on posh

SIMULATION USING THE BICAMWAL NEURAL NETWORK MODEL 55

Figure 2: A Hopfield Neural Network with Controller

In 2.1.1, when the system was probed with p = (+1 +1 -1 +1 +l), it had two choices for update: positions 3 and 4. The purpose of the controller is to improve the eventual solution by improving the update choice at each stage. A decision must therefore be made between position 3 and position 4. A simple rule structure might be chosen: e.g. select the left most position. This arbitrary rule simply chooses the lowest numbered element in any effective set. In this particular instance, it would not have led to the correct choice. It's arbitrariness does not seem to be a good candidate for a general rule. A better rule would be one that uses contextual or global information available to the system, e.g. select the neuron whose update has a higher probability of leading to a learned memory. Devising the particular rules is of course dependent on information from the domain of reference. Let us consider a case of phonological interpretation to determine such a rule structure.

In the endeavor to create computers that can deal with natural language, scientists have dealt separately with the production and perception of speech. NETtalk is a neural net approach to speech production developed by Sejnowski and Romberg [Sejnowski and Rosenberg 861. It uses a supervised learning algorithm that trains on a chosen body of text. The neural network automatically embodies the phonological regularities that are used for pronunciation.

In one approach to speech perception, a neural net transforms acoustic information into orthographic information [Kohonen 881, thereby obtaining words. While this approach has achieved some success with vocals (vowels and /I / and/r/ )and fricatives (e.g. /f/ and /s/ [Kohonen 88, Kohonen 88a1, Kohonen notes that plosives (e.g. stopssuch as /t/ and /d/ ) are difficult to deal with [Kohonen 881. It has long been known that there is no one-to-one relationship between acoustic events and phonological segments [Denes 63, Parker 77, Repp 81, Foster 851; therefore, a direct translation from acoustic information to phonological interpretation is not in general possible. Systems attempting a direct translation must be trained to each individual voice they are expected to respond to and can be prone to misinterpretations.

We propose a neural net with an asynchronous controller which creates a bicameral structure, allowing conditional ~ l t ? ~ necessary to solve certain problems in the perception of word-final plosives. This application to phonology will now be d d b e d .

3.1. The Phonological Model

In phonology, we consider at least three levels of representation: (a) systematic phonemic, (b) systematic phonetic, and (c) physical phonetic (both acoustic and articulatory) Foster 851. A fourth level between a and b, called a' classical phonemic, is not actually a true level of representation [Chomsky 641, but it is well known in the field. For our purposes here, we do not need to distinguish between levels a and a'.

The levels a and b are psychological/mental [Foster 851 and are defined in terms of distinctive features (DFs), each segment being designated by a DF matrix. (Not all phonologists agree that features at the systematic phonetic level (level b) are distinctive; however, this question need not concem us here.) Levels a and b are related to one another by phonological rules and, with a few exceptions, can be mapped isomorphically one to another.

It is frequently emphasized [Denes 63, Parker 77, Repp 81, Foster 851 that there is no isomorphic mapping from levels a and b to either speech production (articulation) or to the properties of sound (acoustic information). W e are most concerned here with the relationship between (physical) acoustic information (level c) and the (mental) phonemic interpretation (level a) since the former level is the raw material of speech perception and resolution of the information into a phonemic representation is the immediate goal of machine recognition of speech. Perception of



phonological segments (mental entities) depends on acoustic cues, physical entities which may not be uniquely or invariantly associated with a particular phonological segment, either in English or across languages (see [Buckingham and Yule 8fl regarding problems this non-isomorphism has caused for linguists working in languages not their own).

Thus, there are two relevant components of phonology that must be modeled: psychological/mental (systematic and classical phonemics and systematic phonetics) and physical (physical phonetics, both articulatory and acoustic). Relations among the psychological levels may, with a few exceptions, be mapped isomorphically, but those between the physical and psychological levels cannot be. One exception to this isomorphism between levels is the relationship between the alveolar flap [Dl and the phonemes /t/ and /d/. An instance of ID] may be interpreted as either a /t/ or a /d/, depending on a variety of acoustic and non-acoustic contextual factors. Consequently, the relationship between the phonetic and phonemic levels is not strictly isomorphic.

3.1.1. Example

The following example will demonstrate the interaction of the various levels. Consider the distinction between pairs of voiced and unvoiced stops (plosives): /b,p/, /d,t/, /g,k/ (slashes indicate that the segments are phonemes). Each voiced/unvoiced pair has the same distinctive feature (DFl matrix except for the DF &voice]: /b/ is [+voice] and /p/ is [-voice]. Likewise, with /d,t/ and /g,k/ , the first member of each pair is [+voice] and the second is [-voice]. It is important to keep in mind that these DFs are mental distinctions, having no necessary connection to any aspect of the physical speech event, either the acoustic signal or the articulatory movements [Parker 77, Foster 851. That is, the perceptual distinction &voice] is not tied in any necessary way to the presence or absence of glottal pulsation or any other single articulatory or acoustic factor. Furthermore, the feature hvoice] is not itself an acoustic cue. Rather, it is a mental distinction that is cued by some aspect of the acoustic signal, but not always the same aspect.

At the systematic phonetic level, which is a second mental level, we can predict more precisely the phonological character of a stop segment. For instance, in English when a /p/ occurs in initial position in a stressed syllable immediately before a vowel, it is always realized as aspirated p [p hl. (Square brackets indicate that the relevant level is the systematic phonetic level, or, in other words, a segment in square brackets is an allophone, or variant, of the phoneme.) An unaspirated p ([PI) in that position will be, at best, judged odd by a native speaker and, at worst, interpreted as /b/. The relationship between /p/ and [p h] may be expressed by the following rule:

(3)

(A voiceless stop is aspirated in syllable initial position before a stressed vowel)

The mental segment Ip hl, then, is cued by certain physical acoustic signals, for instance, a delay in Voice Onset Time (VOT), the start of vocal cord vibration. However, delayed VOT is not invariably associated with [p h]. While the initial /p/ in poker is [p h1 at the systematic phonetic level, the initial /p/ in potato is not aspirated, in spite of the fact that it also has a delay in VOT. Potato has stress on the second syllable, so it is the first /t/ that fits the aspiration rule and is, consequently, aspirated. The /p/ does not fit the rule and so is not aspirated. A delay in VOT is an acoustic cue, a physical correlate of the interpretation. It is not isomorphic with the feature kaspirationl which is a mental distinction operating at the systematic phonetic level. Neither is it isomorphic with the feature [+voice], another mental distinction which operates at the systematic phonemic level, since [-voice] segments in other positions are not associated with a delayed VOT. Furthermore, [-voice] segments in other positions (e.g. post-vocalic, word final) are normally not associated with aspiration.

What is important to note here is that human perception of phonemes is not simply a matter of recognizing chunks of acoustic signals. Instead, humans must actively interpret the signal that reaches them, taking into account information not only from the portion of the acoustic signal which corresponds to the segment in question, but from throughout the signal for the resolution of any part of the interpretation. In the case above, the system must not only note a delay in VOT, but must also determine whether the syllable in question is stressed before it can decide whether the segment is aspirated. In the case of potato, the stress information is not available until well after the VOT information.

[continuant] --> [+aspirated] / #-v' [-voice]

Grammatical and semantic/pragmatic information also affect interpretation of acoustic signals [Denes 63, Raphael 72, Church 87l, though we will consider only acoustic information here. Furthermore, information from whatever source may be conclusive in itself or it may simply bias us, strongly or weakly, toward a particular interpretation, and it may be supported or contradicted by other information available from inside or outside the acoustic stream. In 0th- words, the system must be able to handle gradations of conclusiveness and conttadictory idomration. We propose that the Bicameral model can accomplish these tasks efficiently.

posium

SIMULATION USING THE BICAMERAL NrmRAL NE'IWORK MODEL 57

32. Conditional Rules in Phonological Interpretation

If the Hopfield model is used, salient information from the acoustic signal must be placed in a vector and the system must learn that vector. This could be done in three ways. If all the information were to be placed on one vector, the system could search for the combination of the levels represented on that vector. However, we cannot place the phonological information on the same vector as the acoustic information because the relationship is not one-to-one. One might consider placing the information on multiple vectors of a similar nature. However, the acoustic and phonological information are intrinsically linearly dependent and, therefore, prone to create spurious states. Additionally, an overload might occur if too many information vectors were leaved.

A better solution is to have the neural net learn only the acoustic cues. The phonetic probe would then seek a solution in the neural net containing the cues. To augment the update decision structure, a set of rules may be placed in a controller. The Hopfield model, which may be used for learning the acoustic cues, has no structure to contain the rule set. Without the ability to implement conditional rules, the system has no ability to control the variations in the mappings from the physical to the mental. On the other hand, an asynchronous control structure can accomplish this task.

To clarify, let us consider the implications of the Duration of Periodic Noise (DPN) associated with the perception of vowels on the perception of the [+voice] distinction in post-vocalic, word final stops. In most accounts, what we are calling DPN, or duration of periodic noise associated with vowel perception, is called VL, or vowel length. We have chosen the somewhat bulkier term because it is more accurate. That is, "vowel" is a psychological term, not truly applicable to the physical signal. By using a term that is appropriate for the description of the physical level, we hope to avoid some of the confusion of levels that one frequently finds in accounts of human speech perception [Repp 81, Foster 851.

Kohonen has noted the difficulties of handling plosives in a speech recognition system [Kohonen 881. Furthermore, the voicing diction is a crucial one in speech communication [Revooile et a1 821. Church [Church 871 has noted that voicing may be difficult to detect in certain phonetic environments. He further notes that he is "not aware of a concrete proposal describing how [acoustic] cues are to be integrated in a practical recognition system" [Church 87: xiii]. At present, we are looking only at the voicing distinction in post-vocalic, word-final stops, but Raphael's work [Raphael 721 suggests that our method could be generalized to voicing distinctions in fricatives and affricates as well. Furthermore, it is likely that our method can also be generalized to intervocalic voicing distinctions.

It is generally agreed by phonologists that the kvoice] distinction in post-vocalic, word-final stop segments can be made even when the part of the acoustic stream that corresponds to the perceived final stop is not clearly identifiable in isolation as either voiced or unvoiced [Denes 55, Raphael 72, Walsh and Parker 83, Walsh and Parker 84, Walsh and Parker 871. Cues to the voicing dih ction may be contained in that portion of the acoustic signal associated with the final stop consonant; however, these are apparently not necessary for perception of the final stop. Walsh et al. note, "since stops in [final] position are often produced without release or glottal pulsation during closure, all acoustic cues to the nature (and even to the presence of the final stop) must occur prior to the closure" [Walsh and Parker 871011. In fact, the closure itself may be removed [Walsh and Parker 841, yet the signal will remain interpretable. In the synthesized speech Walsh et al. used in their experiment, the F2 and F3 (second and third formants, harmonics of the fundamental frequency) transitions were sufficient to signal the presence of an alveolar stop (i.e. /t/ or /d/) [Walsh and Parker 871, and whether the listener perceives /d/ or /t/, that is, the voicing distinction, is likewise governed by some aspect of the periodic noise preceding closure.

It is worth noting here that at level b, the systematic phonetic level, final /t/ and /d/ segments would be represented exactly the same. Just as the alveolar flap, [DJ, is an allophone of both /t/ and /d/, the final segment here is indistinguishable at the systematic phonetic level between /t/ and /d/. We might represent it, by analogy with the alveolar flap, as [TI, that is an allophone of both /t/ and /d/ which may occur in final position. Here is another case where the relationship between the phonemic and systematic phonetic levels is not isomorphic. Notice also that in this case, the acoustic signal impinges directly on a distinction made at the phonemic level without intervention of the systematic phonetic level.

That the voicing distinction is cued by the portion of the signal associated with vowel perception is generally accepted. DPN is most widely accepted as the primary cue. However, the rate of decline of the frequency of the F1 (first formant) transition (F1 slope) has also been suggested as a cue. A lengthy Dl" cues a [+voice] interpretation while a short DPN cues a [-voice] interpretation; a rapidly falling F1 slope cues a [+voice] interpretation while a flat F1 slope cues a [-voice1 interpretation. In fact, there is general agreement that both cues are active, but controversy exists over which is the critical cue.

In disputing the general contention that DPN is the critical cue, Walsh and Parker [Walsh and Parker 831 claim that if the DPN is less than a critical duration then the following stop will be perceived as [-voice]. If the DPN is greater than a critical duration, the following stop will be perceived as [+voice]. But between those critical durations, DPN is not an accurate cue. Since F1 slope operates in very long and very short vowels as well as intermediate cases, they claim that it is the most important cue while DPN "as a cue to kvoicel in postvocalic stops in English is (at best) redundant and (at worst) misleading" [Walsh and Parker 831. Their observations with regard to the role of DPN as a cue could be expressed by the following rules:



(4) ( 5 )

DPN c CDI --> [-voice] DPN > C m -> [+voice]

where CD means critical duration and CDI < CD2.

Walsh et al. [Walsh and Parker 8n present results which show that subjects listening to tokens of bad and bat strongly preferred a [-voice] interpretation of the final stop regardless of the rate of F1 decline when the DPN was extremely short (100 ms, apparently less than or close to CDI). At the greatest D W s they test, their results do not indicate a strong preference for a [+voice] interpretation regardless of rate of F1 decline. Instead, a flat F1 transition with a DPN of 250 ms results in near random interpretations since the cues are contradictory. However, 250 ms is less than CD2 for /d/ and /t/ according to the results presented by Raphael [Raphael 721. His results also show a significant percentage of final alveolar stops judged [-voice] when the preceding DPN is 250 ms. However, when the preceding DPN is slightly more than 300 ms, judgments were consistently [+voice], indicting that CD2 is somewhat greater than the longest durations tested by Walsh et al. [Walsh and Parker 87l. We will assume, then, that the formulation presented in Walsh and Parker's earlier work [Walsh and Parker 83, Walsh and Parker 841 is essentially correct with regard to the role of DPN as a cue. That is, a DPN of less that CD1 conclusively cues a [-voice] interpretation while a DPN of greater than CD2 conclusively cues a [+voice] interpretation regardless of F1 slope. A DPN between CD1 and C Q will act to bias the interpretation, but it may conflict with the F1 slope information. In cases of conflict, F1 slope may be the stronger cue as Walsh and Parker have suggested.

It is worth noting DPN differs depending on what sort of segment follows. While the voicing distinction is the most important factor, manner and place of articulation also have an effect [House and Fairbanks 531. Consequently, the CDs for any given voiced/unvoiced pair will differ, if only slightly, from any other pair. Furthermore, dialectal variations in length of vowels will also affect the values of the CDs

It is interesting to consider the point of disagreement between Walsh and Parker and others. As noted, DPN is generally accepted as the primary cue. Walsh and Parker, however, argue that because F1 slope acts as a cue at all DPN's, it should be considered the primary cue, though it may be overridden by extreme DPNs [Walsh and Parker 83 4121. Walsh et al. [Walsh and Parker 871 do not argue strongly for a primary cue, noting instead that both cues operate. The Bicameral neural net allows us to make use of the notion of primary cue while relieving us of the necessity of determining which cue is primary in all cases. We can proceed on the assumption that no single cue is Primary in all instances. What is relevant in this case is that there are at least two features of the preceding vowel that can affect interpretation of the following consonant. Moreover, they may affect interpretation differently. Furthermore, each of them may be the primary cue under different circumstances. Finally, extreme DPNs are conclusive while F1 slope is not claimed to be conclusive under any circumstances. We wish to show that neural nets with bicameral nature due to the addition of a controller can handle multiple (and possibly misleading) cues of varying strengths quite naturally.

With regard to human speech perception, it is intereh g to note that in normal speech most DPNs fall between the two CDs [Walsh and Parker 841. Revoile [Revoile et al. 821 found that impaired-hearing subjects were more dependent on DPN as a cue than were normal-hearing subjects. We might also speculate that DPN might be an important cue when there is a great deal of noise interfering with perception. Duration would seem likely to be easier to distinguish in such conditions than spectral characteristics of vowels. In any case, whether DPN is used frequently in normal speech does not affect the point of this example, which is that DPN varies in importance as a cue under differing conditions.

33. The Neural Net Implementation

Assume that the asynchronous control of a neural net has leamed the phonological rules for our example. This means that it contains stable points for post-vocalic, word-final stops at [-voice] for a preceding DPN less than CDI and [+voice] for a preceding DPN greater than CD2. The physical correlates of the [+voice] or [-voice] interpretation to which a speaker id normally exposed may be considered the prototypical cue. Each prototypical cue would create one stable point. Physical stimuli not fitting the prototypical cue exactly must be resolved to one of the available points to allow interpretation. DPNs less than CDI and greater than CL32 fix the attraction basin, but if the probe is in the intermediate rai:ge, then the system must not fix on the corresponding kvoice] on the basis of DPN alone.

us@ the asynchronous controller, the system can consider particular conditions of the probe and fix portions of the probe in accordance with those conditions. For example, if the portion of the information vector corresponding to DPN is greater than the upper bound (CD2), the system should fix the portion of the probe (and the solution) to the attractor for c& and [+voice]. Conversely, if the portion of the information vector Corresponding to DPN is less than the lower bound (CDI), the system should fix the portion of the probe (and the solution) to the attractor for CDI and [- voice]. However, if DPN is between CD1 and CD2, the system should not fix any portion of the pmbe on the basis of DPN alone, though DPN may still act as an attractor.

This can be interpreted as Saying that as an attractor DFN is at times a conclusive attractor and at other times strong or weak, but not conclusive. Depending on its value, DPN can serve b direct a portion of the probe to a particular Solution. For other values, the same portion of the probe is not fixed and the probe is allowed b stabilize keely in the neural network. Without the asynchronous controller, neural networks cannot handle the conditional rules necessary for dealing with the variable importance of DPN as an attractor.

a1 Si

SIMULATION USING THE BICAMERAL N E W NETWORK MODEL 59

33.1. Example

Digitization of acoustic information is commonplace. To simplify our examples, we assume that the DPN information is contained in the first four bits of an information vector representing the number of milliseconds. In addition, to simplify our simulation, we assume that CD1 = (-1 -1 -1 +1) and C m = (-1 +l +1 +U. The F1 slope information is contained in the last four bits representing the Hz / milliseconds. In both cases the left most bits represent more significant bits, therefore -1's will represent a very slow F1 decline, and all +l's will represent a very rapid F1 decline. The two groups of four bits are combined to form prototypes by placing the DPN before the F1 slope to form DPN/FIS.

For this example, the following learned memories are prototypes:

(6) (7)

/t/ is cued when DPN/FlS = (-1 -1 -1 +1+1-1-1 +1) /d/ is cued when DPN/FIS = (-1 +1+1+1+1+1-1+1)

These formulations may be considered prototypical cues: the physical correlates closest to what a particular speaker finds in speech to which he is normally exposed (or perhaps that to which he was exposed during the critical period of language acquisition). Notice that /t/ is cued by a fairly short DPN and medium F1 slope, whereas /d/ is cued by a fairly long DPN and a fairly rapid F1 slope.

Now consider two different probes to observe how the system works to stabilize either one. Let a probe p = DPN/FIS = (-1 -1 +1+1+1-1-1-1). This probe has a medium DPN and a slow F1 slope. In this case, then, there is no conclusive cue and we are assuming that DPN and F1 slope are equally weighted cues. By applying pT = p' the following effective set is obtained:

(-1 -1 +1+1+1 -1 -1 -1) (-1 +1 -1 +1+1+1-1+1)

x x x x 2 3 6 8

The x's mark where p' differs from p and those positions comprise the effective set, in this instance (2,3,6,8). Position 2 is randomly selected for update, thus causing the second position in p to change from -1 to +l. By applying the new probe to the transition matrix T in the same manner (pT = p') the new choices of 6 and 8 are obtained. Randomly choosing position 6 then 8, the system updates to a stable point cuing /d/. The probe could have updated to cue a /t/ if the choice of neurons for update had been 3 and 8 in that order. In other words, the stimulus in this case did not definitively favor either a /d/ or a /t/ interpretation. It is worth noting that when Walsh et al. presented subjects with an F1 slope of 3.7 Hz/ms, the subjects heard /d/ 34% of the time when the DPN was 150 ms and 75% of the time when the DPN was 200 ms IWalsh and Parker 871. If we assume that our medium DPN falls between 150 ms and 2"s, then the results for this probe appear to correspond with their findings. In this case, neither of the available acoustic cues was strong enough to force a particular interpretation. In such cases, grammatical or semantic/pragmatic information would have to be used for interpretation. Notice that if we want to consider F1 slope a more important cue at durations between the CDs, we would simply have to increase the probability that the system would update DPN neurons rather than making the choice random

Now consider the probe p = DPN/FIS = (-1 -1 -1 -1 +1 +l -1 -1). Notice that the DPN is very short and that the system, therefore, should stabilize at a /t/ interpretation. By multiplying the probe with the transition matrix, the effective set obtained is (1,2,3,4,6,7,8). Unfortunately, if we update randomly the very short DPN is easily lost as a critical cue. For example, updating position 2, we obtain a choice for the next pass of (1,3,4,7,8). The system will stabilize at /d/ if in the following passes we choose 3,4, and 8 respectively. However, the very short DPN (less than CD1) suggests we should stabilize at /U.

If we use a controller containing rules (4) and (5), the CD rules, positions 1 to 4 will only be updated as a last resort because the controller notes that DPN < CD1. Therefore, the set of choices is narrowed to (6, 7, 8). Additionally, the system determines that updating 6 and 8 has a high probability of leading to a learned memory (1.0 in this case). Randomly choosing from these two, the system updates 8 (6 works equally well). The new probe p' = (-1 - 1 -1 -1 +1+1-1+1) gives only the option of position 4 to update. By updating the fourth position, the system stabilii, and additionally it stabilizes at the desired learned memory: p" = (-1 -1 -1 +1 +1 +1 -1 +l), which corresponds to the prototypical cue for /t/. This result agrees with the results of tests using extreme DPNs. The controller allows us to use DPN as a conclusive cue under certain conditions while leaving the possibility that F1 slope is a strong but not conclusive cue under other conditions. Thus our system seems to mimic previous results in human speech perception studies.



The use of the Bicameral neural net model allows us to handle the essential non-isomorphism between the physical acoustic signal and the phonological segments perceived by human speakers of English. This model can handle not only multiple cues and contradictory ones, but also its controller allows the use of conditional rules. This latter capability allows us to choose DPN or F1 slope as the primary cue to the [ kvoice] distinction in post-vocalic, word-final stops, depending on the circumstances. The ability to make use of various bits of acoustic information for a single interpretation, to stabilize to an interpretation regardless of contradictory cues, and to vary the importance of cues, makes this model valuable for the solution of problems in phonology caused by the non-isomorphic relationship between the acoustic signal and the perceived phonological segments. These abilities also suggest directions to proceed in the pursuit of speech-recognition machines.

Conditional rule structures appear to be necessary in a wide variety of applications, in machine speech perception, linguistic processing, and in other areas of simulated human perception. We have presented one such application. Not only does our method promise improved machine performance, but also may give insight into the functioning of human cognition.

References

Buckingham, H.W. and Yule, G. 1987. "Phonemic False Evaluation: Thsoretical and Clinical aspects," Clinical Linguistics and Phonetics, vol. 1, pp. 113-125.

Chomsky, N. 1967 Current Issues in Linguistic Theory, Mouton, The Hague.

Church, K. W. 1987. Phonological Parsing in Speech Recognition, Kluwer Academic Publishers.

Denes, P. 1955. "Effects of Duration on the Perception of Voicing," Journal of the Acoustical Society of America, vol. 27, pp. 761-764.

Denes, P. B. 1963. "On the Statistics of Spoken English," The Journal of the Acoustic Society of America, vol. 35, pp. 892-904.

Foster, D., Riley, K., and Parker, F. 1985. "Some Problems in the Clinical Application of Phonological Theory," Journal of Speech and Hearing Disorders, pp. 294-297.

Hopfield, J. J., 1982. "Neural Networks and Physical Systems with Emergent Collective Computational Abilities," Proceedings of the National Academy of Science, USA, vol. 79, pp. 2554-2558.

House, A. S. and Fairbanks, G. 1953. "The Influence of Consonant Environment upon Secondary Acoustical Characteristics of Vowels," The Journal of Acoustical Society of America, vol25, pp. 105-113.

Kak, S.C. and Stinson, M.C.1988. "Bicameral Neural Networks Where Information can be Indexed," Electronic Letters, vol. 25, pp. 203-205

Kohonen, T., 1988. "The 'Neural' Phonetic Typewriter," Computer, vol. 21, no. 3, (March) pp. 11-24.

Kohonen, T., 1988. Self-Organization and Associative Memory, 2nd ed., Springer-Verlag.

Parker, F., 1977. "Distinctive Features and Acoustic Cues," Journal of the Acoustical Society of America, vol. 62, pp. 1051-1054.

Raphael, L.J. 1972. "Preceding Vowel Duration as a Cue to the Perception of Voicing in Word-final Consonants in American English," Journal of the Acoustical Society of America, vol. 51, pp. 1296-1303.

Repp, B.H. 1988. "On Levels of Description in Speech Research," Journal of the Acoustical Society of America, vol. 69, pp. 1462-14641981.

Revoile, S. Pickett, J.M., Holden, L.D., and Talkin, D. 1982. "Acoustic Cues to Final Stop Voicing for Impaired and Normal-hearing Listeners," The Journal of the Acoustical Society of America, vol 72, pp. 1145-1154.

Sejnowski,T. and Rosenberg, C.R. 1986. "NETtalk A parallel network that leams to read aloud." Johns Hopkins University Technical ReportJHU/EECS86/01.

Stinson, M.C. 1988. "Neural Network with asynchronous control," Louisiana State University: Ph.D. Dissertation.

Stinson, M.C. and Kak, S.C. 1988. "A Bicameral Neural Network that Improves Convergence," Presented at the International Neural Network Society First Annual Meeting, Boston MA, September 1988. Abstracted in the Proceedings of the INNS.

Stinson, M.C. and Kak, S.C. 1988 "An Asynchronous Controller for Neural Networks," Proceedings of the 1988 Association of Computing Machinery Southeast Regional Conference,.

nu ation posiu

SIMULATION USING THE BICAMERAL NEURAL NE'lWORK MODEL 61

Stinson, M.C. and Kak, S.C. October "Bicameral Neural Computing," Proceedings of the 1988 Internatwml Conference on Communication and Control, Baton Rouge LA 1988.

Walsh, T. and Parker, F. 1983 "Vowel Length and Vowel Transition: cues to &voice] in post-vocalic stop," IOWMl of Phonetics, vol. 11, pp. 407- 412.

Walsh, T. and Parker, F. 1984. "A Review of the Vocalic Cues to kvoice] in Post-vocalic Stops in English," Journal of P h o n ~ t b , vol. 12, pp. 207-218.

Walsh, T. and Parker, F., and Miller, C.J. 1989. "The Contribution of Rate of F1 Decline to the Perception Of kVOiCe1," Joumal of Phonetics, vol. 15, pp. 101-1031987.

Youn, C .H. and Kak, S.C. 1989. "Continuous Unlearning in Neural Networks,"Electronic kfters, vol. 25, pp. 202-203


A simulation of final stop consonants in speech perception using the bicameral neural network model

Documents