Contextually dependent cue realization and cue weighting ...

Contextually dependent cue realization and cue weighting for alaryngeal contrast in Shanghai Wua)

Jie Zhangb)

Department of Linguistics, University of Kansas, 1541 Lilac Lane, Lawrence, Kansas 66045, USA

Hanbo YanSchool of Chinese Studies and Exchange, Shanghai International Studies University, Shanghai 200083, China

(Received 2 November 2017; revised 3 August 2018; accepted 16 August 2018; published online11 September 2018)

Phonological categories are often differentiated by multiple phonetic cues. This paper reports a pro-

duction and perception study of a laryngeal contrast in Shanghai Wu that is not only cued in multi-

ple dimensions, but also cued differently on different manners (stops, fricatives, sonorants) and in

different positions (non-sandhi, sandhi). Acoustic results showed that, although this contrast has

been described as phonatory in earlier literature, its primary cue is in tone in the non-sandhi con-

text, with vowel phonation and consonant properties appearing selectively for specific manners of

articulation. In the sandhi context where the tonal distinction is neutralized, these other cues may

remain depending on the manner of articulation. Sonorants, in both contexts, embody the weakest

cues. The perception results were largely consistent with the aggregate acoustic results, indicating

that speakers adjust the perceptual weights of individual cues for a contrast according to manner

and context. These findings support the position that phonological contrasts are formed by the integra-

tion of multiple cues in a language-specific, context-specific fashion and should be represented as such.VC 2018 Acoustical Society of America. https://doi.org/10.1121/1.5054014

[MS] Pages: 1293–1308

I. INTRODUCTION

A standard assumption about phonological contrast is

that it is categorical, based on either segments (/p/ vs /b/) or

features ([�voice] for /p/, [þvoice] for /b/; Jakobson et al.,1952; Chomsky and Halle, 1968; Stevens, 2002; Clements,

2009). A major challenge for phoneticians and phonologists

alike is to account for how speakers categorize gradient and

variable acoustic signals into such discrete entities. Two

salient aspects of this challenge relate to how featural con-

trasts are instantiated acoustically. First, contrasts are often

differentiated by multiple acoustic cues. The stop voicing

contrast in English, for example, is associated with differ-

ences in voice-onset time (VOT), closure duration, f0 of the

following vowel, and a host of other acoustic properties

(Lisker, 1986). Second, the acoustic cues for the same con-

trast often depend on the phonological context in which the

contrast appears. For instance, the English voicing contrast

would not benefit from the f0 cue of the following vowel in

the final position, but would benefit from a duration differ-

ence on the vowel preceding it (Chen, 1970; Raphael, 1972).

The investigations of how a contrast is acoustically realized

in a multidimensional fashion, how the different acoustic

cues are weighted in the perception of the contrast, and how

the weighting is affected by the acoustic dimensions along

which the cues vary, the distributional characteristics of the

acoustic cues, the context in which the contrast appears, and

the listeners’ language background have contributed to sig-

nificant theoretical issues in phonetics and phonology, such

as the mode of speech perception (Repp, 1983; Parker et al.,

1986; Massaro, 1987), the nature of distinctive features

(Halle and Stevens, 1971; Kingston, 1992; Stevens and

Keyser, 2010), the production-perception link (Newman,

2003; Shultz et al., 2012; DiCanio, 2014), the influence of

phonological knowledge of a language on perception

(Massaro and Cohen, 1983; Flege and Wang, 1989; Dupoux

et al., 1999; Hall�e and Best, 2007), the theories of perceptual

contribution of secondary cues (Holt et al., 2001; Francis

et al., 2008; Kingston et al., 2008; Llanos et al., 2013), and

the mechanisms of phonetic category learning (Clayards

et al., 2008; Toscano and McMurray, 2010; McMurray

et al., 2011).

This paper contributes to this scholarship by presenting

a case study on the cue realization and cue weighting of a

laryngeal contrast on different segments in different contexts

in Shanghai Wu. Like many Wu dialects of Chinese,

Shanghai has a three-way distinction among voiceless aspi-

rated, voiceless unaspirated, and voiced stops. The voiced

series, however, is not realized with typical closure voicing,

but is known as “voiceless with voiced aspiration” (Chao,

1967), indicating the involvement of breathy phonation. On

fricatives, there is a two-way voicing contrast, whereby the

voiced fricatives are truly voiced, and on sonorants, there is

a modal-murmured distinction that corresponds to the

a)Portions of this work were presented at the 18th International Congress of

Phonetic Sciences, Glasgow, Scotland, UK; the 89th annual meeting of the

Linguistic Society of America, Portland, OR; and the 22nd annual meeting

of the International Association of Chinese Linguistics in conjunction with

the 26th North American Conference on Chinese Linguistics, College

Park, MD.b)Electronic mail: [email protected]

J. Acoust. Soc. Am. 144 (3), September 2018 VC 2018 Acoustical Society of America 12930001-4966/2018/144(3)/1293/16/$30.00

https://doi.org/10.1121/1.5054014

mailto:[email protected]

http://crossmark.crossref.org/dialog/?doi=10.1121/1.5054014&domain=pdf&date_stamp=2018-09-01

voiceless-voiced distinction in obstruents (Chao, 1967; Xu

and Tang, 1988; Zhu, 1999, 2006).

Shanghai Wu, like other Chinese dialects, is also tonal.

There are three phonetic tones on open or sonorant-closed

syllables, transcribed as 53, 34, and 13, and two phonetic

tones on ?-closed syllables, 55 and 12. But there is a co-

occurrence restriction between tones and onset laryngeal fea-

tures in that the higher tones 53, 34, and 55 only occur on

syllables with voiceless obstruent or modal sonorant onsets,

and the lower tones only occur with phonologically voiced

obstruent or murmured sonorant onsets (Xu and Tang, 1988;

Zhu, 1999, 2006). Therefore, in Shanghai, there is a minimal

contrast between tO3 “to arrive” and dO13 “news,” and this

contrast is cued by both the voice quality of the initial conso-

nant and f0. The examples in Table I illustrate the co-

occurrence of the two rising tones 34 and 13 with the laryn-

geal features in Shanghai.

Tones in connected speech are affected by a tone change

process called tone sandhi in Shanghai. Polysyllabic com-

pound words undergo a rightward spreading tone sandhi pro-

cess by extending the tone on the first syllable over the

entire compound domain and consequently wiping out the

tonal contrasts in non-initial syllables (Zee and Maddieson,

1980; Xu and Tang, 1988; Zhu, 1999, 2006). For example,

tO34 “to arrive” and dO13 “news,” when appearing as the

second syllable of a disyllabic compound, are reported to

lose their tonal difference, as shown in the following exam-

ples: /pO34-tO34/ ! [pO33-tO44] “check-in”; /pO34-dO13/! [pO33-dO44] “news report.” The voicing difference between

the onset consonants on the second syllable, however,

remains, and the voiced stops have been reported to have clo-

sure voicing in this position (Cao and Maddieson, 1992; Ren,

1992; Shen and Wang, 1995; Chen, 2011; Wang, 2011; Gao,

2015; Gao and Hall�e, 2017).

The data pattern in Shanghai, therefore, presents a clear

example in which a phonological contrast is realized differ-

ently on different manners and different positions: stops, fri-

catives, and sonorants can all carry the contrast, but via

different sets of cues; the monosyllabic context is significant

in that it is the only context in which the phonation-tone co-

occurrence, as illustrated in Table I is fully manifested, while

the second syllable of disyllables constitutes a position

where the cues for the contrast are considerably altered by a

tone sandhi process. We specifically focus on the contrast

between voiceless unaspirated/modal and voiced/murmured

consonants co-occurring with a high-rising and a low-rising

tone, respectively (e.g., tO34 vs dO13; me34 vs€me13). As we

review in Sec. I B below, although previous studies have

established the multidimensional nature of this contrast, as

well as the fact that the cues for the contrast vary by prosodic

position, no study has expressly compared the realization of

cues in different manners or studied how the cues are

weighted in perception across manners and positions. This

study aims to achieve these goals. In so doing, it has the

potential to make the following unique contributions. First,

previous studies on the perceptual contributions of voicing

and f0 of a contrast have primarily been conducted on non-

tone languages like English and Spanish, and in these lan-

guages, voicing has been found to be the primary cue

(Abramson and Lisker, 1985; Shultz et al., 2012; Llanos

et al., 2013). Shanghai, being from a tone-language family,

could work in the opposite way with tone as a primary cue

and voicing/voice quality a secondary cue, similar to

Southern Vietnamese (Brunelle, 2009) and Eastern Cham

(Brunelle, 2012). This provides an opportunity to observe

the influence of language background on how cues are

weighted and the limit and potential reasons for the primacy

of a particular cue (see also Francis et al., 2008; Llanos

et al., 2013). Second, the positional dependency of the reali-

zation of this contrast results from not only the position perse, but also a phonological alternation process that, at least

according to the descriptive literature, categorically neutral-

izes one of the cues (tone) in the non-initial context. This

puts the context scenario here, phonologically, between full

realization (e.g., voicing in final position in English) and full

neutralization (e.g., manner contrast in final position in

Korean; Kim and Jongman, 1996) and allows it to contribute

to the large literature on incomplete neutralization (e.g.,

Dinnsen and Charles-Luce, 1984; Port and Crawford, 1989;

Warner et al., 2004; Dmitrieva et al., 2010). Third, phonetic

studies of phonation have primarily focused on vowels (e.g.,

Huffman, 1987; Andruski and Ratliff, 2000; Blankenship,

2002; Wayland and Jongman, 2003; Esposito, 2010a, 2012;

Khan, 2012) and obstruent consonants (e.g., Davis, 1994;

Mikuteit and Reetz, 2007; Dutta, 2009; Berkson, 2016a);

studies on sonorant consonant phonation (e.g., Aoki, 1970;

Traill and Jackson, 1988; Berkson, 2016b) are relatively

rare, presumably due to their typological rarity and the weak

acoustic cues they embody (Berkson, 2016b). Shanghai fur-

nishes an example that has a laryngeal contrast in both

obstruents and sonorants, and thus provides a rare venue to

compare the acoustics and perception of the contrast on the

two types of segments.

A. Acoustic correlates of breathiness

During the production of breathy phonation, the vocal

folds are in a relatively abducted configuration with low lon-

gitudinal tension. Articulatorily, this results in a higher open

quotient of the glottal cycle and a less abrupt glottal closing

gesture; aerodynamically, the increased airflow volume and

the loose vibratory mode of the vocal fold cause turbulence

noise at the glottis, which gives the auditory perception of

breathy voice (Gordon and Ladefoged, 2001).

A host of acoustic parameters that result from these

articulatory and aerodynamic properties have been identified

TABLE I. Examples of laryngeal and tone co-occurrence restrictions in

Shanghai. Voiceless obstruents or modal sonorants co-occur with the high-

rising tone 34; voiced obstruents or murmured sonorants co-occur with the

low-rising tone 13.

Stops Fricatives Sonorants

pu34 “cloth” fi34 “fee” me34 “beautiful”

phu34 “tattered”

bu13 “division” vi13 “fat” m€e13 “plum”

1294 J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan

in the literature. In terms of spectral measures, Klatt and

Klatt (1990) and Holmberg et al. (1995) showed that a

higher open quotient correlates with a greater difference

between the amplitude of the first two harmonics (H1-H2),

and Stevens (1977) and Hanson et al. (2001) demonstrated

that the more gradual glottal closure results in a steeper spec-

tral tilt that can be measured by the amplitude differences

between f0 and F1-F3 (H1-A1, H1-A2, H1-A3). In terms of

periodicity measures, Hillenbrand et al. (1994) advocated

the use of cepstral-peak prominence (CPP), a measure of

peak harmonic amplitude adjusted for the overall amplitude,

of which breathy phonation is expected to have lower values

than modal phonation; the harmonics-to-noise ratio (HNR)

has also been used, with breathy phonation having lower

HNR values (de Krom, 1993). In studies of phonological

breathiness crosslinguistically, these measures have often

been shown to be relevant acoustic and perceptual correlates.

For instance, increased H1-H2 and spectral tilt measures

have been found to be acoustic correlates of breathy vowels

in Hmong (Huffman, 1987; Andruski and Ratliff, 2000;

Esposito, 2012; Garellek et al., 2013), Khmer (Wayland and

Jongman, 2003), Juj’hoansi (Miller, 2007), Hindi (Dutta,

2009), Gujarati (Khan, 2012), Jalapa Mazatec (Blankenship,

2002; Esposito, 2010b; Garellek and Keating, 2011), and

Santa Ana del Valle Zapotec (Esposito, 2010a). Esposito

(2010b) and Garellek et al. (2013), in addition, found that

these measures directly contribute to the perception of

breathiness. Lower CPP values have been found for breathy

vowels in Jalapa Mazatec (Blankenship, 2002; Garellek and

Keating, 2011), White Hmong (Esposito, 2012), and

Gujarati (Khan, 2012). Lower HNR values were found for

breathy vowels in Juj’hoansi (Miller, 2007), but not in

Khmer (Wayland and Jongman, 2003).

Duration measures have also been found to correlate

with breathiness. For stops, breathy stops have shorter clo-

sure durations than their plain counterparts in Bengali

(Mikuteit and Reetz, 2007), Hindi (Dutta, 2009), and

Marathi (Berkson, 2016a), and the shorter closure duration

of voiced stops compared to voiceless stops is well known

(e.g., Lisker, 1986).1 For fricatives, Jongman et al. (2000)

showed that voiced fricatives generally have shorter frication

duration than their voiceless counterparts. The duration pat-

tern for sonorant phonation is scantily documented, but there

is some evidence that breathy sonorants tend to be longer

than their modal counterparts, as reported for Marathi

(Berkson, 2013).

Finally, the phonological co-occurrence between

breathy phonation and lower tones found in Shanghai is

attested elsewhere as well, e.g., in Santa Ana del Valle

Zapotec (Esposito, 2010a) and Hmong (Andruski and

Ratliff, 2000; Esposito, 2012). This may be rooted in the

general f0 lowering effect of breathiness (Laver, 1980;

Gordon and Ladefoged, 2001), which has been well attested,

e.g., in Khmu’ (Abramson et al., 2007), Hindi (Dutta, 2009),

and Marathi (Berkson, 2013). But whether this effect is a

phonetic universal remains controversial, as there are studies

that have shown either an f0 raising effect (Wayland and

Jongman, 2003, for Khmer) or the lack of an f0 correlate

(Garellek and Keating, 2011, for Jalapa Mazatec) for

breathiness.

B. Previous research on the phonation–toneinteraction in Shanghai Wu

As previously stated, existing literature on phonation–

tone interaction in Shanghai has firmly established that the

cues for the laryngeal contrast of interest here are multidi-

mensional in both non-sandhi and sandhi positions. Cao and

Maddieson (1992) showed that for syllables in isolation, i.e.,

the non-sandhi context, H1-H2 and H1-A12 were signifi-

cantly higher at vowel onset after the voiced stop than after

the voiceless unaspirated stop, but the differences disap-

peared at the mid and end points of the vowel; for syllables

in the sandhi context (e.g., second syllable in disyllables),

only the H1-H2 difference remained at vowel onset, and the

magnitude of the difference was smaller; but the voiced

stops were “phonetically voiced.” The acoustic study by Ren

(1992) also showed tapering H1-H2 and H1-A1 differences

on the vowel after voiced and voiceless unaspirated stops in

the non-sandhi position; but in the sandhi position, Ren

found an H1-A1 difference instead of an H1-H2 difference.

Ren (1992) also conducted a perception study in which H1-

H2 was varied in ten steps and f0 in three steps on the initial

portion of the vowel after a stop in the sandhi position

(ä13–ta34 “shoelace” to ä13–da13 “shoe (is) big”). Results

showed that both H1-H2 and f0 had an effect on the percep-

tion of the second syllable: the /d/ response was more likely

with a higher H1-H2; a raised f0 shifted response toward /t/,

while a lowered f0 shifted the response toward /d/. Shen and

Wang (1995) focused on the roles of the closure and release

durations of the stop as the acoustic correlates of stop voic-

ing. They showed that, although the two types of stops did

not differ in their release duration (duration between the stop

burst and the beginning of vowel periodicity), the voiceless

stops had a significantly longer closure duration than the

voiced stops in both initial and medial positions, and the

voiced stops had closure voicing medially. The acoustic

study by Wang (2011) returned similar duration results to

Shen and Wang’s except that she did not find a closure dura-

tion difference based on voicing in the initial position. In a

series of perception studies that manipulated closure dura-

tion and f0, Wang showed that when restricting the tones to

the two rising tones 34 (co-occurs with voiceless) and 13

(co-occurs with voiced), f0 was the primary perceptual cue

for the contrast in initial position; in the medial position,

both f0 and closure duration were used perceptually for the

contrast, but closure voicing was not. Chen (2011) focused

on the f0 perturbation effect from the stop voicing contrast in

the sandhi context and found that the effect was minimal,

and that its size was partly determined by the underlying

tone of the preceding syllable. Chen argued that these pat-

terns potentially serve the purpose of maximizing the tonal

contrast on the preceding syllable, which determines the

pitch contour of the entire sandhi domain; therefore, the f0perturbation here is speaker controlled, at least in part. For

H1-H2, Chen only found the expected difference in the /o/

J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan 1295

context, with the voiced stops inducing greater H1-H2; for

the /i/ context, the effect was the reverse.

In a dissertation (Gao, 2015) and a series of related pub-

lications (Gao and Hall�e, 2013, 2015, 2016, 2017), Gao and

Hall�e presented the most comprehensive study of Shanghai

phonation–tone interaction to date. Their acoustic investiga-

tion included all three manners (stops, fricatives, nasals) as

onsets in monosyllables as well as both syllables of disyl-

lables. In terms of duration, a consonant-vowel (CV) syllable

with a voiceless fricative onset had a longer consonant and a

shorter vowel than one with a corresponding voiced fricative

onset (Gao, 2015); voiced stops had a significantly longer

VOT than voiceless unaspirated stops by around 2–4 ms

(Gao, 2015; Gao and Hall�e, 2017). In terms of voicing in the

initial position, voiced stops rarely had voicing, while voiced

fricatives had voicing ratios (percentages of consonant dura-

tion being voiced) of around 30%–40%; in medial position,

voiced stops and fricatives had over 90% voicing ratios,

compared to around 20%–30% for voiceless ones (Gao,

2015; Gao and Hall�e, 2017). For spectral and periodicity

measures, they showed that for monosyllables, H1-H2, H1-

A1, and H1-A2 were generally higher, while CPP was gener-

ally lower following voiced/murmured onsets than voiceless/

modal ones, but the differences were the greatest and the

most consistent for elder male speakers; linear discriminant

analyses (LDAs) showed that H1-H2 was the most consistent

cue across age and gender groups and in different tonal con-

texts. Only H1-H2 results were reported for the two syllables

in disyllables. Results showed that for the first syllable, H1-

H2 was higher after voiced/murmured onsets, but the differ-

ence was less clear-cut than in monosyllables; for the second

syllable, no H1-H2 difference based on the voicing differ-

ence was found (Gao, 2015; Gao and Hall�e, 2017).

Perceptually, two experiments were conducted to investigate

the effect of duration and voicing patterns on the identifica-

tion of the laryngeal contrast. The first experiment created

“congruent” and “incongruent” monosyllabic stimuli by

imposing the f0 of one CV onto another when the two onsets

differed in voicing, and the results showed that the congru-

ence factor significantly affected the accuracy and reaction

time of tone identification when the onsets were labial frica-

tives, which had the largest voicing difference. The second

experiment created tonal continua between the two rising

tones on both the long C-short V and short C-long V dura-

tion patterns and showed that the duration pattern shifted the

listeners’ identification response toward the category with

that duration pattern, and the incongruence between tone and

duration pattern slowed down the reaction time (Gao and

Hall�e, 2013; Gao, 2015). An additional experiment was car-

ried out to investigate the effect of voice quality on percep-

tion. Tonal continua were again created between the two

rising tones and imposed onto modal and breathy syllables

(both synthesized and naturally produced modal and breathy

syllables were used). Identification results showed that the

voice quality of the syllable shifted the listeners’ identifica-

tion response toward the category with that phonation type,

and the incongruence between tone and phonation slowed

down the reaction time with the exception of naturally

produced tokens with nasal onsets (Gao and Hall�e, 2015;

Gao, 2015).

With the exception of the work of Gao and Hall�e, the

previous studies only investigated a subset of the cues for

stops. But even in the studies by Gao and Hall�e, there was

no direct comparison among the different manners, and their

perception studies were restricted to monosyllables. In the

present work, the goal is to provide a comprehensive look at

the acoustic realization and perception of the contrast

between voiceless unaspirated/modal and voiced/murmured

consonants co-occurring with a high-rising and a low-rising

tone, respectively, across different manners (e.g., tO34 vs

dO13; fi34 vs vi13; me34 vs€me13) and different contexts (san-

dhi, non-sandhi) using a consistent set of methods, and con-

sequently shed light on the language- and context-dependent

nature of contrast realization and perceptual cue weighting,

especially when a phonological alternation process is

involved, as well as the production-perception link. In Secs.

II and III, a production study and a perception study con-

ducted to this end are reported.

II. EXPERIMENT 1: PRODUCTION STUDY

A. Methods

Thirteen monosyllabic voiceless/modal vs voiced/mur-

mured minimal pairs were used for the non-sandhi context

(six stop pairs, four fricative pairs, three sonorant pairs); all

voiceless/modal syllables occurred with the high rising tone

34 and all voiced/murmured syllables with the low rising tone

13 (e.g., pu34 and bu13). The same pairs were then used as the

second syllable of disyllabic compounds with matched first

syllable for the sandhi context (e.g., f@n53-pu34 and f@n53-bu13). Both the monosyllabic and disyllabic words were

embedded in the carrier sentence ˛u34 ˆja34 __ g@?12 @?55

zØ13 “I write the character/word ___.” The reason the target

stimuli were put in sentence-medial position was to allow the

measurement of closure duration for onset stops, as duration

has been more consistently shown as a perceptual cue for the

contrast in previous studies (Wang, 2011; Gao and Hall�e,

2013; Gao, 2015). The trade-off, however, is that this creates

an environment that may also facilitate consonant voicing for

the voiced obstruents even for the monosyllables. Tone sandhi

(or lack thereof) on the target words, however, is not expected

to be affected by the sentential context, as the preceding verb

ˆja34 “to write” and the following demonstrative g@?12 “this”

do not belong to the same prosodic word as the target. The

full word list is given in Table II.

Ten native speakers (5 male, 5 female) with an age range

of 19–30 and a mean age of 25 were recorded in a quiet room

in Shanghai using an Electro-Voice N/D767 cardioid micro-

phone (Burnsville, MN) and a Marantz portable solid state

recorder (PMD 671, Cumberland, RI). Each of them read the

stimuli twice. Subsequent measurements for the two repeti-

tions were averaged before the statistical analyses.

Consonant durations were measured in Praat (Boersma

and Weenink, 2012) by the second author. The duration for

stops was the closure duration and was measured from the

end of the previous syllable to the stop release. For fricatives

and sonorants, the segments themselves were identified from


the spectrograms and their durations measured. Durations

were analyzed with linear mixed-effects models, with the

laryngeal feature (referred to as voicing for brevity below) as

fixed effects and subject and item as random effects. P-values

were calculated using the lmerTest package in R (Kuznetsova

et al., 2016). Monosyllables and disyllables were analyzed

separately. Stops and fricatives were classified as “voiced”

or “voiceless” depending on whether 50% or more of the

consonant duration (closure for stops, frication duration for

fricatives) had voicing, as determined from the waveforms

and spectrograms in Praat by the first author.

The spectral measure H1*-H2* (corrected H1-H2 based

on the frequencies and bandwidths of formants; Shue et al.,2011) and the periodicity measure CPP were selected to esti-

mate the breathiness induced by the contrast.3 H1*-H2* and

CPP values were measured every millisecond in VoiceSaucev1.12 (Shue et al., 2011), and the measurements over every

9.1% of the vowel duration were averaged, yielding 11 data

points for each vowel for statistical analysis. The Snack

Sound Toolkit (Sj€olander, 2004) was used by VoiceSauce to

find the frequencies and bandwidths of the formants with the

covariance method, a pre-emphasis of 0.96, and a window

length of 25 ms with a frame shift of 1 ms. Fundamental fre-

quencies were measured at 10% intervals during the vowel

using the ProsodyPro Praat script (Xu, 2005–2013). The

Maxf0 and Minf0 parameters in the script, as well as the

octave-jump cost, were adjusted for each speaker, and the f0measurements were manually checked by the second author

against pitch tracks and narrowband spectrograms in Praat to

correct any measurement errors by the script. The f0 values in

Hz were then converted into semitones and z-scored. Growth

curve analyses (Mirman, 2014) were conducted on the H1*-

H2*, CPP, and f0 curves over the vowel using third-order

(cubic) orthogonal polynomials. The models were built up

from the base model that only included subject, item, and sub-

ject-by-voicing random effects. Voicing and its interaction

with the time terms were subsequently added step-wise, and

their effects on model fit were evaluated using log-likelihood

model comparison. Parameter estimates for the full model

were then tested for significance using t-tests, and p-values

were again estimated by the lmerTest package. Different man-

ners and different positions were analyzed separately, and the

voiceless/modal category was used as the baseline. H1*-H2*

and CPP were similarly compared for sonorant consonants,

but the measurements were averaged over every 20% of the

sonorant duration, yielding only five data points for each

sonorant. All statistical analyses were performed using the

lme4 package (Bates et al., 2015) in R (R Core Team, 2014).

To investigate the relative contribution of the different

acoustic cues in the laryngeal contrast for each manner in

monosyllables and disyllables, LDAs were conducted to

explore the extent to which the laryngeal category can be

predicted from the acoustic cues. The greedy.wilks function

in the klaR package (Weihs et al., 2005) in R was used to

conduct stepwise forward variable selection for significant

predictors (p< 0.05), and the lda function in the MASS

package (Venables et al., 2002) was used to derive the coef-

ficients for the variables for the linear discriminant functions.

The overall Wilks’s lambda values (from 0 to 1) for the dis-

crimination (0 means total discrimination, 1 means no dis-

crimination), as well as their F and p values, were calculated

using the manova function (see also Gao, 2015).

B. Results

1. Duration and voicing measures

The consonant duration results are given in Fig. 1. For

both the monosyllables and the second syllable of disyl-

lables, the best model included the interaction between voic-

ing and manner. An analysis with voicing nested under

manner as fixed effects for monosyllables and disyllables,

respectively, was then conducted to get voicing estimates for

the different manners in the same model. For monosyllables,

the effect of voicing is significant for fricatives (estimate

¼ �59.168, Standard Error (SE)¼ 13.344, degrees of freedom

(df)¼ 25.246, t¼�4.434, p< 0.001), but not for stops

(estimate¼�11.073, SE¼ 10.930, df ¼ 25.544, t¼�1.013,

p¼ 0.321) or sonorants (estimate ¼�0.783, SE¼ 15.393,

TABLE II. Word list used in the production experiment. Tone transcriptions reflect the base tones before the application of tone sandhi.

Monosyllables Disyllables

Voiceless/modal Voiced/murmured Voiceless/modal Voiced/murmured

Stops pin34 “pancake” bin13 “bottle” ma13-pin34 “to sell pancakes” ma13-bin13 “to sell bottles”

pu34 “to spread’ bu13 “section” f@n53-pu34 “distribution” f@n53-bu13 “division”

tO34 “to arrive’ dO13 “news” pO34-tO34 “check-in” pO34-dO13 “news report”

ti34 “emperor” di13 “brother” üA~13-ti34 “emperor” üA~13-di13 “royal brother”

kue34 “rail” gue13 “hoop” thiI?55-kue34 “rail” thiI?55-gue13 “iron hoop”

ko˛34 “arch” go˛13 “together” iI?55-ko˛34 “an arch” iI?55-go˛13 “all together”

Fricatives fi34 “fee” vi13 “fat’ ke34-fi34 “to reduce the fee” ke34-vi13 “to lose weight”

f@n34 “hard work” v@n13 “article” fa?55-f@n34 “to work hard” fa?55-v@n13 “to publish an article”

sØ34 “water” zØ13 “porcelain’ dA~13-sØ34 “sugar water” dA~13-zØ13 “porcelain”

su34 “lock” zu13 “seat” tîn53-su34 “golden lock” tîn53-zu13 “golden seat”

Sonorants min34 “chirp” m€

in13 “name” �jO34-min34 “bird’s chirps” �jO34-m€

in13 “bird’s name”

me34 “America” m€e13 “plum” ly34-me34 “traveling in the US” ly34-m

€e13 Proper name

gjO34 “bird” g€

jO13 “around” le13-�jO34 “blue bird” le13-g€

jO13 “indiscriminate”


df¼ 25.151, t¼�0.051, p¼ 0.960). For the second syllable

of disyllables, likewise, the effect of voicing is significant

for fricatives (estimate¼�66.554, SE¼ 12.558,

df¼ 25.792, t¼�5.300, p< 0.001), but not for stops

(estimate¼�8.870, SE¼ 10.250, df¼ 25.755, t¼�0.865,

p¼ 0.395) or sonorants (estimate¼ 0.182, SE¼ 14.484,

df¼ 25.676, t¼ 0.013, p¼ 0.990).

In terms of voicing, 89% of the voiced stops and 100% of

the voiced fricatives in the second syllable of disyllable tokens

were classified as voiced. This generally agrees with the

results of earlier studies (Shen and Wang, 1995; Chen, 2011;

Wang, 2011; Gao, 2015; Gao and Hall�e, 2017). In monosyl-

lables, due to the intervocalic position in which the consonant

appears in the sentential context, 33% of the voiced stop

onsets were also classified as voiced. For fricatives, 100% of

the voiced bilabial fricatives and 50% of the coronal fricatives

were classified as voiced. The tendency for bilabial fricatives

to have more voicing in this position has also been docu-

mented in Gao (2015) and Gao and Hall�e (2017). Voiceless

obstruents were occasionally voiced (11% for monosyllables,

31% for the second syllable of disyllables), contra traditional

descriptions for which we have no good explanation except

that the intervocalic or post-nasal positions in which they

appear perhaps encouraged phonetic voicing.

2. Spectral and periodicity measures

The H1*-H2* and CPP results for the vowels after the

three consonant manners in monosyllables are given in

Figs. 2 and 3, respectively. Model comparisons for

H1*-H2* showed that the model did not significantly

improve with the addition of voicing or its interactions with

the linear, quadratic, and cubic time terms for any manner

(p> 0.15 for all comparisons). For CPP, the interaction

between voicing and the quadratic time term did signifi-

cantly improve the model for fricatives [v2(1)¼ 8.455,

p¼ 0.004]. Parameter estimates for the quadratic interac-

tion (estimate¼ 2.160, SE¼ 0.610, t¼ 3.538, p¼ 0.006)

indicated that voiceless fricatives induced a sharper

peak for the CPP curve on the following vowel than voiced

ones; no other model comparisons were significant (all

p> 0.07).

For the second syllable of disyllables, stops and sonor-

ants again did not exhibit any phonatory difference in H1*-

H2* or CPP on the following vowel based on their laryngeal

features (p> 0.18 for all model comparisons). For fricatives,

however, model comparisons showed that for H1*-H2* the

effect of voicing on the intercept significantly improved the

model [v2(1)¼ 9.564, p¼ 0.002], and parameter estimates

(estimate¼ 2.241, SE¼ 0.568, t¼ 3.942, p¼ 0.002) indi-

cated that voiceless fricatives induced a lower H1*-H2* than

voiced fricatives; for CPP, the effects of the laryngeal fea-

ture on the intercept and quadratic time terms both signifi-

cantly improved the model [intercept: v2(1)¼ 8.752,

p¼ 0.003; quadratic: v2(1)¼ 6.353, p¼ 0.012], and parameter

estimates showed a significant effect for the quadratic interac-

tion (estimate¼ 2.881, SE¼ 0.943, t¼ 3.054, p¼ 0.011),

indicating that voiceless fricatives again induced a sharper

peak for the CPP curve on the following vowel. These results

are given in Figs. 4 and 5.4

FIG. 1. Duration of onset consonants in monosyllables and the second syllable of disyllables. *: p< 0.05; **: p< 0.01; ***: p< 0.001.

FIG. 2. H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical

lines indicate 6 SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***: p< 0.001.


For the spectral and periodicity measures on the sonor-

ant consonants themselves, for monosyllables, the model for

H1*-H2* did not significantly improve with the addition of

voicing or its interactions with the linear, quadratic, and

cubic time terms (p> 0.75 for all comparisons), but the

model for CPP did improve with the addition of voicing on

the intercept [v2(1)¼ 4.818, p¼ 0.028] and the quadratic

time term [v2(1)¼ 4.064, p¼ 0.044]. Parameter estimates

indicated that the modal sonorants had an overall higher

CPP value than the murmured sonorants (voicing intercept:

estimate¼�1.815, SE¼ 0.510, t¼ 3.561, p¼ 0.005), and

the murmured sonorants had a more U-shaped curve than the

modal sonorants (voicing and quadratic time term interac-

tion: estimate¼ 0.890, SE¼ 0.395, t¼ 2.256, p¼ 0.041).

For sonorant onsets on the second syllable of disyllables, the

models for H1*-H2* and CPP did not significantly improve

with the addition of voicing or its interactions with the lin-

ear, quadratic, and cubic time terms (p> 0.33 for all compar-

isons). The monosyllabic and disyllabic results are given in

Figs. 6 and 7, respectively.

3. f0

The f0 results for the monosyllables and the second syl-

lable of disyllables are given in Figs. 8 and 9, respectively.

For monosyllables, the addition of voicing improved the

model for the stops [v2(1)¼ 8.350, p¼ 0.004] and fricatives

[v2(1)¼ 15.153, p< 0.001], and the addition of its interac-

tion with the linear time term improved the model for the fri-

catives [v2(1)¼ 11.224, p< 0.001] and sonorants [v2(1)

¼ 4.472, p¼ 0.034]. Parameter estimates for the full model,

which include the effects of voicing and its interaction with

the linear, quadratic, and cubic time terms for the three man-

ners are summarized in Table III. With the voiceless/modal

category as the baseline, the negative intercepts indicated

that the f0s after the voiced/murmured consonants were sig-

nificantly lower than those after the voiceless/modal conso-

nants, and the positive coefficients for the interaction

between voicing and the linear time term indicated that the

f0s after the voiced/murmured consonants had sharper rising

slopes than those after the voiceless/modal consonants;

therefore, the f0 difference between the two types of onsets

decreased over the duration of the vowel. For the second syl-

lable in disyllables, however, only for the fricatives did the

addition of the laryngeal feature significantly improve the

model [v2(1)¼ 3.849, p¼ 0.050]. No other model compari-

sons were significant (all p> 0.12). Parameter estimates for

the full models indicated that the effects of voicing on the

intercept or higher time terms were not significant for any

manner, including the fricatives.

4. Linear discriminant analysis

Consonant duration and CPP and f0 values averaged

over the entire vowel duration were used as the acoustic vari-

ables in the linear discriminant analysis. These variables

were selected as representatives of the acoustic properties of

the consonant, vowel phonation, and vowel f0. Consonant

duration was selected as the consonant cue as previous stud-

ies have primarily shown the perceptual effect of duration

(e.g., Wang, 2011; Gao and Hall�e, 2013; Gao, 2015), and

Wang (2011) has shown that listeners did not use closure

voicing as a perceptual cue for stops. CPP was selected as

the phonation cue as our acoustic results above showed

stronger CPP effects than H1*-H2*. The variables were cen-

tered and scaled before being submitted to the discriminant

analysis.

Table IV summarizes the coefficients for the variables

for the linear discriminant functions as well as the Wilks’slambda, F, and p values for the discriminations. Significant

FIG. 3. CPP results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines

indicate 6SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***: p< 0.001.

FIG. 4. H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed

data (vertical lines indicate 6SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***: p< 0.001.


FIG. 6. H1*-H2* and CPP results over

the duration of sonorant onsets for

monosyllables. Symbols represent

observed data (vertical lines indicate

6SE) and lines represent growth curve

model fits using cubic orthogonal poly-

nomials. *: p< 0.05; **: p< 0.01; ***:

p< 0.001.

FIG. 7. H1*-H2* and CPP results over

the duration of sonorant onsets for the

second syllable of disyllables. Symbols

represent observed data (vertical lines

indicate 6SE) and lines represent

growth curve model fits using cubic

orthogonal polynomials. *: p< 0.05;

**: p< 0.01; ***: p< 0.001.

FIG. 8. Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (verti-

cal lines indicate 6SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***: p< 0.001.

FIG. 9. Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent

observed data (vertical lines indicate 6SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***:

p< 0.001.

FIG. 5. CPP results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed

data (vertical lines indicate 6SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***: p< 0.001.


predictors, as indicated by stepwise variable selection, are

given in bold. “Voiceless/modal” was dummy coded as 0.

Therefore, a negative coefficient for a factor indicates that a

higher value for that factor is more likely to lead to a

“voiceless/modal” classification. For monosyllables (non-

sandhi), the only consistent predictor was f0; but for frica-

tives, both CPP and duration were significant as well, and

the stepwise analysis selected f0 first, then CPP, followed by

duration. For the second syllable in disyllables (sandhi), only

the fricatives could be significantly discriminated, and the

stepwise analysis selected duration first, then CPP.

C. Discussion

The acoustic results above indicate that this laryngeal

contrast in Shanghai is primarily a tone contrast in the non-

sandhi context (monosyllables), as although the H1*-H2* and

CPP comparisons between the voiceless/modal and voiced/

murmured categories were generally in the expected direction,

with the voiceless/modal consonants exhibiting numerically

lower H1*-H2* and higher CPP on the following vowel than

the voiced/murmured ones, only the CPP comparison for fri-

catives reached significance under the growth curve analysis;

f0 curves on the vowels after voiceless/modal and voiced/mur-

mured consonants, however, differed significantly on both the

intercept and slope for all three manners except for the slope

for stops. There are indications that the consonants themselves

still played a role in the contrast as the fricatives exhibited a

duration difference, while the sonorants exhibited a CPP dif-

ference based on the contrast. Moreover, the attenuation of

the f0 difference over the vowel after voiceless/modal vs

voiced/murmured consonants also suggests that the f0 differ-

ence, at least in part, stems from the onset consonants. The

LDAs provided the relative weighting of the acoustic cues

from consonant duration, vowel phonation, and vowel f0 and

corroborated the acoustic finding that the laryngeal contrast in

the non-sandhi context is primarily tonal, with secondary cues

from CPP and consonant duration for the fricatives.

In the sandhi context (second syllable of disyllables), the f0difference was neutralized, but the stops gained a voicing differ-

ence despite losing the closure duration difference, and the frica-

tives exhibited both duration and voicing differences. For the

sonorants, however, no difference between the modal and mur-

mured categories was detected in consonant duration, consonant

phonation, vowel phonation, or f0. The LDAs did not encode

the effect of voicing, but confirmed that f0 cannot be used to dis-

criminate the contrast, and that fricatives have enough second-

ary cues in duration and CPP to be differentiated.

These results show that the acoustic cues for the contrast

indeed vary by the manner and position in which the contrast

is realized. In the sandhi position where a phonological pro-

cess presumably neutralizes the main cue for the contrast—

f0, the contrast itself is incompletely neutralized for frica-

tives and arguably for stops, but completely neutralized for

sonorants as far as the measures included here are concerned.

The weakness of this contrast on sonorants hence finds some

support in the results.

Unlike in previous studies (e.g., Cao and Maddieson, 1992;

Ren, 1992; Gao, 2015), the H1*-H2* and CPP results here gen-

erally did not show a significant effect of the laryngeal feature.

For f0, although we showed that it significantly covaried with

the consonant feature in the non-sandhi context—a result shared

by all previous research—we did not find incomplete neutraliza-

tion in the sandhi context indicated by Ren (1992), Chen

(2011), and Wang (2011). There are two potential reasons for

these disparities. One is that, given our speakers were consider-

ably younger than the speakers used in earlier studies, it is possi-

ble that Shanghai is gradually losing the phonation difference,

and the contrast is now primarily cued by tone in the younger

generations (see Gao, 2015; Gao and Hall�e, 2016, 2017, for age

and gender-based differences that support this contention).

Another possibility is that the different results are partly due to

the different statistical methods used. In the linear mixed-

effects-based growth curve analyses, the random effects struc-

ture included not only subject and item, but also subject-by-

voicing interaction. This helps reduce the type I error in hypoth-

esis testing (Barr et al., 2013), in this case, the effect of voicing.

TABLE III. Parameter estimates for the monosyllable f0 analysis. Baseline

¼ voiceless.

Estimate SE t p

Stop Voicing: Intercept �0.805 0.190 �4.228 <0.001

Voicing: Linear 0.533 0.271 1.967 0.068

Voicing: Quadratic 0.330 0.208 1.588 0.137

Voicing: Cubic �0.144 0.140 �1.028 0.314

Fricative Voicing: Intercept �1.180 0.176 �6.699 <0.001

Voicing: Linear 1.153 0.228 5.045 <0.001


Voicing: Cubic �0.228 0.177 �1.283 0.231

Sonorant Voicing: Intercept �0.682 0.244 �2.789 0.019

Voicing: Linear 0.973 0.400 2.431 0.043


Voicing: Cubic �0.175 0.142 �1.233 0.232

TABLE IV. Coefficients for the variables for the linear discriminant functions, as well as the Wilks’s lambda, F, and p values for the discriminations.

Significant predictors (p <0.05) are in bold.

Coefficients Duration CPP f0 Wilks’s lambda F p

Monosyllable (non-sandhi) Stop �0.124 �0.061 21.245 0.684 16.627 <0.001

Fricative 20.604 20.761 21.207 0.303 56.080 <0.001

Sonorant 0.026 �0.156 21.314 0.716 7.402 <0.001

Disyllable (sandhi) Stop �0.973 0.200 �0.723 0.961 1.531 0.210

Fricative 21.464 20.403 �0.041 0.434 30.013 <0.001

Sonorant 2.136 0.490 �0.377 0.998 0.042 0.988


III. EXPERIMENT 2: PERCEPTION STUDY

A. Methods

The perception study investigated how the different

acoustic cues for the laryngeal contrast are weighted in per-

ception and how the weightings are affected by the manner

and position of the contrast. The stimuli were monosyllabic

and disyllabic words in which the target syllables had a full

cross-classification of three sets of cues—consonant proper-

ties, vowel phonation, and vowel f0. These syllables were

constructed by cross-splicing consonant and vowel portions

of different syllables and superimposing the f0 contour from

one vowel onto another in Praat. For instance, from two base

tokens [pu34] (no. 1) and [bu13] (no. 8), six additional stimuli

(no. 2–no. 7) were constructed, as shown in Table V. Three

monosyllabic pairs, one from each manner, were selected as

the original base tokens—pu34�bu13, fi34�vi13, and me34

�m€e13, and their corresponding disyllabic pairs—

f@n53pu34� f@n53bu13, ke34fi34� ke34vi13, and ly34me34

�ly34m€e13—were selected as the originals for the disyllables.

Therefore, there were 24 monosyllables and 24 disyllables in

total as the perceptual stimuli. There are three main reasons

why we used the cross-spliced stimuli in the perception

experiment instead of the acoustic continua often used in

similar studies. First, this method allows for a complete par-

allel for the investigation of different manners in different

positions. The acoustic-continuum method necessitates the

use of different values along the continuous scale due to dif-

ferent acoustic properties depending on context, and hence

loses some of the parallelism. Second, the manipulation is

easily executable. The acoustic-continuum method may not

allow effective continua to be built due to the small acoustic

differences in some contexts. Third, the method is symmetri-

cal among the three sets of cues and hence makes no

assumption about the importance of any particular one.

The base tokens were selected from a female speaker’s

production data, and a number of considerations went into

the selection of these tokens. First, it was ensured that these

tokens were representative of the overall acoustic patterns

reported in Sec. II. Second, given that the f0 contour was

either stretched or compressed when superimposed onto a

vowel of a different duration, the original syllable pairs were

selected such that their vowel durations were as similar as

possible. Third, after f0 was superimposed onto a different

vowel, H1*-H2* and CPP of the new token were remeas-

ured, and we selected the base tokens for which these mea-

sures were minimally affected by the f0 manipulation. A

summary of the acoustic measures for the 12 base tokens, as

well as when the f0 of the base tokens was switched to that

of the other laryngeal category, is given in Table VI, and all

48 test stimuli are provided as supplemental material online.5

All stimuli were embedded in the same carrier sentence and

auditorily presented to the subjects through headphones for a

two-alternative forced choice (2AFC) task, where they had

to choose on a monitor the Chinese character(s) they heard.6

The entire stimulus list was presented four times, and the

order of the stimuli was randomized each time. Forty-one

native speakers (16 male, 25 female) with an age range of

19–37 yr and a mean age of 24.4 yr participated in the exper-

iment in a quiet office at Fudan University in Shanghai.

TABLE V. Examples of stimulus construction for the perception experiment

from original tokens [pu34] and [bu13].

Stimulus

number C properties V phonation V f0 Method

1. pu34 pu34 pu34 Original

2. pu34 pu34 bu13 Superimpose f0of [bu13] onto [pu34]

3. pu34 bu13 pu34 Cross-splice C

of [pu34] to V of [bu13],

then superimpose f0of [pu34] onto the vowel

4. pu34 bu13 bu13 Cross-splice C of [pu34]

to the V of [bu13]

5. bu13 pu34 pu34 Cross-splice C of [bu13]

to the V of [pu34]

6. bu13 pu34 bu13 Cross-splice C of [bu13]

to V of [pu34],

then superimpose f0 of [bu13]

onto the vowel

7. bu13 bu13 pu34 Superimpose f0 of [pu34]

onto [bu13]

8. bu13 bu13 bu13 Original

TABLE VI. Acoustic measures of the base tokens for the perception experiment as well as when the f0 of the base tokens was switched to that of the other

laryngeal category (given in parentheses). H1*-H2*, CPP, and f0 were the average values over the vowel.

C duration (ms) H1*-H2* (dB) CPP (dB) f0 (Hz)

Monosyllable (non-sandhi) pu34 126 �1.55 (0.75) 16.66 (17.28) 217 (201)

bu13 124 2.72 (1.77) 16.01 (18.24) 201 (217)

fi34 196 �1.18 (1.22) 18.90 (19.28) 229 (191)

vi13 126 3.45 (0.58) 17.24 (18.86) 191 (229)

me34 122 �1.83 (4.19) 22.98 (24.00) 211 (211)

m€e13 118 8.38 (10.03) 17.44 (21.56) 172 (172)

Disyllable (sandhi) f@n53-pu34 57 �1.28 (0.09) 17.38 (19.41) 217 (198)

f@n53-bu13 39 6.12 (6.43) 17.47 (18.92) 198 (217)

ke34-fi34 147 �3.21 (4.79) 19.11 (21.37) 205 (187)

ke34-vi13 70 7.81 (2.03) 20.73 (23.60) 187 (205)

ly34-me34 136 8.37 (8.47) 20.54 (22.42) 192 (194)

ly34-m€e13 99 12.27 (11.53) 20.88 (21.77) 194 (192)


For each stimulus type defined by manner and position, a

mixed-effects logistic regression was conducted with the sub-

jects’ binary responses as the dependent variable and the voic-

ing specifications of consonant, phonation, and f0 cues as

categorical predictors with random intercept by subject.7 A

non-parametric analysis—the Classification and Regression

Tree (CART) analysis (Breiman et al., 1984)—was also con-

ducted using the rpart package in R to further investigate how

the listeners classified the stimuli based on these cues. CART

is a recursive partitioning technique that outlines the decision

process for a category membership based on categorical pre-

dictors. The splits in a classification tree are selected so that

the descendant subsets are “purer” than the current set, and

the parameters for the splits can be considered as significant

predictors for the classification. For our analysis, we con-

structed the classification trees by using consonant, phonation,

and f0 cues as categorical predictors for the subjects’ response

for each manner and position by using the rpart function. We

then conducted cost-complexity pruning for each tree based

on the relative errors generated by tenfold cross-validation

using the plotcp and prune functions (Baayen, 2008).

B. Results

The accuracy and d0 results for the listeners’ classifica-

tion of the natural tokens are given in Fig. 10. These results

indicate that the subjects had near perfect identification of

the contrast in the non-sandhi context regardless of manner

and in the sandhi context for fricatives. For stops in the san-

dhi context, the identification was weaker, but well above

chance; for sonorants, however, identification was at chance.

The coefficients for the consonants, phonation, and f0 cues

in the mixed-effects logistic regressions for different manners

and positions are given in Tables VII and VIII. “Voiceless/

modal” was dummy coded as 0 for both the response variable

and all the categorical predictors. Therefore, the intercept in the

models indicates the log odds [ln(p/(1 � p))] of the segment

being given a “voiced/murmured” response when the conso-

nant, phonation, and f0 cues all came from the voiceless/

modal category, and the coefficients for consonant, phona-

tion, and f0 indicate the increase of the log odds when these

cues came from the voiced category, respectively. For mono-

syllables (non-sandhi), f0 was the only consistent factor that

significantly affected the response, and its coefficient was

the largest among the three cues for all three manners; but

for stops, phonation also had a significant effect, and for fri-

catives, both the consonant and phonation cues were signifi-

cant as well. For the second syllable in disyllables (sandhi),

all factors contributed significantly to the response for stops

and fricatives, with phonation and consonant cues having the

largest coefficient for stops and fricatives, respectively; for

sonorants, none of the factors was significant. All significant

effects were in the expected direction, i.e., the cues from the

FIG. 10. Perceptual accuracy and d0 for the natural tokens in the perception experiment.

TABLE VII. Parameter estimates for the mixed-effects logistic regressions

for monosyllables (non-sandhi context). Baseline ¼ voiceless.

Estimate SE z p

Stop (Intercept) �5.007 0.429 �11.667 <0.001

Consonant �0.0984 0.222 �0.443 0.658

Phonation 0.945 0.232 4.059 <0.001

f0 6.816 0.396 17.195 <0.001

Fricative (Intercept) �0.945 0.213 �4.428 <0.001

Consonant 2.523 0.204 12.384 <0.001

Phonation 1.126 0.177 6.374 <0.001

f0 2.551 0.205 12.464 <0.001

Sonorant (Intercept) �4.429 0.419 �10.585 <0.001

Consonant 0.284 0.286 0.992 0.321

Phonation �0.284 0.286 �0.992 0.321

f0 7.411 0.450 16.486 <0.001

TABLE VIII. Parameter estimates for the mixed-effects logistic regressions

for the second syllable of disyllables (sandhi context). Baseline ¼ voiceless.

Estimate SE z p

Stop (Intercept) �2.8715 0.298 �9.632 <0.001

Consonant 0.292 0.146 1.996 0.046

Phonation 1.484 0.155 9.577 <0.001

f0 1.015 0.150 6.756 <0.001

Fricative (Intercept) �2.270 0.244 �9.323 <0.001

Consonant 4.517 0.267 16.952 <0.001

Phonation 0.957 0.179 5.343 <0.001

f0 2.406 0.202 11.885 <0.001

Sonorant (Intercept) 0.762 0.224 3.402 <0.001

Consonant 0.057 0.127 0.450 0.652

Phonation �0.221 0.127 �1.734 0.083

f0 0.172 0.127 1.350 0.177


voiced/murmured category elicited more voiced/murmured

responses.

The CART analyses after pruning are given in Fig. 11.

The only pruning necessary was for fricatives in monosyllables

for which the original tree from the rpart function also included

branches based on phonation. Relative errors generated by ten-

fold cross-validation under different cost-complexity measures

using the plotcp function indicate that the structural complexity

introduced by these branches is not warranted. These branches

were then subsequently pruned using the prune function.

For stops and sonorants in disyllables, only the root

node was obtained, indicating that none of the cues was a

significant factor in the partition. For stops and sonorants in

monosyllables, f0 was the sole significant predictor for the

subjects’ classification (to read the Monosyllable_Stop

graph, for instance: among the 656 tokens with f0 cues com-

ing from voiceless stop onsets, 641 were classified as voice-

less and 15 were classified as voiced; among the 656 tokens

with f0 cues coming from voiced stop onsets, 549 were clas-

sified as voiced and 107 were classified as voiceless); for fri-

catives in monosyllables, f0 and consonant cues contributed

significantly, but their roles differed: f0> consonant; for fri-

catives in disyllables, only the consonant and f0 cues were

relevant, and the former was more important.

C. Discussion

Both the logistic regression and CART analysis of the

perception data showed that f0 was the primary cue that the

listeners relied on in making category judgment for the laryn-

geal contrast in monosyllables (non-sandhi context). For the

second syllable of disyllables (sandhi context), both analyses

showed that the consonant and f0 cues contributed signifi-

cantly to the voicing classification of fricatives. The logistic

regression analysis, however, identified additional significant

predictors: phonation for stops and fricatives in monosyl-

lables, consonant, phonation, and f0 for stops in disyllables,

and phonation for fricatives in disyllables. For a relatively

small dataset with only a few predictors like ours, it seems

that the CART analysis returned a more conservative estimate

of what predictors are significant in the classification. Logistic

regression and CART differ in that the former is able to pro-

vide an estimate of the average effect of a predictor while

accounting for other predictors, whereas the latter’s hierarchi-

cal structure does not allow the net effect of a predictor, in

general, to be estimated (Lemon et al., 2003). Without a pri-ori assumptions about how our perception data would pattern,

it is perhaps worthwhile to consider both analyses to provide

a more comprehensive view of the data.

The perception results were generally consistent with

the aggregate production results: the laryngeal contrast in

question was primarily cued by f0 in the non-sandhi context,

and the f0 cue was able to override conflicting cues in the

consonant or vowel phonation; for the sandhi context, f0became ineffective in stops and sonorants, but still had an

effect on fricative classification. Different manners relied on

different cues, and classification was the most robust for fri-

catives. For stops in the sandhi context, the fact that the

speakers were able to classify the natural tokens at a high

rate indicates the relevance of the consonant cue, but the

effect of the cue was not strong enough to override

FIG. 11. CART analyses for stops, fricatives, and sonorants in monosyllables and fricatives in the second syllable of disyllables.


conflicting cues from f0 and phonation, if any. For sonorants

in this context, however, both the natural token identification

and the classification of all stimuli demonstrated that there

was simply no reliable cue for the contrast.

It is worth noting that the coefficients in the LDA per-

formed on the acoustic data are not directly comparable with

the coefficients in the logistic regression analysis of the per-

ception data as they mean very different things in the two

analyses (logistic regression was not used for the acoustic

data due to convergence problems). Moreover, the predictors

in the acoustic study were continuous, while those in the per-

ception study were categorical. However, comparisons

among the coefficients within each analysis consistently

point out how the cues are implemented and perceptually

used differently based on manner and position and the

importance of f0 cues in monosyllables and consonant cues

for fricatives in the second syllable of disyllables.

IV. GENERAL DISCUSSION

Both the production and perception results here clearly

show that, at least for the younger speakers that we tested,

the laryngeal contrast in question in Shanghai is primarily

realized as a tone difference acoustically in the non-sandhi

position, and listeners accordingly attend to the f0 cues in

classifying the contrast in this position. However, the fact

that the f0 difference over the vowel diminishes over time

indicates that the voicing/voice quality property of the onset

consonant contributes to the contrast. This is also consistent

with the weakness of the contrast for sonorants, which is a

known crosslinguistic tendency for laryngeal contrasts for

consonants, but would be difficult to explain if the contrast

were purely tonal. Taken together with the acoustic and per-

ceptual results of voicing and f0 cues in tonal and non-tonal

languages elsewhere, the findings are consistent with the

position that the perceptual system is tuned to the distribu-

tion of cues in the particular language.

Our results also shed light on whether certain cues are

inherently better perceptually for a contrast. For instance,

there is some evidence that consonant voicing is better cued

on fricatives than on stops, as in the non-initial position,

although both stops and fricatives exhibited an acoustic dif-

ference in voicing for the contrast, stop voicing did not seem

to be a strong perceptual cue and was not able to override

conflicting cues from the vowel, a finding also reported in

Wang (2011), while fricative voicing was able to stand out

as a cue for the listeners even when conflicting cues were

present. This is potentially because the voicing contrast on

fricatives is cued not only by consonant voicing, but also by

the spectral peak and spectral moments provided by the fri-

cation noise (Jongman et al., 2000). It is also interesting to

note that the voicing difference on fricatives is concomitant

with a larger f0 difference than the voicing difference on

stops in the second syllable sandhi position, as shown in the

growth curve analysis, and the perception results showed

that the f0 difference can be used by listeners. This indicates

that the strength of one cue for a contrast may enhance

another cue for the contrast realized elsewhere.

The presence of phonological tone sandhi had an inter-

esting effect on the acoustic realization and perception of the

laryngeal contrast in question. Although the intervocalic

position is typically a prime position for laryngeal contrasts

for consonants due to the transitional cues that vowels pro-

vide (Steriade, 1997), the fact that this contrast in Shanghai

is primarily cued by f0 in non-sandhi contexts, and the f0 cue

can potentially be lost due to tone sandhi in this position,

makes this a special case. The f0 result on the second sylla-

ble of disyllables indicates that the tonal difference concomi-

tant with the voicing difference of the onset consonant was

indeed neutralized with fricative-onset syllables as marginal

exceptions, but the contrast was only fully lost for the sonor-

ants. For fricatives, there was a voicing and duration differ-

ence between the voiceless and voiced consonants, and the

vowels also differed in the phonation and periodicity mea-

sures; perceptually, both consonant and f0 cues were able to

drown out conflicting cues. For stops, the voiceless and

voiced stops differed in closure voicing in this position; this

voicing difference potentially led to the high d0 score for the

classification of the natural tokens, but was ineffective when

there were conflicting cues. The complexity of the situation

indicates that there is more nuance to incomplete neutraliza-

tion of a phonological contrast, as the “neutralizing” context,

e.g., the non-initial sandhi context, may need to be further

divided up, in this case, by manners of articulation of the

onset consonant.

The weakness of the voice quality contrast for sonorant

consonants was evident in both the production and percep-

tion results. In the non-sandhi position, the contrast was cued

by f0 on the following vowel, and there was a CPP differ-

ence on the consonant itself, but the CPP cue was so weak

that it was not able to compete with conflicting cues in the

perception experiment. In the sandhi position, the sonorants

were the only manner that lost all acoustic cues reported

here between the contrasting pair, and the perceptual results

also showed that there was no discriminability between the

modal and murmured sonorants in this position. These

results, on the one hand, support the contention of Berkson

(2016b) that phonation contrasts tend to be more weakly

cued on sonorants than obstruents, which potentially contrib-

utes to their typological rarity (see also Gao, 2015; Gao and

Hall�e, 2015); on the other hand, they also support the phono-

logical theory of “licensing-by-cue” and its variations

(Steriade, 1997, 2008), which contend that phonological

contrasts are better licensed in contexts of better perceptibil-

ity and more susceptible to loss when the cues are endan-

gered. The complete loss of the laryngeal contrast for

sonorants in the sandhi position in Shanghai is a case in

point. A caveat to the current results is that the acoustic and

perceptual data both come from nasals, and it is possible that

other sonorants, such as liquids, may behave differently,

especially given that nasalization and spread glottis share

similar acoustic consequences of increased amplitude of the

first harmonic and increased bandwidth of the first formant

(Keyser and Stevens, 2006), and have been shown to be per-

ceptually confusable with each other (Klatt and Klatt, 1990).

However, the confusion in the source of an increased first

harmonic reported in Klatt and Klatt (1990) was for a female


voice whose first harmonic is close to the value of the nasal

pole; and Berkson (2013) showed in her study of breathiness

in Marathi that only males cued breathiness with H1*-H2*,

while females used CPP. This indicates that the confusion

between nasalization and breathiness can potentially be

avoided. Moreover, if the weakness of the phonation cues

for sonorants is entirely due to the confusability between

nasalization and breathiness, then the typological rarity of

phonation contrasts on sonorants, in general, remains unac-

counted for.

Although the results of our perception study by and large

match the results of the acoustic study in the aggregate, we

are not in a position to make generalizations about how the

production and perception of this laryngeal contrast in

Shanghai are related to each other on an individual speaker’s

level as the subjects in the two experiments were two distinct

sets. It is possible that individual subjects tune their perception

to the aggregate input in their environment, but we do not

exclude the possibility that individual subjects’ perception is

disproportionally biased by their own production. It must also

be acknowledged that both our production and perception

studies were conducted with relatively young speakers of

Shanghai, and as previously mentioned, it is possible that the

voicing/voice quality contrast has undergone or is undergoing

restructuring (see Gao, 2015; Gao and Hall�e, 2016, 2017), the

investigation of which requires a design that incorporates

sociolinguistic factors, which our study does not.

Finally, if we consider this contrast to stem from a sin-

gle distinctive feature, then this set of data lays out clearly

the challenges for how the instantiation of this feature in a

particular language can be acquired, as the issue concerns

not only the weighting and integration of multiple cues in

potentially unsupervised learning, but also how this learning

can overcome the contextual dependency of cue weighting,

especially when phonological processes intervene. The cur-

rent work does not provide an answer to a difficult problem

like this, but it does suggest that the learning of phonological

contrast realization is likely guided by the morphophonolog-

ical alternation in the language as well as the distributional

properties of the acoustic dimensions along which the con-

trast manifests itself.

V. CONCLUSION

This paper presents a case study on how a phonological

contrast is cued in multiple phonetic dimensions, both acousti-

cally and perceptually. What is of particular interest is that the

contrast in question—a laryngeal contrast in Shanghai Wu—

is cued differently when realized on different manners (stops,

fricatives, sonorants) and in different positions (non-sandhi,

sandhi). Acoustic results showed that, although this contrast

has been described as phonatory in earlier literature, its pri-

mary cue is in tone, at least in the younger speakers that were

tested. In the non-sandhi position, phonation correlates only

appear on fricative-onset syllables and sonorant consonants;

stops and fricatives have consonant duration cues, and frica-

tives also have a frication voicing cue. In the sandhi position,

tone sandhi neutralizes the f0 difference, but the contrast is

maintained in fricatives by both consonant and vowel

phonation cues, marginally maintained in stops by closure

voicing, and lost in sonorants. The perception results were

largely consistent with the aggregate acoustic results, indicat-

ing that speakers adjust the perceptual weights of individual

cues for a contrast according to contexts. These findings sup-

port the position that phonological contrasts are formed by the

integration of multiple cues in a language-specific, context-

specific fashion and should be represented as such.

ACKNOWLEDGMENTS

We are grateful to Dan Yuan and Zhongmin Chen for

hosting us at Fudan University for data collection, Yifeng Li

and Zhenzhen Xu for serving as our Shanghai consultants,

Kelly Berkson, Christina Esposito, and Goun Lee for

helping us with VoiceSauce, Mingxing Li for helping us

with the linear discriminant analysis, and the University of

Kansas General Research Fund No. 2 301 618 for financial

support. We also thank the Associate Editor Megha Sundara

and four anonymous reviewers for their many insightful

comments, which helped improve both the content and the

presentation of the paper. All remaining errors are our own.

1We focus on the closure instead of the post-release portion of the stops

here as the previous literature in Shanghai has shown that the difference in

release duration between voiceless unaspirated and voiced stops in either

initial or medial position is minimal (Shen and Wang, 1995; Chen, 2011).2The authors reported H2-H1 and F1-H1. These were converted to H1-H2

and H1-F1, and F1 was changed, notationally, to A1, to be consistent with

the rest of the paper.3For spectral measures, we focus on H1*-H2* for two reasons. First,

although different spectral measures have been shown to be effective voice

quality measures in different languages, H1-H2 is the most consistently

used parameter in the literature and is found to be effective in the majority

of languages with phonation contrasts. Gao (2015) and Gao and Hall�e(2017) also found that H1-H2 was the most consistently used acoustic

parameter for the laryngeal contrast in Shanghai by speakers of different

age groups and genders and in different tonal contexts. Second, H1*-A1*,

H1*-A2*, and H1*-A3* were also measured and analyzed for our study,

and they did not reveal additional differences for the contrast in question

not shown by H1*-H2*.4An anonymous reviewer asked whether the word pairs with a nasal coda

behaved similarly to those that are open in the phonation measures due to

the potential confusion between breathiness and nasality reported in the

literature (Klatt and Klatt, 1990: Keyser and Stevens, 2006). We reran the

growth curve analyses for H1*-H2* and CPP on the vowels for the stimuli

without nasal codas, and the statistical patterns were identical to the ones

reported here except that for the CPP in sonorant onsets in the monosyl-

labic (no sandhi) context, the addition of the voicing intercept

[v2(1)¼ 4.451, p¼ 0.035] and the interaction between voicing and the lin-

ear time term [v2(1)¼ 4.522, p¼ 0.033] both significantly improved the

model, with the modal sonorants inducing a greater CPP and a slower CPP

decrease on the following vowel.5See supplementary material at https://doi.org/10.1121/1.5052364 for the

acoustic files used in the perception experiment.6An anonymous reviewer raised the issue of whether any prosodic effects

on the onset consonants (e.g., as documented in Chen, 2011) could have

influenced the perception results. In the production, the speaker read all

items in the same carrier sentence, effectively putting the items in a focus

position. In the perception study, the listeners also listened to the same car-

rier sentence and therefore performed the identification in the same focus

position. Therefore, the entire study can be conceived as an investigation

of this laryngeal contrast in focus position.7Additional analyses that included random slopes by subject for each factor

were also conducted for different manners in the two positions. Models

that included the random slopes for all factors all failed to converge.

Models that included a subset of the random slopes were attempted as


https://doi.org/10.1121/1.5052364

well, and there was no consistent random slope structure that converged

for all manners and positions. We therefore opted to report the models

with random intercept by subject only.

Abramson, A. S., and Lisker, L. (1985). “Relative power of cues: F0 shift

versus voice timing,” in Linguistic Phonetics, edited by V. Fromkin

(Academic, New York), pp. 25–33.

Abramson, A. S., Nye, P. W., and Luangthongkum, T. (2007). “Voice regis-

ter in Khmu’: experiments in production and perception,” Phonetica 64,

80–104.

Andruski, J., and Ratliff, M. (2000). “Phonation types in production of pho-

nological tone: The case of Green Mong,” J. Int. Phon. Assoc. 30, 37–61.

Aoki, H. (1970). “A note on glottalized consonants,” Phonetica 21, 65–75.

Baayen, R. H. (2008). Analyzing Linguistic Data—A Practical Introductionto Statistics Using R (Cambridge University Press, Cambridge, UK).

Barr, D. J., Levy, R., Scheepers, C., and Tily, H. J. (2013). “Random effects

structure for confirmatory hypothesis testing: Keep it maximal,” J. Mem.

Lang. 68, 255–278.

Bates, D., Maechler, B., Bolker, B., and Walker, S. (2015). “Fitting linear

mixed-effects models using lme4,” J. Stat. Software 67, 1–48.

Berkson, K. (2013). “Phonation types in Marathi: An acoustic inves-

tigation,” Ph.D. dissertation, University of Kansas.

Berkson, K. (2016a). “Durational properties of Marathi obstruents,” Indian

Linguist. 76(3-4), 7–25.

Berkson, K. (2016b). “Production, perception, and distribution of breathy

sonorants in Marathi,” in Formal Approaches to South Asian LanguagesVol. 2, edited by M. Menon and S. Syed (Open Journal Systems 2.4.6.0,

University of Konstanz, Konstanz, Germany), pp. 4–14.

Blankenship, B. (2002). “The timing of nonmodal phonation in vowels,”

J. Phonetics 30, 163–191.

Boersma, P., and Weenink, D. (2012). “Praat: Doing phonetics by computer

(version 5.3.14) [computer program],” http://www.praat.org/ (Last viewed

5 July 2012).

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984).

Classification and Regression Trees (Wadsworth, Belmont, CA).

Brunelle, M. (2009). “Tone perception in Northern and Southern

Vietnamese,” J. Phonetics 37, 79–96.

Brunelle, M. (2012). “Dialect experience and perceptual integrality in pho-

nological registers: Fundamental frequency, voice quality and the first for-

mant in Cham,” J. Acoust. Soc. Am. 131, 3088–3102.

Cao, J.-F., and Maddieson, I. (1992). “An exploration of phonation types in

Wu dialects of Chinese,” J. Phonetics 20, 77–92.

Chao, Y.-R. (1967). “Contrastive aspects of the Wu dialects,” Language 43,

92–101.

Chen, M. Y. (1970). “Vowel length variation as a function of the voicing of

the consonant environment,” Phonetica 22, 129–159.

Chen, Y.-Y. (2011). “How does phonology guide phonetics in segment-f0interaction?,” J. Phonetics 39, 612–625.

Chomsky, N., and Halle, M. (1968). The Sound Pattern of English (Harper

and Row, New York).

Clayards, M., Tanenhaus, M. K., Aslin, R. N., and Jacobs, R. A. (2008).

“Perception of speech reflects optimal use of probabilistic speech cues,”

Cognition 108, 804–809.

Clements, G. N. (2009). “The role of features in phonological inventories,”

in Contemporary Views on Architecture and Representations inPhonology, edited by E. Raimy and C. Cairns (MIT Press, Cambridge,

MA), pp. 19–68.

Davis, K. (1994). “Stop voicing in Hindi,” J. Phonetics 22, 177–193.

de Krom, G. (1993). “A cepstrum-based technique for determining a har-

monics-to-noise ratio in speech signals,” J. Speech Hear. Res. 36,

224–266.

DiCanio, C. (2014). “Cue weight in the perception of Trique glottal con-

sonants,” J. Acoust. Soc. Am. 135, 884–895.

Dinnsen, D. A., and Charles-Luce, J. (1984). “Phonological neutralization,

phonetic implementation, and individual differences,” J. Phonetics 12,

49–60.

Dmitrieva, O., Jongman, A., and Sereno, J. (2010). “Phonological neutrali-

zation by native and non-native speakers: The case of Russian final

devoicing,” J. Phonetics 38, 483–492.

Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., and Mehler, J. (1999).

“Epenthetic vowels in Japanese: A perceptual illusion?,” J. Exp. Psych.:

Human Percept. Perform. 25, 1568–1578.

Dutta, I. (2009). Acoustics of Stop Consonants in Hindi: Voicing,Fundamental Frequency and Spectral Intensity (Verlag Dr. M€uller,

Saarbr€ucken, Germany).

Esposito, C. M. (2010a). “Variation in contrastive phonation in Santa Ana

Del Valle Zapotec,” J. Int. Phonetic Assoc. 40, 181–198.

Esposito, C. M. (2010b). “The effects of linguistic experience on the percep-

tion of phonation,” J. Phonetics 38, 306–316.

Esposito, C. M. (2012). “An acoustic and electroglottographic study of

White Hmong tone and phonation,” J. Phonetics 40, 466–476.

Flege, J. E., and Wang, C. (1989). “Native-language phonotactic constraints

affect how well Chinese subjects perceive the word-final /t/-/d/ contrast,”


Francis, A. L., Kaganovich, N., and Driscoll-Huber, C. (2008). “Cue-specific

effects of categorization training on the relative weighting of acoustic cues

to consonant voicing in English,” J. Acoust. Soc. Am. 124, 1234–1251.

Gao, J.-Y. (2015). “Interdependence between tones, segments and phonation

types in Shanghai Chinese: Acoustics, articulation, perception and

evolution,” Ph.D. dissertation Universit�e Sorbonne Nouvelle–Paris III,

Paris, France.

Gao, J.-Y., and Hall�e, P. (2013). “Duration as a secondary cue for perception

of voicing and tone in Shanghai Chinese,” in Proc. of Interspeech 14,

Lyon, France, pp. 3157–3162.

Gao, J.-Y., and Hall�e, P. (2015). “The role of voice quality in Shanghai tone

perception,” in Proc. of ICPhS 18, Glasgow, Scotland, UK, paper no. 448.

Gao, J.-Y., and Hall�e, P. (2016). “Sociolinguistic motivations in sound

change: On-going loss of low tone breathy voice in Shanghai Chinese,”

Papers Hist. Phonology 1, 166–186.

Gao, J.-Y., and Hall�e, P. (2017). “Phonetic and phonological properties of

tones in Shanghai Chinese,” Cahiers de Linguistique Asie Orientale 46,

1–31.

Garellek, M., and Keating, P. (2011). “The acoustic consequences of phona-

tion and tone interactions in Jalapa Mazatec,” J. Int. Phonetic Assoc. 41,

185–205.

Garellek, M., Keating, P., Esposito, C. M., and Kreiman, J. (2013). “Voice

quality and tone identification in White Hmong,” J. Acoust. Soc. Am. 133,

1078–1089.

Gordon, M., and Ladefoged, P. (2001). “Phonation types: A crosslinguistic

overview,” J. Phonetics 29, 383–406.

Halle, M., and Stevens, K. (1971). “A note on laryngeal features,” Q.

Progress Rep. Res. Lab. Electron. (MIT) 101, 198–213.

Hall�e, P., and Best, C. (2007). “Dental-to-velar perceptual assimilation: A

cross-linguistic study of the perception of dental stopþ/l/ clusters,”

J. Acoust. Soc. Am. 121, 2899–2914.

Hanson, H. M., Stevens, K. N., Kuo, H.-K. J., Chen, M. Y., and Slifka, J.

(2001). “Towards models of phonation,” J. Phonetics 29, 451–480.

Hillenbrand, J. M., Cleveland, R. A., and Erickson, R. L. (1994). “Acoustic

correlates of breathy vocal quality,” J. Speech Hear. Res. 37, 769–778.

Holmberg, E., Hillman, R., Perkell, J., Guiod, P., and Goldman, S. (1995).

“Comparisons among aerodynamic, electroglottographic, and acoustic

spectral measures of female voice,” J. Speech Hear. Res. 38, 1212–1223.

Holt, L. L., Lotto, A. J., and Kluender, K. R. (2001). “Influence of funda-

mental frequency on stop-consonant voicing perception: A case of learned

covariation or auditory enhancement,” J. Acoust. Soc. Am. 109, 764–774.

Huffman, M. K. (1987). “Measures of phonation in Hmong,” J. Acoust. Soc.

Am. 81, 495–504.

Jakobson, R., Fant, G., and Halle, M. (1952). Preliminaries to SpeechAnalysis (MIT Press, Cambridge, MA).

Jongman, A., Wayland, R., and Wong, S. (2000). “Acoustic characteristics

of English fricatives,” J. Acoust. Soc. Am. 108, 1252–1263.

Keyser, S. J., and K. N. Stevens (2006). “Enhancement and overlap in the

speech chain,” Language 82, 33–63.

Khan, S. D. (2012). “The phonetics of contrastive phonation in Gujarati,”


Kim, H., and Jongman, A. (1996). “Acoustic and perceptual evidence for

complete neutralization of manner of articulation in Korean,” J. Phonetics

24, 295–312.

Kingston, J. (1992). “The phonetics and phonology of perceptually moti-

vated articulatory covariation,” Lang. Speech 35, 99–113.

Kingston, J., Diehl, R. L., Kirk, C. J., and Castleman, W. A. (2008). “On the

internal perceptual structure of distinctive features: The [voice] contrast,”


Klatt, D. H., and Klatt, L. C. (1990). “Analysis, synthesis and perception of

voice quality variations among male and female talkers,” J. Acoust. Soc.

Am. 87, 820–856.


https://doi.org/10.1159/000107911

https://doi.org/10.1017/S0025100300006654

https://doi.org/10.1159/000259291

https://doi.org/10.1016/j.jml.2012.11.001

https://doi.org/10.1016/j.jml.2012.11.001

https://doi.org/10.18637/jss.v067.i01

https://doi.org/10.1006/jpho.2001.0155

http://www.praat.org/

https://doi.org/10.1016/j.wocn.2008.09.003

https://doi.org/10.1121/1.3693651

https://doi.org/10.2307/411386

https://doi.org/10.1159/000259312


https://doi.org/10.1016/j.cognition.2008.04.004

https://doi.org/10.1121/1.4861921


https://doi.org/10.1037/0096-1523.25.6.1568

https://doi.org/10.1037/0096-1523.25.6.1568

https://doi.org/10.1017/S0025100310000046



https://doi.org/10.1121/1.2945161

https://doi.org/10.2218/pihph.1.2016.1698

https://doi.org/10.1163/19606028-04601001

https://doi.org/10.1017/S0025100311000193

https://doi.org/10.1121/1.4773259


https://doi.org/10.1121/1.2534656


https://doi.org/10.1044/jshr.3704.769

https://doi.org/10.1044/jshr.3806.1212

https://doi.org/10.1121/1.1339825

https://doi.org/10.1121/1.394915

https://doi.org/10.1121/1.394915

https://doi.org/10.1121/1.1288413

https://doi.org/10.1353/lan.2006.0051



https://doi.org/10.1177/002383099203500209


https://doi.org/10.1121/1.398894

https://doi.org/10.1121/1.398894

Kuznetsova, A., Brockhoff, B., and Christensen, H. (2016). “Tests in linear

mixed effects models,” available at https://cran.r–project.org/web/packages/

lmerTest/index.html (Last viewed August 3, 2018).

Laver, J. (1980). The Phonetic Description of Voice Quality (Cambridge

University Press, Cambridge, UK).

Lemon, S. C., Roy, J., Clark, M. A., Friedmann, P. D., and Rakowski, W.

(2003). “Classification and Regression Tree analysis in public health:

Methodological review and comparison with logistic regression.” Ann.

Behav. Med. 36, 172–180.

Lisker, L. (1986). “‘Voicing’ in English: A catalogue of acoustic features

signalling /b/ versus /p/ in trochees,” Lang. Speech 29, 3–11.

Llanos, F., Dmitrieva, O., Shultz, A., and Francis, A. L. (2013). “Auditory

enhancement and second language experience in Spanish and English

weighting of secondary voicing cues,” J. Acoust. Soc. Am. 134,

2213–2224.

Massaro, D. W. (1987). “Psychophysics versus specialized processes in

speech perception: An alternative perspective,” in The Psychophysics ofSpeech Perception, edited by M. E. H. Schouten (Martinus Mijhoff,

Boston), pp. 46–65.

Massaro, D., and Cohen, M. (1983). “Phonological context in speech

perception,” Percept. Psychophys. 34, 338–348.

McMurray, B., Cole, J. S., and Munson, C. (2011). “Features as an emergent

product of computing perceptual cues relative to expectations,” in WhereDo Phonological Features Come From?: Cognitive, Physical andDevelopmental Bases of Distinctive Speech Categories, edited by G. N.

Clements, and R. Ridouane (John Benjamins, Amsterdam/Philadelphia),

pp. 197–235.

Mikuteit, S., and Reetz, H. (2007). “Caught in the ACT: The timing of aspi-

ration and voicing in Bengali,” Lang. Speech 50, 247–277.

Miller, A. L. (2007). “Guttural vowels and guttural co-articulation in

Juj’hoansi,” J. Phonetics 35, 56–84.

Mirman, D. (2014). Growth Curve Analysis and Visualization Using R(CRC Press, Boca Raton, FL).

Newman, R. S. (2003). “Using links between speech perception and speech

production to evaluate different acoustic metrics: A preliminary report,”

J. Acoust. Soc. Am. 113, 2850–2860.

Parker, E. M., Diehl, R. L., and Kluender, K. R. (1986). “Trading relations

in speech and nonspeech,” Percept. Psychophys. 39, 129–142.

Port, R., and Crawford, P. (1989). “Incomplete neutralization and pragmat-

ics in German,” J. Phonetics 17, 257–282.

R Core Team (2014). “R: A language and environment for statistical com-

puting (version 3.1.0),” (R Foundation for Statistical Computing, Vienna),

available at http://www.R-project.org/ (Last viewed October 10, 2017).

Raphael, L. J. (1972). “Preceding vowel duration as a cue to the perception

of the voicing characteristic of word-final consonants in American

English,” J. Acoust. Soc. Am. 51, 1296–1303.

Ren, N.-Q. (1992). “Phonation types and stop consonant distinctions:

Shanghai Chinese,” Ph.D. dissertation, University of Connecticut, Storrs.

Repp, B. H. (1983). “Trading relations among acoustic cues in speech per-

ception are largely a result of phonetic categorization,” Speech Commun.

2, 341–361.

Shen, Z.-W., and Wang, W. S. (1995). “Wuyu zhuoseyin de yanjiu—Tongji

shang de fenxi he lilun shang de kaol€u” (“A study of voiced stops in thje

Wu dialects—Statistical analysis and theoretical considerations”), in

Wuyu Yanjiu (Studies of the Wu Dialects), edited by E. Zee (New Asia

Books, Hong Kong), pp. 219–238.

Shue, Y.-L., Keating, P., Vicenik, C., and Yu, K. (2011). “VoiceSauce: A

Program for Voice Analysis,” available at http://www.ee.ucla.edu/~spapl/

voicesauce/ (Last viewed November 1, 2015).

Shultz, A. A., Francis, A. L., and Llanos, F. (2012). “Differential cue

weighting in perception and production of consonant voicing,” J. Acoust.

Soc. Am. 132, EL95–EL101.

Sj€olander, K. (2004). “The Snack Sound Toolkit,” available at http://

www.speech.kth.se/snack/ (Last viewed March 2, 2018).

Steriade, D. (1997). “Phonetics in phonology: The case of laryngeal neu-

tralization,” UCLA Work. Pap. Phonetics 3, 25–146.

Steriade, D. (2008). “The phonology of perceptibility effects: The P-map

and its consequences for constraint organization,” in The Nature of theWord: Essays in Honor of Paul Kiparsky, edited by K. Hanson, and S.

Inkelas (MIT Press, Cambridge, MA), pp. 151–180.

Stevens, K. N. (1977). “Physics of laryngeal behavior and larynx modes,”

Phonetica 34, 264–279.

Stevens, K. N. (2002). “Toward a model for lexical access based on acoustic

landmarks and distinctive features,” J. Acoust. Soc. Am. 111, 1872–1891.

Stevens, K. N., and Keyser, S. J. (2010). “Quantal theory, enhancement, and

overlap,” J. Phonetics 38, 10–19.

Toscano, J. C., and McMurray, B. (2010). “Cue integration with categories:

Weighting acoustic cues in speech using unsupervised learning and distri-

butional statistics,” Cogn. Sci. 34, 434–464.

Traill, A., and Jackson, M. (1988). “Speaker variation and phonation type in

Tsonga nasals,” J. Phonetics 16, 385–400.

Venables, W. N., and Ripley, B. D. (2002). Modern Applied Statistics withS, 4th ed. (Springer, New York).

Wang, Y.-Z. (2011). “Acoustic measurements and perceptual studies on ini-

tial stops in Wu dialects—Take Shanghainese for example,” Ph.D. disser-

tation, Zhejing University, China.

Warner, N., Jongman, A., Sereno, J., and Kemper, R. (2004). “Incomplete

neutralization of sub-phonemic durational differences in production and

perception of Dutch,” J. Phonetics 32, 251–276.

Wayland, R., and Jongman, A. (2003). “Acoustic correlates of breathy and

clear vowels: The case of Khmer,” J. Phonetics 31, 181–201.

Weihs, C., Ligges, U., Luebke, K., and Raabe, N. (2005). “klaR analyzing

German business cycles,” in Data Analysis and Decision Support, edited by

D. Baier, R. Decker, and L. Schmidt-Thieme (Springer, Berlin), pp. 335–343.

Xu, B.-H., and Tang, Z.-Z. (1988). Shanghai Shiqu Fangyan Zhi (ADescription of the Urban Shanghai Dialect) (Shanghai Educational Press,

Shanghai).

Xu, Y. (2005–2013). “ProsodyPro.praat,” available at http://www.phon.ucl.

ac.uk/home/yi/ProsodyPro/ (Last viewed November 1, 2015).

Zee, E., and Maddieson, I. (1980). “Tones and tone sandhi in Shanghai:

Phonetic evidence and phonological analysis,” Glossa 14, 45–88.

Zhu, X.-N. (1999). Shanghai Tonetics (Lincom Europa, M€unchen).

Zhu, X.-N. (2006). A Grammar of Shanghai Wu (Lincom Europa, M€unchen).


https://cran.r–project.org/web/packages/lmerTest/index.html

https://cran.r–project.org/web/packages/lmerTest/index.html

https://doi.org/10.1207/S15324796ABM2603_02

https://doi.org/10.1207/S15324796ABM2603_02

https://doi.org/10.1177/002383098602900102

https://doi.org/10.1121/1.4817845

https://doi.org/10.3758/BF03203046

https://doi.org/10.1177/00238309070500020401


https://doi.org/10.1121/1.1567280

https://doi.org/10.3758/BF03211495

http://www.R-project.org/

https://doi.org/10.1121/1.1912974

https://doi.org/10.1016/0167-6393(83)90050-X

http://www.ee.ucla.edu/~spapl/voicesauce/

http://www.ee.ucla.edu/~spapl/voicesauce/

https://doi.org/10.1121/1.4736711

https://doi.org/10.1121/1.4736711

http://www.speech.kth.se/snack/

http://www.speech.kth.se/snack/

https://doi.org/10.1159/000259885

https://doi.org/10.1121/1.1458026


https://doi.org/10.1111/j.1551-6709.2009.01077.x

https://doi.org/10.1016/S0095-4470(03)00032-9

https://doi.org/10.1016/S0095-4470(02)00086-4

http://www.phon.ucl.ac.uk/home/yi/ProsodyPro/

http://www.phon.ucl.ac.uk/home/yi/ProsodyPro/

Contextually dependent cue realization and cue weighting ...

Documents